### CLASSIFICATION

About Dataset
Description: This dataset contains information on the performance of high school students in mathematics, including their grades and demographic information. The data was collected from three high schools in the United States.
"This dataset was created for educational purposes and was generated, not collected from actual data sources."

Columns:
• Gender: The gender of the student (male/female)
• Race/ethnicity: The student's racial or ethnic background (Asian, African-American, Hispanic, etc.)
• Parental level of education: The highest level of education attained by the student's parent(s) or guardian(s)
• Lunch: Whether the student receives free or reduced-price lunch (yes/no)
• Test preparation course: Whether the student completed a test preparation course (yes/no)
• Math score: The student's score on a standardized mathematics test
• Reading score: The student's score on a standardized reading test
• Writing score: The student's score on a standardized writing test

This dataset could be used for various research questions related to education, such as examining the impact of parental education or test preparation courses on student performance. It could also be used to develop machine learning models to predict student performance based on demographic and other factors.

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib as plt

In [2]:
df1=pd.read_csv("exams.csv")

In [3]:
df1.head()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group D,some college,standard,completed,59,70,78
1,male,group D,associate's degree,standard,none,96,93,87
2,female,group D,some college,free/reduced,none,57,76,77
3,male,group B,some college,free/reduced,none,70,70,63
4,female,group D,associate's degree,standard,none,83,85,86


In [4]:
df1.tail()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
995,male,group C,some college,standard,none,77,77,71
996,male,group C,some college,standard,none,80,66,66
997,female,group A,high school,standard,completed,67,86,86
998,male,group E,high school,standard,none,80,72,62
999,male,group D,high school,standard,none,58,47,45


In [5]:
df1.shape

(1000, 8)

In [6]:
df1.describe()

Unnamed: 0,math score,reading score,writing score
count,1000.0,1000.0,1000.0
mean,67.81,70.382,69.14
std,15.250196,14.107413,15.025917
min,15.0,25.0,15.0
25%,58.0,61.0,59.0
50%,68.0,70.5,70.0
75%,79.25,80.0,80.0
max,100.0,100.0,100.0


In [7]:
df1.corr()

Unnamed: 0,math score,reading score,writing score
math score,1.0,0.811767,0.790055
reading score,0.811767,1.0,0.948909
writing score,0.790055,0.948909,1.0


In [8]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   gender                       1000 non-null   object
 1   race/ethnicity               1000 non-null   object
 2   parental level of education  1000 non-null   object
 3   lunch                        1000 non-null   object
 4   test preparation course      1000 non-null   object
 5   math score                   1000 non-null   int64 
 6   reading score                1000 non-null   int64 
 7   writing score                1000 non-null   int64 
dtypes: int64(3), object(5)
memory usage: 62.6+ KB


In [9]:
df1.isnull().sum()

gender                         0
race/ethnicity                 0
parental level of education    0
lunch                          0
test preparation course        0
math score                     0
reading score                  0
writing score                  0
dtype: int64

In [10]:
df2=df1[["math score", "reading score","writing score"]]

In [11]:
df2

Unnamed: 0,math score,reading score,writing score
0,59,70,78
1,96,93,87
2,57,76,77
3,70,70,63
4,83,85,86
...,...,...,...
995,77,77,71
996,80,66,66
997,67,86,86
998,80,72,62


In [12]:
df2["row_means"]=df2.mean(axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2["row_means"]=df2.mean(axis=1)


In [13]:
df2

Unnamed: 0,math score,reading score,writing score,row_means
0,59,70,78,69.000000
1,96,93,87,92.000000
2,57,76,77,70.000000
3,70,70,63,67.666667
4,83,85,86,84.666667
...,...,...,...,...
995,77,77,71,75.000000
996,80,66,66,70.666667
997,67,86,86,79.666667
998,80,72,62,71.333333


In [14]:
df=pd.merge(df1,df2)
df.drop("race/ethnicity",axis=1,inplace=True)

In [15]:
df.lunch.unique()

array(['standard', 'free/reduced'], dtype=object)

In [29]:
df["lunch"]=df["lunch"].replace(["free/reduced"],1)
df["lunch"]=df["lunch"].replace(["standard"],0)

In [17]:
df["test preparation course"].unique()

array(['completed', 'none'], dtype=object)

In [18]:
df["test preparation course"]=df["test preparation course"].replace(["completed"],1)
df["test preparation course"]=df["test preparation course"].replace(["none"],0)

In [19]:
df["gender"]=df["gender"].replace(["female"],1)
df["gender"]=df["gender"].replace(["male"],0)

In [20]:
df["parental level of education"].unique()

array(['some college', "associate's degree", 'some high school',
       "bachelor's degree", "master's degree", 'high school'],
      dtype=object)

In [21]:
df["parental level of education"]=df["parental level of education"].replace(["some high school"],0)
df["parental level of education"]=df["parental level of education"].replace(["high school"],1)
df["parental level of education"]=df["parental level of education"].replace(["some college"],2)
df["parental level of education"]=df["parental level of education"].replace(["associate's degree"],3)
df["parental level of education"]=df["parental level of education"].replace(["master's degree"],4)
df["parental level of education"]=df["parental level of education"].replace(["bachelor's degree"],5)

In [30]:
df

Unnamed: 0,gender,parental level of education,lunch,test preparation course,math score,reading score,writing score,row_means
0,1,2,0,1,59,70,78,69.000000
1,0,3,0,0,96,93,87,92.000000
2,1,2,1,0,57,76,77,70.000000
3,0,2,1,0,70,70,63,67.666667
4,1,3,0,0,83,85,86,84.666667
...,...,...,...,...,...,...,...,...
1035,1,3,0,0,82,97,90,89.666667
1036,0,2,0,0,77,77,71,75.000000
1037,0,2,0,0,80,66,66,70.666667
1038,1,1,0,1,67,86,86,79.666667


In [40]:
x=df.drop("test preparation course",axis=1)
y=df["test preparation course"]

In [41]:
y.info()

<class 'pandas.core.series.Series'>
Int64Index: 1040 entries, 0 to 1039
Series name: test preparation course
Non-Null Count  Dtype
--------------  -----
1040 non-null   int64
dtypes: int64(1)
memory usage: 16.2 KB


In [42]:
y.astype("int64")

0       1
1       0
2       0
3       0
4       0
       ..
1035    0
1036    0
1037    0
1038    1
1039    0
Name: test preparation course, Length: 1040, dtype: int64

In [43]:
x.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1040 entries, 0 to 1039
Data columns (total 7 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   gender                       1040 non-null   int64  
 1   parental level of education  1040 non-null   int64  
 2   lunch                        1040 non-null   int64  
 3   math score                   1040 non-null   int64  
 4   reading score                1040 non-null   int64  
 5   writing score                1040 non-null   int64  
 6   row_means                    1040 non-null   float64
dtypes: float64(1), int64(6)
memory usage: 65.0 KB


In [44]:
from sklearn.naive_bayes import  GaussianNB
from sklearn.naive_bayes import BernoulliNB
g=GaussianNB()
b=BernoulliNB()

In [46]:
g.fit(x,y)
b.fit(x,y)

BernoulliNB()

In [47]:
tahmin1=g.predict(x)
tahmin2=b.predict(x)

In [48]:
from sklearn.metrics import accuracy_score, confusion_matrix , classification_report

In [49]:
confusion_matrix(tahmin1,y)

array([[457, 163],
       [224, 196]], dtype=int64)

In [50]:
print(classification_report(tahmin1,y))

              precision    recall  f1-score   support

           0       0.67      0.74      0.70       620
           1       0.55      0.47      0.50       420

    accuracy                           0.63      1040
   macro avg       0.61      0.60      0.60      1040
weighted avg       0.62      0.63      0.62      1040



In [52]:
accuracy_score(tahmin1,y)

0.6278846153846154

In [53]:
from sklearn.neighbors import KNeighborsClassifier

In [54]:
k=KNeighborsClassifier()

In [55]:
k.fit(x,y)

KNeighborsClassifier()

In [56]:
tahmin3=k.predict(x)

In [57]:
accuracy_score(tahmin3,y)

0.7894230769230769

In [59]:
from sklearn.linear_model import LogisticRegression
l=LogisticRegression()

In [60]:
l.fit(x,y)

LogisticRegression()

In [61]:
tahmin4=l.predict(x)

In [62]:
accuracy_score(tahmin4,y)

0.7673076923076924