<p style="font-family: Arial; font-size:3.75em;color:purple; font-style:bold"><br>
Data Analysis with Python</p><br>


This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. The datasets is provided regarding the performance in Mathematics (mat) and modeled under binary/five-level classification and regression tasks.

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold">
Descriptive statistics</p>
<br>
*Descriptive statistics* are brief descriptive coefficients that summarize a given data set.


In [None]:
import pandas as pd
from scipy.stats import chisquare
from scipy.stats import chi2_contingency
from scipy import stats

In [None]:
df = pd.read_excel('student data-mat.xlsx', sheetname='Data') 
# creat a new DataFrame 'df', and use 'ID-code' as index 
df.head(10) # print the first 10 rows of DataFrame

In [None]:
print (df['school'].value_counts()) # count the number of student from different schools
print ('minimum value:', df['G3'].min()) # minimum value in the column “G3”
print ('maximum value:',df['G3'].max()) # maximum value in the column “G3”
print ('standard value:',df['G3'].std()) # standard value of the column “G3”
print ('mean',df['G3'].mean()) # mean of the column “G3”

In [None]:
df['G3'].describe()

In [None]:
df_cor = df.loc[:,['G1','G2','G3']]
pd.Series(stats.pearsonr(df_cor['G1'],df_cor['G3']),index=['Coef','p-value'])

In [None]:
pd.Series(stats.pearsonr(df_cor['G2'],df_cor['G3']),index=['Coef','p-value'])

In [None]:
df_gen = df.loc[:,['gender','G3']]

In [None]:

df_gen

In [None]:
df_gen0 = df_gen[df_gen['gender']=='F']['G3'].copy()
df_gen0.mean()

In [None]:
df_gen1 = df_gen[df_gen['gender']=='M']['G3'].copy()
df_gen1.mean()

In [None]:
pd.DataFrame([stats.ttest_ind(df_gen0,df_gen1).statistic,stats.ttest_ind(df_gen1,df_gen0).pvalue], index=['statistic','pvalue'],columns=['G3']).T

In [None]:
df_chi = df.loc[:,['studytime','goout']]

In [None]:
df_chi

In [None]:
ct = pd.crosstab(df_chi['studytime'],df_chi['goout'])
ct

In [None]:
chi2,p,dof,ex=chi2_contingency(ct)

In [None]:
pd.DataFrame([chi2,p,dof],index = ['chi2','pvalue','dof'],columns=['Chi squre result']).T

In [None]:
df['G3'].value_counts(sort=False) # student distribution of final grade 

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold">
Data Visualization</p>
<br>
*Data Visualization* plays an important role in data analysis, it provides interactive, visual representations of abstract data to amplify cognition and facilitate understanding.

In [None]:
import matplotlib.pyplot as plt # use plt.function_name to call function
import numpy as np # numpy is a fundamental package used for scientific computing

In [None]:
df1 = df[df['G3']>=15].copy() # create a new dataframe for group one students
df2 = df[df['G3']<15].copy() # create a new dataframe for group two students

In [None]:
df1['G3'].value_counts(sort=False)

In [None]:
df2['G3'].value_counts(sort=False)

In [None]:
plt.scatter(df1['G1'],df1['G2'],c='r',alpha=0.4, label='Final Grade >=15')
plt.scatter(df2['G1'],df2['G2'],c='b',alpha=0.4, label='Final Grade <15')
plt.xlabel('First Period Grade')
plt.ylabel('Second Period Grade')
plt.legend()
plt.show()

In [None]:
df1['studytime'].value_counts(sort=False)

In [None]:
bin_labels = ['<2hours','2 to 5 hours','5 to 10 hours','>10 hours'] # Bins setting
data_bar1 = pd.cut(df1['studytime'], [0,1,2,3,4], labels=bin_labels).copy() # create a series and replace 'Total activity' with bin label
data_bar2 = pd.cut(df2['studytime'], [0,1,2,3,4], labels=bin_labels).copy() # same as above

In [None]:
data_bar1.value_counts(sort=False) # check the distribution of group one

In [None]:
data_bar2.value_counts(sort=False) # check the distribution of group two

In [None]:
X=np.arange(4)+1 # 4 categories
plt.bar(X, height=data_bar1.value_counts(sort=False),width=0.35,color='r',label='Final Grade >=15')
plt.bar(X+0.35, height=data_bar2.value_counts(sort=False),width=0.35,color='b',label='Final Grade <15')
plt.xticks(X+0.17,bin_labels) # set labels of x axis
plt.xlabel('Study Time')
plt.ylabel('Number of student')
plt.legend()
plt.show()

In [None]:
pie_labels = ['none','primary education (4th grade)','5th to 9th grade','secondary education', 'higher education']
fig1 = plt.figure(figsize=(30,30))
ax1_1 = fig1.add_subplot(1,2,1)
ax1_2 = fig1.add_subplot(1,2,2)
ax1_1.pie(df1['Medu'].value_counts(sort=False),labels=pie_labels,labeldistance=1.05,autopct='%1.1f%%',pctdistance=0.85,startangle=90)
ax1_1.axis('equal')
ax1_2.pie(df2['Medu'].value_counts(sort=False),labels=pie_labels,labeldistance=1.05,autopct='%1.1f%%',pctdistance=0.85,startangle=90)
ax1_2.axis('equal')

ax1_1.set_title('mother\'s education of group one student')
ax1_2.set_title('mother\'s education of group two student')
plt.rcParams['font.size']=25
plt.show()

In [None]:
fig2 = plt.figure(figsize=(30,30))
ax2_1 = fig2.add_subplot(2,1,1)
ax2_2 = fig2.add_subplot(2,1,2)
ax2_1.pie(df1['Fedu'].value_counts(sort=False),labels=pie_labels,labeldistance=1.05,autopct='%1.1f%%',pctdistance=0.85,startangle=90)
ax2_1.axis('equal')
ax2_2.pie(df2['Fedu'].value_counts(sort=False),labels=pie_labels,labeldistance=1.05,autopct='%1.1f%%',pctdistance=0.85,startangle=90)
ax2_2.axis('equal')

ax2_1.set_title('father\'s education of group one student')
ax2_2.set_title('father\'s education of group two student')
plt.rcParams['font.size']=25
plt.show()

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold">
Data analysis with machine learning</p>
<br>
*Machine learning* is a subfield of artificial intelligence that explores how machines can learn from data to analyze structures, help with decisions, and make predictions. In this section, we will try to predict the performance group of a student (group one or group two) with the attributes collected. 

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
# import the libraries will be used

In [None]:
df['group'] = pd.cut(df['G3'], [-1,14,20],labels=['Two','One'],right=True).copy() # create a series and replace 'Total activity' with bin label
df['group'] = df['group'].astype(str)
df.head(5)

In [None]:
#features_c =['studytime','failures','schoolsup','famsup','paid','romantic','famrel','freetime','activities','goout','G1','G2']
X = df.loc[:,'school':'absences'].copy()
X.replace(['no','yes'],[0,1],inplace=True)
X.replace(['GP','MS'],[0,1],inplace=True)
X.replace(['F','M'],[0,1],inplace=True)
X.replace(['U','R'],[0,1],inplace=True)
X.replace(['LE3','GT3'],[0,1],inplace=True)
X.replace(['A','T'],[0,1],inplace=True)
X['Mjob'].replace(['other','at_home','services','health','teacher'],[1,2,3,4,5],inplace=True)
X['Fjob'].replace(['other','at_home','services','health','teacher'],[1,2,3,4,5],inplace=True)
X['reason'].replace(['home','reputation','course','other'],[1,2,3,4],inplace=True)
X['guardian'].replace(['mother','father','other'],[1,2,3],inplace=True)
#X = X.join(df.loc[:,'Medu':'Fedu'])
y = df['group'].copy()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33,random_state=0)
y_train.value_counts()

In [None]:
y_test.value_counts()

In [None]:
clf = LinearSVC(random_state=0)
clf.fit(X_train, y_train)

In [None]:
predictions = clf.predict(X_test)
predictions = pd.Series(predictions, index=y_test.index)
predictions

In [None]:
predictions.value_counts()

In [None]:
accuracy_score(y_true = y_test, y_pred = predictions)# Caculate the accuracy by accuracy_score function

In [None]:
from sklearn.feature_selection import SelectFromModel

In [None]:
coef = pd.DataFrame(clf.coef_,index=['coef'],columns=X.columns) # get weights assigned to the features of the linearSVC model
coef.T

In [None]:
abs(coef).T.mean()

In [None]:
abs(coef).T.sort_values('coef',ascending=False)

In [None]:
clf_new = LinearSVC(random_state=0)
sfm = SelectFromModel(clf_new,threshold='median') # feature selector setting
sfm.fit(X_train, y_train)
X_transform = sfm.transform(X_train)

In [None]:
selected_result =pd.Series(sfm.get_support(), index=X_train.columns)

In [None]:
selected_result

In [None]:
selected_result[selected_result==True] # Keep the feature name with True value only

In [None]:
selected_features = selected_result[selected_result==True].index
selected_features 

In [None]:
clf_new.fit(X_transform, y_train) # train a new classifier with transformed training data set

In [None]:
predictions_new = clf_new.predict(X_test[selected_features])
predictions_new = pd.Series(predictions_new, index=y_test.index)
predictions_new

In [None]:
accuracy_score(y_true = y_test, y_pred = predictions_new)

In [None]:
from sklearn.cluster import KMeans
from sklearn.cluster import AffinityPropagation # import the algorithms will be used

In [None]:
features_c =['studytime','failures','schoolsup','famsup','paid','romantic','famrel','freetime','activities','goout','absences','G1','G2']
X_c = df[features_c].copy()
X_c.replace(['no','yes'],[0,1],inplace=True)


In [None]:
X_c.head() # check the inputs

In [None]:
km = KMeans(n_clusters=8,algorithm="full",random_state=0).fit(X_c)
af = AffinityPropagation().fit(X_c) # clustering

In [None]:
df['kmlabel']=km.labels_
df['aflabel']=af.labels_
df.head() # store the labels acquired from clustering and check

In [None]:
df['G3'].groupby(df['kmlabel']).mean() # calculate the means of K-means clusters

In [None]:
km_centers=pd.DataFrame(km.cluster_centers_,columns=X_c.columns)
km_centers.round(2)

In [None]:
df['group'].groupby(df['kmlabel']).value_counts(sort=False) # find out how many group one and group two students in each cluster 

In [None]:
mean_c = df['column name of G3'].groupby(df['column name of af label']).mean()
mean_c.round(2) # please fill the column name above to show the mean of each af clusters 

In [None]:
pd.DataFrame(af.cluster_centers_,columns=X_c.columns) # show the exemplars

In [None]:
af.cluster_centers_indices_ # show the index of the exemplars

In [None]:
df['column name of af label'].value_counts(sort=False)

In [None]:
df['column name of af group'].groupby(df['column name of af label']).value_counts(sort=False)