# Predicting Student Success (Naive Bayesian and Regression Approaches) 
Two fundamental learning approahces in predictive analytics is probability-based learning and error-based learning. In this homework assignment, you will explore how those two approaches can be used for building predictive models. 

You will use a subset of _Student Performance Data Set_ from UCI (https://archive.ics.uci.edu/ml/datasets/student+performance). This dataset addresses the factors impacting the student achievement in secondary education of two Portuguese schools. The data attributes include grades, demographic, social and school related features and it was collected by using school reports and questionnaires. In this homework assignment you will use the mathematics scores (file provided 'student-mat.csv'). You will apply initial pre-processing and build learning models for predicting student success. 

Initial loading and pre-processing steps are shown in the next cell. Please use the descriptive features outlined below. Target variable you will derive (or use) will be based on the third year/final grade of the students.  

In [1]:
# load libraries
import pandas as pd
import numpy as np
import warnings
warnings.simplefilter(action='ignore')

# load and process data
mat_df = pd.read_csv('student-mat.csv', sep=';')
mat_df = mat_df.drop(columns=['G1', 'G2']) # G1 and G2 are first two years grades. We are interested in the final year
_target = 'G3'
desc_features = ['school', 'sex', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu',
       'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime',
       'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 
        'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc',
        'Walc', 'health', 'absences']
mat_df

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,no,no,4,3,4,1,1,3,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,yes,no,5,3,3,1,1,3,4,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,yes,no,4,3,2,2,3,3,10,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,yes,yes,3,2,2,1,1,5,2,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,no,no,4,3,2,1,2,5,4,10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
390,MS,M,20,U,LE3,A,2,2,services,services,...,no,no,5,5,4,4,5,4,11,9
391,MS,M,17,U,LE3,T,3,1,services,services,...,yes,no,2,4,5,3,4,2,3,16
392,MS,M,21,R,GT3,T,1,1,other,other,...,no,no,5,5,3,3,3,3,3,7
393,MS,M,18,R,LE3,T,3,2,services,other,...,yes,no,4,4,1,3,4,5,0,10


### Q1. Create a new categorical target variable, called 'passed' based on grade obtained in the third year (see G3). Any student who gets 9 or above will be considered as passed. Also, split the data into training and testing (33% for testing). 

In [2]:
# you answer for Q1 goes here.

In [3]:
# Creating passed column based on G3
mat_df['passed'] = [1 if i > 9 else 0 for i in mat_df['G3']]
mat_df

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,romantic,famrel,freetime,goout,Dalc,Walc,health,absences,G3,passed
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,no,4,3,4,1,1,3,6,6,0
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,no,5,3,3,1,1,3,4,6,0
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,no,4,3,2,2,3,3,10,10,1
3,GP,F,15,U,GT3,T,4,2,health,services,...,yes,3,2,2,1,1,5,2,15,1
4,GP,F,16,U,GT3,T,3,3,other,other,...,no,4,3,2,1,2,5,4,10,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
390,MS,M,20,U,LE3,A,2,2,services,services,...,no,5,5,4,4,5,4,11,9,0
391,MS,M,17,U,LE3,T,3,1,services,services,...,no,2,4,5,3,4,2,3,16,1
392,MS,M,21,R,GT3,T,1,1,other,other,...,no,5,5,3,3,3,3,3,7,0
393,MS,M,18,R,LE3,T,3,2,services,other,...,no,4,4,1,3,4,5,0,10,1


In [4]:
#Splitting training and testing data
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(mat_df[desc_features], mat_df[['G3','passed']], test_size=0.33)

### Q2. Using the training and testing partitions form Q1, train a model using Naive Bayes algorithm. Test and analyze the trained model.
You are expected to use the 'passed' variable (categorical target) instead of G3 (do not include G3 in your training or testing set). You can use the GaussianNB() model available in sklearn library and make use of OrdinalEncoder utility class.
Using the testing set, calculate the accuracy score and display the confusion matrix.

Additionally, print the __class_prior___ attribute of your classifier. What do these numbers represent? How would the prediction results change if you initialized them to "[0.5, 0.5]"?

In [5]:
# answer for Q2
mat_df.dtypes

school        object
sex           object
age            int64
address       object
famsize       object
Pstatus       object
Medu           int64
Fedu           int64
Mjob          object
Fjob          object
reason        object
guardian      object
traveltime     int64
studytime      int64
failures       int64
schoolsup     object
famsup        object
paid          object
activities    object
nursery       object
higher        object
internet      object
romantic      object
famrel         int64
freetime       int64
goout          int64
Dalc           int64
Walc           int64
health         int64
absences       int64
G3             int64
passed         int64
dtype: object

In [6]:
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import OrdinalEncoder


In [7]:
cols=["school","sex", "address", "famsize", "Pstatus", "Mjob", "Fjob", "reason",
     "guardian", "schoolsup", "famsup", "paid", "activities", "nursery", "higher",
      "internet", "romantic"]

encoder = OrdinalEncoder()

X_train[cols] = encoder.fit_transform(X_train[cols])
X_test[cols] = encoder.fit_transform(X_test[cols])

In [8]:
# training the model on training set
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, Y_train['passed'])
 
# making predictions on the testing set
Y_pred = gnb.predict(X_test)
 
from sklearn import metrics

accuracy=metrics.accuracy_score(Y_test['passed'], Y_pred)
print("Accuracy: ",round(accuracy,2)*100)
print("Error Rate: ", round(1-accuracy, 2)*100)

Accuracy:  73.0
Error Rate:  27.0


In [9]:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(Y_test['passed'], Y_pred)
cm_df = pd.DataFrame(cm)
cm_df

Unnamed: 0,0,1
0,12,28
1,8,83


<p><b>class_prior_ :</b> It is basically an rough estimate of the probability that if we randomly sample a dataset and take an instance from it. This instance will yield the given class.

In [10]:
gnb.class_prior_

array([0.34090909, 0.65909091])

In [11]:
# class_prior setting

gnb_classifier = GaussianNB()
gnb_classifier.class_prior_= [0.5, 0.5]

In [12]:
gnb_classifier.class_prior_

[0.5, 0.5]

In [13]:
#Testing classifier wih new class_prior

gnb_classifier.fit(X_train, Y_train['passed'])
Y_pred = gnb_classifier.predict(X_test)


In [14]:
accuracy = metrics.accuracy_score(Y_test['passed'], Y_pred)
print("Accuracy: ",round(accuracy,2)*100)
print("Error Rate: ", round(1-accuracy, 2)*100)

Accuracy:  73.0
Error Rate:  27.0


In [15]:
cm = confusion_matrix(Y_test['passed'], Y_pred)
cm_df = pd.DataFrame(cm)
cm_df

Unnamed: 0,0,1
0,12,28
1,8,83


In [16]:
gnb_classifier.class_prior_

array([0.34090909, 0.65909091])

<p>Value of class_prior changes after training the model even if you change it in the beginning

### Q3. Build a logistic regression model using the same training/testing split you used in Q2. Test and analyze the trained model.  
Using the testing set, calculate the accuracy score and display the confusion matrix.

Also print the __coef___ attribute of your logistic regression classifier. What do these numbers in 'coef_' represent? 

In [17]:
# answer for Q3 

In [18]:
from sklearn.linear_model import LogisticRegression

logistic_reg = LogisticRegression()

logistic_reg.fit(X_train, Y_train['passed'])
Y_pred = logistic_reg.predict(X_test)

In [19]:
accuracy = metrics.accuracy_score(Y_test['passed'], Y_pred)
print("Accuracy: ",round(accuracy,2)*100)
print("Error Rate: ", round(1-accuracy, 2)*100)

Accuracy:  69.0
Error Rate:  31.0


In [20]:
cm = confusion_matrix(Y_test['passed'], Y_pred)
cm_df = pd.DataFrame(cm)
cm_df

Unnamed: 0,0,1
0,10,30
1,10,81


In [21]:
logistic_reg.coef_

array([[ 0.19944884,  0.79002079,  0.12150624,  0.2202989 , -0.24883772,
         0.05933296, -0.03473367, -0.12053084,  0.14847401,  0.14320858,
        -0.00387353,  0.10352532,  0.21658584, -0.83006074, -0.28918336,
        -0.27102376,  0.2623414 , -0.23070085, -0.4279059 ,  0.77583181,
        -0.03215685, -0.23438232,  0.18688747,  0.17773514, -0.49503954,
        -0.59548613,  0.39310105, -0.05052786,  0.00451094]])

<p><b>'coef_'</b> define weights associated with each descriptive feature

### Q4. Build a linear regression model using the same training/testing split you used in Q2.  
This time you are expected to use 'G3' attribute in the original dataset as target variable (continuous). The descriptive features you will use are provided below (in `reg_desc_feat` list). Use the same split in Q1 (33% testing set) and calculate the mean absolute error.

In addition, print the __coef___ and __intercept___ attributes of your linear regression model. What are these numbers?

In [22]:
from sklearn.linear_model import LinearRegression
reg_desc_feat = ['school', 'address', 'famsize', 'Mjob', 'Fjob', 'guardian', 'paid', 'activities', 'romantic' ]

# answer for Q4

In [23]:
X_train = X_train[reg_desc_feat]
X_test = X_test[reg_desc_feat]
Y_train=Y_train['G3']
Y_test=Y_test['G3']

In [24]:
linear_reg= LinearRegression().fit(X_train, Y_train)

In [25]:
# print slope and y-intercept
linear_reg.coef_

array([ 0.81777037,  0.66443181,  0.77322115,  0.16144928,  0.09669593,
       -0.24378864,  0.904716  , -0.36403423, -1.86330024])

<p><b>'coef_'</b> define weights associated with each descriptive feature

In [26]:
linear_reg.intercept_

9.43089774867824

<p><b>'intercept_'</b> is basically the y-intercept

In [27]:
Y_pred = linear_reg.predict(X_test)

In [28]:
from sklearn.metrics import mean_absolute_error
mean_absolute_error(Y_test, Y_pred)

3.5037641874492564