# Unit 1 Assignment

In this assignment, we will focus on education. This dataset contains data about high school students. Each row represents a single student. The school administrators want to predict a student's cumulative GPA at the time of graduation so that they can make interventions for struggling students. The goal is to predict the CGPA of a student. 

## Description of Variables

The description of variables are provided in "High School - Data Dictionary.docx"

## Goal

Use the **high_school.csv** data set and build a model to predict **CGPA**.

## Submission:

Please save and submit this Jupyter notebook file. The correctness of the code matters for your grade. **Readability and organization of your code is also important.** You may lose points for submitting unreadable/undecipherable code. Therefore, use markdown cells to create sections, and use comments where necessary.


# Section 1: (6 points in total)

## Data Prep (5.5 points)

In [378]:
import numpy as np
import pandas as pd

np.random.seed(55)

In [379]:
highschool = pd.read_csv("high_school.csv")
highschool.head()

Unnamed: 0,Gender,ParentEdu,ParentMaritalStatus,ExtraCurricular,IsFirstChild,Siblings,Transportation,AvgReadingScore,AvgWritingScore,traveltime,studytime,internet,freetime,absences,CGPA
0,female,bachelor's degree,married,regularly,yes,3.0,school_bus,71,74,2,2,no,3,6,C
1,female,some college,married,sometimes,yes,0.0,,90,88,1,2,yes,3,4,D
2,female,master's degree,single,sometimes,yes,4.0,school_bus,93,91,1,2,yes,3,10,B
3,male,associate's degree,married,never,no,1.0,,56,42,1,3,yes,2,2,F
4,male,some college,married,sometimes,yes,0.0,school_bus,78,75,1,2,no,3,4,C


In [380]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(highschool, test_size=0.3)

In [381]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

In [382]:
train_target = train['CGPA']
test_target = test['CGPA']

train_inputs = train.drop('CGPA', axis=1)
test_inputs = test.drop('CGPA', axis=1)

In [383]:
train_target.dtypes

dtype('O')

In [384]:
numeric_columns = train_inputs.select_dtypes(include=[np.number]).columns.to_list()

categorical_columns = train_inputs.select_dtypes('object').columns.to_list()

In [385]:
numeric_transformer = Pipeline(steps=[
                ('imputer', SimpleImputer(strategy='median')),
                ('scaler', StandardScaler())])

In [387]:
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='unknown')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

In [389]:
preprocessor = ColumnTransformer([
        ('num', numeric_transformer, numeric_columns),
        ('cat', categorical_transformer, categorical_columns)],
        remainder='drop')

In [390]:
train_x = preprocessor.fit_transform(train_inputs)

train_x

array([[-1.47676357, -0.65447713, -0.29550821, ...,  0.        ,
         0.        ,  1.        ],
       [-0.77679959,  0.8385936 ,  1.04981146, ...,  1.        ,
         1.        ,  0.        ],
       [-0.07683561, -0.31514288,  0.28105736, ...,  0.        ,
         0.        ,  1.        ],
       ...,
       [-0.07683561, -0.04367547,  0.15293168, ...,  0.        ,
         0.        ,  1.        ],
       [-0.07683561,  0.9743273 ,  1.24199999, ...,  0.        ,
         0.        ,  1.        ],
       [ 1.32309235,  0.43139249,  0.40918305, ...,  0.        ,
         0.        ,  1.        ]])

In [391]:
train_x.shape

(1658, 33)

In [392]:
test_x = preprocessor.transform(test_inputs)

test_x

array([[-0.07683561, -0.24727602, -0.61582242, ...,  0.        ,
         0.        ,  1.        ],
       [-0.77679959, -0.17940917, -0.48769673, ...,  0.        ,
         0.        ,  1.        ],
       [-0.07683561,  1.38152841,  1.04981146, ...,  0.        ,
         1.        ,  0.        ],
       ...,
       [-0.07683561, -0.11154232,  0.024806  , ...,  0.        ,
         0.        ,  1.        ],
       [-0.77679959,  0.15992508,  0.21699452, ...,  0.        ,
         0.        ,  1.        ],
       [ 0.62312837,  0.56712619,  0.47324589, ...,  1.        ,
         0.        ,  1.        ]])

In [393]:
test_x.shape

(711, 33)

## Find the Baseline (0.5 point)

In [394]:
from sklearn.dummy import DummyClassifier

dummy_clf = DummyClassifier(strategy="most_frequent")

dummy_clf.fit(train_x, train_target)

In [395]:
from sklearn.metrics import accuracy_score

In [396]:
#Baseline Train Accuracy
dummy_train_pred = dummy_clf.predict(train_x)

baseline_train_acc = accuracy_score(train_target, dummy_train_pred)

print('Baseline Train Accuracy: {}' .format(baseline_train_acc))

Baseline Train Accuracy: 0.3335343787696019


In [397]:
#Baseline Test Accuracy
dummy_test_pred = dummy_clf.predict(test_x)

baseline_test_acc = accuracy_score(test_target, dummy_test_pred)

print('Baseline Test Accuracy: {}' .format(baseline_test_acc))

Baseline Test Accuracy: 0.32489451476793246


In [398]:
from sklearn.metrics import confusion_matrix

confusion_matrix(test_target, test_target_pred)

array([[ 24,  18,   5,   0,   0],
       [ 11,  34,  31,   9,   3],
       [  4,  41,  68,  51,   6],
       [  0,  10,  49,  65,  51],
       [  1,   1,   6,  53, 170]], dtype=int64)

# Section 2: (3 points in total)

Build three different SVM models (by changing the kernels, regularization, etc.). Generate their training and test values. Each model is worth 1 point. 

(Add cells as needed)



## SVM Model 1:

In [399]:
#SVC(kernel='linear')
from sklearn.svm import SVC
 
lin_svm = SVC(kernel="linear")

lin_svm.fit(train_x, train_target)

In [400]:
from sklearn.metrics import accuracy_score

In [401]:
train_svm_linear = lin_svm.predict(train_x)

#Train accuracy
print('SVC_linear_Train_Accuracy: {}' .format(accuracy_score(train_target,train_svm_linear)))

SVC_linear_Train_Accuracy: 0.6791314837153196


In [402]:
test_svm_linear = lin_svm.predict(test_x)

#Test accuracy
print('SVC_linear_Test_Accuracy: {}' .format(accuracy_score(test_target, test_svm_linear)))

SVC_linear_Test_Accuracy: 0.6413502109704642


In [403]:
confusion_matrix(test_target, test_svm_linear)

array([[ 31,  15,   1,   0,   0],
       [  9,  40,  34,   4,   1],
       [  1,  19, 100,  50,   0],
       [  1,   3,  36,  94,  41],
       [  0,   0,   4,  36, 191]], dtype=int64)

### Is the model overfitting? Provide your answer below. If yes, please add more cells below and show how you corrected overfitting. If your model is overfitting and you don't correct it, you will lose points. (0.25 points)

In [404]:
#since the accuracy score difference between train and test is minimal model overfitting is negligible we can say the model performed well model correction is not necessary

## SVM Model 2:

In [405]:
#SVC(kernel='poly')
from sklearn.svm import SVC

pol_svm = SVC(kernel="poly", degree=3, coef0=1, C=1)

pol_svm.fit(train_x, train_target)

In [406]:
train_svm_poly = pol_svm.predict(train_x)

#Train accuracy
print('SVC_poly_Train_Accuracy: {}' .format(accuracy_score(train_target, train_svm_poly)))

SVC_poly_Train_Accuracy: 0.8540410132689988


In [407]:
test_svm_poly = pol_svm.predict(test_x)

#Test accuracy
print('SVC_poly_Test_Accuracy: {}' .format(accuracy_score(test_target, test_svm_poly)))

SVC_poly_Test_Accuracy: 0.5738396624472574


In [408]:
confusion_matrix(test_target, test_svm_poly)

array([[ 31,  14,   2,   0,   0],
       [ 14,  33,  36,   3,   2],
       [  2,  27,  91,  48,   2],
       [  1,   4,  53,  75,  42],
       [  0,   0,   1,  52, 178]], dtype=int64)

### Is the model overfitting? Provide your answer below. If yes, please add more cells below and show how you corrected overfitting. If your model is overfitting and you don't correct it, you will lose points. (0.25 points)

In [409]:
#The test accuracy score is significantly low compare to the train scores indicates a overfitting exists in the model.Correction is required to minimize overfitting
#Therefore Randomized Grid Search is performed to fine tune hyperparameters in order to reduce the overfitting in the model

In [410]:
##Grid Search: randomized
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform
import random

param_distribs = {'degree': randint(low=2, high=5),
        'C': randint(low=1, high=5),
        'coef0': uniform(0.1, 1),    
    }

poly_svm = SVC(kernel="poly", decision_function_shape='ovr')

poly_search = RandomizedSearchCV(poly_svm, param_distributions=param_distribs,
                                n_iter=10, cv=10, scoring='accuracy', random_state=55)

poly_search.fit(train_x, train_target)

In [411]:
cvres = poly_search.cv_results_

for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(mean_score, params)

0.6025593282219789 {'C': 2, 'coef0': 0.795069469300329, 'degree': 2}
0.5609529025191675 {'C': 2, 'coef0': 0.34795496644469115, 'degree': 3}
0.5549288061336254 {'C': 4, 'coef0': 0.8233882046158927, 'degree': 3}
0.5338043081416576 {'C': 3, 'coef0': 0.41165441156253324, 'degree': 4}
0.5380284775465498 {'C': 3, 'coef0': 0.7848141682681444, 'degree': 4}
0.5790653523183643 {'C': 1, 'coef0': 0.5603683708289007, 'degree': 3}
0.5495290251916758 {'C': 2, 'coef0': 0.1630545006596247, 'degree': 4}
0.6055786783497628 {'C': 2, 'coef0': 1.0856115817551715, 'degree': 2}
0.5549251551661191 {'C': 4, 'coef0': 0.8483028908601774, 'degree': 3}
0.5832566630156991 {'C': 4, 'coef0': 0.2957184273277015, 'degree': 2}


In [412]:
poly_search.best_params_

{'C': 2, 'coef0': 1.0856115817551715, 'degree': 2}

In [455]:
final_model = poly_search.best_estimator_

test_predictions = final_model.predict(test_x)

print('SVC_mod2_Rand_search Test Accuracy: {}' .format(accuracy_score(test_target, test_predictions)))

SVC_mod2_Rand_search Test Accuracy: 0.5907172995780591


In [414]:
#Though the test accuracy improved a bit,it did not resolve the overfitting problem still a large gap between test and train accuracy persists maybe the overfitting issue cannot be rsolved with this model,further investigation is necessary for model evaluation

## SVM Model 3:

In [415]:
#SVC(kernel='rbf')
rbf_svm = SVC(kernel="rbf", C=10, gamma='scale')

rbf_svm.fit(train_x, train_target)

In [416]:
train_svm_rbf = rbf_svm.predict(train_x)

#Train accuracy
print('SVC_rbf_Train_Accuracy: {}' .format(accuracy_score(train_target, train_svm_rbf)))

SVC_rbf_Train_Accuracy: 0.9529553679131484


In [417]:
test_svm_rbf = rbf_svm.predict(test_x)

#Test accuracy
print('SVC_rbf_Test_Accuracy: {}' .format(accuracy_score(test_target, test_svm_rbf)))

SVC_rbf_Test_Accuracy: 0.559774964838256


In [418]:
confusion_matrix(test_target, test_svm_rbf)

array([[ 26,  19,   2,   0,   0],
       [ 13,  37,  32,   4,   2],
       [  3,  31,  83,  51,   2],
       [  1,   6,  49,  78,  41],
       [  0,   1,   3,  53, 174]], dtype=int64)

### Is the model overfitting? Provide your answer below. If yes, please add more cells below and show how you corrected overfitting. If your model is overfitting and you don't correct it, you will lose points. (0.25 points)

In [419]:
#The model is overfitting since the test accuracy score is comparatively low with train accuracy score so correction is needed to reduce overfitting
#Randomized Grid Search is performed to fine tune hyperparameters in order to reduce the overfitting in the model

In [420]:
#Grid Search: randomized
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform
import random

param_distribs = {
        'C': randint(low=1,high=10),
        'gamma': uniform(0.1,0.3), 
    }

rbf_svm = SVC(kernel="rbf", decision_function_shape='ovr')

rbf_search = RandomizedSearchCV(rbf_svm, param_distributions=param_distribs,
                                n_iter=5,cv=5, scoring='accuracy', random_state=55)

rbf_search.fit(train_x, train_target)

In [421]:
cvres = rbf_search.cv_results_

for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(mean_score, params)

0.5247242747424744 {'C': 8, 'gamma': 0.39363371987013507}
0.5476613402249482 {'C': 6, 'gamma': 0.25933714894217597}
0.5386106358970626 {'C': 2, 'gamma': 0.3587891131154919}
0.5506752083864157 {'C': 4, 'gamma': 0.165390943978014}
0.5446329123139082 {'C': 9, 'gamma': 0.33271496159318015}


In [422]:
rbf_search.best_params_

{'C': 4, 'gamma': 0.165390943978014}

In [456]:
final_model = rbf_search.best_estimator_

test_predictions = final_model.predict(test_x)

print('SVC_mod3_Rand_Search Test Accuracy: {}' .format(accuracy_score(test_target, test_predictions)))

SVC_mod3_Rand_Search Test Accuracy: 0.5386779184247539


In [424]:
#The obtained test accuracy after randomized search is less than normal test accuracy many random search iterations are evaluated but none of the paramaters combination able to reduce the overfitting concern maybe the overfitting issue could not be resolved or further investigations are needed to evaluate model

# Section 3: (3 points in total)

Build two different SGD models (by changing the penalty, etc. or adding polynomial terms) and one LogisticRregression model. Generate their training and test values. Each model is worth 1 point.

(Add cells as needed)

## SGD Model 1:

In [425]:
from sklearn.linear_model import SGDClassifier 

sgd_mod1 = SGDClassifier(max_iter=1000, penalty='l1') 

sgd_mod1.fit(train_x, train_target)

In [426]:
train_y_pred = sgd_mod1.predict(train_x)

#Train accuracy
print('SGD_mod1_Train_Accuracy: {}' .format(accuracy_score(train_target, train_target_pred)))

SGD_mod1_Train_Accuracy: 0.9873341375150784


In [427]:
test_y_pred = sgd_mod1.predict(test_x)

#Test accuracy
print('SGD_mod1_Test_Accuracy: {}' .format(accuracy_score(test_target, test_target_pred)))

SGD_mod1_Test_Accuracy: 0.5077355836849508


### Is the model overfitting? Provide your answer below. If yes, please add more cells below and show how you corrected overfitting. If your model is overfitting and you don't correct it, you will lose points. (0.25 points)

In [428]:
#This model is overfitting due to the significant difference between the test and train accuracy scores.lets see performing some regularizations on the model will make an impact on the accuracy score by reducing overfitting or not
#Performed L2 regularization, elasticnet 

In [429]:
sgd_mod1 = SGDClassifier(max_iter=1000, penalty='l2') 

sgd_mod1.fit(train_x, train_target)

In [430]:
train_y_pred = sgd_mod1.predict(train_x)

#Train accuracy

print('SGD_mod1_l2_Train_Accuracy: {}' .format(accuracy_score(train_target, train_target_pred)))

SGD_mod1_l2_Train_Accuracy: 0.9873341375150784


In [431]:
test_y_pred = sgd_mod1.predict(test_x)

#Test accuracy
print('SGD_mod1_l2_Test_Accuracy: {}' .format(accuracy_score(test_target, test_target_pred)))

SGD_mod1_l2_Test_Accuracy: 0.5077355836849508


In [432]:
sgd_mod1 = SGDClassifier(max_iter=1000, penalty='elasticnet',l1_ratio=0.5) 

sgd_mod1.fit(train_x, train_target)

In [433]:
train_y_pred = sgd_mod1.predict(train_x)

#Train accuracy
print('SGD_mod1_elast_Train_Accuracy: {}' .format(accuracy_score(train_target, train_target_pred)))

SGD_mod1_elast_Train_Accuracy: 0.9873341375150784


In [434]:
test_y_pred = sgd_mod1.predict(test_x)

#Test accuracy
print('SGD_mod1_elast_Test_Accuracy: {}' .format(accuracy_score(test_target, test_target_pred)))

SGD_mod1_elast_Test_Accuracy: 0.5077355836849508


In [435]:
#The test accuracy scores obtained after regularization techniques are similar to the intial model accuracy score. based on results the overfitting concern might not be resolved in this model

## SGD Model 2:

In [436]:
from sklearn.preprocessing import PolynomialFeatures

poly_features = PolynomialFeatures(degree=3).fit(train_x)

train_x_poly = poly_features.transform(train_x)

test_x_poly = poly_features.transform(test_x)

In [437]:
sgd_class = SGDClassifier()
sgd_class.fit(train_x_poly,train_target)

In [438]:
train_target_pred = sgd_class.predict(train_x_poly)

#Train accuracy
print('SGD_mod2_poly_Train_Accuracy: {}' .format(accuracy_score(train_target, train_target_pred)))

SGD_mod2_poly_Train_Accuracy: 0.9553679131483716


In [439]:
test_target_pred = sgd_class.predict(test_x_poly)

#Test accuracy
print('SGD_mod2_poly_Test_Accuracy: {}' .format(accuracy_score(test_target, test_target_pred)))

SGD_mod2_poly_Test_Accuracy: 0.5049226441631505


### Is the model overfitting? Provide your answer below. If yes, please add more cells below and show how you corrected overfitting. If your model is overfitting and you don't correct it, you will lose points. (0.25 points)

In [440]:
#Overfitting exists in this model as the test accuracy score is low where as the train accuracy score is high.
#Performing L1,l2 and elasticnet regularization techniques to see if we can minimize the overfitting issue

In [441]:
sgd_class = SGDClassifier(penalty='l2')
sgd_class.fit(train_x_poly,train_target)

In [442]:
train_target_pred = sgd_class.predict(train_x_poly)

#Train accuracy
print('SGD_mod2_l2_Train_Accuracy: {}' .format(accuracy_score(train_target, train_target_pred)))

SGD_mod2_l2_Train_Accuracy: 0.8890229191797346


In [443]:
test_target_pred = sgd_class.predict(test_x_poly)

#Test accuracy
print('SGD_mod2_l2_Test_Accuracy: {}' .format(accuracy_score(test_target, test_target_pred)))

SGD_mod2_l2_Test_Accuracy: 0.4964838255977497


In [444]:
sgd_class = SGDClassifier(max_iter=1000,penalty='l1')
sgd_class.fit(train_x_poly,train_target)



In [445]:
train_target_pred = sgd_class.predict(train_x_poly)

#Train accuracy
print('SGD_mod2_l1_Train_Accuracy: {}' .format(accuracy_score(train_target, train_target_pred)))

SGD_mod2_l1_Train_Accuracy: 0.9837153196622437


In [446]:
test_target_pred = sgd_class.predict(test_x_poly)

#Test accuracy
print('SGD_mod2_l1_Test_Accuracy: {}' .format(accuracy_score(test_target, test_target_pred)))

SGD_mod2_l1_Test_Accuracy: 0.49929676511954996


In [447]:
sgd_class = SGDClassifier(penalty='elasticnet', l1_ratio=0.5,max_iter =1000,tol=0.1)
sgd_class.fit(train_x_poly,train_target)

In [448]:
train_target_pred = sgd_class.predict(train_x_poly)

#Train accuracy
print('SGD_mod2_elast_Train_Accuracy: {}' .format(accuracy_score(train_target, train_target_pred)))

SGD_mod2_elast_Train_Accuracy: 0.9891435464414958


In [449]:
test_target_pred = sgd_class.predict(test_x_poly)

#Test accuracy
print('SGD_mod2_elast_Test_Accuracy: {}' .format(accuracy_score(test_target, test_target_pred)))

SGD_mod2_elast_Test_Accuracy: 0.5035161744022504


In [450]:
#after comapring all regularization test accuracy scores though accuracy score is improved a bit.it did not minimize the model overfitting issue based on the results i can say the model overftting issue might not be resolved in this model 

## LogisticRegression Model:

In [451]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(solver='liblinear')

log_reg.fit(train_x, train_target)

In [452]:
train_target_pred = log_reg.predict(train_x)

#Train accuracy
print('Log_Reg_Train_Accuracy: {}' .format(accuracy_score(train_target, train_target_pred)))

Log_Reg_Train_Accuracy: 0.6290711700844391


In [453]:
test_target_pred = log_reg.predict(test_x)

#Test accuracy
print('Log_Reg_Test_Accuracy: {}' .format(accuracy_score(test_target, test_target_pred)))

Log_Reg_Test_Accuracy: 0.5794655414908579


### Is the model overfitting? Provide your answer below. If yes, please add more cells below and show how you corrected overfitting. If your model is overfitting and you don't correct it, you will lose points. (0.25 points)

In [454]:
#Logistic regression Model performed well. model overfitting is not an issue in this model as the differences between test and train accuracy scores is significantly low so model correction is not necessary in this model.

# Discussion (3 points in total)


## List the train and test values of each model you built (1 point)

**If the train/test values listed here do not match the outputs of models, you will lose points.**

## Which model performs the best and why? (1 point) 

Hint: The best model is the one that has the best TEST score (regardless of any of the training values). If you select your model based on TRAIN values, you will lose points.

## How does your best model compare to the baseline? (1 point)