In [2]:
## Put import statements here
import numpy as np
import pandas as pd

# Creating  a Machine Learning model

Today you will be creating your first machine learning model! There are many components to this creating these models. However, there is a general pipeline that you can follow and iterate over to simplify the model building process.

  1. Define the problem
  2. Prepare the data
  3. Spot check algorithms (to figure out the best ones)
  4. Improve results (usually requires going back to step 2 or 3)
  5. Present results
  
For a more detailed description of the results, visit <a href=http://machinelearningmastery.com/process-for-working-through-machine-learning-problems/>this website.</a>

Since we have been using the iris dataset a lot lately, we felt it was time to switch things up. Let's look at this dataset. It can be downloaded directly from the UCI Machine Learning repository. <a href = http://archive.ics.uci.edu/ml/datasets/STUDENT+ALCOHOL+CONSUMPTION>Download the dataset here.</a> Once you have downloaded it, make sure the dataset is in the same folder as this ipython notebook. From there, we can begin working with it. 

## 1. Define the problem

To understand where this dataset might be useful, skim over these articles. They both show how machine learning can improve graduation rates by finding students at risk of dropping out. In this lab, we are going to take characteristics and grades for a group of students and see if we can predict whether they fall in low, medium, or high risk categories.

https://dssg.uchicago.edu/wp-content/uploads/2016/04/montogmery-kd2015.pdf

http://www.opb.org/news/article/npr-how-one-university-used-big-data-to-boost-graduation-rates/

Question 1: From the second article, by what percentage have graduation rates increased at Georgia State University since they implemented their new graduation and progression success (GPS) system and hired new advisors? 

## 2. Prepare the data

Skim over the student.txt file to better understand what is in this dataset. It is important to know where to find information about any of the variables in a dataset. We are just going to use student-por.csv for this labn. It contains data on the grades and characteristics of certain students in the class. Let's load the data.

In [7]:
import pandas as pd
student_grades = pd.read_csv('student-por.csv')

Make sure you check the dataframe using .head(). Is there something wrong? What can you do to fix this error? 

In [8]:
student_grades.head()

Unnamed: 0,school;sex;age;address;famsize;Pstatus;Medu;Fedu;Mjob;Fjob;reason;guardian;traveltime;studytime;failures;schoolsup;famsup;paid;activities;nursery;higher;internet;romantic;famrel;freetime;goout;Dalc;Walc;health;absences;G1;G2;G3
0,"GP;""F"";18;""U"";""GT3"";""A"";4;4;""at_home"";""teacher..."
1,"GP;""F"";17;""U"";""GT3"";""T"";1;1;""at_home"";""other"";..."
2,"GP;""F"";15;""U"";""LE3"";""T"";1;1;""at_home"";""other"";..."
3,"GP;""F"";15;""U"";""GT3"";""T"";4;2;""health"";""services..."
4,"GP;""F"";16;""U"";""GT3"";""T"";3;3;""other"";""other"";""h..."


In [10]:
#fix the delimiters

student_grades = pd.read_csv('student-por.csv',delimiter=';')
student_grades.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,4,0,11,11
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,2,9,11,11
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,6,12,13,12
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,0,14,14,14
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,0,11,13,13


We are going to attempt to predict the final grade (G3 column). However, the scores range from 0 - 20. Thus, we will need to bin the values. Let's assume that we want our algorithm to flag anyone who may possibly score below a 10 on the final grade, to allow the teacher time to tutor or help the student boost their score. 

In [12]:
student_grades.G3

0      11
1      11
2      12
3      14
4      13
5      13
6      13
7      13
8      17
9      13
10     14
11     13
12     12
13     13
14     15
15     17
16     14
17     14
18      7
19     12
20     14
21     12
22     14
23     10
24     10
25     12
26     12
27     11
28     13
29     12
       ..
619    13
620    15
621    13
622     9
623    16
624     9
625    10
626     0
627    10
628    12
629     9
630    17
631    12
632     9
633    14
634    16
635     9
636    19
637     0
638    16
639     0
640     0
641    15
642    11
643    10
644    10
645    16
646     9
647    10
648    11
Name: G3, dtype: int64

Run this cell to create a variable that will flag a student with a score less than 10 with a 1, and all other students will be 0.

In [13]:
def categorize(val,high_risk):
    if val <= high_risk:
        return 1
    else:
        return 0
    
student_grades.loc[:,'flag_student'] = student_grades.loc[:,'G3'].map(lambda x: categorize(x,10))

In [15]:
student_grades.columns

Index(['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu',
       'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime',
       'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery',
       'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc',
       'Walc', 'health', 'absences', 'G1', 'G2', 'G3', 'flag_student'],
      dtype='object')

In [23]:
student_grades.Fjob.unique()

array(['teacher', 'other', 'services', 'health', 'at_home'], dtype=object)

In [25]:
student_grades_dummy = pd.get_dummies(student_grades)
student_grades_dummy.head()

Unnamed: 0,age,Medu,Fedu,traveltime,studytime,failures,famrel,freetime,goout,Dalc,...,activities_no,activities_yes,nursery_no,nursery_yes,higher_no,higher_yes,internet_no,internet_yes,romantic_no,romantic_yes
0,18,4,4,2,2,0,4,3,4,1,...,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0
1,17,1,1,1,2,0,5,3,3,1,...,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0
2,15,1,1,1,2,0,4,3,2,2,...,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0
3,15,4,2,1,3,0,3,2,2,1,...,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0
4,16,3,3,1,2,0,4,3,2,1,...,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0


In [116]:
student_grades_dummy.columns

Index(['age', 'Medu', 'Fedu', 'traveltime', 'studytime', 'failures', 'famrel',
       'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences', 'G1', 'G2',
       'G3', 'flag_student', 'school_GP', 'school_MS', 'sex_F', 'sex_M',
       'address_R', 'address_U', 'famsize_GT3', 'famsize_LE3', 'Pstatus_A',
       'Pstatus_T', 'Mjob_at_home', 'Mjob_health', 'Mjob_other',
       'Mjob_services', 'Mjob_teacher', 'Fjob_at_home', 'Fjob_health',
       'Fjob_other', 'Fjob_services', 'Fjob_teacher', 'reason_course',
       'reason_home', 'reason_other', 'reason_reputation', 'guardian_father',
       'guardian_mother', 'guardian_other', 'schoolsup_no', 'schoolsup_yes',
       'famsup_no', 'famsup_yes', 'paid_no', 'paid_yes', 'activities_no',
       'activities_yes', 'nursery_no', 'nursery_yes', 'higher_no',
       'higher_yes', 'internet_no', 'internet_yes', 'romantic_no',
       'romantic_yes'],
      dtype='object')

'flag_students' will now be the column we are trying to predict. This is where your expertise kicks in! Choose which features to keep, and save them into the X variable (this will become our feature space). 

In [278]:
#We only need column for each X dummy variable (linear dependence) . Keep the yes columns, take other the 'other' jobs
#Also take out g1 and g2
X = student_grades_dummy.loc[:,['age', 'Medu', 'Fedu', 'traveltime', 'studytime', 'failures', 'famrel',
       'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences', 
        'school_MS', 'sex_F',
       'address_R',  'famsize_GT3', 'Pstatus_A',
        'Mjob_at_home', 'Mjob_health',
       'Mjob_services', 'Mjob_teacher', 'Fjob_at_home', 'Fjob_health',
        'Fjob_services', 'Fjob_teacher', 'reason_course',
       'reason_home',  'reason_reputation', 'guardian_father',
       'guardian_mother',  'schoolsup_yes',
        'famsup_yes',  'paid_yes', 
       'activities_yes', 'nursery_yes', 
       'higher_yes',  'internet_yes',
       'romantic_yes']]
y = student_grades_dummy.loc[:,'flag_student']
np.shape(X)

(649, 39)

Bonus: Since KNN relies on distance, you cannot directly put categorical variables into the algorithm. If you want to include this type of information, you will first need to dummify the variables before putting them in the classifier. As an example, dummifying would take a column with 'yes' or 'no' and would change the 'yes' to a 1 and a 'no' to a zero. Try creating a method that will do this for you. 

# LDA

In [279]:
#two scatter matrices, within classes and between classes
#Start with between classes Sb
import numpy as np

mean_X = np.mean(X,axis=0)
mean_X = np.reshape(mean_X,(1,len(mean_X)))
print(np.shape(mean_X),np.shape(X))

(1, 39) (649, 39)


In [280]:
mean_X

array([[ 16.7442,   2.5146,   2.3066,   1.5686,   1.9307,   0.2219,
          3.9307,   3.1803,   3.1849,   1.5023,   2.2804,   3.5362,
          3.6595,   0.3482,   0.5901,   0.3035,   0.7042,   0.1233,
          0.208 ,   0.074 ,   0.2096,   0.1109,   0.0647,   0.0354,
          0.2789,   0.0555,   0.4391,   0.2296,   0.2203,   0.2357,
          0.7011,   0.1048,   0.6133,   0.0601,   0.4854,   0.8028,
          0.8937,   0.7673,   0.3683]])

In [281]:
#Filter for a particular class
student_grades_dummy[y==0].head()

Unnamed: 0,age,Medu,Fedu,traveltime,studytime,failures,famrel,freetime,goout,Dalc,...,activities_no,activities_yes,nursery_no,nursery_yes,higher_no,higher_yes,internet_no,internet_yes,romantic_no,romantic_yes
0,18,4,4,2,2,0,4,3,4,1,...,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0
1,17,1,1,1,2,0,5,3,3,1,...,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0
2,15,1,1,1,2,0,4,3,2,2,...,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0
3,15,4,2,1,3,0,3,2,2,1,...,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0
4,16,3,3,1,2,0,4,3,2,1,...,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0


In [282]:
np.shape(X)

(649, 39)

In [283]:
#Mean vectors
np.set_printoptions(precision=4)

mean_vectors = []
for cl in range(2):
    mean_vectors.append(np.mean(X[y==cl], axis=0))
    print('Mean Vector class %s: %s\n' %(cl, mean_vectors[cl-1]))

Mean Vector class 0: age                  16.612832
Medu                  2.688053
Fedu                  2.469027
traveltime            1.508850
studytime             2.037611
failures              0.053097
famrel                3.957965
freetime              3.097345
goout                 3.130531
Dalc                  1.411504
Walc                  2.170354
health                3.495575
absences              3.212389
school_MS             0.252212
sex_F                 0.621681
address_R             0.258850
famsize_GT3           0.701327
Pstatus_A             0.123894
Mjob_at_home          0.172566
Mjob_health           0.079646
Mjob_services         0.216814
Mjob_teacher          0.141593
Fjob_at_home          0.055310
Fjob_health           0.035398
Fjob_services         0.258850
Fjob_teacher          0.073009
reason_course         0.389381
reason_home           0.256637
reason_reputation     0.256637
guardian_father       0.261062
guardian_mother       0.685841
schoolsup_yes     

In [284]:
#within class covariance
cov_within_classes = np.zeros((np.shape(X)[1],np.shape(X)[1]))
#print(np.shape(cov))
for i in range(0,2):
    X_centered = X[y==i] - np.mean(X[y==i])
    #print(np.shape(X[y==i] - np.mean(X[y==i]))) ## take the mean of every feature
    cov_within_classes +=X_centered.T.dot(X_centered)
cov_within_classes

array([[ 935.8344,  -62.6415,  -73.3264, ...,  -51.2062,   10.9417,
          61.8749],
       [ -62.6415,  789.3302,  481.6505, ...,   30.768 ,   74.0466,
          -2.7414],
       [ -73.3264,  481.6505,  744.7085, ...,   25.542 ,   47.187 ,
         -15.5509],
       ..., 
       [ -51.2062,   30.768 ,   25.542 , ...,   54.6347,    2.5135,
          -6.3187],
       [  10.9417,   74.0466,   47.187 , ...,    2.5135,  114.1913,
           6.2046],
       [  61.8749,   -2.7414,  -15.5509, ...,   -6.3187,    6.2046,
         149.4636]])

In [285]:
#Between class covariance

cov_between_classes = np.zeros((np.shape(X)[1],np.shape(X)[1]))

overall_mean = np.mean(X, axis=0)
print(overall_mean)

for i,mean_vec in enumerate(mean_vectors):  
    n = X[y==i].shape[0]
    print(n,'n')
    #print(i,'i')
    mean_vec = mean_vec.reshape(np.shape(X)[1],1) # make column vector
    #print(mean_vec,'mean ec')
    overall_mean = overall_mean.reshape(np.shape(X)[1],1) # make column vector
    
    cov_between_classes += n * (mean_vec - overall_mean).dot((mean_vec - overall_mean).T)

print('between-class Scatter Matrix:\n', cov_between_classes)


age                  16.744222
Medu                  2.514638
Fedu                  2.306626
traveltime            1.568567
studytime             1.930663
failures              0.221880
famrel                3.930663
freetime              3.180277
goout                 3.184900
Dalc                  1.502311
Walc                  2.280431
health                3.536210
absences              3.659476
school_MS             0.348228
sex_F                 0.590139
address_R             0.303544
famsize_GT3           0.704160
Pstatus_A             0.123267
Mjob_at_home          0.208012
Mjob_health           0.073960
Mjob_services         0.209553
Mjob_teacher          0.110940
Fjob_at_home          0.064715
Fjob_health           0.035439
Fjob_services         0.278891
Fjob_teacher          0.055470
reason_course         0.439137
reason_home           0.229584
reason_reputation     0.220339
guardian_father       0.235747
guardian_mother       0.701079
schoolsup_yes         0.104777
famsup_y

In [286]:
np.shape(mean_vectors)

(2, 39)

### Now, calculate the eigenvectors/eigenvalues

In [287]:


eig_vals, eig_vecs =  np.linalg.eig(np.linalg.inv(cov_within_classes).dot(cov_between_classes))

for i in range(len(eig_vals)):
    eigvec_sc = eig_vecs[:,i].reshape(np.shape(X)[1],1)
    print('\nEigenvector {}: \n{}'.format(i+1, eigvec_sc.real))
    print('Eigenvalue {:}: {:.2e}'.format(i+1, eig_vals[i].real))


Eigenvector 1: 
[[-0.8469]
 [-0.0065]
 [-0.0349]
 [-0.007 ]
 [-0.0246]
 [ 0.2255]
 [-0.0052]
 [ 0.0237]
 [-0.005 ]
 [ 0.0137]
 [ 0.0073]
 [ 0.0087]
 [ 0.0132]
 [ 0.2261]
 [-0.0899]
 [ 0.0245]
 [ 0.0043]
 [-0.0045]
 [ 0.0023]
 [ 0.0219]
 [-0.0343]
 [-0.086 ]
 [-0.0035]
 [ 0.1841]
 [ 0.0893]
 [-0.0115]
 [ 0.0612]
 [-0.0148]
 [ 0.0053]
 [ 0.0505]
 [ 0.1544]
 [ 0.1541]
 [ 0.0144]
 [-0.0204]
 [-0.0362]
 [ 0.0309]
 [-0.2417]
 [ 0.0033]
 [ 0.0507]]
Eigenvalue 1: 0.00e+00

Eigenvector 2: 
[[  2.1883e-04]
 [ -1.2161e-02]
 [ -6.5700e-02]
 [ -1.3217e-02]
 [ -4.6329e-02]
 [  4.2407e-01]
 [ -9.8257e-03]
 [  4.4481e-02]
 [ -9.4390e-03]
 [  2.5795e-02]
 [  1.3724e-02]
 [  1.6450e-02]
 [  2.4827e-02]
 [  4.2520e-01]
 [ -1.6899e-01]
 [  4.6021e-02]
 [  8.0969e-03]
 [ -8.5345e-03]
 [  4.3258e-03]
 [  4.1241e-02]
 [ -6.4492e-02]
 [ -1.6174e-01]
 [ -6.4878e-03]
 [  3.4624e-01]
 [  1.6784e-01]
 [ -2.1665e-02]
 [  1.1501e-01]
 [ -2.7811e-02]
 [  1.0058e-02]
 [  9.4933e-02]
 [  2.9039e-01]
 [  2.8978e-01]
 

In [288]:
np.linalg.eig?

In [289]:
# Make a list of (eigenvalue, eigenvector) tuples
eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]

# Sort the (eigenvalue, eigenvector) tuples from high to low
eig_pairs = sorted(eig_pairs, key=lambda k: k[0], reverse=True)

# Visually confirm that the list is correctly sorted by decreasing eigenvalues

print('Eigenvalues/vectors in decreasing order:\n')
for i in eig_pairs:
    print(i[0])
    print(i[1])


Eigenvalues/vectors in decreasing order:

0.584435095023
[  2.1883e-04+0.j  -1.2161e-02+0.j  -6.5700e-02+0.j  -1.3217e-02+0.j
  -4.6329e-02+0.j   4.2407e-01+0.j  -9.8257e-03+0.j   4.4481e-02+0.j
  -9.4390e-03+0.j   2.5795e-02+0.j   1.3724e-02+0.j   1.6450e-02+0.j
   2.4827e-02+0.j   4.2520e-01+0.j  -1.6899e-01+0.j   4.6021e-02+0.j
   8.0969e-03+0.j  -8.5345e-03+0.j   4.3258e-03+0.j   4.1241e-02+0.j
  -6.4492e-02+0.j  -1.6174e-01+0.j  -6.4878e-03+0.j   3.4624e-01+0.j
   1.6784e-01+0.j  -2.1665e-02+0.j   1.1501e-01+0.j  -2.7811e-02+0.j
   1.0058e-02+0.j   9.4933e-02+0.j   2.9039e-01+0.j   2.8978e-01+0.j
   2.7020e-02+0.j  -3.8334e-02+0.j  -6.8055e-02+0.j   5.8108e-02+0.j
  -4.5442e-01+0.j   6.1625e-03+0.j   9.5384e-02+0.j]
1.15723874474e-16
[ 0.0122+0.0404j  0.0120+0.1079j  0.0290-0.023j   0.0297-0.0094j
 -0.0951+0.0053j  0.2202-0.1234j  0.0086+0.0166j -0.0097+0.0706j
  0.0043-0.0208j -0.0575-0.0133j -0.0025+0.0038j -0.0129+0.0025j
 -0.0098-0.0012j  0.1239+0.2171j  0.2381+0.0117j -0.1132

### Above, there is only one eigenvalue (and corresponding eigenvector) that appear to account for the majority of explanatory power in the data (to predict at risk kids).

In [297]:
eig_pairs

[(0.58443509502306024,
  array([  2.1883e-04+0.j,  -1.2161e-02+0.j,  -6.5700e-02+0.j,
          -1.3217e-02+0.j,  -4.6329e-02+0.j,   4.2407e-01+0.j,
          -9.8257e-03+0.j,   4.4481e-02+0.j,  -9.4390e-03+0.j,
           2.5795e-02+0.j,   1.3724e-02+0.j,   1.6450e-02+0.j,
           2.4827e-02+0.j,   4.2520e-01+0.j,  -1.6899e-01+0.j,
           4.6021e-02+0.j,   8.0969e-03+0.j,  -8.5345e-03+0.j,
           4.3258e-03+0.j,   4.1241e-02+0.j,  -6.4492e-02+0.j,
          -1.6174e-01+0.j,  -6.4878e-03+0.j,   3.4624e-01+0.j,
           1.6784e-01+0.j,  -2.1665e-02+0.j,   1.1501e-01+0.j,
          -2.7811e-02+0.j,   1.0058e-02+0.j,   9.4933e-02+0.j,
           2.9039e-01+0.j,   2.8978e-01+0.j,   2.7020e-02+0.j,
          -3.8334e-02+0.j,  -6.8055e-02+0.j,   5.8108e-02+0.j,
          -4.5442e-01+0.j,   6.1625e-03+0.j,   9.5384e-02+0.j])),
 (1.157238744735696e-16,
  array([ 0.0122+0.0404j,  0.0120+0.1079j,  0.0290-0.023j ,  0.0297-0.0094j,
         -0.0951+0.0053j,  0.2202-0.1234j,  0.0086+0.

In [298]:
from matplotlib import pyplot as plt
%pylab inline

W = np.hstack((eig_pairs[0][1].reshape(39,1), eig_pairs[1][1].reshape(39,1)))

print('Matrix W:\n', W.real)

from matplotlib import pyplot as plt

X_lda = X.dot(W)





Populating the interactive namespace from numpy and matplotlib
Matrix W:
 [[  2.1883e-04   1.2176e-02]
 [ -1.2161e-02   1.2030e-02]
 [ -6.5700e-02   2.8976e-02]
 [ -1.3217e-02   2.9702e-02]
 [ -4.6329e-02  -9.5098e-02]
 [  4.2407e-01   2.2016e-01]
 [ -9.8257e-03   8.5547e-03]
 [  4.4481e-02  -9.6512e-03]
 [ -9.4390e-03   4.3496e-03]
 [  2.5795e-02  -5.7492e-02]
 [  1.3724e-02  -2.4827e-03]
 [  1.6450e-02  -1.2924e-02]
 [  2.4827e-02  -9.8332e-03]
 [  4.2520e-01   1.2395e-01]
 [ -1.6899e-01   2.3810e-01]
 [  4.6021e-02  -1.1321e-01]
 [  8.0969e-03  -2.1634e-02]
 [ -8.5345e-03  -1.1784e-02]
 [  4.3258e-03   4.0305e-02]
 [  4.1241e-02   5.2028e-02]
 [ -6.4492e-02  -7.1856e-02]
 [ -1.6174e-01   1.8477e-03]
 [ -6.4878e-03   3.9102e-03]
 [  3.4624e-01  -2.5466e-01]
 [  1.6784e-01   1.0087e-01]
 [ -2.1665e-02   1.1564e-01]
 [  1.1501e-01  -2.1763e-01]
 [ -2.7811e-02  -2.2699e-02]
 [  1.0058e-02  -5.2334e-02]
 [  9.4933e-02  -6.2478e-02]
 [  2.9039e-01  -3.7522e-01]
 [  2.8978e-01  -4.3322e-01

In [328]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
s = StandardScaler()

In [300]:
# Create an instance of the class
lda = LinearDiscriminantAnalysis(n_components = 1)

In [330]:
# Run a train test split on the data
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.25)

In [339]:
s_X_train = s.fit_transform(X_train)
s_X_test = s.fit_transform(X_test)
y_train = y_train


In [340]:
# Fit the model to the training data, reduce the dimensionality of the training
lda.fit_transform(X_train,y_train)

# Print the accuracy score for the training set
print ("Accuracy Score Training Set: {}".format(accuracy_score(y_train,lda.predict(s_X_train))))

# Print the accuracy score for the testing set
print ("Accuracy Score Test Set: {}".format(accuracy_score(y_test,lda.predict(s_X_test))))

Accuracy Score Training Set: 0.7818930041152263
Accuracy Score Test Set: 0.754601226993865


## By keeping one component in LDA, we achieve an accuracy score of ~76% for our test set for normalized data.

- Try two components

In [341]:
# Create an instance of the class
lda = LinearDiscriminantAnalysis(n_components = 2)


# Fit the model to the training data
lda.fit_transform(X_train,y_train)

# Print the accuracy score for the training set
print ("Accuracy Score Training Set: {}".format(accuracy_score(y_train,lda.predict(s_X_train))))

# Print the accuracy score for the testing set
print ("Accuracy Score Test Set: {}".format(accuracy_score(y_test,lda.predict(s_X_test))))

Accuracy Score Training Set: 0.7818930041152263
Accuracy Score Test Set: 0.754601226993865


### - Accuracy goes down with two components.

# 3. Spot check algorithms

## PCA - use in conjunction with KNN

In [304]:
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier

In [342]:
#Try one comonent
pca = PCA(1)

knn = KNeighborsClassifier(5) #Set k

# Transform our data
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.fit_transform(X_test)


knn.fit(X_train_pca,y_train)

# Print the accuracy score for the training set from PCA
print ("Accuracy Score Training Set: {}".format(accuracy_score(y_train,knn.predict(X_train_pca))))

# Print the accuracy score for the testing set from PCA
print ("Accuracy Score Test Set: {}".format(accuracy_score(y_test,knn.predict(X_test_pca))))

Accuracy Score Training Set: 0.7633744855967078
Accuracy Score Test Set: 0.6932515337423313


In [343]:
# tryy k -10 for one component of PCA



#Try one comonent
pca = PCA(1)

knn = KNeighborsClassifier(10) #Set k

# Transform our data
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.fit_transform(X_test)


knn.fit(X_train_pca,y_train)

# Print the accuracy score for the training set from PCA
print ("Accuracy Score Training Set: {}".format(accuracy_score(y_train,knn.predict(X_train_pca))))

# Print the accuracy score for the testing set from PCA
print ("Accuracy Score Test Set: {}".format(accuracy_score(y_test,knn.predict(X_test_pca))))

Accuracy Score Training Set: 0.7242798353909465
Accuracy Score Test Set: 0.6993865030674846


- Lower accuracy for k = 10

In [344]:
#Try two components
pca = PCA(2)

In [345]:
# Transform our data
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.fit_transform(X_test)

In [346]:
knn = KNeighborsClassifier(5)

In [347]:
knn.fit(X_train_pca,y_train)


KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [348]:
# Print the accuracy score for the training set from PCA
print ("Accuracy Score Training Set: {}".format(accuracy_score(y_train,knn.predict(X_train_pca))))

# Print the accuracy score for the testing set from PCA
print ("Accuracy Score Test Set: {}".format(accuracy_score(y_test,knn.predict(X_test_pca))))

Accuracy Score Training Set: 0.7839506172839507
Accuracy Score Test Set: 0.6687116564417178


- Try K  = 10

In [349]:
knn = KNeighborsClassifier(10)
knn.fit(X_train_pca,y_train)
# Print the accuracy score for the training set from PCA
print ("Accuracy Score Training Set: {}".format(accuracy_score(y_train,knn.predict(X_train_pca))))

# Print the accuracy score for the testing set from PCA
print ("Accuracy Score Test Set: {}".format(accuracy_score(y_test,knn.predict(X_test_pca))))

Accuracy Score Training Set: 0.7263374485596708
Accuracy Score Test Set: 0.656441717791411


- Try k = 20

In [350]:
knn = KNeighborsClassifier(23)
knn.fit(X_train_pca,y_train)
# Print the accuracy score for the training set from PCA
print ("Accuracy Score Training Set: {}".format(accuracy_score(y_train,knn.predict(X_train_pca))))

# Print the accuracy score for the testing set from PCA
print ("Accuracy Score Test Set: {}".format(accuracy_score(y_test,knn.predict(X_test_pca))))

Accuracy Score Training Set: 0.7119341563786008
Accuracy Score Test Set: 0.6687116564417178


## Using  LDA has higher acuracy than KNN with PCA (76% vs 70%).

For now, we will use accuracy to improve upon our model. We want to maximize the accuracy in both the training and testing set. Play around and see how high you can get the scores! Watch out though, scores that are too high (such as 100% accuracy) can sometimes be flags for leakage and other improper modeling techniques. While using PCA or LDA, make sure to use the following pipeline. 

 1. Train/Test split
 2. Dimensionality reduction on training set
 3. Fit model to training set
 4. Accuracy of model on training set
 5. Dimensionality reduction on testing set
 6. Accuracy of model on testing set

Use LDA, PCA, and KNN to make a classifier to predict using attributes that suggest a student may be at high risk for under-performing in the course. Note: LDA can be used for dimensionality reduction and classification. 

# 4. Improve Results

## There are a few things we can do to maximize the score. One thing is to tune different parameters. Parameters can be number of components, number of nearest neighbors, which distance function to use, and so on. Change these numbers and see how the accuracy changes with them.

Bonus: Check out <a href=http://scikit-learn.org/0.17/modules/generated/sklearn.grid_search.GridSearchCV.html>GridsearchCV</a>. This will allow you to choose combinations of parameters and it will let you know which one is the best. It is super powerful!

In [314]:
from sklearn.grid_search import GridSearchCV

In [351]:


params = {'n_neighbors':[i for i in range(1,100)]}


In [355]:
knn = KNeighborsClassifier()
grid = GridSearchCV(estimator=knn, param_grid = params)
grid.fit(X_train_pca,y_train)
print(grid)

GridSearchCV(cv=None, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)


In [356]:
grid.best_params_

{'n_neighbors': 53}

In [357]:
# 53 is the best k for the pca transform data. Try this

In [358]:
knn = KNeighborsClassifier(53)
knn.fit(X_train_pca,y_train)
# Print the accuracy score for the training set from PCA
print ("Accuracy Score Training Set: {}".format(accuracy_score(y_train,knn.predict(X_train_pca))))

# Print the accuracy score for the testing set from PCA
print ("Accuracy Score Test Set: {}".format(accuracy_score(y_test,knn.predict(X_test_pca))))

Accuracy Score Training Set: 0.7037037037037037
Accuracy Score Test Set: 0.6993865030674846


In [318]:
import sklearn.metrics as met
# quickly evaluate models
def model_eval(model, X_test):
    pred = model.predict(X_test)
    print('Accuracy:\n-----------------------------')
    print(met.accuracy_score(y_test, pred))
    print('\nConfusion Matrix:\n-----------------------------')
    print(met.confusion_matrix(y_test, pred))
    print('\nClassification Report:\n-----------------------------')
    print(met.classification_report(y_test, pred))

In [319]:
knn = KNeighborsClassifier()
knn.fit(X_train_pca,y_train)
model_eval(knn,X_test_pca) ## evaluate knn

Accuracy:
-----------------------------
0.650306748466

Confusion Matrix:
-----------------------------
[[92 27]
 [30 14]]

Classification Report:
-----------------------------
             precision    recall  f1-score   support

          0       0.75      0.77      0.76       119
          1       0.34      0.32      0.33        44

avg / total       0.64      0.65      0.65       163



# 5. Present result

For a company, this usually includes a slide show or presentation on what your findings were. In this case, you will not have to do that. Although, you may want to think about these aspects of your model. 

 1. Are there ethical concerns with trying to find high risk students this way?
 2. Is there a possibility of neglecting the high performing students? What would the implications of this be?
 3. Would it be beneficial to allow a parent to have access to this information so that they can be informed when their student is flagged for possibly being at risk of failing the course? 
 
There are no right or wrong answers to these questions, but they are good to think about. You do have to provide a thoughtful reponse to atleast one of these questions. 

- ## 
>1) There are ethical concerns with trying to identify students based upon their G3 grade. Would could happen, is that our model might signal out a particular race or socioeconomic class for being at high risk. This could be from a biased training set, or the parameters of the model. However, as an educator, it is not possible to focus on one race (or socioeconomic class) and say they are at risk of failing (or falling behind in) school. At a certain point, this becomes a self-fulfilling prophecy where students who meet this criteria may not even try in school because they think they are going to fail.