# Student Intervention System

About project - Given student data, identify students who might need early intervention before they fail to graduate.

This is a classification problem. We have been provided with student data on which analysis needs to be done to determine whether student needs intervention or not. Classification problems require examples to be categorized into two or more classes that can in-turn be fed into the learning algorithm as training data.

In [2]:
import numpy as np
import pandas as pd
from time import time
from sklearn.metrics import f1_score

# Read student data
student_data = pd.read_csv("student-data.csv")
print "Student data read successfully!"

Student data read successfully!


In [2]:
# TODO: Calculate number of students
n_students = len(student_data.index)

# TODO: Calculate number of features
n_features = len(student_data.columns) - 1

# TODO: Calculate passing students
n_passed = len(student_data[student_data.passed=="yes"])

# TODO: Calculate failing students
n_failed = len(student_data[student_data.passed=="no"])

# TODO: Calculate graduation rate
grad_rate = (n_passed*100*1.0)/n_students

# Print the results
print "Total number of students: {}".format(n_students)
print "Number of features: {}".format(n_features)
print "Number of students who passed: {}".format(n_passed)
print "Number of students who failed: {}".format(n_failed)
print "Graduation rate of the class: {:.2f}%".format(grad_rate)

Total number of students: 395
Number of features: 30
Number of students who passed: 265
Number of students who failed: 130
Graduation rate of the class: 67.09%


In [3]:
feature_cols = list(student_data.columns[:-1])

# Extract target column 'passed'
target_col = student_data.columns[-1] 

# Show the list of columns
print "Feature columns:\n{}".format(feature_cols)
print "\nTarget column: {}".format(target_col)

# Separate the data into feature data and target data (X_all and y_all, respectively)
X_all = student_data[feature_cols]
y_all = student_data[target_col]

# Show the feature information by printing the first five rows
print "\nFeature values:"
print X_all.head()

Feature columns:
['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']

Target column: passed

Feature values:
  school sex  age address famsize Pstatus  Medu  Fedu     Mjob      Fjob  \
0     GP   F   18       U     GT3       A     4     4  at_home   teacher   
1     GP   F   17       U     GT3       T     1     1  at_home     other   
2     GP   F   15       U     LE3       T     1     1  at_home     other   
3     GP   F   15       U     GT3       T     4     2   health  services   
4     GP   F   16       U     GT3       T     3     3    other     other   

    ...    higher internet  romantic  famrel  freetime goout Dalc Walc health  \
0   ...       yes       no        no       4         3     4    1    1      3   
1   ...       

In [4]:
def preprocess_features(X):
    ''' Preprocesses the student data and converts non-numeric binary variables into
        binary (0/1) variables. Converts categorical variables into dummy variables. '''
    
    # Initialize new output DataFrame
    output = pd.DataFrame(index = X.index)

    # Investigate each feature column for the data
    for col, col_data in X.iteritems():
        
        # If data type is non-numeric, replace all yes/no values with 1/0
        if col_data.dtype == object:
            col_data = col_data.replace(['yes', 'no'], [1, 0])

        # If data type is categorical, convert to dummy variables
        if col_data.dtype == object:
            # Example: 'school' => 'school_GP' and 'school_MS'
            col_data = pd.get_dummies(col_data, prefix = col)  
        
        # Collect the revised columns
        output = output.join(col_data)
    
    return output

X_all = preprocess_features(X_all)
print "Processed feature columns ({} total features):\n{}".format(len(X_all.columns), list(X_all.columns))

Processed feature columns (48 total features):
['school_GP', 'school_MS', 'sex_F', 'sex_M', 'age', 'address_R', 'address_U', 'famsize_GT3', 'famsize_LE3', 'Pstatus_A', 'Pstatus_T', 'Medu', 'Fedu', 'Mjob_at_home', 'Mjob_health', 'Mjob_other', 'Mjob_services', 'Mjob_teacher', 'Fjob_at_home', 'Fjob_health', 'Fjob_other', 'Fjob_services', 'Fjob_teacher', 'reason_course', 'reason_home', 'reason_other', 'reason_reputation', 'guardian_father', 'guardian_mother', 'guardian_other', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']


In [6]:
from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.model_selection import train_test_split

# First, decide how many training vs test samples you want
num_all = student_data.shape[0]  # same as len(student_data)
num_train = 300  # about 75% of the data
num_test = num_all - num_train

# TODO: Then, select features (X) and corresponding labels (y) for the training and test sets
# Note: Shuffle the data or randomly select samples to avoid any bias due to ordering in the dataset
sss = StratifiedShuffleSplit(y_all,test_size=num_test, random_state=0)
for train_index, test_index in sss:
    X_train, X_test = X_all.iloc[train_index], X_all.iloc[test_index]
    y_train, y_test = y_all[train_index], y_all[test_index]
print "Training set: {} samples".format(X_train.shape[0])
print "Test set: {} samples".format(X_test.shape[0])
# Note: If you need a validation set, extract it from within training data

Training set: 300 samples
Test set: 95 samples


## Training & Evaluating Models

I chose the following four supervised learning models available in scikit-learn for the student data:

    1) Support Vector Machines(SVM)

    2) Gaussian Naive Bayes (GaussianNB)

    3) Logistic Regression

    4) Random Forest

### Support Vector Machines
SVM's working can be explained with the help of maximal-margin classifier. Consider you have some input variables or columns in the given data, 30 in this case, then the maximal-margin classifier will form a 30-dimension space. A hyperplane is a line that splits the input variable space. A hyperplane is selected to separate or classify the input points by their class. The distance between the points and the separated line, called margin is chosen such that it separates the two classes by maximum margin, hence called maximal-margin classifier. The margin is calculated as the perpendicular distance from the line to the closest points only. These points are referred to as support vectors. So, here SVM will first learn the data and classify the existing data into two groups, called labels, i.e., YES - in case the student needs intervention or NO - no intervention.

Advantages: 

    1) By introducing the kernel, SVMs gain flexibility in the choice of the form of the threshold separating input points, 
    2) The input points need not be linear and even need not have the same functional form for all data, since its function is non-parametric and operates locally.
    3) The SVM is an effective tool in high-dimensional spaces, which is particularly applicable to document classification and sentiment analysis where the dimensionality can be extremely large (≥10^6).

Disadvantages:

    1) SVMs don't work well with large datasets as the time complexity of training them is of the order of O(N^3).
    2) In situations where the number of features for each object exceeds the number of training data samples, SVMs can perform poorly. This can be seen intuitively, as if the high-dimensional feature space is much larger than the samples, then there are less effective support vectors on which to support the optimal linear hyperplanes.
    3) The results are not good in case of overlapping classes or data containing lots of noise.

In [7]:
import time

def train_classifier(clf, X_train, y_train):
    ''' Fits a classifier to the training data. '''
    
    # Start the clock, train the classifier, then stop the clock
    print "Training model {}...".format(clf.__class__.__name__)
    start = time.time()
    clf.fit(X_train, y_train)
    end = time.time()
    
    # Print the results
    print "Model training completed in {:.4f} seconds".format(end - start)
    
from sklearn import svm
clf = svm.SVC()

#call the classifier method to train the data using SVM
train_classifier(clf,X_train,y_train)
print clf

Training model SVC...
Model training completed in 0.0101 seconds
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)


In [8]:
from sklearn.metrics import f1_score
def predict_labels(clf, features, target):
    ''' Makes predictions using a fit classifier based on F1 score. '''
    
    # Start the clock, make predictions, then stop the clock
    print "Predicting labels using {}...".format(clf.__class__.__name__)
    start = time.time()
    y_pred = clf.predict(features)
    end = time.time()
    
    # Print and return results
    print "Predictions made in {:.4f} seconds.".format(end - start)
    return f1_score(target.values, y_pred, pos_label='yes')

trained_f1_score = predict_labels(clf,X_train,y_train)
print "F1 score for training dataset: {}".format(trained_f1_score)

Predicting labels using SVC...
Predictions made in 0.0068 seconds.
F1 score for training dataset: 0.867678958785


In [9]:
#finding f1 score for test data
print "F1 score for test dataset: {}".format(predict_labels(clf,X_test,y_test))

Predicting labels using SVC...
Predictions made in 0.0032 seconds.
F1 score for test dataset: 0.808219178082


In [10]:
from sklearn.decomposition import PCA
import pylab as pl

def train_predict(clf, X_train, y_train, X_test, y_test):
    ''' Train and predict using a classifer based on F1 score. '''
    
    # Indicate the classifier and the training set size
    print "Training a {} using a training set of size {}. . .".format(clf.__class__.__name__, len(X_train))
    
    # Train the classifier
    train_classifier(clf, X_train, y_train)
    
    # Print the results of prediction for both training and testing
    print "F1 score for training set: {:.4f}.".format(predict_labels(clf, X_train, y_train))
    print "F1 score for test set: {:.4f}.".format(predict_labels(clf, X_test, y_test))
    print "\n\n"
    
train_predict(clf,X_train[:100],y_train[:100],X_test, y_test)
train_predict(clf,X_train[:200],y_train[:200],X_test, y_test)
train_predict(clf,X_train[:300],y_train[:300],X_test, y_test)

Training a SVC using a training set of size 100. . .
Training model SVC...
Model training completed in 0.0023 seconds
Predicting labels using SVC...
Predictions made in 0.0011 seconds.
F1 score for training set: 0.8383.
Predicting labels using SVC...
Predictions made in 0.0014 seconds.
F1 score for test set: 0.8050.



Training a SVC using a training set of size 200. . .
Training model SVC...
Model training completed in 0.0045 seconds
Predicting labels using SVC...
Predictions made in 0.0033 seconds.
F1 score for training set: 0.8371.
Predicting labels using SVC...
Predictions made in 0.0020 seconds.
F1 score for test set: 0.8344.



Training a SVC using a training set of size 300. . .
Training model SVC...
Model training completed in 0.0080 seconds
Predicting labels using SVC...
Predictions made in 0.0054 seconds.
F1 score for training set: 0.8677.
Predicting labels using SVC...
Predictions made in 0.0025 seconds.
F1 score for test set: 0.8082.





### Gaussian Naive Bayes (Gaussian NB)
Based on Bayes theorem (a theorem which provides a way to calculate the probability of a hypothesis given our prior knowledge) and Naive Bayes Classifier(a classification algorithm for binary or multi-class classification problems using class probabilities and conditional probabilities), the Gaussian NB algorithm calculates the mean and standard deviation for input values for each class to summarize the distribution, in addition to the probabilities of each class.

Advantages:

    1) Fairly simple method that involves some counts that involves small amount of training data. If the NB conditional independence assumption actually holds, a Naive Bayes classifier will converge quicker than discriminative models like logistic regression, so you need less training data.
    
    2) Naive Bayes learners and classifiers can be extremely fast compared to more sophisticated methods.The decoupling of the class conditional feature distributions means that each distribution can be independently estimated as a one dimensional distribution. This in turn helps to alleviate problems stemming from the curse of dimensionality.
    
Disadvantages:

    1) Despite the above advantages, the estimations provided by it are bad.
    
    2) If the number of dependent attributes or parameters are large, then performance is poor.

### Logistic Regression
Logistic regression models the probability of the default class. For example, if we are modeling people’s sex as male or female from their height, then the first class could be male and the logistic regression model could be written as the probability of male given a person’s height, or more formally: P(sex=male|height). The probability prediction must be transformed into a binary values (0 or 1) in order to actually make a probability prediction. Logistic regression is a linear method, but the predictions are transformed using the logistic function. The binary logistic model is used to estimate the probability of a binary response based on one or more predictor (or independent) variables (features).

Advantages:

    1) Unlike decision trees or SVMs, model updation can be done easily. If you want a probabilistic framework (e.g., to easily adjust classification thresholds, to say when you’re unsure, or to get confidence intervals) or if you expect to receive more training data in the future that you want to be able to quickly incorporate into your model, then this method is beneficial.
    
    2) Logistic regression will work better if there's a single decision boundary, not necessarily parallel to the axis.
    
Disadavantages:

    1) Logistic regression attempts to predict outcomes based on a set of independent variables, but if researchers include the wrong independent variables, the model will have little to no predictive value. 
    
    2) It requires that each data point be independent of all other data points. If observations are related to one another, then the model will tend to overweight the significance of those observations. 

### Random Forest
Random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction. With a few exceptions a random-forest classifier has all the hyperparameters of a decision-tree classifier and also all the hyperparameters of a bagging classifier, to control the ensemble itself. Instead of building a bagging-classifier and passing it into a decision-tree-classifier, you can just use the random-forest classifier class, which is more convenient and optimized for decision trees. 

Advantages:

    1) One big advantage of random forest is, that it can be used for both classification and regression problems, which form the majority of current machine learning systems.
    
    2) It is very easy to measure the relative importance of each feature on the prediction.
    
    3) It handles high dimensional spaces as well as large number of training examples really well.
    
    4) Random forest runtimes are quite fast, and they are able to deal with unbalanced and missing data
    
Disadvantages:

    1) Random forests tends to overestimate the low values and underestimate the high values. This is because the response from random forests in the case of regression is the average (mean) of all of the trees.
    
    2) When used for regression they cannot predict beyond the range in the training data, and that they may over-fit data sets that are particularly noisy.


In [12]:
from sklearn.naive_bayes import GaussianNB
from sklearn import linear_model
from sklearn.ensemble import RandomForestClassifier

clfA = GaussianNB()
train_predict(clfA,X_train[:100],y_train[:100],X_test, y_test)
train_predict(clfA,X_train[:200],y_train[:200],X_test, y_test)
train_predict(clfA,X_train[:300],y_train[:300],X_test, y_test)

clfB = linear_model.LogisticRegression(C=1e5)
train_predict(clfB,X_train[:100],y_train[:100],X_test, y_test)
train_predict(clfB,X_train[:200],y_train[:200],X_test, y_test)
train_predict(clfB,X_train[:300],y_train[:300],X_test, y_test)

clfC = RandomForestClassifier(n_estimators=10)
train_predict(clfC,X_train[:100],y_train[:100],X_test, y_test)
train_predict(clfC,X_train[:200],y_train[:200],X_test, y_test)
train_predict(clfC,X_train[:300],y_train[:300],X_test, y_test)

Training a GaussianNB using a training set of size 100. . .
Training model GaussianNB...
Model training completed in 0.0025 seconds
Predicting labels using GaussianNB...
Predictions made in 0.0017 seconds.
F1 score for training set: 0.8163.
Predicting labels using GaussianNB...
Predictions made in 0.0009 seconds.
F1 score for test set: 0.8160.



Training a GaussianNB using a training set of size 200. . .
Training model GaussianNB...
Model training completed in 0.0027 seconds
Predicting labels using GaussianNB...
Predictions made in 0.0015 seconds.
F1 score for training set: 0.7839.
Predicting labels using GaussianNB...
Predictions made in 0.0013 seconds.
F1 score for test set: 0.7520.



Training a GaussianNB using a training set of size 300. . .
Training model GaussianNB...
Model training completed in 0.0032 seconds
Predicting labels using GaussianNB...
Predictions made in 0.0026 seconds.
F1 score for training set: 0.7781.
Predicting labels using GaussianNB...
Predictions made in 0.0

In [17]:
from sklearn import grid_search
from sklearn.metrics import f1_score
from sklearn.metrics import make_scorer
from sklearn.model_selection import train_test_split

clf = svm.SVC()
param_grid = [
  {'C': [1,10, 50, 100, 200, 250, 300, 350, 400, 500, 600],
    'kernel':['rbf','poly','sigmoid'],
    'gamma': [0.001,0.01,0.1,1,0.1,0.01,0.001,0.0001,0.00001],
     'tol':[0.01,0.001,0.0001,0.00001,0.0000001]
  }
 ]

f1_scorer = make_scorer(f1_score, pos_label="yes")
regressor = grid_search.GridSearchCV(clf, param_grid, cv=5,scoring=f1_scorer)
regressor.fit(X_train, y_train)
reg = regressor.best_estimator_
print reg
print "\n"
train_predict(reg, X_train, y_train,X_test,y_test)

SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.1, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.01, verbose=False)


Training a SVC using a training set of size 300. . .
Training model SVC...
Model training completed in 0.0091 seconds
Predicting labels using SVC...
Predictions made in 0.0067 seconds.
F1 score for training set: 0.9781.
Predicting labels using SVC...
Predictions made in 0.0029 seconds.
F1 score for test set: 0.8153.



