# Classification and Prediction
Author: Simon

## Introduction
Many universities have a problem with students over-enrolling in courses at the beginning of semester and then dropping most of them as the make decisions about which classes to attend. This makes it difficult to plan for the semester and allocate resources. However, schools don’t want to restrict the choice of their students. One solution is to create predictions of which students are likley to drop out of which courses and use these predictions to inform semester planning.

### Objectives
In this assignment you will modelling student data using at least two classification algorithms. We will:

>1. use multiple algorithms to attempt to predict which students drop out of courses,
2. evaluate the performance of the algorithms
3. discuss the advantages/disadvantages of each algorithms

### Data

The data comes from a university registrar’s office. Your task is to build a classifier that can predict whether a student will drop out from the course or not (i.e., the complete variable) given a set of features that you choose.



*   student_id = Student ID
* years = Number of years the student has been enrolled in their program of study
* entrance_test_score = Entrance exam test score
* courses_taken = Number of courses a student has taken during their program
* complete = Whether or not a student completed a course or dropped out (yes = completed)
* enroll_data_time = Date and time student enrolled in POSIXct format
* course_id = Course ID
international = Is the student from overseas
* online = Is the student only taking online courses
* gender = One of five possible gender identitiesist item







## Project Write-up Guide

### Recap and articulate the project

In this project, I will use multiple algorithms to predict if a student will drop out courses or not, followed up with algorithm evaluation.

### Feature engeineering

I will perform the Extra Trees Classifier and ...

Extra trees classifier is a type of ensemble learning methods that combines the predictions from many decision trees.

### Algorithms
Logistic Regression,Support Vector Machine, Naive Bayes

## Implementation

### Import Data

In [10]:
from google.colab import auth
auth.authenticate_user()

import gspread
from google.auth import default
creds, _ = default()

gc = gspread.authorize(creds)

# To import Goolge sheet to pandas dataframe. 
worksheet = gc.open('drop-out').sheet1
rows = worksheet.get_all_values()

import pandas as pd
data = pd.DataFrame.from_records(rows)
# Create a new variable called 'new_header' from the first row of the dataset
# This calls the first row for the header
new_header = data.iloc[0] 
# take the rest of your data minus the header row
data = data[1:] 
# set the header row as the df header
data.columns = new_header 
# Lets see the 5 first rows of the new dataset
data.head()

Unnamed: 0,student_id,years,entrance_test_score,courses_taken,complete,enroll_date_time,course_id,international,online,gender
1,172777,0,47.0,4,yes,159227767,807728,no,no,1
2,172777,0,47.0,4,yes,159227782,658434,no,no,1
3,172777,0,47.0,4,yes,159227866,658463,no,no,1
4,172777,0,47.0,4,yes,159227948,658498,no,no,1
5,175658,0,92.8,22,yes,157446419,807728,no,no,1


In [11]:
data.describe()

Unnamed: 0,student_id,years,entrance_test_score,courses_taken,complete,enroll_date_time,course_id,international,online,gender
count,5861,5861,5861,5861,5861,5861,5861,5861,5861,5861
unique,682,8,222,43,2,5861,35,2,2,5
top,240300,0,0,4,yes,159227767,658434,no,no,2
freq,60,4871,2528,476,4137,1,872,4997,4478,2805


### Data processing
To create 3 dummy variables, which are international, online and complete.

In [12]:
# create international dummy variable.
dummyInternational = pd.get_dummies(data['international'], prefix = 'international')
data = pd.concat([data, dummyInternational], axis=1)
data = data.drop(['international', 'international_no'], axis=1)

# Create online dummy variable.
dummyOnline = pd.get_dummies(data['online'], prefix = 'online')
data = pd.concat([data, dummyOnline], axis=1)
data = data.drop(['online', 'online_no'], axis=1)

# create a dummy variable for complete
dummyComplete = pd.get_dummies(data['complete'], prefix = 'complete') 
data = pd.concat([data, dummyComplete], axis=1)
data = data.drop(['complete', 'complete_no'], axis=1)

# preview the data
data

Unnamed: 0,student_id,years,entrance_test_score,courses_taken,enroll_date_time,course_id,gender,international_yes,online_yes,complete_yes
1,172777,0,47,4,159227767,807728,1,0,0,1
2,172777,0,47,4,159227782,658434,1,0,0,1
3,172777,0,47,4,159227866,658463,1,0,0,1
4,172777,0,47,4,159227948,658498,1,0,0,1
5,175658,0,92.8,22,157446419,807728,1,0,0,1
...,...,...,...,...,...,...,...,...,...,...
5857,295097,0,0,3,159550455,807728,2,0,0,1
5858,295097,0,0,3,159550594,658434,2,0,0,0
5859,295097,0,0,3,159551590,658463,2,0,0,0
5860,299198,0,0,2,160838118,807728,2,0,0,1


### Feature Selection
I am gonna use Extra Trees Classifier for the feauture selection, because it is fast and easy to compute. <br>
Extra Trees Classifier would randomly sample the features at each split point of a decision tree.

In [19]:
from sklearn.ensemble import ExtraTreesClassifier
import numpy as np

x = data[['student_id', 'years', 'entrance_test_score', 'courses_taken',
          'enroll_date_time', 'course_id', 'gender', 'international_yes',
          'online_yes']]
# ravel the complete_yes for model fit
y = np.ravel(data[['complete_yes']])

# use the Extra Tree Classifier model
model = ExtraTreesClassifier(n_estimators=100)
model.fit(x, y)
print(model.feature_importances_)

[0.08086855 0.40185247 0.03079727 0.09503724 0.14994451 0.20644594
 0.0193025  0.00768053 0.00807099]


In [22]:
# Sort the importance
ext = pd.DataFrame(model.feature_importances_,columns=["extratrees"])
ext
ext.sort_values(['extratrees'], ascending=False)

Unnamed: 0,extratrees
1,0.401852
5,0.206446
4,0.149945
3,0.095037
0,0.080869
2,0.030797
6,0.019303
8,0.008071
7,0.007681


According to the sorted list, we will use first 5 items:
1. student_id
2. years
3. courses_taken
4. enroll_data_time
5. course_id

### Model Implementation & Validation

For classifer selsection, I am gonna use logistic regression, support vector machine, and Naive Bayes.
* Logistic regression: most common approach for classification.
* SVM is linear algorithm and is used as comparsion with logistic regression.
* Naive Bayes: non-linear classification algorithm that helps to learn from the features from given label.

In [23]:
# Prepare the training data and test data.
from sklearn.model_selection import train_test_split

# Split data into training sets and validation sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3)

#### Logistic Regression

In [27]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

# fit model and predict
LogitModel = LogisticRegression(max_iter=2000).fit(x_train, y_train)
y_pred = LogitModel.predict(x_test)

# compare the predicted ys with waht is acutally in the testing data and obtain
# the confusion matrix.
print(confusion_matrix(y_test, y_pred))

[[   0  500]
 [   0 1259]]


In [28]:
from sklearn.metrics import accuracy_score
ac_logit = accuracy_score(y_test, y_pred)
print(ac_logit)

0.7157475838544628


In [29]:
from sklearn import metrics

print(metrics.classification_report(y_test, y_pred, zero_division=0))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00       500
           1       0.72      1.00      0.83      1259

    accuracy                           0.72      1759
   macro avg       0.36      0.50      0.42      1759
weighted avg       0.51      0.72      0.60      1759



#### Support Vector Machine

In [34]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

svcModel = make_pipeline(StandardScaler(), SVC(gamma='auto')).fit(x_train, y_train)
y_pred = svcModel.predict(x_test)

In [31]:
ac_logit = accuracy_score(y_test, y_pred)
print(ac_logit)

0.8732234223990903


In [35]:
# Cross Validation
from sklearn.model_selection import cross_val_score
scores = cross_val_score(svcModel, x, y, cv=5)
scores

array([0.80903666, 0.8609215 , 0.87201365, 0.87372014, 0.80716724])

In [37]:
print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))

0.84 accuracy with a standard deviation of 0.03


#### Naive Bayes

In [33]:
from sklearn.naive_bayes import GaussianNB

NBModel = GaussianNB(priors=[0.5, 0.5]).fit(x_train, y_train)
y_pred = NBModel.predict(x_test)
ac_logit = accuracy_score(y_test, y_pred)
print(ac_logit)

0.549175667993178


#### Result

* Logistic Regression = 0.7157475838544628
*  **SVM = 0.84 with a standard deviation of 0.03**
* Naive Bayes = 0.549175667993178