In [2]:
## Put import statements here
import numpy as np
import pandas as pd

# Creating  a Machine Learning model

Today you will be creating your first machine learning model! There are many components to this creating these models. However, there is a general pipeline that you can follow and iterate over to simplify the model building process.

  1. Define the problem
  2. Prepare the data
  3. Spot check algorithms (to figure out the best ones)
  4. Improve results (usually requires going back to step 2 or 3)
  5. Present results
  
For a more detailed description of the results, visit <a href=http://machinelearningmastery.com/process-for-working-through-machine-learning-problems/>this website.</a>

Since we have been using the iris dataset a lot lately, we felt it was time to switch things up. Let's look at this dataset. It can be downloaded directly from the UCI Machine Learning repository. <a href = http://archive.ics.uci.edu/ml/datasets/STUDENT+ALCOHOL+CONSUMPTION>Download the dataset here.</a> Once you have downloaded it, make sure the dataset is in the same folder as this ipython notebook. From there, we can begin working with it. 

## 1. Define the problem

To understand where this dataset might be useful, skim over these articles. They both show how machine learning can improve graduation rates by finding students at risk of dropping out. In this lab, we are going to take characteristics and grades for a group of students and see if we can predict whether they fall in low, medium, or high risk categories.

https://dssg.uchicago.edu/wp-content/uploads/2016/04/montogmery-kd2015.pdf

http://www.opb.org/news/article/npr-how-one-university-used-big-data-to-boost-graduation-rates/

Question 1: From the second article, by what percentage have graduation rates increased at Georgia State University since they implemented their new graduation and progression success (GPS) system and hired new advisors? 

## 2. Prepare the data

Skim over the student.txt file to better understand what is in this dataset. It is important to know where to find information about any of the variables in a dataset. We are just going to use student-por.csv for this labn. It contains data on the grades and characteristics of certain students in the class. Let's load the data.

In [11]:
student_grades = pd.read_csv('student-por.csv')

Make sure you check the dataframe using .head(). Is there something wrong? What can you do to fix this error? 

We are going to attempt to predict the final grade (G3 column). However, the scores range from 0 - 20. Thus, we will need to bin the values. Let's assume that we want our algorithm to flag anyone who may possibly score below a 10 on the final grade, to allow the teacher time to tutor or help the student boost their score. 

Run this cell to create a variable that will flag a student with a score less than 10 with a 1, and all other students will be 0.

In [12]:
def categorize(val,high_risk):
    if val <= high_risk:
        return 1
    else:
        return 0
    
student_grades.loc[:,'flag_student'] = student_grades.loc[:,'G3'].map(lambda x: categorize(x,10))

'flag_students' will now be the column we are trying to predict. This is where your expertise kicks in! Choose which features to keep, and save them into the X variable (this will become our feature space). 

In [None]:
X = student_grades.loc[:,['Put Names of Columns to Keep Here']]
y = student_grades.loc[:,'flag_students']

Bonus: Since KNN relies on distance, you cannot directly put categorical variables into the algorithm. If you want to include this type of information, you will first need to dummify the variables before putting them in the classifier. As an example, dummifying would take a column with 'yes' or 'no' and would change the 'yes' to a 1 and a 'no' to a zero. Try creating a method that will do this for you. 

# 3. Spot check algorithms

For now, we will use accuracy to improve upon our model. We want to maximize the accuracy in both the training and testing set. Play around and see how high you can get the scores! Watch out though, scores that are too high (such as 100% accuracy) can sometimes be flags for leakage and other improper modeling techniques. While using PCA or LDA, make sure to use the following pipeline. 

 1. Train/Test split
 2. Dimensionality reduction on training set
 3. Fit model to training set
 4. Accuracy of model on training set
 5. Dimensionality reduction on testing set
 6. Accuracy of model on testing set

Use LDA, PCA, and KNN to make a classifier to predict using attributes that suggest a student may be at high risk for under-performing in the course. Note: LDA can be used for dimensionality reduction and classification. 

# 4. Improve Results

There are a few things we can do to maximize the score. One thing is to tune different parameters. Parameters can be number of components, number of nearest neighbors, which distance function to use, and so on. Change these numbers and see how the accuracy changes with them. 

Bonus: Check out <a href=http://scikit-learn.org/0.17/modules/generated/sklearn.grid_search.GridSearchCV.html>GridsearchCV</a>. This will allow you to choose combinations of parameters and it will let you know which one is the best. It is super powerful!

# 5. Present result

For a company, this usually includes a slide show or presentation on what your findings were. In this case, you will not have to do that. Although, you may want to think about these aspects of your model. 

 1. Are there ethical concerns with trying to find high risk students this way?
 2. Is there a possibility of neglecting the high performing students? What would the implications of this be?
 3. Would it be beneficial to allow a parent to have access to this information so that they can be informed when their student is flagged for possibly being at risk of failing the course? 
 
There are no right or wrong answers to these questions, but they are good to think about. You do have to provide a thoughtful reponse to atleast one of these questions. 