<h2>Predictive Analysis</h2>

After having data ready for modeling, we can now build, train, and test our models. This module discuss the tasks in predictive analysis that are regression and classification. Recall, these two belong to **supervised learning** which means they need a target in the provided data. 

First, we import and split data as usual

In [1]:
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/MyDrive/IT7143\ Module\ 5

Mounted at /content/drive
/content/drive/MyDrive/IT7143 Module 5


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

students = pd.read_csv('students_m5.csv')
students

Unnamed: 0,StudentID,FirstName,LastName,Major,HighSchoolGPA,FamilyIncome,State,AvgDailyStudyTime,TotalAbsence,FirstYearGPA,isGRA
0,202303595,Baxter,Dengler,Computer Science,2.82,45013,WA,2.01,14.0,1.93,0
1,202309162,Christian,Wickey,Data Science,3.07,128358,GA,5.41,,2.76,0
2,202306337,Lonnie,Wulff,Software Engineering,2.68,112392,GA,9.57,13.0,3.09,0
3,202306072,Mitchell,Deshotel,Software Engineering,3.21,190846,GA,8.57,16.0,3.08,0
4,202301733,Linwood,Willing,Information Technology,3.44,187163,GA,6.24,20.0,2.73,0
...,...,...,...,...,...,...,...,...,...,...,...
995,202302372,Michael,Richman,Computer Science,4.00,32210,SC,8.84,16.0,3.31,1
996,202309892,Lacy,Anton,Software Engineering,3.02,163481,GA,6.61,17.0,2.53,0
997,202308310,Ell,Benke,Software Engineering,2.05,45446,GA,3.68,30.0,1.77,0
998,202305648,Elzie,Enderle,Information Technology,2.19,44714,GA,2.74,17.0,2.11,0


In [3]:
features = students.drop(['StudentID','FirstName','LastName','FirstYearGPA','isGRA'], axis=1)
labels = students['FirstYearGPA']

from sklearn.model_selection import train_test_split

trainX, testX, trainY, testY = train_test_split(features, labels, test_size=0.2)

<h3>Processing Pipeline</h3>

Processing data with pandas is fine however not too convenient, especially if we have to repeat the same process multiple times.

Instead, we will utilize the **SKLearn Pipeline** in this module. A pipeline allows mutiple processing steps to be wrapped inside a single object that can be reusable.

*This part is kept simple since this notebook focuses on modeling. Please refer to the Pipeline Explained notebook for more descriptions on each step in the pipeline.*

For steps like outlier clipping, log transformation, and removing rare values, we need to write the functions. For more common processes like standardization and one hot encoder, we will use module from sklearn.

The below code for pipeline is pretty standard. You can **reuse the code in different analysis**, just make sure to change the list of numeric columns and class columns, as well as verify whether the hard-coded numbers are what you want.

In [4]:
num_cols = ['HighSchoolGPA','FamilyIncome','AvgDailyStudyTime','TotalAbsence']
cat_cols = ['Major','State']

#function to clip outliers
def outlier_clip(data):
    num_sds = trainX[num_cols].std()
    num_means = trainX[num_cols].mean()
    return np.clip(data, num_means - 4*num_sds, num_means + 4*num_sds, axis=1)    #you can change 4 to other numbers

#function to log transform
def log_transform(data):
    return pd.concat([data, np.log(data.add_suffix('_log') + 0.001)], axis=1)     #you can change 0.001 to other numbers

#function to remove rare classes
def remove_rare_classes(data):
    data_copy = data.copy()
    kept_classes = {}
    for col in cat_cols:
        cat_counts = trainX[col].value_counts()
        kept_classes[col] = cat_counts.index[cat_counts > 40]                     #you can change 40 to other numbers
    for col in cat_cols:
        data_copy.loc[~data_copy[col].isin(kept_classes[col]), col] = 'Other'
    return data_copy

Now we import all the other needed tools from sklearn. 
- **FunctionTransformer** is used for transformation that are not prebuilt in sklearn. This included removing outliers, log transformation, and removing rare classes. 
    - FunctionTransformer needs the written function as an input
- **Pipeline** in an object that takes inputs as a *list of transformations*
    - Each item in the list requires a name (as a string) and the sklearn transformer, i.e., FunctionTransformer, SimpleImputer, etc. The two components are separated by a comma
    - The listed transformations will be performed in the order they appear
- **ColumnTransformer** combines the two pipelines for numeric and class columns into one uniform, final pipeline
    - Only use fit_transform() on training data
    - Testing data must be transformed with transform()

In [5]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

#pipeline for numeric columns
num_pipeline = Pipeline([
    ('outlier clip', FunctionTransformer(outlier_clip)),
    ('log transform', FunctionTransformer(log_transform)),
    ('standardize', StandardScaler()),
    ('impute', SimpleImputer(strategy='median'))    
])

#pipeline for class columns
cat_pipeline = Pipeline([
    ('remove rare classes', FunctionTransformer(remove_rare_classes)),
    ('encode', OneHotEncoder())
])

from sklearn.compose import ColumnTransformer

#combining
full_pipeline = ColumnTransformer([
    ('numeric', num_pipeline, num_cols),
    ('class', cat_pipeline, cat_cols)
])

#use the built pipeline to process training and testing data
trainX_prc = full_pipeline.fit_transform(trainX)
testX_prc = full_pipeline.transform(testX)

In [6]:
trainX_prc.shape, testX_prc.shape

((800, 10), (200, 10))

<h3> Predictive Analysis - Regression </h3>

Recall, regression is a supervised task in which the target/label is numeric type
- Numerical comparisons like < or > are meaningful
- Statistics like mean, variance, standard deviation, etc. make are meaningful

We will talk more about different models in the next module. Here, we will use the basic Linear Regression and discuss how to evaluate a regression model.

In sklearn, modeling is very easy. Step-by-step, the cell below
1. import the model from the correct sklearn module
2. create a new model
3. train the model. Notice you have to **provide both features and labels in fit()**.

In [None]:
from sklearn.linear_model import LinearRegression

linear_reg = LinearRegression()

linear_reg.fit(trainX_prc,trainY)

LinearRegression()

The most common evaluation measurement for regression problem is Mean Squared Error - MSE. As each data point has a true value for the target, and a predicted value made by a model, MSE is the average squared differences among all true/predicted value pairs

$MSE = \dfrac{(true - predicted)^2}{n}$

We can import the MSE function from sklearn to use without having to write too much code. 

After training, all sklearn models will have access to a **predict()** function which can be used to make prediction of the labels based on features. Notice that we only feed the features to predict()

In [None]:
from sklearn.metrics import mean_squared_error

#get the prediction
trainY_pred = linear_reg.predict(trainX_prc)

#get the MSE
mse_lr = mean_squared_error(trainY, trainY_pred)
print(mse_lr)

0.05687847627686117


Lower MSE means better more accurate models. However, **do not compare MSE of models trained on different data**.

A very similar metric is Root Mean Squared Error - RMSE which is the square root of the MSE

In [None]:
np.sqrt(mse_lr)

0.23849208849951642

RMSE is intepreted as the average errors between the predicted values and the true values, in this case, the predicted first year GPA and the true first year GPA of the students.

Is this a good model? Recall, the range of GPA is from 0 - 4 (or in this data, 2.0 - 4.15), so is an average error of 0.381 good enough?

Measurements like MSE and RMSE are dependent on the target range, and could be hard to interprete sometimes. We can use a different measurement that is the R-Squared

In [None]:
from sklearn.metrics import r2_score

r2_lr = r2_score(trainY, trainY_pred)
print(r2_lr)

0.8106208828937093


R2 score is always less than 1, and it is interpreted as the percentage of variation in the data that our model can explain. In cases with very bad-fit models, R2 can get to negative values.

In this case, this linear regression model can explain about 81% variation in the data.

Now we can see how the model adapt to new data, i.e., the test set

In [None]:
#get the MSE
testY_pred = linear_reg.predict(testX_prc)

mse_lr_test = mean_squared_error(testY, testY_pred)
print(mse_lr_test)

0.053492752679506615


In [None]:
r2_lr_test = r2_score(testY, testY_pred)
print(r2_lr_test)

0.8150714678652767


<h3> Predictive Analysis - Classification </h3>

Classification is a supervised task in which the target/label is discrete. We will now utilize the isGRA column as the label. This require a reset of features and labels, and re-splitting them.

The pipeline, however, can be directly reused, since the lists of numeric columns and class columns, and all processing steps stay the same.

In [None]:
features = students.drop(['StudentID','FirstName','LastName','FirstYearGPA','FirstYearGPA'], axis=1)
labels = students['isGRA']

trainX, testX, trainY, testY = train_test_split(features, labels, test_size=0.2)

trainX_prc = full_pipeline.fit_transform(trainX)
testX_prc = full_pipeline.transform(testX)

For classification, we use a **Logistic Regression** model. Details will be discussed in later modules. For now, we will import, create, and train the model.

In [None]:
from sklearn.linear_model import LogisticRegression

logistic_reg = LogisticRegression()

logistic_reg.fit(trainX_prc, trainY)

LogisticRegression()

<h4>Accuracy</h4>

The easiest way to evaluate a classification model is to use **accuracy rate**. Accuracy rate represents how much of the data get assigned labels correctly. In other words, accuracy is the rate of data in which predicted labels are equal to true labels.

Accuracy is always between 0 and 1, and can be converted to percent by multiplying to 100

In [None]:
logistic_reg.score(trainX_prc, trainY)

0.89

In [None]:
trainY_pred = logistic_reg.predict(trainX_prc)

<h4>F1 Score</h4>

Accuracy is not always the best evaluation metric, especially in data that labels have rare values. In such case, the proper metric to use is **F1 score**. F1 is also between 0 and 1, and higher F1 means better models.

In [None]:
from sklearn.metrics import f1_score
f1_score(trainY, trainY_pred)

0.7086092715231788

In the test data

In [None]:
testY_pred = logistic_reg.predict(testX_prc)
logistic_reg.score(testX_prc, testY)

0.915

In [None]:
f1_score(testY, testY_pred)

0.7733333333333334