# Lab 2

In [334]:
FIRST_NAME = "Leng"
LAST_NAME = "Her"
STUDENT_ID = "5445877"

## Introduction

Scikit Learn is one of the most prominent tools for building machine learning models in the data science industry. It contains a plethora of standardized tools for managing the entire machine learning development workflow.

In this lab, you will use the same simulated data set as last week to build, validate, and evaluate machine learning models with various regression algorithms. We will also save this model to a file so it can be utilized to make future predictions.

## The Data Set

**Data Description**

This is a simulated data set of students performance in the INET 4062 class. _None of these are actual students._

**Data Dictionary**

| Column Name | Type | Description |
| :----------- | :-- | :----------- |
| studentId | `int` | Unique Id of student |
| gpa | `float` | Current cumulative GPA |
| labHours | `float` | Number of hours spent per week on labs |
| studyHours | `float` | Number of hours spent studying for each exam |
| took4061 | `int` | Binary if student took INET 4061 (0=No, 1=Yes) |
| pythonExp | `int` | A High, Medium, or Low rating from student on previous python experience (0=Low, 1=Medium, 2=High) |
| statsRating | `int` | A 0-5 rating of ability on statistics |
| height | `float` | Height of student in inches |
| eyeColor | `str` | Eye color of student |
| followers | `int` | Number of followers on all social media accounts |
| grade | `float` | Percentage grade in INET 4062 out of 100 |
| letterGrade | `str` | Letter grade derived from the percentage |


**Data Sample**

|   studentId |     gpa |   labHours |   studyHours |   took4061 |   pythonExp |   statsRating |   height | eyeColor   |   followers |   grade | letterGrade   |
|------------:|--------:|-----------:|-------------:|-----------:|------------:|--------------:|---------:|:-----------|------------:|--------:|:--------------|
|           0 | 3.62061 |   3.0089   |     4.36066  |          1 |           2 |             3 |  69.4933 | brown      |         632 |   92.31 | A-            |
|           1 | 3.19391 |   2.524    |     4.88687  |          0 |           2 |             2 |  67.3275 | blue       |          44 |   85.59 | B             |
|           2 | 3.19453 |   0.903686 |     2.0478   |          1 |           1 |             5 |  69.3401 | green      |         181 |   88.39 | B+            |
|           3 | 3.27793 |   4.88015  |     0.822806 |          1 |           2 |             4 |  67.8951 | blue       |         347 |   90.91 | A-            |
|           4 | 2.5     |   1.47281  |     7.51036  |          0 |           2 |             4 |  67.708  | brown      |        1070 |   84.14 | B             |
|           5 | 2.56162 |   4.53166  |     5.50934  |          0 |           1 |             5 |  69.6897 | hazel      |          18 |   84.37 | B             |
|           6 | 3.15581 |   2.60646  |     1.56167  |          0 |           1 |             5 |  69.045  | hazel      |        7007 |   84.2  | B             |
|           7 | 3.73405 |   2.41052  |     2.99812  |          1 |           2 |             5 |  66.5794 | brown      |        5599 |   93.6  | A             |
|           8 | 2.98454 |   2.68131  |     1.73898  |          1 |           2 |             4 |  69.1432 | hazel      |        1206 |   88.73 | B+            |
|           9 | 3.84509 |   5.43147  |     4.9316   |          1 |           1 |             1 |  68.034  | blue       |       40097 |   92.12 | A-            |

In [335]:
import numpy as np
import pandas as pd

n = 1000 # number of records to simulate
np.random.seed(40) # set the seed of the random number generator

# Current GPA
gpa = 0.4 * np.random.randn(n) + 3.25
gpa = np.clip(gpa, 2.5, 4.0)

# Average hours per week on Labs
labHours = 5.5/np.exp(2*np.random.rand(n))

# Number of hours studying for exam
studyHours = np.power(2*np.random.rand(n) + 0.75, 2)

# Junior or Senior
isSenior = np.random.binomial(size=n, n=1, p=0.67)

# Took 4061
took4061 = np.random.binomial(size=n, n=1, p=0.75)

# Previous Python Experience
pythonExp = np.random.binomial(size=n, n=2, p=0.70)

# Ability in statistics
statsRating = np.random.binomial(size=n, n=5, p=0.75)

# Height
height = 4 * np.random.rand(n) + 66.5

# Eye Color
eyeColor = np.random.choice(["blue", "green", "brown", "hazel"], n)

# Social media followers
followers = (10 ** (1+5*np.random.beta(3, 7, size=n))).round()

# simulate grades
grade = 72 + (((gpa**2)/3 + np.sqrt(statsRating+1)) * 
              np.sqrt(labHours/3 + studyHours/6 + pythonExp + 3*took4061)) + \
              (1+3*np.random.rand())

# Compile columns into a DataFrame
students_df = pd.DataFrame({
    'gpa' : gpa,
    'labHours' : labHours,
    'studyHours' : studyHours,
    'took4061' : took4061,
    'pythonExp' : pythonExp,
    'statsRating' : statsRating,
    'height' : height,
    'eyeColor' : eyeColor,
    'followers' : followers,
    'grade' : grade.round(2)
})

# Define a function to calculate the letter grade based
# on the percentage in the class
def getLetterGrade(x):
  if x < 76.67:
    return("C")
  elif x < 80:
    return("C+")
  elif x < 83.33:
    return("B-")
  elif x < 86.67:
    return("B")
  elif x < 90:
    return("B+")
  elif x < 93.33:
    return("A-")
  else:
    return("A")

# Add the letter grade column to the DataFrame
students_df['letterGrade'] = students_df['grade'].apply(lambda row: getLetterGrade(row))

# Rename the index of the DataFrame to be `studentId`
# because that index uniquely identifies 1 student
students_df.index.rename("studentId", inplace=True)
students_df.reset_index(drop=False, inplace=True)

In [336]:
students_df

Unnamed: 0,studentId,gpa,labHours,studyHours,took4061,pythonExp,statsRating,height,eyeColor,followers,grade,letterGrade
0,0,3.006981,3.013592,3.187031,1,1,2,66.605363,blue,154.0,85.57,B
1,1,3.199545,1.025472,2.750329,1,0,3,68.511122,hazel,1091.0,84.96,B
2,2,2.976157,3.032981,7.211016,1,2,4,68.770468,brown,749.0,88.34,B+
3,3,3.621486,1.799475,5.882726,1,2,4,67.058603,brown,136.0,91.36,A-
4,4,2.512240,3.950051,3.418261,1,2,3,69.320191,brown,802.0,85.17,B
...,...,...,...,...,...,...,...,...,...,...,...,...
995,995,2.801776,2.742993,2.137272,1,2,4,68.828923,blue,3712.0,86.56,B
996,996,3.618104,1.017808,4.031764,1,0,4,69.533289,blue,5676.0,87.62,B+
997,997,3.838771,1.576157,3.818294,0,2,3,69.373830,green,104.0,86.70,B+
998,998,3.688545,1.297447,0.953019,1,1,3,68.642846,blue,262.0,88.41,B+


## Question 1

The first step is to split data into the independent variables (features) and the dependent variable (label), and to split the data into a train and test set. 

First, create a DataFrame named `X` with only the following columns _"gpa", "labHours", "studyHours", "took4061", "pythonExp", "statsRating"._

Next, create a Series (a 1 column DataFrame) name `y` that contains just the _"grade"_ column of the DataFrame.

Then do a train/test split on the `X` and `y` datasets where 75% of the rows are in the train set, and 25% are in the test set. Set the random seed to `4062` for the train/test split.


### Resources

**Scikit Learn Documentation**
 * sklearn.model_selection.train_test_split ([link](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html))

**Pandas Documentation**
* Indexing and Selecting data ([link](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html))

**Examples**
 * Selecting multiple columns [datagy](https://datagy.io/pandas-select-columns/)
 * Selecting multiple columns [statology](https://www.statology.org/pandas-select-multiple-columns/)
 * Train Test Split ([Real Python](https://realpython.com/train-test-split-python-data/))
 * Train Test Split ([Machine Learning Mastery](https://machinelearningmastery.com/train-test-split-for-evaluating-machine-learning-algorithms/))



### Answer

In [337]:
from sklearn.model_selection import train_test_split

In [338]:
# select the independent variables and store in X
X = students_df[["gpa", "labHours", "studyHours", "took4061", "pythonExp", "statsRating"]]

# select the dependent variable (label) and store in y
y = students_df["grade"]

# split the data into a train and test set with 75% of the data in the train set and 25% in the test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=4062)

## Question 2

One hot encode the categorical features for the model _"took4061", "pythonExp", "statsRating"_. Fit and transform the OneHotEncoder on the `X_train` DataFrame. Save the results of the one hot encoding transformation into a variable named `categoricalVars`.

Next, convert the `categoricalVars` array into a Pandas DataFrame named `categoricalVars_df` with the column names from the `.get_feature_names_out()` method of your OneHotEncoder object.




### Resources

**Scikit Learn Documentation**
 * sklearn.preprocessing.OneHotEncoder ([link](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html))

**Examples**
 * Datagy ([link](https://datagy.io/sklearn-one-hot-encode/))
 * Geeks4Geeks ([link](https://www.geeksforgeeks.org/ml-one-hot-encoding-of-datasets-in-python/))

### Answer

In [339]:
from sklearn.preprocessing import OneHotEncoder

In [340]:

encoder = OneHotEncoder()
categoricalVars = encoder.fit_transform(X_train[["took4061", "pythonExp", "statsRating"]])
categoricalVars_df = pd.DataFrame(categoricalVars.toarray(), columns=encoder.get_feature_names_out())


## Question 3

Normalize the numeric features for the machine learning models _"gpa", "labHours", "studyHours"_.  Fit and transform the StandardScaler on the `X_train` DataFrame. Save the results of the transformation into a variable named `normalizedVars`. 

Next, convert the `normalizedVars` array into a Pandas DataFrame named `normalizedVars_df` with the column names from the `.get_feature_names_out()` method of your StandardScaler object.

### Resources

**Scikit Learn Documentation**
 * sklearn.preprocessing.StandardScaler ([link](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html))

**Examples**
 * Machine Learning Mastery ([link](https://machinelearningmastery.com/standardscaler-and-minmaxscaler-transforms-in-python/))
 * BenAlexKeen ([link](https://benalexkeen.com/feature-scaling-with-scikit-learn/))

### Answer

In [341]:
from sklearn.preprocessing import StandardScaler

In [342]:


# Select the numeric features
numeric_features = ["gpa", "labHours", "studyHours"]
X_train_num = X_train[numeric_features]

# Fit and transform the StandardScaler on the numeric features in X_train
scaler = StandardScaler()
normalizedVars = scaler.fit_transform(X_train_num)

# Convert normalizedVars array into a Pandas DataFrame
normalizedVars_df = pd.DataFrame(normalizedVars, columns=numeric_features)


## Question 4

Combine the columns of the two DataFrames `categoricalVars_df` and `normalizedVars_df`. Use the `pd.concat()` function with `axis='columns'`. Save the resulting DataFrame into a variable named `X_train_cleaned`.

### Resources

**Pandas Documentation**
* pd.concat ([link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html))

**Examples**
* Statology ([link](https://www.statology.org/concatenate-two-pandas-dataframes/))
* W3Resource ([link](https://www.w3resource.com/pandas/concat.php))

### Answer

In [343]:
X_train_cleaned = pd.concat([categoricalVars_df, normalizedVars_df], axis='columns')


In [344]:
X_train_cleaned

Unnamed: 0,took4061_0,took4061_1,pythonExp_0,pythonExp_1,pythonExp_2,statsRating_0,statsRating_1,statsRating_2,statsRating_3,statsRating_4,statsRating_5,gpa,labHours,studyHours
0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.455838,-0.704009,-0.676952
1,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.812140,-0.939033,-0.352502
2,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.155625,1.593085,1.425394
3,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.294941,-0.740124,-0.039968
4,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,-1.462300,-1.150700,0.052982
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
745,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,-1.010872,0.527532,-0.844072
746,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.936559,1.803265,1.516103
747,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.694991,-1.129082,1.051331
748,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.197053,-1.055185,0.639203


## Question 5

Test the performance of 3 different algorithms to predict the grades of each student: Support Vector Machines, K-Nearest Neighbors, and Random Forest. Try various different hyperparameters for each algorithm (_except for this case, do not use `weight=distance` for KNN_). 

Track the Mean Absolute Error (MAE) on the training data set for each set of hyperparameters for each algorithm that you try. 

### Resources

**Scikit Learn Documentation**
* LinearSVR ([link](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVR.html))
* KNeighborsRegressor ([link](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html))
* RandomForestRegressor ([link](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html))
* mean_absolute_error ([link](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html#sklearn.metrics.mean_absolute_error))


### Answer

In [345]:
from sklearn.svm import LinearSVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import mean_absolute_error

In [346]:
svm = LinearSVR()
knn = KNeighborsRegressor()
rf = RandomForestRegressor()

In [347]:
# Fitting the models
svm.fit(X_train_cleaned, y_train)
knn.fit(X_train_cleaned, y_train)
rf.fit(X_train_cleaned, y_train)

RandomForestRegressor()

In [348]:
# Predicting with the models
y_train_pred_svm = svm.predict(X_train_cleaned)
y_train_pred_knn = knn.predict(X_train_cleaned)
y_train_pred_rf = rf.predict(X_train_cleaned)

In [349]:
# Calculating MAE
mae_svm = mean_absolute_error(y_train, y_train_pred_svm)
mae_knn = mean_absolute_error(y_train, y_train_pred_knn)
mae_rf = mean_absolute_error(y_train, y_train_pred_rf)

In [350]:
print("Mean Absolute Error - SVM:", mae_svm)
print("Mean Absolute Error - KNN:", mae_knn)
print("Mean Absolute Error - RF:", mae_rf)

Mean Absolute Error - SVM: 0.2676725559973789
Mean Absolute Error - KNN: 0.48866133333333284
Mean Absolute Error - RF: 0.16256840000000358


## Question 6

Evaluate the results of the models on the training data set. Select the model with the lowest Mean Absolute Error (MAE). Then re-train that algorithm with the same hyperparameters and calculate the MAE on the test data set.

In [351]:
encoder = OneHotEncoder()
categoricalVars = encoder.fit_transform(X_train[["took4061", "pythonExp", "statsRating"]])
categoricalVars_df = pd.DataFrame(categoricalVars.toarray(), columns=encoder.get_feature_names_out())


In [352]:
#Apply encoding on the test set

# Categorical feature preprocessing
categoricalVars = encoder.transform(X_test[["took4061", "pythonExp", "statsRating"]])
categoricalVars_df = pd.DataFrame(categoricalVars.toarray(), columns=encoder.get_feature_names_out())

# Numerical feature preprocessing
normalizedVars = scaler.transform(X_test[["gpa", "labHours", "studyHours"]])
normalizedVars_df = pd.DataFrame(normalizedVars, columns=scaler.get_feature_names_out())

# Combine all features
X_test_cleaned = pd.concat([categoricalVars_df, normalizedVars_df], axis='columns')

In [353]:
# Re-train the selected model with the same hyperparameters
rf.fit(X_train_cleaned, y_train)
y_pred = rf.predict(X_test_cleaned)
best_model_mae = mean_absolute_error(y_test, y_pred)
print('MAE on the test data set:', best_model_mae)

MAE on the test data set: 0.4021979999999989


## Question 7

Save the model with the lowest mean absolute error (MAE) to a Pickle file named `model.pkl`. 
* Retrain the algorithm with the same hyper parameters as in the previous step.
* Get the mean average error of the model on the test data set
* Use the Pickle Python package the model object to a file named `model.pkl`

### Resources

**Python Documentation**
* Pickle Examples ([link](https://docs.python.org/3/library/pickle.html#examples))
* Pickle Docs ([link](https://docs.python.org/3/library/pickle.html#))

**Examples**
* Datacamp ([link](https://www.datacamp.com/tutorial/pickle-python-tutorial))
* Real Python ([link](https://realpython.com/python-pickle-module/))

### Answer

In [354]:
import pickle

In [355]:
rf.fit(X_train_cleaned, y_train)

RandomForestRegressor()

In [356]:
y_pred2 = rf.predict(X_test_cleaned)

In [357]:
best_model_mae2 = mean_absolute_error(y_test, y_pred)
print('MAE on the test data set:', best_model_mae2)

MAE on the test data set: 0.4021979999999989


In [358]:
with open("model.pkl", "wb") as file:
    pickle.dump(rf, file)

## Question 8

Write a Python function named `predict()` that takes in the data about new students in a dictionary, and returns the predictions from the `model.pkl` object that was recently saved.

On the example below, it should return something like:
```
{'predictions': [84.76219999999998]}
```

* First, read in the `model.pkl` file into a new python variable named "model". 
* Next, convert the input dictionary into a Pandas DataFrame using `pd.DataFrame()`.
* Then, run the `.predict()` function on the DataFrame to get an array with the prediction. Use the `.tolist()` method of the array to convert it into a list.
* Return the list in a dictionary with the key as `'predictions'`.

### Answer

In [359]:
new_students = [{
    'took4061_0' : 1,
    'took4061_1' : 0,
    'pythonExp_0' : 1,
    'pythonExp_1' : 0,
    'pythonExp_2' : 0,
    'statsRating_0' : 0,
    'statsRating_1' : 0,
    'statsRating_2' : 0,
    'statsRating_3' : 1,
    'statsRating_4' : 0,
    'statsRating_5' : 0,
    'gpa' : 3.1,
    'labHours' : 2.2,
    'studyHours' : 4.7
}]

In [360]:
def predict(new_students: dict) -> dict:
    # Read in the model
    with open('model.pkl', 'rb') as f:
        model = pickle.load(f)
    
    # Convert the input dictionary into a pandas DataFrame
    X_new = pd.DataFrame(new_students)
    
    # Ensure that the feature names are in the same order as they were in the training data
    X_new = X_new[['took4061_0', 'took4061_1', 'pythonExp_0', 'pythonExp_1', 'pythonExp_2', 'statsRating_0',
                   'statsRating_1', 'statsRating_2', 'statsRating_3', 'statsRating_4', 'statsRating_5',
                   'gpa', 'labHours', 'studyHours']]
    
    # Get predictions from the model
    predictions = model.predict(X_new).tolist()
    
    # Return the predictions in a dictionary with the key 'predictions'
    return {'predictions': predictions}

In [361]:
print(predict(new_students))

{'predictions': [84.77680000000004]}


# End