# 23. GRADUATE ADMISSION: MODEL TRAINING 1
---

## 1. Presenting Our Objectives

- We will training 4 different models to find the best:
    - OLS Linear Regression
    - ElasticNet Linear Regression
    - K-Nearest Neighbors Regression
    - Random Forest Regression
    
- We will build prediction models in 2 different ways:
    - Manual: design one feature that will be used for a univariate model
    - Traditional: design models as it is usually done (train all features and see the results)
    
- We will work on 4 different variations of our data
    - Data 1: No changes to the dataset
    - Data 2: Continuous features are discretized (binned)
    - Data 3: Remove target outliers and 
        - bin continous features if `Data 2` were better than `Data 1`, 
        - no binning if `Data 2` were worse than `Data 1`
    - Data 4: change the objective into a multiple-classification problem by changing our target
        - `1=admit` right away
        - `2=waitlist`, the applicant is told to wait for spot to open
        - `3=no chance`, tell the applicant "good luck in your future endeavors"
        
- We could end up training 4x2x4=`32 variations of models`

## 2. Introducing Data 1
#### `No changes to the dataset`

In [2]:
import pandas as pd
import numpy as np
pd.set_option("display.max_columns", 99)
pd.set_option("display.max_rows", 999)
pd.set_option('precision', 3)

admission = pd.read_csv('data/Admission_1.1.csv')
admission['LOR'] = admission['LOR ']
admission['Chance of Admit'] = admission['Chance of Admit ']
admission_1 = admission.drop(['Serial No.', 'LOR ', 'Chance of Admit '], axis=1)
print(admission_1.shape)
print(admission_1.columns)
admission.head()

(500, 8)
Index(['GRE Score', 'TOEFL Score', 'University Rating', 'SOP', 'CGPA',
       'Research', 'LOR', 'Chance of Admit'],
      dtype='object')


Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit,LOR.1,Chance of Admit.1
0,1,337,118,4,4.5,4.5,9.65,1,0.92,4.5,0.92
1,2,324,107,4,4.0,4.5,8.87,1,0.76,4.5,0.76
2,3,316,104,3,3.0,3.5,8.0,1,0.72,3.5,0.72
3,4,322,110,3,3.5,2.5,8.67,1,0.8,2.5,0.8
4,5,314,103,2,2.0,3.0,8.21,0,0.65,3.0,0.65


In [3]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(admission_1, test_size=0.2, random_state=42)
print('Train:', train.shape, '\n', 'Test:', test.shape)

Train: (400, 8) 
 Test: (100, 8)


## 3. Manual Model: Feature Engineering

In [14]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
manual_cols = admission_1.columns
manual_df1_ = admission_1.copy()
manual_df1_[manual_cols] = scaler.fit_transform(admission_1)
manual_df1_.head()

Unnamed: 0,GRE Score,TOEFL Score,University Rating,SOP,CGPA,Research,LOR,Chance of Admit
0,0.94,0.929,0.75,0.875,0.913,1.0,0.875,0.921
1,0.68,0.536,0.75,0.75,0.663,1.0,0.875,0.667
2,0.52,0.429,0.5,0.5,0.385,1.0,0.625,0.603
3,0.64,0.643,0.5,0.625,0.599,1.0,0.375,0.73
4,0.48,0.393,0.25,0.25,0.452,0.0,0.5,0.492


In [4]:
corr_matrix = admission_1.corr()
sorted_corr = corr_matrix['Chance of Admit'].sort_values(ascending=False)
sorted_corr

Chance of Admit      1.000
CGPA                 0.882
GRE Score            0.810
TOEFL Score          0.792
University Rating    0.690
SOP                  0.684
LOR                  0.645
Research             0.546
Name: Chance of Admit, dtype: float64

In [15]:
manual_df1 = pd.DataFrame()
manual_df1['CGPA'] = manual_df1_['CGPA']*0.882
manual_df1['GRE Score'] = manual_df1_['GRE Score']*0.810
manual_df1['TOEFL Score'] = manual_df1_['TOEFL Score']*0.792
manual_df1['U Rating'] = manual_df1_['University Rating']*0.690
manual_df1['SOP'] = manual_df1_['SOP']*0.684
manual_df1['LOR'] = manual_df1_['LOR']*0.645
manual_df1['Research'] = manual_df1_['Research']*0.546
manual_df1.head()

Unnamed: 0,CGPA,GRE Score,TOEFL Score,U Rating,SOP,LOR,Research
0,0.806,0.761,0.735,0.517,0.599,0.564,0.546
1,0.585,0.551,0.424,0.517,0.513,0.564,0.546
2,0.339,0.421,0.339,0.345,0.342,0.403,0.546
3,0.529,0.518,0.509,0.345,0.428,0.242,0.546
4,0.399,0.389,0.311,0.172,0.171,0.323,0.0


In [19]:
manual_df1['sum_all'] = manual_df1.sum(axis=1)
manual_df1['Admit'] = admission_1['Chance of Admit']
manual_df1.head()

Unnamed: 0,CGPA,GRE Score,TOEFL Score,U Rating,SOP,LOR,Research,sum_all,Admit
0,0.806,0.761,0.735,0.517,0.599,0.564,0.546,14.507,0.92
1,0.585,0.551,0.424,0.517,0.513,0.564,0.546,11.863,0.76
2,0.339,0.421,0.339,0.345,0.342,0.403,0.546,8.928,0.72
3,0.529,0.518,0.509,0.345,0.428,0.242,0.546,10.15,0.8
4,0.399,0.389,0.311,0.172,0.171,0.323,0.0,5.944,0.65
