In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn import linear_model
from sklearn.metrics import mean_squared_error
import sklearn.metrics
from sklearn.model_selection import GridSearchCV

# Assignment 3 (due Oct 30th at 11:59pm)

- This assignment covers Linear Regression, Model Selection, and Regularization. Please refer to the class notes and corresponding Colabs on the course website for the required background.

- The assignment requires that you participate of Kaggle Competitions.  We created a private competition that must be accessed via this [invitation link](https://www.kaggle.com/t/678a70d78f624f39b88d30184fe98e8b).

- This assignment is worth **100 points**.  After completing the solutions you will submit a copy of this notebook (`.ipynb`), including all your answers.

- You are free to use any Python library.

>  **Important**: Make sure all cells are executed before saving/downloading a copy of the notebook you will submit.

## Question 1 (5 points)
Load the data files available at the [Kaggle competition](https://www.kaggle.com/competitions/uri-ml-hw-3-f22) created for this assignent.  Print the shapes of all matrices/objects you are creating.  Can use `pandas` or `numpy`.

In [2]:
# your answer here 
train_x = pd.read_csv("uri-ml-hw-3-f22/trainx.csv")
train_y = pd.read_csv("uri-ml-hw-3-f22/trainy.csv")
test_x = pd.read_csv("uri-ml-hw-3-f22/testx.csv")

print("train_x shape: ",train_x.shape, "\ntrain_y shape: ", train_y.shape, "\ntest_x shape: ", test_x.shape)

train_x shape:  (14447, 9) 
train_y shape:  (14447, 2) 
test_x shape:  (6193, 9)


In [3]:
print("train_x first few: \n", train_x.head(), "\ntrain_y first few: \n", train_y.head(), "\ntest_x first few: \n", test_x.head())

## Question 2 (10 points)
Print the `min` and `max` values for all input features in the training data and the test data.  It will give you a rough idea of ranges on each column.

In [4]:
# your answer here
print("Max in each feature of train_x: \n", train_x.max(), "\nMin in each feature of train_x: \n", train_x.min())


In [5]:
print("Max in each feature of test_x: \n", test_x.max(), "\nMin in each feature of test_x: \n", test_x.min())

## Question 3 (75 pts)
Here you will perform model selection, testing a number of different configurations for a final model.

- play with regularization methods.  You are expected to try plain Linear Regression, and Linear Regression with L1/L2 regularization.  You can use the code provided in class or rely on `scikit-learn` implementations.  20 points are awarded for a correct use of all three methods.

- apply different pre-processing techniques.  30 points are awarded here for a correct pre-processing.  We expect you to preprocess your data with the following strategies:

    - normalization or scaling (e.g. `StandardScaler`, `MinMaxScaler`, `RobustScaler`)
    - feature transformations (e.g. `PolynomialFeatures`, applying functions like `log`, or both)
    - PCA after feature transformations

> Hint: most of the transformations require you to `fit` to the training data first and then `transform` the test data.  Please read the documentation of every transformation you intend to apply.  Alternatively, a `Pipeline` is extremely useful.  Examples can be found at [Pipelines and composite estimators](https://scikit-learn.org/stable/modules/compose.html).

- model selection must use **cross-validation**. You can use any of the functions provided in scikit-learn (e.g. `cross_val_score`, `train_test_split`).  Here is a good introduction to their usage: [Cross-Validation](https://scikit-learn.org/stable/modules/cross_validation.html).  25 points are awarded for a correct use of CV.

> Hint: if you dont want to implement your own cross-validation loop, there is a `GridSearchCV` object that can help in the process.  Documentation is available at [Tuning the hyper-parameters of an estimator](https://scikit-learn.org/stable/modules/grid_search.html).

### Initial Preprocessing

In [6]:
output_df = pd.DataFrame(test_x.Id)

In [7]:
output_df.head()

Unnamed: 0,Id
0,14447
1,14448
2,14449
3,14450
4,14451


In [8]:
train_x = train_x.drop(columns = 'Id')
train_y = train_y.drop(columns = 'Id')
test_x = test_x.drop(columns = 'Id')

In [9]:
train_x = train_x.to_numpy()
train_y = train_y.to_numpy()
test_x = test_x.to_numpy()

# First Model 
#### Regular Linear Regression with Robust Scaler and PolynomialFeatures == 2 and PCA with cutoff at 95%

In [10]:
lr_pipe = Pipeline([
    ('robust_scaler', preprocessing.RobustScaler()), 
    ('polynomailfeatures', preprocessing.PolynomialFeatures(2)), 
    ('PCA', PCA(n_components = 0.95)), 
    ('Linear Regression', linear_model.LinearRegression())
]
)

In [11]:
print(cross_val_score(lr_pipe, train_x, train_y, cv=10, scoring = 'neg_root_mean_squared_error'))

[-117288.67078766 -115115.13734309 -116447.86119489 -116808.56526957
 -112748.9515617  -115295.6363525  -113316.68318298 -114169.04216358
 -113450.38820082 -113428.22562681]


# Second Model
#### L2 Regression with Robust Scaler and PolynomialFeatures == 2 and PCA with cutoff at 95%

In [12]:
L2_pipe = Pipeline(steps = [
    ('robust_scaler', preprocessing.RobustScaler()),
    ('PolynomialFeatures', preprocessing.PolynomialFeatures(2)),
    ('PCA', PCA(n_components= .95)),
    ('L2 Regression', linear_model.Ridge(alpha = .0001))
])

In [13]:
print(-1 * (cross_val_score(L2_pipe, train_x, train_y, cv = 5, scoring = 'neg_root_mean_squared_error')))

[116120.731404   116639.11289825 114012.7648224  113709.40173904
 113420.60664128]


In [14]:
L2_pipe.fit(train_x, train_y)
L2_preds = L2_pipe.predict(test_x)

# Third Model
#### L1 Regression with Robust Scaler and PolynomalFeatures == 2 and PCA with cutoff at 95%

In [15]:
L1_pipe = Pipeline(steps = [
    ('robust_scaler', preprocessing.RobustScaler()), 
    ('PolynomaialFeatures', preprocessing.PolynomialFeatures(2)),
    ('PCA', PCA(n_components = .95)),
    ('L1 Regression', linear_model.Lasso(alpha = .0001))
])

In [16]:
print(-1 * (cross_val_score(L1_pipe, train_x, train_y, cv = 5, scoring = 'neg_root_mean_squared_error')))

[116120.73139847 116639.11289839 114012.76482262 113709.40173961
 113420.60664133]


# Fourth Model
#### L2 Regression with Grid Search on alpha values

In [17]:

ridge_param_grid = [{
    'L2_Regression__alpha': [.1, .01, .001, .0001]
    }]



In [18]:
Ridge_pipe = Pipeline([
    ('robust_scaler', preprocessing.RobustScaler()),
    ('PCA', PCA(n_components = .95)),
    ('L2_Regression', linear_model.Ridge())
])

In [19]:
grid_pipeline = GridSearchCV(Ridge_pipe, param_grid = ridge_param_grid, scoring = 'neg_root_mean_squared_error', cv = 5)
grid_pipeline.fit(train_x, train_y)
grid_pipeline.best_params_

{'L2_Regression__alpha': 0.0001}

In [20]:
print(-1 * (cross_val_score(grid_pipeline.best_estimator_, train_x, train_y, cv = 5, scoring = 'neg_root_mean_squared_error')))

[78654.98985389 80821.91585767 84300.18617634 80953.29873881
 81254.75254439]


# Fifth Model
#### L2 regression with multiple PolynomialFeature Transformations

In [21]:
degrees = [2, 3, 4, 5]

for degree in degrees:

    Ridge_pipe = Pipeline([
    ('robust_scaler', preprocessing.RobustScaler()),
    ('polynomialfeatures', preprocessing.PolynomialFeatures(degree = 2)),
    ('PCA', PCA(n_components = .95, svd_solver = 'full')),
    ('L2_Regression', linear_model.Ridge())
    ])
    grid_pipeline = GridSearchCV(Ridge_pipe, param_grid = ridge_param_grid, scoring = 'neg_root_mean_squared_error', cv = 5)
    grid_pipeline.fit(train_x, train_y)
    
    print("Degree {} RSME:\n{}\n".format(degree, -1 * (cross_val_score(grid_pipeline.best_estimator_, train_x, train_y, cv = 5, scoring = 'neg_root_mean_squared_error'))))



Degree 2 RSME:
[116120.73075934 116639.11290486 114012.76483189 113709.40179307
 113420.60664441]

Degree 3 RSME:
[116120.73075934 116639.11290486 114012.76483189 113709.40179307
 113420.60664441]

Degree 4 RSME:
[116120.73075934 116639.11290486 114012.76483189 113709.40179307
 113420.60664441]

Degree 5 RSME:
[116120.73075934 116639.11290486 114012.76483189 113709.40179307
 113420.60664441]



In [22]:
components = [8, 16, 32, 40]

for component in components:

    Ridge_pipe = Pipeline([
    ('robust_scaler', preprocessing.RobustScaler()),
    ('polynomialfeatures', preprocessing.PolynomialFeatures(degree = 2)),
    ('PCA', PCA(n_components = component)),
    ('L2_Regression', linear_model.Ridge())
    ])
    grid_pipeline = GridSearchCV(Ridge_pipe, param_grid = ridge_param_grid, scoring = 'neg_root_mean_squared_error', cv = 5)
    grid_pipeline.fit(train_x, train_y)
    
    print("Components {} RSME:\n{}\n".format(component, -1 * (cross_val_score(grid_pipeline.best_estimator_, train_x, train_y, cv = 5, scoring = 'neg_root_mean_squared_error'))))

Components 8 RSME:
[ 97563.0062469   99796.77529005 102170.99275965  99637.82593767
  99189.15623352]

Components 16 RSME:
[78067.25760645 79588.31378497 82817.09587501 78742.93592669
 79659.89942199]

Components 32 RSME:
[87320.52843385 69711.27933077 72583.39050994 68313.80831689
 67164.10828465]

Components 40 RSME:
[73594.48634389 66164.77660127 68132.54391928 63739.99819964
 63333.37430625]



In [23]:
Ridge_pipe = Pipeline([
('robust_scaler', preprocessing.RobustScaler()),
('polynomialfeatures', preprocessing.PolynomialFeatures(degree = 2)),
('PCA', PCA(n_components = 40)),
('L2_Regression', linear_model.Ridge())
])
grid_pipeline = GridSearchCV(Ridge_pipe, param_grid = ridge_param_grid, scoring = 'neg_root_mean_squared_error', cv = 5)
grid_pipeline.fit(train_x, train_y)
    
print("Cross validation cv = 5 RMSE:\n{}".format(-1 * (cross_val_score(grid_pipeline.best_estimator_, train_x, train_y, cv = 5, scoring = 'neg_root_mean_squared_error'))))

Cross validation cv = 5 RMSE:
[73594.48634389 66164.77660127 68132.54391928 63739.99819964
 63333.37430625]


In [24]:
curr_best_model = grid_pipeline.best_estimator_
bm_preds = curr_best_model.predict(test_x)

In [25]:
output_df['Predicted'] = bm_preds
output_df.head()

Unnamed: 0,Id,Predicted
0,14447,293112.944793
1,14448,217759.024055
2,14449,397950.619326
3,14450,379451.980303
4,14451,139152.207208


In [26]:
output_df.to_csv("best_model.csv", index = False)

## Question 4 (10 pts)
Include a description of your best solution, your place in the leaderboard, and public/private scores from the [Kaggle competition](https://www.kaggle.com/competitions/uri-ml-hw-3-f22).  You will only get points if your scores are above the Linear Regression baseline.

#### Answer
Currently, my position on the leaderboard is 2. I have beaten L1 regression and the random baseline, but have not optimized my L2 model enough to beat the L2 regression baseline. My model currently uses Ridge Regression (L2 Regression) with an alpha value of .0001. My preprocessing includes using RobustScaler, but I might change it to MinMaxScaler after seeing promising results after trying optimize a different model I tried building. I then perform PolynomialFeatures with degree of 2 after testing multiple degrees but discovered that it didn't really help (I am open to trying it again with different combinations in GridSearchCV). I then perform PCA with 40 Principal Components because I found that this was the right number of components after thouroughly testing a plethora of PCs. After that I perform grid search on different alpha values for the L2 regressor and create a "best model" with the best alpha value.   