<a href="https://colab.research.google.com/github/mindab/Model-Prediction-Using-Machine-Learning-/blob/main/P9_Machine_Learning_II.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---
---

Applied Data Science in Python for Social Scientists



---
---
#Start Here
## Learning Goals
### General Goals
- Learn the fundamental concepts of applied machine learning
- Learn the fundamental concepts of supervised learning

### Specific Goals
- Learn the basics of regression
- Learn to apply different models of regression:
    - linear regression
    - polynomial regression
    - kNN regression
- Understand bias-variance tradeoff
- Learn to apply cross validation
- Learn to apply regularization (L1 vs. L2)
- Learn to evaluate and compare the performance of your regression models
- Learn to apply feature scaling
- Feature engineering
- Understand transfer learning




# Part I: Predicting the Prevalence of the CCD Disease (80 points)

For a long time now, humans of the **United States of America (USA)** have been suffering from a communicable disease called the CCD, short for the **Climate Change Denialism**, a serious disease that is making humans incapable to reason. True story! <sup>1</sup>

The Center of Logical Reasoning has been collecting the data related to the disease since 2010, and has reached out to NYU for help in creating a model for the prediction of the prevalance of **Climate Change Denialism** in different states using a set of features. The dataset is **spatio-temporal** as it has prevalance rates of the disease for ~50 states (spatial), across 7 years (temporal).

------------------
<sup>1. This is a work of fiction. The story, names, writing, data depicted in this problem set are mostly ficticious. Any similarity to actual persons, living or dead, or to actual papers, is not purely coincidental but definitely inspirational. The "Climate Change Denialism" is a fictitious disease that may have been inspired by a same name disorder found amongst certain individuals in the world.</sup>


## A. Training for the US (45 points)

Using the dataset `us_train.csv`, train a machine learning based regression model that predicts the prevalence of **Climate Change Denialism** disease for a particular state in the USA. The features are in the columns labeled as `A`, `B`, ..., `AC`. The outcome variable (i.e. the prevalence of CCD disease) is present in the column `outcome`. 

You may try different models (linear, polynomial, kNN) to see which one performs the best for estimation of the prevalence of the disease. You have data for the years 2010 to 2015 for 50 states in the U.S. Your data will be tested on data from 2016. The features for 2016 are provided in the file `us_test_x.csv`. The outcome/labels for 2016 are not provided.

For this part, you are required to train and evaluate your regression models very similar to what we did in the recitation.

As a submission for this part, you will fill the `us_predictions.csv` file and submit that along with this Notebook to NYU Classes. You will also submit `us_predictions.csv` file to Kaggle (see Part B).

In [None]:
# Importing libraries you "may" need
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.metrics import mean_absolute_error, r2_score
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
us_train = pd.read_csv("/content/drive/My Drive/Data_Science/Assignment/hw9_handout/us_train.csv")
us_test = pd.read_csv("/content/drive/My Drive/Data_Science/Assignment/hw9_handout/us_test_x.csv")

us_train = us_train[us_train.year != 2010]

In [None]:
#getting dummies and scaling train and test features
scaler = MinMaxScaler()
x_train = us_train.drop(['Id', 'year', 'outcome'], axis=1)
x_train = pd.get_dummies(x_train, columns=['states'], prefix="", prefix_sep="")
x_train.iloc[:,0:29] = scaler.fit_transform(x_train.iloc[:,0:29])
y_train = us_train['outcome']

x_test = us_test.drop(['Id', 'year'], axis=1)
x_test = pd.get_dummies(x_test, columns=['states'], prefix="", prefix_sep="")
x_test.iloc[:,0:29] = scaler.transform(x_test.iloc[:,0:29])


In [None]:
Defining a range of alpha values for Lasso
alphas = np.linspace(0.01,1,100)
#alphas = np.linspace(0.0001,1,100)

# Initializing the instance of lasso
lasso = Lasso()

# Setting parameter grid for grid search
param_grid = {'alpha': alphas}

# defining grid search with 5-fold cross validation
grid_search = GridSearchCV(lasso, param_grid, scoring='r2', cv = 5)

# fitting the train
grid_search.fit(x_train, y_train)

# Printing the best set of parameters
print('Best parameters{}'.format(grid_search.best_params_))

# Printing the best score (here score is R squared score)
print('Best score {}'.format(grid_search.best_score_))

Best parameters{'alpha': 0.0001}
Best score 0.761925971322


In [None]:
linear_lasso_optimal = Lasso(alpha=grid_search.best_params_['alpha'])
linear_lasso_optimal.fit(x_train, y_train)

def evaluate(x,y, model):
        print ("Train mean_absolute_error (MSE)", 
               mean_absolute_error(y, model.predict(x)))
        print ("Train R-squared Score (R2)", 
               r2_score(y, model.predict(x)))

evaluate(x_train, y_train, linear_lasso_optimal)

('Train mean_absolute_error (MSE)', 0.3325357408961107)
('Train R-squared Score (R2)', 0.9327949017139796)


In [None]:
linear_lasso_optimal.predict(x_test)

array([12.95006068,  7.24689784, 10.47157262, 12.34292907, 10.1660963 ,
        7.25417224,  9.48093502, 11.1867625 ,  8.66226608, 11.11000771,
       11.51239037,  8.75550547, 10.33973038, 11.6178218 ,  9.46105172,
        9.59259171, 12.11156369, 12.43212198, 10.14244893, 10.16737968,
        9.30661649, 10.98165648,  8.10875611, 14.23688732, 11.76190856,
        8.45777062,  8.80864424, 10.17639623,  8.87686547,  9.8443763 ,
       11.29214099, 10.56621889, 11.43967284,  8.48577688, 11.51658872,
       12.25187071,  9.62781068, 10.89096339,  9.38529519, 12.85377601,
        8.85920709, 12.63232884, 11.17607157,  7.37922165,  8.19374114,
       11.00028022,  9.16724243, 13.80006748,  8.8737233 ,  8.33670379])

In [None]:
 #This model is the best and is my final submission on kaggle
us_predictions=pd.DataFrame()
us_predictions['Id']=us_test['Id']
us_predictions['Predicted'] = pd.DataFrame(linear_lasso_optimal.predict(x_test))
output_dir = '/content/drive/My Drive/Data_Science/Assignment/'
us_predictions=us_predictions.to_csv(output_dir+'us_predictions.csv', index=False)

In [None]:

from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import KNeighborsRegressor

k_range = list(range(1,31))
weight_options = ["uniform", "distance"]

param_grid = dict(n_neighbors = k_range, weights = weight_options)
knn = KNeighborsRegressor()

grid = GridSearchCV(knn, param_grid, cv = 10)
grid.fit(x_train,y_train)


print (grid.best_score_)
print (grid.best_params_)
print (grid.best_estimator_)

0.8452447874650099
{'n_neighbors': 4, 'weights': 'distance'}
KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=None, n_neighbors=4, p=2,
          weights='distance')


In [None]:
best_params = grid.best_params_
knn = KNeighborsRegressor(n_neighbors=best_params['n_neighbors'], weights=best_params['weights'])
knn.fit(x_train,y_train)
predicted_value = knn.predict(x_test)
predicted_value

array([12.50416177,  7.49634181, 10.31328029, 12.18417014, 10.07277267,
        6.92137454,  9.21301559, 10.82459608,  8.30793724, 11.2612335 ,
       11.02552475,  8.0942518 ,  9.84831511, 11.09005695,  9.19400134,
       10.00226594, 12.49260391, 12.33955318, 10.34863907, 10.03362608,
        8.91586911, 10.41203039,  7.61568761, 14.0475303 , 10.96107948,
        7.95248145,  8.85979001,  9.48908416,  8.68742422,  9.21762938,
       11.08825391,  9.9464195 , 10.80405082,  8.70154992, 11.11409767,
       11.20925405,  9.85155634, 10.61883127,  9.27022143, 11.95672555,
        8.96331215, 12.54018888, 11.08131223,  7.07402586,  7.84612827,
       10.31897743,  8.60707925, 13.6999051 ,  8.4700293 ,  8.56157496])

In [None]:
us_predictions=pd.DataFrame()
us_predictions['Id']=us_test['Id']
us_predictions['Predicted'] = pd.DataFrame(knn.predict(x_test))
us_predictions.rename(columns={0: 'Predicted'},inplace=True)
output_dir = '/content/drive/My Drive/Data_Science/Assignment/hw9_handout/'
us_predictions_knn=us_predictions.to_csv(output_dir+'us_predictions_knn.csv', index=False)

### Rubric

- +25 points for logical and reasonable steps to training and testing the models using the techniques taught in the course
- +20 points showing code and evaluation of **at least two regression models** at least one of which makes the same predictions as submitted on Kaggle and in the document `us_predictions.csv`

## B. Kaggle Submission (35 points)

Create an account on Kaggle, and submit your predictions as `us_predictions.csv` with the two columns `Id` and `Predicted` to Kaggle [here](https://www.kaggle.com/t/c621fd11faca492eb4db37ff5b9f78f0). 

You will be evaluated on the `Mean Absolute Error` as a scoring metric.

There are seven benchmarks/baselines that we have provided you on Kaggle. These are as follows: 

- `Trivial Baseline`
- `Baseline A (1 and 2)`
- `Baseline B (1 and 2)`
- `Baseline C (1 and 2)`

To be able to get full points on this task, you would need to pass the `Trivial Baseline`, either of `A1` or `A2`, either of `B1` or `B2`, **and** either of `C1` or `C2` baseline.

Note that the score you see on Kaggle Leaderboard for your submission is only based on 50% of the test dataset (i.e. 25 data points) -- we have hidden the other 50% of the dataset, and your score on those will only be revealed once the competition ends. In general, if you pass the baseline on the publicly available data, your model should pass the baselines on the hidden data as well. But we have kept it hidden so that you don't overfit your model on the test set. 

The Kaggle data points for the test set are from 2016 the features for which are provided in `us_test_x.csv`. 

You have a maximum for 10 submissions per day on Kaggle. Before submitting the notebook, enter your Kaggle username in the **Kaggle Username** section above.

### Rubric

- +10 points for achieving/crossing Baseline A1 or Baseline A2 across both public (7 points) and hidden data points (3 points).
- +5 points for achieving/crossing Trivial Baseline across both public (3 points) and hidden data points (2 points).
- +10 points for achieving/crossing Baseline B1 or Baseline B2 across both public (7 points) and hidden data points (3 points).
- +10 points for achieving/crossing Baseline C1 or Baseline C2 across both public (7 points) and hidden data points (3 points).




## *Concepts required to complete this task*

*   Basics of Machine Learning
*   Basics of Regression
*   Feature Engineering




# Part II: Transfer Learning (40 points)

Many machine learning methods work well only under a common assumption: the training and test data are drawn from the same feature space and/or the same distribution. When the distribution changes, most statistical models need to be rebuilt from scratch using newly collected training data. In many real world applications, it is expensive or impossible to recollect the needed training data and rebuild the models. It would be nice to reduce the need and effort to recollect the training data. In such cases, **knowledge transfer** or **transfer learning** between task domains would be desirable. 

**Transfer learning** is a machine learning method where a model developed for a task is reused for a model on a second task. For example, in the paper on *Revealing Inherent Gender Biases in Using Word Embeddings for Sentiment Analysis* in PS7, the (imaginary) authors used word embeddings for sentiment analysis. That was a transfer learning approach where word embeddings were created from a machine learning model that was trained for the purpose of [predicting words](https://machinelearningmastery.com/develop-word-based-neural-language-models-python-keras/#:~:text=Language%20modeling%20involves%20predicting%20the,machine%20translation%20and%20speech%20recognition.), but the model was later **reused** to extract word embeddings to be used for the sentiment analysis task. Another example would be a **spam filtering model** that has been trained on emails of one user (the source distribution) and is applied to a new user who receives significantly different emails (the target distribution). This process of applying the model to a different target distribution is sometimes also known as **domain adaptation**. <Sup>2</Sup>

---------
<sup> 2. Some people distinguish between **transfer learning** and **domain adaptation**, some don't. These are not very precisely defined terms in the literature.</sup> 



## A. Cross-Country Generalizability of the Model (20 points)

Using a model trained on all the data from the United States, estimate the prevalence of Climate Change Denialism disease for the 8 provinces of the **Dominion of Canada** for the years 2011 to 2014. You may have to modify and retrain your model according to the data and features available to you for Canada. 

The dataset (i.e. features) for Canada is available as `ca_test_x.csv`. You will submit your final predictions as `ca_predictions.csv` the template for which is provided to you in the handout. As you will notice, the features for Canada are a subset of the features for the USA, therefore, you'll have to train your US based model accordingly.

As a submission for this part, you will fill the `ca_predictions.csv` file and submit that along with this Notebook to NYU Classes. You will also submit `ca_predictions.csv` file to Kaggle (see Part B).

In [None]:
ca_test = pd.read_csv("/content/drive/My Drive/Data_Science/Assignment/hw9_handout/ca_test_x.csv")
X_train = us_train[list(ca_test.columns)]
X_train.drop(['Id', 'states', 'year'], axis=1, inplace=True)
y_train=us_train['outcome']
X_test=ca_test.drop(['Id', 'states', 'year'], axis=1)

X_train.head()

Unnamed: 0,B,F,H,I,K,L,M,O,T,W,X,Y,AA,AC
50,55.929102,29.419012,66.509808,72.384352,34.337305,73.123699,42.101993,53.302688,31.846101,31.054514,43.67231,0.0,47.71848,60.503087
51,50.967649,20.082439,47.010948,0.0,33.21761,70.386448,38.274539,41.74591,27.171444,28.687129,0.0,0.0,37.736553,58.714818
52,51.19317,22.372542,47.545164,42.16071,33.590842,72.341628,39.85055,45.755405,28.047943,27.155292,34.836872,0.0,38.223477,52.455878
53,58.184308,19.730116,68.379561,69.59057,34.52392,74.687843,37.373962,48.113931,33.891264,30.358224,50.48822,0.0,45.040402,68.84834
54,44.202032,18.320822,33.12135,23.620157,22.953742,61.001589,33.546508,33.491069,22.64287,20.192397,30.040491,80.15661,31.650013,42.024312


In [None]:
X_test.head()

Unnamed: 0,B,F,H,I,K,L,M,O,T,W,X,Y,AA,AC
0,43.0,100.0,44.0,68.0,58.0,44.0,26.0,49.0,47.0,42.0,34.0,0.0,84.0,66.0
1,48.0,65.0,36.0,64.0,45.0,44.0,33.0,43.0,40.0,40.0,34.0,0.0,75.0,65.0
2,61.0,55.0,47.0,57.0,62.0,48.0,27.0,60.0,63.0,54.0,39.0,0.0,89.0,100.0
3,53.0,41.0,56.0,0.0,51.0,41.0,22.0,43.0,62.0,47.0,0.0,0.0,0.0,0.0
4,77.0,88.0,87.0,0.0,60.0,69.0,28.0,85.0,100.0,52.0,100.0,0.0,83.0,88.0


In [None]:
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test=scaler.fit_transform(X_test)

In [None]:
# Defining a range of alpha values for Lasso
alphas = np.linspace(0.001,1,100)

# Initializing the instance of Ridge
lasso = Lasso()

# Setting parameter grid for grid search
param_grid = {'alpha': alphas}

# defining grid search with 5-fold cross validation
grid_search = GridSearchCV(lasso, param_grid, scoring='r2', cv = 5)

# fitting the train
grid_search.fit(X_train, y_train)

# Printing the best set of parameters
print('Best parameters{}'.format(grid_search.best_params_))

# Printing the best score (here score is R squared score)
print('Best score {}'.format(grid_search.best_score_))

Best parameters{'alpha': 0.011090909090909092}
Best score 0.606597208582


In [None]:
linear_lasso_optimal = Lasso(alpha=grid_search.best_params_['alpha'])
linear_lasso_optimal.fit(X_train, y_train)

def evaluate(X,Y, model, is_test=False):
        print ("Train mean_absolute_error (MSE)", 
               mean_absolute_error(Y, model.predict(X)))
        print ("Train R-squared Score (R2)", 
               r2_score(Y, model.predict(X)))

evaluate(X_train, y_train, linear_lasso_optimal)

('Train mean_absolute_error (MSE)', 0.6743379292806603)
('Train R-squared Score (R2)', 0.723125854610794)


In [None]:
linear_lasso_optimal.predict(X_test)

array([ 9.20401375,  8.7232817 ,  8.87039069,  8.36063583,  9.56016717,
        9.70366808,  9.0060616 ,  8.51884175,  9.19389384,  8.51306987,
        9.34342235,  8.34448389,  8.28236009,  9.99124087,  8.78094252,
       11.5449953 ,  8.85742643,  8.28403566,  8.19992046,  8.82215489,
        6.97105318,  9.7967199 ,  9.23615441, 10.57746835,  8.94988777,
        8.5851952 ,  9.09239234,  8.75214354,  8.42976749,  9.57482778,
        9.42038555,  9.94193502])

In [None]:
#This is the best model and is my final submission of kaggle
ca_predictions=pd.DataFrame()
ca_predictions['Id']=ca_test['Id']
ca_predictions['Predicted'] = pd.DataFrame(linear_lasso_optimal.predict(X_test))
output_dir = '/content/drive/My Drive/Data_Science/Assignment/'
ca_predictions=ca_predictions.to_csv(output_dir+'ca_predictions.csv', index=False)

In [None]:

from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import KNeighborsRegressor

k_range = list(range(1,31))
weight_options = ["uniform", "distance"]

param_grid = dict(n_neighbors = k_range, weights = weight_options)
#print (param_grid)
knn = KNeighborsRegressor()

grid = GridSearchCV(knn, param_grid, cv = 10)
grid.fit(X_train,y_train)


print (grid.best_score_)
print (grid.best_params_)
print (grid.best_estimator_)

0.7369353759355491
{'n_neighbors': 4, 'weights': 'distance'}
KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=None, n_neighbors=4, p=2,
          weights='distance')


In [None]:
best_params = grid.best_params_
knn = KNeighborsRegressor(n_neighbors=best_params['n_neighbors'], weights=best_params['weights'])
knn.fit(X_train,y_train)
predicted_value = knn.predict(X_test)
predicted_value

array([10.15379827,  9.62778276,  9.60343147,  8.36378979,  9.04735846,
       10.39983411, 10.31420389,  8.11059539, 10.15664619,  8.99621168,
       10.33099189,  7.90901447,  7.77447097, 11.60392384,  9.9984013 ,
        8.28919905,  9.5208534 ,  8.98302055,  8.52327939,  7.92677283,
        9.76551628,  9.6502414 ,  9.69170643,  7.90616127, 10.12873253,
        9.37114469,  9.53523568,  9.16254356,  8.21105094, 10.69716281,
       10.04922247,  8.2435629 ])

In [None]:
ca_predictions=pd.DataFrame()
ca_predictions['Id']=ca_test['Id']
ca_predictions['Predicted'] = pd.DataFrame(knn.predict(X_test))
output_dir = '/content/drive/My Drive/Data_Science/Assignment/'
ca_predictions=ca_predictions.to_csv(output_dir+'ca_predictions_knn.csv', index=False)

### Rubric

- +10 points for logical and reasonable steps to training and testing the models using the techniques taught in the course
- +10 points showing code and evaluation of **at least two regression models** at least one of which makes the same predictions as submitted on Kaggle and in the document `ca_predictions.csv`.

## B. Kaggle Submission (20 points)

You will submit your predictions as `ca_predictions.csv` with the two columns `Id` and `Predicted` to Kaggle [here](https://www.kaggle.com/t/32971211c35047fcbeb5538e48fadd7d). 

You will be evaluated on the `Mean Absolute Error` scoring metric.

There is one benchmark/baseline that we have provided you on Kaggle that you will have to meet/beat to receive all the points.

Note that the score you see on Kaggle Leaderboard for your submission is only based on 75% of the dataset (i.e. 24 data points) -- we have hidden 25% of the dataset (8 data points), and your score on those will only be revealed once the competition ends.

You have a maximum for 10 submissions per day on Kaggle. Before submitting the notebook, enter your Kaggle username in the **Kaggle Username** section above.

### Rubric

- +20 points for achieving/crossing the baseline provided across both public (15 points) and hidden data points (5 points).

## *Concepts required to complete this task*

*   Basics of Machine Learning
*   Basics of Regression
*   Feature Engineering
*   Concept of Transfer Learning