# Assignment 2 - Data Set Description
The questions below relate to the data files associated with the contest with the title 'DengAI: Predicting Disease Spread' published at the following website. 
https://www.drivendata.org/competitions/44/dengai-predicting-disease-spread/data/

Anyone can join the contest and showcase your skills. To know about contest submissions visit the following webpage
https://www.drivendata.org/competitions/44/dengai-predicting-disease-spread/submissions/
You can showcase your Machine Learning skills by ranking top in the contest. 

Problem description:
Your goal is to predict the total_cases label for each (city, year, weekofyear) in the test set. There are two cities, San Juan and Iquitos, with test data for each city spanning 5 and 3 years respectively. You will make one submission that contains predictions for both cities. The data for each city have been concatenated along with a city column indicating the source: sj for San Juan and iq for Iquitos. The test set is a pure future hold-out, meaning the test data are sequential and non-overlapping with any of the training data. Throughout, missing values have been filled as NaNs.

Assignment:
The goal is achieved through three subsequent Assignments 1, 2 and 3, all using the same dataset


The features in this dataset
You are provided the following set of information on a (year, weekofyear) timescale:

(Where appropriate, units are provided as a _unit suffix on the feature name.)

City and date indicators

    city – City abbreviations: sj for San Juan and iq for Iquitos
    week_start_date – Date given in yyyy-mm-dd format

NOAA's GHCN daily climate data weather station measurements

    station_max_temp_c – Maximum temperature
    station_min_temp_c – Minimum temperature
    station_avg_temp_c – Average temperature
    station_precip_mm – Total precipitation
    station_diur_temp_rng_c – Diurnal temperature range
    
PERSIANN satellite precipitation measurements (0.25x0.25 degree scale)

    precipitation_amt_mm – Total precipitation

NOAA's NCEP Climate Forecast System Reanalysis measurements (0.5x0.5 degree scale)

    reanalysis_sat_precip_amt_mm – Total precipitation
    reanalysis_dew_point_temp_k – Mean dew point temperature
    reanalysis_air_temp_k – Mean air temperature
    reanalysis_relative_humidity_percent – Mean relative humidity
    reanalysis_specific_humidity_g_per_kg – Mean specific humidity
    reanalysis_precip_amt_kg_per_m2 – Total precipitation
    reanalysis_max_air_temp_k – Maximum air temperature
    reanalysis_min_air_temp_k – Minimum air temperature
    reanalysis_avg_temp_k – Average air temperature
    reanalysis_tdtr_k – Diurnal temperature range

Satellite vegetation - Normalized difference vegetation index (NDVI) - NOAA's CDR Normalized Difference Vegetation Index (0.5x0.5 degree scale) measurements

    ndvi_se – Pixel southeast of city centroid
    ndvi_sw – Pixel southwest of city centroid
    ndvi_ne – Pixel northeast of city centroid
    ndvi_nw – Pixel northwest of city centroid

# Assignment 2 - Questions
Use the merged data frame from Assignment 1 for this assignment
This Assignment focuses on data preprocessing and model building. Continue with the datasets loaded in Assignment 1 (or reload with same steps and create merged data frame) and Make a stratified 80-20 split based on target to ensure there are no biases in the dataset. Predict the "total_cases" using a stochastic gradient descent regressor. Calculate the Root Mean Square Error. Also, plot Learning Curve for the model. Provide your intepretations based on these metrics.

<ul>
    <li>Import the required libraries</li>
    <li>Make an 80-20 stratified split based on the target data</li>
    <li>Preprocess the data (Encode the  categorical features and Standardize the numerical features)</li>
    <li>Build a stochastic gradient descent regressor, train the model </li>
    <li>Evaluate your model based on applicable metrics. Show the metric(s) you chose and why you chose this(these) metrics.</li>
    <li>List the hyper-parameters that can be tuned in SGD. Show the code along with comments on the parameter value chosen (use class presentation, discussion notes, some online reading) and why this value was chosen. Show the improvement you achieved in model accuracy.  </li>
    <li>Plot Learning curve and provide insights</li>
    <li>Create a submission file which has predictions for both cities in the submission format prescribed by the contest at the link https://www.drivendata.org/competitions/44/dengai-predicting-disease-spread/data/</li>
    <li>Optional: Submit your predictions to the contest. You will get a submission score. Update it here. As you improve your model in next assignments, you can try to improve this score.</li>
</ul>

Submit the following for this assignment: 
1. .ipynb and .html formats of Jupyter notebook code with outputs and 
2. the submission_format.csv with your predictions

In [1]:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import os

In [2]:
#Checking working directory
os.getcwd()

'D:\\Learning\\USF - Data Science Programming\\Assignment 2'

In [3]:
#Recreating the merged dataframe from Assignment 1
df = pd.read_csv("dengue_features_train.csv")
df_rn = df.copy()
df_rn.columns = df_rn.columns.str.replace('station', 'stn')
df_rn.columns = df_rn.columns.str.replace('precip', 'prec')
df_rn.columns = df_rn.columns.str.replace('humidity', 'hd')
df_rn.columns = df_rn.columns.str.replace('reanalysis', 're_an')
df_rn.year = df_rn.year.astype('category')
df_pred = pd.read_csv("dengue_labels_train.csv")
df_merged = pd.merge(df_rn, df_pred, on=['city','year','weekofyear'], how='outer')
df_merged = df_merged.fillna(method='ffill')

In [4]:
df_merged.head(3)

Unnamed: 0,city,year,weekofyear,week_start_date,ndvi_ne,ndvi_nw,ndvi_se,ndvi_sw,precitation_amt_mm,re_an_air_temp_k,...,re_an_relative_hd_percent,re_an_sat_prec_amt_mm,re_an_specific_hd_g_per_kg,re_an_tdtr_k,stn_avg_temp_c,stn_diur_temp_rng_c,stn_max_temp_c,stn_min_temp_c,stn_prec_mm,total_cases
0,sj,1990,18,1990-04-30,0.1226,0.103725,0.198483,0.177617,12.42,297.572857,...,73.365714,12.42,14.012857,2.628571,25.442857,6.9,29.4,20.0,16.0,4
1,sj,1990,19,1990-05-07,0.1699,0.142175,0.162357,0.155486,22.82,298.211429,...,77.368571,22.82,15.372857,2.371429,26.714286,6.371429,31.7,22.2,8.6,5
2,sj,1990,20,1990-05-14,0.03225,0.172967,0.1572,0.170843,34.54,298.781429,...,82.052857,34.54,16.848571,2.3,26.714286,6.485714,32.2,22.8,41.4,4


In [5]:
df_copy = df_merged.copy()

In [6]:
df_copy['city'] = df_copy['city'].astype('category')

In [7]:
df_copy['week_start_date'] = df_copy['week_start_date'].astype('datetime64[ns]')

In [8]:
df_copy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1456 entries, 0 to 1455
Data columns (total 25 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   city                        1456 non-null   category      
 1   year                        1456 non-null   int64         
 2   weekofyear                  1456 non-null   int64         
 3   week_start_date             1456 non-null   datetime64[ns]
 4   ndvi_ne                     1456 non-null   float64       
 5   ndvi_nw                     1456 non-null   float64       
 6   ndvi_se                     1456 non-null   float64       
 7   ndvi_sw                     1456 non-null   float64       
 8   precitation_amt_mm          1456 non-null   float64       
 9   re_an_air_temp_k            1456 non-null   float64       
 10  re_an_avg_temp_k            1456 non-null   float64       
 11  re_an_dew_point_temp_k      1456 non-null   float64     

In [9]:
df_copy['day'] = df_copy['week_start_date'].dt.day

In [10]:
df_copy['month'] = df_copy['week_start_date'].dt.month

In [44]:
df_copy.head()

Unnamed: 0,city,year,weekofyear,week_start_date,ndvi_ne,ndvi_nw,ndvi_se,ndvi_sw,precitation_amt_mm,re_an_air_temp_k,...,re_an_specific_hd_g_per_kg,re_an_tdtr_k,stn_avg_temp_c,stn_diur_temp_rng_c,stn_max_temp_c,stn_min_temp_c,stn_prec_mm,total_cases,day,month
0,sj,1990,18,1990-04-30,0.1226,0.103725,0.198483,0.177617,12.42,297.572857,...,14.012857,2.628571,25.442857,6.9,29.4,20.0,16.0,4,30,4
1,sj,1990,19,1990-05-07,0.1699,0.142175,0.162357,0.155486,22.82,298.211429,...,15.372857,2.371429,26.714286,6.371429,31.7,22.2,8.6,5,7,5
2,sj,1990,20,1990-05-14,0.03225,0.172967,0.1572,0.170843,34.54,298.781429,...,16.848571,2.3,26.714286,6.485714,32.2,22.8,41.4,4,14,5
3,sj,1990,21,1990-05-21,0.128633,0.245067,0.227557,0.235886,15.36,298.987143,...,16.672857,2.428571,27.471429,6.771429,33.3,23.3,4.0,3,21,5
4,sj,1990,22,1990-05-28,0.1962,0.2622,0.2512,0.24734,7.52,299.518571,...,17.21,3.014286,28.942857,9.371429,35.0,23.9,5.8,6,28,5


In [88]:
df_copy = df_copy.drop(['week_start_date'], axis = 1)

In [89]:
#Separating the total_cases column into the y-variable (to be predicted)
y = df_copy[["total_cases"]]

In [99]:
#Storing all columns apart from the target column as the x-variable
cols = df_copy.columns.to_list()

In [101]:
cols

['city',
 'year',
 'weekofyear',
 'ndvi_ne',
 'ndvi_nw',
 'ndvi_se',
 'ndvi_sw',
 'precitation_amt_mm',
 're_an_air_temp_k',
 're_an_avg_temp_k',
 're_an_dew_point_temp_k',
 're_an_max_air_temp_k',
 're_an_min_air_temp_k',
 're_an_prec_amt_kg_per_m2',
 're_an_relative_hd_percent',
 're_an_sat_prec_amt_mm',
 're_an_specific_hd_g_per_kg',
 're_an_tdtr_k',
 'stn_avg_temp_c',
 'stn_diur_temp_rng_c',
 'stn_max_temp_c',
 'stn_min_temp_c',
 'stn_prec_mm',
 'total_cases',
 'day',
 'month']

In [104]:
cols = cols[-2:] + cols[:-2]

In [106]:
X = df_copy[cols]
X

Unnamed: 0,day,month,city,year,weekofyear,ndvi_ne,ndvi_nw,ndvi_se,ndvi_sw,precitation_amt_mm,...,re_an_relative_hd_percent,re_an_sat_prec_amt_mm,re_an_specific_hd_g_per_kg,re_an_tdtr_k,stn_avg_temp_c,stn_diur_temp_rng_c,stn_max_temp_c,stn_min_temp_c,stn_prec_mm,total_cases
0,30,4,sj,1990,18,0.122600,0.103725,0.198483,0.177617,12.42,...,73.365714,12.42,14.012857,2.628571,25.442857,6.900000,29.4,20.0,16.0,4
1,7,5,sj,1990,19,0.169900,0.142175,0.162357,0.155486,22.82,...,77.368571,22.82,15.372857,2.371429,26.714286,6.371429,31.7,22.2,8.6,5
2,14,5,sj,1990,20,0.032250,0.172967,0.157200,0.170843,34.54,...,82.052857,34.54,16.848571,2.300000,26.714286,6.485714,32.2,22.8,41.4,4
3,21,5,sj,1990,21,0.128633,0.245067,0.227557,0.235886,15.36,...,80.337143,15.36,16.672857,2.428571,27.471429,6.771429,33.3,23.3,4.0,3
4,28,5,sj,1990,22,0.196200,0.262200,0.251200,0.247340,7.52,...,80.460000,7.52,17.210000,3.014286,28.942857,9.371429,35.0,23.9,5.8,6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1451,28,5,iq,2010,21,0.342750,0.318900,0.256343,0.292514,55.30,...,88.765714,55.30,18.485714,9.800000,28.633333,11.933333,35.4,22.4,27.0,5
1452,4,6,iq,2010,22,0.160157,0.160371,0.136043,0.225657,86.47,...,91.600000,86.47,18.070000,7.471429,27.433333,10.500000,34.7,21.7,36.6,8
1453,11,6,iq,2010,23,0.247057,0.146057,0.250357,0.233714,58.94,...,94.280000,58.94,17.008571,7.500000,24.400000,6.900000,32.2,19.2,7.4,1
1454,18,6,iq,2010,24,0.333914,0.245771,0.278886,0.325486,59.67,...,94.660000,59.67,16.815714,7.871429,25.433333,8.733333,31.2,21.0,16.0,1


In [109]:
X = X.iloc[:,:25]

In [14]:
from sklearn.model_selection import train_test_split

In [110]:
#Creating a stratified 80:20 split based on the City
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = X["city"], test_size = 0.2, random_state = 42, shuffle = True)

In [112]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1164 entries, 1449 to 572
Data columns (total 25 columns):
 #   Column                      Non-Null Count  Dtype   
---  ------                      --------------  -----   
 0   day                         1164 non-null   int64   
 1   month                       1164 non-null   int64   
 2   city                        1164 non-null   category
 3   year                        1164 non-null   int64   
 4   weekofyear                  1164 non-null   int64   
 5   ndvi_ne                     1164 non-null   float64 
 6   ndvi_nw                     1164 non-null   float64 
 7   ndvi_se                     1164 non-null   float64 
 8   ndvi_sw                     1164 non-null   float64 
 9   precitation_amt_mm          1164 non-null   float64 
 10  re_an_air_temp_k            1164 non-null   float64 
 11  re_an_avg_temp_k            1164 non-null   float64 
 12  re_an_dew_point_temp_k      1164 non-null   float64 
 13  re_an_max_air_te

In [155]:
train_num_cols = X_train.select_dtypes(include=np.number)

In [163]:
train_num_cols = train_num_cols.iloc[:,5:]

In [164]:
train_num_cols

Unnamed: 0,ndvi_nw,ndvi_se,ndvi_sw,precitation_amt_mm,re_an_air_temp_k,re_an_avg_temp_k,re_an_dew_point_temp_k,re_an_max_air_temp_k,re_an_min_air_temp_k,re_an_prec_amt_kg_per_m2,re_an_relative_hd_percent,re_an_sat_prec_amt_mm,re_an_specific_hd_g_per_kg,re_an_tdtr_k,stn_avg_temp_c,stn_diur_temp_rng_c,stn_max_temp_c,stn_min_temp_c,stn_prec_mm
1449,0.158500,0.133071,0.145600,59.40,297.278571,297.935714,296.738571,306.0,294.0,87.30,97.445714,59.40,18.391429,6.185714,27.400000,10.400000,33.7,21.2,32.0
295,0.010867,0.091929,0.120443,6.35,297.412857,297.457143,294.540000,299.5,295.9,57.00,84.135714,6.35,15.888571,1.657143,25.185714,4.842857,28.3,21.7,46.5
571,0.058550,0.122729,0.113186,35.69,298.751429,298.835714,294.281429,301.3,296.7,22.36,76.581429,35.69,15.630000,3.128571,27.200000,7.771429,33.3,22.8,20.6
306,0.111800,0.185529,0.211271,0.00,297.428571,297.600000,293.834286,300.6,295.2,28.00,80.647143,0.00,15.235714,2.542857,26.200000,7.700000,32.2,22.2,14.2
823,0.161100,0.239671,0.285971,11.80,298.055714,298.207143,292.971429,300.1,296.7,8.10,73.414286,11.80,14.388571,2.385714,24.857143,6.342857,28.9,21.1,2.6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
992,0.195617,0.188314,0.133900,27.97,297.291429,299.178571,293.452857,308.1,289.8,9.71,82.225714,27.97,14.985714,12.871429,26.520000,10.880000,34.0,18.4,10.9
593,0.122075,0.254571,0.246929,84.76,300.738571,300.785714,296.437143,303.1,298.7,19.69,77.555714,84.76,17.901429,3.171429,27.971429,6.928571,32.8,23.9,15.8
1093,0.162971,0.154129,0.161100,53.58,296.942857,298.307143,294.154286,307.3,291.4,13.55,86.655714,53.58,15.618571,10.100000,27.075000,10.875000,33.2,20.7,1.5
855,-0.035350,0.175500,0.195871,12.22,301.474286,301.692857,296.547143,303.7,299.6,14.10,74.732857,12.22,18.001429,3.071429,28.857143,7.385714,33.9,24.4,7.3


In [165]:
test_num_cols = X_test.select_dtypes(include=np.number)

In [166]:
test_num_cols = test_num_cols.iloc[:,5:]

In [167]:
test_num_cols

Unnamed: 0,ndvi_nw,ndvi_se,ndvi_sw,precitation_amt_mm,re_an_air_temp_k,re_an_avg_temp_k,re_an_dew_point_temp_k,re_an_max_air_temp_k,re_an_min_air_temp_k,re_an_prec_amt_kg_per_m2,re_an_relative_hd_percent,re_an_sat_prec_amt_mm,re_an_specific_hd_g_per_kg,re_an_tdtr_k,stn_avg_temp_c,stn_diur_temp_rng_c,stn_max_temp_c,stn_min_temp_c,stn_prec_mm
979,0.179000,0.312829,0.262557,94.77,297.285714,298.650000,295.740000,305.6,293.1,103.87,92.192857,94.77,17.305714,7.414286,27.625000,11.325000,35.2,20.5,51.0
1143,0.153543,0.132829,0.197043,32.95,295.634286,296.814286,293.860000,304.2,289.8,25.90,90.447143,32.95,15.312857,8.400000,25.075000,9.075000,32.1,18.6,98.5
732,-0.008067,0.159114,0.158800,175.41,298.365714,298.457143,295.347143,300.4,297.2,90.20,83.550000,175.41,16.740000,2.000000,24.814286,4.914286,28.9,21.7,145.8
1195,0.246071,0.351857,0.260457,50.88,297.857143,299.057143,294.644286,307.7,291.9,11.00,84.467143,50.88,16.168571,10.485714,27.350000,9.550000,32.8,22.4,89.2
483,0.113950,0.212571,0.168914,44.23,299.970000,300.007143,296.742857,302.6,298.4,61.70,82.645714,44.23,18.212857,1.842857,28.085714,6.357143,31.7,23.9,26.7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1097,0.174643,0.217414,0.232300,89.28,297.658571,298.550000,293.045714,308.2,290.8,6.50,77.861429,89.28,14.584286,12.642857,27.850000,11.200000,34.2,21.3,117.0
1179,0.269617,0.230800,0.277100,54.22,298.158571,299.392857,297.045714,305.9,294.4,44.51,94.401429,54.22,18.731429,8.014286,28.266667,9.200000,33.8,23.3,19.0
1295,0.082229,0.111057,0.064743,76.68,296.178571,297.185714,294.038571,306.8,286.9,29.10,89.280000,76.68,15.740000,10.200000,24.700000,10.633333,32.8,17.5,61.8
866,0.031800,0.107760,0.125517,0.00,299.302857,299.471429,294.551429,300.8,297.7,14.50,75.240000,0.00,15.917143,1.971429,25.928571,6.171429,30.0,22.2,26.9


In [177]:
train_cat_cols = X_train[["city"]]

In [175]:
test_cat_cols = X_test[["city"]]

In [171]:
from sklearn.preprocessing import StandardScaler

ss = StandardScaler()
X_tr_num_scaled = ss.fit_transform(train_num_cols)

In [172]:
X_test_num_scaled = ss.transform(test_num_cols)

In [178]:
from sklearn.preprocessing import OneHotEncoder

cat_encoder = OneHotEncoder()
X_tr_cat_scaled = cat_encoder.fit_transform(train_cat_cols)
X_test_cat_scaled = cat_encoder.transform(test_cat_cols)

In [181]:
cat_encoder.categories_

[array(['iq', 'sj'], dtype=object)]

In [183]:
from sklearn.compose import ColumnTransformer

X_tr_scaled = ColumnTransformer