# **Lab: Neural Networks**



## Exercise 1: Regression with Pytorch

In this exercise, we will build a Neural Networks with Pytorch for predicting pollution level. We will be working on the Beijing Pollution dataset:
https://code.datasciencedojo.com/datasciencedojo/datasets/tree/master/Beijing%20PM2.5

The steps are:
1.   Setup Repository
2.   Load and Explore Dataset
3.   Prepare Data
4.   Baseline Model
5.   Define Architecture
6.   Create Data Loader
7.   Train Model
8.   Assess Performance
9.   Push Changes


### 1. Setup Repository

**[1.1]** Go to a folder of your choice on your computer (where you store projects)

In [None]:
# Placeholder for student's code (1 command line)
# Task: Go to a folder of your choice on your computer (where you store projects)

In [None]:
# Solution
cd ~/Projects/

**[1.2]** Copy the cookiecutter data science template

In [None]:
# Placeholder for student's code (1 command line)
# Task: Copy the cookiecutter data science template

In [None]:
# Solution
cookiecutter -c v1 https://github.com/drivendata/cookiecutter-data-science

Follow the prompt (name the project and repo adv_dsi_lab_5)

**[1.3]** Go inside the created folder `adv_dsi_lab_5`


In [None]:
# Placeholder for student's code (1 command line)
# Task: Go inside the created folder adv_dsi_lab_5

In [None]:
# Solution
cd adv_dsi_lab_5

**[1.4]** Create a file called `Dockerfile` and add the following content:

`FROM jupyter/scipy-notebook:0ce64578df46`

`RUN pip install torch==1.9.0+cpu torchvision==0.10.0+cpu torchtext==0.10.0 -f https://download.pytorch.org/whl/torch_stable.html`

`ENV PYTHONPATH "${PYTHONPATH}:/home/jovyan/work"`

`RUN echo "export PYTHONPATH=/home/jovyan/work" >> ~/.bashrc`

`WORKDIR /home/jovyan/work`


In [None]:
# Placeholder for student's code (1 command line)
# Task: Create a file called Dockerfile 

In [None]:
# Solution
vi Dockerfile

We will create our own Docker image based on the official jupyter/scipy-notebook.

**[1.5]** Build the image from this Dockerfile

In [None]:
docker build -t pytorch-notebook:latest .

Syntax: docker build [OPTIONS] PATH 

Options:

`-t: Name and optionally a tag in the 'name:tag' format`

Documentation: https://docs.docker.com/engine/reference/commandline/build/

**[1.6]** Run the built image

In [75]:
# docker run  -dit --rm --name adv_dsi_lab_5 -p 8888:8888 -e JUPYTER_ENABLE_LAB=yes -v ~/Projects/adv_dsi/adv_dsi_lab_5:/home/jovyan/work -v ~/Projects/adv_dsi/src:/home/jovyan/work/src pytorch-notebook:latest 
 docker run  -dit --rm --name adv_dsi_lab_5 -p 8888:8888 -e JUPYTER_ENABLE_LAB=yes -v ~/Projects/adv_dsi/adv_dsi_lab_5:/home/jovyan/work pytorch-notebook:latest                

IndentationError: unexpected indent (<ipython-input-75-8f46e48194d5>, line 2)

**[1.7]** Display last 50 lines of logs

In [None]:
docker logs --tail 50 adv_dsi_lab_5

Copy the url displayed and paste it to a browser in order to launch Jupyter Lab

**[1.8]** Initialise the repo

In [None]:
# Placeholder for student's code (1 command line)
# Task: Initialise the repo

In [None]:
# Solution
git init

**[1.9]** Login into Github with your account (https://github.com/) and create a public repo with the name `adv_dsi_lab_5`

**[1.10]** In your local repo `adv_dsi_lab_4`, link it with Github (replace the url with your username)

In [None]:
# Placeholder for student's code (1 command line)
# Task: Link repo with Github

In [None]:
# Solution
git remote add origin git@github.com:<username>/adv_dsi_lab_1_5.git

**[1.11]** Add you changes to git staging area and commit them

In [None]:
# Placeholder for student's code (2 command lines)
# Task: Add you changes to git staging area and commit them

In [None]:
# Solution
git add .
git commit -m "init"

**[1.12]** Push your master branch to origin

In [None]:
# Placeholder for student's code (1 command line)
# Task: Push your master branch to origin

In [None]:
# Solution
git push --set-upstream origin master

**[1.13]** Preventing push to `master` branch

In [None]:
# Placeholder for student's code (1 command line)
# Task: Preventing push to master branch

In [None]:
# Solution
git config branch.master.pushRemote no_push

**[1.14]** Create a new git branch called `pytorch_reg`

In [None]:
# Placeholder for student's code (1 command line)
# Task: Create a new git branch called pytorch_reg

In [None]:
# Solution
git checkout -b pytorch_reg

**[1.15]** Navigate the folder `notebooks` and create a new jupyter notebook called `1_pytorch_regression.ipynb`

### 2.   Load and Explore Dataset

**[2.1]** Download the dataset into the `data/raw` folder:https://code.datasciencedojo.com/datasciencedojo/datasets/raw/master/Beijing%20PM2.5/PRSA_data_2010.1.1-2014.12.31.csv

In [141]:
!wget -P ../data/raw https://code.datasciencedojo.com/datasciencedojo/datasets/raw/master/Beijing%20PM2.5/PRSA_data_2010.1.1-2014.12.31.csv

--2022-03-18 07:11:54--  https://code.datasciencedojo.com/datasciencedojo/datasets/raw/master/Beijing%20PM2.5/PRSA_data_2010.1.1-2014.12.31.csv
Resolving code.datasciencedojo.com (code.datasciencedojo.com)... 167.99.111.153
Connecting to code.datasciencedojo.com (code.datasciencedojo.com)|167.99.111.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1966669 (1.9M) [text/plain]
Saving to: ‘../data/raw/PRSA_data_2010.1.1-2014.12.31.csv.1’


2022-03-18 07:11:56 (1.45 MB/s) - ‘../data/raw/PRSA_data_2010.1.1-2014.12.31.csv.1’ saved [1966669/1966669]



**[2.2]** Launch the magic commands for auto-relaoding external modules

In [None]:
# Placeholder for student's code (2 lines of Python code)
# Task: Launch the magic commands for auto-relaoding external modules

In [80]:
#Solution
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


**[2.3]** Import the pandas and numpy packages

In [81]:
# Placeholder for student's code (3 lines of Python code)
# Task: mport the pandas and numpy packages

In [147]:
#Solution
import pandas as pd
import numpy as np

**[2.4]** Load the data in a dataframe called `df`


In [148]:
# Placeholder for student's code (1 line of Python code)
# Task: Load the data in a dataframe called df

In [149]:
#Solution:
df = pd.read_csv('../data/raw/PRSA_data_2010.1.1-2014.12.31.csv')

**[2.5]** Display the first 5 rows of df

In [150]:
# Placeholder for student's code (1 line of Python code)
# Task: Display the first 5 rows of df

In [151]:
# Solution
df.head()

Unnamed: 0,No,year,month,day,hour,pm2.5,DEWP,TEMP,PRES,cbwd,Iws,Is,Ir
0,1,2010,1,1,0,,-21,-11.0,1021.0,NW,1.79,0,0
1,2,2010,1,1,1,,-21,-12.0,1020.0,NW,4.92,0,0
2,3,2010,1,1,2,,-21,-11.0,1019.0,NW,6.71,0,0
3,4,2010,1,1,3,,-21,-14.0,1019.0,NW,9.84,0,0
4,5,2010,1,1,4,,-20,-12.0,1018.0,NW,12.97,0,0


**[2.6]** Display the dimensions (shape) of df

In [152]:
# Placeholder for student's code (1 line of Python code)
# Task: Task: Display the dimensions (shape) of df

In [153]:
# Solution
df.shape

(43824, 13)

**[2.7]** Display the summary (info) of df

In [154]:
# Placeholder for student's code (1 line of Python code)
# Task: Display the summary (info) of df

In [155]:
# Solution
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43824 entries, 0 to 43823
Data columns (total 13 columns):
No       43824 non-null int64
year     43824 non-null int64
month    43824 non-null int64
day      43824 non-null int64
hour     43824 non-null int64
pm2.5    41757 non-null float64
DEWP     43824 non-null int64
TEMP     43824 non-null float64
PRES     43824 non-null float64
cbwd     43824 non-null object
Iws      43824 non-null float64
Is       43824 non-null int64
Ir       43824 non-null int64
dtypes: float64(4), int64(8), object(1)
memory usage: 4.3+ MB


**[2.8]** Display the descriptive statistics of df


In [156]:
# Placeholder for student's code (1 line of Python code)
# Task: Display the descriptive statictics of df

In [157]:
# Solution
df.describe()

Unnamed: 0,No,year,month,day,hour,pm2.5,DEWP,TEMP,PRES,Iws,Is,Ir
count,43824.0,43824.0,43824.0,43824.0,43824.0,41757.0,43824.0,43824.0,43824.0,43824.0,43824.0,43824.0
mean,21912.5,2012.0,6.523549,15.72782,11.5,98.613215,1.817246,12.448521,1016.447654,23.88914,0.052734,0.194916
std,12651.043435,1.413842,3.448572,8.799425,6.922266,92.050387,14.43344,12.198613,10.268698,50.010635,0.760375,1.415867
min,1.0,2010.0,1.0,1.0,0.0,0.0,-40.0,-19.0,991.0,0.45,0.0,0.0
25%,10956.75,2011.0,4.0,8.0,5.75,29.0,-10.0,2.0,1008.0,1.79,0.0,0.0
50%,21912.5,2012.0,7.0,16.0,11.5,72.0,2.0,14.0,1016.0,5.37,0.0,0.0
75%,32868.25,2013.0,10.0,23.0,17.25,137.0,15.0,23.0,1025.0,21.91,0.0,0.0
max,43824.0,2014.0,12.0,31.0,23.0,994.0,28.0,42.0,1046.0,585.6,27.0,36.0


### 3. Prepare Data

**[3.1]** Create a copy of `df` and save it into a variable called `df_cleaned`

In [158]:
# Placeholder for student's code (1 line of Python code)
# Task: Create a copy of df and save it into a variable called df_cleaned

In [159]:
# Solution
df_cleaned = df.copy()

**[3.2]** Remove the column `No` as it is an identifier for rows

In [160]:
# Placeholder for student's code (1 line of Python code)
# Task: Remove the column No as it is an identifier for rows

In [161]:
# Solution
df_cleaned.drop('No', axis=1, inplace=True)

**[3.3]** Remove the missing values from the target variable `pm2.5`

In [162]:
# Placeholder for student's code (1 line of Python code)
# Task: Remove the missing values from the target variable pm2.5

In [163]:
# Solution
df_cleaned.dropna(inplace=True)

**[3.4]** Reset the indexes of the dataframe

In [164]:
# Placeholder for student's code (1 line of Python code)
# Task: Reset the indexes of the dataframe

In [165]:
# Solution
df_cleaned.reset_index(drop=True, inplace=True)

In [166]:
df_cleaned

Unnamed: 0,year,month,day,hour,pm2.5,DEWP,TEMP,PRES,cbwd,Iws,Is,Ir
0,2010,1,2,0,129.0,-16,-4.0,1020.0,SE,1.79,0,0
1,2010,1,2,1,148.0,-15,-4.0,1020.0,SE,2.68,0,0
2,2010,1,2,2,159.0,-11,-5.0,1021.0,SE,3.57,0,0
3,2010,1,2,3,181.0,-7,-5.0,1022.0,SE,5.36,1,0
4,2010,1,2,4,138.0,-7,-5.0,1022.0,SE,6.25,2,0
...,...,...,...,...,...,...,...,...,...,...,...,...
41752,2014,12,31,19,8.0,-23,-2.0,1034.0,NW,231.97,0,0
41753,2014,12,31,20,10.0,-22,-3.0,1034.0,NW,237.78,0,0
41754,2014,12,31,21,10.0,-22,-3.0,1034.0,NW,242.70,0,0
41755,2014,12,31,22,8.0,-22,-4.0,1034.0,NW,246.72,0,0


**[3.5]** Import `StandardScaler` and `OneHotEncoder` from `sklearn.preprocessing`

In [167]:
# Placeholder for student's code (1 line of Python code)
# Task: Import StandardScaler and OneHotEncoder from sklearn.preprocessing


In [168]:
# Solution
from sklearn.preprocessing import StandardScaler, OneHotEncoder

**[3.6]** Create a list called `num_cols` that contains `year`, `DEWP`, `TEMP`, `PRES`, `Iws`, `Is`, `Ir`

In [169]:
# Placeholder for student's code (1 line of Python code)
# Task: Create a list called num_cols that contains year, DEWP, TEMP, PRES, Iws, Is, Ir

In [170]:
# Solution
num_cols = ['year', 'DEWP', 'TEMP', 'PRES', 'Iws', 'Is', 'Ir']

**[3.7]** Instantiate a `StandardScaler` and called it `sc`

In [171]:
# Placeholder for student's code (1 line of Python code)
# Task: Instantiate a StandardScaler and called it sc

In [172]:
# Solution
sc = StandardScaler()

**[3.8]** Fit and transform the numeric feature of `df_cleaned` and replace the data into it

In [173]:
# Placeholder for student's code (1 line of Python code)
# Task: Fit and transform the numeric feature of X_train_cleaned and replace the data into it

In [174]:
# Solution
df_cleaned[num_cols] = sc.fit_transform(df_cleaned[num_cols])

**[3.9]** Create a list called `cat_cols` that contains `month`, `day`, `hour`, `cbwd`

In [175]:
# Placeholder for student's code (1 line of Python code)
# Task: Create a list called cat_cols that contains Gender

In [176]:
# Solution (categorical variables in a list)
cat_cols = ['month', 'day', 'hour', 'cbwd']

**[3.10]** Instantiate a `OneHotEncoder` and called it `ohe`

In [177]:
# Placeholder for student's code (1 line of Python code)
# Task: Instantiate a OneHotEncoder and called it ohe

In [178]:
# Solution
ohe = OneHotEncoder(sparse=False)

**[3.11]** Perform One-Hot encoding on `cat_cols` and save them into a dataframe called `X_cat`

In [179]:
# Placeholder for student's code (1 line of Python code)
# Task: Perform One-Hot encoding on cat_cols and save them into a dataframe called X_cat

In [180]:
# Solution
X_cat = pd.DataFrame(ohe.fit_transform(df_cleaned[cat_cols]))

In [181]:
X_cat

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,61,62,63,64,65,66,67,68,69,70
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41752,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
41753,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
41754,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
41755,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0


**[3.12]** Extract the feature names from `ohe` and replace the names of the columns of the `X_cat`

In [182]:
# Placeholder for student's code (1 line of Python code)
# Task: Extract the feature names from ohe and replace the names of the columns of the X_cat

In [183]:
# Solution
X_cat.columns = ohe.get_feature_names(cat_cols)

In [184]:
ohe.get_feature_names(cat_cols)

array(['month_1', 'month_2', 'month_3', 'month_4', 'month_5', 'month_6',
       'month_7', 'month_8', 'month_9', 'month_10', 'month_11',
       'month_12', 'day_1', 'day_2', 'day_3', 'day_4', 'day_5', 'day_6',
       'day_7', 'day_8', 'day_9', 'day_10', 'day_11', 'day_12', 'day_13',
       'day_14', 'day_15', 'day_16', 'day_17', 'day_18', 'day_19',
       'day_20', 'day_21', 'day_22', 'day_23', 'day_24', 'day_25',
       'day_26', 'day_27', 'day_28', 'day_29', 'day_30', 'day_31',
       'hour_0', 'hour_1', 'hour_2', 'hour_3', 'hour_4', 'hour_5',
       'hour_6', 'hour_7', 'hour_8', 'hour_9', 'hour_10', 'hour_11',
       'hour_12', 'hour_13', 'hour_14', 'hour_15', 'hour_16', 'hour_17',
       'hour_18', 'hour_19', 'hour_20', 'hour_21', 'hour_22', 'hour_23',
       'cbwd_NE', 'cbwd_NW', 'cbwd_SE', 'cbwd_cv'], dtype=object)

**[3.13]** Drop the original columns of `cat_cols` from `df_cleaned`

In [185]:
# Placeholder for student's code (1 line of Python code)
# Task: Drop the original columns of cat_cols from df_cleaned

In [186]:
# Solution
df_cleaned.drop(cat_cols, axis=1, inplace=True)

In [187]:
df_cleaned

Unnamed: 0,year,pm2.5,DEWP,TEMP,PRES,Iws,Is,Ir
0,-1.443355,129.0,-1.229791,-1.347143,0.345329,-0.444944,-0.071057,-0.137408
1,-1.443355,148.0,-1.160508,-1.347143,0.345329,-0.427007,-0.071057,-0.137408
2,-1.443355,159.0,-0.883375,-1.429278,0.442411,-0.409069,-0.071057,-0.137408
3,-1.443355,181.0,-0.606241,-1.429278,0.539493,-0.372993,1.212862,-0.137408
4,-1.443355,138.0,-0.606241,-1.429278,0.539493,-0.355055,2.496781,-0.137408
...,...,...,...,...,...,...,...,...
41752,1.382914,8.0,-1.714775,-1.182873,1.704472,4.194201,-0.071057,-0.137408
41753,1.382914,10.0,-1.645491,-1.265008,1.704472,4.311298,-0.071057,-0.137408
41754,1.382914,10.0,-1.645491,-1.265008,1.704472,4.410458,-0.071057,-0.137408
41755,1.382914,8.0,-1.645491,-1.347143,1.704472,4.491479,-0.071057,-0.137408


**[3.14]** Concatenate `df_cleaned` with `X_cat` and save the result to a variable called `X`

In [188]:
# Placeholder for student's code (1 line of Python code)
# Task: Concatenate df_cleaned with X_cat and save the result to a variable called X

In [189]:
# Solution
X = pd.concat([df_cleaned, X_cat], axis=1)

In [190]:
X

Unnamed: 0,year,pm2.5,DEWP,TEMP,PRES,Iws,Is,Ir,month_1,month_2,...,hour_18,hour_19,hour_20,hour_21,hour_22,hour_23,cbwd_NE,cbwd_NW,cbwd_SE,cbwd_cv
0,-1.443355,129.0,-1.229791,-1.347143,0.345329,-0.444944,-0.071057,-0.137408,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,-1.443355,148.0,-1.160508,-1.347143,0.345329,-0.427007,-0.071057,-0.137408,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,-1.443355,159.0,-0.883375,-1.429278,0.442411,-0.409069,-0.071057,-0.137408,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,-1.443355,181.0,-0.606241,-1.429278,0.539493,-0.372993,1.212862,-0.137408,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,-1.443355,138.0,-0.606241,-1.429278,0.539493,-0.355055,2.496781,-0.137408,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41752,1.382914,8.0,-1.714775,-1.182873,1.704472,4.194201,-0.071057,-0.137408,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
41753,1.382914,10.0,-1.645491,-1.265008,1.704472,4.311298,-0.071057,-0.137408,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
41754,1.382914,10.0,-1.645491,-1.265008,1.704472,4.410458,-0.071057,-0.137408,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
41755,1.382914,8.0,-1.645491,-1.347143,1.704472,4.491479,-0.071057,-0.137408,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0


**[3.15]** Import `split_sets_by_time` and `save_sets` from `src.data.sets`

In [191]:
# Placeholder for student's code (1 line of Python code)
# Task: Import train_test_split from sklearn.model_selection

In [192]:
# Solution
from src.data.sets import split_sets_by_time, save_sets

**[3.16]** Split the data into training and testing sets with 80-20 ratio

In [193]:
# Placeholder for student's code (1 line of Python code)
# Task: Split the data into training and testing sets with 80-20 ratio

In [194]:
# Solution
X_train, y_train, X_val, y_val, X_test, y_test = split_sets_by_time(X, target_col='pm2.5', test_ratio=0.2)

**[3.17]** Create the following folder: `data/processed/beijing_pollution`

In [195]:
!mkdir ../data/processed/beijing_pollution

/bin/mkdir: cannot create directory ‘../data/processed/beijing_pollution’: File exists


**[3.18]** Save the sets in the `data/processed/beijing_pollution` folder

In [196]:
# Placeholder for student's code (1 line of Python code)
# Task: Save the sets in the data/processed/beijing_pollution folder

In [197]:
save_sets(X_train=X_train, y_train=y_train, X_val=X_val, y_val=y_val, X_test=X_test, y_test=y_test, path='../data/processed/beijing_pollution/')

### 4. Baseline Model

**[4.1]** Import `NullModel` from `src.models.null`

In [198]:
# Placeholder for student's code (1 line of Python code)
# Task: Import NullModel from src.models.null

In [199]:
# Solution
from src.models.null import NullModel

**[4.2]** Instantiate a `NullModel` and call `.fit_predict()` on the training target to extract your predictions into a variable called `y_base`

In [200]:
# Placeholder for student's code (2 lines of Python code)
# Task: Instantiate a NullModel and call .fit_predict() on the training target to extract your predictions into a variable called y_base

In [217]:
# Solution:
baseline_model = NullModel()
y_base = baseline_model.fit_predict(y_train)

**[4.3]** Import `print_reg_perf` from `src.models.performance`

In [202]:
# Placeholder for student's code (1 line of Python code)
# Task: Import print_reg_perf from src.models.performance

In [218]:
# Solution:
from src.models.performance import print_reg_perf

**[4.4]** Print the regression metrics for this baseline model

In [219]:
# Placeholder for student's code (1 line of Python code)
# Task: Print the regression metrics for this baseline model

In [220]:
# Solution:
print_reg_perf(y_base, y_train, set_name='Training')

RMSE Training: 92.82545840756482
MAE Training: 69.67082209440568


### 5. Define Architecture

**[5.1]** Import `torch`, `torch.nn` as `nn` and `torch.nn.functional` as `F`

In [206]:
# Placeholder for student's code (3 lines of Python code)
# Task: Import torch and torch.nn as nn

In [221]:
# Solution:
import torch
import torch.nn as nn
import torch.nn.functional as F

**[5.2]** Create in `src/models/pytorch.py` a class called `PytorchRegression` that inherits from `nn.Module` with:
- `num_features` as input parameter
- attributes:
    - `layer_1`: fully-connected layer with 128 neurons
    - `layer_out`: fully-connected layer with 1 neurons
- methods:
    - `forward()` with `inputs` as input parameter, perform ReLU and DropOut on the fully-connected layer followed by the output layer

In [140]:
# Placeholder for student's code (multiple lines of Python code)
# Task: Create a class called PytorchRegression that inherits from nn.Module

In [222]:
# Solution:
class PytorchRegression(nn.Module):
    def __init__(self, num_features):
        super(PytorchRegression, self).__init__()
        
        self.layer_1 = nn.Linear(num_features, 128)
        self.layer_out = nn.Linear(128, 1)

    def forward(self, x):
        x = F.dropout(F.relu(self.layer_1(x)))
        x = self.layer_out(x)
        return (x)

**[5.3]** Instantiate `PytorchRegression` with the correct number of input feature and save it into a variable called `model`

In [142]:
# Placeholder for student's code (2 lines of Python code)
# Task: Instantiate PytorchRegression with the correct number of input feature and save it into a variable called model


In [227]:
# Solution:
from src.models.pytorch import PytorchRegression
model = PytorchRegression(X_train.shape[1])


**[5.4]** Create  in `src/models/pytorch.py` a function called `get_device()` with:
- Logics: check if cuda is available and return `cuda:0` if that is the case `cpu` otherwise
- Output: device to be used by Pytorch

In [228]:
# Placeholder for student's code (multiple lines of Python code)
# Task: Create a function called get_device()

In [229]:
# Solution:
def get_device():
    if torch.cuda.is_available():
        device = torch.device('cuda:0')
    else:
        device = torch.device('cpu') # don't have GPU 
    return device

**[5.5]** Set `model` to use the device available

In [230]:
# Placeholder for student's code (3 lines of Python code)
# Task: Set model to use the device available

In [231]:
# Solution:
from src.models.pytorch import get_device

device = get_device()
model.to(device)

PytorchRegression(
  (layer_1): Linear(in_features=78, out_features=128, bias=True)
  (layer_out): Linear(in_features=128, out_features=1, bias=True)
)

**[5.6]** Print the architecture of `model`

In [232]:
# Placeholder for student's code (1 line of Python code)
# Task: Print the architecture of model

In [233]:
# Solution:
print(model)

PytorchRegression(
  (layer_1): Linear(in_features=78, out_features=128, bias=True)
  (layer_out): Linear(in_features=128, out_features=1, bias=True)
)


### 6. Create Data Loader

**[6.1]** Import `Dataset` and `DataLoader` from `torch.utils.data`

In [234]:
# Placeholder for student's code (1 line of Python code)
# Task: Import Dataset and DataLoader from torch.utils.data

In [235]:
# Solution:
from torch.utils.data import Dataset, DataLoader

**[6.2]** Create in `src/models/pytorch.py`a class called `PytorchDataset` that inherits from `torch.utils.data.Dataset` with:
- `X` ans `y` as input parameters
- attributes:
    - `X_tensor`: X converted to Pytorch tensor
    - `y_tensor`: y converted to Pytorch tensor
- methods:
    - `__getitem__(index)`
        Return features and target for a given index
    - `__len__`
        Return the number of observations
    - `to_tensor(data)`
        Convert Pandas Series to Pytorch tensor

In [212]:
# Placeholder for student's code (multiple lines of Python code)
# Task: Create a class called PytorchDataset

In [236]:
# Solution:
class PytorchDataset(Dataset):
    """
    Pytorch dataset
    ...

    Attributes
    ----------
    X_tensor : Pytorch tensor
        Features tensor
    y_tensor : Pytorch tensor
        Target tensor

    Methods
    -------
    __getitem__(index)
        Return features and target for a given index
    __len__
        Return the number of observations
    to_tensor(data)
        Convert Pandas Series to Pytorch tensor
    """
        
    def __init__(self, X, y):
        self.X_tensor = self.to_tensor(X)
        self.y_tensor = self.to_tensor(y)
    
    def __getitem__(self, index):
        return self.X_tensor[index], self.y_tensor[index]
        
    def __len__ (self):
        return len(self.X_tensor)
    
    def to_tensor(self, data):
        return torch.Tensor(np.array(data))

**[6.3]** Import this class from `src/models/pytorch` and convert all sets to PytorchDataset

In [237]:
# Placeholder for student's code (multiple lines of Python code)
# Task: Import this class from src/models/pytorch and convert all sets to PytorchDataset

In [241]:
# Solution:
from src.models.pytorch import PytorchDataset

train_dataset = PytorchDataset(X=X_train, y=y_train)
val_dataset = PytorchDataset(X=X_val, y=y_val)
test_dataset = PytorchDataset(X=X_test, y=y_test)

**[6.4]** Import DataLoader from `torch.utils.data`

In [None]:
# Placeholder for student's code (multiple lines of Python code)
# Task: Import DataLoader from torch.utils.data

In [242]:
# Solution:
from torch.utils.data import DataLoader

### 7. Train Model

**[7.1]** Instantiate a `nn.MSELoss()` and save it into a variable called `criterion` 

In [243]:
# Placeholder for student's code (1 line of Python code)
# Task: Instantiate a nn.MSELoss() and save it into a variable called criterion

In [244]:
# Solution:
criterion = nn.MSELoss()

**[7.2]** Instantiate a `torch.optim.Adam()` optimizer with the model's parameters and 0.001 as learning rate and save it into a variable called `optimizer`

In [245]:
# Placeholder for student's code (multiple lines of Python code)
# Task: Instantiate a torch.optim.Adam() optimizer with the model's parameters and 0.001 as learning rate and save it into a variable called optimizer

In [246]:
# Solution:
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

**[7.3]** Create a function called `train_regression()` that will perform forward and back propagation and calculate loss and RMSE scores

In [247]:
def train_regression(train_data, model, criterion, optimizer, batch_size, device, scheduler=None, collate_fn=None):
    """Train a Pytorch regresssion model

    Parameters
    ----------
    train_data : torch.utils.data.Dataset
        Pytorch dataset
    model: torch.nn.Module
        Pytorch Model
    criterion: function
        Loss function
    optimizer: torch.optim
        Optimizer
    bacth_size : int
        Number of observations per batch
    device : str
        Name of the device used for the model
    scheduler : torch.optim.lr_scheduler
        Pytorch Scheduler used for updating learning rate
    collate_fn : function
        Function defining required pre-processing steps

    Returns
    -------
    Float
        Loss score
    Float:
        RMSE Score
    """
    
    # Set model to training mode
    model.train()
    train_loss = 0

    # Create data loader
    data = DataLoader(train_data, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)
    
    # Iterate through data by batch of observations
    for feature, target_class in data:
        
        # Reset gradients
        optimizer.zero_grad()
        
        # Load data to specified device
        feature, target_class = feature.to(device), target_class.to(device)
        
        # Make predictions
        output = model(feature)
        
        # Calculate loss for given batch
        loss = criterion(output, target_class)
        
        # Calculate global loss
        train_loss += loss.item()
        
        # Calculate gradients
        loss.backward()
        
        # Update Weights
        optimizer.step()
        
    # Adjust the learning rate
    if scheduler:
        scheduler.step()

    return train_loss / len(train_data), np.sqrt(train_loss / len(train_data))

**[7.4]** Create a function called `test_regression()` that will perform forward and calculate loss and RMSE scores

In [248]:
def test_regression(test_data, model, criterion, batch_size, device, collate_fn=None):
    """Calculate performance of a Pytorch regresssion model

    Parameters
    ----------
    test_data : torch.utils.data.Dataset
        Pytorch dataset
    model: torch.nn.Module
        Pytorch Model
    criterion: function
        Loss function
    bacth_size : int
        Number of observations per batch
    device : str
        Name of the device used for the model
    collate_fn : function
        Function defining required pre-processing steps

    Returns
    -------
    Float
        Loss score
    Float:
        RMSE Score
    """    
    
    # Set model to evaluation mode
    model.eval()
    test_loss = 0

    # Create data loader
    data = DataLoader(test_data, batch_size=batch_size, collate_fn=collate_fn)
    
    # Iterate through data by batch of observations
    for feature, target_class in data:
        
        # Load data to specified device
        feature, target_class = feature.to(device), target_class.to(device)
        
        # Set no update to gradients
        with torch.no_grad():
            
            # Make predictions
            output = model(feature)
            
            # Calculate loss for given batch
            loss = criterion(output, target_class)
            
            # Calculate global loss
            test_loss += loss.item()
            
    return test_loss / len(test_data), np.sqrt(test_loss / len(test_data))

**[7.5]** Create 2 variables called `N_EPOCHS` and `BATCH_SIZE` that will take respectively 5 and 32 as values

In [249]:
# Placeholder for student's code (2 lines of Python code)
# Task: Create 2 variables called N_EPOCHS and BATCH_SIZE that will take respectively 5 and 32 as values

In [250]:
# Solution:
N_EPOCHS = 5
BATCH_SIZE = 32

**[7.6]** Create a for loop that will iterate through the specified number of epochs and will train the model with the training set and assess the performance on the validation set and print their scores

In [251]:
# Placeholder for student's code (multiple lines of Python code)
# Task: Create a for loop that will iterate through the specified number of epochs and will train the model with the training set and assess the performance on the validation set and print their scores

In [253]:
# Solution:
from src.models.pytorch import train_regression, test_regression

for epoch in range(N_EPOCHS):
    train_loss, train_rmse = train_regression(train_dataset, model=model, criterion=criterion, optimizer=optimizer, batch_size=BATCH_SIZE, device=device)
    valid_loss, valid_rmse = test_regression(val_dataset, model=model, criterion=criterion, batch_size=BATCH_SIZE, device=device)

    print(f'Epoch: {epoch}')
    print(f'\t(train)\tLoss: {train_loss:.4f}\t|\tRMSE: {train_rmse:.1f}')
    print(f'\t(valid)\tLoss: {valid_loss:.4f}\t|\tRMSE: {valid_rmse:.1f}')

  return F.mse_loss(input, target, reduction=self.reduction)
  return F.mse_loss(input, target, reduction=self.reduction)


Epoch: 0
	(train)	Loss: 349.5482	|	RMSE: 18.7
	(valid)	Loss: 243.9819	|	RMSE: 15.6
Epoch: 1
	(train)	Loss: 273.4944	|	RMSE: 16.5
	(valid)	Loss: 239.4132	|	RMSE: 15.5
Epoch: 2
	(train)	Loss: 272.7597	|	RMSE: 16.5
	(valid)	Loss: 237.2977	|	RMSE: 15.4
Epoch: 3
	(train)	Loss: 272.5247	|	RMSE: 16.5
	(valid)	Loss: 238.1760	|	RMSE: 15.4
Epoch: 4
	(train)	Loss: 272.3270	|	RMSE: 16.5
	(valid)	Loss: 237.6604	|	RMSE: 15.4


**[7.7]** Save the model into the `models` folder

In [None]:
# Placeholder for student's code (1 line of Python code)
# Task: Save the model into the models folder

In [254]:
# Solution
torch.save(model, "../models/pytorch_reg_pm2_5.pt")

### 8.   Assess Performance

**[8.1]** Assess the model performance on the testing set and print its scores

In [None]:
# Placeholder for student's code (2 lines of Python code)
# Task: Assess the model performance on the testing set and print its scores

In [255]:
test_loss, test_rmse = test_regression(test_dataset, model=model, criterion=criterion, batch_size=BATCH_SIZE, device=device)
print(f'\tLoss: {test_loss:.4f}\t|\tRMSE: {test_rmse:.1f}')

	Loss: 275.9460	|	RMSE: 16.6


### 9.   Push changes

**[9.1]** Add you changes to git staging area

In [None]:
# Placeholder for student's code (1 command line)
# Task: Add you changes to git staging area

In [None]:
# Solution:
git add .

**[9.2]** Create the snapshot of your repository and add a description

In [None]:
# Placeholder for student's code (1 command line)
# Task: Create the snapshot of your repository and add a description

In [None]:
# Solution:
git commit -m "pytorch regression"

**[9.3]** Push your snapshot to Github

In [None]:
# Placeholder for student's code (1 command line)
# Task: Push your snapshot to Github

In [None]:
# Solution:
git push

**[9.4]** Check out to the master branch

In [None]:
# Placeholder for student's code (1 command line)
# Task: Check out to the master branch

In [None]:
# Solution:
git checkout master

**[9.5]** Pull the latest updates

In [None]:
# Placeholder for student's code (1 command line)
# Task: Pull the latest updates

In [None]:
# Solution:
git pull

**[9.6]** Check out to the `pytorch_reg` branch

In [None]:
# Placeholder for student's code (1 command line)
# Task: Merge the branch pytorch_reg

In [None]:
# Solution:
git checkout pytorch_reg

**[9.7]** Merge the `master` branch and push your changes

In [None]:
# Placeholder for student's code (2 command lines)
# Task: Merge the master branch and push your changes

In [None]:
# Solution:
git merge master
git push

**[9.8]** Go to Github and merge the branch after reviewing the code and fixing any conflict

**[9.9]** Stop the Docker container

In [None]:
# Placeholder for student's code (1 command line)
# Task: Stop the Docker container

In [None]:
# Solution:
docker stop adv_dsi_lab_5