# Advanced Data Science for Innovation - Assessment Task 2

## Multiclass Classification with Pytorch

**Setup Repository**<br>
cd ~/Projects

**Copy Cookiecutter template** <br>
cookiecutter -c v1 https://github.com/drivendata/cookiecutter-data-science


***cookiecutter details:***

Project name: adv_dsi_at2 

repo name: adv_dsi_at2

author name: Justin

description: Adv DSI Assessment Task 2 

No license and python 3

#### Enter the terminal command:

   **nano dockerfile**
    
Then enter these details in dockerfile:


`FROM jupyter/scipy-notebook:0ce64578df46`

`RUN pip install torch==1.9.0+cpu torchvision==0.10.0+cpu torchtext==0.10.0 -f https://download.pytorch.org/whl/torch_stable.html`

`ENV PYTHONPATH "${PYTHONPATH}:/home/jovyan/work"`

`RUN echo "export PYTHONPATH=/home/jovyan/work" >> ~/.bashrc`

`WORKDIR /home/jovyan/work`



#### Enter command in terminal

**docker build -t pytorch-notebook:latest .**

#### Run docker

docker run  -dit --rm --name adv_dsi_at2 -p 8888:8888 -e JUPYTER_ENABLE_LAB=yes -v ~/Projects/adv_dsi_at2:/home/jovyan/work -v ~/Projects/adv_dsi/src:/home/jovyan/work/src pytorch-notebook:latest 
                

## *GITHUB*

Create a repository in your Github account `adv_dsi_at2`. 

**Enter commands in Terminal**

- git init

- git remote add origin https://github.com/justinmuts/adv_dsi_at2.git


**Adding changes to git staging area and commit**

- git add .

- git commit -m "init"

**Push local repository into Github account**

- git push --set-upstream origin master



# Load the dataset

In [1]:
# Launch the magic commands for auto-relaoding external modules
%load_ext autoreload
%autoreload 2

In [2]:
#import packages
import pandas as pd
import numpy as np
import os

In [3]:
# save beer reviews data as 'df'
df = pd.read_csv('../data/raw/beer_reviews.csv')


In [4]:
# first five records
df.head()

Unnamed: 0,brewery_id,brewery_name,review_time,review_overall,review_aroma,review_appearance,review_profilename,beer_style,review_palate,review_taste,beer_name,beer_abv,beer_beerid
0,10325,Vecchio Birraio,1234817823,1.5,2.0,2.5,stcules,Hefeweizen,1.5,1.5,Sausa Weizen,5.0,47986
1,10325,Vecchio Birraio,1235915097,3.0,2.5,3.0,stcules,English Strong Ale,3.0,3.0,Red Moon,6.2,48213
2,10325,Vecchio Birraio,1235916604,3.0,2.5,3.0,stcules,Foreign / Export Stout,3.0,3.0,Black Horse Black Beer,6.5,48215
3,10325,Vecchio Birraio,1234725145,3.0,3.0,3.5,stcules,German Pilsener,2.5,3.0,Sausa Pils,5.0,47969
4,1075,Caldera Brewing Company,1293735206,4.0,4.5,4.0,johnmichaelsen,American Double / Imperial IPA,4.0,4.5,Cauldron DIPA,7.7,64883


In [5]:
# the dimension of the data df
df.shape

(1586614, 13)

In [6]:
# information on df
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1586614 entries, 0 to 1586613
Data columns (total 13 columns):
brewery_id            1586614 non-null int64
brewery_name          1586599 non-null object
review_time           1586614 non-null int64
review_overall        1586614 non-null float64
review_aroma          1586614 non-null float64
review_appearance     1586614 non-null float64
review_profilename    1586266 non-null object
beer_style            1586614 non-null object
review_palate         1586614 non-null float64
review_taste          1586614 non-null float64
beer_name             1586614 non-null object
beer_abv              1518829 non-null float64
beer_beerid           1586614 non-null int64
dtypes: float64(6), int64(3), object(4)
memory usage: 157.4+ MB


We can identify that 
- `brewery_name`, 
- `review_profilename`,
- `beer_style ` and 
- `beer_name` 

are text based categorical variables.

Our target variable is ***beer_style***.

Null Entries in beer dataset:

Null Entries | count
---|---
Brewery name  |15
review_profilename| 348
beer_abv | 67,785

In [7]:
# summary statistics
df.describe()


Unnamed: 0,brewery_id,review_time,review_overall,review_aroma,review_appearance,review_palate,review_taste,beer_abv,beer_beerid
count,1586614.0,1586614.0,1586614.0,1586614.0,1586614.0,1586614.0,1586614.0,1518829.0,1586614.0
mean,3130.099,1224089000.0,3.815581,3.735636,3.841642,3.743701,3.79286,7.042387,21712.79
std,5578.104,76544270.0,0.7206219,0.6976167,0.6160928,0.6822184,0.7319696,2.322526,21818.34
min,1.0,840672000.0,0.0,1.0,0.0,1.0,1.0,0.01,3.0
25%,143.0,1173224000.0,3.5,3.5,3.5,3.5,3.5,5.2,1717.0
50%,429.0,1239203000.0,4.0,4.0,4.0,4.0,4.0,6.5,13906.0
75%,2372.0,1288568000.0,4.5,4.0,4.0,4.0,4.5,8.5,39441.0
max,28003.0,1326285000.0,5.0,5.0,5.0,5.0,5.0,57.7,77317.0


In [8]:
# Filter the dataframe for any records that have NA or Null values in the dataset.
df_null_entry_records = df[df.isna().any(axis=1)]

In [9]:
df_null_entry_records

Unnamed: 0,brewery_id,brewery_name,review_time,review_overall,review_aroma,review_appearance,review_profilename,beer_style,review_palate,review_taste,beer_name,beer_abv,beer_beerid
273,1075,Caldera Brewing Company,1103668195,3.0,3.0,3.0,RedDiamond,American Stout,4.0,3.0,Cauldron Espresso Stout,,21241
430,850,Moon River Brewing Company,1110736110,3.5,4.0,4.5,cMonkey,Scotch Ale / Wee Heavy,3.5,3.5,The Highland Stagger,,20689
603,850,Moon River Brewing Company,1100038819,4.0,3.5,4.0,aracauna,Scotch Ale / Wee Heavy,3.5,3.5,The Highland Stagger,,20689
733,1075,Caldera Brewing Company,1260673921,4.0,4.0,4.0,plaid75,American IPA,4.0,4.0,Alpha Beta,,54723
798,1075,Caldera Brewing Company,1212201268,4.5,4.5,4.0,grumpy,American Double / Imperial Stout,4.0,4.5,Imperial Stout,,42964
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1586568,14359,The Defiant Brewing Company,1187052567,4.0,3.5,4.0,maddogruss,Bock,4.0,4.0,Bock,,36424
1586587,14359,The Defiant Brewing Company,1177842168,3.5,4.5,4.0,BBM,Maibock / Helles Bock,4.5,4.0,Maibock,,36555
1586596,14359,The Defiant Brewing Company,1287951067,4.0,3.0,5.0,hoppymcgee,Belgian Strong Pale Ale,4.0,3.5,Resolution #2,,48360
1586597,14359,The Defiant Brewing Company,1241906223,4.5,4.5,4.0,WesWes,Belgian Strong Pale Ale,4.0,4.0,Resolution #2,,48360


In [10]:
df['brewery_name'].value_counts()

Boston Beer Company (Samuel Adams)    39444
Dogfish Head Brewery                  33839
Stone Brewing Co.                     33066
Sierra Nevada Brewing Co.             28751
Bell's Brewery, Inc.                  25191
                                      ...  
Le Moulin De Saint Martin                 1
Nanuan Corp.                              1
Ajeper S.A.                               1
Palmbräu Zorn Söhne KG                    1
Seychelles Breweries Limited (SBL)        1
Name: brewery_name, Length: 5742, dtype: int64

In [11]:
df['review_profilename'].value_counts()

northyorksammy    5817
BuckeyeNation     4661
mikesgroove       4617
Thorpe429         3518
womencantsail     3497
                  ... 
WolverineWench       1
Beowolf1911          1
omnihappiness        1
DDKennemore          1
OldPuppy61           1
Name: review_profilename, Length: 33387, dtype: int64

In [12]:
df['beer_style'].value_counts()

American IPA                        117586
American Double / Imperial IPA       85977
American Pale Ale (APA)              63469
Russian Imperial Stout               54129
American Double / Imperial Stout     50705
                                     ...  
Gose                                   686
Faro                                   609
Roggenbier                             466
Kvass                                  297
Happoshu                               241
Name: beer_style, Length: 104, dtype: int64

In [13]:
df['beer_name'].value_counts()

90 Minute IPA                          3290
India Pale Ale                         3130
Old Rasputin Russian Imperial Stout    3111
Sierra Nevada Celebration Ale          3000
Two Hearted Ale                        2728
                                       ... 
Guardian Angel                            1
Creekside '09 Sour Spice                  1
Buffalo Bills Bellehop Porter             1
Belgian Blonde Bomber                     1
Senior Moment Old Ale                     1
Name: beer_name, Length: 56857, dtype: int64

In [14]:
# How many distinct entries for out Target Variable 'beer-style'
df['beer_style'].value_counts()

# We have in total 104 unique categories for this variable.

American IPA                        117586
American Double / Imperial IPA       85977
American Pale Ale (APA)              63469
Russian Imperial Stout               54129
American Double / Imperial Stout     50705
                                     ...  
Gose                                   686
Faro                                   609
Roggenbier                             466
Kvass                                  297
Happoshu                               241
Name: beer_style, Length: 104, dtype: int64

## Prepare the data

In [15]:
df_cleaned = df.copy()

### Drop null values in the dataset

In [16]:
# lets drop null entries that are in "beer_style" since it doesn't add any value and you cannot inflate Fictitious figures into the target variable.
# df_cleaned['beer_style'].dropna(axis=0,inplace= True)
df_cleaned.dropna(axis=0,inplace= True,how ='any')

In [17]:
df_cleaned.head()

Unnamed: 0,brewery_id,brewery_name,review_time,review_overall,review_aroma,review_appearance,review_profilename,beer_style,review_palate,review_taste,beer_name,beer_abv,beer_beerid
0,10325,Vecchio Birraio,1234817823,1.5,2.0,2.5,stcules,Hefeweizen,1.5,1.5,Sausa Weizen,5.0,47986
1,10325,Vecchio Birraio,1235915097,3.0,2.5,3.0,stcules,English Strong Ale,3.0,3.0,Red Moon,6.2,48213
2,10325,Vecchio Birraio,1235916604,3.0,2.5,3.0,stcules,Foreign / Export Stout,3.0,3.0,Black Horse Black Beer,6.5,48215
3,10325,Vecchio Birraio,1234725145,3.0,3.0,3.5,stcules,German Pilsener,2.5,3.0,Sausa Pils,5.0,47969
4,1075,Caldera Brewing Company,1293735206,4.0,4.5,4.0,johnmichaelsen,American Double / Imperial IPA,4.0,4.5,Cauldron DIPA,7.7,64883


In [18]:
# Check for any null values
df_cleaned.isnull().sum()

brewery_id            0
brewery_name          0
review_time           0
review_overall        0
review_aroma          0
review_appearance     0
review_profilename    0
beer_style            0
review_palate         0
review_taste          0
beer_name             0
beer_abv              0
beer_beerid           0
dtype: int64

In [19]:
# Check the dimensions of the dataset df_cleaned
df_cleaned.shape

(1518478, 13)

- We have 1,518,478 record reviews and 7 variables

In [20]:
# df_brewname_na = df_cleaned[df_cleaned['brewery_name'].isna()]


In [21]:
# df_brewname_na[df_brewname_na['beer_name'].str.contains("WRONG")].count()
# Select variables

# ['review_overall','review_aroma', 'review_appearance', 'beer_style','review_palate', 'review_taste']

### Drop variables

We are only taking `review_aroma`, `review_appearance`, `review_palate` and `review_taste` as directed by the objective. I did not choose `beer_name` as that will defeat the purpose of having the Machine Learning code to estimate and predict the type of beer.

`'review_overall` and `review_profilename` does not have any relevance to the task because we are looking towards solving the type of beer.

In [22]:
# choosing these variables only
df_cleaned = df_cleaned.loc[:,['brewery_name','review_aroma', 'review_appearance', 'beer_style','review_palate', 'review_taste','beer_abv']]

In [23]:
df_cleaned.head()

Unnamed: 0,brewery_name,review_aroma,review_appearance,beer_style,review_palate,review_taste,beer_abv
0,Vecchio Birraio,2.0,2.5,Hefeweizen,1.5,1.5,5.0
1,Vecchio Birraio,2.5,3.0,English Strong Ale,3.0,3.0,6.2
2,Vecchio Birraio,2.5,3.0,Foreign / Export Stout,3.0,3.0,6.5
3,Vecchio Birraio,3.0,3.5,German Pilsener,2.5,3.0,5.0
4,Caldera Brewing Company,4.5,4.0,American Double / Imperial IPA,4.0,4.5,7.7


In [24]:
# target variable 'beer_style' save under the name 'target'
target = 'beer_style'

In [25]:
# create list containing categorical columns
cat_cols = ['brewery_name', 'beer_style']

In [26]:
# Create a list for numerical columns
num_cols = list(set(df_cleaned.columns) - (set(cat_cols)  | set([target])))

In [27]:
num_cols

['review_palate',
 'review_appearance',
 'review_aroma',
 'beer_abv',
 'review_taste']

In [28]:
list_brewery_name = list(np.sort(df_cleaned['brewery_name'].unique()))
list_beer_style = list(np.sort(df_cleaned['beer_style'].unique()))



In [29]:
# Create a dictionary called cats_dict that contains the categorical variables as keys and their respective values sorted in ascending order

cats_dict = {
    'brewery_name': [list_brewery_name],
    'beer_style': [list_beer_style]
}

In [30]:
cats_dict

{'brewery_name': [["'t Hofbrouwerijke",
   '(512) Brewing Company',
   '10 Barrel Brewing Co.',
   '1516 Brewing Company',
   '16 Mile Brewing Company',
   '1648 Brewing Company Ltd',
   '1702 / The Address Brewing Co.',
   '192 Brewing Company',
   '2 Brothers Brewery',
   '21st Amendment Brewery',
   '23rd Street Brewery',
   '2nd Shift Brewery',
   '3 Ravens Brewing',
   '3 Stars Brewing Company',
   '32 Via Dei Birrai',
   '4 Hands Brewing Co.',
   '4 Pines Brewing Company',
   '4Seasons Sports Bar & Brew Pub',
   '4th Street Brewing Co.',
   '5 Rabbit Cerveceria',
   '50 Back Brewing Company',
   '508 Gastrobrewery',
   '5280 Roadhouse and Brewery',
   '7 Seas Brewery and Taproom',
   '75th Street Brewery',
   '7venth (Seventh) Sun Brewery',
   '8 Wired Brewing Co.',
   '961 Beer',
   'A Tribbiera',
   'A.J.I. Beer Inc',
   'A1A Aleworks',
   'AB Group, Ltd.',
   'AC Golden Brewing Company',
   'ALDI Stores Australia',
   'AMB - Maître Brasseur',
   'AO Susyndar',
   'AS L&#257;&#

In [31]:
pd.DataFrame(list_brewery_name).to_csv("../data/brewery_name_list.csv",index=False)
pd.DataFrame(list_beer_style).to_csv("../data/beer_style_list.csv",index=False)
    

## One hot encoder and Standard scaler from sklearn

In [32]:
from sklearn.preprocessing import StandardScaler, OrdinalEncoder

**Note:** 
1) https://datascience.stackexchange.com/questions/39317/difference-between-ordinalencoder-and-labelencoder

Both have the same functionality. A bit difference is the idea behind. `OrdinalEncoder` is for converting features, while `LabelEncoder` is for converting target variable.

LabelEncoder learns classes_

OrdinalEncoder learns categories_


2) https://www.analyticsvidhya.com/blog/2020/03/one-hot-encoding-vs-label-encoding-using-scikit-learn/

- We apply One-Hot Encoding when:

The categorical feature is not ordinal (like the countries above)
The number of categorical features is less so one-hot encoding can be effectively applied

- We apply Label Encoding when:

The categorical feature is ordinal (like Jr. kg, Sr. kg, Primary school, high school)
 The number of categories is quite large as one-hot encoding can lead to high memory consumption

In [33]:
# [val for sublist in matrix for val in sublist]
df_cleaned.head()

Unnamed: 0,brewery_name,review_aroma,review_appearance,beer_style,review_palate,review_taste,beer_abv
0,Vecchio Birraio,2.0,2.5,Hefeweizen,1.5,1.5,5.0
1,Vecchio Birraio,2.5,3.0,English Strong Ale,3.0,3.0,6.2
2,Vecchio Birraio,2.5,3.0,Foreign / Export Stout,3.0,3.0,6.5
3,Vecchio Birraio,3.0,3.5,German Pilsener,2.5,3.0,5.0
4,Caldera Brewing Company,4.5,4.0,American Double / Imperial IPA,4.0,4.5,7.7


In [34]:
for col, cats in cats_dict.items():
    col_encoder = OrdinalEncoder(categories=cats)
    df_cleaned[col] = col_encoder.fit_transform(df_cleaned[[col]])

In [35]:
# using standard scaler
sc = StandardScaler()

In [36]:
# Fit and transform numercial column variables using standard scaler
num_cols = ['brewery_name', 'review_aroma','review_appearance','review_palate','review_taste','beer_abv']
df_cleaned[num_cols] = sc.fit_transform(df_cleaned[num_cols])

In [37]:
df_cleaned.head()

Unnamed: 0,brewery_name,review_aroma,review_appearance,beer_style,review_palate,review_taste,beer_abv
0,1.45482,-2.511302,-2.19821,65.0,-3.317561,-3.162309,-0.87941
1,1.45482,-1.792233,-1.384289,51.0,-1.109519,-1.103587,-0.36274
2,1.45482,-1.792233,-1.384289,59.0,-1.109519,-1.103587,-0.233573
3,1.45482,-1.073164,-0.570368,61.0,-1.845533,-1.103587,-0.87941
4,-0.808444,1.084042,0.243553,9.0,0.36251,0.955134,0.283097


In [38]:
# Using One Hot Encoder to convert categorical variables into binaries
# ohe = OneHotEncoder(sparse=False)

# Convert the column 'beer_style' as integer
df_cleaned['beer_style'] = df_cleaned['beer_style'].astype(int)

In [39]:
df_cleaned.head()

Unnamed: 0,brewery_name,review_aroma,review_appearance,beer_style,review_palate,review_taste,beer_abv
0,1.45482,-2.511302,-2.19821,65,-3.317561,-3.162309,-0.87941
1,1.45482,-1.792233,-1.384289,51,-1.109519,-1.103587,-0.36274
2,1.45482,-1.792233,-1.384289,59,-1.109519,-1.103587,-0.233573
3,1.45482,-1.073164,-0.570368,61,-1.845533,-1.103587,-0.87941
4,-0.808444,1.084042,0.243553,9,0.36251,0.955134,0.283097


In [43]:
from src.data.sets import split_sets_random, save_sets

In [44]:
#  train test split with a ratio of 80 / 20

X_train, y_train, X_val, y_val, X_test, y_test = split_sets_random(df_cleaned, target_col='beer_style', test_ratio=0.2)

In [45]:
# create the folder in processed folder called beer_style

!mkdir ../data/processed/beer_style

mkdir: cannot create directory ‘../data/processed/beer_style’: File exists


In [46]:
# Save the sets in the data/processed/beer_style folder
save_sets(X_train=X_train, y_train=y_train, X_val=X_val, y_val=y_val, X_test=X_test, y_test=y_test, path='../data/processed/beer_style/')

In [47]:
#

from src.models.pytorch import PytorchDataset

train_dataset = PytorchDataset(X=X_train, y=y_train)
val_dataset = PytorchDataset(X=X_val, y=y_val)
test_dataset = PytorchDataset(X=X_test, y=y_test)

## baseline Model

In [48]:
from src.models.null import NullModel

In [49]:
# y_base

In [50]:
# Instantiate a NullModel and call .fit_predict() on the training target to extract your predictions into a variable called y_base
baseline_model = NullModel(target_type='classification')
y_base = baseline_model.fit_predict(y_train)

In [51]:
# from joblib import dump 

# dump(gmm_pipe,  '../models/gmm_pipeline.joblib')


In [52]:
# Import print_class_perf from src.models.performance
from src.models.performance import print_class_perf

In [53]:
print_class_perf(y_base, y_train, set_name='Training', average='weighted')

Accuracy Training: 0.07444192974099043
F1 Training: 0.010315310209270104


In [54]:
df_cleaned
# 6 features (predictor variables) 

Unnamed: 0,brewery_name,review_aroma,review_appearance,beer_style,review_palate,review_taste,beer_abv
0,1.454820,-2.511302,-2.198210,65,-3.317561,-3.162309,-0.879410
1,1.454820,-1.792233,-1.384289,51,-1.109519,-1.103587,-0.362740
2,1.454820,-1.792233,-1.384289,59,-1.109519,-1.103587,-0.233573
3,1.454820,-1.073164,-0.570368,61,-1.845533,-1.103587,-0.879410
4,-0.808444,1.084042,0.243553,9,0.362510,0.955134,0.283097
...,...,...,...,...,...,...,...
1586609,1.282155,0.364974,-0.570368,85,0.362510,0.268894,-0.793298
1586610,1.282155,1.803111,-2.198210,85,-2.581547,0.268894,-0.793298
1586611,1.282155,-0.354095,-1.384289,85,-0.373505,0.268894,-0.793298
1586612,1.282155,1.084042,1.057473,85,1.098524,0.955134,-0.793298


## Define Architecture

In [55]:
import torch
import torch.nn as nn
import torch.nn.functional as F

In [56]:
from src.models.pytorch import PytorchMultiClass

model = PytorchMultiClass(X_train.shape[1])

In [57]:
from src.models.pytorch import get_device

device = get_device()
model.to(device)

PytorchMultiClass(
  (layer_1): Linear(in_features=6, out_features=32, bias=True)
  (layer_out): Linear(in_features=32, out_features=104, bias=True)
  (softmax): Softmax(dim=1)
)

In [58]:
# Solution:
print(model)

PytorchMultiClass(
  (layer_1): Linear(in_features=6, out_features=32, bias=True)
  (layer_out): Linear(in_features=32, out_features=104, bias=True)
  (softmax): Softmax(dim=1)
)


## Train the model 

In [59]:
# Instantiate a nn.CrossEntropyLoss() and save it into a variable called criterion
criterion = nn.CrossEntropyLoss()

In [60]:
# learning optimiser
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)

In [61]:
def train_classification(train_data, model, criterion, optimizer, batch_size, device, scheduler=None, generate_batch=None):
    """Train a Pytorch multi-class classification model

    Parameters
    ----------
    train_data : torch.utils.data.Dataset
        Pytorch dataset
    model: torch.nn.Module
        Pytorch Model
    criterion: function
        Loss function
    optimizer: torch.optim
        Optimizer
    bacth_size : int
        Number of observations per batch
    device : str
        Name of the device used for the model
    scheduler : torch.optim.lr_scheduler
        Pytorch Scheduler used for updating learning rate
    collate_fn : function
        Function defining required pre-processing steps

    Returns
    -------
    Float
        Loss score
    Float:
        Accuracy Score
    """
    
    # Set model to training mode
    model.train()
    train_loss = 0
    train_acc = 0
    
    # Create data loader
    data = DataLoader(train_data, batch_size=batch_size, shuffle=True, collate_fn=generate_batch)
    
    # Iterate through data by batch of observations
    for feature, target_class in data:

        # Reset gradients
        optimizer.zero_grad()
        
        # Load data to specified device
        feature, target_class = feature.to(device), target_class.to(device)
        
        # Make predictions
        output = model(feature)
        
        # Calculate loss for given batch
        loss = criterion(output, target_class.long())

        # Calculate global loss
        train_loss += loss.item()
        
        # Calculate gradients
        loss.backward()

        # Update Weights
        optimizer.step()
        
        # Calculate global accuracy
        train_acc += (output.argmax(1) == target_class).sum().item()

    # Adjust the learning rate
    if scheduler:
        scheduler.step()

    return train_loss / len(train_data), train_acc / len(train_data)

In [62]:
def test_classification(test_data, model, criterion, batch_size, device, generate_batch=None):
    """Calculate performance of a Pytorch multi-class classification model

    Parameters
    ----------
    test_data : torch.utils.data.Dataset
        Pytorch dataset
    model: torch.nn.Module
        Pytorch Model
    criterion: function
        Loss function
    bacth_size : int
        Number of observations per batch
    device : str
        Name of the device used for the model
    collate_fn : function
        Function defining required pre-processing steps

    Returns
    -------
    Float
        Loss score
    Float:
        Accuracy Score
    """    
    
    # Set model to evaluation mode
    model.eval()
    test_loss = 0
    test_acc = 0
    
    # Create data loader
    data = DataLoader(test_data, batch_size=batch_size, collate_fn=generate_batch)
    
    # Iterate through data by batch of observations
    for feature, target_class in data:
        
        # Load data to specified device
        feature, target_class = feature.to(device), target_class.to(device)
        
        # Set no update to gradients
        with torch.no_grad():
            
            # Make predictions
            output = model(feature)
            
            # Calculate loss for given batch
            loss = criterion(output, target_class.long())

            # Calculate global loss
            test_loss += loss.item()
            
            # Calculate global accuracy
            test_acc += (output.argmax(1) == target_class).sum().item()

    return test_loss / len(test_data), test_acc / len(test_data)

In [63]:
# Create 2 variables called N_EPOCHS and BATCH_SIZE that will take respectively 50 and 32 as values
N_EPOCHS = 5
BATCH_SIZE = 32

In [64]:
# Create a for loop that will iterate through the specified number of epochs and will train the model with the training set
# and assess the performance on the validation set and print their scores
from src.models.pytorch import train_classification, test_classification

for epoch in range(N_EPOCHS):
    train_loss, train_acc = train_classification(train_dataset, model=model, criterion=criterion, optimizer=optimizer, batch_size=BATCH_SIZE, device=device)
    valid_loss, valid_acc = test_classification(val_dataset, model=model, criterion=criterion, batch_size=BATCH_SIZE, device=device)

    print(f'Epoch: {epoch}')
    print(f'\t(train)\t|\tLoss: {train_loss:.4f}\t|\tAcc: {train_acc * 100:.1f}%')
    print(f'\t(valid)\t|\tLoss: {valid_loss:.4f}\t|\tAcc: {valid_acc * 100:.1f}%')

Epoch: 0
	(train)	|	Loss: 0.1421	|	Acc: 11.5%
	(valid)	|	Loss: 0.1419	|	Acc: 11.9%
Epoch: 1
	(train)	|	Loss: 0.1420	|	Acc: 11.8%
	(valid)	|	Loss: 0.1418	|	Acc: 12.5%
Epoch: 2
	(train)	|	Loss: 0.1419	|	Acc: 11.8%
	(valid)	|	Loss: 0.1418	|	Acc: 12.5%
Epoch: 3
	(train)	|	Loss: 0.1419	|	Acc: 11.9%
	(valid)	|	Loss: 0.1418	|	Acc: 12.4%
Epoch: 4
	(train)	|	Loss: 0.1419	|	Acc: 11.9%
	(valid)	|	Loss: 0.1417	|	Acc: 12.7%


In [69]:
# save the model 
# torch.save(model, "./models/pytorch_multi_classification_beer.pt")
torch.save(model.state_dict(), '../models/pytorch_multi_classification_beer_250322.pt')

Assess of the performance of the model

In [284]:
test_loss, test_acc = test_classification(test_dataset, model=model, criterion=criterion, batch_size=BATCH_SIZE, device=device)
print(f'\tLoss: {test_loss:.4f}\t|\tAccuracy: {test_acc:.1f}')

	Loss: 0.1417	|	Accuracy: 0.1


## PUSH CHANGES

In [None]:
#add changes to git staging area
git add .

# git commit snapshot of  the repository
git commit -m "pytorch beer style multi-classification"

#git push to Github
git push

# checkout master branch
git checkout master

# pull for lastest updates:
git pull

# git checkout pytorch_multi_class
git checkout pytorch_multi_class

# Merge the master branch and push your changes
git merge master
git push


In [None]:
# stop docker 
docker stop adv_dsi_at2

## DEPLOY FAST API

In [None]:
# IN TERMINAL APPLICATION 

# Go to the folder
cd ~/Projects/adv_dsi_AT2

# Create folder 'api'
mkdir api

# Go to the folder api
cd api

# Copy models folder 
cp -r ../models .


In [None]:
# initialise the Git repo

git init

Go to github.com and create a repository called **'adv_dsi_at2_api'** and get the link to the repository:

https://github.com/justinmuts/adv_dsi_at2_api.git

In [None]:
# make directory app in terminal
mkdir app

In [291]:
df_cleaned.columns

Index(['brewery_name', 'review_aroma', 'review_appearance', 'beer_style',
       'review_palate', 'review_taste', 'beer_abv'],
      dtype='object')

In [71]:
#**Enter this in format features function**

#brewery_name: str, review_aroma: int , review_appearance:int,review_palate: int, review_taste:int, beer_abv: int

# def `format_features`(brewery_name: str, review_aroma: int , review_appearance:int,review_palate: int, review_taste:int, beer_abv: int):
#     return {
#         'brewery_name': [brewery_name],
#         'review_aroma': [review_aroma],
#         'review_appearance': [review_appearance],
#         'review_palate': [review_palate],
#         'review_taste':[review_taste],
#         'beer_abv':[beer_abv]
#     } 

In [292]:
df.head()

Unnamed: 0,brewery_id,brewery_name,review_time,review_overall,review_aroma,review_appearance,review_profilename,beer_style,review_palate,review_taste,beer_name,beer_abv,beer_beerid
0,10325,Vecchio Birraio,1234817823,1.5,2.0,2.5,stcules,Hefeweizen,1.5,1.5,Sausa Weizen,5.0,47986
1,10325,Vecchio Birraio,1235915097,3.0,2.5,3.0,stcules,English Strong Ale,3.0,3.0,Red Moon,6.2,48213
2,10325,Vecchio Birraio,1235916604,3.0,2.5,3.0,stcules,Foreign / Export Stout,3.0,3.0,Black Horse Black Beer,6.5,48215
3,10325,Vecchio Birraio,1234725145,3.0,3.0,3.5,stcules,German Pilsener,2.5,3.0,Sausa Pils,5.0,47969
4,1075,Caldera Brewing Company,1293735206,4.0,4.5,4.0,johnmichaelsen,American Double / Imperial IPA,4.0,4.5,Cauldron DIPA,7.7,64883
