# **Lab: Model Optimization**



## Exercise 1: S3 Datalake

In this exercise, we will load data from a AWS S3 bucket. The dataset is called NYC Yellow Cab trip record data. It contains records of taxi trips in New York
https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page

In the next exercise we will build a model to predict the trips duration.

First we will set up our project with a custom Docker image and will prepare the dataset for modelling (exercises 2 and 3).

The steps are:
1.   Setup Repository
2.   Load and Explore Dataset
3.   Prepare Data
4.   Split Dataset
5.   Baseline Model
6.   Push Changes


### 1. Setup Repository

**[1.1]** Go to a folder of your choice on your computer (where you store projects)

In [None]:
Go to a folder of your choice on your computer (where you store projects)# Placeholder for student's code (1 command line)
# Task: 

In [None]:
#Solution:
cd ~/Projects/adv_dsi

**[1.2]** Copy the cookiecutter data science template

In [None]:
# Placeholder for student's code (1 command line)
# Task: Copy the cookiecutter data science template

In [None]:
#Solution:
cookiecutter -c v1 https://github.com/drivendata/cookiecutter-data-science

Follow the prompt (name the project and repo adv_dsi_lab_3)

**[1.3]** Go inside the created folder `adv_dsi_lab_3`


In [None]:
# Placeholder for student's code (1 command line)
# Task: Go inside the created folder adv_dsi_lab_3

In [None]:
#Solution:
cd adv_dsi_lab_3

**[1.4]** Create a file called `Dockerfile` and add the following content:

`FROM jupyter/scipy-notebook:0ce64578df46`

`RUN conda install xgboost`

`RUN conda install boto3`

`RUN conda install s3fs`

`RUN conda install lime`

`RUN conda install hyperopt`

`RUN conda install graphviz`

`ENV PYTHONPATH "${PYTHONPATH}:/home/jovyan/work"`

`RUN echo "export PYTHONPATH=/home/jovyan/work" >> ~/.bashrc`

`WORKDIR /home/jovyan/work`


In [None]:
# Placeholder for student's code (1 command line)
# Task: Create a file called Dockerfile 

In [None]:
#Solution:
vi Dockerfile

We will create our own Docker image based on the official jupyter/scipy-notebook.

**[1.5]** Build the image from this Dockerfile

In [None]:
docker build -t xgboost-notebook:latest .

Syntax: docker build [OPTIONS] PATH 

Options:

`-t: Name and optionally a tag in the 'name:tag' format`

Documentation: https://docs.docker.com/engine/reference/commandline/build/

**[1.6]** Run the built image

In [None]:
docker run  -dit --rm --name adv_dsi_lab_3 -p 8888:8888 -e JUPYTER_ENABLE_LAB=yes -v ~/Projects/adv_dsi/adv_dsi_lab_3:/home/jovyan/work -v ~/Projects/adv_dsi/.aws:/home/jovyan/.aws -v ~/Projects/adv_dsi/src:/home/jovyan/work/src xgboost-notebook:latest 

**[1.7]** Display last 50 lines of logs

In [None]:
docker logs --tail 50 adv_dsi_lab_3

Copy the url displayed and paste it to a browser in order to launch Jupyter Lab

**[1.8]** Initialise the repo

In [None]:
# Placeholder for student's code (1 command line)
# Task: Initialise the repo

In [None]:
#Solution:
git init

**[1.9]** Login into Github with your account (https://github.com/) and create a public repo with the name `adv_dsi_lab_3`

**[1.10]** In your local repo `adv_dsi_lab_3`, link it with Github (replace the url with your username)

In [None]:
# Placeholder for student's code (1 command line)
# Task: Link repo with Github

In [None]:
#Solution:
git remote add origin git@github.com:<username>/adv_dsi_lab_1_3

**[1.11]** Add you changes to git staging area and commit them

In [None]:
# Placeholder for student's code (2 command lines)
# Task: Add you changes to git staging area and commit them

In [None]:
#Solution:
git add .
git commit -m "init"

**[1.12]** Push your master branch to origin

In [None]:
# Placeholder for student's code (1 command line)
# Task: Push your master branch to origin

In [None]:
#Solution:
git push --set-upstream origin master

**[1.13]** Preventing push to `master` branch

In [None]:
# Placeholder for student's code (1 command line)
# Task: Preventing push to master branch

In [None]:
# Solution
git config branch.master.pushRemote no_push

**[1.14]** Create a new git branch called `data_prep`

In [None]:
# Placeholder for student's code (1 command line)
# Task: Create a new git branch called data_prep

In [None]:
#Solution:
git checkout -b data_prep

**[1.15]** Navigate the folder `notebooks` and create a new jupyter notebook called `1_data_prep.ipynb`

### 2.   Load and Explore Dataset

**[2.1]** Import the boto3, pandas and numpy packages

In [None]:
# Placeholder for student's code (3 lines of Python code)
# Task: Import the boto3 package

In [1]:
#Solution
import boto3
import pandas as pd
import numpy as np

**[2.2]** Create a function that will all files from a S3 bucket contains a string

In [None]:
def list_bucket_contents(bucket, match=''):
    s3_resource = boto3.resource('s3')
    bucket_resource = s3_resource.Bucket(bucket)
    for key in bucket_resource.objects.all():
        if match in key.key:
            print(key.key)

**[2.3]** Call the function you defined to list the file of the 'nyc-tlc' bucket that contains the string '2020'

In [None]:
# Placeholder for student's code (1 line of Python code)
# Task: Call the function you defined to list the file of the 'nyc-tlc' bucket that contains the string '2020'

In [None]:
#Solution
list_bucket_contents(bucket='nyc-tlc', match='2020')

trip data/fhv_tripdata_2020-01.csv
trip data/fhv_tripdata_2020-02.csv
trip data/fhv_tripdata_2020-03.csv
trip data/fhv_tripdata_2020-04.csv
trip data/fhv_tripdata_2020-05.csv
trip data/fhv_tripdata_2020-06.csv
trip data/fhv_tripdata_2020-07.csv
trip data/fhv_tripdata_2020-08.csv
trip data/fhv_tripdata_2020-09.csv
trip data/fhv_tripdata_2020-10.csv
trip data/fhv_tripdata_2020-11.csv
trip data/fhv_tripdata_2020-12.csv
trip data/fhvhv_tripdata_2020-01.csv
trip data/fhvhv_tripdata_2020-02.csv
trip data/fhvhv_tripdata_2020-03.csv
trip data/fhvhv_tripdata_2020-04.csv
trip data/fhvhv_tripdata_2020-05.csv
trip data/fhvhv_tripdata_2020-06.csv
trip data/fhvhv_tripdata_2020-07.csv
trip data/fhvhv_tripdata_2020-08.csv
trip data/fhvhv_tripdata_2020-09.csv
trip data/fhvhv_tripdata_2020-10.csv
trip data/fhvhv_tripdata_2020-11.csv
trip data/fhvhv_tripdata_2020-12.csv
trip data/green_tripdata_2020-01.csv
trip data/green_tripdata_2020-02.csv
trip data/green_tripdata_2020-03.csv
trip data/green_tripdata_

**[2.4]** Load the file named `trip data/yellow_tripdata_2020-04.csv` into a dataframe called df. Specify `s3://` as prefix for the file url


In [None]:
# Placeholder for student's code (1 line of Python code)
# Task: Load the file named trip data/yellow_tripdata_2020-04.csv into a dataframe called df. Specify s3:// as prefix for the file url

In [2]:
#Solution:
df = pd.read_csv('s3://nyc-tlc/trip data/yellow_tripdata_2020-04.csv')

  df = pd.read_csv('s3://nyc-tlc/trip data/yellow_tripdata_2020-04.csv')


**[2.5]** Display the first 5 rows of df

In [None]:
# Placeholder for student's code (1 line of Python code)
# Task: Display the first 5 rows of df

In [3]:
# Solution
df.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
0,1.0,2020-04-01 00:41:22,2020-04-01 01:01:53,1.0,1.2,1.0,N,41,24,2.0,5.5,0.5,0.5,0.0,0.0,0.3,6.8,0.0
1,1.0,2020-04-01 00:56:00,2020-04-01 01:09:25,1.0,3.4,1.0,N,95,197,1.0,12.5,0.5,0.5,2.75,0.0,0.3,16.55,0.0
2,1.0,2020-04-01 00:00:26,2020-04-01 00:09:25,1.0,2.8,1.0,N,237,137,1.0,10.0,3.0,0.5,1.0,0.0,0.3,14.8,2.5
3,1.0,2020-04-01 00:24:38,2020-04-01 00:34:38,0.0,2.6,1.0,N,68,142,1.0,10.0,3.0,0.5,1.0,0.0,0.3,14.8,2.5
4,2.0,2020-04-01 00:13:24,2020-04-01 00:18:26,1.0,1.44,1.0,Y,263,74,1.0,6.5,0.5,0.5,3.0,0.0,0.3,13.3,2.5


**[2.6]** Display the dimensions (shape) of df

In [None]:
# Placeholder for student's code (1 line of Python code)
# Task: Task: Display the dimensions (shape) of df

In [4]:
# Solution
df.shape

(237993, 18)

**[2.7]** Display the summary (info) of df

In [None]:
# Placeholder for student's code (1 line of Python code)
# Task: Display the summary (info) of df

In [5]:
# Solution
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 237993 entries, 0 to 237992
Data columns (total 18 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   VendorID               218480 non-null  float64
 1   tpep_pickup_datetime   237993 non-null  object 
 2   tpep_dropoff_datetime  237993 non-null  object 
 3   passenger_count        218480 non-null  float64
 4   trip_distance          237993 non-null  float64
 5   RatecodeID             218480 non-null  float64
 6   store_and_fwd_flag     218480 non-null  object 
 7   PULocationID           237993 non-null  int64  
 8   DOLocationID           237993 non-null  int64  
 9   payment_type           218480 non-null  float64
 10  fare_amount            237993 non-null  float64
 11  extra                  237993 non-null  float64
 12  mta_tax                237993 non-null  float64
 13  tip_amount             237993 non-null  float64
 14  tolls_amount           237993 non-nu

**[2.8]** Display the descriptive statistics of df


In [None]:
# Placeholder for student's code (1 line of Python code)
# Task: Display the descriptive statictics of df

In [6]:
# Solution
df.describe()

Unnamed: 0,VendorID,passenger_count,trip_distance,RatecodeID,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
count,218480.0,218480.0,237993.0,218480.0,237993.0,237993.0,218480.0,237993.0,237993.0,237993.0,237993.0,237993.0,237993.0,237993.0,237993.0
mean,1.564949,1.296764,4.039981,1.034081,154.908422,150.361414,1.425673,11.666027,1.066739,0.487,1.530229,0.220504,0.296331,16.408621,1.927536
std,0.495765,0.983595,294.879052,0.865044,70.749496,74.474108,0.555915,11.728767,1.26017,0.094993,2.295523,1.342351,0.045429,13.155858,1.072839
min,1.0,0.0,0.0,1.0,1.0,1.0,1.0,-118.0,-4.5,-0.5,-5.0,-19.87,-0.3,-138.17,-2.5
25%,1.0,1.0,0.95,1.0,97.0,75.0,1.0,5.5,0.0,0.5,0.0,0.0,0.3,9.8,2.5
50%,2.0,1.0,1.74,1.0,143.0,143.0,1.0,8.0,0.5,0.5,1.0,0.0,0.3,12.8,2.5
75%,2.0,1.0,3.4,1.0,234.0,233.0,2.0,13.0,2.5,0.5,2.46,0.0,0.3,18.36,2.5
max,2.0,7.0,126501.77,99.0,265.0,265.0,4.0,903.02,7.0,1.1,117.28,98.75,0.3,903.32,2.5


**[2.9]** Save the dataframe locally in the `data/raw` folder

In [None]:
# Placeholder for student's code (1 line of Python code)
# Save the dataframe locally in the data/raw folder

In [8]:
# Solution
df.to_csv('../data/raw/yellow_tripdata_2020-04.csv')

### 3. Prepare Data

**[3.1]** Create a copy of df and save it into a variable called df_cleaned

In [None]:
# Placeholder for student's code (1 line of Python code)
# Task: Create a copy of df and save it into a variable called df_cleaned

In [9]:
# Solution
df_cleaned = df.copy()

**[3.2]** Launch magic commands to automatically reload modules

In [10]:
%load_ext autoreload
%autoreload 2

**[3.4]** Import your new function `convert_to_date` from `src.features.dates`

In [None]:
# Placeholder for student's code (1 line of Python code)
# Task: Import your new function convert_to_date from src.features.dates

In [11]:
# Solution
from src.features.dates import convert_to_date

**[3.5]** Convert the column `tpep_pickup_datetime`, `tpep_dropoff_datetime` with your function `convert_to_date`

In [None]:
# Placeholder for student's code (1 line of Python code)
# Task: Convert the column tpep_pickup_datetime, tpep_dropoff_datetime with your function convert_to_date

In [12]:
# Solution
df_cleaned = convert_to_date(df_cleaned, ['tpep_pickup_datetime', 'tpep_dropoff_datetime'])

**[3.6]** Create a new column `trip_duration` that will corresponds to the diuration of the trip in seconds (difference between `tpep_pickup_datetime` and `tpep_dropoff_datetime`)

In [None]:
# Placeholder for student's code (1 line of Python code)
# Task: Create a new column trip_duration that will corresponds to the diuration of the trip in seconds (difference between tpep_pickup_datetime and tpep_dropoff_datetime)

In [13]:
# Solution
df_cleaned['trip_duration'] = (df_cleaned['tpep_dropoff_datetime'] - df_cleaned['tpep_pickup_datetime']).dt.total_seconds()

**[3.7]** Convert the `trip_duration` column into 5 different bins with [0, 300, 600, 1800, 300000]

In [14]:
df_cleaned['trip_duration'] = pd.cut(df_cleaned['trip_duration'], bins=[-1, 300, 600, 1800, 300000], labels=[0, 1, 2, 3])

**[3.8]** Extract the month component from `tpep_pickup_datetime` and save the results in the column `tpep_pickup_dayofmonth`

In [None]:
# Placeholder for student's code (1 line of Python code)
# Task: Extract the year component from dteday and save the results in the column yr

In [15]:
# Solution
df_cleaned['tpep_pickup_dayofmonth'] = df_cleaned['tpep_pickup_datetime'].dt.day

**[3.9]** Extract the hour component from `tpep_pickup_datetime` and save the results in the column `tpep_pickup_hourofday`

In [None]:
# Placeholder for student's code (1 line of Python code)
# Task: Extract the month name component from dteday and save the results in the column mnth

In [16]:
# Solution
df_cleaned['tpep_pickup_hourofday'] = df_cleaned['tpep_pickup_datetime'].dt.hour

**[3.10]** Extract the day of week component from `tpep_pickup_datetime` and save the results in the column `tpep_pickup_dayofweek`

In [None]:
# Placeholder for student's code (1 line of Python code)
# Task: Extract the day of week component from dteday and save the results in the column weekday

In [17]:
# Solution
df_cleaned['tpep_pickup_dayofweek'] = df_cleaned['tpep_pickup_datetime'].dt.dayofweek

In [19]:
df_cleaned.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,...,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,trip_duration,tpep_pickup_dayofmonth,tpep_pickup_hourofday,tpep_pickup_dayofweek
0,1.0,2020-04-01 00:41:22,2020-04-01 01:01:53,1.0,1.2,1.0,N,41,24,2.0,...,0.5,0.0,0.0,0.3,6.8,0.0,2,1,0,2
1,1.0,2020-04-01 00:56:00,2020-04-01 01:09:25,1.0,3.4,1.0,N,95,197,1.0,...,0.5,2.75,0.0,0.3,16.55,0.0,2,1,0,2
2,1.0,2020-04-01 00:00:26,2020-04-01 00:09:25,1.0,2.8,1.0,N,237,137,1.0,...,0.5,1.0,0.0,0.3,14.8,2.5,1,1,0,2
3,1.0,2020-04-01 00:24:38,2020-04-01 00:34:38,0.0,2.6,1.0,N,68,142,1.0,...,0.5,1.0,0.0,0.3,14.8,2.5,1,1,0,2
4,2.0,2020-04-01 00:13:24,2020-04-01 00:18:26,1.0,1.44,1.0,Y,263,74,1.0,...,0.5,3.0,0.0,0.3,13.3,2.5,1,1,0,2


**[3.11]** Perform One-Hot encoding on the categorical features (`VendorID`, `RatecodeID`, `store_and_fwd_flag`)

In [None]:
# Placeholder for student's code (1 line of Python code)
# Task: Perform One-Hot encoding on the categorical features (VendorID, RatecodeID, store_and_fwd_flag)

In [20]:
# Solution
df_cleaned = pd.get_dummies(df_cleaned, columns=['VendorID', 'RatecodeID', 'store_and_fwd_flag'])

In [23]:
# using pd.dummies() to convert categorical to numerical
df_cleaned.head()

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,...,VendorID_2.0,RatecodeID_1.0,RatecodeID_2.0,RatecodeID_3.0,RatecodeID_4.0,RatecodeID_5.0,RatecodeID_6.0,RatecodeID_99.0,store_and_fwd_flag_N,store_and_fwd_flag_Y
0,2020-04-01 00:41:22,2020-04-01 01:01:53,1.0,1.2,41,24,2.0,5.5,0.5,0.5,...,0,1,0,0,0,0,0,0,1,0
1,2020-04-01 00:56:00,2020-04-01 01:09:25,1.0,3.4,95,197,1.0,12.5,0.5,0.5,...,0,1,0,0,0,0,0,0,1,0
2,2020-04-01 00:00:26,2020-04-01 00:09:25,1.0,2.8,237,137,1.0,10.0,3.0,0.5,...,0,1,0,0,0,0,0,0,1,0
3,2020-04-01 00:24:38,2020-04-01 00:34:38,0.0,2.6,68,142,1.0,10.0,3.0,0.5,...,0,1,0,0,0,0,0,0,1,0
4,2020-04-01 00:13:24,2020-04-01 00:18:26,1.0,1.44,263,74,1.0,6.5,0.5,0.5,...,1,1,0,0,0,0,0,0,0,1


**[3.12]** Drop the columns `tpep_pickup_datetime`, `tpep_dropoff_datetime`, `PULocationID`, `DOLocationID`

In [None]:
# Placeholder for student's code (1 line of Python code)
# Task: Drop the columns tpep_pickup_datetime, tpep_dropoff_datetime, PULocationID, DOLocationID

In [24]:
# Solution
df_cleaned.drop(['tpep_pickup_datetime', 'tpep_dropoff_datetime', 'PULocationID', 'DOLocationID'], axis=1, inplace=True)

**[3.13]** Save the prepared dataframe in the `data/interim` folder

In [None]:
# Placeholder for student's code (1 line of Python code)
# Task: Save the prepared dataframe in the data/interim folder

In [25]:
# Solution
df_cleaned.to_csv('../data/interim/yellow_tripdata_2020-04_prepared.csv')

### 4. Split Dataset

**[4.1]** In the file `src/data/sets.py` create a function called `pop_target` with the following logics:
- input parameters: dataframe (`df`), target column name (`target_col`), flag to convert to Numpy array which False by default (`to_numpy`)
- logics: extract the target variable from input dataframe, split the input dataframe into training, validation and testing sets from the specified ratio
- output parameters: features and target

In [None]:
# Placeholder for student's code (multiple lines of Python code)
# Task: create a function called pop_target 

In [26]:
# Solution
def pop_target(df, target_col, to_numpy=False):
    """Extract target variable from dataframe and convert to nympy arrays if required

    Parameters
    ----------
    df : pd.DataFrame
        Dataframe
    target_col : str
        Name of the target variable
    to_numpy : bool
        Flag stating to convert to numpy array or not

    Returns
    -------
    pd.DataFrame/Numpy array
        Subsetted Pandas dataframe containing all features
    pd.DataFrame/Numpy array
        Subsetted Pandas dataframe containing the target
    """

    df_copy = df.copy()
    target = df_copy.pop(target_col)
    
    if to_numpy:
        df_copy = df_copy.to_numpy()
        target = target.to_numpy()
    
    return df_copy, target

**[4.2]** In the file `src/data/sets.py` create a function called `split_sets_random` with the following logics:
- input parameters: dataframe (`df`), target column name (`target_col`), flag to convert to Numoy array (`to_numpy`)
- logics: extract the target variable from input dataframe and convert to Numpy araay if needed
- output parameters: training, validation and testing sets

In [None]:
# Placeholder for student's code (multiple lines of Python code)
# Task: Create a subset function

In [27]:
# Solution
def split_sets_random(df, target_col, test_ratio=0.2, to_numpy=False):
    """Split sets randomly

    Parameters
    ----------
    df : pd.DataFrame
        Input dataframe
    target_col : str
        Name of the target column
    test_ratio : float
        Ratio used for the validation and testing sets (default: 0.2)

    Returns
    -------
    Numpy Array
        Features for the training set
    Numpy Array
        Target for the training set
    Numpy Array
        Features for the validation set
    Numpy Array
        Target for the validation set
    Numpy Array
        Features for the testing set
    Numpy Array
        Target for the testing set
    """
    
    from sklearn.model_selection import train_test_split
    
    features, target = pop_target(df=df, target_col=target_col, to_numpy=to_numpy)
    
    X_data, X_test, y_data, y_test = train_test_split(features, target, test_size=test_ratio, random_state=8)
    
    val_ratio = test_ratio / (1 - test_ratio)
    X_train, X_val, y_train, y_val = train_test_split(X_data, y_data, test_size=val_ratio, random_state=8)

    return X_train, y_train, X_val, y_val, X_test, y_test

**[4.3]** Import your new function `split_sets_random` and split the data into several sets as Numpy arrays

In [28]:
# Placeholder for student's code (2 lines of Python code)
# Task: Import your new function split_sets_random and split the data into several sets as Numpy arrays

In [29]:
# Solution
from src.data.sets import split_sets_random

X_train, y_train, X_val, y_val, X_test, y_test = split_sets_random(df_cleaned, target_col='trip_duration', test_ratio=0.2, to_numpy=True)

**[4.4]** Import save_sets from src.data.sets and save the sets into the folder `data/processed`

In [None]:
# Placeholder for student's code (multiple lines of Python code)
# Task: Import save_sets from src.data.sets and save the sets into the folder data/processed

In [30]:
# Solution
from src.data.sets import save_sets

save_sets(X_train, y_train, X_val, y_val, X_test, y_test, path='../data/processed/')

### 5. Baseline Model

**[5.1]** in `src.models` folder, create a script called `null.py` ans define a class called `NullModel` with:

Attributes
    ----------
    target_type : str
        Type of ML problem (default regression)
    y : Numpy Array-like
        Target variable
    pred_value : Float
        Value to be used for prediction
    preds : Numpy Array
        Predicted array

Methods
-------
    fit(y)
        Store the input target variable and calculate the predicted value to be used based on the problem type
    predict(y)
        Generate the predictions
    fit_predict(y)
        Perform a fit followed by predict

In [None]:
# Placeholder for student's code (multiple lines of Python code)
# Task: in src.models folder, create a script called null.py ans define a class called NullModel

In [31]:
# Solution:
import pandas as pd
import numpy as np

class NullModel:
    """
    Class used as baseline model for both regression and classification
    ...

    Attributes
    ----------
    target_type : str
        Type of ML problem (default regression)
    y : Numpy Array-like
        Target variable
    pred_value : Float
        Value to be used for prediction
    preds : Numpy Array
        Predicted array

    Methods
    -------
    fit(y)
        Store the input target variable and calculate the predicted value to be used based on the problem type
    predict(y)
        Generate the predictions
    fit_predict(y)
        Perform a fit followed by predict
    """
        
    
    def __init__(self, target_type: str = "regression"):
        self.target_type = target_type
        self.y = None
        self.pred_value = None
        self.preds = None
        
    def fit(self, y):
        self.y = y
        if self.target_type == "regression":
            self.pred_value = y.mean()
        else:
            from scipy.stats import mode
            self.pred_value = mode(y)[0][0]
    
    def predict(self, y):
        self.preds = np.full((len(y), 1), self.pred_value)
        return self.preds
    
    def fit_predict(self, y):
        self.fit(y)
        return self.predict(self.y)

**[5.2]** Import `NullModel` from `src.models.null`

In [None]:
# Placeholder for student's code (1 line of Python code)
# Task: Import NullModel from src.models.null

In [32]:
# Solution:
from src.models.null import NullModel

**[5.3]** Instantiate a `NullModel` with `target_type='classification'` and save it into a variable called `base_model`

In [None]:
# Placeholder for student's code (1 line of Python code)
# Task: Instantiate a NullModel with target_type='classification and save it into a variable called base_model

In [33]:
# Solution:
base_model = NullModel(target_type="classification")

**[5.4]** Make a prediction using `fit_predict()` and save the results in a variable called `y_base`

In [None]:
# Placeholder for student's code (1 line of Python code)
# Task: Make a prediction using fit_predict() and save the results in a variable called y_base

In [34]:
# Solution:
y_base = base_model.fit_predict(y_train)

In [37]:
# review the y_base
y_base

array([[1],
       [1],
       [1],
       ...,
       [1],
       [1],
       [1]])

**[5.5]** In the `src/models/performance.py` file, create a function called `print_class_perf` with the following logics:
- input parameters: predicted target (`y_preds`), actual target (`y_actuals`) and name of the set (`set_name`)
- logics: Print the Accuracy and F1 score for the provided data
- output parameters: None

In [None]:
# Placeholder for student's code (multiple lines of Python code)
# Task: In the src/models/performance.py file, create a function called print_class_perf 

In [38]:
def print_class_perf(y_preds, y_actuals, set_name=None, average='binary'):
    """Print the Accuracy and F1 score for the provided data

    Parameters
    ----------
    y_preds : Numpy Array
        Predicted target
    y_actuals : Numpy Array
        Actual target
    set_name : str
        Name of the set to be printed
    average : str
        Parameter  for F1-score averaging
    Returns
    -------
    """
    from sklearn.metrics import accuracy_score
    from sklearn.metrics import f1_score

    print(f"Accuracy {set_name}: {accuracy_score(y_actuals, y_preds)}")
    print(f"F1 {set_name}: {f1_score(y_actuals, y_preds, average=average)}")

**[5.6]** Display the Accuracy and F1 scores of this baseline model on the training set

In [None]:
# Placeholder for student's code (2 lines of Python code)
# Task: Display the Accuracy and F1 scores of this baseline model on the training set

In [39]:
from src.models.performance import print_class_perf

print_class_perf(y_preds=y_base, y_actuals=y_train, set_name='Training', average='weighted')

Accuracy Training: 0.34539024475646907
F1 Training: 0.17733802015864514


### 6.   Push changes

**[6.1]** Add you changes to git staging area

In [None]:
# Placeholder for student's code (1 command line)
# Task: Add you changes to git staging area

In [None]:
# Solution:
git add .

**[6.2]** Create the snapshot of your repository and add a description

In [None]:
# Placeholder for student's code (1 command line)
# Task: Create the snapshot of your repository and add a description

In [None]:
# Solution:
git commit -m "data prep"

**[6.3]** Push your snapshot to Github

In [None]:
# Placeholder for student's code (1 command line)
# Task: Push your snapshot to Github

In [None]:
# Solution:
git push

**[6.4]** Check out to the master branch

In [None]:
# Placeholder for student's code (1 command line)
# Task: Check out to the master branch

In [None]:
# Solution:
git checkout master

**[6.5]** Pull the latest updates

In [None]:
# Placeholder for student's code (1 command line)
# Task: Pull the latest updates

In [None]:
git pull

**[6.6]** Check out to the `data_prep` branch


In [None]:
# Placeholder for student's code (1 command line)
# Task: Merge the branch data_prep

In [None]:
# Solution:
git checkout data_prep

**[6.7]** Merge the `master` branch and push your changes

In [None]:
# Placeholder for student's code (2 command lines)
# Task: Merge the master branch and push your changes

In [None]:
# Solution:
git merge master
git push

**[6.8]** Go to Github and merge the branch after reviewing the code and fixing any conflict




**[6.9]** Stop the Docker container

In [None]:
# Placeholder for student's code (1 command line)
# Task: Stop the Docker container

In [None]:
# Solution:
docker stop adv_dsi_lab_3