# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
## Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [8]:
# Import libraries
import pandas as pd
import numpy as np
import re
import pickle
from sqlalchemy import create_engine
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.model_selection import train_test_split, GridSearchCV, KFold
from sklearn.metrics import classification_report, make_scorer, fbeta_score
from sklearn.metrics import f1_score, accuracy_score, recall_score, precision_score
from sklearn.svm import LinearSVC
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import StandardScaler, MaxAbsScaler
from functools import partial

# Set NumPy printing precision
np.set_printoptions(precision=4)

In [9]:
# Load data from database
engine = create_engine('sqlite:///../data/DisasterResponse.db')
df = pd.read_sql_table('messages', con=engine)

# Drop 'child_alone' column since this has 1 class only
df.drop(columns=['child_alone'], axis=1, inplace=True)

# Define feature, target and category name variables
X = df.message
y = df.iloc[:,4:].values
category_names = df.iloc[:,4:].columns

In [10]:
# Quick look at data
print(df.shape)
df.head()

(26215, 39)


Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,1,0,0,1,0,0,...,0,0,1,0,1,0,0,0,0,0
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,1,1,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [5]:
# Display the percentage of messages relevant for each category
round(100 * df.iloc[:,4:].sum(axis=0) / len(df), 2)

related                   76.65
request                   17.07
offer                      0.45
aid_related               41.43
medical_help               7.95
medical_products           5.01
search_and_rescue          2.76
security                   1.80
military                   3.28
water                      6.38
food                      11.15
shelter                    8.83
clothing                   1.54
money                      2.30
missing_people             1.14
refugees                   3.34
death                      4.55
other_aid                 13.15
infrastructure_related     6.50
transport                  4.58
buildings                  5.08
electricity                2.03
tools                      0.61
hospitals                  1.08
shops                      0.46
aid_centers                1.18
other_infrastructure       4.39
weather_related           27.84
floods                     8.22
storm                      9.32
fire                       1.08
earthqua

### Observations

Given the nature of the dataset, it's not surprising that it's strongly imbalanced for almost almost all categories.  

Just 3 of the 35 categories above have more than 20% of messages assigned to them, and 28 of 35 categories have less than 10% of messages assigned to them.

Note, the 'child_alone' category has been dropped from the dataset no messages were assigned to this category and there's nothing for a Machine Learning algorithm to learn in this instance.  A category with only 1 class will also be problematic for the algorithms selected below.

## Write a tokenization function to process your text data

In [11]:
def tokenize(text: str) -> list:
    
    """ Normalise and tokenise a text string, remove stop words and return 
    list of tokens after lemmatisation and stemming.
    
    Args:
    text: str.  A string of text to be processed.
    
    Returns:
    list. A list of processed tokens.
    """
    
    # Normalise text by removing capitalisation and punctuation
    # Replace non-alpha-numeric characters with a space to avoid 
    # incorrect concatenation of words within text
    text = re.sub(r'[^a-zA-Z0-9]', ' ', text.lower())
    
    # Tokenise text
    tokens = word_tokenize(text)
    
    # Remove stop words and lemmatise text
    nltk_stop_words = stopwords.words('english')
    tokens = [WordNetLemmatizer().lemmatize(token.strip()) for token in tokens \
             if token.strip() not in nltk_stop_words]
    
    # Finally apply Stemming
    tokens = [PorterStemmer().stem(token) for token in tokens]
    
    return tokens    
    

In [7]:
# View and test output of tokenize function
for i in range(0,10):
    print(f'{X[i]} \n {tokenize(X[i])}\n')

Weather update - a cold front from Cuba that could pass over Haiti 
 ['weather', 'updat', 'cold', 'front', 'cuba', 'could', 'pa', 'haiti']

Is the Hurricane over or is it not over 
 ['hurrican']

Looking for someone but no name 
 ['look', 'someon', 'name']

UN reports Leogane 80-90 destroyed. Only Hospital St. Croix functioning. Needs supplies desperately. 
 ['un', 'report', 'leogan', '80', '90', 'destroy', 'hospit', 'st', 'croix', 'function', 'need', 'suppli', 'desper']

says: west side of Haiti, rest of the country today and tonight 
 ['say', 'west', 'side', 'haiti', 'rest', 'countri', 'today', 'tonight']

Information about the National Palace- 
 ['inform', 'nation', 'palac']

Storm at sacred heart of jesus 
 ['storm', 'sacr', 'heart', 'jesu']

Please, we need tents and water. We are in Silo, Thank you! 
 ['pleas', 'need', 'tent', 'water', 'silo', 'thank']

I would like to receive the messages, thank you 
 ['would', 'like', 'receiv', 'messag', 'thank']

I am in Croix-des-Bouquets. We

## Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [8]:
# Define pipeline to fit and transform data with tfidf step and predict message categories
# using a RandomForestClassifier
pipeline = Pipeline([('tfidf', TfidfVectorizer(tokenizer=tokenize)),
                     ('clf', MultiOutputClassifier(RandomForestClassifier(random_state=55), \
                                                   n_jobs=-1))], verbose=True)

## Train pipeline
- Split data into train and test sets
- Train pipeline

In [12]:
# Split data into train and test sets 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=55)

In [10]:
# Fit pipeline on training data
pipeline = pipeline.fit(X_train, y_train)

[Pipeline] ............. (step 1 of 2) Processing tfidf, total=  18.0s
[Pipeline] ............... (step 2 of 2) Processing clf, total= 1.8min


## Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [11]:
# Generate predictions on test set
preds = pipeline.predict(X_test)

### Model evaluation helper functions

In [13]:
# Define helper functions to assist with model evaluation

def display_class_report(y_test: np.array, preds: np.array, category_names: list):
    """ For each message category, display the classification report, baseline accuracy 
    and model F2 score. At the end of the classification reports, display the mean baseline 
    accuracy, mean model accuracy, mean binary & weighted F1 & F2 scores, and mean 
    binary precision and recall scores across all categories.
    
    Args:
    y_test: np.array.  Array of true feature test data.
    preds: np.arry.  Array of predicted feature test data.
    category_names: list.  A list of category names.
    
    Returns:
    None
    """
    
    # Assign lists to store baseline accuracy, model accuracy, and binary 
    # and weighted F1 & F2 scores for each category
    base_acc, model_acc, f1, f2, f1_w, f2_w, prec, recall = [], [], [], [], [], [], [], []
    
    # Cast y_test array as a DataFrame for easier analysis
    y_test_df = pd.DataFrame(y_test, columns=category_names)
    
    # Iterate through each category
    for i in range(y_test.shape[1]):
                
        # Calculate additional evaluation metrics for each category and append to list
        base_acc.append(y_test_df[category_names[i]].value_counts().max() / len(y_test_df))
        model_acc.append(accuracy_score(y_test[:,i], preds[:,i]))
        f1.append(f1_score(y_test[:,i], preds[:,i], average='binary', zero_division=0))
        f2.append(fbeta_score(y_test[:,i], preds[:,i], beta=2, average='binary', zero_division=0))
        f1_w.append(f1_score(y_test[:,i], preds[:,i], average='weighted', zero_division=0))
        f2_w.append(fbeta_score(y_test[:,i], preds[:,i], beta=2, average='weighted', zero_division=0))
        prec.append(precision_score(y_test[:,i], preds[:,i], average='binary', zero_division=0))
        recall.append(recall_score(y_test[:,i], preds[:,i], average='binary', zero_division=0))
        
        # Display classification report, baseline accuracy and F2 score for each category
        print(f'Category: {category_names[i]}.  Baseline accuracy: {round(base_acc[i],4)}.  '
              f'F2 score: {round(f2[i],4)}.')
        print(classification_report(y_test[:,i], preds[:,i], digits=4, zero_division=0))
        print('-'*100)
        print()
        
    # Display mean evaluation metrics across all categories after the classificaiton reports
    print('Mean evaluation metrics across all categories...\n')
    print(f'Mean baseline Accuracy: {round(np.mean(base_acc),4)}')
    print(f'Mean model Accuracy: {round(np.mean(model_acc),4)}')
    print(f'Mean binary F1: {round(np.mean(f1),4)}')
    print(f'Mean binary F2: {round(np.mean(f2),4)}')
    print(f'Mean weighted F1: {round(np.mean(f1_w),4)}')
    print(f'Mean weighted F2: {round(np.mean(f2_w),4)}')
    print(f'Mean binary Precision: {round(np.mean(prec),4)}')
    print(f'Mean binary Recall: {round(np.mean(recall),4)}\n')
    print('-'*100)
    

def return_model_metrics(y_test: np.array, preds: np.array, category_names:list) -> dict:
    """ Return a dictionary of the baseline accuracy, model accuracy, binary & 
    weighted F1 & F2 scores and binary precision and recall scores for each category
    
    Args:
    y_test: np.array.  Array of true feature test data.
    preds: np.arry.  Array of predicted feature test data.
    category_names: list.  A list of category names.
    
    Returns:
    dict.  A dictionary of evalation metrics for each category.
    """
    
    # Create metrics dictionary
    metrics = {}
    
    # Cast y_test array as a DataFrame for easier analysis
    y_test_df = pd.DataFrame(y_test, columns=category_names)
    
    # Iterate through message categories
    for i in range(y_test.shape[1]):
        category = category_names[i]
        # Define a dictionary for each category
        metrics[category] = {}
        # Calculate evaluation metrics and insert into nested dictionary
        metrics[category]['Baseline Accuracy'] = y_test_df[category].value_counts().max() \
                                                 / len(y_test_df)
        metrics[category]['Model Accuracy'] = accuracy_score(y_test[:,i], preds[:,i])
        metrics[category]['Model binary Recall'] = recall_score(y_test[:,i], preds[:,i], \
                                                        zero_division=0)
        metrics[category]['Model binary Precision'] = precision_score(y_test[:,i], preds[:,i], \
                                                              zero_division=0)
        metrics[category]['Model binary F1 score'] = f1_score(y_test[:,i], preds[:,i], \
                                                       average='binary', zero_division=0)
        metrics[category]['Model binary F2 score'] = fbeta_score(y_test[:,i], preds[:,i], beta=2, \
                                                          average='binary', zero_division=0)
        
        metrics[category]['Model weighted F1 score'] = f1_score(y_test[:,i], preds[:,i], \
                                                       average='weighted', zero_division=0)
        metrics[category]['Model weighted F2 score'] = fbeta_score(y_test[:,i], preds[:,i], beta=2, \
                                                          average='weighted', zero_division=0)
    
    return metrics


def print_metrics(metrics: dict, mean_metrics=True):
    """Print evaluation metric name and its mean value across all categories
    
    Args:
    metrics: dict.  A dictionary of evaluation metrics returned by the 
    return_model_metrics function.
    mean_metrics: boolean.  Default=True.  If True, print mean metrics across all 
    categories.  If False, print, metrics for each category.
    
    Returns:
    None
    """
    
    # Define ordered list of metric types for display
    metric_type_list = ['Baseline Accuracy', 'Model Accuracy', 'Model binary F1 score', \
                        'Model binary F2 score', 'Model weighted F1 score', \
                        'Model weighted F2 score', 'Model binary Precision',\
                        'Model binary Recall']

    
    if mean_metrics:
        # Print mean metric value across all categories
        # For each metric type create a list of metric values across all categories
        for metric_type in metric_type_list:
            metric_vals = [cat_metrics[metric_type] for cat_metrics in metrics.values()]

            # Print metric type and its mean value across all categories
            print(f'{metric_type}: {round(np.mean(metric_vals),4)}')
    else: 
        # Print metrics for each category
        for category in metrics.keys():
            print(f'Category: {category}')
            cat_metrics = metrics[category]
            for (k, v) in cat_metrics.items():
                print(f'{k}: {round(v,4)}')
            print('-'*100)
        
        

In [13]:
# Display classification report for each category, along with mean metrics across 
# all categories
display_class_report(y_test, preds, category_names)

Category: related.  Baseline accuracy: 0.7644.  F2 score: 0.9278.
              precision    recall  f1-score   support

           0     0.7266    0.4015    0.5172      1853
           1     0.8379    0.9534    0.8919      6012

    accuracy                         0.8234      7865
   macro avg     0.7822    0.6775    0.7046      7865
weighted avg     0.8117    0.8234    0.8036      7865

----------------------------------------------------------------------------------------------------

Category: request.  Baseline accuracy: 0.8272.  F2 score: 0.5504.
              precision    recall  f1-score   support

           0     0.9049    0.9782    0.9401      6506
           1     0.8293    0.5077    0.6298      1359

    accuracy                         0.8969      7865
   macro avg     0.8671    0.7430    0.7850      7865
weighted avg     0.8918    0.8969    0.8865      7865

----------------------------------------------------------------------------------------------------

Category: 

Category: tools.  Baseline accuracy: 0.9945.  F2 score: 0.0.
              precision    recall  f1-score   support

           0     0.9945    1.0000    0.9973      7822
           1     0.0000    0.0000    0.0000        43

    accuracy                         0.9945      7865
   macro avg     0.4973    0.5000    0.4986      7865
weighted avg     0.9891    0.9945    0.9918      7865

----------------------------------------------------------------------------------------------------

Category: hospitals.  Baseline accuracy: 0.9906.  F2 score: 0.0.
              precision    recall  f1-score   support

           0     0.9906    1.0000    0.9953      7791
           1     0.0000    0.0000    0.0000        74

    accuracy                         0.9906      7865
   macro avg     0.4953    0.5000    0.4976      7865
weighted avg     0.9813    0.9906    0.9859      7865

----------------------------------------------------------------------------------------------------

Category: shops.

In [14]:
# Test function to print mean metrics across all categories, without classification report
print_metrics(return_model_metrics(y_test, preds, category_names))

Baseline Accuracy: 0.9242
Model Accuracy: 0.9468
Model binary F1 score: 0.2539
Model binary F2 score: 0.2205
Model weighted F1 score: 0.9345
Model weighted F2 score: 0.9414
Model binary Precision: 0.5944
Model binary Recall: 0.2057


### Observations
For most categories, the accuracy is often equivalent or slightly higher than the baseline accuracy (i.e. the accuracy that would be achieved by selecting the majority class for every message).

Given the use case recall is important here - and the recall for predicting the message category is often much lower than the precision, indicating that while the model is performing well to not predict too may false positives, it's missing more positives and predicting more false negatives than we'd like in this scenario.  Indeed, recall is low for quite a few categories.  The mean recall across all categories is 0.2057 vs. 0.5944 for Precision.  This isn't a surprise, given the imbalance of the dataset and is something to be improved on.

In this case, accuracy is not the measure we should seek to maximise here - it's recall and the F-beta score, which is a weighted mean of the recall and precision.  Recall is likely to be more important in this scenario, hence beta should be greater than 1 to apply more weight to the recall.  The mean F2 score for the model across the categories is 0.2205, reflecting a recall that we'd like to try to improve upon.  

Based on this, the F2 score will be the score I work to maximise in future runs of this model.

**<font color='red'>I want prediction accuracy to improve across ALL categories, rather than a subset.  For this reason, when evaluating the model, I'll focus on the mean metrics across ALL categories, rather than the metrics for each individual category.  These mean metrics are displayed after the classification reports for each category.  I'll also define a custom scorer that measures the mean F2 score across all categories.</font>**

In [15]:
# Manual cross-check of evaluation metrics in classification report above
# Cross check metrics for the 'direct_report' category

# Cast predictions and test data as DataFrames to facilitate manual cross check of metrics
preds_df = pd.DataFrame(preds, columns=category_names)
y_test_df = pd.DataFrame(y_test, columns=category_names)

category='direct_report'
tp = ((y_test_df[category]==1) & (preds_df[category]==1)).sum()
tn = ((y_test_df[category]==0) & (preds_df[category]==0)).sum()
fp = ((y_test_df[category]==0) & (preds_df[category]==1)).sum()
fn = ((y_test_df[category]==1) & (preds_df[category]==0)).sum()
print('Class 1:')
print(f'Precision: {round(tp/(tp+fp),4)}')
print(f'Recall: {round(tp/(tp+fn),4)}')
print(f'Accuracy: {round((tp+tn)/(tp+tn+fp+fn),4)}')
print()

tp_0 = ((y_test_df[category]==0) & (preds_df[category]==0)).sum()
tn_0 = ((y_test_df[category]==1) & (preds_df[category]==1)).sum()
fp_0 = ((y_test_df[category]==1) & (preds_df[category]==0)).sum()
fn_0 = ((y_test_df[category]==0) & (preds_df[category]==1)).sum()
print('Class 0:')
print(f'Precision: {round(tp_0/(tp_0+fp_0),4)}')
print(f'Recall: {round(tp_0/(tp_0+fn_0),4)}')
print(f'Accuracy: {round((tp_0+tn_0)/(tp_0+tn_0+fp_0+fn_0),4)}')

Class 1:
Precision: 0.7871
Recall: 0.368
Accuracy: 0.858

Class 0:
Precision: 0.8651
Recall: 0.976
Accuracy: 0.858


## Improve your model
Use grid search to find better parameters. 

In [16]:
# View pipeline parameters available for tuning
pipeline.get_params()

{'memory': None,
 'steps': [('tfidf',
   TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                   dtype=<class 'numpy.float64'>, encoding='utf-8',
                   input='content', lowercase=True, max_df=1.0, max_features=None,
                   min_df=1, ngram_range=(1, 1), norm='l2', preprocessor=None,
                   smooth_idf=True, stop_words=None, strip_accents=None,
                   sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b',
                   tokenizer=<function tokenize at 0x00000280D723D288>,
                   use_idf=True, vocabulary=None)),
  ('clf',
   MultiOutputClassifier(estimator=RandomForestClassifier(bootstrap=True,
                                                          ccp_alpha=0.0,
                                                          class_weight=None,
                                                          criterion='gini',
                                                          max_depth=None,
    

### Define custom scorer function for mean F2 value

As explained above, reflecting the importance of recall in this scenario, I'm defining a custom scoring method that returns the mean F2 score across ALL categories.  This is the score I will seek to maximise in my grid searches below.

In [14]:
# Define custom scoring function for multiple outputs
def calculate_multioutput_f2(y_test, preds, beta=2):
    """ Custom scoring function to calculate and return the mean binary
    F2 score across all categories
    
    Args:
    y_test: np.array.  Array of true feature test data.
    preds: np.arry.  Array of predicted feature test data
    
    Returns:
    float.  Mean F2 score across all categories.
    """
    
    # Create a list of F2 scores across all categories
    score_list = []
    for i in range(y_test.shape[1]):
        score_list.append(fbeta_score(y_test[:,i], preds[:,i], \
                                      beta=beta, average='binary', zero_division=0))
    
    # Return mean of F2 scores across all categories
    return np.mean(score_list)

In [15]:
# Assign custom scorer as the 'calculate_multioutput_f2' function
scorer = make_scorer(calculate_multioutput_f2, greater_is_better=True)

### Grid search with RandomForestClassifier (part 1)

In [22]:
# Define pipeline to fit and transform data with tfidf step and predict message categories
# using a RandomForestClassifier
pipeline = Pipeline([('tfidf', TfidfVectorizer(tokenizer=tokenize)),
                     ('clf', MultiOutputClassifier(RandomForestClassifier(random_state=55)))],
                   verbose=True)

# Define a grid of workflow hyperparameters for Gridsearch
param_grid = {'tfidf__ngram_range': [(1, 1), (1,2)], 
              'tfidf__max_df': [0.7, 1.0], 
              'clf__estimator__n_estimators':[50, 100],
              'clf__estimator__max_depth':[10, None],
             }

# Define GridSearchCV object with custom scorer
gs = GridSearchCV(pipeline, param_grid, verbose=2, n_jobs=-1, scoring=scorer, \
                  cv=KFold(shuffle=True, random_state=55))

In [23]:
# Fit GridSearchCV object to training data
gs.fit(X_train, y_train)

Fitting 5 folds for each of 16 candidates, totalling 80 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:  5.0min
[Parallel(n_jobs=-1)]: Done  80 out of  80 | elapsed: 85.6min finished


[Pipeline] ............. (step 1 of 2) Processing tfidf, total=  17.5s
[Pipeline] ............... (step 2 of 2) Processing clf, total= 4.9min


GridSearchCV(cv=KFold(n_splits=5, random_state=55, shuffle=True),
             error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('tfidf',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1

In [24]:
# Display best mean F2 score
gs.best_score_

0.21625175644527914

In [26]:
# Display classification report for each category, along with mean metrics across 
# all categories
display_class_report(y_test, gs.best_estimator_.predict(X_test), category_names)

Category: related.  Baseline accuracy: 0.7644.  F2 score: 0.9278.
              precision    recall  f1-score   support

           0     0.7266    0.4015    0.5172      1853
           1     0.8379    0.9534    0.8919      6012

    accuracy                         0.8234      7865
   macro avg     0.7822    0.6775    0.7046      7865
weighted avg     0.8117    0.8234    0.8036      7865

----------------------------------------------------------------------------------------------------

Category: request.  Baseline accuracy: 0.8272.  F2 score: 0.5504.
              precision    recall  f1-score   support

           0     0.9049    0.9782    0.9401      6506
           1     0.8293    0.5077    0.6298      1359

    accuracy                         0.8969      7865
   macro avg     0.8671    0.7430    0.7850      7865
weighted avg     0.8918    0.8969    0.8865      7865

----------------------------------------------------------------------------------------------------

Category: 

              precision    recall  f1-score   support

           0     0.9945    1.0000    0.9973      7822
           1     0.0000    0.0000    0.0000        43

    accuracy                         0.9945      7865
   macro avg     0.4973    0.5000    0.4986      7865
weighted avg     0.9891    0.9945    0.9918      7865

----------------------------------------------------------------------------------------------------

Category: hospitals.  Baseline accuracy: 0.9906.  F2 score: 0.0.
              precision    recall  f1-score   support

           0     0.9906    1.0000    0.9953      7791
           1     0.0000    0.0000    0.0000        74

    accuracy                         0.9906      7865
   macro avg     0.4953    0.5000    0.4976      7865
weighted avg     0.9813    0.9906    0.9859      7865

----------------------------------------------------------------------------------------------------

Category: shops.  Baseline accuracy: 0.9963.  F2 score: 0.0.
              pr

#### Observations
The mean F2 score hasn't improved. I'll try a different set of hyperparameters, focussed on reducing the number of trees in each RandomForest ensemble.


### Grid search with RandomForestClassifier (part 2)

In [27]:
# Define a different grid of workflow hyperparameters for grid search
param_grid = {'tfidf__ngram_range': [(1, 1), (1,2)], 
              'tfidf__max_df': [0.7, 1.0], 
              'clf__estimator__n_estimators':[10, 20],
              'clf__estimator__max_depth':[10, None],
             }

# Define GridSearchCV object with customer scorer and fit to training data
gs1 = GridSearchCV(pipeline, param_grid, verbose=2, n_jobs=-1, scoring=scorer, \
                  cv=KFold(shuffle=True, random_state=55))

gs1.fit(X_train, y_train)

Fitting 5 folds for each of 16 candidates, totalling 80 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:  3.4min
[Parallel(n_jobs=-1)]: Done  80 out of  80 | elapsed: 22.8min finished


[Pipeline] ............. (step 1 of 2) Processing tfidf, total=  16.8s
[Pipeline] ............... (step 2 of 2) Processing clf, total=  58.5s


GridSearchCV(cv=KFold(n_splits=5, random_state=55, shuffle=True),
             error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('tfidf',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1

In [28]:
# Display best mean F2 score 
gs1.best_score_

0.21249503424837504

In [29]:
# Display classification report for each category, along with mean metrics across 
# all categories
display_class_report(y_test, gs1.best_estimator_.predict(X_test), category_names)

Category: related.  Baseline accuracy: 0.7644.  F2 score: 0.9167.
              precision    recall  f1-score   support

           0     0.6797    0.4317    0.5281      1853
           1     0.8426    0.9373    0.8874      6012

    accuracy                         0.8182      7865
   macro avg     0.7611    0.6845    0.7077      7865
weighted avg     0.8042    0.8182    0.8027      7865

----------------------------------------------------------------------------------------------------

Category: request.  Baseline accuracy: 0.8272.  F2 score: 0.5012.
              precision    recall  f1-score   support

           0     0.8962    0.9816    0.9369      6506
           1     0.8376    0.4555    0.5901      1359

    accuracy                         0.8907      7865
   macro avg     0.8669    0.7185    0.7635      7865
weighted avg     0.8860    0.8907    0.8770      7865

----------------------------------------------------------------------------------------------------

Category: 

              precision    recall  f1-score   support

           0     0.9817    0.9992    0.9904      7710
           1     0.6471    0.0710    0.1279       155

    accuracy                         0.9809      7865
   macro avg     0.8144    0.5351    0.5591      7865
weighted avg     0.9751    0.9809    0.9734      7865

----------------------------------------------------------------------------------------------------

Category: tools.  Baseline accuracy: 0.9945.  F2 score: 0.0.
              precision    recall  f1-score   support

           0     0.9945    1.0000    0.9973      7822
           1     0.0000    0.0000    0.0000        43

    accuracy                         0.9945      7865
   macro avg     0.4973    0.5000    0.4986      7865
weighted avg     0.9891    0.9945    0.9918      7865

----------------------------------------------------------------------------------------------------

Category: hospitals.  Baseline accuracy: 0.9906.  F2 score: 0.0.
              pr

#### Observations
Again, the mean F2 score hasn't improved.  I'm going to try a different model.

### Train a LinearSVC model

In [30]:
# Define pipeline to fit and transform data with tfidf step and predict message categories 
# using a Support Vector Classifier with a linear kernel
pipeline = Pipeline([('tfidf', TfidfVectorizer(tokenizer=tokenize)),
                     ('clf', MultiOutputClassifier(LinearSVC(random_state=55, dual=False)))])

# Fit pipeline to training data
pipeline = pipeline.fit(X_train, y_train)

# Display classification report for each category, along with mean metrics across 
# all categories
display_class_report(y_test, pipeline.predict(X_test), category_names)

Category: related.  Baseline accuracy: 0.7644.  F2 score: 0.8983.
              precision    recall  f1-score   support

           0     0.6383    0.5267    0.5772      1853
           1     0.8616    0.9080    0.8842      6012

    accuracy                         0.8182      7865
   macro avg     0.7500    0.7174    0.7307      7865
weighted avg     0.8090    0.8182    0.8119      7865

----------------------------------------------------------------------------------------------------

Category: request.  Baseline accuracy: 0.8272.  F2 score: 0.6269.
              precision    recall  f1-score   support

           0     0.9201    0.9607    0.9399      6506
           1     0.7612    0.6004    0.6713      1359

    accuracy                         0.8984      7865
   macro avg     0.8406    0.7805    0.8056      7865
weighted avg     0.8926    0.8984    0.8935      7865

----------------------------------------------------------------------------------------------------

Category: 

Category: transport.  Baseline accuracy: 0.9538.  F2 score: 0.2925.
              precision    recall  f1-score   support

           0     0.9651    0.9940    0.9793      7502
           1     0.6739    0.2562    0.3713       363

    accuracy                         0.9599      7865
   macro avg     0.8195    0.6251    0.6753      7865
weighted avg     0.9516    0.9599    0.9513      7865

----------------------------------------------------------------------------------------------------

Category: buildings.  Baseline accuracy: 0.9485.  F2 score: 0.4039.
              precision    recall  f1-score   support

           0     0.9666    0.9883    0.9773      7460
           1     0.6329    0.3704    0.4673       405

    accuracy                         0.9565      7865
   macro avg     0.7997    0.6794    0.7223      7865
weighted avg     0.9494    0.9565    0.9511      7865

----------------------------------------------------------------------------------------------------

Catego

#### Observations

Using the LinearSVC model has greatly increased the mean binary F2 score, from the previous highest value of 0.2205 to 0.3557.  Recall has also increased from a previous maximum value of 0.2057 to 0.3322, without a loss in mean binary precision.  

Let's see if we can improve on this by grid searching over a few different values of the regularisation parameter.

### Grid search with LinearSVC (part 1)

In [31]:
# Define a pipeline to fit and transform data with tfidf step and predict message categories 
# using a Support Vector Classifier with linear kernel
pipeline = Pipeline([('tfidf', TfidfVectorizer(tokenizer=tokenize)),
                     ('clf', MultiOutputClassifier(LinearSVC(random_state=55, dual=False)))])

# Define a grid of hyperparameters, searching 3 different values for the regularisation parameter
param_grid = {'clf__estimator__C':[0.5, 1.0, 2.0]}

# Define GridSearchCV object with custom scorer 
gs2 = GridSearchCV(pipeline, param_grid, verbose=2, n_jobs=-1, scoring=scorer, \
                  cv=KFold(shuffle=True, random_state=55))

# Fit GridSearchCV object to training data
gs2.fit(X_train, y_train)

Fitting 5 folds for each of 3 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   8 out of  15 | elapsed:  1.0min remaining:   55.0s
[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:  1.8min finished


GridSearchCV(cv=KFold(n_splits=5, random_state=55, shuffle=True),
             error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('tfidf',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1

In [32]:
# Fit GridSearchCV object to training data
gs2.fit(X_train, y_train)

Fitting 5 folds for each of 3 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   8 out of  15 | elapsed:   55.7s remaining:   48.7s
[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:  1.6min finished


GridSearchCV(cv=KFold(n_splits=5, random_state=55, shuffle=True),
             error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('tfidf',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1

In [33]:
# Display best mean F2 score
gs2.best_score_

0.36134177137912815

In [34]:
# Generate predictions for best estimator identified in grid search and display 
# classification report for each category, along with mean metrics across 
# all categories
display_class_report(y_test, gs2.best_estimator_.predict(X_test), category_names)

Category: related.  Baseline accuracy: 0.7644.  F2 score: 0.8886.
              precision    recall  f1-score   support

           0     0.6144    0.5434    0.5767      1853
           1     0.8641    0.8949    0.8792      6012

    accuracy                         0.8121      7865
   macro avg     0.7393    0.7192    0.7280      7865
weighted avg     0.8053    0.8121    0.8080      7865

----------------------------------------------------------------------------------------------------

Category: request.  Baseline accuracy: 0.8272.  F2 score: 0.6244.
              precision    recall  f1-score   support

           0     0.9201    0.9502    0.9349      6506
           1     0.7173    0.6049    0.6563      1359

    accuracy                         0.8905      7865
   macro avg     0.8187    0.7775    0.7956      7865
weighted avg     0.8850    0.8905    0.8868      7865

----------------------------------------------------------------------------------------------------

Category: 

Category: buildings.  Baseline accuracy: 0.9485.  F2 score: 0.4244.
              precision    recall  f1-score   support

           0     0.9678    0.9859    0.9768      7460
           1     0.6038    0.3951    0.4776       405

    accuracy                         0.9555      7865
   macro avg     0.7858    0.6905    0.7272      7865
weighted avg     0.9490    0.9555    0.9511      7865

----------------------------------------------------------------------------------------------------

Category: electricity.  Baseline accuracy: 0.9803.  F2 score: 0.2493.
              precision    recall  f1-score   support

           0     0.9845    0.9964    0.9904      7710
           1     0.5484    0.2194    0.3134       155

    accuracy                         0.9811      7865
   macro avg     0.7664    0.6079    0.6519      7865
weighted avg     0.9759    0.9811    0.9771      7865

----------------------------------------------------------------------------------------------------

Cate

In [35]:
# Display parameters of best estimator identified in grid search above
gs2.best_estimator_

Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='...\b\\w\\w+\\b',
                                 tokenizer=<function tokenize at 0x00000280D723D288>,
                                 use_idf=True, vocabulary=None)),
                ('clf',
                 MultiOutputClassifier(estimator=LinearSVC(C=2.0,
                                              

#### Observations

The mean binary F2 and Recall scores have improved slightly with this grid search, at the expense of a slight decrease in precision and overall accuracy.

Looking at the hyperparameters of the best performing LinearSVC model on the training data (as defined by maximising the mean F2 score), increasing the penalty applied to misclassified values with this regularisation appears to have improved the model.  Based on this, I'm going to grid search with a wider range of regularisation values for the Linear SVC model.  

### Grid Search with LinearSVC (part 2)

In [36]:
# Define pipeline to fit and transform data with tfidf step and predict message categories 
# using a Support Vector Classifier with a linear kernel
pipeline = Pipeline([('tfidf', TfidfVectorizer(tokenizer=tokenize)),
                     ('clf', MultiOutputClassifier(LinearSVC(random_state=55)))])

# Define a grid of hyperparameters, searching a greater range of values for the regularisation parameter
param_grid = {'clf__estimator__C':np.logspace(-5,5,10)}

# Define GridSearchCV object with custom scorer
gs3 = GridSearchCV(pipeline, param_grid, verbose=2, n_jobs=-1, scoring=scorer, \
                  cv=KFold(shuffle=True, random_state=55))

In [37]:
# Fit GridSearchCV object to training data
gs3.fit(X_train, y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:  3.3min
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:  6.9min finished


GridSearchCV(cv=KFold(n_splits=5, random_state=55, shuffle=True),
             error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('tfidf',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1

In [38]:
# Display best mean F2 score
gs3.best_score_

0.3628972041188219

In [39]:
# Generate predictions for best estimator identified in grid search and display 
# classification report for each category, along with mean metrics across 
# all categories
display_class_report(y_test, gs3.best_estimator_.predict(X_test), category_names)

Category: related.  Baseline accuracy: 0.7644.  F2 score: 0.879.
              precision    recall  f1-score   support

           0     0.5888    0.5424    0.5646      1853
           1     0.8623    0.8832    0.8726      6012

    accuracy                         0.8029      7865
   macro avg     0.7255    0.7128    0.7186      7865
weighted avg     0.7978    0.8029    0.8001      7865

----------------------------------------------------------------------------------------------------

Category: request.  Baseline accuracy: 0.8272.  F2 score: 0.6152.
              precision    recall  f1-score   support

           0     0.9185    0.9427    0.9304      6506
           1     0.6860    0.5997    0.6400      1359

    accuracy                         0.8834      7865
   macro avg     0.8023    0.7712    0.7852      7865
weighted avg     0.8784    0.8834    0.8802      7865

----------------------------------------------------------------------------------------------------

Category: o


----------------------------------------------------------------------------------------------------

Category: infrastructure_related.  Baseline accuracy: 0.9358.  F2 score: 0.152.
              precision    recall  f1-score   support

           0     0.9426    0.9711    0.9566      7360
           1     0.2473    0.1386    0.1777       505

    accuracy                         0.9176      7865
   macro avg     0.5950    0.5548    0.5671      7865
weighted avg     0.8980    0.9176    0.9066      7865

----------------------------------------------------------------------------------------------------

Category: transport.  Baseline accuracy: 0.9538.  F2 score: 0.3131.
              precision    recall  f1-score   support

           0     0.9662    0.9840    0.9750      7502
           1     0.4667    0.2893    0.3571       363

    accuracy                         0.9519      7865
   macro avg     0.7164    0.6366    0.6661      7865
weighted avg     0.9432    0.9519    0.9465     

#### Observations
There doesn't appear to be further improvement in the model by grid searching the regularisation parameter of the LinearSVC model alone.  I'll now grid search the entire pipeline workflow to see if I can get further improvement.

In [40]:
# Display parameters of best estimator identified in grid search
gs3.best_estimator_

Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='...,
                                 tokenizer=<function tokenize at 0x00000280D723D288>,
                                 use_idf=True, vocabulary=None)),
                ('clf',
                 MultiOutputClassifier(estimator=LinearSVC(C=3.593813663804626,
                                             

### Grid Search across Pipeline workflow (with LinearSVC)

In [44]:
# Define pipeline to fit and transform data with tfidf step and predict message categories 
# using a Support Vector Classifier with a linear kernel
pipeline = Pipeline([('tfidf', TfidfVectorizer(tokenizer=tokenize)),
                     ('clf', MultiOutputClassifier(LinearSVC(random_state=55, dual=False)))])

# Define a grid of hyperparameters for the workflow
param_grid = {'tfidf__ngram_range': [(1, 1), (1,2)], 
              'tfidf__max_df': [0.7, 1.0], 
              'tfidf__use_idf': [True, False], 
              'clf__estimator__C':np.logspace(-2, 5, 11),
             }

# Define GridSearchCV object with custom scorer
gs4 = GridSearchCV(pipeline, param_grid, verbose=2, n_jobs=-1, scoring=scorer, \
                  cv=KFold(shuffle=True, random_state=55))

In [45]:
# Fit GridSearchCV object to training data
gs4.fit(X_train, y_train)

Fitting 5 folds for each of 88 candidates, totalling 440 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:  3.5min
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed: 27.7min
[Parallel(n_jobs=-1)]: Done 349 tasks      | elapsed: 99.6min
[Parallel(n_jobs=-1)]: Done 440 out of 440 | elapsed: 126.9min finished


GridSearchCV(cv=KFold(n_splits=5, random_state=55, shuffle=True),
             error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('tfidf',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1

In [46]:
# Display best mean F2 score
gs4.best_score_

0.40617477550295095

In [47]:
# Display parameters of best estimator identified in grid search
gs4.best_estimator_

Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=0.7, max_features=None,
                                 min_df=1, ngram_range=(1, 2), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='...
                                 tokenizer=<function tokenize at 0x00000280D723D288>,
                                 use_idf=True, vocabulary=None)),
                ('clf',
                 MultiOutputClassifier(estimator=LinearSVC(C=794.3282347242805,
                                              

In [48]:
# Generate predictions using best estimator identified in grid search
preds = gs4.best_estimator_.predict(X_test)

In [49]:
# Display classification report for each category, along with mean metrics across 
# all categories for best estimator identified in grid search
display_class_report(y_test, preds, category_names)

Category: related.  Baseline accuracy: 0.7644.  F2 score: 0.9003.
              precision    recall  f1-score   support

           0     0.6321    0.4830    0.5476      1853
           1     0.8514    0.9133    0.8813      6012

    accuracy                         0.8120      7865
   macro avg     0.7418    0.6982    0.7144      7865
weighted avg     0.7998    0.8120    0.8027      7865

----------------------------------------------------------------------------------------------------

Category: request.  Baseline accuracy: 0.8272.  F2 score: 0.6557.
              precision    recall  f1-score   support

           0     0.9276    0.9353    0.9314      6506
           1     0.6774    0.6505    0.6637      1359

    accuracy                         0.8861      7865
   macro avg     0.8025    0.7929    0.7975      7865
weighted avg     0.8844    0.8861    0.8852      7865

----------------------------------------------------------------------------------------------------

Category: 

Category: transport.  Baseline accuracy: 0.9538.  F2 score: 0.3509.
              precision    recall  f1-score   support

           0     0.9678    0.9869    0.9773      7502
           1     0.5442    0.3223    0.4048       363

    accuracy                         0.9563      7865
   macro avg     0.7560    0.6546    0.6911      7865
weighted avg     0.9483    0.9563    0.9509      7865

----------------------------------------------------------------------------------------------------

Category: buildings.  Baseline accuracy: 0.9485.  F2 score: 0.501.
              precision    recall  f1-score   support

           0     0.9722    0.9812    0.9767      7460
           1     0.5833    0.4840    0.5290       405

    accuracy                         0.9556      7865
   macro avg     0.7778    0.7326    0.7529      7865
weighted avg     0.9522    0.9556    0.9537      7865

----------------------------------------------------------------------------------------------------

Categor

In [50]:
# Display mean metrics for best estimator identified in grid search
print_metrics(return_model_metrics(y_test, preds, category_names))

Baseline Accuracy: 0.9242
Model Accuracy: 0.9432
Model binary F1 score: 0.4415
Model binary F2 score: 0.4143
Model weighted F1 score: 0.94
Model weighted F2 score: 0.9418
Model binary Precision: 0.5401
Model binary Recall: 0.4003


#### Observations
Grid searching the entire workflow has increased the mean binary F2 score from a previous maximum value of 0.3667 to 0.4143.  Mean binary recall has also increased from a previous maximum of 0.3443 to 0.4003.  This has come at the expense of a small decrease in binary precision but the overall mean model accuracy remains above 94%.  

This has almost doubled both the mean binary recall and F2 score of the initial RandomForestClassifier model, and I'm happy with this model at this stage.

###  Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [105]:
# Generate predictions using best estimator found in grid search above
preds = gs4.best_estimator_.predict(X_test)

In [111]:
# Display mean metrics for best estimator identified in grid search
print_metrics(return_model_metrics(y_test, preds, category_names))

Baseline Accuracy: 0.9242
Model Accuracy: 0.9432
Model binary F1 score: 0.4415
Model binary F2 score: 0.4143
Model weighted F1 score: 0.94
Model weighted F2 score: 0.9418
Model binary Precision: 0.5401
Model binary Recall: 0.4003


In [113]:
# Print metric values for each category
print_metrics(return_model_metrics(y_test, preds, category_names), mean_metrics=False)

Category: related
Baseline Accuracy: 0.7644
Model Accuracy: 0.812
Model binary Recall: 0.9133
Model binary Precision: 0.8514
Model binary F1 score: 0.8813
Model binary F2 score: 0.9003
Model weighted F1 score: 0.8027
Model weighted F2 score: 0.8076
----------------------------------------------------------------------------------------------------
Category: request
Baseline Accuracy: 0.8272
Model Accuracy: 0.8861
Model binary Recall: 0.6505
Model binary Precision: 0.6774
Model binary F1 score: 0.6637
Model binary F2 score: 0.6557
Model weighted F1 score: 0.8852
Model weighted F2 score: 0.8857
----------------------------------------------------------------------------------------------------
Category: offer
Baseline Accuracy: 0.9955
Model Accuracy: 0.9955
Model binary Recall: 0.0286
Model binary Precision: 0.5
Model binary F1 score: 0.0541
Model binary F2 score: 0.0352
Model weighted F1 score: 0.9936
Model weighted F2 score: 0.9947
------------------------------------------------------

In [53]:
# Display classification report for each category, along with mean metrics across 
# all categories for best estimator identified in grid search
display_class_report(y_test, preds, category_names)

Category: related.  Baseline accuracy: 0.7644.  F2 score: 0.9003.
              precision    recall  f1-score   support

           0     0.6321    0.4830    0.5476      1853
           1     0.8514    0.9133    0.8813      6012

    accuracy                         0.8120      7865
   macro avg     0.7418    0.6982    0.7144      7865
weighted avg     0.7998    0.8120    0.8027      7865

----------------------------------------------------------------------------------------------------

Category: request.  Baseline accuracy: 0.8272.  F2 score: 0.6557.
              precision    recall  f1-score   support

           0     0.9276    0.9353    0.9314      6506
           1     0.6774    0.6505    0.6637      1359

    accuracy                         0.8861      7865
   macro avg     0.8025    0.7929    0.7975      7865
weighted avg     0.8844    0.8861    0.8852      7865

----------------------------------------------------------------------------------------------------

Category: 

              precision    recall  f1-score   support

           0     0.9044    0.9242    0.9142      6845
           1     0.4034    0.3441    0.3714      1020

    accuracy                         0.8490      7865
   macro avg     0.6539    0.6341    0.6428      7865
weighted avg     0.8394    0.8490    0.8438      7865

----------------------------------------------------------------------------------------------------

Category: infrastructure_related.  Baseline accuracy: 0.9358.  F2 score: 0.1828.
              precision    recall  f1-score   support

           0     0.9446    0.9773    0.9607      7360
           1     0.3320    0.1644    0.2199       505

    accuracy                         0.9251      7865
   macro avg     0.6383    0.5708    0.5903      7865
weighted avg     0.9053    0.9251    0.9131      7865

----------------------------------------------------------------------------------------------------

Category: transport.  Baseline accuracy: 0.9538.  F2 score: 0

In [122]:
# Test model with a message
msg_test = "A massive fire has broken out after the storm. Homes are destroyed and people have been \
            left homeless.  We need doctors and clothing."

test_pred = gs4.best_estimator_.predict([msg_test])
print(y_test_df.columns[(test_pred==1).flatten()])

Index(['related', 'request', 'aid_related', 'medical_help', 'shelter',
       'clothing', 'buildings', 'weather_related', 'storm', 'fire',
       'direct_report'],
      dtype='object')


#### Observations
The mean binary F2 score across all categories has increased from a starting point of 0.2205 with the RandomForestClassifier model to 0.4143 with the LinearSVC model, and the binary recall has almost doubled from 0.2057 to 0.4003.  While precision has decreased slightly (due to predicting more false positives), the mean accuracy across all categories has remained above 94%.

Recall has increased greatly for many categories e.g. medical_help (from 0.0972 to 0.3953), military (0.0435 to 0.4743), refugees (0.0149 to 0.3309), clothing(0.1066 to 0.4180), shelter (from 0.3994 to 0.6709) etc.  - which I'm happy with.

The model also returns 'sensible' results for the test message above.



## Try improving your model further

Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

### Grid Search with AdaBoostClassifier

In [54]:
# Define a pipeline to fit and transform data with tfidf step and predict message categories 
# using a AdaBoost Classifier
pipeline = Pipeline([('tfidf', TfidfVectorizer(tokenizer=tokenize)),
                     ('clf', MultiOutputClassifier(AdaBoostClassifier(random_state=55)))])

# Define a grid of hyperparameters for the workflow
param_grid = {'tfidf__ngram_range': [(1, 1), (1,2)], 
              'tfidf__max_df': [0.7, 1.0], 
              'tfidf__use_idf': [True, False], 
              'clf__estimator__n_estimators':[10, 50, 100],
              'clf__estimator__learning_rate':[0.1, 1.0, 2.0]
             }

# Define GridSearchCV object with custom scorer
gs5 = GridSearchCV(pipeline, param_grid, verbose=2, n_jobs=-1, scoring=scorer, \
                  cv=KFold(shuffle=True, random_state=55))

In [55]:
# Fit GridSearchCV object to training data
gs5 = gs5.fit(X_train, y_train)

Fitting 5 folds for each of 72 candidates, totalling 360 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:  4.8min
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed: 72.2min
[Parallel(n_jobs=-1)]: Done 360 out of 360 | elapsed: 204.3min finished


In [56]:
# Display best mean F2 score
gs5.best_score_

0.38889594094468577

In [57]:
# Generate predictions using best estimator identified in grid search
preds = gs5.best_estimator_.predict(X_test)

In [58]:
# Display mean metrics for best estimator identified in grid search
print_metrics(return_model_metrics(y_test, preds, category_names))

Baseline Accuracy: 0.9242
Model Accuracy: 0.944
Model binary F1 score: 0.4276
Model binary F2 score: 0.3959
Model weighted F1 score: 0.9391
Model weighted F2 score: 0.9418
Model binary Precision: 0.5081
Model binary Recall: 0.3787


#### Observations
This hasn't improved the model - the mean F2 score as well as the mean binary recall and precision have decreased.



### Grid Search with MultinomialNB

In [59]:
# Define a pipeline to fit and transform data with cvec step and predict message categories 
# using a Multinomial Naive Bayes classifier.  Since this model is suitable for classification 
# use word counts only, rather than tfidf values

pipeline = Pipeline([('cvec', CountVectorizer(tokenizer=tokenize)),
                     ('clf', MultiOutputClassifier(MultinomialNB()))])

# Define a grid of hyperparameters for the workflow
param_grid = {'cvec__ngram_range': [(1, 1), (1,2)], 
              'cvec__max_df': [0.7, 0.85, 1.0], 
              'cvec__max_features': [None, 10000, 5000], 
              'clf__estimator__fit_prior':[True, False],
              'clf__estimator__alpha':[0.05, 0.1, 1.0, 2.0]
             }

# Define GridSearchCV object with custom scorer
gs6 = GridSearchCV(pipeline, param_grid, verbose=2, n_jobs=-1, scoring=scorer, \
                  cv=KFold(shuffle=True, random_state=55))

In [60]:
# Fit GridSearchCV object to training data
gs6 = gs6.fit(X_train, y_train)

Fitting 5 folds for each of 144 candidates, totalling 720 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:  3.0min
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed: 14.3min
[Parallel(n_jobs=-1)]: Done 349 tasks      | elapsed: 33.9min
[Parallel(n_jobs=-1)]: Done 632 tasks      | elapsed: 61.4min
[Parallel(n_jobs=-1)]: Done 720 out of 720 | elapsed: 69.9min finished


In [61]:
# Display best mean F2 score
gs6.best_score_

0.4257618707530084

In [62]:
# Generate predictions using best estimator identified in grid search
preds = gs6.predict(X_test)

In [63]:
# Display mean metrics for best estimator identified in grid search
print_metrics(return_model_metrics(y_test, preds, category_names))

Baseline Accuracy: 0.9242
Model Accuracy: 0.913
Model binary F1 score: 0.3768
Model binary F2 score: 0.4329
Model weighted F1 score: 0.9208
Model weighted F2 score: 0.9155
Model binary Precision: 0.317
Model binary Recall: 0.487


#### Observations
The mean binary F2 and recall scores have increased from 0.4143 and 0.4003 respectively achieved with the LinearSVC model, to 0.4329 and 0.487.  At first glance, this increase in recall seems like a great result - however, this has come at the cost of a big reduction in mean binary precision (from 0.5401 to 0.317) and a reduction in mean model accuracy from 0.9432 to 0.913 (which is lower than the baseline accuracy).

For these reasons, I'll retain the LinearSVC model tested in Section 1.6.8 above.

### Define a custom transformer

As per the analysis below, messages assigned to at least 1 category have an average length of 154.5 characters - greater than those not assigned to a category (mean length is 112.7 characters).  Based on this, I've decided to add a custom transformer to the pipeline that includes the length of the message as an additional feature.  

In [123]:
# See if there's a difference in the mean length of messages not relevant 
# to any categories vs. those that are 

# Print average length of message for those not relevant to any categories
print(np.mean([len(msg) for msg in list(df.loc[df.iloc[:,4:].sum(axis=1)==0,'message'])]))

# Print average length of message for those relevant to at least 1 category
print(np.mean([len(msg) for msg in list(df.loc[df.iloc[:,4:].sum(axis=1)>0,'message'])]))

112.72280300555374
154.51983277758424


In [64]:
# Define a custom transformer for inclusion in the workflow

class msgLengthExtractor(BaseEstimator, TransformerMixin):

    """Customer transformer that returns the length of each text string 
    in an array"""
    
    def return_msg_length(self, text: str) -> int:
        # Return length of text string
        return (len(text))

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        # Apply return_msg_length function to all values in X and 
        # return values in a DataFrame
        msg_lengths = X.map(self.return_msg_length)
        return pd.DataFrame(msg_lengths)

In [66]:
# Test custom transformer
print(X_train[:10].map(len).values)

# Instantiate customer transformer and transform data
test_transformer = msgLengthExtractor()
print(test_transformer.transform(X_train[:10]))

[ 44 181 236 205 111 123 101 238 132 428]
       message
7433        44
18678      181
13183      236
13947      205
20303      111
23573      123
24021      101
19696      238
10151      132
20192      428


#### Grid Search LinearSVC pipeline with custom transformer (part 1)

In [90]:
# Define a pipeline to fit and transform data with tfidf and custom transformer and 
# predict message categories using a Support Vector Classifier with linear kernel
pipeline = Pipeline([    
    ('features', FeatureUnion([
        ('tfidf', TfidfVectorizer(tokenizer=tokenize)),
        ('length_ext', msgLengthExtractor())])),
    ('clf', MultiOutputClassifier(LinearSVC(random_state=55, dual=False)))
    ])

# Define a grid of hyperparameters for the workflow
param_grid = {'features__tfidf__ngram_range': [(1,1), (1,2)], 
              'features__tfidf__max_df': [0.7], 
              'features__tfidf__use_idf': [False, True], 
              'clf__estimator__C':np.logspace(-2, 5, 11)[5:8]
             }

# Define GridSearchCV object with custom scorer
gs8 = GridSearchCV(pipeline, param_grid, verbose=2, n_jobs=-1, scoring=scorer, \
                  cv=KFold(shuffle=True, random_state=55))

In [91]:
# Fit GridSearchCV object to training data
gs8.fit(X_train, y_train)

Fitting 5 folds for each of 12 candidates, totalling 60 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:  8.1min
[Parallel(n_jobs=-1)]: Done  60 out of  60 | elapsed: 20.1min finished


GridSearchCV(cv=KFold(n_splits=5, random_state=55, shuffle=True),
             error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('features',
                                        FeatureUnion(n_jobs=None,
                                                     transformer_list=[('tfidf',
                                                                        TfidfVectorizer(analyzer='word',
                                                                                        binary=False,
                                                                                        decode_error='strict',
                                                                                        dtype=<class 'numpy.float64'>,
                                                                                        encoding='utf-8',
                                                                                        input='content',
            

In [94]:
# Display parameters of best estimator identified in grid search
gs8.best_estimator_

Pipeline(memory=None,
         steps=[('features',
                 FeatureUnion(n_jobs=None,
                              transformer_list=[('tfidf',
                                                 TfidfVectorizer(analyzer='word',
                                                                 binary=False,
                                                                 decode_error='strict',
                                                                 dtype=<class 'numpy.float64'>,
                                                                 encoding='utf-8',
                                                                 input='content',
                                                                 lowercase=True,
                                                                 max_df=0.7,
                                                                 max_features=None,
                                                                 min_df=1,
                     

In [95]:
# Display best mean F2 score
gs8.best_score_

0.4031301436122952

In [96]:
# Display mean metrics for best estimator identified in grid search
print_metrics(return_model_metrics(y_test, gs8.predict(X_test), category_names))

Baseline Accuracy: 0.9242
Model Accuracy: 0.9463
Model binary F1 score: 0.4442
Model binary F2 score: 0.4104
Model weighted F1 score: 0.9424
Model weighted F2 score: 0.9446
Model binary Precision: 0.5578
Model binary Recall: 0.3928


##### Observations
The metrics are comparable with those achieved with the model tested in section 1.6.8, so I'll be staying with that earlier model.



#### Grid Search LinearSVC pipeline with custom transformer (part 2)

The length_ext text feature will be on a significantly different scale to the tfidf feature, and I'm keen to see how normalising the features impacts performance.  

Since I'm working with a sparse matrix, the options for standardisation are limited.  

In [87]:
# Define a pipeline to fit and transform data with tfidf and custom transformer and 
# predict message categories using a Support Vector Classifier with linear kernel
pipeline = Pipeline([    
    ('features', FeatureUnion([
        ('tfidf', TfidfVectorizer(tokenizer=tokenize)),
        ('length_ext', msgLengthExtractor())])),
    ('ss', StandardScaler(with_mean=False)),
    ('clf', MultiOutputClassifier(LinearSVC(random_state=55, dual=False)))])

# Define a grid of hyperparameters for the workflow
param_grid = {'features__tfidf__ngram_range': [(1,1), (1,2)], 
              'features__tfidf__max_df': [0.7], 
              'features__tfidf__use_idf': [False, True], 
              'clf__estimator__C':np.logspace(-2, 5, 11)[5:8]
             }

# Define GridSearchCV object with custom scorer
gs7 = GridSearchCV(pipeline, param_grid, verbose=2, n_jobs=-1, scoring=scorer, \
                  cv=KFold(shuffle=True, random_state=55))

In [88]:
gs7.fit(X_train, y_train)

Fitting 5 folds for each of 12 candidates, totalling 60 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed: 119.8min
[Parallel(n_jobs=-1)]: Done  60 out of  60 | elapsed: 368.4min finished


GridSearchCV(cv=KFold(n_splits=5, random_state=55, shuffle=True),
             error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('features',
                                        FeatureUnion(n_jobs=None,
                                                     transformer_list=[('tfidf',
                                                                        TfidfVectorizer(analyzer='word',
                                                                                        binary=False,
                                                                                        decode_error='strict',
                                                                                        dtype=<class 'numpy.float64'>,
                                                                                        encoding='utf-8',
                                                                                        input='content',
            

In [89]:
gs7.best_score_

0.2956965613576542

##### Observations
This hasn't improved model performance and it took a very long time to run.  I'm going to experiment with a different scaler below - MaxAbsScaler to see how this compares.

In [97]:
# Define a pipeline to fit and transform data with tfidf and custom transformer and 
# predict message categories using a Support Vector Classifier with linear kernel
pipeline = Pipeline([    
    ('features', FeatureUnion([
        ('tfidf', TfidfVectorizer(tokenizer=tokenize)),
        ('length_ext', msgLengthExtractor())])),
    ('ss', MaxAbsScaler()),
    ('clf', MultiOutputClassifier(LinearSVC(random_state=55, dual=False)))])

# Define a grid of hyperparameters for the workflow
param_grid = {'features__tfidf__ngram_range': [(1,1), (1,2)], 
              'features__tfidf__max_df': [0.7], 
              'features__tfidf__use_idf': [False, True], 
              'clf__estimator__C':np.logspace(-2, 5, 11)[5:6]
             }

# Define GridSearchCV object with custom scorer
gs9 = GridSearchCV(pipeline, param_grid, verbose=2, n_jobs=-1, scoring=scorer, \
                  cv=KFold(shuffle=True, random_state=55))

In [98]:
gs9.fit(X_train, y_train)

Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  16 out of  20 | elapsed: 23.5min remaining:  5.9min
[Parallel(n_jobs=-1)]: Done  20 out of  20 | elapsed: 27.3min finished


GridSearchCV(cv=KFold(n_splits=5, random_state=55, shuffle=True),
             error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('features',
                                        FeatureUnion(n_jobs=None,
                                                     transformer_list=[('tfidf',
                                                                        TfidfVectorizer(analyzer='word',
                                                                                        binary=False,
                                                                                        decode_error='strict',
                                                                                        dtype=<class 'numpy.float64'>,
                                                                                        encoding='utf-8',
                                                                                        input='content',
            

In [99]:
gs9.best_score_

0.3612200674339511

In [100]:
# Display mean metrics for best estimator identified in grid search
print_metrics(return_model_metrics(y_test, gs9.predict(X_test), category_names))

Baseline Accuracy: 0.9242
Model Accuracy: 0.9427
Model binary F1 score: 0.4082
Model binary F2 score: 0.3718
Model weighted F1 score: 0.9383
Model weighted F2 score: 0.9408
Model binary Precision: 0.547
Model binary Recall: 0.3538


##### Observations
Scaling the features hasn't improved model performance, so I'll be staying with the model tested in section 1.6.8 above.

##  Export your model as a pickle file

For the reasons explained in the 'Observations' sections throughout this notebook, I'm going to export the model tested in section 1.6.8.

### Recreate model to be pickled

I need to recreate the model using the partial function within the tfidf step to enable it to be pickled successfully.

In [127]:
# Display best estimator paramaters
gs4.best_estimator_

Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=0.7, max_features=None,
                                 min_df=1, ngram_range=(1, 2), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='...
                                 tokenizer=<function tokenize at 0x00000280D723D288>,
                                 use_idf=True, vocabulary=None)),
                ('clf',
                 MultiOutputClassifier(estimator=LinearSVC(C=794.3282347242805,
                                              

In [133]:
# Re-run best model with partial function specified for tokenizer in Pipeline
# to enable model to be saved as a pickle file

# Define pipeline to fit and transform data with tfidf step and predict message categories 
# using a Support Vector Classifier with a linear kernel
pipeline = Pipeline([('tfidf', TfidfVectorizer(tokenizer=partial(tokenize))),
                     ('clf', MultiOutputClassifier(LinearSVC(random_state=55, dual=False)))])

# Define a grid of hyperparameters for the workflow
# Narrow hyperparameter grid search to replicate best model (gs4) above
param_grid = {'tfidf__ngram_range': [(1,2)], 
              'tfidf__max_df': [0.7], 
              'tfidf__use_idf': [True], 
              'clf__estimator__C':np.logspace(-2, 5, 11)[5:8],
             }

# Define GridSearchCV object with custom scorer
model = GridSearchCV(pipeline, param_grid, verbose=3, n_jobs=-1, scoring=scorer, \
                  cv=KFold(shuffle=True, random_state=55))

In [134]:
# Fit GridSearchCV object to training data
model.fit(X_train, y_train)

Fitting 5 folds for each of 3 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   6 out of  15 | elapsed:  5.4min remaining:  8.2min
[Parallel(n_jobs=-1)]: Done  12 out of  15 | elapsed:  7.9min remaining:  2.0min
[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:  8.1min finished


GridSearchCV(cv=KFold(n_splits=5, random_state=55, shuffle=True),
             error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('tfidf',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1

In [137]:
# Assign best estimator from grid search above
clf = model.best_estimator_

# Generate predictions for best estimator identified in grid search and display 
# classification report for each category, along with mean metrics across 
# all categories
preds = clf.predict(X_test)
display_class_report(y_test, preds, category_names)

Category: related.  Baseline accuracy: 0.7644.  F2 score: 0.9003.
              precision    recall  f1-score   support

           0     0.6321    0.4830    0.5476      1853
           1     0.8514    0.9133    0.8813      6012

    accuracy                         0.8120      7865
   macro avg     0.7418    0.6982    0.7144      7865
weighted avg     0.7998    0.8120    0.8027      7865

----------------------------------------------------------------------------------------------------

Category: request.  Baseline accuracy: 0.8272.  F2 score: 0.6557.
              precision    recall  f1-score   support

           0     0.9276    0.9353    0.9314      6506
           1     0.6774    0.6505    0.6637      1359

    accuracy                         0.8861      7865
   macro avg     0.8025    0.7929    0.7975      7865
weighted avg     0.8844    0.8861    0.8852      7865

----------------------------------------------------------------------------------------------------

Category: 

              precision    recall  f1-score   support

           0     0.9722    0.9812    0.9767      7460
           1     0.5833    0.4840    0.5290       405

    accuracy                         0.9556      7865
   macro avg     0.7778    0.7326    0.7529      7865
weighted avg     0.9522    0.9556    0.9537      7865

----------------------------------------------------------------------------------------------------

Category: electricity.  Baseline accuracy: 0.9803.  F2 score: 0.3343.
              precision    recall  f1-score   support

           0     0.9862    0.9935    0.9899      7710
           1     0.4898    0.3097    0.3794       155

    accuracy                         0.9800      7865
   macro avg     0.7380    0.6516    0.6847      7865
weighted avg     0.9764    0.9800    0.9778      7865

----------------------------------------------------------------------------------------------------

Category: tools.  Baseline accuracy: 0.9945.  F2 score: 0.0286.
        

In [139]:
# Confirm metrics are aligned with those reported in section 1.6.8
print_metrics(return_model_metrics(y_test, preds, category_names))

Baseline Accuracy: 0.9242
Model Accuracy: 0.9432
Model binary F1 score: 0.4415
Model binary F2 score: 0.4143
Model weighted F1 score: 0.94
Model weighted F2 score: 0.9418
Model binary Precision: 0.5401
Model binary Recall: 0.4003


Model metrics confirmed as being aligned with those reported in section 1.6.8.

### Export model

In [140]:
# Export model as a pickle file
# Note, the 
pkl_filename = "classifier.pkl"
with open(pkl_filename, 'wb') as file:
    pickle.dump(clf, file)

### Reload and test model

In [16]:
# Load pickled model and test results
clf_load = pickle.load(open('classifier.pkl', 'rb'))

# Display mean metrics for best estimator identified in grid search
print_metrics(return_model_metrics(y_test, clf_load.predict(X_test), category_names))

Baseline Accuracy: 0.9242
Model Accuracy: 0.9432
Model binary F1 score: 0.4415
Model binary F2 score: 0.4143
Model weighted F1 score: 0.94
Model weighted F2 score: 0.9418
Model binary Precision: 0.5401
Model binary Recall: 0.4003


Model successfully loaded from pickle file and demonstrated to produce same results as that in section 1.6.8.

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.