# ML Pipeline Preparation

### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
# import libraries
from sqlalchemy import create_engine
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import numpy as np
import pdb
from sklearn.metrics import precision_recall_fscore_support
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
import pickle
from nltk.corpus import stopwords  
from nltk.tokenize import word_tokenize 

In [2]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\lbrutton\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [3]:
engine = create_engine('sqlite:///DisasterResponse.db')
with engine.connect() as connection:
    result = connection.execute("select * from Messages")
result
pd.read_sql_table('Messages', con=engine)

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,1,0,0,1,0,0,...,0,0,1,0,1,0,0,0,0,0
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,1,1,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26211,30261,The training demonstrated how to enhance micro...,,news,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
26212,30262,A suitable candidate has been selected and OCH...,,news,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
26213,30263,"Proshika, operating in Cox's Bazar municipalit...",,news,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
26214,30264,"Some 2,000 women protesting against the conduc...",,news,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [4]:
# load data from database
engine = create_engine('sqlite:///DisasterResponse.db')
df = pd.read_sql_table('Messages', con=engine)
df = df[df['related'] != 2]
X = df['message']
y = df.iloc[:,-36:]

In [5]:
df

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,1,0,0,1,0,0,...,0,0,1,0,1,0,0,0,0,0
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,1,1,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26211,30261,The training demonstrated how to enhance micro...,,news,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
26212,30262,A suitable candidate has been selected and OCH...,,news,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
26213,30263,"Proshika, operating in Cox's Bazar municipalit...",,news,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
26214,30264,"Some 2,000 women protesting against the conduc...",,news,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


### 2. Write a tokenization function to process text data

In [6]:
import re
import string

In [7]:
def tokenize(text):
    """Clean and tokenize words in a given message."""
    # pdb.set_trace()
    text = re.sub(r"""
               [,.;@#?!&$()'-]+  # Accept one or more copies of punctuation
               \ *           # plus zero or more copies of a space,
               """,
               " ",          # and replace it with a single space
               text, flags=re.VERBOSE)
    # pdb.set_trace()
    tokens = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()

    clean_tokens = []
    for tok in tokens:
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        clean_tokens.append(clean_tok)
        
    stop_words = set(stopwords.words('english')) 
    
    filtered_sentence = [] 
    
    for w in clean_tokens:  
        if w not in stop_words:  
            filtered_sentence.append(w) 

    return filtered_sentence
    pass

### 3. Build a machine learning pipeline
This machine pipeline will take in the `message` column as input and output classification results on the other 36 categories in the dataset. Using the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html), as in this case each message could take multiple labels, any number between 0 and 36. 

We'll start by trying the kneigbors classifier, but will likely test other models further down:

In [55]:
pipeline_knn = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', MultiOutputClassifier(KNeighborsClassifier()))
    ])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [14]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

In [None]:
pipeline_knn.fit(X_train, y_train)

### 5. Test model
Use the new model to make predictions using the test data. 

Report the f1 score, precision and recall for each output category of the dataset. Iterate over all the columns of the prediction matrix, using the precision_recall_fscore_support method, and adding these results to a dataframe. 

Finally, take the average of these scores to obtain an easy to read series representing the quality of the model output.

In [29]:
y_pred = pipeline_knn.predict(X_test)

In [24]:
def classify_model_output(y_test, y_pred):
    classification_scores = []

    for i, column in enumerate(y_test.columns):
        # print(column + str(i) + ' has the following score: \n' + classification_report(y_test[column], y_pred[:, i]))
        classification_scores.append(precision_recall_fscore_support(y_test[column], y_pred[:, i], zero_division=0))
        # pdb.set_trace()
    
    df_classification = pd.DataFrame(classification_scores)
    # pdb.set_trace()
    df_classification.columns = ['precision', 'recall', 'fscore', 'support']
    df_classification.set_index(y_test.columns, inplace=True)

    # currently the child_alone column has labeled entries, thus dropping it for now
    df_classification.drop(['child_alone'], axis=0, inplace=True) 

    # below loop splits the precision, recall and f-score columns into two, one for negatives and one for positives (0 and 1)
    for column in df_classification.columns:
        column_1 = df_classification[column].apply(lambda x: x[0]).rename(column+str(0), inplace=True)
        # pdb.set_trace()
        column_2 = df_classification[column].apply(lambda x: x[1]).rename(column+str(1), inplace=True)
        # pdb.set_trace()
        df_classification.drop([column], axis=1, inplace=True)
        df_classification = pd.concat([df_classification, column_1, column_2], axis=1)
        # pdb.set_trace()

    # finally, take the average of the dataframe to get a classifier for the model                                                                    
    df_classification_avg = df_classification.mean(axis=0)
                                                                          
    return df_classification_avg

In [40]:
df_classification_avg_knn = classify_model_output(y_test, y_pred)

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [43]:
df_classification_avg_knn

precision0       0.942332
precision1       0.345074
recall0          0.940049
recall1          0.260572
fscore0          0.940238
fscore1          0.265945
support0      7802.000000
support1       788.000000
dtype: float64

### 6. Improve the model using grid search
Use grid search to find better parameters. 

In [44]:
# check which parameters are currently available in this pipeline
pipeline.get_params()

{'memory': None,
 'steps': [('vect',
   CountVectorizer(tokenizer=<function tokenize at 0x000001F779015940>)),
  ('tfidf', TfidfTransformer()),
  ('clf', MultiOutputClassifier(estimator=KNeighborsClassifier()))],
 'verbose': False,
 'vect': CountVectorizer(tokenizer=<function tokenize at 0x000001F779015940>),
 'tfidf': TfidfTransformer(),
 'clf': MultiOutputClassifier(estimator=KNeighborsClassifier()),
 'vect__analyzer': 'word',
 'vect__binary': False,
 'vect__decode_error': 'strict',
 'vect__dtype': numpy.int64,
 'vect__encoding': 'utf-8',
 'vect__input': 'content',
 'vect__lowercase': True,
 'vect__max_df': 1.0,
 'vect__max_features': None,
 'vect__min_df': 1,
 'vect__ngram_range': (1, 1),
 'vect__preprocessor': None,
 'vect__stop_words': None,
 'vect__strip_accents': None,
 'vect__token_pattern': '(?u)\\b\\w\\w+\\b',
 'vect__tokenizer': <function __main__.tokenize(text)>,
 'vect__vocabulary': None,
 'tfidf__norm': 'l2',
 'tfidf__smooth_idf': True,
 'tfidf__sublinear_tf': False,
 'tf

In [45]:
# try with two counts of nearest neighbors
parameters_knn = { 'clf__estimator__n_neighbors': [5, 7]}

cv_knn = GridSearchCV(pipeline, parameters)

In [46]:
# fit the grid search object to the training data
cv_knn.fit(X_train, y_train)

GridSearchCV(estimator=Pipeline(steps=[('vect',
                                        CountVectorizer(tokenizer=<function tokenize at 0x000001F779015940>)),
                                       ('tfidf', TfidfTransformer()),
                                       ('clf',
                                        MultiOutputClassifier(estimator=KNeighborsClassifier()))]),
             param_grid={'clf__estimator__n_neighbors': [5, 7]})

In [47]:
# check which value was selected fotr the optimal number of neighbors
cv_knn.best_params_

{'clf__estimator__n_neighbors': 7}

In [48]:
# use the model to make predictions using the test data
y_pred_2 = cv_knn.predict(X_test)

In [49]:
# re-use the previous function to classify the output of the updated model

df_classification_knn_avg_2 = classify_model_output(y_test, y_pred_2)

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [50]:
df_classification_knn_avg_2

precision0       0.943338
precision1       0.353025
recall0          0.942209
recall1          0.252193
fscore0          0.941367
fscore1          0.261706
support0      7802.000000
support1       788.000000
dtype: float64

### 7. Test the model
Compare the precision, recall and f-score of the tuned model to the original model. 

In [25]:
def highlight_max(s):
    '''
    highlight the maximum in a Series yellow.
    '''
    is_max = s == s.max()
    return ['background-color: yellow' if v else '' for v in is_max]

In [81]:
df_compare = pd.concat([df_classification_avg_knn, df_classification_knn_avg_2], axis=1)
df_compare.rename(columns = {0: 'df_classification_avg_knn', 1: 'df_classification_knn_avg_2'}, inplace=True)
df_compare.style.apply(highlight_max, axis=1)

Unnamed: 0,df_classification_avg_knn,df_classification_knn_avg_2
precision0,0.942332,0.943338
precision1,0.345074,0.353025
recall0,0.940049,0.942209
recall1,0.260572,0.252193
fscore0,0.940238,0.941367
fscore1,0.265945,0.261706
support0,7802.0,7802.0
support1,788.0,788.0


### 8. Try improving the model further
* try using the RandomForestClassifier
* add other features besides the TF-IDF

In [57]:
pipeline_rf = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', MultiOutputClassifier(RandomForestClassifier()))
    ])

In [83]:
pipeline_rf.get_params()

{'memory': None,
 'steps': [('vect',
   CountVectorizer(tokenizer=<function tokenize at 0x000001F7512C0D30>)),
  ('tfidf', TfidfTransformer()),
  ('clf', MultiOutputClassifier(estimator=RandomForestClassifier()))],
 'verbose': False,
 'vect': CountVectorizer(tokenizer=<function tokenize at 0x000001F7512C0D30>),
 'tfidf': TfidfTransformer(),
 'clf': MultiOutputClassifier(estimator=RandomForestClassifier()),
 'vect__analyzer': 'word',
 'vect__binary': False,
 'vect__decode_error': 'strict',
 'vect__dtype': numpy.int64,
 'vect__encoding': 'utf-8',
 'vect__input': 'content',
 'vect__lowercase': True,
 'vect__max_df': 1.0,
 'vect__max_features': None,
 'vect__min_df': 1,
 'vect__ngram_range': (1, 1),
 'vect__preprocessor': None,
 'vect__stop_words': None,
 'vect__strip_accents': None,
 'vect__token_pattern': '(?u)\\b\\w\\w+\\b',
 'vect__tokenizer': <function __main__.tokenize(text)>,
 'vect__vocabulary': None,
 'tfidf__norm': 'l2',
 'tfidf__smooth_idf': True,
 'tfidf__sublinear_tf': False,


In [58]:
parameters_rf = {} # not trying any other parameters for now to save time here

cv_rf = GridSearchCV(pipeline_rf, parameters_rf)

In [61]:
cv_rf.fit(X_train, y_train)

GridSearchCV(estimator=Pipeline(steps=[('vect',
                                        CountVectorizer(tokenizer=<function tokenize at 0x000001CC5E97FC10>)),
                                       ('tfidf', TfidfTransformer()),
                                       ('clf',
                                        MultiOutputClassifier(estimator=RandomForestClassifier()))]),
             param_grid={})

In [62]:
y_pred_rf = cv_rf.predict(X_test)

In [63]:
df_classification_avg_rf_nt = classify_model_output(y_test, y_pred_rf)

In [64]:
df_classification_avg_rf_nt

precision0       0.945838
precision1       0.615753
recall0          0.973564
recall1          0.208881
fscore0          0.957671
fscore1          0.259247
support0      7802.000000
support1       788.000000
dtype: float64

In [97]:
df_compare = pd.concat([df_classification_avg_knn, df_classification_knn_avg_2, df_classification_avg_rf], axis=1)
df_compare.rename(columns = {0: 'df_classification_avg_knn', 
                             1: 'df_classification_knn_avg_2', 
                             2: 'df_classification_avg_rf'}, inplace=True)
df_compare.style.apply(highlight_max, axis=1)

Unnamed: 0,df_classification_avg_knn,df_classification_knn_avg_2,df_classification_avg_rf
precision0,0.942332,0.943338,0.942719
precision1,0.345074,0.353025,0.616413
recall0,0.940049,0.942209,0.972526
recall1,0.260572,0.252193,0.170683
fscore0,0.940238,0.941367,0.953246
fscore1,0.265945,0.261706,0.218726
support0,7802.0,7802.0,7802.0
support1,788.0,788.0,788.0


More testing below, playing around with the params of the random forest classifier:

In [104]:
pipeline_rf_2 = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', MultiOutputClassifier(RandomForestClassifier()))
    ])

In [105]:
pipeline_rf_2.get_params()

{'memory': None,
 'steps': [('vect',
   CountVectorizer(tokenizer=<function tokenize at 0x000001F7512C0D30>)),
  ('tfidf', TfidfTransformer()),
  ('clf', MultiOutputClassifier(estimator=RandomForestClassifier()))],
 'verbose': False,
 'vect': CountVectorizer(tokenizer=<function tokenize at 0x000001F7512C0D30>),
 'tfidf': TfidfTransformer(),
 'clf': MultiOutputClassifier(estimator=RandomForestClassifier()),
 'vect__analyzer': 'word',
 'vect__binary': False,
 'vect__decode_error': 'strict',
 'vect__dtype': numpy.int64,
 'vect__encoding': 'utf-8',
 'vect__input': 'content',
 'vect__lowercase': True,
 'vect__max_df': 1.0,
 'vect__max_features': None,
 'vect__min_df': 1,
 'vect__ngram_range': (1, 1),
 'vect__preprocessor': None,
 'vect__stop_words': None,
 'vect__strip_accents': None,
 'vect__token_pattern': '(?u)\\b\\w\\w+\\b',
 'vect__tokenizer': <function __main__.tokenize(text)>,
 'vect__vocabulary': None,
 'tfidf__norm': 'l2',
 'tfidf__smooth_idf': True,
 'tfidf__sublinear_tf': False,


In [111]:
parameters_rf_2 = { 'clf__estimator__n_estimators': [100, 150]}

cv_rf_2 = GridSearchCV(pipeline_rf_2, parameters_rf_2)

In [112]:
cv_rf_2.fit(X_train, y_train)

GridSearchCV(estimator=Pipeline(steps=[('vect',
                                        CountVectorizer(tokenizer=<function tokenize at 0x000001F7512C0D30>)),
                                       ('tfidf', TfidfTransformer()),
                                       ('clf',
                                        MultiOutputClassifier(estimator=RandomForestClassifier()))]),
             param_grid={'clf__estimator__n_estimators': [100, 150]})

In [113]:
cv_rf_2.best_params_

{'clf__estimator__n_estimators': 100}

### 8.b Try using Doc2Vec instead of Bag of words

In [8]:
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

In [9]:
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(common_texts)]
model = Doc2Vec(documents, vector_size=5, window=2, min_count=1, workers=4)

In [10]:
vector = model.infer_vector(["system", "response"])
vector

array([-0.0464276 , -0.07531845, -0.03542244,  0.08753716,  0.08961979],
      dtype=float32)

In [12]:
tokens = []
for x in X:
    tokens.append(model.infer_vector(tokenize(x)))
tokens

[array([-0.02102805, -0.09544075, -0.03447378,  0.04223363, -0.09201233],
       dtype=float32),
 array([ 0.00555632,  0.08723989, -0.03303778,  0.01438427,  0.07157621],
       dtype=float32),
 array([-0.04040078,  0.05684828, -0.03204436, -0.05835181,  0.0522645 ],
       dtype=float32),
 array([-0.09336329,  0.08730442,  0.01225083,  0.06184048,  0.05011596],
       dtype=float32),
 array([-0.07235494,  0.09407878,  0.05878349,  0.04743649,  0.02464597],
       dtype=float32),
 array([-0.06075339, -0.08356941,  0.00116646,  0.05917516,  0.03670799],
       dtype=float32),
 array([-0.03379729,  0.0242969 ,  0.08997396,  0.01484654,  0.022728  ],
       dtype=float32),
 array([-0.00382158, -0.03672004,  0.04759531,  0.07700966,  0.01942977],
       dtype=float32),
 array([-0.02656337,  0.05242578,  0.04158869, -0.02589305,  0.0011935 ],
       dtype=float32),
 array([-0.01400361,  0.06656753,  0.07035922, -0.07938597,  0.03109539],
       dtype=float32),
 array([-0.00354923,  0.060262

In [15]:
X_doc2vec = tokens
y_doc2vec = y

In [18]:
X_train, X_test, y_train, y_test = train_test_split(
    X_doc2vec, y_doc2vec, test_size=0.33, random_state=42)

In [19]:
pipeline_doc2vec_rf = Pipeline([
        ('clf', MultiOutputClassifier(RandomForestClassifier()))
    ])

In [20]:
pipeline_doc2vec_rf.fit(X_train, y_train)

Pipeline(steps=[('clf',
                 MultiOutputClassifier(estimator=RandomForestClassifier()))])

In [21]:
y_pred_doc2vec = pipeline_doc2vec_rf.predict(X_test)

In [26]:
df_classification_avg_d2v_rf = classify_model_output(y_test, y_pred_doc2vec)

In [27]:
df_classification_avg_d2v_rf

precision0       0.917305
precision1       0.331409
recall0          0.965783
recall1          0.038131
fscore0          0.932570
fscore1          0.040048
support0      7802.000000
support1       788.000000
dtype: float64

With the current parameters, using doc2vec this way doesn't provide a better result than simply using bad of words. However it's possible that with a different set of parameters it might improve... not sure how to use gridsearch here, so for now I'm just going to try increasing the Doc2Vec vector size to see how that impacts the result - I'd expect the result to improve on precion and recall on catching positives...

In [28]:
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(common_texts)]
model = Doc2Vec(documents, vector_size=50, window=2, min_count=1, workers=4)

In [29]:
vector = model.infer_vector(["system", "response"])
vector

array([-4.6427562e-03, -7.5318352e-03, -3.5422309e-03,  8.7537067e-03,
        8.9619793e-03,  6.8707028e-03,  6.9264518e-03,  6.2866341e-03,
        2.7919442e-03,  8.4868241e-03, -4.8740525e-04,  3.1649088e-03,
        8.2890700e-05, -5.2334056e-03, -9.1936113e-03, -8.9575965e-03,
       -2.7224752e-03, -1.7723162e-03, -3.5701951e-03, -6.4971042e-03,
        2.0644539e-03, -6.5533752e-03, -3.5976879e-03, -9.2489654e-03,
        6.8578203e-03, -1.6074858e-03,  8.1894072e-03,  9.2428178e-03,
       -8.8699767e-03, -8.2957176e-03,  1.4843202e-04, -5.3289551e-03,
       -6.0247360e-03, -7.4426061e-04,  7.2017019e-03,  6.4165266e-03,
       -7.3573082e-03, -1.1945019e-03, -2.2880470e-03,  2.1151870e-03,
       -5.1961667e-03,  6.9974019e-04, -6.5722573e-03, -6.1040740e-03,
        7.8199599e-03,  8.3594238e-03, -7.2375761e-04,  8.5355230e-03,
       -7.6397206e-03,  9.1029210e-03], dtype=float32)

In [30]:
tokens = []
for x in X:
    tokens.append(model.infer_vector(tokenize(x)))

In [32]:
X_doc2vec_v2 = tokens
y_doc2vec_v2 = y

In [31]:
pipeline_doc2vec_rf_v2 = Pipeline([
        ('clf', MultiOutputClassifier(RandomForestClassifier()))
    ])

In [33]:
X_train, X_test, y_train, y_test = train_test_split(
    X_doc2vec_v2, y_doc2vec_v2, test_size=0.33, random_state=42)

In [34]:
pipeline_doc2vec_rf_v2.fit(X_train, y_train)

Pipeline(steps=[('clf',
                 MultiOutputClassifier(estimator=RandomForestClassifier()))])

In [35]:
y_pred_doc2vec_v2 = pipeline_doc2vec_rf_v2.predict(X_test)

In [36]:
df_classification_avg_d2v_rf_v2 = classify_model_output(y_test, y_pred_doc2vec_v2)

In [37]:
df_classification_avg_d2v_rf_v2

precision0       0.925480
precision1       0.386917
recall0          0.969701
recall1          0.033870
fscore0          0.933590
fscore1          0.034649
support0      7802.000000
support1       788.000000
dtype: float64

### 9. Export the model as a pickle file to be used in the webapp

In [102]:
final_model = cv_rf
filename = 'final_model.pkl'
pickle.dump(final_model, open(filename, 'wb'))