# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
# import libraries
import pandas as pd
from sqlalchemy import create_engine

import numpy as np

#sklearn package
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier,GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report,make_scorer
from sklearn.base import BaseEstimator,TransformerMixin

import random

#nlp pacakges
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import word_tokenize,sent_tokenize
from nltk import bigrams,trigrams,pos_tag

#pickle

import pickle

In [2]:
# load data from database
engine = create_engine('sqlite:///Udacity_Disaster_Response.db')
df = pd.read_sql_table("message_categories",con=engine)
#id is useless for prediction,
#original is the same as message, but in different language
#for now only use message
#we may add genre later
X = df[df.columns[1]]
Y = df[df.columns[4:]]

In [4]:
#some check
df.head()

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,1,0,0,1,0,0,...,0,0,1,0,1,0,0,0,0,0
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,1,1,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [6]:
#check the all columns
df.columns.tolist()

['id',
 'message',
 'original',
 'genre',
 'related',
 'request',
 'offer',
 'aid_related',
 'medical_help',
 'medical_products',
 'search_and_rescue',
 'security',
 'military',
 'child_alone',
 'water',
 'food',
 'shelter',
 'clothing',
 'money',
 'missing_people',
 'refugees',
 'death',
 'other_aid',
 'infrastructure_related',
 'transport',
 'buildings',
 'electricity',
 'tools',
 'hospitals',
 'shops',
 'aid_centers',
 'other_infrastructure',
 'weather_related',
 'floods',
 'storm',
 'fire',
 'earthquake',
 'cold',
 'other_weather',
 'direct_report']

### check the data of different type

In [62]:
messages_type = Y.columns.tolist()
print(messages_type)

['related', 'request', 'offer', 'aid_related', 'medical_help', 'medical_products', 'search_and_rescue', 'security', 'military', 'child_alone', 'water', 'food', 'shelter', 'clothing', 'money', 'missing_people', 'refugees', 'death', 'other_aid', 'infrastructure_related', 'transport', 'buildings', 'electricity', 'tools', 'hospitals', 'shops', 'aid_centers', 'other_infrastructure', 'weather_related', 'floods', 'storm', 'fire', 'earthquake', 'cold', 'other_weather', 'direct_report']


In [63]:
for cat in messages_type:
    print(cat)
    print(df[cat].value_counts())
    for t in df[cat].value_counts().index.tolist():
        print("value "+str(t))
        print(df[df[cat]==t]['message'].head(2).values)
    print()

related
1    19906
0     6122
2      188
Name: related, dtype: int64
value 1
['Weather update - a cold front from Cuba that could pass over Haiti'
 'Is the Hurricane over or is it not over']
value 0
['Information about the National Palace-'
 'I would like to receive the messages, thank you']
value 2
['Dans la zone de Saint Etienne la route de Jacmel est bloqu, il est trsdifficile de se rendre  Jacmel'
 '. .. i with limited means. Certain patients come from the capital.']

request
0    21742
1     4474
Name: request, dtype: int64
value 0
['Weather update - a cold front from Cuba that could pass over Haiti'
 'Is the Hurricane over or is it not over']
value 1
['UN reports Leogane 80-90 destroyed. Only Hospital St. Croix functioning. Needs supplies desperately.'
 'Please, we need tents and water. We are in Silo, Thank you!']

offer
0    26098
1      118
Name: offer, dtype: int64
value 0
['Weather update - a cold front from Cuba that could pass over Haiti'
 'Is the Hurricane over or is it n

['Weather update - a cold front from Cuba that could pass over Haiti'
 'Is the Hurricane over or is it not over']
value 1
['Please we need help, food and toiletries.'
 'we are in need of food tentes corvers water money. we are in croix des missions/ route butte boyer in the churche mormon an. we are 50 people. pascale saint georges']

hospitals
0    25933
1      283
Name: hospitals, dtype: int64
value 0
['Weather update - a cold front from Cuba that could pass over Haiti'
 'Is the Hurricane over or is it not over']
value 1
['UN reports Leogane 80-90 destroyed. Only Hospital St. Croix functioning. Needs supplies desperately.'
 "For your information, There are people that are found in the rubbles of the School of Trinite, ( L'Ecole Sainte Trinite ) in Jacmel. Cookies brought to them by Colombian dogs are keeping them alive."]

shops
0    26096
1      120
Name: shops, dtype: int64
value 0
['Weather update - a cold front from Cuba that could pass over Haiti'
 'Is the Hurricane over or is i

### Some Obersevation
All message categories have very highly imbalenced dataset. For some categories, there exist some ambiguity.For example,"There's nothing to eat and water, we starving and thirsty.", this is weather_related. But from the message, there is no clue about the weather.    

In [23]:
df.head()

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,1,0,0,1,0,0,...,0,0,1,0,1,0,0,0,0,0
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,1,1,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [72]:
#2 for related 
df['related'].value_counts()

1    19906
0     6122
2      188
Name: related, dtype: int64

In [6]:
X.head()

0    Weather update - a cold front from Cuba that c...
1              Is the Hurricane over or is it not over
2                      Looking for someone but no name
3    UN reports Leogane 80-90 destroyed. Only Hospi...
4    says: west side of Haiti, rest of the country ...
Name: message, dtype: object

In [28]:
#one message may have several categories
Y.sum(axis=1).head()

0    1
1    5
2    1
3    8
4    1
dtype: int64

In [4]:
#wee drop column child_alone since all value is zero 
Y.sum()

related                   20282
request                    4474
offer                       118
aid_related               10860
medical_help               2084
medical_products           1313
search_and_rescue           724
security                    471
military                    860
child_alone                   0
water                      1672
food                       2923
shelter                    2314
clothing                    405
money                       604
missing_people              298
refugees                    875
death                      1194
other_aid                  3446
infrastructure_related     1705
transport                  1201
buildings                  1333
electricity                 532
tools                       159
hospitals                   283
shops                       120
aid_centers                 309
other_infrastructure       1151
weather_related            7297
floods                     2155
storm                      2443
fire    

In [3]:
Y=Y.drop(['child_alone'],axis=1)

In [6]:
Y.head()

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,water,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,1,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [14]:
#make some detail check of the message and original
#they are same but different language
for i in range(8):
    print(str(i))
    print("message: "+df['message'][i])
    print("original: "+df['original'][i])

0
message: Weather update - a cold front from Cuba that could pass over Haiti
original: Un front froid se retrouve sur Cuba ce matin. Il pourrait traverser Haiti demain. Des averses de pluie isolee sont encore prevues sur notre region ce soi
1
message: Is the Hurricane over or is it not over
original: Cyclone nan fini osinon li pa fini
2
message: Looking for someone but no name
original: Patnm, di Maryani relem pou li banm nouvel li ak timoun yo. Mesi se john jean depi Monben kwochi.
3
message: UN reports Leogane 80-90 destroyed. Only Hospital St. Croix functioning. Needs supplies desperately.
original: UN reports Leogane 80-90 destroyed. Only Hospital St. Croix functioning. Needs supplies desperately.
4
message: says: west side of Haiti, rest of the country today and tonight
original: facade ouest d Haiti et le reste du pays aujourd hui et ce soir
5
message: Information about the National Palace-
original: Informtion au nivaux palais nationl
6
message: Storm at sacred heart of jesus
o

In [17]:
df['genre'].value_counts(dropna=False)

news      13054
direct    10766
social     2396
Name: genre, dtype: int64

### 2. Write a tokenization function to process your text data

In [3]:
stop_words = stopwords.words("english")
lemmatizer = WordNetLemmatizer()

In [4]:
def tokenize(text):
    """
    input is a string text
    will remove punctuation,and stop words
    return the tokens
    """
    #normalize case and remove punctuation
    text = re.sub(r"[^a-zA-Z0-9]"," ",text.lower())
    #tokenize text
    tokens = word_tokenize(text)
    
    #lemmatize and remove stop words
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]
    
    return tokens

In [5]:
#make some test
# works
tokenize(df['message'][1])

['hurricane']

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [29]:
pipeline = Pipeline([('vect',CountVectorizer(tokenizer=tokenize)),
                    ('tfidf',TfidfTransformer()),
                    ('clf',MultiOutputClassifier(estimator=RandomForestClassifier(n_estimators=50)))
        ])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [7]:
#split train and test
X_train,X_test,y_train,y_test = train_test_split(X,Y,random_state=0)

In [8]:
#fit the model
pipeline.fit(X_train,y_train);

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [9]:
y_predict = pipeline.predict(X_test)

#### Print   f1 score,precision , recall and accuracy

In [26]:
for i in range(y_test.shape[1]):
    print("+++++++++++++   "+y_test.columns[i]+"   +++++++++++++++++")
    y_col_true =  np.array(y_test[y_test.columns[i]])
    y_col_pred = y_predict[:,i]
    accuracy = np.mean(y_col_true==y_col_pred).mean()
    y_col_true_pos_idx = y_col_true==1
    y_col_pred_pos_idx = y_col_pred==1
    
    precision = np.sum(np.logical_and(y_col_pred_pos_idx, y_col_true_pos_idx))/np.sum(y_col_pred_pos_idx)
    recall = np.sum(np.logical_and(y_col_pred_pos_idx,y_col_true_pos_idx))/np.sum(y_col_true_pos_idx)
    
    f1_socre = 0
    if precision>0 and recall>0:
        f1_socre = 2*precision*recall/(precision+recall)
    
    
    print("accuracy: {} \nprecision: {} \nrecall: {} \nf1 score: {}".format(accuracy,precision,recall,f1_socre))
    #print(classification_report(y_test[y_test.columns[i]],y_predict[:,i]))

+++++++++++++   related   +++++++++++++++++
accuracy: 0.8191943851083308 
precision: 0.843834141087776 
recall: 0.9394484412470024 
f1 score: 0.8890780141843972
+++++++++++++   request   +++++++++++++++++
accuracy: 0.897467195605737 
precision: 0.8470764617691154 
recall: 0.4977973568281938 
f1 score: 0.6270810210876804
+++++++++++++   offer   +++++++++++++++++
accuracy: 0.996490692706744 
precision: nan 
recall: 0.0 
f1 score: 0
+++++++++++++   aid_related   +++++++++++++++++
accuracy: 0.7789136405248703 
precision: 0.7699373695198329 
recall: 0.6725018234865062 
f1 score: 0.7179287521899942
+++++++++++++   medical_help   +++++++++++++++++
accuracy: 0.9218797680805615 
precision: 0.6363636363636364 
recall: 0.05343511450381679 
f1 score: 0.09859154929577464
+++++++++++++   medical_products   +++++++++++++++++
accuracy: 0.9536161122978334 
precision: 0.8709677419354839 
recall: 0.08256880733944955 
f1 score: 0.15083798882681565
+++++++++++++   search_and_rescue   +++++++++++++++++
accu

  if __name__ == '__main__':
  # Remove the CWD from sys.path while we load stuff.


### 6. Improve your model
Use grid search to find better parameters. 

In [51]:
parameters = {'clf__estimator__n_estimators':[20,50,70],
              'clf__estimator__min_samples_split':[2,4]
             }
cv = GridSearchCV(pipeline,param_grid=parameters,n_jobs=-1)

#### Fit GridSearchCV

In [52]:
cv.fit(X_train,y_train)



GridSearchCV(cv='warn', error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip..._score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=None))]),
       fit_params=None, iid='warn', n_jobs=-1,
       param_grid={'clf__estimator__n_estimators': [20, 50, 70], 'clf__estimator__min_samples_split': [2, 4]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [53]:
best_fit = cv.best_estimator_

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [54]:
best_predict = best_fit.predict(X_test)

#### Print result for best_fit

In [55]:
for i in range(y_test.shape[1]):
    print("+++++++++++++   "+y_test.columns[i]+"   +++++++++++++++++")
    y_col_true =  np.array(y_test[y_test.columns[i]])
    y_col_pred = best_predict[:,i]
    accuracy = np.mean(y_col_true==y_col_pred).mean()
    y_col_true_pos_idx = y_col_true==1
    y_col_pred_pos_idx = y_col_pred==1
    
    precision = np.sum(np.logical_and(y_col_pred_pos_idx, y_col_true_pos_idx))/np.sum(y_col_pred_pos_idx)
    recall = np.sum(np.logical_and(y_col_pred_pos_idx,y_col_true_pos_idx))/np.sum(y_col_true_pos_idx)
    
    f1_socre = 0
    if precision>0 and recall>0:
        f1_socre = 2*precision*recall/(precision+recall)
    
    
    print("accuracy: {} \nprecision: {} \nrecall: {} \nf1 score: {}".format(accuracy,precision,recall,f1_socre))
    #print(classification_report(y_test[y_test.columns[i]],y_predict[:,i]))

+++++++++++++   related   +++++++++++++++++
accuracy: 0.816447970704913 
precision: 0.8370776578807713 
recall: 0.9456434852118305 
f1 score: 0.8880547996621938
+++++++++++++   request   +++++++++++++++++
accuracy: 0.8941104668904486 
precision: 0.8376722817764165 
recall: 0.4819383259911894 
f1 score: 0.6118568232662192
+++++++++++++   offer   +++++++++++++++++
accuracy: 0.996490692706744 
precision: nan 
recall: 0.0 
f1 score: 0
+++++++++++++   aid_related   +++++++++++++++++
accuracy: 0.7763198046994202 
precision: 0.7623355263157895 
recall: 0.6761487964989059 
f1 score: 0.7166602241979126
+++++++++++++   medical_help   +++++++++++++++++
accuracy: 0.9223375038144644 
precision: 0.6229508196721312 
recall: 0.07251908396946564 
f1 score: 0.12991452991452992
+++++++++++++   medical_products   +++++++++++++++++
accuracy: 0.9531583765639304 
precision: 0.7777777777777778 
recall: 0.0856269113149847 
f1 score: 0.15426997245179064
+++++++++++++   search_and_rescue   +++++++++++++++++
accu

  if __name__ == '__main__':
  # Remove the CWD from sys.path while we load stuff.


### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

### Try other algo

In [56]:
#try other algorithms  adaboost
pipeline_ada = Pipeline([('vect',CountVectorizer(tokenizer=tokenize)),
                    ('tfidf',TfidfTransformer()),
                    ('clf',MultiOutputClassifier(estimator=AdaBoostClassifier(),n_jobs=-1))
                    ])
pipeline_ada.fit(X_train,y_train)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...ator=None,
          learning_rate=1.0, n_estimators=50, random_state=None),
           n_jobs=-1))])

In [57]:
y_predict = pipeline_ada.predict(X_test)
for i in range(y_test.shape[1]):
    print("+++++++++++++   "+y_test.columns[i]+"   +++++++++++++++++")
    y_col_true =  np.array(y_test[y_test.columns[i]])
    y_col_pred = y_predict[:,i]
    accuracy = np.mean(y_col_true==y_col_pred).mean()
    y_col_true_pos_idx = y_col_true==1
    y_col_pred_pos_idx = y_col_pred==1
    
    precision = np.sum(np.logical_and(y_col_pred_pos_idx, y_col_true_pos_idx))/np.sum(y_col_pred_pos_idx)
    recall = np.sum(np.logical_and(y_col_pred_pos_idx,y_col_true_pos_idx))/np.sum(y_col_true_pos_idx)
    
    f1_socre = 0
    if precision>0 and recall>0:
        f1_socre = 2*precision*recall/(precision+recall)
    
    
    print("accuracy: {} \nprecision: {} \nrecall: {} \nf1 score: {}".format(accuracy,precision,recall,f1_socre))
    #print(classification_report(y_test[y_test.columns[i]],y_predict[:,i]))

+++++++++++++   related   +++++++++++++++++
accuracy: 0.767622825755264 
precision: 0.7787738114294862 
recall: 0.9722222222222222 
f1 score: 0.8648120167096259
+++++++++++++   request   +++++++++++++++++
accuracy: 0.8832773878547452 
precision: 0.7421465968586387 
recall: 0.4995594713656388 
f1 score: 0.5971563981042654
+++++++++++++   offer   +++++++++++++++++
accuracy: 0.9949649069270674 
precision: 0.0 
recall: 0.0 
f1 score: 0
+++++++++++++   aid_related   +++++++++++++++++
accuracy: 0.7604516325907843 
precision: 0.7707948243992606 
recall: 0.6083150984682714 
f1 score: 0.6799836934366082
+++++++++++++   medical_help   +++++++++++++++++
accuracy: 0.9292035398230089 
precision: 0.6363636363636364 
recall: 0.26717557251908397 
f1 score: 0.3763440860215054
+++++++++++++   medical_products   +++++++++++++++++
accuracy: 0.9580408910588953 
precision: 0.678082191780822 
recall: 0.30275229357798167 
f1 score: 0.4186046511627907
+++++++++++++   search_and_rescue   +++++++++++++++++
accur

  # Remove the CWD from sys.path while we load stuff.
  # This is added back by InteractiveShellApp.init_path()


###  conclusion
adaBoost seems work better

### let's check some examples which are fail to predict (use water)

In [59]:
def get_example(y_true,y_pred,X,number=5,tp="TP"):
    """
    get want example of message for water
    y_true: a list of true value
    y_pred: predict value
    X:  dataframe of message
    number: number of example want to display
    tp: TP (True Positive),FP(False Positive),FN(False Negative)
        TN(True Negative)
    """
    y_true_p_idx = y_true==1
    y_pred_p_idx = y_pred==1
    y_true_n_idx = y_true==0
    y_pred_n_idx = y_pred==0
    
    tp_idx = np.logical_and(y_true_p_idx,y_pred_p_idx)
    fp_idx = np.logical_and(y_true_p_idx,y_pred_n_idx)
    tn_idx = np.logical_and(y_true_n_idx,y_pred_n_idx)
    fn_idx = np.logical_and(y_true_n_idx,y_pred_p_idx)
    
    if tp=='TP':
        for i in range(number):
            print(X[tp_idx][i:i+1].values)
    if tp=='FP':
        for i in range(number):
            print(X[fp_idx][i:i+1].values)
    if tp=='TN':
        for i in range(number):
            print(X[tn_idx][i:i+1].values)
    if tp=='FN':
        for i in range(number):
            print(X[fn_idx][i:i+1].values)

In [57]:
y_predict_water = pipeline_ada.predict(X_test)[:,9]
y_true_water = np.array(y_test['water'])

In [58]:
#let's check some FP examples
#the second and third examples are hard to decide if it is
# related with need of the water
#the first and fourth examples are strongly related with water
get_example(y_true_water,y_predict_water,X_test,tp="FP")

['Please, we need tents and water. We are in Silo, Thank you!']
['In June 2014, just a month after the Chibok schoolgirls were abducted, Nigerian military and police began detaining journalists, confiscated print publications and intercepted vehicles in an attempt to halt the circulation of critical information.']
['During the last six years, USAID has supported RSA to gain international search and rescue training and the ability to train others in life-saving procedures following crises such as earthquakes, floods, trench rescue, swift water rescue, aircraft accident body recovery and numerous other skills.']
["we'are in delmas 33 camp in front of Clini-Med hospital we don't find some water to use since two weeks. do something for us.thank you. "]
['In the area of water and sanitation, testing of water supplies indicate that recontamination of tubewells has not been as widespread as first feared.']


In [60]:
#let's check examples of FN
get_example(y_true_water,y_predict_water,X_test,tp="FN")

['non-perishable , any pet food as well , hygiene products , water , baby supplies']
['Coordinators with the Spanish Red Cross have been distributing blankets, sleeping mats, jerry cans for storing drinking water, purification tablets, soap, and insecticide-treated mosquito nets to guard against malaria.']
['Thank you very much. I am from Mariani - 1 Rue La Victoire extended. . We are facing all sorts of problems. Even water is scarce. We are maybe too far, this can be the reason why we are not being reached. ..']
["Tajikistan's needs for heating fuel in the coming winter months are estimated to be largely uncovered, due to the country's outstanding debt to external suppliers, as well as limited capacities to produce enough electrical energy locally, as the long-lasting drought reduced the level of water in the main rivers."]
['The system included drilling a well and installing two above-ground tanks that each hold 1,000 gallons and a submersible pump that moves water into the tanks.']

### Add new features

In [197]:
class Sentence_verb(BaseEstimator,TransformerMixin):
    """
    add new feautres:
    for each text and each sentence, get the verb and
    convert the to the array of verb with value 0 and 1
    0: the verb doesn't appear
    1: the verb appears
    """
    def __init__(self):
        self.verb_dic={}
    
    def get_verb_dic(self,X):
        print('perform get_verb_dic')
        lemmatizer = WordNetLemmatizer()
        for m in X:
            msents = sent_tokenize(m.lower())
            for sent in msents:
                sent=re.sub(r"[^a-zA-Z0-9]"," ",sent)
                tokens = word_tokenize(sent)
                #if 'water' not in tokens:
                #    continue
                tags = pos_tag(tokens)
                for tg in tags:
                    if tg[1] in ['VBP','VB']:
                        lem_w = lemmatizer.lemmatize(tg[0])
                        if lem_w in self.verb_dic:
                            self.verb_dic[lem_w] += 1
                        else:
                            self.verb_dic[lem_w] = 1
        ct=0
        for k in self.verb_dic:
            #select only the counts > 10
            if self.verb_dic[k]>10:
                self.verb_dic[k]=ct
                ct += 1
                
    def get_verb_array(self,text):
        n = len(self.verb_dic)
        lemmatizer = WordNetLemmatizer()
        ary = [0]*n
        msents = sent_tokenize(text.lower())
        for sent in msents:
            sent=re.sub(r"[^a-zA-Z0-9]"," ",sent)
            tokens = word_tokenize(sent)
            #if 'water' not in tokens:
            #    continue
            tags = pos_tag(tokens)
            for tg in tags:
                if tg[1] in ['VBP','VB']:
                    lem_w = lemmatizer.lemmatize(tg[0])
                    if lem_w not in self.verb_dic:
                        continue
                    ary[self.verb_dic[lem_w]] = 1
        return ary
    
    
    def fit(self,x,y=None):
        return self
    
    def transform(self,X):
        if len(self.verb_dic)==0:
            self.get_verb_dic(X)
        
        X_tag = pd.Series(X).apply(lambda x: pd.Series(self.get_verb_array(x)))
        
        return X_tag
    
    

In [198]:
pipeline_new_feat = Pipeline([
                              ('features',FeatureUnion([
                                  ('text_pipeline',Pipeline([
                                      ('vect',CountVectorizer(tokenizer=tokenize)),
                                      ('tfidf',TfidfTransformer())
                                  ])),
                                  ('sent_verb',Sentence_verb())
                              ])),
                              ('clf',MultiOutputClassifier(estimator=AdaBoostClassifier(),n_jobs=-1))
                    ])

In [199]:
pipeline_new_feat.fit(X_train,y_train)

perform get_verb_dic


Pipeline(memory=None,
     steps=[('features', FeatureUnion(n_jobs=None,
       transformer_list=[('text_pipeline', Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, ma...ator=None,
          learning_rate=1.0, n_estimators=50, random_state=None),
           n_jobs=-1))])

In [200]:
y_predict = pipeline_new_feat.predict(X_test)
for i in range(y_test.shape[1]):
    print("+++++++++++++   "+y_test.columns[i]+"   +++++++++++++++++")
    print(classification_report(y_test[y_test.columns[i]],y_predict[:,i]))
    #print(classification_report)

+++++++++++++   related   +++++++++++++++++
              precision    recall  f1-score   support

           0       0.60      0.05      0.09      1509
           1       0.77      0.99      0.87      5004
           2       0.30      0.07      0.12        41

   micro avg       0.77      0.77      0.77      6554
   macro avg       0.56      0.37      0.36      6554
weighted avg       0.73      0.77      0.68      6554

+++++++++++++   request   +++++++++++++++++
              precision    recall  f1-score   support

           0       0.91      0.97      0.93      5419
           1       0.76      0.52      0.62      1135

   micro avg       0.89      0.89      0.89      6554
   macro avg       0.83      0.74      0.78      6554
weighted avg       0.88      0.89      0.88      6554

+++++++++++++   offer   +++++++++++++++++
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      6531
           1       0.00      0.00      0.00        23

              precision    recall  f1-score   support

           0       0.99      1.00      1.00      6507
           1       1.00      0.04      0.08        47

   micro avg       0.99      0.99      0.99      6554
   macro avg       1.00      0.52      0.54      6554
weighted avg       0.99      0.99      0.99      6554

+++++++++++++   hospitals   +++++++++++++++++
              precision    recall  f1-score   support

           0       0.99      1.00      0.99      6474
           1       0.18      0.03      0.04        80

   micro avg       0.99      0.99      0.99      6554
   macro avg       0.58      0.51      0.52      6554
weighted avg       0.98      0.99      0.98      6554

+++++++++++++   shops   +++++++++++++++++
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      6522
           1       0.00      0.00      0.00        32

   micro avg       0.99      0.99      0.99      6554
   macro avg       0.50      0.50      0

### Conclusion
very small improvement but take a long time

### attack the unbalenced data 
here we try to use the undersampling method, we use the water as a example

In [9]:
# first lets under sampling the data for non water
# we will undersampling 30% of the non water data
df['water'].value_counts()

0    24544
1     1672
Name: water, dtype: int64

In [10]:
df_ud = df.copy()

In [13]:
drop_index=[]
for i in range(df_ud.shape[0]):
    if df_ud[i:i+1]['water'].values[0]==1:
        continue
    else:
        rd = random.uniform(0,1)
        if rd < 0.7:
            drop_index.append(i)

In [14]:
df_ud.drop(drop_index,inplace=True)

In [16]:
#check the sample of water
df_ud['water'].value_counts()

0    7422
1    1672
Name: water, dtype: int64

In [17]:
#get X and Y
X_ud = df_ud[df_ud.columns[1]]
Y_ud = df_ud[df_ud.columns[4:]]
#child_alone is all 0 
Y_ud=Y_ud.drop(['child_alone'],axis=1)

In [18]:
X_train_ud,X_test_ud,y_train_ud,y_test_ud = train_test_split(X_ud,Y_ud,random_state=0)

In [54]:
#fit the model
pipeline_ada_ud = Pipeline([('vect',CountVectorizer(tokenizer=tokenize)),
                    ('tfidf',TfidfTransformer()),
                    ('clf',MultiOutputClassifier(estimator=AdaBoostClassifier()))
                    ])
pipeline_ada_ud.fit(X_train_ud,y_train_ud)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...or=None,
          learning_rate=1.0, n_estimators=50, random_state=None),
           n_jobs=None))])

In [55]:
#test on under sampling sampling data
#for water, the result is better
y_predict = pipeline_ada_ud.predict(X_test_ud)
for i in range(y_test_ud.shape[1]):
    print("+++++++++++++   "+y_test.columns[i]+"   +++++++++++++++++")
    print(classification_report(y_test_ud[y_test_ud.columns[i]],y_predict[:,i]))
    #print(classification_report)

+++++++++++++   related   +++++++++++++++++
              precision    recall  f1-score   support

           0       0.00      0.00      0.00       473
           1       0.79      1.00      0.88      1784
           2       0.25      0.06      0.10        17

   micro avg       0.78      0.78      0.78      2274
   macro avg       0.35      0.35      0.32      2274
weighted avg       0.62      0.78      0.69      2274

+++++++++++++   request   +++++++++++++++++
              precision    recall  f1-score   support

           0       0.88      0.97      0.92      1755
           1       0.82      0.54      0.65       519

   micro avg       0.87      0.87      0.87      2274
   macro avg       0.85      0.75      0.78      2274
weighted avg       0.86      0.87      0.86      2274

+++++++++++++   offer   +++++++++++++++++
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      2267
           1       0.00      0.00      0.00         7

In [52]:
#let's append all dropped rows to test set
df_drop = df.iloc[drop_index,:].copy()
df_drop.drop(['child_alone'],axis=1,inplace=True)

In [32]:
X_test_ud_add_drop = X_test_ud.append(df_drop[df_drop.columns[1]])
y_test_ud_add_drop = y_test_ud.append(df_drop[df_drop.columns[4:]])

In [38]:
X_test_ud_add_drop.shape

(19396,)

In [39]:
y_test_ud_add_drop.shape

(19396, 35)

In [57]:
y_test_ud_add_drop.tail()

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,water,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
26208,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
26211,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
26212,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
26213,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
26215,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [58]:
#test on set with dropped rows
#the prediction become worse 
y_predict = pipeline_ada_ud.predict(X_test_ud_add_drop)
for i in range(y_test.shape[1]):
    print("+++++++++++++   "+y_test_ud_add_drop.columns[i]+"   +++++++++++++++++")
    print(classification_report(y_test_ud_add_drop[y_test_ud_add_drop.columns[i]],y_predict[:,i]))
    #print(classification_report)

+++++++++++++   related   +++++++++++++++++
              precision    recall  f1-score   support

           0       0.31      0.00      0.01      4759
           1       0.75      1.00      0.85     14488
           2       0.28      0.09      0.13       149

   micro avg       0.75      0.75      0.75     19396
   macro avg       0.44      0.36      0.33     19396
weighted avg       0.64      0.75      0.64     19396

+++++++++++++   request   +++++++++++++++++
              precision    recall  f1-score   support

           0       0.91      0.96      0.93     16383
           1       0.69      0.47      0.56      3013

   micro avg       0.88      0.88      0.88     19396
   macro avg       0.80      0.72      0.75     19396
weighted avg       0.87      0.88      0.88     19396

+++++++++++++   offer   +++++++++++++++++
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     19318
           1       0.00      0.00      0.00        78

              precision    recall  f1-score   support

           0       0.99      1.00      0.99     19184
           1       0.23      0.06      0.09       212

   micro avg       0.99      0.99      0.99     19396
   macro avg       0.61      0.53      0.54     19396
weighted avg       0.98      0.99      0.98     19396

+++++++++++++   other_infrastructure   +++++++++++++++++
              precision    recall  f1-score   support

           0       0.96      0.99      0.97     18604
           1       0.26      0.10      0.15       792

   micro avg       0.95      0.95      0.95     19396
   macro avg       0.61      0.55      0.56     19396
weighted avg       0.93      0.95      0.94     19396

+++++++++++++   weather_related   +++++++++++++++++
              precision    recall  f1-score   support

           0       0.89      0.94      0.92     14001
           1       0.83      0.71      0.76      5395

   micro avg       0.88      0.88      0.88     19396
   macro avg       

### Conclusion
the model works well on undersampling dataset, but when testing on the dropped examples, the performance is bad

### 9. Export your model as a pickle file

In [60]:
with open('best_fit.pickle','wb') as pickle_out:
    pickle.dump(best_fit,pickle_out)

In [27]:
#test if we can get back
with open('best_fit.pickle','rb') as pickle_in:
    test_fit = pickle.load(pickle_in)

In [28]:
print(test_fit)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip..._score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=None))])


### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.