# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [97]:
# import libraries
import datetime
import pandas as pd 
import numpy as np
from sqlalchemy import create_engine
import re
import string
import nltk
from nltk.tokenize import word_tokenize
from nltk import bigrams
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize
from nltk.stem.porter import PorterStemmer
from sklearn.multioutput import MultiOutputClassifier
from nltk.stem.wordnet import WordNetLemmatizer

import pickle


from sklearn.pipeline import Pipeline,FeatureUnion
from sklearn.ensemble import  RandomForestClassifier,AdaBoostClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVR
from sklearn.model_selection  import train_test_split,GridSearchCV
from sklearn.metrics import roc_auc_score,f1_score,precision_score,recall_score,accuracy_score,make_scorer,classification_report,confusion_matrix
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer,TfidfTransformer
from sklearn.decomposition import PCA, TruncatedSVD


import seaborn as sns
import matplotlib.pyplot as plt

nltk.download("wordnet")
nltk.download("stopwords")
nltk.download("punkt")

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [3]:
# load data from database
engine = create_engine('sqlite:///disaster_response.db')
df = pd.read_sql_table(table_name="disaster_response",con=engine)
X = df['message']
Y = df.drop(['message','original','genre','id'],axis=1)

#### 1.1 Data Explore

In [4]:
Y.describe()

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
count,26177.0,26177.0,26177.0,26177.0,26177.0,26177.0,26177.0,26177.0,26177.0,26177.0,...,26177.0,26177.0,26177.0,26177.0,26177.0,26177.0,26177.0,26177.0,26177.0,26177.0
mean,0.77358,0.170531,0.00447,0.414104,0.079459,0.050044,0.027658,0.017993,0.032815,0.0,...,0.011804,0.04397,0.278336,0.082095,0.093212,0.010773,0.09367,0.02017,0.052565,0.193414
std,0.435345,0.376106,0.066707,0.492576,0.270459,0.21804,0.163994,0.132928,0.178156,0.0,...,0.108006,0.205032,0.448188,0.274515,0.290734,0.103234,0.291375,0.140586,0.223168,0.394982
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [5]:
Y.describe().loc['max',]

related                   2.0
request                   1.0
offer                     1.0
aid_related               1.0
medical_help              1.0
medical_products          1.0
search_and_rescue         1.0
security                  1.0
military                  1.0
child_alone               0.0
water                     1.0
food                      1.0
shelter                   1.0
clothing                  1.0
money                     1.0
missing_people            1.0
refugees                  1.0
death                     1.0
other_aid                 1.0
infrastructure_related    1.0
transport                 1.0
buildings                 1.0
electricity               1.0
tools                     1.0
hospitals                 1.0
shops                     1.0
aid_centers               1.0
other_infrastructure      1.0
weather_related           1.0
floods                    1.0
storm                     1.0
fire                      1.0
earthquake                1.0
cold      

In [6]:
Y.describe().loc['min',]

related                   0.0
request                   0.0
offer                     0.0
aid_related               0.0
medical_help              0.0
medical_products          0.0
search_and_rescue         0.0
security                  0.0
military                  0.0
child_alone               0.0
water                     0.0
food                      0.0
shelter                   0.0
clothing                  0.0
money                     0.0
missing_people            0.0
refugees                  0.0
death                     0.0
other_aid                 0.0
infrastructure_related    0.0
transport                 0.0
buildings                 0.0
electricity               0.0
tools                     0.0
hospitals                 0.0
shops                     0.0
aid_centers               0.0
other_infrastructure      0.0
weather_related           0.0
floods                    0.0
storm                     0.0
fire                      0.0
earthquake                0.0
cold      

Thre is an incorrect record for the column related which has value "2", it should be either 0 or 1. And child_alone has only max value as 0, it seems like all the columns has 0 value so we can drop this column since it has no information for the models.

In [7]:
[print(i) for i in df.sample(10)['message']]

Before making landfall in neighbouring Bangladesh on July 30, Cyclone Komen had swept past western Myanmar, causing high winds and heavy rains throughout the country and increasing the severity of the seasonal downpours.
2.5 WFP is preparing a deployment plan for the WFP-chartered helicopter which arrived in Addis Ababa on 2 June.
I'd like a psychologist to see me 
good evening please I have heard news about infomation that we must know 
Lutheran World Relief is channeling funds through the Action by Churches Together (ACT) network to ACT member organisations in Indonesia that have begun to assist survivors with medical care and food.
RT ASA_Astronauts: http://twitpic.com/16ad2g - San Antonio, Chile. One of the closest port to Santiago. /via @Astro_Soichi
New Overview Thousands feared dead as major quake strikes Haiti n Reuters Reuters A major earthquake rocke.. http bit.ly 6GNZii
Uste from deep of my heart.Bouy as well that I know THAT YOU'RE SUFFERING BECAUSE OF me. you're with me.wa

[None, None, None, None, None, None, None, None, None, None]

As I check random samples from messages, It seems like we need to remove special characters and urls from the messages.

In [8]:
[print(i) for i in df.sample(10)['message']]

The inclement weather has suspended helicopter relief flights and triggered landslides that have severed road links in different parts of the quake-affected areas.
According to Rehman Awan, a social mobilisation specialist with the UN Human Settlements Programme (UN HABITAT) in Muzaffarabad, there is still much to be done to provide landless families with an alternative choice of future residence.
good evening please can I know what will be tomorrow? 
They had to adjust to the taste of chlorinated water and learn that it was safe to drink.
WE ARE 2 BROTHERS. WE GOT MARRIED 1 MONTH BEFORE THE FLOOD. THE FLOOD HAS DESTROYED OUR HOUSE. I AM A METRIC STUDENT ANS MY BROTHER IS A FSC STUDENT. WE DO NOT HAVE OUR ID CARDS DUE TO WHICH WE CANNOT CONTINUE OUR STUDIES. HOMELESS AND HELPLESS
We are in Mon Repos Carrefour, and we never find help. 
We need hygiene products ( deodorant , soap , dish detergent , mop , broom ) , a place to wash clothes , temporary shelter , a non-electrical heater . We

[None, None, None, None, None, None, None, None, None, None]

There are 60 messages that contains url which as space, this can be replaced. 

In [9]:
df[df.message.str.contains("http : //")]

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
10503,12079,Sandy relief efforts in full swing @foodcoop @...,,social,1,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
10504,12080,#foodtruck to the rescue #eastvillage #nyc #sa...,,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10505,12081,Sandy Takes NY Out : .. .and now the morning a...,,social,1,0,0,1,1,0,...,0,0,1,1,1,0,0,0,0,1
10511,12087,What we thought could never happen : Dark #Str...,,social,1,0,0,1,0,0,...,0,0,1,0,1,0,0,0,1,1
10513,12089,"Hide your kids , hide your wife .. .. And hide...",,social,1,0,0,0,0,0,...,0,0,1,0,1,0,0,0,1,0
10515,12092,Stocking up on some food before the storm ( @W...,,social,1,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,1
10530,12110,the big storm Hurricane Sandy : The calm befor...,,social,1,0,0,0,0,0,...,0,0,1,1,1,0,0,0,0,0
10563,12146,"In case of hurricane , buy as much junk food a...",,social,1,0,0,1,0,0,...,0,0,1,0,1,0,0,0,0,0
10564,12147,BB scouting team on the move .. FDR flooded .....,,social,1,0,0,1,0,0,...,0,0,1,1,1,0,0,0,0,0
10571,12154,Over 25 people in line at Starbucks . People n...,,social,1,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [10]:
urls_to_fix = list(set([i.split("http ")[1].split()[:2][0] for i in df[df.message.str.contains("http ")]['message']]))
urls_to_fix = ["http "+i for i in urls_to_fix]

In [11]:
[i for i in urls_to_fix if 'bi' in i]

['http bi', 'http bit.ly']

In [12]:
urls_to_fix.remove('http bi')

In [13]:
[print(i) for i in df[df.message.str.contains("http ")].sample(10)['message']]

Haiti Earthquake One Local Reaction Video webcastr.com http bit.ly 6JEwuX
RT msnbc BREAKING NEWS 7.3 magnitude earthquake hits near Haiti coast USGS reports http bit.ly 3ieuwr
Haiti earthquake what we re hearing http bit.ly 6mOsWE #cnn
UN says Haiti headquarters collapsed in earthquake Independent The headquarters of the UN peacekeeping mission .. http bit.ly 91bTsB
My experience volunteering for Sandy church with food and supplies , if you have time take time @Kaiser Park http : //t.co/gOyTj0oJ
haiti earthquake lane kiffin usc haiti news u2026 What&#8217 s Famous Keywords About Online Right now? n n n n nhai. http bit.ly 5gG8KI
I have no words RT gillo Eyewitness Tweets from Haiti http bit.ly 5QVS4c #haiti #earthquake
#Ireland Only noticed that Irish papers had missed Haiti s earthquake til I saw this article in Examiner. http bit.ly 7rjoZv
Orange earthquake alert Haiti M=5.5 potentially affecting 4.8 million people. http bit.ly 7M8Njh
Haiti hit by 7.0 magnitude earthquake buildings l

[None, None, None, None, None, None, None, None, None, None]

In [14]:
def empty_url_fixer(x):
    global urls_to_fix
    
    for i in urls_to_fix:
        try:
            if i in x:
                unique_url = x.split(i,1)[1].split()[0]

                x = x.replace(unique_url,"")

                x = x.replace(i,i.replace(" ","://")+"/"+unique_url)
        except Exception as e:
            print(e)
            print(i)
    return x
    

In [15]:
a = "RT thezonedotfm If you want to DONATE to HAITI EARTHQUAKE RELIEF http tinyurl.com ya6kpzm Pol"
b = """Haitians in the U.S. are trying desperately to contact their loved ones on the earthquake ravaged island http bit.ly 8qMxXz
RT Stodsports http twitpic.com xvgbm image of the devestation in #Haiti after 7.0 mag #earthquake via North Angel endlesshugs dds .."""

In [16]:
empty_url_fixer(a)

'RT thezonedotfm If you want to DONATE to HAITI EARTHQUAKE RELIEF http://tiny/url.com ya6kpzm Pol'

In [17]:
empty_url_fixer(b)

'Haitians in the U.S. are trying desperately to contact their loved ones on the earthquake ravaged island http://bit.ly/8qMxXz \nRT Stodsports http://twitpic.com/xvgbm  image of the devestation in #Haiti after 7.0 mag #earthquake via North Angel endlesshugs dds ..'

We can clean the html tags from the strings also.

In [18]:
def remove_html_tags(x):
    CLEANR = re.compile('<.*?>')
    x = re.sub(CLEANR, '', x)
    return x
    

In [19]:
df['message' ] = df['message' ].apply(lambda x: x.replace("http : //","http://"))

In [20]:
def remove_urls(x):
    url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    detected_urls = re.findall(url_regex, x)
    for url in detected_urls:
        x = x.replace(url, "urlplaceholder")
    return x


In [21]:
remove_urls(empty_url_fixer(b))

'Haitians in the U.S. are trying desperately to contact their loved ones on the earthquake ravaged island urlplaceholder \nRT Stodsports urlplaceholder  image of the devestation in #Haiti after 7.0 mag #earthquake via North Angel endlesshugs dds ..'

#### 1.2 Data Cleaning

In [22]:
df['related'] = df['related'].apply(lambda x: 1 if x==2 else x)
df.drop('child_alone',axis=1,inplace=True)

In [23]:
X = df['message']
Y = df.drop(['message','original','genre','id'],axis=1)

### 2. Write a tokenization function to process your text data

In [24]:
def tokenize(text):

    text = re.sub(r'([a-z])([A-Z])',r'\1\. \2',text)
    text = re.sub('\s+', ' ', text) # remove \t and \n
    text = text.translate(str.maketrans('','',string.punctuation))
    text = " ".join([w for w in text.split() if w not in stopwords.words("english")])
    text = " ".join([PorterStemmer().stem(w) for w in text.split()])
    text = " ".join([WordNetLemmatizer().lemmatize(w) for w in text.split()]).lower()
    text = word_tokenize(text)        
                    
    return text
    
    

In [25]:
tokenize(b)

['haitian',
 'us',
 'tri',
 'desper',
 'contact',
 'love',
 'one',
 'earthquak',
 'ravag',
 'island',
 'urlplacehold',
 'rt',
 'stodsport',
 'urlplacehold',
 'imag',
 'devest',
 'haiti',
 '70',
 'mag',
 'earthquak',
 'via',
 'north',
 'angel',
 'endlesshug',
 'dd']

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [40]:
tfidf_vectorizer = TfidfVectorizer(tokenizer=tokenize,use_idf=True,smooth_idf=True,max_df=0.98,min_df=0.01)
clf = RandomForestClassifier(random_state=1)
pipeline = Pipeline([('tfidf_vectorizer',tfidf_vectorizer),('clf',clf)])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [41]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state = 1)

In [42]:
pipeline.fit(X_train, y_train)

list index out of range
http www.haitifeed.com
list index out of range
http ..
list index out of range
http angelmissionshaiti.org


Pipeline(memory=None,
     steps=[('tfidf_vectorizer', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.98, max_features=None, min_df=0.01,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smo...estimators=10, n_jobs=1,
            oob_score=False, random_state=1, verbose=0, warm_start=False))])

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [43]:
y_pred_test = pipeline.predict(X_test)
print(classification_report(y_test.values, y_pred_test, target_names=Y.columns.values))

list index out of range
http www.ifrc.org
                        precision    recall  f1-score   support

               related       0.84      0.90      0.87      4990
               request       0.74      0.46      0.56      1151
                 offer       0.00      0.00      0.00        26
           aid_related       0.72      0.55      0.63      2751
          medical_help       0.44      0.06      0.10       530
      medical_products       0.54      0.08      0.14       328
     search_and_rescue       0.00      0.00      0.00       179
              security       0.11      0.01      0.02       113
              military       0.19      0.02      0.04       206
                 water       0.81      0.46      0.58       416
                  food       0.81      0.58      0.68       762
               shelter       0.76      0.45      0.57       585
              clothing       0.72      0.22      0.34       104
                 money       0.70      0.10      0.17       1

  'precision', 'predicted', average, warn_for)


### 6. Improve your model
Use grid search to find better parameters. 

In [39]:
pipeline.get_params()

{'memory': None,
 'steps': [('tfidf_vectorizer',
   TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
           dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
           lowercase=True, max_df=0.95, max_features=None, min_df=0.05,
           ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
           stop_words=None, strip_accents=None, sublinear_tf=False,
           token_pattern='(?u)\\b\\w\\w+\\b',
           tokenizer=<function tokenize at 0x7fa67dc9abf8>, use_idf=True,
           vocabulary=None)),
  ('clf',
   RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
               max_depth=None, max_features='auto', max_leaf_nodes=None,
               min_impurity_decrease=0.0, min_impurity_split=None,
               min_samples_leaf=1, min_samples_split=2,
               min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
               oob_score=False, random_state=1, verbose=0, warm_start=False))]

In [53]:
parameters = {
 'clf__max_depth': [  30, 100],
 'clf__min_samples_leaf': [1, 4],
 'clf__n_estimators': [10, 150]}

cv = GridSearchCV(pipeline, param_grid=parameters, cv=5, n_jobs=-1, verbose=2)
print(datetime.datetime.now())
cv.fit(X_train, y_train)


2022-12-29 07:43:26.569612
Fitting 5 folds for each of 8 candidates, totalling 40 fits
[CV] clf__max_depth=30, clf__min_samples_leaf=1, clf__n_estimators=10 
list index out of range
http www.haitifeed.com
list index out of range
http ..
list index out of range
http angelmissionshaiti.org
list index out of range
http www.haitifeed.com
list index out of range
http ..
list index out of range
http angelmissionshaiti.org
[CV]  clf__max_depth=30, clf__min_samples_leaf=1, clf__n_estimators=10, total= 1.6min
[CV] clf__max_depth=30, clf__min_samples_leaf=1, clf__n_estimators=10 


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:  2.9min remaining:    0.0s


list index out of range
http www.haitifeed.com
list index out of range
http ..
list index out of range
http angelmissionshaiti.org
list index out of range
http www.haitifeed.com
list index out of range
http ..
list index out of range
http angelmissionshaiti.org
[CV]  clf__max_depth=30, clf__min_samples_leaf=1, clf__n_estimators=10, total= 1.6min
[CV] clf__max_depth=30, clf__min_samples_leaf=1, clf__n_estimators=10 
list index out of range
http ..
list index out of range
http angelmissionshaiti.org
list index out of range
http www.haitifeed.com
list index out of range
http ..
list index out of range
http angelmissionshaiti.org
[CV]  clf__max_depth=30, clf__min_samples_leaf=1, clf__n_estimators=10, total= 1.6min
[CV] clf__max_depth=30, clf__min_samples_leaf=1, clf__n_estimators=10 
list index out of range
http www.haitifeed.com
list index out of range
http ..
list index out of range
http angelmissionshaiti.org
list index out of range
http www.haitifeed.com
[CV]  clf__max_depth=30, clf__m

list index out of range
http ..
list index out of range
http angelmissionshaiti.org
list index out of range
http www.haitifeed.com
list index out of range
http ..
list index out of range
http angelmissionshaiti.org
[CV]  clf__max_depth=100, clf__min_samples_leaf=1, clf__n_estimators=10, total= 1.7min
[CV] clf__max_depth=100, clf__min_samples_leaf=1, clf__n_estimators=10 
list index out of range
http www.haitifeed.com
list index out of range
http ..
list index out of range
http angelmissionshaiti.org
list index out of range
http www.haitifeed.com
[CV]  clf__max_depth=100, clf__min_samples_leaf=1, clf__n_estimators=10, total= 1.7min
[CV] clf__max_depth=100, clf__min_samples_leaf=1, clf__n_estimators=10 
list index out of range
http www.haitifeed.com
list index out of range
http ..
list index out of range
http angelmissionshaiti.org
list index out of range
http www.haitifeed.com
list index out of range
http ..
list index out of range
http angelmissionshaiti.org
[CV]  clf__max_depth=100, c

[Parallel(n_jobs=-1)]: Done  40 out of  40 | elapsed: 133.7min finished


list index out of range
http www.haitifeed.com
list index out of range
http ..
list index out of range
http angelmissionshaiti.org


GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('tfidf_vectorizer', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.98, max_features=None, min_df=0.01,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smo...estimators=10, n_jobs=1,
            oob_score=False, random_state=1, verbose=0, warm_start=False))]),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'clf__max_depth': [30, 100], 'clf__min_samples_leaf': [1, 4], 'clf__n_estimators': [10, 150]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=2)

In [55]:
cv.best_params_

{'clf__max_depth': 100, 'clf__min_samples_leaf': 1, 'clf__n_estimators': 150}

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [56]:
y_pred_test = cv.predict(X_test)

print(classification_report(y_test.values, y_pred_test, target_names=Y.columns.values))

list index out of range
http www.ifrc.org
                        precision    recall  f1-score   support

               related       0.81      0.95      0.88      4990
               request       0.78      0.52      0.62      1151
                 offer       0.00      0.00      0.00        26
           aid_related       0.75      0.59      0.66      2751
          medical_help       0.59      0.07      0.12       530
      medical_products       0.55      0.05      0.10       328
     search_and_rescue       0.00      0.00      0.00       179
              security       0.14      0.01      0.02       113
              military       0.20      0.01      0.02       206
                 water       0.76      0.54      0.63       416
                  food       0.80      0.63      0.71       762
               shelter       0.80      0.45      0.58       585
              clothing       0.64      0.20      0.31       104
                 money       0.80      0.07      0.13       1

  'precision', 'predicted', average, warn_for)


Hyperparameter optimization helped and our model got better from the previous one.

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

#### 8.1 Try other features

Best Params: {'clf__max_depth': 100, 'clf__min_samples_leaf': 1, 'clf__n_estimators': 150}

Adding truncated svd as feature.

In [76]:
improve_pipeline = Pipeline([ ('features', FeatureUnion([
        ('svd_features',Pipeline([ ('tfidf_vectorizer',tfidf_vectorizer),("svd", TruncatedSVD(n_components=50))])),
    ('tfidf_vectorizer',tfidf_vectorizer)
            ])),('clf',RandomForestClassifier(max_depth=100,min_samples_leaf=1,n_estimators=150,random_state=1,n_jobs=-1))])

In [77]:
improve_pipeline.fit(X_train, y_train)
y_pred_rf_test = improve_pipeline.predict(X_test)

print(classification_report(y_test.values, y_pred_rf_test, target_names=Y.columns.values))

list index out of range
http www.haitifeed.com
list index out of range
http ..
list index out of range
http angelmissionshaiti.org
list index out of range
http www.haitifeed.com
list index out of range
http ..
list index out of range
http angelmissionshaiti.org
list index out of range
http www.ifrc.org
list index out of range
http www.ifrc.org
                        precision    recall  f1-score   support

               related       0.81      0.96      0.88      4990
               request       0.80      0.47      0.60      1151
                 offer       0.00      0.00      0.00        26
           aid_related       0.72      0.61      0.66      2751
          medical_help       0.73      0.02      0.04       530
      medical_products       0.32      0.02      0.04       328
     search_and_rescue       0.00      0.00      0.00       179
              security       0.12      0.01      0.02       113
              military       0.00      0.00      0.00       206
             

  'precision', 'predicted', average, warn_for)


New feature did'nt improve the result.

#### 8.2 Try other ml algorithms

In [94]:
improve_pipeline = Pipeline([('tfidf_vectorizer',tfidf_vectorizer),('clf',SGDClassifier())])

### 9. Export your model as a pickle file

In [98]:
with open('classifier.pkl', 'wb') as f:
    pickle.dump(cv, f)

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.