# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
# import libraries
from sqlalchemy import create_engine
import pandas as pd 
import re
import nltk
nltk.download(['punkt', 'wordnet','stopwords'])
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier

from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import precision_recall_fscore_support
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split, GridSearchCV
import pickle

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [2]:
# load data from database
engine = create_engine('sqlite:///InsertDatabaseName.db')
df = pd.read_sql_table('InsertTableName', con=engine)
#X = 
#Y = 

In [3]:
df.head()

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,1,0,0,1,0,0,...,0,0,1,0,1,0,0,0,0,0
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,1,1,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [6]:
#Data visulazation  
#1. Top categories 
df.iloc[:,4:].sum().sort_values(ascending=False)

related                   20282
aid_related               10860
weather_related            7297
direct_report              5075
request                    4474
other_aid                  3446
food                       2923
earthquake                 2455
storm                      2443
shelter                    2314
floods                     2155
medical_help               2084
infrastructure_related     1705
water                      1672
other_weather              1376
buildings                  1333
medical_products           1313
transport                  1201
death                      1194
other_infrastructure       1151
refugees                    875
military                    860
search_and_rescue           724
money                       604
electricity                 532
cold                        530
security                    471
clothing                    405
aid_centers                 309
missing_people              298
hospitals                   283
fire    

In [20]:
#second visuallization  group by per gerne
df.groupby('genre').count()['message']

genre
direct    10766
news      13054
social     2396
Name: message, dtype: int64

In [21]:
#Visuallization 3, top category or direct messages
df[df['genre'] == 'direct'].iloc[:,4:].sum().sort_values(ascending=False)

related                   7446
aid_related               4338
request                   3696
direct_report             3613
food                      1807
other_aid                 1575
weather_related           1521
shelter                   1152
water                      836
earthquake                 796
medical_help               592
medical_products           471
buildings                  391
infrastructure_related     327
storm                      315
floods                     304
death                      254
clothing                   247
search_and_rescue          216
transport                  210
other_weather              207
other_infrastructure       186
refugees                   174
money                      148
security                   131
missing_people              86
electricity                 81
aid_centers                 78
cold                        63
hospitals                   54
military                    46
offer                       46
fire    

In [4]:
# X is input data which is message sent 
# Y is 36 categories which we need to predict which are last 36 columns 

X = df['message']
y = df[df.columns[-36:]]

In [5]:
y.head()

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,1,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


All 36 colunms are in one hot encoding format so no further preprocessing is required

### 2. Write a tokenization function to process your text data

In [6]:
def tokenize(text):
    """
    function to tokenize and preprocess text
    1 normalize the text by converting it into lower case
    2. remove punctuation symboles and replace them with space
    3 tokenize it using word tokenizer
    4 remove stop words
    5 lamitize it using  
    """
    #1 normalize the text by converting it into lower case
    text=text.lower()
    #remove punctuation symboles and replace them with space
    text=re.sub('[^ a-zA-Z0-9]',' ',text)
    #tokenize it using word tokenizer
    tokens=word_tokenize(text)
    #remove stop words
    stwords=stopwords.words('english')
    tokens_without_stop=[w for w in tokens if w not in stwords]
    #lamitize using noun 
    lemmed = [WordNetLemmatizer().lemmatize(w) for w in tokens_without_stop]
    #lamitize using verb
    clean_tokens = [WordNetLemmatizer().lemmatize(w, pos='v') for w in lemmed]
    
    return clean_tokens

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

### is it multi class or multi-lable ? 

The multi-label learning problem is in fact a special case of multi-target learning (also known as multi-dimensional or multi-objective), where each label can take multiple values, as opposed to binary labels indicating relevance or not. [ref](https://www.quora.com/Is-multi-target-classification-the-same-as-multi-label-classification-from-scikit-learn-If-not-what-are-some-relevant-articles) 

better explaination is on [ref](https://www.analyticsvidhya.com/blog/2017/08/introduction-to-multi-label-classification/)  below image makes it clear distinction between multiclass vs multi lable 
![image](https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2017/08/25230542/Screen-Shot-2017-08-20-at-12.20.24-AM.png)
**Certificate** can only take one value out of many possible values hence it is **multi class** 
where as **genre of movie** can take multiple values hence it is **multi lable** 

Now we need to identify out of 36 categories, and a message can take more than one value at a time so it is multilable

SkLearn MultiOutputClassifier uses  strategy  of fitting one classifier per target. This is a simple strategy for extending classifiers that do not natively support multi-target classification

In [7]:
moc = MultiOutputClassifier(DecisionTreeClassifier())

pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', moc)
    ])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y)
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [9]:
# Get results and add them to a dataframe.
def get_results(y_test, y_pred):
    results = pd.DataFrame(columns=['Category', 'f_score', 'precision', 'recall'])
    num = 0
    for cat in y_test.columns:
        precision, recall, f_score, support = precision_recall_fscore_support(y_test[cat], y_pred[:,num], average='weighted')
        results.set_value(num+1, 'Category', cat)
        results.set_value(num+1, 'f_score', f_score)
        results.set_value(num+1, 'precision', precision)
        results.set_value(num+1, 'recall', recall)
        num += 1
    print('Aggregated f_score:', results['f_score'].mean())
    print('Aggregated precision:', results['precision'].mean())
    print('Aggregated recall:', results['recall'].mean())
    return results

In [10]:
results = get_results(y_test, y_pred)
results

Aggregated f_score: 0.931496162888
Aggregated precision: 0.930610558694
Aggregated recall: 0.932530600481


  import sys
  
  if __name__ == '__main__':
  # Remove the CWD from sys.path while we load stuff.


Unnamed: 0,Category,f_score,precision,recall
1,related,0.766888,0.769773,0.764571
2,request,0.851313,0.849568,0.853372
3,offer,0.989478,0.988115,0.990845
4,aid_related,0.714681,0.714265,0.715441
5,medical_help,0.903928,0.901373,0.906774
6,medical_products,0.943097,0.941969,0.944309
7,search_and_rescue,0.960264,0.959327,0.961245
8,security,0.966987,0.965337,0.968721
9,military,0.957377,0.955668,0.959567
10,child_alone,1.0,1.0,1.0


### 6. Improve your model
Use grid search to find better parameters. 

In [20]:
pipeline.get_params()

{'memory': None,
 'steps': [('vect',
   CountVectorizer(analyzer='word', binary=False, decode_error='strict',
           dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
           lowercase=True, max_df=1.0, max_features=None, min_df=1,
           ngram_range=(1, 1), preprocessor=None, stop_words=None,
           strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
           tokenizer=<function tokenize at 0x7f66b6994598>, vocabulary=None)),
  ('tfidf',
   TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)),
  ('clf',
   MultiOutputClassifier(estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
               max_features=None, max_leaf_nodes=None,
               min_impurity_decrease=0.0, min_impurity_split=None,
               min_samples_leaf=1, min_samples_split=2,
               min_weight_fraction_leaf=0.0, presort=False, random_state=None,
               splitter='best'),
              n_jobs=1))],
 

### Overfitting vs Underfitting

In Tree algorithm like decision tree or Random forest we can control overfitting and underfitting by controlling
- depth of tree 
- maximum leaf nodes 
- minimum leaf nodes
Hence lets create grid serach with all these 3 parameters

In [21]:
parameters = {'clf__estimator__max_depth': [5,10, 20,None],
              'clf__estimator__min_samples_leaf':[1,2,5],
             'clf__estimator__max_leaf_nodes':[5,10,None]}

cv = GridSearchCV(pipeline, parameters)

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, 

make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [22]:
cv.fit(X_train.as_matrix(), y_train.as_matrix())
y_pred = cv.predict(X_test)
results2 = get_results(y_test, y_pred)
results2

  """Entry point for launching an IPython kernel.


Aggregated f_score: 0.937710273344
Aggregated precision: 0.938493852245
Aggregated recall: 0.946173668328


  import sys
  
  if __name__ == '__main__':
  # Remove the CWD from sys.path while we load stuff.


Unnamed: 0,Category,f_score,precision,recall
1,related,0.716788,0.742476,0.770369
2,request,0.878412,0.882172,0.889991
3,offer,0.992452,0.99086,0.994049
4,aid_related,0.716782,0.742255,0.730851
5,medical_help,0.908158,0.908853,0.925999
6,medical_products,0.95299,0.952615,0.959719
7,search_and_rescue,0.965048,0.96418,0.972078
8,security,0.971561,0.967357,0.978486
9,military,0.9591,0.956583,0.965212
10,child_alone,1.0,1.0,1.0


In [31]:
cv.best_estimator_

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...ion_leaf=0.0, presort=False, random_state=None,
            splitter='best'),
           n_jobs=1))])

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

**AdaBoost Algorithm**
It is ensamble alogrithm,, in otherwords it uses weak learner alorithm to train on specific part of problem.
By default weak learner is decision tree. 



In [11]:
moc = MultiOutputClassifier(AdaBoostClassifier(base_estimator=DecisionTreeClassifier()))

pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', moc)
    ])

In [12]:
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
get_results(y_test, y_pred)

Aggregated f_score: 0.936666872922
Aggregated precision: 0.935164868241
Aggregated recall: 0.942469399519


  import sys
  
  if __name__ == '__main__':
  # Remove the CWD from sys.path while we load stuff.


Unnamed: 0,Category,f_score,precision,recall
1,related,0.804268,0.802393,0.810955
2,request,0.84456,0.847066,0.842386
3,offer,0.99009,0.988123,0.992066
4,aid_related,0.725923,0.729301,0.730546
5,medical_help,0.903293,0.89986,0.922948
6,medical_products,0.945656,0.942474,0.950412
7,search_and_rescue,0.967206,0.965979,0.973451
8,security,0.968972,0.963041,0.97635
9,military,0.95161,0.951789,0.962313
10,child_alone,1.0,1.0,1.0


In [13]:
pipeline.get_params()

{'memory': None,
 'steps': [('vect',
   CountVectorizer(analyzer='word', binary=False, decode_error='strict',
           dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
           lowercase=True, max_df=1.0, max_features=None, min_df=1,
           ngram_range=(1, 1), preprocessor=None, stop_words=None,
           strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
           tokenizer=<function tokenize at 0x7fa24339e2f0>, vocabulary=None)),
  ('tfidf',
   TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)),
  ('clf',
   MultiOutputClassifier(estimator=AdaBoostClassifier(algorithm='SAMME.R',
             base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
               max_features=None, max_leaf_nodes=None,
               min_impurity_decrease=0.0, min_impurity_split=None,
               min_samples_leaf=1, min_samples_split=2,
               min_weight_fraction_leaf=0.0, presort=False, random_state

In [14]:
parameters = {'clf__estimator__n_estimators':[25,50], 
              'clf__estimator__learning_rate':[0.1, 0.5],
              'clf__estimator__base_estimator__max_depth' : [1,2]
             }

In [15]:
cv = GridSearchCV(pipeline, parameters)
cv.fit(X_train.as_matrix(), y_train.as_matrix())
y_pred = cv.predict(X_test)
get_results(y_test, y_pred)

  


Aggregated f_score: 0.938485577868
Aggregated precision: 0.94008622793
Aggregated recall: 0.947818126335


  import sys
  
  if __name__ == '__main__':
  # Remove the CWD from sys.path while we load stuff.


Unnamed: 0,Category,f_score,precision,recall
1,related,0.725952,0.76412,0.783949
2,request,0.878848,0.882247,0.889533
3,offer,0.990625,0.988129,0.993134
4,aid_related,0.746151,0.756556,0.753128
5,medical_help,0.910735,0.910244,0.927678
6,medical_products,0.949752,0.949335,0.958346
7,search_and_rescue,0.966819,0.969186,0.974825
8,security,0.969795,0.964288,0.978334
9,military,0.960255,0.961304,0.967196
10,child_alone,1.0,1.0,1.0


### 9. Export your model as a pickle file

In [16]:
pickle.dump(cv, open('model.pkl', 'wb'))

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.