# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [3]:
# import libraries
import pandas as pd
import numpy as np

pd.set_option('display.max_colwidth', -1)


from sqlalchemy import create_engine

import re

import nltk
nltk.download(['punkt', 'wordnet','stopwords'])

from nltk.stem.wordnet import WordNetLemmatizer

from nltk.tokenize import word_tokenize

from nltk.corpus import stopwords

from sklearn.pipeline import Pipeline

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

from sklearn.multioutput import MultiOutputClassifier

from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC


from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.metrics import confusion_matrix,classification_report

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [4]:
# load data from database
engine = create_engine('sqlite:///data/DisasterResponse.db')
df = pd.read_sql("SELECT * FROM DataFrame_generated", engine)
X = df['message']
Y = df.drop(columns=['id','genre','original','message'],axis=1)

In [5]:
print("5 first rows of Y: \n",Y.head())
print("\n 5first rows of X: \n",X.head())

5 first rows of Y: 
    related  request  offer  aid_related  medical_help  medical_products  \
0  1        0        0      0            0             0                  
1  1        0        0      1            0             0                  
2  1        0        0      0            0             0                  
3  1        1        0      1            0             1                  
4  1        0        0      0            0             0                  

   search_and_rescue  security  military  child_alone      ...        \
0  0                  0         0         0                ...         
1  0                  0         0         0                ...         
2  0                  0         0         0                ...         
3  0                  0         0         0                ...         
4  0                  0         0         0                ...         

   aid_centers  other_infrastructure  weather_related  floods  storm  fire  \
0  0            0

In [6]:
for column in Y.columns:
    print("Frequencies for {} \n".format(column),Y[column].value_counts()," \n ")

#child_alone will be droped. It has only 0 values. It has no value for a modelling process 

Frequencies for related 
 1    19906
0    6122 
2    188  
Name: related, dtype: int64  
 
Frequencies for request 
 0    21742
1    4474 
Name: request, dtype: int64  
 
Frequencies for offer 
 0    26098
1    118  
Name: offer, dtype: int64  
 
Frequencies for aid_related 
 0    15356
1    10860
Name: aid_related, dtype: int64  
 
Frequencies for medical_help 
 0    24132
1    2084 
Name: medical_help, dtype: int64  
 
Frequencies for medical_products 
 0    24903
1    1313 
Name: medical_products, dtype: int64  
 
Frequencies for search_and_rescue 
 0    25492
1    724  
Name: search_and_rescue, dtype: int64  
 
Frequencies for security 
 0    25745
1    471  
Name: security, dtype: int64  
 
Frequencies for military 
 0    25356
1    860  
Name: military, dtype: int64  
 
Frequencies for child_alone 
 0    26216
Name: child_alone, dtype: int64  
 
Frequencies for water 
 0    24544
1    1672 
Name: water, dtype: int64  
 
Frequencies for food 
 0    23293
1    2923 
Name: food, dty

### 2. Write a tokenization function to process your text data

In [7]:
def tokenize(text):
    """
    This function takes the input text and it will transform it into tokens.
    text: The next which will be transformed into tokesns ()
    """
    # normalize case and remove punctuation   
    text =re.sub(r"[^a-zA-Z0-9]", " ",text.lower())
    
    # Tokenize text
    tokens=word_tokenize(text)
    
    #Initialiaze important functions for stop words and the lemmatizer type of Normalization od text
    stop_words = stopwords.words("english")
    lemmatizer = WordNetLemmatizer()

    # lemmatize and remove stop words
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]

    return tokens
    

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [9]:
pipeline = Pipeline([
        ('vect',CountVectorizer(tokenizer=tokenize)),
        ('tfidf',TfidfTransformer()),
        ('clf',MultiOutputClassifier(KNeighborsClassifier()))
    ])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [10]:
#Split X,Y data in train and test dataframes
X_train, X_test, y_train, y_test = train_test_split(X, Y.drop(columns=['child_alone'],axis=1))

#Train pipeline
pipeline.fit(X_train, y_train)


Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...ric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
           n_jobs=1))])

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [8]:
# Predicted values
y_pred = pd.DataFrame(pipeline.predict(X_test),columns=y_test.columns)

#Print results
for column in y_test.columns:
    print("Classification report for {}:".format(column),"\n")
    print(classification_report(y_test[column], y_pred[column]))
    print("\n")
    print("Labels for {}:".format(column),np.unique(y_pred[column]))
    print("\n")
    print("Confusion Matrix for {}:".format(column),"\n")
    print(confusion_matrix(y_test[column], y_pred[column], labels=np.unique(y_pred[column])))
    print("\n")
    accuracy=(y_test[column].reset_index(drop=True)==y_pred[column].reset_index(drop=True)).mean()
    print("Accuracy of prediction for {}:".format(column),accuracy)
    print("\n")
    print("----------------------------------------------------")
    print("\n")
    print("\n")




Classification report for related: 

             precision    recall  f1-score   support

          0       0.58      0.33      0.42      1544
          1       0.81      0.92      0.86      4969
          2       0.27      0.34      0.30        41

avg / total       0.76      0.78      0.76      6554



Labels for related: [0 1 2]


Confusion Matrix for related: 

[[ 506 1031    7]
 [ 357 4581   31]
 [   2   25   14]]


Accuracy of prediction for related: 0.778303326213


----------------------------------------------------




Classification report for request: 

             precision    recall  f1-score   support

          0       0.85      1.00      0.91      5475
          1       0.79      0.08      0.15      1079

avg / total       0.84      0.85      0.79      6554



Labels for request: [0 1]


Confusion Matrix for request: 

[[5451   24]
 [ 990   89]]


Accuracy of prediction for request: 0.845285321941


----------------------------------------------------




Classificat

  'precision', 'predicted', average, warn_for)


[[6258    0]
 [ 291    5]]


Accuracy of prediction for death: 0.955599633811


----------------------------------------------------




Classification report for other_aid: 

             precision    recall  f1-score   support

          0       0.87      1.00      0.93      5706
          1       0.47      0.01      0.02       848

avg / total       0.82      0.87      0.81      6554



Labels for other_aid: [0 1]


Confusion Matrix for other_aid: 

[[5696   10]
 [ 839    9]]


Accuracy of prediction for other_aid: 0.870460787305


----------------------------------------------------




Classification report for infrastructure_related: 

             precision    recall  f1-score   support

          0       0.94      1.00      0.97      6137
          1       0.00      0.00      0.00       417

avg / total       0.88      0.94      0.91      6554



Labels for infrastructure_related: [0]


Confusion Matrix for infrastructure_related: 

[[6137]]


Accuracy of prediction for infrast

### 6. Improve your model
Use grid search to find better parameters. 

In [10]:
parameters = {
    'vect__ngram_range': ((1, 1), (1, 2)),
    'vect__max_features': (None,500),
    'tfidf__use_idf': (True, False),
    'clf__estimator' : (
        RandomForestClassifier(),
        
        KNeighborsClassifier(),
        
        MultinomialNB(),
        
        AdaBoostClassifier()
        
    )
}    

cv = GridSearchCV(pipeline, param_grid=parameters)

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [11]:
#Fit the model
cv.fit(X_train, y_train)

#Take the predictions
y_pred =  pd.DataFrame(cv.predict(X_test),columns=y_test.columns)



#Results

print("\nBest Parameters of GridSearchCV: \n", cv.best_params_)


for column in y_test.columns:
    print("Classification report for {}:".format(column),"\n")
    print(classification_report(y_test[column], y_pred[column]))
    print("\n")
    print("Labels for {}:".format(column),np.unique(y_pred[column]))
    print("\n")
    print("Confusion Matrix for {}:".format(column),"\n")
    print(confusion_matrix(y_test[column], y_pred[column], labels=np.unique(y_pred[column])))
    print("\n")
    accuracy=(y_test[column].reset_index(drop=True)==y_pred[column].reset_index(drop=True)).mean()
    print("Accuracy of prediction for {}:".format(column),accuracy)
    print("\n")
    print("----------------------------------------------------")
    print("\n")
    print("\n")


Best Parameters of GridSearchCV: {'clf__estimator': KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'), 'tfidf__use_idf': False, 'vect__max_features': 500, 'vect__ngram_range': (1, 1)} 

Classification report for related: 

             precision    recall  f1-score   support

          0       0.34      0.59      0.43      1513
          1       0.84      0.66      0.74      4993
          2       0.14      0.10      0.12        48

avg / total       0.72      0.64      0.66      6554



Labels for related: [0 1 2]


Confusion Matrix for related: 

[[ 895  609    9]
 [1688 3282   23]
 [  28   15    5]]


Accuracy of prediction for related: 0.638083613061


----------------------------------------------------




Classification report for request: 

             precision    recall  f1-score   support

          0       0.90      0.97      0.93      5446
          1       0

  'precision', 'predicted', average, warn_for)





Accuracy of prediction for missing_people: 0.98870918523


----------------------------------------------------




Classification report for refugees: 

             precision    recall  f1-score   support

          0       0.96      1.00      0.98      6307
          1       0.56      0.02      0.04       247

avg / total       0.95      0.96      0.95      6554



Labels for refugees: [0 1]


Confusion Matrix for refugees: 

[[6303    4]
 [ 242    5]]


Accuracy of prediction for refugees: 0.96246566982


----------------------------------------------------




Classification report for death: 

             precision    recall  f1-score   support

          0       0.96      1.00      0.98      6262
          1       0.83      0.07      0.13       292

avg / total       0.95      0.96      0.94      6554



Labels for death: [0 1]


Confusion Matrix for death: 

[[6258    4]
 [ 272   20]]


Accuracy of prediction for death: 0.957888312481


-------------------------------------

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

The best parameters of gridsearch are:

Best Parameters of GridSearchCV: {'clf__estimator': KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'), 'tfidf__use_idf': False, 'vect__max_features': 500, 'vect__ngram_range': (1, 1)} 
           

So we will set up our pipeline based on these results and we will run another gridsearch to find the best parameters for the classifier

In [18]:
#New pipeline:
pipeline2 = Pipeline([
        ('vect',CountVectorizer(tokenizer=tokenize, ngram_range=(1, 1), max_features=500)),
        ('tfidf',TfidfTransformer(use_idf=False)),
        ('clf',MultiOutputClassifier(KNeighborsClassifier()))
    ])

#Set up new parameters for the new gridsearch which will optimize the classifier

parameters2 = {
    'clf__estimator' : (
        KNeighborsClassifier(n_neighbors=3,weights='uniform'),
        KNeighborsClassifier(n_neighbors=3,weights='distance'),
 
        KNeighborsClassifier(n_neighbors=4,weights='uniform'),
        KNeighborsClassifier(n_neighbors=4,weights='distance'),
 
        KNeighborsClassifier(n_neighbors=5,weights='uniform'),
        KNeighborsClassifier(n_neighbors=5,weights='distance')
    )

}    

#new gridsearch
cv2 = GridSearchCV(pipeline2, param_grid=parameters2)


In [21]:
#Fit the model for the new gridsearch
cv2.fit(X_train, y_train)

#Take the predictions of the new gridsearch
y_pred =  pd.DataFrame(cv2.predict(X_test),columns=y_test.columns)

#Results of new gridsearch

print("\nBest Parameters of GridSearchCV: \n", cv2.best_params_)


for column in y_test.columns:
    print("Classification report for {}:".format(column),"\n")
    print(classification_report(y_test[column], y_pred[column]))
    print("\n")
    print("Labels for {}:".format(column),np.unique(y_pred[column]))
    print("\n")
    print("Confusion Matrix for {}:".format(column),"\n")
    print(confusion_matrix(y_test[column], y_pred[column], labels=np.unique(y_pred[column])))
    print("\n")
    accuracy=(y_test[column].reset_index(drop=True)==y_pred[column].reset_index(drop=True)).mean()
    print("Accuracy of prediction for {}:".format(column),accuracy)
    print("\n")
    print("----------------------------------------------------")
    print("\n")


Best Parameters of GridSearchCV: 
 {'clf__estimator': KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=4, p=2,
           weights='uniform')}
Classification report for related: 

             precision    recall  f1-score   support

          0       0.37      0.70      0.48      1544
          1       0.87      0.61      0.71      4969
          2       0.03      0.07      0.04        41

avg / total       0.74      0.63      0.66      6554



Labels for related: [0 1 2]


Confusion Matrix for related: 

[[1087  445   12]
 [1865 3018   86]
 [  21   17    3]]


Accuracy of prediction for related: 0.626792798291


----------------------------------------------------


Classification report for request: 

             precision    recall  f1-score   support

          0       0.89      0.98      0.93      5475
          1       0.79      0.38      0.51      1079

avg / total       0.87      0.88      0.86      

  'precision', 'predicted', average, warn_for)


[[5655   51]
 [ 792   56]]


Accuracy of prediction for other_aid: 0.871376258773


----------------------------------------------------


Classification report for infrastructure_related: 

             precision    recall  f1-score   support

          0       0.94      1.00      0.97      6137
          1       0.14      0.00      0.00       417

avg / total       0.89      0.94      0.91      6554



Labels for infrastructure_related: [0 1]


Confusion Matrix for infrastructure_related: 

[[6131    6]
 [ 416    1]]


Accuracy of prediction for infrastructure_related: 0.935611840098


----------------------------------------------------


Classification report for transport: 

             precision    recall  f1-score   support

          0       0.95      1.00      0.98      6244
          1       0.70      0.02      0.04       310

avg / total       0.94      0.95      0.93      6554



Labels for transport: [0 1]


Confusion Matrix for transport: 

[[6241    3]
 [ 303    7]]


A

In [11]:
#So we keep the following model:
pipeline3=Pipeline([
        ('vect',CountVectorizer(tokenizer=tokenize, ngram_range=(1, 1), max_features=500)),
        ('tfidf',TfidfTransformer(use_idf=False)),
        ('clf',MultiOutputClassifier(KNeighborsClassifier(n_neighbors=4,)))
    ])


#we do no define weights='uniform', because this is a default value

### 9. Export your model as a pickle file

In [15]:
import pickle

Pkl_Filename = "Pickle_F8_Model.pkl"  

with open(Pkl_Filename, 'wb') as file:  
    pickle.dump(pipeline3, file)

In [18]:
Pkl_Filename

'Pickle_F8_Model.pkl'

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.