# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
# import libraries
import pandas as pd
import numpy as np 
from sqlalchemy import create_engine
import re
import pickle


from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.metrics import classification_report as cr
from sklearn.metrics import f1_score,accuracy_score,precision_score,recall_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.base import BaseEstimator, TransformerMixin

from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import warnings
warnings.filterwarnings('ignore')
#nltk.download('punkt')
#nltk.download('wordnet')
#nltk.download('stopwords')

In [2]:
# load data from database
engine = create_engine('sqlite:///disaster_messages.db')
df = pd.read_sql_table('message categories',engine)
X = df['message']
y = df.iloc[:,4:]

### 2. Write a tokenization function to process your text data

In [3]:
def tokenize(text):
    """
     clean ,tokenize and lemmatize text 
    """
    text = re.sub(r'\W',' ',text)
    token = word_tokenize(text)
    lemm = [WordNetLemmatizer().lemmatize(i.lower(),pos='v') for i in token]
    stem = [PorterStemmer().stem(i) for i in lemm]
    return stem

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [4]:
class input_transform(BaseEstimator, TransformerMixin):
    """Convert the input from sparse matrix to array
    """

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X.toarray()

In [5]:
pipeline = Pipeline([('vect',CountVectorizer(tokenizer = tokenize,stop_words='english')),
                     ('tfidf',TfidfTransformer()),
                    ('convert input',input_transform()),
                     ('multi output classifier',MultiOutputClassifier(RandomForestClassifier(n_estimators=25,max_depth = 15)))],verbose=True)

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.4,random_state = 42)

In [7]:
pipeline.fit(X_train,y_train)

[Pipeline] .............. (step 1 of 4) Processing vect, total=  12.8s
[Pipeline] ............. (step 2 of 4) Processing tfidf, total=   0.0s
[Pipeline] ..... (step 3 of 4) Processing convert input, total=   0.2s
[Pipeline]  (step 4 of 4) Processing multi output classifier, total= 6.3min


Pipeline(steps=[('vect',
                 CountVectorizer(stop_words='english',
                                 tokenizer=<function tokenize at 0x0000018527D080D0>)),
                ('tfidf', TfidfTransformer()),
                ('convert input', input_transform()),
                ('multi output classifier',
                 MultiOutputClassifier(estimator=RandomForestClassifier(max_depth=15,
                                                                        n_estimators=25)))],
         verbose=True)

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [7]:
def report(model,X_test,y_test):
    """Prints out classification report and Calculate average accuracy , precision , recall and f1 score
    Args:
    model : model used to be tested
    X_test : messages column for testing
    y_test : true values of predictions
    
    Returns : None
    prints out classification report , average accuracy , precision , recall and f1 score
    
    """
    y_preds = model.predict(X_test)
    y_preds = pd.DataFrame(y_preds, columns = y_test.columns)
    acc = []
    precision = []
    recall = []
    f1 = []
    for i  , j in zip(range(len(list(y_test.columns))),y_test.columns):
        print('category : ',j)
        print(cr(y_test.iloc[:,i],y_preds.iloc[:,i]))
        print('---------------------------')
        acc.append(accuracy_score(y_test.iloc[:,i],y_preds.iloc[:,i]))
        precision.append(precision_score(y_test.iloc[:,i],y_preds.iloc[:,i],average='macro'))
        recall.append(recall_score(y_test.iloc[:,i],y_preds.iloc[:,i],average='macro'))
        f1.append(f1_score(y_test.iloc[:,i],y_preds.iloc[:,i],average='macro'))
    
    print('average accuracy score : ' , np.mean(acc))
    print('average precision score : ' , np.mean(precision))
    print('average recall score : ' , np.mean(recall))
    print('average f1 score : ' , np.mean(f1))

In [9]:
report(pipeline,X_test,y_test)

category :  related
              precision    recall  f1-score   support

           0       1.00      0.00      0.01      2464
           1       0.77      1.00      0.87      8023

    accuracy                           0.77     10487
   macro avg       0.88      0.50      0.44     10487
weighted avg       0.82      0.77      0.67     10487

---------------------------
category :  request
              precision    recall  f1-score   support

           0       0.83      1.00      0.91      8685
           1       1.00      0.00      0.00      1802

    accuracy                           0.83     10487
   macro avg       0.91      0.50      0.45     10487
weighted avg       0.86      0.83      0.75     10487

---------------------------
category :  offer
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     10437
           1       0.00      0.00      0.00        50

    accuracy                           1.00     10487
   macro avg  

              precision    recall  f1-score   support

           0       0.99      1.00      0.99     10382
           1       0.00      0.00      0.00       105

    accuracy                           0.99     10487
   macro avg       0.49      0.50      0.50     10487
weighted avg       0.98      0.99      0.99     10487

---------------------------
category :  earthquake
              precision    recall  f1-score   support

           0       0.91      1.00      0.95      9514
           1       0.88      0.02      0.03       973

    accuracy                           0.91     10487
   macro avg       0.90      0.51      0.49     10487
weighted avg       0.91      0.91      0.87     10487

---------------------------
category :  cold
              precision    recall  f1-score   support

           0       0.98      1.00      0.99     10267
           1       0.00      0.00      0.00       220

    accuracy                           0.98     10487
   macro avg       0.49      0.5

the above classification report shows the effect of the imbalance in the data .

As its predict the 0s class better than the 1s class as its the majority class .

Also shows that the words like 'water' or 'electricity' rarely refers to the need of water or electricity

### 6. Improve your model
Use grid search to find better parameters. 

In [10]:
parameters = {'multi output classifier__estimator__n_estimators':[25,30]}

cv = GridSearchCV(pipeline,parameters,cv=2)
cv.fit(X_train,y_train)
print(cv.best_params_)

[Pipeline] .............. (step 1 of 4) Processing vect, total=   5.6s
[Pipeline] ............. (step 2 of 4) Processing tfidf, total=   0.0s
[Pipeline] ..... (step 3 of 4) Processing convert input, total=   0.1s
[Pipeline]  (step 4 of 4) Processing multi output classifier, total= 2.4min
[Pipeline] .............. (step 1 of 4) Processing vect, total=   5.6s
[Pipeline] ............. (step 2 of 4) Processing tfidf, total=   0.0s
[Pipeline] ..... (step 3 of 4) Processing convert input, total=   0.1s
[Pipeline]  (step 4 of 4) Processing multi output classifier, total= 2.4min
[Pipeline] .............. (step 1 of 4) Processing vect, total=   5.5s
[Pipeline] ............. (step 2 of 4) Processing tfidf, total=   0.0s
[Pipeline] ..... (step 3 of 4) Processing convert input, total=   0.1s
[Pipeline]  (step 4 of 4) Processing multi output classifier, total= 2.8min
[Pipeline] .............. (step 1 of 4) Processing vect, total=   5.5s
[Pipeline] ............. (step 2 of 4) Processing tfidf, total

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [12]:
report(cv,X_test,y_test)

category :  related
              precision    recall  f1-score   support

           0       0.00      0.00      0.00      2464
           1       0.77      1.00      0.87      8023

    accuracy                           0.77     10487
   macro avg       0.38      0.50      0.43     10487
weighted avg       0.59      0.77      0.66     10487

---------------------------
category :  request
              precision    recall  f1-score   support

           0       0.83      1.00      0.91      8685
           1       1.00      0.00      0.00      1802

    accuracy                           0.83     10487
   macro avg       0.91      0.50      0.46     10487
weighted avg       0.86      0.83      0.75     10487

---------------------------
category :  offer
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     10437
           1       0.00      0.00      0.00        50

    accuracy                           1.00     10487
   macro avg  

category :  earthquake
              precision    recall  f1-score   support

           0       0.91      1.00      0.95      9514
           1       1.00      0.00      0.01       973

    accuracy                           0.91     10487
   macro avg       0.95      0.50      0.48     10487
weighted avg       0.92      0.91      0.86     10487

---------------------------
category :  cold
              precision    recall  f1-score   support

           0       0.98      1.00      0.99     10267
           1       0.00      0.00      0.00       220

    accuracy                           0.98     10487
   macro avg       0.49      0.50      0.49     10487
weighted avg       0.96      0.98      0.97     10487

---------------------------
category :  other_weather
              precision    recall  f1-score   support

           0       0.95      1.00      0.97      9950
           1       0.00      0.00      0.00       537

    accuracy                           0.95     10487
   mac

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2,random_state = 42)
from sklearn.linear_model import RidgeClassifier
pipeline1 = Pipeline([('vect',CountVectorizer(tokenizer = tokenize,stop_words='english')),
                     ('tfidf',TfidfTransformer()),
                     ('multi output classifier',MultiOutputClassifier(RidgeClassifier()))],verbose=True)

In [16]:
pipeline1.fit(X_train,y_train)

[Pipeline] .............. (step 1 of 3) Processing vect, total=  14.9s
[Pipeline] ............. (step 2 of 3) Processing tfidf, total=   0.0s
[Pipeline]  (step 3 of 3) Processing multi output classifier, total=   2.5s


Pipeline(steps=[('vect',
                 CountVectorizer(stop_words='english',
                                 tokenizer=<function tokenize at 0x0000010DE5337CA0>)),
                ('tfidf', TfidfTransformer()),
                ('multi output classifier',
                 MultiOutputClassifier(estimator=RidgeClassifier(alpha=1)))],
         verbose=True)

In [17]:
report(pipeline1,X_test,y_test)

category :  related
              precision    recall  f1-score   support

           0       0.69      0.46      0.55      1266
           1       0.85      0.93      0.89      3978

    accuracy                           0.82      5244
   macro avg       0.77      0.70      0.72      5244
weighted avg       0.81      0.82      0.81      5244

---------------------------
category :  request
              precision    recall  f1-score   support

           0       0.91      0.97      0.94      4349
           1       0.80      0.55      0.65       895

    accuracy                           0.90      5244
   macro avg       0.86      0.76      0.80      5244
weighted avg       0.89      0.90      0.89      5244

---------------------------
category :  offer
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      5218
           1       0.00      0.00      0.00        26

    accuracy                           1.00      5244
   macro avg  

category :  storm
              precision    recall  f1-score   support

           0       0.96      0.98      0.97      4758
           1       0.72      0.56      0.63       486

    accuracy                           0.94      5244
   macro avg       0.84      0.77      0.80      5244
weighted avg       0.93      0.94      0.94      5244

---------------------------
category :  fire
              precision    recall  f1-score   support

           0       0.99      1.00      1.00      5191
           1       0.75      0.06      0.11        53

    accuracy                           0.99      5244
   macro avg       0.87      0.53      0.55      5244
weighted avg       0.99      0.99      0.99      5244

---------------------------
category :  earthquake
              precision    recall  f1-score   support

           0       0.97      0.99      0.98      4766
           1       0.88      0.69      0.77       478

    accuracy                           0.96      5244
   macro avg  

If we to explain the above results :

- Accuaracy is high because the models is predicting the 0s class as its the majority class
- Precision is quite lower than accuracy because there are not so many false positives as we said 0s class is majority
- Recall is low because there are many false negatives as the model fails to predict the 1s class due to imbalanced data
- F1 score is low because its the harmonic mean between precision and recall



#### As we see our best model was the Ridge Classifier by all metrics after trying different models as random forest and adaboost

### 9. Export your model as a pickle file

In [16]:
pickle.dump(pipeline1, open('classifier.pkl', 'wb'))

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.