# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [37]:
# import libraries
import re
import pandas as pd
import numpy as np
#from sqlalchemy import create_engine
import sqlite3

from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
#from sklearn.compose import ColumnTransformer
import pickle


In [2]:
pd.__version__

'1.3.4'

In [3]:
import sklearn
sklearn.show_versions()


System:
    python: 3.10.0 (tags/v3.10.0:b494f59, Oct  4 2021, 19:00:18) [MSC v.1929 64 bit (AMD64)]
executable: C:\Users\M.Hedia\AppData\Local\Programs\Python\Python310\python.exe
   machine: Windows-10-10.0.19044-SP0

Python dependencies:
      sklearn: 1.1.1
          pip: 22.3.1
   setuptools: 57.4.0
        numpy: 1.21.2+mkl
        scipy: 1.8.1
       Cython: None
       pandas: 1.3.4
   matplotlib: 3.6.3
       joblib: 1.1.0
threadpoolctl: 3.1.0

Built with OpenMP: True

threadpoolctl info:
       user_api: openmp
   internal_api: openmp
         prefix: libiomp
       filepath: C:\Users\M.Hedia\AppData\Local\Programs\Python\Python310\Lib\site-packages\numpy\DLLs\libiomp5md.dll
        version: None
    num_threads: 8

       user_api: blas
   internal_api: mkl
         prefix: mkl_rt
       filepath: C:\Users\M.Hedia\AppData\Local\Programs\Python\Python310\Lib\site-packages\numpy\DLLs\mkl_rt.1.dll
        version: 2021.4-Product
threading_layer: intel
    num_threads: 4

     

In [4]:
import nltk
nltk.download(['punkt', 'wordnet'])

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\M.Hedia\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\M.Hedia\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [5]:
# load data from database
#engine = create_engine('sqlite:///DisasterResponse.db')
#df = pd.read_sql_table('MessageCategory', engine)

# connect to the database
conn = sqlite3.connect('DisasterResponse.db')

# run a query
df=pd.read_sql('SELECT * FROM MessageCategory', conn)
df=df[df['related']!=2]
#df['child_alone'].iloc[0]=1
X = df[['message', 'genre']]#.values
#X = df[['message']]#.values
Y = df.iloc[:,4:]#.values
df.head()

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,1,0,0,1,0,0,...,0,0,1,0,1,0,0,0,0,0
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,1,1,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [6]:
Y.shape

(26028, 36)

In [7]:
X.shape

(26028, 2)

In [8]:
X['message'].shape

(26028,)

In [9]:
X['genre'].shape

(26028,)

In [10]:
df=df[df['related']!=2]

In [11]:
Y.shape

(26028, 36)

In [12]:
single_value_targets=[]
target_cols=list(df.columns[4:])
for col in target_cols:#df.iloc[:,4:].columns:
    if len(df[col].unique())==1:
        print(col, df[col].unique(), len(df[col].unique()), target_cols.index(col))
        single_value_targets.append(target_cols.index(col))

child_alone [0] 1 9


In [13]:
df.child_alone.unique()

array([0], dtype=int64)

In [14]:
df.related.unique()

array([1, 0], dtype=int64)

In [15]:
df.genre.unique()

array(['direct', 'social', 'news'], dtype=object)

In [33]:
target_names=list(df.columns[4:])

### 2. Write a tokenization function to process your text data

In [16]:
url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'

def tokenize(text):
    detected_urls = re.findall(url_regex, text)
    for url in detected_urls:
        text = text.replace(url, "urlplaceholder")

    tokens = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()

    clean_tokens = []
    for tok in tokens:
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        clean_tokens.append(clean_tok)

    return clean_tokens

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [17]:
class ItemSelector(BaseEstimator, TransformerMixin):
    """For data grouped by feature, select subset of data at a provided key.

    The data is expected to be stored in a 2D data structure, where the first
    index is over features and the second is over samples.  i.e.

    >> len(data[key]) == n_samples

    Please note that this is the opposite convention to scikit-learn feature
    matrixes (where the first index corresponds to sample).

    ItemSelector only requires that the collection implement getitem
    (data[key]).  Examples include: a dict of lists, 2D numpy array, Pandas
    DataFrame, numpy record array, etc.

    >> data = {'a': [1, 5, 2, 5, 2, 8],
               'b': [9, 4, 1, 4, 1, 3]}
    >> ds = ItemSelector(key='a')
    >> data['a'] == ds.transform(data)

    ItemSelector is not designed to handle data grouped by sample.  (e.g. a
    list of dicts).  If your data is structured this way, consider a
    transformer along the lines of `sklearn.feature_extraction.DictVectorizer`.

    Parameters
    ----------
    key : hashable, required
        The key corresponding to the desired value in a mappable.
    """
    def __init__(self, key):
        self.key = key

    def fit(self, x, y=None):
        return self

    def transform(self, data_dict):
        if self.key=='genre':
            return data_dict[[self.key]]
        else:
            return data_dict[self.key]


In [18]:
message_transformer = Pipeline([
    ('selector', ItemSelector(key='message')),
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tdfidf', TfidfTransformer())])

In [19]:
genre_transformer = Pipeline([
    ('selector', ItemSelector(key='genre')),
    ('onehot', OneHotEncoder())])

In [20]:
msg_gnre_pipeline=Pipeline([
    ('features', FeatureUnion([
        ('message_pipe', message_transformer),
        ('genre_pipe', genre_transformer)
    ])),
    ('clf', MultiOutputClassifier(RandomForestClassifier()))
    #('clf', MultiOutputClassifier(LogisticRegression()))
])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [21]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

In [22]:
for col in y_train:
    print(y_train[col].unique())

[1 0]
[0 1]
[0 1]
[1 0]
[0 1]
[0 1]
[0 1]
[0 1]
[0 1]
[0]
[0 1]
[0 1]
[0 1]
[0 1]
[0 1]
[0 1]
[0 1]
[1 0]
[0 1]
[0 1]
[0 1]
[0 1]
[0 1]
[0 1]
[0 1]
[0 1]
[0 1]
[0 1]
[0 1]
[0 1]
[0 1]
[0 1]
[0 1]
[0 1]
[0 1]
[0 1]


In [23]:
X_train.head()

Unnamed: 0,message,genre
15445,The failure of rescuers to reach these areas p...,news
15974,"NEW YORK, 21 September 2007 - The exceptionall...",news
7112,Human rights group Be a pillar on which those ...,direct
3285,Please send us information concerning them.,direct
23842,In deference to the concerns of the Malian gov...,news


In [24]:
y_train.head()

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
15445,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
15974,1,0,0,1,0,0,0,0,0,0,...,0,0,1,1,1,0,0,0,0,0
7112,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3285,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
23842,1,0,0,1,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


In [25]:
for i in single_value_targets:
    y_train.iloc[0,i]=1

In [26]:
y_train.iloc[0,9]

1

In [27]:
msg_gnre_pipeline.fit(X_train, y_train)

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [28]:
#target_names=list(df.columns[4:])

y_pred_msg_gnre_rfc = msg_gnre_pipeline.predict(X_test)

In [29]:
pd.DataFrame(y_pred_msg_gnre_rfc).sum(axis=0)

0     4742
1      508
2        0
3     1708
4       23
5       25
6        8
7        1
8       11
9        0
10     101
11     297
12     126
13      10
14       7
15       1
16       4
17      32
18      15
19       3
20      30
21      20
22       2
23       0
24       0
25       0
26       0
27       2
28    1038
29     174
30     198
31       1
32     380
33       4
34       8
35     475
dtype: int64

In [30]:
y_test.values.shape

(5206, 36)

In [31]:
y_pred_msg_gnre_rfc.shape

(5206, 36)

In [34]:
report_msg_gnre_rfc = classification_report(y_test.values, y_pred_msg_gnre_rfc, target_names=target_names)
print(report_msg_gnre_rfc)

                        precision    recall  f1-score   support

               related       0.81      0.97      0.88      3959
               request       0.89      0.50      0.64       902
                 offer       0.00      0.00      0.00        25
           aid_related       0.79      0.63      0.70      2156
          medical_help       0.70      0.04      0.07       431
      medical_products       0.84      0.08      0.15       264
     search_and_rescue       0.62      0.03      0.06       151
              security       0.00      0.00      0.00       106
              military       0.82      0.05      0.10       175
           child_alone       0.00      0.00      0.00         0
                 water       0.91      0.27      0.41       344
                  food       0.88      0.44      0.59       586
               shelter       0.87      0.23      0.36       487
              clothing       0.80      0.10      0.18        79
                 money       0.86      

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### 6. Improve your model
Use grid search to find better parameters. 

In [35]:
parameters = {
    'features__message_pipe__vect__ngram_range': ((1, 1), (1, 2)),
    'clf__estimator__n_estimators': [50, 100, 200],
    'clf__estimator__min_samples_split': [2, 3, 4]
}

# create grid search object
cv = GridSearchCV(msg_gnre_pipeline, parameters, verbose=4)
cv.fit(X_train, y_train)

Fitting 5 folds for each of 18 candidates, totalling 90 fits
[CV 1/5] END clf__estimator__min_samples_split=2, clf__estimator__n_estimators=50, features__message_pipe__vect__ngram_range=(1, 1);, score=0.236 total time= 2.8min


KeyboardInterrupt: 

In [39]:
cv = pickle.load(open('cv_model.pickle', 'rb'))

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
  cv = pickle.load(open('cv_model.pickle', 'rb'))
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [None]:
# predict on test data
y_pred_cv = cv.predict(X_test)

report_cv = classification_report(y_test, y_pred_cv, target_names=target_names)
print(report_cv)
#print(classification_report(np.hstack(y_test), np.hstack(y_pred)))

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

### 9. Export your model as a pickle file

In [28]:
filename = "cv_model.pickle" # save model
pickle.dump(cv, open(filename, "wb"))

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.