# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [119]:
# import libraries
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, f1_score, hamming_loss, accuracy_score, precision_score, recall_score
from sklearn.model_selection import GridSearchCV
from sklearn.exceptions import UndefinedMetricWarning
import time
import warnings
import pickle

warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=UndefinedMetricWarning)

In [67]:
# load data from database
engine = create_engine('sqlite:///../data/messages.db')
df = pd.read_sql_table('categorized_messages', con=engine)
df.head(2)

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,1,0,0,1,0,0,...,0,0,1,0,1,0,0,0,0,0


In [68]:
print(df.columns)

Index(['id', 'message', 'original', 'genre', 'related', 'request', 'offer',
       'aid_related', 'medical_help', 'medical_products', 'search_and_rescue',
       'security', 'military', 'child_alone', 'water', 'food', 'shelter',
       'clothing', 'money', 'missing_people', 'refugees', 'death', 'other_aid',
       'infrastructure_related', 'transport', 'buildings', 'electricity',
       'tools', 'hospitals', 'shops', 'aid_centers', 'other_infrastructure',
       'weather_related', 'floods', 'storm', 'fire', 'earthquake', 'cold',
       'other_weather', 'direct_report'],
      dtype='object')


In [69]:
X = df[['message']]
Y = df[['related', 'request', 'offer',
       'aid_related', 'medical_help', 'medical_products', 'search_and_rescue',
       'security', 'military', 'child_alone', 'water', 'food', 'shelter',
       'clothing', 'money', 'missing_people', 'refugees', 'death', 'other_aid',
       'infrastructure_related', 'transport', 'buildings', 'electricity',
       'tools', 'hospitals', 'shops', 'aid_centers', 'other_infrastructure',
       'weather_related', 'floods', 'storm', 'fire', 'earthquake', 'cold',
       'other_weather', 'direct_report']]

### 2. Writing a tokenization function to process text data

In [70]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26028 entries, 0 to 26027
Data columns (total 1 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   message  26028 non-null  object
dtypes: object(1)
memory usage: 203.5+ KB


In [71]:
def simple_tokenizer(text):
    # Normalize and split on whitespace
    return re.sub(r"[^a-zA-Z0-9]", " ", text.lower()).split()



### 3. Building a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. 

In [72]:
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(tokenizer=simple_tokenizer)),  # Tokenization + TF-IDF vectorization
    ('clf', MultiOutputClassifier(RandomForestClassifier()))  # Multi-label classifier
])


### 4. Training pipeline
- Split data into train and test sets
- Train pipeline

In [75]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)


In [76]:
X_train = X_train.values.ravel()
X_test = X_test.values.ravel()
Y_train_array = Y_train.values
Y_test_array = Y_test.values


In [81]:
# Due to the class imbalance, I'll be using weighted scores (F1, recall and precisions) for evaluation.
Y.sum().sort_values()

child_alone                   0
offer                       118
shops                       120
tools                       159
fire                        282
hospitals                   283
missing_people              298
aid_centers                 309
clothing                    405
security                    471
cold                        530
electricity                 532
money                       604
search_and_rescue           724
military                    860
refugees                    875
other_infrastructure       1151
death                      1194
transport                  1201
medical_products           1313
buildings                  1333
other_weather              1376
water                      1672
infrastructure_related     1705
medical_help               2084
floods                     2155
shelter                    2314
storm                      2443
earthquake                 2455
food                       2923
other_aid                  3446
request 

In [82]:
start_time = time.time()


pipeline.fit(X_train,Y_train)

end_time = time.time()

# Calculate the time taken for a single fit
time_per_fit = (end_time - start_time)/60
print(f"Time taken for a single fit: {time_per_fit:.2f} mins")



Time taken for a single fit: 3.65 mins


### 5. Testing model

In [83]:
Y_pred = pipeline.predict(X_test)

In [86]:
# Ensure Y_test and Y_pred are NumPy arrays
Y_test = np.array(Y_test)
Y_pred = np.array(Y_pred)


# Initialize lists to store metrics and weights (number of true instances per label)
f1_scores, precision_scores, recall_scores, weights = [], [], [], []

# Iterate over each label
for i in range(Y_test.shape[1]):
    tp = np.sum((Y_test[:, i] == 1) & (Y_pred[:, i] == 1))
    fp = np.sum((Y_test[:, i] == 0) & (Y_pred[:, i] == 1))
    fn = np.sum((Y_test[:, i] == 1) & (Y_pred[:, i] == 0))
    tn = np.sum((Y_test[:, i] == 0) & (Y_pred[:, i] == 0))

    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
    
    f1_scores.append(f1)
    precision_scores.append(precision)
    recall_scores.append(recall)
    
    # Calculate weight as the number of true instances for each label
    weights.append(np.sum(Y_test[:, i]))

# Weighted average metrics
weighted_f1 = np.average(f1_scores, weights=weights)
weighted_precision = np.average(precision_scores, weights=weights)
weighted_recall = np.average(recall_scores, weights=weights)

print(f"Weighted F1: {weighted_f1:.2f}")
print(f"Weighted Precision: {weighted_precision:.2f}")
print(f"Weighted Recall: {weighted_recall:.2f}")


Weighted F1: 0.55
Weighted Precision: 0.79
Weighted Recall: 0.50


### 6. Improving our model through GridSearch
Use grid search to find better parameters. 

In [91]:
parameters = {
    'clf__estimator__n_estimators': [100, 200],
    'clf__estimator__max_depth': [None, 10],
    'clf__estimator__min_samples_split': [2, 5],
}


cv = GridSearchCV(pipeline, parameters, scoring='f1_weighted', cv=2, verbose=2, n_jobs=-1)


#### 6a. Test our model
Showing weighted F1, precision, and recall of the tuned model.  


In [92]:
cv.fit(X_train, Y_train)

Fitting 2 folds for each of 8 candidates, totalling 16 fits


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize

[CV] END clf__estimator__max_depth=None, clf__estimator__min_samples_split=5, clf__estimator__n_estimators=100; total time= 2.0min
[CV] END clf__estimator__max_depth=10, clf__estimator__min_samples_split=2, clf__estimator__n_estimators=100; total time=  17.6s
[CV] END clf__estimator__max_depth=10, clf__estimator__min_samples_split=2, clf__estimator__n_estimators=200; total time=  29.7s
[CV] END clf__estimator__max_depth=None, clf__estimator__min_samples_split=5, clf__estimator__n_estimators=100; total time= 2.0min
[CV] END clf__estimator__max_depth=10, clf__estimator__min_samples_split=2, clf__estimator__n_estimators=100; total time=  16.3s
[CV] END clf__estimator__max_depth=10, clf__estimator__min_samples_split=2, clf__estimator__n_estimators=200; total time=  30.5s
[CV] END clf__estimator__max_depth=None, clf__estimator__min_samples_split=2, clf__estimator__n_estimators=100; total time= 2.4min
[CV] END clf__estimator__max_depth=10, clf__estimator__min_samples_split=5, clf__estimator_

In [93]:
print("Best Parameters:", cv.best_params_)
print("Best Cross-Validation Score:", cv.best_score_)


Best Parameters: {'clf__estimator__max_depth': None, 'clf__estimator__min_samples_split': 5, 'clf__estimator__n_estimators': 200}
Best Cross-Validation Score: 0.5346928507055196


In [112]:
best_model_1 = cv.best_estimator_
Y_pred = best_model_1.predict(X_test)


In [113]:
# Initialize lists to store metrics for each label
f1_scores_1 = []
precision_scores_1 = []
recall_scores_1 = []

# Ensure Y_test and Y_pred are NumPy arrays
Y_test = np.array(Y_test)
Y_pred = np.array(Y_pred)

# Iterate over each label
for i in range(Y_test.shape[1]):
    f1 = f1_score(Y_test[:, i], Y_pred[:, i], average='weighted')
    precision = precision_score(Y_test[:, i], Y_pred[:, i], average='weighted')
    recall = recall_score(Y_test[:, i], Y_pred[:, i], average='weighted')

    f1_scores_1.append(f1)
    precision_scores_1.append(precision)
    recall_scores_1.append(recall)

# Aggregate metrics across all labels
average_f1_1 = sum(f1_scores_1) / len(f1_scores_1)
average_precision_1 = sum(precision_scores_1) / len(precision_scores_1)
average_recall_1 = sum(recall_scores_1) / len(recall_scores_1)

print(f"Average F1 Score: {average_f1:.2f}")
print(f"Average Precision: {average_precision:.2f}")
print(f"Average Recall: {average_recall:.2f}")

Average F1 Score: 0.93
Average Precision: 0.94
Average Recall: 0.95


## Exporting our model as a pickle file

In [120]:
model_filepath = 'classifier.pkl'
with open(model_filepath, 'wb') as f:
        pickle.dump(best_model_1, f)

print(f"Model saved as {model_filepath}")

Model saved as classifier.pkl


### This notebook is used to complete `train_classifier.py`