# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [14]:
# import libraries
from sklearn.metrics import hamming_loss, jaccard_score, f1_score
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer
from xgboost import XGBClassifier
import nltk
import pickle
nltk.download('punkt') 
import sqlite3
import warnings
warnings.filterwarnings('ignore')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\letsm005\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
# load data from database
#engine = create_engine('sqlite:///InsertDatabaseName.db')
conn = sqlite3.connect('etl_disaster_data.db')

# Read data from SQLite database into a DataFrame
query = "SELECT * FROM etl_disaster_table"
df = pd.read_sql_query(query, conn).head(5000)

# Close the connection
conn.close()

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [3]:
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    df['message'], df.drop(["message","id"],axis=1), test_size=0.2, random_state=42
)

In [4]:
# Define pipeline
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),  # Text preprocessing
    ('clf', MultiOutputClassifier(RandomForestClassifier()))  # Multi-output classifier
])



In [5]:
### 4. Train pipeline
pipeline.fit(X_train, y_train)

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [6]:
y_pred = pipeline.predict(X_test)
# Calculate accuracy for each label
accuracies = [accuracy_score(y_test[label], y_pred[:, idx]) for idx, label in enumerate(y_test.columns)]
print("Accuracy for each label:", accuracies)

# Calculate Hamming Loss for each label
hamming_losses = [hamming_loss(y_test[label], y_pred[:, idx]) for idx, label in enumerate(y_test.columns)]
print("Hamming Loss for each label:", hamming_losses)

# Calculate Jaccard Score for each label
jaccard_scores = [jaccard_score(y_test[label], y_pred[:, idx], average=None) for idx, label in enumerate(y_test.columns)]
print("Jaccard Score for each label:", jaccard_scores)

# Calculate F1 Score for each label
f1_scores = [f1_score(y_test[label], y_pred[:, idx], average=None) for idx, label in enumerate(y_test.columns)]
print("F1 Score for each label:", f1_scores)


Accuracy for each label: [0.809, 0.822, 0.997, 0.782, 0.921, 0.942, 0.962, 0.976, 0.994, 1.0, 0.952, 0.927, 0.926, 0.991, 0.987, 0.979, 0.982, 0.963, 0.825, 0.956, 0.968, 0.95, 0.994, 0.997, 0.988, 0.997, 0.994, 0.974, 0.915, 0.971, 0.98, 0.995, 0.973, 0.991, 0.974, 0.787, 1.0, 1.0, 1.0]
Hamming Loss for each label: [0.191, 0.178, 0.003, 0.218, 0.079, 0.058, 0.038, 0.024, 0.006, 0.0, 0.048, 0.073, 0.074, 0.009, 0.013, 0.021, 0.018, 0.037, 0.175, 0.044, 0.032, 0.05, 0.006, 0.003, 0.012, 0.003, 0.006, 0.026, 0.085, 0.029, 0.02, 0.005, 0.027, 0.009, 0.026, 0.213, 0.0, 0.0, 0.0]
Jaccard Score for each label: [array([0.26377953, 0.79528403, 0.        ]), array([0.70529801, 0.68989547]), array([0.997, 0.   ]), array([0.62478485, 0.6577708 ]), array([0.9204431 , 0.08139535]), array([0.94159114, 0.10769231]), array([0.962, 0.   ]), array([0.976, 0.   ]), array([0.994, 0.   ]), array([1.]), array([0.94713656, 0.65714286]), array([0.91119221, 0.70916335]), array([0.92136026, 0.44360902]), array(

In [7]:


# Assuming y_test and y_pred are your true labels and predicted labels respectively

# Flatten y_test and y_pred to fit classification_report
y_test_flat = y_test.values.ravel()
y_pred_flat = y_pred.ravel()

# Generate the classification report
report = classification_report(y_test_flat, y_pred_flat)

print("Classification Report:")
print(report)


Classification Report:
              precision    recall  f1-score   support

           0       0.96      0.99      0.97     34245
           1       0.87      0.71      0.78      4751
           2       0.00      0.00      0.00         4

    accuracy                           0.95     39000
   macro avg       0.61      0.57      0.59     39000
weighted avg       0.95      0.95      0.95     39000



### 6. Improve your model
Use grid search to find better parameters. 

In [8]:
param_grid = {
    'tfidf__max_features': [1000, 2000, 3000],  # Number of features to consider
    'tfidf__ngram_range': [(1, 1), (1, 2)],      # Range of n-grams
    'clf__estimator__n_estimators': [100, 200, 300],  # Number of trees in the forest
    'clf__estimator__max_depth': [10, 20, 30],       # Maximum depth of the tree
}

# Perform grid search cross-validation
grid_search = GridSearchCV(pipeline, param_grid, cv=3, verbose=2, n_jobs=-1)

# Fit grid search on training data
grid_search.fit(X_train, y_train)

# Evaluate performance on test set
best_pipeline = grid_search.best_estimator_
y_pred = best_pipeline.predict(X_test)

Fitting 3 folds for each of 54 candidates, totalling 162 fits


In [9]:
print("Optimal parameters :",best_pipeline)

Optimal parameters : Pipeline(steps=[('tfidf',
                 TfidfVectorizer(max_features=1000, ngram_range=(1, 2))),
                ('clf',
                 MultiOutputClassifier(estimator=RandomForestClassifier(max_depth=30,
                                                                        n_estimators=200)))])


In [10]:
# Define pipeline
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=1000, ngram_range=(1, 2))),  # Text preprocessing
    ('clf', MultiOutputClassifier(RandomForestClassifier(
        max_depth=30,n_estimators=200
    )))  # Multi-output classifier
])

In [11]:
### 4. Train pipeline
pipeline.fit(X_train, y_train)

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [12]:
y_pred = pipeline.predict(X_test)
# Calculate accuracy for each label
accuracies = [accuracy_score(y_test[label], y_pred[:, idx]) for idx, label in enumerate(y_test.columns)]
print("Accuracy for each label:", accuracies)

# Calculate Hamming Loss for each label
hamming_losses = [hamming_loss(y_test[label], y_pred[:, idx]) for idx, label in enumerate(y_test.columns)]
print("Hamming Loss for each label:", hamming_losses)

# Calculate Jaccard Score for each label
jaccard_scores = [jaccard_score(y_test[label], y_pred[:, idx], average=None) for idx, label in enumerate(y_test.columns)]
print("Jaccard Score for each label:", jaccard_scores)

# Calculate F1 Score for each label
f1_scores = [f1_score(y_test[label], y_pred[:, idx], average=None) for idx, label in enumerate(y_test.columns)]
print("F1 Score for each label:", f1_scores)


Accuracy for each label: [0.806, 0.821, 0.997, 0.796, 0.922, 0.944, 0.963, 0.976, 0.994, 1.0, 0.972, 0.949, 0.939, 0.991, 0.987, 0.979, 0.982, 0.963, 0.826, 0.956, 0.968, 0.957, 0.994, 0.997, 0.988, 0.997, 0.994, 0.974, 0.915, 0.971, 0.979, 0.995, 0.964, 0.991, 0.974, 0.791, 1.0, 1.0, 1.0]
Hamming Loss for each label: [0.194, 0.179, 0.003, 0.204, 0.078, 0.056, 0.037, 0.024, 0.006, 0.0, 0.028, 0.051, 0.061, 0.009, 0.013, 0.021, 0.018, 0.037, 0.174, 0.044, 0.032, 0.043, 0.006, 0.003, 0.012, 0.003, 0.006, 0.026, 0.085, 0.029, 0.021, 0.005, 0.036, 0.009, 0.026, 0.209, 0.0, 0.0, 0.0]
Jaccard Score for each label: [array([0.22131148, 0.794926  , 0.        ]), array([0.70462046, 0.68760908]), array([0.997, 0.   ]), array([0.64766839, 0.6736    ]), array([0.92145015, 0.08235294]), array([0.94337715, 0.1641791 ]), array([0.96296296, 0.02631579]), array([0.976, 0.   ]), array([0.994, 0.   ]), array([1.]), array([0.96828992, 0.80689655]), array([0.93609023, 0.79841897]), array([0.93376764, 0.5642

In [13]:

# Assuming y_test and y_pred are your true labels and predicted labels respectively

# Flatten y_test and y_pred to fit classification_report
y_test_flat = y_test.values.ravel()
y_pred_flat = y_pred.ravel()

# Generate the classification report
report = classification_report(y_test_flat, y_pred_flat)

print("Classification Report:")
print(report)


Classification Report:
              precision    recall  f1-score   support

           0       0.96      0.99      0.97     34245
           1       0.87      0.73      0.80      4751
           2       0.00      0.00      0.00         4

    accuracy                           0.95     39000
   macro avg       0.61      0.57      0.59     39000
weighted avg       0.95      0.95      0.95     39000



### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [15]:
# Define pipeline with CountVectorizer and use XGboost classifier
pipeline_v2 = Pipeline([
    ('count_vectorizer', CountVectorizer()),  # Text preprocessing
    ('clf', MultiOutputClassifier(XGBClassifier()))  # Multi-output classifier
])



In [19]:
# Find the minimum value in y_train
min_label = y_train.min()

# Adjust the target labels
y_train_adjusted = y_train - min_label

# Fit the pipeline with the adjusted labels
pipeline_v2.fit(X_train, y_train_adjusted)



In [20]:

# Assuming y_test and y_pred are your true labels and predicted labels respectively

# Flatten y_test and y_pred to fit classification_report
y_test_flat = y_test.values.ravel()
y_pred_flat = y_pred.ravel()

# Generate the classification report
report = classification_report(y_test_flat, y_pred_flat)

print("Classification Report:")
print(report)

Classification Report:
              precision    recall  f1-score   support

           0       0.96      0.99      0.97     34245
           1       0.87      0.73      0.80      4751
           2       0.00      0.00      0.00         4

    accuracy                           0.95     39000
   macro avg       0.61      0.57      0.59     39000
weighted avg       0.95      0.95      0.95     39000



In [22]:
pipeline = Pipeline([
    ('count_vectorizer', CountVectorizer()),  # Text preprocessing
    ('clf', MultiOutputClassifier(XGBClassifier()))  # Multi-output classifier
])

# Find the minimum value in y_train
min_label = y_train.min()

# Adjust the target labels
y_train_adjusted = y_train - min_label

# Initialize GridSearchCV with adjusted target labels
grid_search = GridSearchCV(estimator=pipeline, param_grid=param_grid, cv=3, scoring='accuracy')

# Perform grid search with adjusted target labels
grid_search.fit(X_train, y_train_adjusted)

# Print best parameters and best score
print("Best Parameters: ", grid_search.best_params_)
print("Best Score: ", grid_search.best_score_)


Best Parameters:  {'clf__estimator__learning_rate': 0.1, 'clf__estimator__max_depth': 3, 'clf__estimator__n_estimators': 100}
Best Score:  nan


### 9. Export your model as a pickle file

In [None]:
# Assuming 'pipeline' is your trained pipeline
pipeline.fit(X_train, y_train)  # Train your pipeline if not already done

# Serialize the pipeline using pickle
with open('model.pkl', 'wb') as file:
    pickle.dump(pipeline, file)

### 10. Use this notebook to complete `train_classifier.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.