# Disaster Messages Classification Using NLP

![ai.jpeg](ai.jpeg)

### Problem Statement:

This project is part of the Data Science Nanodegree Program offered by Udacity. The dataset comprises pre-labeled tweets and messages derived from actual disaster events. The primary objective of this project is to develop a Natural Language Processing (NLP) model capable of categorizing messages in real-time, aiding disaster response teams in efficiently providing assistance.

**The project unfolds through the following key sections:**

1. **Data Processing:**
   - Creation of an Extract, Transform, Load (ETL) pipeline to retrieve data from the source.
   - Cleaning and organizing the data for effective utilization.
   - Storage of processed data in a SQLite database for accessibility and future use.

2. **Machine Learning Pipeline:**
   - Development of a machine learning pipeline designed to train a model capable of categorizing text messages into multiple predefined categories.
   - Incorporation of natural language processing techniques and machine learning algorithms to enhance model accuracy.

3. **Web Application:**
   - Implementation of a web application that showcases real-time results generated by the trained model.
   - The web app serves as a user-friendly interface for disaster response teams to quickly assess and act upon incoming messages.

By seamlessly integrating these components, the project aims to contribute to the effective and swift classification of messages during disaster situations, ultimately aiding in a more efficient allocation of resources by response teams.

#  <font color='red'> Part 1: ETL (Extract, Transform, and Load) </font>

### 1. Import libraries and load datasets.
- Import Python libraries
- Load `messages.csv` into a dataframe and inspect the first few lines.
- Load `categories.csv` into a dataframe and inspect the first few lines.

In [1]:
# import libraries 

# Importing NumPy library for numerical operations and array handling
import numpy as np

# Importing Pandas library for data manipulation and analysis
import pandas as pd

# Importing create_engine from SQLAlchemy for database connectivity and interaction
from sqlalchemy import create_engine

import plotly.express as px


### Dataset Overview:

In this project, we are working with two essential data files:

1. **disaster_messages.csv:**
   - This file contains messages submitted by individuals during disaster events.
   
2. **disaster_categories.csv:**
   - This file provides information about the categories associated with each message in the corresponding disaster_messages.csv file.

The dataset is labeled, enabling the application of supervised learning techniques, specifically the classification method, to achieve our goal. The objective is to develop a model that can effectively categorize messages based on their content and associated categories. This labeled dataset will serve as the foundation for training our supervised learning model, enhancing its ability to accurately classify messages in real-world disaster scenarios.


###  <font color='red'> **Load messages dataset** </font>

In [2]:
# load messages dataset
messages = pd.read_csv("disaster_messages.csv")
messages.head()

Unnamed: 0,id,message,original,genre
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct


In [3]:
messages.shape

(26248, 4)

#### Observation :
- The dataset comprises a total of 26,248 messages.
- It contains four types of information: Message_id, the message itself, the message in its original language, and the genre.

I'm not sure about the language of the original messages, that is why I am going to take help of Google translator to auto-detect the original language. 

In [4]:
import mitosheet

In [5]:
mitosheet.sheet(analysis_to_replay="id-tlfdbehind")

In [6]:
from mitosheet.public.v3 import *; # Analysis Name:id-tlfdbehind;
import pandas as pd

# Imported disaster_messages.csv
disaster_messages = pd.read_csv(r'disaster_messages.csv')

# Pivoted disaster_messages into disaster_messages_pivot
tmp_df = disaster_messages[['id', 'genre']].copy()
pivot_table = tmp_df.pivot_table(
    index=['genre'],
    values=['id'],
    aggfunc={'id': ['count']}
)
pivot_table.set_axis([flatten_column_header(col) for col in pivot_table.keys()], axis=1, inplace=True)
disaster_messages_pivot = pivot_table.reset_index()


#### Observation :

- I extracted messages from the table above and utilized Google Translator to identify the language of the original messages.

- The majority of the messages are in French, while a few of them are detected as Hetian Croel.

###  <font color='red'> **Load categories dataset** </font>

In [7]:
# load categories dataset
categories = pd.read_csv("disaster_categories.csv")
categories.head()

Unnamed: 0,id,categories
0,2,related-1;request-0;offer-0;aid_related-0;medi...
1,7,related-1;request-0;offer-0;aid_related-1;medi...
2,8,related-1;request-0;offer-0;aid_related-0;medi...
3,9,related-1;request-1;offer-0;aid_related-1;medi...
4,12,related-1;request-0;offer-0;aid_related-0;medi...


### 2. Merge datasets.
- Merge the messages and categories datasets using the common id
- Assign this combined dataset to `df`, which will be cleaned in the following steps

In [8]:
# merge datasets
df = pd.merge(messages, categories, on="id")
df.head()

Unnamed: 0,id,message,original,genre,categories
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,related-1;request-0;offer-0;aid_related-0;medi...
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,related-1;request-0;offer-0;aid_related-1;medi...
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,related-1;request-0;offer-0;aid_related-0;medi...
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,related-1;request-1;offer-0;aid_related-1;medi...
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,related-1;request-0;offer-0;aid_related-0;medi...


Categories is our target variable, but it has many categories. 

### 3. Split `categories` into separate category columns.
- Split the values in the `categories` column on the `;` character so that each value becomes a separate column. You'll find [this method](https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.Series.str.split.html) very helpful! Make sure to set `expand=True`.
- Use the first row of categories dataframe to create column names for the categories data.
- Rename columns of `categories` with new column names.

In [9]:
# create a dataframe of the 36 individual category columns
categories = df.categories.str.split(";", expand=True)
categories.head(3)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,26,27,28,29,30,31,32,33,34,35
0,related-1,request-0,offer-0,aid_related-0,medical_help-0,medical_products-0,search_and_rescue-0,security-0,military-0,child_alone-0,...,aid_centers-0,other_infrastructure-0,weather_related-0,floods-0,storm-0,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0
1,related-1,request-0,offer-0,aid_related-1,medical_help-0,medical_products-0,search_and_rescue-0,security-0,military-0,child_alone-0,...,aid_centers-0,other_infrastructure-0,weather_related-1,floods-0,storm-1,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0
2,related-1,request-0,offer-0,aid_related-0,medical_help-0,medical_products-0,search_and_rescue-0,security-0,military-0,child_alone-0,...,aid_centers-0,other_infrastructure-0,weather_related-0,floods-0,storm-0,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0


In [10]:
# select the first row of the categories dataframe
row = categories.iloc[0,:]

# use this row to extract a list of new column names for categories.
# one way is to apply a lambda function that takes everything 
# up to the second to last character of each string with slicing
category_colnames = row.apply(lambda name: name[:-2]).tolist() #Removed last two characters from the names
print(category_colnames)

['related', 'request', 'offer', 'aid_related', 'medical_help', 'medical_products', 'search_and_rescue', 'security', 'military', 'child_alone', 'water', 'food', 'shelter', 'clothing', 'money', 'missing_people', 'refugees', 'death', 'other_aid', 'infrastructure_related', 'transport', 'buildings', 'electricity', 'tools', 'hospitals', 'shops', 'aid_centers', 'other_infrastructure', 'weather_related', 'floods', 'storm', 'fire', 'earthquake', 'cold', 'other_weather', 'direct_report']


In [11]:
# rename the columns of `categories`
categories.columns = category_colnames
categories.head()

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,related-1,request-0,offer-0,aid_related-0,medical_help-0,medical_products-0,search_and_rescue-0,security-0,military-0,child_alone-0,...,aid_centers-0,other_infrastructure-0,weather_related-0,floods-0,storm-0,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0
1,related-1,request-0,offer-0,aid_related-1,medical_help-0,medical_products-0,search_and_rescue-0,security-0,military-0,child_alone-0,...,aid_centers-0,other_infrastructure-0,weather_related-1,floods-0,storm-1,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0
2,related-1,request-0,offer-0,aid_related-0,medical_help-0,medical_products-0,search_and_rescue-0,security-0,military-0,child_alone-0,...,aid_centers-0,other_infrastructure-0,weather_related-0,floods-0,storm-0,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0
3,related-1,request-1,offer-0,aid_related-1,medical_help-0,medical_products-1,search_and_rescue-0,security-0,military-0,child_alone-0,...,aid_centers-0,other_infrastructure-0,weather_related-0,floods-0,storm-0,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0
4,related-1,request-0,offer-0,aid_related-0,medical_help-0,medical_products-0,search_and_rescue-0,security-0,military-0,child_alone-0,...,aid_centers-0,other_infrastructure-0,weather_related-0,floods-0,storm-0,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0


### 4. Convert category values to just numbers 0 or 1.
- Iterate through the category columns in df to keep only the last character of each string (the 1 or 0). For example, `related-0` becomes `0`, `related-1` becomes `1`. Convert the string to a numeric value.
- You can perform [normal string actions on Pandas Series](https://pandas.pydata.org/pandas-docs/stable/text.html#indexing-with-str), like indexing, by including `.str` after the Series. You may need to first convert the Series to be of type string, which you can do with `astype(str)`.

In [12]:
for column in categories:
    # set each value to be the last character of the string
    categories[column] = categories[column].str.split("-").str[-1]
    
    # convert column from string to numeric
    categories[column] = categories[column].astype(int)
categories.head()

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,1,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### 5. Replace `categories` column in `df` with new category columns.
- Drop the categories column from the df dataframe since it is no longer needed.
- Concatenate df and categories data frames.

In [13]:
# drop the original categories column from `df`
df = df.drop(["categories"], axis=1)

df.head()

Unnamed: 0,id,message,original,genre
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct


In [14]:
# concatenate the original dataframe with the new `categories` dataframe
df = pd.concat([df, categories], axis=1)
df.head(2)

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,1,0,0,1,0,0,...,0,0,1,0,1,0,0,0,0,0


### 6. Remove duplicates.
- Check how many duplicates are in this dataset.
- Drop the duplicates.
- Confirm duplicates were removed.

In [15]:
# check number of duplicates
df.duplicated().sum()

170

In [16]:
# drop duplicates
df = df.drop_duplicates()

In [17]:
# check number of duplicates
df.duplicated().sum()

0

In [18]:
# Columns in your DataFrame
columns_list = ['related', 'request', 'offer', 'aid_related', 'medical_help', 'medical_products', 'search_and_rescue',
                'security', 'military', 'child_alone', 'water', 'food', 'shelter', 'clothing', 'money', 'missing_people',
                'refugees', 'death', 'other_aid', 'infrastructure_related', 'transport', 'buildings', 'electricity',
                'tools', 'hospitals', 'shops', 'aid_centers', 'other_infrastructure', 'weather_related', 'floods', 'storm',
                'fire', 'earthquake', 'cold', 'other_weather', 'direct_report']

# Create a new DataFrame with the column-wise sum
column_sum = df[columns_list].sum().sort_values(ascending=False)

# Create an interactive bar chart using plotly
fig = px.bar(column_sum, x=column_sum.index, y=column_sum.values, labels={'x': 'Category', 'y': 'Count'},
             title='Count of Messages in Each Category', text=column_sum.values)

# Rotate x-axis labels for better readability
fig.update_layout(xaxis=dict(tickangle=45))

# Show the plot
fig.show()

In [19]:
df.shape

(26216, 40)

In [20]:
import pandas as pd

# Assuming df is your DataFrame
# Columns in your DataFrame
columns_list = ['related', 'request', 'offer', 'aid_related', 'medical_help', 'medical_products', 'search_and_rescue',
                'security', 'military', 'child_alone', 'water', 'food', 'shelter', 'clothing', 'money', 'missing_people',
                'refugees', 'death', 'other_aid', 'infrastructure_related', 'transport', 'buildings', 'electricity',
                'tools', 'hospitals', 'shops', 'aid_centers', 'other_infrastructure', 'weather_related', 'floods', 'storm',
                'fire', 'earthquake', 'cold', 'other_weather', 'direct_report']

# Create a new DataFrame with the column-wise sum
column_sum = df[columns_list].sum()/len(df)

# Sort the result in descending order
column_sum_sorted = column_sum.sort_values(ascending=False)

# Print the result
print("Column-wise sum (sorted in descending order):")
print(column_sum_sorted)


Column-wise sum (sorted in descending order):
related                   0.773650
aid_related               0.414251
weather_related           0.278341
direct_report             0.193584
request                   0.170659
other_aid                 0.131446
food                      0.111497
earthquake                0.093645
storm                     0.093187
shelter                   0.088267
floods                    0.082202
medical_help              0.079493
infrastructure_related    0.065037
water                     0.063778
other_weather             0.052487
buildings                 0.050847
medical_products          0.050084
transport                 0.045812
death                     0.045545
other_infrastructure      0.043904
refugees                  0.033377
military                  0.032804
search_and_rescue         0.027617
money                     0.023039
electricity               0.020293
cold                      0.020217
security                  0.017966
clothing 

The column-wise sum, represented as a percentage of the total number of rows in the dataset, provides insights into the distribution of categories within the data. The top categories, in terms of the highest percentage of occurrence, include:

1. **Related (77.37%):** The majority of messages are related to some form of disaster or crisis.
2. **Aid-related (41.43%):** A significant proportion of messages are associated with requests for aid or assistance.
3. **Weather-related (27.83%):** A considerable portion of messages pertains to weather-related events.
4. **Direct report (19.36%):** A noteworthy percentage of messages involves direct reporting of incidents.
5. **Request (17.07%):** A substantial number of messages include specific requests for assistance.
6. **Child Alone (0%):**  On the other hand, some categories have relatively lower percentages, such as "Child Alone" with 0.00%, indicating a minimal occurrence in the dataset.

Understanding these percentages is valuable for prioritizing and focusing efforts on specific categories, especially in disaster response scenarios where certain types of messages may be more prevalent. This summary aids in the interpretation of the distribution of messages across categories in the context of the dataset.

7. Save the clean dataset into an sqlite database. For this notebook we will us df from the above cells for our next tasks, but to be aligned with udacity project we will do the following step 
You can do this with pandas [`to_sql` method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html) combined with the SQLAlchemy library. Remember to import SQLAlchemy's `create_engine` in the first cell of this notebook to use it below.

In [21]:
engine = create_engine('sqlite:///Disaster_Response.db')
df.to_sql('Disaster_Response', engine, index=False)

ValueError: Table 'Disaster_Response' already exists.

# <font color='red'> Model Training and Testing <font>


### Machine Learning Pipeline Preparation

Certainly! You can integrate the information into your project report as follows:

"In the subsequent phases of our project, we will implement a pivotal step—data splitting. This involves dividing our dataset into two distinct subsets: training data and testing data. The rationale behind this division is to facilitate the training of our machine learning model on the training data, allowing it to discern patterns and relationships within the dataset. Following the training phase, the model will be put to the test using the testing data, which serves as an independent dataset for evaluating the model's performance on previously unseen instances.

The sequential breakdown is as follows:

### 1. Data Splitting:

We will partition our dataset into training data and testing data, with the former constituting the input for training our machine learning model.

### 2. Training the Model:

Our machine learning model will be trained using the training data, enabling it to learn from the input features and associated target labels. The objective is to instill the model with the capacity to generalize well to new, unseen data.

### 3. Testing the Model:

The trained model will then be tested on the designated testing data. This phase assesses the model's performance on instances it has not encountered during the training process, providing valuable insights into its ability to generalize effectively.

### 4. Performance Evaluation:

Various metrics, including accuracy, precision, recall, and F1 score, will be employed to evaluate the model's performance on the testing data. These metrics serve as quantifiable indicators of the model's classification accuracy and alignment with the project's objectives.

### 5. Iterative Process:

The evaluation results may prompt iterations in the model or adjustments to its parameters. This iterative process continues until the model demonstrates satisfactory performance in line with our project goals.

This meticulous approach ensures a robust evaluation of our machine learning model's capabilities, fostering continuous improvement as we refine and enhance its performance based on real-world testing scenarios."



#### Import libraries 
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [22]:
# Importing re for regular expressions
import re

# Importing pickle for serializing and deserializing Python objects
import pickle

# Importing create_engine from SQLAlchemy for database connectivity and interaction
from sqlalchemy import create_engine

# Importing Natural Language Toolkit (NLTK) for natural language processing tasks
import nltk

# Importing stopwords, word_tokenize, and WordNetLemmatizer from NLTK for text processing
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer

# Importing classification_report, MultiOutputClassifier, confusion_matrix, GridSearchCV,
# RandomForestClassifier, train_test_split, Pipeline, FeatureUnion, BaseEstimator, and TransformerMixin
# from scikit-learn for machine learning tasks
from sklearn.metrics import classification_report
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin

# Importing CountVectorizer and TfidfTransformer from scikit-learn for text feature extraction
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer


In [23]:
# Downloading NLTK resources (punkt, wordnet, stopwords) for text processing tasks
nltk.download(['punkt', 'wordnet', 'stopwords'])
#including tokenization data (punkt), the WordNet lexical database (wordnet), and a list of common stopwords (stopwords

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/kapilwankhede/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/kapilwankhede/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/kapilwankhede/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [24]:
# Creating a database engine using SQLAlchemy to connect to the SQLite database 'Disaster_Response.db'
#engine = create_engine('sqlite:///Disaster_Response.db')

# Reading data from the 'Disaster_Response' table in the SQLite database into a Pandas DataFrame 'df'
#df = pd.read_sql_table("Disaster_Response", engine)

# Extracting the 'message' column from the DataFrame 'df' as the feature variable 'X'
X = df["message"]

# Creating a copy of the DataFrame 'df' and storing it in 'df_sample'
df_sample = df

# Extracting the target variables from 'df_sample' by dropping columns 'id', 'message', 'original', and 'genre'
Y = df_sample.drop(["id", "message", "original", "genre"], axis=1)


Now we have x and y 

In [25]:
X.head(2)

0    Weather update - a cold front from Cuba that c...
1              Is the Hurricane over or is it not over
Name: message, dtype: object

In [26]:
Y.head(3)

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,1,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### 2. Write a tokenization function to process your text data

In [27]:
def tokenize(text):
    # Converting all the text to lowercase
    text.lower()
    
    # Removing punctuation characters from the text and replacing them with an empty space
    text = re.sub(r"[^a-zA-Z0-9]", " ", text)
    
    # Using the word tokenizer to convert text into tokens
    tokens = word_tokenize(text)
    
    # Removing stopwords by calling a "for loop" for the tokens and using English stopwords from NLTK
    tokens = [word for word in tokens if word not in stopwords.words("english")]
    
    # Using lemmatization to strip all the words
    tokens = [WordNetLemmatizer().lemmatize(word).strip() for word in tokens]
    
    # Finally, return the tokens
    return tokens


### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [28]:
# Creating a machine learning pipeline using scikit-learn
pipeline = Pipeline([
    # Step 1: Tokenize and convert text data to a bag-of-words representation
    ('vect', CountVectorizer(tokenizer=tokenize)),
    
    # Step 2: Transforming the bag-of-words representation into TF-IDF (Term Frequency-Inverse Document Frequency)
    ('tfidf', TfidfTransformer()),
    
    # Step 3: Building a multi-output classification model using Random Forest Classifier
    ('clf', MultiOutputClassifier(RandomForestClassifier()))
])

# Displaying the pipeline
pipeline


### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [29]:
# Split data into train and test sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=42)

In [30]:
pipeline.fit(X_train, Y_train)


The parameter 'token_pattern' will not be used since 'tokenizer' is not None'



### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [31]:
category_names = Y.columns.tolist()
print(category_names)

['related', 'request', 'offer', 'aid_related', 'medical_help', 'medical_products', 'search_and_rescue', 'security', 'military', 'child_alone', 'water', 'food', 'shelter', 'clothing', 'money', 'missing_people', 'refugees', 'death', 'other_aid', 'infrastructure_related', 'transport', 'buildings', 'electricity', 'tools', 'hospitals', 'shops', 'aid_centers', 'other_infrastructure', 'weather_related', 'floods', 'storm', 'fire', 'earthquake', 'cold', 'other_weather', 'direct_report']


In [32]:
from sklearn.metrics import classification_report

predict_y = pipeline.predict(X_test)


# Create an empty DataFrame to store the classification report metrics
classification_reports_df = pd.DataFrame(columns=['Category', 'Precision', 'Recall', 'F1-Score', 'Support'])

# Loop through each category and generate the classification report
for i in range(len(category_names)):
    category = category_names[i]
    report = classification_report(Y_test[category], predict_y[:, i], output_dict=True, zero_division=1)
    
    # Extract relevant metrics from the classification report
    precision = report['weighted avg']['precision']
    recall = report['weighted avg']['recall']
    f1_score = report['weighted avg']['f1-score']
    support = report['weighted avg']['support']
    
    # Append the metrics to the DataFrame
    classification_reports_df = classification_reports_df.append({
        'Category': category,
        'Precision': precision,
        'Recall': recall,
        'F1-Score': f1_score,
        'Support': support
    }, ignore_index=True)

# Sort the DataFrame based on the 'F1-Score' column in descending order
classification_reports_df = classification_reports_df.sort_values(by='F1-Score', ascending=False)

# Display the sorted classification report DataFrame
print(classification_reports_df)


                  Category  Precision    Recall  F1-Score  Support
9              child_alone   1.000000  1.000000  1.000000   7865.0
25                   shops   0.996453  0.996440  0.994663   7865.0
2                    offer   0.995444  0.995423  0.993139   7865.0
23                   tools   0.993934  0.993897  0.990855   7865.0
7                 security   0.963710  0.981310  0.990739   7865.0
24               hospitals   0.990181  0.990083  0.985149   7865.0
31                    fire   0.984405  0.989447  0.984446   7865.0
15          missing_people   0.988439  0.988303  0.982488   7865.0
13                clothing   0.983044  0.986141  0.980567   7865.0
26             aid_centers   0.987076  0.986904  0.980399   7865.0
33                    cold   0.976593  0.979657  0.971425   7865.0
32              earthquake   0.970870  0.971774  0.971081   7865.0
14                   money   0.974897  0.979021  0.969695   7865.0
22             electricity   0.974119  0.978894  0.968943   78

Here are five key points from the above classification performance metrics:

1. **Exceptional Performance for 'child_alone':**
   - The category 'child_alone' demonstrates outstanding precision, recall, and F1-score, each reaching a perfect score of 1.0. This is because it's not have any positive values. In the training and testing dataset no message classified as 'child alone'. In that case we should remove this category. 

2. **Consistently High Performance Across Multiple Categories:**
   - Several categories, such as 'fire', 'shops', 'offer', and 'tools', exhibit high precision, recall, and F1-scores, all surpassing 0.99. Similarly, the total number of messages categorized into these 'fire', 'shops', 'offer', and 'tools' categories are less than 2%. 
   
3. **Balanced Performance Across Various Categories:**
   - A number of categories, including 'earthquake', 'water', 'death', 'refugees', and 'floods', showcase well-balanced precision, recall, and F1-scores. This indicates that the model performs consistently across different types of categories, maintaining a balance between precision and recall.

4. **Challenges in 'other_infrastructure' Category:**
   - The 'other_infrastructure' category, while still demonstrating relatively high precision and recall, has a lower F1-score. This suggests potential challenges in achieving a harmonious balance between precision and recall for this specific category.

5. **Room for Improvement in 'direct_report':**
   - The 'direct_report' category exhibits comparatively lower precision, recall, and F1-score. This implies that there may be opportunities to enhance the model's performance in accurately classifying instances related to direct reports.

These insights provide a nuanced understanding of the model's performance across various categories, highlighting areas of strength and pointing towards potential refinements for specific categories with room for improvement.

### Store the trained model as classifier.pkl

In [34]:
import gzip

In [42]:
with open("../models/classifier.pkl", 'wb') as file:  
    pickle.dump(pipeline, file)

In [46]:
import gzip
import pickle

# Load your model or data
with open('../models/classifier.pkl', 'rb') as file:
    data = pickle.load(file)

# Compress and save
with gzip.open('../models/classifier.pkl.gz', 'wb') as file:
    pickle.dump(data, file)


In [49]:

import pickle

# Specify the path to your pickle file
file_path = '../models/classifier.pkl'

# Load the data from the pickle file
with open(file_path, 'rb') as f:
    data = pickle.load(f)

# Now, you can print or explore the contents of the loaded data
print(data)

Pipeline(steps=[('vect',
                 CountVectorizer(tokenizer=<function tokenize at 0x7f7cd30a6b80>)),
                ('tfidf', TfidfTransformer()),
                ('clf',
                 MultiOutputClassifier(estimator=RandomForestClassifier()))])


### 6. Improve your model
Use grid search to find better parameters. 

In [None]:
pipeline.get_params()

parameters = {
    "clf__estimator__n_estimators": [50, 100, 200],
    "clf__estimator__min_samples_split": [2, 3]
}

cv = GridSearchCV(pipeline, param_grid=parameters)

# Back-testing 

### Test the model


In [258]:
import pickle
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
#from sklearn.ensemble import AdaBoostClassifier

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
import joblib  # Use joblib for compatibility with older scikit-learn versions



# Load the trained model
#model_path = '../models/classifier.pkl'
#loaded_model = joblib.load(model_path)

# Define the preprocessing function
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation characters
    text = re.sub(r"[^a-zA-Z0-9]", " ", text)
    # Tokenize the text
    tokens = word_tokenize(text)
    # Remove stopwords
    tokens = [word for word in tokens if word not in stopwords.words("english")]
    # Lemmatize the words
    tokens = [WordNetLemmatizer().lemmatize(word).strip() for word in tokens]
    # Join tokens back into a string
    processed_text = ' '.join(tokens)
    return processed_text

# Get user input
user_input = input("Enter a message: ")

# Preprocess the user input
processed_input = preprocess_text(user_input)

# Use the loaded model to predict categories
predicted_categories = pipeline.predict([processed_input])

# Display the predicted categories that are 1
category_names = ['related', 'request', 'offer', 'aid_related', 'medical_help', 'medical_products', 'search_and_rescue',
                   'security', 'military', 'child_alone', 'water', 'food', 'shelter', 'clothing', 'money',
                   'missing_people', 'refugees', 'death', 'other_aid', 'infrastructure_related', 'transport', 'buildings',
                   'electricity', 'tools', 'hospitals', 'shops', 'aid_centers', 'other_infrastructure', 'weather_related',
                   'floods', 'storm', 'fire', 'earthquake', 'cold', 'other_weather', 'direct_report']

predicted_categories = [category_names[i] for i in range(len(predicted_categories[0])) if predicted_categories[0][i] == 1]

print("Predicted categories:", predicted_categories)


Enter a message: 'I am in Croix-des-Bouquets. We have health issues. They ( workers ) are in Santo 15. ( an area in Croix-des-Bouquets 
Predicted categories: ['related', 'request', 'aid_related', 'medical_help', 'medical_products', 'direct_report']


In [261]:
print(df.message[[7, 9, 40,44]])

7     Please, we need tents and water. We are in Sil...
9     I am in Croix-des-Bouquets. We have health iss...
40    People from Dal blocked since Wednesday in Car...
44                 Good evening, is the earthquake end?
Name: message, dtype: object


In [267]:
print(df.message[40])

People from Dal blocked since Wednesday in Carrefour, we having water shortage, food and medical assistance.


In [41]:
with open("../models/classifier.pkl", 'wb') as file:  
    pickle.dump(pipeline, file)

## <font color='red'> 8. Try improving your model further. Here are a few ideas:<font> 
* Try other machine learning algorithms
* Add other features besides the TF-IDF

In [None]:
class StartingVerbExtractor(BaseEstimator, TransformerMixin):

    def starting_verb(self, text):
        sentence_list = nltk.sent_tokenize(text)
        for sentence in sentence_list:
            pos_tags = nltk.pos_tag(tokenize(sentence))
            first_word, first_tag = pos_tags[0]
            if first_tag in ['VB', 'VBP'] or first_word == 'RT':
                return True
        return False

    def fit(self, x, y=None):
        return self

    def transform(self, X):
        X_tagged = pd.Series(X).apply(self.starting_verb)
        return pd.DataFrame(X_tagged)

In [None]:
pipeline_modified = Pipeline([
        ('features', FeatureUnion([

            ('text_pipeline', Pipeline([
                ('vect', CountVectorizer(tokenizer=tokenize)),
                ('tfidf', TfidfTransformer())
            ])),

            ('starting_verb', StartingVerbExtractor())
        ])),

        ('clf', RandomForestClassifier())
    ])

In [None]:
from sklearn.model_selection import GridSearchCV

# Define the parameters you want to tune for RandomForestClassifier
param_grid = {
    'clf__n_estimators': [50, 100, 200],
    'clf__min_samples_split': [2, 3, 4],
    'features__transformer_weights': [
        {'text_pipeline': 1, 'starting_verb': 0.5},
        {'text_pipeline': 0.5, 'starting_verb': 1},
        {'text_pipeline': 0.8, 'starting_verb': 1},
    ]
}

# Create the pipeline with the RandomForestClassifier
pipeline_modified = Pipeline([
    ('features', FeatureUnion([
        ('text_pipeline', Pipeline([
            ('vect', CountVectorizer(tokenizer=tokenize)),
            ('tfidf', TfidfTransformer())
        ])),
        ('starting_verb', StartingVerbExtractor())
    ])),
    ('clf', RandomForestClassifier())
])

# Use GridSearchCV for hyperparameter tuning
grid_search = GridSearchCV(pipeline_modified, param_grid=param_grid, cv=3, scoring='f1_micro')

# Fit the model with the training data
grid_search.fit(X_train, Y_train)

# Get the best parameters and the best estimator
best_params = grid_search.best_params_
best_estimator = grid_search.best_estimator_

# Print the best parameters
print("Best Parameters:", best_params)

# Print the best estimator
print("Best Estimator:", best_estimator)


## <font color='red'> It is taking time to train the model that is why I am considering going with the base model<font>

In [None]:
pipeline_modified

parameters = {
        'features__text_pipeline__vect__ngram_range': ((1, 1), (1, 2)),
        'features__text_pipeline__vect__max_df': (0.5, 0.75, 1.0),
        'features__text_pipeline__vect__max_features': (None, 5000, 10000),
        'features__text_pipeline__tfidf__use_idf': (True, False),
        'clf__n_estimators': [50, 100, 200],
        'clf__min_samples_split': [2, 3, 4],
        'features__transformer_weights': (
            {'text_pipeline': 1, 'starting_verb': 0.5},
            {'text_pipeline': 0.5, 'starting_verb': 1},
            {'text_pipeline': 0.8, 'starting_verb': 1},
        )
    }

In [None]:
parameters = {
    'features__text_pipeline__vect__ngram_range': [(1, 1), (1, 2)],
    'features__text_pipeline__vect__max_df': [0.5, 1.0],
    'features__text_pipeline__vect__max_features': [None, 5000],
    'features__text_pipeline__tfidf__use_idf': [True, False],
    'clf__n_estimators': [50, 100],
    'clf__min_samples_split': [2, 4],
    'features__transformer_weights': [
        {'text_pipeline': 1, 'starting_verb': 0.5},
        {'text_pipeline': 0.5, 'starting_verb': 1},
    ]
}



In [None]:
cv = GridSearchCV(pipeline_modified, param_grid=parameters)

In [None]:
model_md = cv

In [None]:
model_md.fit(X_train, Y_train)

In [None]:
from sklearn.metrics import classification_report

predict_y_mod = model_md.predict(X_test)


# Create an empty DataFrame to store the classification report metrics
classification_reports_df = pd.DataFrame(columns=['Category', 'Precision', 'Recall', 'F1-Score', 'Support'])

# Loop through each category and generate the classification report
for i in range(len(category_names)):
    category = category_names[i]
    report = classification_report(Y_test[category], predict_y_mod[:, i], output_dict=True, zero_division=1)
    
    # Extract relevant metrics from the classification report
    precision = report['weighted avg']['precision']
    recall = report['weighted avg']['recall']
    f1_score = report['weighted avg']['f1-score']
    support = report['weighted avg']['support']
    
    # Append the metrics to the DataFrame
    classification_reports_df = classification_reports_df.append({
        'Category': category,
        'Precision': precision,
        'Recall': recall,
        'F1-Score': f1_score,
        'Support': support
    }, ignore_index=True)

# Sort the DataFrame based on the 'F1-Score' column in descending order
classification_reports_df = classification_reports_df.sort_values(by='F1-Score', ascending=False)

# Display the sorted classification report DataFrame
print(classification_reports_df)


In [None]:
import pickle
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
#from sklearn.ensemble import AdaBoostClassifier

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
import joblib  # Use joblib for compatibility with older scikit-learn versions



# Load the trained model
#model_path = '../models/classifier.pkl'
#loaded_model = joblib.load(model_path)

# Define the preprocessing function
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation characters
    text = re.sub(r"[^a-zA-Z0-9]", " ", text)
    # Tokenize the text
    tokens = word_tokenize(text)
    # Remove stopwords
    tokens = [word for word in tokens if word not in stopwords.words("english")]
    # Lemmatize the words
    tokens = [WordNetLemmatizer().lemmatize(word).strip() for word in tokens]
    # Join tokens back into a string
    processed_text = ' '.join(tokens)
    return processed_text

# Get user input
user_input = input("Enter a message: ")

# Preprocess the user input
processed_input = preprocess_text(user_input)

# Use the loaded model to predict categories
predicted_categories = model_md.predict([processed_input])

# Display the predicted categories that are 1
category_names = ['related', 'request', 'offer', 'aid_related', 'medical_help', 'medical_products', 'search_and_rescue',
                   'security', 'military', 'child_alone', 'water', 'food', 'shelter', 'clothing', 'money',
                   'missing_people', 'refugees', 'death', 'other_aid', 'infrastructure_related', 'transport', 'buildings',
                   'electricity', 'tools', 'hospitals', 'shops', 'aid_centers', 'other_infrastructure', 'weather_related',
                   'floods', 'storm', 'fire', 'earthquake', 'cold', 'other_weather', 'direct_report']

predicted_categories = [category_names[i] for i in range(len(predicted_categories[0])) if predicted_categories[0][i] == 1]

print("Predicted categories:", predicted_categories)


### <font color='red'> 9. Export your model as a pickle file <font>

In [None]:
with open("../models/classifier_md.pkl", 'wb') as file:  
    pickle.dump(model_md, file)