# Project Title | Classification of Crisis-related Tweets

## Introduction

Social media has emerged as an invaluable resource for humanitarian aid organizations to get real-time information about the situation on the ground
during a natural disaster. Microblogging platforms like Twitter allow individuals to easily and quickly relay information on location, time, damage severity, etc. to the public at large. With a clearer understanding of what is happening at a micro-level, humanitarian aid organizations can more efficiently allocate and dispatch assistance, as well as inform victims about where and how to seek help. 

**While social media has the advantage of real-time information transmission during a fast-changing crisis, the sudden flood of posts and messages can easily overwhelm humanitarian organizations hoping to act on critical and relevant information.** Not all posts and messages tied to a disaster-related event are informative and useful to these organizations.

**Machine Learning models can help play a role in the task of sifting through the deluge of disaster-related posts and identify which posts are the most informative for humanitarian aid purposes.**

This project will compare **three classification machine learning models** (**Multinomial Naive-Bayes**, **CatBoostClassifier**, **LogisticRegression**, **RandomForestClassifier**, **XGBoost**, **GradientBoostingClassifier** and **K-Nearest Neighbors**) on this task and propose which model could perform the most effectively during a diaster-related event. For our models, we will train and test them on tweets that have been gathered from 7 natural disasters occurring during 2017: Hurricanes Irma, Harvey, and Maria; Earthquakes in Mexico and Iran & Iraq; Wildfires in California; and Floods in Sri Lanka.   

## Objective

**Business Objective:**

To accurately classify tweets related to natural disasters as informative or not informative, thereby enhancing the ability of humanitarian organizations to quickly identify and utilize valuable information for disaster response and relief efforts.

## Data

### Source

The data for this project comes from the **CrisisMMD (Crisis Multi-Modal Dataset)** https://crisisnlp.qcri.org/crisismmd, a collection gathered and made available by the **Crisis Computing team** at **Qatar Computing Research Institute (QCRI)** of **Hamad Bin Khalifa University (HBKU)**. 

Within this dataset are thousands of **tweets** sampled from over 14 million tweets collected during **seven major disasters** (earthquakes, floods, hurricanes, and wildfires) occurring in **2017** in various parts of the world (United States, Puerto Rico, Mexico, Sri Lanka, Iran and Iraq): **Hurricane Irma, Hurricane Harvey, Hurricane Maria, California wildfires, Mexico earthquake, Iran-Iraq earthquake, **and the** Sri Lanka floods**. 

### Annotation

Tweets have been manually annotated with three types of labels.

1. **Informative vs Not informative** (Text and Image)
    - Informative
    - Not informative

2. **Humanitarian Categories** (Text and Image)
    - Affected individuals
    - Infrastructure and utility damage
    - Injured or dead people
    - Missing or found people
    - Rescue, volunteering or donation effort
    - Vehicle damage
    - Other relevant information
    - Not humanitarian
    
3. **Damage Severity Assessment** (Image)
    - Severe damage
    - Mild damage
    - Little or no damage
    - Don't know or can't judge
    
However, for this project our focus will only be textual analysis and only focus on the labels for **Informative vs. Not informative**. We will discard image-related data as well as the labels for **Humanitarian Categoreis** and **Damage Severity Assessment**. 

The label **Informative vs. Not informative** indicates whether or not the text of the tweet is useful for humanitarian aid purposes.  


## PreProcessing

### Import Necessary Libraries

I will load every libraries that I am going to use throughout this project. I will group the libraries by their use. This will make it easier to read for you.

In [3]:
# basic libraries

import os
import sys
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


# libraries for data ingestion
import zipfile
import requests
import traceback
import urllib.request as request


# Libraries for cleaning.
import re
import nltk
import string
from autocorrect import Speller
from nltk.corpus import stopwords
from nltk.tokenize import TweetTokenizer
from nltk.stem.porter import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer

nltk.download('all', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)




# python dataclass decorator
from dataclasses import dataclass


# custom functions
from src.logger import log
from src.exception import CustomException
from src.utils import unzip_data, download_file

[2024-06-03 22:04:07,772] 350 - matplotlib - DEBUG - __init__ - matplotlib data path: c:\disaster-tweets\dis\Lib\site-packages\matplotlib\mpl-data
[2024-06-03 22:04:07,780] 350 - matplotlib - DEBUG - __init__ - CONFIGDIR=C:\Users\harry\.matplotlib
[2024-06-03 22:04:07,783] 1511 - matplotlib - DEBUG - __init__ - interactive is False
[2024-06-03 22:04:07,783] 1512 - matplotlib - DEBUG - __init__ - platform is win32
[2024-06-03 22:04:07,882] 350 - matplotlib - DEBUG - __init__ - CACHEDIR=C:\Users\harry\.matplotlib
[2024-06-03 22:04:07,890] 1574 - matplotlib.font_manager - DEBUG - font_manager - Using fontManager instance from C:\Users\harry\.matplotlib\fontlist-v330.json


### Data Ingestion

In [5]:
# Step 2: Define the URL of the source data to be downloaded

# URL of the source data to be downloaded
source_url: str = "https://github.com/xplict33/mlproject/raw/main/data.zip"

# Path to the downloaded zip file
zip_dir = "data.zip"

In [None]:
# Step 3: Download the dataset

response = requests.get(source_url)
with open(zip_dir, 'wb') as file:
    file.write(response.content)
print(f"Downloaded file to {zip_dir}")

In [None]:
# Step 4: Extract the dataset

extracted_dir = 'data'
with zipfile.ZipFile(zip_dir, 'r') as zip_ref:
    zip_ref.extractall(extracted_dir)
print(f"Unzipped file to {extracted_dir}")

In [None]:
# Step 5: List the files in the extracted directory

extracted_files = os.listdir(extracted_dir)
print(f"Files in the extracted directory: {extracted_files}")

In [2]:
t = pd.read_csv("C:\\disaster-tweets\\artifacts\\data_cleaner\\cleaned_data.csv")
t.head(3)

Unnamed: 0,tweet_id,image_id,text_info,text_info_conf,image_info,image_info_conf,text_human,text_human_conf,image_human,image_human_conf,tweet_text,image_url,image_path,cleanText
0,917791044158185473,917791044158185473_0,informative,1.0,informative,0.6766,other_relevant_information,1.0,other_relevant_information,0.6766,RT @Gizmodo: Wildfires raging through Northern...,http://pbs.twimg.com/media/DLyi_WYVYAApwNg.jpg,data_image/california_wildfires/10_10_2017/917...,gi ##z ##mo ##do wild ##fire rage northern cal...
1,917791130590183424,917791130590183424_0,informative,1.0,informative,0.6667,infrastructure_and_utility_damage,1.0,affected_individuals,0.6667,PHOTOS: Deadly wildfires rage in California ht...,http://pbs.twimg.com/media/DLymKm9UMAAu0qw.jpg,data_image/california_wildfires/10_10_2017/917...,photo deadli wild ##fire rage california ##tc ...
2,917791291823591425,917791291823591425_0,informative,0.6813,informative,1.0,other_relevant_information,0.6813,infrastructure_and_utility_damage,1.0,RT @Cal_OES: PLS SHARE: Weâ€™re capturing wild...,http://pbs.twimg.com/media/DLudaaZV4AAjT7x.jpg,data_image/california_wildfires/10_10_2017/917...,cano ## pl ## share captur wild ##fire respons...
