# **NLP Twitter Disaster Classifier Project**
This project aims to classify tweets based on their content on whether they are commenting on real disasters or not.

---
## **1. Importing data**
Data set from [Keggle](https://www.kaggle.com/competitions/nlp-getting-started)

In [None]:
import pandas as pd

df = pd.read_csv('..\\Datasets\\Twitter Disaster\\train.csv')

df

---
## **2. Preprocessing data**
**This includes:**
- Removing unnecessary text.
- Converting the text to lowercase.
- Tokenizing the text.
- Removing stopwords.
- Applying Lemmatization.
- Extracting important info (such as hashtags)

In [None]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download()

lemmatizer = WordNetLemmatizer()
stop_words = stopwords.words('english')

# Removes URLs, HTML tags, Hashtags, then converts text to lowercase, and tokenizes and lemmatizes it.
def preprocess_text(text):
    text = re.sub(r'http\S+', ' ', text)
    text = re.sub(r'<.*?>', ' ', text)
    text = re.sub(r'[^a-zA-Z]', ' ', text)
    text = text.lower()
    words = nltk.word_tokenize(text)
    words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
    return ' '.join(words)

# Processing Hashtags specifically
def preprocess_hashtags(text):
    text = text.lower()
    text = re.findall(r'#\w[\w_]*', text)
    return ' '.join(text)

# Apply preprocessing to the tweet text
df['cleaned_text'] = df['text'].apply(preprocess_text)
# Extract hashtags to a seperate column
df['hashtags'] = df['text'].apply(preprocess_hashtags)

df

---
# **3. Feature Extraction**
using TF-IDF, vectorize the text.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import hstack


tfidf_vectorizer = TfidfVectorizer()
X = tfidf_vectorizer.fit_transform(df['cleaned_text'])
y = df['target']

# Vectorize and process hashtags
X_hashtags = tfidf_vectorizer.fit_transform(df['hashtags'])
hashtag_set = ['#disaster', '#earthquake', '#flood', '#fire'] 
# Add binary features for each hashtag in the hashtag_set
for hashtag in hashtag_set:
    df[hashtag] = df['hashtags'].apply(lambda x: int(hashtag in x))


# Merge cleaned_text and hashtags into a single column
X = hstack([X, X_hashtags])

X

---
# **4-5.Model Training and Evaluation**
train multiple models and select the best model.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=26)

print(f'Training data shape: {X_train.shape}')
print(f'Test data shape: {X_test.shape}')

In [None]:
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix, precision_score, recall_score, f1_score
from sklearn.model_selection import StratifiedKFold, cross_val_score
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
import xgboost as xgb
from sklearn.ensemble import RandomForestClassifier
import numpy as np

# Resample data to handle imbalance
smote = SMOTE(random_state=26)
X_train, y_train = smote.fit_resample(X_train, y_train)

# dictionary for all the models used
models = {
    'Logistic Regression': LogisticRegression(class_weight='balanced', max_iter=1000),
    'SVM': SVC(class_weight='balanced', probability=True),
    'XGBoost': xgb.XGBClassifier(scale_pos_weight=(np.sum(y_train == 0)/np.sum(y_train == 1)), eval_metric='mlogloss'),
    'Random Forest': RandomForestClassifier(class_weight='balanced'),
}

evaluation_results = {}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=26)

for model_name, model in models.items():
    print(f"Evaluating {model_name}...")
    
    model.fit(X_train, y_train)
    
    y_pred = model.predict(X_test)
    
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='weighted')
    recall = recall_score(y_test, y_pred, average='weighted')
    f1 = f1_score(y_test, y_pred, average='weighted')
    conf_matrix = confusion_matrix(y_test, y_pred)
    classification_rep = classification_report(y_test, y_pred)

    cv_scores = cross_val_score(model, X_train, y_train, cv=cv, scoring='f1')
    mean_cv_score = np.mean(cv_scores)

    evaluation_results[model_name] = {
        'Classification Report': classification_rep,
        'Confusion Matrix': conf_matrix,
        'Mean Cross-Validation F1 Score': mean_cv_score,
    }
    
    print(f'{model_name} Performance:')
    print('Classification Report:\n', classification_rep)
    print('Mean Cross-Validation F1 Score:', mean_cv_score)
    print('Confusion Matrix:\n', conf_matrix)
    print('\n')


---
# **6.  Interpretation and Application**
using metrics, find the best possible model to be used on the data.

In [None]:
# Determine the best model based on Mean Cross-Validation F1 Score
best_model_name = max(evaluation_results, key=lambda k: evaluation_results[k]['Mean Cross-Validation F1 Score'])
best_model_metrics = evaluation_results[best_model_name]

# Print the model's info
print(f'======================================')
print(f"The best model is: {best_model_name}")
print(f'======================================')
print(f'Metrics:\n---------------------')
print('Classification Report:', best_model_metrics['Classification Report'])
print('Confusion Matrix:\n', best_model_metrics['Confusion Matrix'])
print('Mean Cross-Validation F1 Score:', best_model_metrics['Mean Cross-Validation F1 Score'])

### **How Can We Use This Model Efficiently?**

#### **Real-Time Monitoring**

We can **connect** the model to social media to keep an eye on **disaster-related tweets**. This way, it can automatically **alert authorities** about new situations, helping them respond quickly and coordinate efforts.

#### **Spotting Key Information**

The model helps **pick out** and **highlight** crucial info from social media, making sure that **important updates** get to the right people in time. This helps affected communities stay informed.

#### **Guiding Resource Use**

It also looks at **tweet trends** to help decide where **resources** should go. By showing how **serious** and **widespread** a disaster is, the model supports **smarter planning** and quicker responses.