# Natural Language Processing with Disaster Tweets (v3)

ML Sample of NLP.

## Dataset

Natural Language Processing with Disaster Tweets

- Predict which Tweets are about real disasters and which ones are not
  - https://www.kaggle.com/competitions/nlp-getting-started/overview


In [1]:
import pandas as pd
import numpy as np
import re

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

from sklearn.base import BaseEstimator
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, precision_score


In [2]:
# Methods preparation
def clean_text(text: str) -> str:
    """Clean text with remove hashtag, user name, and URL"""
    text = re.sub(r"#\w+", "", text)
    text = re.sub(r"http\S+", "", text)
    text = re.sub(r"@\w+", "", text)
    return text


def fill_missing_keyword_and_location(df: pd.DataFrame) -> None:
    """Complete missing values in the 'keyword' and 'location' columns of a DataFrame"""
    df['keyword'].fillna('unknown_keyword', inplace=True)
    df['location'].fillna('unknown_location', inplace=True)


def evaluate_trained_model(
    model: BaseEstimator,
    X_test_data: list,
    y_test_data: list,
    is_transformer: bool = False
) -> None:
    """Evaluate a trained Machine Learning model using various metrics

    This function provides:
    - Accuracy Score: Measures how accurately the class labels are predicted.
    - Precision Score: Evaluates how many of the items predicted as positive are actually positive.
    - Confusion Matrix: Provides a matrix representing TP, FP, FN, TN for each class.
    - Classification Report: Generates a detailed report including Precision, Recall, F1-score, and Support for each class.

    Args:
        model: Trained machine learning model.
        X_test_data, y_test_data: Test data and labels.
        is_transformer: If the model is a transformers model (like ALBERT), set to True
    """
    if is_transformer:
        # Assuming model outputs are logits
        logits = model(X_test_data)
        y_pred = logits.argmax(dim=1).cpu().numpy()
    else:
        y_pred = model.predict(X_test_data)

    print(f"Evaluation: {model.__class__.__name__}\n")  
    print("Accuracy:", accuracy_score(y_test_data, y_pred))
    print("Precision:", precision_score(y_test_data, y_pred))
    print("Confusion Matrix:\n", confusion_matrix(y_test_data, y_pred))
    print("Classification Report:\n", classification_report(y_test_data, y_pred))


In [3]:
# Load Train Dataset
df_train = pd.read_csv("./raw_data/train.csv")

df_train.head(3)

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1


In [4]:
# Preprocessing: fill NaN and clean text
fill_missing_keyword_and_location(df_train)
df_train['text'] = df_train['text'].apply(clean_text)

df_train.head(3)

Unnamed: 0,id,keyword,location,text,target
0,1,unknown_keyword,unknown_location,Our Deeds are the Reason of this May ALLAH Fo...,1
1,4,unknown_keyword,unknown_location,Forest fire near La Ronge Sask. Canada,1
2,5,unknown_keyword,unknown_location,All residents asked to 'shelter in place' are ...,1


In [5]:
y = df_train['target']

## Naive Bayes
---

In [6]:
# Feature Engineering: TF-IDF Vectorization
vectorizer = TfidfVectorizer(max_features=5000)

X_tfidf = vectorizer.fit_transform(df_train['text'])

In [7]:
# Check vectorizer vocabulary
vocabulary = vectorizer.vocabulary_

first_n_pairs = {k: vocabulary[k] for k in list(vocabulary)[:10]}
print("First 10 vocabulary items:", first_n_pairs)

First 10 vocabulary items: {'our': 3172, 'are': 416, 'the': 4398, 'reason': 3563, 'of': 3101, 'this': 4419, 'may': 2809, 'allah': 300, 'us': 4627, 'all': 299}


In [8]:
# Model Building: split the data
X_train_tfidf, X_test_tfidf, y_train, y_test = train_test_split(
    X_tfidf,
    y,
    test_size=0.2,
    random_state=42
)

In [9]:
# Train the Naive Bayes model
model_nb = MultinomialNB()
model_nb.fit(X_train_tfidf, y_train)

In [10]:
evaluate_trained_model(
    model_nb,
    X_test_tfidf,
    y_test
)

Evaluation: MultinomialNB

Accuracy: 0.804333552199606
Precision: 0.8447937131630648
Confusion Matrix:
 [[795  79]
 [219 430]]
Classification Report:
               precision    recall  f1-score   support

           0       0.78      0.91      0.84       874
           1       0.84      0.66      0.74       649

    accuracy                           0.80      1523
   macro avg       0.81      0.79      0.79      1523
weighted avg       0.81      0.80      0.80      1523

