# Natural Language Processing with Disaster Tweets (v3)

ML Sample of NLP.

## Dataset

Natural Language Processing with Disaster Tweets

- Predict which Tweets are about real disasters and which ones are not
  - https://www.kaggle.com/competitions/nlp-getting-started/overview


In [1]:
import pandas as pd
import numpy as np
import re

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

import torch
from transformers import AlbertTokenizer, AlbertForSequenceClassification, Trainer, TrainingArguments

from sklearn.base import BaseEstimator
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, precision_score


In [2]:
# Methods preparation
def clean_text(text: str) -> str:
    """Clean text with remove hashtag, user name, and URL"""
    text = re.sub(r"#\w+", "", text)
    text = re.sub(r"http\S+", "", text)
    text = re.sub(r"@\w+", "", text)
    return text


def fill_missing_keyword_and_location(df: pd.DataFrame) -> None:
    """Complete missing values in the 'keyword' and 'location' columns of a DataFrame"""
    df['keyword'].fillna('unknown_keyword', inplace=True)
    df['location'].fillna('unknown_location', inplace=True)


def evaluate_trained_model(
    model: BaseEstimator,
    X_test_data: list,
    y_test_data: list,
    is_transformer: bool = False
) -> None:
    """Evaluate a trained Machine Learning model using various metrics

    This function provides:
    - Accuracy Score: Measures how accurately the class labels are predicted.
    - Precision Score: Evaluates how many of the items predicted as positive are actually positive.
    - Confusion Matrix: Provides a matrix representing TP, FP, FN, TN for each class.
    - Classification Report: Generates a detailed report including Precision, Recall, F1-score, and Support for each class.

    Args:
        model: Trained machine learning model.
        X_test_data, y_test_data: Test data and labels.
        is_transformer: If the model is a transformers model (like ALBERT), set to True
    """
    if is_transformer:
        predictions = model.predict(X_test_data)
        y_pred = predictions[0].argmax(axis=-1)
        # Assuming model outputs are logits
        # logits = model(X_test_data)
        # y_pred = logits.argmax(dim=1).cpu().numpy()
    else:
        y_pred = model.predict(X_test_data)

    print(f"Evaluation: {model.__class__.__name__ if not is_transformer else 'Transformer Model'}\n")  
    print("Accuracy:", accuracy_score(y_test_data, y_pred))
    print("Precision:", precision_score(y_test_data, y_pred))
    print("Confusion Matrix:\n", confusion_matrix(y_test_data, y_pred))
    print("Classification Report:\n", classification_report(y_test_data, y_pred))


In [3]:
# Load Train Dataset
df_train = pd.read_csv("./raw_data/train.csv")

df_train.head(3)

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1


In [4]:
# Preprocessing: fill NaN and clean text
fill_missing_keyword_and_location(df_train)
df_train['text'] = df_train['text'].apply(clean_text)

df_train.head(3)

Unnamed: 0,id,keyword,location,text,target
0,1,unknown_keyword,unknown_location,Our Deeds are the Reason of this May ALLAH Fo...,1
1,4,unknown_keyword,unknown_location,Forest fire near La Ronge Sask. Canada,1
2,5,unknown_keyword,unknown_location,All residents asked to 'shelter in place' are ...,1


In [5]:
y = df_train['target']

## Naive Bayes
---

In [6]:
# Feature Engineering: TF-IDF Vectorization
vectorizer = TfidfVectorizer(max_features=5000)

X_tfidf = vectorizer.fit_transform(df_train['text'])

In [7]:
# Check vectorizer vocabulary
vocabulary = vectorizer.vocabulary_

first_n_pairs = {k: vocabulary[k] for k in list(vocabulary)[:10]}
print("First 10 vocabulary items:", first_n_pairs)

First 10 vocabulary items: {'our': 3172, 'are': 416, 'the': 4398, 'reason': 3563, 'of': 3101, 'this': 4419, 'may': 2809, 'allah': 300, 'us': 4627, 'all': 299}


In [8]:
# Model Building: split the data
X_train_tfidf, X_test_tfidf, y_train, y_test = train_test_split(
    X_tfidf,
    y,
    test_size=0.2,
    random_state=42
)

In [9]:
# Train the Naive Bayes model
model_nb = MultinomialNB()
model_nb.fit(X_train_tfidf, y_train)

In [10]:
evaluate_trained_model(
    model_nb,
    X_test_tfidf,
    y_test
)

Evaluation: MultinomialNB

Accuracy: 0.804333552199606
Precision: 0.8447937131630648
Confusion Matrix:
 [[795  79]
 [219 430]]
Classification Report:
               precision    recall  f1-score   support

           0       0.78      0.91      0.84       874
           1       0.84      0.66      0.74       649

    accuracy                           0.80      1523
   macro avg       0.81      0.79      0.79      1523
weighted avg       0.81      0.80      0.80      1523



## ALBERT (BERT)
---
- ref
  - ALBERT https://huggingface.co/docs/transformers/model_doc/albert
    - albert-base-v2 https://huggingface.co/albert-base-v2
  - SentencePiece https://github.com/google/sentencepiece#installation

In [11]:
# Create torch Dataset
class DisasterDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, index):
        item = {
            key: torch.tensor(val[index]) for key, val in self.encodings.items()
        }
        if self.labels:
            item['labels'] = torch.tensor(self.labels[index])
        return item

    def __len__(self):
        return len(self.labels) if self.labels else len(self.encodings['input_ids'])


In [12]:
# Model Building: Tokenization
tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')

In [13]:
# Model Building: split the data
X_train_texts, X_test_texts, y_train_labels, y_test_labels = train_test_split(
    df_train['text'].tolist(),
    y.tolist(),
    test_size=0.2,
    random_state=42
)

In [14]:
# Encode the data
MAX_LENGTH: int = 256

X_train_encodings = tokenizer(
    X_train_texts,
    truncation=True,
    padding=True,
    max_length=MAX_LENGTH
)
X_test_encodings = tokenizer(
    X_test_texts,
    truncation=True,
    padding=True,
    max_length=MAX_LENGTH
)

In [15]:
# Dataset for train
train_dataset = DisasterDataset(
    X_train_encodings,
    y_train_labels
)

# Dataset for evaluation
test_dataset = DisasterDataset(
    X_test_encodings,
    y_test_labels
)

In [16]:
# Prepare training arguments (Hyperparameter adjustment)
training_arguments = TrainingArguments(
    output_dir='./v3_training_results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./v3_training_logs',
    learning_rate=2e-5,
)

In [17]:
# Train the ALBERT model
model_albert = AlbertForSequenceClassification.from_pretrained(
    'albert-base-v2',
    num_labels=2
)

# Create a trainer for the Albert model
trainer = Trainer(
    model=model_albert,
    args=training_arguments,
    train_dataset=train_dataset,
    eval_dataset=test_dataset
)

Some weights of AlbertForSequenceClassification were not initialized from the model checkpoint at albert-base-v2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [18]:
trainer.train()

Step,Training Loss
500,0.4859
1000,0.4698


TrainOutput(global_step=1143, training_loss=0.46477560275287244, metrics={'train_runtime': 36260.9037, 'train_samples_per_second': 0.504, 'train_steps_per_second': 0.032, 'total_flos': 69927152677200.0, 'train_loss': 0.46477560275287244, 'epoch': 3.0})

### Train process result

The model is learning and training losses are decreasing.

- Training Loss: The training loss is decreasing, from 0.51 for the first 500 steps to 0.39 for the first 1000 steps.
  - This indicates that the model is learning, which is a good sign.
- Global Step: Overall, 1143 steps (batches) were trained.
- Training Loss (Final): The final average training loss is about 0.43.
  - This is low compared to the beginning and is evidence that the model has learned.
- Train Runtime: Training took about 6690 seconds.
- Train Samples Per Second: Approximately 2.731 samples per second were processed.
- Train Steps Per Second: Approximately 0.17 steps (batches) per second were processed.
- Total FLOs: The total number of floating point operations is about 69,927,152,677,200 (about 70 trillion).
- Epoch: The data set was iterated three times (3 epochs).


In [19]:
evaluate_trained_model(
    trainer,
    test_dataset,
    y_test_labels,
    is_transformer=True
)

Evaluation: Transformer Model

Accuracy: 0.8319107025607354
Precision: 0.8324873096446701
Confusion Matrix:
 [[775  99]
 [157 492]]
Classification Report:
               precision    recall  f1-score   support

           0       0.83      0.89      0.86       874
           1       0.83      0.76      0.79       649

    accuracy                           0.83      1523
   macro avg       0.83      0.82      0.83      1523
weighted avg       0.83      0.83      0.83      1523



### Comparison results of MultinomialNB and ALBERT

- The ALBERT model excels in overall accuracy and repeatability, but falls slightly short in fit rate.
- ALBERT has the potential to provide a deeper understanding of the internal structure of the text, but is computationally demanding.

For this __v3__, we shall select __ALBERT__ model.

In [20]:
# Load Test Dataset
df_test = pd.read_csv("./raw_data/test.csv")

df_test.head(3)

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."


In [21]:
# Preprocessing: fill NaN and clean text
fill_missing_keyword_and_location(df_test)
df_test['text'] = df_test['text'].apply(clean_text)

In [22]:
test_texts = df_test['text'].tolist()

test_encodings = tokenizer(
    test_texts,
    truncation=True,
    padding=True,
    max_length=256
)

In [23]:
# Test Dataset
test_dataset = DisasterDataset(
    test_encodings,
    # Temporarily use None for labels
    None
)

In [24]:
# Prediction
test_predictions = trainer.predict(test_dataset)[0].argmax(axis=-1)

In [25]:
result_df = pd.DataFrame({
    'id': df_test['id'],
    'target': test_predictions
})

In [26]:
display(result_df)

Unnamed: 0,id,target
0,0,1
1,2,0
2,3,1
3,9,0
4,11,1
...,...,...
3258,10861,1
3259,10865,1
3260,10868,1
3261,10874,1


In [27]:
count_target_1 = result_df['target'].value_counts().get(1, 0)
print(f"The number of rows where target=1: {count_target_1}")

The number of rows where target=1: 1209


In [28]:
result_df.to_csv(
    'v3_submission.csv',
    index=False
)

### Result

Score: 0.8296