Introduction: 

In this social media era, social media has become the most important medium through which information is getting conveyed. When a disaster occurs, it is important to know it immediately. Twitter is one platform which is being relied to provide authentic and genuine information. It is being monitored by Disaster teams as well as other agencies to keep informed on the disaster. There is an important point to be noted. The context. For example, 

Keyboard warriors are on fire has a different meaning when compared to the building is on fire. 

We leverage machine learning to classify if the tweets are informing disaster or not. 

The metric that is being validated is F1 Score

Training Data: 7613 records with 5 columns (id, keyword, location, text and target)

Test Data: 3623 records with 4 columns (id, keyword, location, text)

# Importing Libraries

In [None]:

import re
import os
import string
import unicodedata
from nltk.corpus import stopwords, wordnet
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from wordcloud import WordCloud
from nltk.stem.porter import PorterStemmer
from nltk.corpus import words
word_dict = words.words()
stemmer = PorterStemmer()
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold, KFold, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score
import tensorflow as tf
import tensorflow_hub as hub
from transformers import BertTokenizer, TFBertForSequenceClassification

# Reading csv files

In [None]:
train_data = pd.read_csv("/kaggle/input/nlp-getting-started/train.csv")

In [None]:
train_data.head(5)

In [None]:
train_data.shape

In [None]:
test_data = pd.read_csv("/kaggle/input/nlp-getting-started/test.csv")

In [None]:
test_data.head(5)

In [None]:
test_data.shape

# EDA and Text Preproessing

In [None]:
train_data.info()

In [None]:
train_data.isnull().sum()

In [None]:
test_data.info()

In [None]:
test_data.isnull().sum()

# Few Findings

There are 2533 (33%) of records are missing for location in training data. 

There are 1105 (33%) of records are missing for location in test data

Keyword - Approximately 1% of records are missing 

There are no missing values in text and target

# We will replace missing values with Unknown for both keyword and location (Mode imputation will work as well)

In [None]:
train_data = train_data.fillna('Unknown')
test_data = test_data.fillna('Unknown')

In [None]:
train_data=train_data[['keyword','location','text','target']].drop_duplicates() # There are 102 duplicates in training data. Dropping them

In [None]:
sns.displot(x = 'target', hue = 'target', data = train_data, palette = ['green', 'yellow'])
plt.legend(['No Disaster', 'Disaster'])
plt.show()

In [None]:
train_data['location'].value_counts()[0:20]

In [None]:
train_data['keyword'].value_counts()[0:20]

# There are certain keywords which points to disaster. Also if we look at the location, most of the tweets are from US (If the location information is available)

# Its important to process text by removing stop words, non characters such as #, urls

In [None]:
def clean_txt(txt):
    res = unicodedata.normalize('NFKC', txt)
    res = re.sub(r'[^\x00-\x7F]+', r'', res)
    res = re.sub(r'^RT[\s]+', r'', res)
    res = re.sub(r'\$\w*', r'', res)
    res = re.sub(r'&lt;', r'<', res)
    res = re.sub(r'&gt;', r'>', res)
    res = re.sub(r'&amp;?', r'and', res)
    res = re.sub(r'<[^>]*?>', r'', res)
    res = re.sub(r'#', r' #', res)
    res = re.sub(r'\s#\s', r' ', res)
    return res

train_data['text'] = train_data['text'].apply(clean_txt )

In [None]:
text = " ".join(review for review in train_data['text'])
wordcloud = WordCloud(background_color="white").generate(text)
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

In [None]:
pd.Series(' '.join(train_data['text']).lower().split()).value_counts()[:50] # Top 50 words

# Data Cleaning

In [None]:
stop_words = set(stopwords.words('english'))
def text_processing(text):
    words = text.lower().split()
    filtered_words = [word for word in words if word not in stop_words]
    clean_text = ' '.join(filtered_words)
    clean_text = clean_text.translate(str.maketrans('', '', string.punctuation)).strip()
    return clean_text

In [None]:
train_data['text'] = train_data['text'].apply(text_processing)

In [None]:
text = " ".join(review for review in train_data['text'])
wordcloud = WordCloud(background_color="white").generate(text)
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

In [None]:
pd.Series(' '.join(train_data['text']).lower().split()).value_counts()[:50] # Top 50 words

# Model 1

# Splitting training data into 70:30

In [None]:
X_train, X_test, y_train, y_test = train_test_split(train_data['text'], train_data['target'], test_size=0.2, random_state=42)

In [None]:
X_train.shape, y_train.shape, X_test.shape, y_test.shape

In [None]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = TFBertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=1)

In [None]:
X_train1=X_train.tolist()
X_test1=X_test.tolist()
X_train_encoded = tokenizer(X_train1, padding=True, truncation=True, return_tensors="tf")
X_test_encoded = tokenizer(X_test1, padding=True, truncation=True, return_tensors="tf")
train_dataset = tf.data.Dataset.from_tensor_slices((dict(X_train_encoded), y_train)).shuffle(100).batch(32)
test_dataset = tf.data.Dataset.from_tensor_slices((dict(X_test_encoded), y_test)).batch(32)

In [None]:
loss_object = tf.keras.losses.BinaryCrossentropy(from_logits=True) 
optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5)
train_loss = tf.keras.metrics.Mean(name='train_loss')
train_accuracy = tf.keras.metrics.BinaryAccuracy(name='train_accuracy')
epochs = 3 
for epoch in range(epochs):
    train_loss.reset_states()
    train_accuracy.reset_states()

    for batch_inputs, batch_labels in train_dataset:
        with tf.GradientTape() as tape:
            outputs = model(batch_inputs, training=True).logits
            loss = loss_object(batch_labels, outputs)

        gradients = tape.gradient(loss, model.trainable_variables)
        optimizer.apply_gradients(zip(gradients, model.trainable_variables))

        train_loss(loss)
        train_accuracy(batch_labels, tf.sigmoid(outputs))  # Apply sigmoid activation for accuracy calculation

    print(f"Epoch {epoch + 1}: Loss {train_loss.result()}, Accuracy {train_accuracy.result()}")

# Evaluation
test_accuracy = tf.keras.metrics.BinaryAccuracy(name='test_accuracy')  

for batch_inputs, batch_labels in test_dataset:
    test_predictions = model(batch_inputs, training=False).logits
    test_accuracy(batch_labels, tf.sigmoid(test_predictions))

print(f"Test Accuracy: {test_accuracy.result()}")


# Model 2 - With a different learning rate

In [None]:
loss_object = tf.keras.losses.BinaryCrossentropy(from_logits=True) 
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-5)
train_loss = tf.keras.metrics.Mean(name='train_loss')
train_accuracy = tf.keras.metrics.BinaryAccuracy(name='train_accuracy')
epochs = 3 
for epoch in range(epochs):
    train_loss.reset_states()
    train_accuracy.reset_states()

    for batch_inputs, batch_labels in train_dataset:
        with tf.GradientTape() as tape:
            outputs = model(batch_inputs, training=True).logits
            loss = loss_object(batch_labels, outputs)

        gradients = tape.gradient(loss, model.trainable_variables)
        optimizer.apply_gradients(zip(gradients, model.trainable_variables))

        train_loss(loss)
        train_accuracy(batch_labels, tf.sigmoid(outputs))  # Apply sigmoid activation for accuracy calculation

    print(f"Epoch {epoch + 1}: Loss {train_loss.result()}, Accuracy {train_accuracy.result()}")

# Evaluation
test_accuracy = tf.keras.metrics.BinaryAccuracy(name='test_accuracy')  

for batch_inputs, batch_labels in test_dataset:
    test_predictions = model(batch_inputs, training=False).logits
    test_accuracy(batch_labels, tf.sigmoid(test_predictions))

print(f"Test Accuracy: {test_accuracy.result()}")


# Apply preprocessing - The same as the one in training data

In [None]:
test_data['text'] = test_data['text'].apply(clean_txt )
test_data['text'] = test_data['text'].apply(text_processing)

# Use the best model for classification - Second model

In [None]:
encoded_texts = tokenizer(list(test_data["text"]), padding=True, truncation=True, return_tensors="tf")

dataset = tf.data.Dataset.from_tensor_slices((dict(encoded_texts)))
predictions = []

for batch_inputs in dataset.batch(32):
    batch_predictions = model(batch_inputs, training=False).logits
    batch_probabilities = tf.sigmoid(batch_predictions)
    batch_labels = [1 if p >= 0.5 else 0 for p in batch_probabilities]
    predictions.extend(batch_labels)



In [None]:
len(predictions)

In [None]:
submission = pd.read_csv("/kaggle/input/nlp-getting-started/sample_submission.csv")
submission['target'] = predictions

In [None]:
submission.head(5)

In [None]:
submission.to_csv("submission.csv",index=False)

# Conclusion

1. Started with TF-IDF, it didnt give a great result on the test daata. So moved to Transformers,  

2. BERT is Bidirectional Encoder Representations Transformers. 

3. BERT is built upon the Transformer architecture, which was introduced in the paper "Attention Is All You Need" by Vaswani et al. Transformers have proven to be highly effective in handling sequential data, making them well-suited for NLP tasks. 

4. Unlike previous NLP models like RNN or LSTM or GRU, which processed information in unidirection (either right to left or left to right). BERT can be used bidirectional. 

5. BERT has been used across different NLP tasks such as classification, summarization. 

6. The accuracy of these models is hovering around 0.82. 

7. Other models such as Deberta, Roberta can be used to check how the performance of these models are. 
