# Natural Language Processing with Disaster Tweets

The main goal of this competition is predict which Tweets are about real disasters and which oner are not.

* Link to the competition website: https://www.kaggle.com/competitions/nlp-getting-started/overview

## Get Data

In [None]:
!pip install kaggle

In [None]:
from google.colab import userdata

# Retrieve credentials
KAGGLE_KEY =  userdata.get('KAGGLE_KEY')
KAGGLE_USERNAME = userdata.get('KAGGLE_USERNAME')


# Set environmental variables with %env to better work with kaggle
%env KAGGLE_USERNAME=$KAGGLE_USERNAME
%env KAGGLE_KEY=$KAGGLE_KEY

In [None]:
# Import libraries
import os
import kaggle
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn

In [None]:
!kaggle competitions download -c nlp-getting-started

In [None]:
!unzip /content/nlp-getting-started.zip

## Inspect Data

In [None]:
import pandas as pd
test_df = pd.read_csv('/content/test.csv')
train_df = pd.read_csv('/content/train.csv')

In [None]:
# Check train_df
train_df.head()

In [None]:
train_df.shape

In [None]:
train_df.info()

In [None]:
# Check how much data is missing
train_df.isnull().sum()

In [None]:
# How many examples of each class are
train_df.target.value_counts()

## Prepare data

To prepare our data we need a few steps:
1. we need to lowecase our text so all the tokens are equal. Meaning "Fire" is equl to "FIRE" or "fire"

2. We need to remove the URLs that are considered noised for a ML classifier

3. We also need to remove Stop Words that provide low-information.

In [None]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('stopwords')
nltk.download('punkt')
nltk.download("punkt_tab")

stop_words = set(stopwords.words('english'))

def clean_text(text):
  # Lowercasing text
  text = text.lower()

  # Removing URLs
  text = re.sub(r'https?:\/\/.*', " " , text)

  # Tokenize text
  tokenized_text = nltk.word_tokenize(text)

  # Filter stop words
  filtered_tokens = [word for word in tokenized_text if word not in stop_words]

  # Join tokens back into a single string
  text = ' '.join(filtered_tokens)

  return text


## Split data

In [None]:
# Create X
X = train_df.drop("target", axis= 1)
# Create y
y = train_df["target"]

In [None]:
#Split the data
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
len(X_train), len(X_test), len(y_train), len(y_test)

Now we need to use our `clean_text()` function to handle lowercasing, URL removal and stop word filtering.


In [None]:
# Apply the cleaning function to the 'text' column of the training set
X_train['cleaned_text'] = X_train['text'].apply(clean_text)

# Do the same for the testing set
X_test['cleaned_text'] = X_test['text'].apply(clean_text)

## Create a Baseline model

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfid = TfidfVectorizer(ngram_range=(1,3), max_df=0.9)
X_train_vectors = tfid.fit_transform(X_train['cleaned_text'])
# no fit() here to prevent data leakage
X_test_vectors = tfid.transform(X_test['cleaned_text'])

In [None]:
from sklearn.linear_model import LogisticRegression

model_0 = LogisticRegression(random_state = 42)
model_0.fit(X_train_vectors, y_train)

In [None]:
# Evaluate our baseline model
baseline_score  = model_0.score(X_test_vectors, y_test)
baseline_score

In [None]:
# Make predictions
baseline_preds = model_0.predict(X_test_vectors)
baseline_preds[:10]

In [None]:
# Function to evaluate: accuracy, precision, recall, f1-score
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def calculate_results(y_true, y_pred):
  """
  Calculates model accuracy, precision, recall and f1 score of a binary classification model.

  Args:
  -----
  y_true = true labels in the form of a 1D array
  y_pred = predicted labels in the form of a 1D array

  Returns a dictionary of accuracy, precision, recall, f1-score.
  """
  # Calculate model accuracy
  model_accuracy = accuracy_score(y_true, y_pred) * 100
  # Calculate model precision, recall and f1 score using "weighted" average
  model_precision, model_recall, model_f1, _ = precision_recall_fscore_support(y_true, y_pred, average="weighted")
  model_results = {"accuracy": model_accuracy,
                  "precision": model_precision,
                  "recall": model_recall,
                  "f1": model_f1}
  return model_results

In [None]:
baseline_results = calculate_results(y_test, baseline_preds)
baseline_results