In [None]:
# The notebook was last run on 23rd August '22.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ishandandekar/This_Is_A_Disaster/blob/main/This_Is_A_Disaster_nbk.ipynb)

# This_Is_A_Disaster  
👋 Hello and welcome to the notebook. In this notebook, I make three models and make a submission to the **`nlp-getting-started`** Kaggle compition, with each of the model. This project was part of the [Zero-to-Mastery Tensorflow Developer course](https://zerotomastery.io/courses/learn-tensorflow/). In the end I have summarised the results of the submissions I made to the competition. This is the original notebook which I used for the competition. I have removed the outputs of the code cells to make the notebook look cleaner.

In [None]:
# Check for GPU
!nvidia-smi -L

## Step 0: About the problem 

To predict whether a tweet is regarding a real disaster or not. If it is then mark the tweet as `1` else `0`.   

Files:
- train.csv - the training set
- test.csv - the test set
- sample_submission.csv - a sample submission file in the correct format  

Columns:
- id - a unique identifier for each tweet
- text - the text of the tweet
- location - the location the tweet was sent from (may be blank)
- keyword - a particular keyword from the tweet (may be blank)
- target - in train.csv only, this denotes whether a tweet is about a real - disaster (1) or not (0)

## Step 1: Get the data  
The data is given officially by the competition organisers on Kaggle.   
- Use Kaggle's API to download the data.
- Get some utility functions which will further help us.
- Configure data files to read using Python.


In [None]:
# Getting the helper functions script
!wget https://raw.githubusercontent.com/ishandandekar/This_Is_A_Disaster/main/helper_functions.py --no-verbose

# Get the necessary functions
from helper_functions import unzip_data, plot_loss_curves, gen_metrics_report, get_metrics

In [None]:
# Install the kaggle library
!pip install -q kaggle

# Upload the Kaggle API keys
from google.colab import files
files.upload()

!mkdir ~/.kaggle

# Copy the json file to the folder
!cp kaggle.json ~/.kaggle

# Change permissions for json to work with the Kaggle API
!chmod 600 ~/.kaggle/kaggle.json

# Download the dataset
!kaggle competitions download -c nlp-getting-started

# Unzip data
unzip_data("nlp-getting-started.zip")

## Step 2: Know more about the data
- Reading the data using `Pandas` library.
- Get the statistics about the data.
- Check if data is imbalanced.
- Visualizing sample data.
- Split training data into train and validation sets.

In [None]:
# Importing library
import pandas as pd

# Reading the files in
train_df= pd.read_csv("data/train.csv",index_col=[0])
test_df= pd.read_csv("data/test.csv",index_col=[0])

# Getting the first 5 rows of the training data
train_df.head(5)

In [None]:
# Getting a sample from training data
train_df.sample(frac=1,random_state=42).head()

In [None]:
# Check for label imbalance
train_df.target.value_counts().plot(kind="bar")

In [None]:
# Check train and test data size
len(train_df),len(test_df)

In [None]:
# Visualize some random training examples
import random
random_index = random.randint(0,len(train_df)-5)
for row in train_df[["text","target"]][random_index:random_index+5].itertuples():
  _, text, target = row
  print(f"Target: {target}","(real disaster)" if target>0 else "(not real disaster)")
  print(f"Text:\n{text}")
  print("---\n")

In [None]:
# Importing necessary function(s)
from sklearn.model_selection import train_test_split

train_sentences,val_sentences,train_labels,val_labels = train_test_split(train_df["text"].to_numpy(),train_df["target"].to_numpy(),test_size=0.1,random_state=42)

# Check the lengths of the dataframes
print(f"Length of train set: {len(train_sentences)}")
print(f"Length of validation set: {len(val_sentences)}")

## Step 3: Building the first model
- Use transfer learning to get a pretrained text-vectorization and embedding layer.
- *(Optional)* Add more layers after the pretrained layer
- Compile the model.
- *(Optional)* Setup `ModelCheckpoint` and `EarlyStopping` callbacks.
- Plot loss curves.
- Make predictions on validation set.
- Make predictions on test set.
- Submit predictions on competition.


In [None]:
# Get the pretrained embedding layer and using it as a Keras layer
# Using the Universal-Sentence-Encoder, another option is GloVe (pretty famous too)!
import tensorflow_hub as hub
import tensorflow as tf

sentence_encoder_layer = hub.KerasLayer("https://tfhub.dev/google/universal-sentence-encoder/4",input_shape=[],dtype=tf.string,trainable=False,name="USE")

In [None]:
# Making the model using Sequential API
import tensorflow as tf
from tensorflow.keras import layers

model_0 = tf.keras.Sequential(
    [
     sentence_encoder_layer,
     layers.Dense(64,activation="relu"),
     layers.Dense(1,activation="sigmoid",name="output_layer")
    ],name="model_0_USE"
)

# Compile the model
model_0.compile(loss="binary_crossentropy",
                optimizer="Adam",
                metrics=["accuracy"])

# Get the summary
model_0.summary()

In [None]:
# Setting up variables
EPOCHS = 5

# Create ModelCheckpoint callback to save a model's progress during training
!mkdir -p model_checkpoint/model_0
checkpoint_path = "model_checkpoint/model_0.h5"
mc_callback_0 = tf.keras.callbacks.ModelCheckpoint(checkpoint_path,
                                                      monitor="val_accuracy",
                                                      save_best_only=True,
                                                      save_weights_only=True,
                                                      verbose=0)

In [None]:
# Fitting the model on the training set
model_0_history = model_0.fit(train_sentences,
                              train_labels,
                              epochs=EPOCHS,
                              validation_data=(val_sentences,val_labels),
                              callbacks=[mc_callback_0])

In [None]:
# Plot loss curves
plot_loss_curves(model_0_history)

In [None]:
# Load-in best weights saved using checkpoint callback
model_0.load_weights(checkpoint_path)

# Evaluate on the validation data
model_0_pred_probs = model_0.predict(val_sentences)
model_0_preds = tf.squeeze(tf.round(model_0_pred_probs))
get_metrics(val_labels,model_0_preds)

In [None]:
# Checking sample submission to format data for submission
sample = pd.read_csv('data/sample_submission.csv')
sample.head()

In [None]:
test_df

In [None]:
# Making predictions on test set
test_id,test_sentences = test_df.index,test_df['text'].to_numpy()
model_0_pred_probs_test = model_0.predict(test_sentences)
model_0_preds_test = tf.squeeze(tf.round(model_0_pred_probs_test))
submission_0 = pd.DataFrame({'id':test_id,'target':list(map(int,model_0_preds_test))})

In [None]:
# Submitting the csv to the competition
!mkdir submissions
submission_0.to_csv("submissions/submission_0.csv",index=False)

# Below line has been commented as I have already made a submission
# !kaggle competitions submit -c nlp-getting-started -f submissions/submission_0.csv -m "23/8 First submission using model_0"

In [None]:
submission_0.head()

## Step 4: Building a better model
- Get a pretrained embedding layer and set `trainable=True`. This trains the parameters within the embedding layer too.
- (Optional) Add more layers to fine-tune.
- Compile the model.
- (Optional) Setup `ModelCheckpoint` and `EarlyStopping` callbacks.
- Plot loss curves.
- Make predictions on validation set.
- Make predictions on test set.
- Submit predictions to competition.

In [None]:
# Get the pretrained embedding layer and using it as a Keras layer
# Setup trainable as True so the params within this layer can be changed while training
import tensorflow_hub as hub

sentence_encoder_layer = hub.KerasLayer("https://tfhub.dev/google/universal-sentence-encoder/4",input_shape=[],dtype=tf.string,trainable=True,name="USE")

In [None]:
# Making the model using Sequential API
import tensorflow as tf
from tensorflow.keras import layers

model_1 = tf.keras.Sequential(
    [
     sentence_encoder_layer,
     layers.Dense(64,activation="relu"),
     layers.Dropout(0.2),
     layers.Dense(64,activation="relu"),
     layers.Dense(1,activation="sigmoid",name="output_layer")
    ],name="model_1_trainable"
)

# Compile the model
model_1.compile(loss="binary_crossentropy",
                optimizer="Adam",
                metrics=["accuracy"])

# Get the model summary
model_1.summary()

In [None]:
# Setting up variables
EPOCHS = 10

# Create ModelCheckpoint callback to save a model's progress during training
checkpoint_path = "model_checkpoints/model_1_cp.ckpt"
mc_callback_1 = tf.keras.callbacks.ModelCheckpoint(checkpoint_path,
                                                   monitor="val_accuracy",
                                                   save_best_only=True,
                                                   save_weights_only=True,
                                                   verbose=0)

es_callback = tf.keras.callbacks.EarlyStopping(
    monitor="val_loss",
    patience=2,
    verbose=0,
    mode="auto",
    restore_best_weights=True,
)

In [None]:
# Training the model
model_1_history = model_1.fit(train_sentences,
                              train_labels,
                              epochs=EPOCHS,
                              validation_data=(val_sentences,val_labels),
                              callbacks=[mc_callback_1,es_callback])

In [None]:
# Plot loss curves
plot_loss_curves(model_1_history)

In [None]:
# Load-in best weights saved using checkpoint callback
model_1.load_weights(checkpoint_path)

# Evaluate on the validation data
model_1_pred_probs = model_1.predict(val_sentences)
model_1_preds = tf.squeeze(tf.round(model_1_pred_probs))
gen_metrics_report(val_labels,model_1_preds)

In [None]:
# Making predictions on test set
model_1_pred_probs_test = model_1.predict(test_sentences)
model_1_preds_test = tf.squeeze(tf.round(model_1_pred_probs_test))
submission_1 = pd.DataFrame({'id':test_id,'target':list(map(int,model_1_preds_test))})

In [None]:
# Making submission to the competition
submission_0.to_csv("submissions/submission_1.csv",index=False)

# Below line has been commented, as the data has been already submitted
# !kaggle competitions submit -c nlp-getting-started -f submissions/submission_1.csv -m "23/8 Second submission using model_1"

## Step 5: Building a *better* better model
- Get the same model in **step 5**.
- Add callbacks.
- Train the model on whole training set (training set before split).
- Make predictions on test set.
- Submit predictions on competition.

In [None]:
# Cloning model_1
model_2 = tf.keras.models.clone_model(model_1)

# Compile the model
model_2.compile(loss="binary_crossentropy",
                optimizer="Adam",
                metrics=["accuracy"])

# Get the model summary
model_2.summary()

In [None]:
# Setting up variables
EPOCHS = 10

In [None]:
# Formatting train data
train_sentences_full,train_labels_full = train_df["text"],train_df["target"]

In [None]:
# Training the model
model_2_history = model_2.fit(train_sentences_full,
                              train_labels_full,
                              epochs=EPOCHS)

In [None]:
# Can't plot loss curve as the history object doesn't have val_loss and accuracy in it's dictionary
# plot_loss_curves(model_2_history)

In [None]:
# Load-in best weights saved using checkpoint callback
# model_2.load_weights(checkpoint_path)

In [None]:
# Making predictions on test set
model_2_pred_probs_test = model_2.predict(test_sentences)
model_2_preds_test = tf.squeeze(tf.round(model_2_pred_probs_test))
submission_2 = pd.DataFrame({'id':test_id,'target':list(map(int,model_2_preds_test))})

In [None]:
# Making submission to the competition
submission_2.to_csv("submissions/submission_2.csv",index=False)

# Below line has been commented, as the data has been already submitted
# !kaggle competitions submit -c nlp-getting-started -f submissions/submission_2.csv -m "23/8 Third submission using model_2"

## Step 6: Comparing the two models' results
- Check the results on Kaggle's website
- Analyze results
- How to improve these


1. **Model 0** : It had the simplest architecture amongest the other models. Various callbacks were used to make the training process better. Model was fit on 90% of the actual training data. Model got a score of **0.80570** on test data (taken from the Kaggle evaluation).  

1. **Model 1** : It had a little different architecture than the previous model. Adding a Dropout layer and a Dense was thought to be advantageous so as to add more trainable parameters. But due to the change in this large number of trainable parameters the data model could've overfit. Model got a score of **0.80570** on test data (taken from the Kaggle evaluation). Model got the same score as the previous one. Adding more trainable parameters should've affected the score.  

1. **Model 2** : The model had the same architecture as the previous model. The change in training this model, was to train on the full training set. This meant there was no validation set for this. Due to which, no callbacks could be added. The model while training was highly vulnerable to overfitting. Model got a score of **0.76984** on test data (taken from the Kaggle evaluation). The model has clearly overfit on the training set.

Clearly the models could've been better. Change in encoders such as using GloVe intead could give a different result. Adding more layers and dropouts could change the results. Analyzing the data more and creating features could also help.