Importing the necessary libraries and packages to execute the notebook. 

In [16]:
import numpy as np
import pandas as pd
import os
import tensorflow as tf
import spacy
import en_core_web_sm
from tensorflow.keras import preprocessing
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Dense, Dropout, GlobalAveragePooling1D
from tensorflow.keras.callbacks import EarlyStopping

The code reads in three CSV files: train data, test data, and a sample submission file for the competition.

In [17]:
df_train = pd.read_csv('train.csv', dtype={'id': np.int16, 'target': np.int8})
df_test = pd.read_csv('test.csv', dtype={'id': np.int16})
submission = pd.read_csv("sample_submission.csv")

This is how the data looks like:

In [25]:
df_train.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Deeds Reason # earthquake ALLAH Forgive,1
1,4,,,Forest fire near La Ronge Sask . Canada,1
2,5,,,residents asked ' shelter place ' notified off...,1
3,6,,,"13,000 people receive # wildfires evacuation o...",1
4,7,,,got sent photo Ruby # Alaska smoke # wildfires...,1


In [19]:
df_train_text = df_train.text.values
df_train_labels = df_train.target.values
df_test_text = df_test.text.values

Importing the en_core_web_sm module from the spaCy library and using it to define a function remove_stopwords() which takes a list of training data and removes all the stop words from it using spaCy's natural language processing capabilities. The function is then called on a sample training data to demonstrate the removal of stop words.

In [20]:
import en_core_web_sm

nlp = en_core_web_sm.load()

def remove_stopwords(training_data, nlp):
    stopwords = ["a", "about", "above", "after", "again", "against", "all", "am", "an", "and", "any", "are", "as", "at",
                 "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "could", "did",
                 "do", "does", "doing", "down", "during", "each", "few", "for", "from", "further", "had", "has", "have",
                 "having", "he", "he'd", "he'll", "he's", "her", "here", "here's", "hers", "herself", "him", "himself",
                 "his", "how", "how's", "i", "i'd", "i'll", "i'm", "i've", "if", "in", "into", "is", "it", "it's",
                 "its", "itself", "let's", "me", "more", "most", "my", "myself", "nor", "of", "on", "once", "only",
                 "or", "other", "ought", "our", "ours", "ourselves", "out", "over", "own", "same", "she", "she'd",
                 "she'll", "she's", "should", "so", "some", "such", "than", "that", "that's", "the", "their", "theirs",
                 "them", "themselves", "then", "there", "there's", "these", "they", "they'd", "they'll", "they're",
                 "they've", "this", "those", "through", "to", "too", "under", "until", "up", "very", "was", "we",
                 "we'd", "we'll", "we're", "we've", "were", "what", "what's", "when", "when's", "where", "where's",
                 "which", "while", "who", "who's", "whom", "why", "why's", "with", "would", "you", "you'd", "you'll",
                 "you're", "you've", "your", "yours", "yourself", "yourselves"]
        
    for k in range(len(training_data)):
        sentence = training_data[k]
        doc = nlp(sentence)
        tokens = [token.text for token in doc]
        filtered = [token.text for token in doc if token.is_stop == False]
        sentence = " ".join(filtered)
        training_data[k] = sentence
    
    return training_data

print("Before cleaning: ",df_train_text[0])
df_train_text = remove_stopwords(df_train_text, nlp)
print("After cleaning: ",df_train_text[0])

Before cleaning:  Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all
After cleaning:  Deeds Reason # earthquake ALLAH Forgive


Splitting the data into training split and validation split:

In [21]:
TRAINING_SPLIT = 0.7

def train_val_split(texts, labels, training_split):
    
    train_size = int(len(texts)*training_split)
    
    train_texts = texts[:train_size]
    train_labels = labels[:train_size]
    
    validation_texts = texts[train_size:]
    validation_labels = labels[train_size:]
    
    return train_texts, validation_texts, train_labels, validation_labels

train_texts, val_texts, train_labels, val_labels = train_val_split(df_train_text, df_train_labels, TRAINING_SPLIT)

The chosen model is a simple ***neural network*** with an embedding layer followed by a global average pooling layer, two dense layers (with 64 and 1 units respectively), and a dropout layer in between to prevent overfitting. The model is compiled with the Adam optimizer and binary cross-entropy loss, and the accuracy is used as the evaluation metric.

The hyperparameters were chosen as follows:

**NUM_WORDS** = 10000: the number of words to keep in the tokenizer's word index based on frequency.

**OOV_TOKEN** = "<OOV>": a special token to replace words that are not in the tokenizer's word index.

**MAXLEN** = 120: the maximum length of the input sequences, which are padded or truncated to this length.

**PADDING** = 'post': the padding strategy for the sequences, which pads zeros at the end of the sequences.

**EMBEDDING_DIM** = 128: the dimensionality of the dense embedding layer, which determines the number of features in the embedding vectors.

**EPOCHS** = 100: the number of times the model trains on the entire dataset.

**BATCH_SIZE** = 32: the number of samples processed by the model in each training batch.

In [22]:
# Hyperparameters
NUM_WORDS = 10000
OOV_TOKEN = "<OOV>"
MAXLEN = 120
PADDING = 'post'
EMBEDDING_DIM = 128
EPOCHS = 100
BATCH_SIZE = 32

# Define model
model = Sequential()
model.add(Embedding(input_dim=NUM_WORDS, output_dim=EMBEDDING_DIM, input_length=MAXLEN))
model.add(GlobalAveragePooling1D())
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))

# Compile model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train model
early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)
history = model.fit(train_padded_seq, train_labels, batch_size=BATCH_SIZE, 
                    epochs=EPOCHS, validation_data=(val_padded_seq, val_labels), callbacks=[early_stopping])

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100


Now that the model is ready, this prepares the submission document for the competition by applying the prediction with the test dataframe. 

In [24]:
df_test_pred = remove_stopwords(df_test_text, nlp)
test_padded_seq = seq_and_pad(df_test_pred, tokenizer, PADDING, MAXLEN)
df_test_pred = model.predict(test_padded_seq)
submission['target'] = df_test_pred.round().astype(int)
submission.to_csv('submission.csv', index=False)

