## Kaggle competition : disaster tweets classification.

The purpose of this competition is to classify tweets according to two classes: are the sequences of text talking about a real natural disaster or not? 

To implement my solution, I will use the HuggingFace/TensorFlow frameworks and more specifically I will fine-tune a DistilBERT type of Transformer. DistilBERT is a small, fast, cheap and light Transformer model trained by distilling BERT base. It has 40% less parameters than bert-base-uncased, runs 60% faster while preserving over 95% of BERT’s performances as measured on the GLUE language understanding benchmark.

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

Firstly, let's import the Tokenizer that was used to pre-train the DistilBERT to ensure our sequences of text will be encoded appropriately for this model.

In [2]:
import tensorflow as tf
from transformers import DistilBertTokenizerFast
from transformers import TFDistilBertForSequenceClassification

Now let's load the CSV file into a Pandas dataframe and retain information from the 'text' column for our independant feature, and the 'target' column for our dependant feature.

In [3]:
df = pd.read_csv('../input/nlp-getting-started/train.csv')
df

In [4]:
X = list(df['text'])
y = list(df['target'])

Let's generate our training and validation datasets using Scikit-Learn:

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.20, random_state = 0)

Let's instantiate our tokenizer to generate encodings for the training and validation datasets, using 'truncation' and 'padding' in order to make all sequences of equal length:

In [6]:
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

In [7]:
train_encodings = tokenizer(X_train,
                            truncation=True,
                            padding=True)

val_encodings = tokenizer(X_val,
                            truncation=True,
                            padding=True)

Let's convert our datasets to 'tf.data.Dataset' format by combining the encodings and their labels:

In [8]:
train_dataset = tf.data.Dataset.from_tensor_slices((dict(train_encodings), y_train))
val_dataset = tf.data.Dataset.from_tensor_slices((dict(val_encodings), y_val))

Let's import the DistilBERT pretrained model with the following default configuration:

( vocab_size = 30522, max_position_embeddings = 512, sinusoidal_pos_embds = False, n_layers = 6, n_heads = 12, dim = 768, hidden_dim = 3072, dropout = 0.1, attention_dropout = 0.1, activation = 'gelu', initializer_range = 0.02, qa_dropout = 0.1, seq_classif_dropout = 0.2, pad_token_id = 0**kwargs )

Let's compile our model for 2 classes and an 'Adam' optimizer:

In [9]:
model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2)

In [10]:
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)

In [11]:
model.compile(optimizer=optimizer, loss=model.compute_loss, metrics=['accuracy'])

In [12]:
model.fit(train_dataset.shuffle(100).batch(16),
          epochs=1,
          batch_size=16,
          validation_data=val_dataset.shuffle(100).batch(16))

Saving and loading our fine-tuned model:

In [13]:
model.save_pretrained("MSF_DistilBERT_CustomModel")

In [14]:
loaded_model = TFDistilBertForSequenceClassification.from_pretrained("MSF_DistilBERT_CustomModel")

Let's make an inference by tokenizing a test sentence and then passing it to our trained model. Let's apply 'softmax' to vizualize the probability for each class:

In [15]:
test_sentence = "Terrible earthquake this morning, everyone was scared."

In [16]:
predict_input = tokenizer.encode(test_sentence,
                                 truncation=True,
                                 padding=True,
                                 return_tensors="tf")

In [17]:
tf_output = loaded_model.predict(predict_input)[0]

In [18]:
tf_prediction = tf.nn.softmax(tf_output, axis=1).numpy()[0]
tf_prediction

Now let's evaluate our model with the test set. We will have to tokenize each sequence of text and pass it to the 'predict' method. Then we will obtain the probabilities for each class by applying softmax, and finally return the index of the most likely class by using 'argmax':

In [19]:
df = pd.read_csv('../input/nlp-getting-started/test.csv')
data = list(df['text'])

In [20]:
results = []
for txt in data:
    tokenized_input = tokenizer.encode(txt,
                                 truncation=True,
                                 padding=True,
                                 return_tensors="tf")
    preds = loaded_model.predict(tokenized_input)
    proba = tf.math.softmax(preds.logits, axis=-1)
    label = proba.numpy()
    results.append(label.argmax())

Let's concatenate the list of IDs and the matching list of predicted class in a Dataframe, then write it all to our final CSV submission file:

In [21]:
sub = pd.DataFrame(np.column_stack((list(df['id']), results)), columns=["id", "target"])

In [22]:
sub.to_csv("submission.csv", index=False)