# Modeling and Evaluation

In [1]:
!python -m spacy download en_core_web_md

2023-01-07 15:39:04.534770: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-md==3.4.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.4.1/en_core_web_md-3.4.1-py3-none-any.whl (42.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 MB[0m [31m17.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.4.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


In [2]:
# Data wrangling
import pandas as pd

# Word processing
import en_core_web_md

# Text encoding
import tensorflow as tf 
import tensorflow_datasets as tfds

# Performing small tasks
import functools

pd.set_option("display.precision", 12)
import warnings
warnings.filterwarnings("ignore")



In [3]:
tweets_df = pd.read_csv("tweets_train.csv")

In [4]:
tweets_df.head()

Unnamed: 0,keyword,cleaned_text,target
0,ablaze,Wholesale Markets ablaze,1
1,ablaze,We always try to bring the heavy.,0
2,ablaze,: Breaking news:Nigeria flag set ablaze in Aba.,1
3,ablaze,Crying out for more! Set me ablaze,0
4,ablaze,On plus side LOOK AT THE SKY LAST NIGHT IT WAS...,0


In [5]:
tweets_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7383 entries, 0 to 7382
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   keyword       7383 non-null   object
 1   cleaned_text  7383 non-null   object
 2   target        7383 non-null   int64 
dtypes: int64(1), object(2)
memory usage: 173.2+ KB


## Neural Networks

We will train two neural networks in order to predict the target variable. The first one is built from the feature __keyword__, the second is based on the feature __cleaned_text__. 


### Model based on keyword variable
#### Preprocessing

In [6]:
nlp_keyword = en_core_web_md.load()
corpus_keyword = " ".join(tweets_df["keyword"].to_list())
doc_keyword = nlp_keyword(corpus_keyword)

In [7]:
tokens_keyword = [token.text for token in doc_keyword]
vocabulary_set_keyword= set(tokens_keyword)
vocab_keyword_size = len(vocabulary_set_keyword)

In [8]:
vocab_keyword_size

229

In [9]:
# small check to ensure that all the words in the feature keyword are present in vocabulary_set_keyword
words_in_vocabulary = []

for keyword in tweets_df["keyword"]:
  words_in_vocabulary.append(functools.reduce(lambda x, y : x & y, [word in vocabulary_set_keyword for word in keyword.split()]))

print("Size of the dataset:", tweets_df["keyword"].shape)
print("Counting the values of the words_in_vocabulary list:", pd.Series(words_in_vocabulary).value_counts())

Size of the dataset: (7383,)
Counting the values of the words_in_vocabulary list: True    7383
dtype: int64


In [10]:
encoder_keyword = tfds.deprecated.text.TokenTextEncoder(vocabulary_set_keyword)

In [11]:
encoder_keyword.vocab_size

231

In [12]:
tf_ds_keyword = tf.data.Dataset.from_tensor_slices((tweets_df["keyword"].values, tweets_df["target"].values))

In [13]:
example_keyword, example_target = next(iter(tf_ds_keyword))

In [14]:
print(tweets_df[["keyword", "target"]].head(1))
example_keyword.numpy(), example_target.numpy()

  keyword  target
0  ablaze       1


(b'ablaze', 1)

In [15]:
def encode_keyword(keyword, target):
  encoded_keyword = encoder_keyword.encode(keyword.numpy())
  return encoded_keyword, target


def encode_keyword_map(keyword, target):
  return tf.py_function(encode_keyword, inp=[keyword, target], Tout=(tf.int64, tf.int64))

In [16]:
all_encoded_keyword = tf_ds_keyword.map(encode_keyword_map)

In [17]:
# simple verification of the result of the encoding process
for batch_keyword, batch_target in all_encoded_keyword.take(1):
  print("First batch : keyword ->", batch_keyword, "AND target ->", batch_target)

print("Manual encoding of the first observation : keyword -> ", encoder_keyword.encode('ablaze'), "AND target ->", tweets_df.loc[0, "target"])

First batch : keyword -> tf.Tensor([1], shape=(1,), dtype=int64) AND target -> tf.Tensor(1, shape=(), dtype=int64)
Manual encoding of the first observation : keyword ->  [1] AND target -> 1


In [18]:
BATCH_SIZE = 16
TAKE_SIZE = int(0.8*len(tweets_df))

train_set_keyword = all_encoded_keyword.take(TAKE_SIZE).shuffle(len(tweets_df))
train_set_keyword = train_set_keyword.padded_batch(BATCH_SIZE,  padded_shapes=([-1], []))

validation_set_keyword = all_encoded_keyword.skip(TAKE_SIZE)
validation_set_keyword = validation_set_keyword.padded_batch(BATCH_SIZE, padded_shapes=([-1], []))

In [19]:
# how does the first batch look like ?
# Which words correspond to the given numbers?
for batch_keyword, batch_target in train_set_keyword.take(1):
  for i in range(len(batch_keyword)):
    print(batch_keyword[i, :].numpy(), "<==>", [encoder_keyword.decode([number]) for number in batch_keyword[i, :].numpy()])

[162 158] <==> ['dust', 'storm']
[82  0] <==> ['aftershock', '']
[50  0] <==> ['army', '']
[139   0] <==> ['quarantined', '']
[4 0] <==> ['devastation', '']
[15  0] <==> ['lightning', '']
[85  0] <==> ['hijacking', '']
[124   0] <==> ['mayhem', '']
[55  0] <==> ['death', '']
[120   0] <==> ['disaster', '']
[23  0] <==> ['evacuation', '']
[59  0] <==> ['debris', '']
[221   0] <==> ['collapse', '']
[164   0] <==> ['injury', '']
[224   0] <==> ['demolished', '']
[164   0] <==> ['injury', '']


#### Modeling

The architecture of the neural network is defined: the type of layer, the number of neurons in each layer, the activation function.

In [20]:
model_keyword = tf.keras.Sequential([       
                  tf.keras.layers.Embedding(vocab_keyword_size + 1, 16),
                  tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(16, return_sequences=True)),
                  # tf.keras.layers.Conv1D(16, 3, activation="relu"),
                  # tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, return_sequences=True)),
                  tf.keras.layers.LSTM(8, return_sequences=False),               
                  # tf.keras.layers.Dense(64, activation='relu'),
                  # tf.keras.layers.Dense(32, activation='relu'),
                  tf.keras.layers.Dense(16, activation='relu'),
                  tf.keras.layers.Dense(8, activation='relu'),
                  tf.keras.layers.Dense(1, activation="sigmoid")              
])

In [21]:
model_keyword.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 16)          3680      
                                                                 
 bidirectional (Bidirectiona  (None, None, 32)         4224      
 l)                                                              
                                                                 
 lstm_1 (LSTM)               (None, 8)                 1312      
                                                                 
 dense (Dense)               (None, 16)                144       
                                                                 
 dense_1 (Dense)             (None, 8)                 136       
                                                                 
 dense_2 (Dense)             (None, 1)                 9         
                                                        

We define the parameters of the training process : the cost function, the evaluation metrics, the optimisation algorithm, the learning rate.

In [22]:
initial_learning_rate = 0.001

lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
    initial_learning_rate,
    decay_steps=2500,
    decay_rate=0.96,
    staircase=True
    )

model_keyword.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=lr_schedule),
    loss=tf.keras.losses.binary_crossentropy,
    metrics=[tf.keras.metrics.binary_accuracy]
    )

In [23]:
history = model_keyword.fit(train_set_keyword, epochs=20, validation_data=validation_set_keyword)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [24]:
pd.DataFrame(history.history)

Unnamed: 0,loss,binary_accuracy,val_loss,val_binary_accuracy
0,0.629905998707,0.645276010036,0.705410838127,0.441435337067
1,0.548413693905,0.728242456913,0.684342265129,0.517264723778
2,0.540649950504,0.729427695274,0.68176817894,0.521327018738
3,0.536063671112,0.731798171997,0.67469650507,0.582938373089
4,0.533747255802,0.732814073563,0.674517273903,0.585646569729
5,0.531697392464,0.734507262707,0.673090219498,0.592417061329
6,0.530204057693,0.732644796371,0.6698371768,0.606635093689
7,0.530308485031,0.733660697937,0.671934902668,0.598510503769
8,0.529610753059,0.733999311924,0.67104113102,0.606635093689
9,0.529380023479,0.733491361141,0.670776605606,0.606635093689


* This model is clearly overfitted: there is a 20% gap between the score metrics of the two data sets (training and validation) and a 25% gap between the last values of the cost functions. 
* We see that there is almost no evolution of the cost function and the score metric from the fifth epoch onwards.

### Model based on cleaned_text variable
#### Preprocessing

In [25]:
nlp_text = en_core_web_md.load()
corpus_text = " ".join(tweets_df["cleaned_text"].to_list())
doc_text = nlp_text(corpus_text)

In [26]:
tokens_text = [token.text for token in doc_text]
vocabulary_set_text= set(tokens_text)
vocab_text_size = len(vocabulary_set_text)

In [27]:
vocab_text_size

17937

In [28]:
encoder_text = tfds.deprecated.text.TokenTextEncoder(vocabulary_set_text)

In [29]:
encoder_text.vocab_size

17939

In [30]:
tf_ds_text = tf.data.Dataset.from_tensor_slices((tweets_df["cleaned_text"].values, tweets_df["target"].values))

In [31]:
example_text, example_target = next(iter(tf_ds_text))

In [32]:
print(tweets_df[["cleaned_text", "target"]].head(1))
print(example_text.numpy(), example_target.numpy())

                 cleaned_text  target
0   Wholesale Markets ablaze        1
b' Wholesale Markets ablaze ' 1


In [33]:
print("Initial text :", example_text.numpy(),"=> Encoding : ", encoder_text.encode(example_text.numpy()))

Initial text : b' Wholesale Markets ablaze ' => Encoding :  [5060, 15405, 3655]


In [34]:
def encode_text(text, target):
  encoded_text = encoder_text.encode(text.numpy())
  return encoded_text, target


def encode_text_map(text, target):
  return tf.py_function(encode_text, inp=[text, target], Tout=(tf.int64, tf.int64))

In [35]:
all_encoded_text = tf_ds_text.map(encode_text_map)

In [36]:
BATCH_SIZE = 16
TAKE_SIZE = int(0.8*len(tweets_df))

train_set_text = all_encoded_text.take(TAKE_SIZE).shuffle(len(tweets_df))
train_set_text = train_set_text.padded_batch(BATCH_SIZE,  padded_shapes=([-1], []))

validation_set_text = all_encoded_text.skip(TAKE_SIZE)
validation_set_text = validation_set_text.padded_batch(BATCH_SIZE, padded_shapes=([-1], []))

In [37]:
# just a look at the first batch
for batch_text, batch_target in train_set_text.take(1):
  print("Shape of the first batch", batch_text.shape)
  for i in range(len(batch_text)):
    print(batch_text[i, :].numpy())

Shape of the first batch (16, 22)
[ 2336  9703 14793 16849  5985 16957 15981    52  7635  8236 16765 17598
 14342 16359  2349  9703 14319 13109 13885 13534     0     0]
[ 5361  6456 11980 14868  4356  2666 11653 15981 13495  4516     0     0
     0     0     0     0     0     0     0     0     0     0]
[  285  2641 11157 14498  6399 14642 17503  5682 14804  6399 13212  4793
 11157 11688 11590  4357 12663     0     0     0     0     0]
[11157  1285  5770  2211 11785 17866     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0]
[12725   896  9703  8819  8700  4320  3081  8700  9773  6681 16765 14182
 13376 11221 14989     0     0     0     0     0     0     0]
[ 8995  5998  1438 17639 14197  3835 17245 15685 10156  9723 14197  6341
 15834  5210  8324  5998  3595 11968 15869     0     0     0]
[ 1083 11157 14584  5503 15092  6715  9288  8700   930   217  5583   556
  3457  9878  7477  1528 12868  1668 16066 10490 17654 14642]
[ 4203 14642   795 11

#### Modeling

In [38]:
model_text = tf.keras.Sequential([
                  tf.keras.layers.Embedding(encoder_text.vocab_size, 16),
                  tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(16, return_sequences=True)),
                  # tf.keras.layers.Conv1D(4, 3, activation="relu"),
                  # tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(16, return_sequences=True)),
                  tf.keras.layers.LSTM(8, return_sequences=False),               
                  # tf.keras.layers.Dense(16, activation='relu'),
                  # tf.keras.layers.Dense(8, activation='relu'),
                  tf.keras.layers.Dense(4, activation='relu'),
                  tf.keras.layers.Dense(2, activation='relu'),
                  tf.keras.layers.Dense(1, activation="sigmoid")              
                  ],
                   name="model_text"
            )

In [39]:
model_text.summary()

Model: "model_text"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, None, 16)          287024    
                                                                 
 bidirectional_1 (Bidirectio  (None, None, 32)         4224      
 nal)                                                            
                                                                 
 lstm_3 (LSTM)               (None, 8)                 1312      
                                                                 
 dense_3 (Dense)             (None, 4)                 36        
                                                                 
 dense_4 (Dense)             (None, 2)                 10        
                                                                 
 dense_5 (Dense)             (None, 1)                 3         
                                                        

In [40]:
# number of observations used for training at each epoch
int(TAKE_SIZE/BATCH_SIZE)

369

In [41]:
initial_learning_rate

0.001

In [42]:
lr_schedule.get_config()

{'initial_learning_rate': 0.001,
 'decay_steps': 2500,
 'decay_rate': 0.96,
 'staircase': True,
 'name': None}

In [43]:
model_text.get_config()

{'name': 'model_text',
 'layers': [{'class_name': 'InputLayer',
   'config': {'batch_input_shape': (None, None),
    'dtype': 'float32',
    'sparse': False,
    'ragged': False,
    'name': 'embedding_1_input'}},
  {'class_name': 'Embedding',
   'config': {'name': 'embedding_1',
    'trainable': True,
    'batch_input_shape': (None, None),
    'dtype': 'float32',
    'input_dim': 17939,
    'output_dim': 16,
    'embeddings_initializer': {'class_name': 'RandomUniform',
     'config': {'minval': -0.05, 'maxval': 0.05, 'seed': None}},
    'embeddings_regularizer': None,
    'activity_regularizer': None,
    'embeddings_constraint': None,
    'mask_zero': False,
    'input_length': None}},
  {'class_name': 'Bidirectional',
   'config': {'name': 'bidirectional_1',
    'trainable': True,
    'dtype': 'float32',
    'layer': {'class_name': 'LSTM',
     'config': {'name': 'lstm_2',
      'trainable': True,
      'dtype': 'float32',
      'return_sequences': True,
      'return_state': False,

In [44]:
model_text.compile(
          optimizer=tf.keras.optimizers.Adam(learning_rate=lr_schedule),
          loss=tf.keras.losses.BinaryCrossentropy(),
          metrics=[tf.keras.metrics.BinaryAccuracy()]
          )

In [45]:
history_text = model_text.fit(train_set_text, epochs=20, validation_data=validation_set_text)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [46]:
pd.DataFrame(history_text.history)

Unnamed: 0,loss,binary_accuracy,val_loss,val_binary_accuracy
0,0.643795073032,0.596850633621,0.627597093582,0.6756939888
1,0.520545601845,0.814256668091,0.605036377907,0.749492228031
2,0.417762845755,0.872841179371,0.577682971954,0.725118458271
3,0.33866211772,0.90060955286,0.63042396307,0.725118458271
4,0.276577919722,0.924483597279,0.68707126379,0.637779295444
5,0.236672505736,0.939553022385,0.652973175049,0.698713600636
6,0.217244133353,0.944293916225,0.771159410477,0.68449562788
7,0.195550709963,0.948526918888,0.749903440475,0.689234912395
8,0.182583555579,0.951066732407,0.752912580967,0.696005403996
9,0.172840684652,0.953267872334,0.793782889843,0.696005403996


* This model is highly overfitted : there is an almost 30% change between the score metrics of the two data sets.
* On the training set, the cost function continuously decreases from 0.6437 to 0.1098, while on the validation set, the cost function alternates between increasing and decreasing throughout the training process.
* In the third epoch, focusing on the validation set, the cost function reaches its absolute minimum on validation set : if we decide to consider the model at this stage, we obtain a model with an accuracy of 87.28% on the training set and a score of 72.51% on the validation set.

## Conclusion

* The results of the study are inconclusive. Getting a machine to "read" a tweet indicating that a natural disaster has occurred is not easy. 
* In Data Understranding and Data Preparation phase, we saw that the keywords of tweets relating a real disaster were more precise and descriptive. One way of solving this problem would be to develop bags of words via factorial analysis algorithms to have a better vision of the keywords associated with real natural disasters and those associated with fake ones.
* A second way concerns the cleaning phase of the text contained in the tweets: it is possible that we missed certain formatting errors, which could then have disrupted the predictive capacities of our models. 
* The last way to solve the disaster prediction problem would be to modify the architecture of the neural networks presented in this study. We used the same architecture for both models, but in fact we tested several architectures in order to obtain better scores. In the time available, it is obviously impossible to train all possible architectures. 