# Deep Learning models
## RoBERTa model:

This notebook was run on a Google cloud Deep Learning VM. We had created a VM with 2vCPUs and 52Go of Memory and an Nvidia Tesla P100 GPU. We recommend as in the other notebooks to run it after installing the requirements `pip install -r requirements.txt` and having a suitable GPU to ensure fast trainng.
With the Google Cloud VM it took about 6 hours to fit.
We used tensorflow, transformers, and pretrained models from https://huggingface.co/.

This model gave us 0.878 in accuracy and 0.878 F1-Score in AiCrowd.
And below you'll see an accuracy of 0.8704 on training and 0.8833 on validation.

In [1]:
import tensorflow as tf
# Let's import a fast tokenizer that can work on batched inputs
# (the 'Fast' tokenizers in HuggingFace)
from transformers import RobertaTokenizer, logging as transformers_logging
from helpers import create_csv_submission
from helpers import load_cleaned_data
import numpy as np
print(tf.__version__)

2.3.1


In [None]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing import sequence
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import GlobalAveragePooling1D
from keras.layers import Convolution1D
from keras.layers import MaxPooling1D
from keras.layers import Embedding
from keras.layers import LSTM

In [2]:
# Check the current GPU infos
!nvidia-smi

Thu Dec 17 00:38:07 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   45C    P0    31W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [3]:
tf.test.gpu_device_name()

'/device:GPU:0'

In [3]:
from tensorflow.python.client import device_lib 
print(device_lib.list_local_devices())

[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 8151212649712713456
, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality {
}
incarnation: 3249601242645294926
physical_device_desc: "device: XLA_CPU device"
, name: "/device:XLA_GPU:0"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 2881377164702471986
physical_device_desc: "device: XLA_GPU device"
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 15703311680
locality {
  bus_id: 1
  links {
  }
}
incarnation: 5416298941462155982
physical_device_desc: "device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:04.0, compute capability: 6.0"
]


In [9]:
train_pos,train_neg, test = load_cleaned_data(load=False, full=True,
                        emojis=True, repetitions=True,
                        numbers=False, hashtag=False,
                        apostrophes=False, tokenizing=True,
                        slang=False, spelling=True,
                        punctuations=False, stop_words=True,
                        stemming=False, lemmatizing=False)

Ommiting repetitions
Translating emojis
correcting spelling mistakes
tokenizing
removing stop words


In [28]:
x_poss = list(open("Project/cleaned_data/cleaned_train_pos_full.txt", "r", encoding='utf-8').readlines())
x_poss = [s.strip() for s in x_poss]
x_pos = []
for elem in x_poss:
    if elem!='':
        tweet=''
        for word in elem.split(','):
            tweet+=word+' '
        x_pos.append(tweet)
x_negg = list(open("Project/cleaned_data/cleaned_train_neg_full.txt", "r", encoding='utf-8').readlines())
x_negg = [s.strip() for s in x_negg]
x_neg = []
for elem in x_negg:
    if elem!='':
        tweet=''
        for word in elem.split(','):
            tweet+=word+' '
        x_neg.append(tweet)

In [11]:
transformers_logging.set_verbosity_warning()

tokenizer = RobertaTokenizer.from_pretrained('roberta-base')

HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=898823.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=456318.0), HTML(value='')))




In [12]:
max_length = 126
batch_size = 64

In [30]:
def convert_example_to_feature(tweet):
    return tokenizer(tweet, add_special_tokens=True,
                                    max_length=None,
                                    pad_to_max_length=True,
                                    return_attention_mask=True,
                                    return_token_type_ids=False)

In [31]:
bert_x = convert_example_to_feature((x_pos+x_neg))

In [32]:
y = np.concatenate([np.ones(len(x_pos)), np.zeros(len(x_neg))])

In [35]:
model= TFRobertaForSequenceClassification.from_pretrained("roberta-base")
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(reduction=tf.keras.losses.Reduction.NONE, from_logits=True)
opt = tf.keras.optimizers.Adam(learning_rate=3e-5)

model.compile(optimizer=opt,
              loss=loss_fn,
              metrics=['accuracy'])

Some layers from the model checkpoint at roberta-base were not used when initializing TFRobertaForSequenceClassification: ['lm_head']
- This IS expected if you are initializing TFRobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFRobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [36]:
model.summary()

Model: "tf_roberta_for_sequence_classification_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
roberta (TFRobertaMainLayer) multiple                  124645632 
_________________________________________________________________
classifier (TFRobertaClassif multiple                  592130    
Total params: 125,237,762
Trainable params: 125,237,762
Non-trainable params: 0
_________________________________________________________________


In [37]:
x_pos, x_neg = (0, 0) # optimize memory

In [38]:
attention_mask = np.array(bert_x['attention_mask'],dtype=np.int8)
input_ids= np.array(bert_x['input_ids'],dtype=np.int32)

In [40]:
model.summary()

Model: "tf_roberta_for_sequence_classification_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
roberta (TFRobertaMainLayer) multiple                  124645632 
_________________________________________________________________
classifier (TFRobertaClassif multiple                  592130    
Total params: 125,237,762
Trainable params: 125,237,762
Non-trainable params: 0
_________________________________________________________________


In [41]:
bert_x = 0 # to optimize memory

In [42]:
indices = np.random.permutation(input_ids.shape[0])
training_idx, test_idx = indices[:2000000], indices[2000000:]

In [43]:
train_input_ids = tf.convert_to_tensor(input_ids[training_idx,:])
train_att_mask = tf.convert_to_tensor(attention_mask[training_idx,:])
y_train = tf.convert_to_tensor(y[training_idx],dtype=tf.float32)


test_input_ids = tf.convert_to_tensor(input_ids[test_idx,:])
test_att_mask = tf.convert_to_tensor(attention_mask[test_idx,:])
y_test = tf.convert_to_tensor(y[test_idx],dtype=tf.float32)

In [44]:
bert_history = model.fit([train_input_ids,train_att_mask], 
      y_train,
      validation_data=([test_input_ids,test_att_mask], y_test),
      epochs=1, batch_size=batch_size, verbose=1 )



In [45]:
model.save_weights("./full_training_roberta_weights_1M.h5")

In [46]:
tesst = list(open("Project/cleaned_data/cleaned_test_data.txt", "r", encoding='utf-8').readlines())
tesst = [s.strip() for s in tesst]
test = []
for elem in tesst:
    if elem!='':
        tweet=''
        for word in elem.split(','):
            tweet+=word+' '
        test.append(tweet)

In [47]:
bert_test = convert_example_to_feature(test)

In [48]:
input_ids = tf.convert_to_tensor(bert_test.get('input_ids'))
attention_mask =tf.convert_to_tensor(bert_test.get('attention_mask'))

In [49]:
y_pred = model.predict([input_ids,attention_mask])

In [51]:
def softmax(x):
    return np.exp(x)/sum(np.exp(x))

In [52]:
predictions = [softmax(x) for x in y_pred[0]]

In [53]:
output = []
for elem in predictions:
    if elem[0]>elem[1] : x=-1
    else : x = 1
    output.append(x)

In [55]:
create_csv_submission(output, './roBERTa_FULL.csv') 