<a href="https://colab.research.google.com/github/jjordana/twitter_sentiment_analysis/blob/master/roberta_sucio.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# roBERTa's approach

A new way to try to solve this [Kaggles's competition](https://www.kaggle.com/c/tweet-sentiment-extraction) is going to be thorugh the use of Facebook **roBERTa's neural network**. <br><br>
This NN is based on Google's BERT NN, but presents some modifications. It stands for Robustly optimized BERT approach. <br>
Its main *power* is that it dinamized the tokens, meaning that during the different *epochs* they change, getting the calculated ones. <br>
For more info, please check this interesting [article](https://towardsdatascience.com/bert-roberta-distilbert-xlnet-which-one-to-use-3d5ab82ba5f8)
<br><br>

For this first approach we are going to get our data directly from Kaggle, without any transformation.<br>
I have learnt a lot of **roBERTa** thanks to [Chris Deotte's](https://www.kaggle.com/cdeotte/tensorflow-roberta-0-705) notebook. Besides, **hugginFace** site has eveyrthing quite well explained. I deeply recommend to have a look on their [web](https://huggingface.co/transformers/model_doc/roberta.html).

## Kaggle's API

In [59]:
!pip install -q kaggle
!mkdir .kaggle

mkdir: cannot create directory ‘.kaggle’: File exists


In [0]:
import json
token = {"username":"jjordana16","key":"6e806145f7c3fdd4c09e7299f3a70d73"}
with open('/content/.kaggle/kaggle.json', 'w') as file:
    json.dump(token, file)

In [61]:
!cp /content/.kaggle/kaggle.json ~/.kaggle/kaggle.json
!kaggle config set -n path -v/content

- path is now set to: /content


In [62]:
!kaggle competitions download -c tweet-sentiment-extraction

sample_submission.csv: Skipping, found more recently modified local copy (use --force to force download)
train.csv.zip: Skipping, found more recently modified local copy (use --force to force download)
test.csv: Skipping, found more recently modified local copy (use --force to force download)


## Importing Libraries & data

In [63]:
import pandas as pd
import numpy as np
import requests
import urllib.request

import tensorflow as tf
print('TF version',tf.__version__)
import tensorflow.keras.backend as K

from sklearn.model_selection import StratifiedKFold

TF version 2.2.0


In [64]:
pip install transformers



In [0]:
from transformers import *
import tokenizers
from tokenizers import ByteLevelBPETokenizer

In [0]:
train = pd.read_csv('/content/competitions/tweet-sentiment-extraction/train.csv.zip').fillna('')
test = pd.read_csv('/content/competitions/tweet-sentiment-extraction/test.csv').fillna('')

roBERTa deals with NLP problems as any other algorithm, it need to tokenize the input data.<br>
It allows to import our own pre-tokenized data or to use some of their tokenizers. In our case, we are going to use their data.<br>
We will need to import some data. **HuggingFace** has all this info in his doc page. I really recommend to have a look on their [site](https://huggingface.co/transformers/_modules/transformers/tokenization_roberta.html).<br>
For the tokenizer, we will be needing a vocab file list and the merges list.
We will get those files by the url showed on HugginFace web. 

In [67]:
# We create a folder for all roberta's data
!mkdir /content/roberta

mkdir: cannot create directory ‘/content/roberta’: File exists


In [68]:
# Vocab json
site = requests.get('https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-vocab.json')
data = site.json()
with open('/content/roberta/roberta-base-vocab.json', 'w') as f:
    json.dump(data, f)

# Merges txt
urllib.request.urlretrieve('https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-merges.txt', 
                           "/content/roberta/roberta-base-merges.txt")

('/content/roberta/roberta-base-merges.txt',
 <http.client.HTTPMessage at 0x7f2107a2e0b8>)

These two files will defined our tokenizer. <br>
Our `merges.txt` file is a list in which for each alphabetic/word there is a substitution value.<br>
For exmaple, the phrarse *What's up* would be tokenized following the merges file as *`'What', "'s", 'Ġup`*.
Then, for each value that commes from the merges file, the `voca file` would substitute those tokens by index values (defined in this specific file). For example:  *`[2061, 18, 62]`*.
<br>**Keep in mind that we have more than 50K values defined in our vocab**.

`ByteLevelBPETokenizer` allows us to import this data an generae our tokenizer. We set `lowercase as true` and we set `add_prefix_space`. We basically want to add an space before the string. It is useful for the encoding/decoding.<br>
i.e. `*tokenizer.decode(tokenizer.encode("Madrid")) = " Madrid"*`

In [0]:
# Defining our tokenizer based on the pre-tokenizer from hugginface
tokenizer = ByteLevelBPETokenizer(
    add_prefix_space=True, 
    lowercase=True,
    vocab_file='/content/roberta/roberta-base-vocab.json', 
    merges_file='/content/roberta/roberta-base-merges.txt')

We will stablish as **MAX_LEN** the maximum value for tweet length.<br>
This value will be the maximum of tokens per input example. Find more info [here](https://towardsdatascience.com/to-distil-or-not-to-distil-bert-roberta-and-xlnet-c777ad92f8)

In [70]:
print("Max tweet length:", max(train.text.apply(lambda x: len(x))))
print("Mean tweet length", np.mean(train.text.apply(lambda x: len(x))))

Max tweet length: 141
Mean tweet length 68.32753538808632


In [72]:
MAX_LEN = 141
EPOCHS = 3
BATCH_SIZE = 32
PAD_ID = 1
SEED = 16
LABEL_SMOOTHING = 0.1
tf.random.set_seed(SEED)
np.random.seed(SEED)
train.head()

Unnamed: 0,textID,text,selected_text,sentiment
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative
2,088c60f138,my boss is bullying me...,bullying me,negative
3,9642c003ef,what interview! leave me alone,leave me alone,negative
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative


We encode the sentiment values

In [0]:
sentiment_id = {'positive': 1313, 'negative': 2430, 'neutral': 7974}

Once we have defined our tokenizer, we need to prepared our training and testing data.<br>
We will create some empty vectors for filling in all our data.

We are creating a set of arrays full of '0' and '1' with the dimension of the size of out dataset and MAX_LEN parameter.

The following coding is based on [Chris Deotte's](https://www.kaggle.com/cdeotte/tensorflow-roberta-0-705) notebook.

In [0]:
rowsFile = train.shape[0]

input_ids = np.ones((rowsFile,MAX_LEN),dtype='int32')
attention_mask =  np.zeros((rowsFile,MAX_LEN),dtype='int32')
token_type_ids = np.zeros((rowsFile,MAX_LEN),dtype='int32')
start_tokens = np.zeros((rowsFile,MAX_LEN),dtype='int32')
end_tokens = np.zeros((rowsFile,MAX_LEN),dtype='int32')

We will proceed by: <br>
1. Selecting all our tweet and selection texts. <br>
The idea is to find the index of the `selected_text` in the `tweet` text. Like this, in a full zero vector X of length equalt to the tweet's length, we will fulfill 1 for determining where the selected_test is.<br>
Then we codify our resulting vector by substituting each word by it's corresponding token.<br>
       i.e.: tweet: "I love mondays"; len(text): 14; selected_text: "love"; len(selected_text): 4
             X =       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
             index where selected_text start: 2
             result:   [0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]
             enc:      ['Ġi', 'Ġlove', 'Ġm', 'ond', 'ays']
2. Each enc list contains all the words codified. By selecting `enc.id` we can access to their corresponding index. <br>
In this part we basically get the length for each token. We decodify our tokens thorugh their index, and we get back the words. We initiate a new list that is going to contain the starting and ending positions of the whole tweet.
       i.e.: enc.id:               [939, 657, 475, 2832, 4113]
             enc decodified:       [' i', ' love', ' m', 'ond', 'ays']
             length of each token: [(0, 2), (2, 7), (7, 9), (9, 12), (12, 15)]
3. Now we loop in each __offsert__. Basically, we are getting the length of each token, and by knwoing the length we can select the sub vector (for that length) in our vector **chars** (in the example above would be result). Therefore, we sum all the partitions, and if the sum is higger thant 1 we append to our list **token**.
        i.e.: length of each token: [(0, 2), (2, 7), (7, 9), (9, 12), (12, 15)]
              result:               [0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]
              We get the partitions. In this example we would get the two first one:
              Partition result[0:2] [0, 0 ,1] ====> sum(result[0:2]) > 1 ====> add to token the sum value (1)
              Partition result[2:7] [1, 1, 1, 1, 0, 0] ====> sum(result[2:7]) > 1 ====> add to token the sum value (4)
4. Finally, we overwrite all our previous created lists (in the previous cell) with the corresponding data calculates.
             
          
        
        
  

In [0]:
for val in range(rowsFile):
  # FIND OVERLAP
  text1 = " " + " ".join(train.loc[val,'text'].split())
  text2 = " ".join(train.loc[val,'selected_text'].split())

  idx = text1.find(text2) # We find index where our selected_word starts
  chars = np.zeros((len(text1))) # We create a vector of 0s with the length of text
  chars[idx:idx+len(text2)] = 1 # Fullfill the vector with 1 when it coincides
  if text1[idx-1]==' ': chars[idx-1] = 1 
  enc = tokenizer.encode(text1) 

  # ID_OFFSETS
  offsets = []
  idx=0
  for t in enc.ids:
    w = tokenizer.decode([t])
    offsets.append((idx,idx+len(w)))
    idx += len(w)

  # START END TOKENS
  toks = []
  for i,(a,b) in enumerate(offsets):  
    sm = np.sum(chars[a:b])
    if sm>0: toks.append(i) 
        
  s_tok = sentiment_id[train.loc[val,'sentiment']]
  input_ids[val,:len(enc.ids) + 5] = [0] + enc.ids + [2,2] + [s_tok] + [2]
  attention_mask[val,:len(enc.ids)+5] = 1
  if len(toks)>0:
    start_tokens[val,toks[0]+1] = 1
    end_tokens[val,toks[-1]+1] = 1

We have now to perform in the same way with the test data

In [0]:
rowsFile = test.shape[0]
input_ids_t = np.ones((rowsFile,MAX_LEN),dtype='int32')
attention_mask_t = np.zeros((rowsFile,MAX_LEN),dtype='int32')
token_type_ids_t = np.zeros((rowsFile,MAX_LEN),dtype='int32')

In [0]:
for val in range(rowsFile):   
  # INPUT_IDS
  text1 = " "+" ".join(test.loc[val,'text'].split())
  enc = tokenizer.encode(text1)                
  s_tok = sentiment_id[test.loc[val,'sentiment']]
  input_ids_t[val,:len(enc.ids)+5] = [0] + enc.ids + [2,2] + [s_tok] + [2]
  attention_mask_t[val,:len(enc.ids)+5] = 1

Our data is preapred. All our tweets and selected_text have been tokenized.<br>
Now, we must define our model. We will be using **HuggingFace** [configuration](https://github.com/huggingface/transformers/blob/master/src/transformers/configuration_roberta.py) and [pretrained models](https://huggingface.co/transformers/_modules/transformers/modeling_tf_roberta.html).

In [0]:
# Config json
site = requests.get('https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-config.json')
data = site.json()
with open('/content/roberta/roberta-base-config.json', 'w') as f:
    json.dump(data, f)

`build model` function consists on: <br>
defining the **configuration** and **pretrained model** that we want to use creating the different layers for our model <br>
`dropoutLayer` it's active during the training. We stablis a rate at each step. Inputs that are different to 0 are scaled up by 1/(1 - rate). It's applied directly to the build_model(). <br>
`Conv1D` creates a convolution kernel. The first parameter, _filter_, defines the dimension of the output; the second defines the dimension of the 1D convolution window.<br>
`Flatten` flattens the input matriz size <br>
`Activation` in our case we use _softmax__ to activate an output. This function transforms the vector into a vector of categorical probabilities (probabilistic distribution).


In [0]:
def build_model():
  config = RobertaConfig.from_pretrained('/content/roberta/roberta-base-config.json')
  bert_model = TFRobertaModel.from_pretrained("https://cdn.huggingface.co/roberta-base-tf_model.h5", config=config)

  ids = tf.keras.layers.Input((MAX_LEN,), dtype=tf.int32) #ids layer
  att = tf.keras.layers.Input((MAX_LEN,), dtype=tf.int32) #attention mask
  tok = tf.keras.layers.Input((MAX_LEN,), dtype=tf.int32) # token layer
  
  x = bert_model(ids,attention_mask=att,token_type_ids=tok)
  
  # First output
  x1 = tf.keras.layers.Dropout(0.1)(x[0]) 
  x1 = tf.keras.layers.Conv1D(1,1)(x1)
  x1 = tf.keras.layers.Flatten()(x1)
  x1 = tf.keras.layers.Activation('softmax')(x1)
  # Second output
  x2 = tf.keras.layers.Dropout(0.1)(x[0]) 
  x2 = tf.keras.layers.Conv1D(1,1)(x2)
  x2 = tf.keras.layers.Flatten()(x2)
  x2 = tf.keras.layers.Activation('softmax')(x2)

  model = tf.keras.models.Model(inputs=[ids, att, tok], outputs=[x1,x2])
  optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5)

  model.compile(loss='categorical_crossentropy', optimizer=optimizer)

  return model

To measure our score we will use, as we have done before in other notebooks, the `jaccard score`.

In [0]:
def jaccardScore(str1, str2): 
  wordA = set(str1.lower().split())
  wordB = set(str1.lower().split())
  if (len(wordA)==0) & (len(wordB)==0): 
    return 0.5
  intersect = wordA.intersection(wordB)
  return float(len(intersect)) / (len(wordA) + len(wordB) - len(intersect))

In [0]:
jac = []; 
VER='v0'; 
DISPLAY=1 # USE display=1 FOR INTERACTIVE

# For the training data
tr_start = np.zeros((input_ids.shape[0],MAX_LEN))
tr_end = np.zeros((input_ids.shape[0],MAX_LEN))
# For the testing data
preds_start = np.zeros((input_ids_t.shape[0],MAX_LEN))
preds_end = np.zeros((input_ids_t.shape[0],MAX_LEN))

In [0]:
skf = StratifiedKFold(n_splits=5,shuffle=True,random_state=777)
for fold,(idxT,idxV) in enumerate(skf.split(input_ids,train.sentiment.values)):

    print('#'*25)
    print('### FOLD %i'%(fold+1))
    print('#'*25)
    
    K.clear_session()
    model = build_model()
        
    sv = tf.keras.callbacks.ModelCheckpoint(
        '%s-roberta-%i.h5'%(VER,fold), monitor='val_loss', verbose=1, save_best_only=True,
        save_weights_only=True, mode='auto', save_freq='epoch')
        
    model.fit([input_ids[idxT,], attention_mask[idxT,], token_type_ids[idxT,]], [start_tokens[idxT,], end_tokens[idxT,]], 
        epochs=3, batch_size=32, verbose=DISPLAY, callbacks=[sv],
        validation_data=([input_ids[idxV,],attention_mask[idxV,],token_type_ids[idxV,]], 
        [start_tokens[idxV,], end_tokens[idxV,]]))
    
    print('Loading model...')
    model.load_weights('%s-roberta-%i.h5'%(VER,fold))
    
    print('Predicting OOF...')
    tr_start[idxV,],tr_end[idxV,] = model.predict([input_ids[idxV,],attention_mask[idxV,],token_type_ids[idxV,]],verbose=DISPLAY)
    
    print('Predicting Test...')
    preds = model.predict([input_ids_t,attention_mask_t,token_type_ids_t],verbose=DISPLAY)
    preds_start += preds[0]/skf.n_splits
    preds_end += preds[1]/skf.n_splits
    
    # DISPLAY FOLD JACCARD
    all = []
    for k in idxV:
        a = np.argmax(tr_start[k,])
        b = np.argmax(tr_end[k,])
        if a>b: 
            st = train.loc[k,'text'] # IMPROVE CV/LB with better choice here
        else:
            text1 = " "+" ".join(train.loc[k,'text'].split())
            enc = tokenizer.encode(text1)
            st = tokenizer.decode(enc.ids[a-1:b])
        all.append(jaccardScore(st,train.loc[k,'selected_text']))
    jac.append(np.mean(all))
    print('>>>> FOLD %i Jaccard ='%(fold+1),np.mean(all))
    print()

#########################
### FOLD 1
#########################
Epoch 1/3

In [0]:
print('>>>> OVERALL 5Fold CV Jaccard =',np.mean(jac))

>>>> OVERALL 5Fold CV Jaccard = 0.7002155571532336


In [0]:
all = []
for k in range(input_ids_t.shape[0]):
    a = np.argmax(preds_start[k,])
    b = np.argmax(preds_end[k,])
    if a>b: 
        st = test.loc[k,'text']
    else:
        text1 = " "+" ".join(test.loc[k,'text'].split())
        enc = tokenizer.encode(text1)
        st = tokenizer.decode(enc.ids[a-1:b])
    all.append(st)

In [0]:
test['selected_text'] = all
test[['textID','selected_text']].to_csv('submission.csv',index=False)
pd.set_option('max_colwidth', 60)
test.sample(25)

Unnamed: 0,textID,text,sentiment,selected_text
126,b3eadfc565,Folks thought it was hilarious when I told them the sto...,positive,hilarious
1075,e8ce4f8bdc,"ohh snapp, have fun",neutral,"ohh snapp, have fun"
2853,8244eb7ae9,"omg, NO ICECREAM",negative,"omg, no icecream"
2782,aa2c27ea07,I like Rio Ferdinand - when he`s wearing an England jersey.,positive,like
22,3dcf4f7e13,"... need retail therapy, bad. AHHH.....gimme money geebus",negative,bad.
240,abda0ac082,is fixin to clean the house for my mom for mother`s day,positive,is fixin to clean
2550,b8e35a07a2,bout to go to bed... pretty good day for a Monday.,positive,pretty good
419,d7b1ac80c4,Good morning! Just took the longest shower ive ever tak...,positive,good morning!
1533,bab26fdc24,pansy wtf codeh?!,negative,wtf
2488,4805e663c0,Have been writing since 6pm & I only have 300 words. Can...,neutral,have been writing since 6pm & i only have 300 words. ca...
