<a href="https://colab.research.google.com/github/prawizard/TweetsClassification_NLP/blob/main/TweetEval/TweetEval_EmotionTweetsClassification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Instructions to Execute:

    *   To execute by loading my trained model and evaluating on test data:


1.   

    #### Go to the following drive link where you can access the weights from my goole drive.

    https://drive.google.com/drive/folders/1NyLnCuDTWg476bTpihzhECd54EQjNMma?usp=sharing

    #### Download the Folder 'Offensive_ModelWeights'. If it is downloaded as a zip file, please extract the folder. 

    #### Upload the extracted folder 'Offensive_ModelWeights' to your Google Drive. You will mount your drive in colab runtime and access the .h5 file from there. This way is faster that loading the .h5 file directly to colab due to the large size (~1.1 GB).

    #### The trained weights couldn't be uploaded on GitHub as well due to the large size, hence, we will access from Google Drive.
    
    #### Further instructions on how to do this are provided in Section 14.

2. #### Run all cells in sections from Section No. **1 to 7 ONLY**.

3. #### Skip the cells afterwards until **Section 14**.

4. #### **Execute the cells in Section 14**. Last cell in this section provides the **F1-Score and Accuracy obtained by the model on test set**.



    *   To execute by training the model from scratch and then evaluating on test data.



#### Execute the Sections from Section No 1 to 12. Section 13 can be ignored, as it is just saving the model.

#### Section 14 is to mount your drive on colab and to load the weights(You will have to upload the weights folder from my drive link to your drive before running Section 14 as described in the previous instruction.). Hence, if you're not using my trained model, this section can be ignored too.




# 1 : Install/import the required libraries

In [None]:
!pip install ktrain



In [None]:
import tensorflow as tf
import pandas as pd
import numpy as np
import ktrain
from ktrain import text
import tensorflow as tf
import re
import requests
import sklearn.metrics

# 2 : Define Constants

In [None]:
TRAIN_TEXT_URL="https://raw.githubusercontent.com/cardiffnlp/tweeteval/main/datasets/emotion/train_text.txt"
TRAIN_LABELS_URL="https://raw.githubusercontent.com/cardiffnlp/tweeteval/main/datasets/emotion/train_labels.txt"
VAL_TEXT_URL="https://raw.githubusercontent.com/cardiffnlp/tweeteval/main/datasets/emotion/val_text.txt"
VAL_LABELS_URL="https://raw.githubusercontent.com/cardiffnlp/tweeteval/main/datasets/emotion/val_labels.txt"
TEST_TEXT_URL="https://raw.githubusercontent.com/cardiffnlp/tweeteval/main/datasets/emotion/test_text.txt"
TEST_LABELS_URL="https://raw.githubusercontent.com/cardiffnlp/tweeteval/main/datasets/emotion/test_labels.txt"
VOCAB_SIZE=2000

# 3 : Fetch the Data using URLs

In [None]:
r = requests.get(TRAIN_TEXT_URL, allow_redirects=True)
open('train_text.txt', 'wb').write(r.content)

r = requests.get(TRAIN_LABELS_URL, allow_redirects=True)
open('train_labels.txt', 'wb').write(r.content)

r = requests.get(VAL_TEXT_URL, allow_redirects=True)
open('val_text.txt', 'wb').write(r.content)

r = requests.get(VAL_LABELS_URL, allow_redirects=True)
open('val_labels.txt', 'wb').write(r.content)

r = requests.get(TEST_TEXT_URL, allow_redirects=True)
open('test_text.txt', 'wb').write(r.content)

r = requests.get(TEST_LABELS_URL, allow_redirects=True)
open('test_labels.txt', 'wb').write(r.content)

2842

# 4 : Access the data from files

In [None]:
stream=open("train_text.txt")
tweets=stream.readlines()
stream.close()
val_stream=open("val_text.txt")
val_tweets=val_stream.readlines()
val_stream.close()
test_stream=open("test_text.txt")
test_tweets=test_stream.readlines()
test_stream.close()

# Labels
stream=open("train_labels.txt")
tweetsLabels=stream.readlines()
stream.close()
val_stream=open("val_labels.txt")
val_tweetsLabels=val_stream.readlines()
val_stream.close()
test_stream=open("test_labels.txt")
test_tweetsLabels=test_stream.readlines()
test_stream.close()


# Labels
labels=[0]*len(tweetsLabels)
for i in range(len(tweetsLabels)):
  if tweetsLabels[i].find('\n')!=-1:
    labels[i]=int(re.sub('\n', '', tweetsLabels[i]))
val_labels=[0]*len(val_tweetsLabels)
for i in range(len(val_tweetsLabels)):
  if val_tweetsLabels[i].find('\n')!=-1:
    val_labels[i]=int(re.sub('\n', '', val_tweetsLabels[i]))
test_labels=[0]*len(test_tweetsLabels)
for i in range(len(test_tweetsLabels)):
  if test_tweetsLabels[i].find('\n')!=-1:
    test_labels[i]=int(re.sub('\n', '', test_tweetsLabels[i]))

In [None]:
print('Samples in Training set : ',len(labels),', Validation set : ', len(val_labels),', Test set : ', len(test_labels))

Samples in Training set :  3257 , Validation set :  374 , Test set :  1421


# 5 : Data Cleaning

## 5.1 : Remove the twitter handles

In [None]:
for i in range(len(tweets)):
  if tweets[i].find('@user')!=-1:
    tweets[i]=re.sub('@user', '', tweets[i])
    tweets[i]=re.sub('#+', '', tweets[i])

for i in range(len(val_tweets)):
  if val_tweets[i].find('@user')!=-1:
    val_tweets[i]=re.sub('@user', '', val_tweets[i])
    val_tweets[i]=re.sub('#+', '', val_tweets[i])

for i in range(len(test_tweets)):
  if test_tweets[i].find('@user')!=-1:
    test_tweets[i]=re.sub('@user', '', test_tweets[i])
    test_tweets[i]=re.sub('#+', '', test_tweets[i])

## 5.2 : Remove the unnecessary hashtags

In [None]:
for i in range(len(tweets)):
  if tweets[i].find('#[a-zA-Z]+')!=-1:
    tweets[i]=re.sub('#[a-zA-Z]+', '', tweets[i])

for i in range(len(val_tweets)):
  if val_tweets[i].find('#[a-zA-Z]+')!=-1:
    val_tweets[i]=re.sub('#[a-zA-Z]+', '', val_tweets[i])

for i in range(len(test_tweets)):
  if test_tweets[i].find('#[a-zA-Z]+')!=-1:
    test_tweets[i]=re.sub('#[a-zA-Z]+', '', test_tweets[i])

## 5.3 : Remove characters like \\n and unnecessary dots

In [None]:
for i in range(len(tweets)):
  if tweets[i].find('\n')!=-1:
    tweets[i]=re.sub('\n', '', tweets[i])

for i in range(len(val_tweets)):
  if val_tweets[i].find('\n')!=-1:
    val_tweets[i]=re.sub('\n', '', val_tweets[i])

for i in range(len(test_tweets)):
  if test_tweets[i].find('\n')!=-1:
    test_tweets[i]=re.sub('\n', '', test_tweets[i])

# Unnecessary dots
p='\.\.\.|\.\.'

for i in range(len(tweets)):
  tweets[i]=re.sub(p, '', tweets[i])
  tweets[i]=tweets[i].lower()

for i in range(len(val_tweets)):
  val_tweets[i]=re.sub(p, '', val_tweets[i])
  val_tweets[i]=val_tweets[i].lower()

for i in range(len(test_tweets)):
  test_tweets[i]=re.sub(p, '', test_tweets[i])
  test_tweets[i]=test_tweets[i].lower()

# 6 : Store the Data Processed In a Data Frame

In [None]:
rows=[]
rowIndices=[]
for i in range(len(tweets)):
  rows.append({"TWEET":tweets[i], "CATEGORY":labels[i]})
  rowIndices.append(i+1)
train_df=pd.DataFrame(rows, index=rowIndices)

val_rows=[]
val_rowIndices=[]
for i in range(len(val_tweets)):
  val_rows.append({"TWEET":val_tweets[i], "CATEGORY":val_labels[i]})
  val_rowIndices.append(i+1)
val_df=pd.DataFrame(val_rows, index=val_rowIndices)

test_rows=[]
test_rowIndices=[]
for i in range(len(test_tweets)):
  test_rows.append({"TWEET":test_tweets[i], "CATEGORY":test_labels[i]})
  test_rowIndices.append(i+1)
test_df=pd.DataFrame(test_rows, index=test_rowIndices)

# 7 : Segregate the Data Into Training and Validation/Dev Set

In [None]:
(X_train, y_train), (X_test, y_test), preproc = text.texts_from_df(train_df=train_df,
                                                                   text_column = 'TWEET',
                                                                   label_columns = 'CATEGORY',
                                                                   val_df = val_df,
                                                                   maxlen = 65,
                                                                   preprocess_mode = 'bert')

['CATEGORY_0', 'CATEGORY_1', 'CATEGORY_2', 'CATEGORY_3']
   CATEGORY_0  CATEGORY_1  CATEGORY_2  CATEGORY_3
1         0.0         0.0         1.0         0.0
2         1.0         0.0         0.0         0.0
3         0.0         1.0         0.0         0.0
4         1.0         0.0         0.0         0.0
5         0.0         0.0         0.0         1.0
['CATEGORY_0', 'CATEGORY_1', 'CATEGORY_2', 'CATEGORY_3']
   CATEGORY_0  CATEGORY_1  CATEGORY_2  CATEGORY_3
1         1.0         0.0         0.0         0.0
2         1.0         0.0         0.0         0.0
3         1.0         0.0         0.0         0.0
4         1.0         0.0         0.0         0.0
5         1.0         0.0         0.0         0.0
preprocessing train...
language: en


Is Multi-Label? False
preprocessing test...
language: en


# * For loading my trained weights and evaluating instead of training the model again, skip all the cells below until the Section number 14 " (Running the Model Using Weights (Pretrained))" and run all the cells that follow thereafter. *

# Instructions on how the trained weights can be loaded are provided in that section. (Number 14)

# *Otherwise, To Train The Model Again, Continue Executing the Cells Below Until the Section 12 "(Running the Model Using Weights (Pretrained))". Section 12 gives the result after training the model again and evaluating on test data. *

# 8 : Using BERT for pretrained weights

In [None]:
model = text.text_classifier(name = 'bert',
                             train_data = (X_train, y_train),
                             preproc = preproc)

Is Multi-Label? False
maxlen is 65
done.


# 9 : Tuning the hyper parameters

In [None]:
learner = ktrain.get_learner(model=model, train_data=(X_train, y_train),
                   val_data = (X_test, y_test),
                   batch_size = 3)

# 10 : Fit the Model

In [None]:
learner.fit_onecycle(lr = 2e-5, epochs = 3)

predictor = ktrain.get_predictor(learner.model, preproc)
predictor.save('/content/drive/My Drive/bert')



begin training using onecycle policy with max lr of 2e-05...
Epoch 1/3
Epoch 2/3
Epoch 3/3


# 11 : Evaluate the Model Performance on Test Data

In [None]:
data = test_df['TWEET'].tolist()

In [None]:
bert_pred=predictor.predict(data)

In [None]:
bert_pred[:10]

['CATEGORY_3',
 'CATEGORY_0',
 'CATEGORY_3',
 'CATEGORY_1',
 'CATEGORY_1',
 'CATEGORY_0',
 'CATEGORY_3',
 'CATEGORY_3',
 'CATEGORY_1',
 'CATEGORY_0']

In [None]:
y_true=np.array(test_df['CATEGORY'].tolist())

In [None]:
y_true[:10]

array([3, 0, 3, 1, 1, 0, 3, 3, 3, 0])

In [None]:
res_bert=[]
for i in range(len(bert_pred)):
  if bert_pred[i]=='CATEGORY_0':
    res_bert.append(0)
  elif bert_pred[i]=='CATEGORY_1':
    res_bert.append(1)
  elif bert_pred[i]=='CATEGORY_2':
    res_bert.append(2)
  else:
    res_bert.append(3)

# 12 : Accuracy and F1-Score

In [None]:
res_bert=np.array(res_bert)
np.mean(res_bert==y_true)

0.8142153413089374

In [None]:
F1_SCORE=sklearn.metrics.f1_score(y_true, res_bert, average='macro')
print('F1-SCORE OBTAINED :',round(F1_SCORE*100, 2))

F1-SCORE OBTAINED : 78.38


# 13 : Saving the Trained Weights

In [None]:
predictor.save('/content/EmotionTrainedWeights/Weights')

In [None]:
!zip -r /content/Emotion_Weights.zip /content/EmotionTrainedWeights/Weights

  adding: content/EmotionTrainedWeights/Weights/ (stored 0%)
  adding: content/EmotionTrainedWeights/Weights/tf_model.preproc (deflated 52%)
  adding: content/EmotionTrainedWeights/Weights/tf_model.h5 (deflated 18%)


In [None]:
from google.colab import files
files.download('/content/Emotion_Weights.zip')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# 14 : Running the Model Using Weights (Pretrained)

### Run the cell below. 
* You must have uploaded my trained weights folder to your drive, as described at the top of this notebook. Use the same google account here to which you uploaded the folder. *
### 1. Click on the link following "Go to this URL in a browser:"
### 2. Select your google account/ Sign-in to your google account.
### 3. Click 'Allow'
### 4. Copy the link in that tab and paste it in the box provided. (With the box label 'Enter your authorization code:') Use Ctrl+V to paste as right click to paste might not work.
### 5. Hit the Enter Key. Now you can see the folder 'gdrive' with files/folders in your drive ready for access in colab runtime. 

### We will use these weights folders to load the trained weights and evaluate the model.

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

!cp '/content/gdrive/My Drive/Emotion_ModelWeights' Emotion_Weights_Drive

Mounted at /content/gdrive
cp: -r not specified; omitting directory '/content/gdrive/My Drive/Emotion_ModelWeights'


### Load the weights

In [None]:
predictor_load = ktrain.load_predictor('/content/gdrive/My Drive/Emotion_ModelWeights')

## Evaluate on Test Data Using these loaded weights

In [None]:
data = test_df['TWEET'].tolist()

In [None]:
bert_pred_load=predictor_load.predict(data)

In [None]:
y_true=np.array(test_df['CATEGORY'].tolist())

In [None]:
res_bert_load=[]
for i in range(len(bert_pred_load)):
  if bert_pred_load[i]=='CATEGORY_0':
    res_bert_load.append(0)
  elif bert_pred[i]=='CATEGORY_1':
    res_bert_load.append(1)
  elif bert_pred[i]=='CATEGORY_2':
    res_bert_load.append(2)
  else:
    res_bert_load.append(3)

## ACCURACY and F1_SCORE

In [None]:
res_bert_load=np.array(res_bert_load)
print('ACCURACY OBTAINED :', round(np.mean(res_bert_load==y_true)*100,2))

ACCURACY OBTAINED : 81.42


In [None]:
F1_SCORE=sklearn.metrics.f1_score(y_true, res_bert_load, average='macro')
print('F1_SCORE OBTAINED :', round(F1_SCORE*100, 2))

F1_SCORE OBTAINED : 78.38


In [None]:
# while True:pass