<a href="https://colab.research.google.com/github/zen030/CourseProject/blob/main/DEMO_Model_Evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Introduction**
<b>This notebook is implemented and tested in Google Colab PRO environment.</b>


In this demo, I use the BERT model trained using the following parameters: 
- Learning rates (each for 4 epoch iterations): 2e-5
- Optimizer Epsilon value: 1e-8
- Random Seed value: 17
- Evaluation is done only for model epoch # 4

The files in this demo:
- The trained model used in this demo: https://drive.google.com/file/d/1sn3QT-GlFvgk7XHv-144WC6gf9gHCAnE/view?usp=sharing
- answer.txt generated from this demo: https://drive.google.com/file/d/14tLEIr07SK4uq5cx7lrUOj_IzJZuUGYO/view?usp=sharing


<b>Run this demo, you should generate the same answer.txt file in the Colab session! Please compare your run result with this file (they should be matching).</b>


# 1. Colab environment configuration and import modules

In [None]:
# install required modules
!pip install transformers
!pip install PyDrive

In [2]:
# Import the required modules.

# Evaluation.
import pandas as pd
import json
from transformers import BertTokenizer
from torch.utils.data import TensorDataset
from torch.utils.data import DataLoader, SequentialSampler
import torch.nn.functional as F 
import torch
from transformers import BertForSequenceClassification
import numpy as np

# To manage dataset.
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

In [3]:
# Copy the trained model file and testing dataset
# from Google Drive to the Colab session.

# test.jsonl file location: 
# https://drive.google.com/file/d/1vA3uyqy1TZmahgZ0PeNRFx67LuYeAkoW/view?usp=sharing

################################################
# The pre-trained model using training dataset #
################################################
# Google Drive file name.
model_file = 'lr_2e-5_1e-8_17_4.model'
# Google Drive unique file ID
model_file_id = "1sn3QT-GlFvgk7XHv-144WC6gf9gHCAnE"

##################################
# The evaluation/testing dataset #
##################################
# Google Drive file name.
evaluation_file = 'test.jsonl'
# Google Drive unique file ID.
test_jsonl_file_id = "1vA3uyqy1TZmahgZ0PeNRFx67LuYeAkoW"

In [4]:
# The files are shared to public.
# Login using Google Account to proceed.
# Copy-paste the code.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

downloaded = drive.CreateFile({'id':test_jsonl_file_id})
downloaded.GetContentFile(evaluation_file) 

downloaded = drive.CreateFile({'id':model_file_id})
downloaded.GetContentFile(model_file)

# 2. Testing Dataset Preparation

In [5]:
# Read testing dataset and store it
# in Pandas DataFrame.

# Read jsonl file into list (of json)
with open(evaluation_file) as f:
    # creating array of json
    lines = f.read().splitlines()
print(f'Number of lines in file: {len(lines)}')

# Normalize json into dataframe columns
df = pd.json_normalize(pd.DataFrame(lines)[0].apply(json.loads))
print(f'Number of records in Pandas DataFrame: {len(df)}')

# lowercase response text
df.response = df.response.str.lower()

# Check maximum character length of 'response'
max_response_chars = df.response.str.len().max()
print(f"Maximum character length of 'response': {max_response_chars}")

# Adding 5 extra characters in case special token is needed by the model
max_length = max_response_chars + 5 

# Print DataFrame to have preview of the data
print(df)

Number of lines in file: 1800
Number of records in Pandas DataFrame: 1800
Maximum character length of 'response': 310
                id  ...                                            context
0        twitter_1  ...  [Well now that ’ s problematic AF <URL>, @USER...
1        twitter_2  ...  [Last week the Fake News said that a section o...
2        twitter_3  ...  [@USER Let ’ s Aplaud Brett When he deserves i...
3        twitter_4  ...  [Women generally hate this president . What's ...
4        twitter_5  ...  [Dear media Remoaners , you excitedly sharing ...
...            ...  ...                                                ...
1795  twitter_1796  ...  [I have been a business customer of MWeb @USER...
1796  twitter_1797  ...  [A woman refuses to have her temperature taken...
1797  twitter_1798  ...  [The reason big government wants @USER out is ...
1798  twitter_1799  ...  [Happy #musicmonday and #thanks for #all your ...
1799  twitter_1800  ...  [Not long wrapped on the amazing

# 3. Encode input data and Data Loader creation

In [6]:
# 1. Encode the data
# 2. Create Tensor Dataset
# 3. Create Dataloader for the evaluation

bert_model = 'bert-large-uncased'
batch_size = 5

tokenizer = BertTokenizer.from_pretrained(bert_model, do_lower_case=True)

encoded_data_evaluation = tokenizer.batch_encode_plus(
    df.response.values,
    add_special_tokens=True,
    return_attention_mask=True,
    max_length=max_length,
    padding='max_length',
    return_tensors='pt'
)

input_ids_evaluation = encoded_data_evaluation['input_ids']
attention_masks_evaluation = encoded_data_evaluation['attention_mask']

dataset_evaluation = TensorDataset(input_ids_evaluation, attention_masks_evaluation)

dataloader_eval = DataLoader(dataset_evaluation, sampler=SequentialSampler(dataset_evaluation), batch_size=batch_size)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




# 4. Load pre-trained model and Evaluation

In [None]:
# If GPU is available.
if torch.cuda.is_available():    
    # PyTorch to use the GPU    
    device = torch.device("cuda")
    print('There are %d GPU(s) available.' % torch.cuda.device_count())
    print('We will use the GPU:', torch.cuda.get_device_name(0))
# If GPU is not available. Use the CPU.
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

# To set the model into a training mode
label_dict = {'SARCASM': 0, 'NOT_SARCASM': 1}

# Load the pre-trained model
model = BertForSequenceClassification.from_pretrained(bert_model,
                                                      num_labels=len(label_dict),
                                                      output_attentions=False,
                                                      output_hidden_states=False)
model.to(device)
model.load_state_dict(torch.load(model_file, map_location=torch.device(device)))

# Set the model to evaluation/testing mode
model.eval()
loss_val_total = 0
predictions = []

# Iterate the evaluation/testing data loader
indx = 1
for batch in dataloader_eval:
  batch = tuple(b.to(device) for b in batch)
  inputs = {'input_ids': batch[0], 'attention_mask': batch[1]}

  print(f'Processing batch # {indx}')
  with torch.no_grad():
    # evaluate the validation dataset
    output = model(**inputs)
    logits = output[0]
    logits = logits.detach().cpu().numpy()
    predictions.append(logits)
  indx = indx + 1

predictions = np.concatenate(predictions, axis=0)
preds_flat = np.argmax(predictions, axis=1).flatten()


print('######################')
print('# Evaluation is done #')
print('######################')

No GPU available, using the CPU instead.


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=434.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1344997306.0, style=ProgressStyle(descr…

In [None]:
# Print the answer.txt file
# 
# This is the file I generated using this demo.
# This is the same file I submitted to LiveDataLab Leaderboard for evaluation.
#
# 
# https://drive.google.com/file/d/14tLEIr07SK4uq5cx7lrUOj_IzJZuUGYO/view?usp=sharing
f = open('answer.txt',"w")
i = 1
for pred in enumerate(preds_flat):
  if pred[1] == 0:
    text = 'SARCASM'
  else:
    text = 'NOT_SARCASM'
    f.write('twitter_{0},{1}\n'.format(i, text))
    i = i + 1
f.close()

# 5. Summary

The generated answer.txt file scores in the LiveDataLab Leaderboard:
- f1 = 0.757905138339921
- recall = 0.8522222222222222
- precision = 0.6823843416370107

Baseline score (f1, recall and precision) is 0.723