<a href="https://colab.research.google.com/github/hishamp3/MasterThesis-Lies-DeceptiveText/blob/main/QnA_Classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Install Transformers
!pip install transformers==4.20.0
!pip install torch torchvision
!pip install pandas
!pip install numpy



In [None]:
!pip install sentencepiece



In [None]:
# loading data
import pandas as pd
train_data_df = pd.read_json(path_or_buf="./sample_data/train.jsonl", lines=True,orient='records')
dev_data_df = pd.read_json(path_or_buf="./sample_data/dev.jsonl", lines=True,orient='records')

In [None]:
print(train_data_df.head(5))

                                            question  \
0    do iran and afghanistan speak the same language   
1  do good samaritan laws protect those who help ...   
2  is windows movie maker part of windows essentials   
3  is confectionary sugar the same as powdered sugar   
4         is elder scrolls online the same as skyrim   

                      title  answer  \
0          Persian language    True   
1        Good Samaritan law    True   
2       Windows Movie Maker    True   
3            Powdered sugar    True   
4  The Elder Scrolls Online   False   

                                             passage  
0  Persian (/ˈpɜːrʒən, -ʃən/), also known by its ...  
1  Good Samaritan laws offer legal protection to ...  
2  Windows Movie Maker (formerly known as Windows...  
3  Powdered sugar, also called confectioners' sug...  
4  As with other games in The Elder Scrolls serie...  


In [None]:
import random
import torch
import numpy as np
import pandas as pd
from tqdm import tqdm
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from transformers import AutoTokenizer, AutoModel, AdamW,AutoModelForSequenceClassification

In [None]:
!pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cpu

Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/cpu


In [None]:
import gc
gc.collect()
torch.cuda.empty_cache()

In [None]:
# use GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

#setting seeds
random.seed(26)
np.random.seed(26)
torch.manual_seed(26)
model_name = 'roberta-base'
#model_name = 'bert-base-uncased'
#model_name = 'bert-base-cased'
#model_name = 'bert-large-uncased'
#model_name = 'roberta-large'

# xlm-roberta-base
'''from transformers import AutoTokenizer, XLMRobertaForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("cardiffnlp/twitter-roberta-base-emotion")
model = XLMRobertaForSequenceClassification.from_pretrained("cardiffnlp/twitter-roberta-base-emotion")'''

# xlm-roberta-large
'''from transformers import AutoTokenizer, XLMRobertaXLForSequenceClassification
#tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large")
#model = XLMRobertaXLForSequenceClassification.from_pretrained("xlm-roberta-large", num_labels=2)'''

model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Downloading:   0%|          | 0.00/478M [00:00<?, ?B/s]

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'lm_head.layer_norm.weight', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'roberta.pooler.dense.bias', 'lm_head.bias', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.out_proj.bias', 'classifi

In [None]:
model.to(device)
learning_rate = 1e-5
optimizer = AdamW(model.parameters(),lr=learning_rate,eps=1e-8)



In [None]:
def encode_data(tokenizer, questions, passages, max_length):
    """Encode the question/passage pairs into features than can be fed to the model"""
    input_ids = []
    attention_masks = []

    for question,passage in zip(questions,passages):
        encoded_data = tokenizer.encode_plus(question, passage, max_length=max_length, pad_to_max_length=True,truncation_strategy="longest_first")
        encoded_pair = encoded_data["input_ids"]
        attention_mask = encoded_data["attention_mask"]

        input_ids.append(encoded_pair)
        attention_masks.append(attention_mask)

    return np.array(input_ids),np.array(attention_masks)

In [None]:
passages_train = train_data_df.passage.values
questions_train = train_data_df.question.values
answers_train = train_data_df.answer.values.astype(int)

passages_dev = dev_data_df.passage.values
questions_dev = dev_data_df.question.values
answers_dev = dev_data_df.answer.values.astype(int)

In [None]:
max_seq_length = 256
#max_seq_length = 512
input_ids_train,attention_masks_train = encode_data(tokenizer, questions_train, passages_train, max_seq_length)
input_ids_dev,attention_masks_dev = encode_data(tokenizer, questions_dev, passages_dev, max_seq_length)

train_features = (input_ids_train,attention_masks_train,answers_train)
dev_features = (input_ids_dev,attention_masks_dev,answers_dev)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


In [None]:
# building dataloaders
batch_size = 16
train_features_tensors = [torch.tensor(feature, dtype=torch.long) for feature in train_features]
dev_features_tensors = [torch.tensor(feature, dtype=torch.long) for feature in dev_features]

train_dataset = TensorDataset(*train_features_tensors)
dev_dataset = TensorDataset(*dev_features_tensors)

train_sampler = RandomSampler(train_dataset)
dev_sampler = SequentialSampler(dev_dataset)

train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=batch_size)
dev_dataloader = DataLoader(dev_dataset, sampler=dev_sampler, batch_size=batch_size)

In [None]:
torch.cuda.empty_cache()
def cache_clear():
    gc.collect()
    torch.cuda.empty_cache()

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score

In [None]:
epochs = 3
grad_acc_steps = 4
train_loss_values = []
dev_acc_values = []

for _ in tqdm(range(epochs), desc="Epoch"):

    #training
    epoch_train_loss = 0
    model.train()
    model.zero_grad()

    for step,batch in enumerate(train_dataloader):
        inputs_ids = batch[0].to(device)
        attention_masks = batch[1].to(device)
        labels = batch[2].to(device)

        # cache_clear()
        outputs = model(inputs_ids, token_type_ids=None, attention_mask=attention_masks, labels=labels)
        #outputs = model(inputs_ids, attention_mask=attention_masks)

        loss = outputs[0]
        loss = loss / grad_acc_steps
        epoch_train_loss += loss.item()

        loss.backward()

        if(step+1) % grad_acc_steps == 0:
            torch.nn.utils.clip_grad_norm_(model.parameters(),1.0)
            optimizer.step()
            model.zero_grad()

    epoch_train_loss = epoch_train_loss / len(train_dataloader)
    train_loss_values.append(epoch_train_loss)

    #Evaluation
    epoch_dev_accuracy = 0
    model.eval()

    y_pred = []
    y_true = []
    for batch in dev_dataloader:

        input_ids = batch[0].to(device)
        attention_masks = batch[1].to(device)
        labels = batch[2]

        with torch.no_grad():
            outputs = model(input_ids, token_type_ids=None, attention_mask = attention_masks)
            # outputs = model(inputs_ids, attention_mask=attention_masks, labels=labels)

        logits = outputs[0]
        logits = logits.detach().cpu().numpy()

        predictions = np.argmax(logits,axis=1).flatten()
        labels = labels.numpy().flatten()

        y_pred.extend(predictions)
        y_true.extend(labels)
        epoch_dev_accuracy += np.sum(predictions==labels)/len(labels)


    cf_matrix = confusion_matrix(y_true, y_pred)
    score_f1 = f1_score(y_true, y_pred)
    epoch_dev_accuracy = epoch_dev_accuracy / len(dev_dataloader)
    dev_acc_values.append(epoch_dev_accuracy)

Epoch: 100%|██████████| 3/3 [22:03<00:00, 441.03s/it]


In [None]:
print(cf_matrix)

[[ 752  485]
 [ 265 1768]]


In [None]:
from sklearn.metrics import classification_report
target_names = ['No', 'Yes']
print(classification_report(y_true, y_pred,target_names=target_names))

              precision    recall  f1-score   support

          No       0.74      0.61      0.67      1237
         Yes       0.78      0.87      0.83      2033

    accuracy                           0.77      3270
   macro avg       0.76      0.74      0.75      3270
weighted avg       0.77      0.77      0.77      3270



In [None]:
print("Training Losses :",train_loss_values)
print("Accuracy :",dev_acc_values)

Training Losses : [0.16526830287302954, 0.14491456675074868, 0.11725915095184819]
Accuracy : [0.6275406504065041, 0.7116869918699188, 0.7708333333333334]


In [None]:
def predict(question, passage):
  sequence = tokenizer.encode_plus(question, passage, return_tensors="pt")['input_ids'].to(device)

  logits = model(sequence)[0]
  probabilities = torch.softmax(logits, dim=1).detach().cpu().tolist()[0]
  proba_yes = round(probabilities[1], 2)
  proba_no = round(probabilities[0], 2)

  print(f"Question: {question}, Yes: {proba_yes}, No: {proba_no}")

passage_planets = '''A planet is a large, rounded astronomical body that is neither a star nor its remnant. The best available theory of planet formation is the nebular hypothesis, which posits that an interstellar cloud collapses out of a nebula to create a young protostar orbited by a protoplanetary disk. Planets grow in this disk by the gradual accumulation of material driven by gravity, a process called accretion. The Solar System has at least eight planets: the terrestrial planets Mercury, Venus, Earth and Mars, and the giant planets Jupiter, Saturn, Uranus and Neptune. These planets each rotate around an axis tilted with respect to its orbital pole. All planets of the Solar System other than Mercury possess a considerable atmosphere, and some share such features as ice caps, seasons, volcanism, hurricanes, tectonics, and even hydrology. Apart from Venus and Mars, the Solar System planets generate magnetic fields, and all except Venus and Mercury have natural satellites. The giant planets bear planetary rings, the most prominent being those of Saturn.'''
planets_questions = [
    "Mercury is a planet",
    "Planet is a star",
    "Mercury has a no atmosphere",
    "There are eight planets in solar system",
    "There are hundred planets in solar system",
    "Venus is closer to sun compared to mercury"

]

In [None]:
for s_question in planets_questions:
  predict(s_question, passage_planets)

Question: Mercury is a planet, Yes: 0.82, No: 0.18
Question: Planet is a star, Yes: 0.07, No: 0.93
Question: Mercury has a no atmosphere, Yes: 0.33, No: 0.67
Question: There are eight planets in solar system, Yes: 0.92, No: 0.08
Question: There are hundred planets in solar system, Yes: 0.74, No: 0.26
Question: Venus is closer to sun compared to mercury, Yes: 0.35, No: 0.65
