<a href="https://colab.research.google.com/github/jensman100/Fast.ai-Practical-Deep-Learning-for-Coders/blob/main/Lesson_4_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natual Language Programming

Creating a model trained on JyotiNayak/political_ideologies to work out whether a statement is liberal or conservative. Explores tokenisation and the creation of a model

Importing data...

In [1]:
# Datasets is a hugging face library which allows you to download any of its data sets
!pip install datasets -qq # -qq means less is printed

In [1]:
from datasets import load_dataset
import pandas as pd

In [2]:
dataset = load_dataset('JyotiNayak/political_ideologies')

In [3]:
# Displaying the dataset
df = dataset['train'].to_pandas()
df.head()

# Need to ensure that the file which represents the labels is called label

Unnamed: 0,statement,label,issue_type,__index_level_0__
0,"Climate change, and the escalating environment...",1,1,465
1,I believe in the foundational importance of th...,0,2,1191
2,I firmly believe that the principle of separat...,1,6,2440
3,I firmly believe in the separation of church a...,1,6,2406
4,I firmly believe in the power of free markets ...,0,0,1903


This dataset has lists of statements which are labelled 0 (conservative) or 1 (liberal).

Preparing text...  
The model cannot have strings of text input, instead it needs to be 'tokenised' and 'numericalised'.

*   Tokenised - Split text into tokens (words, or parts of words)
*   Numericalised - Each token is assigned a number which is an index to a huge dictionary of tokens

In [4]:
from transformers import AutoModelForSequenceClassification,AutoTokenizer

In [5]:
# Pretrained model must be chosen, as differnet ones use different tokenisation
model_nm = 'microsoft/deberta-v3-small'
tokz = AutoTokenizer.from_pretrained(model_nm)



In [7]:
# An example
string = 'Hello, my name is Joe. I live in Llanfairpwllgwyngyll'
tokz.tokenize(string)

['▁Hello',
 ',',
 '▁my',
 '▁name',
 '▁is',
 '▁Joe',
 '.',
 '▁I',
 '▁live',
 '▁in',
 '▁Llan',
 'fair',
 'pw',
 'll',
 'g',
 'wyn',
 'gy',
 'll']

Each new word starts with a ▁ character.  
Punctuation is a seperate token.  
When a word is more complicated, it may not be saved in the dictionary, so it is broken down into smaller tokens which are in the dictionary.

In [8]:
# Tokenising and numericalising the dataset
sample = dataset['train']['statement'][0][:100] # First 100 characters in the text
print('Original text:')
print(sample)
print('')

tokenised_sample = tokz.tokenize(sample)
print('Tokenised text:')
print(tokenised_sample)
print('')

indexed_sample = tokz(sample)
print('Numericalised Tokens:')
print(indexed_sample['input_ids'])
print('')

# Checking if the tokens are correct
print('Reversing index')
print(tokz.vocab['▁Climate'])
print(tokz.convert_ids_to_tokens([8868]))

Original text:
Climate change, and the escalating environmental degradation we witness daily, is an urgent issue th

Tokenised text:
['▁Climate', '▁change', ',', '▁and', '▁the', '▁escalating', '▁environmental', '▁degradation', '▁we', '▁witness', '▁daily', ',', '▁is', '▁an', '▁urgent', '▁issue', '▁th']

Numericalised Tokens:
[1, 8868, 575, 261, 263, 262, 24990, 2543, 15316, 301, 5276, 1323, 261, 269, 299, 9178, 889, 6554, 2]

Reversing index
8868
['▁Climate']


In [6]:
# Use a function to quickly tokenise the dataset. (It makes things run in parallel)
def tok_func(x): return tokz(x["statement"])

tok_ds = dataset.map(tok_func, batched=True)

Training the model...

In [7]:
from transformers import TrainingArguments, Trainer
from transformers import AutoModelForSequenceClassification

In [8]:
bs = 32 # state batch size, how many lines are input at once. More = faster, higher GPU demand
evt_epochs = 5
lr = 8e-5 # recomend to start small and increase until the model breaks
args = TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=True,
    per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2,
    num_train_epochs=evt_epochs, weight_decay=0.01, report_to='none')

import numpy as np
from sklearn.metrics import accuracy_score

# Define the metrics function
def compute_metrics(pred):
    logits, labels = pred.predictions, pred.label_ids
    # For classification, take argmax to get predicted class
    preds = np.argmax(logits, axis=-1)
    acc = accuracy_score(labels, preds)
    return {"accuracy": acc}

# Loading pretrained model for binary classification
model = AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=2) # num_labels configures output layer

# Gets ready for training
trainer = Trainer(model, args, train_dataset=tok_ds['train'], eval_dataset=tok_ds['test'],
                  tokenizer=tokz, compute_metrics=compute_metrics)

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-small and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(model, args, train_dataset=tok_ds['train'], eval_dataset=tok_ds['test'],


In [71]:
trainer.train()

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': 2, 'bos_token_id': 1}.


Step,Training Loss


TrainOutput(global_step=400, training_loss=0.10488640785217285, metrics={'train_runtime': 114.2095, 'train_samples_per_second': 112.075, 'train_steps_per_second': 3.502, 'total_flos': 328961385156096.0, 'train_loss': 0.10488640785217285, 'epoch': 5.0})

Analyising Model...

In [9]:
from scipy.special import softmax
logits = trainer.predict(tok_ds['validation']).predictions # Find predictions for validation set
probs = softmax(logits, axis=-1)  # converts output between 0 and 1 using exponentials
pred_class = probs.argmax(axis=-1) # returns highest value in probs, either 1 or 0

In [73]:
# True labels
true_labels = np.array(tok_ds['validation']['label'])

# Count correct and incorrect
num_correct = np.sum(pred_class == true_labels)
num_incorrect = np.sum(pred_class != true_labels)

print(f"Correct predictions: {num_correct}")
print(f"Incorrect predictions: {num_incorrect}")

Correct predictions: 309
Incorrect predictions: 11


In [35]:
for i in range(5):
    print(f"Statement: {tok_ds['validation'][i]['statement']}")
    print(f"True label: {tok_ds['validation'][i]['label']}")
    print(f"Predicted class: {pred_class[i]}, Probabilities: {probs[i]}")
    print("---")

Statement: I firmly believe that all individuals, regardless of their race or ethnicity, should be treated with equal respect and dignity. Our focus should be on promoting unity, common values, and shared goals rather than emphasizing divisions. It's important to uphold meritocracy and create opportunities for all, ensuring we don't let race be the defining aspect of someone's potential or capabilities.
True label: 0
Predicted class: 0, Probabilities: [0.9970536 0.0029464]
---
Statement: I believe that we should work towards more diplomatic and peaceful solutions to the ongoing conflicts globally. It's equally important to address the root causes of these conflicts, such as socio-economic inequalities, climate change, and lack of access to education. We should also encourage stronger international cooperation and uphold human rights in all our foreign policies.
True label: 1
Predicted class: 1, Probabilities: [0.00447708 0.995523  ]
---
Statement: I firmly believe in the importance of 

In [74]:
import torch
phrase = "I think people should have the freedom to make their own choices while following the law."
inputs = tokz(phrase, truncation=True, padding=True, return_tensors="pt")

device = next(model.parameters()).device  # get model device (CPU or GPU)

inputs = tokz(phrase, truncation=True, padding=True, return_tensors="pt")
inputs = {k: v.to(device) for k, v in inputs.items()}  # move inputs to GPU if needed

model.eval()
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits.cpu().numpy()  # move logits back to CPU

probs = softmax(logits, axis=-1)
pred_class = np.argmax(probs, axis=-1)

choices = ['Conservative', 'Liberal']

print(f'This statement is {choices[pred_class[0]]} with probability {probs[0][pred_class[0]]*100:.1f}%')


This statement is Conservative with probability 97.9%
