In [None]:
!pip install datasets transformers[torch] sentencepiece -Uq
import random
random.seed(42)

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.1/9.1 MB[0m [31m19.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m25.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.6/302.6 kB[0m [31m21.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.3/21.3 MB[0m [31m46.7 MB/s[0m eta [36m0:00:00[0m
[?25h

# Introduction to Classification




### What is classification?

The task of a classification model is to classify something as a certain type of that thing. In NLP, there are many applications of this type of model, including sentiment analysis: determining whether a sentence is of positive or negative sentiment or meaning.

### Today's example
Today, we're going to create our sentiment analyizer using a classification model that will determine whether a text is of positive or negative sentiment. Specifically, this dataset is taken from tweets. Each tweet is labeled as 'Positive', 'Neutral' or 'Negative'. This is a very common way of labeling data, but we will see later in the course that how we label our data can have a profound effect on our results.

# Wrangling our data

Unfortunately, we are not able to create a model out of raw text. Instead, we have to put into a very specific form so that the software libraries we'll be using can read and process it correctly.

This dataset comes from a website called HuggingFace. This site hosts both pretrained models and datasets to train and finetune new models. We will be using [this dataset](https://huggingface.co/datasets/mteb/tweet_sentiment_extraction) which was collected by the Massive Text Embedding Benchmark. This group uses datasets like this one to evaluate how good certain models are compared to others. What we will be doing in this notebook is a key part of their evlaution process for new models.

This data comes to us as a HuggingFace Dataset. This is a specific Python-based data structure that is derived from parquet files hosted on the MTEB HuggingFace account.

In [None]:
import datasets
from datasets import load_dataset

dataset = load_dataset('mteb/tweet_sentiment_extraction')
train = dataset['train'].to_pandas()
test = dataset['test'].to_pandas()

train['label'] = train['label'].astype('float32')
test['label'] = test['label'].astype('float32')

dataset = datasets.DatasetDict({
    'train': datasets.Dataset.from_pandas(train),
    'test': datasets.Dataset.from_pandas(test)
})

dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'text', 'label', 'label_text'],
        num_rows: 27481
    })
    test: Dataset({
        features: ['id', 'text', 'label', 'label_text'],
        num_rows: 3534
    })
})

In [None]:
# this is just a fancy spreadsheet
dataset['train'].to_pandas()

Unnamed: 0,id,text,label,label_text
0,cb774db0d1,"I`d have responded, if I were going",1.0,neutral
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,0.0,negative
2,088c60f138,my boss is bullying me...,0.0,negative
3,9642c003ef,what interview! leave me alone,0.0,negative
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...",0.0,negative
...,...,...,...,...
27476,4eac33d1c0,wish we could come see u on Denver husband l...,0.0,negative
27477,4f4c4fc327,I`ve wondered about rake to. The client has ...,0.0,negative
27478,f67aae2310,Yay good for both of you. Enjoy the break - y...,2.0,positive
27479,ed167662a5,But it was worth it ****.,2.0,positive


See! Just a spreadsheet! We'll talk about how I read it into our notebook in a couple classes.

Above are the sentiments in our dataset and how many there are. Each one is associated with a tweet, but they didn't just appear there magically. Someone (or more likely a group of people) had to classify these by hand. As a result, labeled data like this can be extremely valuable.

## Fine-tuning

Although, we're training our own sentiment analysis model today. We are not starting from nothing. Today we are going to do a process called 'fine-tuning.' Fine-tuning is when we take a model that someone else has trained and we train it again for a specific task. This is very common for classification tasks like this one.

Today, we'll be using Microsoft's `deberta` model.

In [None]:
model_nm = 'microsoft/deberta-v3-small'

In [None]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
tokz = AutoTokenizer.from_pretrained(model_nm)

tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/578 [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]



`deberta` will transform the text of the tweets into mathematical representations of the text that we can use to generate our model.

What is important, however, is the the step we take before this conversion from text to numbers: **tokenization**.

### Tokenization

In [None]:
def tok_func(x):
  return tokz(x["text"])

Tokenization is the process of breaking a sentence into the words that make it up and then associating each piece of the sentence with a number that we can refer to later.

In [None]:
tok_ds = dataset.map(tok_func, batched=True)

Map:   0%|          | 0/27481 [00:00<?, ? examples/s]

Map:   0%|          | 0/3534 [00:00<?, ? examples/s]

For example, this is what a single post looks like now.

In [None]:
tok_ds['train']['input_ids'][0], tok_ds['train']['text'][0]

([1, 273, 5459, 407, 286, 5028, 261, 337, 273, 332, 446, 2],
 ' I`d have responded, if I were going')

We can make sure with the `decode` method.

In [None]:
tokz.decode(tok_ds['train']['input_ids'][0])

'[CLS] I`d have responded, if I were going[SEP]'

### Model training

We need some way to know how well our model is doing while we train. There are many options to use but today I'll use Pearson's R. This metric will give us updates on our model while its training.

As you may have noticed above, we have two sets of data in our larger dataset: a training and test set. This training set is much larger and will be what we finetune our model on. The model will only see the test set after the model has finished looking at hte training set.

If the model looks like its going really well, and getting all of the predictions right on the training set, it's possible we have a really good model or a model that has only memorized the answers on the training set, that is a terrible model for using outside of the training set. This is called **overfitting** and is common problem in all machine learning applications.

As a result, we use the test set, a set of data that the model has never seen to give use a sense of how well the model is doing beyond the training set.

In [None]:
import numpy as np
def corr(x,y):
  return np.corrcoef(x,y)[0][1]

def corr_d(eval_pred):
  return {'pearson': corr(*eval_pred)}

Now, let's get training!

In [None]:
from transformers import TrainingArguments,Trainer

# these three hyperparameters can be changed and tweaked to return a new model.
bs = 64
epochs = 6
lr = 1e-5

args = TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=True,
    evaluation_strategy="epoch", per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2,
    num_train_epochs=epochs, weight_decay=0.01, report_to='none')



In [None]:
model = AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=1)
trainer = Trainer(model, args, train_dataset=tok_ds['train'], eval_dataset=tok_ds['test'],
                  tokenizer=tokz, compute_metrics=corr_d)

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-small and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Pearson
1,No log,0.196478,0.829679
2,0.467600,0.207629,0.834406
3,0.181200,0.191214,0.843036
4,0.157400,0.198318,0.840735
5,0.144300,0.185594,0.841218
6,0.136100,0.187439,0.841398


TrainOutput(global_step=2580, training_loss=0.21463621309561323, metrics={'train_runtime': 551.6027, 'train_samples_per_second': 298.922, 'train_steps_per_second': 4.677, 'total_flos': 1868988087492990.0, 'train_loss': 0.21463621309561323, 'epoch': 6.0})

## Evaluation

Once training is finished, we can start predicting. These are the predictions for the evaluation set.

In [None]:
preds = trainer.predict(trainer.eval_dataset)[1]
preds

array([1., 2., 0., ..., 0., 2., 2.], dtype=float32)

In [None]:
import pandas as pd
eval_df = pd.DataFrame({'text':trainer.eval_dataset['text'], 'pred':preds})
eval_df

Unnamed: 0,text,pred
0,Last session of the day http://twitpic.com/67ezh,1.0
1,Shanghai is also really exciting (precisely -...,2.0
2,"Recession hit Veronique Branquinho, she has to...",0.0
3,happy bday!,2.0
4,http://twitpic.com/4w75p - I like it!!,2.0
...,...,...
3529,"its at 3 am, im very tired but i can`t sleep ...",0.0
3530,All alone in this old house again. Thanks for...,2.0
3531,I know what you mean. My little dog is sinkin...,0.0
3532,_sutra what is your next youtube video gonna b...,2.0


In [None]:
import pprint
pp = pprint.PrettyPrinter(indent=4)
p = eval_df.sample(5).apply(lambda x: pp.pprint([f"TEXT: {x['text']}", f"PREDICTION: {x['pred']}"]),axis=1)

[   'TEXT:  I need botox work on the lips if I`m going to change my name to '
    'Angelina Jolie, but it`s a thought! Sad though about the racism',
    'PREDICTION: 0.0']
[   'TEXT:  jersey weather - and good on you for the charity drive!',
    'PREDICTION: 2.0']
[   'TEXT:  someone`s a sweet tooth  i was dying for somethin sweet so i`ve '
    'attacked the chock coated tiny teddies all i could find lol',
    'PREDICTION: 2.0']
[   'TEXT:  HAPPY MOTHERS DAY!  Tell ur mom that`s she an awesome madre & such '
    'a great example to the Archuleta familia!!',
    'PREDICTION: 2.0']
[   'TEXT:  Yay I love it when you host Money For Breakfast Jenna Lee  4hrs of '
    'you the amazing and so pretty and sexy Jenna Lee yay.',
    'PREDICTION: 2.0']


A random sample of our evaluation data is looking good!

## Conclusion

This exercise is the most basic example of using a neural network to read sentiment from a document. There are a lot of optimizations and augmentations that can be implemented to cut down our loss even more.

I recommend the following links if you are interested in learning more:
* [Practical Deep Learning for Coders](https://course.fast.ai/)
* [The NLP HuggingFace Course](https://huggingface.co/course/chapter1/1)
* [Andrej Karpathy's Zero to Hero Series](https://www.youtube.com/watch?v=VMj-3S1tku0&list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ)