# BERT on Sentence Classification
- https://colab.research.google.com/drive/1X7NEu0oPPv97U1LRS-FORAdY7CKLv4dd?authuser=1

# Huggingface
- Allows you to:
  - Apply Transformers!
  - Download to NLP models and NLP datasets
- link: https://huggingface.co/

In [1]:
# You need these libraries
!pip install datasets
!pip install transformers[torch] 



#### Display function

In [2]:
# !pip install datasets
# Use this to display your dataset!
from datasets import ClassLabel, Sequence
from IPython.display import display, HTML
import pandas as pd
from random import randint
def show_elements(dataset, randomize = True, num_samples = 10):
    if isinstance(dataset,pd.DataFrame):                  
        if randomize:                                          
            dataset = dataset.sample(frac=1)
        display(HTML(dataset.iloc[:num_samples].to_html()))            
    else:                                                    
        if randomize:                                          
            dataset = dataset.shuffle(seed = randint(1,100))   
        dataset = pd.DataFrame(dataset.select(range(num_samples)))   
        display(HTML(dataset.to_html()))

# load_dataset
- Loads any NLP dataset from the HuggingFace Dataset Repository 
- HuggingFace Dataset Repository: https://huggingface.co/datasets
- load_dataset documentation: https://huggingface.co/docs/datasets/package_reference/loading_methods.html

In [3]:
from datasets import load_dataset

In [4]:
# load datasets by passing the name of the dataset as the first argument
huggingface_dataset_name = "financial_phrasebank"                                 # financial_phrasebank - a Document-level Sentiment Analysis Dataset
huggingface_dataset = load_dataset(huggingface_dataset_name, 
                                   name = "sentences_75agree",                    
                                   split = "train")
show_elements(huggingface_dataset,randomize = True, num_samples = 10)

Reusing dataset financial_phrasebank (/home/students/s121md106_09/.cache/huggingface/datasets/financial_phrasebank/sentences_75agree/1.0.0/a6d468761d4e0c8ae215c77367e1092bead39deb08fbf4bffd7c0a6991febbf0)


Unnamed: 0,sentence,label
0,"The company said that its investments in the new market areas resulted in sales increase in Sweden , Poland , Russia and Lithuania .",2
1,"The broad-based WIG index ended Thursday 's session 0.1 pct up at 65,003.34 pts , while the blue-chip WIG20 was 1.13 down at 3,687.15 pts .",1
2,"After the transaction , M-real will own 30 % in Metsa-Botnia and UPM -- 17 % .",1
3,"We went to the market with yield guidance of the 7.25 % area , which gave us the flexibility to go up or down by 1-8th .",1
4,"At 10.33 am , Huhtamaki was the market 's biggest faller , 8.69 pct lower at 11.35 eur , while the OMX Helsinki 25 was 0.32 pct higher at 3,332.41 , and the OMX Helsinki was up 0.47 pct at 11,687.32 .",0
5,"Publishing Sweden 's operating loss was EUR 1.1 mn in Q1 of 2009 , compared to a profit of EUR 0.6 mn a year ago .",0
6,The output of the contracts totals 72 MWe .,1
7,The sale of the Healthcare Trade business supports Oriola-KD 's strategy to focus on Pharmaceutical Wholesale and Retail businesses .,2
8,Net sales increased to EUR193 .3 m from EUR179 .9 m and pretax profit rose by 34.2 % to EUR43 .1 m. ( EUR1 = USD1 .4 ),2
9,Curators have divided their material into eight themes .,1


# Load your own data 
- You can also load your own data, instead of those from the HuggingFace Dataset Repository, but be sure to convert them into HuggingFace dataset!
- some useful methods to do so:
  - Dataset.from_pandas
  - Dataset.from_csv
  - Dataset.from_json
- link: https://huggingface.co/docs/datasets/package_reference/main_classes.html#dataset

In [5]:
import pandas as pd

In [6]:
path = "datasets/cleaned.csv"
pandas_dataframe = pd.read_csv(path)
pandas_dataframe.head()

Unnamed: 0.1,Unnamed: 0,content_9
0,0,propose use million bitcoin measure entire ag...
1,1,bitcoin death cross perfect die naturally beco...
2,2,candle day close open high low close chan...
3,3,bitcoin honest form money humanity ever see
4,4,punch poc easily lads btc


In [7]:
pandas_dataframe.drop('Unnamed: 0', axis=1, inplace=True)
pandas_dataframe.columns = ['sentence']

In [8]:
show_elements(pandas_dataframe,randomize = True, num_samples = 5)

Unnamed: 0,sentence
228523,really break bitcoin celebrate 🥳 cardano
95037,nobody really want say bitcoin poster boy cryptocurrency bad drink drug use problem 🤣 nobody want address cause get rich him…ticking
233479,must rise together fight elon musk crypto nonsense twitts bitcoin
43636,long position suggest bitcoin current price signal bullcount bearcount date
292891,back k btc crypto cryptodigital digitalart cryptocurrencies bitcoin bitcoinnews


In [9]:
!pip install datasets
from datasets import Dataset



In [10]:
huggingface_dataset = Dataset.from_pandas(pandas_dataframe)
huggingface_dataset

Dataset({
    features: ['sentence'],
    num_rows: 478040
})

In [11]:
show_elements(huggingface_dataset, randomize = True, num_samples = 11)

Unnamed: 0,sentence
0,hereby thank fed central bank accelerate bitcoin adoption rampant money printing
1,btcusd current bitcoin price day high day low year low year high day move avg day move avg bitcoin realmoney btc cypto
2,current price bitcoin decreased last hour bitcoin btc cryptocurrency
3,america racist country glazersout haiti pokemonunite fleet freebritney nationalnudityday novaccinepassports karen mufc bitcoin blinkensowoyane arizonaaudit theview tuckercarlson takeusbacktochina yr yourboyfriendgame lokifinale loveisland loot
4,last chance buy bitcoin k ever…
5,way feel today way nocoiner friend feel future one enough bitcoin even michael
6,define freedom bitcoin decentralize sovereign money humanity
7,anybody way dogecoin love respect shit you forget diversify bit legenddoge legendary too coin always powerhouse bitcoin ethereum 🤝
8,sxp short position entry price target stop binance bitcoin signal sell sxpusdt sxp
9,bitcoin legal tender el salvador meanwhile tether coin lot supply inmarket purchase btc currently ban new york although cannabis fully recreational now cryptocurrency world continue bumpy road web


# *train_test_split()* method
- This step is optional. 
  - it will be useful to evaluate your trained model on an unseen test set
- documentation: https://huggingface.co/docs/datasets/package_reference/main_classes.html#dataset

In [12]:
huggingface_dataset = huggingface_dataset.train_test_split(test_size = 0.3, 
                                                           shuffle = True, 
                                                           seed = 0)
huggingface_dataset

DatasetDict({
    train: Dataset({
        features: ['sentence'],
        num_rows: 334628
    })
    test: Dataset({
        features: ['sentence'],
        num_rows: 143412
    })
})

# Choose your Transformers model
- Huggingface model repository: https://huggingface.co/models?sort=downloads

In [13]:
# Get the exact name of the model you want 
model_checkpoint = "distilbert-base-uncased"

# AutoModelForSequenceClassification
- AutoModelForSequenceClassification loads the correct model architecture for classifying sentences
- AutoModelForSequenceClassification documentation: https://huggingface.co/transformers/model_doc/auto.html#automodelforsequenceclassification

In [14]:
from transformers import AutoModelForSequenceClassification

In [15]:
# from_pretrained() method loads the model
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, 
                                                           num_labels=3)            # we have 3 labels: Positive, Negative and Neutral

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.bias', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.bias', 'classifier.w

# AutoTokenizer
- AutoTokenizer loads the right tokenizer for your model
- AutoTokenizer documentation: https://huggingface.co/transformers/model_doc/auto.html#autotokenizer

In [16]:
from transformers import AutoTokenizer

In [17]:
# from_pretrained() method loads the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

# Tokenize your Dataset
- You need to convert your raw input text into the correct input format that BERT expects!
  - input_ids - a sequence of indices where each index corresponds to a particular token
  - attention_mask - tells BERT not to focus on padding tokens
  - label - encoded target variable
- tokenizer documentation: https://huggingface.co/transformers/internal/tokenization_utils.html

In [18]:
# define your preprocessing function
def preprocess(examples):
    tokenized_examples = tokenizer(examples['sentence'], 
                                   padding = "max_length",
                                   truncation = True,
                                   max_length = 50)
    return tokenized_examples

In [19]:
# use the map() function to preprocess the whole Dataset
tokenized_dataset = huggingface_dataset.map(preprocess, 
                                            batched = True, 
                                            batch_size = 30,
                                            remove_columns = ["sentence"])
tokenized_dataset

  0%|          | 0/11155 [00:00<?, ?ba/s]

  0%|          | 0/4781 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 334628
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 143412
    })
})

In [20]:
# this table shows the inout format that BERT expects!
show_elements(tokenized_dataset["train"],randomize = False, num_samples = 5)

Unnamed: 0,input_ids,attention_mask
0,"[101, 2651, 2792, 2716, 3661, 18411, 1042, 1040, 4965, 23816, 11514, 2978, 3597, 2378, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]"
1,"[101, 2342, 11481, 11865, 2094, 2978, 3597, 2378, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]"
2,"[101, 24264, 2213, 3407, 21117, 2278, 2897, 2273, 2100, 3444, 21117, 2278, 2897, 4495, 1057, 3693, 2897, 13970, 24065, 13970, 3597, 2378, 21117, 2278, 2033, 4168, 2978, 3597, 2378, 13970, 24065, 16294, 6651, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]"
3,"[101, 2204, 2622, 2293, 2250, 25711, 2279, 18863, 3643, 23961, 2034, 13180, 3643, 23961, 3745, 6523, 7941, 8406, 7533, 2378, 25217, 4353, 3058, 2279, 18863, 23961, 2250, 25711, 18411, 2278, 2002, 3597, 24925, 2078, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]"
4,"[101, 4025, 18411, 2278, 2978, 3597, 2378, 2735, 5012, 2490, 100, 2051, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]"


### example
- tokenizer documentation: https://huggingface.co/transformers/internal/tokenization_utils.html

In [21]:
sentence = "According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing"

In [22]:
# to tokenize, input the sentence as the first argument
tokenized_sentence = tokenizer(sentence, 
                               padding = "max_length",
                               truncation = True,
                               max_length = 50)
tokenized_sentence

{'input_ids': [101, 2429, 2000, 12604, 1010, 1996, 2194, 2038, 2053, 3488, 2000, 2693, 2035, 2537, 2000, 3607, 1010, 2348, 2008, 2003, 2073, 1996, 2194, 2003, 3652, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}

In [23]:
# this checks if your text has been tokenized correctly
tokenizer.decode(tokenized_sentence["input_ids"])

'[CLS] according to gran, the company has no plans to move all production to russia, although that is where the company is growing [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]'

# TrainingArguments
- TrainingArguments allow us to set training configurations.
- The more important arguments are:
  - evaluation_strategy - determines when to evaluate your model on the test set 
  - learning_rate - determines how fast your model updates its parameters
  - num_train_epochs - determines number of epochs to train your model
- TrainingArguments documentation: https://huggingface.co/transformers/main_classes/trainer.html#trainingarguments

In [24]:
from transformers import TrainingArguments

In [25]:
args = TrainingArguments(
    "financial_phrasebank_checkpoints",
    evaluation_strategy = "epoch",                           
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=1,
    weight_decay=0.01,
    # load_best_model_at_end=True,                              
    metric_for_best_model="accuracy",
    logging_steps = 800,
    save_strategy = "no",
)

# load_metric
- You can load available metrics to evaluate you model here: https://huggingface.co/metrics
- source: https://huggingface.co/docs/datasets/package_reference/loading_methods.html 

In [26]:
from datasets import load_metric

In [27]:
metric = load_metric("accuracy")

In [28]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = predictions[:, 0]
    return metric.compute(predictions=predictions, references=labels)

# Trainer
- Trainer ties together to all the components above and initiates training using .train() method
- https://huggingface.co/transformers/main_classes/trainer.html#id1

In [29]:
from transformers import Trainer

In [30]:
trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_dataset["train"],                  
    eval_dataset=tokenized_dataset["test"],
    compute_metrics=compute_metrics
)

In [31]:
# runs training
# trainer.train()

# Summary: BERT for sequence classification
- AutoModelForSequenceClassification 
  - load our BERT classification model
- AutoTokenizer 
  - load the correct tokenizer for our model
- Convert your dataset into the expected input format:
  - input_ids 
  - attention_mask 
  - label 
- TrainingArguments
  - defines training configurations e.g number of epochs
- Metric (Optional)
- Trainer
  - Organizes all the components and runs training

# Making Predictions

## Load fine tuned model 
- Instead of training your own model you can download models shared by other people!
- Huggingface model repository: https://huggingface.co/models
- Why load other people's model?
    - Long training duration
    - Lack of data 

In [32]:
# load find tuned model
model_checkpoint = "mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis"
fine_tuned_model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint)

loading configuration file https://huggingface.co/mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis/resolve/main/config.json from cache at /home/students/s121md106_09/.cache/huggingface/transformers/32cfde1251e42836903ec367e0318ee96ead565f3a8f9e38d6e6d9a8625e32f7.1d32c415a454383dd90a88b64b406108719272aabc01efb82409aa8593f86983
Model config RobertaConfig {
  "_name_or_path": "mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis",
  "architectures": [
    "RobertaForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "negative",
    "1": "neutral",
    "2": "positive"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "negative": 0,
    "neutral": 1,
    "positive": 2
  },
  "layer_norm_eps": 1e-05,
  "max_p

In [33]:
# download correct tokenizer - because different models may have different tokenization
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

loading file https://huggingface.co/mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis/resolve/main/vocab.json from cache at /home/students/s121md106_09/.cache/huggingface/transformers/d1e38a9dbde9bd6aaff25762cc6371f58c172e5b50891ed94bd32a5fbbd4bc19.bfdcc444ff249bca1a95ca170ec350b442f81804d7df3a95a2252217574121d7
loading file https://huggingface.co/mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis/resolve/main/merges.txt from cache at /home/students/s121md106_09/.cache/huggingface/transformers/5bbc28825fd3e73773bce80ec73605ce7b576501e3c9a37031dfd35277f1c738.f5b91da9e34259b8f4d88dbc97c740667a0e8430b96314460cdb04e86d4fc435
loading file https://huggingface.co/mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis/resolve/main/tokenizer.json from cache at /home/students/s121md106_09/.cache/huggingface/transformers/d483430aa1deae75cd959527ed83ee01e5b2af12c5053d67eb52083ef482a8bd.cae845566850be098cc55c53bbeb4f4ad435550c9a6e72d26f6afd7d8a07a209
loadin

# Make Prediction!

In [34]:
import numpy as np
def make_prediction(sentence, model, tokenizer):
    tokenized_sentence = tokenizer(sentence, return_tensors = "pt")   # tokenize the sentence
    logits = model(**tokenized_sentence).logits                       # make a forward pass through the model to get the raw outputs i.e logits
    prediction = np.argmax(logits.detach().numpy())                   # argmax because the raw output with the highest value is the most likely prediction
    label_mapping = fine_tuned_model.config.id2label                  # get the label mappings
    return label_mapping[prediction]

In [35]:
# sentence = "According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing"
# make_prediction(sentence, fine_tuned_model, tokenizer)

In [36]:
pandas_dataframe['prediction'] = pandas_dataframe['sentence'].apply(lambda sent: make_prediction(sent, fine_tuned_model, tokenizer))

In [37]:
# pandas_dataframe['sentence'].head().apply(lambda sent: make_prediction(sent, fine_tuned_model, tokenizer))

In [38]:
pandas_dataframe.head()

Unnamed: 0,sentence,prediction
0,propose use million bitcoin measure entire ag...,neutral
1,bitcoin death cross perfect die naturally beco...,positive
2,candle day close open high low close chan...,neutral
3,bitcoin honest form money humanity ever see,neutral
4,punch poc easily lads btc,neutral


In [39]:
pandas_dataframe.to_csv('./datasets/bert_prediction.csv')

In [46]:
neutral = len(pandas_dataframe.loc[pandas_dataframe['prediction'] == 'neutral'])
positive = len(pandas_dataframe.loc[pandas_dataframe['prediction'] == 'positive'])
negative = len(pandas_dataframe.loc[pandas_dataframe['prediction'] == 'negative'])

print(str(neutral) + ' neutral')
print(str(positive) + ' positive')
print(str(negative) + ' negative')

364101 neutral
75201 positive
38738 negative


478040