<a href="https://colab.research.google.com/github/pnadelofficial/hf_classification_hackthon2_12/blob/main/Classification_with_HuggingGace_transformers_Hackathon2_12.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classification with HuggingFace `transformers`

In this notebook, we are tasked to find and replace misspelled or misidentified placenames in geographic data. In addition to the placenames, we also have scores that come from a fuzzymatching process, which will give us a starting place for building our model.

This notebook should be a place of experimentation and exploration. I have tried to mark places where you could try other approaches and implementations.

A few notes on running this yourself:
1. It requires a GPU. You can access time on a GPU for free in Colab by going to the `Runtime` tab and selecting `Change runtime type`. Then find `GPU` in the `Hardware accelerator` drop down and hit `Save`. 
2. I tried to make it so that there were `pip install`s for all of the software packages we are using, but if you see an error it may be a versions or installation issue. 

In [1]:
!pip install transformers sentencepiece datasets -U # intalling and updating all of the packages we need
!wget "https://raw.githubusercontent.com/pnadelofficial/hf_classification_hackthon2_12/main/MATCH%20Fuzzy%20Matching%20(1).csv" # downloading the data

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m44.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentencepiece
  Downloading sentencepiece-0.1.97-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m41.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.9.0-py3-none-any.whl (462 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m462.8/462.8 KB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m41.1 MB/s[0m eta 

In [2]:
import pandas as pd 
import numpy as np

from datasets import Dataset, DatasetDict
from transformers import AutoModelForSequenceClassification,AutoTokenizer,TrainingArguments,Trainer

## Data Preparation

We're going to need to prepare our data and make some important decisions about how the model will read our data. This notebok presents a simplification of many of the details of working with natural language. I encourage you to explore different way of preparing the data and record how that affects the preformance of your model. 

At this step especially, I will make some generalizations for the sake of time, but also so that you have a place to begin your experimentation with natural language processing.

In [3]:
# let's see what our data looks like using pandas
# read_csv let's us... read a csv
match_adm = pd.read_csv('/content/MATCH Fuzzy Matching (1).csv')
match_adm

Unnamed: 0.1,Unnamed: 0,target_adm,SURVEYID,MATCHED_ADM,score_sort,score_partial,ADM_012,GID_2
0,0,afghanistan - daykundi - ishtarlay,afgh38.csv,afghanistan - daykundi - shahristan,85,54,afghanistan - daykundi - shahristan,AFG.6.4_1
1,0,afghanistan - laghman - mihtariam,afgh39.csv,afghanistan - laghman - mihtarlam,97,205,afghanistan - laghman - mihtarlam,AFG.20.4_1
2,0,afghanistan - laghman - qarghayi,afgh39.csv,afghanistan - laghman - qarghayi,100,206,afghanistan - laghman - qarghayi,AFG.20.5_1
3,0,afghanistan - nangarhar - behsud,afgh40.csv,afghanistan - nangarhar - hisarak,84,219,afghanistan - nangarhar - hisarak,AFG.22.8_1
4,0,afghanistan - nangarhar - kama rodat,afgh40.csv,afghanistan - nangarhar - rodat,92,228,afghanistan - nangarhar - rodat,AFG.22.17_1
...,...,...,...,...,...,...,...,...
3418,0,zimbabwe - mashonaland central - rushinga,zimb01.csv,zimbabwe - mashonaland central - rushinga,100,355093,zimbabwe - mashonaland central - rushinga,ZWE.4.9_2
3419,25,zimbabwe - mashonaland central - rushinga,zimb01.csv,zimbabwe - mashonaland central - rushinga,100,355094,zimbabwe - mashonaland central - rushinga,ZWE.4.9_2
3420,50,zimbabwe - mashonaland central - rushinga,zimb01.csv,zimbabwe - mashonaland central - rushinga,100,355095,zimbabwe - mashonaland central - rushinga,ZWE.4.9_2
3421,75,zimbabwe - mashonaland central - rushinga,zimb01.csv,zimbabwe - mashonaland central - rushinga,100,355096,zimbabwe - mashonaland central - rushinga,ZWE.4.9_2


First, we need to create a consistent format for our natural language data. Generally the shape of your data doesn't matter so long as it's the same every time. Thankfully, `pandas` let's us do exactly that. 

In [4]:
match_adm['input'] = match_adm.apply(lambda x: f"TEXT1: {x['target_adm']}; TEXT2: {x['ADM_012']}",axis=1) # can change this
match_adm['input'][0]

'TEXT1: afghanistan - daykundi - ishtarlay; TEXT2: afghanistan - daykundi - shahristan'

Next, we need to decide what counts as "correct" and "incorrect" for the purposes of the classification. That is, upon what criterion are we going to classify? In this case, we are going to the fuzzymatching scores to get us started. 

Too we are also going to pull out about 1000 sample upon which we'll train our model and reserve the rest of the data to test our model.

In [5]:
match_adm['labels'] = match_adm['score_sort'].apply(lambda x: 1 if x>90 else 0) # can change this threshold (related to the fuzzy matching)
match_adm['labels'] = match_adm['labels'].astype(float)
data = match_adm[['input','labels']].sample(len(match_adm)//3) # can change this (how much of the original data we are using to train)

In [6]:
data

Unnamed: 0,input,labels
299,TEXT1: burkina faso - est - tapoa; TEXT2: burk...,1.0
624,TEXT1: chad - ouaddai - ouara; TEXT2: chad - o...,1.0
2209,TEXT1: malawi - central - nkhota kota; TEXT2: ...,0.0
2123,TEXT1: kenya - marsabit - maikona; TEXT2: keny...,0.0
2421,TEXT1: niger - maradi - mayahi; TEXT2: niger -...,1.0
...,...,...
3372,TEXT1: uganda - northern - abim; TEXT2: sudan ...,0.0
3189,TEXT1: south sudan - greater upper nile - jong...,0.0
1232,TEXT1: ethiopia - somali - gode; TEXT2: ethiop...,0.0
1115,TEXT1: ethiopia - amhara - north wollo; TEXT2:...,0.0


In [7]:
# convert from pandas df to hf Dataset
ds = Dataset.from_pandas(data)
ds

Dataset({
    features: ['input', 'labels', '__index_level_0__'],
    num_rows: 1141
})

## Tokenization, Metrics and Model Training

Now that we have our data in the correct form. We can talk about how we will train our model. We will use a process called fine-tuning, where we take a pretrained model whihc took hundreds of hours to train and adapt it for our purposes. For today, we'll use a smaller model from Microsoft. 

Each language model also comes with something called a tokenizer. Tokenization is the process of breaking down a sentence into its constituent words or tokens. We need to tokenize the data we created above in the exact way the creators of this model tokenized their data or else our data will be incompatiable. 

Luckily, HuggingFace includes a utility for model creators to upload not just their model but also its tokenizer. This tokenizer will assign an id for each unique word so that it will be easily accessible by our the pretrained model.


In [8]:
model_nm = 'microsoft/deberta-v3-small' # simple model, easy to download and run
tokz = AutoTokenizer.from_pretrained(model_nm) # need tokenizer

def tok_func(x): 
  return tokz(x["input"])

tok_ds = ds.map(tok_func, batched=True) # tokenization

Downloading (…)okenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/578 [00:00<?, ?B/s]

Downloading (…)"spm.model";:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


  0%|          | 0/2 [00:00<?, ?ba/s]

In [9]:
# what tokenization looks like:
print(tok_ds['input'][0])
tok_ds['input_ids'][0]

TEXT1: burkina faso - est - tapoa; TEXT2: burkina faso - est - tapoa


[1,
 54453,
 435,
 294,
 26028,
 50408,
 19315,
 3070,
 341,
 11148,
 341,
 5266,
 22496,
 346,
 54453,
 445,
 294,
 26028,
 50408,
 19315,
 3070,
 341,
 11148,
 341,
 5266,
 22496,
 2]

In [10]:
# hf datasets splits our data into train, test (validation), and evaluation (test) sets
tts = tok_ds.train_test_split(0.25, seed=42)
tts

DatasetDict({
    train: Dataset({
        features: ['input', 'labels', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 855
    })
    test: Dataset({
        features: ['input', 'labels', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 286
    })
})

In [11]:
# metrics, not loss
def corr_d(eval_pred): 
    return {'pearson': np.corrcoef(*eval_pred)[0][1]} # using r value for metrics
# can read more on r here: https://statistics.laerd.com/statistical-guides/pearson-correlation-coefficient-statistical-guide.php

In [12]:
# feel free to play around here to find out what works best
# can change these to tune results
bs = 128
epochs = 8
lr = 6e-5 

args = TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=True,
    evaluation_strategy="epoch", per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2,
    num_train_epochs=epochs, weight_decay=0.01, report_to='none')

In [13]:
model = AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=1)
trainer = Trainer(model, args, train_dataset=tts['train'], eval_dataset=tts['test'],
                  tokenizer=tokz, compute_metrics=corr_d)

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/286M [00:00<?, ?B/s]

Some weights of the model checkpoint at microsoft/deberta-v3-small were not used when initializing DebertaV2ForSequenceClassification: ['mask_predictions.LayerNorm.weight', 'mask_predictions.LayerNorm.bias', 'lm_predictions.lm_head.LayerNorm.bias', 'mask_predictions.dense.bias', 'mask_predictions.classifier.bias', 'lm_predictions.lm_head.bias', 'mask_predictions.classifier.weight', 'lm_predictions.lm_head.dense.bias', 'lm_predictions.lm_head.dense.weight', 'lm_predictions.lm_head.LayerNorm.weight', 'mask_predictions.dense.weight']
- This IS expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from 

In [14]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `DebertaV2ForSequenceClassification.forward` and have been ignored: __index_level_0__, input. If __index_level_0__, input are not expected by `DebertaV2ForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 855
  Num Epochs = 8
  Instantaneous batch size per device = 128
  Total train batch size (w. parallel, distributed & accumulation) = 128
  Gradient Accumulation steps = 1
  Total optimization steps = 56
  Number of trainable parameters = 141895681
You're using a DebertaV2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Pearson
1,No log,0.226964,0.706231
2,No log,0.10628,0.908184
3,No log,0.037943,0.925591
4,No log,0.015817,0.969041
5,No log,0.018665,0.9757
6,No log,0.017498,0.977604
7,No log,0.012645,0.978855
8,No log,0.012837,0.979414


The following columns in the evaluation set don't have a corresponding argument in `DebertaV2ForSequenceClassification.forward` and have been ignored: __index_level_0__, input. If __index_level_0__, input are not expected by `DebertaV2ForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 286
  Batch size = 256
The following columns in the evaluation set don't have a corresponding argument in `DebertaV2ForSequenceClassification.forward` and have been ignored: __index_level_0__, input. If __index_level_0__, input are not expected by `DebertaV2ForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 286
  Batch size = 256
The following columns in the evaluation set don't have a corresponding argument in `DebertaV2ForSequenceClassification.forward` and have been ignored: __index_level_0__, input. If __index_level_0__, input are not expected by `DebertaV2ForSequ

TrainOutput(global_step=56, training_loss=0.09283702714102608, metrics={'train_runtime': 19.6857, 'train_samples_per_second': 347.461, 'train_steps_per_second': 2.845, 'total_flos': 75730785532980.0, 'train_loss': 0.09283702714102608, 'epoch': 8.0})

In [15]:
# preds for out evaluation set
preds = trainer.predict(trainer.eval_dataset)[1]
preds

The following columns in the test set don't have a corresponding argument in `DebertaV2ForSequenceClassification.forward` and have been ignored: __index_level_0__, input. If __index_level_0__, input are not expected by `DebertaV2ForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 286
  Batch size = 256


array([0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 1., 0., 0., 1., 1., 0.,
       0., 0., 0., 0., 1., 0., 1., 0., 1., 1., 0., 1., 1., 0., 0., 0., 0.,
       0., 1., 0., 1., 0., 0., 0., 0., 0., 1., 0., 1., 1., 0., 1., 1., 0.,
       0., 0., 1., 0., 1., 1., 0., 1., 0., 0., 1., 0., 0., 0., 0., 0., 0.,
       0., 1., 1., 1., 1., 1., 1., 0., 1., 0., 0., 1., 1., 1., 0., 0., 1.,
       1., 1., 0., 0., 0., 0., 0., 0., 1., 0., 1., 1., 1., 0., 0., 0., 1.,
       0., 1., 1., 0., 1., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       1., 1., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
       1., 0., 1., 1., 1., 1., 0., 0., 1., 1., 1., 1., 1., 0., 0., 0., 1.,
       0., 1., 1., 1., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 1., 0.,
       0., 0., 1., 0., 0., 0., 1., 1., 0., 1., 1., 0., 1., 0., 0., 0., 0.,
       0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0., 0.,
       0., 0., 0., 0., 1., 1., 1., 1., 1., 0., 0., 0., 1., 1., 0., 1., 0.,
       1., 1., 1., 0., 0.

In [16]:
# a new df with our eval predictions
preds_df = pd.DataFrame({'input':pd.Series(trainer.eval_dataset['input']), 'preds':preds})

In [17]:
# pred will be 1 if they are refering to the same place 
# pred will be 0 if they are refering to a different place
p = preds_df.sample(5).apply(lambda x: print(x['input'], x['preds'], '\n'),axis=1) # sampling 5 at a time

TEXT1: kenya - mandera - el wak or central; TEXT2: kenya - meru - igembe central 0.0 

TEXT1: uganda - northern - amudat; TEXT2: sudan - northern - addabah 0.0 

TEXT1: uganda - northern - moroto; TEXT2: uganda - moroto - moroto 0.0 

TEXT1: cote divoire - woroba - bafing; TEXT2: côte d'ivoire - woroba - bafing 1.0 

TEXT1: myanmar - west - rakhine; TEXT2: myanmar - rakhine - sittwe 0.0 



In [18]:
# save out model for later
trainer.save_model("SMART_correction_model_deberta-v3-small")
!zip -r SMART_spelling_model_deberta-v3-small.zip SMART_correction_model_deberta-v3-small

Saving model checkpoint to SMART_correction_model_deberta-v3-small
Configuration saved in SMART_correction_model_deberta-v3-small/config.json
Model weights saved in SMART_correction_model_deberta-v3-small/pytorch_model.bin
tokenizer config file saved in SMART_correction_model_deberta-v3-small/tokenizer_config.json
Special tokens file saved in SMART_correction_model_deberta-v3-small/special_tokens_map.json


  adding: SMART_correction_model_deberta-v3-small/ (stored 0%)
  adding: SMART_correction_model_deberta-v3-small/tokenizer_config.json (deflated 45%)
  adding: SMART_correction_model_deberta-v3-small/training_args.bin (deflated 48%)
  adding: SMART_correction_model_deberta-v3-small/tokenizer.json (deflated 77%)
  adding: SMART_correction_model_deberta-v3-small/config.json (deflated 53%)
  adding: SMART_correction_model_deberta-v3-small/added_tokens.json (stored 0%)
  adding: SMART_correction_model_deberta-v3-small/spm.model (deflated 50%)
  adding: SMART_correction_model_deberta-v3-small/pytorch_model.bin (deflated 28%)
  adding: SMART_correction_model_deberta-v3-small/special_tokens_map.json (deflated 54%)


## Conclusion

We have gotten a start on creating a classification model from a pretrained large language model using the HuggingFace `transformer` framework. I encourage you to play around with this code, especially try to apply the model we have now to the data we reserved at the beginning. 

Find useful links below:


1.   [HuggingFace Transformers Documentation](https://huggingface.co/docs/transformers/index)
2.   [Fast.ai course example using Transformers](https://www.kaggle.com/code/jhoward/getting-started-with-nlp-for-absolute-beginners)
3.   [Model card for Microsoft's deBERTa v3 small model](https://huggingface.co/microsoft/deberta-v3-small)

