# Sentiment Analysis with Hugging Face

Hugging Face is an open-source and platform provider of machine learning technologies. You can use install their package to access some interesting pre-built models to use them directly or to fine-tune (retrain it on your dataset leveraging the prior knowledge coming with the first training), then host your trained models on the platform, so that you may use them later on other devices and apps.

Please, [go to the website and sign-in](https://huggingface.co/) to access all the features of the platform.

[Read more about Text classification with Hugging Face](https://huggingface.co/tasks/text-classification)

The Hugging face models are Deep Learning based, so will need a lot of computational GPU power to train them. Please use [Colab](https://colab.research.google.com/) to do it, or your other GPU cloud provider, or a local machine having NVIDIA GPU.

## Application of Hugging Face Text classification model Fune-tuning

Find below a simple example, with just `10 epochs of fine-tuning`. 

Read more about the fine-tuning concept : [here](https://deeplizard.com/learn/video/5T-iXNNiwIs#:~:text=Fine%2Dtuning%20is%20a%20way,perform%20a%20second%20similar%20task.)

In [15]:
!git clone https://github.com/muchaimaryanne/Natural-Language-Processing-Project-Sentiment-Analysis.git


Cloning into 'Natural-Language-Processing-Project-Sentiment-Analysis'...
remote: Enumerating objects: 47, done.[K
remote: Counting objects: 100% (47/47), done.[K
remote: Compressing objects: 100% (45/45), done.[K
remote: Total 47 (delta 17), reused 15 (delta 1), pack-reused 0[K
Unpacking objects: 100% (47/47), 936.55 KiB | 2.05 MiB/s, done.


In [16]:
%cd Natural-Language-Processing-Project-Sentiment-Analysis


/content/Natural-Language-Processing-Project-Sentiment-Analysis/Natural-Language-Processing-Project-Sentiment-Analysis/Natural-Language-Processing-Project-Sentiment-Analysis


In [17]:
# Install the necessary package to create a virtual environment
!pip3 install virtualenv

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [18]:
# Create the virtual environment venv
!virtualenv venv

created virtual environment CPython3.10.11.final.0-64 in 191ms
  creator CPython3Posix(dest=/content/Natural-Language-Processing-Project-Sentiment-Analysis/Natural-Language-Processing-Project-Sentiment-Analysis/Natural-Language-Processing-Project-Sentiment-Analysis/venv, clear=False, no_vcs_ignore=False, global=False)
  seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/root/.local/share/virtualenv)
    added seed packages: pip==23.1.2, setuptools==67.7.2, wheel==0.40.0
  activators BashActivator,CShellActivator,FishActivator,NushellActivator,PowerShellActivator,PythonActivator


In [19]:
# Activate the virtual environment
!source venv/bin/activate

In [20]:
!pip install --upgrade huggingface_hub
!pip install datasets
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Using cached datasets-2.12.0-py3-none-any.whl (474 kB)
Collecting responses<0.19
  Using cached responses-0.18.0-py3-none-any.whl (38 kB)
Collecting dill<0.3.7,>=0.3.0
  Using cached dill-0.3.6-py3-none-any.whl (110 kB)
Collecting aiohttp
  Using cached aiohttp-3.8.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
Collecting multiprocess
  Using cached multiprocess-0.70.14-py310-none-any.whl (134 kB)
Collecting xxhash
  Using cached xxhash-3.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
Collecting frozenlist>=1.1.1
  Using cached frozenlist-1.3.3-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (149 kB)
Collecting aiosignal>=1.1.2
  Using cached aiosignal-1.3.1-py3-none-any.wh

In [21]:
!pip install -r requirements.txt
!pip install torch==1.9.0+cu111 -f https://download.pytorch.org/whl/cu111/torch_stable.html


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting backports.zoneinfo
  Using cached backports.zoneinfo-0.2.1.tar.gz (74 kB)
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting simpletransformers
  Using cached simpletransformers-0.63.11-py3-none-any.whl (250 kB)
Collecting jupyter
  Using cached jupyter-1.0.0-py2.py3-none-any.whl (2.7 kB)
Collecting gradio
  Using cached gradio-3.28.3-py3-none-any.whl (17.3 MB)
Collecting sentencepiece
  Using cached sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
Collecting streamlit
  Using cached streamlit-1.22.0-py2.py3-none-any.whl (8.9 MB)
Collecting seqeval
  Using cached seqeval-1.2.2-py3-none-any.whl
Collecting wandb>=0.10.32
  Using cached wandb-0.15.2-py3-none-any.whl (2.0 MB)
Collecting qtconsole
  Using cached qt

In [22]:
import os
import pandas as pd
from datasets import load_dataset
from sklearn.model_selection import train_test_split
from transformers import RobertaTokenizerFast, RobertaForSequenceClassification,Trainer, TrainingArguments
import numpy as np
from datasets import load_metric
from transformers import AutoModel
from huggingface_hub import login

In [23]:
# Disabe W&B
os.environ["WANDB_DISABLED"] = "true"
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
os.environ["HUGGINGFACE_API_KEY"] = "hf_OVxqIgPDGGIodndhJJnDfkzuKBUehhqAAcn"

In [24]:
# Load the dataset and display some values
df = pd.read_csv('/content/Natural-Language-Processing-Project-Sentiment-Analysis/zindi_challenge/data/Train.csv')

# A way to eliminate rows containing NaN values
df = df[~df.isna().any(axis=1)]


In [25]:
df.isna().sum()

tweet_id     0
safe_text    0
label        0
agreement    0
dtype: int64

In [26]:
df.isnull().sum()

tweet_id     0
safe_text    0
label        0
agreement    0
dtype: int64

I manually split the training set to have a training subset ( a dataset the model will learn on), and an evaluation subset ( a dataset the model with use to compute metric scores to help use to avoid some training problems like [the overfitting](https://www.ibm.com/cloud/learn/overfitting) one ). 

There are multiple ways to do split the dataset. You'll see two commented line showing you another one.

In [27]:
# Split the train data => {train, eval}
train, eval = train_test_split(df, test_size=0.2, random_state=42, stratify=df['label'])

In [28]:
train.head()

Unnamed: 0,tweet_id,safe_text,label,agreement
9305,YMRMEDME,Mickey's Measles has gone international <url>,0.0,1.0
3907,5GV8NEZS,S1256 [NEW] Extends exemption from charitable ...,0.0,1.0
795,EI10PS46,<user> your ignorance on vaccines isn't just ...,1.0,0.666667
5793,OM26E6DG,Pakistan partly suspends polio vaccination pro...,0.0,1.0
3431,NBBY86FX,In other news I've gone up like 1000 mmr,0.0,1.0


In [29]:
eval.head()

Unnamed: 0,tweet_id,safe_text,label,agreement
6571,R7JPIFN7,Children's Museum of Houston to Offer Free Vac...,1.0,1.0
1754,2DD250VN,<user> no. I was properly immunized prior to t...,1.0,1.0
3325,ESEVBTFN,<user> thx for posting vaccinations are impera...,1.0,1.0
1485,S17ZU0LC,This Baby Is Exactly Why Everyone Needs To Vac...,1.0,0.666667
4175,IIN5D33V,"Meeting tonight, 8:30pm in room 322 of the stu...",1.0,1.0


In [30]:
print(f"new dataframe shapes: train is {train.shape}, eval is {eval.shape}")

new dataframe shapes: train is (7999, 4), eval is (2000, 4)


In [31]:
# Save splitted subsets
train.to_csv("/content/Natural-Language-Processing-Project-Sentiment-Analysis/zindi_challenge/data/train_subset.csv", index=False)
eval.to_csv("/content/Natural-Language-Processing-Project-Sentiment-Analysis/zindi_challenge/data/eval_subset.csv", index=False)

In [32]:
dataset = load_dataset('csv',
                        data_files={'train': '/content/Natural-Language-Processing-Project-Sentiment-Analysis/zindi_challenge/data/train_subset.csv',
                        'eval': '/content/Natural-Language-Processing-Project-Sentiment-Analysis/zindi_challenge/data/eval_subset.csv'}, encoding = "ISO-8859-1")


Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-da5b04f8cfc7f05d/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating eval split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-da5b04f8cfc7f05d/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [33]:
tokenizer = RobertaTokenizerFast.from_pretrained('roberta-base', max_length = 512)

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

In [34]:
def transform_labels(label):

    label = label['label']
    num = 0
    if label == -1: #'Negative'
        num = 0
    elif label == 0: #'Neutral'
        num = 1
    elif label == 1: #'Positive'
        num = 2

    return {'labels': num}

def tokenize_data(example):
    return tokenizer(example['safe_text'], padding='max_length')

# Change the tweets to tokens that the models can exploit
dataset = dataset.map(tokenize_data, batched=True)

# Transform	labels and remove the useless columns
remove_columns = ['tweet_id', 'label', 'safe_text', 'agreement']
dataset = dataset.map(transform_labels, remove_columns=remove_columns)

Map:   0%|          | 0/7999 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/7999 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

In [35]:
dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 7999
    })
    eval: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 2000
    })
})

In [36]:
# dataset['train']

In [37]:
from transformers import TrainingArguments

# Configure the trianing parameters like `num_train_epochs`: 
# the number of time the model will repeat the training loop over the dataset
training_args = TrainingArguments("test_trainer", evaluation_strategy='steps', num_train_epochs=10, load_best_model_at_end=True,)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


In [38]:
# Loading a pretrain model while specifying the number of labels in our dataset for fine-tuning
model = RobertaForSequenceClassification.from_pretrained('roberta-base',num_labels=3)

Downloading pytorch_model.bin:   0%|          | 0.00/501M [00:00<?, ?B/s]

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.bias', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.out_proj.weight', 'classi

In [39]:
train_dataset = dataset['train'].shuffle(seed=10) #.select(range(40000)) # to select a part
eval_dataset = dataset['eval'].shuffle(seed=10)

## other way to split the train set ... in the range you must use: 
# # int(num_rows*.8 ) for [0 - 80%] and  int(num_rows*.8 ),num_rows for the 20% ([80 - 100%])
# train_dataset = dataset['train'].shuffle(seed=10).select(range(40000))
# eval_dataset = dataset['train'].shuffle(seed=10).select(range(40000, 41000))

In [40]:
from transformers import Trainer

trainer = Trainer(
    model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset
)

In [41]:
# !wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-11-1_11.1.105-1_amd64.deb
# !dpkg -i cuda-11-1_11.1.105-1_amd64.deb
# !apt-get update
# !apt-get install cuda



In [42]:
# !nvcc --version


In [43]:
# Launch the learning process: training 
trainer.train()



Step,Training Loss,Validation Loss
500,0.9105,0.813192
1000,0.782,0.746815
1500,0.7538,0.738309
2000,0.7565,0.706449
2500,0.7538,0.77228
3000,0.721,0.762116
3500,0.7302,0.733978
4000,0.7088,0.72013
4500,0.732,0.755758
5000,0.7407,0.793369


Step,Training Loss,Validation Loss
500,0.9105,0.813192
1000,0.782,0.746815
1500,0.7538,0.738309
2000,0.7565,0.706449
2500,0.7538,0.77228
3000,0.721,0.762116
3500,0.7302,0.733978
4000,0.7088,0.72013
4500,0.732,0.755758
5000,0.7407,0.793369


TrainOutput(global_step=10000, training_loss=0.7011243133544922, metrics={'train_runtime': 8930.653, 'train_samples_per_second': 8.957, 'train_steps_per_second': 1.12, 'total_flos': 2.104644228406272e+16, 'train_loss': 0.7011243133544922, 'epoch': 10.0})

Don't worry the above issue, it is a `KeyboardInterrupt` that means I stopped the training to avoid taking a long time to finish.

In [44]:
metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

  metric = load_metric("accuracy")


Downloading builder script:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

In [45]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
)

In [46]:
# Launch the final evaluation 
trainer.evaluate()

{'eval_loss': 0.6884087324142456,
 'eval_accuracy': 0.7415,
 'eval_runtime': 60.09,
 'eval_samples_per_second': 33.283,
 'eval_steps_per_second': 4.16}

In [47]:
# Authentication token for hugging face
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [49]:
# Save pretrained model to hugging face
finetuned_model = trainer.model
finetuned_model.push_to_hub("twitter-finetuned-model")

Upload 1 LFS files:   0%|          | 0/1 [00:00<?, ?it/s]

pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/MaryanneMuchai/twitter-finetuned-model/commit/5f1b1ad98f02a0219f53e4d5ed7d6176f29d8856', commit_message='Upload RobertaForSequenceClassification', commit_description='', oid='5f1b1ad98f02a0219f53e4d5ed7d6176f29d8856', pr_url=None, pr_revision=None, pr_num=None)

In [50]:
tokenizer.push_to_hub("twitter-finetuned-model"),

(CommitInfo(commit_url='https://huggingface.co/MaryanneMuchai/twitter-finetuned-model/commit/0f83417f68e1a3cbf22cd48d58de505fad99ef04', commit_message='Upload tokenizer', commit_description='', oid='0f83417f68e1a3cbf22cd48d58de505fad99ef04', pr_url=None, pr_revision=None, pr_num=None),)

In [51]:
!pip freeze > requirements.txt


Some checkpoints of the model are automatically saved locally in `test_trainer/` during the training.

You may also upload the model on the Hugging Face Platform... [Read more](https://huggingface.co/docs/hub/models-uploading)

This notebook is inspired by an article: [Fine-Tuning Bert for Tweets Classification ft. Hugging Face](https://medium.com/mlearning-ai/fine-tuning-bert-for-tweets-classification-ft-hugging-face-8afebadd5dbf)

Do not hesitaite to read more and to ask questions, the Learning is a lifelong activity.