# Sentiment Analysis with Hugging Face

Hugging Face is an open-source and platform provider of machine learning technologies. You can use install their package to access some interesting pre-built models to use them directly or to fine-tune (retrain it on your dataset leveraging the prior knowledge coming with the first training), then host your trained models on the platform, so that you may use them later on other devices and apps.

Please, [go to the website and sign-in](https://huggingface.co/) to access all the features of the platform.

[Read more about Text classification with Hugging Face](https://huggingface.co/tasks/text-classification)

The Hugging face models are Deep Learning based, so will need a lot of computational GPU power to train them. Please use [Colab](https://colab.research.google.com/) to do it, or your other GPU cloud provider, or a local machine having NVIDIA GPU.

## Application of Hugging Face Text classification model Fune-tuning

Find below a simple example, with just `3 epochs of fine-tuning`. 

Read more about the fine-tuning concept : [here](https://deeplizard.com/learn/video/5T-iXNNiwIs#:~:text=Fine%2Dtuning%20is%20a%20way,perform%20a%20second%20similar%20task.)

In [8]:
!git clone https://github.com/muchaimaryanne/Natural-Language-Processing-Project-Sentiment-Analysis.git


fatal: destination path 'Natural-Language-Processing-Project-Sentiment-Analysis' already exists and is not an empty directory.


In [9]:
%cd Natural-Language-Processing-Project-Sentiment-Analysis


/content/Natural-Language-Processing-Project-Sentiment-Analysis


In [10]:
# Install the necessary package to create a virtual environment
!pip3 install virtualenv

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [11]:
# Create the virtual environment venv
!virtualenv venv

created virtual environment CPython3.9.16.final.0-64 in 402ms
  creator CPython3Posix(dest=/content/Natural-Language-Processing-Project-Sentiment-Analysis/venv, clear=False, no_vcs_ignore=False, global=False)
  seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/root/.local/share/virtualenv)
    added seed packages: pip==23.1, setuptools==67.6.1, wheel==0.40.0
  activators BashActivator,CShellActivator,FishActivator,NushellActivator,PowerShellActivator,PythonActivator


In [12]:
# Activate the virtual environment
!source venv/bin/activate

In [13]:
!pip install -r requirements.txt
!pip install torch==1.9.0+cu111 -f https://download.pytorch.org/whl/cu111/torch_stable.html


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting torch<2.1,>=1.7
  Using cached torch-2.0.0-cp39-cp39-manylinux1_x86_64.whl (619.9 MB)
Installing collected packages: torch
  Attempting uninstall: torch
    Found existing installation: torch 1.9.0+cu111
    Uninstalling torch-1.9.0+cu111:
      Successfully uninstalled torch-1.9.0+cu111
Successfully installed torch-2.0.0


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in links: https://download.pytorch.org/whl/cu111/torch_stable.html
Collecting torch==1.9.0+cu111
  Using cached https://download.pytorch.org/whl/cu111/torch-1.9.0%2Bcu111-cp39-cp39-linux_x86_64.whl (2041.4 MB)
Installing collected packages: torch
  Attempting uninstall: torch
    Found existing installation: torch 2.0.0
    Uninstalling torch-2.0.0:
      Successfully uninstalled torch-2.0.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchvision 0.15.1+cu118 requires torch==2.0.0, but you have torch 1.9.0+cu111 which is incompatible.
torchtext 0.15.1 requires torch==2.0.0, but you have torch 1.9.0+cu111 which is incompatible.
torchdata 0.6.0 requires torch==2.0.0, but you have torch 1.9.0+cu111 which is incompatible.
torchaudio 2.0.1+cu118 requir

In [14]:
import os
import pandas as pd
from datasets import load_dataset
from sklearn.model_selection import train_test_split
from transformers import RobertaTokenizerFast, RobertaForSequenceClassification,Trainer, TrainingArguments
import numpy as np
from datasets import load_metric
from transformers import AutoModel

In [15]:
# Disabe W&B
os.environ["WANDB_DISABLED"] = "true"
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

In [16]:
# Load the dataset and display some values
df = pd.read_csv('/content/Natural-Language-Processing-Project-Sentiment-Analysis/zindi_challenge/data/Train.csv')

# A way to eliminate rows containing NaN values
df = df[~df.isna().any(axis=1)]


In [17]:
df.isna().sum()

tweet_id     0
safe_text    0
label        0
agreement    0
dtype: int64

In [18]:
df.isnull().sum()

tweet_id     0
safe_text    0
label        0
agreement    0
dtype: int64

I manually split the training set to have a training subset ( a dataset the model will learn on), and an evaluation subset ( a dataset the model with use to compute metric scores to help use to avoid some training problems like [the overfitting](https://www.ibm.com/cloud/learn/overfitting) one ). 

There are multiple ways to do split the dataset. You'll see two commented line showing you another one.

In [19]:
# Split the train data => {train, eval}
train, eval = train_test_split(df, test_size=0.2, random_state=42, stratify=df['label'])

In [20]:
train.head()

Unnamed: 0,tweet_id,safe_text,label,agreement
9305,YMRMEDME,Mickey's Measles has gone international <url>,0.0,1.0
3907,5GV8NEZS,S1256 [NEW] Extends exemption from charitable ...,0.0,1.0
795,EI10PS46,<user> your ignorance on vaccines isn't just ...,1.0,0.666667
5793,OM26E6DG,Pakistan partly suspends polio vaccination pro...,0.0,1.0
3431,NBBY86FX,In other news I've gone up like 1000 mmr,0.0,1.0


In [21]:
eval.head()

Unnamed: 0,tweet_id,safe_text,label,agreement
6571,R7JPIFN7,Children's Museum of Houston to Offer Free Vac...,1.0,1.0
1754,2DD250VN,<user> no. I was properly immunized prior to t...,1.0,1.0
3325,ESEVBTFN,<user> thx for posting vaccinations are impera...,1.0,1.0
1485,S17ZU0LC,This Baby Is Exactly Why Everyone Needs To Vac...,1.0,0.666667
4175,IIN5D33V,"Meeting tonight, 8:30pm in room 322 of the stu...",1.0,1.0


In [22]:
print(f"new dataframe shapes: train is {train.shape}, eval is {eval.shape}")

new dataframe shapes: train is (7999, 4), eval is (2000, 4)


In [23]:
# Save splitted subsets
train.to_csv("/content/Natural-Language-Processing-Project-Sentiment-Analysis/zindi_challenge/data/train_subset.csv", index=False)
eval.to_csv("/content/Natural-Language-Processing-Project-Sentiment-Analysis/zindi_challenge/data/eval_subset.csv", index=False)

In [24]:
dataset = load_dataset('csv',
                        data_files={'train': '/content/Natural-Language-Processing-Project-Sentiment-Analysis/zindi_challenge/data/train_subset.csv',
                        'eval': '/content/Natural-Language-Processing-Project-Sentiment-Analysis/zindi_challenge/data/eval_subset.csv'}, encoding = "ISO-8859-1")


Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-b13ddabb51d03a38/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating eval split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-b13ddabb51d03a38/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [25]:
tokenizer = RobertaTokenizerFast.from_pretrained('roberta-base', max_length = 512)

In [26]:
def transform_labels(label):

    label = label['label']
    num = 0
    if label == -1: #'Negative'
        num = 0
    elif label == 0: #'Neutral'
        num = 1
    elif label == 1: #'Positive'
        num = 2

    return {'labels': num}

def tokenize_data(example):
    return tokenizer(example['safe_text'], padding='max_length')

# Change the tweets to tokens that the models can exploit
dataset = dataset.map(tokenize_data, batched=True)

# Transform	labels and remove the useless columns
remove_columns = ['tweet_id', 'label', 'safe_text', 'agreement']
dataset = dataset.map(transform_labels, remove_columns=remove_columns)

Map:   0%|          | 0/7999 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/7999 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

In [27]:
dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 7999
    })
    eval: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 2000
    })
})

In [28]:
# dataset['train']

In [29]:
from transformers import TrainingArguments

# Configure the trianing parameters like `num_train_epochs`: 
# the number of time the model will repeat the training loop over the dataset
training_args = TrainingArguments("test_trainer", evaluation_strategy='steps', num_train_epochs=10, load_best_model_at_end=True,)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


In [30]:
# Loading a pretrain model while specifying the number of labels in our dataset for fine-tuning
model = RobertaForSequenceClassification.from_pretrained('roberta-base',num_labels=3)

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.out_proj.weight', 'classi

In [31]:
train_dataset = dataset['train'].shuffle(seed=10) #.select(range(40000)) # to select a part
eval_dataset = dataset['eval'].shuffle(seed=10)

## other way to split the train set ... in the range you must use: 
# # int(num_rows*.8 ) for [0 - 80%] and  int(num_rows*.8 ),num_rows for the 20% ([80 - 100%])
# train_dataset = dataset['train'].shuffle(seed=10).select(range(40000))
# eval_dataset = dataset['train'].shuffle(seed=10).select(range(40000, 41000))

In [32]:
from transformers import Trainer

trainer = Trainer(
    model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset
)

In [33]:
# !wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-11-1_11.1.105-1_amd64.deb
# !dpkg -i cuda-11-1_11.1.105-1_amd64.deb
# !apt-get update
# !apt-get install cuda



In [34]:
# !nvcc --version


In [None]:
# Launch the learning process: training 
trainer.train()



Step,Training Loss,Validation Loss


Don't worry the above issue, it is a `KeyboardInterrupt` that means I stopped the training to avoid taking a long time to finish.

In [None]:
metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
)

In [None]:
# Launch the final evaluation 
trainer.evaluate()

In [None]:
# Authentication token for hugging face
use_auth_token='hf_OVxqIgPDGGIodndhJJnDfkzuKBUehhqAAcn'

In [None]:
# Save pretrained model to hugging face
finetuned_model = trainer.model
finetuned_model.push_to_hub("twitter-finetuned-model")

In [None]:
tokenizer.push_to_hub("twitter-finetuned-model"),

Some checkpoints of the model are automatically saved locally in `test_trainer/` during the training.

You may also upload the model on the Hugging Face Platform... [Read more](https://huggingface.co/docs/hub/models-uploading)

This notebook is inspired by an article: [Fine-Tuning Bert for Tweets Classification ft. Hugging Face](https://medium.com/mlearning-ai/fine-tuning-bert-for-tweets-classification-ft-hugging-face-8afebadd5dbf)

Do not hesitaite to read more and to ask questions, the Learning is a lifelong activity.