## Resources used for fine-tuning the huggingface model
## Please read the following articles, which is basically a HelloWorld
## Of fine-tuning a HuggingFace sentiment classification model
https://huggingface.co/blog/sentiment-analysis-python

## Creating and passing a custom dataset to the HuggingFace Model
https://huggingface.co/transformers/v3.2.0/custom_datasets.html

In [1]:
# Intalling the dependencies
!pip install -q transformers[torch] datasets huggingface_hub
!apt-get install git-lfs

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.9/7.9 MB[0m [31m31.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m493.7/493.7 kB[0m [31m38.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m35.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m80.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m74.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m261.4/261.4 kB[0m [31m26.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m15.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━

In [45]:
from transformers import AutoTokenizer
from transformers import DistilBertTokenizerFast

from transformers import AutoModelForSequenceClassification
from transformers import DistilBertForSequenceClassification

from transformers import DataCollatorWithPadding


from transformers import TrainingArguments, Trainer


import torch

import numpy as np
import pandas as pd


from sklearn.model_selection import train_test_split

# Metrics
from datasets import load_metric

import json
import gc
import copy

In [3]:
torch.cuda.is_available()

True

# ------------------------------------------------------
### PLACE kaggle.json in to /root/.kaggle directory
# ------------------------------------------------------

In [4]:
# !mv ./kaggle.json /root/.kaggle
# !chmod 600 /root/.kaggle/kaggle.json

#Downloading the Amazon appliance Reviews dataset

In [6]:
# Some other datasets which can be utilized for fine-tuning the huggingface model
# 1) https://nijianmo.github.io/amazon/index.html -> This contains various amazon datasets for sentiment classification and other tasks
# 2) https://www.yelp.com/dataset -> Is also another dataset which can be used for training and fine-tuning a sentiment classification model

# 3) https://research.aimultiple.com/sentiment-analysis-dataset/ -> Many good datasets for sentiment classification
# 4) https://cseweb.ucsd.edu/~jmcauley/datasets/amazon_v2/ -> Download various amazon reviews from here for sentiment classification,
#    note that this and above amazon links both are same, this link was aquired after authentication, which can be done on the above link as well.

In [7]:
!rm -rf ./Appliances.json.gz ./Appliances.json

In [8]:
!wget https://datarepo.eng.ucsd.edu/mcauley_group/data/amazon_v2/categoryFiles/Appliances.json.gz
!gzip -d ./Appliances.json.gz

--2023-11-03 04:35:06--  https://datarepo.eng.ucsd.edu/mcauley_group/data/amazon_v2/categoryFiles/Appliances.json.gz
Resolving datarepo.eng.ucsd.edu (datarepo.eng.ucsd.edu)... 132.239.8.30
Connecting to datarepo.eng.ucsd.edu (datarepo.eng.ucsd.edu)|132.239.8.30|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 69677301 (66M) [application/x-gzip]
Saving to: ‘Appliances.json.gz’


2023-11-03 04:35:11 (14.0 MB/s) - ‘Appliances.json.gz’ saved [69677301/69677301]



In [9]:
# Loading in the JSON data
# Its not a standard json file
# Each record is written in a seperate JSON object per line
# So we can not directly parse it
appliances_data_json = []
with open("./Appliances.json", "r") as json_file:
  appliances_data_json = json_file.readlines()

In [10]:
# lets probe the structure of json
appliances_data_json[0]

'{"overall": 5.0, "vote": "2", "verified": false, "reviewTime": "11 27, 2013", "reviewerID": "A3NHUQ33CFH3VM", "asin": "1118461304", "style": {"Format:": " Hardcover"}, "reviewerName": "Greeny", "reviewText": "Not one thing in this book seemed an obvious original thought. However, the clarity with which this author explains how innovation happens is remarkable.\\n\\nAlan Gregerman discusses the meaning of human interactions and the kinds of situations that tend to inspire original and/or clear thinking that leads to innovation. These things include how people communicate in certain situations such as when they are outside of their normal patterns.\\n\\nGregerman identifies the ingredients that make innovation more likely. This includes people being compelled to interact when they normally wouldn\'t, leading to serendipity. Sometimes the phenomenon will occur through collaboration, and sometimes by chance such as when an individual is away from home on travel.\\n\\nI recommend this book

In [11]:
j_obj = json.loads(appliances_data_json[0])
j_obj.keys()

dict_keys(['overall', 'vote', 'verified', 'reviewTime', 'reviewerID', 'asin', 'style', 'reviewerName', 'reviewText', 'summary', 'unixReviewTime'])

## As we are interested in the sentiment classification task, we will extract and isolate the datapoints of interest.

In [12]:
# For better code optimization, I am putting the three required arrays inside an object
# This is not required for building the pandas structure, I am using it solely for the code optimization purposes
extracted_data = {
    "overall" : [],
    "reviewText" : [],
    "summary" : [],
    "label" : [] # Is not present in the data but will be computed and added on the fly
}

In [13]:
cols_of_interest = ["overall", "reviewText", "summary"]

## How many samples to take for (training  + validation)
### Due to slow GPU resource, I am reducing the number of training+validation samples

In [14]:
samples_to_take = 100_00

## Loading data into custom dictionary before putting it into Pytorch Dataset

In [15]:
cur_sample_count = 0

for data_point in appliances_data_json:

  # Control how many samples to take
  cur_sample_count+=1
  if cur_sample_count > samples_to_take:
    break

  parsed_json = json.loads(data_point)

  # The purpose of this dictionary is to hold the current datapoint temporarily
  # We will probe this dictionary at the end of the col_of_interest loop
  # To make sure that all values are present and there are no None/na
  # This way we won't have to any kind of post-processing to mitigate with the missing data
  # As we are taking care of it, during the data loading stage
  temp_extracted_data = {
      "overall" : None,
      "reviewText" : None,
      "summary" : None
  }

  for col_of_interest in cols_of_interest:

    if col_of_interest in parsed_json.keys():
      temp_extracted_data[col_of_interest] = parsed_json[col_of_interest]

    # If all the columns/required points are present only then add this sample
    if temp_extracted_data["overall"] and temp_extracted_data["reviewText"] and temp_extracted_data["summary"]:
      for k,v in temp_extracted_data.items():
        extracted_data[k].append(v)
      # Compute and add the label
      # Here we are considering all the review having score less than `3` as critical or negative
      extracted_data["label"].append(1 if temp_extracted_data["overall"] > 3.0 else 0)

In [16]:
# For Simplicity purposes renaming `reviewText` to `text
extracted_data["text"] = copy.deepcopy(extracted_data["reviewText"])

del extracted_data["reviewText"]

### Converting to a pandas dataframe
##  I am only using pandas dataframe for ETA and initial data probing
##  For data exchange with the model I will be using Pytorch datasets
## As this is the recommended way to do things in the official doc of huggingface

In [17]:
df = pd.DataFrame(extracted_data)

In [18]:
# Missing values percentage
miss_percentage = (100 / df.shape[0]) * (df.isna().sum()[0] + df.isna().sum()[1] + df.isna().sum()[2])
print(miss_percentage)

0.0


In [19]:
# Dropping na as the the missing percentage is quite low
df = df.dropna()
df = df.reset_index(drop=True)

In [20]:
df["overall"].mean()

4.235152279185177

In [21]:
df.head()

Unnamed: 0,overall,summary,label,text
0,5.0,Clear on what leads to innovation,1,Not one thing in this book seemed an obvious o...
1,5.0,Becoming more innovative by opening yourself t...,1,I have enjoyed Dr. Alan Gregerman's weekly blo...
2,5.0,The World from Different Perspectives,1,Alan Gregerman believes that innovation comes ...
3,5.0,Strangers are Your New Best Friends,1,"Alan Gregerman is a smart, funny, entertaining..."
4,5.0,"How and why it is imperative to engage, learn ...",1,"As I began to read this book, I was again remi..."


In [22]:
def map_value(in_v, in_min, in_max, out_min, out_max):
    """Helper method to map an input value (in_v)
       between alternative max/min ranges."""
    v = (in_v - in_min) * (out_max - out_min) / (in_max - in_min) + out_min
    if v < out_min:
        v = out_min
    elif v > out_max:
        v = out_max
    return v


In [23]:
df["positive_sentiment_percentage"] = df["overall"].map(lambda x: map_value(x, 0, 5, 0, 100))

In [24]:
# Also adding a label column which indicates whether or not a particular sentiment is positive or negative
# df["label"] = df["positive_sentiment_percentage"].map(lambda x: 1 if x > 70 else 0)

In [25]:
# Renaming reviewText to text for simplicity
df.rename(columns= {"reviewText" : "text"}, inplace=True)

In [26]:
df.head()

Unnamed: 0,overall,summary,label,text,positive_sentiment_percentage
0,5.0,Clear on what leads to innovation,1,Not one thing in this book seemed an obvious o...,100.0
1,5.0,Becoming more innovative by opening yourself t...,1,I have enjoyed Dr. Alan Gregerman's weekly blo...,100.0
2,5.0,The World from Different Perspectives,1,Alan Gregerman believes that innovation comes ...,100.0
3,5.0,Strangers are Your New Best Friends,1,"Alan Gregerman is a smart, funny, entertaining...",100.0
4,5.0,"How and why it is imperative to engage, learn ...",1,"As I began to read this book, I was again remi...",100.0


## Splitting data into training and validation sets

In [27]:
train_text, val_text, train_labels, val_labels = train_test_split(extracted_data["text"], extracted_data["label"], test_size=0.2,random_state=42, shuffle=False)

## Tokenization and encodings

In [28]:
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

# Creating the Pytorch dataset that will be passed to the HuggingFace Model

In [29]:
# Side note (Difference between truncation and padding)
# Truncation -> Will truncate all those examples whose length > model's maximum input length
# Padding -> Will make all the inputs of the same length
# PADDING -> CONSUMES SIGNIFICANTLY more memory

train_encodings = tokenizer(train_text, truncation=True)
val_encodings = tokenizer(val_text, truncation=True)

In [30]:
class CustomAmazonProductDataset(torch.utils.data.Dataset):

  def __init__(self, text_encodings, labels):

    self.text_encodings = text_encodings
    self.labels = labels

  def __getitem__(self, idx):
    # If you dont understand whats going on with the encoding and how we are accessing it
    # Here is an example of how HuggingFace might be storring various attributes under the hood
    '''
     # For better code optimization, I am putting the three required arrays inside an object
     # This is not required for building the pandas structure, I am using it solely for the code optimization purposes
    extracted_data = {
        "overall" : [],
        "reviewText" : [],
        "summary" : [],
        "label" : [] # Is not present in the data but will be computed and added on the fly
    }
    '''
    # We access the relevant keys inside the encodings dict
    # And we extract the corrosponding index values which we are interested in
    # Here item probably represents a single Encoding output of the tokenizer

    item = {key: torch.tensor(val[idx]).to("cuda") for key, val in self.text_encodings.items()}
    item['labels'] = torch.tensor(self.labels[idx]).to("cuda")

    return item

  def __len__(self):
    return len(self.labels)


In [31]:
train_dataset = CustomAmazonProductDataset(train_encodings, train_labels)
val_dataset = CustomAmazonProductDataset(val_encodings, val_labels)

## Now fine-tuning the HuggingFace Sentiment classification model

In [46]:
def compute_metrics(eval_pred):

  load_accuracy = load_metric("accuracy")
  load_f1 = load_metric("f1")

  logits, labels = eval_pred

  predictions = np.argmax(logits, axis=-1)

  accuracy = load_accuracy.compute(predictions=predictions, references=labels)
  f1 = load_f1.compute(predictions=predictions, references=labels)

  return {"accuracy": accuracy, "f1": f1}

In [33]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

## Instantiate a pretrained Distilled BERT Model

In [34]:
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
model.to("cuda")

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.weight', 'classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

## Defining the metrics which will be used to fine-tune the model

In [35]:
## Login to HuggingFace hub
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Defining training parameters

In [39]:
repo_name = "finetuning-sentiment-classification-model-with-amazon-appliances-data"

In [40]:
training_args = TrainingArguments(
   output_dir=repo_name,
   num_train_epochs=2,
   learning_rate=2e-5,
   per_device_train_batch_size=32,
   per_device_eval_batch_size=32,
   weight_decay=0.01,
   save_strategy="epoch",
   push_to_hub=True,
)

In [48]:
trainer = Trainer(
   model=model,
   args=training_args,
   train_dataset=train_dataset,
   eval_dataset=val_dataset,
   tokenizer=tokenizer,
   data_collator=data_collator,
   compute_metrics=compute_metrics,
)

In [42]:
trainer.train()

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
500,0.2512
1000,0.1943
1500,0.184
2000,0.1738
2500,0.164
3000,0.1277
3500,0.1319
4000,0.1238
4500,0.1246


TrainOutput(global_step=4998, training_loss=0.1598491294711244, metrics={'train_runtime': 4189.3684, 'train_samples_per_second': 38.172, 'train_steps_per_second': 1.193, 'total_flos': 1.2790771555902768e+16, 'train_loss': 0.1598491294711244, 'epoch': 2.0})

In [49]:
trainer.evaluate()

  load_accuracy = load_metric("accuracy")


Downloading builder script:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

Trainer is attempting to log a value of "{'accuracy': 0.9405702851425712}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'f1': 0.9630918354666336}" of type <class 'dict'> for key "eval/f1" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.


{'eval_loss': 0.15951979160308838,
 'eval_accuracy': {'accuracy': 0.9405702851425712},
 'eval_f1': {'f1': 0.9630918354666336},
 'eval_runtime': 156.0828,
 'eval_samples_per_second': 128.073,
 'eval_steps_per_second': 4.004}