# Flipkart Customer Reviews

Dataset Source: [Kaggle](https://www.kaggle.com/datasets/niraliivaghani/flipkart-product-customer-reviews-dataset)

## <b>Sentiment Analysis</b>

### <b><i>Using DistilBERT model to fine-tune it on the data</i></b>

<br><br><br>
### Results:

Sentiment Categories - <i>Positive, Negative, Neutral</i> 

Recall: 0.94
Precision: 0.94


<br><br><br>

## Libraries and Data

In [None]:
# import locale
# locale.getpreferredencoding = lambda: "UTF-8"
!pip install datasets transformers


In [3]:
import pandas as pd
import numpy as np
import spacy
from spacy import displacy
from datasets import load_dataset, Dataset

import torch
from transformers import AutoTokenizer, DataCollatorWithPadding, AutoModelForSequenceClassification, TrainingArguments, Trainer
from datasets import load_metric


In [4]:
# Read data 

data = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Flipkart Product Reviews - Kaggle/Dataset-SA.csv')

# Convert sentiments to category and drop NA value rows

data['sentiment_code'] = pd.Categorical(data.Sentiment).codes
data['sentiment_code'] = data['sentiment_code'].astype('Int64')
data.dropna(inplace = True)

# Convert data into Dataset object for using with distilBERT

data_2 = Dataset.from_pandas(data[['Summary', 'sentiment_code']])


In [5]:
data.head()

Unnamed: 0,product_name,product_price,Rate,Review,Summary,Sentiment,sentiment_code
0,Candes 12 L Room/Personal Air Cooler??????(Whi...,3999,5,super!,great cooler excellent air flow and for this p...,positive,2
1,Candes 12 L Room/Personal Air Cooler??????(Whi...,3999,5,awesome,best budget 2 fit cooler nice cooling,positive,2
2,Candes 12 L Room/Personal Air Cooler??????(Whi...,3999,3,fair,the quality is good but the power of air is de...,positive,2
3,Candes 12 L Room/Personal Air Cooler??????(Whi...,3999,1,useless product,very bad product its a only a fan,negative,0
4,Candes 12 L Room/Personal Air Cooler??????(Whi...,3999,3,fair,ok ok product,neutral,1


In [6]:
len(data)

180379

In [7]:
data.Sentiment.value_counts()

positive    147171
negative     24401
neutral       8807
Name: Sentiment, dtype: int64

In [8]:
data.Summary[0]

'great cooler excellent air flow and for this price its so amazing and unbelievablejust love it'

## Modelling

In [9]:
torch.cuda.is_available()

True

In [10]:
# Train-Test data split

train = data_2.shuffle(seed=42).select([i for i in list(range(20000))])
test = data_2.shuffle(seed=42).select([i for i in list(range(20000, len(data_2)))])


print(train[0])
print(test[0])

{'Summary': 'good quality product i think price little bit high otherwise awesome stuff', 'sentiment_code': 2, '__index_level_0__': 138001}
{'Summary': 'gud product', 'sentiment_code': 2, '__index_level_0__': 17880}


In [11]:
# Load tokenizer from distilBERT

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [12]:
# Tokenize train and test data

def tokenize_function(df):
    return tokenizer(df["Summary"], truncation=True)

tokenized_train = train.map(tokenize_function, batched=True)
tokenized_test = test.map(tokenize_function, batched=True)

Map:   0%|          | 0/20000 [00:00<?, ? examples/s]

Map:   0%|          | 0/160379 [00:00<?, ? examples/s]

In [13]:
# Process train and test data to match the requirements of distilBERT 

tokenized_train = tokenized_train.remove_columns('__index_level_0__')
tokenized_test = tokenized_test.remove_columns('__index_level_0__')


tokenized_train = tokenized_train.rename_column("Summary", "text")
tokenized_train = tokenized_train.rename_column("sentiment_code", "labels")

tokenized_test = tokenized_test.rename_column("Summary", "text")
tokenized_test = tokenized_test.rename_column("sentiment_code", "labels")


In [14]:
tokenized_test

Dataset({
    features: ['text', 'labels', 'input_ids', 'attention_mask'],
    num_rows: 160379
})

In [15]:
# Define data collator

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Define DistilBERT
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=3)

Downloading pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'pre_clas

In [20]:
# Define function to compute metrics

def compute_metrics(eval_pred):

    load_recall = load_metric('recall')
    load_precision = load_metric('precision')
    
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    
    recall = load_recall.compute(predictions = predictions, references=labels, average="micro")["recall"]
    precision = load_precision.compute(predictions = predictions, references=labels, average="micro")["precision"]

    return {"recall": recall, "precision": precision}

In [21]:
# Define trainer object for model training and evaluation

repo_name = '/content/drive/MyDrive/Colab Notebooks/Flipkart Product Reviews - Kaggle'


training_args = TrainingArguments(
    output_dir=repo_name,
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=10,
    weight_decay=0.01,
    save_strategy="epoch"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

In [22]:
# Train model 

trainer.train()



Step,Training Loss
500,0.0296
1000,0.0293
1500,0.0238
2000,0.0296
2500,0.0355
3000,0.0275
3500,0.0279
4000,0.0208
4500,0.0228
5000,0.0268


TrainOutput(global_step=12500, training_loss=0.022602348480224608, metrics={'train_runtime': 1230.6241, 'train_samples_per_second': 162.519, 'train_steps_per_second': 10.157, 'total_flos': 2696856652046880.0, 'train_loss': 0.022602348480224608, 'epoch': 10.0})

In [23]:
# Evaluate metrics for model on test data

trainer.evaluate()

{'eval_loss': 0.507477343082428,
 'eval_recall': 0.9404660211124898,
 'eval_precision': 0.9404660211124898,
 'eval_runtime': 265.3979,
 'eval_samples_per_second': 604.296,
 'eval_steps_per_second': 37.77,
 'epoch': 10.0}

## Saving Model

In [24]:
# Save trained model to directory

model_path = '/content/drive/MyDrive/Colab Notebooks/Flipkart Product Reviews - Kaggle/distilbert-base-uncased_finetuned_v2'

trainer.save_model(output_dir = model_path)

In [25]:
print("model saved")

model saved


## Load Model from Memory

In [26]:
# Load model and tokenizer from memory

tokenizer_finetuned = AutoTokenizer.from_pretrained(model_path)
model_finetuned = AutoModelForSequenceClassification.from_pretrained(model_path)
sentiment_map = {2: "positive", 
                 1: "neutral",
                 0: "negative"}


In [27]:
# Define a function to use the model to make predictions

def use_model(input_text):

  tokenized_text = tokenizer_finetuned(input_text,
                                     truncation=True,
                                     is_split_into_words=False,
                                     return_tensors='pt')
  outputs = model_finetuned(tokenized_text["input_ids"])
  predicted_label = outputs.logits.argmax(-1)

  return(sentiment_map[predicted_label.item()])




In [28]:
use_model("I am not unhappy with this product.")

# Important to note that the model is able to recognize and handle negation in the input statement. 

'positive'