# Defining the business problem

Amazon sells thousands of different products, and I'd like to understand the customer by collecting reviews posted online.


> Given an English Language product review for shoes, I want to predict a star rating between 1 and 5 to understand how satisfied customers are with their purchase

## Translating the business problem as a Machine Learning Problem

Here, I want to train a *multi-class model* for product reviews. The model should predict accurately the probabilities for each one of the five star ratings.

Possible metrics for evaluating the models are *F1 Score* and *Accuracy Score*


Thus the definition of the ML problem is **Given a test set for English Language Product reviews, I want to classify each review according to star ratings b/n 1 and 5, with an accuracy of at least xx% and an F1 Score of atleast 0.yy**

> The baseline metrics can be provided by the stakeholders or set via a statistical method

## Transfer Learning: looking for models

Take knowledge learnt from one task into a related domain. Here, I search the huggingface website for text classification or sentiment analysis models I can apply to this multi-class problem.

In [41]:
from datasets import load_dataset, Dataset
import pandas as pd
from transformers import AutoConfig, AutoModelForSequenceClassification, AutoTokenizer, TrainingArguments, DistilBertTokenizer, DistilBertForSequenceClassification, Trainer
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
import torch


In [20]:

am_shoe_df = load_dataset("juliensimon/amazon-shoe-reviews")

# am_shoe_df.train()

Using custom data configuration juliensimon--amazon-shoe-reviews-2085fc7afbd18449
Found cached dataset parquet (C:/Users/INNO/.cache/huggingface/datasets/juliensimon___parquet/juliensimon--amazon-shoe-reviews-2085fc7afbd18449/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
100%|██████████| 2/2 [00:00<00:00, 346.18it/s]


In [16]:
# Convert to pandas df (you can use the to_pandas method)
amazon_shoes_train_df =  pd.DataFrame(am_shoe_df['train'])

# Check the value count of each rating (watchout for class-imbalance)
amazon_shoes_train_df['labels'].value_counts()

4    18044
2    18039
0    18004
1    17980
3    17933
Name: labels, dtype: int64

In [19]:
# Convert to pandas df (you can use the to_pandas method)
amazon_shoes_test_df =  pd.DataFrame(am_shoe_df['test'])

# Check the value count of each rating (watch out for class-imbalance)
amazon_shoes_test_df['labels'].value_counts()

3    2067
1    2020
0    1996
2    1961
4    1956
Name: labels, dtype: int64

In [21]:
train_df = Dataset.from_pandas(amazon_shoes_train_df, preserve_index=False)

test_df = Dataset.from_pandas(amazon_shoes_test_df, preserve_index=False)

## Training and Deploying Models

In this notebook, I train a model locally, and also deploy to Hugging Face Hub.

In [30]:
base_model_id_1 = 'distilbert-base-uncased'
base_model_id_2 = 'distilbert-base-uncased-finetuned-sst-2-english'

epochs = 1
# 5 labels
num_labels = 5
learning_rate = 5e-5
training_batch_size = 32
eval_batch_size = 64
save_strategy = 'no'
save_steps = 1000


output_data_dir = './output'
model_dir = './model'


def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision = precision_score(labels, preds)
    recall = recall_score(labels, preds)
    f1 = f1_score(labels, preds)
    accuracy = accuracy_score(labels, preds)
    return {'accuracy_score': accuracy, 'f1': f1, 'recall': recall, 'precision': precision}

In [43]:
# Download the model and slap a classification tag of 5 labels at the end
model_1 =  AutoModelForSequenceClassification.from_pretrained(base_model_id_1, num_labels=num_labels)
# Convert natural language steps to integer for training
tokenizer_1 = AutoTokenizer.from_pretrained(base_model_id_1)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_transform.bias', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.weight', 'pre_clas

In [44]:
# Download the model and slap a classification tag of 5 labels at the end
model_2 =  AutoModelForSequenceClassification.from_pretrained(base_model_id_2, num_labels=num_labels, ignore_mismatched_sizes=True)
# Convert natural language steps to integer for training
tokenizer_2 = AutoTokenizer.from_pretrained(base_model_id_2)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english and are newly initialized because the shapes did not match:
- classifier.weight: found shape torch.Size([2, 768]) in the checkpoint and torch.Size([5, 768]) in the model instantiated
- classifier.bias: found shape torch.Size([2]) in the checkpoint and torch.Size([5]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Define a function to tokenize the datasets. Data is processed in batches using the map function.

In [34]:
def tokenize(batch):
    # Convert strings to numbers
    return tokenizer(batch['text'], padding='max_length', truncation=True)


train_dataset = train_df.map(tokenize, batched=True, batch_size=len(train_df))
test_dataset = train_df.map(tokenize, batched=True, batch_size=len(test_df))

100%|██████████| 1/1 [00:51<00:00, 51.77s/ba]
100%|██████████| 9/9 [00:21<00:00,  2.38s/ba]


In [45]:
hub_model_id = 'mklomo/amazon-shoe-reviews'

training_args = TrainingArguments(
                                    hub_model_id=hub_model_id,          # Push the model to the remote repo
                                    output_dir=model_dir,
                                    num_train_epochs=epochs,
                                    per_device_train_batch_size=training_batch_size,
                                    per_device_eval_batch_size=eval_batch_size,
                                    save_strategy=save_strategy,
                                    save_steps=save_steps,
                                    evaluation_strategy='epoch',
                                    learning_rate=learning_rate
)


# Put the pieces together with the Trainer object

trainer_1 = Trainer(
                    model=model_1,
                    args=training_args,
                    tokenizer=tokenizer_1,
                    compute_metrics=compute_metrics,
                    train_dataset=train_dataset,
                    eval_dataset=test_dataset
)


trainer_2 = Trainer(
                    model=model_2,
                    args=training_args,
                    tokenizer=tokenizer_2,
                    compute_metrics=compute_metrics,
                    train_dataset=train_dataset,
                    eval_dataset=test_dataset
)

In [46]:
# Train model 1
trainer_1.train()

***** Running training *****
  Num examples = 90000
  Num Epochs = 1
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 2813
  0%|          | 0/2813 [00:00<?, ?it/s]