# 🔍 Review Sense 

This Jupyter notebook uses NLP that focuses on detecting sentiment from customer reviews. It uses the Hugging-Face DistilBERT model, which is a pre-trained transformer model designed to efficiently process text data. Specifically, the model has been trained on the Amazon Electronics reviews dataset, which includes a large corpus of customer reviews for various electronics products. By analyzing the language used in the reviews, the model is able to classify the sentiment of each review as positive, negative, or neutral.

## Install Packages

In [3]:
!pip install "sagemaker>=2.140.0" "transformers==4.26.1" "datasets[s3]==2.10.1" --upgrade

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [4]:
import sagemaker.huggingface
import sagemaker 
import boto3
import pandas as pd
sagemaker_sess = sagemaker.Session() # Gets current SageMaker session 

## Permissions 

In [5]:
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sagemaker_sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sagemaker_sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sagemaker_sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sagemaker_sess.default_bucket()}")
print(f"sagemaker session region: {sagemaker_sess.boto_region_name}")

sagemaker role arn: arn:aws:iam::155047035098:role/service-role/AmazonSageMaker-ExecutionRole-20221019T153336
sagemaker bucket: sagemaker-us-east-1-155047035098
sagemaker session region: us-east-1


## Data Ingestion 
- We are using Huggingface datasets to download the Amazon Electronics reviews dataset. 

In [6]:
from datasets import load_dataset

dataset = load_dataset("amazon_us_reviews", "Electronics_v1_00", split="train[:10%]")

Found cached dataset amazon_us_reviews (/root/.cache/huggingface/datasets/amazon_us_reviews/Electronics_v1_00/0.1.0/17b2481be59723469538adeb8fd0a68b0ba363bbbdd71090e72c325ee6c7e563)


Let's take a look a the data we downloaded.

In [7]:
print(f"{dataset.size_in_bytes / 1024 / 1024 / 1024:.2f} GB")
print(dataset.shape)

2.40 GB
(309387, 15)


In [8]:
dataset[0]

{'marketplace': 'US',
 'customer_id': '41409413',
 'review_id': 'R2MTG1GCZLR2DK',
 'product_id': 'B00428R89M',
 'product_parent': '112201306',
 'product_title': 'yoomall 5M Antenna WIFI RP-SMA Female to Male Extensionl Cable',
 'product_category': 'Electronics',
 'star_rating': 5,
 'helpful_votes': 0,
 'total_votes': 0,
 'vine': 0,
 'verified_purchase': 1,
 'review_headline': 'Five Stars',
 'review_body': 'As described.',
 'review_date': '2015-08-31'}

## Preprocessing
- We are only interested in two columns `star_rating` and `review_body` all the other columns will be dropped.
- Look at the frequency of the star rating values 
- Transformer models require that labels start at 0, so let's decrement all star ratings using the map() function in datasets.

In [9]:
dataset = dataset.remove_columns(
    [
        "marketplace",
        "customer_id",
        "review_id",
        "product_id",
        "product_parent",
        "product_title",
        "product_category",
        "helpful_votes",
        "total_votes",
        "vine",
        "verified_purchase",
        "review_headline",
        "review_date",
    ]
)
dataset[0]

{'star_rating': 5, 'review_body': 'As described.'}

In [10]:
dataset.to_pandas().value_counts('star_rating')

star_rating
5    189462
4     46173
1     35433
3     21727
2     16592
dtype: int64

We can see the dataset is a bit unbalance, there's a large amount of 5 star ratings. We can rebalance the dataset to have better quality data.

In [11]:
pd_dataset = dataset.to_pandas() # Convert dataset to pandas format 
pd_dataset_balanced = pd.DataFrame(columns=dataset.column_names)

# Select first 15,000 star rating from each class 
for star_rating in range(1,6):
    data = pd_dataset[pd_dataset['star_rating'] == star_rating][:15000]
    pd_dataset_balanced = pd.concat([pd_dataset_balanced, data])


In [12]:
pd_dataset_balanced.value_counts('star_rating') 

star_rating
1    15000
2    15000
3    15000
4    15000
5    15000
dtype: int64

Since the dataset is now more balanced we can switch back to Hugging Face dataset format.

In [13]:
from datasets import Dataset

dataset_hf = Dataset.from_pandas(pd_dataset_balanced, preserve_index=False)
print(dataset_hf)

Dataset({
    features: ['star_rating', 'review_body'],
    num_rows: 75000
})


In [14]:
def decrement_stars(row):
    return {'star_rating': row['star_rating'] -1}

dataset_hf = dataset_hf.map(decrement_stars)

Map:   0%|          | 0/75000 [00:00<?, ? examples/s]

In [15]:
dataset_hf = dataset_hf.rename_column('star_rating', 'label')
dataset_hf = dataset_hf.rename_column('review_body', 'text')

dataset_hf[0]

{'label': 0, 'text': 'Did not work at all.'}

In [16]:
dataset_hf_split = dataset_hf.train_test_split(test_size=0.1, shuffle=True, seed=59)

In [17]:
dataset_hf_split

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 67500
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 7500
    })
})

## Save data to S3

In [18]:
s3_prefix = 'amazon-electronics-reviews/data'

# Define path for train & test datasets
train_data_path = f's3://{sagemaker_sess.default_bucket()}/{s3_prefix}/train'
test_data_path = f's3://{sagemaker_sess.default_bucket()}/{s3_prefix}/test'


train_data = dataset_hf_split['train']
test_data = dataset_hf_split['test']

# Save train & test datasets to S3
train_data.save_to_disk(train_data_path)
test_data.save_to_disk(test_data_path)



Flattening the indices:   0%|          | 0/67500 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/67500 [00:00<?, ? examples/s]

Flattening the indices:   0%|          | 0/7500 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/7500 [00:00<?, ? examples/s]

Now we are down to the fun part defining our hyperparameters and kick off a training job. 🚀

In [19]:
from sagemaker.huggingface import HuggingFace 
from transformers import  AutoTokenizer
base_model = 'distilbert-base-uncased'
# )
# model_name = ""
# hyperparams = {'epochs':1, 'train_batch_size': 32, 'model_name': 

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


In [20]:
tokenizer = AutoTokenizer.from_pretrained(base_model)

In [21]:
def tokenize(batch):
    return tokenizer(batch["text"], padding="max_length", truncation=True)

In [23]:
train_data = train_data.map(tokenize, batched=True)

Map:   0%|          | 0/67500 [00:00<?, ? examples/s]