In [60]:
import os
import random

import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns 

pd.set_option("display.max_colwidth", None)
random.seed(6806)
random_state = 6806

In [22]:
csv_files = [f"./data/{file}" for file in os.listdir("./data") if file.endswith(".csv")]

In [24]:
csv_files

['./data/luxury_beauty.csv',
 './data/software.csv',
 './data/arts_crafts_and_sewing.csv',
 './data/prime_pantry.csv',
 './data/industrial_and_scientific.csv',
 './data/gift_cards.csv',
 './data/all_beauty.csv',
 './data/magazine_subscriptions.csv',
 './data/digital_music.csv',
 './data/appliances.csv',
 './data/musical_instruments.csv',
 './data/amazon_fashion.csv']

In [72]:
dfs = [] 

for path in csv_files:
    df = pd.read_csv(path, index_col=[0])
    category = path.split("/")[-1][:-4]
    df["category"] = category
    dfs.append(df)

df = pd.concat(dfs, axis=0)

In [73]:
# Preprocessing
df = df[df["overall"] != 3] # Drop neutral/indeterminate reviews
df["is_positive"] = df["overall"] > 3

In [74]:
df

Unnamed: 0,reviewText,overall,category,is_positive
0,This handcream has a beautiful fragrance. It doesnt stay on or protect your hands through washing. This size is quite small.,5.0,luxury_beauty,True
1,"wonderful hand lotion, for seriously dry skin, stays on a long time, a little goes a long long way.. go easy.. wonderful scent.. maybe a bit strong at first, but dissipates after a while.",5.0,luxury_beauty,True
2,"Best hand cream around. Silky, thick, soaks in all the way leaving hands super soft.",5.0,luxury_beauty,True
3,Thanks!!,5.0,luxury_beauty,True
4,"Great hand lotion. Soaks right in and leaves skin super soft. No greasy residue, great scent!",5.0,luxury_beauty,True
...,...,...,...,...
3171,Perfect fit!,5.0,amazon_fashion,True
3172,My favorite cross trainers!,5.0,amazon_fashion,True
3173,Love them fit perfect,5.0,amazon_fashion,True
3174,Favorite Nike shoe ever! The flex sole is excellent for someone like me who loves the free feeling of sandals or being barefoot. These move effortlessly with the bend of my foot. I've worn these for multiple activities and I've had no foot or ankle pain. The white/green/dark grey color goes with so many outfits and the mesh breathes perfectly on hot summer days. Highly recommend!,5.0,amazon_fashion,True


In [75]:
df.sample(5, random_state=random_state)

Unnamed: 0,reviewText,overall,category,is_positive
79567,This collection of songs does not disappoint! Imagine Dragons are a very talented group. I Love Amazon's MP3 music downloads!,5.0,digital_music,True
25097,"I am a bit of a connoisseur for these types of items and this impressed me. I found this palette to be convenient and have unique packaging and to be a well put together kit with all the 'must haves' in one package. I also didn't find the product to be small like some, I found it to be a decent size and the quality of the make-up to be above average. Jouer has all the basics wrapped into one here, each with their neat little compartments and mirrors included. You get a palette for eyes, face, lip gloss and a tint and highlighter. Only thing they didn't have was a small applicator or brush with it. This they could've added for conveniency as i've seen in other palette's such as this but still it impressed me. I also think the colors included are very neutral/wearable and blend well or would look good on most people. Product comes in a nice sheer translucent bag. Would make a nice gift for yourself or someone else, if you're willing to spend a bit extra for what I'd call quality.",5.0,luxury_beauty,True
7967,"I had a very cheap version of this tool (not of this brand) many, many years ago and it was totally worthless, fit only for the garbage can. So when I ordered this step drill in an attempt to cut a 1 1/8"" hole in mild steel, my expectations were not high. But I was absolutely wrong. This little beauty drilled four holes through 3/16"" thick steel and looks as if it was still brand new and untouched!! I'm very impressed and would recommend it, especially for use in a drill press.",5.0,industrial_and_scientific,True
119987,"I like that there are several different sizes. they only thing that i don't like is that the smaller hearts don't fit perfectly in center of the one bigger. So are they truely ""nesting""? (you can tell in the pic)",4.0,arts_crafts_and_sewing,True
33972,The best always,5.0,prime_pantry,True


In [80]:
df["reviewText"].str.len().describe()

count    1.101300e+06
mean     2.076870e+02
std      4.488919e+02
min      1.000000e+00
25%      2.700000e+01
50%      8.800000e+01
75%      2.050000e+02
max      3.218400e+04
Name: reviewText, dtype: float64

Some of the reviews are really long in length. We may have to filter these reviews out before passing them into DistilBert. 

In [76]:
df["overall"].value_counts()

overall
5.0    871509
4.0    167843
1.0     32711
2.0     29984
Name: count, dtype: int64

In [77]:
df["overall"].describe()

count    1.102047e+06
mean     4.647348e+00
std      8.647075e-01
min      1.000000e+00
25%      5.000000e+00
50%      5.000000e+00
75%      5.000000e+00
max      5.000000e+00
Name: overall, dtype: float64

In [51]:
df["is_positive"].value_counts()

is_positive
True     1110974
False      62695
Name: count, dtype: int64

In [52]:
df["is_positive"].describe()

count     1173669
unique          2
top          True
freq      1110974
Name: is_positive, dtype: object

It looks like most of the reviews are positive in sentiment! We may want to consider, as a hyperparameter, upsampling negative reviews by via data augmentation.

Note: we did not standardize the reviews to account for different interpretations of the ratings scale by each user. 

In [63]:
df["category"].value_counts()

category
arts_crafts_and_sewing       494485
musical_instruments          231392
digital_music                169781
prime_pantry                 137788
industrial_and_scientific     77071
luxury_beauty                 34278
software                      12805
all_beauty                     5269
amazon_fashion                 3176
gift_cards                     2972
magazine_subscriptions         2375
appliances                     2277
Name: count, dtype: int64

We will take the three domains with the most reviews to be our source domains, and leave everything else as a target domain. In particular, we will start our experiments with one source domain, and then add the other two later on. This is both for speed of development and also to observe what happens as the size/breadth of the source domain increases. We may later choose to include reviews from completely different domains as well, such as movie reviews and restaurant reviews.

In [68]:
source_domains = ["arts_crafts_and_sewing", "musical_instruments", "digital_music"]

In [69]:
len(df[df["category"].isin(source_domains)]) / len(df)

0.7631265714609485