# Prepare a Subset of Amazon Reviews for Class

Data is a partially processed version of the Home and Kitchen category in the Amazon Reviews dataset (see info below).

In [1]:
from IPython.display import display, Markdown
with open("../Data-AmazonReviews/Amazon Product Reviews.md") as f:
    display(Markdown(f.read()))

# Amazon Product Reviews

- URL: https://cseweb.ucsd.edu/~jmcauley/datasets.html#amazon_reviews 

## Description

This is a large crawl of product reviews from Amazon. This dataset contains 82.83 million unique reviews, from around 20 million users.

## Basic statistics

| Ratings:  | 82.83 million        |
| --------- | -------------------- |
| Users:    | 20.98 million        |
| Items:    | 9.35 million         |
| Timespan: | May 1996 - July 2014 |

## Metadata

- reviews and ratings
- item-to-item relationships (e.g. "people who bought X also bought Y")
- timestamps
- helpfulness votes
- product image (and CNN features)
- price
- category
- salesRank

## Example

```
{  "reviewerID": "A2SUAM1J3GNN3B",  "asin": "0000013714",  "reviewerName": "J. McDonald",  "helpful": [2, 3],  "reviewText": "I bought this for my husband who plays the piano.  He is having a wonderful time playing these old hymns.  The music  is at times hard to read because we think the book was published for singing from more than playing from.  Great purchase though!",  "overall": 5.0,  "summary": "Heavenly Highway Hymns",  "unixReviewTime": 1252800000,  "reviewTime": "09 13, 2009" }
```

## Download link

See the [Amazon Dataset Page](https://cseweb.ucsd.edu/~jmcauley/datasets/amazon_v2/) for download information.

The 2014 version of this dataset is [also available](https://cseweb.ucsd.edu/~jmcauley/datasets/amazon/links.html).

## Citation

Please cite the following if you use the data:

**Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering**

R. He, J. McAuley

*WWW*, 2016
[pdf](https://cseweb.ucsd.edu/~jmcauley/pdfs/www16a.pdf)

**Image-based recommendations on styles and substitutes**

J. McAuley, C. Targett, J. Shi, A. van den Hengel

*SIGIR*, 2015
[pdf](https://cseweb.ucsd.edu/~jmcauley/pdfs/sigir15.pdf)

# Load the Data

We will download our **corpus** of reviews.

In [2]:
## Import necessary packages
import pandas as pd
import nltk
import os, glob, json

In [3]:
sorted(os.listdir())##"../Data-AmazonReviews/"))

['.ipynb_checkpoints',
 'Amazon Product Reviews.md',
 'Prepare-Amazon-Reviews-Home-Kitchen-Subset-csv.ipynb',
 'amazon-reviews-home-ktichen_hoover.csv',
 'amazon-reviews-with-metadata-home-kitchen.csv.gz',
 'amazon-reviews-with-metadata-toys-games.csv.gz',
 'processed_data.csv',
 'processed_data.joblib']

In [4]:
# Select which raw reviews file to use
fpath_reviews = "amazon-reviews-with-metadata-home-kitchen.csv.gz"
# fpath_reviews = "../Data-AmazonReviews/amazon-reviews-video-games-combined.csv.gz

In [5]:
## Load full corpus
df_full = pd.read_csv(fpath_reviews)
df_full.head(10)

Unnamed: 0,asin,reviewerID,date,reviewText,summary,overall,title,brand,main_cat,description,feature,category,imageURL
0,B000067DW6,A6VPIOPMDJ8H7,2005-05-28,"In my recently deleted review, I praised these...",A shadow of a once-great product,2.0,"PUR Basic Water Pitcher Replacement Filter, 2-...",PUR,Amazon Home,"['Replacement Filter Cartridge, 2-11/16x2-11/1...","[""PUR BASIC WATER FILTER REPLACEMENT: PUR's Ba...","['Home & Kitchen', 'Kitchen & Dining', 'Water ...",['https://images-na.ssl-images-amazon.com/imag...
1,B000067DW6,A2CBZMETQJTNEE,2005-01-01,"I lived in Peru, in the Andes where it was imp...","What once was a good filter is now, basicly, w...",1.0,"PUR Basic Water Pitcher Replacement Filter, 2-...",PUR,Amazon Home,"['Replacement Filter Cartridge, 2-11/16x2-11/1...","[""PUR BASIC WATER FILTER REPLACEMENT: PUR's Ba...","['Home & Kitchen', 'Kitchen & Dining', 'Water ...",['https://images-na.ssl-images-amazon.com/imag...
2,B000067DW6,A2ZR3YTMEEIIZ4,2003-10-03,Pur makes water filters for campers and backpa...,"Pur over Brita, and neither softens water.",5.0,"PUR Basic Water Pitcher Replacement Filter, 2-...",PUR,Amazon Home,"['Replacement Filter Cartridge, 2-11/16x2-11/1...","[""PUR BASIC WATER FILTER REPLACEMENT: PUR's Ba...","['Home & Kitchen', 'Kitchen & Dining', 'Water ...",['https://images-na.ssl-images-amazon.com/imag...
3,B000067DW6,A1IPFGQ9EH8BGL,2018-05-06,We've tried both flavors. Pur's basic filter w...,"Pur's Lead Filter Takes Forever To Work, Doesn...",2.0,"PUR Basic Water Pitcher Replacement Filter, 2-...",PUR,Amazon Home,"['Replacement Filter Cartridge, 2-11/16x2-11/1...","[""PUR BASIC WATER FILTER REPLACEMENT: PUR's Ba...","['Home & Kitchen', 'Kitchen & Dining', 'Water ...",['https://images-na.ssl-images-amazon.com/imag...
4,B000067DW6,A3668U9GFM4II9,2018-05-03,Finally got around to using these and they are...,LEAK,1.0,"PUR Basic Water Pitcher Replacement Filter, 2-...",PUR,Amazon Home,"['Replacement Filter Cartridge, 2-11/16x2-11/1...","[""PUR BASIC WATER FILTER REPLACEMENT: PUR's Ba...","['Home & Kitchen', 'Kitchen & Dining', 'Water ...",['https://images-na.ssl-images-amazon.com/imag...
5,B000067DW6,A168EZ6357ACMG,2018-05-01,as advertised and on time!,Five Stars,5.0,"PUR Basic Water Pitcher Replacement Filter, 2-...",PUR,Amazon Home,"['Replacement Filter Cartridge, 2-11/16x2-11/1...","[""PUR BASIC WATER FILTER REPLACEMENT: PUR's Ba...","['Home & Kitchen', 'Kitchen & Dining', 'Water ...",['https://images-na.ssl-images-amazon.com/imag...
6,B000067DW6,A19UUX709H8MTI,2018-04-28,Switched to Aquacrest filters because they are...,More Expensive and Spread More Carbon Than Kno...,3.0,"PUR Basic Water Pitcher Replacement Filter, 2-...",PUR,Amazon Home,"['Replacement Filter Cartridge, 2-11/16x2-11/1...","[""PUR BASIC WATER FILTER REPLACEMENT: PUR's Ba...","['Home & Kitchen', 'Kitchen & Dining', 'Water ...",['https://images-na.ssl-images-amazon.com/imag...
7,B000067DW6,AWDURRGBMK7LR,2018-04-25,"Water stops flowing every time, requiring the ...",Water stops flowing every time,1.0,"PUR Basic Water Pitcher Replacement Filter, 2-...",PUR,Amazon Home,"['Replacement Filter Cartridge, 2-11/16x2-11/1...","[""PUR BASIC WATER FILTER REPLACEMENT: PUR's Ba...","['Home & Kitchen', 'Kitchen & Dining', 'Water ...",['https://images-na.ssl-images-amazon.com/imag...
8,B000067DW6,A32QO1EGR8HQFP,2018-04-24,I ordered the filter and sadly when I added th...,I was very disappointed.,1.0,"PUR Basic Water Pitcher Replacement Filter, 2-...",PUR,Amazon Home,"['Replacement Filter Cartridge, 2-11/16x2-11/1...","[""PUR BASIC WATER FILTER REPLACEMENT: PUR's Ba...","['Home & Kitchen', 'Kitchen & Dining', 'Water ...",['https://images-na.ssl-images-amazon.com/imag...
9,B000067DW6,A3KEQ19CTTMR4J,2018-04-23,4 stars here... It works as it should be can't...,Good filter just a bit pricey,4.0,"PUR Basic Water Pitcher Replacement Filter, 2-...",PUR,Amazon Home,"['Replacement Filter Cartridge, 2-11/16x2-11/1...","[""PUR BASIC WATER FILTER REPLACEMENT: PUR's Ba...","['Home & Kitchen', 'Kitchen & Dining', 'Water ...",['https://images-na.ssl-images-amazon.com/imag...


In [6]:
# Keep only the required columns
use_cols = ['reviewText','summary','overall','brand','title']
df_full = df_full[use_cols]
df_full

Unnamed: 0,reviewText,summary,overall,brand,title
0,"In my recently deleted review, I praised these...",A shadow of a once-great product,2.0,PUR,"PUR Basic Water Pitcher Replacement Filter, 2-..."
1,"I lived in Peru, in the Andes where it was imp...","What once was a good filter is now, basicly, w...",1.0,PUR,"PUR Basic Water Pitcher Replacement Filter, 2-..."
2,Pur makes water filters for campers and backpa...,"Pur over Brita, and neither softens water.",5.0,PUR,"PUR Basic Water Pitcher Replacement Filter, 2-..."
3,We've tried both flavors. Pur's basic filter w...,"Pur's Lead Filter Takes Forever To Work, Doesn...",2.0,PUR,"PUR Basic Water Pitcher Replacement Filter, 2-..."
4,Finally got around to using these and they are...,LEAK,1.0,PUR,"PUR Basic Water Pitcher Replacement Filter, 2-..."
...,...,...,...,...,...
110207,This tumbler is just as good as a Yeti tumbler...,Amazing product and one that anyone would appr...,5.0,RTIC,RTIC 30 oz Stainless Steel Tumbler Cup w/ Spla...
110208,This is an amazing cup. Use it daily. Drapela ...,Cold forever,5.0,RTIC,RTIC 30 oz Stainless Steel Tumbler Cup w/ Spla...
110209,I bought two of these and they work great. Jus...,Keeps Ice All Day...and Longer,5.0,RTIC,RTIC 30 oz Stainless Steel Tumbler Cup w/ Spla...
110210,"Buy direct from RTIC, awesome cup!",Love!,5.0,RTIC,RTIC 30 oz Stainless Steel Tumbler Cup w/ Spla...


# Some light EDA

In [7]:
df_full.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110212 entries, 0 to 110211
Data columns (total 5 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   reviewText  110212 non-null  object 
 1   summary     110189 non-null  object 
 2   overall     110212 non-null  float64
 3   brand       110212 non-null  object 
 4   title       110212 non-null  object 
dtypes: float64(1), object(4)
memory usage: 4.2+ MB


In [8]:
# Check duplicatges
df_full.duplicated().sum()

5486

In [9]:
# # Drop duplicates
# df_full = df_full.drop_duplicates()
# df_full.duplicated().sum()

In [10]:
# # Combine the 
# df_full['text'] = df_full['summary'] + ": " + df_full['reviewText']
# df_full

In [11]:
# df_full['title-short'] = df_full['title'].map(lambda x: x[:40])

### Selecting a Subset

In [12]:
df_full['brand'].value_counts()

Instant Pot              9027
Thermos                  8957
Hoover                   8257
Contigo                  7443
Cuisinart                5960
Pinzon by Amazon         5947
New Metro Design         5825
Ozeri                    4547
Hamilton Beach           3991
Clara Clark              3962
Coop Home Goods          3721
Bissell                  3434
RTIC                     3345
InterDesign              3285
Aroma Housewares         3248
HC COLLECTION            3144
Zinus                    2768
Himalayan Glow           2720
Eureka                   2620
Lodge                    2543
Zyliss                   2288
BLACK+DECKER             2228
Presto                   2222
PUR                      2045
Keurig                   1940
SleepBetter              1761
American Weigh Scales    1528
Swissmar                 1456
Name: brand, dtype: int64

In [13]:
overall_by_brand_counts = df_full.groupby('brand')['overall'].value_counts().unstack(1)
overall_by_brand_counts.head()

overall,1.0,2.0,3.0,4.0,5.0
brand,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
American Weigh Scales,78,49,87,209,1105
Aroma Housewares,194,165,202,525,2162
BLACK+DECKER,126,94,143,337,1528
Bissell,175,154,188,493,2424
Clara Clark,209,192,350,521,2690


In [14]:
# Finding brands with the most one-star reviews
overall_by_brand_counts.sort_values(1.0, ascending=False)

overall,1.0,2.0,3.0,4.0,5.0
brand,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
New Metro Design,613,354,415,715,3728
Cuisinart,561,278,378,786,3957
Hoover,531,281,437,1397,5611
Thermos,399,306,506,1054,6692
Contigo,389,340,496,941,5277
Hamilton Beach,358,256,365,759,2253
InterDesign,299,259,442,529,1756
Instant Pot,271,147,275,738,7596
Pinzon by Amazon,262,276,539,991,3879
Coop Home Goods,249,235,297,483,2457


In [15]:
SELECTED_BRAND  ="Hoover"
new_column_order = [
    #'text'.
    'reviewText',
    'summary',    
    'overall',
    'brand',
    'title']

df = df_full.loc[ df_full['brand']==SELECTED_BRAND, new_column_order]
# Shuffling Rows & Reset Index
df = df.sample(frac=1, replace=False)
df = df.reset_index(drop=True)
df = df.copy()
df

Unnamed: 0,reviewText,summary,overall,brand,title
0,Used it twice already and I have absolutely se...,Not going to show you the dirty water on here ...,4.0,Hoover,Hoover Power Scrub Deluxe Carpet Washer FH50150
1,When you get the shampooer you have to put it ...,Makes carpet look brand new!!!,5.0,Hoover,Hoover Power Scrub Deluxe Carpet Washer FH50150
2,After getting an estimate on how much it would...,I Got What I Paid For,4.0,Hoover,Hoover Carpet Cleaner SteamVac with Clean Surg...
3,"I purchased this Hoover carpet cleaner, becaus...","Read tips before use, but overall great product",4.0,Hoover,Hoover Power Scrub Deluxe Carpet Washer FH50150
4,WORKED for maybe 1/2 hr and then it appeared t...,VERY DISAPPOINTED,1.0,Hoover,Hoover Power Scrub Deluxe Carpet Washer FH50150
...,...,...,...,...,...
8252,My husband and I love this little vacuum - it ...,A TWO STICK VACUUM FAMILY,5.0,Hoover,"Hoover Linx Cordless Stick Vacuum Cleaner, BH5..."
8253,An EXCELLENT carpet cleaner. We've had this a ...,Keep em seperated! Awesome machine,5.0,Hoover,Hoover Carpet Cleaner SteamVac with Clean Surg...
8254,This is a great carpet cleaner that will save ...,This is a great carpet cleaner that will save ...,5.0,Hoover,Hoover Power Scrub Deluxe Carpet Washer FH50150
8255,Bought years ago and I probably only use it ha...,Adequate cleaning but best for large areas,3.0,Hoover,Hoover Carpet Cleaner SteamVac with Clean Surg...


In [16]:
df['overall'].value_counts()

5.0    5611
4.0    1397
1.0     531
3.0     437
2.0     281
Name: overall, dtype: int64

In [17]:
# pd.set_option('display.max_colwidth',300)

In [18]:
df.head()

Unnamed: 0,reviewText,summary,overall,brand,title
0,Used it twice already and I have absolutely se...,Not going to show you the dirty water on here ...,4.0,Hoover,Hoover Power Scrub Deluxe Carpet Washer FH50150
1,When you get the shampooer you have to put it ...,Makes carpet look brand new!!!,5.0,Hoover,Hoover Power Scrub Deluxe Carpet Washer FH50150
2,After getting an estimate on how much it would...,I Got What I Paid For,4.0,Hoover,Hoover Carpet Cleaner SteamVac with Clean Surg...
3,"I purchased this Hoover carpet cleaner, becaus...","Read tips before use, but overall great product",4.0,Hoover,Hoover Power Scrub Deluxe Carpet Washer FH50150
4,WORKED for maybe 1/2 hr and then it appeared t...,VERY DISAPPOINTED,1.0,Hoover,Hoover Power Scrub Deluxe Carpet Washer FH50150


In [19]:
## Save the processed data
fpath_final ='amazon-reviews-home-kitchen_hoover.csv'
df.to_csv(fpath_final, index=False)

In [20]:
pd.read_csv(fpath_final)

Unnamed: 0,reviewText,summary,overall,brand,title
0,Used it twice already and I have absolutely se...,Not going to show you the dirty water on here ...,4.0,Hoover,Hoover Power Scrub Deluxe Carpet Washer FH50150
1,When you get the shampooer you have to put it ...,Makes carpet look brand new!!!,5.0,Hoover,Hoover Power Scrub Deluxe Carpet Washer FH50150
2,After getting an estimate on how much it would...,I Got What I Paid For,4.0,Hoover,Hoover Carpet Cleaner SteamVac with Clean Surg...
3,"I purchased this Hoover carpet cleaner, becaus...","Read tips before use, but overall great product",4.0,Hoover,Hoover Power Scrub Deluxe Carpet Washer FH50150
4,WORKED for maybe 1/2 hr and then it appeared t...,VERY DISAPPOINTED,1.0,Hoover,Hoover Power Scrub Deluxe Carpet Washer FH50150
...,...,...,...,...,...
8252,My husband and I love this little vacuum - it ...,A TWO STICK VACUUM FAMILY,5.0,Hoover,"Hoover Linx Cordless Stick Vacuum Cleaner, BH5..."
8253,An EXCELLENT carpet cleaner. We've had this a ...,Keep em seperated! Awesome machine,5.0,Hoover,Hoover Carpet Cleaner SteamVac with Clean Surg...
8254,This is a great carpet cleaner that will save ...,This is a great carpet cleaner that will save ...,5.0,Hoover,Hoover Power Scrub Deluxe Carpet Washer FH50150
8255,Bought years ago and I probably only use it ha...,Adequate cleaning but best for large areas,3.0,Hoover,Hoover Carpet Cleaner SteamVac with Clean Surg...
