## Amazon Reviews

Data Source: https://nijianmo.github.io/amazon/index.html

This notebook takes in the zipped reviews and meta data from the Amazon review data of Electronics.

Since the files are large and product characteristics vary, we will extract one popular product's information for further modeling and analysis.

### Data Cleaning

Json to dataframe

In [1]:
import pandas as pd
import gzip
import json

def parse(path):
  g = gzip.open(path, 'rb')
  for l in g:
    yield json.loads(l)

def getDF(path):
  i = 0
  df = {}
  for d in parse(path):
    df[i] = d
    i += 1
  return pd.DataFrame.from_dict(df, orient='index')

df = getDF('Electronics_5.json.gz')

In [20]:
df.head()

Unnamed: 0,overall,vote,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,image
0,5.0,67,True,"09 18, 1999",AAP7PPBU72QFM,151004714,{'Format:': ' Hardcover'},D. C. Carrad,This is the best novel I have read in 2 or 3 y...,A star is born,937612800,
1,3.0,5,True,"10 23, 2013",A2E168DTVGE6SV,151004714,{'Format:': ' Kindle Edition'},Evy,"Pages and pages of introspection, in the style...",A stream of consciousness novel,1382486400,
2,5.0,4,False,"09 2, 2008",A1ER5AYS3FQ9O3,151004714,{'Format:': ' Paperback'},Kcorn,This is the kind of novel to read when you hav...,I'm a huge fan of the author and this one did ...,1220313600,
3,5.0,13,False,"09 4, 2000",A1T17LMQABMBN5,151004714,{'Format:': ' Hardcover'},Caf Girl Writes,What gorgeous language! What an incredible wri...,The most beautiful book I have ever read!,968025600,
4,3.0,8,True,"02 4, 2000",A3QHJ0FXK33OBE,151004714,{'Format:': ' Hardcover'},W. Shane Schmidt,I was taken in by reviews that compared this b...,A dissenting view--In part.,949622400,


### Aggregate with Metadata

#### Group reviews data by products

asin: Amazon Standard Identification Number 

In [27]:
tbl = df.groupby('asin').agg({'overall':'count'})
tbl = tbl.sort_values(by = ['overall'], ascending = False)
tbl['asin'] = tbl.index
tbl = tbl.rename_axis("ID")
tbl

Unnamed: 0_level_0,overall,asin
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
B003L1ZYYW,8617,B003L1ZYYW
B0019HL8Q8,8160,B0019HL8Q8
B0019EHU8G,7777,B0019EHU8G
B0015DYMVO,7380,B0015DYMVO
B000VS4HDM,6802,B000VS4HDM
...,...,...
B000YJAWUA,1,B000YJAWUA
B0013MYX8E,1,B0013MYX8E
B000WGS5B8,1,B000WGS5B8
B00166G6EG,1,B00166G6EG


#### Read in metadata and merge with aggregated review data

In [28]:
# read in meta data and inner merge with review data by asin
meta = pd.read_csv('meta.csv')
new = tbl.merge(meta, on = 'asin', how = 'inner')

  interactivity=interactivity, compiler=compiler, result=result)


In [29]:
new.columns

Index(['overall', 'asin', 'Unnamed: 0', 'category', 'description', 'title',
       'image', 'brand', 'feature', 'rank', 'main_cat', 'date', 'price',
       'also_buy', 'also_view', 'similar_item', 'tech1', 'tech2', 'details',
       'fit'],
      dtype='object')

#### Exploratory playground

In [135]:
def keep_recent(new):
    '''The function returns whether the product is released after 2016.'''
    try:
        if int(new['date'][-4:]) > 2016:
            return 1
        else:
            return 0
    except:
        return 0
    
new['newitem'] = new.apply(keep_recent, axis = 1)

In [137]:
sim_df = new[new['newitem'] == 1][['asin', 'title', 'overall', 'brand']]

In [142]:
sim_df.head(10)

Unnamed: 0,asin,title,overall,brand
19,B00BP5KOPA,"Logitech MK270 Wireless Keyboard and Mouse Combo - Keyboard and Mouse Included, 2.4GHz Dropout-Free Connection, Long Battery Life (Frustration-Free Packaging)",4625,Logitech
26,B001TH7GUU,AmazonBasics USB 2.0 Extension Cable - A-Male to A-Female - 9.8 Feet (3 Meters),4503,AmazonBasics
36,B009D79VH4,"Transcend USB 3.0 SDHC / SDXC / microSDHC / SDXC Card Reader, TS-RDF5K (Black)",3747,Transcend
48,B001XURP8Q,"Sandisk Cruzer 32GB USB 32 GB Flash Drive, Black - SDCZ36-032G",3166,SanDisk
51,B004OVECU0,"Logitech Harmony 650 Infrared All in One Remote Control, Universal Remote Logitech, Programmable Remote (Silver)",3132,Logitech
61,B001TH7GSW,AmazonBasics Digital Optical Audio Toslink Cable - 6 Feet (1.8 Meters),3016,AmazonBasics
75,B00APCMMEK,Transcend 16GB MicroSDHC Class10 UHS-1 Memory Card with Adapter 60 MB/s (TS16GUSDU1),2848,Transcend
89,B000WU2LXC,"ARCTIC MX-2 - Thermal Compound Paste, Carbon Based High Performance, Heatsink Paste, Thermal Compound CPU for All Coolers, Thermal Interface Material - 4 Grams",2586,ARCTIC
92,B000A6PPOK,Microsoft Natural Ergonomic Keyboard 4000,2567,Microsoft
102,B00DSUTX3O,WD Black 750GB Performance Mobile Hard Disk Drive - 7200 RPM SATA 6 Gb/s 16MB Cache 9.5 MM 2.5 Inch - WD7500BPKX,2442,Western Digital


##### Explore different brands

In [140]:
# Amazon products
sim_df[sim_df['brand']=='Amazon'].head()

Unnamed: 0,asin,title,overall,brand


In [141]:
# Apple products
sim_df[sim_df['brand']=='Apple'].head()

Unnamed: 0,asin,title,overall,brand
2913,B00UGBMRQ8,"Apple MacBook Pro 15"" Core i7 2.8GHz Retina (MGXG2LL/A), 16GB RAM, 512GB Solid State Drive (Refurbished)",349,Apple
4039,B0096VDM8G,"Apple MacBook Pro 15-Inch Laptop with Retina Display, 2.2 Ghz Intel core i7, 16GB DDR3L, 512GB SSD (Z0RC0005YR)",269,Apple
10007,B00B3Y4U4E,Apple MD862ZM/A Thunderbolt Cable - 0.5 M (NEWEST VERSION),126,Apple
14257,B0186RZAWQ,Apple Smart Keyboard for iPad Pro 12.9,92,Apple
18190,B01H29JY62,"Apple Magic Keyboard (Wireless, Rechargable) (Spanish) - Silver",73,Apple


In [66]:
sim_df[sim_df['brand']=='Xiaomi'].head()

Unnamed: 0,asin,title,overall,brand
5382,B00UHHRLEO,New Original Gold Xiaomi 2nd Piston Earphone I...,213,Xiaomi
8266,B00V9RKSAA,Xiaomi Piston III Headset Earphones with Remot...,149,Xiaomi
12652,B018AMDCLI,Xiaomi ZBW4326TY Professional Store Hybrid Dua...,103,Xiaomi
29086,B00ODPLC8C,Original Xiaomi Piston Earphone Ii Headphone H...,45,Xiaomi
50448,B013N5VS5Y,Xiaomi Piston 3 Headphones In-Ear Bass Earphon...,24,Xiaomi


In [67]:
sim_df[sim_df['brand']=='Huawei'].head()

Unnamed: 0,asin,title,overall,brand
2025,B013LKLS2E,Huawei Watch Stainless Steel with Stainless St...,455,Huawei
3572,B013LKLIC4,Huawei Watch Stainless Steel with Stainless St...,296,Huawei
17091,B00KKRC4I4,Huawei Ascend W1 - Windows 8 Smartphone - Unlo...,77,Huawei
26725,B01FWIJ690,Huawei MateBook Signature Edition 2 in 1 PC Ta...,49,Huawei
27650,B00CTPQAGW,"OEM Manufactured Standard Battery (1500 mAh, N...",47,Huawei


In [80]:
sim_df[sim_df['brand']=='Lenovo'].head()

Unnamed: 0,asin,title,overall,brand
5625,B00FU83YWS,Lenovo Thinkpad E545 20B20011US Laptop (Windo...,205,Lenovo
8614,B00F2ENU92,Lenovo Yoga Multimode 10-inch Tablet,144,Lenovo
8677,B005L2NTTQ,Lenovo N5902 Enhanced Multimedia Remote with B...,143,Lenovo
9658,B000LRI2ZC,2GB PC2-5300 667MHZ DDR2 Sdram,130,Lenovo
9703,B00JAIEAU4,Lenovo IdeaTab A8-50 8-Inch 16 GB Tablet,130,Lenovo


#### Explore different products

In [71]:
sim_df[sim_df['title'].str.contains("camera",na = False)].head()

Unnamed: 0,asin,title,overall,brand
135,B00P7EVST6,Arlo - Wireless Home Security Camera System | ...,2130,"Arlo Technologies, Inc"
3110,B0010SIAV2,Giottos MH621 Quick Release Adapter with Short...,332,Giotto's
3848,B00K3GI8Y6,General 3-way Adjustable Hand Grip Stabilizer ...,279,General
4147,B0042J6VUS,Mini Adjustable Tripod+camera Holder for Iphon...,264,EastVita
5300,B00K67QUQK,The OFFICIAL ROXANT PRO video camera stabilize...,215,Roxant


In [92]:
sim_df[sim_df['title'].str.contains("MacBook",na = False)].head()

Unnamed: 0,asin,title,overall,brand
151,B00XMD7KPU,Anker 4-Port USB 3.0 Ultra Slim Data Hub for M...,2000,Anker
431,B001N7PANQ,var aPageStart = (new Date()).getTime();\nvar ...,1176,mCover
444,B001NJO6AC,var aPageStart = (new Date()).getTime();\nvar ...,1166,mCover
525,B01AHKYIRS,[2 in 1 Pack] Anker USB-C (Male) to Micro USB ...,1067,Anker
568,B00DQ5RYP0,"HooToo USB C Hub, 6-in-1 Premium USB C Adapter...",1022,HooToo


### Extract Sample Data

In [148]:
#pd.set_option('display.max_colwidth', 1)
print(new[new['asin']=='B000A6PPOK']['title'])

92    Microsoft Natural Ergonomic Keyboard 4000
Name: title, dtype: object


In [153]:
new[new['asin']=='B000A6PPOK'].head()

In [149]:
df.columns

Index(['overall', 'vote', 'verified', 'reviewTime', 'reviewerID', 'asin',
       'style', 'reviewerName', 'reviewText', 'summary', 'unixReviewTime',
       'image'],
      dtype='object')

In [151]:
pdt = df[df['asin']=='B000A6PPOK']
pdt.to_csv('sample.csv')