Goes through the process of downloading data from the 2018 Amazon dataset
and places it locally in a SQlite Database for further processing

In [1]:
from datetime import date
from pathlib import Path
from typing import NamedTuple, Optional, Tuple, List

import pandas as pd
import plotly.express as px
import scipy.sparse as sp

import amazon_dataset

# 1. Download Data

Those are the dataset. In this notebook we process one dataset at a time

In [2]:
DATASETS = [
    # 'Baby',
    'Clothing_Shoes_and_Jewelry',
    # 'Home_and_Kitchen',
    # 'Movies_and_TV',
    # 'Musical_Instruments',
    # 'Office_Products',
    # 'Sports_and_Outdoors',
    # 'Toys_and_Games',
]

# 2. Import data to database

In [3]:
DATASET =  'Clothing_Shoes_and_Jewelry'

In [4]:
try:
    amazon_dataset.load_amazon_dataset(
        DATASET,
        force=False,
        min_date=date.fromisoformat('2018-03-01'),
        max_date=date.fromisoformat('2018-10-01'),
        min_reviews_per_reviewer=5,
        min_reviews_per_asin=5
    )
except ValueError as ex:
    # This is OK, since we don't want to blow up the Database. Extracting
    # data can take some minutes
    print(ex)

Loading reviews : 100%|██████████| 32.3M/32.3M [08:01<00:00, 67.1kreview/s, Added 2051627 reviews]
Deleting asins  : 100%|██████████| 403k/403k [00:05<00:00, 68.6krow/s] 
Deleting reviews: 100%|██████████| 1.47M/1.47M [00:22<00:00, 65.4krow/s]
Loading prods   : 100%|██████████| 2.69M/2.69M [04:30<00:00, 9.91kproduct/s, Added 38493 products]


Total reviews: 178944 Reviewers: 23318 Products: 38493 Density: 0.0199%


In [9]:
reviews = amazon_dataset.reviews_df('Clothing_Shoes_and_Jewelry')
reviews.sample(n=5)

Unnamed: 0,id,asin,reviewerID,reviewerName,overall,text,reviewTime,summary,verified,vote
119349,21125619,B01BVRPQ68,A1S3EKIHURJQ1J,Cyndi L,4.0,Love these jeans !! The gold buttons are a nic...,2018-05-31,Cute high waisted jeans with button details,True,
64995,15014343,B00NNEUYLK,ABMWIVQGO73Y,Diungano,5.0,Excelent,2018-06-24,Five Stars,True,
176487,32026294,B01FP4VTY6,A23ZXUEASHM8SL,Sharlene,5.0,regal,2018-08-03,Five Stars,True,
60527,14062419,B00L9Q5DBA,ALBKTPA5FQAUV,Amazon Customer,2.0,NICE SKIRT BUT TOO BIG AND BULKY,2018-03-17,Two Stars,True,
163961,30049062,B011SZGR2A,A3ITL1PV9XVV2B,Chiquita Rolle Jones,3.0,"Shoes are cute. However, my granddaughter wore...",2018-06-21,My granddaughter was sad..,True,


In [10]:
products = amazon_dataset.products_df(DATASET)
products.sample(n=3)

Unnamed: 0_level_0,asin,description,title,brand,main_cat,rank,price,image_slug,image_url,feature,category,tech_detail
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2323046,B0193MJAV0,,YAN & LEI Sterling Silver 6MM Freshwater Cultu...,,,"218,292inClothing,ShoesJewelry(",$13.99,"[41PocfHaypL, 41gRs7ckenL, 51kqMu-Aq%2BL, 51gi...",[https://images-na.ssl-images-amazon.com/image...,[*SUPERIOR QUALITY* AAA level Handpicked fresh...,"[Clothing, Shoes & Jewelry, Women, Jewelry, Ea...",
2139537,B014X9RF60,,Dippin' Daisy's Womens Modern Print Bandeau Bl...,Dippin' Daisy's,,"85,393inClothing,ShoesJewelry(",$23.99,"[41jNN%2Bg-DWL, 412sZoF1lwL, 410ybydSRoL, 41pW...",[https://images-na.ssl-images-amazon.com/image...,"[MADE IN USA. 80% NYLON, 20% SPANDEX., TRENDY ...","[Clothing, Shoes & Jewelry, Women, Clothing, S...",
2478555,B01CS7YMBU,,Olivia's Sweetheart Lace up Bridesmaid Dresses...,Olivia's,,"1,113,459inClothing,ShoesJewelry(",$42.50,"[41rKDSR-bAL, 410Iv5DGP3L, 41vxFJuKVFL, 51iBZY...",[https://images-na.ssl-images-amazon.com/image...,[Shipping Information:\n \n...,"[Clothing, Shoes & Jewelry, Women, Clothing, D...",


In [11]:
amazon_dataset.product_images_df(DATASET)

Unnamed: 0_level_0,url,main,slug,product_id
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,https://images-na.ssl-images-amazon.com/images...,False,51HJbA8UG2L,47
2,https://images-na.ssl-images-amazon.com/images...,False,51FufN7RbSL,47
3,https://images-na.ssl-images-amazon.com/images...,False,51vKjwQ6eAL,47
4,https://images-na.ssl-images-amazon.com/images...,False,410fEp9sdjL,47
5,https://images-na.ssl-images-amazon.com/images...,False,51vFScdjWiL,47
...,...,...,...,...
364457,https://m.media-amazon.com/images/I/316MQnmFeR...,True,316MQnmFeRL,19408
364458,https://m.media-amazon.com/images/I/31hZqQaMRd...,True,31hZqQaMRdS,19470
364459,https://m.media-amazon.com/images/I/41Vu1V9R-R...,True,41Vu1V9R-RL,19458
364460,https://m.media-amazon.com/images/I/41CqphnHfL...,True,41CqphnHfLL,19430


Download products images using the Amazon Web Service. 
This process can take around 2 hours and retrieves around 90% of product images

In [13]:
amazon_dataset.download_main_product_images_webservice(DATASET)

100%|██████████| 3642/3642 [07:43<00:00,  7.85image/s, Errors 3642 https://ws-na.amazon-adsystem.com/widgets/q?_encoding=UTF8&MarketPlace=US&ASIN=B01HJ0WZ7E&ServiceVersion=20070822&ID=AsinImage&WS=1&Format=SS400]


Then some products won't have images. We can use the first image for each product
to store the product image

This only includes products with at least one parsed image!

In [15]:
amazon_dataset.download_main_image_heuristic(DATASET)

100%|██████████| 33/33 [00:08<00:00,  4.11product/s, Errors 33 product_id=2578953]


A sanity check to double check all images in the image folder are the same in the database

In [17]:
amazon_dataset.check_all_images_are_ok(DATASET)

In [18]:
# Products still with no images at all!
amazon_dataset.products_with_no_main_image_df(DATASET)

Unnamed: 0_level_0,asin,description,title,brand,main_cat,rank,price
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
3129,B0002NZ87U,"An enduring favorite, our comfortable classic ...",Port Authority® Ladies Silk Touch™ Polo. L500,Port Authority,,"14,986inClothing,ShoesJewelry(",$11.90 - $70.51
16337,B000BTDP1Q,Gildan is a leading provider of everyday quali...,Gildan Men's Ultra Cotton Tee Extended Sizes,,,"2,304inClothing,ShoesJewelry(",$0.11 - $43.03
35790,B000JL2KJE,,Spiritual Guy Adult Costume,Safari Garden,,,
36528,B000JUFG5K,"Lightweight and comfortable, these gloves are ...",Isotoner Therapeutic Gloves,,Health & Personal Care,,$19.50 - $40.05
43534,B000N5CP0U,Comfortable 100% brushed cotton twill. raised ...,Military Caps Vietnam Veteran Logo Baseball Ca...,The Military Trail Gear Shop,,"56,295inClothing,ShoesJewelry(",$10.60
...,...,...,...,...,...,...,...
2645126,B01GKPS0Y6,Material:100% CottonCondition: 100% brand newS...,Dolphin&Fish Boys Girls Pajamas Toddler Boys S...,,,"809,062inClothing,ShoesJewelry(",
2664148,B01H3AOME4,VamJump Womens Black Casual Zipper Drawstring ...,VamJump Women Sleeveless Zipper Tie Waist One ...,VamJump,,"1,442,482inClothing,ShoesJewelry(",
2670342,B01H7DHVQ8,"X-Temp technology adapts to temperature, envir...",KingSize Men's Big & Tall Hanes X-Temp Boxer B...,,,"1,169,253inClothing,ShoesJewelry(",
2671251,B01H7WNDRU,,Women's Fashion Comfy Vegan Suede Block Heel S...,RF ROOM OF FASHION,,"4,311,008inClothing,ShoesJewelry(",$57.99


And there are products with duplicate images

In [28]:
amazon_dataset.delete_non_relevant_images(DATASET)

Deleting non main product images
Getting duplicated product images
Deleting duplicated


In [29]:
amazon_dataset.check_all_images_are_ok(DATASET)

In [32]:
amazon_dataset.vacuum_dataset(DATASET)

# 2. Analyze Data

Simple command to debug which variables we have defined

In [34]:
fig = px.histogram(reviews, x="overall", title='Stars per review')
fig.show()

In [35]:
fig = px.bar(
    reviews.groupby('reviewerID')['reviewerID'].count().value_counts(),
    log_y=True,
    title='Users vs Number of Reviews'
)
fig.show()