# NIRS (data analysis)

We decided to use Amazon review Product dataset ([link to the dataset](https://cseweb.ucsd.edu/~jmcauley/datasets/amazon_v2/)), in particular for office product.

In [2]:
import pandas as pd
import numpy as np
import gzip
import json
import utils
import matplotlib.pyplot as plt
import seaborn as sns
import data_cleaning as cleaning

utils.seed_everything(42)
%matplotlib inline
sns.set_theme()

## Reading data

In [31]:
# review data
df_reviews_init = utils.getDF('data/Office_Products_5.json.gz')

# product data
df_products_init = utils.getDF('data/meta_Office_Products.json.gz')

KeyboardInterrupt: 

As in the data preparation, we will take a sample from both the reviwes and the products (or read them if available)

In [5]:
import os

reviews_sampled_path = 'data/reviews_sampled.csv'
products_sampled_path = 'data/products_sampled.csv'

# sample data
if os.path.exists(reviews_sampled_path) and os.path.exists(products_sampled_path):
  df_reviews_sampled = pd.read_csv(reviews_sampled_path)
  df_products_sampled = pd.read_csv(products_sampled_path)
else:
  df_reviews_sampled, df_products_sampled = utils.sample_data(
    df_reviews_init, df_products_init, min_reviews_count=10, max_users=1000, frac_products=0.1)

Little bit of cleaning...

In [6]:
# drop irrelevant columns for the analysis
df_reviews = cleaning.clean_reviews_data(df_reviews_sampled)
df_products = cleaning.clean_products_data(df_products_sampled)

In [8]:
df_products.head()

Unnamed: 0,description,title,brand,feature,rank,main_cat,date,asin
0,['Protect yourself and your RFID card with a S...,Black RFID Blocking ID Badge Holder (Holds 2 C...,Specialist ID,"['RFID Blocking 2 Card Holder', 'FIPS 201 Appr...","['>#43,873 in Office Products (See top 100)', ...",Office Products,"October 14, 2011",B005VSY1VK
1,['The Star Wars Moleskine Saga continues in 20...,Moleskine 2015 Star Wars Limited Edition Daily...,Moleskine,[],[],Office Products,"December 26, 2013",8867323296
2,"['Staples Washable Glue Sticks, Purple, .26 oz...","Staples Washable Glue Sticks, Purple, .26 oz.,...",Staples,[],"['>#161,293 in Office Products (See top 100)',...",Office Products,"June 22, 2015",B011LAU4R6
3,"['Exclusive design, classic']",Best Abstract Fiery Floral Design Mouse Pads C...,Luxlady?Mousepad,['Material is made of the best plastic manufac...,"['>#143,156 in Computers & Accessories > Compu...",Cell Phones & Accessories,"September 21, 1677",B00KH94VSG
4,"['Kitten On Piano Keys Mouse Pad is 8"" x 8"" x ...",3dRose LLC 8 x 8 x 0.25 Inches Kitten on Piano...,3dRose,"['Dimensions (in inches): 8 W x 8 H x 0.25 D',...","['>#1,396,217 in Office Products (See top 100)...",Office Products,"July 14, 2014",B00CX71JNU


In [20]:
# review data with the reviewed product
df = pd.merge(df_reviews, df_products, left_on='asin', right_on='asin')

In [21]:
utils.print_shapes(df_reviews_init, df_products)

Reviews df shape: (800357, 12)
Products df shape: (315458, 8)


In [22]:
print(f'Number of unique products: {df_products["asin"].nunique()}')
print(f'Number of unique users: {df_reviews["reviewerID"].nunique()}')

Number of unique products: 306617
Number of unique users: 101501


In [23]:
df_reviews.head(3)

Unnamed: 0,overall,reviewTime,reviewerID,asin,reviewerName,reviewText,summary
0,4.0,"11 7, 2017",A2NIJTYWADLK57,140503528,cotton clay,kids like story BUT while i really wanted a bo...,"good story, small size book though"
1,4.0,"03 7, 2017",A2827D8EEURMP4,140503528,emankcin,Bought this used and it came in great conditio...,Good
2,5.0,"06 25, 2016",APB6087F4J09J,140503528,Starbucks Fan,Every story and book about Corduroy is Fantast...,Best Books for All Children


In [24]:
df_products.head(3)

Unnamed: 0,description,title,brand,feature,rank,main_cat,date,asin
0,[Sequential Spelling is based on the classic O...,Sequential Spelling Level 1 Bundle with Studen...,STL Distributors,[],"[>#439,654 in Office Products (See top 100), >...",Office Products,"August 15, 2014",12624861
1,"[Unusual book, , ]","Mathematics, Applications and Concepts, Course...",bailey,[],"3,839,628 in Books (",Books,,78652669
2,[Pearson MyHistoryLab Online Access Code for A...,Pearson MyHistoryLab Online Access Code for Am...,Pearson MyHistoryLab,[Pearson MyHistoryLab Online Access Code for A...,"[>#1,925,354 in Office Products (See top 100)]",Office Products,"June 21, 2012",136039847


In [25]:
df.head(3)

Unnamed: 0,overall,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,description,title,brand,feature,rank,main_cat,date
0,4.0,"11 7, 2017",A2NIJTYWADLK57,140503528,cotton clay,kids like story BUT while i really wanted a bo...,"good story, small size book though",[Corduroy the bear goes to the launderette wit...,A Pocket for Corduroy,Ingram Book & Distributor,[9780140503524],"[>#422,894 in Office Products (See top 100), >...",Office Products,"September 14, 2006"
1,4.0,"03 7, 2017",A2827D8EEURMP4,140503528,emankcin,Bought this used and it came in great conditio...,Good,[Corduroy the bear goes to the launderette wit...,A Pocket for Corduroy,Ingram Book & Distributor,[9780140503524],"[>#422,894 in Office Products (See top 100), >...",Office Products,"September 14, 2006"
2,5.0,"06 25, 2016",APB6087F4J09J,140503528,Starbucks Fan,Every story and book about Corduroy is Fantast...,Best Books for All Children,[Corduroy the bear goes to the launderette wit...,A Pocket for Corduroy,Ingram Book & Distributor,[9780140503524],"[>#422,894 in Office Products (See top 100), >...",Office Products,"September 14, 2006"


## Exploratory Data Analysis (EDA)