<a href="https://colab.research.google.com/github/lmbd92/DataScienceMonograph/blob/main/Notebooks/EDA_AMAZON_Lina.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# EDA for Development and Evaluation of Recommendation Systems with Sentiment Analysis for Sales Optimization in the Online Market: A Data Analytical Approach

## Step 1: Import Python Libraries

In [4]:
# data manipulation
import pandas as pd
import numpy as np
import os
import json
import gzip
from urllib.request import urlopen


# data viz
import matplotlib.pyplot as plt
from matplotlib import rcParams
import seaborn as sns


# apply some cool styling
plt.style.use("ggplot")
rcParams['figure.figsize'] = (12, 6)

## Step 2: Reading Datasets

### **Meta Data**

https://drive.google.com/file/d/1Bywp06mLpf9n9eXS4_UgWRJUoBjWFeui/view?usp=sharing

In [5]:
!gdown '1Bywp06mLpf9n9eXS4_UgWRJUoBjWFeui'

Downloading...
From: https://drive.google.com/uc?id=1Bywp06mLpf9n9eXS4_UgWRJUoBjWFeui
To: /content/meta_AMAZON_FASHION.json.gz
100% 33.0M/33.0M [00:01<00:00, 24.7MB/s]


In [7]:
### load the meta data

metaData = []
with gzip.open('meta_AMAZON_FASHION.json.gz') as f:
    for l in f:
        metaData.append(json.loads(l.strip()))

# total length of list, this number equals total number of products
print(len(metaData))

# first row of the list
print(metaData[0])

186637
{'title': 'Slime Time Fall Fest [With CDROM and Collector Cards and Neutron Balls, Incredi-Ball and Glow Stick Necklace, Paper Fram', 'brand': 'Group Publishing (CO)', 'feature': ['Product Dimensions:\n                    \n8.7 x 3.6 x 11.4 inches', 'Shipping Weight:\n                    \n2.4 pounds'], 'rank': '13,052,976inClothing,Shoesamp;Jewelry(', 'date': '8.70 inches', 'asin': '0764443682', 'imageURL': ['https://images-na.ssl-images-amazon.com/images/I/51bSrINiWpL._US40_.jpg'], 'imageURLHighRes': ['https://images-na.ssl-images-amazon.com/images/I/51bSrINiWpL.jpg']}


In [8]:
# convert list into pandas dataframe

meta_df = pd.DataFrame.from_dict(metaData)

print(len(meta_df))

186637


In [9]:
# clean "title" feature, remove rows with unformatted title (i.e. some 'title' may still contain html style content)

meta_df3 = meta_df.fillna('')
meta_df4 = meta_df3[meta_df3.title.str.contains('getTime')] # unformatted rows
meta_df5 = meta_df3[~meta_df3.title.str.contains('getTime')] # filter those unformatted rows
print(len(meta_df4))
print(len(meta_df5))

430
186207


In [10]:
# how those unformatted rows look like
meta_df4.iloc[0]

title              var aPageStart = (new Date()).getTime();\nvar ...
brand                                                               
feature            [Package Dimensions:\n                    \n3....
rank                              24,954,464inClothing,ShoesJewelry(
date                                                          Fossil
asin                                                      B0013HNSPS
imageURL                                                            
imageURLHighRes                                                     
description                                                         
price                                                               
also_view                                                           
also_buy                                                            
fit                                                                 
details                                                             
similar_item                      

In [17]:
meta_df5.head()

Unnamed: 0,title,brand,feature,rank,date,asin,imageURL,imageURLHighRes,description,price,also_view,also_buy,fit,details,similar_item,tech1
0,Slime Time Fall Fest [With CDROM and Collector...,Group Publishing (CO),[Product Dimensions:\n \n8....,"13,052,976inClothing,Shoesamp;Jewelry(",8.70 inches,764443682,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,,,,,,,,
1,XCC Qi promise new spider snake preparing men'...,,,"11,654,581inClothing,Shoesamp;Jewelry(",5 star,1291691480,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,,,,,,,,
2,Magical Things I Really Do Do Too!,Christopher Manos,[Package Dimensions:\n \n8....,"19,308,073inClothing,ShoesJewelry(",5 star,1940280001,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,[For the professional or amateur magician. Ro...,,,,,,,
3,"Ashes to Ashes, Oranges to Oranges",Flickerlamp Publishing,[Package Dimensions:\n \n8....,"19,734,184inClothing,ShoesJewelry(",5 star,1940735033,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,,,,,,,,
4,Aether & Empire #1 - 2016 First Printing Comic...,,[Package Dimensions:\n \n10...,"10,558,646inClothing,Shoesamp;Jewelry(",5 star,1940967805,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,,$4.50,,,,,,


In [18]:
meta_df5.tail()

Unnamed: 0,title,brand,feature,rank,date,asin,imageURL,imageURLHighRes,description,price,also_view,also_buy,fit,details,similar_item,tech1
186632,JT Women's Elegant Off Shoulder Chiffon Maxi L...,JT,,"9,835,890inClothing,ShoesJewelry(",5 star,B01HJGXL4O,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,,,,,,,,
186633,Microcosm Retro Vintage Black Crochet Lace One...,Microcosm,[Package Dimensions:\n \n7....,"11,390,771inClothing,ShoesJewelry(",5 star5 star (0%),B01HJHF97K,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,,,,,,,,
186634,Lookatool Classic Plain Vintage Army Military ...,Lookatool,"[Cotton+Polyester, Imported, Item type:Basebal...","972,275inClothing,ShoesJewelry(",5 star,B01HJGJ9LS,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,,$8.53,"[B00XLECZMS, B0018MQAOY, B00N833I4Q, B074DQSPP...","[B07BHQ1FXL, B00XLECZMS, B07CJWM5WY, B07CS97C1...","class=""a-normal a-align-center a-spacing-smal...",,,
186635,Edith Windsor Women's Deep V-neck Beaded Sequi...,Edith Windsor,[Product Dimensions:\n \n9....,"1,964,585inClothing,ShoesJewelry(",5 star,B01HJHTH5U,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,,,,[B077ZLGMJ3],,,,
186636,Aeropostale Women's Sun & Waves Crop Cami L Gr...,,[Product Dimensions:\n \n5 ...,"9,379,125inClothing,ShoesJewelry(",5 star,B01HJFNU7S,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,,,,,,,,


In [19]:
meta_df5.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 186207 entries, 0 to 186636
Data columns (total 16 columns):
 #   Column           Non-Null Count   Dtype 
---  ------           --------------   ----- 
 0   title            186207 non-null  object
 1   brand            186207 non-null  object
 2   feature          186207 non-null  object
 3   rank             186207 non-null  object
 4   date             186207 non-null  object
 5   asin             186207 non-null  object
 6   imageURL         186207 non-null  object
 7   imageURLHighRes  186207 non-null  object
 8   description      186207 non-null  object
 9   price            186207 non-null  object
 10  also_view        186207 non-null  object
 11  also_buy         186207 non-null  object
 12  fit              186207 non-null  object
 13  details          186207 non-null  object
 14  similar_item     186207 non-null  object
 15  tech1            186207 non-null  object
dtypes: object(16)
memory usage: 24.2+ MB


In [23]:
meta_df5.nunique()

TypeError: ignored

In [22]:
meta_df5.isnull().sum()

title              0
brand              0
feature            0
rank               0
date               0
asin               0
imageURL           0
imageURLHighRes    0
description        0
price              0
also_view          0
also_buy           0
fit                0
details            0
similar_item       0
tech1              0
dtype: int64

### **Data reviews**

https://drive.google.com/file/d/11Hu87ubzUZr33rvZPY_jFWQFHpBRjKnh/view?usp=sharing

In [12]:
!gdown '11Hu87ubzUZr33rvZPY_jFWQFHpBRjKnh'

Downloading...
From: https://drive.google.com/uc?id=11Hu87ubzUZr33rvZPY_jFWQFHpBRjKnh
To: /content/AMAZON_FASHION.json.gz
100% 93.2M/93.2M [00:02<00:00, 39.2MB/s]


In [13]:
### load the review data

reviewData = []
with gzip.open('AMAZON_FASHION.json.gz') as f:
    for l in f:
        reviewData.append(json.loads(l.strip()))

# total length of list, this number equals total number of products
print(len(reviewData))

# first row of the list
print(reviewData[0])

883636
{'overall': 5.0, 'verified': True, 'reviewTime': '10 20, 2014', 'reviewerID': 'A1D4G1SNUZWQOT', 'asin': '7106116521', 'reviewerName': 'Tracy', 'reviewText': 'Exactly what I needed.', 'summary': 'perfect replacements!!', 'unixReviewTime': 1413763200}


In [14]:
# convert list into pandas dataframe

review_df = pd.DataFrame.from_dict(reviewData)

print(len(review_df))

883636


In [20]:
review_df.head(5)

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,vote,style,image
0,5.0,True,"10 20, 2014",A1D4G1SNUZWQOT,7106116521,Tracy,Exactly what I needed.,perfect replacements!!,1413763200,,,
1,2.0,True,"09 28, 2014",A3DDWDH9PX2YX2,7106116521,Sonja Lau,"I agree with the other review, the opening is ...","I agree with the other review, the opening is ...",1411862400,3.0,,
2,4.0,False,"08 25, 2014",A2MWC41EW7XL15,7106116521,Kathleen,Love these... I am going to order another pack...,My New 'Friends' !!,1408924800,,,
3,2.0,True,"08 24, 2014",A2UH2QQ275NV45,7106116521,Jodi Stoner,too tiny an opening,Two Stars,1408838400,,,
4,3.0,False,"07 27, 2014",A89F3LQADZBS5,7106116521,Alexander D.,Okay,Three Stars,1406419200,,,


## Step 3: Data Reduction

## Step 4: Feature Engineering

## Step 5: Creating Features

## Step 6: Data Cleaning/Wrangling

## Step 7: EDA Exploratory Data Analysis

## Step 8: Statistics Summary

## Step 9: EDA Univariate Analysis

## Step 10: Data Transformation

## Step 12: EDA Bivariate Analysis

## Step 13: EDA Multivariate Analysis

## Step 14: Impute Missing values