# Recommender System Using Amazon Reviews

## 2. Data Wrangling
This step focuses on collecting your data, organizing it, and making sure it's well defined. Some data cleaning can be done at this stage, but it's important not to be overzealous in your cleaning before you've explored the data to better understand it.
![dsm_2](img/dsm_2.png)

Data-source: https://www.kaggle.com/rogate16/amazon-reviews-2018-full-dataset

### 2.0 Import Packages

In [1]:
import os
import json
import gzip
import pandas as pd
from urllib.request import urlopen

import random
import numpy as np
from tqdm import tqdm_notebook as tqdm
from collections import defaultdict

import seaborn as sns

In [2]:
## Helper methods to call in data into dictionary
def parse(path):
  g = gzip.open(path, 'rb')
  for l in g:
    yield eval(l)

def getDF(path):
  i = 0
  df = {}
  for d in parse(path):
    df[i] = d
    i += 1
  return pd.DataFrame.from_dict(df, orient='index')

### 2.1 Data Collection
The term data collection refers to the process of acquiring data, collating that data, and then loading the data into your work environment of choice (such as Jupyter Notebook). In this subunit, you’ll learn how to collect data from two main channels — the internet and files — and will get practice importing data from a CSV file, calling API to request data from websites, and scraping data from websites.   

In [3]:
## Get data for product details
product_df = getDF('meta_Beauty.json.gz')

In [None]:
## Get data for product reviews
review_df = getDF('reviews_Beauty.json.gz')

In [4]:
product_df.head()

Unnamed: 0,asin,description,title,imUrl,salesRank,categories,price,related,brand
0,205616461,"As we age, our once youthful, healthy skin suc...",Bio-Active Anti-Aging Serum (Firming Ultra-Hyd...,http://ecx.images-amazon.com/images/I/41DecrGO...,{'Health & Personal Care': 461765},"[[Beauty, Skin Care, Face, Creams & Moisturize...",,,
1,558925278,Mineral Powder Brush--Apply powder or mineral ...,Eco Friendly Ecotools Quality Natural Bamboo C...,http://ecx.images-amazon.com/images/I/51L%2BzY...,{'Beauty': 402875},"[[Beauty, Tools & Accessories, Makeup Brushes ...",,,
2,733001998,"From the Greek island of Chios, this Mastiha b...",Mastiha Body Lotion,http://ecx.images-amazon.com/images/I/311WK5y1...,{'Beauty': 540255},"[[Beauty, Skin Care, Body, Moisturizers, Lotio...",,,
3,737104473,Limited edition Hello Kitty Lipstick featuring...,Hello Kitty Lustre Lipstick (See sellers comme...,http://ecx.images-amazon.com/images/I/31u6Hrzk...,{'Beauty': 931125},"[[Beauty, Makeup, Lips, Lipstick]]",,,
4,762451459,"The mermaid is an elusive (okay, mythical) cre...",Stephanie Johnson Mermaid Round Snap Mirror,http://ecx.images-amazon.com/images/I/41y2%2BF...,,"[[Beauty, Tools & Accessories, Mirrors, Makeup...",19.98,,


In [None]:
review_df.head()

### 2.2 Data Exploration

#### 2.2.1 Product Data Exploration

In [5]:
product_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 259204 entries, 0 to 259203
Data columns (total 9 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   asin         259204 non-null  object 
 1   description  234497 non-null  object 
 2   title        258760 non-null  object 
 3   imUrl        259116 non-null  object 
 4   salesRank    254016 non-null  object 
 5   categories   259204 non-null  object 
 6   price        189930 non-null  float64
 7   related      207854 non-null  object 
 8   brand        128166 non-null  object 
dtypes: float64(1), object(8)
memory usage: 19.8+ MB


<ul>
    <li><tt>asin</tt> - ID of the product, e.g. <a href="http://www.amazon.com/dp/0000031852">0000031852</a></li>
    <li><tt>description</tt> - description of the product
    <li><tt>title</tt> - name of the product</li>
    <li><tt>price</tt> - price in US dollars (at time of crawl)</li>
    <li><tt>imUrl</tt> - url of the product image</li>
    <li><tt>related</tt> - related products (also bought, also viewed, bought together, buy after viewing)</li>
    <li><tt>salesRank</tt> - sales rank information</li>
    <li><tt>brand</tt> - brand name</li>
    <li><tt>categories</tt> - list of categories the product belongs to</li>
 </ul>

In [6]:
missing = pd.concat([product_df.isnull().sum(), 100 * product_df.isnull().mean()], axis=1)
missing.columns=['count', '%']
missing.sort_values(by='count')

Unnamed: 0,count,%
asin,0,0.0
categories,0,0.0
imUrl,88,0.03395
title,444,0.171294
salesRank,5188,2.001512
description,24707,9.531875
related,51350,19.810651
price,69274,26.725668
brand,131038,50.554004


#### 2.2.2 Review Data Exploration

In [None]:
review_df.info()

<ul>
    <li><tt>reviewerID</tt> - ID of the reviewer, e.g. <a href="http://www.amazon.com/gp/cdp/member-reviews/A2SUAM1J3GNN3B">A2SUAM1J3GNN3B</a></li>
    <li><tt>asin</tt> - ID of the product, e.g. <a href="http://www.amazon.com/dp/0000013714">0000013714</a></li>
    <li><tt>reviewerName</tt> - name of the reviewer</li>
    <li><tt>helpful</tt> - helpfulness rating of the review, e.g. 2/3</li>
    <li><tt>reviewText</tt> - text of the review</li>
    <li><tt>overall</tt> - rating of the product</li>
    <li><tt>summary</tt> - summary of the review</li>
    <li><tt>unixReviewTime</tt> - time of the review (unix time)</li>
    <li><tt>reviewTime</tt> - time of the review (raw)</li>
</ul>

In [None]:
missing = pd.concat([review_df.isnull().sum(), 100 * review_df.isnull().mean()], axis=1)
missing.columns=['count', '%']
missing.sort_values(by='count')

### 2.3 Data Definitions
- The goal of building data definitions is to describe the features in your dataset with enough detail to:
    1. Identify any issues that will require cleaning
    2. Identify the features of your dataset
    3. Begin to understand how those features will support the data science project you have in mind

#### Possible Questions to understand datasets:

1. In *product_df* data *brand*, how would I need to handle the 50% of null value?
2. Which reviewer wrote the most reviews? are they helpful?
3. How many products does each brand registereed?
4. Do I need to combine two datasets?
5. How relevant is *reviewtime* towards the positivity/negativity of the review?
6. How should I utilize *helpful* column?
7. How many reviews per products?

#### Noteworthy:
1. This dataset focuses on reviews, this recommendation should be primarily based on reviews too.
2. This dataset does not have a lot of numerical values.
3. The *related* column has other products related to itself (This may help clustering a pattern with data description's key words?)
4. I do not think I'll be using in image in further progression. (Image can be helped but image processing is not my primary goal)
5. *related* has good information - it contains 3 different information; I think it is better to separate the columns

#### 2.3.1 Product Data Definitions

In [4]:
product_df.head()

Unnamed: 0,asin,description,title,imUrl,salesRank,categories,price,related,brand
0,205616461,"As we age, our once youthful, healthy skin suc...",Bio-Active Anti-Aging Serum (Firming Ultra-Hyd...,http://ecx.images-amazon.com/images/I/41DecrGO...,{'Health & Personal Care': 461765},"[[Beauty, Skin Care, Face, Creams & Moisturize...",,,
1,558925278,Mineral Powder Brush--Apply powder or mineral ...,Eco Friendly Ecotools Quality Natural Bamboo C...,http://ecx.images-amazon.com/images/I/51L%2BzY...,{'Beauty': 402875},"[[Beauty, Tools & Accessories, Makeup Brushes ...",,,
2,733001998,"From the Greek island of Chios, this Mastiha b...",Mastiha Body Lotion,http://ecx.images-amazon.com/images/I/311WK5y1...,{'Beauty': 540255},"[[Beauty, Skin Care, Body, Moisturizers, Lotio...",,,
3,737104473,Limited edition Hello Kitty Lipstick featuring...,Hello Kitty Lustre Lipstick (See sellers comme...,http://ecx.images-amazon.com/images/I/31u6Hrzk...,{'Beauty': 931125},"[[Beauty, Makeup, Lips, Lipstick]]",,,
4,762451459,"The mermaid is an elusive (okay, mythical) cre...",Stephanie Johnson Mermaid Round Snap Mirror,http://ecx.images-amazon.com/images/I/41y2%2BF...,,"[[Beauty, Tools & Accessories, Mirrors, Makeup...",19.98,,


In [5]:
product_df.columns

Index(['asin', 'description', 'title', 'imUrl', 'salesRank', 'categories',
       'price', 'related', 'brand'],
      dtype='object')

In [6]:
product_df[['asin', 'description', 'title', 'imUrl', 'price', 'brand']].nunique()

asin           259204
description    196916
title          254924
imUrl          248421
price            9830
brand           13189
dtype: int64

##### Checkpoint
- 'asin' needs to be used as a index 
- 'description' -> uncertain of usage
- 'title' -> mandatory
- 'imUrl' -> is unnecessary column.
- 'price' -> there are items without price
- 'related', 'salesRank' and 'categories' -> are columns consisted with dictionaries/lists. This has to be cleaned for better usage.

In [212]:
# drop 'imURL'
product_m_df = product_df[['asin', 'description', 'title', 'salesRank', 'categories',
       'price', 'related', 'brand']]

##### let's learn about  'categories' column

In [291]:
# show how cateogories column looks like
print(product_m_df['categories'])

0         [[Beauty, Skin Care, Face, Creams & Moisturize...
1         [[Beauty, Tools & Accessories, Makeup Brushes ...
2         [[Beauty, Skin Care, Body, Moisturizers, Lotio...
3                        [[Beauty, Makeup, Lips, Lipstick]]
4         [[Beauty, Tools & Accessories, Mirrors, Makeup...
                                ...                        
259199    [[Beauty, Hair Care, Styling Tools, Styling Ac...
259200               [[Beauty, Makeup, Nails, Nail Polish]]
259201    [[Beauty, Skin Care, Face, Creams & Moisturize...
259202    [[Beauty, Hair Care, Styling Tools, Styling Ac...
259203    [[Beauty, Tools & Accessories, Bags & Cases, T...
Name: categories, Length: 259204, dtype: object


In [292]:
# summary 1 
print('Each row of categories column has data type of', type(product_m_df['categories'][0]))

Each row of categories column has data type of <class 'list'>


In [293]:
# get unique values within beauty categories
## using set eliminates duplication comparisons
category_result_set = {''}

for row in product_m_df['categories']:
    for rowlist in row:
        for listdata in rowlist:
            category_result_set.add(listdata)

category_result_df = pd.DataFrame(category_result_set)

In [299]:
print("The number of unique category is",category_result_df.shape[0])

The number of unique category is 657


Figuring the unique category is good but we do not know the popularity of each categories. 
Will this give bias? or will be a helpful information?

Client's business model lies with Nail and Eyelashes

In [189]:
# distinguish interested categories - the keywords are 'Nail' and 'Eyelash'
interest_categories_list = category_result_df[category_result_df[0].str.contains('Nail|Eyelash', regex=True, na=False)]
print(interest_categories_list)
print("The number of interested categories is", interest_categories_list.shape[0])

                              0
15                   Nail Tools
39                        Nails
67   Fake Eyelashes & Adhesives
75     Nail Thickening Solution
161   Nails, Screws & Fasteners
190         Nail Polish Remover
201          Nail Art Equipment
287                Nail Brushes
368                 False Nails
384              Nail Whitening
385               Hands & Nails
404             Nail Treatments
466                    Nail Art
476                 Nail Polish
508                 Nail Dryers
510               Eyelash Tools
597          Nail Strengthening
610                 Nail Repair
651        Nail Files & Buffers
The number of interested categories is 19


Among these, we want to focus on 'False Nails' and 'Fake Eyelashes & Adhesives'.

In [190]:
product_m_df['interested_categories'] = " "

In [14]:
type(product_m_df['categories'][0])

list

In [323]:
product_m_df = pd.concat((product_m_df, product_m_df['categories'].apply(pd.Series)), axis=1)

In [324]:
# checkpoint
product_m_categories_df = product_m_df

In [325]:
product_m_categories_df = product_m_categories_df.rename(columns={0:'categories_list'})

In [327]:
product_m_categories_df.head()

Unnamed: 0,asin,description,title,salesRank,categories,price,related,brand,categories_list,1,...,5,6,7,8,9,10,11,12,13,14
0,205616461,"As we age, our once youthful, healthy skin suc...",Bio-Active Anti-Aging Serum (Firming Ultra-Hyd...,{'Health & Personal Care': 461765},"[[Beauty, Skin Care, Face, Creams & Moisturize...",,,,"[Beauty, Skin Care, Face, Creams & Moisturizers]",,...,,,,,,,,,,
1,558925278,Mineral Powder Brush--Apply powder or mineral ...,Eco Friendly Ecotools Quality Natural Bamboo C...,{'Beauty': 402875},"[[Beauty, Tools & Accessories, Makeup Brushes ...",,,,"[Beauty, Tools & Accessories, Makeup Brushes &...",,...,,,,,,,,,,
2,733001998,"From the Greek island of Chios, this Mastiha b...",Mastiha Body Lotion,{'Beauty': 540255},"[[Beauty, Skin Care, Body, Moisturizers, Lotio...",,,,"[Beauty, Skin Care, Body, Moisturizers, Lotions]",,...,,,,,,,,,,
3,737104473,Limited edition Hello Kitty Lipstick featuring...,Hello Kitty Lustre Lipstick (See sellers comme...,{'Beauty': 931125},"[[Beauty, Makeup, Lips, Lipstick]]",,,,"[Beauty, Makeup, Lips, Lipstick]",,...,,,,,,,,,,
4,762451459,"The mermaid is an elusive (okay, mythical) cre...",Stephanie Johnson Mermaid Round Snap Mirror,,"[[Beauty, Tools & Accessories, Mirrors, Makeup...",19.98,,,"[Beauty, Tools & Accessories, Mirrors, Makeup ...",,...,,,,,,,,,,


In [326]:
product_m_categories_df.columns

Index([           'asin',     'description',           'title',
             'salesRank',      'categories',           'price',
               'related',           'brand', 'categories_list',
                       1,                 2,                 3,
                       4,                 5,                 6,
                       7,                 8,                 9,
                      10,                11,                12,
                      13,                14],
      dtype='object')

1 - 14 exists less related categories, and mostly were CDs

In [330]:
product_m_categories_df.drop([1,2,3,4,5,6,7,8,9,10,11,12,13,14], axis=1, inplace=True)
product_m_categories_df.drop('categories', axis=1, inplace=True)

In [331]:
product_m_categories_df

Unnamed: 0,asin,description,title,salesRank,price,related,brand,categories_list
0,0205616461,"As we age, our once youthful, healthy skin suc...",Bio-Active Anti-Aging Serum (Firming Ultra-Hyd...,{'Health & Personal Care': 461765},,,,"[Beauty, Skin Care, Face, Creams & Moisturizers]"
1,0558925278,Mineral Powder Brush--Apply powder or mineral ...,Eco Friendly Ecotools Quality Natural Bamboo C...,{'Beauty': 402875},,,,"[Beauty, Tools & Accessories, Makeup Brushes &..."
2,0733001998,"From the Greek island of Chios, this Mastiha b...",Mastiha Body Lotion,{'Beauty': 540255},,,,"[Beauty, Skin Care, Body, Moisturizers, Lotions]"
3,0737104473,Limited edition Hello Kitty Lipstick featuring...,Hello Kitty Lustre Lipstick (See sellers comme...,{'Beauty': 931125},,,,"[Beauty, Makeup, Lips, Lipstick]"
4,0762451459,"The mermaid is an elusive (okay, mythical) cre...",Stephanie Johnson Mermaid Round Snap Mirror,,19.98,,,"[Beauty, Tools & Accessories, Mirrors, Makeup ..."
...,...,...,...,...,...,...,...,...
259199,B00LP2YB8E,Color: White\nFullness72 inches\nCenter Gather...,2t 2t Edge Crystal Rhinestones Bridal Wedding ...,,,,,"[Beauty, Hair Care, Styling Tools, Styling Acc..."
259200,B00LOS7MEE,"The secret to long lasting colors, healthy nai...",French Manicure Gel Nail Polish Set - &quot;Se...,{'Beauty': 108820},,"{'also_viewed': ['B0057JCYYE', 'B00LMXHR1Y', '...",,"[Beauty, Makeup, Nails, Nail Polish]"
259201,B00LPVG6V0,ResQ Organics Face & Body Wash - With Aloe Ver...,ResQ Organics Face &amp; Body Wash - Aloe Vera...,,,,,"[Beauty, Skin Care, Face, Creams & Moisturizers]"
259202,B00LTDUHJQ,Color: White\n2 Tier \nFullness 72 inches\nSew...,2 Tier Tulle Elbow Wedding Veil with Ribbon Ed...,,,,,"[Beauty, Hair Care, Styling Tools, Styling Acc..."


In [391]:
product_m_categories_df['interested'] = " "

In [336]:
product_m_categories_df['categories_list_string'] = [','.join(map(str, l)) for l in product_m_categories_df['categories_list']]

In [341]:
product_m_categories_df['interested'] = np.where(product_m_categories_df['categories_list_string'].str.contains('Nail|Eyelash', regex=True, na=False),
                           True, False)

In [360]:
product_m_categories_df

Unnamed: 0,asin,description,title,salesRank,price,related,brand,categories_list,interested,categories_list_string,interested2
0,0205616461,"As we age, our once youthful, healthy skin suc...",Bio-Active Anti-Aging Serum (Firming Ultra-Hyd...,{'Health & Personal Care': 461765},,,,"[Beauty, Skin Care, Face, Creams & Moisturizers]",False,"Beauty,Skin Care,Face,Creams & Moisturizers",na
1,0558925278,Mineral Powder Brush--Apply powder or mineral ...,Eco Friendly Ecotools Quality Natural Bamboo C...,{'Beauty': 402875},,,,"[Beauty, Tools & Accessories, Makeup Brushes &...",False,"Beauty,Tools & Accessories,Makeup Brushes & To...",na
2,0733001998,"From the Greek island of Chios, this Mastiha b...",Mastiha Body Lotion,{'Beauty': 540255},,,,"[Beauty, Skin Care, Body, Moisturizers, Lotions]",False,"Beauty,Skin Care,Body,Moisturizers,Lotions",na
3,0737104473,Limited edition Hello Kitty Lipstick featuring...,Hello Kitty Lustre Lipstick (See sellers comme...,{'Beauty': 931125},,,,"[Beauty, Makeup, Lips, Lipstick]",False,"Beauty,Makeup,Lips,Lipstick",na
4,0762451459,"The mermaid is an elusive (okay, mythical) cre...",Stephanie Johnson Mermaid Round Snap Mirror,,19.98,,,"[Beauty, Tools & Accessories, Mirrors, Makeup ...",False,"Beauty,Tools & Accessories,Mirrors,Makeup Mirrors",na
...,...,...,...,...,...,...,...,...,...,...,...
259199,B00LP2YB8E,Color: White\nFullness72 inches\nCenter Gather...,2t 2t Edge Crystal Rhinestones Bridal Wedding ...,,,,,"[Beauty, Hair Care, Styling Tools, Styling Acc...",False,"Beauty,Hair Care,Styling Tools,Styling Accesso...",na
259200,B00LOS7MEE,"The secret to long lasting colors, healthy nai...",French Manicure Gel Nail Polish Set - &quot;Se...,{'Beauty': 108820},,"{'also_viewed': ['B0057JCYYE', 'B00LMXHR1Y', '...",,"[Beauty, Makeup, Nails, Nail Polish]",True,"Beauty,Makeup,Nails,Nail Polish",Nails
259201,B00LPVG6V0,ResQ Organics Face & Body Wash - With Aloe Ver...,ResQ Organics Face &amp; Body Wash - Aloe Vera...,,,,,"[Beauty, Skin Care, Face, Creams & Moisturizers]",False,"Beauty,Skin Care,Face,Creams & Moisturizers",na
259202,B00LTDUHJQ,Color: White\n2 Tier \nFullness 72 inches\nSew...,2 Tier Tulle Elbow Wedding Veil with Ribbon Ed...,,,,,"[Beauty, Hair Care, Styling Tools, Styling Acc...",False,"Beauty,Hair Care,Styling Tools,Styling Accesso...",na


In [344]:
len(product_m_categories_df[product_m_categories_df['interested']==True])

28634

In [359]:
product_m_categories_df.to_csv('review_addInterestedCategories_df.csv')

##### column  'salesRank' 

In [381]:
#product_m_df = pd.read_csv('review_addInterestedCategories_df.csv')
#product_m_df.drop('Unnamed: 0', axis=1, inplace=True)
#product_m_df

In [361]:
product_m_df = product_m_categories_df

In [363]:
product_m_rank_df= product_m_df

In [364]:
product_m_rank_df = pd.concat((product_m_rank_df, product_m_rank_df['salesRank'].apply(pd.Series)), axis=1)

In [380]:
product_m_rank_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
price,189930.0,24.87816,33.43119,0.01,8.24,15.69,29.3,999.99
0,0.0,,,,,,,
"Arts, Crafts & Sewing",272.0,137353.4,138841.5,51.0,29025.25,97347.5,199541.0,730022.0
Automotive,9.0,5704.556,4083.638,886.0,2919.0,3394.0,9454.0,12481.0
Baby,17.0,50997.12,56170.45,1252.0,3246.0,45706.0,75403.0,188802.0
Beauty,215131.0,289337.0,250079.0,1.0,90207.5,225985.0,420338.0,1233410.0
Books,1.0,1204746.0,,1204746.0,1204746.0,1204746.0,1204746.0,1204746.0
Camera &amp; Photo,8.0,16017.5,6909.146,5159.0,11099.5,17699.0,20196.5,25243.0
Cell Phones & Accessories,3.0,202951.0,205831.8,27440.0,89676.5,151913.0,290706.5,429500.0
Clothing,2448.0,653489.5,665006.6,183.0,142155.75,435464.5,949063.75,3578230.0


In column 'salesRank' there are twenty-nine categorical values. We do not need all of them. Our primary focus will be 'Beuaty' and if needed 'Health & Personal Care' can be used.

In [385]:
product_m_rank_df.drop(['Arts, Crafts & Sewing', 'Automotive', 'Baby','Books','Camera &amp; Photo','Cell Phones & Accessories',                  'Clothing',
                   'Computers & Accessories', 'Electronics', 'Grocery & Gourmet Food','Home &amp; Kitchen','Home Improvement',
                   'Industrial & Scientific','Jewelry','Kitchen & Dining','Magazines','Movies & TV','Music',
                   'Musical Instruments', 'Office Products', 'Patio, Lawn & Garden','Pet Supplies', 'Shoes', 
                   'Software', 'Sports &amp; Outdoors', 'Toys & Games', 'Watches'], axis=1, inplace=True)

In [386]:
product_m_rank_df

Unnamed: 0,asin,description,title,salesRank,price,related,brand,categories_list,interested,categories_list_string,interested2,0,Beauty,Health & Personal Care
0,0205616461,"As we age, our once youthful, healthy skin suc...",Bio-Active Anti-Aging Serum (Firming Ultra-Hyd...,{'Health & Personal Care': 461765},,,,"[Beauty, Skin Care, Face, Creams & Moisturizers]",False,"Beauty,Skin Care,Face,Creams & Moisturizers",na,,,461765.0
1,0558925278,Mineral Powder Brush--Apply powder or mineral ...,Eco Friendly Ecotools Quality Natural Bamboo C...,{'Beauty': 402875},,,,"[Beauty, Tools & Accessories, Makeup Brushes &...",False,"Beauty,Tools & Accessories,Makeup Brushes & To...",na,,402875.0,
2,0733001998,"From the Greek island of Chios, this Mastiha b...",Mastiha Body Lotion,{'Beauty': 540255},,,,"[Beauty, Skin Care, Body, Moisturizers, Lotions]",False,"Beauty,Skin Care,Body,Moisturizers,Lotions",na,,540255.0,
3,0737104473,Limited edition Hello Kitty Lipstick featuring...,Hello Kitty Lustre Lipstick (See sellers comme...,{'Beauty': 931125},,,,"[Beauty, Makeup, Lips, Lipstick]",False,"Beauty,Makeup,Lips,Lipstick",na,,931125.0,
4,0762451459,"The mermaid is an elusive (okay, mythical) cre...",Stephanie Johnson Mermaid Round Snap Mirror,,19.98,,,"[Beauty, Tools & Accessories, Mirrors, Makeup ...",False,"Beauty,Tools & Accessories,Mirrors,Makeup Mirrors",na,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
259199,B00LP2YB8E,Color: White\nFullness72 inches\nCenter Gather...,2t 2t Edge Crystal Rhinestones Bridal Wedding ...,,,,,"[Beauty, Hair Care, Styling Tools, Styling Acc...",False,"Beauty,Hair Care,Styling Tools,Styling Accesso...",na,,,
259200,B00LOS7MEE,"The secret to long lasting colors, healthy nai...",French Manicure Gel Nail Polish Set - &quot;Se...,{'Beauty': 108820},,"{'also_viewed': ['B0057JCYYE', 'B00LMXHR1Y', '...",,"[Beauty, Makeup, Nails, Nail Polish]",True,"Beauty,Makeup,Nails,Nail Polish",Nails,,108820.0,
259201,B00LPVG6V0,ResQ Organics Face & Body Wash - With Aloe Ver...,ResQ Organics Face &amp; Body Wash - Aloe Vera...,,,,,"[Beauty, Skin Care, Face, Creams & Moisturizers]",False,"Beauty,Skin Care,Face,Creams & Moisturizers",na,,,
259202,B00LTDUHJQ,Color: White\n2 Tier \nFullness 72 inches\nSew...,2 Tier Tulle Elbow Wedding Veil with Ribbon Ed...,,,,,"[Beauty, Hair Care, Styling Tools, Styling Acc...",False,"Beauty,Hair Care,Styling Tools,Styling Accesso...",na,,,


In [387]:
product_m_df = product_m_rank_df 

##### column  'related' 

In [388]:
product_m_related_df = product_m_df

In [390]:
product_m_related_df = pd.concat((product_m_related_df, product_m_related_df['related'].apply(pd.Series)), axis=1)

- In column 'related' there are four categorical values - 'also_viewed','also_bought', 'bought_together'. </br>
- The 'related' column can be separated to four different columns that are consisted with list of 'asin'
- As this will be concentrated for categorizing/clustering this is less important at the moment.
- Let's move on

In [392]:
product_m_related_df

Unnamed: 0,asin,description,title,salesRank,price,related,brand,categories_list,interested,categories_list_string,interested2,0,Beauty,Health & Personal Care,0.1,also_bought,also_viewed,bought_together,buy_after_viewing
0,0205616461,"As we age, our once youthful, healthy skin suc...",Bio-Active Anti-Aging Serum (Firming Ultra-Hyd...,{'Health & Personal Care': 461765},,,,"[Beauty, Skin Care, Face, Creams & Moisturizers]",False,"Beauty,Skin Care,Face,Creams & Moisturizers",na,,,461765.0,,,,,
1,0558925278,Mineral Powder Brush--Apply powder or mineral ...,Eco Friendly Ecotools Quality Natural Bamboo C...,{'Beauty': 402875},,,,"[Beauty, Tools & Accessories, Makeup Brushes &...",False,"Beauty,Tools & Accessories,Makeup Brushes & To...",na,,402875.0,,,,,,
2,0733001998,"From the Greek island of Chios, this Mastiha b...",Mastiha Body Lotion,{'Beauty': 540255},,,,"[Beauty, Skin Care, Body, Moisturizers, Lotions]",False,"Beauty,Skin Care,Body,Moisturizers,Lotions",na,,540255.0,,,,,,
3,0737104473,Limited edition Hello Kitty Lipstick featuring...,Hello Kitty Lustre Lipstick (See sellers comme...,{'Beauty': 931125},,,,"[Beauty, Makeup, Lips, Lipstick]",False,"Beauty,Makeup,Lips,Lipstick",na,,931125.0,,,,,,
4,0762451459,"The mermaid is an elusive (okay, mythical) cre...",Stephanie Johnson Mermaid Round Snap Mirror,,19.98,,,"[Beauty, Tools & Accessories, Mirrors, Makeup ...",False,"Beauty,Tools & Accessories,Mirrors,Makeup Mirrors",na,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
259199,B00LP2YB8E,Color: White\nFullness72 inches\nCenter Gather...,2t 2t Edge Crystal Rhinestones Bridal Wedding ...,,,,,"[Beauty, Hair Care, Styling Tools, Styling Acc...",False,"Beauty,Hair Care,Styling Tools,Styling Accesso...",na,,,,,,,,
259200,B00LOS7MEE,"The secret to long lasting colors, healthy nai...",French Manicure Gel Nail Polish Set - &quot;Se...,{'Beauty': 108820},,"{'also_viewed': ['B0057JCYYE', 'B00LMXHR1Y', '...",,"[Beauty, Makeup, Nails, Nail Polish]",True,"Beauty,Makeup,Nails,Nail Polish",Nails,,108820.0,,,,"[B0057JCYYE, B00LMXHR1Y, B00993T7YY, B006FRS9O...",,
259201,B00LPVG6V0,ResQ Organics Face & Body Wash - With Aloe Ver...,ResQ Organics Face &amp; Body Wash - Aloe Vera...,,,,,"[Beauty, Skin Care, Face, Creams & Moisturizers]",False,"Beauty,Skin Care,Face,Creams & Moisturizers",na,,,,,,,,
259202,B00LTDUHJQ,Color: White\n2 Tier \nFullness 72 inches\nSew...,2 Tier Tulle Elbow Wedding Veil with Ribbon Ed...,,,,,"[Beauty, Hair Care, Styling Tools, Styling Acc...",False,"Beauty,Hair Care,Styling Tools,Styling Accesso...",na,,,,,,,,


#### 2.3.2 Review Data Definitions

In [31]:
review_df.head()

NameError: name 'review_df' is not defined

In [None]:
review_df.columns

In [None]:
review_df[['reviewerID', 'asin', 'reviewerName', 'reviewText',
       'overall', 'summary', 'unixReviewTime', 'reviewTime']].nunique()

In [None]:
review_m_df = review_df

##### Split **reviewTime** column

In [None]:
review_m_df.reviewTime = pd.to_datetime(review_m_df.reviewTime)

In [None]:
review_m_df['reviewed_year'], review_m_df['reviewed_month'], review_m_df['reviewed_day'] = review_m_df.reviewTime.dt.year, review_m_df.reviewTime.dt.month, review_m_df.reviewTime.dt.day

In [None]:
review_m_df = review_m_df.drop('unixReviewTime', 1)

##### Split **helpful** column

In [None]:
review_m_df.helpful

In [None]:
review_m_df[['helpful_positive','helpful_negative']] = pd.DataFrame(review_m_df.helpful.tolist(), index= review_m_df.index)

In [None]:
review_m_df = review_m_df.drop('helpful', 1)

In [None]:
review_m_df.head()

##### Export Data

In [None]:
review_m_df.to_csv('review_cleaned_df.csv')

### 2.5 Data Cleaning
- Data cleaning is an essential step when data wrangling; if you don’t clean your data, you’re likely to run into some issues when it comes time to build your models. Putting the time into cleaning your data will result in a seamless transition from wrangling to EDA and modeling. 

In [36]:
## Get data for CLEANED product reviews
review_c_df = pd.read_csv('review_cleaned_df.csv')

In [37]:
review_c_df.columns

Index(['Unnamed: 0', 'reviewerID', 'asin', 'reviewerName', 'reviewText',
       'overall', 'summary', 'reviewTime', 'reviewed_year', 'reviewed_month',
       'reviewed_day', 'helpful_positive', 'helpful_negative'],
      dtype='object')

In [38]:
review_c_df[['reviewerID', 'asin', 'reviewerName', 'reviewText',
       'overall', 'summary', 'reviewTime', 'reviewed_year', 'reviewed_month',
       'reviewed_day', 'helpful_positive', 'helpful_negative']].nunique()

reviewerID          1210271
asin                 249274
reviewerName         880660
reviewText          2018084
overall                   5
summary             1110278
reviewTime             4231
reviewed_year            17
reviewed_month           12
reviewed_day             31
helpful_positive        435
helpful_negative        469
dtype: int64

#### Drop Products With Under 50 Reviews

In [33]:
product_review_counts = review_c_df.asin.value_counts().rename('product_review_counts')
product_review_counts 

B001MA0QY2    7533
B0009V1YR8    2869
B0043OYFKU    2477
B0000YUXI0    2143
B003V265QW    2088
              ... 
B00510GXJG       1
B000NCV5ZE       1
B002QBSIOG       1
B0073I084U       1
B004UDWHOK       1
Name: product_review_counts, Length: 249274, dtype: int64

In [34]:
review_c_df = review_c_df.merge(product_review_counts.to_frame(),
                                left_on='asin',
                                right_index=True)
review_c_df = review_c_df[review_c_df.product_review_counts >= 50]

In [35]:
review_c_df

Unnamed: 0.1,Unnamed: 0,reviewerID,asin,reviewerName,reviewText,overall,summary,reviewTime,reviewed_year,reviewed_month,reviewed_day,helpful_positive,helpful_negative,product_review_counts
1705,1705,A2SR9M2CWC2OCP,9790790961,"AMYR ""AR""",Even though this perfume caused a skin reactio...,4.0,"Love the scent, long lasting - spray mister sp...",2013-09-23,2013,9,23,0,1,70
1706,1706,A3V1EVBYP0U77W,9790790961,Angelica,I was really excited about getting this perfum...,2.0,Not a good purchase/very dissapointed,2014-04-16,2014,4,16,0,0,70
1707,1707,AVJKKAS4P52P9,9790790961,angelica moreno,I love it,5.0,Five Stars,2014-07-04,2014,7,4,0,0,70
1708,1708,A2NQQDBBGFW1OT,9790790961,Anna M. Finnerty,I absolutely loved this. Was not expecting the...,5.0,Versace Brite Crystal,2014-01-26,2014,1,26,0,0,70
1709,1709,A1OFNEUHZ7BSCB,9790790961,Annquienette Burkes,One of the most complimented fragrances ever!!...,5.0,One of the most complimented fragrances ever!!!,2014-02-14,2014,2,14,0,1,70
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022803,2022803,A2EIA53X91F8J,B00L5JHZJO,vicki raines,This argan oil is GREAT. The size of the bott...,5.0,YES!,2014-05-13,2014,5,13,0,0,488
2022804,2022804,A1MDF6GJDANUTH,B00L5JHZJO,Wendy D,Love this stuff! I put it in my oil vleansing ...,5.0,Love it!,2014-07-06,2014,7,6,0,0,488
2022805,2022805,A3T20TGEE4OPB5,B00L5JHZJO,"Wizard of Oooozzzz ""Trike Rider""",I have used this on my dry elbows and noticed ...,5.0,Wonderful Product!,2014-05-22,2014,5,22,0,0,488
2022806,2022806,A2MCFN2F0IINLY,B00L5JHZJO,yb2perfect,I love this oil! it feels so great on my skin ...,5.0,Great product!,2014-05-13,2014,5,13,0,0,488


In [None]:
review_c_df[[ 'reviewerID', 'asin', 'reviewerName', 'reviewText',
       'overall', 'summary', 'reviewTime', 'reviewed_year', 'reviewed_month',
       'reviewed_day', 'helpful_positive', 'helpful_negative',
       'product_review_counts']].nunique()

In [None]:
 1- (( 2023070-613926)/ 2023070)

39% of the reviewers remained (466438)

In [None]:
1- ((249274-2719)/249274)

1% of the products remained (2719)

In [39]:
product_ratings=pd.merge(review_c_df,product_m_df,on='asin',how='inner')
product_ratings

Unnamed: 0,Unnamed: 0_x,reviewerID,asin,reviewerName,reviewText,overall,summary,reviewTime,reviewed_year,reviewed_month,...,helpful_negative,Unnamed: 0_y,description,title,salesRank,categories,price,related,brand,interested_categories
0,105277,A4GWNDA2BENF7,B0002Z8SDY,Amazon Customer,Awesome Product! Has been holding my set of t...,5.0,Bondini,2013-02-11,2013,2,...,1,6759,The one glue that works to apply nails and wra...,"Spilo: MISC Big Bondini Plus Nail Glue, 0.14 oz",{'Beauty': 87102},"[['Beauty', 'Tools & Accessories', 'Nail Tools...",5.39,"{'also_bought': ['B00GVJQ8W8', 'B00819OIXC', '...",Spilo,True
1,105278,A38LZBEPIQELNU,B0002Z8SDY,Angie,One of the best nail glues on the market. Conv...,5.0,Bondini,2013-09-08,2013,9,...,0,6759,The one glue that works to apply nails and wra...,"Spilo: MISC Big Bondini Plus Nail Glue, 0.14 oz",{'Beauty': 87102},"[['Beauty', 'Tools & Accessories', 'Nail Tools...",5.39,"{'also_bought': ['B00GVJQ8W8', 'B00819OIXC', '...",Spilo,True
2,105279,A2DN4CA6A7HWN,B0002Z8SDY,"Ellen P. Stucker ""Author--Memphis""",I've been dealing with acrylic nails and nail ...,5.0,Best Nail Glue I've Ever Used,2010-09-08,2010,9,...,6,6759,The one glue that works to apply nails and wra...,"Spilo: MISC Big Bondini Plus Nail Glue, 0.14 oz",{'Beauty': 87102},"[['Beauty', 'Tools & Accessories', 'Nail Tools...",5.39,"{'also_bought': ['B00GVJQ8W8', 'B00819OIXC', '...",Spilo,True
3,105280,A3A0BSKJCVFDIH,B0002Z8SDY,iyana,Besides from the fact that it kind of dries up...,4.0,Best Nail Glue,2013-07-26,2013,7,...,0,6759,The one glue that works to apply nails and wra...,"Spilo: MISC Big Bondini Plus Nail Glue, 0.14 oz",{'Beauty': 87102},"[['Beauty', 'Tools & Accessories', 'Nail Tools...",5.39,"{'also_bought': ['B00GVJQ8W8', 'B00819OIXC', '...",Spilo,True
4,105281,AQM8T7O35LH05,B0002Z8SDY,Joyce E Flora,Big Bondini was a total flop. I bought it bec...,1.0,Big Bondini,2012-03-13,2012,3,...,1,6759,The one glue that works to apply nails and wra...,"Spilo: MISC Big Bondini Plus Nail Glue, 0.14 oz",{'Beauty': 87102},"[['Beauty', 'Tools & Accessories', 'Nail Tools...",5.39,"{'also_bought': ['B00GVJQ8W8', 'B00819OIXC', '...",Spilo,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5691,2017413,A3PI10AWOQ2725,B00JWM7N1Q,Donna,Theses are awesome,5.0,Five Stars,2014-07-03,2014,7,...,0,258019,Features :- Brand: SAB nails- Material: Acryli...,Lady Acrylic Style Artificial False Nails Full...,{'Beauty': 444654},"[['Beauty', 'Tools & Accessories', 'Nail Tools...",4.99,"{'also_viewed': ['B00LHVEVAG', 'B00LHV2HFC', '...",,True
5692,2018420,A3HR8XMM5JBFYK,B00K5M17VE,Bonnie,"These are ok, for some people with long nails....",3.0,Nail Art,2014-06-23,2014,6,...,0,258390,Features:Great for Both Professional Nail Spec...,Yesurprise 3D Women Beauty 1pcs Full WRAP Butt...,{'Beauty': 149173},"[['Beauty', 'Tools & Accessories', 'Nail Tools...",7.21,"{'also_viewed': ['B0081AI78S', 'B00HK2ON5Y', '...",Yesurprise,True
5693,2018450,A1R430S1BZCJBC,B00K5NJF26,Karen,You can wear this for 1 week. Design is pretty...,5.0,Pretty Good,2014-06-24,2014,6,...,0,258391,,GURAIO 24pcs False Nails Set Pre Design Acryli...,{'Beauty': 541862},"[['Beauty', 'Tools & Accessories', 'Nail Tools...",5.94,"{'also_viewed': ['B002MQJSRQ', 'B00K5NJEL8', '...",,True
5694,2018451,A3IGDMRQAKG9K,B00K5NMTSS,irene naranjo,i did like the design but unfortunately the de...,2.0,i did like the design but unfortunately the de...,2014-07-02,2014,7,...,0,258393,10 Sizes 20 Pcs Aztec Tribal Pattern Artificia...,Artificial Aztec Tribal Pattern False Nail Art...,{'Beauty': 798548},"[['Beauty', 'Tools & Accessories', 'Nail Tools...",11.99,"{'also_viewed': ['B00K5NMUXM', 'B00D2IKT6W']}",,True


In [54]:
product_ratings[product_ratings['brand']=='Kiss']['reviewed_year'].value_counts()

2013    59
2014    50
2012    18
2011     5
2009     1
2007     1
Name: reviewed_year, dtype: int64

In [56]:
product_ratings[product_ratings['brand']=='Kiss Products']['reviewed_year'].value_counts()

2012    2
2014    1
Name: reviewed_year, dtype: int64

### Conclusion 

1. There are two types of recommendation system we can approach - one is by using product clustering and the other one is by using sentiment analysis. (For the benefit of having reviews our approach can be used with sentiment analysis. Will there be a way to commodiate both of the algorithm approach?)
    - If we can come up with clustering algorithm and backed up by sentiment analysis, it will be more challenging to move on.
    - This has to be broken into good small pieces to handled properly.
2. What strategy can I approach on long texts.