# EDA using Google Facets Overview

In this notebook, ML team uses FACETS Overview which takes input feature data of 4 supercategories, analyzes them feature by feature and visualizes the analysis.
Overview gives users a quick understanding of the distribution of values across the features of their dataset(s). Uncover several uncommon and common issues such as: 
- unexpected feature values 
- missing feature values for a large number of observation
- training/serving skew 
- and train/test/validation set skew.

To run this notebook, please follow the setup instruction on the github link below:  

https://github.com/PAIR-code/facets


In [93]:
# Add the facets overview python code to the python path
import sys
sys.path.append('./python')
import numpy as np
import pandas as pd

# Features Description
For each supercategory, we will analyze 13 unique features that describe below:
- total_sales_volumn: Total sales volume of current month
- total_sales_price: Total sales in dollars of current month
- pm_total_sales_volume: Total sales volume of the previous month
- pm_total_sales_price:  Total sales price of the previous month
- l3m_total_sales_volume:  Total sales volume of previous 3 months
- l3m_total_sales_price:  Total sales price of previous 3 months
- l12m_total_sales_volume: Total sales volume of previous 12 months 
- l12m_total_sales_price:  Total sales price of previous 12 months 
- pm_numreviews:  Number of reviews of previous month
- pm_avgrating:  The average rating of previous month
- l3m_numreviews:  Number of reviews of previous 3 months
- l3m_avgrating:  The average rating of previous 3 months
- l12m_numreviews:  Number of reviews of previous 12 months
- l12m_avgrating:  The average rating of previous 12 months 
- numreviews:  Number of reviews of current month
- avgrating:  The average rating of current month


In [124]:
# Load supercategory 1, 2, 3 and 4 csv files into a pandas dataframe


features = ['id', 'nodeid', 'year', 'month', 'total_sales_volume', 'total_sales_price', 'pm_total_sales_volume', \
            'pm_total_sales_price', 'l3m_total_sales_volume', 'l3m_total_sales_price',  \
            'l12m_total_sales_volume', 'l12m_total_sales_price', 'pm_numreviews', 'pm_avgrating', \
            'l3m_numreviews', 'l3m_avgrating', 'l12m_numreviews', 'l12m_avgrating', 'numreviews', \
            'avgrating']

#supercat1 = pd.read_csv('ML_feat_cat1.csv', 
#                 index_col = ['nodeid', 'mon', 'year'],  \
#                 names=features,skiprows=[0], dtype=np.float32)

#supercat2 = pd.read_csv('ML_feat_cat2.csv', 
#                 index_col = ['nodeid', 'mon', 'year'],  \
#                 names=features,skiprows=[0], dtype=np.float32)

#  pd.read_csv('ML_feat_cat3.csv',  \
#                 index_col = ['nodeid', 'mon', 'year'],  \
#                 names=features, skiprows=[0], dtype=np.float32)

#supercat3 = pd.read_csv("ML_feat_cat3.csv", index_col = ['nodeid', 'mon', 'yr'], \
#                       )

supercat3 = pd.read_csv('ML_feat_cat3.csv', \
                       index_col = ['nodeid', 'year', 'month'], names=features, skiprows=[0], dtype=np.float32)
supercat3.drop('id', axis = 1, inplace=True)

supercat4 = pd.read_csv('ML_feat_cat4.csv', \
                       index_col = ['nodeid', 'year', 'month'], names=features, skiprows=[0], dtype=np.float32)                        
supercat4.drop('id', axis = 1, inplace=True)      

# Unit Test:  Ensure data is read in correctly to the dataframe
supercat4

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,total_sales_volume,total_sales_price,pm_total_sales_volume,pm_total_sales_price,l3m_total_sales_volume,l3m_total_sales_price,l12m_total_sales_volume,l12m_total_sales_price,pm_numreviews,pm_avgrating,l3m_numreviews,l3m_avgrating,l12m_numreviews,l12m_avgrating,numreviews,avgrating
nodeid,year,month,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2.945000e+03,2009.0,10.0,1880.0,18760.000000,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,2.0,5.0
2.945000e+03,2009.0,11.0,376.0,3455.000000,1880.0,18760.000000,1880.0,18760.000000,1880.0,18760.000000,0.0,0.0,0.0,0.0,0.0,0.0,2.0,5.0
2.945000e+03,2009.0,12.0,280.0,2055.000000,376.0,3455.000000,2256.0,22215.000000,2256.0,22215.000000,0.0,0.0,0.0,0.0,0.0,0.0,2.0,5.0
2.945000e+03,2010.0,1.0,52.0,460.000000,280.0,2055.000000,2536.0,24270.000000,2536.0,24270.000000,0.0,0.0,0.0,0.0,0.0,0.0,2.0,5.0
2.945000e+03,2010.0,2.0,72.0,200.000000,52.0,460.000000,708.0,5970.000000,2588.0,24730.000000,0.0,0.0,0.0,0.0,0.0,0.0,2.0,5.0
2.945000e+03,2010.0,3.0,114.0,130.000000,72.0,200.000000,404.0,2715.000000,2660.0,24930.000000,0.0,0.0,0.0,0.0,0.0,0.0,2.0,5.0
2.945000e+03,2010.0,4.0,3.0,10.000000,114.0,130.000000,238.0,790.000000,2774.0,25060.000000,0.0,0.0,0.0,0.0,0.0,0.0,2.0,5.0
2.945000e+03,2010.0,5.0,14.0,90.000000,3.0,10.000000,189.0,340.000000,2777.0,25070.000000,0.0,0.0,0.0,0.0,0.0,0.0,2.0,5.0
2.945000e+03,2010.0,6.0,6.0,40.000000,14.0,90.000000,131.0,230.000000,2791.0,25160.000000,0.0,0.0,0.0,0.0,0.0,0.0,2.0,5.0
2.945000e+03,2010.0,7.0,3.0,20.000000,6.0,40.000000,23.0,140.000000,2797.0,25200.000000,0.0,0.0,0.0,0.0,0.0,0.0,2.0,5.0


In [125]:
# Calculate the feature statistics proto from the datasets and stringify it for use in facets overview
from generic_feature_statistics_generator import GenericFeatureStatisticsGenerator
import base64

gfsg = GenericFeatureStatisticsGenerator()
proto = gfsg.ProtoFromDataFrames([{'name': 'supercat3', 'table': supercat3},
                                  {'name': 'supercat4', 'table': supercat4}])

protostr = base64.b64encode(proto.SerializeToString()).decode("utf-8")



In [126]:
# Visualize data
from IPython.core.display import display, HTML

HTML_TEMPLATE = """<link rel="import" href="/nbextensions/facets-dist/facets-jupyter.html" >
        <facets-overview id="elem"></facets-overview>
        <script>
          document.querySelector("#elem").protoInput = "{protostr}";
        </script>"""
html = HTML_TEMPLATE.format(protostr=protostr)
display(HTML(html))



In [6]:
from IPython.core.display import display, HTML
display(HTML('<h1>Hello, world!</h1>'))