<hr style="height:1px">
<hr style="height:3px">

#   Etsy Data Challenge : Analyzing Food-Mart Data

  Question # 1 : Create Visuals to understand the Product Data Dimensions in terms of :

-  #### Category of products sold
-  #### Brand of products sold


<hr style="height:1px">
<hr style="height:3px">


## Procedure : 

For this section I worked only with the **Product** and **Product Class** tables to gain information about the range of products housed in the Food Mart stores. Initial exploration reveals that there are **1560 unique products** belonging to different : 

- Product Family : 3 varieties (<font color='red'>Drinks</font>, <font color='blue'>Food</font> and <font color='green'>Non-Consumables</font>)
- Product Department : 22 types
- Product Category : 45 types
- Brand name : 111 

To visualise this information I created two different D3 plots, one from the point of view of product categories and the other for Brands.

-  Figure-1 : A **Circular Dendogram** structure that shows the number of unique products housed in the stores for each Brand belonging to a particular Category. It is a modification of the following tree structure :

$$ Product Family -->  Product Category --> Brands --> Number of Products  $$

- The circular shape of the tree allows the entire data dimension to be viewed all at once. 


- Figure-2 : A **Bubble Chart** showing number of unique product items sold by different brands where the brands are color coded according to the family of products to which they belong

To see the figures, we will need to start a local server first and then click on the links to the figures provided at the end of the notebook.


<hr style="height:1px">




In [1]:
# Import the necessary packages

from sqlalchemy import create_engine
import psycopg2
import pandas as pd
import numpy as np
import json

pd.set_option('notebook_repr_html',False)

In [2]:
# Setting up connection using SQL Alchemy : 

dbname = 'etsy'
username = 'parama'
engine = create_engine('postgres://%s@localhost/%s'%(username,dbname))

In [3]:
# Setting up connection using Psycopg2 : 

try:
    conn = psycopg2.connect(database = 'etsy', user = 'parama' , password = 'pargop')
except:
    print 'Unable to connect'

### Making the Circular Dendogram :

I first made a new table that combines relevant information from the product and product_class table, which I call the **new_products** table that will be useful for now and also future analysis. 

Next I query this new table to generate a dataframe that will be easy to transform into a json object. My queries should produce a final product like this : 

      | product_family | product_category | brand_name | no_of_products | ncolor(red/blue/green) |  
      
The json file so created were now used to make a D3 plot using html coding adapted from [here](https://bl.ocks.org/mbostock/4339607).

The circular dendogram has all products at its root which branches out into three product families

- <font color='red'>Drinks</font>
- <font color='blue'>Food</font>
- <font color='green'>Non-Consumables</font>

These further branch out into specific product categories unique to them. 

Lastly each product category branches out into the number of brands available for those products. The last node ends with circles of **specific color decided by the product family** it originated from and **specific size decided by the number of unique products** in that node. 



.


In [8]:
# Make a new table combining the information from Product and Product_Class tables using the following SQL Query:
# To be executed from within the database
sql_query = """ SELECT a.product_id, a.brand_name, a.product_name, b.* INTO new_product
FROM product a INNER JOIN product_class b ON a.product_class_id = b.product_class_id; """ 

In [9]:
# Read in the necessary data to make a json input file for plotting : 
# Can be run here
sql_query = """
SELECT product_family, product_category, brand_name, count(product_id) AS no_of_products, 
case when product_family = 'Drink' then 'red' when product_family = 'Food' then 'blue' else 'green' end as ncolor
    FROM new_product GROUP BY product_family, product_category, brand_name 
    ORDER BY product_family, product_category, brand_name;
"""
tree_data = pd.read_sql_query(sql_query,conn)

In [10]:
# Make dataframes specific to the product family

drink = tree_data[tree_data.product_family=='Drink'][['product_category','brand_name','no_of_products','ncolor']]
food = tree_data[tree_data.product_family=='Food'][['product_category','brand_name','no_of_products','ncolor']]
nonc = tree_data[tree_data.product_family=='Non-Consumable'][['product_category','brand_name','no_of_products','ncolor']]

In [11]:
# Function to create a nested dictionary structure from the Dataframe

def dict_maker(name,color,dframe): 
    d = dict()
    d = {"name": name, "ncolor":color, "size": 5,  "children": []}
    
    for line in dframe.values:
        the_parent = line[0]
        the_child = line[1]
        child_size = line[2]
        ncolor = line[3]

        # Make a list of keys
        keys_list = []
        for item in d['children']:
            keys_list.append(item['name'])

        # If 'the_parent' is NOT a key then append it
        if not the_parent in keys_list:
            d['children'].append({"name":the_parent, "children":[{"name":the_child, "size":child_size, "ncolor":ncolor}]})

        # if 'the_parent' IS a key add a new child to it
        else:
            d['children'][keys_list.index(the_parent)]['children'].append({"name":the_child, "size":child_size, "ncolor":ncolor})
    return d       

In [12]:
# Making the json file 

data_out = dict()
data_out = {"name": 'products', "size": 15, "ncolor":"black", "children" :[]}

family = ['Drinks','Food','Non-Consumable']
tcolor = ['red','blue','green']
tree_data = [drink,food,nonc]

for index in range(3):
    data1 = dict_maker(family[index],tcolor[index],tree_data[index])
    data_out['children'].append(data1)

# Export the final result to a json file
with open('cat_brand.json', 'w') as outfile:
    json.dump(data_out, outfile, indent=4, sort_keys=True, separators=(',',':'))                 
                         

### Making the Bubble Chart :

While the circular dendogram gives a comprehensive picture of the product category and the brand names all together - it may look like a lot of information if Food Mart executives are only looking for **brand representation** in their stock. 

In that case the following bubble chart is more helpful.

The bubbles are **color coded according to the product family** (<font color='orange'>Drinks</font>/ <font color='blue'>Food</font>/ <font color='green'>Non-Consumables</font>) to which they belong and their **sizes are representative of the number of unique products** produced by the particular brand.

The bubble chart was similarly generated using an [html code](https://bl.ocks.org/mbostock/4063269) fed by a json file created using the following steps. 

.

In [13]:
# SQL query to generate dataframe

sql_query = """
SELECT product_family,brand_name,count(*) AS no_of_items, 
CASE WHEN product_family = 'Drink' THEN 'red' 
     WHEN product_family = 'Food' THEN 'blue' 
     ELSE 'green' END AS color FROM new_product 
     GROUP BY product_family,brand_name ORDER BY product_family,brand_name;
"""
tree_data = pd.read_sql_query(sql_query,conn)

In [14]:
# Function to produce nested dictionary structure

def dict_maker(dframe): 
    d = dict()
    d = {"name": "products", "children": []}
    
    for line in dframe.values:
        the_parent = line[0]
        the_child = line[1]
        child_size = line[2]
        child_color = line[3]

        # Make a list of keys
        keys_list = []
        for item in d['children']:
            keys_list.append(item['name'])

        # If 'the_parent' is NOT a key append it
        if not the_parent in keys_list:
            d['children'].append({"name":the_parent, "children":[{"name":the_child, "size":child_size, "tcolor":child_color}]})

        # if 'the_parent' IS a key add a new child to it
        else:
            d['children'][keys_list.index(the_parent)]['children'].append({"name":the_child, "size":child_size, "tcolor":child_color})
    return d       

In [15]:
# Writing out the json file

data_out= dict_maker(tree_data)
with open('bubble_chart.json', 'w') as outfile:
    json.dump(data_out, outfile, indent=4, sort_keys=True, separators=(',',':'))                 
                         

In [None]:
# Starting the local server 
! python -m SimpleHTTPServer 7000

Serving HTTP on 0.0.0.0 port 7000 ...
127.0.0.1 - - [10/Aug/2016 11:03:17] "GET /index_cir.html HTTP/1.1" 200 -
127.0.0.1 - - [10/Aug/2016 11:03:17] "GET /cat_brand.json HTTP/1.1" 200 -
127.0.0.1 - - [10/Aug/2016 11:03:17] code 404, message File not found
127.0.0.1 - - [10/Aug/2016 11:03:17] "GET /favicon.ico HTTP/1.1" 404 -
127.0.0.1 - - [10/Aug/2016 11:04:16] "GET /index_bubble.html HTTP/1.1" 200 -
127.0.0.1 - - [10/Aug/2016 11:04:16] "GET /bubble_chart.json HTTP/1.1" 200 -


## Links to Figures : 

-  [Circular Dendogram - How many products in each category and each brand](http://localhost:7000/index_cir.html)
 

- [How many products in each brand](http://localhost:7000/index_bubble.html)
    