# Tree structure & Category Products

Exploring categories using [anytree] library

[anytree]: https://pypi.org/project/anytree/2.8.0/

In [2]:
from anytree import Node
from anytree import RenderTree, find, find_by_attr, findall
import json
from urllib.request import urlopen
import pandas as pd

In [3]:
prod_open = urlopen('https://raw.githubusercontent.com/anyoneai/e-commerce-open-data-set/master/products.json')
js_prod = json.loads(prod_open.read().decode('utf-8'))

**TO-DO**:  
integrate with `build_df.py` that allows build a dataframe setting a treshold for minimun amount of products by category

In [4]:
df_prod = pd.DataFrame(js_prod)
df_prod.head()

Unnamed: 0,sku,name,type,price,upc,category,shipping,description,manufacturer,model,url,image
0,43900,Duracell - AAA Batteries (4-Pack),HardGood,5.49,41333424019,"[{'id': 'pcmcat312300050015', 'name': 'Connect...",5.49,Compatible with select electronic devices; AAA...,Duracell,MN2400B4Z,http://www.bestbuy.com/site/duracell-aaa-batte...,http://img.bbystatic.com/BestBuy_US/images/pro...
1,48530,Duracell - AA 1.5V CopperTop Batteries (4-Pack),HardGood,5.49,41333415017,"[{'id': 'pcmcat312300050015', 'name': 'Connect...",5.49,Long-lasting energy; DURALOCK Power Preserve t...,Duracell,MN1500B4Z,http://www.bestbuy.com/site/duracell-aa-1-5v-c...,http://img.bbystatic.com/BestBuy_US/images/pro...
2,127687,Duracell - AA Batteries (8-Pack),HardGood,7.49,41333825014,"[{'id': 'pcmcat312300050015', 'name': 'Connect...",5.49,Compatible with select electronic devices; AA ...,Duracell,MN1500B8Z,http://www.bestbuy.com/site/duracell-aa-batter...,http://img.bbystatic.com/BestBuy_US/images/pro...
3,150115,Energizer - MAX Batteries AA (4-Pack),HardGood,4.99,39800011329,"[{'id': 'pcmcat312300050015', 'name': 'Connect...",5.49,4-pack AA alkaline batteries; battery tester i...,Energizer,E91BP-4,http://www.bestbuy.com/site/energizer-max-batt...,http://img.bbystatic.com/BestBuy_US/images/pro...
4,185230,Duracell - C Batteries (4-Pack),HardGood,8.99,41333440019,"[{'id': 'pcmcat312300050015', 'name': 'Connect...",5.49,Compatible with select electronic devices; C s...,Duracell,MN1400R4Z,http://www.bestbuy.com/site/duracell-c-batteri...,http://img.bbystatic.com/BestBuy_US/images/pro...


First, we created the tree using `make_tree()`function


In [5]:
def make_tree(df, df_column_cat, root_name):
  
  """Takes a data frame, a column of such dataframe and a string name
  Using anytree library generate the tree and print it, and return a dictionary 
  
  Parameters:
  df = a column data frame
  df_column_cat: a column of a data frame which values are a simple dictinary or a list of them
  root_name: str
  
  print the tree and
  Returns a dictionary with category name:str as keys and nodes as values 
  """
  #set the root node
  root = Node(root_name)
  # list of nodes to be generated by the function
  nodes = {}
  nodes[root.name]: root
  # Iteration over rows
  for index, row in df.iterrows():
  # Iteration over index of values at each row
    c = df_column_cat[index]
    for i_cat in range(len(c)):
      #cat_name = c[i_cat]['id'] 
      cat_name = c[i_cat]['id'] + ' '+ c[i_cat]['name'] # If you want to display the id along with category names uncomment this line and comment the previous one

      # Verify if the parent category exist if it does not exists appends it to the nodes list and creates the node. Else continues
      if i_cat == 0 and cat_name not in nodes:
        nodes[cat_name] = Node(cat_name, parent=find_by_attr(root, root_name))

      # Verify if the subcategory exists if it does not exists appends it to the nodes list and creates the node. Else continues  
      elif i_cat > 0 and cat_name not in nodes:
        #predecessor = c[i_cat -1]['id'] 
        predecessor = c[i_cat - 1]['id'] + ' '+ c[i_cat-1]['name'] # If you want to display the id along with category names uncomment this line and comment the previous one
        nodes[cat_name] = Node(cat_name, find_by_attr(root,predecessor))
      
      else: continue
   
  for pre, _, node in RenderTree(root):
    print("%s%s" % (pre, node.name))
    
  return nodes 

Storing the dictionary in the `tree_dict` variable in order to use with our `dist_nodes()` function 

In [6]:
tree_dict = make_tree(df_prod, df_prod['category'], "Categories")

Categories
├── pcmcat312300050015 Connected Home & Housewares
│   ├── pcmcat248700050021 Housewares
│   │   ├── pcmcat303600050001 Household Batteries
│   │   │   ├── abcat0208002 Alkaline Batteries
│   │   │   ├── abcat0208006 Specialty Batteries
│   │   │   ├── abcat0208005 Rechargeable Batteries
│   │   │   └── abcat0208003 Lithium Batteries
│   │   ├── pcmcat179100050006 Outdoor Living
│   │   │   ├── pcmcat179200050003 Grills
│   │   │   │   ├── pcmcat179200050004 Gas Grills
│   │   │   │   ├── pcmcat179200050007 Grill Accessories
│   │   │   │   ├── pcmcat179200050005 Electric Grills
│   │   │   │   ├── pcmcat270300050004 Charcoal Grills
│   │   │   │   └── pcmcat270300050005 Smokers
│   │   │   ├── pcmcat179200050017 Outdoor Audio
│   │   │   ├── pcmcat179200050008 Patio Furniture & Decor
│   │   │   │   ├── pcmcat179200050009 Fire Pits
│   │   │   │   ├── pcmcat748300323222 Outdoor Furniture Sets
│   │   │   │   │   ├── pcmcat748300323342 Outdoor Dining Sets
│   │   │   │   │  

# Distance between nodes

Defining our `dist_nodes()` function 

In [49]:
def dist_nodes(node_nm1, node_nm2, cat_tree_dict):
    """Takes two nodes names categories and the dictionary generated by maketree() function and returns the distance between them using anytree libray
    input: str names of nodes
    return: int distance between nodes """

    cat_node1 = cat_tree_dict[node_nm1]
    cat_node2 = cat_tree_dict[node_nm2]

    path_list1 = list(cat_node1.path)[1:]
    path_list2 = list(cat_node2.path)[1:]
    
    len_list_path = [len(path_list1), len(path_list2)]
    
    for lp in [path_list1, path_list2]:
      if len(lp) == max(len_list_path):
        max_length_list = lp
      else:
        min_length_list = lp
    
    dist = 1
    common_path = []
        

    for nd in max_length_list:
        if nd in min_length_list:
            common_path.append(nd)
            dist +=0
        else:
            dist += 1
    return dist # (dist, common_path)

### Applying `dist_nodes()` function 

**Example 1** we apply `dist_nodes()` function to determine the distance between *pcmcat303600050001 Household Batteries* and *pcmcat179200050005 Electric Grills*. 

They have as common parent *pcmcat248700050021 Housewares*

In [52]:
node1 = 'pcmcat179200050005 Electric Grills'
node2 = 'pcmcat303600050001 Household Batteries'

dist_nodes(node1, node2, tree_dict) == dist_nodes(node2, node1, tree_dict)

True

**Example 2** we apply `dist_nodes()` function to determine the distance between *abcat0208005 Rechargeable Batteries* and *abcat0208003 Lithium Batteries*. 

Both belong to the same subcategory: *pcmcat303600050001 Household Batteries*

In [42]:
dist_nodes('abcat0208005 Rechargeable Batteries','abcat0208003 Lithium Batteries', tree_dict )

2

**Example 3** we apply `dist_nodes()` function to determine the distance between *abcat0712007 Sports* and *pcmcat179200050005 Electric Grills*. 

They do not have common parent categories

In [10]:
dist_nodes('abcat0712007 Sports', 'pcmcat179200050005 Electric Grills', tree_dict)

6