Using the following dataset, https://github.com/BestBuyAPIs/open-data-set; build 
an API with one endpoint that receives the “name” and “description” of a 
new product as input parameters, and outputs the “category” or “categories” that 
this new product should be in.

Expectations:

    • You have a data pipeline that handles the dataset.
    • You build a classifier that predicts the category(s) of a new product.
        ◦ You can decide if your model would output one or multiple labels.
        ◦ You don't need to spend lots of time comparing different models.
        ◦ You don't need to spend lots of time on trying to have the state of the art feature engineering.
    • You build one API endpoint that exposes the classifier as a solution to label new products.

In [1]:
# import libraries
import common
import pandas as pd

pd.set_option('display.max_columns', None)

## Categories

In [2]:
# Read data from file
categories_path = 'open-data-set-master/categories.json'
categories = common.read_file(categories_path)

# convert columns
cat_df = pd.DataFrame.from_dict(categories)
cat_list = ['cat1','cat2', 'cat3', 'cat4', 'cat5', 'cat6']
cat_df[cat_list] = pd.DataFrame(cat_df.path.tolist(), index=cat_df.index)

cat_df.head()

Unnamed: 0,id,name,path,subCategories,cat1,cat2,cat3,cat4,cat5,cat6
0,abcat0010000,Gift Ideas,"[{'id': 'abcat0010000', 'name': 'Gift Ideas'}]","[{'id': 'pcmcat140000050035', 'name': 'Capturi...","{'id': 'abcat0010000', 'name': 'Gift Ideas'}",,,,,
1,abcat0020001,Learning Toys,"[{'id': 'abcat0010000', 'name': 'Gift Ideas'},...",[],"{'id': 'abcat0010000', 'name': 'Gift Ideas'}","{'id': 'abcat0014000', 'name': 'Kids'}","{'id': 'abcat0020000', 'name': 'Toys'}","{'id': 'abcat0020001', 'name': 'Learning Toys'}",,
2,abcat0020002,DVD Games,"[{'id': 'abcat0010000', 'name': 'Gift Ideas'},...",[],"{'id': 'abcat0010000', 'name': 'Gift Ideas'}","{'id': 'abcat0014000', 'name': 'Kids'}","{'id': 'abcat0020000', 'name': 'Toys'}","{'id': 'abcat0020002', 'name': 'DVD Games'}",,
3,abcat0020004,Unique Gifts,"[{'id': 'abcat0010000', 'name': 'Gift Ideas'},...",[],"{'id': 'abcat0010000', 'name': 'Gift Ideas'}","{'id': 'abcat0020004', 'name': 'Unique Gifts'}",,,,
4,abcat0100000,TV & Home Theater,"[{'id': 'abcat0100000', 'name': 'TV & Home The...","[{'id': 'abcat0101000', 'name': 'TVs'}, {'id':...","{'id': 'abcat0100000', 'name': 'TV & Home Thea...",,,,,


In [3]:
print('There are {} rows in the categories table'.format(cat_df.shape[0]))
print('There are {} unique ids'.format(cat_df.shape[0]))
print('There are {} unique names'.format(len(cat_df['name'].unique())))
print('There are {} unique paths'.format(len(cat_df['path'].astype(str).unique())))
print('There are {} unique subCategories'.format(len(cat_df['subCategories'].astype(str).unique())))

There are 4584 rows in the categories table
There are 4584 unique ids
There are 4228 unique names
There are 4584 unique paths
There are 791 unique subCategories


The target for the classification model will be the id column, as it's the primary id for the category

There appears to be a hierarchical structure in the information in the subcategories column. See https://www.sciencedirect.com/science/article/pii/S089812211300432X
The model to implement will be a flat classifier, as described in the paper above:

<img src="pictures/flat_classification.png">

This additional hierarchical information could be extracted and used to improve a future model, such as a Local Node Classifier, also from the paper above:

<img src="pictures/LCN.png">


## Products

In [4]:
# load data
products_path = 'open-data-set-master/products.json'
products = common.read_file(products_path)
prod_df = pd.DataFrame.from_dict(products)

# convert categories form json to text
cat_list = ['cat1','cat2', 'cat3', 'cat4', 'cat5', 'cat6', 'cat7']
prod_df[cat_list] = pd.DataFrame(prod_df.category.tolist(), index= prod_df.index)
    
prod_df.head()

Unnamed: 0,sku,name,type,price,upc,category,shipping,description,manufacturer,model,url,image,cat1,cat2,cat3,cat4,cat5,cat6,cat7
0,43900,Duracell - AAA Batteries (4-Pack),HardGood,5.49,41333424019,"[{'id': 'pcmcat312300050015', 'name': 'Connect...",5.49,Compatible with select electronic devices; AAA...,Duracell,MN2400B4Z,http://www.bestbuy.com/site/duracell-aaa-batte...,http://img.bbystatic.com/BestBuy_US/images/pro...,"{'id': 'pcmcat312300050015', 'name': 'Connecte...","{'id': 'pcmcat248700050021', 'name': 'Housewar...","{'id': 'pcmcat303600050001', 'name': 'Househol...","{'id': 'abcat0208002', 'name': 'Alkaline Batte...",,,
1,48530,Duracell - AA 1.5V CopperTop Batteries (4-Pack),HardGood,5.49,41333415017,"[{'id': 'pcmcat312300050015', 'name': 'Connect...",5.49,Long-lasting energy; DURALOCK Power Preserve t...,Duracell,MN1500B4Z,http://www.bestbuy.com/site/duracell-aa-1-5v-c...,http://img.bbystatic.com/BestBuy_US/images/pro...,"{'id': 'pcmcat312300050015', 'name': 'Connecte...","{'id': 'pcmcat248700050021', 'name': 'Housewar...","{'id': 'pcmcat303600050001', 'name': 'Househol...","{'id': 'abcat0208002', 'name': 'Alkaline Batte...",,,
2,127687,Duracell - AA Batteries (8-Pack),HardGood,7.49,41333825014,"[{'id': 'pcmcat312300050015', 'name': 'Connect...",5.49,Compatible with select electronic devices; AA ...,Duracell,MN1500B8Z,http://www.bestbuy.com/site/duracell-aa-batter...,http://img.bbystatic.com/BestBuy_US/images/pro...,"{'id': 'pcmcat312300050015', 'name': 'Connecte...","{'id': 'pcmcat248700050021', 'name': 'Housewar...","{'id': 'pcmcat303600050001', 'name': 'Househol...","{'id': 'abcat0208002', 'name': 'Alkaline Batte...",,,
3,150115,Energizer - MAX Batteries AA (4-Pack),HardGood,4.99,39800011329,"[{'id': 'pcmcat312300050015', 'name': 'Connect...",5.49,4-pack AA alkaline batteries; battery tester i...,Energizer,E91BP-4,http://www.bestbuy.com/site/energizer-max-batt...,http://img.bbystatic.com/BestBuy_US/images/pro...,"{'id': 'pcmcat312300050015', 'name': 'Connecte...","{'id': 'pcmcat248700050021', 'name': 'Housewar...","{'id': 'pcmcat303600050001', 'name': 'Househol...","{'id': 'abcat0208002', 'name': 'Alkaline Batte...",,,
4,185230,Duracell - C Batteries (4-Pack),HardGood,8.99,41333440019,"[{'id': 'pcmcat312300050015', 'name': 'Connect...",5.49,Compatible with select electronic devices; C s...,Duracell,MN1400R4Z,http://www.bestbuy.com/site/duracell-c-batteri...,http://img.bbystatic.com/BestBuy_US/images/pro...,"{'id': 'pcmcat312300050015', 'name': 'Connecte...","{'id': 'pcmcat248700050021', 'name': 'Housewar...","{'id': 'pcmcat303600050001', 'name': 'Househol...","{'id': 'abcat0208002', 'name': 'Alkaline Batte...",,,


In [5]:
# Merge the product table with the categories, to get a unique id for each product

prod_df['category'] = prod_df['category'].astype('str')
cat_df['path'] = cat_df['path'].astype('str')

prod_df = prod_df.merge(
    cat_df[['path', 'id']],
    left_on='category',
    right_on='path',
    how='left'
    )

prod_df.head()

Unnamed: 0,sku,name,type,price,upc,category,shipping,description,manufacturer,model,url,image,cat1,cat2,cat3,cat4,cat5,cat6,cat7,path,id
0,43900,Duracell - AAA Batteries (4-Pack),HardGood,5.49,41333424019,"[{'id': 'pcmcat312300050015', 'name': 'Connect...",5.49,Compatible with select electronic devices; AAA...,Duracell,MN2400B4Z,http://www.bestbuy.com/site/duracell-aaa-batte...,http://img.bbystatic.com/BestBuy_US/images/pro...,"{'id': 'pcmcat312300050015', 'name': 'Connecte...","{'id': 'pcmcat248700050021', 'name': 'Housewar...","{'id': 'pcmcat303600050001', 'name': 'Househol...","{'id': 'abcat0208002', 'name': 'Alkaline Batte...",,,,"[{'id': 'pcmcat312300050015', 'name': 'Connect...",abcat0208002
1,48530,Duracell - AA 1.5V CopperTop Batteries (4-Pack),HardGood,5.49,41333415017,"[{'id': 'pcmcat312300050015', 'name': 'Connect...",5.49,Long-lasting energy; DURALOCK Power Preserve t...,Duracell,MN1500B4Z,http://www.bestbuy.com/site/duracell-aa-1-5v-c...,http://img.bbystatic.com/BestBuy_US/images/pro...,"{'id': 'pcmcat312300050015', 'name': 'Connecte...","{'id': 'pcmcat248700050021', 'name': 'Housewar...","{'id': 'pcmcat303600050001', 'name': 'Househol...","{'id': 'abcat0208002', 'name': 'Alkaline Batte...",,,,"[{'id': 'pcmcat312300050015', 'name': 'Connect...",abcat0208002
2,127687,Duracell - AA Batteries (8-Pack),HardGood,7.49,41333825014,"[{'id': 'pcmcat312300050015', 'name': 'Connect...",5.49,Compatible with select electronic devices; AA ...,Duracell,MN1500B8Z,http://www.bestbuy.com/site/duracell-aa-batter...,http://img.bbystatic.com/BestBuy_US/images/pro...,"{'id': 'pcmcat312300050015', 'name': 'Connecte...","{'id': 'pcmcat248700050021', 'name': 'Housewar...","{'id': 'pcmcat303600050001', 'name': 'Househol...","{'id': 'abcat0208002', 'name': 'Alkaline Batte...",,,,"[{'id': 'pcmcat312300050015', 'name': 'Connect...",abcat0208002
3,150115,Energizer - MAX Batteries AA (4-Pack),HardGood,4.99,39800011329,"[{'id': 'pcmcat312300050015', 'name': 'Connect...",5.49,4-pack AA alkaline batteries; battery tester i...,Energizer,E91BP-4,http://www.bestbuy.com/site/energizer-max-batt...,http://img.bbystatic.com/BestBuy_US/images/pro...,"{'id': 'pcmcat312300050015', 'name': 'Connecte...","{'id': 'pcmcat248700050021', 'name': 'Housewar...","{'id': 'pcmcat303600050001', 'name': 'Househol...","{'id': 'abcat0208002', 'name': 'Alkaline Batte...",,,,"[{'id': 'pcmcat312300050015', 'name': 'Connect...",abcat0208002
4,185230,Duracell - C Batteries (4-Pack),HardGood,8.99,41333440019,"[{'id': 'pcmcat312300050015', 'name': 'Connect...",5.49,Compatible with select electronic devices; C s...,Duracell,MN1400R4Z,http://www.bestbuy.com/site/duracell-c-batteri...,http://img.bbystatic.com/BestBuy_US/images/pro...,"{'id': 'pcmcat312300050015', 'name': 'Connecte...","{'id': 'pcmcat248700050021', 'name': 'Housewar...","{'id': 'pcmcat303600050001', 'name': 'Househol...","{'id': 'abcat0208002', 'name': 'Alkaline Batte...",,,,"[{'id': 'pcmcat312300050015', 'name': 'Connect...",abcat0208002


In [6]:
print('The dataframe with the information to train the model will be the following:')
prod_df[['name', 'description', 'id']]

The dataframe with the information to train the model will be the following:


Unnamed: 0,name,description,id
0,Duracell - AAA Batteries (4-Pack),Compatible with select electronic devices; AAA...,abcat0208002
1,Duracell - AA 1.5V CopperTop Batteries (4-Pack),Long-lasting energy; DURALOCK Power Preserve t...,abcat0208002
2,Duracell - AA Batteries (8-Pack),Compatible with select electronic devices; AA ...,abcat0208002
3,Energizer - MAX Batteries AA (4-Pack),4-pack AA alkaline batteries; battery tester i...,abcat0208002
4,Duracell - C Batteries (4-Pack),Compatible with select electronic devices; C s...,abcat0208002
...,...,...,...
51641,Honeywell - True HEPA Replacement Filters for ...,Compatible with select Honeywell air purifier ...,pcmcat303700050016
51642,Dyson - Hard Floor Wipes for Dyson Hard DC56 V...,Removes dirt and grime from hard floors; cloth...,abcat0916009
51643,Aleratec - Drive Enclosure - Internal - Black,"1 x Total Bay - 1 x 2.5"" Bay",pcmcat186100050005
51644,Amazon - Fire TV Stick,"Streams 1080p content; dual-band, dual-antenna...",


In [14]:
print('There are {} products without match in the products table'.format(prod_df[prod_df['id'].isnull()].shape[0]))

There are 1162 products without match in the products table


The products without a matching category will be removed from the dataset when training. 

If this happened in a business setting, the proper way to deal with it would be to communicate with the responsible team (data engineering, product, etc) to understand what the best solution would be

## Stores

In [7]:
# load data
stores_path = 'open-data-set-master/stores.json'
stores = common.read_file(stores_path)
store_df = pd.DataFrame.from_dict(stores)

store_df.head()

Unnamed: 0,id,type,name,address,address2,city,state,zip,location,hours,services
0,1000,BigBox,Mall of America,340 W Market,,Bloomington,MN,55425,"{'lat': 44.85466, 'lon': -93.24565}",Mon: 10-9:30; Tue: 10-9:30; Wed: 10-9:30; Thur...,"[Geek Squad Services, Best Buy Mobile, Best Bu..."
1,1002,BigBox,Tempe Marketplace,1900 E Rio Salado Pkwy,,Tempe,AZ,85281,"{'lat': 33.430729, 'lon': -111.89966}",Mon: 10-9; Tue: 10-9; Wed: 10-9; Thurs: 10-9; ...,"[Windows Store, Geek Squad Services, Best Buy ..."
2,1003,BigBox,Lexington Park,45235 Worth Ave.,,California,MD,20619,"{'lat': 38.29697, 'lon': -76.512016}",Mon: 10-9; Tue: 10-9; Wed: 10-9; Thurs: 10-9; ...,"[Geek Squad Services, Best Buy Mobile, Best Bu..."
3,1004,BigBox,Trussville,5072 Pinnacle Sq,,Birmingham,AL,35235,"{'lat': 33.605438, 'lon': -86.642662}",Mon: 10-9; Tue: 10-9; Wed: 10-9; Thurs: 10-9; ...,"[Geek Squad Services, Best Buy Mobile, Best Bu..."
4,1008,BigBox,Vacaville,1621 E Monte Vista Ave,,Vacaville,CA,95688,"{'lat': 38.367649, 'lon': -121.96328}",Mon: 10-9; Tue: 10-9; Wed: 10-9; Thurs: 10-9; ...,"[Geek Squad Services, Best Buy Mobile, Best Bu..."


The Stores table doesn't  seem to have a connection with the Products and Categories tables, so it will not be used for the model.