Using the following dataset, https://github.com/BestBuyAPIs/open-data-set; build 
an API with one endpoint that receives the “name” and “description” of a 
new product as input parameters, and outputs the “category” or “categories” that 
this new product should be in.

Expectations:

    • You have a data pipeline that handles the dataset.
    • You build a classifier that predicts the category(s) of a new product.
        ◦ You can decide if your model would output one or multiple labels.
        ◦ You don't need to spend lots of time comparing different models.
        ◦ You don't need to spend lots of time on trying to have the state of the art feature engineering.
    • You build one API endpoint that exposes the classifier as a solution to label new products.

In [1]:
# import libraries
import common
import numpy as np
import pandas as pd
from joblib import dump, load
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
# read data
categories_path = 'open-data-set-master/categories.json'
categories = common.read_file(categories_path)
cat_df = pd.DataFrame.from_dict(categories)

products_path = 'open-data-set-master/products.json'
products = common.read_file(products_path)
prod_df = pd.DataFrame.from_dict(products)

# Merge the product table with the categories, to get a unique id for each product
prod_df['category'] = prod_df['category'].astype('str')
cat_df['path'] = cat_df['path'].astype('str')

prod_df = prod_df.merge(
    cat_df[['path', 'id']],
    left_on='category',
    right_on='path',
    how='left'
    )

# Remove products without a matching category
prod_df = prod_df[~prod_df['id'].isnull()]

prod_df.head()

Unnamed: 0,sku,name,type,price,upc,category,shipping,description,manufacturer,model,url,image,path,id
0,43900,Duracell - AAA Batteries (4-Pack),HardGood,5.49,41333424019,"[{'id': 'pcmcat312300050015', 'name': 'Connect...",5.49,Compatible with select electronic devices; AAA...,Duracell,MN2400B4Z,http://www.bestbuy.com/site/duracell-aaa-batte...,http://img.bbystatic.com/BestBuy_US/images/pro...,"[{'id': 'pcmcat312300050015', 'name': 'Connect...",abcat0208002
1,48530,Duracell - AA 1.5V CopperTop Batteries (4-Pack),HardGood,5.49,41333415017,"[{'id': 'pcmcat312300050015', 'name': 'Connect...",5.49,Long-lasting energy; DURALOCK Power Preserve t...,Duracell,MN1500B4Z,http://www.bestbuy.com/site/duracell-aa-1-5v-c...,http://img.bbystatic.com/BestBuy_US/images/pro...,"[{'id': 'pcmcat312300050015', 'name': 'Connect...",abcat0208002
2,127687,Duracell - AA Batteries (8-Pack),HardGood,7.49,41333825014,"[{'id': 'pcmcat312300050015', 'name': 'Connect...",5.49,Compatible with select electronic devices; AA ...,Duracell,MN1500B8Z,http://www.bestbuy.com/site/duracell-aa-batter...,http://img.bbystatic.com/BestBuy_US/images/pro...,"[{'id': 'pcmcat312300050015', 'name': 'Connect...",abcat0208002
3,150115,Energizer - MAX Batteries AA (4-Pack),HardGood,4.99,39800011329,"[{'id': 'pcmcat312300050015', 'name': 'Connect...",5.49,4-pack AA alkaline batteries; battery tester i...,Energizer,E91BP-4,http://www.bestbuy.com/site/energizer-max-batt...,http://img.bbystatic.com/BestBuy_US/images/pro...,"[{'id': 'pcmcat312300050015', 'name': 'Connect...",abcat0208002
4,185230,Duracell - C Batteries (4-Pack),HardGood,8.99,41333440019,"[{'id': 'pcmcat312300050015', 'name': 'Connect...",5.49,Compatible with select electronic devices; C s...,Duracell,MN1400R4Z,http://www.bestbuy.com/site/duracell-c-batteri...,http://img.bbystatic.com/BestBuy_US/images/pro...,"[{'id': 'pcmcat312300050015', 'name': 'Connect...",abcat0208002


In [3]:
# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(
    prod_df, 
    prod_df['id'], 
    test_size=0.3, 
    random_state=42,
    shuffle=True
)

# train model
# Note: see pipeline definition in common.py
text_clf = common.text_clf.fit(X_train, y_train)

# evaluate precision
predicted = text_clf.predict(X_test)
precision = np.mean(predicted == y_test)
print('Precision over all classes = {}'.format(precision))

Precision over all classes = 0.5755314934636208


In [4]:
# save classifier
dump(text_clf, 'text_clf.joblib') 

['text_clf.joblib']