Using the following dataset, https://github.com/BestBuyAPIs/open-data-set; build 
an API with one endpoint that receives the “name” and “description” of a 
new product as input parameters, and outputs the “category” or “categories” that 
this new product should be in.

Expectations:

    • You have a data pipeline that handles the dataset.
    • You build a classifier that predicts the category(s) of a new product.
        ◦ You can decide if your model would output one or multiple labels.
        ◦ You don't need to spend lots of time comparing different models.
        ◦ You don't need to spend lots of time on trying to have the state of the art feature engineering.
    • You build one API endpoint that exposes the classifier as a solution to label new products.

In [1]:
import common
import numpy as np
import pandas as pd
from joblib import dump, load
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
# read data
categories_path = 'open-data-set-master/categories.json'
products_path = 'open-data-set-master/products.json'
stores_path = 'open-data-set-master/stores.json'

categories = common.read_file(categories_path)
products = common.read_file(products_path)
stores = common.read_file(stores_path)

In [3]:
# feat engineering
cat_df = pd.DataFrame.from_dict(categories)

prod_df = pd.DataFrame.from_dict(products)
cat_list = ['cat1','cat2', 'cat3', 'cat4', 'cat5', 'cat6', 'cat7']

prod_df[cat_list] = pd.DataFrame(prod_df.category.tolist(), index=prod_df.index)

# for now using only the first category, needs to include all of them
for cat in cat_list:
    print(cat)
    cat_col = prod_df[cat].values.tolist()
    cat_col = [{'id': None, 'name': None} if v is None else v for v in cat_col]
    cat_col = pd.DataFrame(cat_col)
    cat_col.columns = cat + '.' + cat_col.columns

    prod_df[cat_col.columns] = cat_col

cat1
cat2
cat3
cat4
cat5
cat6
cat7


In [4]:
train_data = prod_df[0:40000]
test_data = prod_df[40000:]

train_target = prod_df['cat1.id'][0:40000]
test_target = prod_df['cat1.id'][40000:]

text_clf = common.text_clf.fit(train_data, train_target)

predicted = text_clf.predict(test_data[0:200])
np.mean(predicted == test_target[0:200])

0.9

In [5]:
dump(text_clf, 'text_clf.joblib') 

['text_clf.joblib']

In [6]:
test_data[0:200]

Unnamed: 0,sku,name,type,price,upc,category,shipping,description,manufacturer,model,...,cat3.id,cat3.name,cat4.id,cat4.name,cat5.id,cat5.name,cat6.id,cat6.name,cat7.id,cat7.name
40000,7669121,Ted Baker - Hex Folio Case for Apple® iPhone® ...,HardGood,44.99,886075022013,"[{'id': 'abcat0800000', 'name': 'Cell Phones'}...",0,Designed for Apple iPhone 6; faux leather mate...,Ted Baker,22013CS,...,pcmcat191200050015,iPhone Accessories,pcmcat214700050000,iPhone Cases & Clips,,,,,,
40001,7669301,Metra - Radio Installation Dash Kit for Most 1...,HardGood,16.99,086429010011,"[{'id': 'abcat0300000', 'name': 'Car Electroni...",0,From our expanded online assortment; compatibl...,Metra,99-7401,...,pcmcat331600050007,Car Audio Installation Parts,pcmcat165900050031,Deck Installation Parts,pcmcat165900050033,Dash Installation Kits,,,,
40002,7670024,Keurig - Green Mountain Breakfast Blend Decaf ...,HardGood,13.99,099555046045,"[{'id': 'abcat0900000', 'name': 'Appliances'},...",5.49,Compatible with Keurig 2.0 beverage machines; ...,Keurig,,...,pcmcat367400050002,"Coffee, Tea & Espresso",pcmcat367400050005,Coffee Pods & Beans,abcat0912008,Coffee Pods,,,,
40003,7670033,Keurig - The Original Donut Shop K-Carafe Pods...,HardGood,13.99,099555046014,"[{'id': 'abcat0900000', 'name': 'Appliances'},...",5.49,Compatible with Keurig 2.0 beverage machines; ...,Keurig,,...,pcmcat367400050002,"Coffee, Tea & Espresso",pcmcat367400050005,Coffee Pods & Beans,abcat0912008,Coffee Pods,,,,
40004,7670088,Keurig - Green Mountain French Vanilla K-Caraf...,HardGood,13.99,099555046052,"[{'id': 'abcat0900000', 'name': 'Appliances'},...",5.49,Compatible with Keurig 2.0 beverage machines; ...,Keurig,,...,pcmcat367400050002,"Coffee, Tea & Espresso",pcmcat367400050005,Coffee Pods & Beans,abcat0912008,Coffee Pods,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40195,7744052,Bakker Elkhuizen - Document Holder/Writing Slo...,HardGood,211.99,855539004124,"[{'id': 'pcmcat312300050015', 'name': 'Connect...",0,Rigid acrylic construction; satin surface; hei...,Bakker Elkhuizen,BNEFDESK640A,...,pcmcat275600050004,Organizers & Storage,,,,,,,,
40196,7744067,"Farberware - 11"" Square Meal Griddle - Gray",HardGood,24.99,631899204424,"[{'id': 'abcat0900000', 'name': 'Appliances'},...",5.99,"FARBERWARE 11"" Square Meal Griddle: Aluminum c...",Farberware,20442,...,pcmcat748301695371,"Cookware, Bakeware & Cutlery",pcmcat226900050013,Cookware,,,,,,
40197,7744076,Circulon - Acclaim 4.5-Quart Covered Casserole...,HardGood,49.99,051153834868,"[{'id': 'abcat0900000', 'name': 'Appliances'},...",0,CIRCULON Acclaim 4.5-Quart Covered Casserole D...,Circulon,83486,...,pcmcat748301695371,"Cookware, Bakeware & Cutlery",pcmcat226900050013,Cookware,,,,,,
40198,7744085,Farberware - New Traditions 12-Quart Covered S...,HardGood,39.99,631899145123,"[{'id': 'abcat0900000', 'name': 'Appliances'},...",0,FARBERWARE New Traditions 12-Quart Covered Sto...,Farberware,14512,...,pcmcat748301695371,"Cookware, Bakeware & Cutlery",pcmcat226900050013,Cookware,,,,,,


In [16]:
test_data[['name', 'description']][0:1]

Unnamed: 0,name,description
40000,Ted Baker - Hex Folio Case for Apple® iPhone® ...,Designed for Apple iPhone 6; faux leather mate...


In [10]:
text_clf.predict(test_data[['name', 'description']][0:1])


array(['abcat0800000'], dtype='<U18')

In [18]:
input = {'name': 'batatas', 'description': 'batatas nao sao cebolas'}

input_df = pd.DataFrame.from_records([input])

In [19]:
text_clf.predict(input_df)


array(['abcat0900000'], dtype='<U18')