Using the following dataset, https://github.com/BestBuyAPIs/open-data-set; build 
an API with one endpoint that receives the “name” and “description” of a 
new product as input parameters, and outputs the “category” or “categories” that 
this new product should be in.

Expectations:

    • You have a data pipeline that handles the dataset.
    • You build a classifier that predicts the category(s) of a new product.
        ◦ You can decide if your model would output one or multiple labels.
        ◦ You don't need to spend lots of time comparing different models.
        ◦ You don't need to spend lots of time on trying to have the state of the art feature engineering.
    • You build one API endpoint that exposes the classifier as a solution to label new products.

In [2]:
import common
import numpy as np
import pandas as pd
from joblib import dump, load
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer

In [3]:
# read data
categories_path = 'open-data-set-master/categories.json'
products_path = 'open-data-set-master/products.json'
stores_path = 'open-data-set-master/stores.json'

categories = common.read_file(categories_path)
products = common.read_file(products_path)
stores = common.read_file(stores_path)

In [4]:
# feat engineering
cat_df = pd.DataFrame.from_dict(categories)

prod_df = pd.DataFrame.from_dict(products)
cat_list = ['cat1','cat2', 'cat3', 'cat4', 'cat5', 'cat6', 'cat7']

prod_df[cat_list] = pd.DataFrame(prod_df.category.tolist(), index=prod_df.index)

# for now using only the first category, needs to include all of them
for cat in cat_list:
    print(cat)
    cat_col = prod_df[cat].values.tolist()
    cat_col = [{'id': None, 'name': None} if v is None else v for v in cat_col]
    cat_col = pd.DataFrame(cat_col)
    cat_col.columns = cat + '.' + cat_col.columns

    prod_df[cat_col.columns] = cat_col

cat1
cat2
cat3
cat4
cat5
cat6
cat7


In [5]:
train_data = prod_df['description'][0:40000]
test_data = prod_df['description'][40000:]

train_target = prod_df['cat1.id'][0:40000]
test_target = prod_df['cat1.id'][40000:]

count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(train_data)
X_train_counts.shape

from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

clf = MultinomialNB().fit(X_train_tfidf, train_target)

text_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', MultinomialNB())])
text_clf = text_clf.fit(train_data, train_target)

predicted = text_clf.predict(test_data)
np.mean(predicted == test_target)

0.8413189077794951

In [6]:
dump(text_clf, 'text_clf.joblib') 

['text_clf.joblib']