# Nike Shoe Data Ingestion Process

This notebook downloads a Nike shoes data set from DataDotWorld and transforms it into one that is compatible with AI Product Catalog.  AI Product Catalog is built generically and supports different types of products and each product type can coexist in its repository.

The process looks somewhat like the following:
1. Download dataset
2. Clean and transform
3. Extract subcategories
4. Store in DB (blending approach - merges with existing non-overlapping data sets)
5. Implement and store embeddings in Vector DB

Dataset specific tasks fall into this notebook with the generic data processing actions are deferred to the supporting AI Product Catalog Product Library.

## Initialization

In [None]:
import datadotworld as dw
import pandas as pd
import os

from product_dataset_lib import ProductDataset
from product_dataset_lib import get_config_value

In [None]:
DATASET_NAME = 'data-hut/product-data-from-nike'

CATEGORY_DESC = "Shoes"

SENTENCE_TRANSFORMER_MODEL = "sentence-transformers/all-mpnet-base-v2"

In [None]:
DB_HOST = get_config_value("DB_HOST", "127.0.0.1")
DB_PORT = get_config_value("DB_PORT", "5432")
DB_NAME = get_config_value("DB_NAME", "ai_product_catalog")
DB_USER = get_config_value("DB_USER", "ai_product_catalog")
DB_PASSWORD = get_config_value("DB_PASSWORD", "ai_product_catalog123")

DB_CONNECTION_STRING = f"host={DB_HOST} port={DB_PORT} dbname={DB_NAME} user={DB_USER} password={DB_PASSWORD}"
print ("DB_CONNECTION_STRING:", DB_CONNECTION_STRING)

## Download

In [None]:
product_data = dw.load_dataset(DATASET_NAME)
product_data.describe()

In [None]:
df = product_data.dataframes["nike_2020_04_13"]
df = df.drop_duplicates()
print (df.shape)
df.head()

In [None]:
print ("Shape = " + str(df.shape))
print ("Number of unique Product IDs = " + str(len(df['product_id'].drop_duplicates())))
print ("Maximum Length of Product ID Column = " + str(df['product_id'].str.len().max()))
print ("Number of unique Brands = " + str(len(df['brand'].drop_duplicates())))
print ("Maximum Length of Product Name Column = " + str(df['product_name'].str.len().max()))
print ("Maximum Length of Product Description Column = " + str(df['description'].str.len().max()))
print ("Maximum Length of Brand Column = " + str(df['brand'].str.len().max()))

## Transform

In [None]:
df['msrp'] = df['sale_price'].astype('float') / 100.0
df['msrp']

In [None]:
df["category"] = CATEGORY_DESC

df["description"].fillna('', inplace=True)

## Process

In [None]:
product_dataset = ProductDataset(DATASET_NAME, 
                                DB_CONNECTION_STRING,
                                SENTENCE_TRANSFORMER_MODEL)

resultDF = product_dataset.import_df(df,
                    {
                        "product_id": product_dataset.ProductColumns.SKU, 
                        "msrp": product_dataset.ProductColumns.PRICE,
                        "brand": product_dataset.ProductColumns.BRAND_DESC,
                        "category": product_dataset.ProductColumns.CATEGORY_DESC,
                        "product_name": product_dataset.ProductColumns.NAME,
                        "description": product_dataset.ProductColumns.DESC
                    }
                )
product_dataset.persist()
product_dataset.load_embeddings()
product_dataset.refresh_embeddings()
product_dataset.persist_embeddings()

print(resultDF.head())