# Trove Take Home Project
### Data Scientist

This notebook provides a minimal viable product (MVP) that marginally improves Trove's pricing strategy. The approach uses Keras and Deep Learning to integrate a wide range of supply and demand signals into Trove's pricing model. We train a Keras neural network for regression and continous value prediction, specifically focusing on predicting the price of individual inventory items from Homebody, a fictional partner whose Trove-powered resale program offers used home goods.

#### 1. We begin by importing the necessary python modules for this project

In [1]:
import pandas as pd
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, MinMaxScaler
from sklearn.feature_extraction import FeatureHasher
from sklearn.model_selection import train_test_split
from tensorflow.keras.optimizers import Adam
import numpy as np
import locale
import os

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


#### 2. We initialize the list of columns in the csv file and load the data using Pandas

In [2]:
def load_item_data(inputPath):
    cols = ['unique_item_id','used_list_price','used_condition','department','category','item_parent_sku','color','size','msrp_new','last_known_retail_price_new','first_approved_date','first_ordered_date']
    df = pd.read_csv(inputPath, sep=",", header=1, names=cols)
    return df

#### 3. In the following cell, we preprocess the dataset prior to training our machine learning model. The dataset contains both numerical and categorical attributes. For categorical features with low cardinality or small set of distinct values (less than 7 distinct values), we use One Hot Encoding for converting string values to a numerical vector that can be easily interpreted by our machine learning algorithm. For attributes with high cardinality or large set of distinct values, we employ the "hashing trick" to encode the categorical features and limit the number of features to a predefined space with fewer new dimensions. In the sample Homebody dataset, the listed transformation are applied to the following categorical columns:
- used_condition -> one hot encoding
- department -> one hot encoding
- category -> feature hash encoding
- color -> feature hash encoding
- size -> feature hash encoding

#### We perform min-max scaling on each continuous feature column to restrict their values between 0 and 1. This ensures that variables that were measured at different scales contribute equally to the model fitting and model learned function and avoid creating bias. In the sample Homebody dataset, we apply min-max scaling to the following continuous columns:
- msrp_new
- last_known_retail_price_new
- sale_duration

Here, *sale_duration* is obtained by calculating the number of days between first_approved and first_ordered date. For items without a first_ordered date, we make the assumption that they were sold on the very last day of business at the listed price.

In [3]:
def process_item_data(dataFrame):
    
    # one-hot encode used_condition column
    uc_le = LabelEncoder()
    uc_ohe = OneHotEncoder()
    dataFrame["used_condition_code"] = uc_le.fit_transform(dataFrame["used_condition"])
    uc_feature_arr = uc_ohe.fit_transform(dataFrame[["used_condition_code"]]).toarray()
    uc_feature_labels = list(uc_le.classes_)
    uc_features_df = pd.DataFrame(uc_feature_arr, columns=uc_feature_labels)    
    
    # one-hot encode department column
    dept_le = LabelEncoder()
    dept_ohe = OneHotEncoder()
    dataFrame["department_code"] = dept_le.fit_transform(dataFrame["department"])
    dept_feature_arr = dept_ohe.fit_transform(dataFrame[["department_code"]]).toarray()
    dept_feature_labels = list(dept_le.classes_)
    dept_features_df = pd.DataFrame(dept_feature_arr, columns=dept_feature_labels)    
    
    # feature hash encode category column
    cat_fh = FeatureHasher(n_features=7, input_type='string')
    cat_hashed_features = cat_fh.fit_transform(dataFrame['category'])
    cat_hashed_features = cat_hashed_features.toarray()
    cat_hashed_features_df = pd.DataFrame(cat_hashed_features)

    # feature hash encode color column
    col_fh = FeatureHasher(n_features=5, input_type='string')
    col_hashed_features = col_fh.fit_transform(dataFrame['color'])
    col_hashed_features = col_hashed_features.toarray()
    col_hashed_features_df = pd.DataFrame(col_hashed_features)    
    
    # feature hash encode size column
    size_fh = FeatureHasher(n_features=8, input_type='string')
    size_hashed_features = size_fh.fit_transform(dataFrame['size'])
    size_hashed_features = size_hashed_features.toarray()
    size_hashed_features_df = pd.DataFrame(size_hashed_features)     
    
    # fill last known retail price with msrp 
    dataFrame.last_known_retail_price_new.fillna(dataFrame.msrp_new, inplace=True)

    dataFrame['first_approved_date'] = pd.to_datetime(df['first_approved_date'], format='%Y-%m-%dT%H:%M:%S')
    dataFrame['first_ordered_date'] = pd.to_datetime(df['first_ordered_date'], format='%Y-%m-%dT%H:%M:%S')
    dataFrame.first_ordered_date.fillna(dataFrame.first_approved_date.max(), inplace=True)
    dataFrame["sale_duration"] = dataFrame['first_ordered_date'] - dataFrame['first_approved_date']
    dataFrame["sale_duration"] = dataFrame['sale_duration'].dt.days
    
    # initialize the column names of the continuous data
    continuous = ["msrp_new", "last_known_retail_price_new", "sale_duration"]    

    # perform min-max scaling each continuous feature column to the range [0, 1]
    cs = MinMaxScaler()
    data_continuous = cs.fit_transform(dataFrame[continuous])
    continuous_features_df = pd.DataFrame(data_continuous)    
    
    processed_df = pd.concat([uc_features_df, dept_features_df, cat_hashed_features_df, col_hashed_features_df, size_hashed_features_df, continuous_features_df, dataFrame['used_list_price']], axis=1)
        
    return processed_df

#### 4. In the cell below, we define the Keras Multilayer Perceptron (MLP) network for performing linear regression 

In [4]:
def create_mlp(dim, regress=False):
    # define our MLP network
    model = Sequential()
    model.add(Dense(8, input_dim=dim, activation="relu"))
    model.add(Dense(4, activation="relu"))

    # check to see if the regression node should be added
    if regress:
        model.add(Dense(1, activation="linear"))

    # return our model
    return model

#### 5. We load the dataset from the csv file, process the data as defined in step 3 and generate train/test spit for training and evaluating our model.

In [5]:
df = load_item_data("data-science/item_data.csv")
df = process_item_data(df)
(train, test) = train_test_split(df, test_size=0.25, random_state=42)

#### 6. We find the largest item price in the training set and use it to scale our item prices to the range [0, 1] (this will lead to better training and convergence)

In [6]:
maxPrice = train["used_list_price"].max()
trainY = train["used_list_price"] / maxPrice
testY = test["used_list_price"] / maxPrice

#### 7. We remove the used_list_price (our Y) to obtain our X dataFrame with only the transformed input features.

In [7]:
trainX = train.drop('used_list_price', 1)
testX = test.drop('used_list_price', 1)

#### 8. We create our MLP and then compile the model using mean absolute percentage error as our loss, implying that we seek to minimize the absolute percentage difference between our **price predictions** and the **actual prices**.

In [8]:
model = create_mlp(trainX.shape[1], regress=True)
opt = Adam(lr=1e-3, decay=1e-3 / 200)
model.compile(loss="mean_absolute_percentage_error", optimizer=opt)

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


#### 9. We train the model.

In [9]:
print("[INFO] training model...")
model.fit(x=trainX, y=trainY, validation_data=(testX, testY), epochs=200, batch_size=8)

[INFO] training model...
Train on 14983 samples, validate on 4995 samples
Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200

<tensorflow.python.keras.callbacks.History at 0x7fe9bd3179b0>

#### 10. We make predictions on the testing data

In [10]:
preds = model.predict(testX)

#### 11. We compute the difference between the *predicted* item prices and the *actual* item prices, then compute the percentage difference and the absolute percentage difference

In [11]:
diff = preds.flatten() - testY
percentDiff = (diff / testY) * 100
absPercentDiff = np.abs(percentDiff)

#### 12. We compute the mean and standard deviation of the absolute percentage difference

In [12]:
mean = np.mean(absPercentDiff)
std = np.std(absPercentDiff)

#### 13. Finally, we show some statistics on our model

In [13]:
locale.setlocale(locale.LC_ALL, "en_US.UTF-8")
print("[INFO] avg. item price: {}, std item price: {}".format(locale.currency(df["used_list_price"].mean(), grouping=True), locale.currency(df["used_list_price"].std(), grouping=True)))
print("[INFO] mean: {:.2f}%, std: {:.2f}%".format(mean, std))

[INFO] avg. item price: $70.71, std item price: $53.44
[INFO] mean: 14.64%, std: 17.26%
