This is the python notebook where you can do your work. I'd recommend using google colab. You can download this as an .ipynb from github and then upload it to colab where you can work on it. If you haven't done this before or you need help, text me and I can give you more detailed instructions.

So right now, we have a model that can predict the price of a shoe and then make a market on it, which is great! There's one catch though, and that's that we don't know if it works for any shoes that are not in our dataset. Your objective is to test how our model does when we run it on shoes not in our dataset.

Before we even try a shoe from today that's missing a lot of data and has a lot of problems, we will first try removing a shoe from the dataset, and using that as the test shoe. If it fails on that, it's definitely going to fail on a shoe today that's missing even more data.

So, first run the blocks of code below.

In [1]:
# Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import datetime as dt

from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from xgboost import XGBRegressor

In [2]:
# Load data (takes like 9 seconds to run don't panic)

data = pd.read_excel('https://s3.amazonaws.com/stockx-sneaker-analysis/wp-content/uploads/2019/02/StockX-Data-Contest-2019.xlsx', sheet_name = 1)
df = data.copy()
df.head()

Unnamed: 0,Order Date,Brand,Sneaker Name,Sale Price,Retail Price,Release Date,Shoe Size,Buyer Region
0,2017-09-01,Yeezy,Adidas-Yeezy-Boost-350-Low-V2-Beluga,1097.0,220,2016-09-24,11.0,California
1,2017-09-01,Yeezy,Adidas-Yeezy-Boost-350-V2-Core-Black-Copper,685.0,220,2016-11-23,11.0,California
2,2017-09-01,Yeezy,Adidas-Yeezy-Boost-350-V2-Core-Black-Green,690.0,220,2016-11-23,11.0,California
3,2017-09-01,Yeezy,Adidas-Yeezy-Boost-350-V2-Core-Black-Red,1075.0,220,2016-11-23,11.5,Kentucky
4,2017-09-01,Yeezy,Adidas-Yeezy-Boost-350-V2-Core-Black-Red-2017,828.0,220,2017-02-11,11.0,Rhode Island


In [3]:
df['Order Date'] = pd.to_datetime(df['Order Date'])
df['Release Date'] = pd.to_datetime(df['Release Date'])

df['Order Date'] = df['Order Date'].apply(lambda x: x.toordinal())
df['Release Date'] = df['Release Date'].apply(lambda x: x.toordinal())

df

Unnamed: 0,Order Date,Brand,Sneaker Name,Sale Price,Retail Price,Release Date,Shoe Size,Buyer Region
0,736573,Yeezy,Adidas-Yeezy-Boost-350-Low-V2-Beluga,1097.0,220,736231,11.0,California
1,736573,Yeezy,Adidas-Yeezy-Boost-350-V2-Core-Black-Copper,685.0,220,736291,11.0,California
2,736573,Yeezy,Adidas-Yeezy-Boost-350-V2-Core-Black-Green,690.0,220,736291,11.0,California
3,736573,Yeezy,Adidas-Yeezy-Boost-350-V2-Core-Black-Red,1075.0,220,736291,11.5,Kentucky
4,736573,Yeezy,Adidas-Yeezy-Boost-350-V2-Core-Black-Red-2017,828.0,220,736371,11.0,Rhode Island
...,...,...,...,...,...,...,...,...
99951,737103,Yeezy,adidas-Yeezy-Boost-350-V2-Static-Reflective,565.0,220,737054,8.0,Oregon
99952,737103,Yeezy,adidas-Yeezy-Boost-350-V2-Static-Reflective,598.0,220,737054,8.5,California
99953,737103,Yeezy,adidas-Yeezy-Boost-350-V2-Static-Reflective,605.0,220,737054,5.5,New York
99954,737103,Yeezy,adidas-Yeezy-Boost-350-V2-Static-Reflective,650.0,220,737054,11.0,California


This is the crucial part, where we split it by our test and training set.

I want you to pick a shoe in the dataset, it can be any shoe. Then you need to
1) define train_set - it should be all shoes in the dataset where the names match the name of the shoe you chose
and 2) define test_set - it should be all shoes in the dataset where the names don't match up.

Once you add those two things, the block of code and the block below should run. Report the r^2 you get. This is very important because if we get a high r^2, then this means our model can still work great for a shoe it's never seen before. If we don't, then that means our model can only work for the shoes in the dataset, so it's useless for anything else.



In [4]:
# Selecting Nike-Zoom-Fly-Mercurial-Off-White-Total-Orange as "New Shoe"
dummies = pd.get_dummies(df, columns=['Brand', 'Buyer Region'])

train_set = dummies[dummies['Sneaker Name'] != 'Nike-Zoom-Fly-Mercurial-Off-White-Total-Orange']
test_set = dummies[dummies['Sneaker Name'] == 'Nike-Zoom-Fly-Mercurial-Off-White-Total-Orange']

y_train = train_set['Sale Price']
X_train = pd.get_dummies(train_set.drop(['Sale Price', 'Sneaker Name'], axis=1))
X_train_no_dummies = train_set.drop(['Sale Price', 'Sneaker Name'], axis=1)

y_test = test_set['Sale Price']
X_test = pd.get_dummies(test_set.drop(['Sale Price', 'Sneaker Name'], axis=1))
X_test_no_dummies = test_set.drop(['Sale Price', 'Sneaker Name'], axis=1)
X_train, X_test = X_train.align(X_test, join='left', axis=1, fill_value=0)

In [5]:
# models we'll use
models = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': Ridge(),
    'Lasso Regression': Lasso(),
    'ElasticNet': ElasticNet(),
    'Random Forest': RandomForestRegressor(n_estimators=100),
    'XGBoost': XGBRegressor(n_estimators=100),
    'Decision Tree': DecisionTreeRegressor()
}

# for each model, we want to train it, predict, then grade it
results = {}
for name, model in models.items():
    predictions = []
    if name == 'Linear Regression' or name == 'Ridge Regression' or name == 'Lasso Regression':
      model.fit(X_train_no_dummies, y_train)
      predictions = model.predict(X_test_no_dummies)
    else:
      model.fit(X_train, y_train)
      predictions = model.predict(X_test)
    mse = mean_squared_error(y_test, predictions)
    r2 = r2_score(y_test, predictions)
    results[name] = {'MSE': mse, 'R^2': r2}

# results
results_df = pd.DataFrame(results).T
results_df['MSE'] = results_df['MSE']
results_df['R^2'] = results_df['R^2']

print(results_df)

                             MSE        R^2
Linear Regression  160338.225184 -89.436151
Ridge Regression   160329.161016 -89.431039
Lasso Regression   155595.858558 -86.761297
ElasticNet          59220.409354 -32.402302
Random Forest        4875.456613  -1.749921
XGBoost              5563.153403  -2.137806
Decision Tree        5115.089050  -1.885082
