# StockX-Sneaker-Data-Contest

## Context:

Currently the dataset consists of the single file of sales provided by StockX. ~10000 shoe sales from 50 different models (Nike x Off-White and Yeezy).

In the coming weeks more data will be added, including the estimated number of pairs released for each model and other information that might be useful for making predictions. Additionally, some of the data types will be modified to make numerical analysis easier.

## Tasks :
- What shoes are most popular?
- Which shoes have the best/worst profit margins?
- What factors affect profit margin?
- Is it possible to predict the sale price of a shoe at a given time? (i.e. when should I sell?)

In [1]:
#data analysis and wrangling
import pandas as pd
import numpy as np

#visualization
import matplotlib.pyplot as plt
import seaborn as sns

##  Acquire data

Thanks to the pandas library, we load in memory our data set in the form of a table called Dataframe.Then we can make a copy of this data set for our different treatments. This arrangement will allow us to simplify processing on a large number of data.

In [2]:
original_df = pd.read_csv(r".\StockX-Data-Contest-2019-3.csv")
df = original_df.copy()

## Preliminary analysis 

In this section we explore our dataset in search of answers and propose hypotheses. 

#### Which features are available in the dataset?

First, we can look at the different types of data that compose our dataset. This first approach is important because it gives us an global overview.  The command ```df.columns.values``` gives us the names of the different columns of our data frame.

In [3]:
print(df.columns.values)

['Order Date' 'Brand' 'Sneaker Name' 'Sale Price' 'Retail Price'
 'Release Date' 'Shoe Size' 'Buyer Region']


We can differentiate two main types of data: categorical and numerical. 

**Numerical** data is essentially the quantitative data obtained from a variable, and the value has a sense of size / magnitude.This set in subtypes: Continuous (Sale Price, Retail Price), Discrete (Shoe Size).

**Categorical** data are values for a qualitative variable, often a number, word or symbol. They highlight the fact that the variable in the case under consideration belongs to one of the many choices available. This set in subtypes: categorical (Brand, Sneaker Name, Buyer Region), interval (Order Date, Release Date).

In [4]:
df.head()

Unnamed: 0,Order Date,Brand,Sneaker Name,Sale Price,Retail Price,Release Date,Shoe Size,Buyer Region
0,9/1/17,Yeezy,Adidas-Yeezy-Boost-350-Low-V2-Beluga,"$1,097",$220,9/24/16,11.0,California
1,9/1/17,Yeezy,Adidas-Yeezy-Boost-350-V2-Core-Black-Copper,$685,$220,11/23/16,11.0,California
2,9/1/17,Yeezy,Adidas-Yeezy-Boost-350-V2-Core-Black-Green,$690,$220,11/23/16,11.0,California
3,9/1/17,Yeezy,Adidas-Yeezy-Boost-350-V2-Core-Black-Red,"$1,075",$220,11/23/16,11.5,Kentucky
4,9/1/17,Yeezy,Adidas-Yeezy-Boost-350-V2-Core-Black-Red-2017,$828,$220,2/11/17,11.0,Rhode Island


#### Do we have null or empty values 

We need to make sure that we don't have an empty value. To do this we can use the following command ```df.isnull().sum()```.  This returns the number of null values per category. Therefore we can see that we have no missing values. 

In [5]:
df.isnull().sum()

Order Date      0
Brand           0
Sneaker Name    0
Sale Price      0
Retail Price    0
Release Date    0
Shoe Size       0
Buyer Region    0
dtype: int64

#### What types of data do we have ?

To give us an idea of the different types of data we have. We can use the df.info() command. This one reveals us that some data must be modified to be processed by machine learning algorithms. Indeed, the columns Sale Price and Retail Price must be converted into float as well as Order Date and Release Date into datetime

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99956 entries, 0 to 99955
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Order Date    99956 non-null  object 
 1   Brand         99956 non-null  object 
 2   Sneaker Name  99956 non-null  object 
 3   Sale Price    99956 non-null  object 
 4   Retail Price  99956 non-null  object 
 5   Release Date  99956 non-null  object 
 6   Shoe Size     99956 non-null  float64
 7   Buyer Region  99956 non-null  object 
dtypes: float64(1), object(7)
memory usage: 6.1+ MB


#### Statistical description of our numerical values 

In [7]:
df['Sale Price'] = df['Sale Price'].map(lambda x: x[1:].replace(',', '')).astype(float)
df['Retail Price'] = df['Retail Price'].map(lambda x: x[1:].replace(',', '')).astype(float)

In [8]:
df['Profit'] = df['Sale Price'] -  df['Retail Price']

In [9]:
df.describe()

Unnamed: 0,Sale Price,Retail Price,Shoe Size,Profit
count,99956.0,99956.0,99956.0,99956.0
mean,446.634719,208.61359,9.344181,238.021129
std,255.982969,25.20001,2.329588,266.133179
min,186.0,130.0,3.5,-34.0
25%,275.0,220.0,8.0,58.0
50%,370.0,220.0,9.5,154.0
75%,540.0,220.0,11.0,342.0
max,4050.0,250.0,17.0,3860.0


#### Statistical description of our categorical values 

In [10]:
df.describe(include=['O'])

Unnamed: 0,Order Date,Brand,Sneaker Name,Release Date,Buyer Region
count,99956,99956,99956,99956,99956
unique,531,2,50,35,51
top,11/16/18,Yeezy,adidas-Yeezy-Boost-350-V2-Butter,6/30/18,California
freq,1388,72162,11423,11423,19349


## Clean data

### Checking the missing  values

In [11]:
df.isnull().sum()

Order Date      0
Brand           0
Sneaker Name    0
Sale Price      0
Retail Price    0
Release Date    0
Shoe Size       0
Buyer Region    0
Profit          0
dtype: int64

## Convert data type

In [12]:
df.dtypes

Order Date       object
Brand            object
Sneaker Name     object
Sale Price      float64
Retail Price    float64
Release Date     object
Shoe Size       float64
Buyer Region     object
Profit          float64
dtype: object

In [13]:
df['Order Date'] = pd.to_datetime(df['Order Date'])
df['Release Date'] = pd.to_datetime(df['Release Date'])

In [14]:
df['Sale Price'] = df['Sale Price'].map(lambda x: x[1:].replace(',', '')).astype(float)
df['Retail Price'] = df['Retail Price'].map(lambda x: x[1:].replace(',', '')).astype(float)

TypeError: 'float' object is not subscriptable

In [None]:
# from sklearn.preprocessing import OrdinalEncoder

# ordinal_encoder = OrdinalEncoder()
# df["Buyer Region"] = ordinal_encoder.fit_transform(df[["Buyer Region"]])

In [None]:
df.head()

In [None]:
corr = df.corr()
sns.heatmap(corr, 
        xticklabels=corr.columns,
        yticklabels=corr.columns)


In [None]:
df.describe()

In [None]:
df.describe(include=['O'])

In [None]:
df[['Sale Price', 'Shoe Size']].groupby(['Shoe Size']).mean().sort_values(by='Sale Price', ascending = False)

In [None]:
df_Region_count = df[['Sale Price', 'Buyer Region']].groupby(['Buyer Region']).count().sort_values(by='Sale Price', ascending = False)
df_Region_mean = df[['Sale Price', 'Buyer Region']].groupby(['Buyer Region']).mean().sort_values(by='Sale Price', ascending = False)
pd.concat([df_Region_count, df_Region_mean], axis=1, join="inner")

In [None]:
sns.displot(data=df, x="Shoe Size", kde=True)
# g = sns.FacetGrid(df, col ='Shoe Size')
# g.map(plt.hist, 'Sale Price', bins=20)
#df[['Sale Price', 'Shoe Size']].groupby(['Shoe Size']).count().sort_values(by='Sale Price', ascending = True)
#df[['Shoe Size', '']].groupby(['Shoe Size']).count()

## Exploratory Data Analysis 

In [None]:
df['Profit'] = df["Sale Price"] - df["Retail Price"]