## Pre-processing and Training Data Development

The purpose of this notebook is to preprocess data and prepare it for the next steps in modeling.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

  import pandas.util.testing as tm


In [2]:
df = pd.read_csv('liquor_dataset_.csv', index_col = 0, dtype={'week': object, 'store_number': object})

In [3]:
df.shape

(477063, 7)

In [4]:
df.head()

Unnamed: 0,week,store_number,general_alcohol_category,city,county,initial claims,volume_sold_(liters)
0,1,2500,amaretto,AMES,STORY,167.0,6.75
1,1,2500,amaretto,AMES,STORY,248.0,3.75
2,1,2500,amaretto,AMES,STORY,306.0,2.25
3,1,2500,brandy,AMES,STORY,159.0,33.73
4,1,2500,brandy,AMES,STORY,167.0,33.8


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 477063 entries, 0 to 477062
Data columns (total 7 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   week                      477063 non-null  object 
 1   store_number              477063 non-null  object 
 2   general_alcohol_category  477063 non-null  object 
 3   city                      477063 non-null  object 
 4   county                    477063 non-null  object 
 5   initial claims            477063 non-null  float64
 6   volume_sold_(liters)      477063 non-null  float64
dtypes: float64(2), object(5)
memory usage: 29.1+ MB


Since most variables are categorical, checking the correlation coefficient is not too useful.

In [6]:
df.corr()

Unnamed: 0,initial claims,volume_sold_(liters)
initial claims,1.0,0.049004
volume_sold_(liters),0.049004,1.0


## Dummy Variable Creation
For each of the categorical variables, conversion to dummy variables was imperative. As each dummy variable creates multiple columns, the dataframe becomes much larger after all features are concatenated.

In [7]:
df = pd.concat([df.drop('general_alcohol_category', axis=1), pd.get_dummies(df['general_alcohol_category'])], axis=1)

In [8]:
df = pd.concat([df.drop('city', axis = 1),pd.get_dummies(df['city'])], axis=1)

In [9]:
df = pd.concat([df.drop('county', axis = 1), pd.get_dummies(df['county'])], axis=1)

In [10]:
df.shape

(477063, 171)

## Train test Split

At this stage it is appropriate to split the dataset into two parts. I have chosen a 0.75 to 0.25 train test split. 

In [13]:
from sklearn.model_selection import train_test_split

#here I'm ending up with 170 regressors?
X = df.drop(['volume_sold_(liters)'], axis =1)
y = df[['volume_sold_(liters)']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state = 123)

## Scaling the Features

Here I resort to standard scaler which assumes each feature's data is normally distributed. It scales the data such so that it is centered around 0, and has a standard deviation of 1.

In [14]:
from sklearn import preprocessing
scaler = preprocessing.StandardScaler().fit(X_train)
X_train_scaled=scaler.transform(X_train)
X_test_scaled=scaler.transform(X_test)

The next steps are to proceed with modeling.