# **Rossmann Store Sales Prediction**

Rossmann operates over 3,000 drug stores in 7 European countries. Currently, Here we are predicting 6 weeks of daily sales for 1,115 stores located across Germany. Reliable sales forecasts enable store managers to create effective staff schedules that increase productivity and motivation. 

![alt text](https://storage.googleapis.com/kaggle-competitions/kaggle/4594/media/rossmann_banner2.png)

# **Data Exploration and Engineering**

First, we will mount my google drive and load data into the googlecolab workspace.

In [1]:
# import necessary libraries.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import seaborn as sns

ModuleNotFoundError: No module named 'seaborn'

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
data_path = "./Hari-Assignment/data"

store = pd.read_csv(data_path+"/store.csv",sep=',',dtype= {'StoreType':str,
                                                          'Assortment':str,
                                                          'PromoInterval':str})

train = pd.read_csv(data_path+"/train.csv",sep= ',', parse_dates=['Date'], dtype= {'StateHoliday': str, 'SchoolHoliday':str} )
test =  pd.read_csv(data_path+"/test.csv",sep= ',', parse_dates=['Date'], dtype= {'StateHoliday': str, 'SchoolHoliday':str} )

**Cleaning Train dataset**

In [None]:
train['Year'] = pd.DatetimeIndex(train['Date']).year
train['Month'] = pd.DatetimeIndex(train['Date']).month



In [None]:
def factor_to_integer(df, colname, start_value=0):
    while df[colname].dtype == object:
        myval = start_value # factor starts at "start_value".
        for sval in df[colname].unique():
            df.loc[df[colname] == sval, colname] = myval
            myval += 1
        df[colname] = df[colname].astype(int, copy=False)
    print('levels :', df[colname].unique(), '; data type :', df[colname].dtype)

In [None]:
factor_to_integer(train, 'SchoolHoliday')
factor_to_integer(train, 'StateHoliday')

Check for number of NaNs for selected columns.

In [None]:
print("NANs for individual columns")
print("---------------------------")
from collections import Counter
x = {colname : train[colname].isnull().sum() for colname in train.columns}
Counter(x).most_common()

**Cleaning Test dataset**

In [None]:
test['Year'] = pd.DatetimeIndex(test['Date']).year
test['Month'] = pd.DatetimeIndex(test['Date']).month

In [None]:
print("NANs for individual columns")
print("---------------------------")
from collections import Counter
x = {colname : test[colname].isnull().sum() for colname in test.columns}
Counter(x).most_common()

There are 11 missing values in Open column. Let’s have a detailed look at those:

In [None]:
test.loc[np.isnan(test['Open'])]

Do we have any information about store 622? Check train dataset

In [None]:
train.loc[np.where(train['Store']==622)].head()

As we have information about store 622 in train dataset as open (1) lets replace the NaN from test dataset to open (1)

In [None]:
test.loc[np.isnan(test['Open']),'Open']=1

Checking for missing values in test dataset

In [None]:
print("NANs for individual columns")
print("---------------------------")
from collections import Counter
x = {colname : test[colname].isnull().sum() for colname in test.columns}
Counter(x).most_common()

In [None]:
factor_to_integer(test, 'StateHoliday')
factor_to_integer(test, 'SchoolHoliday')

because only StateHoliday 0 and 1 exist in test dataset, we should consider deleting the rows in train dataset that the StateHoliday value is different than 0 or 1.


In [None]:
train.loc[train['StateHoliday'] > 1].shape

In [None]:
train = train.loc[train['StateHoliday']<2]

**Cleaning Store Dataset**

In [None]:
print("NANs for individual columns")
print("---------------------------")
from collections import Counter
x = {colname : store[colname].isnull().sum() for colname in store.columns}
Counter(x).most_common()

In [None]:
store['PromoInterval'].unique()

If there is no promotion, then the corresponding columns should have zero values.

In [None]:
store.loc[store['Promo2'] == 0, ['Promo2SinceWeek', 'Promo2SinceYear', 'PromoInterval']] = 0

In [None]:
store.loc[np.where(store['Promo2']!=0)].head()

In [None]:
store.loc[store['Promo2'] != 0, 'Promo2SinceWeek'] = store['Promo2SinceWeek'].max() - store.loc[store['Promo2'] != 0, 'Promo2SinceWeek']
store.loc[store['Promo2'] != 0, 'Promo2SinceYear'] = store['Promo2SinceYear'].max() - store.loc[store['Promo2'] != 0, 'Promo2SinceYear']

In [None]:
store.dtypes

In [None]:
factor_to_integer(store, 'PromoInterval', start_value=0)

In [None]:
factor_to_integer(store, 'StoreType')
factor_to_integer(store, 'Assortment')

Are there still missing values?

In [None]:
print("NANs for individual columns")
print("---------------------------")
from collections import Counter
x = {colname : store[colname].isnull().sum() for colname in store.columns}
Counter(x).most_common()

Filling the missing values with sklearn’s built-in command. Filling with the column.median().

In [None]:
from sklearn.impute import SimpleImputer
imp = SimpleImputer( strategy='median').fit(store)
store_imputed = imp.transform(store)
store2 = pd.DataFrame(store_imputed, columns=store.columns.values)

In [None]:
print("NANs for individual columns")
print("---------------------------")
from collections import Counter
x = {colname : store2[colname].isnull().sum() for colname in store2.columns}
Counter(x).most_common()

In [None]:
store2['CompetitionOpenSinceMonth'] = store2['CompetitionOpenSinceMonth'].max() - store2['CompetitionOpenSinceMonth']
store2['CompetitionOpenSinceYear'] = store2['CompetitionOpenSinceYear'].max() - store2['CompetitionOpenSinceYear']


In [None]:
store2.tail()

In [None]:
train_store = pd.merge(train, store2, how = 'left', on='Store')
test_store = test.reset_index().merge(store2, how = 'left', on='Store').set_index('Id')

Visual Exploration

In [None]:
import seaborn as sns
sns.distplot(train_store['Sales'])

In [None]:
train_store.boxplot(column='Sales', by='Year')
plt.show()

In [None]:
train_store.boxplot(column='Sales', by='Month')
plt.show()

In [None]:
train_store.hist(column='Sales', by='Year', bins=30)
plt.show()

In [None]:
train_store.hist(column='Sales', by='Month', bins=30)
plt.show()

# Modeling

In [None]:
print(train_store.columns.values)
print(test_store.columns.values)

In [None]:
train_model = train_store.drop(['Customers', 'Date'], axis=1)
train_model['Year'] = train_model['Year'].max() - train_model['Year']
#print(train_model.head())

## Linear Modeling

**Is the relationship significant?**

Correlation is any of a broad class of statistical relationships involving dependence

In [None]:
import matplotlib.pyplot as plt

corr = train_model.corr()
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(corr,cmap='coolwarm', vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = np.arange(0,len(train_model.columns),1)
ax.set_xticks(ticks)
plt.xticks(rotation=90)
ax.set_yticks(ticks)
ax.set_xticklabels(train_model.columns)
ax.set_yticklabels(train_model.columns)
plt.show()

In [None]:
print(corr["Sales"].sort_values(ascending=False))

**AIC & BIC **

The Akaike’s information criterion – AIC and the Bayesian information criterion – BIC are measures of the goodness of fit of an estimated statistical model and can also be used for model selection. Both criteria depend on the maximized value of the likelihood function L for the estimated model.

k= # of variables

n = number of observations

AIC= 2k - 2ln(sse)

BIC = n*ln(sse/n) + k*ln(n)


In [None]:

def calAIC(y,y_hat,k):
  resid = y - y_hat
  sse = sum(resid**2)
  AIC= 2*k - 2*np.log(sse)
  return AIC
  
def calBIC(y,y_hat,k):
  n = len(y)
  resid = y - y_hat
  sse = sum(resid**2)
  BIC = n*np.log(sse/n) + k*np.log(n)
  return BIC

##**Building Linear Models**

In [None]:
from sklearn.model_selection import train_test_split
#Creating the features 

features = train_model.drop('Sales', axis=1)
target = train_model['Sales']

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.3, random_state = 42)

In [None]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

###**Multi Linear Model with scaling**

In [None]:
from sklearn import linear_model
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.metrics import r2_score


#Setting up the scaling pipeline 

pipeline_order = [('scaler', StandardScaler()), ('linear_reg', linear_model.LinearRegression())]

Model_Pipeline = Pipeline(pipeline_order)

# evaluate pipeline
kfold = KFold(n_splits=3, random_state=7)
results = cross_val_score(Model_Pipeline, X_train, y_train, cv=kfold,scoring= 'r2')
Model_Pipeline.fit(X_train, y_train)
preds_train = Model_Pipeline.predict(X_train)
preds = Model_Pipeline.predict(X_test)
print("Train R^2:",round(results.mean(),3), round(results.std(),3))
print("Train AIC, BIC :",round(calAIC(y_train,preds_train,len(X_train.columns)),3),",", round(calBIC(y_train,preds_train,len(X_train.columns))))
print("-----------------------------------------------")
print("Test R^2:",round(r2_score(y_test, preds),3))
print("Test AIC, BIC :",round(calAIC(y_test,preds,len(X_test.columns)),3),",", round(calBIC(y_test,preds,len(X_test.columns))))



###**Multi Linear model without scaling**

In [None]:
from sklearn import linear_model
#Initializing a linear regression model 

linear_reg = linear_model.LinearRegression()

#Fitting the model on the data

model_wos = linear_reg.fit(X_train, y_train)

#Accuracy of the model
preds = model_wos.predict(X_test)
print("Test R^2:",round(r2_score(y_test, preds),3))
print("Test AIC, BIC :",round(calAIC(y_test,preds,len(X_test.columns)),3),",", round(calBIC(y_test,preds,len(X_test.columns))))

In [None]:
plt.scatter(y_test,linear_reg.predict(X_test))

###**Cross validation with K-Fold**

### **Linear Model using SGD**

In [None]:
pipeline_order_sgd = [('scaler', StandardScaler()), ('linear_reg', linear_model.SGDRegressor())]

Model_Pipeline_sgd = Pipeline(pipeline_order_sgd)

# evaluate pipeline
kfold = KFold(n_splits=3, random_state=7)
results = cross_val_score(Model_Pipeline_sgd, X_train, y_train, cv=kfold,scoring= 'r2')
Model_Pipeline_sgd.fit(X_train, y_train)
preds_train = Model_Pipeline_sgd.predict(X_train)
preds = Model_Pipeline_sgd.predict(X_test)
print("Train R^2:",round(results.mean(),3), round(results.std(),3))
print("Train AIC, BIC :",round(calAIC(y_train,preds_train,len(X_train.columns)),3),",", round(calBIC(y_train,preds_train,len(X_train.columns))))
print("-----------------------------------------------")
print("Test R^2:",round(r2_score(y_test, preds),3))
print("Test AIC, BIC :",round(calAIC(y_test,preds,len(X_test.columns)),3),",", round(calBIC(y_test,preds,len(X_test.columns))))


## **Part C: Multi-Colinearity and stepwise regression**

## **Part E: Regularization**

In [None]:
train_model.head(5)

In [None]:
train_model['MeanSalesStore'] = train_model.groupby('Store')['Sales'].transform('mean')

In [None]:
train_model = train_model[train_model['Sales']>0]

In [None]:
train_model['CrossedMeanSales'] = np.where(train_model['Sales'] >= train_model['MeanSalesStore'], 1,0 )

In [None]:
train_model.to_csv('train_model_classification.csv')