<a href="https://colab.research.google.com/github/ibelieveai/DS-Projects/blob/master/Rossmann_Store_Sales_Prediction_Cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Rossmann Store Sales Prediction**

Rossmann operates over 3,000 drug stores in 7 European countries. Currently, Here we are predicting 6 weeks of daily sales for 1,115 stores located across Germany. Reliable sales forecasts enable store managers to create effective staff schedules that increase productivity and motivation. 

![alt text](https://storage.googleapis.com/kaggle-competitions/kaggle/4594/media/rossmann_banner2.png)

# **Data Exploration and Engineering**

First, we will mount my google drive and load data into the googlecolab workspace.

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
# import necessary libraries.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import seaborn as sns

In [0]:
data_path = "/content/drive/My Drive/Data Science/Rossman-salesforecast/data/"

store = pd.read_csv(data_path+"/store.csv",sep=',',dtype= {'StoreType':str,
                                                          'Assortment':str,
                                                          'PromoInterval':str})

train = pd.read_csv(data_path+"/train.csv",sep= ',', parse_dates=['Date'], dtype= {'StateHoliday': str, 'SchoolHoliday':str} )
test =  pd.read_csv(data_path+"/test.csv",sep= ',', parse_dates=['Date'], dtype= {'StateHoliday': str, 'SchoolHoliday':str} )

**Cleaning Train dataset**

In [0]:
train['Year'] = pd.DatetimeIndex(train['Date']).year
train['Month'] = pd.DatetimeIndex(train['Date']).month



In [0]:
def factor_to_integer(df, colname, start_value=0):
    while df[colname].dtype == object:
        myval = start_value # factor starts at "start_value".
        for sval in df[colname].unique():
            df.loc[df[colname] == sval, colname] = myval
            myval += 1
        df[colname] = df[colname].astype(int, copy=False)
    print('levels :', df[colname].unique(), '; data type :', df[colname].dtype)

In [13]:
factor_to_integer(train, 'SchoolHoliday')
factor_to_integer(train, 'StateHoliday')

levels : [0 1] ; data type : int64
levels : [0 1 2 3] ; data type : int64


Check for number of NaNs for selected columns.

In [15]:
print("NANs for individual columns")
print("---------------------------")
from collections import Counter
x = {colname : train[colname].isnull().sum() for colname in train.columns}
Counter(x).most_common()

NANs for individual columns
---------------------------


[('Store', 0),
 ('DayOfWeek', 0),
 ('Date', 0),
 ('Sales', 0),
 ('Customers', 0),
 ('Open', 0),
 ('Promo', 0),
 ('StateHoliday', 0),
 ('SchoolHoliday', 0),
 ('Year', 0),
 ('Month', 0)]

**Cleaning Test dataset**

In [0]:
test['Year'] = pd.DatetimeIndex(test['Date']).year
test['Month'] = pd.DatetimeIndex(test['Date']).month

In [18]:
print("NANs for individual columns")
print("---------------------------")
from collections import Counter
x = {colname : test[colname].isnull().sum() for colname in test.columns}
Counter(x).most_common()

NANs for individual columns
---------------------------


[('Open', 11),
 ('Id', 0),
 ('Store', 0),
 ('DayOfWeek', 0),
 ('Date', 0),
 ('Promo', 0),
 ('StateHoliday', 0),
 ('SchoolHoliday', 0),
 ('Year', 0),
 ('Month', 0)]

There are 11 missing values in Open column. Let’s have a detailed look at those:

In [20]:
test.loc[np.isnan(test['Open'])]

Unnamed: 0,Id,Store,DayOfWeek,Date,Open,Promo,StateHoliday,SchoolHoliday,Year,Month
479,480,622,4,2015-09-17,,1,0,0,2015,9
1335,1336,622,3,2015-09-16,,1,0,0,2015,9
2191,2192,622,2,2015-09-15,,1,0,0,2015,9
3047,3048,622,1,2015-09-14,,1,0,0,2015,9
4759,4760,622,6,2015-09-12,,0,0,0,2015,9
5615,5616,622,5,2015-09-11,,0,0,0,2015,9
6471,6472,622,4,2015-09-10,,0,0,0,2015,9
7327,7328,622,3,2015-09-09,,0,0,0,2015,9
8183,8184,622,2,2015-09-08,,0,0,0,2015,9
9039,9040,622,1,2015-09-07,,0,0,0,2015,9


Do we have any information about store 622? Check train dataset

In [23]:
train.loc[np.where(train['Store']==622)].head()

Unnamed: 0,Store,DayOfWeek,Date,Sales,Customers,Open,Promo,StateHoliday,SchoolHoliday,Year,Month
621,622,5,2015-07-31,6306,540,1,1,0,0,2015,7
1736,622,4,2015-07-30,5412,406,1,1,0,0,2015,7
2851,622,3,2015-07-29,5326,468,1,1,0,0,2015,7
3966,622,2,2015-07-28,4966,417,1,1,0,0,2015,7
5081,622,1,2015-07-27,5413,517,1,1,0,0,2015,7


As we have information about store 622 in train dataset as open (1) lets replace the NaN from test dataset to open (1)

In [0]:
test.loc[np.isnan(test['Open']),'Open']=1

Checking for missing values in test dataset

In [26]:
print("NANs for individual columns")
print("---------------------------")
from collections import Counter
x = {colname : test[colname].isnull().sum() for colname in test.columns}
Counter(x).most_common()

NANs for individual columns
---------------------------


[('Id', 0),
 ('Store', 0),
 ('DayOfWeek', 0),
 ('Date', 0),
 ('Open', 0),
 ('Promo', 0),
 ('StateHoliday', 0),
 ('SchoolHoliday', 0),
 ('Year', 0),
 ('Month', 0)]