# Rossmann Drug Store Chain Sales Prediction

## 1. Problem Definition
Rossmann is Germany's second-largest drug store chain. We have been provided with historical sales data for 1,115 Rossmann stores. The task is to forecast the "Sales" column. The goal of this notebook would be to create a model that would forecast the **sales** by using only the below column inputs:
* Store
* DayOfWeek
* Date
* Customers
* Open
* Promo
* StateHoliday
* SchoolHoliday

![Drug](https://images.unsplash.com/photo-1631549916768-4119b2e5f926?ixlib=rb-1.2.1&ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&auto=format&fit=crop&w=1179&q=80)

## 2. Datasets
We are given two amounts of data, which are as follows:
* train.csv - _contains sales data on a daily frequency_
* store.csv - _contains store information_

In [19]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn

In [27]:
# Import our sales and store dataset
sales_df = pd.read_csv("data/train.csv", low_memory=False, parse_dates=["Date"])
stores_df = pd.read_csv("data/store.csv", low_memory=False)

In [28]:
sales_df.head(3)

Unnamed: 0,Store,DayOfWeek,Date,Sales,Customers,Open,Promo,StateHoliday,SchoolHoliday
0,1,4,2015-04-30,6228,650,1,1,0,0
1,2,4,2015-04-30,6884,716,1,1,0,0
2,3,4,2015-04-30,9971,979,1,1,0,0


In [29]:
sales_df.head(3)

Unnamed: 0,Store,DayOfWeek,Date,Sales,Customers,Open,Promo,StateHoliday,SchoolHoliday
0,1,4,2015-04-30,6228,650,1,1,0,0
1,2,4,2015-04-30,6884,716,1,1,0,0
2,3,4,2015-04-30,9971,979,1,1,0,0


## 3.  Features

### 3.1 Sales Data

In [30]:
sales_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 914629 entries, 0 to 914628
Data columns (total 9 columns):
 #   Column         Non-Null Count   Dtype         
---  ------         --------------   -----         
 0   Store          914629 non-null  int64         
 1   DayOfWeek      914629 non-null  int64         
 2   Date           914629 non-null  datetime64[ns]
 3   Sales          914629 non-null  int64         
 4   Customers      914629 non-null  int64         
 5   Open           914629 non-null  int64         
 6   Promo          914629 non-null  int64         
 7   StateHoliday   914629 non-null  object        
 8   SchoolHoliday  914629 non-null  int64         
dtypes: datetime64[ns](1), int64(7), object(1)
memory usage: 62.8+ MB


In [31]:
# Sort DataFrame in date order.
sales_df.sort_values(by=["Date"], inplace=True, ascending=True)
sales_df.Date.head(20)

914628   2013-01-01
913893   2013-01-01
913892   2013-01-01
913891   2013-01-01
913890   2013-01-01
913889   2013-01-01
913888   2013-01-01
913887   2013-01-01
913886   2013-01-01
913885   2013-01-01
913884   2013-01-01
913883   2013-01-01
913882   2013-01-01
913881   2013-01-01
913880   2013-01-01
913894   2013-01-01
913879   2013-01-01
913877   2013-01-01
913876   2013-01-01
913875   2013-01-01
Name: Date, dtype: datetime64[ns]

In [33]:
# Let us create a restore point of our sales dataset.
sales_df_backup = sales_df.copy(deep=True)

In [36]:
sales_df[:1].Date # Tuesday

914628   2013-01-01
Name: Date, dtype: datetime64[ns]

In [37]:
sales_df[:1].DayOfWeek 

914628    2
Name: DayOfWeek, dtype: int64

In [42]:
#This column that has already been given in our dataset would mean that 1 would be Monday
sales_df.DayOfWeek.unique()

array([2, 3, 4, 5, 6, 7, 1], dtype=int64)

In [49]:
# There is a function that return the day of the week. We won't be using this one.
# It is assumed the week starts on Monday, which is denoted by 0 and ends on Sunday which is denoted by 6.
sales_df[:1].Date.dt.dayofweek

In [50]:
sales_df["SaleYear"] = sales_df.Date.dt.year
sales_df["SaleMonth"] = sales_df.Date.dt.month
sales_df["SaleDay"] = sales_df.Date.dt.day
sales_df["SaleDayOfYear"] = sales_df.Date.dt.dayofyear
sales_df.tail().T

Unnamed: 0,745,746,747,741,0
Store,746,747,748,742,1
DayOfWeek,4,4,4,4,4
Date,2015-04-30 00:00:00,2015-04-30 00:00:00,2015-04-30 00:00:00,2015-04-30 00:00:00,2015-04-30 00:00:00
Sales,9469,12123,9524,12225,6228
Customers,748,1017,746,1196,650
Open,1,1,1,1,1
Promo,1,1,1,1,1
StateHoliday,0,0,0,0,0
SchoolHoliday,0,0,0,0,0
SaleYear,2015,2015,2015,2015,2015


In [48]:
sales_df.isna().sum()

Store            0
DayOfWeek        0
Date             0
Sales            0
Customers        0
Open             0
Promo            0
StateHoliday     0
SchoolHoliday    0
saleYear         0
saleMonth        0
saleDay          0
saleDayOfYear    0
dtype: int64

In [51]:
# Now that we have enriched the DataFrame with columnized features for our date, we can remove the Date column
sales_df.drop("Date", axis=1, inplace=True)

In [61]:
for label, content in sales_df.items():
    if not pd.api.types.is_numeric_dtype(content):
        print(label)

StateHoliday


In [62]:
sales_df.StateHoliday.value_counts()

0    887690
a     16149
b      6690
c      4100
Name: StateHoliday, dtype: int64

In [65]:
# Turn categorical variables into numbers and fill missing
for label, content in sales_df.items():
    if not pd.api.types.is_numeric_dtype(content):
        # Turn categories into numbers and add 
        sales_df[label] = pd.Categorical(content).codes 

In [67]:
pd.Categorical(sales_df["StateHoliday"]).codes

array([1, 1, 1, ..., 0, 0, 0], dtype=int8)

In [68]:
sales_df.StateHoliday.value_counts()

0    887690
1     16149
2      6690
3      4100
Name: StateHoliday, dtype: int64

In [69]:
X_train, y_train = sales_df.drop("Sales", axis=1), sales_df.Sales

In [70]:
# Random Forest Regressor
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_jobs=-1, random_state=42, n_estimators = 100, max_samples=10000)

In [71]:
model.fit(X_train, y_train)

### 3.2 Stores Data

In [96]:
# stores_df = pd.read_csv("data/store.csv", low_memory=False)
stores_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1115 entries, 0 to 1114
Data columns (total 10 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Store                      1115 non-null   int64  
 1   StoreType                  1115 non-null   object 
 2   Assortment                 1115 non-null   object 
 3   CompetitionDistance        1112 non-null   float64
 4   CompetitionOpenSinceMonth  761 non-null    float64
 5   CompetitionOpenSinceYear   761 non-null    float64
 6   Promo2                     1115 non-null   int64  
 7   Promo2SinceWeek            571 non-null    float64
 8   Promo2SinceYear            571 non-null    float64
 9   PromoInterval              571 non-null    object 
dtypes: float64(5), int64(2), object(3)
memory usage: 87.2+ KB


In [97]:
stores_df.head(5)

Unnamed: 0,Store,StoreType,Assortment,CompetitionDistance,CompetitionOpenSinceMonth,CompetitionOpenSinceYear,Promo2,Promo2SinceWeek,Promo2SinceYear,PromoInterval
0,1,c,a,1270.0,9.0,2008.0,0,,,
1,2,a,a,570.0,11.0,2007.0,1,13.0,2010.0,"Jan,Apr,Jul,Oct"
2,3,a,a,14130.0,12.0,2006.0,1,14.0,2011.0,"Jan,Apr,Jul,Oct"
3,4,c,c,620.0,9.0,2009.0,0,,,
4,5,a,a,29910.0,4.0,2015.0,0,,,


In [109]:
# Let us create a restore point of our sales dataset.
stores_df_backup = stores_df.copy(deep=True)

In [110]:
# Check for Missing Values
stores_df.isna().sum()

Store                          0
StoreType                      0
Assortment                     0
CompetitionDistance            3
CompetitionOpenSinceMonth    354
CompetitionOpenSinceYear     354
Promo2                         0
Promo2SinceWeek              544
Promo2SinceYear              544
PromoInterval                544
dtype: int64

In [111]:
stores_df.describe()

Unnamed: 0,Store,CompetitionDistance,CompetitionOpenSinceMonth,CompetitionOpenSinceYear,Promo2,Promo2SinceWeek,Promo2SinceYear
count,1115.0,1112.0,761.0,761.0,1115.0,571.0,571.0
mean,558.0,5404.901079,7.224704,2008.668857,0.512108,23.595447,2011.763573
std,322.01708,7663.17472,3.212348,6.195983,0.500078,14.141984,1.674935
min,1.0,20.0,1.0,1900.0,0.0,1.0,2009.0
25%,279.5,717.5,4.0,2006.0,0.0,13.0,2011.0
50%,558.0,2325.0,8.0,2010.0,1.0,22.0,2012.0
75%,836.5,6882.5,10.0,2013.0,1.0,37.0,2013.0
max,1115.0,75860.0,12.0,2015.0,1.0,50.0,2015.0


In [113]:
# Replace Missing Values
stores_df = stores_df.fillna(stores_df.median())
stores_df.isnull().sum()

  stores_df = stores_df.fillna(stores_df.median())


Store                          0
StoreType                      0
Assortment                     0
CompetitionDistance            0
CompetitionOpenSinceMonth      0
CompetitionOpenSinceYear       0
Promo2                         0
Promo2SinceWeek                0
Promo2SinceYear                0
PromoInterval                544
dtype: int64

In [119]:
#### Strings

In [118]:
# Find columns which contains strings
for label, content in stores_df.items():
    if pd.api.types.is_string_dtype(content):
        print(label)

StoreType
Assortment
PromoInterval


In [120]:
stores_df.StoreType.value_counts()

a    602
d    348
c    148
b     17
Name: StoreType, dtype: int64

In [121]:
stores_df.Assortment.value_counts()

a    593
c    513
b      9
Name: Assortment, dtype: int64

In [122]:
# Turn categorical variables into numbers and fill missing
for label, content in stores_df.items():
     if pd.api.types.is_string_dtype(content) and label != 'PromoInterval':
        # Turn categories into numbers and add 
        stores_df[label] = pd.Categorical(content).codes 

In [123]:
stores_df.StoreType.value_counts()

0    602
3    348
2    148
1     17
Name: StoreType, dtype: int64

In [124]:
stores_df.Assortment.value_counts()

0    593
2    513
1      9
Name: Assortment, dtype: int64

In [138]:
stores_df.fillna('', inplace=True)

In [139]:
stores_df.PromoInterval.value_counts()

                    544
Jan,Apr,Jul,Oct     335
Feb,May,Aug,Nov     130
Mar,Jun,Sept,Dec    106
Name: PromoInterval, dtype: int64

In [147]:
stores_df.isnull().sum()

Store                          0
StoreType                      0
Assortment                     0
CompetitionDistance            0
CompetitionOpenSinceMonth      0
CompetitionOpenSinceYear       0
Promo2                         0
Promo2SinceWeek                0
Promo2SinceYear                0
PromoInterval                  0
Bumpanes                     780
dtype: int64

In [140]:
# Promo Interval
months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
len(months)

12

In [143]:
df_month = pd.DataFrame()
for month in months:
    df_month["PromoInterval" + month] = 1
df_month

Unnamed: 0,PromoIntervalJan,PromoIntervalFeb,PromoIntervalMar,PromoIntervalApr,PromoIntervalMay,PromoIntervalJun,PromoIntervalJul,PromoIntervalAug,PromoIntervalSep,PromoIntervalOct,PromoIntervalNov,PromoIntervalDec


In [145]:
stores_df.loc[stores_df['PromoInterval'].str.contains("Jan", case=False), "Bumpanes"] = 1

In [146]:
stores_df

Unnamed: 0,Store,StoreType,Assortment,CompetitionDistance,CompetitionOpenSinceMonth,CompetitionOpenSinceYear,Promo2,Promo2SinceWeek,Promo2SinceYear,PromoInterval,Bumpanes
0,1,2,0,1270.0,9.0,2008.0,0,22.0,2012.0,,
1,2,0,0,570.0,11.0,2007.0,1,13.0,2010.0,"Jan,Apr,Jul,Oct",1.0
2,3,0,0,14130.0,12.0,2006.0,1,14.0,2011.0,"Jan,Apr,Jul,Oct",1.0
3,4,2,2,620.0,9.0,2009.0,0,22.0,2012.0,,
4,5,0,0,29910.0,4.0,2015.0,0,22.0,2012.0,,
...,...,...,...,...,...,...,...,...,...,...,...
1110,1111,0,0,1900.0,6.0,2014.0,1,31.0,2013.0,"Jan,Apr,Jul,Oct",1.0
1111,1112,2,2,1880.0,4.0,2006.0,0,22.0,2012.0,,
1112,1113,0,2,9260.0,8.0,2010.0,0,22.0,2012.0,,
1113,1114,0,2,870.0,8.0,2010.0,0,22.0,2012.0,,


In [128]:
X_train, y_train = stores_df.drop("CompetitionDistance", axis=1), stores_df.CompetitionDistance

In [129]:
# Random Forest Regressor
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_jobs=-1, random_state=42, n_estimators = 100, max_samples=10000)

In [130]:
model.fit(stores_df.drop("CompetitionDistance", axis=1), stores_df.CompetitionDistance)

ValueError: could not convert string to float: 'Jan,Apr,Jul,Oct'

StoreType
Assortment
PromoInterval


a    602
d    348
c    148
b     17
Name: StoreType, dtype: int64

a    593
c    513
b      9
Name: Assortment, dtype: int64

0    602
3    348
2    148
1     17
Name: StoreType, dtype: int64

0    593
2    513
1      9
Name: Assortment, dtype: int64

In [None]:
# Create evaluation function (the competition uses Root Mean Square Log Error)
from sklearn.metrics import mean_squared_log_error, mean_absolute_error

def rmsle(y_test, y_preds):
    return np.sqrt(mean_squared_log_error(y_test, y_preds))

# Create function to evaluate our model
def show_scores(model):
    train_preds = model.predict(X_train)
    val_preds = model.predict(X_valid)
    scores = {"Training MAE": mean_absolute_error(y_train, train_preds),
              "Valid MAE": mean_absolute_error(y_valid, val_preds),
              "Training RMSLE": rmsle(y_train, train_preds),
              "Valid RMSLE": rmsle(y_valid, val_preds),
              "Training R^2": model.score(X_train, y_train),
              "Valid R^2": model.score(X_valid, y_valid)}
    return scores