<a href="https://colab.research.google.com/github/mdkamrulhasan/data_mining_kdd/blob/main/notebooks/Exploratory_data_Analysis_Retail_part1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



## 1. Indexing
## 2. Datetime
## 3. Data Processing: aggregation
## 4. Data visualizations: plotly

# Importing libraries

In [1]:
# accessing google drive
from google.colab import drive
# data processing
import pandas as pd
import numpy as np
# visualization
import plotly.express as px
import plotly.graph_objects as go



---



#Loading Data

[Data (Retail) source](https://www.kaggle.com/datasets/manjeetsingh/retaildataset)

Note: See, we have made some changes in the file names

In [2]:
features = pd.read_csv("https://raw.githubusercontent.com/mdkamrulhasan/data_mining_kdd/main/data/retail/Features-data-set.csv")
sales = pd.read_csv("https://raw.githubusercontent.com/mdkamrulhasan/data_mining_kdd/main/data/retail/sales-data-set.csv")
stores = pd.read_csv("https://raw.githubusercontent.com/mdkamrulhasan/data_mining_kdd/main/data/retail/stores-data-set.csv")

Lets have a first look of the data in each table?

In [3]:
features.head(3)

Unnamed: 0,Store,Date,Temperature,Fuel_Price,MarkDown1,MarkDown2,MarkDown3,MarkDown4,MarkDown5,CPI,Unemployment,IsHoliday
0,1,05/02/2010,42.31,2.572,,,,,,211.096358,8.106,False
1,1,12/02/2010,38.51,2.548,,,,,,211.24217,8.106,True
2,1,19/02/2010,39.93,2.514,,,,,,211.289143,8.106,False


In [4]:
sales.head(3)

Unnamed: 0,Store,Dept,Date,Weekly_Sales,IsHoliday
0,1,1,05/02/2010,24924.5,False
1,1,1,12/02/2010,46039.49,True
2,1,1,19/02/2010,41595.55,False


In [5]:
stores.Type.unique()

array(['A', 'B', 'C'], dtype=object)

**Mostly will be talking about the sales data today**

In [6]:
sales.index

RangeIndex(start=0, stop=421570, step=1)

In [7]:
sales.index.values[:10]

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [8]:
sales.head(2)

Unnamed: 0,Store,Dept,Date,Weekly_Sales,IsHoliday
0,1,1,05/02/2010,24924.5,False
1,1,1,12/02/2010,46039.49,True


In [9]:
sales.loc[1]

Unnamed: 0,1
Store,1
Dept,1
Date,12/02/2010
Weekly_Sales,46039.49
IsHoliday,True


In [10]:
sales.iloc[1]

Unnamed: 0,1
Store,1
Dept,1
Date,12/02/2010
Weekly_Sales,46039.49
IsHoliday,True


In [11]:
sales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 421570 entries, 0 to 421569
Data columns (total 5 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   Store         421570 non-null  int64  
 1   Dept          421570 non-null  int64  
 2   Date          421570 non-null  object 
 3   Weekly_Sales  421570 non-null  float64
 4   IsHoliday     421570 non-null  bool   
dtypes: bool(1), float64(1), int64(2), object(1)
memory usage: 13.3+ MB


We want to find the max and min date available

In [12]:
sales.Weekly_Sales.min(), sales.Weekly_Sales.max()

(-4988.94, 693099.36)

In [13]:
sales.Date.min(), sales.Date.max()

('01/04/2011', '31/12/2010')

In [14]:
sales.dtypes

Unnamed: 0,0
Store,int64
Dept,int64
Date,object
Weekly_Sales,float64
IsHoliday,bool


Converting to date time format

In [15]:
sales['Date'] = pd.to_datetime(sales.Date, format="%d/%m/%Y")

In [16]:
sales.Date.min(), sales.Date.max()

(Timestamp('2010-02-05 00:00:00'), Timestamp('2012-10-26 00:00:00'))

In [17]:
sales.dtypes

Unnamed: 0,0
Store,int64
Dept,int64
Date,datetime64[ns]
Weekly_Sales,float64
IsHoliday,bool


### Lets create some additional time related columns

In [20]:
pd.__version__

'2.1.4'

In [27]:
sales['Year'] = sales.Date.dt.year
sales['Month'] = sales.Date.dt.month
sales['Week'] = sales.Date.dt.isocalendar().week

Number of unique stores

In [28]:
sales.Store.nunique()

45

In [29]:
yearly_agg = sales.groupby(['Year']).agg({'Weekly_Sales': 'sum'})
yearly_agg.head()

Unnamed: 0_level_0,Weekly_Sales
Year,Unnamed: 1_level_1
2010,2288886000.0
2011,2448200000.0
2012,2000133000.0


Lets comeback to the indexing idea we had at he beginning

In [30]:
sales.index.values[:10]

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [31]:
sales.loc[0]

Unnamed: 0,0
Store,1
Dept,1
Date,2010-02-05 00:00:00
Weekly_Sales,24924.5
IsHoliday,False
Year,2010
Month,2
Week,5


In [32]:
yearly_agg.index.values[:5]

array([2010, 2011, 2012], dtype=int32)

In [33]:
# this is expected to fail as the indexing (value) for "loc" has changed
yearly_agg.loc[2010]

Unnamed: 0,2010
Weekly_Sales,2288886000.0


In [34]:
# iloc will work as it is referring to the index of an index
yearly_agg.iloc[0]

Unnamed: 0,2010
Weekly_Sales,2288886000.0


Lets look at the sales pattern of a specific store

In [35]:
query_stroe_id = 1
store_x = sales[sales.Store == query_stroe_id]

In [36]:
store_x.Dept.unique()

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 16, 17, 18,
       19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
       36, 37, 38, 40, 41, 42, 44, 45, 46, 47, 48, 49, 51, 52, 54, 55, 56,
       58, 59, 60, 67, 71, 72, 74, 77, 78, 79, 80, 81, 82, 83, 85, 87, 90,
       91, 92, 93, 94, 95, 96, 97, 98, 99])

In [37]:
fig = px.bar(store_x, x='Date', y='Weekly_Sales')
fig.show()
# this will show sales for all departments for our selected store

  v = v.dt.to_pydatetime()


Lets see how it looks like for a specific department

In [38]:
query_dept_id = 1
store_dept_x = store_x[store_x.Dept == query_dept_id]

In [39]:
fig = px.bar(store_dept_x, x='Date', y='Weekly_Sales')
fig.show()


The behavior of DatetimeProperties.to_pydatetime is deprecated, in a future version this will return a Series containing python datetime objects instead of an ndarray. To retain the old behavior, call `np.array` on the result



Whats the yearly sales of the selected (store, department)?

In [40]:
yearly_agg = store_dept_x.groupby(['Year']).agg({'Weekly_Sales': 'sum'})
yearly_agg.head()

Unnamed: 0_level_0,Weekly_Sales
Year,Unnamed: 1_level_1
2010,1126348.73
2011,1171550.8
2012,921505.65


In [41]:
fig = px.bar(yearly_agg, x=yearly_agg.index.values, y='Weekly_Sales')
fig.show()

Do we have partial data for any year?

In [42]:
store_dept_x.Date.min(), store_dept_x.Date.max()

(Timestamp('2010-02-05 00:00:00'), Timestamp('2012-10-26 00:00:00'))

Yes, we see year 2010, and 2012 are partial; lets plot the average sales.

In [43]:
monthly_agg = store_dept_x.groupby(['Month']).agg({'Weekly_Sales': 'mean'})
monthly_agg.head(3)

Unnamed: 0_level_0,Weekly_Sales
Month,Unnamed: 1_level_1
1,17418.9925
2,32700.750833
3,22210.847692


In [44]:
fig = px.bar(monthly_agg, x=monthly_agg.index.values, y='Weekly_Sales')
fig.show()

Lets see the aggregated values at the weekly level

In [45]:
weekly_agg = store_dept_x.groupby(['Week']).agg({'Weekly_Sales': 'mean'})
weekly_agg.head(3)

Unnamed: 0_level_0,Weekly_Sales
Week,Unnamed: 1_level_1
1,16275.965
2,17127.05
3,17853.285


In [46]:
fig = px.bar(weekly_agg, x=weekly_agg.index.values, y='Weekly_Sales')
fig.show()

# Rolling average (an example window function)

In [48]:
weekly_agg['sales_roll3'] = weekly_agg['Weekly_Sales'].rolling(3).mean()
weekly_agg['sales_roll5'] = weekly_agg['Weekly_Sales'].rolling(5).mean()
weekly_agg['sales_roll20'] = weekly_agg['Weekly_Sales'].rolling(20).mean()

In [49]:
weekly_agg.index

Index([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
       19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36,
       37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52],
      dtype='UInt32', name='Week')

Reseting index

In [50]:
weekly_agg = weekly_agg.reset_index()

In [51]:
weekly_agg.index.values

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
       51])

In [52]:
weekly_agg.head(6)

Unnamed: 0,Week,Weekly_Sales,sales_roll3,sales_roll5,sales_roll20
0,1,16275.965,,,
1,2,17127.05,,,
2,3,17853.285,17085.433333,,
3,4,18419.67,17800.001667,,
4,5,23366.916667,19879.957222,18608.577333,
5,6,40305.05,27363.878889,23414.394333,


Plotting multiple graps using the plotly grap object

In [53]:
fig = go.Figure([
    go.Scatter(x=weekly_agg['Week'],
               y=weekly_agg['Weekly_Sales']),
    go.Scatter(x=weekly_agg['Week'],
               y=weekly_agg['sales_roll3']),
        go.Scatter(x=weekly_agg['Week'],
               y=weekly_agg['sales_roll5']),
            go.Scatter(x=weekly_agg['Week'],
               y=weekly_agg['sales_roll20'])
]
               )

fig.update_layout(
    title="rolling averages of weekly sales",
    yaxis_title="weekly sales average", legend_title="rooling avg index ")

fig.update_layout(
    legend=dict(
        x=0.05,
        y=0.95,
        traceorder="normal",
        font=dict(
            family="sans-serif",
            size=12,
            color="black"
        ),
    )
)
fig.show()