(Work in progress...)

The first part of this notebook include exploratory analysis.

The second part will feature future prediction.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # graphs
import plotly.offline 
import plotly.graph_objs as go
plotly.offline.init_notebook_mode(connected=True)
iplot = plotly.offline.iplot

from IPython.display import display

## Data

In [None]:
item_cat = pd.read_csv('../input/item_categories.csv')
items = pd.read_csv('../input/items.csv')
shops = pd.read_csv('../input/shops.csv')
train = pd.read_csv('../input/sales_train.csv')
test = pd.read_csv('../input/test.csv')

print(f'''Shapes:
Item categories: {item_cat.shape}
Items: {items.shape}
Shops: {shops.shape}
Train set: {train.shape}
Test set: {test.shape}''')

### Item categories

There are 84 categories

In [None]:
print(item_cat.info())
item_cat.head()

### Items

There are 22170 items

In [None]:
print(items.info())
items.head()

### Shops

There are 60 shops

In [None]:
print(shops.info())
shops.head()

### Train set
6 features: date, date_block_num, shop_id, item_id, item_price, item_cnt_day

2935849 records

In [None]:
print(train.info())
train.head()

### Test set
214200 entries

In [None]:
print(test.info())
test.head()

## Target variable: item_cnt_day

Number of products sold. We are predicting a monthly amount of this measure.

The majority of them are just 1 item.
Some of them are negative (returned items).
Some of them are over 20 items.

The maximum is 2169 items, on 28/10/2105, shop_id 12, item_id 11373 (Boxberry)


In [None]:
# train['item_cnt_day'].value_counts().sort_index()
pd.cut(train['item_cnt_day'], [-np.inf] + list(range(0, 21)) + [np.inf]).value_counts()
# items[items.item_id == train.sort_values('item_cnt_day', ascending=False).head()['item_id'].iloc[0]]

## Predictor variables

We are creating a new feature which is item_price × item_cnt_day, and call it item_sale

We will also merge the item_caregory_id from items into the train and test set

In [None]:
train['item_sale'] = train['item_price'] * train['item_cnt_day']
train = pd.merge(train, items[['item_id', 'item_category_id']], how='left', on='item_id')
test = pd.merge(test, items[['item_id', 'item_category_id']], how='left', on='item_id')
display(train.head(), test.head())

We wand to check if each item_id should ONLY belong to 1 item_category_id

In [None]:
train.groupby('item_id')['item_category_id'].agg(lambda x: x.nunique()).value_counts()

### date
Convert from string to datetime type

In [None]:
train['date'] = pd.to_datetime(train['date'], format='%d.%m.%Y')

Now we have date (day, month, year, week, etc) as well as item_id, item_category_id, shop_id. We want to group them and look at the item_cnt_day and item_sale


There is a clear weekly cycle, with more sales on Thu, Fri and Sat

If we group them by month, we can clearly see that:
- Dec and Jan had the highest item_cnt_day.
- Dec also had the highest item_sale (likely Xmas)
- Nov had the lowest item_cnt_day but Jul had the lowes item_sale

Also, there are a few peaks over the years, they are:
- End of Nov 2013, lots of revenues
- End of Dec 2013, lots of item_cnt_day, but relatively lower item_sale compared to Nov 2013
- End of Dec 2014, lots of item_cnt_day
- There are also peaks around the end of May 2014 and 2015

We also see declines in item_cnt_day and item_sale from 2013 to 2015

In [None]:
def groupby(thing,  label = ''):
    tmp = train.groupby(thing)[['item_cnt_day', 'item_sale']].sum()
    return dict(
        args=[{
            'x': [tmp.index, tmp.index],
            'y': [tmp['item_cnt_day'], tmp['item_sale']],
        }],
        method='update', label=label
    )

tmp = groupby('date')
trace1 = go.Scatter(x=tmp['args'][0]['x'][0], y=tmp['args'][0]['y'][0], opacity=0.75, name='item_cnt_day')
trace2 = go.Scatter(x=tmp['args'][0]['x'][1], y=tmp['args'][0]['y'][1], opacity=0.75, name='item_sale', yaxis='y2')
data = [trace1, trace2]
layout = go.Layout(
    yaxis=dict(title='Item counts'),
    yaxis2=dict(title='Item sale', overlaying='y', side='right')
)
fig = go.Figure(data=data, layout=layout)
fig.layout.updatemenus = list([
    dict(
        buttons=[groupby(i, l) for i, l in [
            (train.date.dt.date, 'date'),
            (train.date.dt.dayofyear, 'day of year'),
            (train.date.dt.day, 'day of month'),
            (train.date.dt.dayofweek, 'day of week'),
            (train.date.dt.month, 'month'),
            (train.date.dt.week, 'week'),
            (train.date.dt.weekofyear, 'week of year'),
            (train.date.dt.year, 'year')
        ]] ,
        direction = 'down',
        showactive = True,
        x = 0, xanchor = 'left',
        y = 1.25, yanchor = 'top' 
    ),
])
iplot(fig)

In total, there were 3.6M items sold, with a combined revenue of \$3.39 billions.

By item, the most popular were:
- By item_cnt_day was: item_id 20949, with 187642 units sold, generated $929K
- By item_sale was: item_id 6675, worth \$219M in revenue, with 10289 units sold

The worst are: item 1590, 11871, 18062, 13474, 13477. Shops lost money on them

By shop:
- shop_id 31 had the highest item_cnt_day (310777 items sold), and also the higest item_sale (\$235M)
- shop_id 36 had the lowest item_cnt_day (330), with a revenue of \$377K

Item_category. The most popular:
- By item_cnt_day is item_category_id 40, with 634171 units sold, generating \$170M
- By item_sale is item_category_id 19, with 254887 units sold (\$412M)

The worst performers are: item_category_id 51, with only 1 unit sold (\$129), item_category_id 50, with 3 units sold (\$24)

In [None]:
def groupby(thing, sort_values = 'item_cnt_day'):
    tmp = train.groupby(thing)[['item_cnt_day', 'item_sale']].sum().sort_values(sort_values).reset_index()[:-100:-1]
    return dict(
        args=[{
            'x': [tmp[thing], tmp[thing]],
            'y': [tmp['item_cnt_day'], tmp['item_sale']],
        }, {
            'xaxis': dict(type='category')
        }],
        method='update', label=thing + ' sort by ' + sort_values
    )

tmp = groupby('item_id')
trace1 = go.Bar(x=tmp['args'][0]['x'][0], y=tmp['args'][0]['y'][0], opacity=0.5, name='item_cnt_day')
trace2 = go.Bar(x=tmp['args'][0]['x'][1], y=tmp['args'][0]['y'][1], opacity=0.5, name='item_sale', yaxis='y2')
data = [trace1, trace2]
layout = go.Layout(
    xaxis=dict(type='category'),
    yaxis=dict(title='Item counts'),
    yaxis2=dict(title='Item sale', overlaying='y', side='right')
)
fig = go.Figure(data=data, layout=layout)
fig.layout.updatemenus = list([
    dict(
        buttons=[groupby(i, s) for i, s in [
            ('item_id', 'item_cnt_day'),
            ('item_id', 'item_sale'),
            ('item_category_id', 'item_cnt_day'),
            ('item_category_id', 'item_sale'),
            ('shop_id', 'item_cnt_day'),
            ('shop_id', 'item_sale'),
        ]] ,
        direction = 'down',
        showactive = True,
        x = 0, xanchor = 'left',
        y = 1.25, yanchor = 'top' 
    ),
])
iplot(fig)

We now want to see items and shops distribution in the test set
- For shop_id, all shop_ids present in the test set are also present in the train set
- All item_category_id are present in test set and train set.
- For item_id, there are 363 items present only in the test set but are not in the train set.

In [None]:
len(set(test['shop_id']).difference(set(train['shop_id']))) # all shop_id in the test set are present in the train set
len(set(test['item_category_id']).difference(set(train['item_category_id']))) # all categories are present in both
len(set(test['item_id']).difference(set(train['item_id']))) # 363 item_id in the test set are NOT present in the train set