# Exploratory Data Analysis

In this notebook, we explore the train and test data.

In [None]:
import pandas as pd
import numpy as np
import itertools
import os
from matplotlib import pyplot
import matplotlib as mpl

%matplotlib inline
%load_ext autoreload
%autoreload 2

## Load Data

According to the competition page, the fields in the dataset are described as follows:

- **ID** - an Id that represents a (Shop, Item) tuple within the test set
- **shop_id** - unique identifier of a shop
- **item_id** - unique identifier of a product
- **item_category_id** - unique identifier of item category
- **item_cnt_day** - number of products sold. You are predicting a monthly amount of this measure
- **item_price** - current price of an item
- **date** - date in format dd/mm/yyyy
- **date_block_num** - a consecutive month number, used for convenience. January 2013 is 0, February 2013 is 1,..., October 2015 is 33
- **item_name** - name of item
- **shop_name** - name of shop
- **item_category_name** - name of item category

To load the date, we aggregate the date by month and add all item sales during that month for each shop and item ID.

In [None]:
data_path = os.path.join(os.pardir, 'input')
train = pd.read_csv(os.path.join(data_path, 'sales_train.csv'))
train['date'] = pd.to_datetime(train['date'], format='%d.%m.%Y')

test = pd.read_csv(os.path.join(data_path, 'test.csv'))
shops = pd.read_csv(os.path.join(data_path, 'shops.csv'))
items = pd.read_csv(os.path.join(data_path, 'items.csv'))
categories = pd.read_csv(os.path.join(data_path, 'item_categories.csv'))

Check the sizes of the train data and test data. In addition, print total number of items, shops, and categories.

In [None]:
print('Train size:', train.shape[0])
print('Test size:', test.shape[0])
print('Number of items:', items.shape[0])
print('Number of shops:', shops.shape[0])
print('Number of categories:', categories.shape[0])

Look at uniques items in the training dataset.

In [None]:
train.nunique()

The training dataset includes all shops listed in the shops dataframe. In addition, it contains 21807/22170 out of all total possible items.

As for the test set:

In [None]:
test.nunique()

The test set includes 42/60 shops, and 5100/22170 items.

## Train data

### Outliers: Items count and items price

In the following we plot the histograms for the number of items sold per day and the price of each item within the datset. This helps to look for outliers.

In [None]:
fig, ax = pyplot.subplots(1, 2, figsize=(15, 4))
ax[0].hist(train.item_cnt_day, 100, edgecolor='k', alpha=0.5)
ax[0].set_ylim([0, 15])
ax[0].set_xlabel('item_cnt_day')
ax[0].set_ylabel('Number of items')
ax[0].set_title('Items sold')

ax[1].hist(train.item_price, 100, edgecolor='k', alpha=0.5)
ax[1].set_ylim([0, 15])
# ax[1].set_xlim([0, 25000])
ax[1].set_xlabel('item price')
ax[1].set_ylabel('Number of items')
ax[1].set_title('Items price')
pass

Note that the histograms above are clipped on the Y-axis in order to hone in on the rare items at the tail of the histogram. We can see that for most days, the items count was less than 250 items sold, but there is two notable outliers at 1000 and above 2000 items. As for items price, one price was above 300000, which is an abvious outlier. On the low end, looks like most items are below 50000, however in reality if we zoom out of histogram most of the items are within 4000 price.

### Number of shops and items per month

For each month period, we find the number of unique stores and items.

In [None]:
date = []
num_shops = []
num_items = []

for idx, group in train.groupby('date_block_num'):
    num_shops.append(len(group.shop_id.unique()))
    num_items.append(len(group.item_id.unique()))
    date.append((group.date.dt.month.iloc[0], group.date.dt.year.iloc[0]))

In [None]:
fig, ax = pyplot.subplots(1, 2, figsize=(16, 4))
x = np.arange(len(date))
ax[0].bar(x, num_shops, edgecolor='k', alpha=0.75)
ax[0].axhline(len(shops), color='#ff7f0e', linewidth=3)
ax[0].set_xticks(x)
ax[0].set_xticklabels(date, rotation='vertical')
ax[0].set_title('Number of shops per month')
ax[0].set_xlabel('Month, year')
ax[0].set_ylabel('Number of shops')

ax[1].bar(x, num_items, edgecolor='k', alpha=0.75)
ax[1].axhline(len(items), color='#ff7f0e', linewidth=3)
ax[1].set_xticks(x)
ax[1].set_xticklabels(date, rotation='vertical')
ax[1].set_title('Number of items per month')
ax[1].set_xlabel('Month, year')
ax[1].set_ylabel('Number of items');

In the figures above, we show the number of shops (left) and number of items (right) for each month in the dataset. The horizontal lines indicate the total number of possible shops (left) and total number of possible items (right).

There is about 45-50 stores recorded for each month, but there is a total of 60 shops in the whole dataset. For the item records, there is 22,170 total possible items, but only about 5000-8000 items are recorded each month, across all shops. Thus the dataset is sparse.

## Total sales per month

Calculate total number of sales and the mean items price across all shops and items for each month.

In [None]:
total_sales = (train.groupby('date_block_num')
               .agg({'item_cnt_day': 'sum', 'item_price': 'mean'})
               .rename({'item_cnt_day': 'item_cnt_month', 'item_price': 'mean_price'}, axis=1))

In [None]:
fig, ax = pyplot.subplots(2, 1, sharex=True, figsize=(12, 7))
total_sales.item_cnt_month.plot(ax=ax[0], marker='o')
ax[0].set_ylabel('Total items solds')

total_sales.mean_price.plot(ax=ax[1], marker='o')
ax[1].set_ylabel('Mean items price')

for axi in ax:
    axi.set_xticks(x)
    axi.set_xticklabels(date, rotation='vertical')
    axi.grid(alpha=0.5)

ax[0].set_title('Number of sales in each month')
ax[1].set_title('Mean item price in each month')
pyplot.tight_layout()

As one would expect, the total number of items jumped in December, during christmas time. For the last month of the train set, October 2015, we see that sales total price is starting to increase, however the total number of items sold slightly decreases compared to the previous month, September 2015. Also, 2015 looks like a year with significantly less sales than the previous two years.

## Train vs Test

The train and test sets have different sets of shop and item combinations. Let's see how big is the difference? For each month, find which store/item combination is present in train dataset but is not present in the test data set, or vice versa.

In [None]:
test_combs = set(map(tuple, test[['shop_id', 'item_id']].values))

train_not_test = []
test_not_train = []
train_size = []
for idx, group in train.groupby('date_block_num'):   
    train_combs = set(map(tuple, group[['shop_id', 'item_id']].values))
    train_size.append(len(train_combs))
    train_not_test.append(len(train_combs - test_combs))
    test_not_train.append(len(test_combs - train_combs))

fig, ax = pyplot.subplots(2, 1, sharex=True, figsize=(12, 6))
ax[0].plot(x, train_not_test, '-o', label='Not in test')
ax[0].plot(x, train_size, '-o', label='Train size')
ax[0].set_title('Combinations in train')
ax[0].set_ylabel('Number of (shop, item)')
ax[0].legend()

ax[1].plot(x, test_not_train, '-o', label='Not in train')
ax[1].plot(x, [len(test_combs)]*len(x), '-o', color='#ff7f0e', label='Test size')
ax[1].set_title('Combinations in test')
ax[1].set_ylabel('Number of (shop, item)')
ax[1].legend()

for axi in ax:
    axi.set_xticks(x)
    axi.set_xticklabels(date, rotation='vertical')
    axi.grid(alpha=0.5)

pyplot.tight_layout()

In the top figure, we plot the total number of combinations in the train dataset which changes from month to month. We also plot the total number of items at current month which are not in test set. 

**We see that as we go back in time, most of the items in the train set are not in the test set.** For the month of October, 2015, which is the last month in the train set, most of items in the train set are in the test set.

In the bottom figure, we compare the number of item combinations in the test set with each month within the train set (which is the complement of the top figure). In this case we also see that as we go back in time, more combinations in the test dataset are not in the train set. In October 2015, out of 215,000 combinations in the test set, only about 25,000 are present in the train dataset! 

Thus, if we consider the month of October 2015, although most items in train set are in test set, the opposite is not true. On the other hand, the Month of October 2015 (last month in train set) has more items in common with test set (corresponding to Nov. 2015), than the previous months.

### Using all combinations of shops/items

In the class, week 3 programming assignment, for each month, all combinations of shops/items were found, and merged with all existing shop/item pairs. This way, pairs not already in the train dataset are set to 0 total sales. In the following, we repeat what we did above, but now, with adding all combinations possible for each month, as was done in the assignment.

In [None]:
index_cols = ['shop_id', 'item_id', 'date_block_num']

# For every month we create a grid from all shops/items combinations from that month
grid = [] 
for block_num in train['date_block_num'].unique():
    cur_shops = train[train['date_block_num'] == block_num]['shop_id'].unique()
    cur_items = train[train['date_block_num'] == block_num]['item_id'].unique()
    grid.append(np.array(list(itertools.product(cur_shops, cur_items, [block_num])),dtype='int32'))

#turn the grid into pandas dataframe
grid = pd.DataFrame(np.vstack(grid), columns=index_cols, dtype=np.int32)

#get aggregated values for (shop_id, item_id, month)
gb = train.groupby(index_cols,as_index=False).agg({'item_cnt_day': 'sum'})\
        .rename(columns={'item_cnt_day': 'target'})

#join aggregated data to the grid
all_data = pd.merge(grid, gb, how='left', on=index_cols).fillna(0)
#sort the data
all_data.sort_values(['date_block_num', 'shop_id', 'item_id'],inplace=True)

In [None]:
all_data.head()

In [None]:
train_not_test = []
test_not_train = []
train_size = []
for idx, group in all_data.groupby('date_block_num'):   
    train_combs = set(map(tuple, group[['shop_id', 'item_id']].values))
    train_size.append(len(train_combs))
    train_not_test.append(len(train_combs - test_combs))
    test_not_train.append(len(test_combs - train_combs))

In [None]:
fig, ax = pyplot.subplots(2, 1, sharex=True, figsize=(12, 6))
ax[0].plot(x, train_not_test, '-o', label='Not in test')
ax[0].plot(x, train_size, '-o', label='Train size')
ax[0].set_title('Combinations in train')
ax[0].set_ylabel('Number of (shop, item)')
ax[0].legend()

ax[1].plot(x, test_not_train, '-o', label='Not in train')
ax[1].plot(x, [len(test_combs)]*len(x), '-o', color='#ff7f0e', label='Test size')
ax[1].set_title('Combinations in test')
ax[1].set_ylabel('Number of (shop, item)')
ax[1].legend()

for axi in ax:
    axi.set_xticks(x)
    axi.set_xticklabels(date, rotation='vertical')
    axi.grid(alpha=0.5)

pyplot.tight_layout()

Using this method, most of the items in the test dataset are now present in the training dataset (look at the bottom figure, where the blue line goes to 50,000 for October 2015, compared with 185,000 without this computation).