In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
from IPython.core.interactiveshell import InteractiveShell
import zipfile
import os

%matplotlib inline

In [2]:
# os.getcwd()

## Files: reading

There are 5 files associated with this project. They will be read directly from the zip file they are in.

In [3]:
zf = zipfile.ZipFile('competitive-data-science-predict-future-sales.zip') 

In [4]:
# Read items
items = pd.read_csv(zf.open('items.csv'))

# Read sales train
sales_train = pd.read_csv(zf.open('sales_train.csv'))
sales_train['date'] = pd.to_datetime(sales_train['date'], format= '%d.%m.%Y')
sales_train['revenue'] = sales_train['item_price']*sales_train['item_cnt_day']

# Read item categories
item_categories = pd.read_csv(zf.open('item_categories.csv'))

# Read shops
shops = pd.read_csv(zf.open('shops.csv'))

# Read test
test = pd.read_csv(zf.open('test.csv'))


## Files: schema

Files associated with this competition are presented on this order:
1. **Training data sets:**<br>
   1.1. sales_train<br>
   1.2. shops<br>
   1.3. items<br>
   1.4. item_category<br>
   
2. **Test data sets:**<br>
   2.1. test_train<br>
   2.2. sample submission<br>
  
A quick analysis of the tables' schema names raise the following questions:
  1. __Are all the features across the training sets reasonably feasible?__<br>
  2. __Is it a pricing model?__<br>
  3. __If so, what is the right approach to the training info?__<br>
     3.1 __Is a TSA approach the right one?__<br>
     3.2 __Is an optimization approach the right one?__<br>
     3.3 __or, a standard ML approach can provide a good answer, instead?__<br>
     
From our 1.0 Read Data file we found out that the language used on the features item_name, shop_name, and item_category_name is Russian.  We could translate some of it via "import googletrans", but it would take days given the 150 translations/day cap, meaning that getting those translation alone would take almost a month.  Therefore, only sales_train and items dataframes are necessary to create our initial train data set.

See the graphic below:


<img src="schema.png" width="600" height="600" align="center"/>

This is the number:  0
This is the number:  1
This is the number:  2
This is the number:  3
This is the number:  4
This is the number:  5
This is the number:  6
This is the number:  7
This is the number:  8
This is the number:  9
This is the number:  10
This is the number:  11
This is the number:  12
This is the number:  13
This is the number:  14
This is the number:  15
This is the number:  16
This is the number:  17
This is the number:  18
This is the number:  19
This is the number:  20
This is the number:  21
This is the number:  22
This is the number:  23
This is the number:  24
This is the number:  25
This is the number:  26
This is the number:  27
This is the number:  28
This is the number:  29
This is the number:  30
This is the number:  31
This is the number:  32
This is the number:  33
This is the number:  34
This is the number:  35
This is the number:  36
This is the number:  37
This is the number:  38
This is the number:  39
This is the number:  40
This is the number:  41
Th

In [None]:
print(len(items.item_id.unique()))
print(len(items.item_category_id.unique()))
print(len(sales_train.shop_id.unique()))


In [None]:
sales_train = sales_train.join(item_categories, on= 'item_id')

### Graph date_block_num

In [None]:
sales_train.date_block_num.plot()

In [None]:
for i in sales_train.columns:
    sales_train.groupby(i).agg(date= (i, 'count')).plot(figsize= (14,8), title= i, legend= False);

In [None]:
sales_train.columns

In [None]:
sales_train['year'] = sales_train.date_block_num.apply(lambda x: ((x//12) + 2013))
sales_train

In [None]:
sales_train['month'] = sales_train.date_block_num.apply(lambda x: (x % 12))
sales_train

In [None]:
sales_train['year_dt'] = sales_train.date.dt.year
sales_train['month_dt'] = sales_train.date.dt.month
sales_train['day_dt'] = sales_train.date.dt.day

In [None]:
sales_train