Let's get started!

I) I got used to work through such kernels and data science flows by using the "Feynman" methodology (https://mattyford.com/blog/2014/1/23/the-feynman-technique-model). 
![](https://cdn.britannica.com/84/19184-004-AE04C440.jpg)
He famously said that you only really understand a concept if you are able to explain it to a 5-year-old. Therefore I go though my work and comment / annotate every step and learning (and especially mistake) I do along the way. That forces me to really understand the topic and to look at my work later and again understand what worked how. Additionally I imagine it could be interesting for fellow learners.

II)  From earlier work on Kaggle I saved the following workflow and will use it for structuring the project:

 1. Prepare Problem
a) Load libraries
b) Load dataset

2. Summarize Data
a) Descriptive statistics
b) Data visualizations

3. Prepare Data
a) Data Cleaning
b) Feature Selection
c) Data Transforms (Normalize,...)

These steps go into a second kernel [here](https://www.kaggle.com/dennise/coursera-competition-modelling/edit)

4. Evaluate Algorithms
a) Split-out validation dataset
b) Test options and evaluation metric
c) Spot Check Algorithms
d) Compare Algorithms

5. Improve Accuracy
a) Algorithm Tuning
b) Ensembles

6. Finalize Model
a) Predictions on validation dataset
b) Create standalone model on entire training dataset
c) Save model for later use

In [None]:
"""
1) Prepare Problem
a) Load libraries
"""
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory
import os
print(os.listdir("../input"))
# Any results you write to the current directory are saved as output.

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# Lets see what other libraries I will be using
# Keras
# sklearn

In [None]:
"""
1) Prepare Problem
b) Load dataset
"""
items=pd.read_csv('../input/items.csv')
item_categories=pd.read_csv('../input/item_categories.csv')
shops=pd.read_csv('../input/shops.csv')

# What about these *.gz files?
# It is a compressed format: "For on-the-fly decompression of on-disk data"
test=pd.read_csv('../input/test.csv.gz',compression='gzip')
sample_submission=pd.read_csv('../input/sample_submission.csv.gz',compression='gzip')
sales_train=pd.read_csv('../input/sales_train.csv.gz',compression='gzip')

In [None]:
items.info()

In [None]:
items.head()

In [None]:
items.describe()

Observations:
- So there are 22.170 different items  in the catalogue
- Each one has a unique ID
- There are 84 (0 is included as can be seen below) different categories and each product is categorized. No NaN's.

In [None]:
item_categories.info()

In [None]:
item_categories.head()

In [None]:
item_categories.describe()
# Gives descriptive statistics on quantitative features

Observations:
- There are 84 different categories (=IDs) (starting at 0 with "PC")
- Each category has a name

In [None]:
shops.info()

In [None]:
shops.head()

In [None]:
shops.describe()

Observations:
- There are 60 shops (again starting at 0) = IDs
-  Each shop has a name
    - maybe the names refer to places and that could be meaningul for analysis? (Added to open questions/ideas)

In [None]:
test.info()

In [None]:
test.head()

In [None]:
test.describe()

From data desciption in competition:
"the test set. You need to forecast the sales for these shops and products for November 2015"

Observations:
    - 214,200 items in test
    - So I need to make predictions for 60 shops for certain items sold per month
    - There are not all item_ids in each shop: 60 shops * 22,167 items would make 1,330,000 predictions

In [None]:
sample_submission.info()

In [None]:
sample_submission.head()

In [None]:
sample_submission.describe()

Observations:
    - So this is what I have to deliver in the end. An item_cnt_month per ID from test-set. I.e. how many items (item_id) are sold in a specific shop (shop_id) in this month? (November 2015) (Added seasonalization to list of open to dos)

In [None]:
sales_train.info()

In [None]:
sales_train.head()

In [None]:
sales_train.describe()

Observations:
- nearly 3 million items in train-set
- the transaction history of 60 shops over 33 months
- per shop, per item, total sales/shop/item/day
- revenue/item/shop/day = item_price*item_cnt_day (as learned in the coursera hosted pandas introduction session)
- looks like there are no NaNs in the data

Now we get to Phase 2:
2. Summarize Data
a) Descriptive statistics

= EDA = Exploratory data analysis (= week 2 of coursera course)

Primarily interest here of course the train_dataset

In [None]:
sales_train.info()

In [None]:
sales_train.describe()

In [None]:
sales_train.item_price.hist()

In [None]:
sales_train.item_price.value_counts()

In [None]:
sales_train.item_price.nunique()

In [None]:
sales_train.item_price.max()

In [None]:
print(sales_train[sales_train.item_price==sales_train.item_price.max()])

In [None]:
print(sales_train[sales_train.item_price==sales_train.item_price.max()].item_id)

In [None]:
print(items[items.item_id==6066])

In [None]:
# Radmin is a remote control software - dont think that it is that expensive. Let's check if it was sold for "normal" prices
print(sales_train[sales_train.item_id==6066])

In [None]:
# Only this one time. Interesting. Then maybe it is right. One huge license?
# Let's see if there are other Radmin versions
# and if this is the only outlier in price
print(sales_train[sales_train.item_price>50000])

In [None]:
# ok lets leave it in for now.
sales_train[sales_train.item_price<60000].item_price.hist()

In [None]:
sales_train[sales_train.item_price<30000].item_price.hist()

In [None]:
sales_train[sales_train.item_price<15000].item_price.hist()

In [None]:
sales_train[sales_train.item_price<5000].item_price.hist()

In [None]:
sales_train[sales_train.item_price<3000].item_price.hist()

In [None]:
sales_train[sales_train.item_price<1000].item_price.hist()

In [None]:
# So definetly lets build some categories on price. There seems to be mayority is small B2C business but there are also big B2B deals.

# In general I should understand more what actually the products are:
print(item_categories.head(300))

In [None]:
# Playstation, X-Box and kyrillic things.
# Lets translate the column (and also the shop column - check if we can see cities)
"""from textblob import TextBlob

item_categories['english'] = item_categories['item_category_name'].str.encode('cp437', 'ignore').apply(lambda x:TextBlob(x.strip()).translate(to='en'))
"""
"""
from unidecode import unidecode
item_categories['english'] = unidecode(item_categories['item_category_name'])
"""
# 3rd solution worked out: https://stackoverflow.com/questions/14173421/use-string-translate-in-python-to-transliterate-cyrillic
symbols=(u"абвгдеёжзийклмнопрстуфхцчшщъыьэюяАБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ", u"abvgdeejzijklmnoprstufhzcss_y_euaABVGDEEJZIJKLMNOPRSTUFHZCSS_Y_EUA")
english = {ord(a):ord(b) for a, b in zip(*symbols)}

item_categories['items_english'] = item_categories['item_category_name'].apply(lambda x: x.translate(english))

print(item_categories.items_english.head(100))

# Observations:
# In categories are meta-categories: Accessories, Console, PC, programs, music...
# Added to to-do list: Take these meta-categories as features


In [None]:
#Split the metacategories with the "-"
item_categories["meta_category"]=item_categories.items_english.apply(lambda x:x.split(" - ")[0])
print(item_categories.meta_category.head(100))

In [None]:
item_categories.head()

In [None]:
print(item_categories.meta_category.unique())
print(item_categories.meta_category.nunique())
#Great! Only 20 makro-categories
print(item_categories.meta_category.value_counts())
print(item_categories.meta_category)
# Of course: I need to put the makro-categories into the data

In [None]:
shops.info()

# Translate shop names
shops['shops_english'] = shops['shop_name'].apply(lambda x: x.translate(english))
print(shops.shops_english.head(100))

# YES! First word is the city! Great feature to extract! Another "Makro-category"
"""
# And because it is only 60 objects this can even be done and checked manually
shops["town"] =["Yakutsk","Yakutsk","Adygea","Balasiha","Volzhskij","Vologda","Voronej","Voronej",]
"""
# No this was to stupid:
shops["town"]=shops.shops_english.apply(lambda x:x.split()[0])
print(shops.town)

# While doing this and researching cities next idea: Another makro feature of "regions" eg Balashiha belongs to moscow region
shops["region"]=["Sakha","Sakha","Adygea","Moscow","Volgograd", "Vologda", "voronezh","voronezh","voronezh","Vyezdnaa", "Moscow", "Moscow","Internet", "tatarstan", "tatarstan","Kaluga", "Moscow", "Moscow", "Moscow", "Kursk", "Moscow", "Moscow", "Moscow", "Moscow", "Moscow", "Moscow", "Moscow", "Moscow", "Moscow", "Moscow", "Moscow", "Moscow", "Moscow","Moscow","novgorod","novgorod","novosibirsk","novosibirsk","omsk","rostov","rostov","rostov","Saint Peterburg","Saint Peterburg","samara","samara","moscow","Khanty-Mansi","Tomsk","Tyumen","tyumen","tyumen","Bashkortostan","Bashkortostan","moscow","zifrovoj","moscow","sakha","sakha","yaroslavl"]
print(shops.town.nunique())
print(shops.region.nunique())
# hmmm didn't help much - only 6 towns that belong to Moscow region

shops.to_csv('final_shops.csv',index=False)

In [None]:
# Always print (parts of) data that you are examining just to get an idea
# done
print(test.shape)
print(sales_train.shape)

# 3 features missing in test
print(test.columns)
print(sales_train.columns)

print(test.head())
# ok test is really only the form I need to fill. per shop per item forecast revenue for the specific month
# Therefore need to split later training data into train & validation set
# Feedback on test-set will be the evaluation via Kaggle and/or coursera

In [None]:
sales_train.columns
# Let's start with the dates column
sales_train['day'] = pd.to_datetime(sales_train['date'], format = '%d.%m.%Y').dt.day
sales_train['month'] = pd.to_datetime(sales_train['date'], format = '%d.%m.%Y').dt.month
sales_train['year'] = pd.to_datetime(sales_train['date'], format = '%d.%m.%Y').dt.year
sales_train['weekday'] = pd.to_datetime(sales_train['date'], format = '%d.%m.%Y').dt.dayofweek
sales_train.columns
print(sales_train.head())

In [None]:
# Dates look ordered and shops also
sales_train.date_block_num.plot()

In [None]:
sales_train.shop_id.plot(figsize=(20,4))
# Interesting. There is a rythm to it

In [None]:
sales_train.weekday[0:100].plot(figsize=(20,4))

In [None]:
sales_train.head(100)
"""
Aha. Order of train set is by
- month
- shops per month
- item_id per shop
- dates of items per shop
"""

In [None]:
sales_train[3000:4000]
# No hypotheses from above is wrong. Shop 25 appears again after shop 24

In [None]:
sales_train.item_id[1:10000].plot(figsize=(20,4))
# There seem to be groups. Maybe it has to do with categories?

In [None]:
sales_train=sales_train.merge(items, how='left')
sales_train=sales_train.merge(item_categories,how="left")
sales_train=sales_train.merge(shops,how="left")

In [None]:
sales_train.head()

In [None]:
sales_train.drop("item_name",axis=1,inplace=True)
sales_train.drop("shop_name",axis=1,inplace=True)
sales_train.drop("item_category_name",axis=1,inplace=True)
sales_train.head()

In [None]:
sales_train.isnull().values.any()
# No values NaN
# Maybe NaN values have been replaced by a number? Carefully check for outliers

In [None]:
sales_train.head(200)
# Can't see a strict order. Kind of per shop. Kind of per item_id. 

In [None]:
sales_train["revenue"]=sales_train.item_cnt_day * sales_train.item_price

In [None]:
sales_train.groupby("year").sum()
# Interesting: 2015 lower revenue - no! We dont have full years. Reaches from beginning of 2013 to October 2015

In [None]:
sales_train.groupby("date_block_num").sum()["revenue"].plot()
# Very clearly typical retail pattern: Peak at christmas and low in Summer

In [None]:
sales_train.groupby("weekday").sum()["revenue"].plot()
# looks like Friday, Saturday and Sunday are the busiest days

In [None]:
sales_train.groupby("shop_id").sum()["revenue"].plot.bar()
# Significant differences
# But careful: Could be that shops opened later or closed earlier
# And what is with the online shop?

In [None]:
sales_train[sales_train["shops_english"]=="Internet-magazin CS"]["revenue"].sum()
# 1.1 in above scale so not the most big one

In [None]:
sales_train[sales_train["shops_english"]=="Internet-magazin CS"]["revenue"].plot()
# funny spikes
# I expected steady and increasing sales if it would be an online shop

In [None]:
sales_train[sales_train["shops_english"]=="Internet-magazin CS"].groupby("date_block_num").sum()["revenue"].plot()

In [None]:
# How does this look for other shops?
sales_train[sales_train["shops_english"]=="Vyezdnaa Torgovla"]["revenue"].plot()
# Spikes seem to be rather normal when larger things are being sold.
# These "things" need to be looked into much deeper and when they occure. They have a mayor impact! 
# How can tihs be modelled? Is this yearly licenses? Or random occurence and I should model a random? 

In [None]:
sales_train[sales_train["shops_english"]=="Vyezdnaa Torgovla"].groupby("date_block_num").sum()["revenue"].plot()
# This looks very strange
# Has it to do with a test-train split? No data for monthes 23-31?

In [None]:
sales_train.groupby("shop_id").sum()["revenue"]

In [None]:
# Understand the training data structure a bit better
sales_train.shop_id.plot(figsize=(20,4))
# Why so irregular?

In [None]:
sales_train.groupby("date_block_num").count().town.plot()
# The number of transactions is shrinking!

In [None]:
sales_train.groupby("date_block_num").mean().item_price.plot()
# But because average price is increasing stronger the revenue is increasing

In [None]:
sales_train.groupby("date_block_num").mean().revenue.plot()

In [None]:
sales_train.groupby("date_block_num").sum().item_cnt_day.plot()
#Just doublechecking it is not only number of transactions but also total items

In [None]:
# https://tradingeconomics.com/russia/inflation-rate-mom
# Inflation rate in russia over the period was oscillating between 0 and 1%
# except one huge peak (to 4% per month very shortly) end of 2014, beginning 2015
# Due to lower oil prices and Western sanctions imposed over Ukraine

# https://www.quora.com/Economy-of-Russia-What-caused-the-high-inflation-in-Russia-in-2014-and-2015
# It translates to a yearly inflation rate of 17%    

In [None]:
sns.pairplot(sales_train)

# if you work in the kernel you should de-activate this as it takes a long time

"""Observations:
- one extreme outlier in price
- one extreme outlier in cnt_day
- one month (~10 date_block_num) has a lot of high revenue items
- certain item_ids have wide range of revenues, some have outliers
"""

In [None]:
sales_train.groupby(["shop_id","date_block_num"]).sum()
# Definetly big difference in how long shops are on the market

In [None]:
shop_life=pd.DataFrame(columns=["shop_id","Start", "Stop"])
shop_life["shop_id"]=np.arange(60)
shop_life["Start"]=sales_train.groupby("shop_id")["date_block_num"].min()
shop_life["Stop"]=sales_train.groupby("shop_id")["date_block_num"].max()
shop_life.merge(shops, how="left").drop("shop_name",axis=1)
print(shop_life)
"""
Observations:
- shops 10 and 11 have the same name, just ^2 and ? -> Check if shop 10 is empty at month 25 (a)
- shops 39 and 40 seem to be the same? (b)
- definetly need to check what shops are in the test-set (c)
- should closed shops be considered? (d)
"""

In [None]:
# (a)
sales_train[(sales_train["shop_id"]==10) & (sales_train["date_block_num"]==25)]

In [None]:
sales_train[(sales_train["shop_id"]==11) & (sales_train["date_block_num"]==25)]

In [None]:
sales_train[(sales_train["shop_id"]==10) & (sales_train["date_block_num"]==24)]

In [None]:
sales_train[(sales_train["shop_id"]==10) & (sales_train["date_block_num"]==26)]

In [None]:
sales_train[(sales_train["shop_id"]==11) & (sales_train["date_block_num"]==24)]

In [None]:
sales_train[(sales_train["shop_id"]==11) & (sales_train["date_block_num"]==26)]

In [None]:
# Good. Brute-force but clear.
# Let's have a 100% picture:
sales_train[(sales_train["shop_id"]==10) | (sales_train["shop_id"]==11)].groupby(["shop_id","date_block_num"]).sum()
# Yes, definetly.

In [None]:
sales_train.loc[sales_train["shop_id"]==11,"shop_id"]=10
sales_train[sales_train["shop_id"]==11]

In [None]:
sales_train[(sales_train["shop_id"]==10)&(sales_train["date_block_num"]==25)]

In [None]:
sales_train.to_csv('sales_train.csv',index=False)

In [None]:
# Good. Next one:
# b) shops 39 and 40 seem to be the same?
sales_train[(sales_train["shop_id"]==39) | (sales_train["shop_id"]==40)].groupby(["shop_id","date_block_num"]).sum()
# No, seems to be two separate shops. Both opened in month 14, one closed earlier than the other

In [None]:
#c) Check what shops are in the test-set
print(sorted(test.shop_id.unique()))
test_list=list(test.shop_id.unique())
complete_list=list(range(60))
out_of_test=[x for x in complete_list if x not in test_list]
print(out_of_test)
print(shop_life[shop_life["Stop"]<33])
# 9, 11, 20, are not in test but were active in time_period 33
# What could be the reason?
# Maybe they closed then?
# A good question is whether the train model should also look at the shops that are in test!?
print(shops.loc[9])
#print(shops.loc[11])
# Yes of course this one I deleted manually
print(shops.loc[20])
sales_train[(sales_train["shop_id"]==9) | (sales_train["shop_id"]==20)].groupby(["shop_id","date_block_num"]).sum()["revenue"]
# Aha, yet another trick. There is data only for limited periods for these shops. 
# Lets check if this is the reason and others are consistently in business
sales_train[(sales_train["shop_id"]==3) | (sales_train["shop_id"]==24)].groupby(["shop_id","date_block_num"]).sum()["revenue"]
# Yes looks fine

# I think it depends what I want to achieve whether i include these shops or not.
# Definetly a kind of different distribution in train and test


In [None]:
# I want to see KPIs over time (prices, revenue per shop per month)
# Let's start with prices
sales_train.groupby("item_id").sum()
# Clearly to be seen some items only very short time in sale

In [None]:
sales_train.groupby("item_id").sum()["revenue"].hist(figsize=(20,4),bins=100)
# many many items with little revenue

In [None]:
sales_train.groupby("item_category_id").sum()

In [None]:
sales_train.groupby("item_category_id").sum()["revenue"].hist(figsize=(20,4),bins=100)
# many many items with little revenue

In [None]:
sales_train.groupby(["date_block_num","item_category_id"]).sum()["revenue"].unstack()
# unstack is a great function!
# https://scentellegher.github.io/programming/2017/07/15/pandas-groupby-multiple-columns-plot.html
# https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.unstack.html
# https://nikgrozev.com/2015/07/01/reshaping-in-pandas-pivot-pivot-table-stack-and-unstack-explained-with-pictures/

sales_train.groupby(["date_block_num","item_category_id"]).sum()["revenue"].unstack().plot(figsize=(20,20))

In [None]:
# Let's just look at growth rates of categories (CAGR's)

In [None]:
sales_train.groupby(["date_block_num","meta_category"]).sum()["revenue"].unstack().plot(figsize=(20,20))
# Consoles with 2 big spikes
# Exactly at Christmas! :-) Playstations from Santa Claud: Or Father Frost as it is in Russia I think

In [None]:
sales_train.groupby(["date_block_num","shop_id"]).sum()["revenue"].unstack().plot(figsize=(20,20))
# of course same spikes. Let's normalize

In [None]:
sales_train.groupby(["date","shop_id"]).sum()["revenue"].unstack().plot(figsize=(20,20))
# Very spiky (see above where I saw spikes for first time)
# Looks difficult to predict
# Probably a good idea to modell low-cost predictable items separately and then model random big sales

In [None]:
# What is total revenue of company?
sales_train.groupby(["date_block_num"]).sum()["revenue"].plot(figsize=(20,5))

In [None]:
# I am working through the question list now - so maybe a bit random
# Slowly I start to think more about how actually to do the modelling
# At the moment I have lost a bit the track on how this will look like and how this actually works.

# Products out of sale:
#All item ids that are being sold in the last month in the train data
sales_train[sales_train.date_block_num==33]["item_id"]
print(sales_train[sales_train.date_block_num==33]["item_id"].nunique())
# All item ids from all times
print(sales_train.item_id.nunique())

# Wow: more than 3/4 of items out of sale. But makes sense. Music titles, old programs, old consoles and PCs

# How about the test data?
print(test.item_id.nunique())
# 300 less than was sold in the month before
# Are there items that are new?
a=set(test.item_id)
b=set(sales_train[sales_train.date_block_num==33]["item_id"])

In [None]:
new_in_test_items=[x for x in a if x not in b]
print(len(new_in_test_items))
print(new_in_test_items[0:100])
print(sales_train[sales_train["item_id"]==8214])
# ok this is an item that is not often sold (only 2 times in dataset)
print(sales_train[sales_train["item_id"]==4893]) #(18 times in dataset)

# a) Let's check how many items are just sold <100 times - maybe different categories? FMCC vs. B2B

# Really new items:
c=set(sales_train["item_id"])
print(len(c))

new_in_test_items2=[x for x in a if x not in c]
print(len(new_in_test_items2))
print(new_in_test_items2[0:100])

print(sales_train[sales_train["item_id"]==83])
print(sales_train[sales_train["item_id"]==430])

# ok it is working. But funny that such low IDs appear for the first time. 
# b) Aren't the IDs an ever growing item? Or are certain ID-number-blocks reserved for categories?

In [None]:
# How many items are only sold rarely?

In [None]:
# Aren't the IDs an ever growing item? Or are certain ID-number-blocks reserved for categories?
sales_train[sales_train.item_category_id==11]
# Interesting, the same item (PS3) has many different product ids

In [None]:
print(sales_train[sales_train.items_english=="Igrovye konsoli - PS3"])

In [None]:
print(sales_train[sales_train.items_english=="Igrovye konsoli - PS3"]["item_id"].unique())
print(sales_train[sales_train.items_english=="Igrovye konsoli - PS3"]["item_id"].nunique())

print(sales_train[sales_train.items_english=="Igrovye konsoli - PS3"]["item_price"].unique())
print(sales_train[sales_train.items_english=="Igrovye konsoli - PS3"]["item_price"].nunique())

In [None]:
prices_PS3=sales_train[sales_train.items_english=="Igrovye konsoli - PS3"]["item_price"]

plt.figure(figsize=(20, 8), dpi=80)
plt.scatter(prices_PS3.index, prices_PS3,s=0.1)
# quite a spread
# Why so many item_ids? Does this correlate price per ID?

In [None]:
sales_train["value"]=1
pivot=pd.pivot_table(sales_train[sales_train.items_english=="Igrovye konsoli - PS3"], values="value", index=["item_id"], columns="item_price", fill_value=0) 
# No, it is not one item id per price


In [None]:
print(new_in_test_items)

In [None]:
# Now examine price changes and price developments of items
sales_train.groupby(["item_id","item_price"]).sum()

In [None]:
# Now examine the categories

# How many items in each category
#sales_train.groupby("category_id","item_id").count()

In [None]:
# Save the final dataset (to not always calculate everything above when restarting the Kernel)
sales_train.to_csv('mycsvfile.csv',index=False)

In [None]:
print(os.listdir("../"))

In [None]:
print(os.listdir("../working"))

In [None]:
# Load the pre-processed dataset to continue from here without always calculating everything above
train=pd.read_csv('../working/mycsvfile.csv')


- A lot around revenues
    - differences per shop: how many items, how many categories, how much revenue
    - differences of items: how many sold, how different price
    - price changes and price spread
    - check especially the online channel/shop: strongly increasing sales I assume
- categories: How are they distributes? How many items per category? prices per category?
- seasonalization of sales? Difference per shop? per price? per category? per product?
- how much does price change? Over time? between shops?
- how do weekdays differ in sales? How do months differ? Check for number of weekends specifically to see how in November 2015 this compares
- some items will surely be out of catalogue in Nov 2015 and others will have increasing sales (old models vs. new models of e.g Playstations)
- should closed shops be considered? (shops ids: )
- should shops out of test be considered in training?
- should products that are not sold anymore be considered in training?
- Should I include negative revenues in the training? Or should I better cancel the sales out and only model/train net sales? I guess returns are rather random so would be good to cancel out
- looking at the revenue spikes per shops definetely groups of prices needed (mass product / b2b contracts / ...)
- correlation between targeted month and product categories or price/categories will be interesting

* Check seasonalization in data. Forecast is for November. Maybe seasonalization works different per shop? Or per item? + what is the overall trajectory (probably over time linearly increasing sales but, around this, seasonalization e.g. lower in november (but compared to ever increasing base) due to upcoming christmas business.
* What do I loose when I aggregate train data to months? Is there a daily / weekly pattern that is relevant to our monthly sales predictions? Maybe number of weekends in November? # of working days!

* Categorize price into categories. Maybe some shops have larger presence in B2B deals
* Does test price distribution look similar? Very small number of very large prices?

- Is it true that the train data does not show all data? That random rows (30% or so) are missing? Maybe it is an idea to construct these rows and fill these NaN values with something meaningul? A moving average or something? These values for train could be learned through a model to then feed a complete dataset into the final algorithm... (easily written - already scared thinking of implementing it :-) )

-----------------------------------------------------------------------------------
For model building part
- setup a test and validation set from the train_sales set
- shuffle rows in training as they are ordered
- Validate with a same month in the year (to ensure same bank holidays etc.)
- we need to check item distribution in test and match with validation set
- the problem is a time series one. I need to forecast the next month of specific shops and have the past data for these shops
    - LSTM problem, isnt it?
    - or linear model
    - NN? (how would this work? What would it be / model?
- Try one model where rows / items with high prices/revenues are deleted / capped
------------------------------------------------------------------------------------
Done:
* split date
* put weekdays into data
* put test and train together (if same columns) (doesnt apply here)
* plot revenue over time per shop
* make first steps of analysis (NaNs)
* analyze for ordinal, categorical values
* Take meta-categories as features - problem with solution now is that if there is no meta-category it takes same value as category


Further notes:
- new_in_test_items: list of items that are new in the test-set but havent been in train
- shop_life: Dataframe with columns: shop_id, opening month, closing month

In [None]:
sample_submission.head(100)
# Do I have to predict only the amount, not the revenue / price???
# Indeed "We are asking you to predict total sales for every product and store in the next month."
# So price information is helpful only as a feature

In [None]:
test.head()

In [None]:
# submission.to_csv('submission.csv',index=False)
# Very first submission resulted in a score of 1,8 something - an extremely bad score place 863 of 950
# I had the sum of items per month completely wrong

Ok in week 2 of the course it was said that score should be 1,16777

"A good exercise is to reproduce previous_value_benchmark. As the name suggest - in this benchmark for the each shop/item pair our predictions are just monthly sales from the previous month, i.e. October 2015.

The most important step at reproducing this score is correctly aggregating daily data and constructing monthly sales data frame. You need to get lagged values, fill NaNs with zeros and clip the values into [0,20] range. If you do it correctly, you'll get precisely 1.16777 on the public leaderboard.

Generating features like this is a necessary basis for more complex models. Also, if you decide to fit some model, don't forget to clip the target into [0,20] range, it makes a big difference."

In [None]:
interim= sales_train[sales_train["date_block_num"]==33].groupby(["shop_id", "item_id"],as_index=False).sum()[["shop_id","item_id","item_cnt_day"]]
interim["item_cnt_day"].clip(0,20,inplace=True)
interim

In [None]:
# the item_cnt_month are not properly entered into the grid
interim2=pd.merge(test, interim, how="left", left_on=["shop_id","item_id"], right_on = ["shop_id","item_id"])
interim2.info()
interim2=interim2[["ID","item_cnt_day"]]
interim2.columns=["ID","item_cnt_month"]
interim2.fillna(0,inplace=True)
interim2

In [None]:
interim2.to_csv('submission2.csv',index=False)
# "Your submission scored 16.05675" ? -> Forgot to clip values
# 1.96214: Still worse than before and lower than mentioned!? -> I had the summing up completely wrong

# v2: Yes: 1.02172, place 376

In [None]:
#Let's try November values from last year
interim= sales_train[sales_train["date_block_num"]==22].groupby(["shop_id", "item_id"],as_index=False).sum()[["shop_id","item_id","item_cnt_day"]]
interim["item_cnt_day"].clip(0,20,inplace=True)
interim2=pd.merge(test, interim, how="left", left_on=["shop_id","item_id"], right_on = ["shop_id","item_id"])
interim2=interim2[["ID","item_cnt_day"]]
interim2.columns=["ID","item_cnt_month"]
interim2.fillna(0,inplace=True)
interim2.to_csv('submission3.csv',index=False)

# Interesting: 1.60233: Much worse. So October-November seasonal effect smaller than November-November between 2 years

In [None]:
# Before starting with more complicated methods lets model something meaningful for items that are sold for the first time
# And lets check if there werer items in the data sold in September, but not in October, but then in November again

# Data that was for the first time in test:
new_in_test_items2
print(test[test.item_id.isin(new_in_test_items2)])
# No further information on the items. Can categories be learned from ids? At the example of 5320
sales_train[sales_train.item_id.isin(range(5310,5330))].groupby("item_id").max()
# Not really.
# One more example 3405-3408
sales_train[sales_train.item_id.isin(range(3400,3415))].groupby("item_id").max()
# Here the category seems to be Igry PC. But prices and counts vary very much

# Idea could be:
# - if category before and after the ID is the same use the average of this category
# - if they do not match take some average (eg of both categories or of all categories)

In [None]:
# Now for the next idea: items that were sold in september but not october
september = set(sales_train[sales_train.date_block_num==32].item_id)
october = set(sales_train[sales_train.date_block_num==33].item_id)
november = set(test.item_id)
sep_but_not_oct=[x for x in november if x not in october and x not in new_in_test_items2]
sep_but_not_oct
print(len(september))
print(len(october))
print(len(november))
print(len(sep_but_not_oct))
# 746 items were we could use the september figures

In [None]:
interim= sales_train[sales_train["date_block_num"]==33].groupby(["shop_id", "item_id"],as_index=False).sum()[["shop_id","item_id","item_cnt_day"]]
interim["item_cnt_day"].clip(0,20,inplace=True)

interim2=sales_train[sales_train["date_block_num"]==32].groupby(["shop_id", "item_id"],as_index=False).sum()[["shop_id","item_id","item_cnt_day"]]
interim2=interim2[interim2.item_id.isin(sep_but_not_oct)]
interim2["item_cnt_day"].clip(0,20,inplace=True)

interim3=pd.merge(test, interim, how="left", left_on=["shop_id","item_id"], right_on = ["shop_id","item_id"])
interim3=pd.merge(interim3, interim2, how="left", left_on=["shop_id","item_id"], right_on = ["shop_id","item_id"])

interim3.fillna(0,inplace=True)

interim3["item_cnt_month"]=interim3[["item_cnt_day_x","item_cnt_day_y"]].max(axis=1) 

print(interim3)
interim4=interim3[["ID","item_cnt_month"]]
print(interim4)
interim4.to_csv('submission4.csv',index=False)

# Scored worse: 1.16602 - but I am not sure that the operations above did what I want them to do. 
# Have to check more closely in next working session

In [None]:
sales_train.groupby(["item_id","shop_id","date_block_num"],as_index=False).sum()[["item_id","shop_id","date_block_num","item_cnt_day"]]

# Need to figure out how to only include the latest date_block_num per item per shop in table as lookup
# value for test

In [None]:
abc=sales_train.groupby(["item_id","shop_id","date_block_num"],as_index=False).sum()[["item_id","shop_id","date_block_num","item_cnt_day"]].head(10)
abc

In [None]:
# Now find out how to get only the rows where item_id and shop_id are the same and date_block_num is max
# I.e. I want to have a table with the most recent item_cnt of a specific item per shop as lookup table to fill this
# most recent item_cnt into the test.
abc.groupby(["item_id","shop_id"]).last()
#YES!

In [None]:
interim5=sales_train.groupby(["item_id","shop_id","date_block_num"],as_index=False).sum()[["item_id","shop_id","date_block_num","item_cnt_day"]].groupby(["item_id","shop_id"],as_index=False).last()
interim5=interim5[["item_id","shop_id","item_cnt_day"]]
interim5

In [None]:
interim5["item_cnt_day"].clip(0,20,inplace=True)
interim6=pd.merge(test, interim5, how="left", left_on=["shop_id","item_id"], right_on = ["shop_id","item_id"])
interim6=interim6[["ID","item_cnt_day"]]
interim6.columns=["ID","item_cnt_month"]
interim6.fillna(0,inplace=True)
interim6.to_csv('submission5.csv',index=False)
interim6
# scored 1.38739, better than september and october together but worse than october alone. Probably some items way to old
# lets restrict age

In [None]:
interim=sales_train.groupby(["item_id","shop_id","date_block_num"],as_index=False).sum()[["item_id","shop_id","date_block_num","item_cnt_day"]].groupby(["item_id","shop_id"],as_index=False).last()
interim=interim[interim["date_block_num"]<25]
interim=interim[["item_id","shop_id","item_cnt_day"]]
interim["item_cnt_day"].clip(0,20,inplace=True)
interim7=pd.merge(test, interim, how="left", left_on=["shop_id","item_id"], right_on = ["shop_id","item_id"])
interim7=interim7[["ID","item_cnt_day"]]
interim7.columns=["ID","item_cnt_month"]
interim7.fillna(0,inplace=True)
interim7.to_csv('submission6.csv',index=False)
interim7
# Only slightly better: 1.32017
# So this is a dead end. Lets start with the modelling

* Ok. I have now a bit of understanding of the data.

Let's start with step 3

3. Prepare Data
a) Data Cleaning
b) Feature Selection
c) Data Transforms (Normalize,...)

The Kernel seemd to have reached some capacity bottleneck. I'll continue the modelling [here](https://www.kaggle.com/dennise/coursera-competition-modelling?scriptVersionId=7573203):


Exporting the data to upload in this second kernel:

In [None]:
sales_train.to_csv('sales_train.csv',index=False)

It took me a while to figure out how to get such an output file from one kernel into another kernel. Here is how it goes:
- Go to your new Kernel
- Edit it
- Check on the right-hand side: There is an "Add Data" button where you can link to your previous work

![](https://i.imgur.com/ClP9kLb.png)

And pay attention. In this second Kernel there are now subfolders in the "input" folder:
![](https://i.imgur.com/rMZ3Kb8.png)