
- **`sales_train.csv`** Rows: 2935849 sales (January 2013 -> Octuber 2015)
  - **date**: date in format dd/mm/yyyy.
  - **date_block_num**: a consecutive month number. January 2013 is 0, February 2013 is 1,..., October 2015 is 33
  - **shop_id**: unique identifier of a shop
  - **item_id**: unique identifier of a product
  - **item_price**: current price of an item
  - **item_cnt_day**: number of products sold. You are predicting a monthly amount of this measure.
- **`shops.csv`** Rows: 60 shops
  - **shop_id**
  - **shop_name**: name of shop (RUSSIAN 🇷🇺)
- **`items.csv`** Rows: 22170 products
  - **item_id**
  - **item_name**: name of item (RUSSIAN 🇷🇺)
  - **item_category_id**: unique identifier of item category
- **`item_categories.csv`** Rows: 84 product categories
  - **item_category_id**
  - **item_category_name**: name of item category (RUSSIAN 🇷🇺)
- **`test.csv`** Rows: 214200 pairs combination of (Shop, Item)
  - **ID**: an Id that represents a (Shop, Item) tuple within the test set
  - **shop_id**
  - **item_id**


In [7]:
#!pip install missingno

Collecting missingno
  Downloading missingno-0.4.2-py3-none-any.whl (9.7 kB)
Installing collected packages: missingno
Successfully installed missingno-0.4.2


In [2]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

import missingno as m
import seaborn as sns
from sklearn.ensemble import IsolationForest
from scipy import stats
import matplotlib as plt



# Preprecessing

from tqdm import tqdm_notebook as tqdm
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer


# Machine Learning
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes  import MultinomialNB
from sklearn.naive_bayes  import BernoulliNB
from sklearn.ensemble     import RandomForestClassifier
from xgboost              import XGBClassifier

# Machine Learning Evaluation
from sklearn.metrics         import accuracy_score, f1_score
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score


In [3]:
path = "../../datasets/predict-future-sales/"

train = pd.read_csv(path+"sales_train.csv") # Daily sales  Jan 2013 -> Oct 2015
shops = pd.read_csv(path+"shops-translated.csv")       # Shops    (60)
items = pd.read_csv(path+"items-translated.csv")       # Products  (22170)
oritem = pd.read_csv(path+"items.csv")  
cats  = pd.read_csv(path+"item_categories-translated.csv") # Product categories (84)
test  = pd.read_csv(path+"test.csv", index_col="ID") # predict November 2015
sub   = pd.read_csv(path+"sample_submission.csv", index_col="ID")


In [4]:
train

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day
0,02.01.2013,0,59,22154,999.00,1.0
1,03.01.2013,0,25,2552,899.00,1.0
2,05.01.2013,0,25,2552,899.00,-1.0
3,06.01.2013,0,25,2554,1709.05,1.0
4,15.01.2013,0,25,2555,1099.00,1.0
...,...,...,...,...,...,...
2935844,10.10.2015,33,25,7409,299.00,1.0
2935845,09.10.2015,33,25,7460,299.00,1.0
2935846,14.10.2015,33,25,7459,349.00,1.0
2935847,22.10.2015,33,25,7440,299.00,1.0


In [7]:
dailysell = train.groupby(['date'])['item_cnt_day'].sum()

In [8]:
dailysell

date
01.01.2013     1951.0
01.01.2014     2310.0
01.01.2015     2117.0
01.02.2013     3817.0
01.02.2014     5711.0
               ...   
31.10.2013     3826.0
31.10.2014     3014.0
31.10.2015     3104.0
31.12.2013    10514.0
31.12.2014    11394.0
Name: item_cnt_day, Length: 1034, dtype: float64

In [11]:
df = pd.DataFrame(dailysell).reset_index()

In [12]:
df

Unnamed: 0,date,item_cnt_day
0,01.01.2013,1951.0
1,01.01.2014,2310.0
2,01.01.2015,2117.0
3,01.02.2013,3817.0
4,01.02.2014,5711.0
...,...,...
1029,31.10.2013,3826.0
1030,31.10.2014,3014.0
1031,31.10.2015,3104.0
1032,31.12.2013,10514.0
