<h1 style="font-size:30px">Digit Recognizer</h1>
<br>
This is a Kaggle competition, where the problem is to correctly classify digits from a dataset ofhandwritten images provided by MNIST ("Modified National Institute of Standards and Technology").
<br>
<hr>
<h1 style="font-size:18px">Objective</h1>
<br>
Our goal is to learn computer vision fundamentals through neural networks, to correctly identify digits from a dataset of tens of thousands of handwritten images.
<hr>
<h1 style="font-size:18px">Content</h1>
<br>
This kernel will be divided into:
1. <a href="#basic">Basic Information</a>
2. <a href="#engineering">Feature Engineering</a>
3. <a href="#numeric">Numeric Features</a>

<h1 style="font-size:18px">File description</h1>
* train.csv - training set
* test.csv - test set
* sample_submission.csv - sample submission file in the correct format

<h1 style="font-size:18px">Data description</h1><br>
The data files contain gray-scale images of hand-drawn digits, from zero through nine.<br>
Each image is 28 pixels in height and 28 pixels in width, for a total of 784 pixels in total. Each pixel has a single pixel-value associated with it, indicating the lightness or darkness of that pixel, with higher numbers meaning darker. This pixel-value is an integer between 0 and 255, inclusive.<br>
<br>
* label - digit that was drawn by the user
* pixelX - contain the value of the pixel X from the associated image

<h1 style="font-size:18px">Import libraries</h1>

In [1]:
# Numpy for numerical computing
import numpy as np

# Pandas for Dataframes
import pandas as pd
pd.set_option('display.max_columns',100)

# Matplolib for visualization
from matplotlib import pyplot as plt
# display plots in the notebook
%matplotlib inline

# Seaborn for easier visualization
import seaborn as sns

# Datetime deal with dates formats
import datetime as dt

<h1 style="font-size:18px">Load files</h1>

In [2]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

<br id="basic">
# 1. Basic Information
Let's first check some informations about the dataset for each loaded file, as:
* Dimension
* Features type
* Number of missing values
* View the first 3 rows

In [3]:
# Dataframe dimensions
print('The dimension of the training set is:',train.shape,'\n')
print('The feature types are:\n', train.dtypes,'\n')
print('Number of missing values:\n',train.isnull().sum())
train.head(3)

The dimension of the training set is: (42000, 785) 

The feature types are:
 label       int64
pixel0      int64
pixel1      int64
pixel2      int64
pixel3      int64
pixel4      int64
pixel5      int64
pixel6      int64
pixel7      int64
pixel8      int64
pixel9      int64
pixel10     int64
pixel11     int64
pixel12     int64
pixel13     int64
pixel14     int64
pixel15     int64
pixel16     int64
pixel17     int64
pixel18     int64
pixel19     int64
pixel20     int64
pixel21     int64
pixel22     int64
pixel23     int64
pixel24     int64
pixel25     int64
pixel26     int64
pixel27     int64
pixel28     int64
            ...  
pixel754    int64
pixel755    int64
pixel756    int64
pixel757    int64
pixel758    int64
pixel759    int64
pixel760    int64
pixel761    int64
pixel762    int64
pixel763    int64
pixel764    int64
pixel765    int64
pixel766    int64
pixel767    int64
pixel768    int64
pixel769    int64
pixel770    int64
pixel771    int64
pixel772    int64
pixel773    int64
pixel

Unnamed: 0,label,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,pixel9,pixel10,pixel11,pixel12,pixel13,pixel14,pixel15,pixel16,pixel17,pixel18,pixel19,pixel20,pixel21,pixel22,pixel23,pixel24,pixel25,pixel26,pixel27,pixel28,pixel29,pixel30,pixel31,pixel32,pixel33,pixel34,pixel35,pixel36,pixel37,pixel38,pixel39,pixel40,pixel41,pixel42,pixel43,pixel44,pixel45,pixel46,pixel47,pixel48,...,pixel734,pixel735,pixel736,pixel737,pixel738,pixel739,pixel740,pixel741,pixel742,pixel743,pixel744,pixel745,pixel746,pixel747,pixel748,pixel749,pixel750,pixel751,pixel752,pixel753,pixel754,pixel755,pixel756,pixel757,pixel758,pixel759,pixel760,pixel761,pixel762,pixel763,pixel764,pixel765,pixel766,pixel767,pixel768,pixel769,pixel770,pixel771,pixel772,pixel773,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [4]:
print('The dimension of the test set is:',test.shape,'\n')
print('The feature types are:\n', test.dtypes,'\n')
print('Number of missing values:\n',test.isnull().sum())
test.head(3)

The dimension of the test set is: (28000, 784) 

The feature types are:
 pixel0      int64
pixel1      int64
pixel2      int64
pixel3      int64
pixel4      int64
pixel5      int64
pixel6      int64
pixel7      int64
pixel8      int64
pixel9      int64
pixel10     int64
pixel11     int64
pixel12     int64
pixel13     int64
pixel14     int64
pixel15     int64
pixel16     int64
pixel17     int64
pixel18     int64
pixel19     int64
pixel20     int64
pixel21     int64
pixel22     int64
pixel23     int64
pixel24     int64
pixel25     int64
pixel26     int64
pixel27     int64
pixel28     int64
pixel29     int64
            ...  
pixel754    int64
pixel755    int64
pixel756    int64
pixel757    int64
pixel758    int64
pixel759    int64
pixel760    int64
pixel761    int64
pixel762    int64
pixel763    int64
pixel764    int64
pixel765    int64
pixel766    int64
pixel767    int64
pixel768    int64
pixel769    int64
pixel770    int64
pixel771    int64
pixel772    int64
pixel773    int64
pixel774 

Unnamed: 0,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,pixel9,pixel10,pixel11,pixel12,pixel13,pixel14,pixel15,pixel16,pixel17,pixel18,pixel19,pixel20,pixel21,pixel22,pixel23,pixel24,pixel25,pixel26,pixel27,pixel28,pixel29,pixel30,pixel31,pixel32,pixel33,pixel34,pixel35,pixel36,pixel37,pixel38,pixel39,pixel40,pixel41,pixel42,pixel43,pixel44,pixel45,pixel46,pixel47,pixel48,pixel49,...,pixel734,pixel735,pixel736,pixel737,pixel738,pixel739,pixel740,pixel741,pixel742,pixel743,pixel744,pixel745,pixel746,pixel747,pixel748,pixel749,pixel750,pixel751,pixel752,pixel753,pixel754,pixel755,pixel756,pixel757,pixel758,pixel759,pixel760,pixel761,pixel762,pixel763,pixel764,pixel765,pixel766,pixel767,pixel768,pixel769,pixel770,pixel771,pixel772,pixel773,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


So far the data looks pretty good. There is no missing values, and the features seems to have the correct type.

<br id="engineering">
# 2. Feature Engineering

Before we continue, let's do some preliminary feature engineering to make the data easier to deal with.<br>
<br>
Firstly, let's change the date type, from object to datetime. Then divide de **date** feature to create 3 new columns for **year**, **month** and **day**.

In [7]:
import tensorflow as tf

ModuleNotFoundError: No module named 'tensorflow'

In [None]:
# Change the date type
date = train.date.apply(lambda x:dt.datetime.strptime(x, '%d.%m.%Y'))

# Create 3 new features for year, month and day
train['year'] = date.dt.year
train['month'] = date.dt.month
train['day'] = date.dt.day
train.head()

# Remove the "date" feature
train = train.drop('date', axis=1)

Checking the dataset we can see that there is a feature that might be missing in the **train set**: item_category_id.<br> 
The **item_category_id** and the **item_id** are related at **items set**. Thus, we can create another feature for these categories.

In [None]:
# Add the "item_category_id" to the dataset
train = pd.merge(train, items.drop('item_name', axis=1), on='item_id')
train.head()

As we have the current price of the item (item_price) and the number of items sold (item_cnt_day) features, we can create another feature called "revenue" by their dot multiplication.

In [None]:
# Create "revenue" feature
train['revenue'] = train.item_price*train.item_cnt_day
train.head()

<br id="numeric">
# 3. Numeric features
To ease up the study of the numeric features, we will look the data throught grouping it by year and month to see if there is any possible seasonability.<br>
<br>
Firstly we will analyze the feature item_cnt_day, which gives the number of products sold.

In [None]:
# Plot the total number of products sold by year
train.groupby('year').item_cnt_day.sum().plot()
plt.xticks(np.arange(2013, 2016, 1))
plt.xlabel('Year')
plt.ylabel('Total number of products sold')
plt.show()

# Plot the total number of products sold by month for each year
train.groupby(['month','year']).sum()['item_cnt_day'].unstack().plot()
plt.xlabel('Month')
plt.ylabel('Total number of products sold')
plt.show()

We can see that the number of sold products are decreasing over the years.<br>
Looking at the months, the sales seems to vary in a certain range until October, and then the sales start to increase greatly.<br>
<br>
Now let's check the revenue behavior.

In [None]:
# Plot the total revenue by year
train.groupby('year').revenue.sum().plot()
plt.xticks(np.arange(2013, 2016, 1))
plt.xlabel('Year')
plt.ylabel('Total revenue')
plt.show()

# Plot the total revenue by month for each year
train.groupby(['month','year']).sum()['revenue'].unstack().plot()
plt.xlabel('Month')
plt.ylabel('Total revenue')
plt.show()

The revenue behavior is a little different from the number of total sales.<br>
In 2014 the total revenue increased, even though the number of total sales decreased from 2013. This is due to the "item_price" variable, which can fluctuate by the time.<br>
We can also observe that, over the months, even though the number of sales product decreased, the revenue seems similar for the three years.<br>
<br>
Let's look at the top 10 items and the top 10 shops.

In [None]:
# Plot the top 10 items
sns.countplot(y='item_id', hue='year', data=train, order = train['item_id'].value_counts().iloc[:10].index)
plt.xlim(0,20000)
plt.xlabel('Number of times the item was sold')
plt.ylabel('Identifier of the item')
plt.show()

# Plot the top 10 shops
sns.countplot(y='shop_id', hue='year', data=train, order = train['shop_id'].value_counts().iloc[:10].index)
plt.xlabel('Number of times the shop sold')
plt.ylabel('Identifier of the shop')
plt.show()

The item 20949 is the sales champion over the years by far!<br>
The top 10 shops have similar sales behavior over the years.