# Exploration of dataset

## Load the dataset

In [5]:
import pandas as pd
pd.options.display.max_columns = None
pd.options.display.max_rows = None

In [None]:
train_data = pd.read_csv('../input/train.csv')
test_data = pd.read_csv('../input/test.csv')

In [None]:
train_data.head(10)

## Target variable distribution

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
plt.figure(figsize=(10, 6))
sns.countplot(train_data['target'])
plt.show()

In [None]:
train_data['target'].value_counts()

As we can see, from above plot as well as counts, that target distribution is **skewed**. Because of skewed distribution, we shouldn't use `accuracy_score` as the evaluation metric. We're going to use *Area under ROC curve (AUC)* evaluation metric. We can argue to use *Precision* or *Recall* too, but AUC combines both Precision as well as Recall, therefore, we choose AUC.

## Features

As we can make out from the column names, there are:
+ 05 Binary variables
+ 10 Nominal variables
+ 06 Ordinal variables
+ 02 Cyclic variables
+ 01 Target variable (It is a label or actual target column, not a feature)

As we can observe data in first 10 rows, the dataset columns have numbers as well as strings in the data. We can also observe that there are `NaN` values as well. But the computers or machines only understand numbers while training or evaluating the model, therefore, there is a need to transform strings into numbers, so that we can make machines learn patterns from this dataset to use them for inference in later stage.

### Ordinal variables

If we take `ord_2` column from the dataset, it has different categories which are strings.

In [None]:
train_data['ord_2'].value_counts()

So, for above type of columns, we can define dictionary to map each category as a number and then replace each category in column to the number.

In [None]:
list(train_data['ord_2'].value_counts().index)

In [None]:
mapping = {key:idx for idx,key in enumerate(train_data['ord_2'].value_counts().index)}
mapping

In [None]:
train_data.loc[:, 'ord_2'] = train_data['ord_2'].map(mapping)

In [None]:
train_data['ord_2'].value_counts()

## Label Encoding

Encoding where each category is encoded as a numerical label. Example, what we did in `ord_2` column transformation.

We can do same thing by using `LabelEncoder` in `sklearn` package.

In [None]:
from sklearn import preprocessing

In [None]:
train_data = pd.read_csv('../input/train.csv')

In [None]:
# Fill NaN values in ord_2 column
train_data.loc[:,'ord_2'] = train_data.ord_2.fillna('NONE')
train_data['ord_2'].value_counts()

In [None]:
# LabelEncoding
lbl_enc = preprocessing.LabelEncoder()

# Fit the encoder on the data
lbl_enc.fit(train_data['ord_2'].values)

train_data.loc[:, 'ord_2'] = lbl_enc.transform(train_data['ord_2'].values)

In [None]:
train_data['ord_2'].value_counts()

> *LabelEncoder doesn't handle NaN values and therefore we need to fill NaN values before fit and transform of column*

**LabelEncoder** can be used directly in tree-based models:
+ Decision trees
+ Random forest
+ Extra trees
+ Boosted trees:
  + XGBoost
  + GBM
  + LightGBM
  
This type of encoding cannot be used in linear models, SVM or Neural networks as they expect data to be normalized (or standardized).

### Binarized Encoding

For linear models mentioned above, we can binarize the data in columns, as shown below:

Category | Label | Bin_Label_0 | Bin_Label_1 | Bin_Label_2
--- | --- | --- | --- | ---
Freezing | 0 | 0 | 0 | 0
Warm | 1 | 0 | 0 | 1
Cold | 2 | 0 | 1 | 0
Boiling Hot | 3 | 0 | 1 | 1
Hot | 4 | 1 | 0 | 0
Lava Hot | 5 | 1 | 0 | 1
NONE | 6 | 1 | 1 | 0

As we had only 7 categories, we could represent one column to 3 binarized columns. But as you can see that if we have huge number of categories, we'll have large number of binarized columns. And in that case we will have data sparsely populated (i.e. the number of 1s will be very less).

If we store binarized variables in **sparse format** i.e. store only values that are relevant (which are 1s).

### Comparison between densely stored vs sparsed format storage

In [None]:
# Example stored as dense matrix
import numpy as np

example = np.array(
[
    [0, 0, 1],
    [1, 0, 0],
    [1, 0, 1]
])

print(example.nbytes)

So it takes 72 bytes to store the data in dense format. Let's take a look if we store the data in sparse format.

In [None]:
import numpy as np
from scipy import sparse

example = np.array(
[
    [0, 0, 1],
    [1, 0, 0],
    [1, 0, 1]
])

sparse_example = sparse.csr_matrix(example)

print(sparse_example.data.nbytes)

# The total size of sparse csr matrix is the sum of three values
print(sparse_example.data.nbytes + 
     sparse_example.indptr.nbytes + 
     sparse_example.indices.nbytes)

We can see that space taken by sparse format storage is less compared to dense format storage. This difference can be very large in case of larger arrays.

In [1]:
import numpy as np
from scipy import sparse

bytes_to_gb = lambda x: x/(10**9)
sp_mat_total_size = lambda spm: spm.data.nbytes + spm.indptr.nbytes + spm.indices.nbytes

n_rows = 10000
n_cols = 100000

# Let's build a dense matrix with only 5% 1s
example = np.random.binomial(n=1, p=0.05, size=(n_rows, n_cols))

# print dense matrix size
print('Dense size       :', example.nbytes)

sp_mat = sparse.csr_matrix(example)

# print sparse matrix size and total_size
print('Sparse size      :', sp_mat.data.nbytes)
print('Sparse total size:', sp_mat_total_size(sp_mat))


print('Dense size (GB)       :', bytes_to_gb(example.nbytes))
print('Sparse size (GB)      :', bytes_to_gb(sp_mat.data.nbytes))
print('Sparse total size (GB):', bytes_to_gb(sp_mat_total_size(sp_mat)))

Dense size       : 8000000000
Sparse size      : 399981480
Sparse total size: 600012224
Dense size (GB)       : 8.0
Sparse size (GB)      : 0.39998148
Sparse total size (GB): 0.600012224


As we can compare the storage difference in case of Dense vs Sparse format.

### One-Hot Encoding

It is a kind of binary encoding but representation of label is not binary representation. It takes even less memory compared to binarized encoding.

Let's one-hot encode our ord_2 categories, which can be represented as:

Category | Category_Freezing | Category_Warm | Category_Cold | Category_Boiling_Hot | Category_Hot | Category_Lava_Hot | Category_NONE
--- | --- | --- | --- | --- | --- | --- | ---
Freezing | 1 | 0 | 0 | 0 | 0 | 0 | 0
Warm | 0 | 1 | 0 | 0 | 0 | 0 | 0
Cold | 0 | 0 | 1 | 0 | 0 | 0 | 0
Boiling Hot | 0 | 0 | 0 | 1 | 0 | 0 | 0
Hot | 0 | 0 | 0 | 0 | 1 | 0 | 0
Lava Hot | 0 | 0 | 0 | 0 | 0 | 1 | 0
NONE | 0 | 0 | 0 | 0 | 0 | 0 | 1

### Summary to handle categorical variables in any ML project

Whenever you get categorical variables, follow these steps:
+ Fill the `NaN` values
+ Convert them to integers by applying Label Encoding using `LabelEncoder` of scikit-learn or by using a mapping dictionary.
+ Create one-hot encoding using `OneHotEncoder` from scikit-learn package. (You can skip binarization)
+ Go for ML model building.

### Handle "Rare" or "Unknown" categories

In real-world problems, we can have a situation where training dataset has a column with similar number of categories. After training the model, in real-time we find a new category in the same column, then our model will throw an error because of unseen category while training. This model is not robust and we need to take care of these situations.

We can introduce "Rare" category which is a category not seen very often and can include many different categories.
If we have a fixed test set, we can add our test set to training set to know about the categories in a given feature. This is similar to semi-supervised learning in which we use data which is not available for training to improve the model. This will also take care of rare values that appear very less number of times in training data but are in abundance in test data. Our model will be more robust. 

To make sure above model doesn't overfit, we design our cross-validation in such a way that it replicates the prediction process when we run our model on test data, then it never going to overfit.

Let's understand it better by going through the code.

In [6]:
import pandas as pd
from sklearn import preprocessing
pd.options.display.max_columns = None
pd.options.display.max_rows = None

train_data = pd.read_csv('../input/train.csv')
test_data = pd.read_csv('../input/test.csv')

# Create a fake target column in test_data as it doesn't exists already
test_data['target'] = -1

# concatenate both training and test datasets
data = pd.concat([train_data, test_data]).reset_index(drop=True)

# Make a list of features we're interested in. Skip 'id' and 'target' columns
features = [col for col in train_data.columns if col not in ['id', 'target']]

for feat in features:
    # Create an object of LabelEncoder for each feature
    lbl_enc = preprocessing.LabelEncoder()
    
    # Fill NaN values
    # Since its categorical data, we fillna with a string and we convert all the data type to string. 
    # So, no matter its int or float, its converted to string type but categorical.
    temp_col = data[feat].fillna('NONE').astype(str).values
    
    # Fit_Transform the dataset as it is the complete dataset. Otherwise fit here and transform the unseen data
    data.loc[:, feat] = lbl_enc.fit_transform(temp_col)
    
# Split the data to train and test again
train_data = data[data['target'] != -1].reset_index(drop=True)
test_data = data[data['target'] == -1].reset_index(drop=True)

In [7]:
train_data.head(10)

Unnamed: 0,id,bin_0,bin_1,bin_2,bin_3,bin_4,nom_0,nom_1,nom_2,nom_3,nom_4,nom_5,nom_6,nom_7,nom_8,nom_9,ord_0,ord_1,ord_2,ord_3,ord_4,ord_5,day,month,target
0,0,0,0,0,0,0,3,5,3,6,0,1060,1014,87,1,27,2,0,3,3,21,57,5,5,0
1,1,1,1,0,0,2,3,4,0,5,4,210,359,27,69,2113,2,2,6,5,24,151,6,9,0
2,2,0,1,0,0,0,3,1,3,0,0,861,694,90,102,1400,2,4,2,14,16,106,4,11,0
3,3,2,0,0,0,0,3,0,3,3,4,477,241,51,171,2168,0,5,4,1,2,46,2,5,0
4,4,0,2,0,2,0,3,6,3,2,1,556,361,183,151,1748,2,2,1,8,2,51,4,3,0
5,5,0,2,1,2,0,3,6,4,1,0,767,1060,138,93,59,1,1,3,2,17,181,2,6,0
6,6,0,0,0,0,0,3,6,3,2,0,645,1223,25,168,692,0,2,1,3,18,159,4,8,0
7,7,0,0,1,2,0,3,6,0,3,0,83,277,124,170,1465,2,1,1,2,25,55,0,0,0
8,8,0,0,0,0,0,0,2,3,6,2,6,1004,64,138,1400,0,5,0,3,13,137,5,5,0
9,9,0,0,2,0,2,3,2,3,3,4,754,1388,84,2,1086,2,0,4,14,8,51,0,10,1


> Note: The above trick can work when we have test set available. What if we're working on Real Time problem where we don't have test set available. In these cases we can use "Unknown" category.

In [8]:
import pandas as pd
pd.options.display.max_columns = None
pd.options.display.max_rows = None

In [9]:
train_data = pd.read_csv('../input/train.csv')

In [10]:
# Consider ord_4 column. 

train_data['ord_4'].value_counts()

N    39978
P    37890
Y    36657
A    36633
R    33045
U    32897
M    32504
X    32347
C    32112
H    31189
Q    30145
T    29723
O    25610
B    25212
E    21871
K    21676
I    19805
D    17284
F    16721
W     8268
Z     5790
S     4595
G     3404
V     3107
J     1950
L     1657
Name: ord_4, dtype: int64

In [11]:
# Let's fill NaN values
train_data['ord_4'].fillna('NONE').value_counts()

N       39978
P       37890
Y       36657
A       36633
R       33045
U       32897
M       32504
X       32347
C       32112
H       31189
Q       30145
T       29723
O       25610
B       25212
E       21871
K       21676
I       19805
NONE    17930
D       17284
F       16721
W        8268
Z        5790
S        4595
G        3404
V        3107
J        1950
L        1657
Name: ord_4, dtype: int64

We can see above that there are ~18K NaN values in the column.

Now, we can define our criteria for "rare" or "unknown" category. Let's assume that occurrence of category less than 2000 will be considered as "rare". We can observe from above that **J and L** categories falls in "rare" category.

In [12]:
train_data.loc[:, 'ord_4'] = train_data.ord_4.fillna('NONE')

train_data.loc[train_data['ord_4'].value_counts()[train_data['ord_4']].values < 2000, 'ord_4'] = 'RARE'
train_data['ord_4'].value_counts()

N       39978
P       37890
Y       36657
A       36633
R       33045
U       32897
M       32504
X       32347
C       32112
H       31189
Q       30145
T       29723
O       25610
B       25212
E       21871
K       21676
I       19805
NONE    17930
D       17284
F       16721
W        8268
Z        5790
S        4595
RARE     3607
G        3404
V        3107
Name: ord_4, dtype: int64

So, now, when it comes to test data, all the new, unseen categories will be mapped to "RARE", and all missing values will be mapped to "NONE".

This approach will ensure that the model works in a live setting, even, if you have new categories.