In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/cat-in-the-dat/train.csv
/kaggle/input/cat-in-the-dat/test.csv
/kaggle/input/cat-in-the-dat/sample_submission.csv


# Introduction
Let's talk about types of features in machine learning. Basically, there are two main types of features
- numerical features (usually continous numbers)
- categorical features (descrete numbers or string categories e.g. male / female)  

each of which has its own way of handling and preprocessing.  
In this notebook we descuss different types of categorical features and how to encode and preprocess them.  
### Types of categorical features
- Binary
- Nominal ( low and high cardinality )
- Ordinal ( low and high cardinality )
- Cyclic

> high cardenality (informally) means a lot of categories

In [2]:
# Importing needed libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, OrdinalEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# defining describer function

def summarize_features(df):
    # first column will be data types of each feature
    summary = pd.DataFrame(df.dtypes,columns=['dtypes'])
    summary = summary.reset_index()
    # second column will contain the name of the feature
    summary['Name'] = summary['index']
    # switch name and dtypes
    summary = summary[['Name','dtypes']]
    # how many missing values in each feature
    summary['Missing'] = df.isnull().sum().values
    # how many unique values in each feature (cardinality indicator)
    summary['Uniques'] = df.nunique().values

    return summary

In [3]:
df_train = pd.read_csv('/kaggle/input/cat-in-the-dat/train.csv')
df_test = pd.read_csv('/kaggle/input/cat-in-the-dat/test.csv')
submission = pd.read_csv('/kaggle/input/cat-in-the-dat/sample_submission.csv', index_col='id')

In [4]:
summarize_features(df_train)

Unnamed: 0,Name,dtypes,Missing,Uniques
0,id,int64,0,300000
1,bin_0,int64,0,2
2,bin_1,int64,0,2
3,bin_2,int64,0,2
4,bin_3,object,0,2
5,bin_4,object,0,2
6,nom_0,object,0,3
7,nom_1,object,0,6
8,nom_2,object,0,6
9,nom_3,object,0,6


After inspecting the output of the function above we see that we don't have any missing value in any column, and it also seems that the target variable has two unique values which means that this is a binary classification problem.  
The names `nom`, `ord` and `bin` are to indicate that these columns represent nominal, ordinal and binary features (duh!!), day and month are both cyclic featuers, more about that later.  
now let's study how to encode these features.  
### Types of encoding
- Ordinal Encoding
- One Hot Encoding
- Encoding Cyclic Features
- Encoding from Statistics
- Target Encoding

#### Ordinal Encoding
This type of encoding is used when dealing with ordinal features which are features that have some sort of order to its values, for example the class in an airplane. It is obvious that first class is better than second class which is also better than third class, so this is an example of an ordinal feature.  
We use [sklearn.preprocessing.OrdinalEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html) with it.  
let's use this encoder on features from `ord_0` to `ord_5`

In [5]:
# getting names of all ordinal feature indices, same encoder can be used for binary data
ord_cols = ["ord_{}".format(i) for i in range(6)] + ["bin_{}".format(i) for i in range(5)]
# extracting data from dataframe, you can call ord_data.head() alone in a cell to see the output
ord_data = df_train[ord_cols]
# initializing an encoder object (you should know oop)
ord_encoder = OrdinalEncoder()
# the encoder learns the categories and replaces them with numeric equivalents
encoded_ord_data = ord_encoder.fit_transform(ord_data)

In [6]:
# you can access learned categories from the .categories_ attribute see the docs for more

## also you can pass the categories your self to the encoder object
## in case you need them to have a specific order which is most of the cases
## `OrdinalEncoder(["cat_1", "cat_2"...])`

In [7]:
encoded_ord_data

array([[1., 2., 1., ..., 0., 1., 1.],
       [0., 2., 3., ..., 0., 1., 1.],
       [0., 1., 4., ..., 0., 0., 1.],
       ...,
       [2., 4., 0., ..., 0., 0., 1.],
       [0., 3., 0., ..., 0., 0., 1.],
       [2., 0., 2., ..., 0., 0., 1.]])

#### OneHot Encoding
OneHot encoding is used with nominal data. Unlike ordinal data, nominal data does not have a specific order or quantitative attribute to it, for example, Blood type (A+, B-, etc...) nothing necissarely says that one is better than the other. in this case you would want to use one hot encoding which basically replaces each column with `n` columns where `n` is the number of categories in the column all of which will have the value `0` except one of the columns will have the value `1` the position of this `1` indicates to which category this sample belongs.  
To use it check out [sklearn.preprocessing.OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)

In [8]:
# getting names of all nominal feature indices, same encoder can be used for binary data
nom_cols = ["nom_{}".format(i) for i in range(10)]
# extracting data from dataframe, you can call nom_data.head() alone in a cell to see the output
nom_data = df_train[ord_cols]
# initializing an encoder object (you should know oop)
nom_encoder = OneHotEncoder()
# the encoder learns the categories and replaces them with one hot array equivalent
encoded_nom_data = nom_encoder.fit_transform(ord_data)

In [9]:
encoded_nom_data.shape
# the shape is (m, 257) which means we have 257 columns after encoding,
# even though before encoding we only had 10 nominal columns, which means that
# each column has been replaced with multiple columns all containing zeros but one of them as explained

(300000, 257)

> **NOTE**
> You might be familiar with pandas.DataFrame.get_dummies which does onehot encoding too, but the sklearn encoder returns a sparce matrix which is more efficient and consumes less memory.

#### Encoding cyclic features, and a bit of OOP
The following image shows a two dimintional representation of a cyclic feature like month of the year.  
![cyclic features](https://miro.medium.com/max/343/1*70cevmU8wNggGJEdLam1lw.png)  
one way of handling this kind of feature is OneHotEncoding, another way is dealing with the month as an angle and calculate its sin and cosin then encode those using OneHotEncoding. both ways are valid the second seems to be more effective however the first is easier.  
Now we are going to implement our own CyclicEncoder class extending the sklearn BaseEstimator and TranformerMixin classes

In [10]:
from sklearn.base import BaseEstimator, TransformerMixin

class CyclicEncoder(BaseEstimator, TransformerMixin):
    def __init__(self):
        super().__init__()
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        X = X.copy()
        columns = X.columns
        for col in columns:
            X[col+'_sin'] = np.sin( (2*np.pi*X[col]) / X[col].nunique() )
            X[col+'_cos'] = np.cos( (2*np.pi*X[col]) / X[col].nunique() )
        
        onehot_encoder = OneHotEncoder(sparse=False, handle_unknown="ignore")
        new_data = X.drop(columns, axis=1)
        return onehot_encoder.fit_transform(new_data)
    

In [11]:
# getting names of all nominal feature indices, same encoder can be used for binary data
cyc_cols = ["day", "month"]
# extracting data from dataframe, you can call nom_data.head() alone in a cell to see the output
cyc_data = df_train[cyc_cols]
# initializing an encoder object (you should know oop)
cyc_encoder = CyclicEncoder()
# the encoder learns the categories and replaces them with one hot array equivalent
encoded_cyc_data = cyc_encoder.fit_transform(cyc_data)

In [12]:
encoded_cyc_data.shape

(300000, 36)

### Pipeline and ColumnTransformer
Before we descuss cardinality and what not let's get familiar with sklearn pipelines which are going to make your code much more organized and readable and also let's test a dummy LogisticRegression model, shall we?  
[sklearn.pipeline.Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) / [sklearn.compose.ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html?highlight=columntransformer#sklearn.compose.ColumnTransformer)

In [13]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

basic_transformer = ColumnTransformer([
    ("ordinal_transformer", OrdinalEncoder(), ord_cols),
    ("nominal_transformer", OneHotEncoder(handle_unknown="ignore"), nom_cols),
    ("cyclic_transformer", CyclicEncoder(), cyc_cols)
])

X = df_train.drop(['target', 'id'], axis=1)
y = df_train['target']

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)

# now we can preprocess the data all at once, now let's make a pipeline with a logistic regression model

basic_pipeline = Pipeline([
    ("preprocessor", basic_transformer),
    ("scaler", StandardScaler(with_mean=False)),
    ("model", LogisticRegression())
])

basic_pipeline.fit(X_train, y_train)
train_acc = basic_pipeline.score(X_train, y_train)
val_acc = basic_pipeline.score(X_val, y_val)
print("Training Accuracy = {}, Validation Accuracy = {}".format(train_acc, val_acc))

Training Accuracy = 0.7783416666666667, Validation Accuracy = 0.7358666666666667


## Advanced encoding techniques
So far so good, you can encode all categorical features now, but some features require further preprocessing and visualization, like features with high cardinality (many categories but in fancy terms).  
Also sometimes (usually with ordinal features) you might need to use statistics from the target to encode the feature (e.g. mean encoding) or even statistics from other features.

### Statistics Encoding
In some cases you might want to choose a certain statistic with which you want to encode some feature, for example the normalized frequency of the categories (frequency / number_of_samples).  
I like oop so I'll do as I did with Cyclyc features, so ... sorry, I guess!

In [14]:
# let's define an encoder that replaces each category with its normalized frequency

class NormalizedFrequencyEncoder(BaseEstimator, TransformerMixin):
    def __init__(self, log_scale=False):
        """
        For high cardinality features we are going to use logarithmic scaling,
        because we don't want numbers to be very small and close to one another
        """
        super().__init__()
        self.log_scale = log_scale
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        X = X.copy()
        columns = X.columns
        m = X.shape[0]
        
        for col in columns:
            frequency_map = X[col].value_counts() / m
            X[col] = X[col].map(frequency_map)
            if self.log_scale == True:
                X[col] = np.log(X[col])
        
        return X

statistics_encoder = NormalizedFrequencyEncoder()
# let's test it on a couple of features
ord_3 = df_train.copy()[['ord_3', 'ord_4']]
preprocessed = statistics_encoder.fit_transform(ord_3)
preprocessed.head()

# for high cardinality features the numbers are going to be very close to zero.
# going to the logarithmic scale at this case could be a good idea

Unnamed: 0,ord_3,ord_4
0,0.082467,0.013247
1,0.117587,0.06086
2,0.082467,0.056423
3,0.093493,0.013247
4,0.117587,0.056423


### Target Encoding
In statistics encoding we used statistics from the feature itself, but that doesn't tell us anything about the label, how about we use statistics from the training set labels (target) to encode our features?  
In this example we are going to use mean encoding which uses the mean of the target to encode a categorical feature.  
There are many ways to do mean encoding, we are going to do the most straight forward way, but we encourage you to lookup more interesting and sophisticated ways.
> **NOTE** You only fit this encoder once on the training set and not on the test set
> as a general rule of thumb, only use fit on the training set nomatter what encoder you're using and use transform on the validation and test set.

In [15]:
class MeanEncoder(BaseEstimator, TransformerMixin):
    def __init__(self):
        super().__init__()
        
    def fit(self, X, y=None):
        self.columns = X.columns
        X = X.copy()
        X['target'] = y
        self.mean_maps = dict()
        for col in self.columns:
            self.mean_maps[col] = X.groupby([col])['target'].mean() 
        return self
    
    def transform(self, X, y=None):
        X = X.copy()
        for col in self.columns:
            X[col] = X[col].map(self.mean_maps[col])
        
        return X

In [16]:
mean_encoder = MeanEncoder()
# let's test it on a couple of features
ord_3 = df_train.copy()[['ord_3', 'ord_4']] # target is needed for fitting
preprocessed = mean_encoder.fit_transform(ord_3, df_train['target'])
preprocessed.head()

# for high cardinality features the numbers are going to be very close to zero.
# going to the logarithmic scale at this case could be a good idea

Unnamed: 0,ord_3,ord_4
0,0.306993,0.208354
1,0.206599,0.186877
2,0.306993,0.351864
3,0.330148,0.208354
4,0.206599,0.351864


In [17]:
advanced_transformer = ColumnTransformer([
    ("ordinal_transformer", MeanEncoder(), ord_cols),
    ("nominal_low_cardinality_transformer", NormalizedFrequencyEncoder(), nom_cols[:5]),
    ("nominal_high_cardinality_transformer", NormalizedFrequencyEncoder(log_scale=True), nom_cols[5:]),
    ("cyclic_transformer", CyclicEncoder(), cyc_cols),
])

X = df_train.drop(['target', 'id'], axis=1)
y = df_train['target']

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)

# now we can preprocess the data all at once, now let's make a pipeline with a logistic regression model

advanced_pipeline = Pipeline([
    ("preprocessor", advanced_transformer),
    ("scaler", StandardScaler()),
    ("model", LogisticRegression())
])

advanced_pipeline.fit(X_train, y_train)
train_acc = advanced_pipeline.score(X_train, y_train)
val_acc = advanced_pipeline.score(X_val, y_val)
print("Training Accuracy = {}, Validation Accuracy = {}".format(train_acc, val_acc))

Training Accuracy = 0.7325666666666667, Validation Accuracy = 0.7335


In [18]:
# let's submit a solution
X = df_test.drop('id', axis=1)
preds = advanced_pipeline.predict(X)
preds

array([0, 0, 0, ..., 0, 1, 0])

In [19]:
submission['target'] = preds
submission.to_csv("results.csv")

read_submission = pd.read_csv("results.csv")
read_submission.head()

Unnamed: 0,id,target
0,300000,0
1,300001,0
2,300002,0
3,300003,0
4,300004,1


> **NOTE** The accuracy is not that good in this submission because we didn't do any kind of model selection, perhaps you can experamint on your own with different models and grid search through model parameters to make better predictions, however, our main focus in this notebook is to explain different preprocessing techniques and get you familiar with sklearn encoders and base classes.

# Conclusion and reading material
If you've gone this far, congratulations, you now know a lot of things about categorical feature encoding.  
If you use these techniques along with some data analysis and feature selection techniques, it shall give your models a nice boost.  
We didn't cover everything in this notebook, and we encourage you to look some keywords up and read the documentation of sklearn carefully or even read more notebooks on this compeition to understand more, and learn more advanced concepts.  
Happy learning :)  
## More readng material
[https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02](https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02)

[https://towardsdatascience.com/cyclical-features-encoding-its-about-time-ce23581845ca](https://towardsdatascience.com/cyclical-features-encoding-its-about-time-ce23581845ca)

[https://towardsdatascience.com/ml-intro-5-one-hot-encoding-cyclic-representations-normalization-6f6e2f4ec001](https://towardsdatascience.com/ml-intro-5-one-hot-encoding-cyclic-representations-normalization-6f6e2f4ec001)

and please read [sklearn's documentation](https://scikit-learn.org/stable/)

