<a href="https://colab.research.google.com/github/indrikwijaya/Approaching-Any-ML-Problem/blob/master/4_Categorical_Variables.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What are categorical variables?
- nominal: have 2 or more categories which don't have any kind of order associated with them. For example, if gender is classified into 2 groups, i.e. male & female, it can be considered as a nominal variable
- ordinal: have 'levels' or categories with a particular order associated w/ them. For eg, low, medium and high. Order is important

- can categorize as **binary**, **cyclic**


In [1]:
%cd drive/MyDrive/Colab\ Notebooks/Approaching-Any-ML-Problem

/content/drive/MyDrive/Colab Notebooks/Approaching-Any-ML-Problem


# cat-in-the-dat Data

In [5]:
import pandas as pd
df = pd.read_csv('data/cat_train.csv')
df.head()

Unnamed: 0,id,bin_0,bin_1,bin_2,bin_3,bin_4,nom_0,nom_1,nom_2,nom_3,...,nom_9,ord_0,ord_1,ord_2,ord_3,ord_4,ord_5,day,month,target
0,0,0.0,0.0,0.0,F,N,Red,Trapezoid,Hamster,Russia,...,02e7c8990,3.0,Contributor,Hot,c,U,Pw,6.0,3.0,0.0
1,1,1.0,1.0,0.0,F,Y,Red,Star,Axolotl,,...,f37df64af,3.0,Grandmaster,Warm,e,X,pE,7.0,7.0,0.0
2,2,0.0,1.0,0.0,F,N,Red,,Hamster,Canada,...,,3.0,,Freezing,n,P,eN,5.0,9.0,0.0
3,3,,0.0,0.0,F,N,Red,Circle,Hamster,Finland,...,f9d456e57,1.0,Novice,Lava Hot,a,C,,3.0,3.0,0.0
4,4,0.0,,0.0,T,N,Red,Triangle,Hamster,Costa Rica,...,c5361037c,3.0,Grandmaster,Cold,h,C,OZ,5.0,12.0,0.0


The dataset consists of all kinds of categorical variables:

• Nominal

• Ordinal

• Cyclical

• Binary

Overall, there are:

• Five binary variables

• Ten nominal variables

• Six ordinal variables

• Two cyclic variables

• And a target variable

In [6]:
df.target.value_counts()

0.0    78347
1.0    18166
Name: target, dtype: int64

- Classes are skewed -> best metric for this binary classification problem is AUC

In [7]:
df.ord_2.unique()

array(['Hot', 'Warm', 'Freezing', 'Lava Hot', 'Cold', 'Boiling Hot', nan],
      dtype=object)

In [8]:
mapping = {"Freezing": 0,
  "Warm": 1,
  "Cold": 2,
  "Boiling Hot": 3,
  "Hot": 4,
  "Lava Hot": 5
}

In [9]:
df.loc[:, "ord_2"] = df.ord_2.map(mapping)
df.ord_2.value_counts()

0.0    23009
1.0    20028
2.0    15557
3.0    13802
4.0    10879
5.0    10395
Name: ord_2, dtype: int64

We can use `LabelEncoder` instead of doing the mapping manually

In [13]:
%cd src
from sklearn import preprocessing
# read the data
df = pd.read_csv("../data/cat_train.csv")
# fill NaN values in ord_2 column
df.loc[:, "ord_2"] = df.ord_2.fillna("NONE")
# initialize LabelEncoder
lbl_enc = preprocessing.LabelEncoder()
# fit label encoder and transform values on ord_2 column
# P.S: do not use this directly. fit first, then transform
df.loc[:, "ord_2"] = lbl_enc.fit_transform(df.ord_2.values)

[Errno 2] No such file or directory: 'src'
/content/drive/MyDrive/Colab Notebooks/Approaching-Any-ML-Problem/src


It becomes easy to store lots of binarized variables like this if we store them in a
sparse format. A sparse format is nothing but a representation or way of storing
data in memory in which you do not store all the values but only the values that
matter. In the case of binary variables described above, all that matters is where we
have ones (1s).

Even though the sparse representation of binarized features takes much less
memory than its dense representation, there is another transformation for
categorical variables that takes even less memory. This is known as One Hot
Encoding.
One hot encoding is a binary encoding too in the sense that there are only two
values, 0s and 1s. However, it must be noted that it’s not a binary representation.

We can also convert them into numerical variables.

Suppose we go back to the categorical features dataframe (original cat-in-the-datii)
that we had. How many ids do we have in the dataframe where the value of ord_2
is Boiling Hot ?
We can easily calculate this value by calculating the shape of the dataframe where
ord_2 column has the value Boiling Hot.


In [16]:
# Boiling hot = 3
df[df.ord_2 == 3].shape

(10879, 25)

In [17]:
df.groupby(["ord_2"])["id"].count()

ord_2
0    13802
1    15557
2    23009
3    10879
4    10395
5     2844
6    20028
Name: id, dtype: int64

If we just replace ord_2 column with its count values, we have converted it to a
feature which is kind of numerical now. We can create a new column or replace this
column by using the `transform` function of pandas along with `groupby`.

In [18]:
df.groupby(['ord_2'])['id'].transform('count')

0        10879
1        20028
2        23009
3        10395
4        15557
         ...  
96509    15557
96510    20028
96511    15557
96512    23009
96513    23009
Name: id, Length: 96514, dtype: int64

You can add counts of all the features or can also replace them or maybe group by
multiple columns and their counts. For example, the following code counts by
grouping on ord_1 and ord_2 columns.

In [19]:
df.groupby(
    [
    "ord_1",
    "ord_2"]
    )["id"].count().reset_index(name="count")

Unnamed: 0,ord_1,ord_2,count
0,Contributor,0,2531
1,Contributor,1,2832
2,Contributor,2,4235
3,Contributor,3,2016
4,Contributor,4,1934
5,Contributor,5,523
6,Contributor,6,3666
7,Expert,0,3243
8,Expert,1,3596
9,Expert,2,5327


One more trick is to create new features from these categorical variables. You can
create new categorical features from existing features, and this can be done in an
effortless manner.

In [20]:
df["new_feature"] = (
  df.ord_1.astype(str)
  + "_"
  + df.ord_2.astype(str)
  )

Here, we have combined ord_1 and ord_2 by an underscore, and before that, we
convert these columns to string types. Note that NaN will also convert to string. But
it’s okay. We can also treat NaN as a new category. Thus, we have a new feature
which is a combination of these two features. You can also combine more than three
columns or four or even more.

Whenever you get categorical variables, follow these simple steps:
- fill the NaN values (this is very important!)
- convert them to integers by applying label encoding using LabelEncoder
of scikit-learn or by using a mapping dictionary. If you didn’t fill up NaN
values with something, you might have to take care of them in this step
- create one-hot encoding
- Go for modeling

Another way of
handling NaN values is to treat them as a completely new category. This is the most
preferred way of handling NaN values. And can be achieved in a very simple
manner if you are using pandas.

In [21]:
df.ord_2.fillna("NONE").value_counts()

2    23009
6    20028
1    15557
0    13802
3    10879
4    10395
5     2844
Name: ord_2, dtype: int64

Wow! There were 20028 NaN values in this column that we didn’t even consider
using previously. With the addition of this new category, the total number of
categories have now increased from 6 to 7. This is okay because now when we build
our models, we will also consider NaN. The more relevant information we have,
the better the model is.

Let’s assume that ord_2 did not have any NaN values. We see that all categories in
this column have a significant count. There are no “rare” categories; i.e. the
categories which appear only a small percentage of the total number of samples.
Now, let’s assume that you have deployed this model which uses this column in
production and when the model or the project is live, you get a category in ord_2
column that is not present in train. You model pipeline, in this case, will throw an
error and there is nothing that you can do about it. If this happens, then probably
something is wrong with your pipeline in production. If this is expected, then you
must modify your model pipeline and include a new category to these six categories.

This new category is known as the “rare” category. A rare category is a category
which is not seen very often and can include many different categories. You can
also try to “predict” the unknown category by using a nearest neighbour model.
Remember, if you predict this category, it will become one of the categories from
the training data.

If you have a fixed test set, you can add your test data to training to know about the
categories in a given feature. This is very similar to semi-supervised learning in
which you use data which is not available for training to improve your model. This
will also take care of rare values that appear very less number of times in training
data but are in abundance in test data. Your model will be more robust.
Many people think that this idea overfits. It may or may not overfit. There is a
simple fix for that. If you design your cross-validation in such a way that it
replicates the prediction process when you run your model on test data, then it’s
never going to overfit. It means that the first step should be the separation of folds,
and in each fold, you should apply the same pre-processing that you want to apply
to test data. Suppose you want to concatenate training and test data, then in each
fold you must concatenate training and validation data and also make sure that your
validation dataset replicates the test set. In this specific case, you must design your
validation sets in such a way that it has categories which are “unseen” in the training
set.

In [23]:
# read training data
train = pd.read_csv("../data/cat_train.csv")
#read test data
test = pd.read_csv("../data/cat_test.csv")

# create a fake target column for test data
# since this column doesn't exist
test.loc[:, "target"] = -1

# concatenate both training and test data
data = pd.concat([train, test]).reset_index(drop=True)

# make a list of features we are interested in
# id and target is something we should not encode
features = [x for x in train.columns if x not in ["id", "target"]]

# loop over the features list
for feat in features:
  # create a new instance of LabelEncoder for each feature
  lbl_enc = preprocessing.LabelEncoder()
  # note the trick here
  # since its categorical data, we fillna with a string
  # and we convert all the data to string type
  # so, no matter its int or float, its converted to string
  # int/float but categorical!!!
  temp_col = data[feat].fillna("NONE").astype(str).values
  
  # we can use fit_transform here as we do not
  # have any extra test data that we need to
  # transform on separately
  data.loc[:, feat] = lbl_enc.fit_transform(temp_col)

# split the training and test data again
train = data[data.target != -1].reset_index(drop=True)
test = data[data.target == -1].reset_index(drop=True)

This trick works when you have a problem where you already have the test dataset.
It must be noted that this trick will not work in a live setting. For example, let’s say
you are in a company that builds a real-time bidding solution (RTB). RTB systems
bid on every user they see online to buy ad space. The features that can be used for
such a model may include pages viewed in a website. Let’s assume that features are
the last five categories/pages visited by the user. In this case, if the website
introduces new categories, we will no longer be able to predict accurately. Our
model, in this case, will fail. A situation like this can be avoided by using an
“unknown” category.

We can treat “NONE” as unknown. So, if during live testing, we get new categories
that we have not seen before, we will mark them as “NONE”.
This is very similar to natural language processing problems. We always build a
model based on a fixed vocabulary. Increasing the size of the vocabulary increases
the size of the model. Transformer models like BERT are trained on ~30000 words
(for English). So, when we have a new word coming in, we mark it as UNK
(unknown).
So, you can either assume that your test data will have the same categories as
training or you can introduce a rare or unknown category to training to take care of
new categories in test data.
Let’s see the value counts in ord_4 column after filling NaN values

In [26]:
df.ord_4.fillna("NONE").value_counts()

N       6395
P       6014
Y       5904
A       5803
C       5396
M       5314
X       5253
U       5226
R       5226
H       5036
T       4942
Q       4865
O       4171
B       4114
E       3450
K       3398
I       3254
NONE    2810
D       2791
F       2565
W       1384
Z        912
S        705
G        513
V        488
J        303
L        282
Name: ord_4, dtype: int64

We can now define our criteria for calling a value “rare”. Let’s say the requirement
for a value being rare in this column is a count of less than 1000. So, it seems, Z, S, G, V, J and
L can be marked as rare values. With pandas, it is quite easy to replace categories
based on count threshold. Let’s take a look at how it’s done.

In [28]:
df.ord_4 = df.ord_4.fillna("NONE")

In [29]:
df.loc[
  df["ord_4"].value_counts()[df["ord_4"]].values < 2000,
  "ord_4"
  ] = "RARE"

In [30]:
df.ord_4.value_counts()

N       6395
P       6014
Y       5904
A       5803
C       5396
M       5314
X       5253
U       5226
R       5226
H       5036
T       4942
Q       4865
RARE    4587
O       4171
B       4114
E       3450
K       3398
I       3254
NONE    2810
D       2791
F       2565
Name: ord_4, dtype: int64

We say that wherever the value count for a certain category is less than 2000,
replace it with rare. So, now, when it comes to test data, all the new, unseen
categories will be mapped to “RARE”, and all missing values will be mapped to
“NONE”.
This approach will also ensure that the model works in a live setting, even if you
have new categories.
Now we have everything we need to approach any kind of problem with categorical
variables in it. Let’s try building our first model and try to improve its performance
in a step-wise manner.
Before going to any kind of model building, it’s essential to take care of crossvalidation.
We have already seen the label/target distribution, and we know that it
is a binary classification problem with skewed targets. Thus, we will be using
StratifiedKFold to split the data here.

In [44]:
%%writefile create_folds.py

import pandas as pd
from sklearn import model_selection

if __name__ == "__main__":
  # Read training data
  df = pd.read_csv("../data/cat_train.csv")
  
  # we create a new column called kfold and fill it with -1
  df["kfold"] = -1
  
  # the next step is to randomize the rows of the data
  df = df.sample(frac=1).reset_index(drop=True)
  
  # fillna with 0
  df = df.fillna(0)

  # fetch labels
  y = df.target.values
  
  # initiate the kfold class from model_selection module
  kf = model_selection.StratifiedKFold(n_splits=5)
  
  # fill the new kfold column
  for f, (t_, v_) in enumerate(kf.split(X=df, y=y)):
    df.loc[v_, 'kfold'] = f

  # save the new csv with kfold column  
  df.to_csv("../input/cat_train_folds.csv", index=False)

Overwriting create_folds.py


In [45]:
!python create_folds.py

In [46]:
df = pd.read_csv('../input/cat_train_folds.csv')
df.kfold.value_counts()

0    19303
1    19303
2    19303
3    19303
4    19302
Name: kfold, dtype: int64

In [53]:
for i in range(5):
  print(i, df[df.kfold==i].target.value_counts())

0 0.0    15669
1.0     3634
Name: target, dtype: int64
1 0.0    15670
1.0     3633
Name: target, dtype: int64
2 0.0    15670
1.0     3633
Name: target, dtype: int64
3 0.0    15670
1.0     3633
Name: target, dtype: int64
4 0.0    15669
1.0     3633
Name: target, dtype: int64


## Logistic Regression w/ OneHot encoding

Build a simple model using logistic regression

In [58]:
%%writefile ohe_logres.py
import pandas as pd

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import OneHotEncoder

def run(fold):
  # Load the full training data with folds
  df = pd.read_csv('../input/cat_train_folds.csv')

  # all columns are features except id, target and kfold columns
  features = [
  f for f in df.columns if f not in ("id", "target", "kfold")
  ]
  # fill all NaN values with NONE
  # note that I am converting all columns to "strings"
  # it doesn’t matter because all are categories

  for col in features:
    df.loc[:, col] = df[col].astype(str).fillna("NONE")

  # get training data using folds
  df_train = df[df.kfold != fold].reset_index(drop=True)

  # get validation data using folds
  df_valid = df[df.kfold == fold].reset_index(drop=True)

  # initialize OneHotEncoder from scikit-learn
  ohe = OneHotEncoder()
  # fit ohe on training + validation features
  full_data = pd.concat(
  [df_train[features], df_valid[features]],
  axis=0
  )

  ohe.fit(full_data[features])
  
  # transform training data
  x_train = ohe.transform(df_train[features])
  
  # transform validation data
  x_valid = ohe.transform(df_valid[features])
  
  # initialize Logistic Regression model
  model = LogisticRegression()

  # fit model on training data (ohe)
  model.fit(x_train, df_train.target.values)

  # predict on validation data
  # we need the probability values as we are calculating AUC
  # we will use the probability of 1s
  valid_preds = model.predict_proba(x_valid)[:, 1]

  # get roc auc score
  auc = roc_auc_score(df_valid.target.values, valid_preds)

  # print auc
  print(auc)

if __name__ == "__main__":
  run(0)

Overwriting ohe_logres.py


In [59]:
!python ohe_logres.py

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
0.7847864978358783


There are a few warnings. It seems logistic regression did not converge for the max
number of iterations. We didn’t play with the parameters, so that is fine. We see
that AUC is ~ 0.785.
Let’s run it for all folds now with a simple change in code.

In [63]:
%%writefile ohe_logres.py
import pandas as pd

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import OneHotEncoder

def run(fold):
  # Load the full training data with folds
  df = pd.read_csv('../input/cat_train_folds.csv')

  # all columns are features except id, target and kfold columns
  features = [
  f for f in df.columns if f not in ("id", "target", "kfold")
  ]
  # fill all NaN values with NONE
  # note that I am converting all columns to "strings"
  # it doesn’t matter because all are categories

  for col in features:
    df.loc[:, col] = df[col].astype(str).fillna("NONE")

  # get training data using folds
  df_train = df[df.kfold != fold].reset_index(drop=True)

  # get validation data using folds
  df_valid = df[df.kfold == fold].reset_index(drop=True)

  # initialize OneHotEncoder from scikit-learn
  ohe = OneHotEncoder()
  # fit ohe on training + validation features
  full_data = pd.concat(
  [df_train[features], df_valid[features]],
  axis=0
  )

  ohe.fit(full_data[features])
  
  # transform training data
  x_train = ohe.transform(df_train[features])
  
  # transform validation data
  x_valid = ohe.transform(df_valid[features])
  
  # initialize Logistic Regression model
  model = LogisticRegression()

  # fit model on training data (ohe)
  model.fit(x_train, df_train.target.values)

  # predict on validation data
  # we need the probability values as we are calculating AUC
  # we will use the probability of 1s
  valid_preds = model.predict_proba(x_valid)[:, 1]

  # get roc auc score
  auc = roc_auc_score(df_valid.target.values, valid_preds)

  # print auc
  print(f"Fold = {fold}, AUC = {auc}")

if __name__ == "__main__":
  for fold_ in range(5):
    run(fold_)

Overwriting ohe_logres.py


In [64]:
!python -W ignore ohe_logres.py

Fold = 0, AUC = 0.7847864978358783
Fold = 1, AUC = 0.7853553678923606
Fold = 2, AUC = 0.7879321947478755
Fold = 3, AUC = 0.7870315760687687
Fold = 4, AUC = 0.7864695667409296


We see that AUC scores are quite stable across all folds. The average AUC is
0.78631449527. Quite good for our first model!
Many people will start this kind of problem with a tree-based model, such as
random forest. For applying random forest in this dataset, instead of one-hot
encoding, we can use label encoding and convert every feature in every column to
an integer as discussed previously.

## Random Forest w/ LabelEncoder

In [74]:
%%writefile lbl_rf.py

import pandas as pd

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelEncoder

def run(fold):
  # Load the full training data with folds
  df = pd.read_csv('../input/cat_train_folds.csv')

  # all columns are features except id, target and kfold columns
  features = [
  f for f in df.columns if f not in ("id", "target", "kfold")
  ]
  
  # fill all NaN values with NONE
  # note that I am converting all columns to "strings"
  # it doesn’t matter because all are categories
  for col in features:
    df.loc[:, col] = df[col].astype(str).fillna("NONE")

  # label encode the features
  for col in features:
    lbl = LabelEncoder()

    # fit label encoder on all data
    lbl.fit(df[col])

    # transform all the data
    df.loc[:, col] = lbl.transform(df[col])

  # get training data using folds
  df_train = df[df.kfold != fold].reset_index(drop=True)

  # get validation data using folds
  df_valid = df[df.kfold == fold].reset_index(drop=True)
  
  # get training data
  x_train = df_train[features].values
  
  # get validation data
  x_valid = df_valid[features].values
  
  # initialize Random Forest model
  model = RandomForestClassifier(n_jobs=-1)

  # fit model on training data (ohe)
  model.fit(x_train, df_train.target.values)

  # predict on validation data
  # we need the probability values as we are calculating AUC
  # we will use the probability of 1s
  valid_preds = model.predict_proba(x_valid)[:, 1]

  # get roc auc score
  auc = roc_auc_score(df_valid.target.values, valid_preds)

  # print auc
  print(f"Fold = {fold}, AUC = {auc}")

if __name__ == "__main__":
  for fold_ in range(5):
    run(fold_)

Overwriting lbl_rf.py


In [68]:
!python lbl_rf.py

Fold = 0, AUC = 0.716930900090115
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/joblib/parallel.py", line 935, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/usr/lib/python3.7/multiprocessing/pool.py", line 651, in get
    self.wait(timeout)
  File "/usr/lib/python3.7/multiprocessing/pool.py", line 648, in wait
    self._event.wait(timeout)
  File "/usr/lib/python3.7/threading.py", line 552, in wait
    signaled = self._cond.wait(timeout)
  File "/usr/lib/python3.7/threading.py", line 296, in wait
    waiter.acquire()
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "lbl_rf.py", line 64, in <module>
    run(fold_)
  File "lbl_rf.py", line 49, in run
    model.fit(x_train, df_train.target.values)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/ensemble/_forest.py", line 467, in fit
    for i, t in enumerate(trees)
  File "/usr/local/lib

Wow! Huge difference! The random forest model, without any tuning of
hyperparameters, performs a lot worse than simple logistic regression.

And this is a reason why we should always start with simple models first. A fan of
random forest would begin with it here and will ignore logistic regression model
thinking it’s a very simple model that cannot bring any value better than random
forest. That kind of person will make a huge mistake. In our implementation of
random forest, the folds take a much longer time to complete compared to logistic
regression. So, we are not only losing on AUC but also taking much longer to
complete the training. Please note that inference is also time-consuming with
random forest and it also takes much larger space.

If we want, we can also try to run random forest on sparse one-hot encoded data,
but that is going to take a lot of time. We can also try reducing the sparse one-hot
encoded matrices using singular value decomposition. This is a very common
method of extracting topics in natural language processing.

## Random Forest w/ TruncatedSVD one-hot encoded matrices

In [69]:
%%writefile ohe_svd_rf.py

import pandas as pd
from scipy import sparse

from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import OneHotEncoder

def run(fold):
  # Load the full training data with folds
  df = pd.read_csv('../input/cat_train_folds.csv')

  # all columns are features except id, target and kfold columns
  features = [
  f for f in df.columns if f not in ("id", "target", "kfold")
  ]
  
  # fill all NaN values with NONE
  # note that I am converting all columns to "strings"
  # it doesn’t matter because all are categories
  for col in features:
    df.loc[:, col] = df[col].astype(str).fillna("NONE")

  # get training data using folds
  df_train = df[df.kfold != fold].reset_index(drop=True)

  # get validation data using folds
  df_valid = df[df.kfold == fold].reset_index(drop=True)

  # initialize OneHotEncoder from scikit-learn
  ohe = OneHotEncoder()
  
  # fit ohe on training + validation features
  full_data = pd.concat(
  [df_train[features], df_valid[features]],
  axis=0
  )

  ohe.fit(full_data[features])
  
  # transform training data
  x_train = ohe.transform(df_train[features])
  
  # transform validation data
  x_valid = ohe.transform(df_valid[features])
  
  # initialize TruncatedSVD
  svd = TruncatedSVD(n_components=120)

  # fit svd on full sparse training data
  full_sparse = sparse.vstack((x_train, x_valid))
  svd.fit(full_sparse)

  # transform sparse training data
  x_train = svd.transform(x_train)

  # transform sparse validation data
  x_valid = svd.transform(x_valid)

  # initialize Random Forest model
  model = RandomForestClassifier(n_jobs=-1)

  # fit model on training data (ohe)
  model.fit(x_train, df_train.target.values)

  # predict on validation data
  # we need the probability values as we are calculating AUC
  # we will use the probability of 1s
  valid_preds = model.predict_proba(x_valid)[:, 1]

  # get roc auc score
  auc = roc_auc_score(df_valid.target.values, valid_preds)

  # print auc
  print(f"Fold = {fold}, AUC = {auc}")

if __name__ == "__main__":
  for fold_ in range(5):
    run(fold_)

Writing ohe_svd_rf.py


In [70]:
!python ohe_svd_rf.py

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/joblib/parallel.py", line 935, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/usr/lib/python3.7/multiprocessing/pool.py", line 651, in get
    self.wait(timeout)
  File "/usr/lib/python3.7/multiprocessing/pool.py", line 648, in wait
    self._event.wait(timeout)
  File "/usr/lib/python3.7/threading.py", line 552, in wait
    signaled = self._cond.wait(timeout)
  File "/usr/lib/python3.7/threading.py", line 296, in wait
    waiter.acquire()
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "ohe_svd_rf.py", line 80, in <module>
    run(fold_)
  File "ohe_svd_rf.py", line 65, in run
    model.fit(x_train, df_train.target.values)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/ensemble/_forest.py", line 467, in fit
    for i, t in enumerate(trees)
  File "/usr/local/lib/python3.7/dist-packages/j

We see that it is even worse. It seems like the best method for this problem is onehot
encoding with logistic regression. Random forest appears to be taking way too
much time. Maybe we can give XGBoost a try. Since it’s a
tree-based algorithm, we will use label encoded data.

## XGBoost with LabelEncoder

In [72]:
%%writefile lbl_xgb.py

import pandas as pd

import xgboost as xgb

from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelEncoder

def run(fold):
  # Load the full training data with folds
  df = pd.read_csv('../input/cat_train_folds.csv')

  # all columns are features except id, target and kfold columns
  features = [
  f for f in df.columns if f not in ("id", "target", "kfold")
  ]
  
  # fill all NaN values with NONE
  # note that I am converting all columns to "strings"
  # it doesn’t matter because all are categories
  for col in features:
    df.loc[:, col] = df[col].astype(str).fillna("NONE")

  # label encode the features
  for col in features:
    lbl = LabelEncoder()

    # fit label encoder on all data
    lbl.fit(df[col])

    # transform all the data
    df.loc[:, col] = lbl.transform(df[col])

  # get training data using folds
  df_train = df[df.kfold != fold].reset_index(drop=True)

  # get validation data using folds
  df_valid = df[df.kfold == fold].reset_index(drop=True)
  
  # get training data
  x_train = df_train[features].values
  
  # get validation data
  x_valid = df_valid[features].values
  
  # initialize XGBoost model
  model = XGBClassifier(n_jobs=-1,
                        max_depth=7,
                        n_estimators=200)

  # fit model on training data (ohe)
  model.fit(x_train, df_train.target.values)

  # predict on validation data
  # we need the probability values as we are calculating AUC
  # we will use the probability of 1s
  valid_preds = model.predict_proba(x_valid)[:, 1]

  # get roc auc score
  auc = roc_auc_score(df_valid.target.values, valid_preds)

  # print auc
  print(f"Fold = {fold}, AUC = {auc}")

if __name__ == "__main__":
  for fold_ in range(5):
    run(fold_)

Overwriting lbl_rf.py


# US Adult Census data
Let’s change the dataset to another
dataset with a lot of categorical variables. One more famous dataset is US adult
census data. The dataset contains some features, and your job is to predict the
salary bracket.

In [73]:
df = pd.read_csv("../data/adult.csv")

df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
0,90,?,77053,HS-grad,9,Widowed,?,Not-in-family,White,Female,0,4356,40,United-States,<=50K
1,82,Private,132870,HS-grad,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States,<=50K
2,66,?,186061,Some-college,10,Widowed,?,Unmarried,Black,Female,0,4356,40,United-States,<=50K
3,54,Private,140359,7th-8th,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States,<=50K
4,41,Private,264663,Some-college,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States,<=50K


In [75]:
df.income.value_counts()

<=50K    24720
>50K      7841
Name: income, dtype: int64

We see that there are 7841 instances with income greater than 50K USD. This is
~24% of the total number of samples. Thus, we will keep the evaluation same as
the cat-in-the-dat dataset, i.e. AUC. Before we start modelling, for simplicity, we
will be dropping a few columns, which are numerical, namely:
- fnlwgt
- age
- capital.gain
- capital.loss
- hours.per.week

## Logistic Regression


In [76]:
%%writefile ohe_logres_adult.py
import pandas as pd

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import OneHotEncoder

def run(fold):
  # Load the full training data with folds
  df = pd.read_csv('../input/adult_folds.csv')

  # list of numerical columns
  num_cols = [
  "fnlwgt",
  "age",
  "capital.gain",
  "capital.loss",
  "hours.per.week"
  ]

  # drop numerical columns
  df = df.drop(num_cols, axis=1)

  # map targets to 0s and 1s
  target_mapping = {
  "<=50K": 0,
  ">50K": 1
  }
  df.loc[:, "income"] = df.income.map(target_mapping)

  # all columns are features except id, target and kfold columns
  features = [
  f for f in df.columns if f not in ("kfold", "income")
  ]
  # fill all NaN values with NONE
  # note that I am converting all columns to "strings"
  # it doesn’t matter because all are categories

  for col in features:
    df.loc[:, col] = df[col].astype(str).fillna("NONE")

  # get training data using folds
  df_train = df[df.kfold != fold].reset_index(drop=True)

  # get validation data using folds
  df_valid = df[df.kfold == fold].reset_index(drop=True)

  # initialize OneHotEncoder from scikit-learn
  ohe = OneHotEncoder()
  # fit ohe on training + validation features
  full_data = pd.concat(
  [df_train[features], df_valid[features]],
  axis=0
  )

  ohe.fit(full_data[features])
  
  # transform training data
  x_train = ohe.transform(df_train[features])
  
  # transform validation data
  x_valid = ohe.transform(df_valid[features])
  
  # initialize Logistic Regression model
  model = LogisticRegression()

  # fit model on training data (ohe)
  model.fit(x_train, df_train.income.values)

  # predict on validation data
  # we need the probability values as we are calculating AUC
  # we will use the probability of 1s
  valid_preds = model.predict_proba(x_valid)[:, 1]

  # get roc auc score
  auc = roc_auc_score(df_valid.income.values, valid_preds)

  # print auc
  print(f"Fold = {fold}, AUC = {auc}")

if __name__ == "__main__":
  for fold_ in range(5):
    run(fold_)

Writing ohe_logres_adult.py


In [77]:
!python -W ignore ohe_logres_adult.py

Fold = 0, AUC = 0.8794835490830637
Fold = 1, AUC = 0.887601339079321
Fold = 2, AUC = 0.8852609687685753
Fold = 3, AUC = 0.8681271052110165
Fold = 4, AUC = 0.8728581541840037


## XGBoost with numerical columns

In [78]:
%%writefile lbl_xgb_adult.py
import pandas as pd

import xgboost as xgb
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelEncoder

def run(fold):
  # Load the full training data with folds
  df = pd.read_csv('../input/adult_folds.csv')

  # list of numerical columns
  num_cols = [
  "fnlwgt",
  "age",
  "capital.gain",
  "capital.loss",
  "hours.per.week"
  ]

  # drop numerical columns
  df = df.drop(num_cols, axis=1)

  # map targets to 0s and 1s
  target_mapping = {
  "<=50K": 0,
  ">50K": 1
  }
  df.loc[:, "income"] = df.income.map(target_mapping)

  # all columns are features except id, target and kfold columns
  features = [
  f for f in df.columns if f not in ("kfold", "income")
  ]
  # fill all NaN values with NONE
  # note that I am converting all columns to "strings"
  # it doesn’t matter because all are categories

  for col in features:
    df.loc[:, col] = df[col].astype(str).fillna("NONE")

  # now its time to label encode the features
  for col in features:
    if col not in num_cols:
      # initialize LabelEncoder for each feature column
      lbl = LabelEncoder()
      # fit label encoder on all data
      lbl.fit(df[col])
      # transform all the data
      df.loc[:, col] = lbl.transform(df[col])

  # get training data using folds
  df_train = df[df.kfold != fold].reset_index(drop=True)

  # get validation data using folds
  df_valid = df[df.kfold == fold].reset_index(drop=True)

  x_train = df_train[features].values
 
  # get validation data
  x_valid = df_valid[features].values
 
  # initialize xgboost model
  model = xgb.XGBClassifier(
  n_jobs=-1
  )

  # fit model on training data (ohe)
  model.fit(x_train, df_train.income.values)

  # predict on validation data
  # we need the probability values as we are calculating AUC
  # we will use the probability of 1s
  valid_preds = model.predict_proba(x_valid)[:, 1]

  # get roc auc score
  auc = roc_auc_score(df_valid.income.values, valid_preds)

  # print auc
  print(f"Fold = {fold}, AUC = {auc}")

if __name__ == "__main__":
  for fold_ in range(5):
    run(fold_)

Writing lbl_xgb_adult.py


In [None]:
!python lbl_xgb_adult.py

## XGBoost with feature engineering

In [79]:
%%writefile xgb_num_feat.py

import itertools
import pandas as pd
import xgboost as xgb
from sklearn.metrics import roc_auc_score
from sklearn import preprocessing

def feature_engineering(df, cat_cols):
  """
  This function is used for feature engineering
  :param df: the pandas dataframe with train/test data
  :param cat_cols: list of categorical columns
  :return: dataframe with new features
  """
  # this will create all 2-combinations of values
  # in this list
  # for example:
  # list(itertools.combinations([1,2,3], 2)) will return
  # [(1, 2), (1, 3), (2, 3)]
  combi = list(itertools.combinations(cat_cols, 2))
  for c1, c2 in combi:
    df.loc[
    :,
    c1 + "_" + c2
    ] = df[c1].astype(str) + "_" + df[c2].astype(str)
  return df

def run(fold):
  # load the full training data with folds
  df = pd.read_csv('../input/adult_folds.csv')

  # list of numerical columns
  num_cols = [
  "fnlwgt",
  "age",
  "capital.gain",
  "capital.loss",
  "hours.per.week"
  ]

  # drop numerical columns
  df = df.drop(num_cols, axis=1)

  # map targets to 0s and 1s
  target_mapping = {
  "<=50K": 0,
  ">50K": 1
  }
  df.loc[:, "income"] = df.income.map(target_mapping)

  # list of categorical columns for feature engineering
  cat_cols = [
  c for c in df.columns if c not in num_cols
  and c not in ("kfold", "income")
  ]
  
  # add new features
  df = feature_engineering(df, cat_cols)
  
  # all columns are features except id, target and kfold columns
  features = [
  f for f in df.columns if f not in ("kfold", "income")
  ]
  # fill all NaN values with NONE
  # note that I am converting all columns to "strings"
  # it doesn’t matter because all are categories

  for col in features:
    df.loc[:, col] = df[col].astype(str).fillna("NONE")

  # now its time to label encode the features
  for col in features:
    if col not in num_cols:
      # initialize LabelEncoder for each feature column
      lbl = LabelEncoder()
      # fit label encoder on all data
      lbl.fit(df[col])
      # transform all the data
      df.loc[:, col] = lbl.transform(df[col])

  # get training data using folds
  df_train = df[df.kfold != fold].reset_index(drop=True)

  # get validation data using folds
  df_valid = df[df.kfold == fold].reset_index(drop=True)

  x_train = df_train[features].values
 
  # get validation data
  x_valid = df_valid[features].values
 
  # initialize xgboost model
  model = xgb.XGBClassifier(
  n_jobs=-1
  )

  # fit model on training data (ohe)
  model.fit(x_train, df_train.income.values)

  # predict on validation data
  # we need the probability values as we are calculating AUC
  # we will use the probability of 1s
  valid_preds = model.predict_proba(x_valid)[:, 1]

  # get roc auc score
  auc = roc_auc_score(df_valid.income.values, valid_preds)

  # print auc
  print(f"Fold = {fold}, AUC = {auc}")

if __name__ == "__main__":
  for fold_ in range(5):
    run(fold_)  

Writing xgb_num_feat.py


This is a very naïve way of creating features from categorical columns. One should
take a look at the data and see which combinations make the most sense. If you use
this method, you might end up creating a lot of features, and in that case, you will need to use some kind of feature selection to select the best features

In [None]:
!python xgb_num_feat.py

One more way of feature engineering from categorical features is to use target
encoding. However, you have to be very careful here as this might overfit your
model. Target encoding is a technique in which you map each category in a given
feature to its mean target value, but this must always be done in a cross-validated
manner. It means that the first thing you do is create the folds, and then use those
folds to create target encoding features for different columns of the data in the same
way you fit and predict the model on folds. So, if you have created 5 folds, you
have to create target encoding 5 times such that in the end, you have encoding for
variables in each fold which are not derived from the same fold. And then when
you fit your model, you must use the same folds again. Target encoding for unseen
test data can be derived from the full training data or can be an average of all the 5
folds.

In [80]:
%%writefile target_encoding.py
import copy
import pandas as pd
from sklearn import metrics
from sklearn import preprocessing
import xgboost as xgb
def mean_target_encoding(data):
  # make a copy of dataframe
  df = copy.deepcopy(data)
  
  # list of numerical columns
  num_cols = [
  "fnlwgt",
  "age",
  "capital.gain",
  "capital.loss",
  "hours.per.week"
  ]
  
  # map targets to 0s and 1s
  target_mapping = {
  "<=50K": 0,
  ">50K": 1
  }
  df.loc[:, "income"] = df.income.map(target_mapping)
  
  # all columns are features except income and kfold columns
  features = [
  f for f in df.columns if f not in ("kfold", "income")
  and f not in num_cols
  ]
  
  # fill all NaN values with NONE
  # note that I am converting all columns to "strings"
  # it doesnt matter because all are categories
  for col in features:
  
  # do not encode the numerical columns
  if col not in num_cols:
    df.loc[:, col] = df[col].astype(str).fillna("NONE")
# now its time to label encode the features
  for col in features:
    if col not in num_cols:
      # initialize LabelEncoder for each feature column
      lbl = preprocessing.LabelEncoder()
      # fit label encoder on all data
      lbl.fit(df[col])
      # transform all the data
      df.loc[:, col] = lbl.transform(df[col])
  
  # a list to store 5 validation dataframes
  encoded_dfs = []
  
  # go over all folds
  for fold in range(5):
  
    # fetch training and validation data
    df_train = df[df.kfold != fold].reset_index(drop=True)
    df_valid = df[df.kfold == fold].reset_index(drop=True)
  
    # for all feature columns, i.e. categorical columns
    for column in features:
  
      # create dict of category:mean target
      mapping_dict = dict(
      df_train.groupby(column)["income"].mean()
      )
      # column_enc is the new column we have with mean encoding
      df_valid.loc[
      :, column + "_enc"
      ] = df_valid[column].map(mapping_dict)
    # append to our list of encoded validation dataframes
    encoded_dfs.append(df_valid)
  # create full data frame again and return
  encoded_df = pd.concat(encoded_dfs, axis=0)
  return encoded_df

def run(df, fold):
  # note that folds are same as before
  
  # get training data using folds
  df_train = df[df.kfold != fold].reset_index(drop=True)
  
  # get validation data using folds
  df_valid = df[df.kfold == fold].reset_index(drop=True)

  # all columns are features except income and kfold columns
  features = [
  f for f in df.columns if f not in ("kfold", "income")
  ]
  
  # scale training data
  x_train = df_train[features].values
  
  # scale validation data
  x_valid = df_valid[features].values
  
  # initialize xgboost model
  model = xgb.XGBClassifier(
  n_jobs=-1,
  max_depth=7
  )
  
  # fit model on training data (ohe)
  model.fit(x_train, df_train.income.values)
  
  # predict on validation data
  # we need the probability values as we are calculating AUC
  # we will use the probability of 1s
  valid_preds = model.predict_proba(x_valid)[:, 1]
  
  # get roc auc score
  auc = metrics.roc_auc_score(df_valid.income.values, valid_preds)
  
  # print auc
  print(f"Fold = {fold}, AUC = {auc}")
  
  if __name__ == "__main__":
  
    # read data
    df = pd.read_csv("../input/adult_folds.csv")
    
    # create mean target encoded categories and
    # munge data
    df = mean_target_encoding(df)
    # run training and validation for 5 folds
    for fold_ in range(5):
      run(df, fold_)

Writing target_encoding.py


In [None]:
!python target_encoding.py

Nice! It seems like we have improved again. However, you must be very careful
when using target encoding as it is too prone to overfitting. When we use target
encoding, it’s better to use some kind of smoothing or adding noise in the encoded
values. Scikit-learn has contrib repository which has target encoding with
smoothing, or you can create your own smoothing. Smoothing introduces some
kind of regularization that helps with not overfitting the model. It’s not very
difficult.



## Neural Network with entity embeddings

In entity embeddings, the
categories are represented as vectors. We represent categories by vectors in both
binarization and one hot encoding approaches. But what if we have tens of
thousands of categories. This will create huge matrices and will take a long time for
us to train complicated models. We can thus represent them by vectors with float
values instead.
The idea is super simple. You have an embedding layer for each categorical feature.
So, every category in a column can now be mapped to an embedding (like mapping
words to embeddings in natural language processing). You then reshape these
embeddings to their dimension to make them flat and then concatenate all the
flattened inputs embeddings. Then add a bunch of dense layers, an output layer and
you are done

In [None]:
%%writefile entity_embeddings.py


