#  Processors and Utils

Description of the main tools and utilities that one needs to prepare the data for a `WideDeep` model constructor. 

#### The `preprocessing`  module

There are 4 preprocessors, corresponding to 4 main components of the `WideDeep` model. These are

* `WidePreprocessor`
* `TabPreprocessor`
* `TextPreprocessor`
* `ImagePreprocessor` 

Behind the scenes, these preprocessors use a series of helper funcions and classes that are in the `utils` module. If you were interested please go and have a look to the documentation

##  1. WidePreprocessor

The `wide` component of the model is a linear model that in principle, could be implemented as a linear layer receiving the result of on one-hot encoding categorical columns. However, this is not memory efficient. Therefore, we implement a liner layer as an Embedding layer plus a bias. I will explain in a bit more detail later. 

With that in mind, `WidePreprocessor` simply encodes the categories numerically so that they are the indexes of the lookup table that is an Embedding layer.

For example

In [1]:
import numpy as np
import pandas as pd
import pytorch_widedeep as wd

from pytorch_widedeep.datasets import load_adult
from pytorch_widedeep.preprocessing import WidePreprocessor

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
df = load_adult(as_frame=True)
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


In [3]:
wide_cols = [
    "education",
    "relationship",
    "workclass",
    "occupation",
    "native-country",
    "gender",
]
crossed_cols = [("education", "occupation"), ("native-country", "occupation")]

In [4]:
wide_preprocessor = WidePreprocessor(wide_cols=wide_cols, crossed_cols=crossed_cols)
X_wide = wide_preprocessor.fit_transform(df)
# From here on, any new observation can be prepared by simply running `.transform`
# new_X_wide = wide_preprocessor.transform(new_df)

In [5]:
X_wide

array([[  1,  17,  23, ...,  89,  91, 316],
       [  2,  18,  23, ...,  89,  92, 317],
       [  3,  18,  24, ...,  89,  93, 318],
       ...,
       [  2,  20,  23, ...,  90, 103, 323],
       [  2,  17,  23, ...,  89, 103, 323],
       [  2,  21,  29, ...,  90, 115, 324]])

Note that the label encoding starts from `1`. This is because it is convenient to leave `0` for padding, i.e. unknown categories. Let's take from example the first entry

In [6]:
X_wide[0]

array([  1,  17,  23,  32,  47,  89,  91, 316])

In [7]:
wide_preprocessor.inverse_transform(X_wide[:1])

Unnamed: 0,education,relationship,workclass,occupation,native-country,gender,education_occupation,native-country_occupation
0,11th,Own-child,Private,Machine-op-inspct,United-States,Male,11th-Machine-op-inspct,United-States-Machine-op-inspct


As we can see, `wide_preprocessor` numerically encodes the `wide_cols` and the `crossed_cols`, which can be recovered using the method `inverse_transform`.

##  2. TabPreprocessor

The `TabPreprocessor` has a lot of different functionalities. Let's explore some of them in detail. In its basic use, the `TabPreprocessor` simply label encodes the categorical columns and normalises the numerical ones (unless otherwised specified).

In [8]:
from pytorch_widedeep.preprocessing import TabPreprocessor

In [9]:
# cat_embed_cols = [(column_name, embed_dim), ...]
cat_embed_cols = [
    ("education", 10),
    ("relationship", 8),
    ("workclass", 10),
    ("occupation", 10),
    ("native-country", 10),
]
continuous_cols = ["age", "hours-per-week"]

In [10]:
tab_preprocessor = TabPreprocessor(
    cat_embed_cols=cat_embed_cols,
    continuous_cols=continuous_cols,
    cols_to_scale=["age"],  # or scale=True or cols_to_scale=continuous_cols
)
X_tab = tab_preprocessor.fit_transform(df)
# From here on, any new observation can be prepared by simply running `.transform`
# new_X_deep = deep_preprocessor.transform(new_df)

In [11]:
X_tab

array([[ 1.00000000e+00,  1.00000000e+00,  1.00000000e+00, ...,
         1.00000000e+00, -9.95128932e-01,  4.00000000e+01],
       [ 2.00000000e+00,  2.00000000e+00,  1.00000000e+00, ...,
         1.00000000e+00, -4.69415091e-02,  5.00000000e+01],
       [ 3.00000000e+00,  2.00000000e+00,  2.00000000e+00, ...,
         1.00000000e+00, -7.76316450e-01,  4.00000000e+01],
       ...,
       [ 2.00000000e+00,  4.00000000e+00,  1.00000000e+00, ...,
         1.00000000e+00,  1.41180837e+00,  4.00000000e+01],
       [ 2.00000000e+00,  1.00000000e+00,  1.00000000e+00, ...,
         1.00000000e+00, -1.21394141e+00,  2.00000000e+01],
       [ 2.00000000e+00,  5.00000000e+00,  7.00000000e+00, ...,
         1.00000000e+00,  9.74183408e-01,  4.00000000e+01]])

Note that the label encoding starts from `1`. This is because it is convenient to leave `0` for padding, i.e. unknown categories. Let's take from example the first entry

In [12]:
X_tab[0]

array([ 1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
       -0.99512893, 40.        ])

In [13]:
tab_preprocessor.inverse_transform(X_tab[:1])

Unnamed: 0,education,relationship,workclass,occupation,native-country,age,hours-per-week
0,11th,Own-child,Private,Machine-op-inspct,United-States,25.0,40.0


The `TabPreprocessor` will have a series of useful attributes that can later be used when instantiating the different Tabular Models, such us for example, the column indexes (used to slice the tensors, internally in the models) or the categorical embeddings set up

In [14]:
tab_preprocessor.column_idx

{'education': 0,
 'relationship': 1,
 'workclass': 2,
 'occupation': 3,
 'native-country': 4,
 'age': 5,
 'hours-per-week': 6}

In [15]:
# column name, num unique, embedding dim
tab_preprocessor.cat_embed_input

[('education', 16, 10),
 ('relationship', 6, 8),
 ('workclass', 9, 10),
 ('occupation', 15, 10),
 ('native-country', 42, 10)]

As I mentioned, there is more one can do, such as for example, quantize (or bucketize) the continuous cols. For this we could use the `quantization_setup` param. This parameter accepts a number of different inputs and uses `pd.cut` under the hood to quantize the continuous cols. For more info, please, read the docs. Let's use it here to quantize "age" and "hours-per-week" in 4 and 5 "buckets" respectively

In [16]:
quantization_setup = {
    "age": 4,
    "hours-per-week": 5,
}  # you can also pass a list of floats with the boundaries if you wanted
quant_tab_preprocessor = TabPreprocessor(
    cat_embed_cols=cat_embed_cols,
    continuous_cols=continuous_cols,
    quantization_setup=quantization_setup,
)
qX_tab = quant_tab_preprocessor.fit_transform(df)



In [17]:
qX_tab

array([[1, 1, 1, ..., 1, 1, 2],
       [2, 2, 1, ..., 1, 2, 3],
       [3, 2, 2, ..., 1, 1, 2],
       ...,
       [2, 4, 1, ..., 1, 3, 2],
       [2, 1, 1, ..., 1, 1, 1],
       [2, 5, 7, ..., 1, 2, 2]])

Note that the continuous columns that have been bucketised into quantiles are treated as any other categorical column

In [18]:
quant_tab_preprocessor.cat_embed_input

[('education', 16, 10),
 ('relationship', 6, 8),
 ('workclass', 9, 10),
 ('occupation', 15, 10),
 ('native-country', 42, 10),
 ('age', 4, 4),
 ('hours-per-week', 5, 4)]

Where the column 'age' has now 4 categories, which will be encoded using embeddings of 4 dims. Note that, as any other categorical columns, the categorical "counter" starts with 1. This is because all incoming values that are lower/higher than the existing lowest/highest value in the train (or already seen) dataset, will be encoded as 0. 

In [19]:
np.unique(qX_tab[:, quant_tab_preprocessor.column_idx["age"]])

array([1, 2, 3, 4])

Finally, if we now wanted to `inverse_transform` the transformed array into the original dataframe, we could still do it, but the continuous, bucketized columns will be transformed back to the middle of their quantile/bucket range

In [20]:
df_decoded = quant_tab_preprocessor.inverse_transform(qX_tab)

Note that quantized cols will be turned into the mid point of the corresponding bin


In [21]:
df.head(2)

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K


In [22]:
df_decoded.head(2)

Unnamed: 0,education,relationship,workclass,occupation,native-country,age,hours-per-week
0,11th,Own-child,Private,Machine-op-inspct,United-States,26.0885,30.4
1,HS-grad,Husband,Private,Farming-fishing,United-States,44.375,50.0


there is one final comment to make regarding to the `inverse_transform` functionality. As we mentioned before, the encoding `0` is reserved for values that fall outside the range covered by the data we used to run the `fit` method. For example

In [23]:
df.age.min(), df.age.max()

(17, 90)

All future age values outside that range will be encoded as 0 and decoded as `NaN`

In [24]:
tmp_df = df.head(1).copy()
tmp_df.loc[:, "age"] = 5

In [25]:
tmp_df

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,5,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K


In [26]:
# quant_tab_preprocessor has already been fitted with a data that has an age range between 17 and 90
tmp_qX_tab = quant_tab_preprocessor.transform(tmp_df)

In [27]:
tmp_qX_tab

array([[1, 1, 1, 1, 1, 0, 2]])

In [28]:
quant_tab_preprocessor.inverse_transform(tmp_qX_tab)

Note that quantized cols will be turned into the mid point of the corresponding bin


Unnamed: 0,education,relationship,workclass,occupation,native-country,age,hours-per-week
0,11th,Own-child,Private,Machine-op-inspct,United-States,,30.4


##  3. TextPreprocessor

This preprocessor returns the tokenised, padded sequences that will be directly fed to the stack of LSTMs.

In [29]:
from pytorch_widedeep.preprocessing import TextPreprocessor

In [30]:
# The airbnb dataset, which you could get from here:
# http://insideairbnb.com/get-the-data.html, is too big to be included in
# our datasets module (when including images). Therefore, go there,
# download it, and use the download_images.py script to get the images
# and the airbnb_data_processing.py to process the data. We'll find
# better datasets in the future ;). Note that here we are only using a
# small sample to illustrate the use, so PLEASE ignore the results, just
# focus on usage
df = pd.read_csv("../tmp_data/airbnb/airbnb_sample.csv")

In [31]:
texts = df.description.tolist()
texts[:2]

["My bright double bedroom with a large window has a relaxed feeling! It comfortably fits one or two and is centrally located just two blocks from Finsbury Park. Enjoy great restaurants in the area and easy access to easy transport tubes, trains and buses. Babies and children of all ages are welcome. Hello Everyone, I'm offering my lovely double bedroom in Finsbury Park area (zone 2) for let in a shared apartment.  You will share the apartment with me and it is fully furnished with a self catering kitchen. Two people can easily sleep well as the room has a queen size bed. I also have a travel cot for a baby for guest with small children.  I will require a deposit up front as a security gesture on both our parts and will be given back to you when you return the keys.  I trust anyone who will be responding to this add would treat my home with care and respect .  Best Wishes  Alina Guest will have access to the self catering kitchen and bathroom. There is the flat is equipped wifi interne

In [32]:
text_preprocessor = TextPreprocessor(text_col="description")
X_text = text_preprocessor.fit_transform(df)
# From here on, any new observation can be prepared by simply running `.transform`
# new_X_text = text_preprocessor.transform(new_df)

The vocabulary contains 2192 tokens


In [33]:
print(X_text[0])

[  29   48   37  367  818   17  910   17  177   15  122  349   53  879
 1174  126  393   40  911    0   23  228   71  819    9   53   55 1380
  225   11   18  308   18 1564   10  755    0  942  239   53   55    0
   11   36 1013  277 1974   70   62   15 1475    9  943    5  251    5
    0    5    0    5  177   53   37   75   11   10  294  726   32    9
   42    5   25   12   10   22   12  136  100  145]


## 4. ImagePreprocessor

`ImagePreprocessor` simply resizes the images, being aware of the aspect ratio.  

In [34]:
from pytorch_widedeep.preprocessing import ImagePreprocessor

In [35]:
image_preprocessor = wd.preprocessing.ImagePreprocessor(
    img_col="id", img_path="../tmp_data/airbnb/property_picture/"
)
X_images = image_preprocessor.fit_transform(df)
# From here on, any new observation can be prepared by simply running `.transform`
# new_X_images = image_preprocessor.transform(new_df)

Reading Images from ../tmp_data/airbnb/property_picture/
Resizing


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 1001/1001 [00:01<00:00, 667.89it/s]


Computing normalisation metrics


In [36]:
X_images[0].shape

(224, 224, 3)