## The `utils` module

Behind the preprocessors in the preprocessing module, there are a series of utility functions. 

Initially I did not intend to "expose" them to the user, but I believe can be useful for all sorts of preprocessing tasks, so let me discuss them briefly. 

The util tools in the module are: 

* `deep_utils.label_encoder`
* `text_utils.simple_preprocess`
* `text_utils.get_texts`
* `text_utils.pad_sequences`
* `text_utils.build_embeddings_matrix`
* `fastai_transforms.Tokenizer`
* `fastai_transforms.Vocab`
* `image_utils.SimplePreprocessor`
* `image_utils.AspectAwarePreprocessor`

Let's have a look to what they do and how they might be useful to the user in general

### 1. Dense utils

`label_encoder` is used by the `DeepPreprocessor` class. Is simply a numerical encoder for categorical features

In [1]:
import pandas as pd
import pytorch_widedeep as wd

In [2]:
df = pd.read_csv("data/adult/adult.csv.zip")
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


In [4]:
enc_df, enc_dict = wd.utils.dense_utils.label_encoder(df)

In [5]:
enc_df.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,0,226802,0,7,0,0,0,0,0,0,0,40,0,0
1,38,0,89814,1,9,1,1,1,1,0,0,0,50,0,0
2,28,1,336951,2,12,1,2,1,1,0,0,0,40,0,1
3,44,0,160323,3,10,1,0,1,0,0,7688,0,40,0,1
4,18,2,103497,3,10,0,3,0,1,1,0,0,30,0,0


In [6]:
enc_dict

{'workclass': {'Private': 0,
  'Local-gov': 1,
  '?': 2,
  'Self-emp-not-inc': 3,
  'Federal-gov': 4,
  'State-gov': 5,
  'Self-emp-inc': 6,
  'Without-pay': 7,
  'Never-worked': 8},
 'education': {'11th': 0,
  'HS-grad': 1,
  'Assoc-acdm': 2,
  'Some-college': 3,
  '10th': 4,
  'Prof-school': 5,
  '7th-8th': 6,
  'Bachelors': 7,
  'Masters': 8,
  'Doctorate': 9,
  '5th-6th': 10,
  'Assoc-voc': 11,
  '9th': 12,
  '12th': 13,
  '1st-4th': 14,
  'Preschool': 15},
 'marital-status': {'Never-married': 0,
  'Married-civ-spouse': 1,
  'Widowed': 2,
  'Divorced': 3,
  'Separated': 4,
  'Married-spouse-absent': 5,
  'Married-AF-spouse': 6},
 'occupation': {'Machine-op-inspct': 0,
  'Farming-fishing': 1,
  'Protective-serv': 2,
  '?': 3,
  'Other-service': 4,
  'Prof-specialty': 5,
  'Craft-repair': 6,
  'Adm-clerical': 7,
  'Exec-managerial': 8,
  'Tech-support': 9,
  'Sales': 10,
  'Priv-house-serv': 11,
  'Transport-moving': 12,
  'Handlers-cleaners': 13,
  'Armed-Forces': 14},
 'relationshi

### 1.2 Text utils

The following utilities are used by the `TextPreprocessor`

In [8]:
df=pd.read_csv("data/airbnb/airbnb_sample.csv")

In [10]:
texts = df.description.tolist()
texts[:2]

["My bright double bedroom with a large window has a relaxed feeling! It comfortably fits one or two and is centrally located just two blocks from Finsbury Park. Enjoy great restaurants in the area and easy access to easy transport tubes, trains and buses. Babies and children of all ages are welcome. Hello Everyone, I'm offering my lovely double bedroom in Finsbury Park area (zone 2) for let in a shared apartment.  You will share the apartment with me and it is fully furnished with a self catering kitchen. Two people can easily sleep well as the room has a queen size bed. I also have a travel cot for a baby for guest with small children.  I will require a deposit up front as a security gesture on both our parts and will be given back to you when you return the keys.  I trust anyone who will be responding to this add would treat my home with care and respect .  Best Wishes  Alina Guest will have access to the self catering kitchen and bathroom. There is the flat is equipped wifi interne

In [13]:
tokens = wd.utils.text_utils.get_texts(texts)
print(tokens[0])

['xxmaj', 'my', 'bright', 'double', 'bedroom', 'with', 'large', 'window', 'has', 'relaxed', 'feeling', 'xxmaj', 'it', 'comfortably', 'fits', 'one', 'or', 'two', 'and', 'is', 'centrally', 'located', 'just', 'two', 'blocks', 'from', 'xxmaj', 'finsbury', 'xxmaj', 'park', 'xxmaj', 'enjoy', 'great', 'restaurants', 'in', 'the', 'area', 'and', 'easy', 'access', 'to', 'easy', 'transport', 'tubes', 'trains', 'and', 'buses', 'xxmaj', 'babies', 'and', 'children', 'of', 'all', 'ages', 'are', 'welcome', 'xxmaj', 'hello', 'xxmaj', 'everyone', 'offering', 'my', 'lovely', 'double', 'bedroom', 'in', 'xxmaj', 'finsbury', 'xxmaj', 'park', 'area', 'zone', 'for', 'let', 'in', 'shared', 'apartment', 'xxmaj', 'you', 'will', 'share', 'the', 'apartment', 'with', 'me', 'and', 'it', 'is', 'fully', 'furnished', 'with', 'self', 'catering', 'kitchen', 'xxmaj', 'two', 'people', 'can', 'easily', 'sleep', 'well', 'as', 'the', 'room', 'has', 'queen', 'size', 'bed', 'also', 'have', 'travel', 'cot', 'for', 'baby', 'for',

In [16]:
vocab = wd.utils.fastai_transforms.Vocab

In [17]:
vocabulary = vocab.create(tokens, max_vocab=2000, min_freq=1)

In [20]:
vocabulary.stoi

defaultdict(int,
            {'xxunk': 0,
             'xxpad': 1,
             'xxbos': 2,
             'xxeos': 3,
             'xxfld': 4,
             'xxmaj': 5,
             'xxup': 6,
             'xxrep': 7,
             'xxwrep': 8,
             'and': 9,
             'the': 10,
             'to': 11,
             'is': 12,
             'in': 13,
             'of': 14,
             'with': 15,
             'london': 16,
             'for': 17,
             'you': 18,
             'room': 19,
             'are': 20,
             'from': 21,
             'flat': 22,
             'on': 23,
             'bedroom': 24,
             'there': 25,
             'it': 26,
             'walk': 27,
             'double': 28,
             'bed': 29,
             'house': 30,
             'has': 31,
             'kitchen': 32,
             'minutes': 33,
             'apartment': 34,
             'all': 35,
             'this': 36,
             'have': 37,
             'very': 38,
         

In [27]:
sequences = [vocabulary.numericalize(t) for t in tokens]
padded_seq = [wd.utils.text_utils.pad_sequences(s, maxlen=200, pad_idx=1) for s in sequences]

In [28]:
padded_seq[0]

array([   1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
          1,    1,    5,   70,  104,   28,   24,   15,   46,  450,   31,
        600, 1173,    5,   26,  508, 1299,   56,   41,   54,    9,   12,
        463,   69,   64,   54, 1474,   21,    5,  714,    5,   57,    5,
        200,   67,   61,   13,   10,   44,    9,  109,   75,   11,  109,
         94,  817,  400,    9,  164,    5, 1973,    9,  349,   14,   35,
          0,   20,  140,    5,  794,    5,  909,  646,   70,   77,   28,
         24,   13,    5,  714,    5,   57,   44,  163,   17,  392,   13,
        146,   34,    5,   18,   53,  293,   10,   34,   15,  247,    9,
         26,   12,  110,  171,   15,  294,  726,   32,    5,   54,  124,
         47,  583,  295,   79,   40,   10,   19,   31,  509,  191,   29,
         48,   37,  367,  818,   17,  910,   17,  177,   15,  122,  349,
         53,  879, 1174,  126,  393,   40,  911,    0,   23,  228,   71,
        819,    9,   53,   55, 1380,  225,   11,   

### 1.4 Image utils