# Deep Learning Model using PyTorch OnFire
by José Fernández Portal

Let's start by installing PyTorch OnFire library.

In [9]:
!pip3 install pytorch-onfire

Defaulting to user installation because normal site-packages is not writeable


**Disclaimer:** the library is in beta stage. In case you find a bug or any problem, feel free to contact me at jose.fernandezportal@mercadolibre.com.

## Load data

In [5]:
import csv

In [3]:
def load_dataset(fname):
    with open(fname, 'r') as f:
        data = [dict(row) for row in csv.DictReader(f)]
    return data

def split_data(data, valid_size=0.2):
    train_data, valid_data = [], []
    for x in data:
        if abs(hash(x['item_id']) % 100) >= int(100*valid_size):
            train_data.append(x)
        else:
            valid_data.append(x)
    return train_data, valid_data

#train_data, valid_data = split_data(load_dataset('csv/_train_2020_09_13.csv'))
#test_data = load_dataset('csv/test.csv')

In [6]:
train_data, valid_data = split_data(load_dataset('csv/_train_2020_09_13.csv'))
test_data = load_dataset('csv/test.csv')

In [7]:
len(train_data), len(valid_data), len(test_data)

(115437, 29558, 19211)

## Preprocessing

### Feature engineering

You can define custom functions to transform you data. `mappify` decorator allows you to define element-wise functions to be applied to entire collections.

In [8]:
from onfire.utils import mappify
from sklearn.preprocessing import FunctionTransformer as FT
from functools import partial
from datetime import datetime

In [9]:
@mappify
def contains(text, keyword):
    return keyword in text

@mappify
def has_value(x):
    return bool(x)

@mappify
def hour(date):
    dt = datetime.strptime(date[:13], '%Y-%m-%dT%H')
    return dt.hour

@mappify
def weekday(date):
    dt = datetime.strptime(date[:13], '%Y-%m-%dT%H')
    return dt.isoweekday()

@mappify
def discount(x):
    return 1 - float(x['price']) / float(x['original_price'])

@mappify
def absolute_position(x):
    return int(x['offset']) + int(x['print_position'])

### Descriptors

Describe & fit your features. You can easily processes text, categorical and continuous features. `FeatureGroup` will merge conveniently all the features for you. When describing your data you can apply the feature-engineering functions that you previously defined.

In [10]:
from onfire.fields import (
    TextFeature, CategoricalFeature, ContinuousFeature, FeatureGroup, SingleLabelTarget)

In [11]:
features = FeatureGroup([
# text
    ('title', TextFeature(key='title', max_len=15, max_vocab=50000, min_freq=5)),
    ('warranty', TextFeature(key='warranty', max_len=30, max_vocab=50000, min_freq=5)),
# categorical    
    ('category_id', CategoricalFeature('category_id')),
    ('domain_id', CategoricalFeature('domain_id')),
    ('free_shipping', CategoricalFeature('free_shipping')),
    ('fulfillment', CategoricalFeature('fulfillment')),
    ('is_pdp', CategoricalFeature('is_pdp')),
    ('product_id', CategoricalFeature('product_id')),
    ('listing_type_id', CategoricalFeature('listing_type_id')),
    ('logistic_type', CategoricalFeature('logistic_type')),
    ('platform', CategoricalFeature('platform')),
    ('deal_of_the_day', CategoricalFeature('tags', 
        FT(partial(contains, keyword='deal_of_the_day')))),
    ('today_promotion', CategoricalFeature('tags', 
        FT(partial(contains, keyword='today_promotion')))),
    ('loyalty_discount_eligible', CategoricalFeature('tags', 
        FT(partial(contains, keyword='loyalty_discount_eligible')))),
    ('brand_verified', CategoricalFeature('tags', 
        FT(partial(contains, keyword='brand_verified')))),
    ('good_quality_picture', CategoricalFeature('tags', 
        FT(partial(contains, keyword='good_quality_picture')))),
    ('logged_user', CategoricalFeature('user_id', FT(has_value))),
    ('week_day', CategoricalFeature('print_server_timestamp', FT(weekday))),
    ('hour', CategoricalFeature('print_server_timestamp', FT(hour))),
# continuous
    ('health', ContinuousFeature('health')),
    ('price', ContinuousFeature('price', log=True)),
    ('discount', ContinuousFeature(preprocessor=FT(discount))),
    ('position', ContinuousFeature(preprocessor=FT(absolute_position))),
    ('sold_quantity', ContinuousFeature('sold_quantity')),
    ('total_orders_item_30days', ContinuousFeature('total_orders_item_30days')),
    ('total_visits_item', ContinuousFeature('total_visits_item')),
]).fit(train_data);

Describe & fit your target variable.

In [12]:
target = SingleLabelTarget('conversion').fit(train_data);

In [13]:
target.output_dim

2

## Dataloaders

Dataloaders will transform and batch your data for you, getting ready to be fed into your model.

In [14]:
from onfire.data import OnFireDataLoader as DataLoader

In [15]:
tfms = (features.transform, target.transform)
train_dl = DataLoader(train_data, tfms, shuffle=True, batch_size=1024)
valid_dl = DataLoader(valid_data, tfms, batch_size=1024)
test_dl  = DataLoader(test_data, features.transform, batch_size=1024)

## Model

By calling `features.build_embedder()` you magically get a model that create an embedding of your features! So, then you only need to define the bottleneck of you neural net.

In [16]:
import torch.nn as nn

In [17]:
def get_model(ps):
    embedder = features.build_embedder()
    return nn.Sequential(embedder,
        nn.BatchNorm1d(embedder.output_dim),
        nn.Dropout(ps),
        nn.Linear(embedder.output_dim, 128, bias=False),
        nn.ReLU(),
        nn.Dropout(ps),
        nn.Linear(128, target.output_dim)
    )

## Train

Define the metric that you what to track during training.

In [18]:
from sklearn.metrics import roc_auc_score

def roc_auc(y_true, y_pred):
    y_pred = y_pred.softmax(dim=1)[:,1]
    return roc_auc_score(y_true, y_pred)

`SupervisedRunner` helps you to easily train your model.

In [19]:
from onfire.colab import SupervisedRunner

In [20]:
model = get_model(ps=0.3)
loss_fn = nn.CrossEntropyLoss()
runner = SupervisedRunner(model, loss_fn)

In [21]:
%%time
runner.fit(train_dl, valid_dl, epochs=10, lr=1e-3, metrics=roc_auc)

epoch,train_loss,valid_loss,roc_auc


TypeError: cannot pickle 'Environment' object

## Predict

`SupervisedRunner` also provides a `predict` method.

In [22]:
y_pred = runner.predict(test_dl)
y_pred = y_pred.softmax(dim=1)[:,1]

TypeError: cannot pickle 'Environment' object

In [19]:
y_pred.shape

torch.Size([19211])

In [20]:
y_pred

tensor([0.0166, 0.0005, 0.1424,  ..., 0.0014, 0.1281, 0.0005])