In this notebook we will show how to use a Hugginface model using this library along with any other model in the libray. In particular we will show how to combine it with a tabular DL model.

Since we are here, we will also compare the performance of a few models on a text classification problem. 

Let's go

In [1]:
import numpy as np
import lightgbm as lgb
from lightgbm import Dataset as lgbDataset
from scipy.sparse import hstack, csr_matrix
from sklearn.metrics import (
    f1_score,
    recall_score,
    accuracy_score,
    precision_score,
    confusion_matrix,
)
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

from pytorch_widedeep import Trainer
from pytorch_widedeep.models import BasicRNN, WideDeep
from pytorch_widedeep.metrics import F1Score, Accuracy
from pytorch_widedeep.utils import Tokenizer, LabelEncoder
from pytorch_widedeep.preprocessing import TextPreprocessor
from pytorch_widedeep.datasets import load_womens_ecommerce


  from .autonotebook import tqdm as notebook_tqdm


Let's load the data and have a look:

In [2]:
df = load_womens_ecommerce(as_frame=True)

df.columns = [c.replace(" ", "_").lower() for c in df.columns]

# classes from [0,num_class)
df["rating"] = (df["rating"] - 1).astype("int64")

# group reviews with 1 and 2 scores into one class
df.loc[df.rating == 0, "rating"] = 1

# and back again to [0,num_class)
df["rating"] = (df["rating"] - 1).astype("int64")

# drop short reviews
df = df[~df.review_text.isna()]
df["review_length"] = df.review_text.apply(lambda x: len(x.split(" ")))
df = df[df.review_length >= 5]
df = df.drop("review_length", axis=1).reset_index(drop=True)

In [3]:
df.head()

Unnamed: 0,clothing_id,age,title,review_text,rating,recommended_ind,positive_feedback_count,division_name,department_name,class_name
0,767,33,,Absolutely wonderful - silky and sexy and comf...,2,1,0,Initmates,Intimate,Intimates
1,1080,34,,Love this dress! it's sooo pretty. i happene...,3,1,4,General,Dresses,Dresses
2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,1,0,0,General,Dresses,Dresses
3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",3,1,0,General Petite,Bottoms,Pants
4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,3,1,6,General,Tops,Blouses


So, we will use the `review_text` column to predict the `rating`. Later on, we will try to combine it with some other columns (like `division_name` and `age`) see if these help.

Let's first have a look to the distribution of ratings

In [4]:
df.rating.value_counts()

3    12515
2     4904
1     2820
0     2369
Name: rating, dtype: int64

This shows that we could have perhaps grouped rating scores of 1, 2 and 3 into 1...but anyway, let's just move on with those 4 classes.

We are not going to carry any hyperparameter optimization here, so, we will only need a train and a test set (i.e.  no need of a validation set for the example in this notebook) 

In [5]:
train, test = train_test_split(df, train_size=0.8, random_state=1, stratify=df.rating)

Let's see what we have to beat. What metrics would we obtain if we always predict the most common rating (3)?

In [6]:
most_common_pred = [train.rating.value_counts().index[0]] * len(test)

most_common_acc = accuracy_score(test.rating, most_common_pred)
most_common_f1 = f1_score(test.rating, most_common_pred, average="weighted")

In [7]:
print(f"Accuracy: {most_common_acc}. F1 Score: {most_common_f1}")

Accuracy: 0.553516143299425. F1 Score: 0.3944344218301668


ok, these are our "baseline" metrics. 

Let's start by using simply tf-idf + lightGBM

In [8]:
# this Tokenizer is part of our utils module but of course, any valid tokenizer can be used here
tok = Tokenizer()
tok_reviews_tr = tok.process_all(train.review_text.tolist())
tok_reviews_te = tok.process_all(test.review_text.tolist())

In [9]:
vectorizer = TfidfVectorizer(
    max_features=5000, preprocessor=lambda x: x, tokenizer=lambda x: x, min_df=5
)

X_text_tr = vectorizer.fit_transform(tok_reviews_tr)
X_text_te = vectorizer.transform(tok_reviews_te)


In [10]:
X_text_tr

<18086x4566 sparse matrix of type '<class 'numpy.float64'>'
	with 884074 stored elements in Compressed Sparse Row format>

We now move our matrices to lightGBM `Dataset` format

In [11]:
lgbtrain_text = lgbDataset(
    X_text_tr,
    train.rating.values,
    free_raw_data=False,
)

lgbtest_text = lgbDataset(
    X_text_te,
    test.rating.values,
    reference=lgbtrain_text,
    free_raw_data=False,
)


and off we go. By the way, I think as we run the next cell, we should appreciate how fast lightGBM runs. Yes, the input is a sparse matrix, but still, trains on 18086x4566 in a matter of secs

In [12]:
model_text = lgb.train(
    {"objective": "multiclass", "num_classes": 4},
    lgbtrain_text,
    valid_sets=[lgbtest_text, lgbtrain_text],
    valid_names=["test", "train"],
    verbose_eval=False,

)



You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 143330
[LightGBM] [Info] Number of data points in the train set: 18086, number of used features: 2285
[LightGBM] [Info] Start training from score -2.255919
[LightGBM] [Info] Start training from score -2.081545
[LightGBM] [Info] Start training from score -1.528281
[LightGBM] [Info] Start training from score -0.591354


In [13]:
preds_text = model_text.predict(X_text_te)
pred_text_class = np.argmax(preds_text, 1)

In [14]:
acc_text = accuracy_score(lgbtest_text.label, pred_text_class)
f1_text = f1_score(lgbtest_text.label, pred_text_class, average="weighted")
cm_text = confusion_matrix(lgbtest_text.label, pred_text_class)

In [15]:
acc_text, f1_text

(0.6444051304732419, 0.617154488246181)

In [16]:
cm_text

array([[ 199,  135,   61,   79],
       [ 123,  169,  149,  123],
       [  30,   94,  279,  578],
       [  16,   30,  190, 2267]])

Ok, so, with no hyperparameter optimization lightGBM gets an accuracy of 0.64 and a F1 score of 0.62, cool, significantly better than predicting always the most popular. 

Let's see if in this implementation, some additional features, like `age` or `class_name` are of any help

In [17]:
tab_cols = [
    "age",
    "division_name",
    "department_name",
    "class_name",
]

for tab_df in [train, test]:
    for c in ["division_name", "department_name", "class_name"]:
        tab_df[c] = tab_df[c].str.lower()
        tab_df[c].fillna("missing", inplace=True)
        
        


In [18]:
# This is our LabelEncoder. A class that is designed to work with the models in this library but 
# can be used for general purposes
le = LabelEncoder(columns_to_encode=["division_name", "department_name", "class_name"])
train_tab_le = le.fit_transform(train)
test_tab_le = le.transform(test)

In [19]:
train_tab_le.head()

Unnamed: 0,clothing_id,age,title,review_text,rating,recommended_ind,positive_feedback_count,division_name,department_name,class_name
4541,836,35,,Bought this on sale in my reg size- 10. im 5'9...,2,1,2,1,1,1
18573,1022,25,"Look like ""mom jeans""",Maybe i just have the wrong body type for thes...,1,0,0,2,2,2
1058,815,39,Ig brought me here,Love the way this top layers under my jackets ...,2,1,0,1,1,1
12132,984,47,Runs small especially the arms,I love this jacket. it's the prettiest and mos...,3,1,0,1,3,3
20756,1051,42,"True red, true beauty.",These pants are gorgeous--the fabric has a sat...,3,1,0,2,2,4


let's for example have a look to the encodings for the categorical feature `class_name`

In [20]:
le.encoding_dict['class_name']

{'blouses': 1,
 'jeans': 2,
 'jackets': 3,
 'pants': 4,
 'knits': 5,
 'dresses': 6,
 'skirts': 7,
 'sweaters': 8,
 'fine gauge': 9,
 'legwear': 10,
 'lounge': 11,
 'shorts': 12,
 'outerwear': 13,
 'intimates': 14,
 'swim': 15,
 'trend': 16,
 'sleep': 17,
 'layering': 18,
 'missing': 19,
 'casual bottoms': 20,
 'chemises': 21}

In [21]:
# tabular training and test sets
X_tab_tr = csr_matrix(train_tab_le[tab_cols].values)
X_tab_te = csr_matrix(test_tab_le[tab_cols].values)

# text + tabular training and test sets
X_tab_text_tr = hstack((X_tab_tr, X_text_tr))
X_tab_text_va = hstack((X_tab_te, X_text_te))

In [22]:
X_tab_tr

<18086x4 sparse matrix of type '<class 'numpy.int64'>'
	with 72344 stored elements in Compressed Sparse Row format>

In [23]:
X_tab_text_tr

<18086x4570 sparse matrix of type '<class 'numpy.float64'>'
	with 956418 stored elements in Compressed Sparse Row format>

In [24]:
lgbtrain_tab_text = lgbDataset(
    X_tab_text_tr,
    train.rating.values,
    categorical_feature=[0, 1, 2, 3],
    free_raw_data=False,
)

lgbtest_tab_text = lgbDataset(
    X_tab_text_va,
    test.rating.values,
    reference=lgbtrain_tab_text,
    free_raw_data=False,
)



In [25]:
model_tab_text = lgb.train(
    {"objective": "multiclass", "num_classes": 4},
    lgbtrain_tab_text,
    valid_sets=[lgbtrain_tab_text, lgbtest_tab_text],
    valid_names=["test", "train"],
    verbose_eval=False
)



New categorical_feature is [0, 1, 2, 3]


You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 143432
[LightGBM] [Info] Number of data points in the train set: 18086, number of used features: 2289
[LightGBM] [Info] Start training from score -2.255919
[LightGBM] [Info] Start training from score -2.081545
[LightGBM] [Info] Start training from score -1.528281
[LightGBM] [Info] Start training from score -0.591354




In [26]:
preds_tab_text = model_tab_text.predict(X_tab_text_va)
preds_tab_text_class = np.argmax(preds_tab_text, 1)

acc_tab_text = accuracy_score(lgbtest_tab_text.label, preds_tab_text_class)
f1_tab_text = f1_score(lgbtest_tab_text.label, preds_tab_text_class, average="weighted")
cm_tab_text = confusion_matrix(lgbtest_tab_text.label, preds_tab_text_class)


In [27]:
acc_tab_text, f1_tab_text

(0.6382131800088456, 0.6080251307242649)

In [28]:
cm_tab_text

array([[ 193,  123,   68,   90],
       [ 123,  146,  157,  138],
       [  37,   90,  272,  582],
       [  16,   37,  175, 2275]])

Ok, so, in this set up, addition, tabular columns do not help performance, in fact, the opposite.

Moving on now to fully using `pytorch-widedeep` in this dataset, let's have a look on how one could use a simple RNN to predict the ratings with the library.

In [29]:
text_preprocessor = TextPreprocessor(
    text_col="review_text", max_vocab=5000, min_freq=5, maxlen=90, n_cpus=1
)

X_text_tr = text_preprocessor.fit_transform(train)
X_text_te = text_preprocessor.transform(test)


The vocabulary contains 4328 tokens


In [30]:
basic_rnn = BasicRNN(
    vocab_size=len(text_preprocessor.vocab.itos),
    embed_dim=300,
    hidden_dim=64,
    n_layers=3,
    rnn_dropout=0.2,
    head_hidden_dims=[32],
)


model = WideDeep(deeptext=basic_rnn, pred_dim=4)



In [31]:
model

WideDeep(
  (deeptext): Sequential(
    (0): BasicRNN(
      (word_embed): Embedding(4328, 300, padding_idx=1)
      (rnn): LSTM(300, 64, num_layers=3, batch_first=True, dropout=0.2)
      (rnn_mlp): MLP(
        (mlp): Sequential(
          (dense_layer_0): Sequential(
            (0): Linear(in_features=64, out_features=32, bias=True)
            (1): ReLU(inplace=True)
          )
        )
      )
    )
    (1): Linear(in_features=32, out_features=4, bias=True)
  )
)

In [32]:
trainer = Trainer(
    model,
    objective="multiclass",
    metrics=[Accuracy, F1Score(average=True)],
)


In [None]:
trainer.fit(
    X_text=X_text_tr,
    target=train.rating.values,
    n_epochs=5,
    batch_size=256,
)


epoch 1:   0%|                                                                                                          | 0/71 [00:00<?, ?it/s]