# To run the notebook

Use Google Colab (should try to use GPU since training is slow without it). Before running any code, upload *train.csv* and *test.csv* to the filesystem first. Then run all the cells. 

After the last cell runs, two files called *desc_train_pred.csv* and *desc_test_pred.csv* should be saved to the filesystem. Download these files since they're required for catboost.ipynb.

# Summary of techniques

For this notebook I only used the noisy text descriptions.

To process the data, I first created a mapping of all the unique words in the text descriptions to numerical values. I created training and validation datasets, where I would return the text description as a fixed length vector of numerical mappings (with padding at the end if necessary).

I used a transformer to analyze the text descriptions. For a given input vector, I first turn it into an embedding. However, I omit any positional embedding since the text descriptions seem more like bags of words to me (so the position doesn't seem relevant). I then run the embedding through the transformer, making sure to mask the padding. I also use a single padding value for the target vector.

I originally used just a transformer encoder, followed by some linear layers. I did this since I figured I wouldn't need a decoder if I'm just classifying text instead of generating a sequence.

I then trained the network and generated my predictions, which would go to catboost.ipynb.

I experimented with many different hyperparameters. I think just using the transformer encoder gave better results, but I'm not entirely sure (might've just gotten unlucky with my hyperparameters).


In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import torch
import torch.nn.functional as F
from torch import nn
from torch import Tensor
from torch.nn import Transformer, TransformerEncoder, TransformerEncoderLayer
from torch.utils.data import Dataset, DataLoader
from tqdm import tqdm

In [2]:
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using {device} device")

Using cuda device


In [3]:
df = pd.read_csv('train.csv')[['category', 'noisyTextDescription']]
df_test = pd.read_csv('test.csv')

c_to_i = {}
i_to_c = {}
for i, c in enumerate(df['category'].unique()):
    c_to_i[c] = i
    i_to_c[i] = c

num_categories = len(c_to_i)

text_mapping = {
    "<PAD>": 0
}
max_desc_len = 0
for desc in df['noisyTextDescription']:
    tokens = desc.split()
    max_desc_len = max(max_desc_len, len(tokens))

    for word in tokens:
        if word not in text_mapping:
            text_mapping[word] = len(text_mapping)

for desc in df_test['noisyTextDescription']:
    tokens = desc.split()
    max_desc_len = max(max_desc_len, len(tokens))

    for word in tokens:
        if word not in text_mapping:
            text_mapping[word] = len(text_mapping)

num_words = len(text_mapping)
print(max_desc_len)

def convert_text(text):
    tokens = [text_mapping[word] for word in text.split()]
    for _ in range(len(tokens), max_desc_len):
        tokens.append(0)
    return tokens

df['noisyTextDescription'] = df['noisyTextDescription'].apply(convert_text)
df['category'] = df['category'].apply(lambda x: c_to_i[x])

df = df.sample(frac=1).reset_index(drop=True)

print(df['noisyTextDescription'].iloc[0])

df_train, df_val = train_test_split(df, train_size=0.8)

14
[1211, 3975, 15, 16, 64, 0, 0, 0, 0, 0, 0, 0, 0, 0]


In [4]:
class DescriptionData(Dataset):
    def __init__(self, df, test=False):
        self.df = df
        self.len = len(self.df)
        self.test = test

    def __getitem__(self, idx):
        desc = torch.tensor(self.df['noisyTextDescription'].iloc[idx]).to(device)
        
        if self.test:
            return desc
        else:
            category = self.df['category'].iloc[idx]
            return desc, category

    def __len__(self):
        return self.len

train_dataset = DescriptionData(df_train)
val_dataset = DescriptionData(df_val)
total_dataset = DescriptionData(df)

train_dataloader = DataLoader(train_dataset, batch_size=64, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=64, shuffle=True)
total_dataloader = DataLoader(total_dataset, batch_size=64, shuffle=True)

In [5]:
class DescriptionTransformer(nn.Module):
    def __init__(self,
                 num_encoder_layers: int,
                 emb_size: int,
                 nhead: int,
                 vocab_size: int,
                 num_categories: int,
                 dim_feedforward: int = 2048,
                 dropout: float = 0.1,
                 max_input_len: int = 14,
                 num_decoder_layers=4):
        super(DescriptionTransformer, self).__init__()
        self.max_input_len = max_input_len
        self.embedding = nn.Embedding(vocab_size, emb_size)

        # self.encoder_layer = TransformerEncoderLayer(d_model=emb_size,
        #                                                 nhead=nhead,
        #                                                 dim_feedforward=dim_feedforward,
        #                                                 dropout=dropout,
        #                                                 batch_first=True)
        # self.encoder = TransformerEncoder(self.encoder_layer, num_layers=num_encoder_layers)
        # self.flatten = nn.Flatten(start_dim=1)
        # self.fc = nn.Linear(max_input_len*emb_size, 512)
        # self.drop = nn.Dropout(0.5)
        # self.generator = nn.Linear(512, num_categories)

        self.tgt_emb = nn.Embedding(1, emb_size)
        self.transform = Transformer(d_model=emb_size,
                                     nhead=nhead,
                                     num_encoder_layers=num_encoder_layers,
                                     num_decoder_layers=num_decoder_layers,
                                     dim_feedforward=dim_feedforward,
                                     dropout=dropout,
                                     batch_first=True)
        self.transform_fc = nn.Linear(emb_size, 64)
        self.transform_drop = nn.Dropout(0.45)
        self.transform_gen = nn.Linear(64, num_categories)

        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self,
                src: Tensor,
                src_key_padding_mask: Tensor = None):
        src_emb = self.embedding(src)

        # outs = self.encoder(src_emb, src_key_padding_mask=src_key_padding_mask)
        
        # if outs.shape[1] < self.max_input_len:
        #     diff = self.max_input_len - outs.shape[1]
        #     pad = torch.zeros((outs.shape[0], diff, outs.shape[2])).to(device)
        #     outs = torch.cat((outs, pad), axis=1)

        # outs = self.flatten(outs)
        # outs = self.drop(outs)
        # outs = self.fc(outs)
        # outs = self.drop(outs)
        # outs = self.generator(outs)

        tgt = torch.zeros((src.shape[0], 1), dtype=torch.long).to(device)
        tgt_emb = self.tgt_emb(tgt)
        tgt_key_padding_mask = torch.tensor([[False]]*src.shape[0]).to(device)
        outs = self.transform(
            src_emb,
            tgt_emb,
            src_key_padding_mask=src_key_padding_mask,
        )
        outs = outs.squeeze(dim=1)
        outs = self.transform_fc(outs)
        outs = self.transform_drop(outs)
        outs = self.transform_gen(outs)

        outs = self.softmax(outs)
        return outs

In [6]:
def train_epoch(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)
    num_batches = len(dataloader)
    train_loss, correct = 0, 0

    model.train()
    for _, (X, y) in enumerate(dataloader):
        X, y = X.to(device), y.to(device)

        mask = torch.tensor([[False if x > 0 else True for x in r] for r in X]).to(device)

        pred = model(src=X, src_key_padding_mask=mask)
        loss = loss_fn(pred, y)
        train_loss += loss.item()
        correct += (pred.argmax(1) == y).type(torch.float).sum().item()

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    average_train_loss = train_loss / num_batches
    average_train_accuracy = correct / size
    return average_train_accuracy, average_train_loss

def test_epoch(dataloader, model, loss_fn):
    size = len(dataloader.dataset)
    num_batches = len(dataloader)
    test_loss, correct = 0, 0

    model.eval()
    with torch.no_grad():
        for X, y in dataloader:
            X, y = X.to(device), y.to(device)

            mask = torch.tensor([[False if x > 0 else True for x in r] for r in X]).to(device)

            pred = model(src=X, src_key_padding_mask=mask)
            test_loss += loss_fn(pred, y).item()
            correct += (pred.argmax(1) == y).type(torch.float).sum().item()

    average_test_loss = test_loss / num_batches
    average_test_accuracy = correct / size
    return average_test_accuracy, average_test_loss

def train_full(train_dataloader, val_dataloader, model, loss_fn, optimizer, epochs=10, save_weights=False):
    train_accuracies, val_accuracies = [], []
    top_val_accuracy = 0.0

    for t in tqdm(range(epochs)):
        train_accuracy, train_loss = train_epoch(train_dataloader, model, loss_fn, optimizer)
        train_accuracies += [train_accuracy]

        val_accuracy, val_loss = test_epoch(val_dataloader, model, loss_fn)
        val_accuracies += [val_accuracy]

        if val_accuracy > top_val_accuracy:
            top_val_accuracy = val_accuracy
            if save_weights:
                torch.save(model, 'desc_model.pth')

        print(f"Epoch {t+1}:\t Train accuracy: {100*train_accuracy:0.1f}%\t Avg train loss: {train_loss:>6f}\t Val accuracy: {100*val_accuracy:0.1f}%\t Avg val loss: {val_loss:>6f}")

    print(f"Top val accuracy: {top_val_accuracy}")
    return train_accuracies, val_accuracies

In [7]:
%%time

# desc_model = DescriptionTransformer(num_encoder_layers=6,
#                                 emb_size=256,
#                                 nhead=4,
#                                 vocab_size=num_words,
#                                 num_categories=num_categories,
#                                 dim_feedforward=256,
#                                 dropout=0.6,
#                                 max_input_len=max_desc_len).to(device)

# desc_model = DescriptionTransformer(num_encoder_layers=6,
#                                 emb_size=128,
#                                 nhead=4,
#                                 vocab_size=num_words,
#                                 num_categories=num_categories,
#                                 dim_feedforward=512,
#                                 dropout=0.45,
#                                 max_input_len=max_desc_len).to(device)

desc_model = DescriptionTransformer(num_encoder_layers=4,
                                emb_size=128,
                                nhead=2,
                                vocab_size=num_words,
                                num_categories=num_categories,
                                dim_feedforward=64,
                                dropout=0.45,
                                max_input_len=max_desc_len,
                                num_decoder_layers=2).to(device)

save_weights = True
load_weights = False
num_epochs = 20

if load_weights:
    desc_model = torch.load('desc_model.pth')

loss_fn = nn.NLLLoss()
optimizer = torch.optim.Adam(desc_model.parameters())

# train_accuracies, val_accuracies = train_full(train_dataloader, val_dataloader, desc_model, loss_fn, optimizer, epochs=num_epochs, save_weights=save_weights)
train_accuracies, val_accuracies = train_full(total_dataloader, val_dataloader, desc_model, loss_fn, optimizer, epochs=num_epochs, save_weights=save_weights)

  output = torch._nested_tensor_from_mask(output, src_key_padding_mask.logical_not(), mask_check=False)
  5%|▌         | 1/20 [00:21<06:41, 21.11s/it]

Epoch 1:	 Train accuracy: 41.6%	 Avg train loss: 2.247027	 Val accuracy: 50.4%	 Avg val loss: 1.863769


 10%|█         | 2/20 [00:39<05:52, 19.57s/it]

Epoch 2:	 Train accuracy: 51.9%	 Avg train loss: 1.834209	 Val accuracy: 56.8%	 Avg val loss: 1.664369


 15%|█▌        | 3/20 [00:57<05:18, 18.74s/it]

Epoch 3:	 Train accuracy: 57.7%	 Avg train loss: 1.652552	 Val accuracy: 66.6%	 Avg val loss: 1.441211


 20%|██        | 4/20 [01:15<04:56, 18.55s/it]

Epoch 4:	 Train accuracy: 61.9%	 Avg train loss: 1.522529	 Val accuracy: 69.0%	 Avg val loss: 1.335737


 25%|██▌       | 5/20 [01:33<04:34, 18.30s/it]

Epoch 5:	 Train accuracy: 65.3%	 Avg train loss: 1.411646	 Val accuracy: 72.0%	 Avg val loss: 1.235521


 30%|███       | 6/20 [01:51<04:16, 18.32s/it]

Epoch 6:	 Train accuracy: 67.7%	 Avg train loss: 1.339058	 Val accuracy: 73.0%	 Avg val loss: 1.164066


 35%|███▌      | 7/20 [02:09<03:55, 18.15s/it]

Epoch 7:	 Train accuracy: 69.9%	 Avg train loss: 1.260106	 Val accuracy: 74.5%	 Avg val loss: 1.139568


 40%|████      | 8/20 [02:27<03:37, 18.13s/it]

Epoch 8:	 Train accuracy: 71.8%	 Avg train loss: 1.213452	 Val accuracy: 77.9%	 Avg val loss: 1.081710


 45%|████▌     | 9/20 [02:45<03:18, 18.05s/it]

Epoch 9:	 Train accuracy: 73.2%	 Avg train loss: 1.169836	 Val accuracy: 79.4%	 Avg val loss: 1.023419


 50%|█████     | 10/20 [03:03<03:00, 18.00s/it]

Epoch 10:	 Train accuracy: 74.5%	 Avg train loss: 1.131493	 Val accuracy: 79.8%	 Avg val loss: 0.986460


 55%|█████▌    | 11/20 [03:21<02:42, 18.08s/it]

Epoch 11:	 Train accuracy: 76.0%	 Avg train loss: 1.101406	 Val accuracy: 80.8%	 Avg val loss: 0.922726


 60%|██████    | 12/20 [03:39<02:23, 18.00s/it]

Epoch 12:	 Train accuracy: 76.4%	 Avg train loss: 1.072780	 Val accuracy: 81.3%	 Avg val loss: 0.928944


 65%|██████▌   | 13/20 [03:57<02:06, 18.09s/it]

Epoch 13:	 Train accuracy: 76.6%	 Avg train loss: 1.058218	 Val accuracy: 80.7%	 Avg val loss: 0.928622


 70%|███████   | 14/20 [04:15<01:47, 17.96s/it]

Epoch 14:	 Train accuracy: 77.5%	 Avg train loss: 1.031319	 Val accuracy: 81.8%	 Avg val loss: 0.907492


 75%|███████▌  | 15/20 [04:34<01:30, 18.18s/it]

Epoch 15:	 Train accuracy: 78.5%	 Avg train loss: 1.008666	 Val accuracy: 82.9%	 Avg val loss: 0.881846


 80%|████████  | 16/20 [04:52<01:12, 18.08s/it]

Epoch 16:	 Train accuracy: 78.7%	 Avg train loss: 0.992611	 Val accuracy: 83.3%	 Avg val loss: 0.830994


 85%|████████▌ | 17/20 [05:10<00:54, 18.16s/it]

Epoch 17:	 Train accuracy: 79.0%	 Avg train loss: 0.993157	 Val accuracy: 83.5%	 Avg val loss: 0.850338


 90%|█████████ | 18/20 [05:28<00:35, 17.99s/it]

Epoch 18:	 Train accuracy: 78.9%	 Avg train loss: 0.993352	 Val accuracy: 82.4%	 Avg val loss: 0.859844


 95%|█████████▌| 19/20 [05:46<00:18, 18.19s/it]

Epoch 19:	 Train accuracy: 79.0%	 Avg train loss: 0.977918	 Val accuracy: 81.6%	 Avg val loss: 0.868281


100%|██████████| 20/20 [06:04<00:00, 18.22s/it]

Epoch 20:	 Train accuracy: 79.4%	 Avg train loss: 0.969861	 Val accuracy: 83.3%	 Avg val loss: 0.833192
Top val accuracy: 0.8349514563106796
CPU times: user 6min, sys: 2.53 s, total: 6min 2s
Wall time: 6min 9s





In [8]:
df = pd.read_csv('train.csv')[['category', 'noisyTextDescription']]

df['noisyTextDescription'] = df['noisyTextDescription'].apply(convert_text)
df['category'] = df['category'].apply(lambda x: c_to_i[x])

In [9]:
if save_weights or load_weights:
    desc_model = torch.load('desc_model.pth')

desc_model.eval()

def make_predictions(df):
    eval_dataset = DescriptionData(df, test=True)
    eval_dataloader = DataLoader(eval_dataset, batch_size=128, shuffle=False)

    predictions = []
    for _, X in enumerate(eval_dataloader):
        X = X.to(device)

        mask = torch.tensor([[False if x > 0 else True for x in r] for r in X]).to(device)

        pred = desc_model(src=X, src_key_padding_mask=mask)
        labels = pred.argmax(1)
        for j in range(pred.shape[0]):
            predictions.append(i_to_c[labels[j].item()])

    return predictions

def eval_pred(base, pred):
    assert(base['id'].equals(pred['id']))
    print('ids match')
    diff_count = (base['category'] == pred['category']).value_counts()

    if True not in diff_count:
        print("WTF")
        return 0.0
    else:
        return (100.0*diff_count[True])/len(base)


In [10]:
%%time

test_features = ['noisyTextDescription']

df = pd.read_csv('train.csv')

df['noisyTextDescription'] = df['noisyTextDescription'].apply(convert_text)

train_pred = df[['id']]
train_pred['category'] = make_predictions(df)

print(f'train accuracy: {eval_pred(df, train_pred)}')

df_test = pd.read_csv('test.csv')
df_test['noisyTextDescription'] = df_test['noisyTextDescription'].apply(convert_text)

test_pred = df_test[['id']]
test_pred['category'] = make_predictions(df_test)
assert(df_test['id'].equals(test_pred['id']))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


ids match
train accuracy: 83.02584732047903
CPU times: user 21.6 s, sys: 67.8 ms, total: 21.7 s
Wall time: 21.8 s


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [11]:
train_pred.to_csv('desc_train_pred.csv', index=False)
test_pred.to_csv('desc_test_pred.csv', index=False)