The goal of this notebook is to recommend an article for a visitor to the site. We use the information of current content -- content ID, category (news, lifestyle, etc.), title, author, date, to predict the next content for the user. Note that the visitor ID is ignored in this task, meaning the model does not learn browse history for a particular visitor. Just infer the next content based on the current content.

The dataset is pulled from the publicly avialable Kurier.at dataset in BigQuery using SQL queries. Kurier.at is an Austrian newsite. The data have already separated to training set (179,092 records) and test set (25,599 records, some records may belong to the same visitor).

## Import data sets.

In [1]:
import pandas as pd
import torch

train = pd.read_csv('training_set.csv', names = ['visitor_id', 'content_id', 'category', 'title', 'author', 'month_since_epoch', 'next_content_id'])
test = pd.read_csv('test_set.csv', names = ['visitor_id', 'content_id', 'category', 'title', 'author', 'month_since_epoch', 'next_content_id'])

train['next_content_id'] = train.next_content_id.astype('Int64')
test['next_content_id'] = test.next_content_id.astype('Int64')

print(train.head())

            visitor_id  content_id        category  \
0  1093134947164137327   299801977            News   
1  1110195149925330322   299844825            News   
2  1110195149925330322   299833903            News   
3  1114140322257389822   299425707  Stars & Kultur   
4  1114140322257389822   299775313  Stars & Kultur   

                                               title               author  \
0       Kanzlerin Merkel strebt stabile Regierung an          Peter Temel   
1  Regierungsbildung: SPD und CDU bringen sich in...  Sandra Lumetsberger   
2      Ungarns Regierung kämpft mit Buch gegen Soros          Peter Temel   
3  Angelina Jolie: Sie fordert mehr Geld von Brad...    Elisabeth Spitzer   
4  Alexander Newley Sohn von Joan Collins: "Mein ...   Christina Michlits   

   month_since_epoch  next_content_id  
0              574.0        299816215  
1              574.0        299866366  
2              574.0        298972803  
3              574.0        299775313  
4           

The month_since_epoch attribute is derived by (browse date - 1/1/1970), assuming 1/1/1970 is the date that the website was created. Only 'category' and 'author' attributes have missing values.

In [2]:
print('Missing values of training set:')
print(train.isnull().sum())
print('Missing values of test set:')
print(test.isnull().sum())

Missing values of training set:
visitor_id               0
content_id               0
category              1440
title                    0
author               32577
month_since_epoch        1
next_content_id          1
dtype: int64
Missing values of test set:
visitor_id              0
content_id              0
category              165
title                   0
author               4649
month_since_epoch       0
next_content_id         0
dtype: int64


## Data Preprocessing
Convert 'content_id' and 'author' to integers (i.e. 1,2,3,...). Note that the vocabulary of 'content_id' and 'author' are comprised by both training and test data (not only training set).

In [3]:
def get_list(col):
    '''Get the list of unique values of a column.
    
    Argument:
    -- col: dataset.column
    
    Return:
    -- A list of unique values.'''

    lst = list(col.unique())
    
    return lst


def map_to_id(col):
    ''' Create a dictionary maps the content_id/author into index (0, 1, 2, ...).
    
    Argument:
    -- col: a list of dataset.column, e.g. [dataset.content_id, dataset.next_content_id]
    
    Return:
    -- A dictionary, e.g. (for content_id) {299801977: 0, ...}
    -- size: the size of unique values. '''
    
    unique_lst = []
    for e in col:
        e_lst = get_list(e)
        unique_lst.extend(e_lst)
    size = len(unique_lst)
    dct = dict()
    i = 0
    while i < len(unique_lst):
        dct[unique_lst[i]] = i
        i += 1
        
    return dct, size


def contentID_author_preprocessing(train, test):
    ''' Preprocess the content_id and author columns.
    Convert them to integers from 0, 1, ... to V, where V is the number of unique values'''
    
    full_dataset = pd.concat([test,train], axis=0)
    author_dct, author_size = map_to_id([full_dataset.author])
    contentID_dct, content_size = map_to_id([full_dataset.content_id, full_dataset.next_content_id])
    
    # change 'content_id' column to indices
    train['content_id'] = train.content_id.apply(lambda x: contentID_dct[x])
    test['content_id'] = test.content_id.apply(lambda x: contentID_dct[x])

    # change 'next_content_id' column to indices
    train['next_content_id'] = train.next_content_id.apply(lambda x: contentID_dct[x])
    test['next_content_id'] = test.next_content_id.apply(lambda x: contentID_dct[x])

    # change 'author' column to indices
    train['author'] = train.author.apply(lambda x: author_dct[x])
    test['author'] = test.author.apply(lambda x: author_dct[x])
    return train, test, author_size, content_size
    
train, test, author_size, content_size = contentID_author_preprocessing(train, test)
print(train.head(2))

            visitor_id  content_id category  \
0  1093134947164137327        3320     News   
1  1110195149925330322        3356     News   

                                               title  author  \
0       Kanzlerin Merkel strebt stabile Regierung an      28   
1  Regierungsbildung: SPD und CDU bringen sich in...       4   

   month_since_epoch  next_content_id  
0              574.0             3104  
1              574.0             3108  


Construct 'category_month' column. First bucketize 'month_since_epoch' to bucket boundaries (400, 420, 440, ..., 680, 700) (16 buckets). Then fill 'missing' in the missing cells of 'category' as a value. Finally, implement one-hot encoding on different combinations of 'month_since_epoch' (16 unique buckets) and 'category' (4 unique values).

In these particular datasets, there are 18 unique values for 'category_month' in the training data. There are 14 unique values for 'category_month' in the test data, which is a subset of the 18 unique values of training set. So I fit the OneHotEncoder to the training data, and transform it on the training and test data.

In [4]:
from sklearn.preprocessing import OneHotEncoder

def create_category_month(train, test):
    '''Create a 'category_month' column.'''
    
    # fill 'missing' in the missing values
    train['category'] = train['category'].fillna('missing')
    test['category'] = test['category'].fillna('missing')

    # categorize month_category
    boundaries = torch.tensor(range(400,700,20))
    train['month_since_epoch'] = torch.bucketize(torch.tensor(train['month_since_epoch']), boundaries)
    test['month_since_epoch'] = torch.bucketize(torch.tensor(test['month_since_epoch']), boundaries)

    train['category_month'] = train['category'] + train['month_since_epoch'].astype(str)
    test['category_month'] = test['category'] + test['month_since_epoch'].astype(str)

    ohc = OneHotEncoder()
    ohe = ohc.fit(train.category_month.values.reshape(-1,1))

    a = ohe.transform(train.category_month.values.reshape(-1,1)).toarray()
    train = pd.concat([train, pd.DataFrame(a, columns = ['category_month_' + str(ohc.categories_[0][i]) for i in range(len(ohc.categories_[0]))])], axis=1)
    train.drop(['category_month'], axis = 1,inplace=True)

    b = ohe.transform(test.category_month.values.reshape(-1,1)).toarray()
    test = pd.concat([test, pd.DataFrame(b)], axis=1)
    test.drop(['category_month'], axis = 1,inplace=True)
    
    return train, test
    
train, test = create_category_month(train, test)
print(train.head(2))

            visitor_id  content_id category  \
0  1093134947164137327        3320     News   
1  1110195149925330322        3356     News   

                                               title  author  \
0       Kanzlerin Merkel strebt stabile Regierung an      28   
1  Regierungsbildung: SPD und CDU bringen sich in...       4   

   month_since_epoch  next_content_id  category_month_Lifestyle15  \
0                  9             3104                         0.0   
1                  9             3108                         0.0   

   category_month_Lifestyle6  category_month_Lifestyle7  ...  \
0                        0.0                        0.0  ...   
1                        0.0                        0.0  ...   

   category_month_News8  category_month_News9  category_month_Stars & Kultur6  \
0                   0.0                   1.0                             0.0   
1                   0.0                   1.0                             0.0   

   category_month_St

The 'title' attribute uses pretrained embedding. I use NNLM model for Genman (https://tfhub.dev/google/nnlm-de-dim50/2). Fifty dimension is enough for this task.

In [5]:
import tensorflow_hub as hub

def title_embedding(train, test):
    ''' Convert 'title' column to pretrained embeddings (50 dimensions). '''
    
    pretrained_emb = hub.load("https://tfhub.dev/google/nnlm-de-dim50/2")
    train['title'] = train['title'].apply(lambda x: pretrained_emb([x]).numpy().reshape((50,)))
    test['title'] = test['title'].apply(lambda x: pretrained_emb([x]).numpy().reshape((50,)))

    emb_col_names = []
    for i in range(50):
        emb_col_names.append('title_emb_' + str(i+1))

    train[emb_col_names] = pd.DataFrame(train.title.tolist(), index= train.index)
    test[emb_col_names] = pd.DataFrame(test.title.tolist(), index= test.index)
    
    return train, test


train, test = title_embedding(train, test)

In [6]:
# apply OneHotEncoding to category
def One_hot_encoding(dataset, attr_lst):
    
    ''' Perform One Hot Encoding on an attribute of a dataset.
    Arguments:
        -- dataset: the data set.
        -- attr_lst: the list of attribute names, in string form. E.g. ['a', 'b']
        
    Output:
        return the dataset with added encoded attributes. And the original attribute is dropped'''
        
    ohc = OneHotEncoder()
    for attr in attr_lst:
        ohe = ohc.fit_transform(dataset[attr].values.reshape(-1,1)).toarray()
        dfOneHot = pd.DataFrame(ohe,columns = [attr + '_' + str(ohc.categories_[0][i]) for i in range(len(ohc.categories_[0]))])
        dataset = pd.concat([dataset,dfOneHot], axis=1)
        dataset.drop([attr], axis = 1,inplace=True)
    
    return dataset


train = One_hot_encoding(train, ['category'])
test = One_hot_encoding(test, ['category'])

# drop unwanted columns
train.drop(['title', 'visitor_id', 'month_since_epoch'], axis = 1,inplace=True)
test.drop(['title', 'visitor_id', 'month_since_epoch'], axis = 1,inplace=True)

print(train.head(2))

   content_id  author  next_content_id  category_month_Lifestyle15  \
0        3320      28             3104                         0.0   
1        3356       4             3108                         0.0   

   category_month_Lifestyle6  category_month_Lifestyle7  \
0                        0.0                        0.0   
1                        0.0                        0.0   

   category_month_Lifestyle8  category_month_Lifestyle9  category_month_News5  \
0                        0.0                        0.0                   0.0   
1                        0.0                        0.0                   0.0   

   category_month_News6  ...  title_emb_45  title_emb_46  title_emb_47  \
0                   0.0  ...      0.076455      0.075862     -0.197285   
1                   0.0  ...     -0.106623      0.025269     -0.109285   

   title_emb_48  title_emb_49  title_emb_50  category_Lifestyle  \
0      0.310615      0.024261     -0.149852                 0.0   
1     -0.0

Now the data set has 74 features ('content_id': 1, 'category': 4, 'title': 50, 'author': 1, 'category_month': 18) and 1 target column.

In [7]:
print('Number of features in train set:', train.shape[1]-1)
print('Number of features in test set:', test.shape[1]-1)

Number of features in train set: 74
Number of features in test set: 74


## Construct Dataset and DataLoader for efficient model training in PyTorch. 
Batch size is set to 512.

In [8]:
from torch.utils.data import Dataset
from torch.utils.data import DataLoader

class MyDataset(Dataset):

    def __init__(self, data):
        """
        Args:
            data: the dataset, in pandas DataFrame form.
        """
        self.df = data

    def __len__(self):
        return self.df.shape[0]
    
    def __getitem__(self, idx):
        ''' Retrieve instance(s) from data.
        
        Argument:
            -- idx: an integer of a list of integers.
            
        return: 
            -- A tuple storing (X_content, X_author, X_rest, y)
        '''

        if torch.is_tensor(idx):
            idx = idx.tolist()
            
        X_content = self.df.iloc[idx, self.df.columns == 'content_id']
        X_content = torch.tensor(X_content, dtype = torch.int64)
        X_author = self.df.iloc[idx, self.df.columns == 'author']
        X_author = torch.tensor(X_author, dtype = torch.int64)
        X_rest = self.df.iloc[idx, 3:]
        X_rest = torch.tensor(X_rest, dtype = torch.float)
        y = self.df.iloc[idx, self.df.columns == 'next_content_id']
        y = torch.tensor(y, dtype = torch.int64)
        sample = (X_content, X_author, X_rest, y)
        
        return sample

train_set = MyDataset(train)    
loader_train = DataLoader(train_set, batch_size=512, shuffle = True)  

test_set = MyDataset(test)   
loader_test = DataLoader(test_set, batch_size=512, shuffle = True) 

## Build the neural network model. 

First, the 'content_id' goes through an embedding layer of dimension 10 and 'author' goes through an embedding layer of dimension 3. 

Next, concatenate the embedded 'content_id', embedded 'author' and the rest features. 

Then go through 'Linear1 (200 hidden neurons) - BatchNorm - ReLu - Linear2 (100 hidden neurons) - BatchNorm - ReLu - Linear3 (50 hidden neurons) - BatchNorm - ReLu - Linear4 (number_of_classes neurons) - softmax'.

Take negative logarithm as the loss function (cross entropy loss = softmax + negative logarithm). 

Use AdaGrad as the optimization method. AdaGrad is designed to solve the two problems:
1. In high dimension problem, progress along "steep" directions is damped and "flat" is accelerated. 
2. The learning rate is slowed down over long time (when approaching close to the minima)

Just use the default initialization method in PyTorch.

In [9]:
# the target labels do not need embedding. Because softmax simply takes the class with highest logit value as the predicted label.
import torch.nn as nn
import torch.optim as optim

class MyNet(nn.Module):
    def __init__(self, content_size, author_size):
        super().__init__()
        
        self.content_emb =  nn.Embedding(content_size, 10) 
        self.author_emb =  nn.Embedding(author_size, 3)
        self.comb = nn.Sequential(
                nn.Linear(85, 200),
                nn.BatchNorm1d(200),
                nn.ReLU(),
                nn.Linear(200, 100),
                nn.BatchNorm1d(100),
                nn.ReLU(),
                nn.Linear(100, 50),
                nn.BatchNorm1d(50),
                nn.ReLU(),
                nn.Linear(50, content_size)
                )

    def forward(self, X_content, X_author, X_rest):        # X_rest size ([512, 72])                  
        content = self.content_emb(X_content)              # torch.Size([512, 1, 10])
        content = content.squeeze()
        author = self.author_emb(X_author)                 # torch.Size([512, 1, 3])
        author = author.squeeze()
        output = torch.cat((content, author, X_rest), 1)   # torch.Size([512, 85])
        output = self.comb(output)                         # torch.Size([512, 6326])
        
        return output


net = MyNet(content_size, author_size)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adagrad(net.parameters(), lr=.1)

## Train the model. 
For implicity, the process of tuning parameter is skipped in this notebook. Just set the learning rate to 0.1 and fix the NN structure. Run for 20 epochs.

In [None]:
# CUDA 
use_cuda = torch.cuda.is_available()
device = torch.device("cuda:0" if use_cuda else "cpu")
net = net.to(device = device)

In [10]:
max_epoch = 20

for epoch in range(max_epoch):  
    running_loss = 0.0
    for i, data in enumerate(loader_train):    # 349 iterations per epoch
        net.train()     

        X_content, X_author, X_rest, labels = data
        
        X_content = X_content.to(device = device)
        X_author = X_author.to(device = device)
        X_rest = X_rest.to(device = device)
        labels = labels.to(device = device)
        
        optimizer.zero_grad()

        outputs = net(X_content, X_author, X_rest)    # torch.Size([512, 6326])
        loss = criterion(outputs, labels.squeeze())
        
        loss.backward()
        optimizer.step()

        # print statistics
        running_loss += loss.item()
        if i % 50 == 49:    # print every 50 mini-batches
            print('[%d, %5d] loss: %.3f' %
                  (epoch + 1, i + 1, running_loss / (i+1)))            

[1,    50] loss: 5.383


Exception ignored in: <function CapturableResourceDeleter.__del__ at 0x000001E5D145E4C8>
Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\site-packages\tensorflow\python\training\tracking\tracking.py", line 202, in __del__
    self._destroy_resource()
  File "C:\ProgramData\Anaconda3\lib\site-packages\tensorflow\python\eager\def_function.py", line 780, in __call__
    result = self._call(*args, **kwds)
  File "C:\ProgramData\Anaconda3\lib\site-packages\tensorflow\python\eager\def_function.py", line 823, in _call
    self._initialize(args, kwds, add_initializers_to=initializers)
  File "C:\ProgramData\Anaconda3\lib\site-packages\tensorflow\python\eager\def_function.py", line 697, in _initialize
    *args, **kwds))
  File "C:\ProgramData\Anaconda3\lib\site-packages\tensorflow\python\eager\function.py", line 2855, in _get_concrete_function_internal_garbage_collected
    graph_function, _, _ = self._maybe_define_function(args, kwargs)
  File "C:\ProgramData\Anaconda3

[1,   100] loss: 5.204
[1,   150] loss: 5.113
[1,   200] loss: 5.065
[1,   250] loss: 5.021
[1,   300] loss: 4.989
[1,   350] loss: 4.969
[2,    50] loss: 4.740
[2,   100] loss: 4.737
[2,   150] loss: 4.740
[2,   200] loss: 4.738
[2,   250] loss: 4.735
[2,   300] loss: 4.732
[2,   350] loss: 4.729
[3,    50] loss: 4.668
[3,   100] loss: 4.665
[3,   150] loss: 4.669
[3,   200] loss: 4.670
[3,   250] loss: 4.666
[3,   300] loss: 4.666
[3,   350] loss: 4.666
[4,    50] loss: 4.610
[4,   100] loss: 4.621
[4,   150] loss: 4.618
[4,   200] loss: 4.630
[4,   250] loss: 4.631
[4,   300] loss: 4.631
[4,   350] loss: 4.629
[5,    50] loss: 4.581
[5,   100] loss: 4.592
[5,   150] loss: 4.596
[5,   200] loss: 4.598
[5,   250] loss: 4.601
[5,   300] loss: 4.602
[5,   350] loss: 4.601
[6,    50] loss: 4.580
[6,   100] loss: 4.572
[6,   150] loss: 4.572
[6,   200] loss: 4.577
[6,   250] loss: 4.581
[6,   300] loss: 4.581
[6,   350] loss: 4.581
[7,    50] loss: 4.538
[7,   100] loss: 4.551
[7,   150] 

In [11]:
def get_topK_accuracy(net, loader_test, k):
    
    '''Return the top K accuracy.
    Argument:
    -- net: The trained model.
    -- loader_test: The DataLoader of test set.
    -- k: Top k, where k is an integer.
    
    Return: Top K accuracy. '''
    num_correct = 0
    num_samples = 0
    net.eval()  
    with torch.no_grad():
        for X_content, X_author, X_rest, y in loader_test:
            X_content = X_content.to(device=device)  # move to device, e.g. GPU
            X_author = X_author.to(device=device)
            X_rest = X_rest.to(device=device)
            y = y.to(device=device)

            scores = net(X_content, X_author, X_rest)
            _, preds = scores.topk(k=k, dim=1)     
            num_correct += (preds == y).sum()
            num_samples += preds.size(0)
        acc = float(num_correct) / num_samples
        
    return acc


top1_acc = get_topK_accuracy(net, loader_test, 1)
top10_acc = get_topK_accuracy(net, loader_test, 10)
print('Top 1 accuracy:', top1_acc)
print('Top 10 accuracy:', top10_acc)

Top 1 accuracy: 0.0632446579944529
Top 10 accuracy: 0.3222000859408571


## Performance.
The accuracy is 0.06 and the top 10 accuracy is 0.32.