# Vowpal Wabbit Deep Dive

VW expects a specific [input format](https://github.com/VowpalWabbit/vowpal_wabbit/wiki/Input-format), in this notebook to_vw() is a convenience function that converts the standard movielens dataset into the required data format. Datafiles are then written to disk and passed to VW for training.

The examples shown are to demonstrate functional capabilities of VW not to indicate performance advantages of different approaches. There are several hyper-parameters (e.g. learning rate and regularization terms) that can greatly impact performance of VW models which can be adjusted using [command line options](https://github.com/VowpalWabbit/vowpal_wabbit/wiki/Command-Line-Arguments). To properly compare approaches it is helpful to learn about and tune these parameters on the relevant dataset.

 # 0. Load Data and Global Setup

In [1]:
import sys
sys.path.append('../..')

import os
from subprocess import run
from tempfile import TemporaryDirectory
from time import process_time

import pandas as pd
import papermill as pm

from reco_utils.common.notebook_utils import is_jupyter
from reco_utils.dataset.movielens import load_pandas_df
from reco_utils.dataset.python_splitters import python_random_split
from reco_utils.evaluation.python_evaluation import (rmse, mae, exp_var, rsquared, get_top_k_items,
                                                     map_at_k, ndcg_at_k, precision_at_k, recall_at_k)

print("System version: {}".format(sys.version))
print("Pandas version: {}".format(pd.__version__))

System version: 3.6.8 |Anaconda, Inc.| (default, Dec 29 2018, 19:04:46) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
Pandas version: 0.25.2


In [2]:
df = pd.read_csv('/Users/yumengxiao/Documents/7374/Assignment3/hetrec2011-lastfm-2k/user_artists.csv')
df.head()

Unnamed: 0,userID,artistID,weight
0,2,51,13883
1,2,52,11690
2,2,53,11351
3,2,54,10300
4,2,55,8983


The 'weight' is actually the listening count for each (user, artist) pair, so if weight is higher it will mean that the user likes this artist more. But beacause the range of weight is too big, we deicide to tranform it into integer 1 to 4.

In [3]:
df['weight'].describe()

count     92834.00000
mean        745.24393
std        3751.32208
min           1.00000
25%         107.00000
50%         260.00000
75%         614.00000
max      352698.00000
Name: weight, dtype: float64

In [4]:
df['weight'] = df['weight'].apply(lambda x: 1 if x <= 107 
                                    else (2 if x <= 260 
                                    else (3 if x <= 614 
                                    else  4)))
df.head()

Unnamed: 0,userID,artistID,weight
0,2,51,4
1,2,52,4
2,2,53,4
3,2,54,4
4,2,55,4


In [5]:
def to_vw(df, output, logistic=False):
    """Convert Pandas DataFrame to vw input format
    Args:
        df (pd.DataFrame): input DataFrame
        output (str): path to output file
        logistic (bool): flag to convert label to logistic value
    """
    with open(output, 'w') as f:
        tmp = df.reset_index()
        
        # convert rating to binary value
        if logistic:
            tmp['weight'] = tmp['weight'].apply(lambda x: 1 if x >= 3 else -1)
        
        # convert each row to VW input format
        # [label] [tag]|[user namespace] [user id feature] |[item namespace] [movie id feature]
        # label is the true rating, tag is a unique id for the example just used to link predictions to truth
        # user and item namespaces separate the features to support interaction features through command line options
        for _, row in tmp.iterrows():
            f.write('{weight:d} {index:d}|user {userID:d} |item {artistID:d}\n'.format_map(row))

In [6]:
def run_vw(train_params, test_params, test_data, prediction_path, logistic=False):
    """Convenience function to train, test, and show metrics of interest
    Args:
        train_params (str): vw training parameters
        test_params (str): vw testing parameters
        test_data (pd.dataFrame): test data
        prediction_path (str): path to vw prediction output
        logistic (bool): flag to convert label to logistic value
    Returns:
        (dict): metrics and timing information
    """

    # train model
    train_start = process_time()
    run(train_params.split(' '), check=True)
    train_stop = process_time()
    
    # test model
    test_start = process_time()
    run(test_params.split(' '), check=True)
    test_stop = process_time()
    
    # read in predictions
    pred_df = pd.read_csv(prediction_path, delim_whitespace=True, names=['prediction'], index_col=1).join(test_data)
    pred_df.drop("weight", axis=1, inplace=True)

    test_df = test_data.copy()
    if logistic:
        # make the true label binary so that the metrics are captured correctly
        test_df['weight'] = test['weight'].apply(lambda x: 1 if x >= 3 else -1)
    else:
        # ensure results are integers in correct range
        pred_df['prediction'] = pred_df['prediction'].apply(lambda x: int(max(1, min(4, round(x)))))

    # calculate metrics
    result = dict()
    result['RMSE'] = rmse(test_df, pred_df)
    result['MAE'] = mae(test_df, pred_df)
    result['R2'] = rsquared(test_df, pred_df)
    result['Explained Variance'] = exp_var(test_df, pred_df)
    result['Train Time (ms)'] = (train_stop - train_start) * 1000
    result['Test Time (ms)'] = (test_stop - test_start) * 1000
    
    return result

In [7]:
# create temp directory to maintain data files
tmpdir = TemporaryDirectory()

model_path = os.path.join(tmpdir.name, 'vw.model')
saved_model_path = os.path.join(tmpdir.name, 'vw_saved.model')
train_path = os.path.join(tmpdir.name, 'train.dat')
test_path = os.path.join(tmpdir.name, 'test.dat')
train_logistic_path = os.path.join(tmpdir.name, 'train_logistic.dat')
test_logistic_path = os.path.join(tmpdir.name, 'test_logistic.dat')
prediction_path = os.path.join(tmpdir.name, 'prediction.dat')
all_test_path = os.path.join(tmpdir.name, 'new_test.dat')
all_prediction_path = os.path.join(tmpdir.name, 'new_prediction.dat')

# 1. Load & Transform Data

In [8]:
# Select MovieLens data size: 100k, 1m, 10m, or 20m
TOP_K = 10

In [9]:
# split data to train and test sets, default values take 75% of each users ratings as train, and 25% as test
train, test = python_random_split(df, 0.75)

# save train and test data in vw format
to_vw(df=train, output=train_path)
to_vw(df=test, output=test_path)

# save data for logistic regression (requires adjusting the label)
to_vw(df=train, output=train_logistic_path, logistic=True)
to_vw(df=test, output=test_logistic_path, logistic=True)

# 2. Regression Based Recommendations

When considering different approaches for solving a problem with machine learning it is helpful to generate a baseline approach to understand how more complex solutions perform across dimensions of performance, time, and resource (memory or cpu) usage.

Regression based approaches are some of the simplest and fastest baselines to consider for many ML problems.

## 2.1 Linear Regression

By passing each user-artist listening count in, the model will begin to learn weights based on average count for each user as well as average count per artist.  
  
Parameters' description of command line:  
VW uses linear regression by default, so no extra command line options  
-f <model_path>: indicates where the final model file will reside after training  
-d <data_path>: indicates which data file to use for training or testing  
--quiet: this runs vw in quiet mode silencing stdout   
-i <model_path>: indicates where to load the previously model file created during training  
-t: this executes inference only (no learned updates to the model)  
-p <prediction_path>: indicates where to store prediction output  

In [7]:
train_params = 'vw -f {model} -d {data} --quiet'.format(model=model_path, data=train_path)
# save these results for later use during top-k analysis
test_params = 'vw -i {model} -d {data} -t -p {pred} --quiet'.format(model=model_path, data=test_path, pred=prediction_path)

result = run_vw(train_params=train_params, 
                test_params=test_params, 
                test_data=test, 
                prediction_path=prediction_path)

comparison = pd.DataFrame(result, index=['Linear Regression'])
comparison

Unnamed: 0,RMSE,MAE,R2,Explained Variance,Train Time (ms),Test Time (ms)
Linear Regression,0.988433,0.70988,0.227276,0.227286,62.5,15.625


## 2.2 Linear Regression with Interaction Features

To generate interaction features use the quadratic command line argument and specify the namespaces that should be combined: '-q ui' combines the user and item namespaces based on the first letter of each.

Currently the userIDs and itemIDs used are integers which means the feature ID is used directly, for instance when user ID 123 rates movie 456, the training example puts a 1 in the values for features 123 and 456. However when interaction is specified (or if a feature is a string) the resulting interaction feature is hashed into the available feature space. Feature hashing is a way to take a very sparse high dimensional feature space and reduce it into a lower dimensional space. This allows for reduced memory while retaining fast computation of feature and model weights.

In [8]:
"""
Quick description of command line parameters used
  -b <N>: sets the memory size to 2<sup>N</sup> entries
  -q <ab>: create quadratic feature interactions between features in namespaces starting with 'a' and 'b' 
"""
train_params = 'vw -b 26 -q ua -f {model} -d {data} --quiet'.format(model=saved_model_path, data=train_path)
test_params = 'vw -i {model} -d {data} -t -p {pred} --quiet'.format(model=saved_model_path, data=test_path, pred=prediction_path)

result = run_vw(train_params=train_params,
                test_params=test_params,
                test_data=test,
                prediction_path=prediction_path)
saved_result = result

comparison = comparison.append(pd.DataFrame(result, index=['Linear Regression w/ Interaction']))
comparison

Unnamed: 0,RMSE,MAE,R2,Explained Variance,Train Time (ms),Test Time (ms)
Linear Regression,0.988433,0.70988,0.227276,0.227286,62.5,15.625
Linear Regression w/ Interaction,0.985921,0.71292,0.231199,0.231338,15.625,31.25


## 2.3 Multinomial Logistic Regression
An alternative to linear regression is to leverage multiclas logistic regression, which treats each rating value as a distinct class. 

Basic multiclass logistic regression can be accomplished using the One Against All approach specified by the '--oaa N' option, where N is the number of classes and proving the logistic option for the loss function to be used.  
  
'One Against All' is: When classifying n-types of samples, for each class, take this class as one, and treat the remaining n-1 class samples as another to build a model, so that we can convert to n two-category questions. Finally, you can get n models.The samples to be predicted are passed into the n models, and the predicted output is the result of the model (label) with the largest likelihood  
  
Parameters' description of command line:  
--loss_function logistic: sets the model loss function for logistic regression  
--oaa <N>: trains N separate models using One-Against-All approach  
--link logistic: converts the predicted output from logit to probability

In [9]:
train_params = 'vw --loss_function logistic --oaa 4 -f {model} -d {data} --quiet'.format(model=model_path, data=train_path)
test_params = 'vw --link logistic -i {model} -d {data} -t -p {pred} --quiet'.format(model=model_path, data=test_path, pred=prediction_path)

result = run_vw(train_params=train_params,
                test_params=test_params,
                test_data=test,
                prediction_path=prediction_path)

comparison = comparison.append(pd.DataFrame(result, index=['Multinomial Regression']))
comparison

Unnamed: 0,RMSE,MAE,R2,Explained Variance,Train Time (ms),Test Time (ms)
Linear Regression,0.988433,0.70988,0.227276,0.227286,62.5,15.625
Linear Regression w/ Interaction,0.985921,0.71292,0.231199,0.231338,15.625,31.25
Multinomial Regression,1.11278,0.75564,0.020626,0.050111,62.5,31.25


## 2.4 Logistic Regression

Additionally, one might simply be interested in whether the user likes or dislikes an item and we can adjust the input data to represent a binary outcome, where ratings in (1,2] are dislikes and [3,4] are likes.

In [10]:
train_params = 'vw --loss_function logistic -f {model} -d {data} --quiet'.format(model=model_path, data=train_logistic_path)
test_params = 'vw --link logistic -i {model} -d {data} -t -p {pred} --quiet'.format(model=model_path, data=test_logistic_path, pred=prediction_path)

result = run_vw(train_params=train_params,
                test_params=test_params,
                test_data=test,
                prediction_path=prediction_path,
                logistic=True)

comparison = comparison.append(pd.DataFrame(result, index=['Logistic Regression']))
comparison

Unnamed: 0,RMSE,MAE,R2,Explained Variance,Train Time (ms),Test Time (ms)
Linear Regression,0.988433,0.70988,0.227276,0.227286,62.5,15.625
Linear Regression w/ Interaction,0.985921,0.71292,0.231199,0.231338,15.625,31.25
Multinomial Regression,1.11278,0.75564,0.020626,0.050111,62.5,31.25
Logistic Regression,0.717475,0.409551,0.096362,0.1425,46.875,31.25


We can see that r square of all the model is really low which means these model are not fitted well. So we decided to write the contented based system by ourselves.