In [2]:
# installations
!pip install numpy
!pip install pandas
!pip install scipy



In [2]:
import os
import pandas as pd
import numpy as np
import json


# Running a test
The following chunk runs a lightweight version of the experiment once, with only a handful of users.
At each round, the model gets fitted to every user in the dataset and a regret score is calculated.
At the end of all rounds (by default 100 rounds) the minimum cumulative regret indicates, which model performs best.
According to the source paper, *ts-seg-pessimistic* performs best on the offline experiment. Hence, they tested this models performance in a cascade versus non-cascade offline experiment, in which case the *ts-seg-pessimistic cascade* model performed best. This model also caused the highest display-to-stream gain.
All featured tests' results showed statistical significance at the 1% level. However, the paper does not state how those statistical tests were conducted.


In [15]:
# small dataset, all policies
!python main.py --users_path data/user_features_small.csv --policies random,etc-seg-explore,etc-seg-exploit,epsilon-greedy-explore,epsilon-greedy-exploit,kl-ucb-seg,ts-seg-naive,ts-seg-pessimistic,ts-lin-naive,ts-lin-pessimistic --n_users_per_round 9 --output_path general_experiment_results.json
!python plot_results.py --data_path general_experiment_results.json


INFO:__main__:LOADING DATA
INFO:__main__:Loading playlist data
INFO:__main__:Loading user data
 

9
9
1.0
Traceback (most recent call last):
  File "/Users/nikolaus/Documents/Uni/TU/DS Experiment Design/exercise 2/carousel_reproduction/main.py", line 79, in <module>
    raise hell
NameError: name 'hell' is not defined
Figure(640x480)


In [16]:
# full dataset, two policies
!python main.py --policies random,ts-seg-pessimistic --print_every 5 --output_path general_experiment_results.json
!python plot_results.py --data_path general_experiment_results.json

INFO:__main__:LOADING DATA
INFO:__main__:Loading playlist data
INFO:__main__:Loading user data
 

974960
20000
0.020513662098957906
Traceback (most recent call last):
  File "/Users/nikolaus/Documents/Uni/TU/DS Experiment Design/exercise 2/carousel_reproduction/main.py", line 79, in <module>
    raise hell
NameError: name 'hell' is not defined
Figure(640x480)


# My suggestion for reproducing tests
We can reproduce the full offline experiment. We just need to find a way to compare results.
https://machinelearningmastery.com/statistical-significance-tests-for-comparing-machine-learning-algorithms/
According to this paper, it is common to do a 10-fold cross validation on the dataset at hand and grab the mean/median success metric of each model as a statistic for that model. We would then have the following hypotheses:
 * h0: there is no difference in model performances, i.e. ts-seg-pessimistic (cascade) performs just as well as the other models, i.e. the lower cumulative regret of the tests are due to chance
  * h1: there is a difference in model performances, i.e. ts-seg-pessimistic (cascade) outperforms the other models

### Does the original code cross-validate?
No, but something different. The argument *n_users_per_round* (default 20.000, 1% of full dataset) states, how many users actually get recommended anything in any round. That means that the actual dataset never changes for every time you run the experiment, but the sample on which you fit and evaluate the model changes on every round.
According to Brownlee (2018, https://machinelearningmastery.com/statistical-significance-tests-for-comparing-machine-learning-algorithms/), this testing behaviour violates a key assumption of the Student's t-test, which is independency of observations in each sample per round.
My suggestion for fixing this. Run the full experiment k (10) times and give the program either 10% of the data to work on, while either continuing to let it use a sample of 20.000 users or even telling it not to resample anything, but rather work on the whole data. Instead of rewriting the source code to include cross-validation, I will split the users dataset itself into k samples and run the whole main.py file k times, with different arguments each time, i.e. changing the results-output-filename.
The only rewriting I will do, is putting the main workflow of the main.py file into a separate main() function with arguments passed in python rather than from the command line, so I can execute it using this notebook.


In [55]:
# test run of the new experiment.py

from experiment import run_experiment
run_experiment(
        users_path="data/user_features.csv",
        playlists_path="data/playlist_features.csv",
        output_path="results.json",
        policies="random,ts-seg-naive",
        n_recos=12,
        l_init=3,
        n_users_per_round=20000,
        n_rounds=100,
        print_every=10,
        user_indices=[3, 6, 7, 9, 12, 36, 47]

)

TypeError: run_experiment() got an unexpected keyword argument 'user_indices'

In [33]:
def split_dataset(data_filename, k):
        # grabs a datafile from the folder "data/"
        # randomly splits the dataset into k smaller sets of equal size
        # saves them in the folder "data/split/"

        with open(f"data/{data_filename}", "r") as f:
                full = np.array(f.readlines())

        data_filename, suffix = data_filename.split(".")

        # header row should be separate
        header_row, full = full[0], full[0:]

        N = len(full) # number of all rows
        indices = np.array(list((range(N)))) # numbers from 0 to N
        n = N // k # number of rows per k

        print(f"Splitting {N} rows from {data_filename}.{suffix} into {k} files with {n} rows each ...")

        for i in range(k):
                # get random row numbers from 0 to the length of indices (starts at N)
                random_indices = np.random.choice(len(indices), n, replace=False)
                # grab random row numbers of the remaining ones
                sample_indices = indices[random_indices]
                # remove the selected row numbers from the indices list
                indices = np.delete(indices, random_indices)
                # create a sample
                sample = full[sample_indices]
                # add the header row
                sample = np.append(header_row, sample)
                # save the sample
                print(f"\t saving file number {i}: {sample_indices}")
                f = f"data/split/{data_filename}_{i}.{suffix}"
                with open(f, "w") as f:
                        f.writelines(list(sample))


        print(f"All done. {len(indices)} users remain unsaved: {indices}")
        print()
        del full
        del indices
        del sample




# Preparing k datasets
data_filename = "user_features.csv"
data_filename = "user_features_s.csv"

k = 10

split_dataset(data_filename, k)

Splitting 97497 rows from user_features_s.csv into 10 files with 9749 rows each ...
	 saving file number 0: [85946  3872 31780 ... 47937 55567 82517]
	 saving file number 1: [22586 31592 96232 ... 73702 64310 54891]
	 saving file number 2: [76709 75098  2673 ...  5544 56415 92977]
	 saving file number 3: [63423 41334 63777 ... 60094 15604 37713]
	 saving file number 4: [19728 31058 70798 ... 16191 62876 61679]
	 saving file number 5: [88045 25461  6879 ... 32632 20313 31835]
	 saving file number 6: [94082 32249 73644 ... 48007 74561  1241]
	 saving file number 7: [69563 19952 90072 ... 81243 24419 55885]
	 saving file number 8: [10380  9242 92649 ... 37100 44500 91414]
	 saving file number 9: [50418 58201 96554 ... 38744 74385 92454]
All done. 7 users remain unsaved: [ 4459  8820 14441 42417 69647 69973 93180]



### Data for testing
* data/split/: small data
* data/splitL: big data

In [51]:
from experiment import run_experiment


all_policies = "random,etc-seg-explore,etc-seg-exploit,epsilon-greedy-explore,epsilon-greedy-exploit,kl-ucb-seg,ts-seg-naive,ts-seg-pessimistic,ts-lin-naive,ts-lin-pessimistic"

for f in os.listdir("data/split"):
        if "DS_Store" in f: continue
        i = f.split(".")[0][-1]
        f = f"data/split/{f}"
        run_experiment(
                users_path=f,
                playlists_path="data/playlist_features.csv",
                output_path=f"output/s/results_{i}.json",
                policies=all_policies,
                n_recos=12,
                l_init=3,
                n_users_per_round=None,
                n_rounds=100,
                print_every=25,
        )


INFO:experiment:LOADING DATA
INFO:experiment:Loading playlist data
INFO:experiment:Loading user data
 

INFO:experiment:SETTING UP SIMULATION ENVIRONMENT
INFO:experiment:for 9749 users, 862 playlists, 12 recommendations per carousel 
 

INFO:experiment:SETTING UP POLICIES
INFO:experiment:Policies to evaluate: random,etc-seg-explore,etc-seg-exploit,epsilon-greedy-explore,epsilon-greedy-exploit,kl-ucb-seg,ts-seg-naive,ts-seg-pessimistic,ts-lin-naive,ts-lin-pessimistic 
 

INFO:experiment:STARTING SIMULATIONS
INFO:experiment:for 100 rounds, with 9749 users per round (randomly drawn with replacement)
 

INFO:experiment:Round: 1/100. Elapsed time: 230.721847 sec.
INFO:experiment:Cumulative regrets: 
	random : 6394.4675012742355
	etc-seg-explore : 6397.4675012742355
	etc-seg-exploit : 6418.4675012742355
	epsilon-greedy-explore : 6456.4675012742355
	epsilon-greedy-exploit : 6379.4675012742355
	kl-ucb-seg : 6384.4675012742355
	ts-seg-naive : 6366.4675012742355
	ts-seg-pessimistic : 6334.467501

TypeError: can't multiply sequence by non-int of type 'float'

In [36]:
os.listdir("output/s")

['results_3.json']

In [37]:
import json

In [38]:
with open("output/s/results_3.json", "r") as f:
        res3 = json.load(f)

In [49]:
res = [[key, value[-1]] for key, value in res3.items()]
np.array(res)

array([['random', '639927.0835794248'],
       ['etc-seg-explore', '607537.0835794248'],
       ['etc-seg-exploit', '204077.0835794248'],
       ['epsilon-greedy-explore', '145837.0835794249'],
       ['epsilon-greedy-exploit', '180067.08357942486'],
       ['kl-ucb-seg', '361742.08357942494'],
       ['ts-seg-naive', '181456.08357942486'],
       ['ts-seg-pessimistic', '110566.08357942494'],
       ['ts-lin-naive', '183840.08357942486'],
       ['ts-lin-pessimistic', '190586.08357942486']], dtype='<U22')