Surprise Sandbox

Able to run a variety of recsys tests using a fork of Surprise. The fork lives here. Requires Jupyter, Pandas, numpy, seaborn, scipy, and matplotlib to explore results. Recommended approach is to just install latest Anaconda distribution.

Requires Surprise fork to run experiments.

Directory Setup

There are some directories required which are not tracked by git (because they contain a lot of large files that are generated throughout the experiment pipeline).

/predictions - predictions go here (user, movie, predicted rating, real rating) for each algo/boycott instance/fold. To use saving and loading of predictions, you need to set a constant random seed for shuffle base cross folds (currently set as random_state=0, so if you don't change anything it will work). Subdirs of /predictions: /standards /boycotts

/results - raw results, e.g. "this algorithm had an RMSE of 0.9 under some boycott conditions" /processed_results - computed additive and percent differences, e.g. "this algorithms had a 5% decrease in NDCG10 compared to no-boycott" /standard_results - standard, no-boycott results here.

These will be created when you run sandbox.py

After a run of sandbox.py, you'll get prediction files in /predictions, performance measure results in /results, a list of which users participated in each simulated boycott in /standard_results/uid_sets_*, and the standard, no-boycott results for each algorithm (specified in specs.py) in /standard_results.

Benchmarks

See benchmark_comparisons.csv and http://surpriselib.com/.

Inspecting Processed Data

The processed results are collected into an all_results.csv file for each dataset (1M and 20M).

Therefore, if you load the data_strikes_results notebook (via jupyter notebook, jupyter lab, etc.)

You can explore the various metrics, algorithms, etc.

In the EDIT ME cell, you can edit a variety of global variables to re-run the notebook with different configurations. The various metrics discussed in the paper are already loaded in the notebook if you want to just check out the figures without runnign code.

Experiment Pipeline

Get files in places with your choice of: cli, finder/explorer app, aws cli, etc

Be sure to install the forked version of surprise.

Organize all the "standards" (i.e. the results used for comparison with boycotts) into a single directory. Here I've used misc_standards, and, and hard-coded that directory in standards_for_uid_sets.py

Merge standards files python .\standards_for_uid_sets.py --join

Do the processing (i.e. match up columns and do substraction) python process_all.py

Re-run visualization and/or statistics jupyter notebook Select "visualize-v02"

Output Files

Currently, the experiement produce outputs that are written directly into files (as opposed to storing results in a database).

Name		Name	Last commit message	Last commit date
Latest commit History 223 Commits
.ipynb_checkpoints		.ipynb_checkpoints
.vscode		.vscode
aws		aws
bash_scripts		bash_scripts
boycott_files		boycott_files
dev_logs		dev_logs
files_for_jobs		files_for_jobs
jobs		jobs
misc_results		misc_results
s3		s3
standard_results		standard_results
uid_sets		uid_sets
zip_code_data		zip_code_data
.gitignore		.gitignore
README.md		README.md
all_diffs.json		all_diffs.json
all_ratios.json		all_ratios.json
all_vals.json		all_vals.json
benchmark_comparisons.csv		benchmark_comparisons.csv
bigcsv_readme.md		bigcsv_readme.md
centroids.ipynb		centroids.ipynb
compute_movie_percentiles.py		compute_movie_percentiles.py
constants.py		constants.py
constants.py.orig		constants.py.orig
copy_from_s3_dirs.py		copy_from_s3_dirs.py
data_splitting_walkthrough.ipynb		data_splitting_walkthrough.ipynb
data_strikes_results.ipynb		data_strikes_results.ipynb
df2json.py		df2json.py
ex1.csv		ex1.csv
ex1.ipynb		ex1.ipynb
genre_fans_sizes_with_threshold4.txt		genre_fans_sizes_with_threshold4.txt
group_to_num_ratings.json		group_to_num_ratings.json
misc_tests.py		misc_tests.py
ml-1m_ndcg_curve.png		ml-1m_ndcg_curve.png
ml-1m_totalhits.png		ml-1m_totalhits.png
ml-20_ndcg_curve.png		ml-20_ndcg_curve.png
movie_mean_experiments.sh		movie_mean_experiments.sh
movie_mean_experiments.txt		movie_mean_experiments.txt
ndcg_walkthrough.ipynb		ndcg_walkthrough.ipynb
p_b_curve.py		p_b_curve.py
plot.py		plot.py
prep_organized_boycotts.py		prep_organized_boycotts.py
process_all.py		process_all.py
process_results.py		process_results.py
pubcss-acm-sigchi.css		pubcss-acm-sigchi.css
rating hists.ipynb		rating hists.ipynb
results_notebook.py		results_notebook.py
sandbox.py		sandbox.py
selected_groups.csv		selected_groups.csv
some_distances.json		some_distances.json
specs.py		specs.py
standards_for_uid_sets.py		standards_for_uid_sets.py
summarize.py		summarize.py
table.html		table.html
table1.csv		table1.csv
table2.csv		table2.csv
user_percents_to_user_counts.py		user_percents_to_user_counts.py
utils.py		utils.py
viz_constants.py		viz_constants.py

nickmvincent/surprise_sandbox

Folders and files

Latest commit

History

Repository files navigation

Surprise Sandbox

Directory Setup

Benchmarks

Inspecting Processed Data

Experiment Pipeline

Output Files

About

Resources

Stars

Watchers

Forks

Languages