# Data Splitting
To generate one of our custom data splits, we're going to need 4 lists:
list of all non-boycott data, list of all boycott data, list of all users who are boycotting, list of all users who are like boycotters but not boycotting.

To make it practical to follow this worked example, let's just sample 50 ratings.

First we'll run some setup code: importing libraries and loading data into memory.

In [27]:
import numpy as np
import pandas as pd
from prep_organized_boycotts import group_by_gender
from surprise import SVD, KNNBasic, Dataset, KNNBaseline
from surprise.reader import Reader
from utils import get_dfs

## Load the data
the load_builtin function downloads ml-100k if you don't already have it

the `get_dfs` function is our custom loader that loads ratings, users, and movies into their own dataframes

For this example, we're going to sample 50 ratings (out of 100k) so we can follow along.

In [28]:
_ = Dataset.load_builtin('ml-100k')
dfs = get_dfs('ml-100k')
ratings_df = dfs['ratings']
users_df = dfs['users']
data = Dataset.load_from_df(
    ratings_df[['user_id', 'movie_id', 'rating']],
    reader=Reader()
)
users_df_for_example = users_df.sample(10, random_state=0)
print(users_df_for_example)
uids_for_example = list(set(users_df_for_example.user_id))
sample_for_example = ratings_df[ratings_df.user_id.isin(uids_for_example)]

     user_id  age gender     occupation zip_code
345      346   34      M          other    76059
876      877   30      M          other    77504
558      559   69      M      executive    10022
667      668   29      F         writer    10016
236      237   49      M  administrator    63146
261      262   19      F        student    78264
651      652   35      M          other    22911
312      313   41      M      marketing    60035
14        15   49      F       educator    97301
674      675   34      M          other    28814


Next, let's identify our boycott users and boycott ratings.
In this example, we'll be simulating a PARTIAL BOYCOTT (50% of all ratings) from half of all male users.
To select the half of male users that are boycotting and the half of ratings affected, we'll just use random sampling.

In [31]:
male_user_group = group_by_gender(users_df_for_example)[0]
male_user_df = male_user_group['df']

boycotting_male_users = male_user_df.sample(frac=0.5, random_state=0)
print('Boycotting Male Users')
print(boycotting_male_users)
boycott_uids = list(set(boycotting_male_users.user_id))
like_boycotters_but_not_boycotting = male_user_df[~male_user_df.user_id.isin(boycott_uids)]
print('Non-boycotting Male Users')
print(like_boycotters_but_not_boycotting)
like_boycotters_but_not_boycotting_uids = list(set(like_boycotters_but_not_boycotting.user_id))

Boycotting Male Users
     user_id  age gender     occupation zip_code
674      675   34      M          other    28814
558      559   69      M      executive    10022
876      877   30      M          other    77504
236      237   49      M  administrator    63146
Non-boycotting Male Users
     user_id  age gender occupation zip_code
345      346   34      M      other    76059
651      652   35      M      other    22911
312      313   41      M  marketing    60035


In [30]:
boycott_ratings_df = None
for uid in boycott_uids:
    ratings_belonging_to_user = sample_for_example[sample_for_example.user_id == uid]
    boycott_ratings_for_user = ratings_belonging_to_user.sample(frac=0.5, random_state=0)
    if boycott_ratings_df is None:
        boycott_ratings_df = boycott_ratings_for_user
    else:
        boycott_ratings_df = pd.concat([boycott_ratings_df, boycott_ratings_for_user])
print(boycott_ratings_df)

       user_id  movie_id  rating  unix_timestamp
31851      237       238       4       879376435
3292       237       100       5       879376381
28883      237         9       4       879376730
33744      237       197       4       879376515
34356      237       180       4       879376730
41060      237       134       5       879376327
35128      237       525       4       879376487
49942      237       498       4       879376698
4931       237        23       4       879376606
7740       237        98       4       879376327
8297       237       153       3       879376698
33745      237       489       4       879376381
34504      237       659       4       879376553
29848      237       485       4       879376553
98557      237       286       3       879376220
2803       237       494       4       879376553
92352      237        64       5       879376671
15480      237       502       4       879376487
11672      237       423       4       879376487
30206      237      

Great, we now have the critical elements to perform our evaluations:
1. list of all non-boycott data (