# Data Splitting
To generate one of our custom data splits, we're going to need 4 lists:
list of all non-boycott data, list of all boycott data, list of all users who are boycotting, list of all users who are like boycotters but not boycotting.

To make it practical to follow this worked example, let's just use data from 50 users.
First we'll run some setup code: importing libraries and loading data into memory.

In [21]:
import numpy as np
import pandas as pd
from prep_organized_boycotts import group_by_gender
from surprise import SVD, KNNBasic, Dataset, KNNBaseline
from surprise.reader import Reader
from utils import get_dfs
from itertools import chain



## Load the data
the load_builtin function downloads ml-100k if you don't already have it

the `get_dfs` function is our custom loader that loads ratings, users, and movies into their own dataframes

In [22]:
_ = Dataset.load_builtin('ml-100k')
dfs = get_dfs('ml-100k')
ratings_df = dfs['ratings']
users_df = dfs['users']

users_df_for_example = users_df.sample(10, random_state=0)
print(users_df_for_example)
uids_for_example = list(set(users_df_for_example.user_id))
sample_for_example = ratings_df[ratings_df.user_id.isin(uids_for_example)]

     user_id  age gender     occupation zip_code
345      346   34      M          other    76059
876      877   30      M          other    77504
558      559   69      M      executive    10022
667      668   29      F         writer    10016
236      237   49      M  administrator    63146
261      262   19      F        student    78264
651      652   35      M          other    22911
312      313   41      M      marketing    60035
14        15   49      F       educator    97301
674      675   34      M          other    28814


Next, let's identify our boycott users and boycott ratings.
In this example, we'll be simulating a PARTIAL BOYCOTT (50% of all ratings) from half of all male users.
To select the half of male users that are boycotting and the half of ratings affected, we'll just use random sampling.

To get a list of all male users, we can use `group_by_genders` from the `prep_organized boycott` module. Similar functions exists for state, genre, power users, etc.

In [23]:
male_user_df = group_by_gender(users_df_for_example)[0]['df']

boycotting_male_users = male_user_df.sample(frac=0.5, random_state=0)
boycott_uid_set = set(boycotting_male_users.user_id)

like_boycotters_but_not_boycotting = male_user_df.drop(boycotting_male_users.index)
like_boycotters_but_not_boycotting_uid_set = set(like_boycotters_but_not_boycotting.user_id)

print('Boycotting Male Users')
print(boycotting_male_users)
print('Non-boycotting Male Users')
print(like_boycotters_but_not_boycotting)

Boycotting Male Users
     user_id  age gender     occupation zip_code
674      675   34      M          other    28814
558      559   69      M      executive    10022
876      877   30      M          other    77504
236      237   49      M  administrator    63146
Non-boycotting Male Users
     user_id  age gender occupation zip_code
345      346   34      M      other    76059
651      652   35      M      other    22911
312      313   41      M  marketing    60035


Now that we've identified all the boycott users, we can set aside the list of ratings from non-boycott users (although we'll  need to come back to this and add in any "lingering" ratings from a partial boycott).

In [24]:
non_boycott_user_ratings_df = sample_for_example[~sample_for_example.user_id.isin(boycott_uid_set)]

In [25]:
boycott_ratings_df = None
boycott_user_lingering_ratings_df = None

for uid in boycott_uid_set:
    ratings_belonging_to_user = sample_for_example[sample_for_example.user_id == uid]
    boycott_ratings_for_user = ratings_belonging_to_user.sample(frac=0.5, random_state=0)
    lingering_ratings_for_user = ratings_belonging_to_user.drop(boycott_ratings_for_user.index)
    if boycott_ratings_df is None:
        boycott_ratings_df = boycott_ratings_for_user
        boycott_ratings_df.iloc[0]['rating'] = -2
    else:
        boycott_ratings_df = pd.concat([boycott_ratings_df, boycott_ratings_for_user])
    if boycott_user_lingering_ratings_df is None:
        boycott_user_lingering_ratings_df = lingering_ratings_for_user
        boycott_user_lingering_ratings_df.iloc[0]['rating'] = -1
    else:
        boycott_user_lingering_ratings_df = pd.concat([boycott_user_lingering_ratings_df, lingering_ratings_for_user])
print(boycott_ratings_df.head())
print(boycott_user_lingering_ratings_df.head())
print('Boycott ratings: {}, Lingering Ratings from Boycott Users: {}'.format(
    len(boycott_ratings_df.index), len(boycott_user_lingering_ratings_df.index)
))

       user_id  movie_id  rating  unix_timestamp
31851      237       238      -2       879376435
3292       237       100       5       879376381
28883      237         9       4       879376730
33744      237       197       4       879376515
34356      237       180       4       879376730
      user_id  movie_id  rating  unix_timestamp
119       237       514      -1       879376641
1501      237       191       4       879376773
2944      237       528       5       879376606
3737      237       211       4       879376515
4738      237       499       2       879376487
Boycott ratings: 119, Lingering Ratings from Boycott Users: 120


So we have all the ratings from non-boycott users in `non_boycott_user_ratings_df`.
However, in a partial boycott case (like this example), the lingering ratings from the boycott users must also be included in the non-boycott ratings (as the lingering ratings should be TRAINED ON, but they will go into the BOYCOTT TESTSET).
To get all the non-boycott ratings together, we can just concatenate `non_boycott_user_ratings_df` with `boycott_user_lingering_ratings_df`.

In [26]:
all_non_boycott_ratings_df = pd.concat([non_boycott_user_ratings_df, boycott_user_lingering_ratings_df])
print(len(all_non_boycott_ratings_df.index))

907


Great, we now have the critical elements to perform our evaluations:
1. list of all non-boycott ratings (in `all_non_boycott_ratings_df`)
2. list of all boycott ratings (in `boycott_ratings_df`)
3. list of all boycott user ids (in `boycott_uid_set`)
4. list of all users ids of user like boycotter but who aren't boycott (in `like_boycotters_but_not_boycotting_uid_set`)

As a quick sanity check, we can make sure that (1) each rating only appears once and (2) each user only appears once

In [27]:
assert len(all_non_boycott_ratings_df.index) + len(boycott_ratings_df.index) == len(sample_for_example.index)
assert len(boycott_uid_set) + len(like_boycotters_but_not_boycotting_uid_set) == len(male_user_df.index)

One more thing before we move on to the actual data splitting: we need to convert the pandas dataframes into the Data() objects that Surprise expects.

In [28]:
nonboycott = Dataset.load_from_df(
    all_non_boycott_ratings_df[['user_id', 'movie_id', 'rating']],
    reader=Reader()
)
boycott = Dataset.load_from_df(
    boycott_ratings_df[['user_id', 'movie_id', 'rating']],
    reader=Reader()
)

Ok - now it's really time to do the data splitting. In order to randomly shuffle the nonboycott and boycott ratings, we're just going to make a list of indices and shuffle that in place.

In [29]:
indices = np.arange(len(nonboycott.raw_ratings))
boycott_indices = np.arange(len(boycott.raw_ratings))

np.random.RandomState(0).shuffle(indices)
np.random.RandomState(0).shuffle(boycott_indices)

Initialize two pairs of counters that will be used to walk through the indices.

The 20% of ratings between start and stop will be the testset for a given fold (by traversing the ratings in this manner, we make sure every single ratings shows up in a testset once). Likewise, the 20% of boycott ratings between boycott_start and boycott_stop will be the boycott ratings that are tested upon for a given fold.

In [30]:
start, stop = 0, 0
boycott_start, boycott_stop = 0, 0
num_splits = 5
ret = []

In [31]:
like_boycott_uid_set = like_boycotters_but_not_boycotting_uid_set
for i_fold in range(num_splits):
    # "Move the goalposts"
    start = stop
    stop += len(indices) // num_splits
    if i_fold < len(indices) % num_splits:
        stop += 1
    # Move the goalposts in the boycott set
    boycott_start = boycott_stop
    boycott_stop += len(boycott_indices) // num_splits
    if i_fold < len(boycott_indices) % num_splits:
        boycott_stop += 1

    raw_trainset = [nonboycott.raw_ratings[i] for i in chain(indices[:start],
                                                       indices[stop:])]
    
    nonboycott_ratings_for_test = [nonboycott.raw_ratings[i] for i in indices[start:stop]]
    # some of these may be going to 
    
    boycott_testratings = []
    nonboycott_testratings = []
    # full name: like-boycotting-users-but-not-boycotting
    like_boycott_but_testratings = []            
    for rating_row in nonboycott_ratings_for_test:
        uid = rating_row[0]
        if uid in boycott_uid_set:
            boycott_testratings.append(rating_row)
        else:
            nonboycott_testratings.append(rating_row)
            if uid in like_boycott_uid_set:
                like_boycott_but_testratings.append(rating_row)

    for rating_row in [
        boycott.raw_ratings[i] for i in boycott_indices[boycott_start:boycott_stop]
    ]:
        boycott_testratings.append(rating_row)

    all_like_boycott_testratings = boycott_testratings + like_boycott_but_testratings
    all_testratings = boycott_testratings + nonboycott_testratings


    # nonboycott is a Data() object with the construct_trainset methods
    # whether we call nonboycott.construct_ or boycott.construct_ is arbitrary
    trainset = nonboycott.construct_trainset(raw_trainset)

    nonboycott_testset = nonboycott.construct_testset(nonboycott_testratings)
    boycott_testset = nonboycott.construct_testset(boycott_testratings)
    like_boycott_but_testset = nonboycott.construct_testset(like_boycott_but_testratings)
    all_like_boycott_testset = nonboycott.construct_testset(all_like_boycott_testratings)
    all_testset = nonboycott.construct_testset(all_testratings)

    # yield trainset, nonboycott_testset, boycott_testset, like_boycott_but_testset, all_like_boycott_testset, all_testset
    ret.append([trainset, nonboycott_testset, boycott_testset, like_boycott_but_testset, all_like_boycott_testset, all_testset])

In [32]:
for row in ret:
    print('\nNext fold:')
    trainset, nonboycott_testset, boycott_testset, like_boycott_but_testset, all_like_boycott_testset, all_testset = row
    print('trainset:', list(trainset.all_ratings()))
    print('nonboycott_testset:', nonboycott_testset)
    print('boycott_testset:', boycott_testset)
    print('like_boycott_but_testset:', like_boycott_but_testset)
    print('all_like_boycott_testset:', all_like_boycott_testset)
    print('all_testset:', all_testset)


Next fold:
trainset: [(0, 0, 3.0), (0, 1, 4.0), (0, 4, 3.0), (0, 11, 1.0), (0, 19, 5.0), (0, 20, 3.0), (0, 32, 4.0), (0, 40, 4.0), (0, 51, 5.0), (0, 59, 4.0), (0, 60, 2.0), (0, 65, 3.0), (0, 71, 3.0), (0, 72, 2.0), (0, 81, 3.0), (0, 87, 1.0), (0, 90, 2.0), (0, 91, 1.0), (0, 95, 1.0), (0, 53, 5.0), (0, 101, 4.0), (0, 26, 5.0), (0, 102, 3.0), (0, 104, 3.0), (0, 110, 4.0), (0, 117, 3.0), (0, 121, 4.0), (0, 124, 4.0), (0, 129, 5.0), (0, 131, 2.0), (0, 106, 4.0), (0, 152, 3.0), (0, 149, 2.0), (0, 156, 4.0), (0, 157, 4.0), (0, 163, 4.0), (0, 170, 2.0), (0, 141, 3.0), (0, 174, 1.0), (0, 191, 1.0), (0, 33, 4.0), (0, 66, 3.0), (0, 187, 4.0), (0, 198, 3.0), (0, 201, 3.0), (0, 206, 3.0), (0, 62, 3.0), (0, 130, 4.0), (0, 54, 3.0), (0, 212, 4.0), (0, 35, 4.0), (0, 219, 3.0), (0, 220, 5.0), (0, 235, 2.0), (0, 240, 1.0), (0, 242, 3.0), (0, 246, 2.0), (0, 248, 4.0), (0, 175, 3.0), (0, 260, 4.0), (0, 169, 3.0), (0, 263, 5.0), (0, 267, 4.0), (0, 159, 5.0), (0, 214, 2.0), (0, 276, 3.0), (0, 143, 5.0), (

like_boycott_but_testset: [(346, 1222, 4.0), (346, 1228, 4.0), (346, 98, 2.0), (346, 932, 2.0), (313, 526, 4.0), (313, 969, 4.0), (313, 143, 3.0), (313, 420, 5.0), (313, 840, 2.0), (346, 748, 4.0), (346, 685, 3.0), (346, 739, 3.0), (346, 333, 4.0), (313, 486, 3.0), (346, 182, 5.0), (313, 118, 4.0), (313, 609, 3.0), (346, 181, 5.0), (346, 572, 5.0), (313, 849, 3.0), (346, 79, 5.0), (346, 809, 3.0), (346, 195, 5.0), (346, 110, 2.0), (346, 1231, 3.0), (313, 845, 3.0), (313, 63, 4.0), (313, 409, 2.0), (313, 73, 5.0), (346, 831, 3.0), (313, 194, 4.0), (313, 161, 4.0), (346, 161, 3.0), (346, 842, 1.0), (313, 768, 3.0), (346, 204, 4.0), (346, 265, 4.0), (346, 732, 3.0), (313, 230, 3.0), (313, 515, 5.0), (346, 147, 4.0), (652, 125, 2.0), (313, 82, 3.0), (313, 211, 5.0), (313, 505, 5.0), (313, 632, 4.0), (346, 127, 5.0), (313, 566, 4.0), (313, 661, 4.0), (346, 216, 3.0), (313, 25, 2.0), (313, 148, 2.0), (313, 448, 3.0), (313, 197, 5.0), (313, 127, 5.0), (313, 226, 4.0), (313, 745, 3.0), (313, 6

In [33]:
for row in ret:
    print('\nNext fold:')
    trainset, nonboycott_testset, boycott_testset, like_boycott_but_testset, all_like_boycott_testset, all_testset = row
    print('nonboycott_testset:', len(nonboycott_testset))
    print('boycott_testset:', len(boycott_testset))
    print('like_boycott_but_testset:', len(like_boycott_but_testset))
    print('all_like_boycott_testset:', len(all_like_boycott_testset))
    print('all_testset:', len(all_testset))
    


Next fold:
nonboycott_testset: 163
boycott_testset: 43
like_boycott_but_testset: 98
all_like_boycott_testset: 141
all_testset: 206

Next fold:
nonboycott_testset: 152
boycott_testset: 54
like_boycott_but_testset: 91
all_like_boycott_testset: 145
all_testset: 206

Next fold:
nonboycott_testset: 159
boycott_testset: 46
like_boycott_but_testset: 91
all_like_boycott_testset: 137
all_testset: 205

Next fold:
nonboycott_testset: 155
boycott_testset: 50
like_boycott_but_testset: 98
all_like_boycott_testset: 148
all_testset: 205

Next fold:
nonboycott_testset: 158
boycott_testset: 46
like_boycott_but_testset: 99
all_like_boycott_testset: 145
all_testset: 204
