# Assessing the value of obscuritist opinions.
In this notebook, I examine the impact of various indices of snobbery on recommender system accuracy.
The background is an insight from my previous EDA suggesting that those who like more obscure movies also tend to differentiate more between canonical and non-canonical movies. This suggests the existence of a 'cooler older sibling', a person whose taste is more defined through broader exposure (cf. Vonnegut's comments in Bluebeard about expressionist art).
This obscuritist perspective (and three other indicators of snobbery) are put to the test against the benchmark SVD++ results via the Surprise package.

<b>Key finding: Obscuritist ratings give greater accuracy ($RMSE = 0.84$) than either the benchmark ($RMSE = 0.92$) or least obscuritist ratings ($RMSE=0.91$). They also give greater diversity, with fewer recommendations to the canon and/or Star Wars.</b>

### Imports and data preparation

In [2]:
import pandas as pd
import numpy as np
from surprise import Dataset, Reader, accuracy, SVDpp, SVD
from surprise.model_selection import train_test_split, GridSearchCV
import random
import io
from collections import defaultdict, Counter

import warnings
warnings.filterwarnings('ignore')

ratings = pd.read_pickle('ratings_for_ML.pkl')

FileNotFoundError: [Errno 2] No such file or directory: 'ratings_for_ML.pkl'

In [3]:
# This section imports the files we will use

#Import the ratings list
ratings = pd.read_pickle('moviesnob_df.pkl') # This file comes from 

#Making the file that we just load above...
# Import the features of each user (will use last four as our test columns)
user_info = pd.read_pickle('moviesnob_by_user_df.pkl').reset_index()
user_info = user_info.drop(columns=['rating_mean','canonical_sum','canon_prop','canonical_mean','canon_pref_stat',
                                   'canon_pref_stat','canon_pref_meandiff'])
ratings = pd.merge(ratings, user_info, how='left', on='userId')
ratings.head()
ratings['None']=1
ratings = ratings.drop(columns=['rating_date','tmdbId','release_date','rating_days_after','rating_year','release_year'])
ratings.to_pickle('ratings_for_ML.pkl')

FileNotFoundError: [Errno 2] No such file or directory: 'moviesnob_df.pkl'

## First run the benchmark sample.

In [None]:
# Load the movielens-100k dataset (download it if needed),
data = Dataset.load_builtin('ml-100k')

# sample random trainset and testset
# test set is made of 25% of the ratings.
trainset, testset = train_test_split(data, test_size=.25)

# We'll use the famous SVD algorithm.
algo = SVDpp()

# Train the algorithm on the trainset, and predict ratings for the testset
algo.fit(trainset)
predictions = algo.test(testset)

# Then compute RMSE
print(accuracy.rmse(predictions))



RMSE for benchmark is 0.9237 with 295 recs for Star Wars.

## Recommender accuracy for each of the snobbery indices

A quick look at the datafile first

In [8]:
ratings.head()

Unnamed: 0,userId,movieId,rating,canonical,rating_count,newold_r,statler_waldorf,obscurist,contrariness,None
18220609,186223,5279,3.0,0.0,2,-1.0,0.5,1.0,0.989649,1
6014617,61730,3578,4.0,1.0,2,-1.0,0.0,1.0,0.358702,1
6992185,71796,79132,5.0,1.0,2,1.0,0.5,1.0,1.064744,1
2071540,21207,480,4.0,1.0,2,-1.0,0.0,1.0,0.169804,1
2071541,21207,1544,3.0,0.0,2,-1.0,0.0,1.0,0.169804,1


In [50]:
results = pd.DataFrame(columns=['RMSE_uppersplit','RMSE_lowersplit'])


### Key methodology

We had four indicators of snobbery:
1. Liking obscure things
2. Liking older movies
3. Being contrary to popular opinion
4. Being extreme in one's ratings

We took the tails of each of these distributions and then looked at accuracy in a recommender engine.

In [6]:
saved_params = pd.DataFrame(columns=['indicator','n_factors','n_epochs','lr_all','reg_all'])
fitted_rmse = []


In [53]:
# This cell is my machine learning part

sort_key = ['statler_waldorf','newold_r','obscurist','contrariness']
for item in sort_key:
    # Set up parameter grid
    param_grid = {'n_factors':[50,100,150],'n_epochs':[20,30],  'lr_all':[0.005,0.01],'reg_all':[0.02,0.1]}
    print('Making rating samples')
    # Set up the raw samples
    ratings = ratings.sort_values(by=item, ascending=False)
    ratings_upper = ratings.iloc[0:100000,:]
    ratings_lower = ratings.iloc[-100000:,:]
    reader = Reader(rating_scale=(1, 5)) # Make a reader
    # Now for the upper half (load reader)
    data = Dataset.load_from_df(ratings_upper[['userId', 'movieId', 'rating']], reader)
    raw_ratings = data.raw_ratings #get raw ratings
    random.shuffle(raw_ratings) # shuffle
    threshold = int(.75 * len(raw_ratings))
    A_raw_ratings = raw_ratings[:threshold]
    B_raw_ratings = raw_ratings[threshold:] #split into two samples
    data.raw_ratings = A_raw_ratings #keep sample A in reader
    print('Loaded data, now grid search...')
    gs = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=5) # instantiate the model
    print('Training on {} upper'.format(item))
    gs.fit(data) #fit it!
    params = gs.best_params['rmse'] #get best params
    saved_params = saved_params.append(params, ignore_index=True) #append to df
    print(saved_params.head())
    print('Saved parameters for upper {}'.format(item))
    tuned_algo = gs.best_estimator['rmse'] #get tuned version
    trainset = data.build_full_trainset() #surprise requires the full trainset
    testset = data.construct_testset(B_raw_ratings)  # testset is now the set B
    predictions = algo.test(testset) #make predictions
    rmse_uppersplit = accuracy.rmse(predictions, verbose=False) #get accuracy score
    print('Unbiased accuracy is {}'.format(rmse_uppersplit))
    fitted_rmse.append(rmse_uppersplit) # add that to my results list
    print(fitted_rmse)
    
    ## Start the other half
    data = Dataset.load_from_df(ratings_lower[['userId', 'movieId', 'rating']], reader)
    raw_ratings = data.raw_ratings
    random.shuffle(raw_ratings)
    threshold = int(.75 * len(raw_ratings))
    A_raw_ratings = raw_ratings[:threshold]
    B_raw_ratings = raw_ratings[threshold:]
    data.raw_ratings = A_raw_ratings
    print('Loaded data, now grid search...')
    gs = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=5)
    print('Training on {} lower'.format(item))
    gs.fit(data)
    params = gs.best_params['rmse']
    saved_params = saved_params.append(params, ignore_index=True)
    print(saved_params.head())
    print('Saved parameters for lower {}'.format(item))
    tuned_algo = gs.best_estimator['rmse']
    trainset = data.build_full_trainset()
    testset = data.construct_testset(B_raw_ratings)  # testset is now the set B
    predictions = algo.test(testset)
    rmse_uppersplit = accuracy.rmse(predictions, verbose=False)
    print('Unbiased accuracy is {}'.format(rmse_uppersplit))
    fitted_rmse.append(rmse_uppersplit)
    print(fitted_rmse)


Making rating samples
Loaded data, now grid search...
Training on statler_waldorf upper
   n_factors  n_epochs  lr_all  reg_all
0       50.0      30.0    0.01      0.1
Saved parameters for upper statler_waldorf
Unbiased accuracy is 1.7905228706312966
[1.791102296848872, 0.5320978691295725, 1.7923675495412288, 0.5327334113664695, 1.7942577634373613, 0.5368018808566758, 1.7905228706312966]
Loaded data, now grid search...
Training on statler_waldorf lower
   n_factors  n_epochs  lr_all  reg_all
0       50.0      30.0   0.010      0.1
1       50.0      30.0   0.005      0.1
Saved parameters for lower statler_waldorf
Unbiased accuracy is 0.6560445873476233
[1.791102296848872, 0.5320978691295725, 1.7923675495412288, 0.5327334113664695, 1.7942577634373613, 0.5368018808566758, 1.7905228706312966, 0.6560445873476233]
Making rating samples
Loaded data, now grid search...
Training on newold_r upper
   n_factors  n_epochs  lr_all  reg_all
0       50.0      30.0   0.010      0.1
1       50.0      3

### On the double-filtered dataset

In [54]:

# Import the features of each user (will use last four as our test columns)
user_info_red = pd.read_pickle('double_filtered.pkl').reset_index()

# Keep only those columns of the ratings where the userID is in the reduced dataset
ratings_redux = ratings[ratings.userId.isin(set(user_info_red.userId))]

Unnamed: 0,userId,rating_count,rating_mean,canonical_sum,canon_prop,canonical_mean,canon_pref_stat,canon_pref_meandiff,newold_r,statler_waldorf,obscurist,contrariness
0,1,16,3.3125,2.0,0.125,3.214286,-0.762255,-0.785714,0.43949,0.0,0.019481,0.573882
1,2,15,3.666667,7.0,0.466667,3.5625,-0.452208,-0.223214,-0.138446,0.0,0.594173,0.453425
2,3,11,3.545455,4.0,0.363636,3.285714,-0.917517,-0.714286,-0.07627,0.090909,-0.69317,0.424874
3,4,736,3.397418,104.0,0.141304,3.343354,-0.312342,-0.382607,-0.013705,0.184783,-0.220126,0.817582
4,5,72,4.263889,31.0,0.430556,4.109756,-0.642231,-0.357986,-0.006196,0.236111,-0.433186,0.480558


In [65]:
# Set up the results objects
saved_params_red = pd.DataFrame(columns=['n_factors','n_epochs','lr_all','reg_all'])
fitted_rmse_red = []


In [None]:
sort_key = ['statler_waldorf','newold_r','obscurist','contrariness']
for item in sort_key:
    param_grid = {'n_factors':[50,100,150],'n_epochs':[20,30],  'lr_all':[0.005,0.01],'reg_all':[0.02,0.1]}
    print('Making rating samples')
    ratings_redux = ratings_redux.sort_values(by=item, ascending=False)
    ratings_upper = ratings_redux.iloc[0:100000,:]
    ratings_lower = ratings_redux.iloc[-100000:,:]
    reader = Reader(rating_scale=(1, 5))
    data = Dataset.load_from_df(ratings_upper[['userId', 'movieId', 'rating']], reader)
    raw_ratings = data.raw_ratings
    random.shuffle(raw_ratings)
    threshold = int(.75 * len(raw_ratings))
    A_raw_ratings = raw_ratings[:threshold]
    B_raw_ratings = raw_ratings[threshold:]
    data.raw_ratings = A_raw_ratings
    print('Loaded data, now grid search...')
    gs = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=5)
    print('Training on {} upper'.format(item))
    gs.fit(data)
    params = gs.best_params['rmse']
    saved_params_red = saved_params_red.append(params, ignore_index=True)
    print(saved_params_red.head())
    print('Saved parameters for upper {}'.format(item))
    tuned_algo = gs.best_estimator['rmse']
    trainset = data.build_full_trainset()
    testset = data.construct_testset(B_raw_ratings)  # testset is now the set B
    predictions = algo.test(testset)
    rmse_uppersplit = accuracy.rmse(predictions, verbose=False)
    print('Unbiased accuracy is {}'.format(rmse_uppersplit))
    fitted_rmse_red.append(rmse_uppersplit)
    print(fitted_rmse_red)
    
    ## Start the other half
    data = Dataset.load_from_df(ratings_lower[['userId', 'movieId', 'rating']], reader)
    raw_ratings = data.raw_ratings
    random.shuffle(raw_ratings)
    threshold = int(.75 * len(raw_ratings))
    A_raw_ratings = raw_ratings[:threshold]
    B_raw_ratings = raw_ratings[threshold:]
    data.raw_ratings = A_raw_ratings
    print('Loaded data, now grid search...')
    gs = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=5)
    print('Training on {} lower'.format(item))
    gs.fit(data)
    params = gs.best_params['rmse']
    saved_params_red = saved_params_red.append(params, ignore_index=True)
    print(saved_params_red.head())
    print('Saved parameters for lower {}'.format(item))
    tuned_algo = gs.best_estimator['rmse']
    trainset = data.build_full_trainset()
    testset = data.construct_testset(B_raw_ratings)  # testset is now the set B
    predictions = algo.test(testset)
    rmse_uppersplit = accuracy.rmse(predictions, verbose=False)
    print('Unbiased accuracy is {}'.format(rmse_uppersplit))
    fitted_rmse_red.append(rmse_uppersplit)
    print(fitted_rmse_red)


In [67]:
ratings.head()

Unnamed: 0,userId,movieId,rating,canonical,rating_count,newold_r,statler_waldorf,obscurist,contrariness,None
24848782,254126,318,0.5,1.0,1,,1.0,,3.924187,1
2852462,29510,318,0.5,1.0,1,,1.0,,3.924187,1
22262974,227861,318,0.5,1.0,1,,1.0,,3.924187,1
8156744,83906,858,0.5,1.0,1,,1.0,,3.83289,1
183972,1869,318,0.5,1.0,2,,1.0,,3.770722,1


# Value counts when looking at contrariness

In [31]:
ratings_lower.rating.value_counts()


3.5    269245
4.0    260882
3.0    250819
2.5     87368
4.5     67699
2.0     46968
1.5     11894
5.0      3195
1.0      1466
0.5       464
Name: rating, dtype: int64

In [32]:
ratings_upper.rating.value_counts()

5.0    529195
4.0    135312
0.5     80485
1.0     73043
3.0     69843
4.5     42454
2.0     30947
3.5     21156
2.5     10561
1.5      7004
Name: rating, dtype: int64

### Some notes on the contours of the recommendations
With 1,000,000 ratings ...
<br>Using least obscure gives 885 canon, .93 RMSE, 14% canon, star wars = 770
<br>Using most obscurant gives 627 canon, .84 RMSE, 13.5% canon, star wars = 484


In [19]:
# This part has some functions to return the names of top recommendations.

def get_top_n(predictions, n=10):
    '''Returns the count of movie recommendations summed across ALL users given the top n for each user
    from a set of predictions.

    Args:
        predictions(list of Prediction objects): The list of predictions, as
            returned by the test method of an algorithm.
        n(int): The number of recommendation to output for each user. Default
            is 10.

    Returns:
    A dict where keys are the movie ID and values are the count for that movie.
    '''

    # First map the predictions to each user.
    top_n = defaultdict(list)
    movie_counter = Counter()
    for uid, iid, true_r, est, _ in predictions:
        top_n[str(uid)].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones and update Counter.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]
        for mov, _ in top_n[uid]:
            movie_counter[mov]+=1
    return movie_counter

# A borrowed function
def read_item_names():
    """Read the u.item file from MovieLens large dataset and return two
    mappings to convert raw ids into movie names and movie names into raw ids.
    """
    file_name = 'ml-latest/movies.csv'
    rid_to_name = {}
    name_to_rid = {}
    with io.open(file_name, 'r', encoding='ISO-8859-1') as f:
        first_line= f.readline()  # skip header line
        for line in f:
            line = line.split(',')
            rid_to_name[line[int(0)]] = line[1]
            name_to_rid[line[1]] = line[0]

    return rid_to_name, name_to_rid


rid_to_name, name_to_rid = read_item_names()

ratings = pd.read_pickle('moviesnob_df.pkl')

# Extract the canonical status
#canonical_status = ratings.groupby('movieId').canonical.min()
canonical_list = ratings[ratings.canonical == 1].groupby('movieId').canonical.min()

uniques_in_sample = ratings_lower.movieId.nunique()


canon_in_sample = ratings_lower[ratings_lower.canonical==1].movieId.nunique()
    # Run this sample through the recommender
top_n = get_top_n(predictions, n=10)

        # Getting the keys for titles
titles = top_n.keys()

top_n_list = list(top_n.keys())
canon_list = [i for i in top_n_list if i in canonical_list]
n_canon = len(canon_list)
canon_perc = n_canon/len(top_n.keys())

        # And the number that is Star Wars
star_wars_tril = top_n[260] + top_n[1196] + top_n[1210]
star_wars = top_n[260]

print(uniques_in_sample)

28569
