# Improving movie recommendations for movie snobs (draft)

## Introduction

This project examines how to improve the precision of recommender systems for movie snobs.

I will go through the following tasks:
1. Collate the 1,001 Movies list
2. Select a dataset that has most of the 1,001 Movies in it
3. Do some exploratory analyses of differences between 1,001 and non-1,001 movies (and The Dekalog and The Dark Knight), potentially by using demographic information
4. Do some baseline recommender modeling
5. Run a baseline model and then a model with 1,001 as a topic/rating.
6. Compare the models (with a 10% holdout set)
7. Try various models until we can predict the difference between 1,001 and non-1,001 movies 

Other ideas:
1. Work out how long it takes for something to be canonical and then use this as a bias adjustment.
2. use rotten tomatoes api to find difference b/w dek and dk reviews (base it on the mini-project for naive bayes section)

### Setting up the data

There are five datasets of relevance on the Grouplens website.

| Name | Movies | Ratings | Userdata | Release date |
| --- | --- | --- | --- | --- |
| New research | 27K | 20M | None | 2016 |
| Education (small) | 9K | 100K | None | 2018 |
| Education (large) | 58K | 27M | None | 2018 |
| Older (100K) | 1.7K | 100K| age, gender, occupation, zip | 1998 |
| Older (1M) | 4K  | 1M | age, gender, occupation, zip | 2003 |

The latter two are useful for identifying people, the second and fourth are best if trying to look at changes in movie status across time. But the latter two also have a very different file structure requiring a separate decompression method.

The relative percentages of the 1,001 Movies in each will be the first thing to work out. To do this, the 1,001 list needs to be broadcast to each dataset.


In [3]:
# Basic packagesused
import pandas as pd
import requests, zipfile, io
from bs4 import BeautifulSoup

In [1]:
# Importing the ratings data
# First download and extract the files (there's a bunch so use a list and loop)
list_of_urls = ['http://files.grouplens.org/datasets/movielens/ml-20m.zip',
               'http://files.grouplens.org/datasets/movielens/ml-latest-small.zip',
               'http://files.grouplens.org/datasets/movielens/ml-latest.zip',
               'http://files.grouplens.org/datasets/movielens/ml-100k.zip',
               'http://files.grouplens.org/datasets/movielens/ml-1m.zip']
for url in list_of_urls:
    ratings_small_file = requests.get(url)
    z = zipfile.ZipFile(io.BytesIO(ratings_small_file.content))
    z.extractall()

In [5]:
# Importing the 1,001 list and converting it to a list
snob_url = 'https://1001films.fandom.com/wiki/The_List'
snob_text= requests.get(snob_url)
soup = BeautifulSoup(snob_text.content, 'html.parser')
basic_list = (soup.body.find_all('b'))
thousand_list = [item.text for item in basic_list]

AttributeError: 'list' object has no attribute 'head'

In [44]:
# Here I get all three files into dataframes
# first, read the correct downloaded csvs (from separate subfolders)
new_movies = pd.read_csv('ml-20m/movies.csv', sep = ',', header = 0)
ed_small_movies = pd.read_csv('ml-latest-small/movies.csv', sep = ',', header = 0)
ed_large_movies = pd.read_csv('ml-latest/movies.csv', sep = ',', header = 0)

# The two older files have a different compression structure and require a different technique
#older_small_movies = pd.read_csv('ml-100k/movies.csv', sep = ',', header = 0)
#older_large_movies = pd.read_csv('ml-1m/movies.csv', sep = ',', header = 0)

# Converting the 1,001 list to a dataframe and dropping the header row
thousandone_movies = pd.DataFrame(thousand_list, columns = ['title']).drop(0)

# The following code can be modified to check they are all dataframes
#print(type(small_set_movies))
# Looking at head of all six
#print(thousandone_movies.head())
#print(new_movies.head())
#print(ed_small_movies.head())
#print(ed_large_movies.head())


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


No doubt the hardest wrangling job was adjusting for any naming discrepancies between files. I first joined the movie sets to the 1,001 list to check how many matched (via a dropna command). The match rate was 60%. 

In [4]:
# Do a left join of the small Grouplens set to the 1,001 set
new_movies_join = thousandone_movies.merge(new_movies, on='title', how='left')
ed_small_join = thousandone_movies.merge(ed_small_movies, on='title', how='left')
ed_large_join = thousandone_movies.merge(ed_large_movies, on='title', how='left')

#print(small_join.head(50))
#print(large_join.head(50))

# Demonstrating how many matched in each file
new_movies_join.dropna().info()
ed_small_join.dropna().info()
ed_large_join.dropna().info()

# Save the large to Excel to inspect why they don't match
#new_movies_join.to_csv('NewMoviesChecker.csv')
ed_large_join.to_csv('EdLargeChecker.csv')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 545 entries, 7 to 1175
Data columns (total 3 columns):
title      545 non-null object
movieId    545 non-null float64
genres     545 non-null object
dtypes: float64(1), object(2)
memory usage: 17.0+ KB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 485 entries, 1 to 1221
Data columns (total 3 columns):
title      485 non-null object
movieId    485 non-null float64
genres     485 non-null object
dtypes: float64(1), object(2)
memory usage: 15.2+ KB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 602 entries, 1 to 1222
Data columns (total 3 columns):
title      602 non-null object
movieId    602 non-null float64
genres     602 non-null object
dtypes: float64(1), object(2)
memory usage: 18.8+ KB


Then I worked through wrangling the largest of the movie sets with the 1,001 list.

After comparing the first 50 missing values in the large education file, I worked out that most errors were due to slight errors in formatting such as prepositions, punctuation, capitalization discrepancies, or year being off by one. Each of those could be corrected by string methods, which led to 883 matches (88%). Better but not good enough. (In reading up on how to do this via sentence similarity matching, I realized that a tokenizer is quicker.)

In [None]:
# Correcting for capitalization (make everything lowercase)

# Prepositions must be done before capitalization because otherwise trailing a's will be removed
# Correcting for preposition order in 1,001 (1001 has capitalized prepositions at the start)
#thousandone_movies['title_edit'] = thousandone_movies['title'].str.replace('A ', '', 1)
#thousandone_movies['title_edit'] = thousandone_movies['title_edit'].str.replace('The ', '', 1)
#thousandone_movies['title_edit'] = thousandone_movies['title_edit'].str.replace('Les ', '', 1)
#thousandone_movies['title_edit'] = thousandone_movies['title_edit'].str.replace('Die ', '', 1)
#thousandone_movies['title_edit'] = thousandone_movies['title_edit'].str.replace('Das ', '', 1)
#thousandone_movies['title_edit'] = thousandone_movies['title_edit'].str.replace('La ', '', 1)
#thousandone_movies['title_edit'] = thousandone_movies['title_edit'].str.replace('Un ', '', 1)
#print(thousandone_movies.head(50))

# Correcting for preposition order in Grouplens (where they are put at end)
#ed_large_movies['title_edit'] = ed_large_movies['title'].str.replace('A ', '', 1)
#ed_large_movies['title_edit'] = ed_large_movies['title_edit'].str.replace(', The ', ' ', 1)
#ed_large_movies['title_edit'] = ed_large_movies['title_edit'].str.replace(', Les ', ' ', 1)
#ed_large_movies['title_edit'] = ed_large_movies['title_edit'].str.replace(', Le ', ' ', 1)
#ed_large_movies['title_edit'] = ed_large_movies['title_edit'].str.replace(', Die ', ' ', 1)
#ed_large_movies['title_edit'] = ed_large_movies['title_edit'].str.replace('Das ', ' ', 1)
#ed_large_movies['title_edit'] = ed_large_movies['title_edit'].str.replace(', La ', ' ', 1)
#ed_large_movies['title_edit'] = ed_large_movies['title_edit'].str.replace(', Un ', ' ', 1)
#print(ed_large_movies.head(50))

# Then remove some other errors
# Removing capitalization

#thousandone_movies['title_edit'] = thousandone_movies['title_edit'].str.lower()
#ed_large_movies['title_edit'] = ed_large_movies['title_edit'].str.lower()

# Correcting for punctuation errors
#thousandone_movies['title_edit'] = thousandone_movies['title_edit'].str.replace(',', '', 1)
#ed_large_movies['title_edit'] = ed_large_movies['title_edit'].str.replace(',', '', 1)
#thousandone_movies['title_edit'] = thousandone_movies['title_edit'].str.replace(':', '', 1)
#ed_large_movies['title_edit'] = ed_large_movies['title_edit'].str.replace(':', '', 1)

# Checking new match number
#ed_large_join = thousandone_movies.merge(ed_large_movies, 
#                                        on='title_edit', how='left')
# And now go through those files again to see what errors exist
#ed_large_join.to_csv('Joined_To_Check.csv')
#ed_large_movies.to_csv('GrouplensLargeTransformed.csv')
# Demonstrating how many matched in small file and large file
#ed_large_join.dropna().info()


My basic string methods above led to an 88% match rate, but the nlkt package has a tokenizer that would be even better. The tokenizer led to a 97% match rate (only 25 movies were mismatched). On inspection, only x had a problem.

In [1]:
# Here's the tokenizer version.
# import the right package
import nltk
# sometimes the line below needed to be included (not sure why)
#from nltk import word_tokenize
nltk.download('punkt')
import string
# Make default stopwords a list of punctuation
stopwords = list(string.punctuation)
# Add a few English words of no use (the default English list contains too many words)
stopwords.append('the')
stopwords.append('and')
#print(stopwords)

['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~', 'the', 'and']


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/jonathangerber/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


I make the tokenizer below and then apply it to all four datasets (three Grouplens and the 1,001 list). For ease I did the largest data set first, and the other two are in the next code cell.

In [47]:
# a line of test code for my tokenizer
#s = ['la voyage de la ( ) luns a trip to the moon','bread and milk . ,','la Voyage de la luns a trip to the moon']

# Define tokenizer
def my_tokenizer(title_list, stopwords):
    '''This function takes a string and a list of stopwords and returns a list of word tokens
    excluding anything defined in stopwords'''
    # tokenize a lower-case string
    s_token = [word_tokenize(i.lower()) for i in title_list]
    # then filter out the stopwords
    s_filtered=[]
    for sentence in s_token:
        s_filt = [w for w in sentence if not w in stopwords]
        s_filtered.append(s_filt)
    # finally sort them
    s_sort = [sorted(item, key = lambda item:item[0]) for item in s_filtered] 
    return s_sort

# Call my function on two datasets
thousandone_movies['title_tokens'] = my_tokenizer(thousandone_movies['title'], stopwords)
ed_large_movies['title_tokens'] = my_tokenizer(ed_large_movies['title'], stopwords)

# To make matching easy I turn them into a string (I kept the old list version in case I wanted
#to use a matching algorithm)
thousandone_movies['title_string'] = thousandone_movies['title_tokens'].apply(', '.join)
ed_large_movies['title_string'] = ed_large_movies['title_tokens'].apply(', '.join)
# Then I join them
ed_large_join = thousandone_movies.merge(ed_large_movies, 
                                        on='title_string', how='left')
ed_large_join.to_csv('TOkenized_To_Check.csv')

As can be seen below, the smaller datasets still had matches at 76% and 92%. This suggests the routine is working well (these are well up from the previous rates) and that these sets do not contain all of the 1,001 movies.

In [46]:
# And do the other two as well
new_movies['title_tokens'] = my_tokenizer(new_movies['title'], stopwords)
new_movies['title_string'] = new_movies['title_tokens'].apply(', '.join)

ed_small_movies['title_tokens'] = my_tokenizer(ed_small_movies['title'], stopwords)
ed_small_movies['title_string'] = ed_small_movies['title_tokens'].apply(', '.join)

ed_small_join = thousandone_movies.merge(ed_small_movies, 
                                        on='title_string', how='left')
#ed_small_join.to_csv('TOkenized_To_Check.csv')
new_movies_join = thousandone_movies.merge(new_movies, 
                                        on='title_string', how='left')
#new_movies_join.to_csv('TOkenized_To_Check.csv')
print(ed_small_join.dropna().info())
print(new_movies_join.dropna().info())


<class 'pandas.core.frame.DataFrame'>
Int64Index: 764 entries, 1 to 1221
Data columns (total 7 columns):
title_x           764 non-null object
title_tokens_x    764 non-null object
title_string      764 non-null object
movieId           764 non-null float64
title_y           764 non-null object
genres            764 non-null object
title_tokens_y    764 non-null object
dtypes: float64(1), object(6)
memory usage: 47.8+ KB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 923 entries, 1 to 1177
Data columns (total 7 columns):
title_x           923 non-null object
title_tokens_x    923 non-null object
title_string      923 non-null object
movieId           923 non-null float64
title_y           923 non-null object
genres            923 non-null object
title_tokens_y    923 non-null object
dtypes: float64(1), object(6)
memory usage: 57.7+ KB
None


#### FROM HERE ON IS UNFINISHED

In [None]:
# Do some exploratory analyses of differences between 1,001 and non-1,001 movies (and The Dekalog and The Dark Knight), potentially by using demographic information


### 4. Running a basic recommender model


In [None]:
import turicreate as tc #this library runs recommender systems

training_data, validation_data = tc.recommender.util.random_split_by_user(actions, 'userId', 'movieId')
baseline_model = tc.recommender.create(training_data, 'userId', 'movieId')

In [None]:
#print(training_data.head())

In [None]:
ratings_model = tc.recommender.create(training_data, 'userId', 'movieId', target='rating')

### 5. Comparing the two models

In [None]:
comparer = tc.recommender.util.compare_models(validation_data, [baseline_model, ratings_model])

In [None]:
References for turicreate:
The basic user guide             https://apple.github.io/turicreate/docs/userguide/recommender/
The recommender documentation    https://apple.github.io/turicreate/docs/api/turicreate.toolkits.recommender.html

In [48]:

print(ed_large_join.dropna().info())


<class 'pandas.core.frame.DataFrame'>
Int64Index: 974 entries, 1 to 1222
Data columns (total 7 columns):
title_x           974 non-null object
title_tokens_x    974 non-null object
title_string      974 non-null object
movieId           974 non-null float64
title_y           974 non-null object
genres            974 non-null object
title_tokens_y    974 non-null object
dtypes: float64(1), object(6)
memory usage: 60.9+ KB
None
