# Improving movie recommendations for movie snobs (draft)

## Introduction

This project examines how to improve the precision of recommender systems for movie snobs.

I will go through the following tasks:
1. Collate the 1,001 Movies list
2. Select a dataset that has most of the 1,001 Movies in it
3. Do some exploratory analyses of differences between 1,001 and non-1,001 movies (and The Dekalog and The Dark Knight), potentially by using demographic information
4. Do some baseline recommender modeling
5. Run a baseline model and then a model with 1,001 as a topic/rating.
6. Compare the models (with a 10% holdout set)
7. Try various models until we can predict the difference between 1,001 and non-1,001 movies 

Other ideas:
1. Work out how long it takes for something to be canonical and then use this as a bias adjustment.
2. use rotten tomatoes api to find difference b/w dek and dk reviews (base it on the mini-project for naive bayes section)

### Setting up the data

There are five datasets of relevance on the Grouplens website.

| Name | Movies | Ratings | Userdata | Release date |
| --- | --- | --- | --- | --- |
| New research | 27K | 20M | None | 2016 |
| Education (small) | 9K | 100K | None | 2018 |
| Education (large) | 58K | 27M | None | 2018 |
| Older (100K) | 1.7K | 100K| age, gender, occupation, zip | 1998 |
| Older (1M) | 4K  | 1M | age, gender, occupation, zip | 2003 |

The latter two are useful for identifying clusters of people but are a bit small, while the second and fourth are best if trying to identify features of movies themselves.

The 1,001 movie file had 1,222 entries due to additions and deletions across the different editions..


In [31]:
# Basic packagesused
import pandas as pd
import numpy as np


In [None]:
#import requests, zipfile, io
# Importing the ratings data
# First download and extract the files (there's a bunch so use a list and loop)
#list_of_urls = ['http://files.grouplens.org/datasets/movielens/ml-latest.zip']
#for url in list_of_urls:
#    ratings_small_file = requests.get(url)
#    z = zipfile.ZipFile(io.BytesIO(ratings_small_file.content))
#    z.extractall()

In [2]:
import requests
from bs4 import BeautifulSoup
# Importing the 1,001 list and converting it to a list
snob_url = 'https://1001films.fandom.com/wiki/The_List'
snob_text= requests.get(snob_url)
soup = BeautifulSoup(snob_text.content, 'html.parser')
basic_list = (soup.body.find_all('b'))
thousand_list = [item.text for item in basic_list]

In [3]:
# Here I get all three files into dataframes
# first, read the correct downloaded csvs (from separate subfolders)
ed_large_movies = pd.read_csv('ml-latest/movies.csv', sep = ',', header = 0)

# The two older files have a different compression structure and require a different technique
#older_small_movies = pd.read_csv('ml-100k/movies.csv', sep = ',', header = 0)
#older_large_movies = pd.read_csv('ml-1m/movies.csv', sep = ',', header = 0)

# Converting the 1,001 list to a dataframe and dropping the header row
thousandone_movies = pd.DataFrame(thousand_list, columns = ['title']).drop(0)

# The following code can be modified to check they are all dataframes
print(thousandone_movies.head())
print(ed_large_movies.head())
thousandone_movies.info()


                                               title
1  A Trip to the Moon (Le Voyage Dans La Lune) (1...
2                     The Great Train Robbery (1903)
3                       The Birth of a Nation (1915)
4                                Les Vampires (1915)
5                                 Intolerance (1916)
   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
3                         Comedy|Drama|Romance  
4                                       Comedy  
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1222 entries, 1 to 1222
Data columns 

No doubt the hardest wrangling job was adjusting for any naming discrepancies in titles between files. I tried straight matching first joined the movie sets to the 1,001 list and the match rate was 49%. Then I tried some simple string manipulations and brought it up to 72%. Last, I realized that a tokenizer would account for many of the problems, and this lead to an 80% match rate (974 out of 1,222). This was acceptable for exploratory data analysis.

The nlkt package had a tokenizer that helped match most of the titles. The tokenizer led to a 88% match rate.

In [6]:
# Here's the tokenizer version.
# import the right package
#import nltk
# sometimes the line below needed to be included (not sure why)
#from nltk import word_tokenize
#nltk.download('punkt')
#import string
# Make default stopwords a list of punctuation
#stopwords = list(string.punctuation)
# Add a few English words of no use (the default English list contains too many words)
#stopwords.append('the')
#stopwords.append('and')
#print(stopwords)

# a line of test code for my tokenizer
#s = ['la voyage de la ( ) luns a trip to the moon','bread and milk . ,','la Voyage de la luns a trip to the moon']

# Define tokenizer
#def my_tokenizer(title_list, stopwords):
#    '''This function takes a string and a list of stopwords and returns a list of word tokens
#    excluding anything defined in stopwords'''
#    # tokenize a lower-case string
#    s_token = [word_tokenize(i.lower()) for i in title_list]
#    # then filter out the stopwords
#    s_filtered=[]
#    for sentence in s_token:
#        s_filt = [w for w in sentence if not w in stopwords]
#        s_filtered.append(s_filt)
#    # finally sort them
#    s_sort = [sorted(item, key = lambda item:item[0]) for item in s_filtered] 
#    return s_sort

# Call my function on two datasets
#thousandone_movies['title_tokens'] = my_tokenizer(thousandone_movies['title'], stopwords)
#ed_large_movies['title_tokens'] = my_tokenizer(ed_large_movies['title'], stopwords)

# To make matching easy I turn them into a string (I kept the old list version in case I wanted
#to use a matching algorithm)
#thousandone_movies['title_string'] = thousandone_movies['title_tokens'].apply(', '.join)
#ed_large_movies['title_string'] = ed_large_movies['title_tokens'].apply(', '.join)'''

Then, I found a string matching library called fuzzywuzzy that does the whole thing.

In [89]:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

choicesed = ed_large_movies['title']
titlestomatch = thousandone_movies['title']

testtitles = titlestomatch.iloc[0:6]
testtitle = titlestomatch.iloc[3]
print(testtitles)
def Matcher(title, choices):
    title_match, percent_match, match3 = process.extractOne(title, choices)
    return title_match, percent_match

a = pd.DataFrame(columns=['title_suggestion','percent_match'])
#np.empty(2, dtype=str)
print(a)

b = np.array(Matcher(testtitle, choicesed))
a[0] = b
print(a)
#print(Matcher(testtitles, choicesed))
#df[titlematches] = df[.apply(Matcher)
#thousandone_movies


1    A Trip to the Moon (Le Voyage Dans La Lune) (1...
2                       The Great Train Robbery (1903)
3                         The Birth of a Nation (1915)
4                                  Les Vampires (1915)
5                                   Intolerance (1916)
6    The Cabinet of Dr. Caligari (Das Kabinett des ...
Name: title, dtype: object
Empty DataFrame
Columns: [title_suggestion, percent_match]
Index: []
  title_suggestion percent_match                     0
0              NaN           NaN  Vampires, Les (1915)
1              NaN           NaN                    95


In [85]:

b = pd.DataFrame(a, axis=)
print(b)

#print(a.type())
#dfb = pd.DataFrame(b, cols=['suggested_title','percent_match'])
#print(dfb)


NameError: name 'rows' is not defined

#### Starting some exploratory data analysis

First, let's review the variables we have for each movie. There are separate tables for movie (movieID, title, genres), movie rating (userID, movieID, rating, timestamp). tags (userID, movieID, tag, timestamp), and genome (a machine learned matrix output giving the relevance of difference keyword aggregates for each movie). These tables can be joined together but it will be easier to work with each table separately and apply the 1,001 list as a feature to each. First, a descriptive analysis of each table. 

In [None]:
# Ratings
# Years

# Year of release to rating band
# Proportion of year of release to rating band
# Number of ratings by year
# Number of ratings since year of release

#Genre
# A simple split of all genres
# Number of tags

#Tags
#Proportion of (Genomic) tags since year of release
#Proportion of (Genomic) tags by number of ratings

# Letter of alphabet in title (don't laugh, most good bands started with S in the 90s)

In [None]:
# Do some exploratory analyses of differences between 1,001 and non-1,001 movies (and The Dekalog and The Dark Knight), potentially by using demographic information


### 4. Running a basic recommender model


In [None]:
import turicreate as tc #this library runs recommender systems

training_data, validation_data = tc.recommender.util.random_split_by_user(actions, 'userId', 'movieId')
baseline_model = tc.recommender.create(training_data, 'userId', 'movieId')

In [None]:
#print(training_data.head())

In [None]:
ratings_model = tc.recommender.create(training_data, 'userId', 'movieId', target='rating')

### 5. Comparing the two models

In [None]:
comparer = tc.recommender.util.compare_models(validation_data, [baseline_model, ratings_model])

In [None]:
References for turicreate:
The basic user guide             https://apple.github.io/turicreate/docs/userguide/recommender/
The recommender documentation    https://apple.github.io/turicreate/docs/api/turicreate.toolkits.recommender.html

In [48]:

print(ed_large_join.dropna().info())


<class 'pandas.core.frame.DataFrame'>
Int64Index: 974 entries, 1 to 1222
Data columns (total 7 columns):
title_x           974 non-null object
title_tokens_x    974 non-null object
title_string      974 non-null object
movieId           974 non-null float64
title_y           974 non-null object
genres            974 non-null object
title_tokens_y    974 non-null object
dtypes: float64(1), object(6)
memory usage: 60.9+ KB
None
