# Improving movie recommendations for movie snobs

## Introduction

This project examines how to improve the precision of recommender systems for movie snobs.

I will go through the following tasks:
1. Collate the 1,001 Movies list (this will involve some wrangling)
2. Select a dataset that has most of the 1,001 Movies in it
3. Do some baseline modeling
4. Run a baseline model and then a model with 1,001 as a topic/rating.
5. Compare the models (with a 10% holdout set)
6. Try various models until we can predict the difference between The Dark Knight and The Dekalog (good blog post title right there)

Other ideas:
1. Work out how long it takes for something to be canonical and then use this as a bias adjustment.
2. use imdb api to find difference b/w dek and dk reviews (base it on the mini-project for naive bayes section)


### Setting up the data

There are five datasets of relevance on the Grouplens website.

| Name | Movies | Ratings | Userdata | Release date |
| --- | --- | --- | --- | --- |
| New research | 27K | 20M | None | 2016 |
| Education (small) | 100K | 9K | None | 2018 |
| Education (large) | 27M | 58K | None | 2018 |
| Older (100K) | 1.7K | 100K| age, gender, occupation, zip | 1998 |
| Older (1M) |  4K | 1M | age, gender, occupation, zip | 2003 |

The latter two are useful for identifying people, the second and fourth are best if trying to look at changes in movie status across time.

The relative percentages of the 1,001 Movies in each will be the first thing to work out.


In [22]:
import pandas as pd
import requests, zipfile, io
from bs4 import BeautifulSoup
# Importing the ratings (no demogs) data
# First download and extract the files (there's a bunch so use a list and loop)
list_of_urls = ['http://files.grouplens.org/datasets/movielens/ml-20m.zip',
               'http://files.grouplens.org/datasets/movielens/ml-latest-small.zip',
               'http://files.grouplens.org/datasets/movielens/ml-latest.zip',
               'http://files.grouplens.org/datasets/movielens/ml-100k.zip',
               'http://files.grouplens.org/datasets/movielens/ml-1m.zip']
for url in list_of_urls:
    ratings_small_file = requests.get(url)
    z = zipfile.ZipFile(io.BytesIO(ratings_small_file.content))
    z.extractall()


In [84]:
# Importing the 1,001 list and converting it to what we want
snob_url = 'https://1001films.fandom.com/wiki/The_List'
snob_text= requests.get(snob_url)
soup = BeautifulSoup(snob_text.content, 'html.parser')
basic_list = (soup.body.find_all('b'))
thousand_list = [item.text for item in basic_list]

In [88]:
# Here I get all three files into dataframes
# first, read the correct downloaded csvs (from separate subfolders)
small_set_movies = pd.read_csv('ml-latest-small/movies.csv', sep = ',', header = 0)
large_set_movies = pd.read_csv('ml-20m/movies.csv', sep = ',', header = 0)
# Converting the 1,001 list to a dataframe and dropping the header row
thousandone_movies = pd.DataFrame(thousand_list, columns = ['title']).drop(0)

# The following code can be modified to check they are all dataframes
#print(type(small_set_movies))
# Looking at head of all three
print(thousandone_movies.head())
print(small_set_movies.head())
print(large_set_movies.head())


                                               title
1  A Trip to the Moon (Le Voyage Dans La Lune) (1...
2                     The Great Train Robbery (1903)
3                       The Birth of a Nation (1915)
4                                Les Vampires (1915)
5                                 Intolerance (1916)
   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
3                         Comedy|Drama|Romance  
4                                       Comedy  
   movieId                               title  \
0        1                    Toy Sto

No doubt the hardest part will be adjusting for any naming discrepancies between files. So I guess we just join the largest sets to the smallest to check how many match. And then keep all the non-matched as well. Any joins that have a resulting movieId will be correctly matched.

In [None]:
JOINING


## FROM HERE ON IS UNFINISHED

In [None]:
# FROM
# there are three main datasets, one for research with 20m, and a small one (9K) and large for development (58K)        
genome_score                               
genome_tags
links
movies
ratings
tags
# Working out how many of the 1,001 are in the Grouplens sets
#20m has 27,000 movies
#

### Running a basic model


In [None]:
import turicreate as tc #this model runs recommender systems

training_data, validation_data = tc.recommender.util.random_split_by_user(actions, 'userId', 'movieId')
baseline_model = tc.recommender.create(training_data, 'userId', 'movieId')

In [None]:
print(training_data.head())

In [None]:
ratings_model = tc.recommender.create(training_data, 'userId', 'movieId', target='rating')

## Comparing the two models

In [None]:
comparer = tc.recommender.util.compare_models(validation_data, [baseline_model, ratings_model])

In [None]:
References:
The basic user guide             https://apple.github.io/turicreate/docs/userguide/recommender/
The recommender documentation    https://apple.github.io/turicreate/docs/api/turicreate.toolkits.recommender.html