<a href="https://colab.research.google.com/github/jpgerber/Recommender-for-movie-snobs/blob/master/0_Movie_Snob_Data_Clean_%26_Wrangle.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# This notebook cleans and combines the MovieLens data with a canonical movie list.

In [5]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
import zipfile, io

# Make the canonical list
# Importing the 1,001 list and converting it to a list
snob_url = 'https://1001films.fandom.com/wiki/The_List'
snob_text= requests.get(snob_url)
soup = BeautifulSoup(snob_text.content, 'html.parser')
basic_list = (soup.body.find_all('b'))
thousand_list = [item.text for item in basic_list]
thousandone_movies = pd.DataFrame(thousand_list, columns = ['title']).drop(0) # Convert to df


In [7]:
# Get the MovieLens dataset
# Importing the ratings data
# First download and extract the files (there's a bunch so use a list and loop)
list_of_urls = ['http://files.grouplens.org/datasets/movielens/ml-latest.zip']
for url in list_of_urls:
    ratings_small_file = requests.get(url)
    z = zipfile.ZipFile(io.BytesIO(ratings_small_file.content))
    z.extractall()

ed_large_movies = pd.read_csv('ml-latest/movies.csv', sep = ',', header = 0) # Make the df


##### There will be lots of different types of cleaning.
First, extract the year of release from the string titles

In [11]:
# Create columns of movie years in each database
# Make sure the titles don't have trailing spaces
thousandone_movies['title'] = thousandone_movies['title'].str.rstrip()
ed_large_movies['title'] = ed_large_movies['title'].str.rstrip()
# Then take the slices (the years are in parantheses at the end of the title)
thousandone_movies['year'] = [title[slice(-5,-1)] for title in thousandone_movies['title']]
ed_large_movies['year'] = [title[slice(-5,-1)] for title in ed_large_movies['title']]

# Then convert these strings to numbers (there is one title missing a year!)
# Define a conversion function
def ConvertYear(value):
    '''This function converts integer strings to integers and non-integer strings to zero'''
    try:
        return int(value)
    except: 
        return 0
# Then apply it to the columns for both thousandone_movies and ed_choices
thousandone_movies['year'] = thousandone_movies['year'].apply(lambda year: ConvertYear(
    year))
ed_large_movies['year'] = ed_large_movies['year'].apply(lambda year: ConvertYear(year))


Then make a custom function to test three years at a time and return the best string match.

In [15]:
!pip install fuzzywuzzy 
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
# Specify the matching function (we only need one of the outputs)
def Matcher(title, choices):
    title_match, percent_match, match3 = process.extractOne(title, choices)
    return title_match
# And here's a function for using the tokenizer
def Matcher_token(title, choices):
    title_match, percent_match, match3 = process.extractOne(title, choices, 
                                                            scorer=fuzz.token_sort_ratio)
    return title_match

#Define a filter to return targets for +/-1 year only
def YearFilter (year):
    years = [year-1, year, year+1]
    return ed_large_movies[ed_large_movies.year.isin(years)].title

# Running the tokenizer over the filtered target set
for index, row in thousandone_movies.iterrows():
    # call the filter
    targets = YearFilter(row.year)
    # update the new cell work out the matcher
    thousandone_movies.loc[index,'return_match'] = Matcher_token(row.title, targets)


Collecting fuzzywuzzy
  Downloading https://files.pythonhosted.org/packages/43/ff/74f23998ad2f93b945c0309f825be92e04e0348e062026998b5eefef4c33/fuzzywuzzy-0.18.0-py2.py3-none-any.whl
Installing collected packages: fuzzywuzzy
Successfully installed fuzzywuzzy-0.18.0




Now hand-code the mismatches...

In [18]:
thousandone_movies.head()

Unnamed: 0,title,year,return_match
1,A Trip to the Moon (Le Voyage Dans La Lune) (1...,1902,"Trip to the Moon, A (Voyage dans la lune, Le) ..."
2,The Great Train Robbery (1903),1903,The Great Train Robbery (1903)
3,The Birth of a Nation (1915),1915,"Birth of a Nation, The (1915)"
4,Les Vampires (1915),1915,"Vampires, Les (1915)"
5,Intolerance (1916),1916,Police (1916)


In [2]:

Intolerance (1916)
Broken Blossoms (1919)
Häxan (1923)
Sunrise (1927)
The Unknown (1927)
A Throw of Dice (Prapancha Pash) (1929)
Tabu (1931)
The Vampire (Vampyr) (1932)
Scarface: The Shame of a Nation (1932)
Midnight Song (Ye Ban Ge Sheng) (1937)
Henry V (1944)
The Battle of San Pietro (1945)
Gun Crazy (1949)
Sunset Blvd. (1950)
Europa '51 (1952)
Tokyo Story (1953)
The Wanton Countess (Senso) (1954)
The Sins of Lola Montes (Lola Montès) (1955)
Pather Panchali (1955)
Ordet (1955)
Hill 24 Doesn't Answer (1955)
Dracula (1958)
Dog Star Man
Blonde Cobra (1963)
Playtime (1967)
Week End (1967)
Viy (1967)
Andrei Rublev (Andrei Rublyov) (1966)
A Touch of Zen (Hsia Nu) (1969)
M*A*S*H (1970)
The Sorrow and the Pity (La Chagrin et la Pitié) (1971)
Ceddo (1977)
Up in Smoke (1978)
Raiders of the Lost Ark (1981)
Yol (1982)
Koyaanisqatsi (1983)
The Naked Gun (1988)
Henry: Portrait of a Serial Killer (1990)
The Actress (Yuen Ling-Yuk) (1992)
Hana-Bi (1997)
Buffalo '66 (1998)
Tetsuo (1989)
A One and a Two (Yi Yi) (2000)
Y Tu Mama Tambien (2001)
Oldboy (2003)
Paranormal Activity (2007)
Precious: Based on the Novel "Push" by Sapphire (2009)
The Favourite (2018)
Vice (2018)

hi google drive!

Join the canonical list to the ratings list

In [4]:
# Add the indicator variable to the canonical list.
thousandone_movies['canonical'] = 1
#print(thousandone_movies.head())

# Add the canonical indicator to the movie file, drop the irrelevant columns 
#and fill the missing values with zeroes
ed_large_movies = pd.merge(ed_large_movies, thousandone_movies, left_on='title', right_on='return_match', how='outer', 
         suffixes=('', '_canon')).drop(['year_canon', 
                                        'return_match', 'title_canon'], axis=1).fillna({'canonical':0})

print(ed_large_movies.head())

In [24]:
ed_large_movies[ed_large_movies['title'].str.match('Intolerance')]

Unnamed: 0,movieId,title,genres,year
7132,7243,Intolerance: Love's Struggle Throughout the Ag...,Drama,1916
