<a href="https://colab.research.google.com/github/jpgerber/Recommender-for-movie-snobs/blob/master/0_Movie_Snob_Data_Clean_%26_Wrangle.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# This notebook cleans and combines the MovieLens data with a canonical movie list.

In [5]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
import zipfile, io

# Make the canonical list
# Importing the 1,001 list and converting it to a list
snob_url = 'https://1001films.fandom.com/wiki/The_List'
snob_text= requests.get(snob_url)
soup = BeautifulSoup(snob_text.content, 'html.parser')
basic_list = (soup.body.find_all('b'))
thousand_list = [item.text for item in basic_list]
thousandone_movies = pd.DataFrame(thousand_list, columns = ['title']).drop(0) # Convert to df


In [7]:
# Get the MovieLens dataset
# Importing the ratings data
# First download and extract the files (there's a bunch so use a list and loop)
list_of_urls = ['http://files.grouplens.org/datasets/movielens/ml-latest.zip']
for url in list_of_urls:
    ratings_small_file = requests.get(url)
    z = zipfile.ZipFile(io.BytesIO(ratings_small_file.content))
    z.extractall()

ed_large_movies = pd.read_csv('ml-latest/movies.csv', sep = ',', header = 0) # Make the df


##### There will be lots of different types of cleaning.
First, extract the year of release from the string titles

In [11]:
# Create columns of movie years in each database
# Make sure the titles don't have trailing spaces
thousandone_movies['title'] = thousandone_movies['title'].str.rstrip()
ed_large_movies['title'] = ed_large_movies['title'].str.rstrip()
# Then take the slices (the years are in parantheses at the end of the title)
thousandone_movies['year'] = [title[slice(-5,-1)] for title in thousandone_movies['title']]
ed_large_movies['year'] = [title[slice(-5,-1)] for title in ed_large_movies['title']]

# Then convert these strings to numbers (there is one title missing a year!)
# Define a conversion function
def ConvertYear(value):
    '''This function converts integer strings to integers and non-integer strings to zero'''
    try:
        return int(value)
    except: 
        return 0
# Then apply it to the columns for both thousandone_movies and ed_choices
thousandone_movies['year'] = thousandone_movies['year'].apply(lambda year: ConvertYear(
    year))
ed_large_movies['year'] = ed_large_movies['year'].apply(lambda year: ConvertYear(year))


Then make a custom function to test three years at a time and return the best string match.

In [15]:
!pip install fuzzywuzzy 
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
# Specify the matching function (we only need one of the outputs)
def Matcher(title, choices):
    title_match, percent_match, match3 = process.extractOne(title, choices)
    return title_match
# And here's a function for using the tokenizer
def Matcher_token(title, choices):
    title_match, percent_match, match3 = process.extractOne(title, choices, 
                                                            scorer=fuzz.token_sort_ratio)
    return title_match

#Define a filter to return targets for +/-1 year only
def YearFilter (year):
    years = [year-1, year, year+1]
    return ed_large_movies[ed_large_movies.year.isin(years)].title

# Running the tokenizer over the filtered target set
for index, row in thousandone_movies.iterrows():
    # call the filter
    targets = YearFilter(row.year)
    # update the new cell work out the matcher
    thousandone_movies.loc[index,'return_match'] = Matcher_token(row.title, targets)


Collecting fuzzywuzzy
  Downloading https://files.pythonhosted.org/packages/43/ff/74f23998ad2f93b945c0309f825be92e04e0348e062026998b5eefef4c33/fuzzywuzzy-0.18.0-py2.py3-none-any.whl
Installing collected packages: fuzzywuzzy
Successfully installed fuzzywuzzy-0.18.0




Now hand-code the mismatches...

In [18]:
thousandone_movies.head()

Unnamed: 0,title,year,return_match
1,A Trip to the Moon (Le Voyage Dans La Lune) (1...,1902,"Trip to the Moon, A (Voyage dans la lune, Le) ..."
2,The Great Train Robbery (1903),1903,The Great Train Robbery (1903)
3,The Birth of a Nation (1915),1915,"Birth of a Nation, The (1915)"
4,Les Vampires (1915),1915,"Vampires, Les (1915)"
5,Intolerance (1916),1916,Police (1916)


In [2]:

Intolerance (1916) - 7243
Broken Blossoms (1919) - 6988
Häxan (1923) - 25744
Sunrise (1927) - 8125
The Unknown (1927) - 25762
A Throw of Dice (Prapancha Pash) (1929) - NONE
Tabu (1931) - 5599
The Vampire (Vampyr) (1932) - 25793
Scarface: The Shame of a Nation (1932) - 25788
Midnight Song (Ye Ban Ge Sheng) (1937) - None
Henry V (1944) - 25901
The Battle of San Pietro (1945) - 80104
Gun Crazy (1949) - 8751
Sunset Blvd. (1950) - 922
Europa '51 (1952) - 25966
Tokyo Story (1953) - 6643
The Wanton Countess (Senso) (1954) - 69911
The Sins of Lola Montes (Lola Montès) (1955) - 8143
Pather Panchali (1955) - 668
Ordet (1955) - 6981
Hill 24 Doesn't Answer (1955) - NONE
Dracula (1958) - 5649
Dog Star Man - 137579	137581	137583	137585	137587
Blonde Cobra (1963) - None
Playtime (1967) - 26171
Week End (1967) - 7749
Viy (1967) - 97065
Andrei Rublev (Andrei Rublyov) (1966) - 26150
A Touch of Zen (Hsia Nu) (1969) - 32511
M*A*S*H (1970) - 5060
The Sorrow and the Pity (La Chagrin et la Pitié) (1971) - 32853
Ceddo (1977) - 71973
Up in Smoke (1978) - 1194
Raiders of the Lost Ark (1981) - 1198
Yol (1982) - 6151
Koyaanisqatsi (1983) - 1289
The Naked Gun (1988) - 3868
Henry: Portrait of a Serial Killer (1990) - 2159
The Actress (Yuen Ling-Yuk) (1992) - 114394
Hana-Bi (1997) - 1809
Buffalo '66 (1998) - 1916
Tetsuo (1989) - 4552
A One and a Two (Yi Yi) (2000) - 4334
Y Tu Mama Tambien (2001) - 5225
Oldboy (2003) - 107314
Paranormal Activity (2007) - 71379
Precious: Based on the Novel "Push" by Sapphire (2009) - 72395
The Favourite (2018) - 183837
Vice (2018) - 127323

In [101]:
# Matching all the weird ones

ed_large_movies[ed_large_movies['title'].str.contains('Vice')]

Unnamed: 0,movieId,title,genres,year
4465,4559,Vice Versa (1988),Comedy,1988
11199,47044,Miami Vice (2006),Action|Crime|Drama|Thriller,2006
11651,50977,Vice Squad (1982),Action|Thriller,1982
15087,76063,Your Vice is a Locked Room and Only I Have the...,Horror|Mystery|Thriller,1972
17213,86635,Vice (2008),Crime|Film-Noir|Mystery|Thriller,2008
17635,88228,Vice Squad (1953),Crime|Drama,1953
25452,116799,Inherent Vice (2014),Comedy|Crime|Drama|Mystery|Romance,2014
25851,117958,Vice Versa (1948),Comedy,1948
26913,121432,The Man Who Shook the Hand of Vicente Fernande...,Comedy|Drama|Western,2012
27049,121769,Vice Raid (1960),Crime|Drama,1960


Join the canonical list to the ratings list

In [4]:
# Add the indicator variable to the canonical list.
thousandone_movies['canonical'] = 1
#print(thousandone_movies.head())

# Add the canonical indicator to the movie file, drop the irrelevant columns 
#and fill the missing values with zeroes
ed_large_movies = pd.merge(ed_large_movies, thousandone_movies, left_on='title', right_on='return_match', how='outer', 
         suffixes=('', '_canon')).drop(['year_canon', 
                                        'return_match', 'title_canon'], axis=1).fillna({'canonical':0})

print(ed_large_movies.head())

Unnamed: 0,movieId,title,genres,year
6877,6988,Broken Blossoms or The Yellow Man and the Girl...,Drama|Romance,1919
10040,33001,Blossoms in the Dust (1941),Drama,1941
13735,68472,Cherry Blossoms (Kirschblüten - Hanami) (2008),Drama|Romance,2008
16397,82330,"Blossoming of Maximo Oliveros, The (Ang pagdad...",Comedy|Drama,2005
20242,98936,"All Blossoms Again: Pedro Costa, Director (Tou...",Documentary,2006
27622,123046,The Tsunami and the Cherry Blossom (2011),Documentary,2011
30198,130000,Broken Blossoms (1936),Drama|Romance,1936
41123,155788,The Bliss of Mrs. Blossom (1968),Comedy|Romance,1968
43771,161908,Plum Blossom (2000),Drama|Romance,2000
48689,172565,Under the Blossoming Cherry Trees (1975),Fantasy|Horror,1975


In [None]:
Intolerance 7243