<a href="https://colab.research.google.com/github/jpgerber/Recommender-for-movie-snobs/blob/master/0_Movie_Snob_Data_Clean_%26_Wrangle.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# This notebook preps the datafiles for the movie snob recommender ML.
It first makes a file that adds canonical status to the ratings file.
<p>Then, we make the indicators of snobbery.
<p>Both of these are saved to my Google Drive.


In [1]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
import zipfile, io

from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)

Mounted at /content/gdrive


#### First stage is to add the canonical indicator to the movie titles file

In [2]:
# Make the canonical list
# Importing the 1,001 list and converting it to a list
snob_url = 'https://1001films.fandom.com/wiki/The_List'
snob_text= requests.get(snob_url)
soup = BeautifulSoup(snob_text.content, 'html.parser')
basic_list = (soup.body.find_all('b'))
thousand_list = [item.text for item in basic_list]
thousandone_movies = pd.DataFrame(thousand_list, columns = ['title']).drop(0) # Convert to df


In [3]:
# Get the MovieLens dataset
# Importing the ratings data
list_of_urls = ['http://files.grouplens.org/datasets/movielens/ml-latest.zip'] # I originally checked several files
for url in list_of_urls:
    ratings_small_file = requests.get(url)
    z = zipfile.ZipFile(io.BytesIO(ratings_small_file.content))
    z.extractall()

gl_movies = pd.read_csv('ml-latest/movies.csv', sep = ',', header = 0) # Make the df


##### There will be some different types of cleaning.
First, extract the year of release from the string titles to enable easier title matching

In [4]:
# Create columns of movie years in each database
# Make sure the titles don't have trailing spaces
thousandone_movies['title'] = thousandone_movies['title'].str.rstrip()
gl_movies['title'] = gl_movies['title'].str.rstrip()
# Then take the slices (the years are in parantheses at the end of the title)
thousandone_movies['year'] = [title[slice(-5,-1)] for title in thousandone_movies['title']]
gl_movies['year'] = [title[slice(-5,-1)] for title in gl_movies['title']]

# Then convert these strings to numbers (there is one title missing a year!)
# Define a conversion function
def ConvertYear(value):
    '''This function converts integer strings to integers and non-integer strings to zero'''
    try:
        return int(value)
    except: 
        return 0
# Then apply it to the columns for both thousandone_movies and ed_choices
thousandone_movies['year'] = thousandone_movies['year'].apply(lambda year: ConvertYear(
    year))
gl_movies['year'] = gl_movies['year'].apply(lambda year: ConvertYear(year))


Then make a custom function to select a three-year window to more efficiently do string matching.
Then run that matching

In [5]:
!pip install fuzzywuzzy 
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
# Specify the matching function (we only need one of the outputs)
def Matcher(title, choices):
    title_match, percent_match, match3 = process.extractOne(title, choices)
    return title_match
# And here's a function for using the tokenizer
def Matcher_token(title, choices):
    title_match, percent_match, match3 = process.extractOne(title, choices, 
                                                            scorer=fuzz.token_sort_ratio)
    return title_match

#Define a filter to return targets for +/-1 year only
def YearFilter (year):
    years = [year-1, year, year+1]
    return gl_movies[gl_movies.year.isin(years)].title

# Running the tokenizer over the filtered target set
for index, row in thousandone_movies.iterrows():
    # call the filter
    targets = YearFilter(row.year)
    # update the new cell work out the matcher
    thousandone_movies.loc[index,'return_match'] = Matcher_token(row.title, targets)







The following movies were misidentified and so needed re-coding
Intolerance (1916) - 7243
Broken Blossoms (1919) - 6988
Häxan (1923) - 25744
Sunrise (1927) - 8125
The Unknown (1927) - 25762
A Throw of Dice (Prapancha Pash) (1929) - NONE
Tabu (1931) - 5599
The Vampire (Vampyr) (1932) - 25793
Scarface: The Shame of a Nation (1932) - 25788
Midnight Song (Ye Ban Ge Sheng) (1937) - None
Henry V (1944) - 25901
The Battle of San Pietro (1945) - 80104
Gun Crazy (1949) - 8751
Sunset Blvd. (1950) - 922
Europa '51 (1952) - 25966
Tokyo Story (1953) - 6643
The Wanton Countess (Senso) (1954) - 69911
The Sins of Lola Montes (Lola Montès) (1955) - 8143
Pather Panchali (1955) - 668
Ordet (1955) - 6981
Hill 24 Doesn't Answer (1955) - NONE
Dracula (1958) - 5649
Dog Star Man - 137579	137581	137583	137585	137587
Blonde Cobra (1963) - None
Playtime (1967) - 26171
Week End (1967) - 7749
Viy (1967) - 97065
Andrei Rublev (Andrei Rublyov) (1966) - 26150
A Touch of Zen (Hsia Nu) (1969) - 32511
M*A*S*H (1970) - 5060
The Sorrow and the Pity (La Chagrin et la Pitié) (1971) - 32853
Ceddo (1977) - 71973
Up in Smoke (1978) - 1194
Raiders of the Lost Ark (1981) - 1198
Yol (1982) - 6151
Koyaanisqatsi (1983) - 1289
The Naked Gun (1988) - 3868
Henry: Portrait of a Serial Killer (1990) - 2159
The Actress (Yuen Ling-Yuk) (1992) - 114394
Hana-Bi (1997) - 1809
Buffalo '66 (1998) - 1916
Tetsuo (1989) - 4552
A One and a Two (Yi Yi) (2000) - 4334
Y Tu Mama Tambien (2001) - 5225
Oldboy (2003) - 107314
Paranormal Activity (2007) - 71379
Precious: Based on the Novel "Push" by Sapphire (2009) - 72395
The Favourite (2018) - 183837
Vice (2018) - 127323

Join the canonical indicators list to the titles file.

In [6]:
# Add the indicator variable to the canonical list.
thousandone_movies['canonical'] = 1
#print(thousandone_movies.head())

# Add the canonical indicator to the movie file, drop the irrelevant columns 
#and fill the missing values with zeroes
gl_movies = pd.merge(gl_movies, thousandone_movies, left_on='title', right_on='return_match', how='outer', 
         suffixes=('', '_canon')).drop(['year_canon', 
                                        'return_match', 'title_canon'], axis=1).fillna({'canonical':0})

print(gl_movies.head())

   movieId                               title  ...  year  canonical
0        1                    Toy Story (1995)  ...  1995        1.0
1        2                      Jumanji (1995)  ...  1995        0.0
2        3             Grumpier Old Men (1995)  ...  1995        0.0
3        4            Waiting to Exhale (1995)  ...  1995        0.0
4        5  Father of the Bride Part II (1995)  ...  1995        0.0

[5 rows x 5 columns]


In [7]:
# Now add the mismatched ones

handcode_ids = [7243, 6988, 25744, 8125, 25762, 5599, 25793, 25788, 25901, 80104,
                         8751, 922, 25966, 6643, 69911, 8143, 668, 6981, 5649, 137579,
                         137581, 137583, 137585, 137587, 26171, 7749, 97065, 26150, 32511,
                         5060, 32853, 71973, 1194, 1198, 6151, 1289, 3868, 2159, 114394,
                         1809, 1916, 4552, 4334, 5225, 107314, 71379, 72395, 183837, 127323]

def handcode_row(row):
    if row['movieId'] in handcode_ids or row['canonical'] == 1:
      return 1
    else:
      return 0

gl_movies['canonical'] = gl_movies.apply(lambda row : handcode_row(row), axis=1) 



In [8]:
gl_movies.head()

Unnamed: 0,movieId,title,genres,year,canonical
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,1
1,2,Jumanji (1995),Adventure|Children|Fantasy,1995,0
2,3,Grumpier Old Men (1995),Comedy|Romance,1995,0
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,1995,0
4,5,Father of the Bride Part II (1995),Comedy,1995,0


In [9]:
# Quickly check that there are 1,200 canonical movies
gl_movies.canonical.value_counts()

0    56829
1     1270
Name: canonical, dtype: int64

### Now join the canonical indicator to the ratings file 
(movie and ratings file have the same movieId)

In [10]:
# Load the ratings file and convert the rating date
import datetime
ratings = pd.read_csv('ml-latest/ratings.csv', sep = ',', header = 0)
ratings['rating_date'] = ratings['timestamp'].apply(lambda x: datetime.date.fromtimestamp(x))

print(ratings.head())


   userId  movieId  rating   timestamp rating_date
0       1      307     3.5  1256677221  2009-10-27
1       1      481     3.5  1256677456  2009-10-27
2       1     1091     1.5  1256677471  2009-10-27
3       1     1257     4.5  1256677460  2009-10-27
4       1     1449     4.5  1256677264  2009-10-27


In [11]:
# Here's the main merge
rating2 = pd.merge(ratings, gl_movies.drop_duplicates(subset=['movieId']), how='left', on='movieId').drop(
    ['title','genres','timestamp'], axis=1)  # we also dropped the long string columns and some others
rating2.head()

Unnamed: 0,userId,movieId,rating,rating_date,year,canonical
0,1,307,3.5,2009-10-27,1993,1
1,1,481,3.5,2009-10-27,1993,0
2,1,1091,1.5,2009-10-27,1989,0
3,1,1257,4.5,2009-10-27,1985,0
4,1,1449,4.5,2009-10-27,1996,0


In [12]:

# Removing the one suspicious user
rating2 = rating2[rating2.userId != 123100]
rating2.groupby('userId').agg({'rating': 'count'}).sort_values(by='rating', ascending=False)


Unnamed: 0_level_0,rating
userId,Unnamed: 1_level_1
117490,9279
134596,8381
212343,7884
242683,7515
111908,6645
...,...
188125,1
117282,1
127062,1
241836,1


## Saving the initial ratings file to my Google Drive

In [13]:
rating2.to_csv('ratings.zip', compression={'archive_name':'rating2.csv','method':'zip'}, index=False)
# File save
! cp ratings.zip '/content/gdrive/My Drive/'
! ls '/content/gdrive/My Drive/'


'Capstone #2: Final report.gdoc'
'Copy of TEMPLATE_Track companies and contacts.gsheet'
 Data
 foo.txt
'Getting started.pdf'
"Imbalanced classes in Naive Bayes' classifiers.gdoc"
'Notes on three informational interviews.gdoc'
 ratings.zip
'Statistical techniques database.gsheet'


### Reading the RATINGS file back

In [2]:
ratings_df = pd.read_csv('/content/gdrive/My Drive/ratings.zip')
ratings_df.head()

Unnamed: 0,userId,movieId,rating,rating_date,year,canonical
0,1,307,3.5,2009-10-27,1993,1
1,1,481,3.5,2009-10-27,1993,0
2,1,1091,1.5,2009-10-27,1989,0
3,1,1257,4.5,2009-10-27,1985,0
4,1,1449,4.5,2009-10-27,1996,0


# First make a dataframe with all the snobbert indices

### The following cells makes four snobbery indices in a new dataframe:
1. Preference for old over new
2. Being extreme in ratings (Statler-Waldorf)
3. Liking obscure things
4. Being contrary to popular opinion

In [3]:
# Old-over-new
by_user = pd.DataFrame()
by_user['newold_r'] = ratings_df[['rating','year','userId']].groupby('userId').corr().unstack().rating.year
#by_user.head()
by_user.head()

Unnamed: 0_level_0,newold_r
userId,Unnamed: 1_level_1
1,-0.43949
2,0.153888
3,0.07627
4,0.016382
5,0.008811


In [4]:
# Extremeness / statler-waldorf
grumpy = ratings_df.groupby(['userId','rating']).agg({'rating':'count'}).unstack()
grumpy.columns =['halfstar','onestar','one5star','twostar','two5star','threestar','three5star','fourstar','four5star'
               ,'fivestar']
grumpy['extremes'] = np.nansum([grumpy.halfstar, grumpy.onestar, grumpy.fivestar], axis=0) 
grumpy['all_sum'] = np.nansum([grumpy.halfstar, grumpy.onestar, grumpy.one5star, grumpy.twostar,grumpy.two5star,
                               grumpy.threestar, grumpy.three5star, grumpy.fourstar, grumpy.four5star, grumpy.fivestar
                              ], axis=0)
by_user['statler_waldorf'] = grumpy.extremes / grumpy.all_sum


In [5]:
# Obscurist
num_ratings_by_movie = ratings_df[['rating','movieId']].groupby('movieId').count()
num_ratings_by_movie.columns = ['times_rated']
# Broadcast the number of ratings of each movie to each rating
ratings_df = pd.merge(ratings_df, num_ratings_by_movie, how='left', on='movieId')
# calculate the correlation by user and add that as new column
by_user['obscurist']=ratings_df[['rating','times_rated','userId']].groupby('userId').corr().unstack().times_rated.rating



In [6]:
# Contrariness

mean_ratings_by_movie = ratings_df[['rating','movieId']].groupby('movieId').mean()
mean_ratings_by_movie.columns = ['movie_mean_rating']
# Join the movie's mean rating to each individual rating
ratings_df = pd.merge(ratings_df, mean_ratings_by_movie, how='left', on='movieId')
# calculate how far each rating is from the average (just the mean deviation)
ratings_df['rating_deviation'] = abs(ratings_df.rating - ratings_df.movie_mean_rating)
#get an average of the deviance for each user as a new column
by_user['contrariness']=ratings_df[['rating_deviation','userId']].groupby('userId').mean()


In [16]:
# Now reset the index to get userIds
by_user = by_user.reset_index()

In [17]:
by_user.head()

Unnamed: 0,userId,newold_r,statler_waldorf,obscurist,contrariness
0,1,-0.43949,0.0,-0.019481,0.573882
1,2,0.153888,0.0,-0.594173,0.453425
2,3,0.07627,0.090909,0.69317,0.424874
3,4,0.016382,0.184783,0.220126,0.817582
4,5,0.008811,0.236111,0.433186,0.480558


In [None]:
### This cell does the outcomes of snobbery but I haven't revised it for this version.

# Constructing the matrix requires a by-user approach. We already have the beginnings but need the means and SDs
#of canonical and non-canonical movies for each user.

# Group by two variables, create the means and SDs, and then unstack and rename.
#double_stack = ms_df.groupby(['userId','canonical'])
# from here
#canonical_mean_sds = double_stack['rating'].agg({'mean' : np.mean, 'sd' : np.std, 'n': 'count'}).unstack()
#canonical_mean_sds.columns = ['canonical_mean', 'noncanon_mean','canonical_sd','noncanon_sd','canon_n','noncanon_n']

# Join the new canonical mean to the main file
#by_user['canonical_mean'] = canonical_mean_sds['canonical_mean']

# Make the t-stat.
#canonical_mean_sds['canon_pref_meandiff']= canonical_mean_sds.canonical_mean - canonical_mean_sds.noncanon_mean
#canonical_mean_sds['canon_pref_sd']= np.sqrt(( (canonical_mean_sds.canonical_sd ** 2 * canonical_mean_sds.canon_n) +
#                                        (canonical_mean_sds.noncanon_sd ** 2 * 
#                                         canonical_mean_sds.noncanon_n))/(canonical_mean_sds.canon_n +
#                                                                          canonical_mean_sds.noncanon_n))
#canonical_mean_sds['canon_pref_stat'] = canonical_mean_sds['canon_pref_meandiff'] / canonical_mean_sds['canon_pref_sd']

#by_user['canon_pref_stat'] = canonical_mean_sds['canon_pref_stat']
#by_user['canon_pref_meandiff'] = canonical_mean_sds['canon_pref_meandiff']

# Replace the infinite t-stat values with the following
#by_user = by_user.replace(to_replace=[-np.inf, np.inf], value=np.nan)



In [18]:
by_user.to_csv('user_indices.zip', compression={'archive_name':'by_user.csv','method':'zip'}, index=False)
# File save
! cp user_indices.zip '/content/gdrive/My Drive/'
! ls '/content/gdrive/My Drive/'

'Capstone #2: Final report.gdoc'
'Copy of TEMPLATE_Track companies and contacts.gsheet'
 Data
 foo.txt
'Getting started.pdf'
"Imbalanced classes in Naive Bayes' classifiers.gdoc"
'Notes on three informational interviews.gdoc'
 ratings.zip
'Statistical techniques database.gsheet'
 user_indices.zip


### Last bit is to join the user indices and ratings in one file and delete the bits not needed for the ML.

In [19]:
## If you need to load the two files, here it is (don't forget to run imports at top first!)
#ratings_df = pd.read_csv('/content/gdrive/My Drive/ratings.zip')
#by_user = pd.read_csv('/content/gdrive/My Drive/user_indices.zip')

Unnamed: 0,userId,newold_r,statler_waldorf,obscurist,contrariness
0,1,-0.43949,0.0,-0.019481,0.573882
1,2,0.153888,0.0,-0.594173,0.453425
2,3,0.07627,0.090909,0.69317,0.424874
3,4,0.016382,0.184783,0.220126,0.817582
4,5,0.008811,0.236111,0.433186,0.480558


In [25]:
## Making the final file for ML

# Drop unneeded variables from the ratings file
# If going from whole script, then run the line below
#ratings_df = ratings_df.drop(['rating_date','year','times_rated','movie_mean_rating','rating_deviation'], axis=1)
# If reloading the two saved files, then run the line below
ratings_df = ratings_df.drop(['rating_date','year'], axis=1)

# Join the two
ml_df = ratings_df.merge(by_user, how='left', on='userId')
ml_df.head()

# File save
ml_df.to_csv('moviesnob_ml_df.zip', compression={'archive_name':'ml_df.csv','method':'zip'}, index=False)
! cp moviesnob_ml_df.zip '/content/gdrive/My Drive/'
! ls '/content/gdrive/My Drive/'


Unnamed: 0,userId,movieId,rating,canonical,times_rated,movie_mean_rating,rating_deviation,newold_r,statler_waldorf,obscurist,contrariness
0,1,307,3.5,1,7957,3.971597,0.471597,-0.43949,0.0,-0.019481,0.573882
1,1,481,3.5,0,6036,3.339298,0.160702,-0.43949,0.0,-0.019481,0.573882
2,1,1091,1.5,0,6138,2.806207,1.306207,-0.43949,0.0,-0.019481,0.573882
3,1,1257,4.5,0,5901,3.828588,0.671412,-0.43949,0.0,-0.019481,0.573882
4,1,1449,4.5,0,6866,3.918439,0.581561,-0.43949,0.0,-0.019481,0.573882


In [None]:
ml_df.to_csv('moviesnob_ml_df.zip', compression={'archive_name':'ml_df.csv','method':'zip'}, index=False)
! cp moviesnob_ml_df.zip '/content/gdrive/My Drive/'
! ls '/content/gdrive/My Drive/'