In [1]:
import pandas as pd
import numpy as np

## Preprocessing of IMDB Data
Information taken from https://www.imdb.com/interfaces/. 

The three tsvs read in from IMDB are `title.principals.tsv`, `name.basics.tsv` and `title.basics.tsv`. 

`principals` contains information regarding the principal cast/crew for titles. Columns in this tsv are as follows:
* `tconst` (string) - alphanumeric unique identifier of the title
* `ordering` (integer) – a number to uniquely identify rows for a given titleId
* `nconst` (string) - alphanumeric unique identifier of the name/person
* `category` (string) - the category of job that person was in
* `job` (string) - the specific job title if applicable, else '\N'
* `characters` (string) - the name of the character played if applicable, else '\N'

`name.basics` contains detailed information about individuals correlated with a title. Columns in this tsv are as follows:
* `nconst` (string) - alphanumeric unique identifier of the name/person
* `primaryName` (string)– name by which the person is most often credited
* `birthYear` – in YYYY format
* `deathYear` – in YYYY format if applicable, else '\N'
* `primaryProfession` (array of strings)– the top-3 professions of the person
* `knownForTitles` (array of tconsts) – titles the person is known for

`title.basics` contains detailed information regarding titles. Columns in this tsv are as follows: 
* `tconst` (string) - alphanumeric unique identifier of the title
* `titleType` (string) – the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)
* `primaryTitle` (string) – the more popular title / the title used by the filmmakers on promotional materials at the point of release
* `originalTitle` (string) - original title, in the original language
* `isAdult` (boolean) - 0: non-adult title; 1: adult title
* `startYear` (YYYY) – represents the release year of a title. In the case of TV Series, it is the series start year
* `endYear` (YYYY) – TV Series end year. ‘\N’ for all other title types
* `runtimeMinutes` – primary runtime of the title, in minutes
* `genres` (string array) – includes up to three genres associated with the title

In [66]:
import time

def read_imdb_data(filenames):
    """ Returns list of dataframes given a list of IMDB tsvs. """
    
    all_data = []
    for filename in filenames:
        time_start = time.time()
        raw_data = pd.read_csv(f'./input/{filename}.tsv.gz', sep='\t', dtype='string')
        print(f'Finished reading {filename} after {time.time() - time_start} seconds.')
        all_data.append(raw_data)
    return all_data
    

filenames = ['title.principals', 'name.basics', 'title.basics']
raw_principals, raw_names, raw_titles = read_imdb_data(filenames)

Finished reading title.principals after 188.37308311462402 seconds.
Finished reading name.basics after 64.15010285377502 seconds.
Finished reading title.basics after 58.90655851364136 seconds.


In [67]:
def filter_df(col_values, df, acceptable=True):
    """
    Filters a dataframe given some values for a column. If these values are acceptable, only keeps rows where
    with this value. If not acceptable, removes rows with this value.

    Parameters:
    col_values (dict): dictionary where the key is a column name, value is a list of acceptable/non-acceptable values. 
    df (pd.DataFrame): dataframe to filter
    acceptable (bool): boolean that dictates whether values are acceptable or not acceptable

    Returns:
    Filtered dataframe

   """
    for col_name, accepted_values in col_values.items():
        if acceptable:
            df = df[df[col_name].isin(accepted_values)].reset_index(drop=True)
        else:
            df = df[~df[col_name].isin(accepted_values)].reset_index(drop=True)
    return df

Further filtering of `raw_principals`.

In [88]:
# We only want to consider some principal cast/crew of a title, namely the director, composer, actor(s), actress(es)
# We filter the raw_principal data to only contain rows where the cast/crew fit the above roles. 
col_values = {'category': ['director', 'composer', 'actor', 'actress']}
principals = filter_df(col_values, raw_principals)

# Remove unused columns
principals.drop(['ordering', 'job'], axis=1, inplace=True)
principals

Unnamed: 0,tconst,nconst,category,characters
0,tt0000001,nm0005690,director,\N
1,tt0000002,nm0721526,director,\N
2,tt0000002,nm1335271,composer,\N
3,tt0000003,nm0721526,director,\N
4,tt0000003,nm1335271,composer,\N
...,...,...,...,...
23263643,tt9916880,nm0254176,actress,"[""Moody Margaret""]"
23263644,tt9916880,nm0286175,actor,"[""Dad"",""Aerobic Al"",""Nasty Nicola""]"
23263645,tt9916880,nm10535738,actress,"[""Horrid Henry""]"
23263646,tt9916880,nm0996406,director,\N


Further filtering of `raw_names`.

In [69]:
# drop unused columns
names = raw_names[['nconst', 'primaryName']]
names

Unnamed: 0,nconst,primaryName
0,nm0000001,Fred Astaire
1,nm0000002,Lauren Bacall
2,nm0000003,Brigitte Bardot
3,nm0000004,John Belushi
4,nm0000005,Ingmar Bergman
...,...,...
10580188,nm9993714,Romeo del Rosario
10580189,nm9993716,Essias Loberg
10580190,nm9993717,Harikrishnan Rajan
10580191,nm9993718,Aayush Nair


Further filtering of `raw_titles`.

In [70]:
# We only want to consider movie titles
# We filter the raw_titles data to only contain rows where the titleType corresponds to a movie.
col_values = {'titleType': ['movie', 'tvMovie']}
titles = filter_df(col_values, raw_titles)

# drop rows where the start year or the runtime minutes are \N
col_values = {'startYear': [r'\N'], 'runtimeMinutes': [r'\N']}
titles = filter_df(col_values, titles, False)

# Remove unused columns 
titles.drop(['titleType', 'originalTitle', 'endYear'], axis=1, inplace=True)
titles

Unnamed: 0,tconst,primaryTitle,isAdult,startYear,runtimeMinutes,genres
0,tt0000009,Miss Jerry,0,1894,45,Romance
1,tt0000502,Bohemios,0,1905,100,\N
2,tt0000574,The Story of the Kelly Gang,0,1906,70,"Biography,Crime,Drama"
3,tt0000679,The Fairylogue and Radio-Plays,0,1908,120,"Adventure,Fantasy"
4,tt0001184,Don Juan de Serrallonga,0,1910,58,"Adventure,Drama"
...,...,...,...,...,...,...
418541,tt9916622,Rodolpho Teóphilo - O Legado de um Pioneiro,0,2015,57,Documentary
418542,tt9916680,De la ilusión al desconcierto: cine colombiano...,0,2007,100,Documentary
418543,tt9916692,Teatroteka: Czlowiek bez twarzy,0,2015,66,Drama
418544,tt9916730,6 Gunn,0,2017,116,\N


### Joining of IMDB Data

Join dataframes to create a `movie` dataframe. Each row contains the following columns:
* `tconst`
* `primaryTitle`
* `isAdult`
* `startYear`
* `runtimeMinutes`
* `genres`
* `primaryName`
* `category`
* `characters`

In [89]:
cast_crew = names.merge(principals, on='nconst')
cast_crew

Unnamed: 0,nconst,primaryName,tconst,category,characters
0,nm0000001,Fred Astaire,tt0025164,actor,"[""Guy Holden""]"
1,nm0000001,Fred Astaire,tt0026942,actor,"[""Huck Haines""]"
2,nm0000001,Fred Astaire,tt0027125,actor,"[""Jerry Travers""]"
3,nm0000001,Fred Astaire,tt0027630,actor,"[""Bake Baker""]"
4,nm0000001,Fred Astaire,tt0028333,actor,"[""Lucky Garnett""]"
...,...,...,...,...,...
23237504,nm9993709,Lu Bevins,tt11772842,actress,"[""Lu""]"
23237505,nm9993709,Lu Bevins,tt11772858,director,\N
23237506,nm9993709,Lu Bevins,tt11772904,director,\N
23237507,nm9993709,Lu Bevins,tt11772940,director,\N


In [91]:
grouped_cast_crew = cast_crew.groupby(['tconst'], as_index=False)[['primaryName', 'category', 'characters']].agg(list)
grouped_cast_crew

Unnamed: 0,tconst,primaryName,category,characters
0,tt0000001,[William K.L. Dickson],[director],[\N]
1,tt0000002,"[Émile Reynaud, Gaston Paulin]","[director, composer]","[\N, \N]"
2,tt0000003,"[Émile Reynaud, Gaston Paulin]","[director, composer]","[\N, \N]"
3,tt0000004,"[Émile Reynaud, Gaston Paulin]","[director, composer]","[\N, \N]"
4,tt0000005,"[William K.L. Dickson, Charles Kayser, John Ott]","[director, actor, actor]","[\N, [""Blacksmith""], [""Assistant""]]"
...,...,...,...,...
5211916,tt9916848,"[Burcu Güven, Pelin Akil, Deniz Yorulmazer, Se...","[composer, actress, director, director, actor,...","[\N, [""Zehra""], \N, \N, [""Cetin Ertas""], [""Ali..."
5211917,tt9916850,"[Burcu Güven, Pelin Akil, Deniz Yorulmazer, Se...","[composer, actress, director, director, actor,...","[\N, [""Zehra""], \N, \N, [""Cetin Ertas""], [""Ali..."
5211918,tt9916852,"[Burcu Güven, Pelin Akil, Deniz Yorulmazer, Se...","[composer, actress, director, director, actor,...","[\N, [""Zehra""], \N, \N, [""Cetin Ertas""], [""Ali..."
5211919,tt9916856,"[Johan Planefeldt, Andreas Demmel, Kathrin Knö...","[director, actor, actress, actress, composer, ...","[\N, [""Stephan""], [""Kathi""], [""Sandra""], \N, [..."


In [92]:
movies = titles.merge(grouped_cast_crew, on='tconst')
movies.to_csv('./movies.csv')
movies

Unnamed: 0,tconst,primaryTitle,isAdult,startYear,runtimeMinutes,genres,primaryName,category,characters
0,tt0000009,Miss Jerry,0,1894,45,Romance,"[Blanche Bayliss, Alexander Black, William Cou...","[actress, director, actor, actor]","[[""Miss Geraldine Holbrook (Miss Jerry)""], \N,..."
1,tt0000502,Bohemios,0,1905,100,\N,"[Ricardo de Baños, Antonio del Pozo, El Mochuelo]","[director, actor, actor]","[\N, \N, \N]"
2,tt0000574,The Story of the Kelly Gang,0,1906,70,"Biography,Crime,Drama","[Bella Cola, Charles Tait, Elizabeth Tait, Joh...","[actress, director, actress, actor, composer, ...","[\N, \N, [""Kate Kelly""], [""School Master""], \N..."
3,tt0000679,The Fairylogue and Radio-Plays,0,1908,120,"Adventure,Fantasy","[L. Frank Baum, Francis Boggs, Frank Burns, Na...","[actor, director, actor, composer, director, a...","[[""The Wizard of Oz Man""], \N, [""His Majesty t..."
4,tt0001184,Don Juan de Serrallonga,0,1910,58,"Adventure,Drama","[Ricardo de Baños, Alberto Marro, Dolores Puch...","[director, director, actress, actor]","[\N, \N, \N, \N]"
...,...,...,...,...,...,...,...,...,...
409726,tt9916622,Rodolpho Teóphilo - O Legado de um Pioneiro,0,2015,57,Documentary,"[Angela Gurgel, Ana Célia de Oliveira, Oldair ...","[director, director, actor]","[\N, \N, [""Rodolpho Teophilo""]]"
409727,tt9916680,De la ilusión al desconcierto: cine colombiano...,0,2007,100,Documentary,[Luis Ospina],[director],[\N]
409728,tt9916692,Teatroteka: Czlowiek bez twarzy,0,2015,66,Drama,"[Zbigniew Zamachowski, Andrzej Bartnikowski, S...","[actor, director, actor, actress, composer, ac...","[[""Authority""], \N, [""Rafal""], [""Monika""], \N,..."
409729,tt9916730,6 Gunn,0,2017,116,\N,"[Sunil Barve, Kiran Gawade, Bhushan Pradhan, A...","[actor, director, actor, actor, actor]","[\N, \N, \N, \N, \N]"


### Webscraping IMDB Reviews

In [11]:
from bs4 import BeautifulSoup
from nltk.stem.snowball import SnowballStemmer
import requests
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
import string
import re

def clean_rating(rating):
    """
    Retrieves rating if rating exists. Else, gives an empty string.

    Parameters:
    rating (BeautifulSoup): html which contains the rating information for a particular entry

    Returns:
    Rating e.g. "9/10"

   """
    cleaned_rating = rating[0].text.strip() if len(rating) else ''
    return cleaned_rating
        
def form_url(row):
    """
    Gives the url for the review page of a movie 

    Parameters:
    row (Series): row that contains movie tconst id 

    Returns:
    url of the movie review page

   """
    url = 'https://www.imdb.com/title/' + row.tconst + '/reviews'
    return url

def get_review_html(url):
    """
    Retrieves html of the review page of a movie.

    Parameters:
    url (string): url of the movie reviews page 

    Returns:
    html of the movie review page 

   """
    response = requests.get(url)
    return response.text

def get_review_info(soup):
    """
    Retrieves all review information of the currently visible reviews on a movie review page. Will discard reviews
    where a numerical rating was not included.

    Parameters:
    soup (BeautifulSoup): all html of current page

    Returns:
    List of dictionaries of the form {'username': 'user_101', 'rating': '9/10', 'review': 'It was a good movie.'}
    List will contain a dictionary object for each review that is currently visible on the page.

   """
    # get all review html of a current page in a list (each review is an element)
    content = soup.find_all("div", class_="lister-item-content")
    
    content_list = []
    
    # for each review on the page 
    for c in content:
        # get the rating
        rating = clean_rating(c.find_all("div", class_="ipl-ratings-bar"))
        
        # if there is a rating for this review
        if rating:
            content_dict = {}
            
            # retrieve username and assign in dictionary
            content_dict['username'] = c.find_all("span", class_="display-name-link")[0].text
            
            # assign rating in dictionary
            content_dict['rating'] = rating
            
            # retrieve textual review and assign in dictionary
            content_dict['review'] = c.find_all("div", class_="text show-more__control")[0].text
            
            # append entire review to list
            content_list.append(content_dict)
        
    return content_list
    
def get_reviews(html, tconst):
    """
    Retrieves all review information of a movie. 

    Parameters:
    html (BeautifulSoup): all html of the inital current page 

    Returns:
    Dictionary of the form {'tt0002423': [{'username': ..., 'rating': ..., 'review': ...}, ...]}
    where the key of the outer dictionary is the tconst of the title, and the value of the outer dictionary is a list
    of all reviews corresponding to that tconst. 

   """
    soup = BeautifulSoup(html, 'html.parser')
    
    # get usernames, reviews, ratings on first page
    all_reviews = get_review_info(soup)
    
    # if there are no reviews on the inital page, then return none 
    if not all_reviews:
        return None
    
    # if there are more reviews than what's on the inital page, should be a div with data-key attribute
    more_reviews = soup.select('div[data-key]')
    
    # get data_ajax_url - used if there are more movie reviews
    data_ajax_url = soup.select('div[data-ajaxurl]')[0]['data-ajaxurl']
    
    # while there are more reviews, update soup
    while len(more_reviews) > 0:
        data_key = soup.select('div[data-key]')[0]['data-key']

        # form url that is requested when clicking on "load more reviews"
        url = f'https://www.imdb.com{data_ajax_url}?ref_=undefined&paginationKey={data_key}'
        
        # create updated BeautifulSoup object using html from new url 
        html = get_review_html(url)
        soup = BeautifulSoup(html, 'html.parser')
        
        # get updated usernames, reviews, ratings
        all_reviews.extend(get_review_info(soup))
        
        # check if there are more reviews
        more_reviews = soup.select('div[data-key]')
        
    return {tconst: all_reviews}

In [86]:
# movies = pd.read_csv('./movies.csv', index_col=0)
movies_continue = movies.iloc[279460+1:,:]

In [87]:
review_dict = {}
for index, row in movies_continue.iterrows():
    print(f'Processing movie {index}...', end='\r')
    url = form_url(row)
    html = get_review_html(url)
    movie_reviews = get_reviews(html, row.tconst) 
    if movie_reviews:
        review_dict.update(movie_reviews)

Processing movie 279549...

KeyboardInterrupt: 

In [82]:
review_df = pd.concat({k: pd.DataFrame(v).set_index('username') for k, v in review_dict.items() if v}, axis=0)
review_df.index.names = ['tconst', 'username']
review_df

Unnamed: 0_level_0,Unnamed: 1_level_0,rating,review
tconst,username,Unnamed: 2_level_1,Unnamed: 3_level_1
tt1424746,hugowery,9/10,"""Apenas o Fim"" is a very well done movie. The ..."
tt1424746,felipepepe,4/10,The film well is directed and bold for Brazili...
tt1424746,splashmuch,1/10,The plot is about having no plot at all. Point...
tt1424769,jason-762-54425,9/10,Our world is becoming smaller as our cultures ...
tt1424797,FrenchEddieFelson,7/10,J'ai tué ma mère (2009) is the first film of X...
...,...,...,...
tt2145803,manisg,10/10,Another excellent world class movie from Tamil...
tt2145803,saranjoker,8/10,Its my first review for a Tamil Movie and I am...
tt2145803,sribornagain-394-460163,9/10,"""Mouna Guru"" will easily qualify as the most w..."
tt2145803,kamalbeeee,8/10,I have watch second time this movie.. Awesome ...


In [85]:
review_df.to_csv('./reviews/raw_reviews/raw_reviews_279460.csv')

In [152]:
# combining separate review csvs into one csv, all_reviews.csv

import glob

all_reviews_filenames = [file for file in glob.glob("./reviews/raw_reviews/*.csv")]

combined_csv_file = open('./reviews/all_reviews.csv', 'a', encoding='utf-8')

print(f'file: {all_reviews_filenames[0]}')
for line in open(all_reviews_filenames[0], encoding='utf-8'):
    combined_csv_file.write(line)
    
for file in all_reviews_filenames[1:]:
    print(f'file: {file}')
    f = open(file, 'r+', encoding='utf-8')
    f.readline()
    for line in f:
        combined_csv_file.write(line)
    f.close()

file: ./reviews/raw_reviews\raw_reviews_125796.csv
file: ./reviews/raw_reviews\raw_reviews_154246.csv
file: ./reviews/raw_reviews\raw_reviews_171222.csv
file: ./reviews/raw_reviews\raw_reviews_226566.csv
file: ./reviews/raw_reviews\raw_reviews_241044.csv
file: ./reviews/raw_reviews\raw_reviews_279460.csv
file: ./reviews/raw_reviews\raw_reviews_5933.csv
file: ./reviews/raw_reviews\raw_reviews_81551.csv
file: ./reviews/raw_reviews\raw_reviews_91745.csv


In [153]:
combined_csv_file.close()