# Grace Techau
## Box Office Revenue & Letterboxd Ratings Project 
### NOTEBOOK 2 
### Cleaning Letterboxd Movie Data - 2017

In [2]:
#import all required packages 
import pandas as pd 
import re 

### Merge raw scraped data files for Letterboxd movies in 2017

The top 25% most popular Letterboxd movies in 2017 were scraped in four parts and were saved in seperate CSV files. The breakdown of pages scraped per file is below: 
- Pages 1 to 37.5 are captured in 'letterboxd_movie_data_2017_part1.1.csv'
- Pages 37.5 to 42.5 are captued in 'letterboxd_movie_data_2017_part1.2.csv'
- Pages 42.5 to 55.5 are captured in 'letterboxd_movie_data_2017_part1.3.csv'
- Pages 55.5 - 66 are captured in 'letterboxd_movie_data_2017_part1.4.csv''

These CSV files need to be merged into one complete 25% most popular 2017 Letterboxd movies Pandas data frame where all the data can be cleaned.  

In [5]:
# Read in 2017 scraping part 1 to a pandas data frame 

raw_2017_1 = pd.read_csv("letterboxd_movie_data_2017_raw_1.csv", encoding='utf-8')

print(f"Number of movies in 2017 part 1: {len(raw_2017_1)}")

Number of movies in 2017 part 1: 2622


In [6]:
# Read in 2017 scraping part 2 to a pandas data frame 

raw_2017_2 = pd.read_csv("letterboxd_movie_data_2017_raw_2.csv", encoding='utf-8')

print(f"Number of movies in 2017 part 2: {len(raw_2017_2)}")

Number of movies in 2017 part 2: 380


In [7]:
# Read in 2017 scraping part 3 to a pandas data frame 

raw_2017_3 = pd.read_csv("letterboxd_movie_data_2017_raw_3.csv", encoding='utf-8')

print(f"Number of movies in 2017 part 3: {len(raw_2017_3)}")

Number of movies in 2017 part 3: 944


In [8]:
# Read in 2017 scraping part 4 to a pandas data frame 

raw_2017_4 = pd.read_csv("letterboxd_movie_data_2017_raw_4.csv", encoding='utf-8')

print(f"Number of movies in 2017 part 4: {len(raw_2017_4)}")

Number of movies in 2017 part 4: 864


In [9]:
# Merge the different data frames to one collective 2017 pandas data frame 

raw_movie_data_2017 = pd.concat([raw_2017_1, raw_2017_2, raw_2017_3, raw_2017_4], axis=0, ignore_index=True)

print(f"Total number of movies in 2017: {len(raw_movie_data_2017)}")
print("-"*50)
display(raw_movie_data_2017.head(5))

Total number of movies in 2017: 4810
--------------------------------------------------


Unnamed: 0,title,year,number_ratings,average_rating,length,genres
0,Get Out,2017.0,"Weighted average of 4.16 based on 2,706,282 ra...",4.2,104 mins More at IMDB TMDB,"Horror, Mystery, Thriller, Horror, The Undead ..."
1,Lady Bird,2017.0,"Weighted average of 3.85 based on 2,062,281 ra...",3.8,94 mins More at IMDB TMDB,"Comedy, Drama, Moving Relationship Stories, Un..."
2,Call Me by Your Name,2017.0,"Weighted average of 3.85 based on 1,687,099 ra...",3.9,132 mins More at IMDB TMDB,"Romance, Drama, Moving Relationship Stories, H..."
3,Baby Driver,2017.0,"Weighted average of 3.71 based on 1,933,993 ra...",3.7,113 mins More at IMDB TMDB,"Crime, Action, Crime, Drugs And Gangsters, Son..."
4,Blade Runner 2049,2017.0,"Weighted average of 4.12 based on 1,411,904 ra...",4.1,164 mins More at IMDB TMDB,"Science Fiction, Drama, Humanity And The World..."


### Clean complete 2017 Letterboxd movie data set 

In [11]:
# Make a copy of the raw data frame for cleaning
clean_movie_data_2017 = raw_movie_data_2017.copy()

Since the 2017 Letterboxd movie data was scraped in batches and some of those batches ended in the middle of a page, there are duplicates present. I scraped the ending page of each batch as the starting page of the next batch to make sure all movies were scraped. Now those duplicates can be removed. 

In [13]:
duplicates_raw = raw_movie_data_2017[raw_movie_data_2017.duplicated(keep=False)]
print(f"Total duplicate records in raw 2017 data frame: {len(duplicates_raw)}")
print("-"*50)

clean_movie_data_2017 = clean_movie_data_2017.drop_duplicates(keep=False)

duplicates_clean = clean_movie_data_2017[clean_movie_data_2017.duplicated(keep=False)]
print(f"Total duplicate records in clean 2017 data frame: {len(duplicates_clean)}")

Total duplicate records in raw 2017 data frame: 125
--------------------------------------------------
Total duplicate records in clean 2017 data frame: 0


The analysis for this project is utilizing the rating data from Letterboxd, therefore any movies where the rating wasn't available are not relevant and can be dropped. 

In [15]:
no_rating_data = clean_movie_data_2017[(clean_movie_data_2017['average_rating'] == 'No average rating available') & (clean_movie_data_2017['number_ratings'] == 'No number of ratings available')].index

print("# rows where no average rating and number ratings was available was available")
print("-"*50)
display(len(clean_movie_data_2017.loc[no_rating_data]))

### Drop these rows 
clean_movie_data_2017 = clean_movie_data_2017.drop(no_rating_data)

print("\n")
print("length of the data frame after dropping rows with no rating data")
print("-"*50)
print(len(clean_movie_data_2017))

# rows where no average rating and number ratings was available was available
--------------------------------------------------


1433



length of the data frame after dropping rows with no rating data
--------------------------------------------------
3252


Remove text from 'length' column so that only the minutes numerical value is present. \
For example changing '164 mins More at IMDB TMDB' to '164'

In [17]:
# Keep only the numeric value from the string of text in the length column
clean_movie_data_2017.loc[:, 'length'] = clean_movie_data_2017['length'].str.replace(r'\D', '', regex=True)

print("length column before cleaning")
print("-"*50)
display(raw_movie_data_2017['length'].head(10))
print("\n")
print("length column after cleaning")
print("-"*50)
display(clean_movie_data_2017['length'].head(10))

length column before cleaning
--------------------------------------------------


0    104 mins   More at IMDB TMDB
1     94 mins   More at IMDB TMDB
2    132 mins   More at IMDB TMDB
3    113 mins   More at IMDB TMDB
4    164 mins   More at IMDB TMDB
5    105 mins   More at IMDB TMDB
6    133 mins   More at IMDB TMDB
7    131 mins   More at IMDB TMDB
8    135 mins   More at IMDB TMDB
9    107 mins   More at IMDB TMDB
Name: length, dtype: object



length column after cleaning
--------------------------------------------------


0    104
1     94
2    132
3    113
4    164
5    105
6    133
7    131
8    135
9    107
Name: length, dtype: object

Remove text from 'number_ratings' column to include only the numerical value for the number of ratings. \
For example changing 'Weighted average of 3.03 based on 288 ratings' to '288'

In [19]:
# using pandas .str.extract() method to keep only the numerical value after string of text 'based on'

clean_movie_data_2017['number_ratings'] = clean_movie_data_2017['number_ratings'].str.extract(r'based on ([\d,]+)').replace({',': ''}, regex=True)

print("number_ratings column before cleaning")
print("-"*50)
display(raw_movie_data_2017['number_ratings'].head(10))
print("\n")
print("number_ratings column after cleaning")
print("-"*50)
display(clean_movie_data_2017['number_ratings'].head(10))

number_ratings column before cleaning
--------------------------------------------------


0    Weighted average of 4.16 based on 2,706,282 ra...
1    Weighted average of 3.85 based on 2,062,281 ra...
2    Weighted average of 3.85 based on 1,687,099 ra...
3    Weighted average of 3.71 based on 1,933,993 ra...
4    Weighted average of 4.12 based on 1,411,904 ra...
5    Weighted average of 4.12 based on 1,729,448 ra...
6    Weighted average of 3.47 based on 1,688,069 ra...
7    Weighted average of 3.72 based on 1,565,028 ra...
8    Weighted average of 3.43 based on 1,553,979 ra...
9    Weighted average of 3.77 based on 1,295,362 ra...
Name: number_ratings, dtype: object



number_ratings column after cleaning
--------------------------------------------------


0    2706282
1    2062281
2    1687099
3    1933993
4    1411904
5    1729448
6    1688069
7    1565028
8    1553979
9    1295362
Name: number_ratings, dtype: object

Some movies have many genres (10+), for the scope of this project only keeping the first three genres is relevant. 

In [21]:
def clean_genres(genre_str):
    genres_list = genre_str.split(',')

    # remove duplicates while preserving order 
    unique_genres = list(dict.fromkeys(genre.strip() for genre in genres_list)) 
    
    # return only the first three unique genres 
    return ', '.join(unique_genres[:3])

clean_movie_data_2017['genres'] = clean_movie_data_2017['genres'].apply(clean_genres)

print("genres column before cleaning")
print("-"*50)
display(raw_movie_data_2017['genres'].head(10))
print("\n")

print("genres column after cleaning")
print("-"*50)
display(clean_movie_data_2017['genres'].head(10))

genres column before cleaning
--------------------------------------------------


0    Horror, Mystery, Thriller, Horror, The Undead ...
1    Comedy, Drama, Moving Relationship Stories, Un...
2    Romance, Drama, Moving Relationship Stories, H...
3    Crime, Action, Crime, Drugs And Gangsters, Son...
4    Science Fiction, Drama, Humanity And The World...
5    Adventure, Animation, Music, Family, Moving Re...
6    Action, Drama, Adventure, Science Fiction, Epi...
7    Adventure, Action, Science Fiction, Epic Heroe...
8    Horror, Horror, The Undead And Monster Classic...
9    Action, Drama, War, War And Historical Adventu...
Name: genres, dtype: object



genres column after cleaning
--------------------------------------------------


0                            Horror, Mystery, Thriller
1           Comedy, Drama, Moving Relationship Stories
2          Romance, Drama, Moving Relationship Stories
3                   Crime, Action, Drugs And Gangsters
4    Science Fiction, Drama, Humanity And The World...
5                          Adventure, Animation, Music
6                             Action, Drama, Adventure
7                   Adventure, Action, Science Fiction
8    Horror, The Undead And Monster Classics, Inten...
9                                   Action, Drama, War
Name: genres, dtype: object

Correct all data types. 

In [23]:
print("data types before cleaning")
print("-"*50)
print(raw_movie_data_2017.dtypes)

# Remove any white space from all columns with object data type 
clean_movie_data_2017['title'] = clean_movie_data_2017['title'].str.strip()
clean_movie_data_2017['number_ratings'] = clean_movie_data_2017['number_ratings'].str.strip()
clean_movie_data_2017['average_rating'] = clean_movie_data_2017['average_rating'].str.strip()
clean_movie_data_2017['length'] = clean_movie_data_2017['length'].str.strip()
clean_movie_data_2017['genres'] = clean_movie_data_2017['genres'].str.strip()

# Convert neccessary columns to numeric data type 
clean_movie_data_2017['year'] = clean_movie_data_2017['year'].astype(int)
clean_movie_data_2017['number_ratings'] = clean_movie_data_2017['number_ratings'].astype(int)
clean_movie_data_2017['average_rating'] = clean_movie_data_2017['average_rating'].astype(float)

# There are some rows that have empty strings for length - convert these to 0's
clean_movie_data_2017['length'] = clean_movie_data_2017['length'].replace('', 0)
clean_movie_data_2017['length'] = clean_movie_data_2017['length'].astype(int)

print("\n")
print("data types after cleaning")
print("-"*50)
print(clean_movie_data_2017.dtypes)

data types before cleaning
--------------------------------------------------
title              object
year              float64
number_ratings     object
average_rating     object
length             object
genres             object
dtype: object


data types after cleaning
--------------------------------------------------
title              object
year                int32
number_ratings      int32
average_rating    float64
length              int32
genres             object
dtype: object


### Final Comparison of Clean and Raw Data Frames 

In [25]:
print(f"Length of raw data frame: {len(raw_movie_data_2017)}")
print("-"*50)
display(raw_movie_data_2017.head(5))
print("\n")

print(f"Length of clean data frame: {len(clean_movie_data_2017)}")
print("-"*50)
display(clean_movie_data_2017.head(5))


Length of raw data frame: 4810
--------------------------------------------------


Unnamed: 0,title,year,number_ratings,average_rating,length,genres
0,Get Out,2017.0,"Weighted average of 4.16 based on 2,706,282 ra...",4.2,104 mins More at IMDB TMDB,"Horror, Mystery, Thriller, Horror, The Undead ..."
1,Lady Bird,2017.0,"Weighted average of 3.85 based on 2,062,281 ra...",3.8,94 mins More at IMDB TMDB,"Comedy, Drama, Moving Relationship Stories, Un..."
2,Call Me by Your Name,2017.0,"Weighted average of 3.85 based on 1,687,099 ra...",3.9,132 mins More at IMDB TMDB,"Romance, Drama, Moving Relationship Stories, H..."
3,Baby Driver,2017.0,"Weighted average of 3.71 based on 1,933,993 ra...",3.7,113 mins More at IMDB TMDB,"Crime, Action, Crime, Drugs And Gangsters, Son..."
4,Blade Runner 2049,2017.0,"Weighted average of 4.12 based on 1,411,904 ra...",4.1,164 mins More at IMDB TMDB,"Science Fiction, Drama, Humanity And The World..."




Length of clean data frame: 3252
--------------------------------------------------


Unnamed: 0,title,year,number_ratings,average_rating,length,genres
0,Get Out,2017,2706282,4.2,104,"Horror, Mystery, Thriller"
1,Lady Bird,2017,2062281,3.8,94,"Comedy, Drama, Moving Relationship Stories"
2,Call Me by Your Name,2017,1687099,3.9,132,"Romance, Drama, Moving Relationship Stories"
3,Baby Driver,2017,1933993,3.7,113,"Crime, Action, Drugs And Gangsters"
4,Blade Runner 2049,2017,1411904,4.1,164,"Science Fiction, Drama, Humanity And The World..."


### Save clean complete 2017 Letterboxd movie details for 25% most popular movies to CSV file 

In [43]:
clean_movie_data_2017.to_csv("letterboxd_movie_data_2017_clean.csv", header=True, index=False, encoding='utf-8')