# Grace Techau
## Box Office Revenue & Letterboxd Ratings Project 
### NOTEBOOK 4
### Cleaning Letterboxd Movie Data - 2019

In [2]:
#import all required packages 
import pandas as pd 

### Merge raw scraped data files for Letterboxd movies in 2019

The top 25% most popular Letterboxd movies in 2019 were scraped in ___ parts and were saved in seperate CSV files. The breakdown of pages scraped per file is below: 
- Pages 1 to 14 are captured in letterboxd_movie_data_2019_raw_1.csv
- Pages 14 to 45 are captured in letterboxd_movie_data_2019_raw_2.csv
- Pages 46 to 75 are captured in letterboxd_movie_data_2019_raw_3.csv

These CSV files need to be merged into one complete 25% most popular 2019 Letterboxd movies Pandas data frame where all the data can be cleaned.  

In [5]:
# Read in 2019 scraping part 1 to a pandas data frame 

raw_2019_1 = pd.read_csv("letterboxd_movie_data_2019_raw_1.csv", encoding='utf-8')

print(f"Number of movies in 2019 part 1: {len(raw_2019_1)}")

Number of movies in 2019 part 1: 963


In [6]:
# Read in 2019 scraping part 2 to a pandas data frame 

raw_2019_2 = pd.read_csv("letterboxd_movie_data_2019_raw_2.csv", encoding='utf-8')

print(f"Number of movies in 2019 part 2: {len(raw_2019_2)}")

Number of movies in 2019 part 2: 2304


In [7]:
# Read in 2019 scraping part 3 to a pandas data frame 

raw_2019_3 = pd.read_csv("letterboxd_movie_data_2019_raw_3.csv", encoding='utf-8')

print(f"Number of movies in 2019 part 3: {len(raw_2019_3)}")

Number of movies in 2019 part 3: 2160


In [8]:
# Merge the different data frames to one collective 2019 pandas data frame 

raw_movie_data_2019 = pd.concat([raw_2019_1, raw_2019_2, raw_2019_3], axis=0, ignore_index=True)

print(f"Total number of movies in 2019: {len(raw_movie_data_2019)}")
print("-"*50)
display(raw_movie_data_2019.head(5))

Total number of movies in 2019: 5427
--------------------------------------------------


Unnamed: 0,title,year,number_ratings,average_rating,length,genres
0,Parasite,2019.0,"Weighted average of 4.55 based on 3,233,120 ra...",4.6,133 mins More at IMDB TMDB,"Comedy, Thriller, Drama, Humanity And The Worl..."
1,Joker,2019.0,"Weighted average of 3.85 based on 3,064,771 ra...",3.8,122 mins More at IMDB TMDB,"Crime, Drama, Thriller, Intense Violence And S..."
2,Midsommar,2019.0,"Weighted average of 3.77 based on 2,420,557 ra...",3.8,147 mins More at IMDB TMDB,"Mystery, Drama, Horror, Intense Violence And S..."
3,Knives Out,2019.0,"Weighted average of 3.98 based on 2,583,937 ra...",4.0,131 mins More at IMDB TMDB,"Mystery, Comedy, Crime, Thrillers And Murder M..."
4,Once Upon a Time… in Hollywood,2019.0,"Weighted average of 3.76 based on 1,977,706 ra...",3.8,162 mins More at IMDB TMDB,"Drama, Thriller, Comedy, Humanity And The Worl..."


### Clean complete 2019 Letterboxd movie data set 

In [10]:
# Make a copy of the raw data frame for cleaning
clean_movie_data_2019 = raw_movie_data_2019.copy()

Since the 2019 Letterboxd movie data was scraped in batches and some of those batches ended in the middle of a page, there are duplicates present. I scraped the ending page of each batch as the starting page of the next batch to make sure all movies were scraped. Now those duplicates can be removed. 

In [12]:
duplicates_raw = raw_movie_data_2019[raw_movie_data_2019.duplicated(keep=False)]
print(f"Total duplicate records in raw 2019 data frame: {len(duplicates_raw)}")
print("-"*50)

clean_movie_data_2019 = clean_movie_data_2019.drop_duplicates(keep=False)

duplicates_clean = clean_movie_data_2019[clean_movie_data_2019.duplicated(keep=False)]
print(f"Total duplicate records in clean 2019 data frame: {len(duplicates_clean)}")

Total duplicate records in raw 2019 data frame: 2
--------------------------------------------------
Total duplicate records in clean 2019 data frame: 0


Some rows for scraping 2019 captured nothing for all rows - so it is needed to drop any rows that have ALL null rows in the data frame.

In [14]:
null_rows = clean_movie_data_2019.isnull().all(axis=1).sum()

print("# rows where ALL values are null values")
print("-"*50)
print(null_rows)
print("\n")

clean_movie_data_2019.dropna(how='all', inplace=True) 

print("length of the data frame after dropping null values")
print("-"*50)
print(len(clean_movie_data_2019))

# rows where ALL values are null values
--------------------------------------------------
1


length of the data frame after dropping null values
--------------------------------------------------
5424


The analysis for this project is utilizing the rating data from Letterboxd, therefore any movies where the rating wasn't available are not relevant and can be dropped. 

In [16]:
no_rating_data = clean_movie_data_2019[(clean_movie_data_2019['average_rating'] == 'No average rating available') & (clean_movie_data_2019['number_ratings'] == 'No number of ratings available')].index

print("# rows where no average rating and number ratings was available was available")
print("-"*50)
display(len(clean_movie_data_2019.loc[no_rating_data]))

### Drop these rows 
clean_movie_data_2019 = clean_movie_data_2019.drop(no_rating_data)

print("\n")
print("length of the data frame after dropping rows with no rating data")
print("-"*50)
print(len(clean_movie_data_2019))

# rows where no average rating and number ratings was available was available
--------------------------------------------------


1720



length of the data frame after dropping rows with no rating data
--------------------------------------------------
3704


Remove text from 'length' column so that only the minutes numerical value is present. \
For example changing '164 mins More at IMDB TMDB' to '164'

In [18]:
# Keep only the numeric value from the string of text in the length column
clean_movie_data_2019.loc[:, 'length'] = clean_movie_data_2019['length'].str.replace(r'\D', '', regex=True)

print("length column before cleaning")
print("-"*50)
display(raw_movie_data_2019['length'].head(10))
print("\n")
print("length column after cleaning")
print("-"*50)
display(clean_movie_data_2019['length'].head(10))

length column before cleaning
--------------------------------------------------


0    133 mins   More at IMDB TMDB
1    122 mins   More at IMDB TMDB
2    147 mins   More at IMDB TMDB
3    131 mins   More at IMDB TMDB
4    162 mins   More at IMDB TMDB
5    135 mins   More at IMDB TMDB
6    181 mins   More at IMDB TMDB
7    108 mins   More at IMDB TMDB
8    116 mins   More at IMDB TMDB
9    109 mins   More at IMDB TMDB
Name: length, dtype: object



length column after cleaning
--------------------------------------------------


0    133
1    122
2    147
3    131
4    162
5    135
6    181
7    108
8    116
9    109
Name: length, dtype: object

Remove text from 'number_ratings' column to include only the numerical value for the number of ratings. \
For example changing 'Weighted average of 3.03 based on 288 ratings' to '288'

In [20]:
# using pandas .str.extract() method to keep only the numerical value after string of text 'based on'

clean_movie_data_2019['number_ratings'] = clean_movie_data_2019['number_ratings'].str.extract(r'based on ([\d,]+)').replace({',': ''}, regex=True)

print("number_ratings column before cleaning")
print("-"*50)
display(raw_movie_data_2019['number_ratings'].head(10))
print("\n")
print("number_ratings column after cleaning")
print("-"*50)
display(clean_movie_data_2019['number_ratings'].head(10))

number_ratings column before cleaning
--------------------------------------------------


0    Weighted average of 4.55 based on 3,233,120 ra...
1    Weighted average of 3.85 based on 3,064,771 ra...
2    Weighted average of 3.77 based on 2,420,557 ra...
3    Weighted average of 3.98 based on 2,583,937 ra...
4    Weighted average of 3.76 based on 1,977,706 ra...
5    Weighted average of 4.15 based on 1,674,110 ra...
6    Weighted average of 3.89 based on 2,119,542 ra...
7    Weighted average of 4.03 based on 1,524,453 ra...
8    Weighted average of 3.66 based on 1,486,294 ra...
9    Weighted average of 4.03 based on 1,050,960 ra...
Name: number_ratings, dtype: object



number_ratings column after cleaning
--------------------------------------------------


0    3233120
1    3064771
2    2420557
3    2583937
4    1977706
5    1674110
6    2119542
7    1524453
8    1486294
9    1050960
Name: number_ratings, dtype: object

Some movies have many genres (10+), for the scope of this project only keeping the first three genres is relevant. 

In [22]:
def clean_genres(genre_str):
    genres_list = genre_str.split(',')

    # remove duplicates while preserving order 
    unique_genres = list(dict.fromkeys(genre.strip() for genre in genres_list)) 
    
    # return only the first three unique genres 
    return ', '.join(unique_genres[:3])

clean_movie_data_2019['genres'] = clean_movie_data_2019['genres'].apply(clean_genres)

print("genres column before cleaning")
print("-"*50)
display(raw_movie_data_2019['genres'].head(10))
print("\n")

print("genres column after cleaning")
print("-"*50)
display(clean_movie_data_2019['genres'].head(10))

genres column before cleaning
--------------------------------------------------


0    Comedy, Thriller, Drama, Humanity And The Worl...
1    Crime, Drama, Thriller, Intense Violence And S...
2    Mystery, Drama, Horror, Intense Violence And S...
3    Mystery, Comedy, Crime, Thrillers And Murder M...
4    Drama, Thriller, Comedy, Humanity And The Worl...
5    Romance, History, Drama, Moving Relationship S...
6    Science Fiction, Action, Adventure, Epic Heroe...
7    War, Drama, Comedy, Crude Humor And Satire, Hu...
8    Thriller, Horror, Horror, The Undead And Monst...
9    Horror, Drama, Thriller, Fantasy, Horror, The ...
Name: genres, dtype: object



genres column after cleaning
--------------------------------------------------


0                              Comedy, Thriller, Drama
1                               Crime, Drama, Thriller
2                               Mystery, Drama, Horror
3                               Mystery, Comedy, Crime
4                              Drama, Thriller, Comedy
5                              Romance, History, Drama
6                   Science Fiction, Action, Adventure
7                                   War, Drama, Comedy
8    Thriller, Horror, The Undead And Monster Classics
9                              Horror, Drama, Thriller
Name: genres, dtype: object

Correct all data types. 

In [24]:
print("data types before cleaning")
print("-"*50)
print(raw_movie_data_2019.dtypes)

# Remove any white space from all columns with object data type 
clean_movie_data_2019['title'] = clean_movie_data_2019['title'].str.strip()
clean_movie_data_2019['number_ratings'] = clean_movie_data_2019['number_ratings'].str.strip()
#clean_movie_data_2019['average_rating'] = clean_movie_data_2019['average_rating'].str.strip()
clean_movie_data_2019['length'] = clean_movie_data_2019['length'].str.strip()
clean_movie_data_2019['genres'] = clean_movie_data_2019['genres'].str.strip()

# Convert neccessary columns to numeric data type 
clean_movie_data_2019['year'] = clean_movie_data_2019['year'].astype(int)
clean_movie_data_2019['number_ratings'] = clean_movie_data_2019['number_ratings'].astype(int)
clean_movie_data_2019['average_rating'] = clean_movie_data_2019['average_rating'].astype(float)

# There are some rows that have empty strings for length - convert these to 0's
clean_movie_data_2019['length'] = clean_movie_data_2019['length'].replace('', 0)
clean_movie_data_2019['length'] = clean_movie_data_2019['length'].astype(int)

print("\n")
print("data types after cleaning")
print("-"*50)
print(clean_movie_data_2019.dtypes)

data types before cleaning
--------------------------------------------------
title              object
year              float64
number_ratings     object
average_rating     object
length             object
genres             object
dtype: object


data types after cleaning
--------------------------------------------------
title              object
year                int32
number_ratings      int32
average_rating    float64
length              int32
genres             object
dtype: object


### Final Comparison of Clean and Raw Data Frames 

In [26]:
print(f"Length of raw data frame: {len(raw_movie_data_2019)}")
print("-"*50)
display(raw_movie_data_2019.head(5))
print("\n")

print(f"Length of clean data frame: {len(clean_movie_data_2019)}")
print("-"*50)
display(clean_movie_data_2019.head(5))


Length of raw data frame: 5427
--------------------------------------------------


Unnamed: 0,title,year,number_ratings,average_rating,length,genres
0,Parasite,2019.0,"Weighted average of 4.55 based on 3,233,120 ra...",4.6,133 mins More at IMDB TMDB,"Comedy, Thriller, Drama, Humanity And The Worl..."
1,Joker,2019.0,"Weighted average of 3.85 based on 3,064,771 ra...",3.8,122 mins More at IMDB TMDB,"Crime, Drama, Thriller, Intense Violence And S..."
2,Midsommar,2019.0,"Weighted average of 3.77 based on 2,420,557 ra...",3.8,147 mins More at IMDB TMDB,"Mystery, Drama, Horror, Intense Violence And S..."
3,Knives Out,2019.0,"Weighted average of 3.98 based on 2,583,937 ra...",4.0,131 mins More at IMDB TMDB,"Mystery, Comedy, Crime, Thrillers And Murder M..."
4,Once Upon a Time… in Hollywood,2019.0,"Weighted average of 3.76 based on 1,977,706 ra...",3.8,162 mins More at IMDB TMDB,"Drama, Thriller, Comedy, Humanity And The Worl..."




Length of clean data frame: 3704
--------------------------------------------------


Unnamed: 0,title,year,number_ratings,average_rating,length,genres
0,Parasite,2019,3233120,4.6,133,"Comedy, Thriller, Drama"
1,Joker,2019,3064771,3.8,122,"Crime, Drama, Thriller"
2,Midsommar,2019,2420557,3.8,147,"Mystery, Drama, Horror"
3,Knives Out,2019,2583937,4.0,131,"Mystery, Comedy, Crime"
4,Once Upon a Time… in Hollywood,2019,1977706,3.8,162,"Drama, Thriller, Comedy"


### Save clean complete 2019 Letterboxd movie details for 25% most popular movies to CSV file 

In [28]:
clean_movie_data_2019.to_csv("letterboxd_movie_data_2019_clean.csv", header=True, index=False, encoding='utf-8')