# Grace Techau
## Box Office Revenue & Letterboxd Ratings Project 
### Scraping Letterboxd Website 2019 Movies

**Scraping elements title, year, number_ratings, average_rating, length and genres for top 25% most popular Letterboxd movies in 2019 applying the filter 'Hide short films'.**

In [3]:
# import all required packages 
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By 
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time 
import random

In [4]:
# define random scroll function 
def random_scroll(browser, total_wait_time): 
    total_height = browser.execute_script("return document.body.scrollHeight")
    scroll_steps = random.randint(3,10)
    scroll_increment = total_height // scroll_steps
    time_per_step = total_wait_time / scroll_steps
    for step in range(scroll_steps): 
        browser.execute_script(f"window.scrollBy(0, {scroll_increment});")
        random_wait = random.uniform(0.5 * time_per_step, 1.5 * time_per_step)
        time.sleep(random_wait)
    browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")

In [5]:
# initialize the Selenium web driver (using Chrome)
chrome_options = Options()
service = Service(ChromeDriverManager().install())
browser = webdriver.Chrome(service=service, options=chrome_options)

### YEAR 2019 - Scrape URL links to individual movie detail pages 

Create a function for applying the viewing filter 'Hide short films' to each page when scraping the individual movie page URL's from the main Letterboxd movie website.  

In [8]:
def apply_filters(): 
    try: 
        WebDriverWait(browser, 10).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, "section.smenu-wrapper .smenu label"))
        )
            
        filter_button = WebDriverWait(browser, 20).until(
            EC.element_to_be_clickable((By.CSS_SELECTOR, "section.smenu-wrapper .smenu label"))
        )
        filter_button.click()
    
        time.sleep(random.uniform(1,3))
            
        #Apply the "Hide short films" filter
        hide_short_films_button = WebDriverWait(browser,20).until(
            EC.presence_of_element_located((By.XPATH, "//a[contains(text(), 'Hide short films')]")) 
        )
        hide_short_films_button.click()
        print("Clicked 'Hide short films' filter")

        time.sleep(random.uniform(4,15))
    
    except Exception as e: 
        print(f"Error applying filters: {e}")

Create a function to scrape the individual movie page URL's from the main Letterboxd movie website for movies from 2019 sorted by popularity.  

In [10]:
def scrape_movie_links(): 
    urls_2019 = [] 

     #scrape all the a tags with the class 'frame'
    tags = browser.find_elements(By.XPATH, '//a[@class="frame"]')

    #seperate the attribute 'href' from all tags - contains the URL to the individual Letterboxd movie detail pages 
    for tag in tags: 
        href = tag.get_attribute('href')
        if href:
            urls_2019.append(href)

    return urls_2019

Create a function for scraping multiple pages of the main Letterboxd website for films in 2019 sorted by popularity. For each page, the apply_filters and scrape_movie_links functions will run. \
\
For the year 2019 there are 301 pages of movies with the 'Hide short films' filter applied. I scraped only the top quarter of these page (75), to capture the 25% most popular movies. The 75 pages were scraped in three batches. The different batches were collected into seperate CSV files which are detailed at the bottom of this page. \
\
The three CSV files will be merged during cleaning to capture the total of the 25% most popular movies from 2019. 

In [12]:
def scrape_movie_pages(start_page, end_page): 
    urls_2019 = [] 

    for i in range(start_page, end_page +1): 
        url_2019 = f"https://letterboxd.com/films/popular/year/2019/page/{i}/"

        browser.get(url_2019) 
        browser.maximize_window()

        print(f"Scraping page {i}: {url_2019}")

        time.sleep(random.uniform(3,5))

        # Only apply the filter to the first page being scraped - the filter is applied to all pages after 
        if i == 46: 
            apply_filters()

        film_urls = scrape_movie_links()
        urls_2019.extend(film_urls)

        total_wait_time = random.uniform(5, 12)
        random_scroll(browser, total_wait_time)

        print(f"Finished scraping page {i}.")

    return urls_2019

## top 25% most popular pages : 75 pages 
#### raw_1 - pages 1 to 14
#### raw_2 - pages 14 to 45 
#### raw_3 - pages 46 to 75

start_page = 46
end_page = 75
urls_2019 = scrape_movie_pages(start_page, end_page)

print("-"*70)
print("Totals of URLS scraped for 2019")
print("-"*70)
print(f"Total # URLs scraped: {len(urls_2019)}")

Scraping page 46: https://letterboxd.com/films/popular/year/2019/page/46/
Clicked 'Hide short films' filter
Finished scraping page 46.
Scraping page 47: https://letterboxd.com/films/popular/year/2019/page/47/
Finished scraping page 47.
Scraping page 48: https://letterboxd.com/films/popular/year/2019/page/48/
Finished scraping page 48.
Scraping page 49: https://letterboxd.com/films/popular/year/2019/page/49/
Finished scraping page 49.
Scraping page 50: https://letterboxd.com/films/popular/year/2019/page/50/
Finished scraping page 50.
Scraping page 51: https://letterboxd.com/films/popular/year/2019/page/51/
Finished scraping page 51.
Scraping page 52: https://letterboxd.com/films/popular/year/2019/page/52/
Finished scraping page 52.
Scraping page 53: https://letterboxd.com/films/popular/year/2019/page/53/
Finished scraping page 53.
Scraping page 54: https://letterboxd.com/films/popular/year/2019/page/54/
Finished scraping page 54.
Scraping page 55: https://letterboxd.com/films/popular/ye

Modify the scraped URL's to include the browser extension '/genres/'.\
This allows all the correct genre data to be scraped from the individual Letterboxd movie detail pages.

In [14]:
modifed_urls = [url + 'genres/' for url in urls_2019]

### YEAR 2019 - Scrape movie data from each movie's page  

In [16]:
# create list to store the data title, year, average_rating, number_ratings, lenth, and genres for 
# each movie on Letterboxd in 2019
movie_data = []

for url in modifed_urls: 
    browser.get(url)
    browser.maximize_window()
    
    total_wait_time = random.uniform(5, 12)
    random_scroll(browser, total_wait_time)

    
    try:
        #SCRAPE TITLE
        title_element = browser.find_element(By.CSS_SELECTOR,"h1.headline-1.filmtitle span.name.js-widont.prettify")
        titles = title_element.text.strip()

        #SCRAPE YEAR 
        year_element = browser.find_element(By.CSS_SELECTOR, "div.releaseyear a")
        years = year_element.text.strip()
        
        #SCRAPE AVERAGE RATING AND NUMBER OF RATINGS 
        try:
            average_rating_element = browser.find_element(By.CSS_SELECTOR, "span.average-rating a.tooltip.display-rating ")
            average_ratings = average_rating_element.text.strip()
            number_ratings = average_rating_element.get_attribute('data-original-title')
        except NoSuchElementException: 
            average_ratings = "No average rating available"
            number_ratings = "No number of ratings available"
            
        #SCRAPE LENGTHS 
        lengths = browser.find_element(By.CSS_SELECTOR, "p.text-link.text-footer").text

        #SCRAPE GENRES 
        try: 
            genre_elements = browser.find_elements(By.CSS_SELECTOR, "div.text-sluglist.capitalize a.text-slug")
            if genre_elements:
                genres = [genre.text.strip() for genre in genre_elements]
            else: 
                genres = ['No genres available']
        except NoSuchElementException:
            genres = ['No genres available']

        #Apend all of the movie data to the dictionary movie_data
        movie_data.append({
            'title': titles,
            'year' : years, 
            'number_ratings' : number_ratings, 
            'average_rating' : average_ratings, 
            'length' : lengths, 
            'genres' : ", ".join(genres),
        })

    except Exception as e: 
        print(f"Error scraping {url}: {e}")
        movie_data.append({
            'title': None,
            'year' : None, 
            'number_ratings' : None, 
            'average_rating' : None, 
            'length' : None, 
            'genres' : None
        })

    #keep a tracker to know when each URL has been scraped 
    print(f"Finished scraping {url}")
    
#close the browser 
browser.close()

Finished scraping https://letterboxd.com/film/silat-warriors-deed-of-death/genres/
Finished scraping https://letterboxd.com/film/malaal/genres/
Finished scraping https://letterboxd.com/film/the-tracker-2019/genres/
Finished scraping https://letterboxd.com/film/saving-leningrad/genres/
Finished scraping https://letterboxd.com/film/kee/genres/
Finished scraping https://letterboxd.com/film/lourdes-2019/genres/
Finished scraping https://letterboxd.com/film/oresuki-are-you-the-only-one-who-loves-me/genres/
Finished scraping https://letterboxd.com/film/the-wrong-stepmother/genres/
Finished scraping https://letterboxd.com/film/guilt-2019-1/genres/
Finished scraping https://letterboxd.com/film/felix-in-wonderland/genres/
Finished scraping https://letterboxd.com/film/breakpoint-a-counter-history-of-progress/genres/
Finished scraping https://letterboxd.com/film/texas-2/genres/
Finished scraping https://letterboxd.com/film/v1-murder-case/genres/
Finished scraping https://letterboxd.com/film/party

### YEAR 2019 - Create a pandas data frame 'movie_data_2019'

In [3]:
movie_data_2019 = pd.DataFrame(movie_data)

display(movie_data_2019)

NameError: name 'pd' is not defined

### Save dataframe to a CSV file for cleaning 
Break down of pages covered in the different files for scraping year 2019. 

| Pages                | File Name                                |
|----------------------|------------------------------------------|
| 1 - 14               | letterboxd_movie_data_2019_raw_1.csv     |
| 14 - 45              | letterboxd_movie_data_2019_raw_2.csv     |
| 46 - 75              | letterboxd_movie_data_2019_raw_3.csv    |
  |


In [21]:
movie_data_2019.to_csv("letterboxd_movie_data_2019_raw_3.csv", header=True, index=False, encoding='utf-8')