HOW TO USE: User input is contain in this first python cell below. Edit the starting_urls to the urls of whichever letterboxd watchlists or lists you are interested in exploring. You can add as many urls as you would like to the list.

Once you have edited the starting_urls, simply hit "run all" and then scroll down to the bottom to see the results.

In [1]:
starting_urls = ['https://letterboxd.com/fella/watchlist/', 'https://letterboxd.com/fella/films/rated/4-5/']

# Do not edit below this point.

Import modules

In [2]:
from urllib.request import urlopen, Request
import re
from datetime import datetime
from thefuzz import process, fuzz

## Scraping Revival Hub

Create list of next three months and their corresponding years. For example, if the current month is november 2024, then months = [11, 12, 1] and years = [2024, 2024, 2025]. We use these below to create the required urls.

In [3]:
current_month = datetime.now().month
current_year = datetime.now().year

months = [((current_month + i-1) % 12) + 1 for i in range(3)]

years = [current_year for i in range(3)]

if current_month >= 11:
    years[2] = current_year + 1
    if current_month == 12:
        years[1] = current_year + 1

months, years

([1, 2, 3], [2024, 2024, 2024])

We now create the urls and save the html code from those pages.

In [4]:
def make_url(month, year):
    month_str = str(month)
    month_str = month_str.rjust(2, '0')
    year_str = str(year)
    url = 'https://www.revivalhubla.com/film-calendar?view=calendar&month=' + month_str + '-' + year_str
    return url

urls = [make_url(month, year) for month, year in zip(months, years)]
pages = [urlopen(url) for url in urls]
htmls = [page.read().decode("utf-8") for page in pages]

The movie titles in the html for Revival Hub's calendar page are between h1 html tags, so we find all instances of that pattern and remove the extraneous html from each, leaving a list of strings containing the movie titles and release years.

In [5]:
pattern = "<h1>.*?</h1>"
results = [re.findall(pattern, html) for html in htmls]

def extract_text(line):
    html_chunks = re.findall("<.*?>", line)
    for chunk in html_chunks:
        line = line.replace(chunk, '')
    return line

def extract_info_link(line):
    link_pattern = '<a href=".*?">'
    res = re.findall(link_pattern, line)
    info_link = res[0].replace('<a href="','').replace('">','')
    url = 'https://www.revivalhubla.com' + info_link
    date = url.replace('https://www.revivalhubla.com/film-calendar/','')
    for i in range(10):
        si = str(i)
        toreplace = '/' + si + '/'
        replacewith = '/0' + si + '/'
        date = date.replace(toreplace, replacewith)
    date = date[:10]
    return url, date

titles = [list(map(extract_text, result)) for result in results]
dates_and_links = [list(map(extract_info_link, result)) for result in results]

## Scraping Letterboxd

In case the starting_url is on page n > 1 of the given list, we remove page/n/ from the urls to start with page 1 instead.

In [6]:
num_of_urls = len(starting_urls)

for i in range(num_of_urls):
    while 'page/' in starting_urls[i]:
        starting_urls[i] = re.sub('page/.*/', '', starting_urls[i])

Letterboxd was not happy with the default user agent in urlopen, so we have to spoof a user browser to scrape from there.

In [7]:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.3'}
reqs = [Request(url=starting_url, headers=headers) for starting_url in starting_urls]
first_htmls = [urlopen(req).read().decode("utf-8") for req in reqs]

first_htmls only contains the first page, containing up to 100 movies, of each list. We start by extracting the total number of pages.

In [8]:
pattern_pages = '<div class="pagination">.*?</div> </div>'
result_pages = [re.findall(pattern_pages, first_html) for first_html in first_htmls]

def extract_page_number(text):
    if '&hellip;' in text:
        str_num = text.split('&hellip;')[-1]
    else:
        str_num = text.split()[-1]
    num = int(str_num)
    return num

page_count = [1 for i in range(num_of_urls)]

for i in range(num_of_urls):    
    if result_pages[i]:
        pages_text = extract_text(result_pages[i][0])
        page_count[i] = extract_page_number(pages_text)

Now that we have the page_count, get the html code for each remaining page, combining these with the first html pages.

In [9]:
remaining_urls = []

for j in range(num_of_urls):
    remaining_urls += [starting_urls[j] + 'page/' + str(i) + '/' for i in range(2, page_count[j] + 1)]

remaining_reqs = [Request(url=rem_url, headers=headers) for rem_url in remaining_urls]
remaining_htmls = [urlopen(req).read().decode("utf-8") for req in remaining_reqs]

In [10]:
mega_html = ''

for html in first_htmls + remaining_htmls:
    mega_html = mega_html + html

We now extract the movie titles from this mega_html string.

In [11]:
letterboxd_title_pattern = 'data-film-slug=".*?"'
slugs = re.findall(letterboxd_title_pattern, mega_html)
all_movies = [slug.replace('data-film-slug="','').replace('"','') for slug in slugs]

## Comparing all_movies and titles

In [12]:
def find_matches(revival_titles, letterboxd_movies):
    matches = []
    for movie in letterboxd_movies:
        new_match = process.extractOne(movie, revival_titles, score_cutoff=86)
        if new_match:
            matches.append(new_match)
    return matches

In [13]:
this_month = [x[0] for x in find_matches(titles[0], all_movies)]
next_month = [x[0] for x in find_matches(titles[1], all_movies)]
month_after = [x[0] for x in find_matches(titles[2], all_movies)]

In [14]:
def show_me(month_list, m):
    indices = []
    for s in month_list:
        indices += [i for i, e in enumerate(titles[m]) if e == s]

    mylist = list(zip(titles[m], dates_and_links[m]))
    mylist = [x for e, x in enumerate(mylist) if e in indices]
    mylist.sort(key=(lambda x: x[1][1]))

    for s, (url, date) in mylist:
        print(date, ' - ', s, ' - ', url)

# RESULTS

### This month

In [15]:
show_me(this_month, 0)

2024/01/18  -  Twin Peaks: Fire Walk with Me - 1992  -  https://www.revivalhubla.com/film-calendar/2024/1/18/twin-peaks-fire-walk-with-me
2024/01/20  -  The Seventh Seal - 1957  -  https://www.revivalhubla.com/film-calendar/2024/1/20/seventh-seal
2024/01/21  -  Yi Yi - 2000  -  https://www.revivalhubla.com/film-calendar/2024/01/21/yi-yi
2024/01/22  -  Stranger Than Paradise - 1984  -  https://www.revivalhubla.com/film-calendar/2024/1/22/stranger-than-paradise
2024/01/25  -  Taipei Story - 1985 / In Our Time - 1982  -  https://www.revivalhubla.com/film-calendar/2024/01/25/taipei-story-in-our-time
2024/01/26  -  Frances Ha - 2013  -  https://www.revivalhubla.com/film-calendar/2024/1/26/frances-ha
2024/01/26  -  Wings of Desire - 1987  -  https://www.revivalhubla.com/film-calendar/2024/01/26/wings-of-desire
2024/01/26  -  In the Mood for Love - 2000 / Punch-Drunk Love - 2002  -  https://www.revivalhubla.com/film-calendar/2024/1/26/mood-punch
2024/01/27  -  In the Mood for Love - 2000 / Pu

In [16]:
show_me(next_month, 1)

2024/02/21  -  The Diving Bell and the Butterfly - 2007  -  https://www.revivalhubla.com/film-calendar/2024/2/21/the-diving-bell-and-the-butterfly


In [17]:
show_me(month_after, 2)