In [1]:
#pip install letterboxdpy
#pip install -U rottentomatoes-python

In [2]:
import pandas as pd
import pickle
import requests
from bs4 import BeautifulSoup
from scrapy import Selector
from pandas import json_normalize

import rottentomatoes as rt
from json import (
  JSONEncoder,
  dumps as json_dumps,
  loads as json_loads,
)

## First try: Webscraping Rotten Tomatoes for Movie Reviews

Although my attempts for writing a function to extract reviews of a movie by taking the movie's page as an argument was successful, I realised that the links for movies in rotten tomatoes are not consistent. 

__For example:__

 - Dune: Part Two (2024) link : https://www.rottentomatoes.com/m/dune_part_two (seems reasonable)
 - Dune (2021) link : https://www.rottentomatoes.com/m/dune_2021 (getting weird)
 - https://www.rottentomatoes.com/m/dune --> link for the miniseries Dune (2000) directed by John Harrison

In [3]:
#first try for webscraping for movie reviews: rotten tomatoes
rt = 'https://www.rottentomatoes.com/m/dune_part_two/reviews?type=top_critics'
req = requests.get(rt)
res = req.content
soup1 = BeautifulSoup(res, 'html.parser')

In [4]:
#function for extracting reviews
def get_comments_rt(url,n=20):
    try:
        response = requests.get(url)
        
        # Check if request was successful
        if response.status_code == 200:
            # Parse HTML
            soup = BeautifulSoup(response.content, 'html.parser')
            
            # Find the review table
            review_table = soup.find("div", class_="review_table")
            
            # If review table exists, find all review text elements
            if review_table:
                review_texts = review_table.find_all("p", class_="review-text")
                
                # Extract the text of the first 20 reviews or all available comments if less than 20
                comments = []
                for i, review_text in enumerate(review_texts):
                    if i == n:
                        break
                    comments.append(review_text.get_text(strip=True))
                
                return comments
            else:
                print("Review table not found.")
                return None
        else:
            print("Failed to retrieve page. Status code:", response.status_code)
            return None
    except Exception as e:
        print("An error occurred:", e)
        return None

In [5]:
#function trial
n = 10
comments = get_comments_rt(rt,n)
if comments:
    print(f"First {n} comments:")
    for i, comment in enumerate(comments, 1):
        print(f"{i}. {comment}")
else:
    print("No comments found.")

First 10 comments:
1. An epic and spectacular sci-fi allegory with mass appeal.
2. You know Villeneuve will get the spectacle right. The question is about the human drama… It almost all connected in Part Two.
3. As in all of these sci-fi epics, there are plenty of scenes in which computer-generated characters drive computer-generated vehicles past computer-generated backdrops but, in "Dune," it feels human, slightly messy and organic.
4. What is really impressive about Part Two, is that despite how complex and like a miasma the plot becomes... the storytelling is so clear.
5. Exceeds expectations in every way—except humanity.
6. The film is all exhilarating buildup leading to an unsatisfactory, and even somewhat perfunctory, payoff.
7. The second Dune instalment is jaw-on-the-floor spectacular. It elegantly weaves together top-tier special effects and arresting cinematography; it layers muscle, sinew and savagery on to the bones of Part One.
8. Lawrence of Arrakis meets Dr. Sandworm, o

In [6]:
len(comments)

10

Unfortunately there were no way to standardize the link of the movies chosen from which I would extract the reviews. So I decided to look elsewhere. 

## Second try: Webscraping Letterboxd for Movie Reviews

I chose to try webscraping Letterboxd for reviews, because:

1. Letterboxd is a commonly used platform for movie reviews. It also has feature for its users to like other user's comments. I can use the most popular reviews to evaluate the movie by simply sorting the reviews by their likes. 
2. Letterboxd uses TMDb api which is free and public. I have access to that api as well, so I try to find a common way to identify movies and their page links for Letterboxd. 

I though I had the same link problem with Letterboxd until I ran into this link: 
https://letterboxd.com/about/film-data/

The page explains that the format 'https://letterboxd.com/tmdb/{tmdb_film_id}' redirects to the movie's Letterboxd page. I then decided to create a class for review extraction for the movies by using the movies' TMDb ID. 
 

In [7]:
class Scraper:

    def __init__(self, domain: str):
        self.base_url = domain
        self.headers = {
            "referer": domain,
            "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
        }
        self.builder = "lxml"

    def get_parsed_page(self, path: str) -> BeautifulSoup:
        url = self.base_url + path
        try:
            response = requests.get(url, headers=self.headers)
            response.raise_for_status()  # Raises an error for non-200 status codes
        except requests.RequestException as e:
            raise Exception(f"Error connecting to {url}: {e}")

        try:
            dom = BeautifulSoup(response.text, self.builder)
        except Exception as e:
            raise Exception(f"Error parsing response from {url}: {e}")

        if response.status_code != 200:
            message = dom.find("section", {"class": "message"})
            message = message.strong.text if message else None
            messages = json.dumps({
                'code': response.status_code,
                'reason': str(response.reason),
                'url': url,
                'message': message
            }, indent=2)
            raise Exception(messages)
        return dom
            
    def get_link(self) -> str:
        url = self.base_url
        try:
            response = requests.get(url, headers=self.headers)
            response.raise_for_status()  # Raises an error for non-200 status codes
        except requests.RequestException as e:
            raise Exception(f"Error connecting to {url}: {e}")
    
        try:
            dom = BeautifulSoup(response.text, self.builder)
            cont = dom.select_one("head > meta[property='og:url']")
            if cont:
                link = cont['content']
            else:
                raise Exception("Meta tag 'og:url' not found.")
        except Exception as e:
            raise Exception(f"Error parsing response from {url}: {e}")
        return link
    
    def extract_reviews(self, soup: BeautifulSoup, num_reviews: int = 12) -> list:
        reviews = []
        film_details = soup.find_all('li', class_='film-detail')
        for film_detail in film_details:
            spoilers_div = film_detail.find('div', class_='hidden-spoilers expanded-text')
            if spoilers_div:
                review_text = spoilers_div.text.strip()
            else:
                review_text = film_detail.find('div', class_='body-text -prose collapsible-text').text.strip()
            reviews.append(review_text)
            if len(reviews) == num_reviews:  # Break if the desired number of reviews is reached
                break
        return reviews


In [8]:
lttx = 'https://letterboxd.com/tmdb/27205'

In [9]:
def get_reviews_from_link(domain: str, num_reviews: int = 12) -> list:
    scraper = Scraper(domain)
    link = scraper.get_link()
    scraper = Scraper(link)
    path = 'reviews/by/activity/'
    dom = scraper.get_parsed_page(path)
    return scraper.extract_reviews(dom, num_reviews)

In [10]:
reviews = get_reviews_from_link(lttx, num_reviews=10)
display(reviews)

["christopher nolan spent years writing this movie's complex plot and really named the main character dom cobb",
 'fellas, is it gay to go inside\xa0ur bros dreams?',
 'finally watched inception the way christopher nolan intended for it to be seen: only the first 10 minutes and on the big screen in fortnite 😌',
 "Dom Cobb seems like he's never told a joke in his life and has zero friends",
 '"The most important emotional thing about the top spinning at the end is that Cobb is not looking at it. He doesn\'t care." -Christopher Nolan, Wired interview, December 8, 2010.',
 "cillian murphy: no dad i'm giving up on YOUR dream!",
 'arthur and eames: interact\xa0me: Gay',
 'hans zimmer: BWAAAAHHHH BWAAAAAAAAAHHHHHHHHme: I LOVE THIS SONG!!!!!',
 'ilysm (i love you scillian murphy)',
 '"Inception," at its most basic, is two things. It is a heist film dressed in science fiction conventions; and it is a study of a man trying to free himself from a near-suffocating past. "Inception," at its more c

### EUREKA !!!

Here is my attempts and drafts to write this code:

In [11]:
# a = Scraper(lttx)
# rew_link = a.get_link()
# b = Scraper(rew_link)
# path = 'reviews/by/activity/'
# dom = b.get_parsed_page(path)
# revi = b.extract_reviews(dom)

In [12]:
# def extract_reviews(soup):
#     reviews = []
    
#     film_details = soup.find_all('li', class_='film-detail')
#     for film_detail in film_details:
#         spoilers_div = film_detail.find('div', class_='hidden-spoilers expanded-text')
#         if spoilers_div:
#             review_text = spoilers_div.text.strip()
#         else:
#             review_text = film_detail.find('div', class_='body-text -prose collapsible-text').text.strip()
#         reviews.append(review_text)
#     return reviews



# extracted_reviews = extract_reviews(dom)
# print(extracted_reviews)

In [None]:
# texts = []

# film_details = dom.find_all('li', class_='film-detail')
# for film_detail in film_details:
#       spoilers_div = film_detail.find('div', class_='hidden-spoilers expanded-text')
#     if spoilers_div:
#         text = spoilers_div.text.strip()
#     else:
#         text = film_detail.find('div', class_='body-text -prose collapsible-text').text.strip()
#     texts.append(text)

# display(texts)

In [23]:
bechreq = requests.get("http://bechdeltest.com/api/v1/getAllMovies")
bechreq.text



