# British Board of Film Calssification: Web Scraping

__Objective:__  To web scrape the BBFC’s website for a list of movies rated in the years 2022/2023. This notebook is seperated into two parts:

1. Scrape the [‘recents page’](https://www.bbfc.co.uk/recently-rated) from the BBFC website and store the data in a dataframe. </br>
2. Scrape a list of given titles from the BBFC website and store the data in a separate dataframe.

## Methdology

The script below contains notes and guidance on the methodology and approach utilised to complete each activity.

The terms of service and conditions of web scraping the BBFC's website was first checked. I then checked for the BBFC's website robots.txt file to determine any areas of the website that are not allowed to be scraped.

1. __Scraping the Recents Page:__
- Created a helper function to scrape the items 'Title', 'Rating Content', 'Synopsis', 'Classified Date', 'Language', 'Director(s)', 'Production Year', 'Release Date', 'Genre','Running Minutes' and 'Cast' by either finding each items corresponding 'h4' tag and its next sibling, or locating the correspinding 'div' element and extracting its text contents.
- The data was then checked for the correct data types. Missing values and empty strings were checked to ensure the data was indeed missing on the BBFC's website. The 'Language' values were then corrected to be standardised. The final cleaned dataframe was then saved as 'bbfc_recently_rated_titles.csv'.

2. __Scraping the list of Given Titles:__
- Created a helper function to iterate through each movie title, fetch the URL for the movie, and then scrape the detailed information from that URL using the scrape_movie_info function. I modified the scrape_movie_info function used in section 1 to work with the individual movie pages instead of the recently-rated page. The items scraped were 'Title', 'URL', 'Type', 'Rating', 'Content',	'Synopsis', 'Classified Date', 'Language', 'Director', 'Production Year',	'Genre', 'Running Minutes', 'Cast', 'UK Age Rating'. The final cleaned dataframe was then saved as 'bbfc_listed_films.csv'.

Attachments and links:
- [Link to BBFC website](https://www.bbfc.co.uk/)
- [Link to BBFC recents page](https://www.bbfc.co.uk/recently-rated)
- [Link to list of titles](https://docs.google.com/spreadsheets/d/1gdNU75_RPG69bsuAFQpZu670sHK9EzTwZWSLUERhMZ4/edit#gid=0)

## Checking the BBFC's Website Terms of Service & Conditions

In reviewing the BBFC's website terms and conditions ([link here](https://www.bbfc.co.uk/terms-and-conditions)), There is not explicit mention of web scraping, however it does state: </br>

 "No part of the text or graphics (including the BBFC's symbols) on this site may be reproduced or transmitted in any form or by any means, electronic or mechanical or otherwise, including by photocopying, facsimile transmission, recording, re-keying or using any information storage and retrieval system without express written permission from the BBFC."

 Web scraping can be considered "reproduction" or "tranmission" as it involves downloading data from a website. Additionally, scraping the BBFC website would potentially violate the terms "using any information storage and retrieval system without express written permission from the BBFC", which can have legal and ethical implcations. This analysis has gone ahead as I attained permission from the BBFC.

## Checking the BBFC's Website robots.txt File

The robots.txt file of a website is the standard file websites use to communicate with web crawlers and search engine bots, outling the areas of the site they are allowed or not allowed to access and index. These guidelines are not legally binding, however, it is good practice to respect them.

The output of the robots.txt file for the BBFC's webiste ([link here](https://www.bbfc.co.uk/robots.txt)) is shown below:

User-agent: * </br>
Disallow: /search </br>
Sitemap: https://www.bbfc.co.uk/sitemap.xml

The * wildcard for User-agent indicates the following rules appply to all web crawlers. The second line is asking bots not to interact with or scrape search result pages. The final line provides the location of the website's sitemap.

Since the 'Disallow' directive is only for '/search' path, it implies other parts of the webiste are not explicitly restricted for scraping. Given this, along with written permission from the BBFC, there should be no legal or ethical issues for scraping the BBFC's website. The web scraping actitivy below has been carried out responsibly to avoid overloading the websites server.

# 1. Scrape the BBFC's Recents Page

In [None]:
# Installing the necesary libraries
!pip install requests beautifulsoup4
!pip install requests bs4 pandas



In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup

In [None]:
# Identify URL to scrape
url = 'https://www.bbfc.co.uk/recently-rated'
response = requests.get(url)

In [None]:
# Parse the HTML content using beautiful soup
soup = BeautifulSoup(response.text, 'html.parser')

## 1.1 Finding the Relevant Data

The relevant data of recently rated films to be scraped and its corresponding HTML script is oultined below:

'Title', 'Rating Content', 'Synopsis', 'Classified Date', 'Language', 'Director(s)', 'Production Year', 'Release Date', 'Genre','Running Minutes' and 'Cast'

The HTML structure of the website was inspected using the web browser's developer tools.

 The python script below navigates through each movie's link on the main recently-rated page and then extracts the required information from the individual movies page.


In [None]:
# Define the scraping function. Note: This approach does not correctly target the values associated with each heading.

def scrape_movie_info(base_url, movie_relative_url):
    # Fetch the movie page
    movie_url = base_url + movie_relative_url
    response = requests.get(movie_url)
    soup = BeautifulSoup(response.content, 'html.parser')

    # Extract required information
    title = soup.find('h2', class_='Type_display2__39wf0').span.text
    rating_content = soup.find('span', class_='RatingColor_color-15__2eUtg').span.text
    synopsis = soup.find('p', class_='Type_base-large__3q3Or').text
    classified_date = soup.find('h4', text='Classified date').find_next_sibling('h4').text
    language = soup.find('h4', text='Language').find_next_sibling('h4').text
    director = soup.find('h4', text='Director(s)').find_next_sibling('h4').text
    production_year = soup.6find('h4', text='Production Year').find_next_sibling('h4').text
    release_date = soup.find('h4', text='Release date').find_next_sibling('h4').text
    genre = soup.find('h4', text='Genre(s)').find_next_sibling('h4').text
    running_minutes = soup.find('h4', text='Approx. running minutes').find_next_sibling('h4').text
    cast = soup.find('h4', text='Cast').find_next_sibling('h4').text

    return {
        'Title': title,
        'Rating Content': rating_content,
        'Synopsis': synopsis,
        'Classified Date': classified_date,
        'Language': language,
        'Director': director,
        'Production Year': production_year,
        'Release Date': release_date,
        'Genre': genre,
        'Running Minutes': running_minutes,
        'Cast': cast
    }

This approach of selecting the indivdual html codes does not correctly target the values associated within each heading. This is due to the website using the same h4 class for different items. For example the h4 class for Director(s) and Production year both use the h4 class = "Type_base__2EnB2". A more appropriate solution is shown below.

In [None]:
# Define the scraping function.
def scrape_movie_info(base_url, movie_relative_url):
    movie_url = base_url + movie_relative_url
    response = requests.get(movie_url)
    soup = BeautifulSoup(response.content, 'html.parser')

    # Define a helper function to safely extract text, this helper function is used to find the <h4> tag and find its next sibling <h4> tag which contains the actual data we want.
    # As some titles do not have some information shown on the recently-rated page, this function returns 'None' if the tag is not found.
    def extract_text_by_heading(heading):
        element = soup.find('h4', class_='Type_base__2EnB2 ListItem_Title__tu7eT', text=heading)
        if element:
            sibling = element.find_next_sibling('h4')
            if sibling:
                return sibling.text
        return None

    title = soup.find('h2', class_='Type_display2__39wf0').span.text if soup.find('h2', class_='Type_display2__39wf0') else None

    type_div = soup.find('div', class_='MediaOutline_MediaType__AipEd')
    movie_type = type_div.text if type_div else None

    rating_content_div = soup.find('div', class_='Type_subtitle__3KRKY')
    rating_content = rating_content_div.span.span.text if rating_content_div else None

    synopsis = soup.find('p', class_='Type_base-large__3q3Or').text if soup.find('p', class_='Type_base-large__3q3Or') else None

    classified_date_div = soup.find('h4', text='Classified date').find_next_sibling('h4')
    classified_date = classified_date_div.text if classified_date_div else None

    distributor_div = soup.find('h4', text='Distributor')
    distributor = distributor_div.find_next_sibling('h4').text if distributor_div and distributor_div.find_next_sibling('h4') else None

    language_div = soup.find('h4', string='Language')
    language = language_div.find_next_sibling('h4').text if language_div and language_div.find_next_sibling('h4') else None

    director = extract_text_by_heading('Director(s)')
    production_year = extract_text_by_heading('Production Year')
    release_date = extract_text_by_heading('Release date')
    genre = extract_text_by_heading('Genre(s)')
    running_minutes = extract_text_by_heading('Approx. running minutes')
    cast = extract_text_by_heading('Cast')

    return {
        'Title': title,
        'Type': movie_type,
        'Rating Content': rating_content,
        'Synopsis': synopsis,
        'Classified Date': classified_date,
        'Distributor': distributor,
        'Language': language,
        'Director': director,
        'Production Year': production_year,
        'Release Date': release_date,
        'Genre': genre,
        'Running Minutes': running_minutes,
        'Cast': cast
    }


In [None]:
# Select the main page for movie links
base_url = 'https://www.bbfc.co.uk'
main_url = 'https://www.bbfc.co.uk/recently-rated'
response = requests.get(main_url)
soup = BeautifulSoup(response.content, 'html.parser')

movie_data = []
for div in soup.find_all('div', class_='SearchItem_SearchItem__3kbZF'):
    movie_relative_url = div.find('a')['href']
    movie_info = scrape_movie_info(base_url, movie_relative_url)
    movie_data.append(movie_info)


  classified_date_div = soup.find('h4', text='Classified date').find_next_sibling('h4')
  distributor_div = soup.find('h4', text='Distributor')
  element = soup.find('h4', class_='Type_base__2EnB2 ListItem_Title__tu7eT', text=heading)


In [None]:
# Print the scraped data
display(movie_data)

[{'Title': 'Born To Fly',
  'Type': 'Film',
  'Rating Content': 'moderate bloody images, threat',
  'Synopsis': 'An elite squadron of Chinese Air Force test pilots face danger in the air and their personal demons in this action-packed Mandarin language drama, which features sustained threat and aerial accidents that cause injuries.…Read more',
  'Classified Date': '11/12/2023',
  'Distributor': None,
  'Language': 'English',
  'Director': 'Liu Xiaoshi',
  'Production Year': '2023',
  'Release Date': None,
  'Genre': 'Action, Drama',
  'Running Minutes': '128m',
  'Cast': 'Wang Yibo, Zhou Dongyu, Hu Jun'},
 {'Title': 'Anyone But You',
  'Type': 'Film',
  'Rating Content': 'very strong language',
  'Synopsis': 'Some coarse language, sexual material and rude humour abound in this frantic, good-natured US romantic comedy in which a couple who fell out after a disastrous first date must pretend to be devoted to each other.…Read more',
  'Classified Date': '11/12/2023',
  'Distributor': None

In [None]:
# convert to a Dataframe
recent_df = pd.DataFrame(movie_data)

display(recent_df)

Unnamed: 0,Title,Type,Rating Content,Synopsis,Classified Date,Distributor,Language,Director,Production Year,Release Date,Genre,Running Minutes,Cast
0,Born To Fly,Film,"moderate bloody images, threat",An elite squadron of Chinese Air Force test pi...,11/12/2023,,English,Liu Xiaoshi,2023,,"Action, Drama",128m,"Wang Yibo, Zhou Dongyu, Hu Jun"
1,Anyone But You,Film,very strong language,"Some coarse language, sexual material and rude...",11/12/2023,,English,Will Gluck,2023,26/12/2023,"Comedy, Romance",103m,"Sydney Sweeney, Glen Powell, Mia Artemis"
2,Honor Among Lovers,Film,"sexual harassment, mild sex references","HONOR AMONG LOVERS is a drama, from 1931, in w...",11/12/2023,,English,Dorothy Arzner,1931,,Drama,76m,"Claudette Colbert, Fredric March, Monroe Owsley"
3,Love Me Tonight,Film,"mild innuendo, domestic abuse references, lang...",LOVE ME TONIGHT is a US romantic musical comed...,11/12/2023,,English,Rouben Mamoulian,1932,,"Comedy, Musical, Romance",89m,"Maurice Chevalier, Jeanette MacDonald, Charles..."
4,The Four Of The Apocalypse...,Film,sexual violence,THE FOUR OF THE APOCALYPSE... is a 1975 Italia...,11/12/2023,,English,Lucio Fulci,1975,,"Drama, Western",104m,"Fabio Testi, Lynne Frederick, Michael J. Pollard"
...,...,...,...,...,...,...,...,...,...,...,...,...,...
85,Fill Your Heart With Ireland - Sharon Horgan 3...,Advert,,,05/12/2023,,English,Jim O’ Hanlon,2023,24/12/2023,Documentary,1m,
86,Stolen Vacation,Film,language,,05/12/2023,,es,Diego Graue,2023,06/12/2023,Comedies,92m,"Ana Claudia Talancon, Bruno Bichir, Vianey Rod..."
87,Blood Coast,TV Show,"language, sexual images, violence",,05/12/2023,"NETFLIX, INC",fr,"Kamel Guemra, Olivier Marchal",2023,06/12/2023,"Thrillers, Action, Dramas",,"Florence Thomassin, Lani Sogoyou, Diouc Koma, ..."
88,Christmas As Usual,Film,"sex references, language",,05/12/2023,,en,Petter Holmsen,2023,06/12/2023,"Romance, Comedies, Dramas",89m,"Ida Ursin-Holm, Marit Adeleide Andreassen, Eri..."


In [None]:
# To save files beyond the lifecycle of Google Colab's current virtual machine instance, save file to Google Drive

# from google.colab import drive
# drive.mount('/content/drive')

## 1.2 Checking the Data: Data Wrangling and Cleaning

The data was first inspected manually, reviewing the first and last 10 rows. Individual entries were also checked for example the films 'Blood Vessel' and 'Jigarthanda Doublex (hindi)' did not contain a synopsis in the dataset above and this was confirmed by viewing the link for the film. Another example is manually viewing the links for TV shows to determine TV shows are also likely to not have a synopsis and running minutes provided. Random samples of the data was also examined and validated against the coresponding content on the recently_rated website.

In general:
- Trailers do not usually have a rating content and synopsis.
- TV shows are also likely to not have information providede for synopsis and running minutes.
- Films tend to have most of the information available, but do not have a designated distributor, which is only seen in VOD TV Shows.

In [None]:
# Checking for Data Types
recent_df.dtypes

Title              object
Type               object
Rating Content     object
Synopsis           object
Classified Date    object
Distributor        object
Language           object
Director           object
Production Year    object
Release Date       object
Genre              object
Running Minutes    object
Cast               object
dtype: object

In [None]:
# Changing each column to the correct data type

# Convert date columns to datetime.
# errors='coerce' will set the value to NaT (Not a Time) if the conversion fails.
# errors='ignore' will keep the original value if the conversion to integer fails.
recent_df['Classified Date'] = pd.to_datetime(recent_df['Classified Date'], errors='coerce').dt.date
recent_df['Production Year'] = pd.to_datetime(recent_df['Production Year'], format='%Y', errors='coerce').dt.date
recent_df['Release Date'] = pd.to_datetime(recent_df['Release Date'], errors='coerce').dt.date

# Remove the 'm' from 'Running Minutes' and convert to numeric
recent_df['Running Minutes'] = pd.to_numeric(recent_df['Running Minutes'].str.replace('m', '', regex=False), errors='coerce')


# Convert other text columns to string
text_columns = ['Title', 'Type', 'Rating Content', 'Synopsis', 'Distributor', 'Language', 'Director', 'Genre', 'Cast']
recent_df[text_columns] = recent_df[text_columns].astype(str)

# Check the new data types
print(recent_df.dtypes)

Title               object
Type                object
Rating Content      object
Synopsis            object
Classified Date     object
Distributor         object
Language            object
Director            object
Production Year     object
Release Date        object
Genre               object
Running Minutes    float64
Cast                object
dtype: object


  recent_df['Release Date'] = pd.to_datetime(recent_df['Release Date'], errors='coerce').dt.date


In [None]:
# Count the number of empty strings in each column of recent_df
empty_string_count = (recent_df == '').sum()

# Display the count of empty strings for each column
print(empty_string_count)

Title               0
Type                0
Rating Content      0
Synopsis           31
Classified Date     0
Distributor         0
Language            0
Director            0
Production Year     0
Release Date        0
Genre               0
Running Minutes     0
Cast                0
dtype: int64


In [None]:
# Filter to view rows where 'Synopsis' is an empty string
empty_synopsis_rows = recent_df[recent_df['Synopsis'] == '']

# Display the rows with empty 'Synopsis'
display(empty_synopsis_rows)

Unnamed: 0,Title,Type,Rating Content,Synopsis,Classified Date,Distributor,Language,Director,Production Year,Release Date,Genre,Running Minutes,Cast
8,Happiness,Episode,"suicide, violence",,2023-11-12,"NETFLIX, INC",ko,"Han Sang-woon, An Gil-ho",2021-01-01,2023-12-12,"Horror, Thrillers, Dramas",,"Jung Woon-sun, Park Hee-von, Park Joo-hee, Kim..."
9,Blippi & Meekah's Game Show!,TV Show,no material likely to offend or harm,,2023-10-12,"NETFLIX, INC",en,,2023-01-01,2023-11-12,Kids,,"Clayton Grimm, Cashae Monya"
10,True Beauty,TV Show,"sexual violence references, nudity, suicide, s...",,2023-10-12,"NETFLIX, INC",ko,"Lee Si-eun, Kim Sang-hyeob",2020-01-01,2023-11-12,"Romance, Comedies",,"Kim Min-gi, Chang Hyae-jin, Hwang In-youp, Mun..."
34,Blood Vessel,Film,"injury detail, violence",,2023-07-12,,ijo,Moses Inwang,2023-01-01,2023-08-12,"Thrillers, Dramas",119.0,"Alex Cyr Budin, Levi Chikere, Achufusi Ekene S..."
35,Barbie: A Touch Of Magic,TV Show,threat,,2023-07-12,"NETFLIX, INC",en,,2023-01-01,2023-08-12,"Kids, Fantasy",,"Kirsten Day, Ritesh Rajan, America Young, Tati..."
36,I Cannot Reach You,TV Show,sexual threat,,2023-07-12,"NETFLIX, INC",ja,,2023-01-01,2023-08-12,"Romance, Dramas",,"Haru Kashiwagi, Tomo Nakai, Shuto Fukushima, L..."
37,Jigarthanda Doublex (hindi),Film,"injury detail, violence",,2023-06-12,,ta,Karthik Subbaraj,2023-01-01,2023-07-12,"Action, Dramas",170.0,"Aravind Akash, Raghava Lawrence, Illavarasu, S..."
38,Jigarthanda Doublex,Film,"injury detail, violence",,2023-06-12,,ta,Karthik Subbaraj,2023-01-01,2023-07-12,"Action, Dramas",170.0,"Aravind Akash, Raghava Lawrence, Shine Tom Cha..."
39,Jigarthanda Doublex (malayalam),Film,"injury detail, violence",,2023-06-12,,ta,Karthik Subbaraj,2023-01-01,2023-07-12,"Action, Dramas",170.0,"Aravind Akash, Raghava Lawrence, Illavarasu, S..."
40,Jigarthanda Doublex (kannada),Film,"injury detail, violence",,2023-06-12,,ta,Karthik Subbaraj,2023-01-01,2023-07-12,"Action, Dramas",170.0,"Aravind Akash, Raghava Lawrence, Shine Tom Cha..."


In [None]:
# Replace empty strings in 'Synopsis' column with 'None'
recent_df['Synopsis'] = recent_df['Synopsis'].replace('', 'None')

From the scraping function above:
- Numerical empty cells were replaced with NaN, to ensure consistency in representing missing data in numerical columns.
- Missing Dates were replaced with NaT.
- In the textual columns empty cells were replaced with 'None' to represent the absence of information.

In [None]:
# Count the number of empty strings in each column of recent_df
empty_string_count = (recent_df == '').sum()

# Display the count of empty strings for each column
print(empty_string_count)

Title              0
Type               0
Rating Content     0
Synopsis           0
Classified Date    0
Distributor        0
Language           0
Director           0
Production Year    0
Release Date       0
Genre              0
Running Minutes    0
Cast               0
dtype: int64


In [None]:
# Check for missing values

# Function to check for NaN, NaT None, and the string 'None'
def is_missing_or_none(x):
    return pd.isna(x) or x == 'None'

# Apply this function to the DataFrame and count True values
missing_or_none_count = recent_df.applymap(is_missing_or_none).sum()

print(missing_or_none_count)

Title               0
Type                0
Rating Content     16
Synopsis           44
Classified Date     0
Distributor        76
Language            0
Director            8
Production Year     0
Release Date       30
Genre               0
Running Minutes    14
Cast               12
dtype: int64


Random sampling was done to check for titles with specefic missing information. For Example: random titles  which do not provide information on rating content such as 'Inside Out Trl B - Goosebumps' and 'The Jungle Bunch World Tour'	  were manually confirmed.

If you would like to review each title that contains specefic misssing information, you can do so using the code example provided below by changing the headings you would like to review in the 'columns_to_check' varaible.

In [None]:
# Define the list of columns you want to check for missing values
columns_to_check = ['Rating Content', 'Synopsis', 'Distributor', 'Language', 'Release Date', 'Running Minutes', 'Cast']

# Function to check for NaN, None, and the string 'None'
def is_missing_or_none(x):
    return pd.isna(x) or x == 'None'

# Filter and display rows for each column separately
for column in columns_to_check:
    missing_values_in_column = recent_df[recent_df[column].apply(is_missing_or_none)]

    display(f"Rows with missing values in '{column}':\n", missing_values_in_column, "\n")

"Rows with missing values in 'Rating Content':\n"

Unnamed: 0,Title,Type,Rating Content,Synopsis,Classified Date,Distributor,Language,Director,Production Year,Release Date,Genre,Running Minutes,Cast
11,Amazon Music Live With Latto - Season 1: Trailer,Trailer,,,2023-09-12,,English,Micah Bickham,2023-01-01,NaT,"Music, Performance",1.0,Latto
25,Noryang: Deadly Sea,Trailer,,,2023-07-12,,Korean,Kim Han-min,2023-01-01,2023-07-12,"Action, Drama, War",1.0,"Yeo Jin-goo, Kim Jae-young, Kim Yoon-seok"
26,Dhootha - Season 1: Trailer,Trailer,,,2023-07-12,,Telugu,Vikram K Kumar,2023-01-01,NaT,"Drama, Thriller",2.0,"Naga Chaitanya Akkineni, Parvathy Thiruvothu, ..."
30,Inside Out Trl B - Goosebumps,Trailer,,,2023-07-12,,English,not known,2023-01-01,2023-07-12,"Adventure, Animation, Comedy",1.0,"Amy Poehler, Phyllis Smith, Lewis Black"
31,Inside Out 2,Trailer,,,2023-07-12,,English,not known,2023-01-01,2023-07-12,"Adventure, Animation, Comedy",1.0,"Amy Poehler, Phyllis Smith, Lewis Black"
44,Potential,Advert,,,2023-06-12,,English,Vince Renee-Lortie,2023-01-01,2024-04-01,Education,1.0,
51,The Fall Guy,Trailer,,,2023-06-12,,English,David Leitch,2023-01-01,2023-06-12,Action,1.0,"Emily Blunt, Ryan Gosling, Hannah Waddingham"
52,Bob Marley One Love,Trailer,,,2023-06-12,,English,Reinaldo Marcos Green,2023-01-01,2023-06-12,"Drama, Music, Musical",2.0,
55,Aquaman And The Lost Kingdom,Trailer,,,2023-06-12,,English,James Wan,2023-01-01,2023-06-12,"Action, Fantasy",2.0,"Jason Momoa, Ben Affleck, Patrick Wilson"
68,Godzilla X Kong: The New Empire,Trailer,,,2023-05-12,,English,Adam Wingard,2023-01-01,2023-05-12,"Action, Adventure, Fantasy",2.0,"Rebecca Hall, Dan Stevens, Rachel House"


'\n'

"Rows with missing values in 'Synopsis':\n"

Unnamed: 0,Title,Type,Rating Content,Synopsis,Classified Date,Distributor,Language,Director,Production Year,Release Date,Genre,Running Minutes,Cast
8,Happiness,Episode,"suicide, violence",,2023-11-12,"NETFLIX, INC",ko,"Han Sang-woon, An Gil-ho",2021-01-01,2023-12-12,"Horror, Thrillers, Dramas",,"Jung Woon-sun, Park Hee-von, Park Joo-hee, Kim..."
9,Blippi & Meekah's Game Show!,TV Show,no material likely to offend or harm,,2023-10-12,"NETFLIX, INC",en,,2023-01-01,2023-11-12,Kids,,"Clayton Grimm, Cashae Monya"
10,True Beauty,TV Show,"sexual violence references, nudity, suicide, s...",,2023-10-12,"NETFLIX, INC",ko,"Lee Si-eun, Kim Sang-hyeob",2020-01-01,2023-11-12,"Romance, Comedies",,"Kim Min-gi, Chang Hyae-jin, Hwang In-youp, Mun..."
11,Amazon Music Live With Latto - Season 1: Trailer,Trailer,,,2023-09-12,,English,Micah Bickham,2023-01-01,NaT,"Music, Performance",1.0,Latto
25,Noryang: Deadly Sea,Trailer,,,2023-07-12,,Korean,Kim Han-min,2023-01-01,2023-07-12,"Action, Drama, War",1.0,"Yeo Jin-goo, Kim Jae-young, Kim Yoon-seok"
26,Dhootha - Season 1: Trailer,Trailer,,,2023-07-12,,Telugu,Vikram K Kumar,2023-01-01,NaT,"Drama, Thriller",2.0,"Naga Chaitanya Akkineni, Parvathy Thiruvothu, ..."
30,Inside Out Trl B - Goosebumps,Trailer,,,2023-07-12,,English,not known,2023-01-01,2023-07-12,"Adventure, Animation, Comedy",1.0,"Amy Poehler, Phyllis Smith, Lewis Black"
31,Inside Out 2,Trailer,,,2023-07-12,,English,not known,2023-01-01,2023-07-12,"Adventure, Animation, Comedy",1.0,"Amy Poehler, Phyllis Smith, Lewis Black"
34,Blood Vessel,Film,"injury detail, violence",,2023-07-12,,ijo,Moses Inwang,2023-01-01,2023-08-12,"Thrillers, Dramas",119.0,"Alex Cyr Budin, Levi Chikere, Achufusi Ekene S..."
35,Barbie: A Touch Of Magic,TV Show,threat,,2023-07-12,"NETFLIX, INC",en,,2023-01-01,2023-08-12,"Kids, Fantasy",,"Kirsten Day, Ritesh Rajan, America Young, Tati..."


'\n'

"Rows with missing values in 'Distributor':\n"

Unnamed: 0,Title,Type,Rating Content,Synopsis,Classified Date,Distributor,Language,Director,Production Year,Release Date,Genre,Running Minutes,Cast
0,Born To Fly,Film,"moderate bloody images, threat",An elite squadron of Chinese Air Force test pi...,2023-11-12,,English,Liu Xiaoshi,2023-01-01,NaT,"Action, Drama",128.0,"Wang Yibo, Zhou Dongyu, Hu Jun"
1,Anyone But You,Film,very strong language,"Some coarse language, sexual material and rude...",2023-11-12,,English,Will Gluck,2023-01-01,2023-12-26,"Comedy, Romance",103.0,"Sydney Sweeney, Glen Powell, Mia Artemis"
2,Honor Among Lovers,Film,"sexual harassment, mild sex references","HONOR AMONG LOVERS is a drama, from 1931, in w...",2023-11-12,,English,Dorothy Arzner,1931-01-01,NaT,Drama,76.0,"Claudette Colbert, Fredric March, Monroe Owsley"
3,Love Me Tonight,Film,"mild innuendo, domestic abuse references, lang...",LOVE ME TONIGHT is a US romantic musical comed...,2023-11-12,,English,Rouben Mamoulian,1932-01-01,NaT,"Comedy, Musical, Romance",89.0,"Maurice Chevalier, Jeanette MacDonald, Charles..."
4,The Four Of The Apocalypse...,Film,sexual violence,THE FOUR OF THE APOCALYPSE... is a 1975 Italia...,2023-11-12,,English,Lucio Fulci,1975-01-01,NaT,"Drama, Western",104.0,"Fabio Testi, Lynne Frederick, Michael J. Pollard"
...,...,...,...,...,...,...,...,...,...,...,...,...,...
84,The Teachers' Lounge,Film,"infrequent strong language, brief drug references",An idealistic teacher faces a moral dilemma wh...,2023-05-12,,German,Ilker Çatak,2023-01-01,2024-03-22,Drama,99.0,"Leonie Benesch, Leonard Stettnisch, Eva Löbau"
85,Fill Your Heart With Ireland - Sharon Horgan 3...,Advert,,,2023-05-12,,English,Jim O’ Hanlon,2023-01-01,2023-12-24,Documentary,1.0,
86,Stolen Vacation,Film,language,,2023-05-12,,es,Diego Graue,2023-01-01,2023-06-12,Comedies,92.0,"Ana Claudia Talancon, Bruno Bichir, Vianey Rod..."
88,Christmas As Usual,Film,"sex references, language",,2023-05-12,,en,Petter Holmsen,2023-01-01,2023-06-12,"Romance, Comedies, Dramas",89.0,"Ida Ursin-Holm, Marit Adeleide Andreassen, Eri..."


'\n'

"Rows with missing values in 'Language':\n"

Unnamed: 0,Title,Type,Rating Content,Synopsis,Classified Date,Distributor,Language,Director,Production Year,Release Date,Genre,Running Minutes,Cast


'\n'

"Rows with missing values in 'Release Date':\n"

Unnamed: 0,Title,Type,Rating Content,Synopsis,Classified Date,Distributor,Language,Director,Production Year,Release Date,Genre,Running Minutes,Cast
0,Born To Fly,Film,"moderate bloody images, threat",An elite squadron of Chinese Air Force test pi...,2023-11-12,,English,Liu Xiaoshi,2023-01-01,NaT,"Action, Drama",128.0,"Wang Yibo, Zhou Dongyu, Hu Jun"
2,Honor Among Lovers,Film,"sexual harassment, mild sex references","HONOR AMONG LOVERS is a drama, from 1931, in w...",2023-11-12,,English,Dorothy Arzner,1931-01-01,NaT,Drama,76.0,"Claudette Colbert, Fredric March, Monroe Owsley"
3,Love Me Tonight,Film,"mild innuendo, domestic abuse references, lang...",LOVE ME TONIGHT is a US romantic musical comed...,2023-11-12,,English,Rouben Mamoulian,1932-01-01,NaT,"Comedy, Musical, Romance",89.0,"Maurice Chevalier, Jeanette MacDonald, Charles..."
4,The Four Of The Apocalypse...,Film,sexual violence,THE FOUR OF THE APOCALYPSE... is a 1975 Italia...,2023-11-12,,English,Lucio Fulci,1975-01-01,NaT,"Drama, Western",104.0,"Fabio Testi, Lynne Frederick, Michael J. Pollard"
5,I Want Him Dead,Film,"sexual violence, strong violence","I WANT HIM DEAD is a western, from 1968, in wh...",2023-11-12,,English,Paolo Bianchini,1968-01-01,NaT,Western,87.0,"Craig Hill, Lea Massari, José Manuel Martín"
6,The Wrath Of The Wind,Film,"strong violence, injury detail, sexual threat",THE WRATH OF THE WIND is a 1970's Italian west...,2023-11-12,,English,Mario Camus,1970-01-01,NaT,Western,97.0,"Mario Pardo, Carlo Alberto Cortina, Maximo Val..."
7,El Puro,Film,"strong violence, sexual threat",EL PURO is an Italian and Spanish language Wes...,2023-11-12,,Italian,Edoardo Mulargia,1969-01-01,NaT,Western,108.0,"Robert Woods, Maurizio Bonuglia, Rosalba Neri"
11,Amazon Music Live With Latto - Season 1: Trailer,Trailer,,,2023-09-12,,English,Micah Bickham,2023-01-01,NaT,"Music, Performance",1.0,Latto
13,Sumotherhood,Film,"strong language, drug misuse, violence, sex re...",This British crime comedy sees two hapless fri...,2023-08-12,,English,Adam Deacon,2023-01-01,NaT,"Comedy, Crime",97.0,"Adam Deacon, Jennifer Saunders, Danny Sapani"
16,Planet Earth Iii,TV Show,"scenes of animals hunting, very mild threat, i...",PLANET EARTH III is a wildlife documentary ser...,2023-08-12,2 entertain Video Ltd.,English,,2023-01-01,NaT,Documentary,,


'\n'

"Rows with missing values in 'Running Minutes':\n"

Unnamed: 0,Title,Type,Rating Content,Synopsis,Classified Date,Distributor,Language,Director,Production Year,Release Date,Genre,Running Minutes,Cast
8,Happiness,Episode,"suicide, violence",,2023-11-12,"NETFLIX, INC",ko,"Han Sang-woon, An Gil-ho",2021-01-01,2023-12-12,"Horror, Thrillers, Dramas",,"Jung Woon-sun, Park Hee-von, Park Joo-hee, Kim..."
9,Blippi & Meekah's Game Show!,TV Show,no material likely to offend or harm,,2023-10-12,"NETFLIX, INC",en,,2023-01-01,2023-11-12,Kids,,"Clayton Grimm, Cashae Monya"
10,True Beauty,TV Show,"sexual violence references, nudity, suicide, s...",,2023-10-12,"NETFLIX, INC",ko,"Lee Si-eun, Kim Sang-hyeob",2020-01-01,2023-11-12,"Romance, Comedies",,"Kim Min-gi, Chang Hyae-jin, Hwang In-youp, Mun..."
16,Planet Earth Iii,TV Show,"scenes of animals hunting, very mild threat, i...",PLANET EARTH III is a wildlife documentary ser...,2023-08-12,2 entertain Video Ltd.,English,,2023-01-01,NaT,Documentary,,
19,Icons Unearthed : Marvel,TV Show,"infrequent strong language, moderate violence,...",ICONS UNEARTHED : MARVEL is a documentary seri...,2023-08-12,Amazon Media EU S.à r.l.,English,Brian Volk-Weiss,2023-01-01,NaT,"Action, Documentary",,
35,Barbie: A Touch Of Magic,TV Show,threat,,2023-07-12,"NETFLIX, INC",en,,2023-01-01,2023-08-12,"Kids, Fantasy",,"Kirsten Day, Ritesh Rajan, America Young, Tati..."
36,I Cannot Reach You,TV Show,sexual threat,,2023-07-12,"NETFLIX, INC",ja,,2023-01-01,2023-08-12,"Romance, Dramas",,"Haru Kashiwagi, Tomo Nakai, Shuto Fukushima, L..."
56,Hilda,TV Show,"threat, violence",,2023-06-12,"NETFLIX, INC",en,Luke Pearson,2023-01-01,2023-07-12,Kids,,"Daisy Haggard, Oliver Nelson, Bella Ramsey, Am..."
58,High Tides,TV Show,"sex, language, sexual images, drug misuse",,2023-06-12,"NETFLIX, INC",nl-BE,,2023-01-01,2023-07-12,"Romance, Dramas",,"Eliyha Altena, Manouk Pluis, Jef Hellemans, Em..."
59,Analog Squad,TV Show,"suicide, language, sexual images",,2023-06-12,"NETFLIX, INC",th,"Aummaraporn Phandintong, Nithiwat Tharatorn",2023-01-01,2023-07-12,Dramas,,"Naphatrada Karnchanacharoen, Teravat Anuvatudo..."


'\n'

"Rows with missing values in 'Cast':\n"

Unnamed: 0,Title,Type,Rating Content,Synopsis,Classified Date,Distributor,Language,Director,Production Year,Release Date,Genre,Running Minutes,Cast
16,Planet Earth Iii,TV Show,"scenes of animals hunting, very mild threat, i...",PLANET EARTH III is a wildlife documentary ser...,2023-08-12,2 entertain Video Ltd.,English,,2023-01-01,NaT,Documentary,,
19,Icons Unearthed : Marvel,TV Show,"infrequent strong language, moderate violence,...",ICONS UNEARTHED : MARVEL is a documentary seri...,2023-08-12,Amazon Media EU S.à r.l.,English,Brian Volk-Weiss,2023-01-01,NaT,"Action, Documentary",,
32,Dagr,Film,"strong language, threat, violence, bloody images",Sustained scenes of unsettling supernatural th...,2023-07-12,,English,Matthew Butler-Hart,2023-01-01,2024-01-03,Horror,77.0,
44,Potential,Advert,,,2023-06-12,,English,Vince Renee-Lortie,2023-01-01,2024-04-01,Education,1.0,
47,Fried Rice And Latkes,Film,references to racism,FRIED RICE AND LATKES is a short US documentar...,2023-06-12,,English,Matty Neikrug & Varun Chopra,2023-01-01,NaT,Documentary,9.0,
52,Bob Marley One Love,Trailer,,,2023-06-12,,English,Reinaldo Marcos Green,2023-01-01,2023-06-12,"Drama, Music, Musical",2.0,
53,The Rev,Film,"strong sex references, injury detail, violence...",THE REV is a British crime documentary which e...,2023-06-12,,English,Rhys Edwards,2023-01-01,NaT,Documentary,93.0,
64,Making Squid Game: The Challenge,Film,"language, violence",,2023-06-12,,en,,2023-01-01,2023-07-12,Reality Programming,30.0,
71,Lord Of Misrule,Film,"strong violence, threat, gory images",LORD OF MISRULE is a horror drama in which par...,2023-05-12,,English,William Brent Bell,2023-01-01,NaT,"Drama, Horror",104.0,
76,Pearl & Dean Juniors,Advert,,,2023-05-12,,English,Pearl & Dean,2023-01-01,2023-08-12,"Animation, Children, Musical",2.0,


'\n'

### Checking General Assumptions

__1. Trailers do not usually have a rating content and synopsis.__

In [None]:
# Filter rows where 'Type' is 'Trailer'
trailers = recent_df[recent_df['Type'] == 'Trailer']

# Check if all 'Rating Content' or 'Synopsis' in trailers are NaN, None, or 'None'
all_missing = trailers[['Rating Content', 'Synopsis']].applymap(is_missing_or_none).all(axis=1)

print("All 'Rating Content' or 'Synopsis' in 'Trailers' are missing:", all_missing.all())

All 'Rating Content' or 'Synopsis' in 'Trailers' are missing: True


__2. TV shows are also likely to not have information providede for synopsis and running minutes.__

In [None]:
# Filter rows where 'Type' is 'TV Show'
tv_shows = recent_df[recent_df['Type'] == 'TV Show']

# Check if all 'Synopsis', and 'Running Minutes' in TV Shows are NaN, None, or 'None'
all_missing = tv_shows[['Synopsis', 'Running Minutes']].applymap(is_missing_or_none).all(axis=1)

print("All 'Synopsis' and 'Running Minutes' in 'TV Shows' are missing:", all_missing.all())

All 'Synopsis' and 'Running Minutes' in 'TV Shows' are missing: False


In [None]:
# Filter and display rows for TV Shows where not all 'Synopsis' and 'Running Minutes' are missing
tv_shows_with_some_data = tv_shows[~all_missing]
display(tv_shows_with_some_data)

Unnamed: 0,Title,Type,Rating Content,Synopsis,Classified Date,Distributor,Language,Director,Production Year,Release Date,Genre,Running Minutes,Cast
16,Planet Earth Iii,TV Show,"scenes of animals hunting, very mild threat, i...",PLANET EARTH III is a wildlife documentary ser...,2023-08-12,2 entertain Video Ltd.,English,,2023-01-01,NaT,Documentary,,
19,Icons Unearthed : Marvel,TV Show,"infrequent strong language, moderate violence,...",ICONS UNEARTHED : MARVEL is a documentary seri...,2023-08-12,Amazon Media EU S.à r.l.,English,Brian Volk-Weiss,2023-01-01,NaT,"Action, Documentary",,


Both these TV showns corresponding links of the BBFC's website were checked to confirm the data is indeed missing.

__3. Films tend to have most of the information available, but do not have a designated distributor, which is only seen in VOD TV Shows.__


In [None]:
# Filter rows where 'Type' is 'Film'
films = recent_df[recent_df['Type'] == 'Film']

# Check if all 'Distributor' in films are NaN, None, or 'None'
all_missing_distributor = films['Distributor'].apply(is_missing_or_none).all()

display("All 'Distributor' in 'Films' are missing:", all_missing_distributor)

"All 'Distributor' in 'Films' are missing:"

True


If required, depending on how thorough the analysis needs to be, a report can be generated by manually going through each of the movie titles which contain missing information to conclude the information is actually missing on the BBFC's website given there are only 99 rows

In [None]:
# Display unique values for specific columns
columns_of_interest = ['Type', 'Distributor', 'Language', 'Genre']
for column in columns_of_interest:
    unique_values = recent_df[column].unique()
    print(f"Unique values in '{column}':\n{unique_values}\n")

Unique values in 'Type':
['Film' 'Episode' 'TV Show' 'Trailer' 'Advert']

Unique values in 'Distributor':
['None' 'NETFLIX, INC' '2 entertain Video Ltd.' 'Amazon Media EU S.à r.l.']

Unique values in 'Language':
['English' 'Italian' 'ko' 'en' 'Japanese' 'Turkish' 'Mandarin' 'Spanish'
 'Korean' 'Telugu' 'Tamil' 'German' 'ijo' 'ja' 'ta' 'hi' 'ml' 'Cantonese'
 'Malayalam' 'ar-SA' 'nl-BE' 'th' 'id' 'Danish' 'French' 'Wolof' 'es' 'fr']

Unique values in 'Genre':
['Action, Drama' 'Comedy, Romance' 'Drama' 'Comedy, Musical, Romance'
 'Drama, Western' 'Western' 'Horror, Thrillers, Dramas' 'Kids'
 'Romance, Comedies' 'Music, Performance' 'Comedy, Crime'
 'Action, Drama, Fantasy' 'Documentary' 'Comedy' 'Drama, Thriller'
 'Action, Documentary' 'Action, Thriller' 'Horror, Thriller' 'Crime'
 'Children, Comedy, Romance' 'Action, Drama, War'
 'Action, Crime, Thriller' 'Comedy, Drama' 'Thriller'
 'Adventure, Animation, Comedy' 'Horror' 'Thrillers, Dramas'
 'Kids, Fantasy' 'Romance, Dramas' 'Action, Dr

In [None]:
# If you would like to display unique values for every column
#for column in recent_df.columns:
#    unique_values = recent_df[column].unique()
#    print(f"Unique values in '{column}':\n{unique_values}\n")

### Correcting for similar values in the language Column

Correcting similar values in the Language Column. Here I have gone with the ISO 639-1 language codes. However this can be changed to what is most suitable. I have intentionally kept Cantonese, Saudi Arabia Arabic (ar-SA) and Flemish (nl-BE) as seperate languages due to differences in dialect, prononciation, vocabulary and grammer so that its more appropriate to the intended audience viewing the content. Another important points with Cantonese is that it is not mutually intelligible with Mandarin, so I have kept them seperate.

In [None]:
# Create a dictionary to map language ISO 639-1 codes to full names.
language_map = {
    'en': 'English', 'ko': 'Korean', 'ja': 'Japanese', 'tr': 'Turkish',
    'zh': 'Chinese', 'es': 'Spanish', 'te': 'Telugu', 'ta': 'Tamil',
    'de': 'German', 'ijo': 'Ijo', 'hi': 'Hindi', 'ml': 'Malayalam',
    'ar': 'Arabic', 'nl': 'Dutch', 'th': 'Thai', 'id': 'Indonesian',
    'da': 'Danish', 'fr': 'French', 'it': 'Italian', 'wo': 'Wolof',
    'ar-SA': 'Arabic (Saudi Arabia)', 'nl-BE': 'Dutch (Belgium)',
    'Cantonese': 'Cantonese', 'None': 'None'
}

# Replace the values in 'Language' column using the mapping
recent_df['Language'] = recent_df['Language'].replace(language_map)

# Check the unique values again
print("Corrected Unique values in 'Language':\n", recent_df['Language'].unique())

Corrected Unique values in 'Language':
 ['English' 'Italian' 'Korean' 'Japanese' 'Turkish' 'Mandarin' 'Spanish'
 'Telugu' 'Tamil' 'German' 'Ijo' 'Hindi' 'Malayalam' 'Cantonese'
 'Arabic (Saudi Arabia)' 'Dutch (Belgium)' 'Thai' 'Indonesian' 'Danish'
 'French' 'Wolof']


In [None]:
# Check for duplicates in the 'Title' column
duplicates_in_title = recent_df['Title'].duplicated(keep=False)

# Display rows with duplicate 'Title'. An empty Index:[] indicates no duplicates found.
print("Rows with duplicate 'Title':\n", recent_df[duplicates_in_title])

Rows with duplicate 'Title':
 Empty DataFrame
Columns: [Title, Type, Rating Content, Synopsis, Classified Date, Distributor, Language, Director, Production Year, Release Date, Genre, Running Minutes, Cast]
Index: []


In [None]:
# View Final Dataset
display(recent_df)

Unnamed: 0,Title,Type,Rating Content,Synopsis,Classified Date,Distributor,Language,Director,Production Year,Release Date,Genre,Running Minutes,Cast
0,Born To Fly,Film,"moderate bloody images, threat",An elite squadron of Chinese Air Force test pi...,2023-11-12,,English,Liu Xiaoshi,2023-01-01,NaT,"Action, Drama",128.0,"Wang Yibo, Zhou Dongyu, Hu Jun"
1,Anyone But You,Film,very strong language,"Some coarse language, sexual material and rude...",2023-11-12,,English,Will Gluck,2023-01-01,2023-12-26,"Comedy, Romance",103.0,"Sydney Sweeney, Glen Powell, Mia Artemis"
2,Honor Among Lovers,Film,"sexual harassment, mild sex references","HONOR AMONG LOVERS is a drama, from 1931, in w...",2023-11-12,,English,Dorothy Arzner,1931-01-01,NaT,Drama,76.0,"Claudette Colbert, Fredric March, Monroe Owsley"
3,Love Me Tonight,Film,"mild innuendo, domestic abuse references, lang...",LOVE ME TONIGHT is a US romantic musical comed...,2023-11-12,,English,Rouben Mamoulian,1932-01-01,NaT,"Comedy, Musical, Romance",89.0,"Maurice Chevalier, Jeanette MacDonald, Charles..."
4,The Four Of The Apocalypse...,Film,sexual violence,THE FOUR OF THE APOCALYPSE... is a 1975 Italia...,2023-11-12,,English,Lucio Fulci,1975-01-01,NaT,"Drama, Western",104.0,"Fabio Testi, Lynne Frederick, Michael J. Pollard"
...,...,...,...,...,...,...,...,...,...,...,...,...,...
85,Fill Your Heart With Ireland - Sharon Horgan 3...,Advert,,,2023-05-12,,English,Jim O’ Hanlon,2023-01-01,2023-12-24,Documentary,1.0,
86,Stolen Vacation,Film,language,,2023-05-12,,Spanish,Diego Graue,2023-01-01,2023-06-12,Comedies,92.0,"Ana Claudia Talancon, Bruno Bichir, Vianey Rod..."
87,Blood Coast,TV Show,"language, sexual images, violence",,2023-05-12,"NETFLIX, INC",French,"Kamel Guemra, Olivier Marchal",2023-01-01,2023-06-12,"Thrillers, Action, Dramas",,"Florence Thomassin, Lani Sogoyou, Diouc Koma, ..."
88,Christmas As Usual,Film,"sex references, language",,2023-05-12,,English,Petter Holmsen,2023-01-01,2023-06-12,"Romance, Comedies, Dramas",89.0,"Ida Ursin-Holm, Marit Adeleide Andreassen, Eri..."


## 1.4 Save to CSV

When saving a DataFrame to CSV, pandads represents NaN and NaT values as empty strings by default. This is preferred for later analysis, as it ensures that missing values don't interfere with the interpretation of data, especially when importing the CSV back into a DataFrame or using it in other data processing tools. Following this, the missing value 'None' was also replaced as an empty string in the final saved csv file below.

In [None]:
# Save the Dataframe as a csv file

# Replace 'None' with an empty string
recent_df.replace('None', '', inplace=True)

# Save the DataFrame to a CSV file
recent_df.to_csv('bbfc_recently_rated_titles.csv', index=False)

# 2. Scraping the list of given titles from the BBFC website

Here we scrape a list of given titles from the BBFC website and store the data in a separate dataframe.

For the sake of convenience, with everything contained in this notebook, I created a data frame containing the movie titles within the notebook below. An alternative approach would be to have the BBFC titles CSV file in the same directory as this notebook which then can be read into the notebook. This would mean that chages, such as adding more movie titles could be made to this file to then run the analysis.

There are 3 possible ways to carry this analysis out:

__1. Manual Collection:__ Since the list of movies is not too large, you could consider manually visiting the BBFC's website and copying the full URLs for each movie.

__2. Search Functionality:__ As the BBFC's website has a search function, you could automate the process of searching for each movie title and scraping the URL from the search results. This would require interacting with the search function or to parse the search results page.

__3. API or Database:__ This would only work if the BBFC provides an API or a publicly accessible database that includes these unique identifiers. This would be the most efficient and reliable way to construct the URLs for each movie title.

In searching for a BBFC API online, I came accross [this](https://github.com/Fustra/bbfcapi) GitHub repository by QasimK, last updated in March 2021 which contains a BBFC API which might have been a viable and efficient approach. The API offers various functionalities that can simplify the process of retrieving information about movies from the BBFC's website. However, upon testing and further inspection of the APIs documentation, the API is limited to only being able to retrieve the movie title and age rating (shown below).

Knowing this, the second approach was taken for this activity.

In [None]:
# Import the necesary libraries
import pandas as pd
import requests
from bs4 import BeautifulSoup

In [None]:
# List of movie titles
movie_titles = [
    "Creed III", "DADA", "Infinity Pool", "Law of Tehran",
    "Dungeons & Dragons: Honour Among Thieves", "LOVE ACCORDING TO DALVA",
    "Champions", "Akhi Fok El Shagara", "Little Eggs: An African Rescue",
    "Selfiee", "Villeneuve Pironi", "Polite Society", "Bakasuran",
    "CHRISTY", "Air", "Naked Lunch", "The Gallery", "Heathers: The Musical",
    "The Age of Innocence", "65", "Single shankarum Smartphone Simranum",
    "SISU", "Winnie the Pooh: Blood and Honey", "All Of The Voices",
    "The Night of the 12th", "How To Blow Up A Pipeline", "Antidote",
    "Say I Do To Me", "Meet Me In The Bathroom", "This Is Endometriosis",
    "Below The Belt", "Tu Jhoothi Main Makkaar", "Fulbari", "CAIRO CONSPIRACY",
    "Iratta", "Mitran Da Naa Chalda", "PRANAYA VILASAM", "LOOK AT ME",
    "Are you there God? it's me, Margaret", "The Super Mario Bros. Movie",
    "1976", "MARLOWE", "LOLA", "The Grass is Greener on the Other Side",
    "Suzume", "Tetris", "Our Rivers...Our Skys", "Wala Ghalta",
    "The Curse of Rosalie", "in The Middle"
]

# Create a DataFrame
df = pd.DataFrame(movie_titles, columns=['Title'])

# Display the DataFrame
display(df)

# Optionally, save this DataFrame to a CSV file
# df.to_csv('/content/BBFC_titles.csv', index=False)

Unnamed: 0,Title
0,Creed III
1,DADA
2,Infinity Pool
3,Law of Tehran
4,Dungeons & Dragons: Honour Among Thieves
5,LOVE ACCORDING TO DALVA
6,Champions
7,Akhi Fok El Shagara
8,Little Eggs: An African Rescue
9,Selfiee


## 2.1 BBFC API

The API is hosted at https://bbfcapi.fustra.uk. By using a simple HTTP request with a movie title as a parameter, you can get the age rating.

The API provides a Python client library (bbfcapi) which you have to install.

In [None]:
# Install the BBFC API Python client library
!pip install bbfcapi[api_sync]



In [None]:
# Import the best_match function from the library
from bbfcapi.api_sync import best_match

In [None]:
# Testing retrieving information about a specific movie
movie_info = best_match("Creed III")
print(movie_info)

title='Creed Iii' age_rating=<AgeRating.AGE_12: '12'>


In [None]:
# Function to get age rating using the BBFC API
def get_age_rating(title):
    try:
        result = best_match(title)
        return result.age_rating.value if result else None
    except Exception as e:
        print(f"Error fetching age rating for {title}: {e}")
        return None

# Retrieve age ratings for each movie title
df['Age Rating'] = df['Title'].apply(get_age_rating)

# Display the DataFrame with age ratings
print(df)

Error fetching age rating for How To Blow Up A Pipeline: 429 Client Error: Too Many Requests for url: https://bbfcapi.fustra.uk/?title=How+To+Blow+Up+A+Pipeline
Error fetching age rating for Say I Do To Me: 429 Client Error: Too Many Requests for url: https://bbfcapi.fustra.uk/?title=Say+I+Do+To+Me
Error fetching age rating for Meet Me In The Bathroom: 429 Client Error: Too Many Requests for url: https://bbfcapi.fustra.uk/?title=Meet+Me+In+The+Bathroom
Error fetching age rating for Below The Belt: 429 Client Error: Too Many Requests for url: https://bbfcapi.fustra.uk/?title=Below+The+Belt
Error fetching age rating for Fulbari: 429 Client Error: Too Many Requests for url: https://bbfcapi.fustra.uk/?title=Fulbari
Error fetching age rating for Iratta: 429 Client Error: Too Many Requests for url: https://bbfcapi.fustra.uk/?title=Iratta
Error fetching age rating for PRANAYA VILASAM: 429 Client Error: Too Many Requests for url: https://bbfcapi.fustra.uk/?title=PRANAYA+VILASAM
Error fetching 

This method is also not always able to extract the age age rating correclt for the film titles.

In [None]:
# Drop the "Age Rating" column from the DataFrame
df.drop('Age Rating', axis=1, inplace=True)

## 2.2 Web Scraping using the Search Functionality

The approach involves sending a search query for each movie and parsing the search results to find the link to the movie's page. This code will attempt to find an anchor 'a' tag with the href attribute (the web link in the BBFC's website) that contains the title of the movie.

Then, the code below sends a search query for each movie title to the BBFC's search page. It then Parses the HTML of the search results page to find the first div with the class 'SearchItem_SearchItem__3kbZF'. Within this div, it looks for an a tag with the href attribute and constructs the full URL of each film.

In [None]:
import requests
from bs4 import BeautifulSoup

# Function to fetch the movie URL from search results
def fetch_movie_url(title):
    # Format the search query URL
    search_url = f"https://www.bbfc.co.uk/search?q={title.replace(' ', '%20')}"
    response = requests.get(search_url)
    soup = BeautifulSoup(response.content, 'html.parser')

    # Find the first search result item and extract the URL
    search_item = soup.find('div', class_='SearchItem_SearchItem__3kbZF')
    if search_item:
        link_element = search_item.find('a', href=True)
        if link_element:
            return "https://www.bbfc.co.uk" + link_element['href']
    return "URL not found"

# First three movie titles
movie_titles = ["Creed III", "DADA", "Infinity Pool"]

# Fetch URLs for each movie
for title in movie_titles:
    url = fetch_movie_url(title)
    print(f"{title}: {url}")


Creed III: https://www.bbfc.co.uk/release/creed-iii-q29sbgvjdglvbjpwwc0xmda4ndix
DADA: https://www.bbfc.co.uk/release/dada-q29sbgvjdglvbjpwwc0xmdexmty1
Infinity Pool: https://www.bbfc.co.uk/release/infinity-pool-q29sbgvjdglvbjpwwc0xmdewmjqy


In [None]:
# Fetch URLs for each movie and add them to the DataFrame
df['URL'] = df['Title'].apply(fetch_movie_url)

# Display the DataFrame with URLs
display(df)

Unnamed: 0,Title,URL
0,Creed III,https://www.bbfc.co.uk/release/creed-iii-q29sb...
1,DADA,https://www.bbfc.co.uk/release/dada-q29sbgvjdg...
2,Infinity Pool,https://www.bbfc.co.uk/release/infinity-pool-q...
3,Law of Tehran,https://www.bbfc.co.uk/release/law-of-tehran-q...
4,Dungeons & Dragons: Honour Among Thieves,https://www.bbfc.co.uk/release/dungeons-dragon...
5,LOVE ACCORDING TO DALVA,https://www.bbfc.co.uk/release/love-according-...
6,Champions,https://www.bbfc.co.uk/release/champions-q29sb...
7,Akhi Fok El Shagara,https://www.bbfc.co.uk/release/akhi-fok-el-sha...
8,Little Eggs: An African Rescue,https://www.bbfc.co.uk/release/little-eggs-an-...
9,Selfiee,https://www.bbfc.co.uk/release/selfiee-q29sbgv...


When pulling the URLs for the movie titles a spelling error was found in "Our Rivers...Our Skys", shown as the only title where the URL was not found. The correct spelling for this film is 'Our River...Our Sky', see the BBFC link for the film [here](https://www.bbfc.co.uk/release/our-river-our-sky-q29sbgvjdglvbjpwwc0xmdexnjc0) with the correct spelling. The title name was corrected and the URL's were fetched again.

In [None]:
# Change the title 'Our Rivers...Our Skys' to 'Our River...our Sky'
df['Title'] = df['Title'].replace('Our Rivers...Our Skys', 'Our River...our Sky')

In [None]:
# Fetch URLs for each movie and add them to the DataFrame
df['URL'] = df['Title'].apply(fetch_movie_url)

# Display the DataFrame with URLs
display(df)

Unnamed: 0,Title,URL
0,Creed III,https://www.bbfc.co.uk/release/creed-iii-q29sb...
1,DADA,https://www.bbfc.co.uk/release/dada-q29sbgvjdg...
2,Infinity Pool,https://www.bbfc.co.uk/release/infinity-pool-q...
3,Law of Tehran,https://www.bbfc.co.uk/release/law-of-tehran-q...
4,Dungeons & Dragons: Honour Among Thieves,https://www.bbfc.co.uk/release/dungeons-dragon...
5,LOVE ACCORDING TO DALVA,https://www.bbfc.co.uk/release/love-according-...
6,Champions,https://www.bbfc.co.uk/release/champions-q29sb...
7,Akhi Fok El Shagara,https://www.bbfc.co.uk/release/akhi-fok-el-sha...
8,Little Eggs: An African Rescue,https://www.bbfc.co.uk/release/little-eggs-an-...
9,Selfiee,https://www.bbfc.co.uk/release/selfiee-q29sbgv...


I have chosen to pull as much relevant information as possible, the items scraped for each movie title are shown below:

'Title', 'URL', 'Type', 'Rating', 'Content',	'Synopsis', 'Classified Date', 'Language', 'Director', 'Production Year',	'Genre', 'Running Minutes', 'Cast', 'UK Age Rating'.

In the code below I iterate through each movie title in the DataFrame, fetch the URL for the movie, and then scrape the detailed information from that URL using the scrape_movie_info function. I modified the scrape_movie_info function used above to work with the individual movie pages instead of the recently-rated page. Since the UK age ratings are standardised to "U", "PG", "12A", "12", "15", and "18", I modified the scraping function to check for these specific ratings in in the SVG path ID, and if it's not found there, it will try to extract it from the aria-label attribute. If neither is found, the UK Age Rating will be set to None. I had to make sure the regex was set to be case sensitive and ensure it matches both lowercase and uppercase variants of these ratings.

The scraped data will be added to the DataFrame as new columns for each movie.

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import re


# Function to fetch the movie URL from search results
def fetch_movie_url(title):
    search_url = f"https://www.bbfc.co.uk/search?q={title.replace(' ', '%20')}"
    response = requests.get(search_url)
    soup = BeautifulSoup(response.content, 'html.parser')
    search_item = soup.find('div', class_='SearchItem_SearchItem__3kbZF')
    if search_item:
        link_element = search_item.find('a', href=True)
        if link_element:
            return "https://www.bbfc.co.uk" + link_element['href']
    return None

# Define the scraping function for movie details
def scrape_movie_info(movie_url):
    response = requests.get(movie_url)
    soup = BeautifulSoup(response.content, 'html.parser')

    # Helper function to extract text by heading
    def extract_text_by_heading(heading):
        element = soup.find('h4', class_='Type_base__2EnB2 ListItem_Title__tu7eT', string=heading)
        if element:
            sibling = element.find_next_sibling('h4')
            if sibling:
                return sibling.text
        return None

    # Extracting various details
    director = extract_text_by_heading('Director(s)')
    production_year = extract_text_by_heading('Production Year')
    genre = extract_text_by_heading('Genre(s)')
    running_minutes = extract_text_by_heading('Approx. running minutes')
    cast = extract_text_by_heading('Cast')

    # Extracting other details
    type_div = soup.find('div', class_='MediaOutline_MediaType__AipEd')
    movie_type = type_div.text if type_div else None

    rating_content_div = soup.find('div', class_='Type_subtitle__3KRKY')
    rating_content = rating_content_div.span.span.text if rating_content_div else None

    synopsis = soup.find('p', class_='Type_base-large__3q3Or').text if soup.find('p', class_='Type_base-large__3q3Or') else None

    classified_date_div = soup.find('h4', string='Classified date').find_next_sibling('h4')
    classified_date = classified_date_div.text if classified_date_div else None

    language_div = soup.find('h4', string='Language').find_next_sibling('h4')
    language = language_div.text if language_div else None

    # Age Rating (assumed from the example, might need adjustment)
    uk_age_rating_div = soup.find('div', class_='MediaOutline_Rating__2rGrd')
    uk_age_rating = uk_age_rating_div['aria-label'] if uk_age_rating_div and 'aria-label' in uk_age_rating_div.attrs else None

    # Extracting UK Age Rating. The function will first try to extract the age rating from the SVG path ID.
    # If it doesn't find it there, it will attempt to extract it from the aria-label. If neither is found, it sets it to 'None'
    # The re.IGNORECASE flag in the regex pattern allows the function to match and correctly extract age ratings regardless of their case (lowercase or uppercase)
    age_rating_div = soup.find('div', class_='MediaOutline_Rating__2rGrd')
    if age_rating_div:
        svg_path = age_rating_div.find('path', id=re.compile(r'rating-(u|pg|12a|12|15|18)_svg', re.IGNORECASE))
        if svg_path and svg_path.get('id'):
            age_rating = re.search(r'rating-(u|pg|12a|12|15|18)', svg_path.get('id'), re.IGNORECASE)
            uk_age_rating = age_rating.group(1).upper() if age_rating else None
        elif 'aria-label' in age_rating_div.attrs:
            aria_label_rating = age_rating_div['aria-label'].replace('Rated ', '').upper()
            uk_age_rating = aria_label_rating if aria_label_rating in ['U', 'PG', '12A', '12', '15', '18'] else None
        else:
            uk_age_rating = None
    else:
        uk_age_rating = None

    return {
        'Type': movie_type,
        'Rating Content': rating_content,
        'Synopsis': synopsis,
        'Classified Date': classified_date,
        'Language': language,
        'Director': director,
        'Production Year': production_year,
        'Genre': genre,
        'Running Minutes': running_minutes,
        'Cast': cast,
        'UK Age Rating': uk_age_rating
    }

# Fetch URLs and details for each movie and add them to the DataFrame
for index, row in df.iterrows():
    url = fetch_movie_url(row['Title'])
    if url:
        movie_info = scrape_movie_info(url)
        for key, value in movie_info.items():
            df.at[index, key] = value

        # Respectful scraping: add a delay between requests
        time.sleep(1)

# Display the DataFrame with additional details
display(df)

Unnamed: 0,Title,URL,Type,Rating Content,Synopsis,Classified Date,Language,Director,Production Year,Genre,Running Minutes,Cast,UK Age Rating
0,Creed III,https://www.bbfc.co.uk/release/creed-iii-q29sb...,Film,"moderate violence, infrequent strong language","In this sports drama sequel, a retired heavywe...",15/03/2023,English,Michael B. Jordan,2022,"Drama, Sport",116m,"Michael B. Jordan, Jonathan Majors, Tessa Thom...",12
1,DADA,https://www.bbfc.co.uk/release/dada-q29sbgvjdg...,Film,"infrequent strong language, domestic abuse, ha...",A young man finds himself raising his newborn ...,09/02/2023,English,Ganesh K Babu,2023,Comedy,133m,"Aishwariyaa Bhaskaran, Aparna Das, Bhagyaraj",12A
2,Infinity Pool,https://www.bbfc.co.uk/release/infinity-pool-q...,Film,"strong bloody violence, gore, sex, nudity","An unsettling tone, as well as graphic violenc...",26/06/2023,English,Brandon Cronenberg,2023,"Crime, Horror, Mystery",113m,"Alexander Skarsgård, Mia Goth, Cleopatra Coleman",18
3,Law of Tehran,https://www.bbfc.co.uk/release/law-of-tehran-q...,Film,"drug misuse, strong threat, suicide references",A policeman goes head to head with a drug deal...,15/02/2023,Farsi,Saeed Roustayi,2019,"Crime, Drama",134m,"Payman Maadi, Navid Mohammadzadeh, Parinaz Iza...",15
4,Dungeons & Dragons: Honour Among Thieves,https://www.bbfc.co.uk/release/dungeons-dragon...,Film,"moderate violence, threat, horror, language",A thief gathers a team of unlikely heroes to e...,27/03/2023,English,"Jonathan Goldstein, John Francis Daley",2022,"Action, Adventure, Comedy",134m,"Chris Pine, Michelle Rodriguez, Hugh Grant",12A
5,LOVE ACCORDING TO DALVA,https://www.bbfc.co.uk/release/love-according-...,Film,"child sex abuse theme, strong language",A young girl struggles to cope in the aftermat...,17/02/2023,French,Emmanuelle Nicot,2022,Drama,88m,"Zelda Samson, Alexis Manenti, Fanta Guirassy",15
6,Champions,https://www.bbfc.co.uk/release/champions-q29sb...,Film,"moderate sex references, discrimination, infre...",A failing basketball coach is court-ordered to...,24/04/2023,English,,2023,Comedy,124m,"Woody Harrelson, Kaitlin Olson, Matt Cook",12
7,Akhi Fok El Shagara,https://www.bbfc.co.uk/release/akhi-fok-el-sha...,Film,drug misuse,A man discovers he has a twin brother in this ...,17/02/2023,Arabic,Mahmoud Karim,2022,Comedy,107m,,12A
8,Little Eggs: An African Rescue,https://www.bbfc.co.uk/release/little-eggs-an-...,Film,"very mild threat, violence, rude humour","When they are stolen and taken to Africa, a ro...",20/02/2023,English,"Gabriel Riva Palacio Alatriste, Rodolfo Riva P...",2021,"Animation, Children, Comedy",89m,"Marcelo Barcelo, Mauricio Barrientos, Bruno Bi...",U
9,Selfiee,https://www.bbfc.co.uk/release/selfiee-q29sbgv...,Film,"infrequent moderate violence, threat, injury d...","In this Hindi language comedy drama, a misunde...",20/02/2023,Hindi,Raj Mehta,2023,"Comedy, Drama",148m,"Akshay Kumar, Nushrratt Bharuccha, Emraan Hashmi",12A


In [None]:
# Check Data Types
df.dtypes

Title              object
URL                object
Type               object
Rating Content     object
Synopsis           object
Classified Date    object
Language           object
Director           object
Production Year    object
Genre              object
Running Minutes    object
Cast               object
UK Age Rating      object
dtype: object

In [None]:
# Convert textual columns to string
text_columns = ['Title', 'URL', 'Type', 'Rating Content', 'Synopsis', 'Language', 'Director', 'Genre', 'Cast']
df[text_columns] = df[text_columns].astype(str)

# Convert date columns to datetime
df['Classified Date'] = pd.to_datetime(df['Classified Date'], format='%d/%m/%Y', errors='coerce')
df['Production Year'] = pd.to_datetime(df['Production Year'], format='%Y', errors='coerce')

# Remove the 'm' from 'Running Minutes' and convert to numeric (integer)
df['Running Minutes'] = df['Running Minutes'].str.replace('m', '', regex=False).astype(int)

# UK Age Rating is a mixed type (numeric and alphanumeric) so we will keep it as string
df['UK Age Rating'] = df['UK Age Rating'].astype(str)

# Display the DataFrame with updated data types
print(df.dtypes)
df.head()

Title                      object
URL                        object
Type                       object
Rating Content             object
Synopsis                   object
Classified Date    datetime64[ns]
Language                   object
Director                   object
Production Year    datetime64[ns]
Genre                      object
Running Minutes             int64
Cast                       object
UK Age Rating              object
dtype: object


Unnamed: 0,Title,URL,Type,Rating Content,Synopsis,Classified Date,Language,Director,Production Year,Genre,Running Minutes,Cast,UK Age Rating
0,Creed III,https://www.bbfc.co.uk/release/creed-iii-q29sb...,Film,"moderate violence, infrequent strong language","In this sports drama sequel, a retired heavywe...",2023-03-15,English,Michael B. Jordan,2022-01-01,"Drama, Sport",116,"Michael B. Jordan, Jonathan Majors, Tessa Thom...",12
1,DADA,https://www.bbfc.co.uk/release/dada-q29sbgvjdg...,Film,"infrequent strong language, domestic abuse, ha...",A young man finds himself raising his newborn ...,2023-02-09,English,Ganesh K Babu,2023-01-01,Comedy,133,"Aishwariyaa Bhaskaran, Aparna Das, Bhagyaraj",12A
2,Infinity Pool,https://www.bbfc.co.uk/release/infinity-pool-q...,Film,"strong bloody violence, gore, sex, nudity","An unsettling tone, as well as graphic violenc...",2023-06-26,English,Brandon Cronenberg,2023-01-01,"Crime, Horror, Mystery",113,"Alexander Skarsgård, Mia Goth, Cleopatra Coleman",18
3,Law of Tehran,https://www.bbfc.co.uk/release/law-of-tehran-q...,Film,"drug misuse, strong threat, suicide references",A policeman goes head to head with a drug deal...,2023-02-15,Farsi,Saeed Roustayi,2019-01-01,"Crime, Drama",134,"Payman Maadi, Navid Mohammadzadeh, Parinaz Iza...",15
4,Dungeons & Dragons: Honour Among Thieves,https://www.bbfc.co.uk/release/dungeons-dragon...,Film,"moderate violence, threat, horror, language",A thief gathers a team of unlikely heroes to e...,2023-03-27,English,"Jonathan Goldstein, John Francis Daley",2022-01-01,"Action, Adventure, Comedy",134,"Chris Pine, Michelle Rodriguez, Hugh Grant",12A


In [None]:
# Check for missing values
# Define a list of custom missing value indicators
missing_indicators = ['N/a', 'None', '', ' ']

# Check for rows with any of these missing value indicators
missing_value_rows = df[df.apply(lambda x: x.astype(str).str.strip().isin(missing_indicators)).any(axis=1)]

# Display the rows with missing values
display(missing_value_rows)

Unnamed: 0,Title,URL,Type,Rating Content,Synopsis,Classified Date,Language,Director,Production Year,Genre,Running Minutes,Cast,UK Age Rating
6,Champions,https://www.bbfc.co.uk/release/champions-q29sb...,Film,"moderate sex references, discrimination, infre...",A failing basketball coach is court-ordered to...,2023-04-24,English,,2023-01-01,Comedy,124,"Woody Harrelson, Kaitlin Olson, Matt Cook",12
7,Akhi Fok El Shagara,https://www.bbfc.co.uk/release/akhi-fok-el-sha...,Film,drug misuse,A man discovers he has a twin brother in this ...,2023-02-17,Arabic,Mahmoud Karim,2022-01-01,Comedy,107,,12A
26,Antidote,https://www.bbfc.co.uk/release/antidote-q29sbg...,Film,"drug references, sexual violence references",This documentary concerns the use of ayahuasca...,2023-03-01,English,Marc Silver,2022-01-01,Documentary,79,,15
29,This Is Endometriosis,https://www.bbfc.co.uk/release/this-is-endomet...,Film,"mild bad language, scenes of emotional upset, ...",THIS IS ENDOMETRIOSIS is a British documentary...,2023-03-01,English,Matt Houghton & Georgie Wileman,2023-01-01,Documentary,20,,PG
30,Below The Belt,https://www.bbfc.co.uk/release/below-the-belt-...,Film,"mild medical detail, upsetting scenes, languag...",BELOW THE BELT is a US documentary short about...,2023-03-01,English,Shannon Cohn,2022-01-01,Documentary,50,,PG
40,1976,https://www.bbfc.co.uk/release/1976-q29sbgvjdg...,Film,strong language,A woman is pulled into a possibly dangerous si...,2023-04-12,Spanish,,2022-01-01,Drama,96,"Aline Küppenheim, Nicolás Sepúlveda, Hugo Medina",15
43,The Grass is Greener on the Other Side,https://www.bbfc.co.uk/release/the-grass-is-gr...,Film,infrequent strong language,A man moves from his home in Hong Kong to the ...,2023-03-06,Cantonese,Crystal WONG,2022-01-01,Documentary,73,,12A
44,Suzume,https://www.bbfc.co.uk/release/suzume-q29sbgvj...,Film,"mild threat, bloody images, language, upsettin...",A teenager has the power to stop real world di...,2023-04-03,English,,2022-01-01,Animation,122,,PG


Each of these were checked and confirmed that the values are indeed missing on the BBFC's website for each of the films.

[Champions](https://www.bbfc.co.uk/release/champions-q29sbgvjdglvbjpwwc0xmda5otk2): Missing Director </br>
[Akhi Fok El Shagara](https://www.bbfc.co.uk/release/akhi-fok-el-shagara-q29sbgvjdglvbjpwwc0xmdexmzc2): Missing Cast </br>
[Antidote](https://www.bbfc.co.uk/release/antidote-q29sbgvjdglvbjpwwc0xmdexmjaw): Missing Cast </br>
[This Is Endometriosis](https://www.bbfc.co.uk/release/this-is-endometriosis-q29sbgvjdglvbjpwwc0xmdexnjey): Missing Cast </br>
[Below The Belt](https://www.bbfc.co.uk/release/below-the-belt-q29sbgvjdglvbjpwwc0xmdexmtm0): Missing Cast </br>
[1976](https://www.bbfc.co.uk/release/1976-q29sbgvjdglvbjpwwc0xmdexnje0): Missing Director </br>
[The Grass is Greener on the Other Side](https://www.bbfc.co.uk/release/the-grass-is-greener-on-the-other-side-q29sbgvjdglvbjpwwc0xmdexnzay): Missing Cast </br>
[Suzume](https://www.bbfc.co.uk/release/suzume-q29sbgvjdglvbjpwwc0xmda4nzu0): Missing Cast

In [None]:
# Saving the df DataFrame to a CSV file

# Replace 'None' with an empty string
df.replace('None', '', inplace=True)

# Save to CSV
df.to_csv('bbfc_listed_films.csv', index=False)

# Future Recomendations

Although the code is already largely automated. Depending on what is required for future work, I would make the code more reproducible and create user defined functions for specefic actions such as for web scraping, data checking and cleaning. This way I could just insert the dataframe in question into the user defined functions to automate the steps required for scraping or checking and cleaning the data.