### Using Beautiful Soup to scrape Rotten Tomoatoes

In this notebook I will be using Beautiful Soup to scrape Nicolas Cage's profile on Rotten Tomatoes

I will then use Pandas to create a dataframe from the scraped data and save it to a .csv file for further visual analysis.

In [1]:
import pandas as pd
import requests
import bs4
import lxml

In [2]:
session = requests.Session()
session.headers["User-Agent"]

'python-requests/2.27.1'

In [3]:
res = requests.get("https://www.rottentomatoes.com/celebrity/nicolas_cage")

In [4]:
soup = bs4.BeautifulSoup(res.text,"lxml")

In [5]:
table = soup.find("table", attrs={"data-qa":"celebrity-filmography-movies"})
table_body = table.find("tbody")

Collect select data from rows using list comprehension

In [31]:
# Collect titles
data_title = [item["data-title"] for item in table.find_all("tr", attrs={"data-title" : True})]

In [32]:
# Collect box office data
# Note: I choose to insert "NULL" because None is converted to 0 when uploading a SQL client.
data_boxoffice = ["NULL" if item["data-boxoffice"] == "" else item["data-boxoffice"] for item in table.find_all("tr", attrs={"data-boxoffice" : True})]

In [33]:
# Collect release year
data_year = [item["data-year"] for item in table.find_all("tr", attrs={"data-year" : True})]

In [34]:
# Collect Rotten Tomatoes score

# Unlike the boxoffice data which will collect no data if there was no money made, the critic and audience scores will collect zero if no reviews were made. This is an issue since some of the movies have a zero rating because they were rated poorly by either critics or the audience. Therefore, additional parcing was needed in order to accurately reflect this discrepancy.

data_tomatometer = ["NULL" if (item.find_all("span", attrs={"data-tomatometer": 0}) and item.find_all("span", attrs={"class": "celebrity-filmography__no-score"})) else item["data-tomatometer"] for item in table.findAll("tr", attrs={"data-tomatometer" : True})]

In [35]:
# Collect audience scores
data_audiencescore = ["NULL" if (item.find_all("span", attrs={"data-audiencescore": 0}) and item.find_all("span", attrs={"class": "celebrity-filmography__no-score"})) else item["data-audiencescore"] for item in table.findAll("tr", attrs={"data-audiencescore" : True})]

Create single list from collected data

In [36]:
data = list(zip(data_title, data_boxoffice, data_year, data_tomatometer, data_audiencescore))

View new list

In [26]:
data

[('The Unbearable Weight of Massive Talent', '19490586', '2022', '87', '87'),
 ('Prisoners of the Ghostland', None, '2021', '62', '20'),
 ('Pig', '3138901', '2021', '96', '84'),
 ("Willy's Wonderland", '388722', '2021', '60', '68'),
 ('The Croods: A New Age', '58544525', '2020', '77', '94'),
 ('Jiu Jitsu', None, '2020', '28', '64'),
 ('Color Out of Space', '677283', '2019', '86', '82'),
 ('Grand Isle', None, '2019', '0', '45'),
 ('Primal', None, '2019', '37', '22'),
 ('Kill Chain', None, '2019', None, '31'),
 ('Running With the Devil', None, '2019', '24', '36'),
 ('A Score to Settle', None, '2019', '15', '17'),
 ('Between Worlds', None, '2018', '32', '82'),
 ('Spider-Man: Into the Spider-Verse', '190193195', '2018', '97', '93'),
 ('Becoming Iconic', None, '2018', None, '99'),
 ('Mandy', '1233694', '2018', '90', '67'),
 ('Teen Titans GO! to the Movies', '29553885', '2018', '91', '71'),
 ('211', None, '2018', '4', '9'),
 ('Looking Glass', None, '2018', '21', '10'),
 ('Looking Glass', Non

Create pandas dataframe from list

In [37]:
df = pd.DataFrame(data, columns = ["Title", "Boxoffice", "ReleaseYear", "RTScore", "AudienceScore"])

View new dataframe

In [38]:
df

Unnamed: 0,Title,Boxoffice,ReleaseYear,RTScore,AudienceScore
0,The Unbearable Weight of Massive Talent,19490586,2022,87,87
1,Prisoners of the Ghostland,,2021,62,20
2,Pig,3138901,2021,96,84
3,Willy's Wonderland,388722,2021,60,68
4,The Croods: A New Age,58544525,2020,77,94
...,...,...,...,...,...
102,The Cotton Club,,1984,77,55
103,Birdy,,1984,83,84
104,Rumble Fish,,1983,75,80
105,Valley Girl,,1983,83,72


Save dataframe to .csv file

In [39]:
df.to_csv("NicCageRT_csv.csv")

Save dataframe to excel file

In [None]:
df.to_csv("NicCageRT_xlsx.xlsx")