## Practical Excercise - Rotten Tomatoes

## Set up

In [1]:
import requests
from bs4 import BeautifulSoup

import pandas as pd # save the extracted file in a dataframe

In [2]:
# Define the URL of the site
base_site = "https://editorial.rottentomatoes.com/guide/140-essential-action-movies-to-watch-now/2/"

In [3]:
response = requests.get(base_site)
print(response.status_code)

200


In [4]:
html = response.content

## Choosing a parser

In [5]:
soup = BeautifulSoup(html, 'html.parser')

In [6]:
with open('Rotten_tomatoes_page_2_html_parser.html', 'wb') as file:
    file.write(soup.prettify('utf-8'))

When inspecting the file we see that HTML element is closed at the begining -- it parsed incorrectly! <br>Let's check another parser.

In [7]:
soup = BeautifulSoup(html, 'lxml')

In [8]:
with open('Rotten_tomatoes_page_2_lxml.parser.html', 'wb') as file:
    file.write(soup.prettify('utf-8'))

By first accounts of inspecting the file everything seems fine. We can use external HTML validator!

#### Learning point: Beautiful Soup ranks the lxml parser as the best one.

If a parser is not explicitly stated in the Beautiful Soup constructor, the best one available on the current machine is chosen.
<br>This means that the same piece of code can give different results on different computers.

---
## Extracting the title, year and rating of each movie

In [9]:
# title and year are all under <h2> tag
title_raw = soup.find_all('h2')[0:140] # limisting to 140 movie entries

In [10]:
temp = title_raw[0]
temp

<h2><a href="https://www.rottentomatoes.com/m/1018009-running_scared/">Running Scared</a> <span class="subtle start-year">(1986)</span> <span class="icon tiny fresh" title="Fresh"></span> <span class="tMeterScore">60%</span></h2>

In [11]:
temp.find('a').text

'Running Scared'

In [12]:
temp.find('span', {'class':'subtle start-year'}).text.strip('()')

'1986'

In [13]:
temp.find('span', {'class':'tMeterScore'}).text

'60%'

In [14]:
def rt_140_essential_movie_extract(soup):
    movie_list = []
    movie_raw = soup.find_all('h2')[0:140]
    
    for data in movie_raw:
        title = data.find('a').text
        year = data.find('span', {'class':'subtle start-year'}).text.strip('()')
        rating = data.find('span', {'class':'tMeterScore'}).text
        
        movie_list.append({'title': title, 'year': year, 'rating': rating})
    
    return movie_list
    

In [15]:
movies = rt_140_essential_movie_extract(soup)
movies

[{'title': 'Running Scared', 'year': '1986', 'rating': '60%'},
 {'title': 'Equilibrium', 'year': '2002', 'rating': '40%'},
 {'title': 'Hero', 'year': '2004', 'rating': '95%'},
 {'title': 'Road House', 'year': '1989', 'rating': '39%'},
 {'title': 'Unstoppable', 'year': '2010', 'rating': '86%'},
 {'title': 'Shaft', 'year': '1971', 'rating': '88%'},
 {'title': 'The Villainess (Ak-Nyeo)', 'year': '2017', 'rating': '85%'},
 {'title': 'Highlander', 'year': '1986', 'rating': '69%'},
 {'title': 'Die Hard 2', 'year': '1990', 'rating': '68%'},
 {'title': 'National Treasure', 'year': '2004', 'rating': '46%'},
 {'title': 'The Protector (Tom yum goong) (Warrior King)',
  'year': '2005',
  'rating': '53%'},
 {'title': 'Revenge', 'year': '2018', 'rating': '93%'},
 {'title': 'El Mariachi', 'year': '1993', 'rating': '93%'},
 {'title': 'A Touch of Zen', 'year': '1969', 'rating': '96%'},
 {'title': 'Top Gun', 'year': '1986', 'rating': '54%'},
 {'title': 'Con Air', 'year': '1997', 'rating': '55%'},
 {'tit

In [16]:
movies = pd.DataFrame(movies)
movies # saving it as a dataframe using pandas library

Unnamed: 0,title,year,rating
0,Running Scared,1986,60%
1,Equilibrium,2002,40%
2,Hero,2004,95%
3,Road House,1989,39%
4,Unstoppable,2010,86%
...,...,...,...
135,Lat sau san taam (Hard-Boiled),1992,94%
136,The Matrix,1999,88%
137,Terminator 2: Judgment Day,1991,93%
138,Die Hard,1988,93%


## Extracting the rest of the information

- Adjusted Scores, hidden in <span></span> tag
- Critics Consensus
- Synopsis
- Starrings
- Director

The rest of the information are all inside the div tags. Also, we will use title as a unique identifier, so we can merge the rest of the information into the previously extraced data.

### Individual Scraping
Before combining all, let's see how web scraping works for individual information.
<br>This is to explore a different way of saving the results.

#### Another way to extract title: title belongs to h2 tag under 'div' tag

In [17]:
title = soup.find_all('h2')[0:140] # making sure getting 140 movies entries

In [18]:
title[0]

<h2><a href="https://www.rottentomatoes.com/m/1018009-running_scared/">Running Scared</a> <span class="subtle start-year">(1986)</span> <span class="icon tiny fresh" title="Fresh"></span> <span class="tMeterScore">60%</span></h2>

In [19]:
title[0].find('a').contents[0]

'Running Scared'

In [20]:
title_list = []
for i in range(len(title)):
    try: 
        data = title[i].find('a').contents[0] # NoneType error 
    except:
        data = title[i] # in case that there is an error in the list
    title_list.append(data)

In [21]:
title_list

['Running Scared',
 'Equilibrium',
 'Hero',
 'Road House',
 'Unstoppable',
 'Shaft',
 'The Villainess (Ak-Nyeo)',
 'Highlander',
 'Die Hard 2',
 'National Treasure',
 'The Protector (Tom yum goong) (Warrior King)',
 'Revenge',
 'El Mariachi',
 'A Touch of Zen',
 'Top Gun',
 'Con Air',
 'The Expendables 2',
 'The Mummy',
 'Mr. & Mrs. Smith',
 'Rush Hour',
 'The Equalizer',
 'Captain America: Civil War',
 'Air Force One',
 'Bloodsport',
 'Blade',
 'Bad Boys',
 'Die Hard: With a Vengeance',
 'The Running Man',
 'Code of Silence',
 "Shoot 'Em Up",
 'Crank',
 'Machete',
 'Drive',
 'Batman',
 'Under Siege',
 'Independence Day',
 'Bullitt',
 'Wanted',
 'Superman',
 'Ronin',
 'They Live',
 'Cliffhanger',
 "Marvel's The Avengers",
 'Hot Fuzz',
 'The Warriors',
 'Starship Troopers',
 'Elite Squad: The Enemy Within',
 'Point Break',
 'The Long Kiss Goodnight',
 'The Guest',
 'Taken',
 '300',
 'True Lies',
 'Demolition Man',
 'Hardcore Henry',
 'Police Story (Ging chaat goo si) (Police Force)',
 '

NoneType error indicates that there is a title with None value.

In [22]:
for i in range(len(title)):
    print(title[i])

<h2><a href="https://www.rottentomatoes.com/m/1018009-running_scared/">Running Scared</a> <span class="subtle start-year">(1986)</span> <span class="icon tiny fresh" title="Fresh"></span> <span class="tMeterScore">60%</span></h2>
<h2><a href="https://www.rottentomatoes.com/m/equilibrium/">Equilibrium</a> <span class="subtle start-year">(2002)</span> <span class="icon tiny rotten" title="Rotten"></span> <span class="tMeterScore">40%</span></h2>
<h2><a href="https://www.rottentomatoes.com/m/hero/">Hero</a> <span class="subtle start-year">(2004)</span> <span class="icon tiny certified" title="Certified Fresh"></span> <span class="tMeterScore">95%</span></h2>
<h2><a href="https://www.rottentomatoes.com/m/1017666-road_house/">Road House</a> <span class="subtle start-year">(1989)</span> <span class="icon tiny rotten" title="Rotten"></span> <span class="tMeterScore">39%</span></h2>
<h2><a href="https://www.rottentomatoes.com/m/unstoppable-2010/">Unstoppable</a> <span class="subtle start-year"

#### Adjusted Scores

In [23]:
adj_score = soup.find_all('div', {'class':'info countdown-adjusted-score'})

In [24]:
adj_score[0].contents[1].strip() # does it work for all others?

'61.158%'

In [25]:
adj_score_list = []
for i in range(len(adj_score)):
    try: 
        data = adj_score[i].contents[1].strip()
    except:
        data = adj_score[i]
    #print(data) #ok
    adj_score_list.append(data)

#### Critic consensus

In [26]:
critics = soup.find_all('div', {'class':'info critics-consensus'})

In [27]:
critics[0].contents[1].strip()

'Running Scared struggles to strike a consistent balance between violent action and humor, but the chemistry between its well-matched leads keeps things entertaining.'

In [28]:
critics_consensus_list = []
for i in range(len(critics)):
    data = critics[i].contents[1].strip()
    #print(data) #ok
    critics_consensus_list.append(data)

#### Synopsis

In [29]:
synopsis = soup.find_all('div', {'class':'info synopsis'})

In [30]:
text = synopsis[0].get_text()
header = 'Synopsis:'
if text.startswith(header):
    text = text[len(header):].strip()
print(text, type(text))

Distinguished by a sharp, witty dialogue between its two cop protagonists, Ray and Danny (Gregory Hines and Billy Crystal), this... [More] <class 'str'>


In [31]:
header = 'Synopsis:'
synopsis_list = [line.get_text() for line in synopsis]
synopsis_list = [synopsis[len(header):].strip() if synopsis.startswith(header) else synopsis for synopsis in synopsis_list]
print(synopsis_list)

['Distinguished by a sharp, witty dialogue between its two cop protagonists, Ray and Danny (Gregory Hines and Billy Crystal), this... [More]', 'In the nation of Libria, there is always peace among men. The rules of the Librian system are simple. If... [More]', "Hero is two-time Academy Award nominee Zhang Yimou's directorial attempt at exploring the concept of a Chinese hero. During the... [More]", 'Dalton (Swayze) is a true gentleman with a degree in philosophy from NYU. He also has a flip side -... [More]', 'In this action thriller from director Tony Scott, rookie train operator Will (Chris Pine) and grizzled veteran engineer Frank (Denzel... [More]', 'Shaft, a highly successful film, spawned an industry of sequels and imitations. The daughter (Sherri Brewer) of Bumpy Jones (Moses... [More]', 'Since she was a little girl, Sook-hee was raised to be a deadly assassin. She gladly accepts the chance to... [More]', 'Among humans for centuries, an immortal specie existed. Connor MacLeod is

#### Cast

In [32]:
cast = soup.find_all('div', {'class':'info cast'})

In [33]:
text = cast[0].get_text().strip()
#print(text)

header = 'Starring:'
cast_list = [line.get_text().strip() for line in cast]
#print(cast_list)

cast_list = [cast[len(header):].strip() if cast.startswith(header) else cast for cast in cast_list]
print(cast_list)

['Gregory Hines, Billy Crystal, Jimmy Smits, Steven Bauer', 'Christian Bale, Emily Watson, Taye Diggs, Angus Macfadyen', 'Jet Li, Tony Leung Chiu Wai, Maggie Cheung, Daoming Chen', 'Patrick Swayze, Kelly Lynch, Sam Elliott, Ben Gazzara', 'Denzel Washington, Chris Pine, Rosario Dawson, Ethan Suplee', 'Richard Roundtree, Moses Gunn, Gwen Mitchell, Christopher St. John', 'Ok-bin Kim, Kim Seo-hyeong, Shin Ha-kyun, Bang Sung-jun', 'Christopher Lambert, Sean Connery, Roxanne Hart, Clancy Brown', 'Bruce Willis, Bonnie Bedelia, William Atherton, Reginald VelJohnson', 'Nicolas Cage, Diane Kruger, Justin Bartha, Sean Bean', 'Tony Jaa, Petchtai Wongkamlao, Bongkoj Khongmalai, Bongkoo Kongmalai', 'Matilda Anna Ingrid Lutz, Kevin Janssens, Vincent Colombe, Guillaume Bouchède', 'Carlos Gallardo, Consuelo Gómez, Reinol Martinez, Peter Marquardt', 'Feng Hsu, Chun Shih, Pai Ying, Tien Peng', 'Tom Cruise, Kelly McGillis, Anthony Edwards, Val Kilmer', 'Nicolas Cage, John Cusack, John Malkovich, Steve Bus

#### Director

In [34]:
director = soup.find_all('div', {'class':'info director'})

In [35]:
director[1].find('a', {'class':''}).contents[0]

'Kurt Wimmer'

In [36]:
temp_list = [line.find('a', {'class':''}) for line in director]
director_list = [line.contents[0] if line !=None else line for line in temp_list]
director_list # considering None type

['Peter Hyams',
 'Kurt Wimmer',
 'Zhang Yimou',
 'Rowdy Herrington',
 'Tony Scott',
 'Gordon Parks',
 'Jung Byung-gil',
 'Russell Mulcahy',
 'Renny Harlin',
 'Jon Turteltaub',
 'Prachya Pinkaew',
 'Coralie Fargeat',
 'Robert Rodriguez',
 'King Hu',
 'Tony Scott',
 'Simon West',
 'Simon West',
 'Stephen Sommers',
 'Doug Liman',
 'Brett Ratner',
 'Antoine Fuqua',
 'Anthony Russo',
 'Wolfgang Petersen',
 'Newt Arnold',
 'Stephen Norrington',
 'Michael Bay',
 'John McTiernan',
 'Paul Michael Glaser',
 'Andrew Davis',
 'Michael Davis',
 'Mark Neveldine',
 'Ethan Maniquis',
 'Nicolas Winding Refn',
 'Tim Burton',
 'Andrew Davis',
 'Roland Emmerich',
 'Peter Yates',
 'Timur Bekmambetov',
 'Richard Donner',
 'John Frankenheimer',
 'John Carpenter',
 'Renny Harlin',
 None,
 'Edgar Wright',
 'Walter Hill',
 'Paul Verhoeven',
 'José Padilha',
 'Kathryn Bigelow',
 'Renny Harlin',
 'Adam Wingard',
 'Pierre Morel',
 'Zack Snyder',
 'James Cameron',
 'Marco Brambilla',
 'Ilya Naishuller',
 'Jackie Ch

---
### Combining the results

In [37]:
print(len(title_list)) # hmm, 146 entries?
print(len(adj_score_list))
print(len(critics_consensus_list))
print(len(synopsis_list))
print(len(director_list))

140
140
140
140
140


In [38]:
rest_info = []

for i in range(len(title_list)):
    title = title_list[i]
    adj_score = adj_score_list[i]
    critics_consensus = critics_consensus_list[i]
    synopsis = synopsis_list[i]
    director = director_list[i]
    
    rest_info.append({'title' : title, 
                      'adj_score' : adj_score, 
                      'critics_consensus' : critics_consensus,
                      'synopsis' : synopsis, 
                      'director' : director})
    

In [39]:
rest_info = pd.DataFrame(rest_info)
rest_info

Unnamed: 0,title,adj_score,critics_consensus,synopsis,director
0,Running Scared,61.158%,Running Scared struggles to strike a consisten...,"Distinguished by a sharp, witty dialogue betwe...",Peter Hyams
1,Equilibrium,41.991%,Equilibrium is a reheated mishmash of other sc...,"In the nation of Libria, there is always peace...",Kurt Wimmer
2,Hero,100.828%,With death-defying action sequences and epic h...,Hero is two-time Academy Award nominee Zhang Y...,Zhang Yimou
3,Road House,41.991%,Whether Road House is simply bad or so bad it'...,Dalton (Swayze) is a true gentleman with a deg...,Rowdy Herrington
4,Unstoppable,91.513%,"As fast, loud, and relentless as the train at ...",In this action thriller from director Tony Sco...,Tony Scott
...,...,...,...,...,...
135,Lat sau san taam (Hard-Boiled),96.035%,Boasting impactful action as well as surprisin...,"Yun-Fat portrays a maverick, clarinet-playing ...",John Woo
136,The Matrix,94.818%,"Thanks to the Wachowskis' imaginative vision, ...","What if virtual reality wasn't just for fun, b...",Lilly Wachowski
137,Terminator 2: Judgment Day,99.097%,T2 features thrilling action sequences and eye...,A sequel to the sci-fi action thriller that ma...,James Cameron
138,Die Hard,98.72%,Its many imitators (and sequels) have never co...,"It's Christmas time in L.A., and there's an em...",John McTiernan


---
### Merging the two dataframes into one

'movies' and 'rest_info'
<br>title is an identifier for merging

In [41]:
rt_essential_movies = pd.merge(left = movies, right = rest_info, on = 'title', how = 'outer')
rt_essential_movies

Unnamed: 0,title,year,rating,adj_score,critics_consensus,synopsis,director
0,Running Scared,1986,60%,61.158%,Running Scared struggles to strike a consisten...,"Distinguished by a sharp, witty dialogue betwe...",Peter Hyams
1,Equilibrium,2002,40%,41.991%,Equilibrium is a reheated mishmash of other sc...,"In the nation of Libria, there is always peace...",Kurt Wimmer
2,Hero,2004,95%,100.828%,With death-defying action sequences and epic h...,Hero is two-time Academy Award nominee Zhang Y...,Zhang Yimou
3,Road House,1989,39%,41.991%,Whether Road House is simply bad or so bad it'...,Dalton (Swayze) is a true gentleman with a deg...,Rowdy Herrington
4,Unstoppable,2010,86%,91.513%,"As fast, loud, and relentless as the train at ...",In this action thriller from director Tony Sco...,Tony Scott
...,...,...,...,...,...,...,...
135,Lat sau san taam (Hard-Boiled),1992,94%,96.035%,Boasting impactful action as well as surprisin...,"Yun-Fat portrays a maverick, clarinet-playing ...",John Woo
136,The Matrix,1999,88%,94.818%,"Thanks to the Wachowskis' imaginative vision, ...","What if virtual reality wasn't just for fun, b...",Lilly Wachowski
137,Terminator 2: Judgment Day,1991,93%,99.097%,T2 features thrilling action sequences and eye...,A sequel to the sci-fi action thriller that ma...,James Cameron
138,Die Hard,1988,93%,98.72%,Its many imitators (and sequels) have never co...,"It's Christmas time in L.A., and there's an em...",John McTiernan


---
### Storing the dataframe in .csv file


In [43]:
rt_essential_movies.to_csv('rotten_tomatoes_essential_movies.csv', index = False)