# Scrapping Rotten Tomatoes
> Rotten Tomatoes is a review aggregation website for movies

## Set-up

In [1]:
import requests
from bs4 import BeautifulSoup

In [2]:
base_site = "https://editorial.rottentomatoes.com/guide/140-essential-action-movies-to-watch-now/"

In [3]:
response = requests.get(base_site)
response
# Or response.status_code

<Response [200]>

In [4]:
html = response.content
html

b'<!DOCTYPE html>\n<html lang="en-US" class="hitim">\n<head prefix="og: http://ogp.me/ns# flixstertomatoes: http://ogp.me/ns/apps/flixstertomatoes#">\n    <meta http-equiv="content-type" content="text/html; charset=UTF-8" />\n    \n    <meta property=\'og.description\' content="From John Wick and Die Hard to Mad Max and Atomic Blonde, these best action movies ever will thrill you and get the adrenaline pumping!" />\n    <meta name=\'description\' content="From John Wick and Die Hard to Mad Max and Atomic Blonde, these best action movies ever will thrill you and get the adrenaline pumping!" />\n    <meta property=\'og:title\' content="140 Essential Action Movies To Watch Now" />\n    <meta property=\'og:type\' content="article" />\n    <meta property=\'og:image\' content="https://prd-rteditorial.s3.us-west-2.amazonaws.com/wp-content/uploads/2019/06/01073141/600Crank.jpg" />\n    <meta property=\'og:url\' content="https://editorial.rottentomatoes.com/guide/140-essential-action-movies-to-

## Choosing a parser

In [5]:
# Note each separate project may require a different parser 
soup = BeautifulSoup(html, 'lxml')
soup

# Note: 'lxml' is better and faster than the 'html.parser'

<!DOCTYPE html>
<html class="hitim" lang="en-US">
<head prefix="og: http://ogp.me/ns# flixstertomatoes: http://ogp.me/ns/apps/flixstertomatoes#">
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<meta content="From John Wick and Die Hard to Mad Max and Atomic Blonde, these best action movies ever will thrill you and get the adrenaline pumping!" property="og.description"/>
<meta content="From John Wick and Die Hard to Mad Max and Atomic Blonde, these best action movies ever will thrill you and get the adrenaline pumping!" name="description"/>
<meta content="140 Essential Action Movies To Watch Now" property="og:title"/>
<meta content="article" property="og:type"/>
<meta content="https://prd-rteditorial.s3.us-west-2.amazonaws.com/wp-content/uploads/2019/06/01073141/600Crank.jpg" property="og:image"/>
<meta content="https://editorial.rottentomatoes.com/guide/140-essential-action-movies-to-watch-now/" property="og:url"/>
<meta content="175594" name="editorialID"/>
<meta

## Exporting to a file

In [6]:
with open('Rotten_Tomatoes_Page_LXML_Parser.html', 'wb') as file:
    file.write(soup.prettify('utf-8'))

## Obtaining the element containing all the data
>We found that all movies data are contained in the div(class="col-sm-18") element and that the h2 element contains title, year, and sore.

In [7]:
# Store all divs
divs = soup.find_all('div', {'class': 'col-sm-18 col-full-xs countdown-item-content'})
divs

[<div class="col-sm-18 col-full-xs countdown-item-content">
 <div class="row countdown-item-title-bar">
 <div class="col-sm-20 col-full-xs" style="height: 100%;">
 <div class="article_movie_title" style="float: left;">
 <div><h2><a href="https://www.rottentomatoes.com/m/1018009-running_scared">Running Scared</a> <span class="subtle start-year">(1986)</span> <span class="icon tiny rotten" title="Rotten"></span> <span class="tMeterScore">57%</span></h2></div>
 </div>
 </div>
 <div class="col-sm-4 col-full-xs" style="height: 100%;">
 <div class="countdown-index">#140</div>
 </div>
 </div>
 <div class="row countdown-item-details">
 <div class="col-sm-24">
 <div class="info countdown-adjusted-score"><span class="descriptor">Adjusted Score: </span>58275% <span class="glyphicon glyphicon-question-sign" data-html="true" data-original-title="The Adjusted Score comes from a weighted formula (Bayesian) that we use that accounts for variation in the number of reviews per movie." data-placement="to

## Extracting the title and year of each movie

In [8]:
headings = [div.find("h2") for div in divs]
headings

[<h2><a href="https://www.rottentomatoes.com/m/1018009-running_scared">Running Scared</a> <span class="subtle start-year">(1986)</span> <span class="icon tiny rotten" title="Rotten"></span> <span class="tMeterScore">57%</span></h2>,
 <h2><a href="https://www.rottentomatoes.com/m/equilibrium">Equilibrium</a> <span class="subtle start-year">(2002)</span> <span class="icon tiny rotten" title="Rotten"></span> <span class="tMeterScore">41%</span></h2>,
 <h2><a href="https://www.rottentomatoes.com/m/hero">Hero</a> <span class="subtle start-year">(2002)</span> <span class="icon tiny certified" title="Certified Fresh"></span> <span class="tMeterScore">94%</span></h2>,
 <h2><a href="https://www.rottentomatoes.com/m/1017666-road_house">Road House</a> <span class="subtle start-year">(1989)</span> <span class="icon tiny rotten" title="Rotten"></span> <span class="tMeterScore">40%</span></h2>,
 <h2><a href="https://www.rottentomatoes.com/m/unstoppable-2010">Unstoppable</a> <span class="subtle start

In [9]:
# Let's inspect the first heading
headings[0]

<h2><a href="https://www.rottentomatoes.com/m/1018009-running_scared">Running Scared</a> <span class="subtle start-year">(1986)</span> <span class="icon tiny rotten" title="Rotten"></span> <span class="tMeterScore">57%</span></h2>

## Title

In [10]:
movie_names = [heading.find('a').string for heading in headings]
movie_names

['Running Scared',
 'Equilibrium',
 'Hero',
 'Road House',
 'Unstoppable',
 'Shaft',
 'The Villainess',
 'Highlander',
 'Die Hard 2',
 'National Treasure',
 'The Protector',
 'Revenge',
 'El Mariachi',
 'A Touch of Zen',
 'Top Gun',
 'Con Air',
 'The Expendables 2',
 'The Mummy',
 'Mr. & Mrs. Smith',
 'Rush Hour',
 'The Equalizer',
 'Captain America: Civil War',
 'Air Force One',
 'Bloodsport',
 'Blade',
 'Bad Boys',
 'Die Hard With a Vengeance',
 'The Running Man',
 'Code of Silence',
 "Shoot 'Em Up",
 'Crank',
 'Machete',
 'Drive',
 'Batman',
 'Under Siege',
 'Independence Day',
 'Bullitt',
 'Wanted',
 'Superman: The Movie',
 'Ronin',
 'They Live',
 'Cliffhanger',
 "Marvel's the Avengers",
 'Hot Fuzz',
 'The Warriors',
 'Starship Troopers',
 'Elite Squad 2',
 'Point Break',
 'The Long Kiss Goodnight',
 'The Guest',
 'Taken',
 '300',
 'True Lies',
 'Demolition Man',
 'Hardcore Henry',
 'Police Story',
 'Brotherhood of the Wolf',
 'Kingsman: The Secret Service',
 'The Fifth Element',
 

## Year

In [11]:
years = [year.find('span', class_ = 'subtle start-year').string for year in headings]
years

['(1986)',
 '(2002)',
 '(2002)',
 '(1989)',
 '(2010)',
 '(1971)',
 '(2017)',
 '(1986)',
 '(1990)',
 '(2004)',
 '(2005)',
 '(2017)',
 '(1992)',
 '(1971)',
 '(1986)',
 '(1997)',
 '(2012)',
 '(1999)',
 '(2005)',
 '(1998)',
 '(2014)',
 '(2016)',
 '(1997)',
 '(1988)',
 '(1998)',
 '(1995)',
 '(1995)',
 '(1987)',
 '(1985)',
 '(2007)',
 '(2006)',
 '(2010)',
 '(2011)',
 '(1989)',
 '(1992)',
 '(1996)',
 '(1968)',
 '(2008)',
 '(1978)',
 '(1998)',
 '(1988)',
 '(1993)',
 '(2012)',
 '(2007)',
 '(1979)',
 '(1997)',
 '(2010)',
 '(1991)',
 '(1996)',
 '(2014)',
 '(2008)',
 '(2006)',
 '(1994)',
 '(1993)',
 '(2015)',
 '(1985)',
 '(2001)',
 '(2014)',
 '(1997)',
 '(1986)',
 '(2017)',
 '(1995)',
 '(2004)',
 '(1984)',
 '(2003)',
 '(2004)',
 '(1993)',
 '(1981)',
 '(2000)',
 '(2004)',
 '(2010)',
 '(1992)',
 '(1989)',
 '(2004)',
 '(1986)',
 '(2008)',
 '(2018)',
 '(2017)',
 '(1964)',
 '(1976)',
 '(2017)',
 '(1972)',
 '(2014)',
 '(2003)',
 '(1971)',
 '(2015)',
 '(1990)',
 '(1992)',
 '(1971)',
 '(2014)',
 '(2003)',

## Removing the brackets

In [12]:
years[0][1:-1]

'1986'

In [13]:
# '.strp(symbols)' method: Removes all leading and trailing symbols from a string. The symbols to be
# removed are supplied as a parameter.
years[0].strip('()')

'1986'

In [14]:
years = [year.strip('()') for year in years]
years

['1986',
 '2002',
 '2002',
 '1989',
 '2010',
 '1971',
 '2017',
 '1986',
 '1990',
 '2004',
 '2005',
 '2017',
 '1992',
 '1971',
 '1986',
 '1997',
 '2012',
 '1999',
 '2005',
 '1998',
 '2014',
 '2016',
 '1997',
 '1988',
 '1998',
 '1995',
 '1995',
 '1987',
 '1985',
 '2007',
 '2006',
 '2010',
 '2011',
 '1989',
 '1992',
 '1996',
 '1968',
 '2008',
 '1978',
 '1998',
 '1988',
 '1993',
 '2012',
 '2007',
 '1979',
 '1997',
 '2010',
 '1991',
 '1996',
 '2014',
 '2008',
 '2006',
 '1994',
 '1993',
 '2015',
 '1985',
 '2001',
 '2014',
 '1997',
 '1986',
 '2017',
 '1995',
 '2004',
 '1984',
 '2003',
 '2004',
 '1993',
 '1981',
 '2000',
 '2004',
 '2010',
 '1992',
 '1989',
 '2004',
 '1986',
 '2008',
 '2018',
 '2017',
 '1964',
 '1976',
 '2017',
 '1972',
 '2014',
 '2003',
 '1971',
 '2015',
 '1990',
 '1992',
 '1971',
 '2014',
 '2003',
 '1993',
 '2018',
 '2010',
 '1995',
 '2002',
 '2019',
 '2012',
 '2002',
 '2008',
 '1997',
 '1985',
 '2008',
 '2011',
 '2011',
 '1987',
 '1996',
 '1987',
 '2017',
 '2006',
 '2017',
 

In [15]:
# To convert the strings to int.
years = [int(year) for year in years]
years

[1986,
 2002,
 2002,
 1989,
 2010,
 1971,
 2017,
 1986,
 1990,
 2004,
 2005,
 2017,
 1992,
 1971,
 1986,
 1997,
 2012,
 1999,
 2005,
 1998,
 2014,
 2016,
 1997,
 1988,
 1998,
 1995,
 1995,
 1987,
 1985,
 2007,
 2006,
 2010,
 2011,
 1989,
 1992,
 1996,
 1968,
 2008,
 1978,
 1998,
 1988,
 1993,
 2012,
 2007,
 1979,
 1997,
 2010,
 1991,
 1996,
 2014,
 2008,
 2006,
 1994,
 1993,
 2015,
 1985,
 2001,
 2014,
 1997,
 1986,
 2017,
 1995,
 2004,
 1984,
 2003,
 2004,
 1993,
 1981,
 2000,
 2004,
 2010,
 1992,
 1989,
 2004,
 1986,
 2008,
 2018,
 2017,
 1964,
 1976,
 2017,
 1972,
 2014,
 2003,
 1971,
 2015,
 1990,
 1992,
 1971,
 2014,
 2003,
 1993,
 2018,
 2010,
 1995,
 2002,
 2019,
 2012,
 2002,
 2008,
 1997,
 1985,
 2008,
 2011,
 2011,
 1987,
 1996,
 1987,
 2017,
 2006,
 2017,
 1994,
 1989,
 2014,
 1973,
 1985,
 1982,
 2015,
 1984,
 2000,
 2003,
 1994,
 1994,
 1994,
 2014,
 2000,
 1987,
 2007,
 1990,
 1981,
 1995,
 2011,
 2018,
 1981,
 1986,
 1992,
 1999,
 1991,
 1988,
 2015]

## Score

In [16]:
movie_scores = [heading.find('span', class_ = "tMeterScore").string for heading in headings]
movie_scores

['57%',
 '41%',
 '94%',
 '40%',
 '87%',
 '88%',
 '85%',
 '70%',
 '69%',
 '46%',
 '53%',
 '93%',
 '91%',
 '97%',
 '58%',
 '56%',
 '67%',
 '61%',
 '60%',
 '61%',
 '60%',
 '90%',
 '78%',
 '40%',
 '57%',
 '42%',
 '60%',
 '66%',
 '70%',
 '67%',
 '61%',
 '72%',
 '93%',
 '71%',
 '79%',
 '68%',
 '98%',
 '71%',
 '94%',
 '68%',
 '85%',
 '67%',
 '91%',
 '91%',
 '87%',
 '66%',
 '91%',
 '69%',
 '70%',
 '92%',
 '59%',
 '61%',
 '71%',
 '60%',
 '51%',
 '93%',
 '73%',
 '75%',
 '71%',
 '74%',
 '79%',
 '80%',
 '80%',
 '83%',
 '85%',
 '86%',
 '91%',
 '86%',
 '88%',
 '93%',
 '95%',
 '88%',
 '88%',
 '91%',
 '93%',
 '94%',
 '91%',
 '94%',
 '99%',
 '96%',
 '93%',
 '83%',
 '90%',
 '81%',
 '98%',
 '82%',
 '89%',
 '96%',
 '87%',
 '91%',
 '85%',
 '96%',
 '96%',
 '87%',
 '79%',
 '90%',
 '94%',
 '79%',
 '83%',
 '86%',
 '92%',
 '85%',
 '94%',
 '93%',
 '77%',
 '80%',
 '68%',
 '90%',
 '89%',
 '94%',
 '92%',
 '100%',
 '98%',
 '82%',
 '95%',
 '69%',
 '85%',
 '94%',
 '100%',
 '77%',
 '85%',
 '74%',
 '94%',
 '84%',
 '86%'

In [17]:
movie_scores = [movie_score.strip('%') for movie_score in movie_scores]
movie_scores

['57',
 '41',
 '94',
 '40',
 '87',
 '88',
 '85',
 '70',
 '69',
 '46',
 '53',
 '93',
 '91',
 '97',
 '58',
 '56',
 '67',
 '61',
 '60',
 '61',
 '60',
 '90',
 '78',
 '40',
 '57',
 '42',
 '60',
 '66',
 '70',
 '67',
 '61',
 '72',
 '93',
 '71',
 '79',
 '68',
 '98',
 '71',
 '94',
 '68',
 '85',
 '67',
 '91',
 '91',
 '87',
 '66',
 '91',
 '69',
 '70',
 '92',
 '59',
 '61',
 '71',
 '60',
 '51',
 '93',
 '73',
 '75',
 '71',
 '74',
 '79',
 '80',
 '80',
 '83',
 '85',
 '86',
 '91',
 '86',
 '88',
 '93',
 '95',
 '88',
 '88',
 '91',
 '93',
 '94',
 '91',
 '94',
 '99',
 '96',
 '93',
 '83',
 '90',
 '81',
 '98',
 '82',
 '89',
 '96',
 '87',
 '91',
 '85',
 '96',
 '96',
 '87',
 '79',
 '90',
 '94',
 '79',
 '83',
 '86',
 '92',
 '85',
 '94',
 '93',
 '77',
 '80',
 '68',
 '90',
 '89',
 '94',
 '92',
 '100',
 '98',
 '82',
 '95',
 '69',
 '85',
 '94',
 '100',
 '77',
 '85',
 '74',
 '94',
 '84',
 '86',
 '97',
 '82',
 '92',
 '82',
 '94',
 '87',
 '87',
 '97',
 '95',
 '97',
 '94',
 '88',
 '93',
 '94',
 '97']

In [18]:
movie_scores = [int(movie_score) for movie_score in movie_scores]
movie_scores

[57,
 41,
 94,
 40,
 87,
 88,
 85,
 70,
 69,
 46,
 53,
 93,
 91,
 97,
 58,
 56,
 67,
 61,
 60,
 61,
 60,
 90,
 78,
 40,
 57,
 42,
 60,
 66,
 70,
 67,
 61,
 72,
 93,
 71,
 79,
 68,
 98,
 71,
 94,
 68,
 85,
 67,
 91,
 91,
 87,
 66,
 91,
 69,
 70,
 92,
 59,
 61,
 71,
 60,
 51,
 93,
 73,
 75,
 71,
 74,
 79,
 80,
 80,
 83,
 85,
 86,
 91,
 86,
 88,
 93,
 95,
 88,
 88,
 91,
 93,
 94,
 91,
 94,
 99,
 96,
 93,
 83,
 90,
 81,
 98,
 82,
 89,
 96,
 87,
 91,
 85,
 96,
 96,
 87,
 79,
 90,
 94,
 79,
 83,
 86,
 92,
 85,
 94,
 93,
 77,
 80,
 68,
 90,
 89,
 94,
 92,
 100,
 98,
 82,
 95,
 69,
 85,
 94,
 100,
 77,
 85,
 74,
 94,
 84,
 86,
 97,
 82,
 92,
 82,
 94,
 87,
 87,
 97,
 95,
 97,
 94,
 88,
 93,
 94,
 97]

# Extracting the rest of the information

## Critics Consensus

In [19]:
consensus = [div.find('div', class_ = "info critics-consensus") for div in divs]
consensus

[<div class="info critics-consensus"><span class="descriptor">Critics Consensus:</span> <em>Running Scared</em> struggles to strike a consistent balance between violent action and humor, but the chemistry between its well-matched leads keeps things entertaining.</div>,
 <div class="info critics-consensus"><span class="descriptor">Critics Consensus:</span> Equilibrium is a reheated mishmash of other sci-fi movies.</div>,
 <div class="info critics-consensus"><span class="descriptor">Critics Consensus:</span> With death-defying action sequences and epic historic sweep, <em>Hero</em> offers everything a martial arts fan could ask for.</div>,
 <div class="info critics-consensus"><span class="descriptor">Critics Consensus:</span> Whether <em>Road House</em> is simply bad or so bad it's good depends largely on the audience's fondness for Swayze -- and tolerance for violently cheesy action.</div>,
 <div class="info critics-consensus"><span class="descriptor">Critics Consensus:</span> As fast, 

In [20]:
# Let's inspect the text content of those divs
[con.text for con in consensus]

['Critics Consensus: Running Scared struggles to strike a consistent balance between violent action and humor, but the chemistry between its well-matched leads keeps things entertaining.',
 'Critics Consensus: Equilibrium is a reheated mishmash of other sci-fi movies.',
 'Critics Consensus: With death-defying action sequences and epic historic sweep, Hero offers everything a martial arts fan could ask for.',
 "Critics Consensus: Whether Road House is simply bad or so bad it's good depends largely on the audience's fondness for Swayze -- and tolerance for violently cheesy action.",
 "Critics Consensus: As fast, loud, and relentless as the train at the center of the story, Unstoppable is perfect popcorn entertainment -- and director Tony Scott's best movie in years.",
 'Critics Consensus: This is the man that would risk his neck for his brother, man. Can you dig it?',
 'Critics Consensus: The Villainess offers enough pure kinetic thrills to satisfy genre enthusiasts -- and carve out a bl

## Way #1: Text processing
> for removing the repeated 'Critics Consensus:'

In [21]:
common_phrase = 'Critics Consensus: '

In [22]:
common_len = len(common_phrase)
common_len

19

In [23]:
consensus[0].text[common_len:]

'Running Scared struggles to strike a consistent balance between violent action and humor, but the chemistry between its well-matched leads keeps things entertaining.'

In [24]:
consensus_text = [con.text[common_len:] if con.text.startswith(common_phrase) else con.text for con in consensus]
consensus_text

['Running Scared struggles to strike a consistent balance between violent action and humor, but the chemistry between its well-matched leads keeps things entertaining.',
 'Equilibrium is a reheated mishmash of other sci-fi movies.',
 'With death-defying action sequences and epic historic sweep, Hero offers everything a martial arts fan could ask for.',
 "Whether Road House is simply bad or so bad it's good depends largely on the audience's fondness for Swayze -- and tolerance for violently cheesy action.",
 "As fast, loud, and relentless as the train at the center of the story, Unstoppable is perfect popcorn entertainment -- and director Tony Scott's best movie in years.",
 'This is the man that would risk his neck for his brother, man. Can you dig it?',
 'The Villainess offers enough pure kinetic thrills to satisfy genre enthusiasts -- and carve out a bloody niche for itself in modern Korean action cinema.',
 "People hate Highlander because it's cheesy, bombastic, and absurd. And peop

## Way #2: Inspecting the HTML

In [25]:
consensus[0]

<div class="info critics-consensus"><span class="descriptor">Critics Consensus:</span> <em>Running Scared</em> struggles to strike a consistent balance between violent action and humor, but the chemistry between its well-matched leads keeps things entertaining.</div>

In [26]:
consensus[0].contents[3]

' struggles to strike a consistent balance between violent action and humor, but the chemistry between its well-matched leads keeps things entertaining.'

In [27]:
consensus[0].contents[3].strip()

'struggles to strike a consistent balance between violent action and humor, but the chemistry between its well-matched leads keeps things entertaining.'

In [28]:
#consensus_text = [con.contents[1].strip() for con in consensus]
#consensus_text

- This last method is better, as it is closet to the Beautiful Soup fundamentals

## Directors

In [29]:
directors = [div.find("div", class_ = "director") for div in divs]
directors

[<div class="info director">
 <span class="descriptor">Directed By:</span> <a class="" href="//www.rottentomatoes.com/celebrity/peter_hyams">Peter Hyams</a></div>,
 <div class="info director">
 <span class="descriptor">Directed By:</span> <a class="" href="//www.rottentomatoes.com/celebrity/kurt_wimmer">Kurt Wimmer</a></div>,
 <div class="info director">
 <span class="descriptor">Directed By:</span> <a class="" href="//www.rottentomatoes.com/celebrity/zhang_yimou">Zhang Yimou</a></div>,
 <div class="info director">
 <span class="descriptor">Directed By:</span> <a class="" href="//www.rottentomatoes.com/celebrity/rowdy_herrington">Rowdy Herrington</a></div>,
 <div class="info director">
 <span class="descriptor">Directed By:</span> <a class="" href="//www.rottentomatoes.com/celebrity/tony_scott">Tony Scott</a></div>,
 <div class="info director">
 <span class="descriptor">Directed By:</span> <a class="" href="//www.rottentomatoes.com/celebrity/gordon_parks">Gordon Parks</a></div>,
 <div 

In [30]:
directors[0]

<div class="info director">
<span class="descriptor">Directed By:</span> <a class="" href="//www.rottentomatoes.com/celebrity/peter_hyams">Peter Hyams</a></div>

In [31]:
[director.find('a') for director in directors]
# If here we have a value of None it will cause a problem

[<a class="" href="//www.rottentomatoes.com/celebrity/peter_hyams">Peter Hyams</a>,
 <a class="" href="//www.rottentomatoes.com/celebrity/kurt_wimmer">Kurt Wimmer</a>,
 <a class="" href="//www.rottentomatoes.com/celebrity/zhang_yimou">Zhang Yimou</a>,
 <a class="" href="//www.rottentomatoes.com/celebrity/rowdy_herrington">Rowdy Herrington</a>,
 <a class="" href="//www.rottentomatoes.com/celebrity/tony_scott">Tony Scott</a>,
 <a class="" href="//www.rottentomatoes.com/celebrity/gordon_parks">Gordon Parks</a>,
 <a class="" href="//www.rottentomatoes.com/celebrity/jeong_byeong_gil">Jeong Byeong-gil</a>,
 <a class="" href="//www.rottentomatoes.com/celebrity/russell_mulcahy">Russell Mulcahy</a>,
 <a class="" href="//www.rottentomatoes.com/celebrity/renny_harlin">Renny Harlin</a>,
 <a class="" href="//www.rottentomatoes.com/celebrity/jon_turteltaub">Jon Turteltaub</a>,
 <a class="" href="//www.rottentomatoes.com/celebrity/prachya_pinkaew">Prachya Pinkaew</a>,
 <a class="" href="//www.rottent

In [32]:
# To solve the None issue
final_directories = [None if director.find("a") is None else director.find("a").string for director in directors]
final_directories

['Peter Hyams',
 'Kurt Wimmer',
 'Zhang Yimou',
 'Rowdy Herrington',
 'Tony Scott',
 'Gordon Parks',
 'Jeong Byeong-gil',
 'Russell Mulcahy',
 'Renny Harlin',
 'Jon Turteltaub',
 'Prachya Pinkaew',
 'Coralie Fargeat',
 'Robert Rodriguez',
 'King Hu',
 'Tony Scott',
 'Simon West',
 'Simon West',
 'Stephen Sommers',
 'Doug Liman',
 'Brett Ratner',
 'Antoine Fuqua',
 'Anthony Russo',
 'Wolfgang Petersen',
 'Newt Arnold',
 'Stephen Norrington',
 'Michael Bay',
 'John McTiernan',
 'Paul Michael Glaser',
 'Andrew Davis',
 'Michael Davis',
 'Mark Neveldine',
 'Robert Rodriguez',
 'Nicolas Winding Refn',
 'Tim Burton',
 'Andrew Davis',
 'Roland Emmerich',
 'Peter Yates',
 'Timur Bekmambetov',
 'Richard Donner',
 'John Frankenheimer',
 'John Carpenter',
 'Renny Harlin',
 'Joss Whedon',
 'Edgar Wright',
 'Walter Hill',
 'Paul Verhoeven',
 'José Padilha',
 'Kathryn Bigelow',
 'Renny Harlin',
 'Adam Wingard',
 'Pierre Morel',
 'Zack Snyder',
 'James Cameron',
 'Marco Brambilla',
 'Ilya Naishuller'

- We basically propagate the missing links ('None' links) as 'None' directors.
- If we removed the whole movie from the list, we will potentially lose the other, non-corrupted data about it.

## Cast
>The cast of a play or film is all the people who act in it.

In [33]:
cast_info = [div.find("div", class_ = "cast") for div in divs]
cast_info

[<div class="info cast">
 <span class="descriptor">Starring:</span> <a class="" href="//www.rottentomatoes.com/celebrity/gregory_hines">Gregory Hines</a>, <a class="" href="//www.rottentomatoes.com/celebrity/billy_crystal">Billy Crystal</a>, <a class="" href="//www.rottentomatoes.com/celebrity/jimmy_smits">Jimmy Smits</a>, <a class="" href="//www.rottentomatoes.com/celebrity/steven_bauer">Steven Bauer</a></div>,
 <div class="info cast">
 <span class="descriptor">Starring:</span> <a class="" href="//www.rottentomatoes.com/celebrity/christian_bale">Christian Bale</a>, <a class="" href="//www.rottentomatoes.com/celebrity/emily_watson">Emily Watson</a>, <a class="" href="//www.rottentomatoes.com/celebrity/taye_diggs">Taye Diggs</a>, <a class="" href="//www.rottentomatoes.com/celebrity/angus_macfadyen">Angus Macfadyen</a></div>,
 <div class="info cast">
 <span class="descriptor">Starring:</span> <a class="" href="//www.rottentomatoes.com/celebrity/jet_li">Jet Li</a>, <a class="" href="//www

In [34]:
cast_info[0]

<div class="info cast">
<span class="descriptor">Starring:</span> <a class="" href="//www.rottentomatoes.com/celebrity/gregory_hines">Gregory Hines</a>, <a class="" href="//www.rottentomatoes.com/celebrity/billy_crystal">Billy Crystal</a>, <a class="" href="//www.rottentomatoes.com/celebrity/jimmy_smits">Jimmy Smits</a>, <a class="" href="//www.rottentomatoes.com/celebrity/steven_bauer">Steven Bauer</a></div>

In [35]:
# I will merge each movie's cast members names in one string and a regular list of strings.
# Try on a single movie
cast_links = cast_info[0].find_all('a')
cast_links

[<a class="" href="//www.rottentomatoes.com/celebrity/gregory_hines">Gregory Hines</a>,
 <a class="" href="//www.rottentomatoes.com/celebrity/billy_crystal">Billy Crystal</a>,
 <a class="" href="//www.rottentomatoes.com/celebrity/jimmy_smits">Jimmy Smits</a>,
 <a class="" href="//www.rottentomatoes.com/celebrity/steven_bauer">Steven Bauer</a>]

In [36]:
cast_names = [link.string for link in cast_links]
cast_names

['Gregory Hines', 'Billy Crystal', 'Jimmy Smits', 'Steven Bauer']

In [37]:
cast = ", ".join(cast_names)
cast

'Gregory Hines, Billy Crystal, Jimmy Smits, Steven Bauer'

## Using a 'for loop'
>The list comprehension can get a bit messy, thus we will start with a 'for loop'
and then adapt it into a list comprehension.

In [38]:
cast = []

for c in cast_info:
    cast_links = c.find_all('a')
    cast_names = [link.string for link in cast_links]
    result = ", ".join(cast_names)
    
    cast.append(result)
    
cast


['Gregory Hines, Billy Crystal, Jimmy Smits, Steven Bauer',
 'Christian Bale, Emily Watson, Taye Diggs, Angus Macfadyen',
 'Jet Li, Tony Leung Chiu Wai, Maggie Cheung Man-yuk, Donnie Yen',
 'Patrick Swayze, Kelly Lynch, Sam Elliott, Ben Gazzara',
 'Denzel Washington, Chris Pine, Rosario Dawson, Kevin Dunn',
 'Richard Roundtree, Moses Gunn, Christopher St. John, Charles Cioffi',
 'Kim Ok-bin, Shin Ha-kyun, Sung Joon, Kim Seo-hyung',
 'Christopher Lambert, Sean Connery, Roxanne Hart, Clancy Brown',
 'Bruce Willis, Bonnie Bedelia, William Atherton, Reginald VelJohnson',
 'Nicolas Cage, Diane Kruger, Justin Bartha, Sean Bean',
 'Tony Jaa, Johnny Nguyen, Nathan Jones, Petchtai Wongkamlao',
 'Matilda Lutz, Kevin Janssens, Vincent Colombe, Guillaume Bouchède',
 'Carlos Gallardo, Consuelo Gómez, Reinol Martinez, Peter Marquardt',
 'Feng Hsu, Chun Shih, Pai Ying, Roy Chiao',
 'Tom Cruise, Kelly McGillis, Anthony Edwards, Val Kilmer',
 'Nicolas Cage, John Cusack, John Malkovich, Steve Buscemi',


## Nested list comprehension
>[the_desired_list_item for item in sequence]

In [39]:
cast = [", ".join([link.string for link in c.find_all('a')]) for c in cast_info]
cast

['Gregory Hines, Billy Crystal, Jimmy Smits, Steven Bauer',
 'Christian Bale, Emily Watson, Taye Diggs, Angus Macfadyen',
 'Jet Li, Tony Leung Chiu Wai, Maggie Cheung Man-yuk, Donnie Yen',
 'Patrick Swayze, Kelly Lynch, Sam Elliott, Ben Gazzara',
 'Denzel Washington, Chris Pine, Rosario Dawson, Kevin Dunn',
 'Richard Roundtree, Moses Gunn, Christopher St. John, Charles Cioffi',
 'Kim Ok-bin, Shin Ha-kyun, Sung Joon, Kim Seo-hyung',
 'Christopher Lambert, Sean Connery, Roxanne Hart, Clancy Brown',
 'Bruce Willis, Bonnie Bedelia, William Atherton, Reginald VelJohnson',
 'Nicolas Cage, Diane Kruger, Justin Bartha, Sean Bean',
 'Tony Jaa, Johnny Nguyen, Nathan Jones, Petchtai Wongkamlao',
 'Matilda Lutz, Kevin Janssens, Vincent Colombe, Guillaume Bouchède',
 'Carlos Gallardo, Consuelo Gómez, Reinol Martinez, Peter Marquardt',
 'Feng Hsu, Chun Shih, Pai Ying, Roy Chiao',
 'Tom Cruise, Kelly McGillis, Anthony Edwards, Val Kilmer',
 'Nicolas Cage, John Cusack, John Malkovich, Steve Buscemi',


## Adjusted score

In [40]:
# The adjusted scores can be found in a div with class 'info countdown-adjusted-score'
adj_scores = [div.find("div", {"class": "info countdown-adjusted-score"}) for div in divs]
adj_scores

[<div class="info countdown-adjusted-score"><span class="descriptor">Adjusted Score: </span>58275% <span class="glyphicon glyphicon-question-sign" data-html="true" data-original-title="The Adjusted Score comes from a weighted formula (Bayesian) that we use that accounts for variation in the number of reviews per movie." data-placement="top" data-toggle="tooltip" rel="tooltip" title=""></span></div>,
 <div class="info countdown-adjusted-score"><span class="descriptor">Adjusted Score: </span>42446% <span class="glyphicon glyphicon-question-sign" data-html="true" data-original-title="The Adjusted Score comes from a weighted formula (Bayesian) that we use that accounts for variation in the number of reviews per movie." data-placement="top" data-toggle="tooltip" rel="tooltip" title=""></span></div>,
 <div class="info countdown-adjusted-score"><span class="descriptor">Adjusted Score: </span>101752% <span class="glyphicon glyphicon-question-sign" data-html="true" data-original-title="The Adju

In [41]:
# Inspecting an element
adj_scores[0]

<div class="info countdown-adjusted-score"><span class="descriptor">Adjusted Score: </span>58275% <span class="glyphicon glyphicon-question-sign" data-html="true" data-original-title="The Adjusted Score comes from a weighted formula (Bayesian) that we use that accounts for variation in the number of reviews per movie." data-placement="top" data-toggle="tooltip" rel="tooltip" title=""></span></div>

In [42]:
# By inspection we see that the string we are looking for is the second child of the 'div' tag
adj_scores[0].contents[1]  # Note the extra whitespace at the end

'58275% '

In [43]:
# Extracting the string (without '%' sign and extra space)
adj_scores_clean = [score.contents[1].strip('% ') for score in adj_scores]
adj_scores_clean

['58275',
 '42446',
 '101752',
 '43364',
 '93193',
 '91975',
 '90806',
 '73006',
 '72543',
 '51390',
 '55304',
 '100653',
 '96227',
 '98322',
 '63496',
 '59930',
 '72198',
 '65187',
 '67581',
 '63769',
 '67516',
 '117263',
 '80407',
 '40194',
 '62969',
 '45944',
 '64804',
 '68241',
 '70957',
 '73735',
 '64540',
 '79239',
 '102565',
 '77466',
 '80795',
 '71587',
 '101570',
 '79343',
 '101273',
 '71452',
 '88842',
 '71312',
 '105982',
 '99734',
 '90783',
 '70064',
 '91978',
 '72955',
 '70197',
 '95435',
 '65039',
 '71261',
 '74146',
 '60899',
 '59820',
 '93742',
 '76659',
 '84812',
 '74080',
 '79716',
 '105555',
 '81427',
 '84488',
 '87031',
 '88856',
 '94784',
 '93294',
 '91474',
 '88119',
 '103350',
 '99672',
 '67078',
 '93078',
 '97467',
 '92509',
 '104565',
 '91105',
 '125095',
 '104436',
 '99061',
 '127852',
 '83345',
 '102344',
 '86705',
 '105926',
 '92441',
 '91000',
 '98720',
 '91554',
 '104281',
 '92690',
 '102631',
 '128648',
 '101309',
 '83494',
 '97583',
 '127823',
 '85888',


In [44]:
# Converting the strings to numbers
final_adj = [float(score) for score in adj_scores_clean] # Note that this time the scores are float, not int!
final_adj

[58275.0,
 42446.0,
 101752.0,
 43364.0,
 93193.0,
 91975.0,
 90806.0,
 73006.0,
 72543.0,
 51390.0,
 55304.0,
 100653.0,
 96227.0,
 98322.0,
 63496.0,
 59930.0,
 72198.0,
 65187.0,
 67581.0,
 63769.0,
 67516.0,
 117263.0,
 80407.0,
 40194.0,
 62969.0,
 45944.0,
 64804.0,
 68241.0,
 70957.0,
 73735.0,
 64540.0,
 79239.0,
 102565.0,
 77466.0,
 80795.0,
 71587.0,
 101570.0,
 79343.0,
 101273.0,
 71452.0,
 88842.0,
 71312.0,
 105982.0,
 99734.0,
 90783.0,
 70064.0,
 91978.0,
 72955.0,
 70197.0,
 95435.0,
 65039.0,
 71261.0,
 74146.0,
 60899.0,
 59820.0,
 93742.0,
 76659.0,
 84812.0,
 74080.0,
 79716.0,
 105555.0,
 81427.0,
 84488.0,
 87031.0,
 88856.0,
 94784.0,
 93294.0,
 91474.0,
 88119.0,
 103350.0,
 99672.0,
 67078.0,
 93078.0,
 97467.0,
 92509.0,
 104565.0,
 91105.0,
 125095.0,
 104436.0,
 99061.0,
 127852.0,
 83345.0,
 102344.0,
 86705.0,
 105926.0,
 92441.0,
 91000.0,
 98720.0,
 91554.0,
 104281.0,
 92690.0,
 102631.0,
 128648.0,
 101309.0,
 83494.0,
 97583.0,
 127823.0,
 85888.0,


## Synopsis
> In screenwriting, a movie synopsis is a brief summary of a completed screenplay's core concept, major plot points, and main character arcs.

In [45]:
# The synopsis is located inside a 'div' tag with the class 'info synopsis'
synopsis = [div.find('div', class_='synopsis') for div in divs]
synopsis

[<div class="info synopsis"><span class="descriptor">Synopsis:</span> Ray and Danny (Gregory Hines, Billy Crystal) are two Chicago police detectives hot on the trail of drug kingpin Julio...<a class="" data-pageheader="" href="https://www.rottentomatoes.com/m/1018009-running_scared" target="_top"> [More]</a></div>,
 <div class="info synopsis"><span class="descriptor">Synopsis:</span> In a futuristic world, a regime has eliminated war by suppressing emotions: books, art and music are strictly forbidden and...<a class="" data-pageheader="" href="https://www.rottentomatoes.com/m/equilibrium" target="_top"> [More]</a></div>,
 <div class="info synopsis"><span class="descriptor">Synopsis:</span> In this visually arresting martial arts epic set in ancient China, an unnamed fighter (Jet Li) is being honored for...<a class="" data-pageheader="" href="https://www.rottentomatoes.com/m/hero" target="_top"> [More]</a></div>,
 <div class="info synopsis"><span class="descriptor">Synopsis:</span> The 

In [46]:
# Inspecting the element
synopsis[0]

<div class="info synopsis"><span class="descriptor">Synopsis:</span> Ray and Danny (Gregory Hines, Billy Crystal) are two Chicago police detectives hot on the trail of drug kingpin Julio...<a class="" data-pageheader="" href="https://www.rottentomatoes.com/m/1018009-running_scared" target="_top"> [More]</a></div>

In [47]:
# The text is the second child
synopsis[0].contents[1]

' Ray and Danny (Gregory Hines, Billy Crystal) are two Chicago police detectives hot on the trail of drug kingpin Julio...'

In [48]:
# Extracting the text
synopsis_text = [syn.contents[1] for syn in synopsis]
synopsis_text

[' Ray and Danny (Gregory Hines, Billy Crystal) are two Chicago police detectives hot on the trail of drug kingpin Julio...',
 ' In a futuristic world, a regime has eliminated war by suppressing emotions: books, art and music are strictly forbidden and...',
 ' In this visually arresting martial arts epic set in ancient China, an unnamed fighter (Jet Li) is being honored for...',
 ' The Double Deuce is the meanest, loudest and rowdiest bar south of the Mason-Dixon Line, and Dalton (Patrick Swayze) has...',
 ' When a massive, unmanned locomotive roars out of control, the threat is more ominous than just a derailment. The train...',
 ' John Shaft (Richard Roundtree) is the ultimate in suave black detectives. He first finds himself up against Bumpy (Moses Gunn),...',
 ' Honed from childhood to be an elite assassin, Sook-hee embarks on a rampage of violence and revenge to finally earn...',
 ' When the mystical Russell Nash (Christopher Lambert) kills a man in a sword fight in a New York Cit

## Representing the data in a structured form

In [49]:
import pandas as pd

In [50]:
movies_info = pd.DataFrame()

movies_info['Movie Title'] = movie_names 
movies_info['Year'] = years
movies_info['Score'] = movie_scores
movies_info['Adjusted Score'] = final_adj
movies_info['Director'] = final_directories
movies_info['Synopsis'] = synopsis_text
movies_info['Cast'] = cast
movies_info['Consensus'] = consensus_text

movies_info

Unnamed: 0,Movie Title,Year,Score,Adjusted Score,Director,Synopsis,Cast,Consensus
0,Running Scared,1986,57,58275.0,Peter Hyams,"Ray and Danny (Gregory Hines, Billy Crystal) ...","Gregory Hines, Billy Crystal, Jimmy Smits, Ste...",Running Scared struggles to strike a consisten...
1,Equilibrium,2002,41,42446.0,Kurt Wimmer,"In a futuristic world, a regime has eliminate...","Christian Bale, Emily Watson, Taye Diggs, Angu...",Equilibrium is a reheated mishmash of other sc...
2,Hero,2002,94,101752.0,Zhang Yimou,In this visually arresting martial arts epic ...,"Jet Li, Tony Leung Chiu Wai, Maggie Cheung Man...",With death-defying action sequences and epic h...
3,Road House,1989,40,43364.0,Rowdy Herrington,"The Double Deuce is the meanest, loudest and ...","Patrick Swayze, Kelly Lynch, Sam Elliott, Ben ...",Whether Road House is simply bad or so bad it'...
4,Unstoppable,2010,87,93193.0,Tony Scott,"When a massive, unmanned locomotive roars out...","Denzel Washington, Chris Pine, Rosario Dawson,...","As fast, loud, and relentless as the train at ..."
...,...,...,...,...,...,...,...,...
135,Hard-Boiled,1992,94,96450.0,John Woo,A cop who loses his partner in a shoot-out wi...,"Chow Yun-Fat, Bowie Lam, Philip Chan, Tony Leu...",Boasting impactful action as well as surprisin...
136,The Matrix,1999,88,94001.0,Andy Wachowski,Neo (Keanu Reeves) believes that Morpheus (La...,"Keanu Reeves, Laurence Fishburne, Carrie-Anne ...","Thanks to the Wachowskis' imaginative vision, ..."
137,Terminator 2: Judgment Day,1991,93,98466.0,James Cameron,"In this sequel set eleven years after ""The Te...","Arnold Schwarzenegger, Linda Hamilton, Edward ...",T2 features thrilling action sequences and eye...
138,Die Hard,1988,94,99190.0,John McTiernan,New York City policeman John McClane (Bruce W...,"Bruce Willis, Alan Rickman, Bonnie Bedelia, Re...",Its many imitators (and sequels) have never co...


In [51]:
pd.set_option('display.max_colwidth', None)
movies_info

Unnamed: 0,Movie Title,Year,Score,Adjusted Score,Director,Synopsis,Cast,Consensus
0,Running Scared,1986,57,58275.0,Peter Hyams,"Ray and Danny (Gregory Hines, Billy Crystal) are two Chicago police detectives hot on the trail of drug kingpin Julio...","Gregory Hines, Billy Crystal, Jimmy Smits, Steven Bauer","Running Scared struggles to strike a consistent balance between violent action and humor, but the chemistry between its well-matched leads keeps things entertaining."
1,Equilibrium,2002,41,42446.0,Kurt Wimmer,"In a futuristic world, a regime has eliminated war by suppressing emotions: books, art and music are strictly forbidden and...","Christian Bale, Emily Watson, Taye Diggs, Angus Macfadyen",Equilibrium is a reheated mishmash of other sci-fi movies.
2,Hero,2002,94,101752.0,Zhang Yimou,"In this visually arresting martial arts epic set in ancient China, an unnamed fighter (Jet Li) is being honored for...","Jet Li, Tony Leung Chiu Wai, Maggie Cheung Man-yuk, Donnie Yen","With death-defying action sequences and epic historic sweep, Hero offers everything a martial arts fan could ask for."
3,Road House,1989,40,43364.0,Rowdy Herrington,"The Double Deuce is the meanest, loudest and rowdiest bar south of the Mason-Dixon Line, and Dalton (Patrick Swayze) has...","Patrick Swayze, Kelly Lynch, Sam Elliott, Ben Gazzara",Whether Road House is simply bad or so bad it's good depends largely on the audience's fondness for Swayze -- and tolerance for violently cheesy action.
4,Unstoppable,2010,87,93193.0,Tony Scott,"When a massive, unmanned locomotive roars out of control, the threat is more ominous than just a derailment. The train...","Denzel Washington, Chris Pine, Rosario Dawson, Kevin Dunn","As fast, loud, and relentless as the train at the center of the story, Unstoppable is perfect popcorn entertainment -- and director Tony Scott's best movie in years."
...,...,...,...,...,...,...,...,...
135,Hard-Boiled,1992,94,96450.0,John Woo,A cop who loses his partner in a shoot-out with gun smugglers goes on a mission to catch them. In...,"Chow Yun-Fat, Bowie Lam, Philip Chan, Tony Leung Chiu Wai","Boasting impactful action as well as surprising emotional resonance, Hard Boiled is a powerful thriller that hits hard in more ways than one."
136,The Matrix,1999,88,94001.0,Andy Wachowski,"Neo (Keanu Reeves) believes that Morpheus (Laurence Fishburne), an elusive figure considered to be the most dangerous man alive, can...","Keanu Reeves, Laurence Fishburne, Carrie-Anne Moss, Hugo Weaving","Thanks to the Wachowskis' imaginative vision, The Matrix is a smartly crafted combination of spectacular action and groundbreaking special effects."
137,Terminator 2: Judgment Day,1991,93,98466.0,James Cameron,"In this sequel set eleven years after ""The Terminator,"" young John Connor (Edward Furlong), the key to civilization's victory over...","Arnold Schwarzenegger, Linda Hamilton, Edward Furlong, Robert Patrick","T2 features thrilling action sequences and eye-popping visual effects, but what takes this sci-fi/ action landmark to the next level is the depth of the human (and cyborg) characters."
138,Die Hard,1988,94,99190.0,John McTiernan,New York City policeman John McClane (Bruce Willis) is visiting his estranged wife (Bonnie Bedelia) and two daughters on Christmas...,"Bruce Willis, Alan Rickman, Bonnie Bedelia, Reginald VelJohnson",Its many imitators (and sequels) have never come close to matching the taut thrills of the definitive holiday action classic.


- Now, we have our data stored in a tabular form and data analysis can be carried on it.

## Exporting the data to CSV and Excel files

In [52]:
movies_info.to_csv("movies_info.csv", index = False, header = True)

In [53]:
movies_info.to_excel("movies_info.xlsx", index = False, header = True)