## Scraping Rotten Tomatoes Website with BeautifulSoup

In this notebook, we will use python libraries requests and BeautifulSoup to scrape information about 140 essential action movies to watch from rotten tomatoes website shown below

<img src='webpage_ss.png' width=600 height=60 />

This notebook is based on "Web Scraping and API Fundamentals in Python" Udemy Course: https://www.udemy.com/course/web-scraping-and-api-fundamentals-in-python/

#### import packages

In [1]:
import requests
from bs4 import BeautifulSoup

In [2]:
base_url = "https://editorial.rottentomatoes.com/guide/140-essential-action-movies-to-watch-now/"

response = requests.get(base_url)
response

<Response [200]>

In [3]:
html = response.content

#### Parse with BeautifulSoup

In [4]:
soup = BeautifulSoup(html, 'lxml')

# save the parsed html file
with open("rotten_tomatoes_page_2_LXML_Parser.html","wb") as file:
    file.write(soup.prettify('utf-8'))
    
divs = soup.find_all('div',{'class':"col-sm-18 col-full-xs countdown-item-content"})

#### extract title, year, and score

In [5]:
headings = [div.find("h2") for div in divs]
headings[0]

<h2><a href="https://www.rottentomatoes.com/m/1018009-running_scared">Running Scared</a> <span class="subtle start-year">(1986)</span> <span class="icon tiny fresh" title="Fresh"></span> <span class="tMeterScore">63%</span></h2>

In [6]:
movie_names = [heading.find('a').string for heading in headings]

years = [heading.find('span',class_='start-year').string for heading in headings]
years = [year.strip('()') for year in years]
years = [int(year) for year in years]

In [7]:
scores = [heading.find('span',class_='tMeterScore').string for heading in headings] 
scores = [s.strip('%') for s in scores]
scores = [int(s) for s in scores]

#### extract critics consensus

In [10]:
consensus = [div.find('div',{'class':'info critics-consensus'}) for div in divs]

In [11]:
consensus[0]

<div class="info critics-consensus"><span class="descriptor">Critics Consensus:</span> <em>Running Scared</em> struggles to strike a consistent balance between violent action and humor, but the chemistry between its well-matched leads keeps things entertaining.</div>

In [12]:
common_phrase = "Critics Consensus: "
common_len = len(common_phrase)
consensus_txt = [con.text[common_len:] for con in consensus]
consensus_txt[0]

'Running Scared struggles to strike a consistent balance between violent action and humor, but the chemistry between its well-matched leads keeps things entertaining.'

### Saving data as pandas dataframe

In [13]:
import pandas as pd

In [14]:
movies_info = pd.DataFrame()

movies_info['Movie Name'] = movie_names
movies_info['Year'] = years
movies_info['Score'] = scores
movies_info['Consensus Critic'] = consensus_txt

movies_info

Unnamed: 0,Movie Name,Year,Score,Consensus Critic
0,Running Scared,1986,63,Running Scared struggles to strike a consisten...
1,Equilibrium,2002,40,Equilibrium is a reheated mishmash of other sc...
2,Hero,2002,94,With death-defying action sequences and epic h...
3,Road House,1989,41,Whether Road House is simply bad or so bad it'...
4,Unstoppable,2010,87,"As fast, loud, and relentless as the train at ..."
...,...,...,...,...
135,Hard-Boiled,1992,92,Boasting impactful action as well as surprisin...
136,The Matrix,1999,83,"Thanks to the Wachowskis' imaginative vision, ..."
137,Terminator 2: Judgment Day,1991,91,T2 features thrilling action sequences and eye...
138,Die Hard,1988,94,Its many imitators (and sequels) have never co...
