# Intro to web scraping

The first step of web scraping is to identify a website and download the html code from it. 

Real html from websites tends to be long and a bit too chaotic for a total beginner. Here we will start with a dummy html document and learn the basics of extracting info with beautifulsoup.

In [4]:
html_doc = """
<!DOCTYPE html>
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
</html>
"""

In [5]:
html_doc

'\n<!DOCTYPE html>\n<html><head><title>The Dormouse\'s story</title></head>\n<body>\n<p class="title"><b>The Dormouse\'s story</b></p>\n\n<p class="story">Once upon a time there were three little sisters; and their names were\n<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,\n<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and\n<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;\nand they lived at the bottom of a well.</p>\n\n<p class="story">...</p>\n</html>\n'

In [6]:
!pip install --upgrade beautifulsoup4



In [8]:
from bs4 import BeautifulSoup

#### "creating the soup"

In [9]:
# parse the element
soup = BeautifulSoup(html_doc, 'html.parser')

In [10]:
soup


<!DOCTYPE html>

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>

In [11]:
soup.prettify

<bound method Tag.prettify of 
<!DOCTYPE html>

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>
>

#### accessing single elements

In [12]:
soup.title

<title>The Dormouse's story</title>

In [13]:
soup.title.string

"The Dormouse's story"

In [15]:
soup.title.parent.name

'head'

In [16]:
soup.body

<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>

In [19]:
# this method only retrieves the first element of the specified tag
soup.p

<p class="title"><b>The Dormouse's story</b></p>

#### finding all elements of a tag with find_all()

In [20]:
p_tags = soup.find_all("p")

In [21]:
p_tags

[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

In [22]:
for p in p_tags:
    print("New paragraph____________________")
    print(p.get_text())

New paragraph____________________
The Dormouse's story
New paragraph____________________
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
New paragraph____________________
...


#### Using css selectors

https://htmlcheatsheet.com/css/

In [23]:
# select all elements with class="title"
soup.select(".title")

[<p class="title"><b>The Dormouse's story</b></p>]

In [24]:
# select all elements with class="sister"
soup.select(".sister")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [26]:
# select "all" elements with the id="link2"
soup.select("#link2")

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

We can combine the `select()` method with other bs4 methods, such as `get_text()`.

`get_text()`, however, can only be applied to single elements, while `select()` might return multiple elements. It's common to iterate through the output of `select()`

In [33]:
soup.select("p.story")[1]

<p class="story">...</p>

In [35]:
print(soup.select("p.story")[1].get_text())

...


In [36]:
for p in soup.select("p.story"):
    print(p.get_text())

Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...


### Your turn:

Write code to print the following contents (not including the html tags, only human-readable text): 

1. All the "fun facts". 

2. The names of all the places. 

3. The content (name and fact) of all the cities (only cities, not countries!) 

4. The names (not facts!) of all the cities (not countries!)

In [37]:
geography = """
<!DOCTYPE html>
<html>
<head> Geography</head>
<body>

<div class="city">
  <h2>London</h2>
  <p>London is the most popular tourist destination in the world.</p>
</div>

<div class="city">
  <h2>Paris</h2>
  <p>Paris was originally a Roman City called Lutetia.</p>
</div>

<div class="country">
  <h2>Spain</h2>
  <p>Spain produces 43,8% of all the world's Olive Oil.</p>
</div>

</body>
</html>
"""

In [40]:
# Create the "soup"
soup = BeautifulSoup(geography,'html.parser')
soup


<!DOCTYPE html>

<html>
<head> Geography</head>
<body>
<div class="city">
<h2>London</h2>
<p>London is the most popular tourist destination in the world.</p>
</div>
<div class="city">
<h2>Paris</h2>
<p>Paris was originally a Roman City called Lutetia.</p>
</div>
<div class="country">
<h2>Spain</h2>
<p>Spain produces 43,8% of all the world's Olive Oil.</p>
</div>
</body>
</html>

In [41]:
# 1. All the "fun facts"
for par in soup.select("p"):
    print(par.get_text())

London is the most popular tourist destination in the world.
Paris was originally a Roman City called Lutetia.
Spain produces 43,8% of all the world's Olive Oil.


In [42]:
# 2. The names of all the places.
for heading in soup.select("h2"):
    print(heading.get_text())

London
Paris
Spain


In [46]:
# 3. All the content (name and fact) of all the cities (only cities, not countries!)
soup.select("div.city h2")

soup.select("div.city p")

[<p>London is the most popular tourist destination in the world.</p>,
 <p>Paris was originally a Roman City called Lutetia.</p>]

In [44]:
for elem in soup.select(".city"):
    print(elem.h2.get_text())
    print(elem.p.get_text())

London
London is the most popular tourist destination in the world.
Paris
Paris was originally a Roman City called Lutetia.


In [47]:
for p in soup.select(".city"):
    print(p.get_text())


London
London is the most popular tourist destination in the world.


Paris
Paris was originally a Roman City called Lutetia.



In [48]:
# 4. The names (not facts!) of all the cities (not countries!)
for elem in soup.select(".city"):
    print(elem.h2.get_text())

London
Paris


## Use case: imdb top charts

Let's go to https://www.imdb.com/chart/top, where we'll see the top 250 movies according to IMDb ratings.

Notice how each movie has the following elements:

- Title

- Release Year

- IMDb rating

- Director & main stars (they appear when you hover over the title)

Our objective is going to be to scrape this information and store it in a pandas dataframe.



In [49]:
# 1. import libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [53]:
# 2. find url and store it in a variable
url_imdb = "https://www.imdb.com/chart/top"

In [88]:
# 3. download html with a get request
headers = {'Accept-Language': 'es-ES'}
response = requests.get(url_imdb, headers = headers)
response.status_code # 200 status code means OK!

200

In [89]:
response.content

b'\n\n\n<!DOCTYPE html>\n<html\n    xmlns:og="http://ogp.me/ns#"\n    xmlns:fb="http://www.facebook.com/2008/fbml">\n    <head>\n         \n\n        <meta charset="utf-8">\n\n    \n    \n    \n\n    \n    \n    \n\n\n\n\n        <script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:\'java\'};</script>\n\n<script>\n    if (typeof uet == \'function\') {\n      uet("bb", "LoadTitle", {wb: 1});\n    }\n</script>\n  <script>(function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);</script>\n        <title>Top 250 Movies - IMDb</title>\n  <script>(function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);</script>\n<script>\n    if (typeof uet == \'function\') {\n      uet("be", "LoadTitle", {wb: 1});\n    }\n</script>\n<script>\n    if (typeof uex == \'function\') {\n      uex("ld", "LoadTitle", {wb: 1});\n    }\n</script>\n\n        <link rel="canonical" href="https://www.im

In [82]:
response = requests.get(url_imdb)

In [87]:
response.headers

{'Content-Type': 'text/html;charset=UTF-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Server': 'Server', 'Date': 'Mon, 02 Jan 2023 12:00:34 GMT', 'x-amz-rid': '5F2RAXSWXXEJ7CWSW8Y1', 'Set-Cookie': 'uu=eyJpZCI6InV1NWI1N2E5OWIwZWE2NDRlYThiNzIiLCJwcmVmZXJlbmNlcyI6eyJmaW5kX2luY2x1ZGVfYWR1bHQiOmZhbHNlfX0=; Domain=.imdb.com; Expires=Sat, 20-Jan-2091 15:14:41 GMT; Path=/; Secure, session-id=000-0000000-0000000; Domain=.imdb.com; Expires=Sat, 20-Jan-2091 15:14:41 GMT; Path=/; Secure, session-id-time=2303380833; Domain=.imdb.com; Expires=Sat, 20-Jan-2091 15:14:41 GMT; Path=/; Secure', 'X-Frame-Options': 'SAMEORIGIN', 'Content-Security-Policy': "frame-ancestors 'self' imdb.com *.imdb.com *.media-imdb.com withoutabox.com *.withoutabox.com amazon.com *.amazon.com amazon.co.uk *.amazon.co.uk amazon.de *.amazon.de translate.google.com images.google.com www.google.com www.google.co.uk search.aol.com bing.com www.bing.com", 'Content-Language': 'en-US', 'Strict-Transport-Security': '

In [55]:
# 4.1. parse html (create the 'soup')
soup = BeautifulSoup(response.content, "html.parser")
# 4.2. check that the html code looks like it should
soup


<!DOCTYPE html>

<html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
<head>
<meta charset="utf-8"/>
<script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>
<script>
    if (typeof uet == 'function') {
      uet("bb", "LoadTitle", {wb: 1});
    }
</script>
<script>(function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);</script>
<title>Top 250 Movies - IMDb</title>
<script>(function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);</script>
<script>
    if (typeof uet == 'function') {
      uet("be", "LoadTitle", {wb: 1});
    }
</script>
<script>
    if (typeof uex == 'function') {
      uex("ld", "LoadTitle", {wb: 1});
    }
</script>
<link href="https://www.imdb.com/chart/top" rel="canonical"/>
<meta content="http://www.imdb.com/chart/top" property="og:url">
<script>
    if (typeof uet == 'function') {
      uet("bb", "Load

In [60]:
soup.select("td.titleColumn") # all the info about all the movies

[<td class="titleColumn">
       1.
       <a href="/title/tt0111161/" title="Frank Darabont (dir.), Tim Robbins, Morgan Freeman">The Shawshank Redemption</a>
 <span class="secondaryInfo">(1994)</span>
 </td>,
 <td class="titleColumn">
       2.
       <a href="/title/tt0068646/" title="Francis Ford Coppola (dir.), Marlon Brando, Al Pacino">The Godfather</a>
 <span class="secondaryInfo">(1972)</span>
 </td>,
 <td class="titleColumn">
       3.
       <a href="/title/tt0468569/" title="Christopher Nolan (dir.), Christian Bale, Heath Ledger">The Dark Knight</a>
 <span class="secondaryInfo">(2008)</span>
 </td>,
 <td class="titleColumn">
       4.
       <a href="/title/tt0071562/" title="Francis Ford Coppola (dir.), Al Pacino, Robert De Niro">The Godfather Part II</a>
 <span class="secondaryInfo">(1974)</span>
 </td>,
 <td class="titleColumn">
       5.
       <a href="/title/tt0050083/" title="Sidney Lumet (dir.), Henry Fonda, Lee J. Cobb">12 Angry Men</a>
 <span class="secondaryInfo">(

In [59]:
soup.select("td.titleColumn")[-1]

<td class="titleColumn">
      250.
      <a href="/title/tt0048021/" title="Jules Dassin (dir.), Jean Servais, Carl Möhner">Rififi</a>
<span class="secondaryInfo">(1955)</span>
</td>

In [62]:
soup.select("td.titleColumn a") # all elements containing movie titles

[<a href="/title/tt0111161/" title="Frank Darabont (dir.), Tim Robbins, Morgan Freeman">The Shawshank Redemption</a>,
 <a href="/title/tt0068646/" title="Francis Ford Coppola (dir.), Marlon Brando, Al Pacino">The Godfather</a>,
 <a href="/title/tt0468569/" title="Christopher Nolan (dir.), Christian Bale, Heath Ledger">The Dark Knight</a>,
 <a href="/title/tt0071562/" title="Francis Ford Coppola (dir.), Al Pacino, Robert De Niro">The Godfather Part II</a>,
 <a href="/title/tt0050083/" title="Sidney Lumet (dir.), Henry Fonda, Lee J. Cobb">12 Angry Men</a>,
 <a href="/title/tt0108052/" title="Steven Spielberg (dir.), Liam Neeson, Ralph Fiennes">Schindler's List</a>,
 <a href="/title/tt0167260/" title="Peter Jackson (dir.), Elijah Wood, Viggo Mortensen">The Lord of the Rings: The Return of the King</a>,
 <a href="/title/tt0110912/" title="Quentin Tarantino (dir.), John Travolta, Uma Thurman">Pulp Fiction</a>,
 <a href="/title/tt0120737/" title="Peter Jackson (dir.), Elijah Wood, Ian McKell

In [63]:
# we can use .get_text() to extract the content of the tags we selected
# we'll need to do it to each tag with a for loop: here we do it to the first one
soup.select("td.titleColumn a")[0].get_text()

'The Shawshank Redemption'

In [70]:
# the director and main stars are in the same tag, but as a value of the attribute "title"
# we can access attributes as key-value pairs of dictionaries: using ["key"] to get the value:
soup.select("td.titleColumn a")[0]["title"]

'Frank Darabont (dir.), Tim Robbins, Morgan Freeman'

In [71]:
# instead of ["title"] we could use .get("title"): choose whatever you prefer

soup.select("td.titleColumn a")[0].get("title")

'Frank Darabont (dir.), Tim Robbins, Morgan Freeman'

In [73]:
# the years are inside a 'span' tag with the 'secondaryInfo' class
# we also specify the parent tag and its class, which is the same we used before
# the years are inside parentheses, but we'll take care of that later
soup.select("td.titleColumn span.secondaryInfo")[0].get_text()

'(1994)'

In [74]:
soup.select("td.imdbRating strong")[0].get_text()

'9.2'

### Storing information in lists

In [104]:
#initialize empty lists
title = []
dir_stars = []
year = []
rating = []

In [77]:
# define the number of iterations of our for loop 
# by checking how many elements are in the retrieved result set
# (this is equivalent but more robust than just explicitly defining 250 iterations)
num_iter = len(soup.select("td.titleColumn a"))

In [78]:
num_iter

250

In [None]:
#table = soup.select("tbody > tr")

In [None]:
#table[2].select("td.titleColumn a")

In [105]:
# iterate through the result set and retrive all the data
for i in range(num_iter):
    title.append(soup.select("td.titleColumn a")[i].get_text())
    dir_stars.append(soup.select("td.titleColumn a")[i]["title"])
    year.append(soup.select("td.titleColumn span.secondaryInfo")[i].get_text())
    rating.append(soup.select("td.imdbRating strong")[i].get_text())

In [90]:
title


['The Shawshank Redemption',
 'The Godfather',
 'The Dark Knight',
 'The Godfather Part II',
 '12 Angry Men',
 "Schindler's List",
 'The Lord of the Rings: The Return of the King',
 'Pulp Fiction',
 'The Lord of the Rings: The Fellowship of the Ring',
 'The Good, the Bad and the Ugly',
 'Forrest Gump',
 'Fight Club',
 'The Lord of the Rings: The Two Towers',
 'Inception',
 'Star Wars: Episode V - The Empire Strikes Back',
 'The Matrix',
 'Goodfellas',
 "One Flew Over the Cuckoo's Nest",
 'Se7en',
 'Seven Samurai',
 "It's a Wonderful Life",
 'The Silence of the Lambs',
 'City of God',
 'Saving Private Ryan',
 'Life Is Beautiful',
 'Interstellar',
 'The Green Mile',
 'Star Wars: Episode IV - A New Hope',
 'Terminator 2: Judgment Day',
 'Back to the Future',
 'Spirited Away',
 'Psycho',
 'The Pianist',
 'Parasite',
 'Léon: The Professional',
 'The Lion King',
 'Gladiator',
 'American History X',
 'The Departed',
 'The Usual Suspects',
 'The Prestige',
 'Whiplash',
 'Casablanca',
 'Harakir

In [91]:
dir_stars

['Frank Darabont (dir.), Tim Robbins, Morgan Freeman',
 'Francis Ford Coppola (dir.), Marlon Brando, Al Pacino',
 'Christopher Nolan (dir.), Christian Bale, Heath Ledger',
 'Francis Ford Coppola (dir.), Al Pacino, Robert De Niro',
 'Sidney Lumet (dir.), Henry Fonda, Lee J. Cobb',
 'Steven Spielberg (dir.), Liam Neeson, Ralph Fiennes',
 'Peter Jackson (dir.), Elijah Wood, Viggo Mortensen',
 'Quentin Tarantino (dir.), John Travolta, Uma Thurman',
 'Peter Jackson (dir.), Elijah Wood, Ian McKellen',
 'Sergio Leone (dir.), Clint Eastwood, Eli Wallach',
 'Robert Zemeckis (dir.), Tom Hanks, Robin Wright',
 'David Fincher (dir.), Brad Pitt, Edward Norton',
 'Peter Jackson (dir.), Elijah Wood, Ian McKellen',
 'Christopher Nolan (dir.), Leonardo DiCaprio, Joseph Gordon-Levitt',
 'Irvin Kershner (dir.), Mark Hamill, Harrison Ford',
 'Lana Wachowski (dir.), Keanu Reeves, Laurence Fishburne',
 'Martin Scorsese (dir.), Robert De Niro, Ray Liotta',
 'Milos Forman (dir.), Jack Nicholson, Louise Fletch

In [99]:
len(year)

315

In [93]:
rating

['9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.2',
 '9.0',
 '9.0',
 '9.0',
 '8.9',
 '8.9',
 '8.8',
 '8.8',
 '8.8',
 '8.8',
 '8.7',
 '8.7',
 '8.7',
 '8.7',
 '8.7',
 '8.7',
 '8.6',
 '8.6',
 '8.6',
 '8.6',
 '8.6',
 '8.6',
 '8.6',
 '8.6',
 '8.6',
 '8.6',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.4',
 '8.4',
 '8.4',
 '8.4',
 '8.4',
 '8.4',
 '8.4',
 '8.4',
 '8.4',
 '8.4',
 '8.4',
 '8.4',
 '8.4',
 '8.4',


### Storing information in pandas DataFrames

If you get an error try this: 
assert len(title) == len(dir_stars) == len(year)

In [98]:
 len(dir_stars) == len(year)

False

In [None]:
import pandas as pd

In [101]:
movies_df = pd.DataFrame(
    {"movie_name": title,
     "director_stars": dir_stars,
     "release_year": year,
     "rating": rating
    }
)

In [102]:
movies_df.head()

Unnamed: 0,movie_name,director_stars
0,The Shawshank Redemption,"Frank Darabont (dir.), Tim Robbins, Morgan Fre..."
1,The Godfather,"Francis Ford Coppola (dir.), Marlon Brando, Al..."
2,The Dark Knight,"Christopher Nolan (dir.), Christian Bale, Heat..."
3,The Godfather Part II,"Francis Ford Coppola (dir.), Al Pacino, Robert..."
4,12 Angry Men,"Sidney Lumet (dir.), Henry Fonda, Lee J. Cobb"


In [103]:
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 316 entries, 0 to 315
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   movie_name      316 non-null    object
 1   director_stars  316 non-null    object
dtypes: object(2)
memory usage: 5.1+ KB


#### Challenge: Cleaning the data

An inherent part of web scraping is data cleaning. We managed to get the information we needed, but for it to be useful, we still need some extra steps:

- Take the year out of the parentheses: we know we can totally do that with regex, but string methods such as str.replace() might be simpler to use. Additionally, this column should be turned into a numerical data type.

- Split dir_stars into 3 columns, one for each person: "director", "star_1", "star_2". This could have been done by filtering when extracting the data from the html document, but it looks easier afterwards:

    - The "(dir.)" pattern can be totally removed
    - We can split the string at each comma

In [None]:
# your code here