# Intro to web scraping

The first step of web scraping is to identify a website and download the html code from it. 

Real html from websites tends to be long and a bit too chaotic for a total beginner. Here we will start with a dummy html document and learn the basics of extracting info with beautifulsoup.

- You can learn about Html here https://www.w3schools.com/html/
- You can use codebeautify to make your html more readable and clean https://codebeautify.org/htmlviewer

In [2]:
html_doc = """ <!DOCTYPE html><html><head><title>The Dormouse's story</title></head><body><p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p></html>"""

In [3]:
html_doc

' <!DOCTYPE html><html><head><title>The Dormouse\'s story</title></head><body><p class="title"><b>The Dormouse\'s story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p></html>'

In [4]:
from bs4 import BeautifulSoup

#### "creating the soup"

In [5]:
# parse the element
soup = BeautifulSoup(html_doc) 

In [6]:
soup

<!DOCTYPE html>
<html><head><title>The Dormouse's story</title></head><body><p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p></body></html>

In [7]:
import pprint

In [8]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>



#### accessing single elements

We can access html tags by appending to the `soup` a dot `.` and the name of the corresponding tag. In case of having multiple instances of the tag, only the first one will be retrieved.  

In [9]:
soup.title

<title>The Dormouse's story</title>

In [10]:
soup.title.parent

<head><title>The Dormouse's story</title></head>

In [11]:
soup.html.body.p

<p class="title"><b>The Dormouse's story</b></p>

<b> searching using find() function

In [12]:
soup.find("p").get_text()

"The Dormouse's story"

In [13]:
# this method only retrieves the first element of the specified tag
soup.p

<p class="title"><b>The Dormouse's story</b></p>

#### finding all elements of a tag with the powerful find_all()

In [14]:
p_tags = soup.find_all("p")
p_tags

[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

To get the `text`from the corresponding html code, we can use the function: get_text()

In [15]:
for p in p_tags:
    print(p.get_text())

The Dormouse's story
Once upon a time there were three little sisters; and their names wereElsie,Lacie andTillie;and they lived at the bottom of a well.
...


## Return the 3 names of the sisters

In [16]:
a_tags = soup.find_all('a')

In [17]:
for a in a_tags:
    print(a.get_text())

Elsie
Lacie
Tillie


## Using css selectors
Another way to find contents using select(). 

Let's learn first the syntax of css selectors playing this game: https://flukeout.github.io/

Everyone should reach level 6!

https://www.w3schools.com/css/css_howto.asp

In [18]:
soup.select("a")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [19]:
for a in soup.select('a'):
    print(a.get_text())

Elsie
Lacie
Tillie


In [20]:
soup.select('p')

[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

In [21]:
soup.select("p")[0]

<p class="title"><b>The Dormouse's story</b></p>

using css selector, you can search directly using Css classes!

In [22]:
soup.select(".title")

[<p class="title"><b>The Dormouse's story</b></p>]

<b> comparing to find_all() ..

In [23]:
soup.find_all("a", class_="sister")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [24]:
soup.select("a.sister")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

<b>  You can searc directly using id attributes

In [25]:
soup.select("#link2")

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

We can combine the `select()` method with other bs4 methods, such as `get_text()`.

`get_text()`, however, can only be applied to single elements, while `select()` might return multiple elements. It's common to iterate through the output of `select()`

In [26]:
print(soup.select(".story"))

[<p class="story">Once upon a time there were three little sisters; and their names were<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;and they lived at the bottom of a well.</p>, <p class="story">...</p>]


In [27]:
for p in soup.select("p.story"):
    print(p.get_text())

Once upon a time there were three little sisters; and their names wereElsie,Lacie andTillie;and they lived at the bottom of a well.
...




Write code to print the following contents (not including the html tags, only human-readable text): 

1. All the "fun facts". 

2. The names of all the places. 

3. The content (name and fact) of all the cities (only cities, not countries!) 

4. The names (not facts!) of all the cities (not countries!)

In [28]:
geography = """
<!DOCTYPE html>
<html>
<head> Geography</head>
<body>

<div class="city">
  <h2>London</h2>
  <p>London is the most popular tourist destination in the world.</p>
</div>

<div class="city">
  <h2>Paris</h2>
  <p>Paris was originally a Roman City called Lutetia.</p>
</div>

<div class="country">
  <h2>Spain</h2>
  <p>Spain produces 43,8% of all the world's Olive Oil.</p>
</div>

</body>
</html>
"""

In [29]:
soup = BeautifulSoup(geography, 'html.parser')

print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  Geography
 </head>
 <body>
  <div class="city">
   <h2>
    London
   </h2>
   <p>
    London is the most popular tourist destination in the world.
   </p>
  </div>
  <div class="city">
   <h2>
    Paris
   </h2>
   <p>
    Paris was originally a Roman City called Lutetia.
   </p>
  </div>
  <div class="country">
   <h2>
    Spain
   </h2>
   <p>
    Spain produces 43,8% of all the world's Olive Oil.
   </p>
  </div>
 </body>
</html>



In [30]:
#All the "fun facts".

fun_facts=soup.find_all("p")
for fact in fun_facts:
    print(fact.get_text())

London is the most popular tourist destination in the world.
Paris was originally a Roman City called Lutetia.
Spain produces 43,8% of all the world's Olive Oil.


In [31]:
#The names of all the places.
names=soup.find_all("h2")
for name in names:
    print(name.get_text())

London
Paris
Spain


In [32]:
#The content (name and fact) of all the cities (only cities, not countries!)
cities=soup.find_all(class_="city")
for city in cities:
    print(city.get_text())


London
London is the most popular tourist destination in the world.


Paris
Paris was originally a Roman City called Lutetia.



In [33]:
#The names (not facts!) of all the cities (not countries!)
cities=soup.select("div.city")
for city in cities:
    print(city.find("h2").get_text())

London
Paris


## Use case: 





In [34]:
# 1. import libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [35]:
# 2. find url and store it in a variable
url = "https://www.timeout.com/film/best-movies-of-all-time"
response=requests.get(url)
soup=BeautifulSoup(response.content)

In [36]:
soup.

SyntaxError: invalid syntax (3299483301.py, line 1)

In [37]:
films=soup.find_all("h3")
for film in films:
    print(film.get_text())
    


1. 2001: A Space Odyssey (1968)
2. The Godfather (1972)
3. Citizen Kane (1941)
4. Jeanne Dielman, 23, Quai du Commerce, 1080 Bruxelles (1975)
5. Raiders of the Lost Ark (1981)
6. La Dolce Vita (1960)
7. Seven Samurai (1954)
8. In the Mood for Love (2000)
9. There Will Be Blood (2007)
10. Singin’ in the Rain (1952)
11. Goodfellas (1990)
12. North by Northwest (1959)
13. Mulholland Drive (2001)
14. Bicycle Thieves (1948)
15. The Dark Knight (2008)
16. City Lights (1931)
17. Grand Illusion (1937)
18. His Girl Friday (1940)
19. The Red Shoes (1948)
20. Vertigo (1958)
21. Beau Travail (1999)
22. The Searchers (1956)
23. Persona (1966)
24. Do the Right Thing (1989)
25. Rashomon (1950)
26. The Rules of the Game (1939)
27. Jaws (1975)
28. Double Indemnity (1944)
29. The 400 Blows (1959)
30. Star Wars (1977)
31. The Passion of Joan of Arc (1928)
32. Once Upon a Time in the West (1968)
33. Alien (1979)
34. Tokyo Story (1951)
35. Pulp Fiction (1994)
36. The Truman Show (1998)
37. Lawrence of Arab

In [38]:
top100=[]
for film in films:
    top100.append(film.get_text())
    
top100

['1.\xa02001: A Space Odyssey (1968)',
 '2.\xa0The Godfather (1972)',
 '3.\xa0Citizen Kane (1941)',
 '4.\xa0Jeanne Dielman, 23, Quai du Commerce, 1080 Bruxelles (1975)',
 '5.\xa0Raiders of the Lost Ark (1981)',
 '6.\xa0La Dolce Vita (1960)',
 '7.\xa0Seven Samurai (1954)',
 '8.\xa0In the Mood for Love (2000)',
 '9.\xa0There Will Be Blood (2007)',
 '10.\xa0Singin’ in the Rain (1952)',
 '11.\xa0Goodfellas (1990)',
 '12.\xa0North by Northwest (1959)',
 '13.\xa0Mulholland Drive (2001)',
 '14.\xa0Bicycle Thieves (1948)',
 '15.\xa0The Dark Knight (2008)',
 '16.\xa0City Lights (1931)',
 '17.\xa0Grand Illusion (1937)',
 '18.\xa0His Girl Friday (1940)',
 '19.\xa0The Red Shoes (1948)',
 '20.\xa0Vertigo (1958)',
 '21.\xa0Beau Travail (1999)',
 '22.\xa0The Searchers (1956)',
 '23.\xa0Persona (1966)',
 '24.\xa0Do the Right Thing (1989)',
 '25.\xa0Rashomon (1950)',
 '26.\xa0The Rules of the Game (1939)',
 '27.\xa0Jaws (1975)',
 '28.\xa0Double Indemnity (1944)',
 '29.\xa0The 400 Blows (1959)',
 '30.\x

In [39]:
top100.pop()

'Check out the best movies of all time as chosen by actors'

In [40]:
url='https://gutenberg.org/ebooks/search/?sort_order=downloads'
response=requests.get(url)
soup=BeautifulSoup(response.content)

In [52]:
titles=soup.select('li.booklink span.title')
books=[]
for title in titles:
    books.append(title.get_text())

books

['Frankenstein; Or, The Modern Prometheus',
 'Moby Dick; Or, The Whale',
 'Pride and Prejudice',
 'Romeo and Juliet',
 'Middlemarch',
 'A Room with a View',
 'Little Women; Or, Meg, Jo, Beth, and Amy',
 'The Complete Works of William Shakespeare',
 'The Blue Castle: a novel',
 'The Enchanted April',
 'The Adventures of Ferdinand Count Fathom — Complete',
 'Cranford',
 'The Expedition of Humphry Clinker',
 'The Adventures of Roderick Random',
 'History of Tom Jones, a Foundling',
 'Twenty years after',
 'My Life — Volume 1',
 "Alice's Adventures in Wonderland",
 'The Great Gatsby',
 'The Picture of Dorian Gray',
 "A Doll's House : a play",
 'The Yellow Wallpaper',
 'The Importance of Being Earnest: A Trivial Comedy for Serious People',
 'Metamorphosis',
 'A Modest Proposal\r']

In [43]:
authors=[]
for i in soup.select("span.subtitle"):
    authors.append(i.get_text())
authors

['Mary Wollstonecraft Shelley',
 'Herman Melville',
 'Jane Austen',
 'William Shakespeare',
 'George Eliot',
 'E. M. Forster',
 'Louisa May Alcott',
 'William Shakespeare',
 'L. M. Montgomery',
 'Elizabeth Von Arnim',
 'T. Smollett',
 'Elizabeth Cleghorn Gaskell',
 'T. Smollett',
 'T. Smollett',
 'Henry Fielding',
 'Alexandre Dumas and Auguste Maquet',
 'Richard Wagner',
 'Lewis Carroll',
 'F. Scott Fitzgerald',
 'Oscar Wilde',
 'Henrik Ibsen',
 'Charlotte Perkins Gilman',
 'Oscar Wilde',
 'Franz Kafka',
 'Jonathan Swift']

In [58]:
df=pd.DataFrame({"Books": books, "Authors": authors})
df

Unnamed: 0,Books,Authors
0,"Frankenstein; Or, The Modern Prometheus",Mary Wollstonecraft Shelley
1,"Moby Dick; Or, The Whale",Herman Melville
2,Pride and Prejudice,Jane Austen
3,Romeo and Juliet,William Shakespeare
4,Middlemarch,George Eliot
5,A Room with a View,E. M. Forster
6,"Little Women; Or, Meg, Jo, Beth, and Amy",Louisa May Alcott
7,The Complete Works of William Shakespeare,William Shakespeare
8,The Blue Castle: a novel,L. M. Montgomery
9,The Enchanted April,Elizabeth Von Arnim


# <b> using requests package

In [70]:
# 3. download html with a get request 
response = requests.get(url)

In [71]:
response.status_code # 200 status code means OK!

200

### HTTP Response status codes 
https://developer.mozilla.org/en-US/docs/Web/HTTP/Status

In [None]:
# 4.1. parse html (create the 'soup')
soup = BeautifulSoup(response.content, "html.parser")
# 4.2. check that the html code looks like it should
soup

#### Building the dataframe

In [None]:
#your code here

### Cleaning the data

In [None]:
# your code here