# Webscraping one page using beautiful soup 

### Tools for scraping 

+ https://www.crummy.com/software/BeautifulSoup/bs4/doc/  (this is what we will use in lectures)

+ https://scrapy.org/

+ https://selenium-python.readthedocs.io/



## Dormouse HTML Code 


In [1]:
#create the variable

html_doc ="""
<!DOCTYPE html>
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
</html>
"""

In [2]:
html_doc

'\n<!DOCTYPE html>\n<html><head><title>The Dormouse\'s story</title></head>\n<body>\n<p class="title"><b>The Dormouse\'s story</b></p>\n\n<p class="story">Once upon a time there were three little sisters; and their names were\n<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,\n<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and\n<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;\nand they lived at the bottom of a well.</p>\n\n<p class="story">...</p>\n</html>\n'

In [4]:
# after installing as conda install -c anaconda beautifulsoup4

#Import needed libraries - BeautifulSoup
from bs4 import BeautifulSoup



In [6]:
# parse (create) the soup 
soup_mouse=BeautifulSoup(html_doc,'html.parser')

In [None]:
# prettify the soup 
print(soup_mouse.prettify())

## Option 1 - using beautiful soup the "HTML" way  

In [10]:
# using basic tree navigation to access single elements
soup_mouse.title

<title>The Dormouse's story</title>

In [11]:
soup_mouse.title.string

"The Dormouse's story"

In [16]:
soup_mouse


<!DOCTYPE html>

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>

In [14]:
soup_mouse.body

<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>

In [17]:
soup_mouse.p

<p class="title"><b>The Dormouse's story</b></p>

In [32]:
# find elements of the tag using find_all()
p_tags=soup_mouse.find_all("p")


In [33]:
for p in p_tags: 
    print(p.get_text())

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...


In [34]:
a_tags=soup_mouse.find_all("a")

In [35]:
a_tags

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [36]:
for atag in a_tags:
    print(atag.get('href'))

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie


In [52]:
soup_mouse.title

<title>The Dormouse's story</title>

In [22]:
soup_mouse.title.parent.string

"The Dormouse's story"

In [28]:
soup_mouse.title.parent

<head><title>The Dormouse's story</title></head>

In [49]:
soup_mouse.title.parent.name

'head'

In [30]:
print(soup_mouse.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>



In [78]:
# count the instances of word 'were'- using regex or using beautifulsoup 
import re
soup_mouse.find_all(string=re.compile("were"))

['Once upon a time there were three little sisters; and their names were\n']

In [105]:
matches = soup_mouse.find_all(string=re.compile("were"))
for match in matches:
    print(len(matches))
#Regex is not able to parse HTML.

1


In [106]:
matches

['Once upon a time there were three little sisters; and their names were\n']

In [111]:
#without regex 
soup_mouse.text.count('were')

2

## Option 2 - using beautiful soup the "CSS" way

As we will be be using css selectors, let's learn first the syntax of css selectors playing this game: https://flukeout.github.io/

Everyone should reach level 12!

In [None]:
# using select()

We can combine the `select()` method with other bs4 methods, such as `get_text()`.

`get_text()`, however, can only be applied to single elements, while `select()` might return multiple elements. It's common to iterate through the output of `select()`

In [72]:
soup_mouse.select('#link2')

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

In [75]:
soup_mouse.select('.sister')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [None]:
soup_mouse

In [None]:
# to be continued!!

useful links for the lecture : 
    
    https://www.w3schools.com/cssref/css_selectors.asp
    https://www.w3schools.com/tags/default.asp
    https://www.w3schools.com/css/css_syntax.ASP
    https://www.imdb.com/chart/top/

## Activity 

Write code to extract and print the following contents (not including the html tags, only human-readable text): 

1. All the "fun facts"

2. The names of all the places

3. The content (name and fact) of all the cities (only cities, not countries) 

4. The names (not facts!) of all the cities (not countries)


In [64]:
geography = """
<!DOCTYPE html>
<html>
<head> Geography</head>
<body>

<div class="city">
  <h2>London</h2>
  <p>London is the most popular tourist destination in the world.</p>
</div>

<div class="city">
  <h2>Paris</h2>
  <p>Paris was originally a Roman City called Lutetia.</p>
</div>

<div class="country">
  <h2>Spain</h2>
  <p>Spain produces 43,8% of all the world's Olive Oil.</p>
</div>

</body>
</html>
"""

In [66]:
soup=BeautifulSoup(geography, 'html.parser')

In [67]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  Geography
 </head>
 <body>
  <div class="city">
   <h2>
    London
   </h2>
   <p>
    London is the most popular tourist destination in the world.
   </p>
  </div>
  <div class="city">
   <h2>
    Paris
   </h2>
   <p>
    Paris was originally a Roman City called Lutetia.
   </p>
  </div>
  <div class="country">
   <h2>
    Spain
   </h2>
   <p>
    Spain produces 43,8% of all the world's Olive Oil.
   </p>
  </div>
 </body>
</html>



example : 
    

**Paris was originally a Roman City called Lutetia**


In [68]:
# 1. All the "fun facts"
for i in soup.find_all("p"):
    print(i.get_text())

London is the most popular tourist destination in the world.
Paris was originally a Roman City called Lutetia.
Spain produces 43,8% of all the world's Olive Oil.


example: 

**Paris**

In [69]:
# 2. The names of all the places.
for i in soup.find_all("h2"):
    print(i.get_text())


London
Paris
Spain


example: 
    
**Paris**

**Paris was originally a Roman City called Lutetia.**

In [70]:
# 3. All the content (name and fact) of all the cities (only cities, not countries!)
for i in soup.find_all("div",{"class":"city"}):
    print(i.get_text())



London
London is the most popular tourist destination in the world.


Paris
Paris was originally a Roman City called Lutetia.



In [71]:
# 4. The names (not facts!) of all the cities (not countries!)
for i in soup.find_all("div",{"class":"city"}):
    print(i.h2.get_text())


London
Paris


## Scraping the IMDB top 250

Let's go to https://www.imdb.com/chart/top, where we'll see the top 250 movies according to IMDb ratings.

Notice how each movie has the following elements:

- Title

- Release Year

- IMDb rating

- Director & main stars (they appear when you hover over the title)

Our objective is going to be to scrape this information and store it in a pandas dataframe.

In [None]:
# 1. importing libraries- BeautifulSoup, requests, pandas


# 2. find url and store it in avariable
url = "https://www.imdb.com/chart/top"

# 3. download html with a get request
response = requests.get(url)

In [None]:
#check response status code 


In [None]:
#parse and store the contents of the url call


In [None]:
#prettify the soup 


### Query the soup to get movie title, actors, director, year 


In [None]:
# the director and main stars are in the same tag, but as a value of the attribute "title"
# we can access attributes as key-value pairs of dictionaries: using ["key"] to get the value:

# instead of ["title"] we could use .get("title"): choose whatever you prefer

In [None]:
# the years are inside a 'span' tag with the 'secondaryInfo' class
# we also specify the parent tag and its class, which is the same we used before
# the years are inside parentheses, but we'll take care of that later

### Once we have a method working for one movie, we can apply it for all the movies

- loop through movies
- pick up title, director, actors, year

+ store in a list

- for example 

**movie_lst = soup.select("td.titleColumn a")**

**yr_lst = soup.select("td.titleColumn span.secondaryInfo")**

In [None]:
## install tqqm.notebook using conda install -c conda-forge tqdm


### Cleaning / Wrangling steps for the scraped data 

An inherent part of web scraping is data cleaning. We managed to get the information we needed, but for it to be useful, we still need some extra steps:

- Take the year out of the parentheses: we know we can do that with regex, but string methods such as str.replace() might be simpler to use.

- Split dir_stars into 3 columns, one for each person: "director", "star_1", "star_2". This could have been done by filtering when extracting the data from the html document, but it looks easier afterwards:

    - The "(dir.)" pattern can be removed
    - We can split the string at each comma
    
- Change the data type of the year column to integer.


### Create data frame from results and preview 