## **D3TOP - Tópicos em Ciência de Dados (IFSP Campinas)**
**Prof. Dr. Samuel Martins (@iamsamucoding @samucoding @xavecoding)** <br/>
xavecoding: https://youtube.com/c/xavecoding <br/><br/>

<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.

<hr/>

# Web Scrapping
Web scraping is the process of **extracting data from websites (or other sources - e.g., pdfs)** using _automated software tools_, also known as **web scrapers** or **crawlers**. In case of websites, these tools access websites and _extract data_ from the _HTML code_, which can be used for a variety of purposes such as data analysis, research, and content creation.

Web scraping is commonly used in fields such as e-commerce, finance, and marketing, as it allows companies to gather data on competitors, track prices and inventory levels, and monitor customer sentiment. It can also be used by researchers to collect data for academic studies, and by journalists to uncover newsworthy information.

However, it's important to note that web scraping can _raise ethical_ and _legal concerns_, as _some websites explicitly prohibit scraping_ of their data. Additionally, scraping can put a strain on server resources and can potentially cause website downtime or other technical issues. As such, it's important to use web scraping tools responsibly and within the boundaries of the law and ethical considerations.

## Goal
In this notebook, we will scrap the page with the **[top 250 movies on IMBD](https://www.imdb.com/chart/top/?ref_=nv_mv_250)**. We will extract data from the html table and save the it into a CSV file.

## Required Packages
Beautiful Soup - https://beautiful-soup-4.readthedocs.io/en/latest/

In [1]:
!pip install beautifulsoup4



## Getting the page

In [2]:
import requests
from bs4 import BeautifulSoup

In [3]:
# target website
URL = 'https://www.imdb.com/chart/top/?ref_=nv_mv_250'

**HTML Protocol**

<img src='https://d77da31580fbc8944c00-52b01ccbcfe56047120eec75d9cb2cbd.ssl.cf6.rackcdn.com/695bb7d8-d242-47a8-ab92-bc72225323df/what-is-http---teachoo.jpg' width=400/>

Source: https://d77da31580fbc8944c00-52b01ccbcfe56047120eec75d9cb2cbd.ssl.cf6.rackcdn.com/695bb7d8-d242-47a8-ab92-bc72225323df/what-is-http---teachoo.jpg

In [4]:
# get the page
page = requests.get(URL)

In [5]:
type(page)

requests.models.Response

In [6]:
# HTML success code - 200
page.status_code

200

In [7]:
# save the obtained HTML page into a file just for inspection
f = open('out/imdb.html', 'w')
f.write(page.content.decode())
f.close()

## Beautiful Soup Basics
Install the [**Selector Gadget**](https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb) on Chrome Web Store - it is a plugin for HTML inspection.

In [8]:
# parsing the html page to a BS object
site = BeautifulSoup(page.content, 'html.parser')

In [9]:
type(site)

bs4.BeautifulSoup

<br/>

`find(tag)`: find the _first_ HTML element with the tag `<tag>`.

https://beautiful-soup-4.readthedocs.io/en/latest/#kinds-of-filters

In [10]:
# find h3
site.find('h3')

<h3>IMDb Charts</h3>

In [11]:
# find tr
site.find('tr')

<tr>
<th></th>
<th>Rank &amp; Title</th>
<th>IMDb Rating</th>
<th>Your Rating</th>
<th></th>
</tr>

In [12]:
first_tr = site.find('tr')
first_tr

<tr>
<th></th>
<th>Rank &amp; Title</th>
<th>IMDb Rating</th>
<th>Your Rating</th>
<th></th>
</tr>

In [13]:
type(first_tr)

bs4.element.Tag

In [14]:
print(first_tr.prettify())

<tr>
 <th>
 </th>
 <th>
  Rank &amp; Title
 </th>
 <th>
  IMDb Rating
 </th>
 <th>
  Your Rating
 </th>
 <th>
 </th>
</tr>



In [15]:
# find for the (first) tag with id='suggestion-search'
site.find(id='suggestion-search')

<input aria-autocomplete="list" aria-controls="react-autowhatever-1" aria-label="Search IMDb" autocapitalize="off" autocomplete="off" autocorrect="off" class="imdb-header-search__input searchTypeahead__input react-autosuggest__input" id="suggestion-search" name="q" placeholder="Search IMDb" spellcheck="true" type="text" value=""/>

In [16]:
# find for the <input> tag with id='suggestion-search'
site.find('input', attrs={'id': 'suggestion-search'})

<input aria-autocomplete="list" aria-controls="react-autowhatever-1" aria-label="Search IMDb" autocapitalize="off" autocomplete="off" autocorrect="off" class="imdb-header-search__input searchTypeahead__input react-autosuggest__input" id="suggestion-search" name="q" placeholder="Search IMDb" spellcheck="true" type="text" value=""/>

In [17]:
# find for the first element with class='titleColumn'
site.find(class_='titleColumn')

<td class="titleColumn">
      1.
      <a href="/title/tt0111161/" title="Frank Darabont (dir.), Tim Robbins, Morgan Freeman">Um Sonho de Liberdade</a>
<span class="secondaryInfo">(1994)</span>
</td>

In [18]:
# find for the first <td> tag with class='titleColumn'
site.find('td', attrs={'class': 'titleColumn'})

<td class="titleColumn">
      1.
      <a href="/title/tt0111161/" title="Frank Darabont (dir.), Tim Robbins, Morgan Freeman">Um Sonho de Liberdade</a>
<span class="secondaryInfo">(1994)</span>
</td>

In [19]:
# find for the first <td> tag with class='titleColumn'
site.find('td', class_='titleColumn')

<td class="titleColumn">
      1.
      <a href="/title/tt0111161/" title="Frank Darabont (dir.), Tim Robbins, Morgan Freeman">Um Sonho de Liberdade</a>
<span class="secondaryInfo">(1994)</span>
</td>

<br/>

`find_all(tag)`: find **all** HTML elements with the tag `<tag>`.

https://beautiful-soup-4.readthedocs.io/en/latest/#kinds-of-filters

In [20]:
# find for all <h3> tags
site.find_all('h3')

[<h3>IMDb Charts</h3>,
 <h3>You Have Seen</h3>,
 <h3> IMDb Charts</h3>,
 <h3>Top Rated Movies by Genre</h3>,
 <h3>Recently Viewed</h3>]

In [21]:
# find for all HTML elements with class='titleColumn'
elements = site.find_all(class_='titleColumn')

print(len(elements))
print(elements[:3])

250
[<td class="titleColumn">
      1.
      <a href="/title/tt0111161/" title="Frank Darabont (dir.), Tim Robbins, Morgan Freeman">Um Sonho de Liberdade</a>
<span class="secondaryInfo">(1994)</span>
</td>, <td class="titleColumn">
      2.
      <a href="/title/tt0068646/" title="Francis Ford Coppola (dir.), Marlon Brando, Al Pacino">O Poderoso Chefão</a>
<span class="secondaryInfo">(1972)</span>
</td>, <td class="titleColumn">
      3.
      <a href="/title/tt0468569/" title="Christopher Nolan (dir.), Christian Bale, Heath Ledger">Batman: O Cavaleiro das Trevas</a>
<span class="secondaryInfo">(2008)</span>
</td>]


In [22]:
# find for all <td> tags with class='titleColumn'
td_list = site.find_all('td', class_='titleColumn')

print(len(td_list))
print(td_list[:3])

250
[<td class="titleColumn">
      1.
      <a href="/title/tt0111161/" title="Frank Darabont (dir.), Tim Robbins, Morgan Freeman">Um Sonho de Liberdade</a>
<span class="secondaryInfo">(1994)</span>
</td>, <td class="titleColumn">
      2.
      <a href="/title/tt0068646/" title="Francis Ford Coppola (dir.), Marlon Brando, Al Pacino">O Poderoso Chefão</a>
<span class="secondaryInfo">(1972)</span>
</td>, <td class="titleColumn">
      3.
      <a href="/title/tt0468569/" title="Christopher Nolan (dir.), Christian Bale, Heath Ledger">Batman: O Cavaleiro das Trevas</a>
<span class="secondaryInfo">(2008)</span>
</td>]


In [23]:
# find for all <td> tags with class='titleColumn'
td_list = site.find_all('td', attrs={'class': 'titleColumn'})

print(len(td_list))
print(td_list[:3])

250
[<td class="titleColumn">
      1.
      <a href="/title/tt0111161/" title="Frank Darabont (dir.), Tim Robbins, Morgan Freeman">Um Sonho de Liberdade</a>
<span class="secondaryInfo">(1994)</span>
</td>, <td class="titleColumn">
      2.
      <a href="/title/tt0068646/" title="Francis Ford Coppola (dir.), Marlon Brando, Al Pacino">O Poderoso Chefão</a>
<span class="secondaryInfo">(1972)</span>
</td>, <td class="titleColumn">
      3.
      <a href="/title/tt0468569/" title="Christopher Nolan (dir.), Christian Bale, Heath Ledger">Batman: O Cavaleiro das Trevas</a>
<span class="secondaryInfo">(2008)</span>
</td>]


<br/>

`select(selector)`: search for **all** HTML elements with the CSS selector `selector`.

https://beautiful-soup-4.readthedocs.io/en/latest/#searching-by-css-class

In [24]:
# find for all <h3> elements
site.select('h3')

[<h3>IMDb Charts</h3>,
 <h3>You Have Seen</h3>,
 <h3> IMDb Charts</h3>,
 <h3>Top Rated Movies by Genre</h3>,
 <h3>Recently Viewed</h3>]

In [25]:
# find for the HTML with the id 'suggestion-search'
# note that the return is a list
site.select('#suggestion-search')

[<input aria-autocomplete="list" aria-controls="react-autowhatever-1" aria-label="Search IMDb" autocapitalize="off" autocomplete="off" autocorrect="off" class="imdb-header-search__input searchTypeahead__input react-autosuggest__input" id="suggestion-search" name="q" placeholder="Search IMDb" spellcheck="true" type="text" value=""/>]

In [26]:
# find for the <input> tag with the 'suggestion-search'
# note that the return is a list
site.select('input#suggestion-search')

[<input aria-autocomplete="list" aria-controls="react-autowhatever-1" aria-label="Search IMDb" autocapitalize="off" autocomplete="off" autocorrect="off" class="imdb-header-search__input searchTypeahead__input react-autosuggest__input" id="suggestion-search" name="q" placeholder="Search IMDb" spellcheck="true" type="text" value=""/>]

In [27]:
# find for all HTML elements with class='titleColumn'
elements = site.select('.titleColumn')

print(len(elements))
print(elements[:3])

250
[<td class="titleColumn">
      1.
      <a href="/title/tt0111161/" title="Frank Darabont (dir.), Tim Robbins, Morgan Freeman">Um Sonho de Liberdade</a>
<span class="secondaryInfo">(1994)</span>
</td>, <td class="titleColumn">
      2.
      <a href="/title/tt0068646/" title="Francis Ford Coppola (dir.), Marlon Brando, Al Pacino">O Poderoso Chefão</a>
<span class="secondaryInfo">(1972)</span>
</td>, <td class="titleColumn">
      3.
      <a href="/title/tt0468569/" title="Christopher Nolan (dir.), Christian Bale, Heath Ledger">Batman: O Cavaleiro das Trevas</a>
<span class="secondaryInfo">(2008)</span>
</td>]


In [28]:
# find for all <td> tags with class='titleColumn'
td_list = site.select('td.titleColumn')

print(len(td_list))
print(td_list[:3])

250
[<td class="titleColumn">
      1.
      <a href="/title/tt0111161/" title="Frank Darabont (dir.), Tim Robbins, Morgan Freeman">Um Sonho de Liberdade</a>
<span class="secondaryInfo">(1994)</span>
</td>, <td class="titleColumn">
      2.
      <a href="/title/tt0068646/" title="Francis Ford Coppola (dir.), Marlon Brando, Al Pacino">O Poderoso Chefão</a>
<span class="secondaryInfo">(1972)</span>
</td>, <td class="titleColumn">
      3.
      <a href="/title/tt0468569/" title="Christopher Nolan (dir.), Christian Bale, Heath Ledger">Batman: O Cavaleiro das Trevas</a>
<span class="secondaryInfo">(2008)</span>
</td>]


### More BS useful functions

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

- `encode_contents`
- `replace_with`
- `unwrap`
- `find_all`
- `find_all_next`
- `find_all_previous`
- `find_next`
- `find_next_sibling`
- `find_next_siblings`
- `find_parent`
- `find_parents`
- `find_previous`
- `find_previous_sibling`
- `find_previous_siblings`
- `get_text`
- `next_sibling`
- `previous_sibling`

## Scrapping the movies
- Inspect the target page to find which is the HTML element that holds the movie data
    + use **SelectorGadget**

### Page HTML overview

<br/>

Get all `<tr>` elements:

In [29]:
movies_tr = site.select('tr')

In [30]:
# number of elements
print(len(movies_tr))

251


It took _one more element_, let's see which is that.

In [31]:
movies_tr[:2]

[<tr>
 <th></th>
 <th>Rank &amp; Title</th>
 <th>IMDb Rating</th>
 <th>Your Rating</th>
 <th></th>
 </tr>,
 <tr>
 <td class="posterColumn">
 <span data-value="1" name="rk"></span>
 <span data-value="9.235691969922451" name="ir"></span>
 <span data-value="7.791552E11" name="us"></span>
 <span data-value="2708940" name="nv"></span>
 <span data-value="-1.764308030077549" name="ur"></span>
 <a href="/title/tt0111161/"> <img alt="Um Sonho de Liberdade" height="67" src="https://m.media-amazon.com/images/M/MV5BNDE3ODcxYzMtY2YzZC00NmNlLWJiNDMtZDViZWM2MzIxZDYwXkEyXkFqcGdeQXVyNjAwNDUxODI@._V1_UX45_CR0,0,45,67_AL_.jpg" width="45"/>
 </a> </td>
 <td class="titleColumn">
       1.
       <a href="/title/tt0111161/" title="Frank Darabont (dir.), Tim Robbins, Morgan Freeman">Um Sonho de Liberdade</a>
 <span class="secondaryInfo">(1994)</span>
 </td>
 <td class="ratingColumn imdbRating">
 <strong title="9.2 based on 2,708,940 user ratings">9.2</strong>
 </td>
 <td class="ratingColumn">
 <div class="se

It took the **table header** as the _extra element_, which is the _first element_ of the list:

In [32]:
movies_tr[0]

<tr>
<th></th>
<th>Rank &amp; Title</th>
<th>IMDb Rating</th>
<th>Your Rating</th>
<th></th>
</tr>

<br/>

To solve this problem, we can simply ***delete** the first list's element*:

In [33]:
movies_tr.pop(0)

<tr>
<th></th>
<th>Rank &amp; Title</th>
<th>IMDb Rating</th>
<th>Your Rating</th>
<th></th>
</tr>

In [34]:
len(movies_tr)

250

In [35]:
movies_tr[:2]

[<tr>
 <td class="posterColumn">
 <span data-value="1" name="rk"></span>
 <span data-value="9.235691969922451" name="ir"></span>
 <span data-value="7.791552E11" name="us"></span>
 <span data-value="2708940" name="nv"></span>
 <span data-value="-1.764308030077549" name="ur"></span>
 <a href="/title/tt0111161/"> <img alt="Um Sonho de Liberdade" height="67" src="https://m.media-amazon.com/images/M/MV5BNDE3ODcxYzMtY2YzZC00NmNlLWJiNDMtZDViZWM2MzIxZDYwXkEyXkFqcGdeQXVyNjAwNDUxODI@._V1_UX45_CR0,0,45,67_AL_.jpg" width="45"/>
 </a> </td>
 <td class="titleColumn">
       1.
       <a href="/title/tt0111161/" title="Frank Darabont (dir.), Tim Robbins, Morgan Freeman">Um Sonho de Liberdade</a>
 <span class="secondaryInfo">(1994)</span>
 </td>
 <td class="ratingColumn imdbRating">
 <strong title="9.2 based on 2,708,940 user ratings">9.2</strong>
 </td>
 <td class="ratingColumn">
 <div class="seen-widget seen-widget-tt0111161 pending" data-titleid="tt0111161">
 <div class="boundary">
 <div class="pop

<br/>

Alternatively, we can perform a more advance **search** with a more specific _CSS selector_: `'TAG1 TAG2'` <br/>
- Select all HTML elements with TAG2 that are "inside" the HTML elements with TAG1

In [36]:
# select all <tr> inside <tbody>
movies_tr = site.select('tbody tr')

In [37]:
len(movies_tr)

250

In [38]:
movies_tr[:2]

[<tr>
 <td class="posterColumn">
 <span data-value="1" name="rk"></span>
 <span data-value="9.235691969922451" name="ir"></span>
 <span data-value="7.791552E11" name="us"></span>
 <span data-value="2708940" name="nv"></span>
 <span data-value="-1.764308030077549" name="ur"></span>
 <a href="/title/tt0111161/"> <img alt="Um Sonho de Liberdade" height="67" src="https://m.media-amazon.com/images/M/MV5BNDE3ODcxYzMtY2YzZC00NmNlLWJiNDMtZDViZWM2MzIxZDYwXkEyXkFqcGdeQXVyNjAwNDUxODI@._V1_UX45_CR0,0,45,67_AL_.jpg" width="45"/>
 </a> </td>
 <td class="titleColumn">
       1.
       <a href="/title/tt0111161/" title="Frank Darabont (dir.), Tim Robbins, Morgan Freeman">Um Sonho de Liberdade</a>
 <span class="secondaryInfo">(1994)</span>
 </td>
 <td class="ratingColumn imdbRating">
 <strong title="9.2 based on 2,708,940 user ratings">9.2</strong>
 </td>
 <td class="ratingColumn">
 <div class="seen-widget seen-widget-tt0111161 pending" data-titleid="tt0111161">
 <div class="boundary">
 <div class="pop

In [39]:
for tr in movies_tr[:2]:
    print(tr.prettify())

<tr>
 <td class="posterColumn">
  <span data-value="1" name="rk">
  </span>
  <span data-value="9.235691969922451" name="ir">
  </span>
  <span data-value="7.791552E11" name="us">
  </span>
  <span data-value="2708940" name="nv">
  </span>
  <span data-value="-1.764308030077549" name="ur">
  </span>
  <a href="/title/tt0111161/">
   <img alt="Um Sonho de Liberdade" height="67" src="https://m.media-amazon.com/images/M/MV5BNDE3ODcxYzMtY2YzZC00NmNlLWJiNDMtZDViZWM2MzIxZDYwXkEyXkFqcGdeQXVyNjAwNDUxODI@._V1_UX45_CR0,0,45,67_AL_.jpg" width="45"/>
  </a>
 </td>
 <td class="titleColumn">
  1.
  <a href="/title/tt0111161/" title="Frank Darabont (dir.), Tim Robbins, Morgan Freeman">
   Um Sonho de Liberdade
  </a>
  <span class="secondaryInfo">
   (1994)
  </span>
 </td>
 <td class="ratingColumn imdbRating">
  <strong title="9.2 based on 2,708,940 user ratings">
   9.2
  </strong>
 </td>
 <td class="ratingColumn">
  <div class="seen-widget seen-widget-tt0111161 pending" data-titleid="tt0111161">
 

### Scrapping the movies

In [40]:
print(movies_tr[0].prettify())

<tr>
 <td class="posterColumn">
  <span data-value="1" name="rk">
  </span>
  <span data-value="9.235691969922451" name="ir">
  </span>
  <span data-value="7.791552E11" name="us">
  </span>
  <span data-value="2708940" name="nv">
  </span>
  <span data-value="-1.764308030077549" name="ur">
  </span>
  <a href="/title/tt0111161/">
   <img alt="Um Sonho de Liberdade" height="67" src="https://m.media-amazon.com/images/M/MV5BNDE3ODcxYzMtY2YzZC00NmNlLWJiNDMtZDViZWM2MzIxZDYwXkEyXkFqcGdeQXVyNjAwNDUxODI@._V1_UX45_CR0,0,45,67_AL_.jpg" width="45"/>
  </a>
 </td>
 <td class="titleColumn">
  1.
  <a href="/title/tt0111161/" title="Frank Darabont (dir.), Tim Robbins, Morgan Freeman">
   Um Sonho de Liberdade
  </a>
  <span class="secondaryInfo">
   (1994)
  </span>
 </td>
 <td class="ratingColumn imdbRating">
  <strong title="9.2 based on 2,708,940 user ratings">
   9.2
  </strong>
 </td>
 <td class="ratingColumn">
  <div class="seen-widget seen-widget-tt0111161 pending" data-titleid="tt0111161">
 

<br/>

Note that the `<td class="titleColumn">` has the movie's **ranking**, a _link_ (`<a>`) with the **title**, and another HTML element (`<span>`) with the movie's **year**.

<pre>
&lt;td class="titleColumn">
  1.
  &lt;a href="/title/tt0111161/" title="Frank Darabont (dir.), Tim Robbins, Morgan Freeman">
   Um Sonho de Liberdade
  &lt;/a>
  &lt;span class="secondaryInfo">
   (1994)
  &lt;/span>
 &lt;/td>
</pre>

We need to **extract** these information. Let's make some tests with the _first movie_.

In [41]:
first_movie_tr = movies_tr[0]

print(first_movie_tr.prettify())

<tr>
 <td class="posterColumn">
  <span data-value="1" name="rk">
  </span>
  <span data-value="9.235691969922451" name="ir">
  </span>
  <span data-value="7.791552E11" name="us">
  </span>
  <span data-value="2708940" name="nv">
  </span>
  <span data-value="-1.764308030077549" name="ur">
  </span>
  <a href="/title/tt0111161/">
   <img alt="Um Sonho de Liberdade" height="67" src="https://m.media-amazon.com/images/M/MV5BNDE3ODcxYzMtY2YzZC00NmNlLWJiNDMtZDViZWM2MzIxZDYwXkEyXkFqcGdeQXVyNjAwNDUxODI@._V1_UX45_CR0,0,45,67_AL_.jpg" width="45"/>
  </a>
 </td>
 <td class="titleColumn">
  1.
  <a href="/title/tt0111161/" title="Frank Darabont (dir.), Tim Robbins, Morgan Freeman">
   Um Sonho de Liberdade
  </a>
  <span class="secondaryInfo">
   (1994)
  </span>
 </td>
 <td class="ratingColumn imdbRating">
  <strong title="9.2 based on 2,708,940 user ratings">
   9.2
  </strong>
 </td>
 <td class="ratingColumn">
  <div class="seen-widget seen-widget-tt0111161 pending" data-titleid="tt0111161">
 

In [42]:
# extract the `ranking & title` column
first_movie_tr.find(class_='titleColumn')

<td class="titleColumn">
      1.
      <a href="/title/tt0111161/" title="Frank Darabont (dir.), Tim Robbins, Morgan Freeman">Um Sonho de Liberdade</a>
<span class="secondaryInfo">(1994)</span>
</td>

In [43]:
title_column = first_movie_tr.find(class_='titleColumn')
title_column

<td class="titleColumn">
      1.
      <a href="/title/tt0111161/" title="Frank Darabont (dir.), Tim Robbins, Morgan Freeman">Um Sonho de Liberdade</a>
<span class="secondaryInfo">(1994)</span>
</td>

In [44]:
# strip the HTML text
title_column.text

'\n      1.\n      Um Sonho de Liberdade\n(1994)\n'

In [45]:
# extract the ranking, title, and year from the title column <td> by regex
import re

# \s includes \n
_, ranking_str, title, year_str, _ = re.split('\n\s*', title_column.text)

print(ranking_str)
print(title)
print(year_str)

1.
Um Sonho de Liberdade
(1994)


In [46]:
# converting string ranking to integer
ranking = int(float(ranking_str))
ranking

1

In [47]:
# converting string year to integer
year = int(re.sub('[()]', '', year_str))
year

1994

Ok, now we know how to extract those movie's information from HTML. 

Let's analyse the HTML of the first movie again. We still need to **extract the *IMBD rating***.

In [48]:
print(first_movie_tr.prettify())

<tr>
 <td class="posterColumn">
  <span data-value="1" name="rk">
  </span>
  <span data-value="9.235691969922451" name="ir">
  </span>
  <span data-value="7.791552E11" name="us">
  </span>
  <span data-value="2708940" name="nv">
  </span>
  <span data-value="-1.764308030077549" name="ur">
  </span>
  <a href="/title/tt0111161/">
   <img alt="Um Sonho de Liberdade" height="67" src="https://m.media-amazon.com/images/M/MV5BNDE3ODcxYzMtY2YzZC00NmNlLWJiNDMtZDViZWM2MzIxZDYwXkEyXkFqcGdeQXVyNjAwNDUxODI@._V1_UX45_CR0,0,45,67_AL_.jpg" width="45"/>
  </a>
 </td>
 <td class="titleColumn">
  1.
  <a href="/title/tt0111161/" title="Frank Darabont (dir.), Tim Robbins, Morgan Freeman">
   Um Sonho de Liberdade
  </a>
  <span class="secondaryInfo">
   (1994)
  </span>
 </td>
 <td class="ratingColumn imdbRating">
  <strong title="9.2 based on 2,708,940 user ratings">
   9.2
  </strong>
 </td>
 <td class="ratingColumn">
  <div class="seen-widget seen-widget-tt0111161 pending" data-titleid="tt0111161">
 

Note that the **IMBD rating** is in the following table column `<td>`:

<pre>
&lt;td class="ratingColumn imdbRating">
  &lt;strong title="9.2 based on 2,704,868 user ratings">
   9.2
  &lt;/strong>
 &lt;/td>
</pre

To extract the number inside the `<td>`, we just need to select the HTML element and get your **text**.

In [49]:
first_movie_tr.find(class_='imdbRating')

<td class="ratingColumn imdbRating">
<strong title="9.2 based on 2,708,940 user ratings">9.2</strong>
</td>

In [50]:
rating_str = first_movie_tr.find(class_='imdbRating').text
rating_str

'\n9.2\n'

In [51]:
rating = float(rating_str)
rating

9.2

<br/>

We're done! We know how to extract movie's data from the HTML page. Since all movies follow the same _HTML pattern_, we can create a function to extract a movie's information and then perform it in a loop.


In [52]:
import re

def extract_movie_data(movie_tr):
    title_column = movie_tr.find(class_='titleColumn')
    _, ranking_str, title, year_str, _ = re.split('\n\s*', title_column.text)
    
    ranking = int(float(ranking_str))
    year = int(re.sub('[()]', '', year_str))
    
    rating_str = movie_tr.find(class_='imdbRating').text
    rating = float(rating_str)
    
    return ranking, title, year, rating

<br/>

Now, we will save each movie's variable in a corresponding list.

In [53]:
import pdb

ranking_list = []
title_list = []
year_list = []
rating_list = []

for tr in movies_tr:
    ranking, title, year, rating = extract_movie_data(tr)
    
    ranking_list.append(ranking)
    title_list.append(title)
    year_list.append(year)
    rating_list.append(rating)
    
    # pdb.set_trace()

<br/>

Create a pandas `DataFrame` to save the **scrapped data** into a _CSV_.

In [54]:
import pandas as pd

In [55]:
df = pd.DataFrame({
    'ranking': ranking_list,
    'title': title_list,
    'year': year_list,
    'rating': rating_list
})

In [56]:
df.to_csv('out/movies_imbd_250.csv', index=False)

# Exercise
- Extend this web scrapping by also extracting the **movie's synopsis** in each movie's page, adding it as a column.