## **D3TOP - Tópicos em Ciência de Dados (IFSP Campinas)**
**Prof. Dr. Samuel Martins (@iamsamucoding @samucoding @xavecoding)** <br/>
xavecoding: https://youtube.com/c/xavecoding <br/><br/>

<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.

<hr/>

# Web Scrapping
Web scraping is the process of **extracting data from websites (or other sources - e.g., pdfs)** using _automated software tools_, also known as **web scrapers** or **crawlers**. In case of websites, these tools access websites and _extract data_ from the _HTML code_, which can be used for a variety of purposes such as data analysis, research, and content creation.

Web scraping is commonly used in fields such as e-commerce, finance, and marketing, as it allows companies to gather data on competitors, track prices and inventory levels, and monitor customer sentiment. It can also be used by researchers to collect data for academic studies, and by journalists to uncover newsworthy information.

However, it's important to note that web scraping can _raise ethical_ and _legal concerns_, as _some websites explicitly prohibit scraping_ of their data. Additionally, scraping can put a strain on server resources and can potentially cause website downtime or other technical issues. As such, it's important to use web scraping tools responsibly and within the boundaries of the law and ethical considerations.

## Goal
In this notebook, we will scrap the page with the **[top 250 movies on IMBD](https://www.imdb.com/chart/top/?ref_=nv_mv_250)**. We will extract data from the html table and save the it into a CSV file.

## Required Packages
Beautiful Soup - https://beautiful-soup-4.readthedocs.io/en/latest/

In [None]:
!pip install beautifulsoup4

## Getting the page

In [None]:
# target website
URL = 'https://www.imdb.com/chart/top/?ref_=nv_mv_250'

**HTML Protocol**

<img src='https://d77da31580fbc8944c00-52b01ccbcfe56047120eec75d9cb2cbd.ssl.cf6.rackcdn.com/695bb7d8-d242-47a8-ab92-bc72225323df/what-is-http---teachoo.jpg' width=400/>

Source: https://d77da31580fbc8944c00-52b01ccbcfe56047120eec75d9cb2cbd.ssl.cf6.rackcdn.com/695bb7d8-d242-47a8-ab92-bc72225323df/what-is-http---teachoo.jpg

In [None]:
# get the page


In [None]:
# HTML success code - 200


In [None]:
# save the obtained HTML page into a file just for inspection


## Beautiful Soup Basics
Install the [**Selector Gadget**](https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb) on Chrome Web Store - it is a plugin for HTML inspection.

In [None]:
# parsing the html page to a BS object


<br/>

`find(tag)`: find the _first_ HTML element with the tag `<tag>`.

https://beautiful-soup-4.readthedocs.io/en/latest/#kinds-of-filters

In [None]:
# find h3


In [None]:
# find tr


In [None]:
type(first_tr)

In [None]:
print(first_tr.prettify())

In [None]:
# find for the (first) tag with id='suggestion-search'


In [None]:
# find for the <input> tag with id='suggestion-search'


In [None]:
# find for the first <td> tag with class='titleColumn'


<br/>

`find_all(tag)`: find **all** HTML elements with the tag `<tag>`.

https://beautiful-soup-4.readthedocs.io/en/latest/#kinds-of-filters

In [None]:
# find for all <h3> tags


In [None]:
# find for all HTML elements with class='titleColumn'


print(len(elements))
print(elements[:3])

In [None]:
# find for all <td> tags with class='titleColumn'


print(len(td_list))
print(td_list[:3])

<br/>

`search(selector)`: search for **all** HTML elements with the CSS selector `selector`.

https://beautiful-soup-4.readthedocs.io/en/latest/#searching-by-css-class

In [None]:
# find for all <h3> elements


In [None]:
# find for the HTML with the 'suggestion-search'
# note that the return is a list


In [None]:
# find for the <input> tag with the 'suggestion-search'
# note that the return is a list


In [None]:
# find for all HTML elements with class='titleColumn'


print(len(elements))
print(elements[:3])

In [None]:
# find for all <td> tags with class='titleColumn'


print(len(td_list))
print(td_list[:3])

### More BS useful functions

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

- `encode_contents`
- `replace_with`
- `unwrap`
- `find_all`
- `find_all_next`
- `find_all_previous`
- `find_next`
- `find_next_sibling`
- `find_next_siblings`
- `find_parent`
- `find_parents`
- `find_previous`
- `find_previous_sibling`
- `find_previous_siblings`
- `get_text`
- `next_sibling`
- `previous_sibling`

## Scrapping the movies
- Inspect the target page to find which is the HTML element that holds the movie data
    + use **SelectorGadget**

### Page HTML overview

<br/>

Get all `<tr>` elements:

In [None]:
# number of elements
print(len(movies_tr))

It took _one more element_, let's see which is that.

It took the **table header** as the _extra element_, which is the _first element_ of the list:

<br/>

To solve this problem, we can simply ***delete** the first list's element*:

In [None]:
len(movies_tr)

In [None]:
movies_tr[:2]

<br/>

Alternatively, we can perform a more advance **search** with a more specific _CSS selector_: `'TAG1 TAG2'` <br/>
- Select all HTML elements with TAG2 that are "inside" the HTML elements with TAG1

In [1]:
# select all <tr> inside <tbody>


In [None]:
len(movies_tr)

In [None]:
movies_tr[:2]

In [None]:
for tr in movies_tr[:2]:
    print(tr.prettify())

### Scrapping the movies

In [None]:
print(movies_tr[0].prettify())

<br/>

Note that the `<td class="titleColumn">` has the movie's **ranking**, a _link_ (`<a>`) with the **title**, and another HTML element (`<span>`) with the movie's **year**.

<pre>
&lt;td class="titleColumn">
  1.
  &lt;a href="/title/tt0111161/" title="Frank Darabont (dir.), Tim Robbins, Morgan Freeman">
   Um Sonho de Liberdade
  &lt;/a>
  &lt;span class="secondaryInfo">
   (1994)
  &lt;/span>
 &lt;/td>
</pre>

We need to **extract** these information. Let's make some tests with the _first movie_.

In [None]:


print(first_movie_tr.prettify())

In [None]:
# extract the `ranking & title` column


In [None]:
# strip the HTML text


In [None]:
# extract the ranking, title, and year from the title column <td> by regex

print(ranking_str)
print(title)
print(year_str)

In [None]:
# converting string ranking to integer


In [None]:
# converting string year to integer


Ok, now we know how to extract those movie's information from HTML. 

Let's analyse the HTML of the first movie again. We still need to **extract the *IMBD rating***.

In [None]:
print(first_movie_tr.prettify())

Note that the **IMBD rating** is in the following table column `<td>`:

<pre>
&lt;td class="ratingColumn imdbRating">
  &lt;strong title="9.2 based on 2,704,868 user ratings">
   9.2
  &lt;/strong>
 &lt;/td>
</pre

To extract the number inside the `<td>`, we just need to select the HTML element and get your **text**.

<br/>

We're done! We know how to extract movie's data from the HTML page. Since all movies follow the same _HTML pattern_, we can create a function to extract a movie's information and then perform it in a loop.


<br/>

Now, we will save each movie's variable in a corresponding list.

In [None]:
ranking_list = []
title_list = []
year_list = []
rating_list = []

<br/>

Create a pandas `DataFrame` to save the **scrapped data** into a _CSV_.

In [None]:
import pandas as pd

In [None]:
df = pd.DataFrame({
    'ranking': ranking_list,
    'title': title_list,
    'year': year_list,
    'rating': rating_list
})

In [None]:
df.to_csv('out/movies_imbd_250.csv', index=False)

# Exercise
- Extend this web scrapping by also extracting the **movie's synopsis** in each movie's page, adding it as a column.