### Web Scraping com Python

técnicas para baixar e analisar páginas HTML
usaremos bibliotecas urllib e BeautifulSoup para extrair informações da web


-> podemos coletar dados de sites para análise
-> automatizar tarefas repetitivas
-> monitorar sites dinamicamente

### 2 estapas principais

1. Crawler (Coletor de Dados) - acessa sites, lê o HTML e baixa a página 
2. Parser (Analisados de Conteúdo) - lê o HTML baixado e extrai informações específicas (títulos, links, imagens, etc).

Em Python, usamos o BeautifulSoup para processar e encontrar dados no HTML.

#### O que há por trás (HTML, CSS e HTTP)


In [1]:
import urllib.request

url = "https://example.com"
response = urllib.request.urlopen(url)
html = response.read()

print(html[:500]) # print the first 500 characters of the HTML

b'<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title>\n\n    <meta charset="utf-8" />\n    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />\n    <meta name="viewport" content="width=device-width, initial-scale=1" />\n    <style type="text/css">\n    body {\n        background-color: #f0f0f2;\n        margin: 0;\n        padding: 0;\n        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;\n    '


In [None]:
# BeautifulSoup is a Python library for parsing HTML and XML documents
from bs4 import BeautifulSoup

html= "<html><body><h1>Olá, Mundo!</h1></body></html>"
soup = BeautifulSoup(html, 'html.parser')

print(soup.h1.text) # print the text inside the <h1> tag

Olá, Mundo!


In [3]:
html = "<html><body><p>Primeiro parágrafo</p><p>Segundo parágrafo</p></body></html>"
soup = BeautifulSoup(html, 'html.parser')

#Extraindo todos os parágrafos
paragrafos = soup.find_all('p')

for p in paragrafos:
    print(p.text) # print the text inside the <p> tag


Primeiro parágrafo
Segundo parágrafo


In [4]:
html = "<html><body><a href='https://example.com'>Clique aqui</a></body></html>"
soup = BeautifulSoup(html, 'html.parser')

#Extraindo o link
link = soup.a.text
url = soup.a['href']

print(f"Texto do link: {link}")
print(f"URL do link: {url}")

Texto do link: Clique aqui
URL do link: https://example.com


In [5]:
html = "<html><body><ul><li>Item 1</li><li>Item 2</li><li>Item 3</li></ul></body></html>"
soup = BeautifulSoup(html, 'html.parser')

#Extraindo todos os itens da lista
itens = [li.text for li in soup.find_all('li')]
print(itens) # print all items in the list

['Item 1', 'Item 2', 'Item 3']


In [6]:
html = "<html><body><div class='destaque'>Conteúdo em destaque</div></body></html>"
soup = BeautifulSoup(html, 'html.parser')

#Extraindo o conteúdo da div com a classe 'destaque'
conteudo = soup.find('div', class_='destaque').text
print(conteudo) # print the content of the div with the class 'destaque'

Conteúdo em destaque


In [7]:
html = '''
<html>
    <body>
        <div class="noticia">Notícia 1</div>
        <div class="noticia">Notícia 2</div>
        <div class="noticia">Notícia 3</div>
    </body>
</html>
'''

soup = BeautifulSoup(html, 'html.parser')

#Extraindo todas as divs com a classe 'noticia'
noticias = [div.text for div in soup.find_all('div', class_='noticia')]
print(noticias) # print all news

['Notícia 1', 'Notícia 2', 'Notícia 3']


In [None]:
html = """
<table>
    <tr><th>Nome</th><th>Idade</th></tr>
    <tr><td>João</td><td>25</td></tr>
    <tr><td>Maria</td><td>30</td></tr>
</table>
"""

soup = BeautifulSoup(html, 'html.parser')

# Extraindo os dados da tabela
linhas = soup.find_all('tr') # get all rows in the table

for linha in linhas:
    colunas = linha.find_all(['th', 'td']) # get all columns in the row
    dados = [col.text for col in colunas]
    print(dados) # print the data in the table

['Nome', 'Idade']
['João', '25']
['Maria', '30']


In [None]:
#Web Scraping no Site quotes.toscrape.com

# Site com citações de autores famosos
url = "http://quotes.toscrape.com"
html = urllib.request.urlopen(url).read()

#Criando o objeto BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')

#Extraindo as citações
for quote in soup.find_all('span', class_='text'):
    print(quote.text) # print the quote


“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
“It is our choices, Harry, that show what we truly are, far more than our abilities.”
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
“Try not to become a man of success. Rather become a man of value.”
“It is better to be hated for what you are than to be loved for what you are not.”
“I have not failed. I've just found 10,000 ways that won't work.”
“A woman is like a tea bag; you never know how strong it is until it's in hot water.”
“A day without sunshine is like, you know, night.”


In [11]:
# Extraindo os autores
for author in soup.find_all('small', class_='author'):
    print(author.text) # print the author

Albert Einstein
J.K. Rowling
Albert Einstein
Jane Austen
Marilyn Monroe
Albert Einstein
André Gide
Thomas A. Edison
Eleanor Roosevelt
Steve Martin


In [14]:
# Extraindo todas as tags usadas nas citações

tags = set()
for tag in soup.find_all('a', class_='tag'):
    tags.add(tag.text)

for tag in tags:
    print(tag) # print

miracles
miracle
love
choices
paraphrased
obvious
humor
edison
deep-thoughts
friendship
life
simile
value
inspirational
reading
books
change
aliteracy
abilities
classic
adulthood
truth
friends
world
misattributed-eleanor-roosevelt
thinking
success
be-yourself
failure
live


In [23]:
top_tags = soup.find_all('span', class_='tag-item')
for tag in top_tags:
    print(tag.text) # print the top tags

# transformar em uma lista

top_tags = [tag.text.strip() for tag in soup.find_all('span', class_='tag-item')]
print(top_tags)



love


inspirational


life


humor


books


reading


friendship


friends


truth


simile

['love', 'inspirational', 'life', 'humor', 'books', 'reading', 'friendship', 'friends', 'truth', 'simile']


In [28]:
import pandas as pd

# Extraindo as citações
quotes = [quote.text for quote in soup.find_all('span', class_='text')]
# Extraindo os autores
authors = [author.text for author in soup.find_all('small', class_='author')]

# Criando um DataFrame
df = pd.DataFrame({'quote': quotes, 'author': authors})
print(df)

                                               quote             author
0  “The world as we have created it is a process ...    Albert Einstein
1  “It is our choices, Harry, that show what we t...       J.K. Rowling
2  “There are only two ways to live your life. On...    Albert Einstein
3  “The person, be it gentleman or lady, who has ...        Jane Austen
4  “Imperfection is beauty, madness is genius and...     Marilyn Monroe
5  “Try not to become a man of success. Rather be...    Albert Einstein
6  “It is better to be hated for what you are tha...         André Gide
7  “I have not failed. I've just found 10,000 way...   Thomas A. Edison
8  “A woman is like a tea bag; you never know how...  Eleanor Roosevelt
9  “A day without sunshine is like, you know, nig...       Steve Martin


In [29]:
# Criando csv
df.to_csv('quotes.csv', index=False)