<a href="https://colab.research.google.com/github/nadjapereira/python-applications/blob/master/aula6_parte3_tags.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BeautifulSoup - Tags

In [0]:
import requests
from bs4 import BeautifulSoup
from requests.exceptions import ConnectionError

In [0]:
def download_html(url, numero_tentativas=2):
    print("Realizando o download da página:", url)
    try:
        req = requests.get(url)
        if req.status_code != 200:
            if numero_tentativas > 0:
                print("Não foi possível realizar o download. Erro:", req.status_code)
                print("\nRealizando nova tentativa:")
                return download_html(url, numero_tentativas - 1)
            else:
                print("Número de tentativas excedidas. Erro: {}".format(req.status_code))
                html = None
                return html
        html = req.text
        return html
    except ConnectionError as e:
        print("Erro no download:", e)
        html = None

In [0]:
html = download_html("http://pythonscraping.com/pages/page3.html")

Realizando o download da página: http://pythonscraping.com/pages/page3.html


In [0]:
bs = BeautifulSoup(html, "html.parser")

In [0]:
bs.find({"span"})

<span class="excitingNote">Now with super-colorful bell peppers!</span>

In [0]:
bs.findAll(name={"span"})

[<span class="excitingNote">Now with super-colorful bell peppers!</span>,
 <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>,
 <span class="excitingNote">Also hand-painted by trained monkeys!</span>,
 <span class="excitingNote">Or maybe he's only resting?</span>,
 <span class="excitingNote">Keep your friends guessing!</span>]

Além do parâmetro ```name``` utilizado anteriormente, é possível utilizar os seguintes parâmetros:
- ```attrs={}``` – Caso o nome a ser procurado seja uma palavra reservada do Python, utiliza-se o atributo attrs.
- ```recursive``` – Se a recursão for definida como True, a função descerá aos filhos e aos filhos dos filhos procurando tags que coincidam com seus parâmetros.
- ```text``` – procurar ocorrências de acordo com o conteúdo de texto das tags.
- ```limit``` – é utilizado no findAll e recupera os n primeiros itens da página.

In [0]:
for item in bs.findAll({"span"}):
    print("-->", item.text)

--> Now with super-colorful bell peppers!
--> 8 entire dolls per set! Octuple the presents!
--> Also hand-painted by trained monkeys!
--> Or maybe he's only resting?
--> Keep your friends guessing!


## Lidando com filhos e outros descendentes

Se você o escrevesse usando a função ```descendants()``` em vez da função ```children()```, outras tags seriam encontradas (img, span, entre outros). **É muito importante diferenciar filhos e descendentes!**

Para listar as linhas de produtos da tabela ```giftList```, temos que criar um iterador e imprimir todos os filhos de uma tag.

In [0]:
bs.find("table", {"id":"giftList"}).children

<list_iterator at 0x188d21950b8>

In [0]:
# Recupera todas as linhas, inclusive a linha de titulo
for filho in bs.find("table", {"id":"giftList"}).children:
    print(filho)



<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>


<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>


<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>


<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>


## Lidando com irmãos
Para exibir todas as linhas de produtos da tabela.

In [0]:
for irmao in bs.find("table", {"id":"giftList"}).tr.next_siblings:
    print(irmao)



<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>


<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>


<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>
</td></tr>


<tr class="gift" id="gift4"><td>
Dead Parrot
</td><td>
This is an ex-parr

### Acessando os elementos e estruturando com o Pandas


In [0]:
import pandas as pd

In [0]:
aux = []

for filho in bs.find("table", {"id":"giftList"}).children:
    aux.append(filho)

In [0]:
aux_final = []
for i in range(1, len(aux), 2):
    aux_final.append(aux[i])

In [0]:
colunas = [th.text.replace('\n', '') for th in aux_final[0].findAll('th')]
print(colunas)

In [0]:
estrutura = {}
# Remove a coluna de Imagem
for col in colunas[:-1]:
    estrutura[col] = []

In [0]:
for item in aux_final[1:]:
    aux = [td.text.replace('\n', '') for td in item.findAll('td')]
    estrutura['Item Title'].append(aux[0])
    estrutura['Description'].append(aux[1])
    str_aux = ["$", ","]
    preco = aux[2]
    for c in str_aux:
        preco = preco.replace(c, '')
    estrutura['Cost'].append(float(preco))

In [0]:
estrutura

In [0]:
df = pd.DataFrame(estrutura)

In [0]:
df

In [0]:
df.describe()