# Web scrapper

### Hello World

In [None]:
from bs4 import BeautifulSoup

In [None]:
raw_html = open('hello_world.html').read()

In [None]:
raw_html

In [None]:
html = BeautifulSoup(raw_html, 'html.parser')

In [None]:
for p in html.select('p'):
    print(p.text)

In [None]:
for p in html.select('p'):
    if p['id'] == 'walrus':
        print(p.text)

## Importar dados da [NASDAQ](https://www.nasdaq.com/screening/companies-by-industry.aspx?industry=Technology&sortname=marketcap&sorttype=1&page=1)

## 1. Importar todas as libs necessárias

In [None]:
from time import time, sleep
from random import randint
import requests
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline 

## 2. Fazer a Requisição HTTP GET usando a lib `request`
[Documentação](https://requests.readthedocs.io/pt_BR/latest/user/quickstart.html)

In [None]:
URL = 'https://www.nasdaq.com/screening/companies-by-industry.aspx?industry=Technology&sortname=marketcap&sorttype=1&page=1'

In [None]:
response = requests.get(URL)

In [None]:
response.status_code

In [None]:
print(response.text[20000:25000])

## 3. Parsear HTML:
vamos usar a lib `BeautifulSoup` para essa etapa, [documentação](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

1. Clique com o botão direito sobre o elemento em HTML e selecione a opção `Inspect`
2. 

In [None]:
page_html = BeautifulSoup(response.text, 'html.parser')

```
<table style="width:100%">
  <tr>
    <th>Firstname</th>
    <th>Lastname</th> 
    <th>Age</th>
  </tr>
  <tr>
    <td>Jill</td>
    <td>Smith</td> 
    <td>50</td>
  </tr>
  <tr>
    <td>Eve</td>
    <td>Jackson</td> 
    <td>94</td>
  </tr>
</table>
```

<table style="width:100%">
  <tr>
    <th>Firstname</th>
    <th>Lastname</th> 
    <th>Age</th>
  </tr>
  <tr>
    <td>Jill</td>
    <td>Smith</td> 
    <td>50</td>
  </tr>
  <tr>
    <td>Eve</td>
    <td>Jackson</td> 
    <td>94</td>
  </tr>
</table>

ref: https://www.w3schools.com/html/html_tables.asp

In [None]:
# Selecionar todas as ações da página 1
html = page_html.find('table', attrs={'id':'CompanylistResults'})

Vamos coletar todos as tags `<tr>`

In [None]:
rows = html.find_all('tr')

Vamos testar a primeira ocorrência

In [None]:
row = rows[1]
print(row)

In [None]:
cols = row.find_all('td')
print('número de colunas: {}'.format(len(cols)))
print('\n\n')
for i,col in enumerate(cols):
    print('coluna {} conteudo: {}'.format(i, col))

ou seja, temos 7 colunas

agora vamos usar a seguinte função para remover espaços em branco


In [None]:
help(str.strip)

In [None]:
for i,col in enumerate(cols):
    print('coluna {} conteudo: {}'.format(i, col.text.strip()))

In [None]:
cols = [ele.text.strip() for ele in cols]

In [None]:
# remover espaços em branco
[col for col in cols if col]

In [None]:
# agora vamos fazer o processo em cadeia e guardar numa lista data
data = []
html = page_html.find('table', attrs={'id':'CompanylistResults'})
rows = html.find_all('tr')
for row in rows:
    cols = row.find_all('td')
    cols = [col.text.strip() for col in cols]
    data.append([col for col in cols if col])

In [None]:
data

In [None]:
# 1. Carregar data num pandas Dataframe
acoes = pd.DataFrame(list(data))

In [None]:
acoes = acoes.dropna(subset=[1])

In [None]:
acoes.columns = ['Name', 'Symbol','Market Cap','Country','IPO Year','Sector','Sub Sector']

In [None]:
acoes.info()

In [None]:
acoes.head()

In [None]:
acoes.to_csv('acoes.csv')

In [None]:
# desafio: fizemos somente para a primeira página
# ref: https://towardsdatascience.com/web-scraping-for-beginners-beautifulsoup-scrapy-selenium-twitter-api-f5a6d0589ea6