# Web scrapper

### Hello World

In [1]:
from bs4 import BeautifulSoup

In [2]:
raw_html = open('hello_world.html').read()

In [3]:
raw_html

'<!DOCTYPE html>\n<html>\n\t<head>\n\t\t  <title>Hello World!</title>\n\t</head>\n\t<body>\n\t\t<p id="eggman"> I am the egg man </p>\n\t\t<p id="walrus"> I am the walrus </p>\n\t</body>\n</html>\n'

In [4]:
html = BeautifulSoup(raw_html, 'html.parser')

In [5]:
for p in html.select('p'):
    print(p.text)

 I am the egg man 
 I am the walrus 


In [6]:
for p in html.select('p'):
    if p['id'] == 'walrus':
        print(p.text)

 I am the walrus 


## Importar dados da [NASDAQ](https://www.nasdaq.com/screening/companies-by-industry.aspx?industry=Technology&sortname=marketcap&sorttype=1&page=1)

## 1. Importar todas as libs necessárias

In [7]:
from time import time, sleep
from random import randint
import requests
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline 

## 2. Fazer a Requisição HTTP GET usando a lib `request`
[Documentação](https://requests.readthedocs.io/pt_BR/latest/user/quickstart.html)

In [8]:
URL = 'https://www.nasdaq.com/screening/companies-by-industry.aspx?industry=Technology&sortname=marketcap&sorttype=1&page=1'

In [9]:
response = requests.get(URL)

In [10]:
response.status_code

200

In [13]:
print(response.text[20000:25000])


            <div class="floatR half">
              <ul>
                <li>
                  <span class="fontS14px"><b>Stock Analysis</b></span>
                  <a href="https://www.nasdaq.com/quotes/analyst-research.aspx" onclick="handleNQClick(this,'nav:quotes:analyst-research')">Analyst Research</a>
                  <a href="https://www.nasdaq.com/quotes/stock-guru-analysis.aspx" onclick="handleNQClick(this,'nav:quotes:guru-analysis')">Guru Analysis</a>
                  <a href="https://www.nasdaq.com/quotes/stock-reports.aspx" onclick="handleNQClick(this,'nav:quotes:stock-reports')">Stock Reports</a>
                  <a href="https://www.nasdaq.com/quotes/business-competitors.aspx" onclick="handleNQClick(this,'nav:quotes:competitors')">Competitors</a>
                </li>
                <li>
                  <span class="fontS14px"><b>Fundamentals</b></span>
                  <a href="https://www.nasdaq.com/quotes/company-financials.aspx" onclick="handleNQC

## 3. Parsear HTML:
vamos usar a lib `BeautifulSoup` para essa etapa, [documentação](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

1. Clique com o botão direito sobre o elemento em HTML e selecione a opção `Inspect`
2. 

In [14]:
page_html = BeautifulSoup(response.text, 'html.parser')

```
<table style="width:100%">
  <tr>
    <th>Firstname</th>
    <th>Lastname</th> 
    <th>Age</th>
  </tr>
  <tr>
    <td>Jill</td>
    <td>Smith</td> 
    <td>50</td>
  </tr>
  <tr>
    <td>Eve</td>
    <td>Jackson</td> 
    <td>94</td>
  </tr>
</table>
```

<table style="width:100%">
  <tr>
    <th>Firstname</th>
    <th>Lastname</th> 
    <th>Age</th>
  </tr>
  <tr>
    <td>Jill</td>
    <td>Smith</td> 
    <td>50</td>
  </tr>
  <tr>
    <td>Eve</td>
    <td>Jackson</td> 
    <td>94</td>
  </tr>
</table>

ref: https://www.w3schools.com/html/html_tables.asp

In [15]:
# Selecionar todas as ações da página 1
html = page_html.find('table', attrs={'id':'CompanylistResults'})

Vamos coletar todos as tags `<tr>`

In [16]:
rows = html.find_all('tr')

Vamos testar a primeira ocorrência

In [17]:
row = rows[1]
print(row)

<tr>
<td><a href="http://www.microsoft.com" rel="nofollow" target="_blank">Microsoft Corporation</a></td>
<td>
<h3>
<a href="https://www.nasdaq.com/symbol/msft">
						            MSFT</a>
</h3>
</td>
<td style="">$987.97B</td>
<td style="display:none"></td>
<td>United States</td>
<td>1986</td>
<td style="width:105px">Computer Software: Prepackaged Software</td>
</tr>


In [18]:
cols = row.find_all('td')
print('número de colunas: {}'.format(len(cols)))
print('\n\n')
for i,col in enumerate(cols):
    print('coluna {} conteudo: {}'.format(i, col))

número de colunas: 7



coluna 0 conteudo: <td><a href="http://www.microsoft.com" rel="nofollow" target="_blank">Microsoft Corporation</a></td>
coluna 1 conteudo: <td>
<h3>
<a href="https://www.nasdaq.com/symbol/msft">
						            MSFT</a>
</h3>
</td>
coluna 2 conteudo: <td style="">$987.97B</td>
coluna 3 conteudo: <td style="display:none"></td>
coluna 4 conteudo: <td>United States</td>
coluna 5 conteudo: <td>1986</td>
coluna 6 conteudo: <td style="width:105px">Computer Software: Prepackaged Software</td>


ou seja, temos 7 colunas

agora vamos usar a seguinte função para remover espaços em branco


In [19]:
help(str.strip)

Help on method_descriptor:

strip(...)
    S.strip([chars]) -> str
    
    Return a copy of the string S with leading and trailing
    whitespace removed.
    If chars is given and not None, remove characters in chars instead.



In [20]:
for i,col in enumerate(cols):
    print('coluna {} conteudo: {}'.format(i, col.text.strip()))

coluna 0 conteudo: Microsoft Corporation
coluna 1 conteudo: MSFT
coluna 2 conteudo: $987.97B
coluna 3 conteudo: 
coluna 4 conteudo: United States
coluna 5 conteudo: 1986
coluna 6 conteudo: Computer Software: Prepackaged Software


In [21]:
cols = [ele.text.strip() for ele in cols]

In [22]:
# remover espaços em branco
[col for col in cols if col]

['Microsoft Corporation',
 'MSFT',
 '$987.97B',
 'United States',
 '1986',
 'Computer Software: Prepackaged Software']

In [23]:
# agora vamos fazer o processo em cadeia e guardar numa lista data
data = []
html = page_html.find('table', attrs={'id':'CompanylistResults'})
rows = html.find_all('tr')
for row in rows:
    cols = row.find_all('td')
    cols = [col.text.strip() for col in cols]
    data.append([col for col in cols if col])

In [24]:
data

[[],
 ['Microsoft Corporation',
  'MSFT',
  '$987.97B',
  'United States',
  '1986',
  'Computer Software: Prepackaged Software'],
 ['MSFT Stock Quote\n\n\r\n\t\t\t\t                MSFT Ratings\n\n\r\n\t\t\t\t                MSFT Stock Report'],
 ['Apple Inc.',
  'AAPL',
  '$874.57B',
  'United States',
  '1980',
  'Computer Manufacturing'],
 ['AAPL Stock Quote\n\n\r\n\t\t\t\t                AAPL Ratings\n\n\r\n\t\t\t\t                AAPL Stock Report'],
 ['Alphabet Inc.',
  'GOOGL',
  '$822.33B',
  'United States',
  'n/a',
  'Computer Software: Programming, Data Processing'],
 ['GOOGL Stock Quote\n\n\r\n\t\t\t\t                GOOGL Ratings\n\n\r\n\t\t\t\t                GOOGL Stock Report'],
 ['Alphabet Inc.',
  'GOOG',
  '$818.5B',
  'United States',
  '2004',
  'Computer Software: Programming, Data Processing'],
 ['GOOG Stock Quote\n\n\r\n\t\t\t\t                GOOG Ratings\n\n\r\n\t\t\t\t                GOOG Stock Report'],
 ['Facebook, Inc.',
  'FB',
  '$533.76B',
  'United S

In [25]:
# 1. Carregar data num pandas Dataframe
acoes = pd.DataFrame(list(data))

In [26]:
acoes = acoes.dropna(subset=[1])

In [27]:
acoes.columns = ['Name', 'Symbol','Market Cap','Country','IPO Year','Sector','Sub Sector']

In [28]:
acoes.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 50 entries, 1 to 99
Data columns (total 7 columns):
Name          50 non-null object
Symbol        50 non-null object
Market Cap    50 non-null object
Country       50 non-null object
IPO Year      50 non-null object
Sector        50 non-null object
Sub Sector    1 non-null object
dtypes: object(7)
memory usage: 3.1+ KB


In [29]:
acoes.head()

Unnamed: 0,Name,Symbol,Market Cap,Country,IPO Year,Sector,Sub Sector
1,Microsoft Corporation,MSFT,$987.97B,United States,1986.0,Computer Software: Prepackaged Software,
3,Apple Inc.,AAPL,$874.57B,United States,1980.0,Computer Manufacturing,
5,Alphabet Inc.,GOOGL,$822.33B,United States,,"Computer Software: Programming, Data Processing",
7,Alphabet Inc.,GOOG,$818.5B,United States,2004.0,"Computer Software: Programming, Data Processing",
9,"Facebook, Inc.",FB,$533.76B,United States,2012.0,"Computer Software: Programming, Data Processing",


In [30]:
acoes.to_csv('acoes.csv')

In [0]:
# desafio: fizemos somente para a primeira página
# ref: https://towardsdatascience.com/web-scraping-for-beginners-beautifulsoup-scrapy-selenium-twitter-api-f5a6d0589ea6