<h1>Webscraping com Python

<h2>Parte I - Coleta de Dados Básica

<h3> Capítulo 1 - Meu Primeiro Scrapper

In [1]:
from urllib.request import urlopen
html = urlopen('http://pythonscraping.com/pages/page1.html')
print(html.read())

b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'


In [2]:
from bs4 import BeautifulSoup
html = urlopen('http://pythonscraping.com/pages/page1.html')
bs = BeautifulSoup(html.read(),'html.parser')
print(bs.h1)

<h1>An Interesting Title</h1>


In [3]:
#usando um parser alternativo
html = urlopen('http://pythonscraping.com/pages/page1.html')
bs = BeautifulSoup(html.read(),'html5lib')
print(bs.head)

<head>
<title>A Useful Page</title>
</head>


In [4]:
#lidando com erros de acesso
from urllib.error import HTTPError, URLError

try:
    #alterar o URL abaixo e testar os casos de erro
    html = urlopen('http://pythonscraping.com/pages/page1.html')
except HTTPError as e:
    print(e)
except URLError as e:
    print('Servidor não encontrado!')
else:
    print('Funcionou!')

Funcionou!


In [5]:
#lidando com erros de tag

try:
    #brincar com badContent para testar os erros de tag
    badContent = bs.NonExistingTag.anotherTag
except AttributeError as e:
    print('Tag não encontrada!\n',e)
else:
    if badContent == None:
        print('Tag não encontrada!')
    else:
        print(badContent)

Tag não encontrada!
 'NoneType' object has no attribute 'anotherTag'




<h3>Capítulo 2 - Parsing de HTML avançado

In [6]:
#Lendo o texto do livro Guerra e Paz que está na página
html = urlopen('http://www.pythonscraping.com/pages/warandpeace.html')

#Neste site, o nome dos personagens estão em verde e suas falas estão em vermelho. 
#Podemos acessar os atributos de estilização do CSS para filtrar os elementos do texto (cor, negrito, etc.).
#Vamos fazer isso para o nome dos personagens que aparecem ao longo do texto
bs = BeautifulSoup(html.read(),'html.parser')
nameList = bs.findAll('span', {'class':'green'})
for name in nameList:
    print(name.get_text())#a função get_text() remove as tags do HTML

Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
the prince
Anna Pavlovna
Anna Pavlovna
the prince
the prince
the prince
Prince Vasili
Anna Pavlovna
Anna Pavlovna
the prince
Wintzingerode
King of Prussia
le Vicomte de Mortemart
Montmorencys
Rohans
Abbe Morio
the Emperor
the prince
Prince Vasili
Dowager Empress Marya Fedorovna
the baron
Anna Pavlovna
the Empress
the Empress
Anna Pavlovna's
Her Majesty
Baron
Funke
The prince
Anna
Pavlovna
the Empress
The prince
Anatole
the prince
The prince
Anna
Pavlovna
Anna Pavlovna


In [7]:
#A árvore de tags HTML site funciona como em uma árvore genealógica.
#Temos tags pai, filho,irmão, descendente, etc.
#Vamos começar com os filhos da gift list 
html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html,'html.parser')
for child in bs.find('table',{'id':'giftList'}).children:
    print(child)



<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>


<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>


<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>


<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>


In [8]:
#fazendo o mesmo agora para todos os descendentes, pegamos muito mais informação

html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html,'html.parser')
#descomentar para ver

#for descendant in bs.find('table',{'id':'giftList'}).descendants:
#    print(descendant) 

In [9]:
#Lidando com Irmãos na Árvore
html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html,'html.parser')
for sibling in bs.find('table',{'id':'giftList'}).tr.next_siblings:
    print(sibling)



<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>


<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>


<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>
</td></tr>


<tr class="gift" id="gift4"><td>
Dead Parrot
</td><td>
This is an ex-parr

In [10]:
#trabalhando com pais (a partir da imagem do site, achar o preço correspondente)
html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html,'html.parser')
print(bs.find('img',{'src':'../img/gifts/img1.jpg'}).parent.previous_sibling.get_text())

#Obs. esse movimento de subir a árvore é bem raro em crawlers.


$15.00



<h4>Expressões Regulares

In [11]:
#Usando expressão regular para acessar o diretório de todas as imagens da página.

import re
html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html,'html.parser')
re01 = re.compile('\.\.\/img\/gifts/img.*\.jpg')
images = bs.find_all('img', {'src':re01})
for image in images:
    print(image['src'])

#Observação: lembre-se que a barra invertida (\) serve para escapar de caracteres especiais, como @, . e /

../img/gifts/img1.jpg
../img/gifts/img2.jpg
../img/gifts/img3.jpg
../img/gifts/img4.jpg
../img/gifts/img6.jpg


<h3>Capítulo 3 - Escrevendo Webcrawlers

In [12]:
html = urlopen('https://pt.wikipedia.org/wiki/Sundar_Pichai')
bs = BeautifulSoup(html,'html.parser')

#padrão da expressão regular: começa com /wiki/ e pode conter quaisquer caracteres depois do / com exceção dos dois pontos
re02 = re.compile('^(/wiki/)((?!:).)*$')

for link in bs.find_all('a',href = re02): #procura tudo o que está na tag raiz da página, a tag 'a'
    if 'href' in link.attrs: #dentro dela, encontra todos links externos
        print(link.attrs['href']) #e printa somente aqueles que são artigo do wikipedia

/wiki/Sundar_Pichai
/wiki/Sundar_Pichai
/wiki/12_de_julho#Nascimentos
/wiki/1972
/wiki/Chenai
/wiki/%C3%8Dndia
/wiki/%C3%8Dndia
/wiki/CEO
/wiki/Google
/wiki/Alphabet_Inc.
/wiki/Google
/wiki/Chenai
/wiki/Google
/wiki/Alphabet_Inc.
/wiki/10_de_agosto
/wiki/2015
/wiki/Larry_Page
/wiki/Alphabet_Inc.
/wiki/Dezembro
/wiki/2019
/wiki/Holding
/wiki/Engenharia_metal%C3%BArgica
/wiki/Google_Chrome
/wiki/Chrome_OS
/wiki/Sistema_Operacional
/wiki/Android
/wiki/Alphabet_Inc.
/wiki/Larry_Page
/wiki/Sergey_Brin


In [13]:
import datetime, random

#garantindo um novo caminho de execução toda vez que o programa executa
random.seed(datetime.datetime.now())

#desenvolvendo a função getLinks que pega todos os links no corpo do artigo do Wikipedia
def getLinks(articleUrl):
    html = urlopen('http://en.wikipedia.org{0}'.format(articleUrl))
    bs = BeautifulSoup(html,'html.parser')
    return bs.find('div',{'id':'bodyContent'}).find_all('a',href=re02)

In [14]:
#fazendo um laço para continuar desenrolando os hrefs a partir do Sundar Pichai em um caminho aleatório
links = getLinks('/wiki/Sundar_Pichai')
while len(links)>0:
    newArticle = links[random.randint(0,len(links))].attrs['href']
    print(newArticle)
    links = getLinks(newArticle)

/wiki/YouTube_headquarters_shooting
/wiki/Semi-automatic_pistol
/wiki/Firing_pin
/wiki/Steel
/wiki/Basil_Brooke_(metallurgist)
/wiki/British_House_of_Commons
/wiki/Sark
/wiki/Doi_(identifier)
/wiki/Guidelines_for_the_Definition_of_Managed_Objects
/wiki/ISO/IEC_7811
/wiki/PHIGS
/wiki/JBIG
/wiki/International_Electrotechnical_Commission
/wiki/Serbia
/wiki/Kosovo
/wiki/Euro_currency
/wiki/2_euro_cent_coin
/wiki/Carantania#The_Ducal_Coronation
/wiki/Kingdom_of_Germany
/wiki/Timeline_of_German_history
/wiki/Denmark
/wiki/Proto-Germanic_language
/wiki/Th-fronting
/wiki/Trap-bath_split
/wiki/Cardiff_English
/wiki/Stop_consonant
/wiki/Laryngeal_consonant
/wiki/Voiced_labiodental_flap
/wiki/Chadic_languages
/wiki/Somrai_language
/wiki/Mawa_language_(Chad)
/wiki/Kwang_language
/wiki/Endangered_Languages_Project
/wiki/Endangered_languages
/wiki/Doi_(identifier)
/wiki/PLoS_Biology


IndexError: list index out of range

In [15]:
pages = set()

#criando uma nova getLinks que evita páginas duplicadas
def getLinks2(pageUrl):
    global pages
    html = urlopen('http://en.wikipedia.org{0}'.format(pageUrl))
    bs = BeautifulSoup(html,'html.parser')
    for link in bs.find_all('a',href = re02):
        if 'href' in link.attrs:
            if link.attrs['href'] not in pages:
                #Encontramos uma página nova!
                newPage = link.attrs['href']
                #printa
                print(newPage)
                #adiciona ao set
                pages.add(newPage)
                #recursão
                getLinks(newPage)

getLinks2('')
#observação, o limite da recursividade (sem tratamento), para essa função em Python sã 1.000 iterações.

/wiki/Wikipedia
/wiki/Free_content
/wiki/Encyclopedia
/wiki/English_language
/wiki/A_Crow_Looked_at_Me
/wiki/Mount_Eerie
/wiki/Phil_Elverum
/wiki/Genevi%C3%A8ve_Castr%C3%A9e
/wiki/Pancreatic_cancer
/wiki/Indie_folk
/wiki/Now_Only
/wiki/Lost_Wisdom_pt._2
/wiki/Mother%27s_Day_(Rugrats)
/wiki/Preening
/wiki/Frances_Gertrude_McGill
/wiki/Arthur_Kopit
/wiki/Oh_Dad,_Poor_Dad,_Mamma%27s_Hung_You_in_the_Closet_and_I%27m_Feelin%27_So_Sad
/wiki/Drama_Desk_Award
/wiki/Outer_Critics_Circle_Award
/wiki/Daimaou_Kosaka
/wiki/Damens_Walker
/wiki/United_States_Public_Health_Service
/wiki/Environmental_Health_Divisions
/wiki/Cincinnati
/wiki/United_States_Environmental_Protection_Agency


KeyboardInterrupt: 

In [22]:
#Modificando o Crawler ligeiramente para pegar elementos específicos da página

pages = set()

def getLinks3(pageUrl):
    global pages
    html = urlopen('http://en.wikipedia.org{}'.format(pageUrl))
    bs = BeautifulSoup(html,'html.parser')
    try:
        #printa o título
        print(bs.h1.get_text())
        #printa o parágrafo
        print(bs.find(id='mw-content-text').find_all('p')[0])
        #printa o link de editar
        print(bs.find(id='ca-edit').find('span').find('a').attrs['href'])
    except AttributeError:
        print('Está faltando algo nesta página... Continuando!')
    for link in bs.find_all('a',href=re02):
        if 'href' in link.attrs:
            if link.attrs['href'] not in pages:
                newPage = link.attrs['href']
                print('-'*20)
                print(newPage)
                pages.add(newPage)
                getLinks(newPage)

In [24]:
getLinks3('')

Main Page
<p><i><b><a href="/wiki/A_Crow_Looked_at_Me" title="A Crow Looked at Me">A Crow Looked at Me</a></b></i> is the eighth studio album by <a href="/wiki/Mount_Eerie" title="Mount Eerie">Mount Eerie</a>, a solo project of the American musician <a href="/wiki/Phil_Elverum" title="Phil Elverum">Phil Elverum</a>. Released in 2017, it was composed in the aftermath of his 35-year-old wife, <a href="/wiki/Genevi%C3%A8ve_Castr%C3%A9e" title="Geneviève Castrée">Geneviève Castrée</a>'s, diagnosis of <a href="/wiki/Pancreatic_cancer" title="Pancreatic cancer">pancreatic cancer</a> in 2015 and her death in July 2016. Elverum wrote and recorded the songs over a six-week period in the room where she died, mostly using her instruments. The lyrics are presented in a diary-like form and sung in a raw, intimate style. They bluntly detail Castrée's illness and death, Elverum's grief, and his relationship with their infant child. It is soundtracked by sparse <a href="/wiki/Indie_folk" title="Indie 

KeyboardInterrupt: 

In [2]:
#Exemplo de URL Parse para entender a célula a seguir
from urllib.parse import urlparse
o = urlparse('http://www.cwi.nl:80/%7Eguido/Python.html')
o

ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html', params='', query='', fragment='')

In [2]:
#Crawleando para sites além de um domínio específico
from urllib.request import urlopen
from urllib.parse import urlparse
from bs4 import BeautifulSoup
import re
import datetime
import random

pages = set()
random.seed(datetime.datetime.now())


#Obtém uma lista dos links internos da página
def getInternalLinks(bs,includeUrl):
    #gera a string http do site
    includeUrl = '{}://{}'.format(urlparse(includeUrl).scheme,
                                 urlparse(includeUrl).netloc)
    internalLinks = []
    
    #expressão regular para todos os links que começam com /
    re_barra = re.compile('^(/|.*' + includeUrl + ')')
    
    for link in bs.find_all('a',href = re_barra):
        if link.attrs['href'] is not None:
            if(link.attrs['href'].startswith('/')):
              internalLinks.append(includeUrl+link.attrs['href'])
            else:
                internalLinks.append(link.attrs['href'])
    return internalLinks

#Obtém uma lista dos links internos da página
def getExternalLinks(bs,excludeUrl):
    externalLinks = []
    
    #expressão regular para excluir o próprio URL inicial
    re_exclude = re.compile('^(http|www*)((?!' + excludeUrl + ').)*$')
    
    for link in bs.find_all('a',href = re_exclude):
        if link.attrs['href'] is not None:
            if link.attrs['href'] not in externalLinks:
              externalLinks.append(link.attrs['href'])
    return externalLinks

def getRandomExternalLink(startingPage):
    html = urlopen(startingPage)
    bs = BeautifulSoup(html,'html.parser')
    externalLinks = getExternalLinks(bs,urlparse(startingPage).netloc)
    if len(externalLinks) == 0:
        print('Não encontramos nenhum link externo =/ Buscando novamente em outra página...')
        domain = '{}://{}'.format(urlparse(startingPage).scheme,
                                 urlparse(startingPage).netloc)
        internalLinks = getInternalLinks(bs,domain)
        return getRandomExternalLink(internalLinks[random.randint(0,len(internalLinks)-1)])
    else:
        return externalLinks[random.randint(0,len(externalLinks)-1)]
    
def followExternalOnly(startingSite):
    externalLink = getRandomExternalLink(startingSite)
    print('O link externo aleatório é: ', externalLink)
    followExternalOnly(externalLink)
    
followExternalOnly('https://www.oreilly.com/')
#Obs. pelos testes, percebi que vários websites estão proibindo o acesso de crawlers

O link externo aleatório é:  https://learning.oreilly.com/search/?query=author%3A%22Kelsey%20Hightower%22&extended_publisher_data=true&highlight=true&include_assessments=false&include_case_studies=true&include_courses=true&include_playlists=true&include_collections=true&include_notebooks=true&include_sandboxes=true&include_scenarios=true&is_academic_institution_account=false&source=user&sort=date_added&facet_json=true&json_facets=true&page=0&include_facets=false
O link externo aleatório é:  https://www.oreilly.com/terms/
O link externo aleatório é:  https://www.copyright.gov/title17/92chap1.html#107


HTTPError: HTTP Error 403: Forbidden

<h2>Capítulo 4 - Modelos de Webcrawling

Não devemos basear nossos modelos nas informações disponibilizadas pelo primeiro site que encontrarmos com informações desejadas. Devemos refletir, em primeiro lugar, sobre quais dados são necessários para o modelo que estamos implementando e pensar na escalabilidade levando em consideração que cada página estará estruturada de uma maneira.

<h3>4.1. Lidando com diferentes layouts de site

In [22]:
#Implementando o classe Content do nosso crawler que coleta notícias e posts de blog
from bs4 import BeautifulSoup
import requests

class Content:
    
    def __init__(self,url,title,body):
        self.url = url
        self.title = title
        self.body = body
    
#a partir de um request retorna um texto parseado no formato BS
def getPage(url):
    req = requests.get(url)
    return BeautifulSoup(req.text,'html.parser')

#Retorna um objeto do tipo content a partir de um URL do NY Times
#Obs. Note que nesse scraper eu pego a notícia linha por linha
def scrapeNYTimes(url):
    bs = getPage(url)
    title = bs.find('h1').text
    lines = bs.select('div.StoryBodyCompanionColumn div p')
    body = '\n'.join([line.text for line in lines])
    return Content(url,title,body)

# Retorna um objeto do tipo content a partir de um URL do Brookings
#Obs. Neste aqui, eu pego o texto parágrafo por parágrafo
def scrapeBrookings(url):
    bs = getPage(url)
    title = bs.find('h1').text
    paragraphs = bs.find_all('div',{'class':'post-body'})
    body = '\n'.join([paragraph.text for paragraph in paragraphs])
    return Content(url,title,body)

#Apresentando a notícia para o usuário
def showPost(content):
    print('Título: ',content.title)
    print('URL: ', content.url)
    print('\n', content.body)

In [16]:
#Testando NY Times
conteudo = scrapeNYTimes('https://www.nytimes.com/2018/01/25/opinion/sunday/silicon-valley-immortality.html')
showPost(conteudo)

Título:  The Men Who Want to Live Forever
URL:  https://www.nytimes.com/2018/01/25/opinion/sunday/silicon-valley-immortality.html

 Would you like to live forever? Some billionaires, already invincible in every other way, have decided that they also deserve not to die. Today several biotech companies, fueled by Silicon Valley fortunes, are devoted to “life extension” — or as some put it, to solving “the problem of death.”
It’s a cause championed by the tech billionaire Peter Thiel, the TED Talk darling Aubrey de Gray, Google’s billion-dollar Calico longevity lab and investment by Amazon’s Jeff Bezos. The National Academy of Medicine, an independent group, recently dedicated funding to “end aging forever.”
As the longevity entrepreneur Arram Sabeti told The New Yorker: “The proposition that we can live forever is obvious. It doesn’t violate the laws of physics, so we can achieve it.” Of all the slightly creepy aspects to this trend, the strangest is the least noticed: The people publicl

In [23]:
#Testando Brookings
url = 'https://www.brookings.edu/blog/up-front/2021/05/12/tax-carbon-and-consumption-not-middle-class-income/'
conteudo = scrapeBrookings(url)
showPost(conteudo)

Título:  Tax carbon and consumption, not middle class income
URL:  https://www.brookings.edu/blog/up-front/2021/05/12/tax-carbon-and-consumption-not-middle-class-income/

 
On April 22nd President Biden kicked off a virtual Leadership Summit on Climate by declaring that the US will cut its global warming emissions in half by the end of the decade. This goal, as well as the summit, which included 40 world leaders, affirmed President Biden’s commitment to combatting climate change during his tenure. Another proposal at the core of Biden’s agenda is strengthening America’s middle class. In  A New Contract with the Middle Class, Richard Reeves and Isabel Sawhill argue for removing almost all income tax for the middle class, and recovering part of that lost revenue with a carbon tax and a progressive value-added tax (VAT). This policy would give the middle class a much-needed income boost while helping us reach our goal of reduced emissions. 	






Ariel Gelrud Shiro

					Research Assista

Vamos modificar agora um pouquinho para o nosso projeto para separar aquilo que é do site daquilo que é da página.

In [31]:
class Content:
    
    def __init__(self,url,title,body):
        self.url = url
        self.title = title
        self.body = body
    
    def print(self):
        #Função para exibir o conteúdo formatado
        print('Título: ', self.title)
        print('URL: ', self.url)
        print('Corpo da notícia:\n', self.body)

class Website:
    
    def __init__(self,name,url,titleTag,bodyTag):
        self.name = name
        self.url = url
        self.titleTag = titleTag
        self.bodyTag = bodyTag

In [28]:
import requests
from bs4 import BeautifulSoup

class Crawler:
    
    '''Pega o conteúdo de uma página específica'''
    def getPage(self,url):
        try:
            req=requests.get(url)
        except requests.exceptions.RequestException:
            return None
        return BeautifulSoup(req.text,'html.parser')
    
    '''Função usada para obter uma string de conteúdo de um objeto BS e um seletor'''
    def safeGet(self, pageObj, selector):
        selectedElems = pageObj.select(selector)
        if selectedElems is not None and len(selectedElems) > 0:
            return '\n'.join(
                [elem.get_text() for elem in selectedElems])
        return ''
    
    '''Faz o parsing a partir de um URL'''
    def parse(self, site, url):
        bs = self.getPage(url)
        if bs is not None:
            title = self.safeGet(bs, site.titleTag)
            body = self.safeGet(bs, site.bodyTag)
            if title != '' and body != '':
                content = Content(url, title, body)
                content.print()

In [32]:
#Testando o Crawler para notícias em diversos sites
crawler = Crawler()

siteData = [['O\'Reilly Media', 'http://oreilly.com','h1', 'section#product-description'],
    ['Reuters', 'http://reuters.com', 'h1', 'div.StandardArticleBody_body_1gnLA'],
    ['Brookings', 'http://www.brookings.edu','h1', 'div.post-body'],
    ['New York Times', 'http://nytimes.com','h1', 'div.StoryBodyCompanionColumn div p']]
websites = []
for info in siteData:
    websites.append(Website(info[0], info[1], info[2], info[3]))

crawler.parse(websites[0], 'http://shop.oreilly.com/product/0636920028154.do')
crawler.parse(websites[1], 'http://www.reuters.com/article/us-usa-epa-pruitt-idUSKBN19W2D0')
crawler.parse(websites[2], 'https://www.brookings.edu/blog/techtank/2016/03/01/idea-to-retire-old-methods-of-policy-education/')
crawler.parse(websites[3], 'https://www.nytimes.com/2018/01/28/business/energy-environment/oil-boom.html')

Título:  Idea to Retire: Old methods of policy education
Idea to Retire: Old methods of policy education
URL:  https://www.brookings.edu/blog/techtank/2016/03/01/idea-to-retire-old-methods-of-policy-education/
Corpo da notícia:
 
Public policy and public affairs schools aim to train competent creators and implementers of government policy. While drawing on the principles that gird our economic and political systems to provide a well-rounded education, like law schools and business schools, policy schools provide professional training. They are quite distinct from graduate programs in political science or economics which aim to train the next generation of academics. As professional training programs, they add value by imparting both the skills which are relevant to current employers, and skills which we know will be relevant as organizations and societies evolve. 
The relevance of the skills that policy programs impart to address problems of today and tomorrow bears further discussion.

<h3>4.2.Rastreando um site por meio de pesquisa

In [2]:
import requests
from bs4 import BeautifulSoup

#Classe do Conteúdo
class Content:
    
    def __init__(self,topic,url,title,body):
        self.topic = topic
        self.url = url
        self.title = title
        self.body = body
        
    def print(self):
        print('Encontrei um novo artigo com o tópico:',self.topic)
        print('Título:',self.title)
        print('Corpo:',self.body)
        print('URL:',self.url)

#Classe do Website
class Website:
    def __init__(self, name, url, searchUrl, resultListing, resultUrl, absoluteUrl, titleTag, bodyTag):
        self.name = name
        self.url = url
        self.searchUrl = searchUrl
        self.resultListing = resultListing
        self.resultUrl = resultUrl
        self.absoluteUrl = absoluteUrl
        self.titleTag = titleTag
        self.bodyTag = bodyTag

#Classe do Crawler
class Crawler:
    
    '''Pega o conteúdo de uma página específica'''
    def getPage(self,url):
        try:
            req=requests.get(url)
        except requests.exceptions.RequestException:
            return None
        return BeautifulSoup(req.text,'html.parser')
    
    '''Função usada para obter uma string de conteúdo de um objeto BS e um seletor'''
    def safeGet(self, pageObj, selector):
        selectedElems = pageObj.select(selector)
        if selectedElems is not None and len(selectedElems) > 0:
            return '\n'.join(
                [elem.get_text() for elem in selectedElems])
        return ''
    
    '''Busca um tópico em dado site e retorna todas as páginas encontradas'''
    def search(self, topic, site):
        bs = self.getPage(site.searchUrl + topic)
        searchResults = bs.select(site.resultListing)
        for result in searchResults:
            url = result.select(site.resultUrl)[0].attrs['href']
            # Check to see whether it's a relative or an absolute URL
            if(site.absoluteUrl):
                bs = self.getPage(url)
            else:
                bs = self.getPage(site.url + url)
            if bs is None:
                print('Algo deu errado com o URL da página. Skipping!')
                return
            title = self.safeGet(bs, site.titleTag)
            body = self.safeGet(bs, site.bodyTag)
            if title != '' and body != '':
                content = Content(topic, title, body, url)
                content.print()

In [3]:
#Testando o nosso crawler
crawler = Crawler()

siteData = [
    ['O\'Reilly Media', 'http://oreilly.com', 'https://ssearch.oreilly.com/?q=',
    'article.product-result', 'p.title a', True, 'h1',
    'section#product-description'],
    ['Reuters', 'http://reuters.com', 'http://www.reuters.com/search/news?blob=',
    'div.search-result-content', 'h3.search-result-title a', False, 'h1',
    'div.StandardArticleBody_body_1gnLA'],
    ['Brookings', 'http://www.brookings.edu',
    'https://www.brookings.edu/search/?s=', 'div.list-content article',
    'h4.title a', True, 'h1', 'div.post-body']
]
sites = []
for row in siteData:
    sites.append(Website(row[0], row[1], row[2],
                         row[3], row[4], row[5], row[6], row[7]))

topics = ['python', 'data science']
for topic in topics:
    print('GETTING INFO ABOUT: ' + topic)
    for targetSite in sites:
        crawler.search(topic, targetSite)

GETTING INFO ABOUT: python
Encontrei um novo artigo com o tópico: python
Título: 
According to President Obama’s Council of Economic Advisers (CEA), approximately 3.1 million jobs will be rendered obsolete or permanently altered as a consequence of artificial intelligence technologies. Artificial intelligence (AI) will, for the foreseeable future, have a significant disruptive impact on jobs. That said, this disruption can create new opportunities if policymakers choose to harness them—including some with the potential to help address long-standing social inequities. Investing in quality training programs that deliver premium skills, such as computational analysis and cognitive thinking, provides a real opportunity to leverage AI’s disruptive power.







Makada Henry-Nickie

					Fellow - Governance Studies 

 Twitter
mhnickie





AI’s disruption presents a clear challenge: competition to traditional skilled workers arising from the cross-relevance of data scientists and code engine

IndexError: list index out of range