<a href="https://colab.research.google.com/github/jacklmg75/data-extraction/blob/main/1_3_Scraping_via_XPATH.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Coleta de páginas HTML e extração com XPATH

In [None]:
# https://docs.python-guide.org/scenarios/scrape/

import requests

## Fazendo uma requisição direta ao site

In [None]:
url = "https://www.nytimes.com/section/science"

html = requests.request(url=url,method='get')

In [None]:
# https://developer.mozilla.org/pt-BR/docs/Web/HTTP/Status

html

<Response [200]>

In [None]:
print(html.headers)

{'Connection': 'close', 'Content-Length': '119056', 'Server': 'nginx', 'Content-Type': 'text/html; charset=utf-8', 'x-cloud-trace-context': '92be7302ff6246c5d03e45d5cfe4e482/2660259731275138863', 'x-b3-traceid': '03449129083d4709937e8131dc6f3efe', 'x-nyt-data-last-modified': 'Sun, 03 Jul 2022 21:53:28 GMT', 'Last-Modified': 'Sun, 03 Jul 2022 21:53:28 GMT', 'X-PageType': 'vi-collection', 'X-XSS-Protection': '1; mode=block', 'X-Content-Type-Options': 'nosniff', 'Content-Encoding': 'gzip', 'cache-control': 's-maxage=600,no-cache', 'x-nyt-route': 'vi-collection', 'X-Origin-Time': '2022-07-03 21:54:19 UTC', 'Accept-Ranges': 'bytes', 'Date': 'Sun, 03 Jul 2022 22:02:28 GMT', 'Age': '540', 'X-Served-By': 'cache-lga21931-LGA, cache-sea4431-SEA', 'X-Cache': 'HIT, HIT', 'X-Cache-Hits': '1, 1', 'X-Timer': 'S1656885748.434009,VS0,VE2', 'Vary': 'Accept-Encoding, Fastly-SSL', 'Set-Cookie': 'nyt-a=ryPi4tNeAx5NaEuzEKLhpc; Expires=Mon, 03 Jul 2023 22:02:28 GMT; Path=/; Domain=.nytimes.com; SameSite=none

In [None]:
html.content



## Exercício:

Utilize outro link de página HTML e substitua neste notebook.

## Lendo um arquivo salvo em html e extraindo os campos desejados

In [None]:
import pandas as pd
url = 'https://www.dropbox.com/s/w6a1ygfu6bt1wl1/nyt-html.csv?dl=1'  # Arquivo do html salvo no dropbox em formato csv.
                                                                     # Obs: o parametro "dl=1" na URL indica que queremos o
                                                                     # download direto e não o preview do arquivo "dl=0"
df = pd.read_csv(url,header=None) # lemos o arquivo como um DataFrame do pandas

In [None]:
df.columns

Int64Index([0, 1, 2], dtype='int64')

In [None]:
df.head()

Unnamed: 0,0,1,2
0,http://www.nytimes.com/pages/science/index.html,"statusCode=200,size=154341",<!DOCTYPE html>\n<!--[if (gt IE 9)|!(IE)]> <!-...


### Código HTML bruto

In [None]:
html_content = df[2].values[0]
html_content



<img src="https://www.dropbox.com/s/g8zaj32mo4d5192/exemplo-codigo-html.PNG?dl=1" alt="Drawing" width="1000"/>

In [None]:
from lxml import etree

myparser = etree.HTMLParser(encoding="utf-8")

### Localizamos nossos campos de interesse
Neste caso queremos as stories que são os artigos no site.

> Portanto extraimos as tags **div** cuja classe seja **"story-body"**.
>
> Dessa forma teremos os blocos de notícias na página indicada

<img src="https://www.dropbox.com/s/62p43dm215rfbq8/div-story-body.PNG?dl=1" alt="Drawing" width="1000"/>

In [None]:
tree = etree.HTML(html_content, parser=myparser) # Note que utilizamos o conteúdo lido do arquivo e armazenado em html_content.
                                                 # Aqui poderia ser o conteúdo obtido na requisição direta ao site como fizemos no início deste notebook (html.content).
                                                 # Porém, fique sempre atento à formatação html de cada site.
                                                 # Inclusive o mesmo site pode mudar de formato com o tempo e as regras precisam ser adaptadas para continuar funcionando.

articles = tree.xpath('//div[contains(@class,"story-body")]')

In [None]:
articles[:3]

[<Element div at 0x7f26eff92d70>,
 <Element div at 0x7f26eff92eb0>,
 <Element div at 0x7f26eff92e10>]

In [None]:
find_text = etree.XPath("string()")  # extrai texto que está entre as tags html

find_text(articles[0])

'\n                                                    Trilobites\n                                                \n                            Before You Flush Your Contact Lenses, You Might Want to Know This\n                        \n                                                    \n                                \n                                    \n                                \n                            \n                                                Flushing disposable contacts down the toilet or washing them down the drain may contribute to the problem of microplastic pollution, researchers said.\n                        \n                            \n                                                            \n                                                        By VERONIQUE GREENWOOD\n                        \n                    '

In [None]:
' '.join(find_text(articles[0]).split())

'Trilobites Before You Flush Your Contact Lenses, You Might Want to Know This Flushing disposable contacts down the toilet or washing them down the drain may contribute to the problem of microplastic pollution, researchers said. By VERONIQUE GREENWOOD'

<img src="https://www.dropbox.com/s/mbxbmizsquaxhwl/h2.PNG?dl=1" alt="Drawing" width="1000"/>

In [None]:
article = articles[0]

headline = article.xpath('.//h2')[0]

find_text(headline)

'\n                            Before You Flush Your Contact Lenses, You Might Want to Know This\n                        '

In [None]:
find_text(headline[0]).strip() # extrai texto que está entre as tags html

'Before You Flush Your Contact Lenses, You Might Want to Know This'

<img src="https://www.dropbox.com/s/w1v0qi1srcdrhp7/summary.PNG?dl=1" alt="Drawing" width="650"/>

In [None]:
summary = article.xpath('.//p[contains(@class,"summary")]')[0]
find_text(summary).strip()

'Flushing disposable contacts down the toilet or washing them down the drain may contribute to the problem of microplastic pollution, researchers said.'

<img src="https://www.dropbox.com/s/ftocpfeqk0rpdwx/href.PNG?dl=1" alt="Drawing" width="800"/>

In [None]:
link = article.xpath('.//a/@href')[0]
link.strip()

'https://www.nytimes.com/2018/08/19/science/contact-lenses-pollution.html'

<img src="https://www.dropbox.com/s/vkg6bqo4wlszfc8/byline.PNG?dl=1" alt="Drawing" width="800"/>

In [None]:
autor = article.xpath('.//p[contains(@class,"byline")]')[0]
find_text(autor).strip()

'By VERONIQUE GREENWOOD'

## Extraindo todos os artigos da página

In [None]:
articles = tree.xpath('//div[contains(@class,"story-body")]')

find_text = etree.XPath("string()")  # extrai texto que está entre as tags html

extracted_articles = []

for article in articles[:3]:

    headline = article.xpath('.//h2')[0]
    headline = find_text(headline).strip()
    print("headline:", headline)

    summary = article.xpath('.//p[contains(@class,"summary")]')[0]
    summary = find_text(summary).strip()
    print("summary:", summary)

    link = article.xpath('.//a/@href')
    if link:
      link = link[0]
      print("link:", link.strip())

    autor = article.xpath('.//p[contains(@class,"byline")]')[0]
    autor = find_text(autor).strip()
    print("autor:", autor)
    print("--------------------------")

    extracted_articles.append(dict(headline=headline,
                                   summary=summary,
                                   link=link,
                                   autor=autor)
    )

headline: Before You Flush Your Contact Lenses, You Might Want to Know This
summary: Flushing disposable contacts down the toilet or washing them down the drain may contribute to the problem of microplastic pollution, researchers said.
link: https://www.nytimes.com/2018/08/19/science/contact-lenses-pollution.html
autor: By VERONIQUE GREENWOOD
--------------------------
headline: Hundreds of Reindeer Died by Lightning. Their Carcasses Became a Laboratory.
summary: “From death comes life,” said researchers who studied how decomposing bodies, with the help of scavengers, might alter plant diversity across a broad landscape.
link: https://www.nytimes.com/2018/08/17/science/reindeer-carcasses-lightning.html
autor: By STEPH YIN
--------------------------
headline: Archaeologists Find 3,200-Year-Old Cheese in an Egyptian Tomb
summary: The cheese was found in a tomb that had been thought lost to shifting sands until it was rediscovered in 2010.
link: https://www.nytimes.com/2018/08/16/science/

In [None]:
extracted_articles

[{'autor': 'By VERONIQUE GREENWOOD',
  'headline': 'Before You Flush Your Contact Lenses, You Might Want to Know This',
  'link': 'https://www.nytimes.com/2018/08/19/science/contact-lenses-pollution.html',
  'summary': 'Flushing disposable contacts down the toilet or washing them down the drain may contribute to the problem of microplastic pollution, researchers said.'},
 {'autor': 'By STEPH YIN',
  'headline': 'Hundreds of Reindeer Died by Lightning. Their Carcasses Became a Laboratory.',
  'link': 'https://www.nytimes.com/2018/08/17/science/reindeer-carcasses-lightning.html',
  'summary': '“From death comes life,” said researchers who studied how decomposing bodies, with the help of scavengers, might alter plant diversity across a broad landscape.'},
 {'autor': 'By NIRAJ CHOKSHI',
  'headline': 'Archaeologists Find 3,200-Year-Old Cheese in an Egyptian Tomb',
  'link': 'https://www.nytimes.com/2018/08/16/science/oldest-cheese-ever-egypt-tomb.html',
  'summary': 'The cheese was found i

## Criando um Dataset

In [None]:
import pandas as pd

df = pd.DataFrame.from_dict(extracted_articles)
df.shape

(3, 4)

In [None]:
df

Unnamed: 0,headline,summary,link,autor
0,"Before You Flush Your Contact Lenses, You Migh...",Flushing disposable contacts down the toilet o...,https://www.nytimes.com/2018/08/19/science/con...,By VERONIQUE GREENWOOD
1,Hundreds of Reindeer Died by Lightning. Their ...,"“From death comes life,” said researchers who ...",https://www.nytimes.com/2018/08/17/science/rei...,By STEPH YIN
2,"Archaeologists Find 3,200-Year-Old Cheese in a...",The cheese was found in a tomb that had been t...,https://www.nytimes.com/2018/08/16/science/old...,By NIRAJ CHOKSHI
