<a href="https://colab.research.google.com/github/leonardodrigo/datasets/blob/master/webscraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
!pip install requests
!pip install beautifulsoup4

In [0]:
import requests
from bs4 import BeautifulSoup

In [0]:
url = requests.get('https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html')

In [0]:
#imprimindo os primeiros 500 caracteres
print(url.text[0:500])

<!DOCTYPE html>
<!--[if (gt IE 9)|!(IE)]> <!--><html lang="en" class="no-js page-interactive section-opinion page-theme-standard tone-opinion page-interactive-default limit-small layout-xlarge app-interactive" itemid="https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html" itemtype="http://schema.org/NewsArticle" itemscope xmlns:og="http://opengraphprotocol.org/schema/"><!--<![endif]-->
<!--[if IE 9]> <html lang="en" class="no-js ie9 lt-ie10 page-interactive section-opinion page


In [0]:
soup = BeautifulSoup(url.text, 'html.parser')

In [0]:
#Estamos verificando quantos registros temos no código conforme a estrutura do código da página em HTML
results = soup.find_all('span', attrs={'class':'short-desc'})
len(results)

180

In [0]:
#Checando os registros
results[-1]

<span class="short-desc"><strong>Nov. 11 </strong>“I'd rather have him  – you know, work with him on the Ukraine than standing and arguing about whether or not  – because that whole thing was set up by the Democrats.” <span class="short-truth"><a href="https://www.nytimes.com/interactive/2017/12/10/us/politics/trump-and-russia.html" target="_blank">(There is no evidence that Democrats "set up" Russian interference in the election.)</a></span></span>

**Extraindo a data do código**

In [0]:
#Queremos pegar a data entre as tags 'strong'
results[0].find('strong')

<strong>Jan. 21 </strong>

In [0]:
results[0].find('strong').text

'Jan. 21\xa0'

In [0]:
results[0].find('strong').text[0:7] + ', 2017'

'Jan. 21, 2017'

**Extraindo a frase do Trump**

In [0]:
results[0].contents

[<strong>Jan. 21 </strong>,
 "“I wasn't a fan of Iraq. I didn't want to go into Iraq.” ",
 <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span>]

In [0]:
#Pegando o segundo elemento da lista (Child)
results[0].contents[1]

"“I wasn't a fan of Iraq. I didn't want to go into Iraq.” "

In [0]:
results[0].contents[1][1:-2]

"I wasn't a fan of Iraq. I didn't want to go into Iraq."

**Extraindo a explicação**

In [0]:
results[0].find('a')

<a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a>

In [0]:
results[0].find('a').text[1:-1]

'He was for an invasion before he was against it.'

**Extraindo a URL**

In [0]:
results[0].find('a')['href']

'https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the'

**Construindo o Dataset**

In [0]:
records = []
for result in results:
  date = result.find('strong').text[0:-1] + ', 2017'
  lie = result.contents[1][1:-2]
  explanation = result.find('a').text[1:-1]
  url = result.find('a')['href']

  records.append((date, lie, explanation, url))

In [0]:
len(records)

180

In [0]:
import pandas as pd
df = pd.DataFrame(records, columns=['date', 'lie', 'explanation', 'url'])

In [0]:
df['date'] = pd.to_datetime(df['date'])

In [0]:
df.head()

Unnamed: 0,date,lie,explanation,url
0,2017-01-21,I wasn't a fan of Iraq. I didn't want to go in...,He was for an invasion before he was against it.,https://www.buzzfeed.com/andrewkaczynski/in-20...
1,2017-01-21,A reporter for Time magazine — and I have been...,Trump was on the cover 11 times and Nixon appe...,http://nation.time.com/2013/11/06/10-things-yo...
2,2017-01-23,Between 3 million and 5 million illegal votes ...,There's no evidence of illegal voting.,https://www.nytimes.com/2017/01/23/us/politics...
3,2017-01-25,"Now, the audience was the biggest ever. But th...",Official aerial photos show Obama's 2009 inaug...,https://www.nytimes.com/2017/01/21/us/politics...
4,2017-01-25,Take a look at the Pew reports (which show vot...,The report never mentioned voter fraud.,https://www.nytimes.com/2017/01/24/us/politics...


**Exportando o dataset para CSV**

In [0]:
df.to_csv('trump_lies.csv', index=False, encoding='utf-8')