### <center> 16 Line Web Scrapper<center>

Source: http://www.dataschool.io/python-web-scraping-of-president-trumps-lies/

In [1]:
import requests
r = requests.get('https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html')

In [2]:
print(r.text[0:500])

<!DOCTYPE html>
<!--[if (gt IE 9)|!(IE)]> <!--> <html lang="en" class="no-js page-interactive section-opinion page-theme-standard tone-opinion page-interactive-default  limit-small layout-xlarge app-interactive" itemid="https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html" itemtype="http://schema.org/NewsArticle" itemscope xmlns:og="http://opengraphprotocol.org/schema/"> <!--<![endif]-->
<!--[if IE 9]> <html lang="en" class="no-js ie9 lt-ie10 page-interactive section-opinion p


In [3]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')

In [4]:
#find all the records
results = soup.find_all('span', attrs={'class':'short-desc'})

In [5]:
len(results)

116

In [6]:
results[0:3]

[<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>,
 <span class="short-desc"><strong>Jan. 21 </strong>“A reporter for Time magazine — and I have been on their cover 14 or 15 times. I think we have the all-time record in the history of Time magazine.” <span class="short-truth"><a href="http://nation.time.com/2013/11/06/10-things-you-didnt-know-about-time/" target="_blank">(Trump was on the cover 11 times and Nixon appeared 55 times.)</a></span></span>,
 <span class="short-desc"><strong>Jan. 23 </strong>“Between 3 million and 5 million illegal votes caused me to lose the popular vote.” <span class="short-truth"><a href="https://www.nytimes.com/2017/01/23/us/politics/donald-trump-congress-democrats.html" target="_

In [7]:
results[-1]

<span class="short-desc"><strong>July 19 </strong>“But the F.B.I. person really reports directly to the president of the United States, which is interesting.” <span class="short-truth"><a href="https://www.usatoday.com/story/news/politics/onpolitics/2017/07/20/fbi-director-reports-justice-department-not-president/495094001/" target="_blank">(He reports directly to the attorney general.)</a></span></span>

In [8]:
first_result = results[0]
first_result

<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>

In [9]:
first_result.find('strong')

<strong>Jan. 21 </strong>

In [10]:
first_result.find('strong').text
#\xa0 is an escape sequence

'Jan. 21\xa0'

In [11]:
first_result.find('strong').text[0:-1]

'Jan. 21'

In [13]:
first_result.find('strong').text[0:-1] + ', 2017'

'Jan. 21, 2017'

#### Extracing the lie

In [15]:
first_result

<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>

In [16]:
first_result.contents

[<strong>Jan. 21 </strong>,
 "“I wasn't a fan of Iraq. I didn't want to go into Iraq.” ",
 <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span>]

In [17]:
first_result.contents[1]

"“I wasn't a fan of Iraq. I didn't want to go into Iraq.” "

In [18]:
first_result.contents[1][1:-2]

"I wasn't a fan of Iraq. I didn't want to go into Iraq."

#### Extracing the Explanation

In [19]:
first_result.contents[2]

<span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span>

In [20]:
first_result.find('a')

<a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a>

In [21]:
first_result.find('a').text[1:-1]

'He was for an invasion before he was against it.'

#### Extracting the URL

In [22]:
first_result.find('a')

NameError: name 'first_results' is not defined

In [24]:
first_result.find('a')['href']

'https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the'

#### Recap: Beautiful Soup methods and attirbutes

find(): searches for the first matching tag, and returns a Tag object<br>
find_all(): searches for all matching tags, and returns a ResultSet object (which you can treat like a list of Tags)

You can extract information from a Tag object (such as first_result) using these two attributes:

text: extracts the text of a Tag, and returns a string
contents: extracts the children of a Tag, and returns a list of Tags and strings

### Building a Dataset

Now that we've figured out how to extract the four components of first_result, we can create a loop to repeat this process on all 116 results. We'll store the output in a list of tuples called records:



In [30]:
records = []
for result in results:
    data = result.find('strong').text[0:-1] + ', 2017'
    lie = result.contents[1][1:-2]
    explanation = result.find('a').text[1:-1]
    url = result.find('a')['href']
    records.append((data, lie, explanation, url))

In [31]:
len(records)

116

In [32]:
records[0:3]

[('Jan. 21, 2017',
  "I wasn't a fan of Iraq. I didn't want to go into Iraq.",
  'He was for an invasion before he was against it.',
  'https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the'),
 ('Jan. 21, 2017',
  'A reporter for Time magazine — and I have been on their cover 14 or 15 times. I think we have the all-time record in the history of Time magazine.',
  'Trump was on the cover 11 times and Nixon appeared 55 times.',
  'http://nation.time.com/2013/11/06/10-things-you-didnt-know-about-time/'),
 ('Jan. 23, 2017',
  'Between 3 million and 5 million illegal votes caused me to lose the popular vote.',
  "There's no evidence of illegal voting.",
  'https://www.nytimes.com/2017/01/23/us/politics/donald-trump-congress-democrats.html')]

#### Applying a tabular data strcuture

In [37]:
import pandas as pd
df = pd.DataFrame(records, columns=['date','lie', 'explanation','url'])
df.head()

Unnamed: 0,date,lie,explanation,url
0,"Jan. 21, 2017",I wasn't a fan of Iraq. I didn't want to go in...,He was for an invasion before he was against it.,https://www.buzzfeed.com/andrewkaczynski/in-20...
1,"Jan. 21, 2017",A reporter for Time magazine — and I have been...,Trump was on the cover 11 times and Nixon appe...,http://nation.time.com/2013/11/06/10-things-yo...
2,"Jan. 23, 2017",Between 3 million and 5 million illegal votes ...,There's no evidence of illegal voting.,https://www.nytimes.com/2017/01/23/us/politics...
3,"Jan. 25, 2017","Now, the audience was the biggest ever. But th...",Official aerial photos show Obama's 2009 inaug...,https://www.nytimes.com/2017/01/21/us/politics...
4,"Jan. 25, 2017",Take a look at the Pew reports (which show vot...,The report never mentioned voter fraud.,https://www.nytimes.com/2017/01/24/us/politics...


In [38]:
df.tail()

Unnamed: 0,date,lie,explanation,url
111,"July 6, 2017","As a result of this insistence, billions of do...",NATO countries agreed to meet defense spending...,http://www.nbcnews.com/politics/donald-trump/f...
112,"July 17, 2017",We’ve signed more bills — and I’m talking abou...,"Clinton, Carter, Truman, and F.D.R. had signed...",https://www.nytimes.com/2017/07/17/us/politics...
113,"July 19, 2017","Um, the Russian investigation — it’s not an in...",It is.,http://time.com/4823514/donald-trump-investiga...
114,"July 19, 2017","I heard that Harry Truman was first, and then ...","Presidents Clinton, Carter, Truman, and F.D.R....",https://www.nytimes.com/2017/07/17/us/politics...
115,"July 19, 2017",But the F.B.I. person really reports directly ...,He reports directly to the attorney general.,https://www.usatoday.com/story/news/politics/o...


Note: "January" is abbreviated, while "July" is not? It's best to format your data consistently, and so we're going to convert the date column to pandas' 

In [41]:
df['date'] = pd.to_datetime(df['date'])

In [47]:
df.head(1)

Unnamed: 0,date,lie,explanation,url,data
0,2017-01-21,I wasn't a fan of Iraq. I didn't want to go in...,He was for an invasion before he was against it.,https://www.buzzfeed.com/andrewkaczynski/in-20...,2017-01-21


In [48]:
df.tail(1)

Unnamed: 0,date,lie,explanation,url,data
115,2017-07-19,But the F.B.I. person really reports directly ...,He reports directly to the attorney general.,https://www.usatoday.com/story/news/politics/o...,2017-07-19


#### Exporting the dataset to CSV file

In [52]:
df.to_csv('trump_lies.csv', index=False, encoding='utf-8')

In [56]:
df = pd.read_csv('trump_lies.csv',parse_dates=['date'], encoding='utf-8')

#### Summary: 16 lines of Python code

In [65]:
import requests
#r = requests.get('https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html')

from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all('span',attrs={'class':'short-desc'})

records = []
for result in results:
    data = result.find('strong').text[0:-1] + ', 2017'
    lie = result.contents[1][1:-2]
    explanation = result.find('a').text[1:-1]
    url = result.find('a')['href']
    records.append((data, lie, explanation, url))
    
import pandas as pd
df = pd.DataFrame(records, columns=['date','lie','explanation','url'])
df['date'] = pd.to_datetime(df['date'])
df.to_csv('trump_lies.csv', index=False, encoding='utf-8')