# Reading the web page into Python


The first thing we need to do is to read the HTML for this article into Python, which we'll do using the requests library. (If you don't have it, you can pip install requests from the command line.)

In [1]:
import requests
r = requests.get('https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html')

The code above fetches our web page from the URL, and stores the result in a "response" object called r. That response object has a text attribute, which contains the same HTML code we saw when viewing the source from our web browser:

In [2]:
# print the first 500 characters of the HTML
print(r.text[0:500])

<!DOCTYPE html>
<!--[if (gt IE 9)|!(IE)]> <!--><html lang="en" class="no-js page-interactive section-opinion page-theme-standard tone-opinion page-interactive-default limit-small layout-xlarge app-interactive" itemid="https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html" itemtype="http://schema.org/NewsArticle" itemscope xmlns:og="http://opengraphprotocol.org/schema/"><!--<![endif]-->
<!--[if IE 9]> <html lang="en" class="no-js ie9 lt-ie10 page-interactive section-opinion page


# Parsing the HTML using Beautiful Soup

We're going to parse the HTML using the Beautiful Soup 4 library, which is a popular Python library for web scraping. (If you don't have it, you can pip install beautifulsoup4 from the command line.)

In [3]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')

The code above parses the HTML (stored in r.text) into a special object called soup that the Beautiful Soup library understands. In other words, Beautiful Soup is reading the HTML and making sense of its structure.

(Note that html.parser is the parser included with the Python standard library, though other parsers can be used by Beautiful Soup. See differences between parsers to learn more

You might have noticed that each record has the following format:




There's an outer <span> tag, and then nested within it is a <strong> tag plus another <span> tag, which itself contains an <a> tag. All of these tags affect the formatting of the text. And because the New York Times wants each record to appear in a consistent way in your web browser, we know that each record will be tagged in a consistent way in the HTML. This is the pattern that allows us to build our dataset!

Let's ask Beautiful Soup to find all of the records:

In [4]:
results = soup.find_all('span', attrs={'class':'short-desc'})

This code searches the soup object for all <span> tags with the attribute class="short-desc". It returns a special Beautiful Soup object (called a "ResultSet") containing the search results.

results acts like a Python list, so we can check its length:

In [5]:
len(results)

180

There are 180 results, which seems reasonable given the length of the article. (If this number did not seem reasonable, we would examine the HTML further to determine if our assumptions about the patterns in the HTML were incorrect.)

We can also slice the object like a list, in order to examine the first three results:

In [6]:
results[0:3]

[<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>,
 <span class="short-desc"><strong>Jan. 21 </strong>“A reporter for Time magazine — and I have been on their cover 14 or 15 times. I think we have the all-time record in the history of Time magazine.” <span class="short-truth"><a href="http://nation.time.com/2013/11/06/10-things-you-didnt-know-about-time/" target="_blank">(Trump was on the cover 11 times and Nixon appeared 55 times.)</a></span></span>,
 <span class="short-desc"><strong>Jan. 23 </strong>“Between 3 million and 5 million illegal votes caused me to lose the popular vote.” <span class="short-truth"><a href="https://www.nytimes.com/2017/01/23/us/politics/donald-trump-congress-democrats.html" target="_

In [7]:
results[-1]

<span class="short-desc"><strong>Nov. 11 </strong>“I'd rather have him  – you know, work with him on the Ukraine than standing and arguing about whether or not  – because that whole thing was set up by the Democrats.” <span class="short-truth"><a href="https://www.nytimes.com/interactive/2017/12/10/us/politics/trump-and-russia.html" target="_blank">(There is no evidence that Democrats "set up" Russian interference in the election.)</a></span></span>

Looks good!

We have now collected all 180 of the records, but we still need to separate each record into its four components (date, lie, explanation, and URL) in order to give the dataset some structure.

# 1. Extracting the date

Web scraping is often an iterative process, in which you experiment with your code until it works exactly as you desire. To simplify the experimentation, we'll start by only working with the first record in the results object, and then later on we'll modify our code to use a loop:

In [8]:
first_result = results[0]
first_result

<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>

Although first_result may look like a Python string, you'll notice that there are no quote marks around it. Instead, it's another special Beautiful Soup object (called a "Tag") that has specific methods and attributes.

In order to locate the date, we can use its find() method to find a single tag that matches a specific pattern, in contrast to the find_all() method we used above to find all tags that match a pattern:

In [9]:
first_result.find('strong')

<strong>Jan. 21 </strong>

This code searches first_result for the first instance of a <strong> tag, and again returns a Beautiful Soup "Tag" object (not a string).

Since we want to extract the text between the opening and closing tags, we can access its text attribute, which does in fact return a regular Python string:

In [10]:
first_result.find('strong').text

'Jan. 21\xa0'

What is \xa0? You don't actually need to know this, but it's called an "escape sequence" that represents the &nbsp; character we saw earlier in the HTML source.

However, you do need to know that an escape sequence represents a single character within a string. Let's slice it off from the end of the string:

In [11]:
first_result.find('strong').text[0:-1]

'Jan. 21'


# 2. Extracting the lie
Let's take another look at first_result:

In [12]:
first_result

<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>

Our goal is to extract the two sentences about Iraq. Unfortunately, there isn't a pair of opening and closing tags that starts immediately before the lie and ends immediately after the lie. Therefore, we're going to have to use a different technique:

In [13]:

first_result.contents

[<strong>Jan. 21 </strong>,
 "“I wasn't a fan of Iraq. I didn't want to go into Iraq.” ",
 <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span>]

The first_result "Tag" has a contents attribute, which returns a Python list containing its "children". What are children? They are the Tags and strings that are nested within a Tag.

We can slice this list to extract the second element:

In [14]:

first_result.contents[1]

"“I wasn't a fan of Iraq. I didn't want to go into Iraq.” "

In [15]:
first_result.contents[1][1:-2]

"I wasn't a fan of Iraq. I didn't want to go into Iraq."

# 3. Extracting the explanation
Based upon what you've seen already, you might have figured out that we have at least two options for how we extract the third component of the record, which is the writer's explanation of why the President's statement was a lie.

The first option is to slice the contents attribute, like we did when extracting the lie:

In [16]:
first_result.contents[2]

<span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span>

The second option is to search for the surrounding tag, like we did when extracting the date:

In [17]:
first_result.find('a')

<a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a>


Either way, we can access the text attribute and then slice off the opening and closing parentheses:

In [18]:

first_result.find('a').text[1:-1]

'He was for an invasion before he was against it.'

# 4. Extracting the URL
Finally, we want to extract the URL of the article that substantiates the writer's claim that the President was lying.

Let's examine the <a> tag within first_result:

In [19]:
first_result.find('a')

<a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a>

So far in this tutorial, we have been extracting text that is between tags. In this case, the text we want to extract is located within the tag itself. Specifically, we want to access the value of the href attribute within the <a> tag.

Beautiful Soup treats tag attributes and their values like key-value pairs in a dictionary: you put the attribute name in brackets (like a dictionary key), and you get back the attribute's value:

In [20]:
first_result.find('a')['href']

'https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the'

# Building the dataset
Now that we've figured out how to extract the four components of first_result, we can create a loop to repeat this process on all 180 results. We'll store the output in a list of tuples called records:

In [21]:

records = []
for result in results:
    date = result.find('strong').text[0:-1] 
    lie = result.contents[1][1:-2]
    explanation = result.find('a').text[1:-1]
    url = result.find('a')['href']
    records.append((date, lie, explanation, url))

Since there were 180 results, we should have 180 records:

In [22]:
len(records)

180

In [23]:
# Spot Check

records[0:3]

[('Jan. 21',
  "I wasn't a fan of Iraq. I didn't want to go into Iraq.",
  'He was for an invasion before he was against it.',
  'https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the'),
 ('Jan. 21',
  'A reporter for Time magazine — and I have been on their cover 14 or 15 times. I think we have the all-time record in the history of Time magazine.',
  'Trump was on the cover 11 times and Nixon appeared 55 times.',
  'http://nation.time.com/2013/11/06/10-things-you-didnt-know-about-time/'),
 ('Jan. 23',
  'Between 3 million and 5 million illegal votes caused me to lose the popular vote.',
  "There's no evidence of illegal voting.",
  'https://www.nytimes.com/2017/01/23/us/politics/donald-trump-congress-democrats.html')]

# Applying a tabular data structure
The last major step in this process is to apply a tabular data structure to our existing structure (which is a list of tuples). We're going to do this using the pandas library, an incredibly popular Python library for data analysis and manipulation. (If you don't have it, here are the installation instructions.)

The primary data structure in pandas is the "DataFrame", which is suitable for tabular data with columns of different types, similar to an Excel spreadsheet or SQL table. We can convert our list of tuples into a DataFrame by passing it to the DataFrame constructor and specifying the desired column names:

In [24]:
import pandas as pd
df = pd.DataFrame(records, columns=['Date', 'Lie', 'Explanation', 'Url'])

In [25]:
df.head()

Unnamed: 0,Date,Lie,Explanation,Url
0,Jan. 21,I wasn't a fan of Iraq. I didn't want to go in...,He was for an invasion before he was against it.,https://www.buzzfeed.com/andrewkaczynski/in-20...
1,Jan. 21,A reporter for Time magazine — and I have been...,Trump was on the cover 11 times and Nixon appe...,http://nation.time.com/2013/11/06/10-things-yo...
2,Jan. 23,Between 3 million and 5 million illegal votes ...,There's no evidence of illegal voting.,https://www.nytimes.com/2017/01/23/us/politics...
3,Jan. 25,"Now, the audience was the biggest ever. But th...",Official aerial photos show Obama's 2009 inaug...,https://www.nytimes.com/2017/01/21/us/politics...
4,Jan. 25,Take a look at the Pew reports (which show vot...,The report never mentioned voter fraud.,https://www.nytimes.com/2017/01/24/us/politics...
