# Web scraping - First steps

    Examining the New York Times article: Trump’s Lies. 
    When converting it into a dataset, you can think of each lie as a "record" with four fields:
        The date of the lie.
        The lie itself (as a quotation).
        The writer's brief explanation of why it was a lie.
        The URL of an article that substantiates the claim that it was a lie.

Video that helped me: https://www.youtube.com/playlist?list=PL5-da3qGB5IDbOi0g5WFh1YPDNzXw4LNL

## Agenda
    1 - Reading the web page into Python
    2 - Parsing the HTML using Beautiful Soup
    3 - Extracting the components 
        3.1 Extracting the date
        3.2 Extracting the lie
        3.3 Extracting the explanation
        3.4 Extracting the URL
    4 - Recap: Beautiful Soup methods and attributes
    5 - Building the dataset
    6 - Applying a tabular data structure
    7 - Exporting the dataset to a CSV file
    8 - Summary: 16 lines of Python code

### Reading the web page into Python
    The first thing we need to do is to read the HTML into Python, which we'll
    do using the requests library.

In [1]:
import requests

r = requests.get('https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html')

In [3]:
# print the first 500 characters of the HTML

print(r.text[0:500])

<!DOCTYPE html>
<!--[if (gt IE 9)|!(IE)]> <!--><html lang="en" class="no-js page-interactive section-opinion page-theme-standard tone-opinion page-interactive-default limit-small layout-xlarge app-interactive" itemid="https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html" itemtype="http://schema.org/NewsArticle" itemscope xmlns:og="http://opengraphprotocol.org/schema/"><!--<![endif]-->
<!--[if IE 9]> <html lang="en" class="no-js ie9 lt-ie10 page-interactive section-opinion page


### 2 - Parsing the HTML using Beautiful Soup
    We're going to parse the HTML into a BeautifulSoup() object called soup.
    This object has attributes and methods we'll use to scrape the web page.
    
    In other words, BeautifulSoup is reading the HTML and making sense of its structure.
    

In [4]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(r.text, 'html.parser')

Each record We want has the following format:
    
    This is the pattern that allows us to build our dataset:
    
    <span class="short-desc"><strong> DATE </strong> LIE <span class="short-truth"><a href="URL"> EXPLANATION </a></span></span>
    
      

In [5]:
# BeautifulSoup will find all the records
'''
    It is basically getting all of this part of the HTML pattern <span class="short-desc">
'''

results = soup.find_all('span', attrs={'class':'short-desc'})

In [6]:
# ps: results acts like a python list
''' 
    180 means there are 180 "soup objects", so to speak.  
    
    180 <span> tags with the attribute class="short-desc"
'''

len(results)

180

In [7]:
results[0:3]

[<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>,
 <span class="short-desc"><strong>Jan. 21 </strong>“A reporter for Time magazine — and I have been on their cover 14 or 15 times. I think we have the all-time record in the history of Time magazine.” <span class="short-truth"><a href="http://nation.time.com/2013/11/06/10-things-you-didnt-know-about-time/" target="_blank">(Trump was on the cover 11 times and Nixon appeared 55 times.)</a></span></span>,
 <span class="short-desc"><strong>Jan. 23 </strong>“Between 3 million and 5 million illegal votes caused me to lose the popular vote.” <span class="short-truth"><a href="https://www.nytimes.com/2017/01/23/us/politics/donald-trump-congress-democrats.html" target="_

In [8]:
# Notice the similar structure

for content in results[0:3]:
    print(content)
    print() # space   

<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>

<span class="short-desc"><strong>Jan. 21 </strong>“A reporter for Time magazine — and I have been on their cover 14 or 15 times. I think we have the all-time record in the history of Time magazine.” <span class="short-truth"><a href="http://nation.time.com/2013/11/06/10-things-you-didnt-know-about-time/" target="_blank">(Trump was on the cover 11 times and Nixon appeared 55 times.)</a></span></span>

<span class="short-desc"><strong>Jan. 23 </strong>“Between 3 million and 5 million illegal votes caused me to lose the popular vote.” <span class="short-truth"><a href="https://www.nytimes.com/2017/01/23/us/politics/donald-trump-congress-democrats.html" target="_bla

In [9]:
# The last record

results[-1]

<span class="short-desc"><strong>Nov. 11 </strong>“I'd rather have him  – you know, work with him on the Ukraine than standing and arguing about whether or not  – because that whole thing was set up by the Democrats.” <span class="short-truth"><a href="https://www.nytimes.com/interactive/2017/12/10/us/politics/trump-and-russia.html" target="_blank">(There is no evidence that Democrats "set up" Russian interference in the election.)</a></span></span>

### 3 - Extracting the components 
    
    We have now collected all 116 of the records, but we still need to separate
    each record into its four components (date, lie, explanation, and URL) in
    order to give the dataset some structure.

#### Extracting the date

    Although first_result may look like a Python string, you'll notice that there 
    are no quote marks around it. Instead, it's another special BeautifulSoup() 
    object (called a "Tag") that has specific methods and attributes.

    In order to locate the date, we can use its .find() method to find a single tag
    that matches a specific pattern, in contrast to the find_all() method we used 
    above to find all tags that match a pattern

In [10]:
first_result = results[0]
first_result

<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>

In [24]:
'''
    From the components above, We are getting this part: 
    
    <span class="short-desc">
    
    
    
    \(*O*)/    <strong>Jan. 21 </strong>       \(*O*)/
    
    
    
    “I wasn't a fan of Iraq.
    I didn't want to go into Iraq.” <span class="short-truth">
    <a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he
    -supported-invading-iraq-on-the" target="_blank">(He was for an invasion before
    he was against it.)</a></span></span>
'''

first_result.find('strong')

<strong>Jan. 21 </strong>

In [11]:
'''
    Since we want to extract the text between the opening and closing tags, 
    we can access its text attribute (with .text), which does in fact return 
    a regular Python string
'''

first_result.find('strong').text

'Jan. 21\xa0'

What is \xa0? You don't actually need to know this, but it's called an "escape sequence" that represents the *&nbsp* character we saw earlier in the HTML source.

However, you do need to know that an **escape sequence represents a single character** within a string. Let's slice it off from the end of the string:

In [13]:
first_result.find('strong').text[0:-1]

'Jan. 21'

In [14]:
'''
    Finally, we're going to add the year, since we don't want our dataset to include ambiguous dates:
'''
# Save this in your memory. We'll use it later, since this is the code that return to us
# one of the components we want to put in the dataset.

first_result.find('strong').text[0:-1] + ', 2017'

'Jan. 21, 2017'

#### Extracting the lie
    Our goal is to extract the two sentences about Iraq. Unfortunately, 
    there isn't a pair of opening and closing tags that starts immediately 
    before the them and ends immediately after the them. Therefore, we're going
    to have to use a different technique

In [15]:
first_result

<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>

In [16]:
'''
    The first_result Tag has a "contents" attribute, which returns a Python list 
    containing its "children" (Tags and strings that are nested within a Tag)

    We can slice this list to extract the second element:
'''

first_result.contents

[<strong>Jan. 21 </strong>,
 "“I wasn't a fan of Iraq. I didn't want to go into Iraq.” ",
 <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span>]

In [21]:
Nth = 0
for child in first_result.contents:
    print('CHILD INDEX [', Nth,']')
    print(child)
    print('\n\n')
    Nth += 1

CHILD INDEX [ 0 ]
<strong>Jan. 21 </strong>



CHILD INDEX [ 1 ]
“I wasn't a fan of Iraq. I didn't want to go into Iraq.” 



CHILD INDEX [ 2 ]
<span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span>





In [18]:
'''
    As what we want is the index number 1, we can just
'''

first_result.contents[1]

"“I wasn't a fan of Iraq. I didn't want to go into Iraq.” "

In [23]:
'''
    And now we slice off the quotation marks:
'''
first_result.contents[1][1:-2]

"I wasn't a fan of Iraq. I didn't want to go into Iraq."

#### Extracting the explanation

    Based upon what you've seen already, you might have figured out
    that we have at least two options for how we extract the third 
    component of the record, which is the writer's explanation of why
    the President's statement was a lie.

In [24]:
# the first option is to slice the contents attribute, like we did just now when extracting the lie

first_result.contents[2]

<span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span>

In [26]:
# The second option is to search for the surrounding tag, like we did when extracting the date

first_result.find('a')

<a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a>

In [27]:
# Either way, we can access the text attribute and then slice off the opening and closing parentheses

first_result.contents[2].text[1:-1]

'He was for an invasion before he was against it.'

In [29]:
# Either way
first_result.find('a').text[1:-1]

'He was for an invasion before he was against it.'

#### Extracting the URL
    Finally, we want to extract the URL of the article 
    that substantiates the writer's claim that the President was lying.

In [30]:
first_result.find('a')

<a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a>

In [31]:
'''
    So far in this tutorial, we have been extracting text that is between tags. 
    In this case, the text we want to extract is located within the tag itself. 
    Specifically, we want to access the value of the href attribute within the <a> tag.

    Beautiful Soup treats tag attributes and their values like key-value pairs in a dictionary: 
    you put the attribute name in brackets (like a dictionary key), and you get back the attribute's value:
'''

first_result.find('a')['href']

'https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the'

## 4 - Recap: Beautiful Soup methods and attributes
    Before we finish building the dataset, let's summarize a few ways 
    you can interact with Beautiful Soup objects.

You can apply these **two methods** to either the initial soup object or a Tag object (such as first_result):

        .find(): searches for the first matching (html) tag, and returns a Tag object
    
        .find_all(): searches for all matching (html) tags, and returns a ResultSet object 
        (which you can treat like a list of Tags)
    
You can extract information from a Tag object (such as first_result) using these **two attributes**:

        .text: extracts the text of a Tag, and returns a string
    
        .contents: extracts the children of a Tag, and returns a list of Tags and strings
    
    
It's important to keep track of whether you are interacting with a Tag, ResultSet, list, or string, because that affects which methods and attributes you can access.

And of course, there are many more methods and attributes available to you, which are described in the Beautiful Soup documentation.

## 5 - Building the dataset
    Now that we've figured out how to extract the four components of first_result
    (    
         first_result.find('strong').text[0:-1] + ', 2017'
         
         first_result.contents[1][1:-2]
         
         first_result.find('a').text[1:-1]
         
         first_result.find('a')['href']  
     )     
    we can create a loop to repeat this process on all 180 results. We'll store the
    output in a list of tuples called records:

In [60]:
'''
              r                  [r,r,r,r,r,r,r...]
   For each record   in all of   records in results:
        
        Give me these four components [date, lie, explanation, url] and put them into variables
        
        Now, append each one of them in a list of tuples
'''


records = []

for result in results:
    
    date = result.find('strong').text[0:-1] + ', 2017'
    lie = result.contents[1][1:-2]
    explanation = result.find('a').text[1:-1]
    url = result.find('a')['href']
    
    records.append((date, lie, explanation, url))

In [64]:
# And as we did in the beginning:

len(records)

180

In [62]:
records[0:3]

[('Jan. 21, 2017',
  "I wasn't a fan of Iraq. I didn't want to go into Iraq.",
  'He was for an invasion before he was against it.',
  'https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the'),
 ('Jan. 21, 2017',
  'A reporter for Time magazine — and I have been on their cover 14 or 15 times. I think we have the all-time record in the history of Time magazine.',
  'Trump was on the cover 11 times and Nixon appeared 55 times.',
  'http://nation.time.com/2013/11/06/10-things-you-didnt-know-about-time/'),
 ('Jan. 23, 2017',
  'Between 3 million and 5 million illegal votes caused me to lose the popular vote.',
  "There's no evidence of illegal voting.",
  'https://www.nytimes.com/2017/01/23/us/politics/donald-trump-congress-democrats.html')]

In [63]:
for content in records[0:3]:
    print(content)
    print('\n\n')

('Jan. 21, 2017', "I wasn't a fan of Iraq. I didn't want to go into Iraq.", 'He was for an invasion before he was against it.', 'https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the')



('Jan. 21, 2017', 'A reporter for Time magazine — and I have been on their cover 14 or 15 times. I think we have the all-time record in the history of Time magazine.', 'Trump was on the cover 11 times and Nixon appeared 55 times.', 'http://nation.time.com/2013/11/06/10-things-you-didnt-know-about-time/')



('Jan. 23, 2017', 'Between 3 million and 5 million illegal votes caused me to lose the popular vote.', "There's no evidence of illegal voting.", 'https://www.nytimes.com/2017/01/23/us/politics/donald-trump-congress-democrats.html')





## 6 - Applying a tabular data structure
        The last major step in this process is to apply a tabular data structure 
        to our existing structure (which is a list of tuples). 

In [71]:
# We can convert our list of tuples into a DataFrame by passing it to the DataFrame constructor 
# and specifying the desired column names

import pandas as pd

df = pd.DataFrame(records, columns=['date', 'lie', 'explanation', 'url'])

# First 5 rows of the DataFrame
df.head()

Unnamed: 0,date,lie,explanation,url
0,"Jan. 21, 2017",I wasn't a fan of Iraq. I didn't want to go in...,He was for an invasion before he was against it.,https://www.buzzfeed.com/andrewkaczynski/in-20...
1,"Jan. 21, 2017",A reporter for Time magazine — and I have been...,Trump was on the cover 11 times and Nixon appe...,http://nation.time.com/2013/11/06/10-things-yo...
2,"Jan. 23, 2017",Between 3 million and 5 million illegal votes ...,There's no evidence of illegal voting.,https://www.nytimes.com/2017/01/23/us/politics...
3,"Jan. 25, 2017","Now, the audience was the biggest ever. But th...",Official aerial photos show Obama's 2009 inaug...,https://www.nytimes.com/2017/01/21/us/politics...
4,"Jan. 25, 2017",Take a look at the Pew reports (which show vot...,The report never mentioned voter fraud.,https://www.nytimes.com/2017/01/24/us/politics...


In [72]:
df.dtypes

date           object
lie            object
explanation    object
url            object
dtype: object

In [73]:
# Let's convert the date column from object to a datetime format

df.date = pd.to_datetime(df.date)

df.dtypes

date           datetime64[ns]
lie                    object
explanation            object
url                    object
dtype: object

In [79]:
# Notice how ['date'] columns is a little bit different from before

df.head()

Unnamed: 0,date,lie,explanation,url
0,2017-01-21,I wasn't a fan of Iraq. I didn't want to go in...,He was for an invasion before he was against it.,https://www.buzzfeed.com/andrewkaczynski/in-20...
1,2017-01-21,A reporter for Time magazine — and I have been...,Trump was on the cover 11 times and Nixon appe...,http://nation.time.com/2013/11/06/10-things-yo...
2,2017-01-23,Between 3 million and 5 million illegal votes ...,There's no evidence of illegal voting.,https://www.nytimes.com/2017/01/23/us/politics...
3,2017-01-25,"Now, the audience was the biggest ever. But th...",Official aerial photos show Obama's 2009 inaug...,https://www.nytimes.com/2017/01/21/us/politics...
4,2017-01-25,Take a look at the Pew reports (which show vot...,The report never mentioned voter fraud.,https://www.nytimes.com/2017/01/24/us/politics...


### Exporting the dataset to a CSV file

In [80]:
df.to_csv('trump_lies.csv', index=False, encoding='utf-8')

In [85]:
# And now you read it
trump_lies = pd.read_csv('trump_lies.csv')

trump_lies.head()

Unnamed: 0,date,lie,explanation,url
0,2017-01-21,I wasn't a fan of Iraq. I didn't want to go in...,He was for an invasion before he was against it.,https://www.buzzfeed.com/andrewkaczynski/in-20...
1,2017-01-21,A reporter for Time magazine — and I have been...,Trump was on the cover 11 times and Nixon appe...,http://nation.time.com/2013/11/06/10-things-yo...
2,2017-01-23,Between 3 million and 5 million illegal votes ...,There's no evidence of illegal voting.,https://www.nytimes.com/2017/01/23/us/politics...
3,2017-01-25,"Now, the audience was the biggest ever. But th...",Official aerial photos show Obama's 2009 inaug...,https://www.nytimes.com/2017/01/21/us/politics...
4,2017-01-25,Take a look at the Pew reports (which show vot...,The report never mentioned voter fraud.,https://www.nytimes.com/2017/01/24/us/politics...


### Summary: 16 lines of Python code
    Here are the 16 lines of code that we used to scrape the web page, extract the relevant data,
    convert it into a tabular dataset, and export it to a CSV file:

In [None]:
import requests
r = requests.get('https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html')

from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all('span', attrs={'class':'short-desc'})

records = []
for result in results:
    date = result.find('strong').text[0:-1] + ', 2017'
    lie = result.contents[1][1:-2]
    explanation = result.find('a').text[1:-1]
    url = result.find('a')['href']
    records.append((date, lie, explanation, url))

import pandas as pd
df = pd.DataFrame(records, columns=['date', 'lie', 'explanation', 'url'])
df['date'] = pd.to_datetime(df['date'])
df.to_csv('trump_lies.csv', index=False, encoding='utf-8')

## Web Scraping Advice
    Web scraping works best with static, well-structured web pages. 
    Dynamic or interactive content on a web page is often not accessible through the HTML source, 
    which makes scraping it much harder!
    
    Web scraping is a "fragile" approach for building a dataset.
    The HTML on a page you are scraping can change at any time, which may cause your scraper to stop working.
    
    If you can download the data you need from a website, or if the website provides an API with data access,
    those approaches are preferable to scraping since they are easier to implement and less likely to break.
    
    If you are scraping a lot of pages from the same website (in rapid succession),
    it's best to insert delays in your code so that you don't overwhelm the website with requests.
    If the website decides you are causing a problem, they can block your IP address
    (which may affect everyone in your building!)
    
    Before scraping a website, you should review its robots.txt file
    (also known as the Robots exclusion standard) to check whether you are "allowed" 
    to scrape their website. (Here is the robots.txt file for nytimes.com.)

### Alternative syntax for Beautiful Soup
    It's worth noting that Beautiful Soup actually offers multiple ways to express the same command.
    For example, you can search for a tag by accessing it like an attribute

In [86]:
# Search for a tag by name
first_result.find('strong')

<strong>Jan. 21 </strong>

In [87]:
# Shorter alternative: access it like an attribute
first_result.strong

<strong>Jan. 21 </strong>

In [92]:
# You can also search for multiple tags a few different ways:
# search for multiple tags by name and attribute

results = soup.find_all('span', attrs={'class':'short-desc'})
results[0]

<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>

In [93]:
# shorter alternative: if you don't specify a method, it's assumed to be find_all()
results = soup('span', attrs={'class':'short-desc'})
results[0]

<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>

In [94]:
# even shorter alternative: you can specify the attribute as if it's a parameter
results = soup('span', class_='short-desc')
results[0]

<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>

### Some useful links:

Web Scraping 101 with Python: http://www.gregreda.com/2013/03/03/web-scraping-101-with-python/


More web scraping with Python (and a map): http://www.gregreda.com/2013/04/29/more-web-scraping-with-python/


Introduction to Web Scraping: https://stanford.edu/~vbauer/teaching/scraping.html


Scrapy: https://scrapy.org/


How a Math Genius Hacked OkCupid to Find True Love: https://www.wired.com/2014/01/how-to-hack-okcupid/


How Netflix Reverse-Engineered Hollywood: https://www.theatlantic.com/technology/archive/2014/01/how-netflix-reverse-engineered-hollywood/282679/?single_page=true