# Web scraping Trump lies from an [article](https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html)

## Import libraries

In [22]:
# Import libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
import datetime

## Fetch HTML response for the article

In [5]:
# Fetch HTML response for the article
response = requests.get('https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html')
# Check if the response is success, otherwise print relevant message.
if not response.ok:
    print('Cannot read article. Please check the url of the article.')
else:
    print(len(response.text))

249378


## Create BeautifulSoup object from the response

In [6]:
# Create BeautifulSoup object from the response
soup_obj = BeautifulSoup(response.text, 'html.parser')

## Find all spans with class 'short-desc'

In [7]:
# Find all spans with class 'short-desc'
lies = soup_obj.find_all('span', attrs={'class':'short-desc'})
# Display count of lies
print(len(lies))

180


There are total of 180 lies in this article which were told by Trump.


Let's now fetch actual lie with date and truth with link for each lie.

In [8]:
# Analyse the structure by looking at 1 or 2 item from the lies list
print(lies[0])

<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>


In [9]:
print(lies[1])

<span class="short-desc"><strong>Jan. 21 </strong>“A reporter for Time magazine — and I have been on their cover 14 or 15 times. I think we have the all-time record in the history of Time magazine.” <span class="short-truth"><a href="http://nation.time.com/2013/11/06/10-things-you-didnt-know-about-time/" target="_blank">(Trump was on the cover 11 times and Nixon appeared 55 times.)</a></span></span>


As we can observe from above 2 items, the structure is as following:
+ _Date_: It is inside <strong> tag
+ _Lie_: It is 2nd item in the "short-desc" span
+ _Truth_: It is the value of <a> tag
+ _Truth link_: It is the "href" value of <a> tag

## Fetch required information from lies list.

Fetch information from first item in the list.  
Then we can replicate the same process for all items in the list.

In [11]:
# Fetch required information from first item in the lies list

lie_item = lies[0]

date_ = lie_item.find('strong').text
lie = lie_item.contents[1]
truth = lie_item.find('a').text
truth_link = lie_item.find('a')['href']

print(date_, lie, truth, truth_link, sep='\n')

Jan. 21 
“I wasn't a fan of Iraq. I didn't want to go into Iraq.” 
(He was for an invasion before he was against it.)
https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the


In [72]:
def parse_date(date_text, year=2017):
    '''
    Parse date text from 'Jan. 21' to '2017-01-21'. Assuming year to be 2017.

    Arguments:
        date_text: date string from article. 'Jan. 21'
        year: [Optional] Year of the date to be parsed. Default to 2017
    
    Returns:
        datetime object
    '''
    try:
        date_ = datetime.datetime.strptime(date_text, '%b. %d')
    except ValueError:
        try:
            date_ = datetime.datetime.strptime(date_text, '%B %d')
        except ValueError:
            m, d = date_text.split()
            # print(m, d)
            m = m[:-1]
            month = None
            if m == 'Sept':
                month = 9
            return datetime.date(year, month, int(d))
    return datetime.date(year, date_.month, date_.day)

In [73]:
# Test the function for parsing date
parse_date('Jan. 21')

datetime.date(2017, 1, 21)

In [74]:
parse_date('March 3')

datetime.date(2017, 3, 3)

In [75]:
parse_date('Sept. 3')

datetime.date(2017, 9, 3)

In [68]:
# Define a function to fetch all info from one item

def fetch_info(lie_item):
    '''
    Extracts date, lie, truth, truth_link from the given lie item.

    Arguments:
        lie_item: BeautifulSoup object from all "lies" objects

    Returns:
        date, lie, truth, truth_link as a tuple
    '''
    date_ = parse_date(lie_item.find('strong').text.strip())
    lie = lie_item.contents[1].strip().strip('“').strip('”')
    truth = lie_item.find('a').text.strip('(').strip(')')
    truth_link = lie_item.find('a')['href']

    return date_, lie, truth, truth_link

In [69]:
# Use the fetch_info function to extract info from first item.
date_, lie, truth, truth_link = fetch_info(lies[0])
print(date_, lie, truth, truth_link, sep='\n')

2017-01-21
I wasn't a fan of Iraq. I didn't want to go into Iraq.
He was for an invasion before he was against it.
https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the


In [70]:
# Use the fetch_info function to extract info from second item.
date_, lie, truth, truth_link = fetch_info(lies[1])
print(date_, lie, truth, truth_link, sep='\n')

2017-01-21
A reporter for Time magazine — and I have been on their cover 14 or 15 times. I think we have the all-time record in the history of Time magazine.
Trump was on the cover 11 times and Nixon appeared 55 times.
http://nation.time.com/2013/11/06/10-things-you-didnt-know-about-time/


So, it seems to be working for first 2 items.  
Let's populate these information in pandas dataframe.

In [76]:
# Loop through all lies items and store them in the list of tuples
items = []
for lie_item in lies:
    items.append(fetch_info(lie_item))
print(len(items))

180


In [77]:
# Define an empty dataframe

df = pd.DataFrame(items, columns=['Date', 'Lie', 'Truth', 'Truth_Link'])

In [78]:
df.head()

Unnamed: 0,Date,Lie,Truth,Truth_Link
0,2017-01-21,I wasn't a fan of Iraq. I didn't want to go in...,He was for an invasion before he was against it.,https://www.buzzfeed.com/andrewkaczynski/in-20...
1,2017-01-21,A reporter for Time magazine — and I have been...,Trump was on the cover 11 times and Nixon appe...,http://nation.time.com/2013/11/06/10-things-yo...
2,2017-01-23,Between 3 million and 5 million illegal votes ...,There's no evidence of illegal voting.,https://www.nytimes.com/2017/01/23/us/politics...
3,2017-01-25,"Now, the audience was the biggest ever. But th...",Official aerial photos show Obama's 2009 inaug...,https://www.nytimes.com/2017/01/21/us/politics...
4,2017-01-25,Take a look at the Pew reports (which show vot...,The report never mentioned voter fraud.,https://www.nytimes.com/2017/01/24/us/politics...


In [79]:
df.shape

(180, 4)