# Web Scraping

It is the process of extracting information from a webpage by taking advantage of patterns in the webpage's underlying code.
How does the webpage present information?
E.g. date; lie; why it's a lie; url -> consistent formatting

# 3 basic facts about html to understand:
1. html consists of tags
2. html tags can have attributes in the opening tag
3. html tags can be nested

# Summary Code

In [1]:
# import requests
# r = requests.get('https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html')

# from bs4 import BeautifulSoup
# soup = BeautifulSoup(r.text, 'html.parser')
# results = soup.find_all('span', attrs={'class':'short-desc'})

# records = []
# for result in results:
#     date = result.find('strong').text[:-1] + ', 2017'
#     lie = result.contents[1][1:-2]
#     explanation = result.find('a').text[1:-1]
#     url = first_result.find('a')['href']
#     records.append((date, lie, explanation, url))
    
# import pandas as pd
# df = pd.DataFrame(records, columns=['date', 'lie', 'explanation', 'url'])
# df['date'] = pd.to_datetime(df['date'])
# df.to_csv('trump_lies.csv', index=False, encoding='utf-8')

# Code Explanation

In [2]:
# request the html code of a webpage. check the page's robots.txt to ensure that it allows scraping.

import requests
r = requests.get('https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html')

In [3]:
# print first 500 characters of the html code

print(r.text[0:500])

<!DOCTYPE html>
<!--[if (gt IE 9)|!(IE)]> <!--><html lang="en" class="no-js page-interactive section-opinion page-theme-standard tone-opinion page-interactive-default limit-small layout-xlarge app-interactive" itemid="https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html" itemtype="http://schema.org/NewsArticle" itemscope xmlns:og="http://opengraphprotocol.org/schema/"><!--<![endif]-->
<!--[if IE 9]> <html lang="en" class="no-js ie9 lt-ie10 page-interactive section-opinion page


In [4]:
# parse the HTML using Beautiful Soup - popular web scraping library

from bs4 import BeautifulSoup

soup = BeautifulSoup(r.text, 'html.parser')

In [5]:
# find all span tags with the attribute short-desc. results is a python list

results = soup.find_all('span', attrs={'class': 'short-desc'})

In [6]:
len(results)

116

In [7]:
# check whether results match the text in the article

results[0:3]
results[-1]

<span class="short-desc"><strong>July 19 </strong>“But the F.B.I. person really reports directly to the president of the United States, which is interesting.” <span class="short-truth"><a href="https://www.usatoday.com/story/news/politics/onpolitics/2017/07/20/fbi-director-reports-justice-department-not-president/495094001/" target="_blank">(He reports directly to the attorney general.)</a></span></span>

In [8]:
# separate these results into the 4 components: date; lie; why it's a lie; url
# this is to create more structure. it is an iterative process

first_result = results[0]
first_result

<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>

In [9]:
# find method is used to find the first match in the text; it is then returned as a soup object, not string.

first_result.find('strong')

<strong>Jan. 21 </strong>

In [10]:
# the text attributes returns a regular python string

# An escape sequence is a sequence of characters that does not represent itself when used inside a character or string literal,
# but is translated into another character or a sequence of characters that may be difficult or impossible to represent directly
# this is the \xa0 in the text below

first_result.find('strong').text

'Jan. 21\xa0'

In [11]:
# slice off the escape sequence. it is a single character so use -1.

first_result.find('strong').text[:-1]

'Jan. 21'

In [12]:
# add the year so the dates in the dataset is not ambiguous

first_result.find('strong').text[:-1] + ', 2017'

'Jan. 21, 2017'

In [13]:
# extracting the lie

first_result

<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>

In [14]:
# there is no tag that starts and ends immediately before and after the lie
# have to use the contents attribute which extracts the children in the tag (returns tags and strings in a python list)
# slice the list to extract the second element

first_result.contents[1]

"“I wasn't a fan of Iraq. I didn't want to go into Iraq.” "

In [15]:
# slice off the curly quotation marks

first_result.contents[1][1:-2]

"I wasn't a fan of Iraq. I didn't want to go into Iraq."

In [16]:
# extracting the explanation

first_result.find('a').text[1:-1]

'He was for an invasion before he was against it.'

In [17]:
# extracting the url
# this involves extracting the href attribute within the 'a' tag
# beautifulsoup stores attributes of a tag in a dictionary -> key: attribute name, value: attribute value

first_result.find('a')['href']

'https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the'

In [18]:
# building the dataset. store the results in a tuple.

records = []
for result in results:
    date = result.find('strong').text[:-1] + ', 2017'
    lie = result.contents[1][1:-2]
    explanation = result.find('a').text[1:-1]
    url = first_result.find('a')['href']
    records.append((date, lie, explanation, url))

In [19]:
# check that there are 116 records according to the results

len(records)

116

In [20]:
# view the first few records

records[:3]

[('Jan. 21, 2017',
  "I wasn't a fan of Iraq. I didn't want to go into Iraq.",
  'He was for an invasion before he was against it.',
  'https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the'),
 ('Jan. 21, 2017',
  'A reporter for Time magazine — and I have been on their cover 14 or 15 times. I think we have the all-time record in the history of Time magazine.',
  'Trump was on the cover 11 times and Nixon appeared 55 times.',
  'https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the'),
 ('Jan. 23, 2017',
  'Between 3 million and 5 million illegal votes caused me to lose the popular vote.',
  "There's no evidence of illegal voting.",
  'https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the')]

In [21]:
# applying a tabular data structure using pandas

import pandas as pd

df = pd.DataFrame(records, columns=['date', 'lie', 'explanation', 'url'])

In [22]:
df.head()

Unnamed: 0,date,lie,explanation,url
0,"Jan. 21, 2017",I wasn't a fan of Iraq. I didn't want to go in...,He was for an invasion before he was against it.,https://www.buzzfeed.com/andrewkaczynski/in-20...
1,"Jan. 21, 2017",A reporter for Time magazine — and I have been...,Trump was on the cover 11 times and Nixon appe...,https://www.buzzfeed.com/andrewkaczynski/in-20...
2,"Jan. 23, 2017",Between 3 million and 5 million illegal votes ...,There's no evidence of illegal voting.,https://www.buzzfeed.com/andrewkaczynski/in-20...
3,"Jan. 25, 2017","Now, the audience was the biggest ever. But th...",Official aerial photos show Obama's 2009 inaug...,https://www.buzzfeed.com/andrewkaczynski/in-20...
4,"Jan. 25, 2017",Take a look at the Pew reports (which show vot...,The report never mentioned voter fraud.,https://www.buzzfeed.com/andrewkaczynski/in-20...


In [23]:
df.tail()

Unnamed: 0,date,lie,explanation,url
111,"July 6, 2017","As a result of this insistence, billions of do...",NATO countries agreed to meet defense spending...,https://www.buzzfeed.com/andrewkaczynski/in-20...
112,"July 17, 2017",We’ve signed more bills — and I’m talking abou...,"Clinton, Carter, Truman, and F.D.R. had signed...",https://www.buzzfeed.com/andrewkaczynski/in-20...
113,"July 19, 2017","Um, the Russian investigation — it’s not an in...",It is.,https://www.buzzfeed.com/andrewkaczynski/in-20...
114,"July 19, 2017","I heard that Harry Truman was first, and then ...","Presidents Clinton, Carter, Truman, and F.D.R....",https://www.buzzfeed.com/andrewkaczynski/in-20...
115,"July 19, 2017",But the F.B.I. person really reports directly ...,He reports directly to the attorney general.,https://www.buzzfeed.com/andrewkaczynski/in-20...


In [24]:
# Jan is abbreviated but July is not. Make sure that the format is consistent.

df['date'] = pd.to_datetime(df['date'])

In [25]:
df.head()

Unnamed: 0,date,lie,explanation,url
0,2017-01-21,I wasn't a fan of Iraq. I didn't want to go in...,He was for an invasion before he was against it.,https://www.buzzfeed.com/andrewkaczynski/in-20...
1,2017-01-21,A reporter for Time magazine — and I have been...,Trump was on the cover 11 times and Nixon appe...,https://www.buzzfeed.com/andrewkaczynski/in-20...
2,2017-01-23,Between 3 million and 5 million illegal votes ...,There's no evidence of illegal voting.,https://www.buzzfeed.com/andrewkaczynski/in-20...
3,2017-01-25,"Now, the audience was the biggest ever. But th...",Official aerial photos show Obama's 2009 inaug...,https://www.buzzfeed.com/andrewkaczynski/in-20...
4,2017-01-25,Take a look at the Pew reports (which show vot...,The report never mentioned voter fraud.,https://www.buzzfeed.com/andrewkaczynski/in-20...


In [26]:
df.tail()

Unnamed: 0,date,lie,explanation,url
111,2017-07-06,"As a result of this insistence, billions of do...",NATO countries agreed to meet defense spending...,https://www.buzzfeed.com/andrewkaczynski/in-20...
112,2017-07-17,We’ve signed more bills — and I’m talking abou...,"Clinton, Carter, Truman, and F.D.R. had signed...",https://www.buzzfeed.com/andrewkaczynski/in-20...
113,2017-07-19,"Um, the Russian investigation — it’s not an in...",It is.,https://www.buzzfeed.com/andrewkaczynski/in-20...
114,2017-07-19,"I heard that Harry Truman was first, and then ...","Presidents Clinton, Carter, Truman, and F.D.R....",https://www.buzzfeed.com/andrewkaczynski/in-20...
115,2017-07-19,But the F.B.I. person really reports directly ...,He reports directly to the attorney general.,https://www.buzzfeed.com/andrewkaczynski/in-20...


In [27]:
# Exporting the dataset to a CSV file

df.to_csv('trump_lies.csv', index=False, encoding='utf-8')

In [28]:
# Reading from a CSV file

data = pd.read_csv('trump_lies.csv', parse_dates=['date'], encoding='utf-8')
data.head()

Unnamed: 0,date,lie,explanation,url
0,2017-01-21,I wasn't a fan of Iraq. I didn't want to go in...,He was for an invasion before he was against it.,https://www.buzzfeed.com/andrewkaczynski/in-20...
1,2017-01-21,A reporter for Time magazine — and I have been...,Trump was on the cover 11 times and Nixon appe...,https://www.buzzfeed.com/andrewkaczynski/in-20...
2,2017-01-23,Between 3 million and 5 million illegal votes ...,There's no evidence of illegal voting.,https://www.buzzfeed.com/andrewkaczynski/in-20...
3,2017-01-25,"Now, the audience was the biggest ever. But th...",Official aerial photos show Obama's 2009 inaug...,https://www.buzzfeed.com/andrewkaczynski/in-20...
4,2017-01-25,Take a look at the Pew reports (which show vot...,The report never mentioned voter fraud.,https://www.buzzfeed.com/andrewkaczynski/in-20...


# Alternative syntax for beautiful soup:

### search for a tag
first_result.find('strong')

first_result.strong

### search for multiple tags
results = soup.find_all('span', attrs={'class':'short-desc})

results = soup('span', attrs={'class':'short-desc})  -> beautifulsoup can figure out the method

results = soup('span', class_='short-desc')  -> the class key is treated like a parameter