# Web Scraping with Python Workshop


We will use the python package, Beautiful Soup, to webscrape headlines from the NY Times. By scraping the headlines, we wil examine how to search for meta data hidden within HTML tags and how HTML tags can be removed with data scraping.

Following the exercise of web scraping on a static webapage, we will crawl a similar webpage and use the crawler to "click" on links embedded within the webpage.

We will then store the data in a Pandas dataframe and show how to transfer this information to aa csv.

Feel free to send questions to CDSS_executives@columbia.edu

Import the Beautiful Soup package as well as urllib, a package that is used to process url's

In [4]:
from bs4 import BeautifulSoup
from urllib.request import urlopen

Find the search URL that you would like to use.

In [5]:
search_url = 'http://nytimes.com'

Form the Beautiful Soup query around the url. For more information, visit the Beautiful Soup documentation https://www.crummy.com/software/BeautifulSoup/bs4/doc/. For trying out different url's, start by modifying the "search_url" variable from above.

In [6]:
soup = BeautifulSoup(urlopen(search_url).read(), 'html.parser')

Beautiful Soup produces essentially copy of the html file from the web application. So let's print it to see what we're working with!

In [7]:
print(soup)

<!DOCTYPE html>

<!--[if (gt IE 9)|!(IE)]> <!--> <html class="no-js edition-domestic app-homepage" itemscope="" lang="en" xmlns:og="http://opengraphprotocol.org/schema/"> <!--<![endif]-->
<!--[if IE 9]> <html lang="en" class="no-js ie9 lt-ie10 edition-domestic app-homepage" xmlns:og="http://opengraphprotocol.org/schema/"> <![endif]-->
<!--[if IE 8]> <html lang="en" class="no-js ie8 lt-ie10 lt-ie9 edition-domestic app-homepage" xmlns:og="http://opengraphprotocol.org/schema/"> <![endif]-->
<!--[if (lt IE 8)]> <html lang="en" class="no-js lt-ie10 lt-ie9 lt-ie8 edition-domestic app-homepage" xmlns:og="http://opengraphprotocol.org/schema/"> <![endif]-->
<head>
<title>The New York Times - Breaking News, World News &amp; Multimedia</title>
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
<link href="https://static01.nyt.com/favicon.ico" rel="shortcut icon"/>
<link href="https://static01.nyt.com/images/icons/ios-ipad-144x144.png" rel="apple-touch-icon-precomposed" sizes="144×144

It doesn't look too nice! We need to find a way to parse through it. Let's start by looking at the HTML identifiers near the information that we want -- the headlines. Let's command+F our printed soup output for the context of one of the headlines that we want.

```
<h2 class="story-heading"><a href="https://www.nytimes.com/2017/02/01/arts/beyonce-pregnant-twins.html">Beyoncé Announces She Is Pregnant With Twins</a></h2>
<p class="byline">By JOE COSCARELLI <time class="timestamp" data-eastern-timestamp="2:49 PM" data-utc-timestamp="1485978561" datetime="2017-02-01">2:49 PM ET</time></p>
<p class="summary">
        The pop star shared an Instagram post in which she said her family with the rapper Jay Z “will be growing by two.”    </p>
```

This headline is denoted by the tag "< h2 >" snd the class "story-heading". With further examination with more command+F searches through the soup output, we confirm that "story-heading" is used to denote the headline for all of the stories on the homepage. 

In [10]:
 soup2 = soup.findAll('h2', {'class':'story-heading'})
 print(soup2)

[<h2 class="story-heading"><a href="https://www.nytimes.com/2017/02/01/us/politics/neil-gorsuch-supreme-court-trump.html">If Court Pick Is Blocked, Trump Advises Senate to ‘Go Nuclear’</a></h2>, <h2 class="story-heading"><a href="https://www.nytimes.com/2017/01/31/us/politics/neil-gorsuch-supreme-court-nominee.html">Our Supreme Court Reporter on Gorsuch’s Record</a></h2>, <h2 class="story-heading">
                                    ‘The Very Best Judge in the Country’                            </h2>, <h2 class="story-heading"><a href="https://www.nytimes.com/2017/02/01/us/politics/rex-tillerson-secretary-of-state-confirmed.html">Tillerson Confirmed as Secretary of State</a></h2>, <h2 class="story-heading"><i class="icon"></i><a href="https://www.nytimes.com/interactive/2017/02/01/us/politics/tillerson-confirmation-vote-live.html">A Breakdown of the Senate Vote on Tillerson</a> </h2>, <h2 class="story-heading"><a href="https://www.nytimes.com/2017/02/01/us/politics/trump-cabinet-nomi

Now we just need to scrape away the html tags to get the text that we want. Since not every line in soup2 has text at all, we first need to check for empty lines before getting the text element.

In [14]:
lines = []
for line in soup2:
    if line:
        lines.append(line.text)
        print(line.text)

If Court Pick Is Blocked, Trump Advises Senate to ‘Go Nuclear’If Court Pick Is Blocked, Trump Advises Senate to ‘Go Nuclear’
Our Supreme Court Reporter on Gorsuch’s RecordOur Supreme Court Reporter on Gorsuch’s Record

                                    ‘The Very Best Judge in the Country’                            
                                    ‘The Very Best Judge in the Country’                            
Tillerson Confirmed as Secretary of StateTillerson Confirmed as Secretary of State
A Breakdown of the Senate Vote on Tillerson A Breakdown of the Senate Vote on Tillerson 
G.O.P. Muscles Nominees Past Democrats’ RoadblocksG.O.P. Muscles Nominees Past Democrats’ Roadblocks
2 G.O.P. Senators to Oppose DeVos as Education Secretary2 G.O.P. Senators to Oppose DeVos as Education Secretary
Army Secretary Pick Could Trade One Conflict for Another Army Secretary Pick Could Trade One Conflict for Another 
World’s Autocrats See Trump as an OpportunityWorld’s Autocrats See Trump as an

In [18]:
print(lines)


['If Court Pick Is Blocked, Trump Advises Senate to ‘Go Nuclear’If Court Pick Is Blocked, Trump Advises Senate to ‘Go Nuclear’', 'Our Supreme Court Reporter on Gorsuch’s RecordOur Supreme Court Reporter on Gorsuch’s Record', '\n                                    ‘The Very Best Judge in the Country’                            \n                                    ‘The Very Best Judge in the Country’                            ', 'Tillerson Confirmed as Secretary of StateTillerson Confirmed as Secretary of State', 'A Breakdown of the Senate Vote on Tillerson A Breakdown of the Senate Vote on Tillerson ', 'G.O.P. Muscles Nominees Past Democrats’ RoadblocksG.O.P. Muscles Nominees Past Democrats’ Roadblocks', '2 G.O.P. Senators to Oppose DeVos as Education Secretary2 G.O.P. Senators to Oppose DeVos as Education Secretary', 'Army Secretary Pick Could Trade One Conflict for Another Army Secretary Pick Could Trade One Conflict for Another ', 'World’s Autocrats See Trump as an OpportunityWorld

How many headlines did we scrape? Does this number seem reasonable? Let's do some visual inspection of the data that we scraped.

In [None]:
print(len(lines))

Now that we've confirmed that our data looks like headlines, let's strip away all the \n, trailing or leading whitespace and other characters that we don't want. We will do this using the string.strip() method.

In [46]:
headlines = []
for line in lines:
    line = line.strip()
    headlines.append(line)
print(headlines)

['If Court Pick Is Blocked, Trump Advises Senate to ‘Go Nuclear’If Court Pick Is Blocked, Trump Advises Senate to ‘Go Nuclear’', 'Our Supreme Court Reporter on Gorsuch’s RecordOur Supreme Court Reporter on Gorsuch’s Record', '‘The Very Best Judge in the Country’                            \n                                    ‘The Very Best Judge in the Country’', 'Tillerson Confirmed as Secretary of StateTillerson Confirmed as Secretary of State', 'A Breakdown of the Senate Vote on Tillerson A Breakdown of the Senate Vote on Tillerson', 'G.O.P. Muscles Nominees Past Democrats’ RoadblocksG.O.P. Muscles Nominees Past Democrats’ Roadblocks', '2 G.O.P. Senators to Oppose DeVos as Education Secretary2 G.O.P. Senators to Oppose DeVos as Education Secretary', 'Army Secretary Pick Could Trade One Conflict for Another Army Secretary Pick Could Trade One Conflict for Another', 'World’s Autocrats See Trump as an OpportunityWorld’s Autocrats See Trump as an Opportunity', 'Trump Golf Resort Ordere

Hmmmm...what seems to be going on here? The newline characters are still there! This is because they're in the middle of the strings and we only stripped whitespace characters from the ends of the line. Since it seems like certain headlines are copied before and after a series of \n's, lets string.split() the lines, remove the trailing and leading whitespaces from the first element of the split string array, and use this processed string as our headline.

In [63]:
for line_index in range(0, len(headlines)):
    first_elem = headlines[line_index].split('\n')[0]
    first_elem = first_elem.strip()
  #  print(first_elem)
    headlines[line_index] = first_elem
print(headlines)
   # print(headlines[line_index].split('\n')[0])

['If Court Pick Is Blocked, Trump Advises Senate to ‘Go Nuclear’If Court Pick Is Blocked, Trump Advises Senate to ‘Go Nuclear’', 'Our Supreme Court Reporter on Gorsuch’s RecordOur Supreme Court Reporter on Gorsuch’s Record', '‘The Very Best Judge in the Country’', 'Tillerson Confirmed as Secretary of StateTillerson Confirmed as Secretary of State', 'A Breakdown of the Senate Vote on Tillerson A Breakdown of the Senate Vote on Tillerson', 'G.O.P. Muscles Nominees Past Democrats’ RoadblocksG.O.P. Muscles Nominees Past Democrats’ Roadblocks', '2 G.O.P. Senators to Oppose DeVos as Education Secretary2 G.O.P. Senators to Oppose DeVos as Education Secretary', 'Army Secretary Pick Could Trade One Conflict for Another Army Secretary Pick Could Trade One Conflict for Another', 'World’s Autocrats See Trump as an OpportunityWorld’s Autocrats See Trump as an Opportunity', 'Trump Golf Resort Ordered to Pay $5.7 Million 6:17 PM ETTrump Golf Resort Ordered to Pay $5.7 Million 6:17 PM ET', 'Hormone Bl

Now let's store the data in a useful way. How about a dataframe? 

In [68]:
import pandas as pd
df = pd.DataFrame(headlines)
print(df)

                                                     0
0    If Court Pick Is Blocked, Trump Advises Senate...
1    Our Supreme Court Reporter on Gorsuch’s Record...
2                 ‘The Very Best Judge in the Country’
3    Tillerson Confirmed as Secretary of StateTille...
4    A Breakdown of the Senate Vote on Tillerson A ...
5    G.O.P. Muscles Nominees Past Democrats’ Roadbl...
6    2 G.O.P. Senators to Oppose DeVos as Education...
7    Army Secretary Pick Could Trade One Conflict f...
8    World’s Autocrats See Trump as an OpportunityW...
9    Trump Golf Resort Ordered to Pay $5.7 Million ...
10   Hormone Blockers Can Help if Prostate Cancer R...
11   New Zealand Is ‘the Future,’ Thiel Said in Pus...
12   Who Should Keep Tabs on Money? The Depp Conund...
13                                     Ten Meter Tower
14   Would you jump? Or would you chicken out?Would...
15   Widow of Orlando Gunman Knew of His Plans, U.S...
16   Fed Says Economic Outlook Continues to Improve...
17   Myste

That was simple, right? We can easily write this data to a CSV from pandas also.

In [71]:
df.to_csv('headlines.csv')
#make sure that it's there...
#you can always use "!" as an escape character to use terminal commands in ipython notebooks
!ls
!head headlines.csv

[34mBooks Spring 2017[m[m
IMG_1189.JPG
IMG_1190.JPG
IMG_1194.JPG
IMG_1198.JPG
SGaliotto_Resume.pdf
Screen Shot 2017-01-20 at 11.14.53 AM.png
Screen Shot 2017-01-23 at 10.07.18 PM.png
Screen Shot 2017-01-24 at 8.13.33 PM.png
Screen Shot 2017-01-24 at 8.13.37 PM.png
Screen Shot 2017-01-25 at 11.40.33 AM.png
Screen Shot 2017-01-30 at 6.57.28 PM.png
WV_WARN_Notices_3-1-11_to_1-3-17 (1).pdf
Web Scraping in Python.ipynb
Women in Data: Directions to Lerner Hall.png
[David_Kreps]_Notes_On_The_Theory_Of_Choice_(Under(BookFi).pdf
headlines.csv
[34mmisc[m[m
tabula-WV_WARN_Notices_3-1-11_to_1-3-17 (1).csv
test.class
test.java
tw.txt
webs.py
,0
0,"If Court Pick Is Blocked, Trump Advises Senate to ‘Go Nuclear’If Court Pick Is Blocked, Trump Advises Senate to ‘Go Nuclear’"
1,Our Supreme Court Reporter on Gorsuch’s RecordOur Supreme Court Reporter on Gorsuch’s Record
2,‘The Very Best Judge in the Country’
3,Tillerson Confirmed as Secretary of StateTillerson Confirmed as Secretary of State
4,A Br

Now let's crawl (but not really crawling)! Let's try to store some text from Reuters articles. This has debatable legality so do this at your own risk!

In [79]:
reuters_url = 'http://www.reuters.com/'
crawl_soup = BeautifulSoup(urlopen(reuters_url).read(), 'html.parser')
print(crawl_soup)

<!--[if !IE]> This has been served from cache <![endif]-->
<!--[if !IE]> Request served from apache server: produs--i-858f6a1d <![endif]-->
<!--[if !IE]> Cached on Thu, 02 Feb 2017 01:23:08 GMT and will expire on Thu, 02 Feb 2017 01:28:06 GMT <![endif]-->
<!--[if !IE]> token: eb400426-8e7c-4b5f-8d8a-6495169b90c6 <![endif]-->
<!--[if !IE]> Prepopulated from the cache-server <![endif]-->
<!--[if !IE]> App Server /produs--i-b3e1bd20/ <![endif]-->
<!DOCTYPE doctype html>
<html><head>
<title>Business &amp; Financial News, Breaking US &amp; International News | Reuters</title>
<meta content="IE=edge" http-equiv="X-UA-Compatible"><meta charset="utf-8"><meta content="on" http-equiv="x-dns-prefetch-control"><link href="//s1.reutersmedia.net" rel="dns-prefetch"/><link href="//s2.reutersmedia.net" rel="dns-prefetch"/><link href="//s3.reutersmedia.net" rel="dns-prefetch"/><link href="//s4.reutersmedia.net" rel="dns-prefetch"/><link href="//static.reuters.com" rel="dns-prefetch"/><link href="//www.

Similarly to the NYT, Reuters stories are mostly identified using the "story-title" class, but we want the url's this time! The url is stored next to the story-title. The href= URL will be stored as a property once we find the story-title classes.

In [124]:
crawl_soup1 = crawl_soup.findAll(True, {'class':'story-title'})
#print(crawl_soup1)

#now get the href, note that it always starts with <a href="....
#remember that lines can be null

r_headlines = []
#hacky way when soup tags don't work
#use escape character "\" to split
for line in crawl_soup1:
    if len(line)>1:
       for i in str(line).split("\""):
            if 'article' in i:
                #print(i)
                r_headlines.append(i)
                
print(r_headlines)

['/article/us-usa-trump-extremists-program-exclusiv-idUSKBN15G5VO', '/article/us-usa-congress-tillerson-idUSKBN15G5I7', '/article/us-usa-trump-commando-idUSKBN15G5RX', '/article/us-usa-trump-iran-idUSKBN15G5ED', '/article/us-israel-palestinians-kushner-idUSKBN15G4W2', '/article/us-usa-trump-meeting-idUSKBN15G5JA', '/article/us-britain-eu-article-idUSKBN15G4CK', '/article/us-romania-government-corruption-idUSKBN15F29F', '/article/us-delaware-prison-idUSKBN15G5FB', '/article/us-usa-court-gorsuch-business-idUSKBN15G5PZ', '/article/us-people-johnnydepp-idUSKBN15G59I', '/article/us-usa-fed-idUSKBN15G5D5', '/article/us-facebook-results-idUSKBN15G5MR', '/article/us-usa-economy-idUSKBN15G4WH', '/article/us-global-markets-idUSKBN15H034', '/article/us-britain-boe-idUSKBN15H005', '/article/us-trump-usa-tax-trade-idUSKBN15G5EH', '/article/us-usa-yemen-raid-idUSKBN15H02K', '/article/us-canada-mosque-shooter-idUSKBN15G5SI', '/article/us-usa-trump-mexico-idUSKBN15G5W0', '/article/us-tesla-namechange-