# All material ©2019, Alex Siegman


---

### There is a LOT of useful information onthe internet, and as data scientists you'll often need access to that information. 

### Unfortunatley, rarely is that information contained neatly in CSVs or even in tabular form. Rather, you have to really work to get what you need. 

### Lucky for us, there are some useful tools for "scraping" the web – in particular, one called BeautifulSoup (https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

In [1]:
import time
import re
import csv
import requests
from bs4 import BeautifulSoup
!pip install lxml



### Before we delve in, here's an example of the power of BeautifulSoup:

In [2]:
# the file attached is a simple csv containing 100 unique URLs from WSJ.com
# the script in this cell allows us to find the word count for each article (stored in the article metadata) via the URL

url_list = [] # create an empty list called 'url_list' where we will store all of the URL's in question

word_count_list = [] # create an empty list called 'word_count_list' where we will store the word counts associated  
                     # with each URL in our 'url_list'

with open("URLS_for_WordCount.csv", newline='') as csvfile:
          # note that you will have to navigate to wherever it is you have stored your csv as a pathname
        
    reader = csv.DictReader(csvfile) # this allows us to map our information in each row to an OrderedDictionary 
                                     # for more on DictReader see https://docs.python.org/3/library/csv.html
    
    for row in reader: # for every row in our csv, aka, for every dictionary entry (which is composed of our URLs)...
        
        # NB: you can use "print(row)" here to see what our ordered dictionary looks like 
        
        for k, v in row.items(): # for every key, value pair in our ordered dictionary...
            
            # NB: again, you can use "print(k)" or "print(v)" here to see what our key, value pairs look like 

            url_list.append(str(v)) # add the URL to our "url_list"
            
            r = requests.get(v) # for more on the requests library check out this tutorial from RealPython: 
                                # https://realpython.com/python-requests/
            
            soup = BeautifulSoup(r.text,'html') # we are going to turn that URL into 'soup', aka, we are going to be 
                                                # able to see it's metadata For more on BeautifulSoup, check out: 
                                                # https://www.crummy.com/software/BeautifulSoup/bs4/doc/
            
            wc1 = str(soup.find("meta", property="article:word_count")) # we want to find the word_count associated 
                                                                        # with each URL, found in the HTML that we 
                                                                        # have just "souped"
            
            wc2 = re.search('\d+',wc1).group(0) # we use regular expressions to find the first number in the associated
            # metadata, and store that. For more on regex see this great tutorial (not from me): https://regexr.com/
        
            word_count_list.append(wc2) # finally, we add (append) our word count to our "word_count_list"

        break 
            
print(word_count_list) # just to make sure everything works as planned
print(url_list) # again, just to make sure everything works as planned      

# the code below will create a new csv, called "URL_for_WordCount_with_WordCounts.csv" in our current directory"
# for more on csv.writer check out: https://docs.python.org/3/library/csv.html

""" 

myData = url_list,word_count_list 
myFile = open('URL_for_WordCount_with_WordCounts.csv', 'w')  
with myFile:  
   writer = csv.writer(myFile)
   writer.writerows(myData)
   
"""

['397']
['https://www.wsj.com/articles/yellen-u-s-financial-system-is-safer-and-sounder-than-before-crisis-1498586028']


" \n\nmyData = url_list,word_count_list \nmyFile = open('URL_for_WordCount_with_WordCounts.csv', 'w')  \nwith myFile:  \n   writer = csv.writer(myFile)\n   writer.writerows(myData)\n   \n"

### Now, back to BeautifulSoup basics:


In [3]:
url = "https://www.nytimes.com/" # let's scrape the NYT homepage

r = requests.get(url) # the requests library is the easiest way to call to a URL; here we are using a GET command

soup = BeautifulSoup(r.text,'html') # we are going to take the result of that GET command and pass it through bs4

print(soup.prettify()) # 'prettify' does exactly what you'd think – it prettifies the output of the print statement

<!DOCTYPE html>
<html lang="en" xmlns:og="http://opengraphprotocol.org/schema/">
 <head>
  <title data-rh="true">
   The New York Times - Breaking News, World News &amp; Multimedia
  </title>
  <meta content="en-US" data-rh="true" itemprop="inLanguage"/>
  <meta content="noarchive,noodp,noydir" data-rh="true" name="robots"/>
  <meta content="The New York Times" data-rh="true" name="application-name"/>
  <meta content="https://www.nytimes.com" data-rh="true" name="msapplication-starturl"/>
  <meta content="name=Search;action-uri=https://www.nytimes.com/search/?src=iepin;icon-uri=https://static01.nyt.com/images/icons/search.ico" data-rh="true" name="msapplication-task"/>
  <meta content="name=Most Popular;action-uri=https://www.nytimes.com/gst/mostpopular.html?src=iepin;icon-uri=https://static01.nyt.com/images/icons/mostpopular.ico" data-rh="true" name="msapplication-task"/>
  <meta content="name=Video;action-uri=https://video.nytimes.com/?src=iepin;icon-uri=https://static01.nyt.com/imag

### What you're seeing above is the HTML for the NYT homepage. Let's continue with a few basics:

In [4]:
soup.title # let's find the title of the page

<title data-rh="true">The New York Times - Breaking News, World News &amp; Multimedia</title>

In [5]:
soup.title.string # get a string version of the title 

# note that there are some encoding issues here

'The New York Times - Breaking News, World News & Multimedia'

In [6]:
soup.title.parent.name # find the parent of the title 
                       # this is exceptionally helpful when you're trying to parse an HTML tree

'head'

In [7]:
soup.p # get the first <p> tag in the HTML

<p class="css-gs67ux e1n8kpyg0">Shutting down 8chan.</p>

In [8]:
soup.p['class'] # get the class of that <p> tag

['css-gs67ux', 'e1n8kpyg0']

In [9]:
soup.find_all('a') # find all 'a' tags on the page

[<a class="css-1rn5q1r" href="#site-content">Skip to content</a>,
 <a class="css-1rn5q1r" href="#site-index">Skip to site index</a>,
 <a aria-label="New York Times Logo. Click to visit the homepage" class="css-nhjhh0 e1huz5gh1" href="/"><svg class="" fill="#000" viewbox="0 0 184 25" xmlns="http://www.w3.org/2000/svg"><path d="M13.8 2.9c0-2-1.9-2.5-3.4-2.5v.3c.9 0 1.6.3 1.6 1 0 .4-.3 1-1.2 1-.7 0-2.2-.4-3.3-.8C6.2 1.4 5 1 4 1 2 1 .6 2.5.6 4.2c0 1.5 1.1 2 1.5 2.2l.1-.2c-.2-.2-.5-.4-.5-1 0-.4.4-1.1 1.4-1.1.9 0 2.1.4 3.7.9 1.4.4 2.9.7 3.7.8v3.1L9 10.2v.1l1.5 1.3v4.3c-.8.5-1.7.6-2.5.6-1.5 0-2.8-.4-3.9-1.6l4.1-2V6l-5 2.2C3.6 6.9 4.7 6 5.8 5.4l-.1-.3c-3 .8-5.7 3.6-5.7 7 0 4 3.3 7 7 7 4 0 6.6-3.2 6.6-6.5h-.2c-.6 1.3-1.5 2.5-2.6 3.1v-4.1l1.6-1.3v-.1l-1.6-1.3V5.8c1.5 0 3-1 3-2.9zm-8.7 11l-1.2.6c-.7-.9-1.1-2.1-1.1-3.8 0-.7 0-1.5.2-2.1l2.1-.9v6.2zm10.6 2.3l-1.3 1 .2.2.6-.5 2.2 2 3-2-.1-.2-.8.5-1-1V9.4l.8-.6 1.7 1.4v6.1c0 3.8-.8 4.4-2.5 5v.3c2.8.1 5.4-.8 5.4-5.7V9.3l.9-.7-.2-.2-.8.6-2.5-2.1L18.5 9V

In [10]:
for link in soup.find_all('a'): # find all 'a' on the page
    print(link.get('href')) # get the associated href (hyperlink) for each instance 

#site-content
#site-index
/
/
https://www.nytimes.com/es/
https://cn.nytimes.com
https://myaccount.nytimes.com/auth/login?response_type=cookie&client_id=vi
https://myaccount.nytimes.com/auth/login?response_type=cookie&client_id=vi
https://www.nytimes.com/section/todayspaper
/
/
https://www.nytimes.com/section/world
https://www.nytimes.com/section/us
https://www.nytimes.com/section/politics
https://www.nytimes.com/section/nyregion
https://www.nytimes.com/section/business
https://www.nytimes.com/section/opinion
https://www.nytimes.com/section/technology
https://www.nytimes.com/section/science
https://www.nytimes.com/section/health
https://www.nytimes.com/section/sports
https://www.nytimes.com/section/arts
https://www.nytimes.com/section/books
https://www.nytimes.com/section/style
https://www.nytimes.com/section/food
https://www.nytimes.com/section/travel
https://www.nytimes.com/section/magazine
https://www.nytimes.com/section/t-magazine
https://www.nytimes.com/section/realestate
https://

## It's important to know that BeautifulSoup transforms HTMl into a tree of Python objects. The most important objects to know are: 

1. Tag
2. NavigableString
3. BeautifulSoup

### A tag corresponds to an XML or HTML tag in the original document. For instance:

In [11]:
tag = soup.p 
tag.name

'p'

In [12]:
tag.attrs # you can easily access an attributes tags

{'class': ['css-gs67ux', 'e1n8kpyg0']}

In [13]:
tag['class'] # or, you can search for a corresponding value as you would in a dictionary 

['css-gs67ux', 'e1n8kpyg0']

### A string corresponds to a bit of text within a tag. You use the NavigableString class to access that text.

In [14]:
tag.string

'Shutting down 8chan.'

### The BeautifulSoup object represents the document as a whole.

In [15]:
soup.name

'[document]'

## Navigating the Tree

### The easiest way to navigate the parse tree is to call out the tag you want. 

In [16]:
soup.head # let's just call out for the 'head' tag

<head>
<title data-rh="true">The New York Times - Breaking News, World News &amp; Multimedia</title>
<meta content="en-US" data-rh="true" itemprop="inLanguage"/><meta content="noarchive,noodp,noydir" data-rh="true" name="robots"/><meta content="The New York Times" data-rh="true" name="application-name"/><meta content="https://www.nytimes.com" data-rh="true" name="msapplication-starturl"/><meta content="name=Search;action-uri=https://www.nytimes.com/search/?src=iepin;icon-uri=https://static01.nyt.com/images/icons/search.ico" data-rh="true" name="msapplication-task"/><meta content="name=Most Popular;action-uri=https://www.nytimes.com/gst/mostpopular.html?src=iepin;icon-uri=https://static01.nyt.com/images/icons/mostpopular.ico" data-rh="true" name="msapplication-task"/><meta content="name=Video;action-uri=https://video.nytimes.com/?src=iepin;icon-uri=https://static01.nyt.com/images/icons/video.ico" data-rh="true" name="msapplication-task"/><meta content="name=Homepage;action-uri=https://w

In [17]:
soup.title # or the 'title' tag

<title data-rh="true">The New York Times - Breaking News, World News &amp; Multimedia</title>

### You can, of course, delve deeper into the parse tree.

In [18]:
soup.body.p # get the first <p> tag beneath the <body> tag

<p class="css-gs67ux e1n8kpyg0">Shutting down 8chan.</p>

In [19]:
# note that using a tag name as an attribute gets you only the first tag by that name

soup.p

<p class="css-gs67ux e1n8kpyg0">Shutting down 8chan.</p>

In [20]:
# to find all the tags, use something like find_all()

soup.find_all('p')

[<p class="css-gs67ux e1n8kpyg0">Shutting down 8chan.</p>,
 <p class="css-gs67ux e1n8kpyg0">Will the Green New Deal inspire voters and affect the 2020 race?</p>,
 <p class="css-gs67ux e1n8kpyg0">Carl Hulse and De’Shawn Charles Winslow discuss their new books.</p>,
 <p class="css-1pfq5u e1n8kpyg0">More than half a century after Lyndon Johnson fell short on his attempt at gun control, new battles bring fresh roadblocks and dispute.</p>,
 <p class="css-1pfq5u e1n8kpyg0">A trade war spills into the realm of currency, with no end in sight.</p>,
 <p class="css-1pfq5u e1n8kpyg0">For many Asian couples, the Greek island of Santorini has become the ultimate destination for pre-wedding photographs.</p>,
 <p class="css-1pfq5u e1n8kpyg0">He’s no foe of bigotry. He’s an agent of it.</p>,
 <p class="css-1pfq5u e1n8kpyg0">The consequences ricochet around the world and embolden our adversaries.</p>,
 <p class="css-1pfq5u e1n8kpyg0">How several former vegans and vegetarians across the country came to s

### As alluded to earlier, it's helpful to be able to navigate the tree step-by-step. A tag's children are available in a list called .contents

In [21]:
head_tag = soup.head
head_tag.contents

['\n',
 <title data-rh="true">The New York Times - Breaking News, World News &amp; Multimedia</title>,
 '\n',
 <meta content="en-US" data-rh="true" itemprop="inLanguage"/>,
 <meta content="noarchive,noodp,noydir" data-rh="true" name="robots"/>,
 <meta content="The New York Times" data-rh="true" name="application-name"/>,
 <meta content="https://www.nytimes.com" data-rh="true" name="msapplication-starturl"/>,
 <meta content="name=Search;action-uri=https://www.nytimes.com/search/?src=iepin;icon-uri=https://static01.nyt.com/images/icons/search.ico" data-rh="true" name="msapplication-task"/>,
 <meta content="name=Most Popular;action-uri=https://www.nytimes.com/gst/mostpopular.html?src=iepin;icon-uri=https://static01.nyt.com/images/icons/mostpopular.ico" data-rh="true" name="msapplication-task"/>,
 <meta content="name=Video;action-uri=https://video.nytimes.com/?src=iepin;icon-uri=https://static01.nyt.com/images/icons/video.ico" data-rh="true" name="msapplication-task"/>,
 <meta content="nam

### You can also iterate over a tag's children with the .children generator

In [22]:
for child in head_tag.children:
    print(child)



<title data-rh="true">The New York Times - Breaking News, World News &amp; Multimedia</title>


<meta content="en-US" data-rh="true" itemprop="inLanguage"/>
<meta content="noarchive,noodp,noydir" data-rh="true" name="robots"/>
<meta content="The New York Times" data-rh="true" name="application-name"/>
<meta content="https://www.nytimes.com" data-rh="true" name="msapplication-starturl"/>
<meta content="name=Search;action-uri=https://www.nytimes.com/search/?src=iepin;icon-uri=https://static01.nyt.com/images/icons/search.ico" data-rh="true" name="msapplication-task"/>
<meta content="name=Most Popular;action-uri=https://www.nytimes.com/gst/mostpopular.html?src=iepin;icon-uri=https://static01.nyt.com/images/icons/mostpopular.ico" data-rh="true" name="msapplication-task"/>
<meta content="name=Video;action-uri=https://video.nytimes.com/?src=iepin;icon-uri=https://static01.nyt.com/images/icons/video.ico" data-rh="true" name="msapplication-task"/>
<meta content="name=Homepage;action-uri=https

## Filters

In [23]:
soup.find_all('a') # simply pass in the string for the tag you're searching for

[<a class="css-1rn5q1r" href="#site-content">Skip to content</a>,
 <a class="css-1rn5q1r" href="#site-index">Skip to site index</a>,
 <a aria-label="New York Times Logo. Click to visit the homepage" class="css-nhjhh0 e1huz5gh1" href="/"><svg class="" fill="#000" viewbox="0 0 184 25" xmlns="http://www.w3.org/2000/svg"><path d="M13.8 2.9c0-2-1.9-2.5-3.4-2.5v.3c.9 0 1.6.3 1.6 1 0 .4-.3 1-1.2 1-.7 0-2.2-.4-3.3-.8C6.2 1.4 5 1 4 1 2 1 .6 2.5.6 4.2c0 1.5 1.1 2 1.5 2.2l.1-.2c-.2-.2-.5-.4-.5-1 0-.4.4-1.1 1.4-1.1.9 0 2.1.4 3.7.9 1.4.4 2.9.7 3.7.8v3.1L9 10.2v.1l1.5 1.3v4.3c-.8.5-1.7.6-2.5.6-1.5 0-2.8-.4-3.9-1.6l4.1-2V6l-5 2.2C3.6 6.9 4.7 6 5.8 5.4l-.1-.3c-3 .8-5.7 3.6-5.7 7 0 4 3.3 7 7 7 4 0 6.6-3.2 6.6-6.5h-.2c-.6 1.3-1.5 2.5-2.6 3.1v-4.1l1.6-1.3v-.1l-1.6-1.3V5.8c1.5 0 3-1 3-2.9zm-8.7 11l-1.2.6c-.7-.9-1.1-2.1-1.1-3.8 0-.7 0-1.5.2-2.1l2.1-.9v6.2zm10.6 2.3l-1.3 1 .2.2.6-.5 2.2 2 3-2-.1-.2-.8.5-1-1V9.4l.8-.6 1.7 1.4v6.1c0 3.8-.8 4.4-2.5 5v.3c2.8.1 5.4-.8 5.4-5.7V9.3l.9-.7-.2-.2-.8.6-2.5-2.1L18.5 9V

In [24]:
import re # you can pass in regular expressions, too

for tag in soup.find_all(re.compile("p")): # find all tags whose names start with 'p'
    print(tag.name)

script
script
script
script
script
script
script
span
span
path
path
path
path
span
path
p
p
p
span
span
span
figcaption
span
p
span
span
span
span
span
span
span
span
span
figcaption
span
span
span
figcaption
span
p
span
span
span
span
p
span
figcaption
span
span
path
p
span
p
span
span
p
span
span
p
span
span
p
span
span
p
path
path
path
polygon
path
polygon
polygon
polygon
path
path
path
polygon
path
path
path
span
span
span
span
script
script
script
script
script
script
script
script
script
noscript
script


In [25]:
for tag in soup.find_all(re.compile("t")): # find all the tags whose names contain the letter 't'
    print(tag.name)

html
title
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
script
script
script
script
script
script
script
style
style
section
button
rect
rect
rect
button
rect
rect
rect
button
path
path
button
button
path
section
path
path
button
rect
rect
rect
section
article
article
article
section
article
figcaption
section
article
article
article
article
article
article
article
figcaption
section
article
figcaption
section
article
article
article
figcaption
section
section
path
article
article
article
article
article
article
article
article
article
article
article
section
article
article
article
path
button
path
section
section
section
section
section
path
rect
rect
path
path
path
path
rect
rect
path
path
path
footer
meta
meta
meta
meta
meta
meta
meta
script
script
script
script
script
script
script
script
script
noscript
script


In [26]:
soup.find_all(["a","b"]) # if you pass a list, bs4 will match against any item in that list 

[<a class="css-1rn5q1r" href="#site-content">Skip to content</a>,
 <a class="css-1rn5q1r" href="#site-index">Skip to site index</a>,
 <a aria-label="New York Times Logo. Click to visit the homepage" class="css-nhjhh0 e1huz5gh1" href="/"><svg class="" fill="#000" viewbox="0 0 184 25" xmlns="http://www.w3.org/2000/svg"><path d="M13.8 2.9c0-2-1.9-2.5-3.4-2.5v.3c.9 0 1.6.3 1.6 1 0 .4-.3 1-1.2 1-.7 0-2.2-.4-3.3-.8C6.2 1.4 5 1 4 1 2 1 .6 2.5.6 4.2c0 1.5 1.1 2 1.5 2.2l.1-.2c-.2-.2-.5-.4-.5-1 0-.4.4-1.1 1.4-1.1.9 0 2.1.4 3.7.9 1.4.4 2.9.7 3.7.8v3.1L9 10.2v.1l1.5 1.3v4.3c-.8.5-1.7.6-2.5.6-1.5 0-2.8-.4-3.9-1.6l4.1-2V6l-5 2.2C3.6 6.9 4.7 6 5.8 5.4l-.1-.3c-3 .8-5.7 3.6-5.7 7 0 4 3.3 7 7 7 4 0 6.6-3.2 6.6-6.5h-.2c-.6 1.3-1.5 2.5-2.6 3.1v-4.1l1.6-1.3v-.1l-1.6-1.3V5.8c1.5 0 3-1 3-2.9zm-8.7 11l-1.2.6c-.7-.9-1.1-2.1-1.1-3.8 0-.7 0-1.5.2-2.1l2.1-.9v6.2zm10.6 2.3l-1.3 1 .2.2.6-.5 2.2 2 3-2-.1-.2-.8.5-1-1V9.4l.8-.6 1.7 1.4v6.1c0 3.8-.8 4.4-2.5 5v.3c2.8.1 5.4-.8 5.4-5.7V9.3l.9-.7-.2-.2-.8.6-2.5-2.1L18.5 9V

## Filtering by CSS Class

In [27]:
soup.find_all("a", class_="sister")

# note the class_, since class is a reserved word in Python

[]

In [28]:
soup.find_all(class_=re.compile("ad"))

[<div class="ad dfp-ad-top-wrapper" style="text-align:center;height:100%;display:block"><div class="place-ad" data-position="top" id="dfp-ad-top"></div></div>,
 <div class="place-ad" data-position="top" id="dfp-ad-top"></div>,
 <div class="ad dfp-ad-mid1-wrapper" style="text-align:center;height:100%;display:block"><div class="" data-position="mid1" id="dfp-ad-mid1"></div></div>]

## Encoding

### Last but not least, it's important to remember that any HTML or XML is written in a specific encoding (ASCII, UTF-8, et. cetera). But, bs4 turns that into Unicode, and sometimes it makes mistakes. 

### If you know the encoding ahead of time, specify it when you originally pass it in. For instance: 

In [29]:
# soup = BeautifulSoup(html, from_encoding="iso-8859-8")

---

## As you can see, there is a lot that can be done with BeautifulSoup! Just like anything else, the key is to know what can be done. Then, refer to the documentation. 

## Next week we'll take our BeautifulSoup skills and marry them with some Natural Language Processing and text mining capabilities.