# All material ©2019, Alex Siegman


---

### There is a LOT of useful information onthe internet, and as data scientists you'll often need access to that information. 

### Unfortunatley, rarely is that information contained neatly in CSVs or even in tabular form. Rather, you have to really work to get what you need. 

### Lucky for us, there are some useful tools for "scraping" the web – in particular, one called BeautifulSoup (https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

In [None]:
import time
import re
import csv
import requests
from bs4 import BeautifulSoup
!pip install lxml

### Before we delve in, here's an example of the power of BeautifulSoup:

In [None]:
# the file attached is a simple csv containing 100 unique URLs from WSJ.com
# the script in this cell allows us to find the word count for each article (stored in the article metadata) via the URL

url_list = [] # create an empty list called 'url_list' where we will store all of the URL's in question

word_count_list = [] # create an empty list called 'word_count_list' where we will store the word counts associated  
                     # with each URL in our 'url_list'

with open("URLS_for_WordCount.csv", newline='') as csvfile:
          # note that you will have to navigate to wherever it is you have stored your csv as a pathname
        
    reader = csv.DictReader(csvfile) # this allows us to map our information in each row to an OrderedDictionary 
                                     # for more on DictReader see https://docs.python.org/3/library/csv.html
    
    for row in reader: # for every row in our csv, aka, for every dictionary entry (which is composed of our URLs)...
        
        # NB: you can use "print(row)" here to see what our ordered dictionary looks like 
        
        for k, v in row.items(): # for every key, value pair in our ordered dictionary...
            
            # NB: again, you can use "print(k)" or "print(v)" here to see what our key, value pairs look like 

            url_list.append(str(v)) # add the URL to our "url_list"
            
            r = requests.get(v) # for more on the requests library check out this tutorial from RealPython: 
                                # https://realpython.com/python-requests/
            
            soup = BeautifulSoup(r.text,'html') # we are going to turn that URL into 'soup', aka, we are going to be 
                                                # able to see it's metadata For more on BeautifulSoup, check out: 
                                                # https://www.crummy.com/software/BeautifulSoup/bs4/doc/
            
            wc1 = str(soup.find("meta", property="article:word_count")) # we want to find the word_count associated 
                                                                        # with each URL, found in the HTML that we 
                                                                        # have just "souped"
            
            wc2 = re.search('\d+',wc1).group(0) # we use regular expressions to find the first number in the associated
            # metadata, and store that. For more on regex see this great tutorial (not from me): https://regexr.com/
        
            word_count_list.append(wc2) # finally, we add (append) our word count to our "word_count_list"

        break 
            
print(word_count_list) # just to make sure everything works as planned
print(url_list) # again, just to make sure everything works as planned      

# the code below will create a new csv, called "URL_for_WordCount_with_WordCounts.csv" in our current directory"
# for more on csv.writer check out: https://docs.python.org/3/library/csv.html

""" 

myData = url_list,word_count_list 
myFile = open('URL_for_WordCount_with_WordCounts.csv', 'w')  
with myFile:  
   writer = csv.writer(myFile)
   writer.writerows(myData)
   
"""

### Now, back to BeautifulSoup basics:


In [None]:
 # let's scrape the NYT homepage

 # the requests library is the easiest way to call to a URL; here we are using a GET command

 # we are going to take the result of that GET command and pass it through bs4

 # 'prettify' does exactly what you'd think – it prettifies the output of the print statement

### What you're seeing above is the HTML for the NYT homepage. Let's continue with a few basics:

In [None]:
 # let's find the title of the page

In [None]:
 # get a string version of the title 

# note that there are some encoding issues here

In [None]:
 # find the parent of the title 
                       # this is exceptionally helpful when you're trying to parse an HTML tree

In [None]:
 # get the first <p> tag in the HTML

In [None]:
 # get the class of that <p> tag

In [None]:
 # find all 'a' tags on the page

In [None]:
 # find all 'a' on the page
     # get the associated href (hyperlink) for each instance 

## It's important to know that BeautifulSoup transforms HTMl into a tree of Python objects. The most important objects to know are: 

1. Tag
2. NavigableString
3. BeautifulSoup

### A tag corresponds to an XML or HTML tag in the original document. For instance:

In [None]:
 # you can easily access an attributes tags

In [None]:
 # or, you can search for a corresponding value as you would in a dictionary 

### A string corresponds to a bit of text within a tag. You use the NavigableString class to access that text.

### The BeautifulSoup object represents the document as a whole.

## Navigating the Tree

### The easiest way to navigate the parse tree is to call out the tag you want. 

In [None]:
 # let's just call out for the 'head' tag

In [None]:
 # or the 'title' tag

### You can, of course, delve deeper into the parse tree.

In [None]:
 # get the first <p> tag beneath the <body> tag

In [None]:
# note that using a tag name as an attribute gets you only the first tag by that name



In [None]:
# to find all the tags, use something like find_all()



### As alluded to earlier, it's helpful to be able to navigate the tree step-by-step. A tag's children are available in a list called .contents

### You can also iterate over a tag's children with the .children generator

## Filters

In [None]:
 # simply pass in the string for the tag you're searching for

In [None]:
import re # you can pass in regular expressions, too

 # find all tags whose names start with 'p'
    

In [None]:
 # find all the tags whose names contain the letter 't'


In [None]:
 # if you pass a list, bs4 will match against any item in that list 

## Filtering by CSS Class

In [None]:


# note the class_, since class is a reserved word in Python

## Encoding

### Last but not least, it's important to remember that any HTML or XML is written in a specific encoding (ASCII, UTF-8, et. cetera). But, bs4 turns that into Unicode, and sometimes it makes mistakes. 

### If you know the encoding ahead of time, specify it when you originally pass it in. For instance: 

In [None]:
# soup = BeautifulSoup(html, from_encoding="iso-8859-8")

---

## As you can see, there is a lot that can be done with BeautifulSoup! Just like anything else, the key is to know what can be done. Then, refer to the documentation. 

## Next week we'll take our BeautifulSoup skills and marry them with some Natural Language Processing and text mining capabilities.