# Python PDF Harvester in $\le$ 25 LOC!

<br><br>

Python is an excellent programming language for web-related tasks. 

In the following demo, we develop a utility that performs the following steps:     
<br>
1. **Capture the underlying HTML for a given webpage of interest** (henceforth the *wpoi*)   
<br>   
2. **Parse the HTML, and extract all references to PDF documents with associated URLs**     
<br>      
3. **For each PDF URL reference, retrieve the source document and write the contents to file**    
<br>    

    

## Step I: Capture Underlying HTML

In order to get the demo to run correctly, you'll need to provide details of our (CNA's) proxy server upon requesting content from our wpoi's. 

<br>
**Note that this step is required only because we're accessing the internet from within the Enterprise. If you were to repeat the demo at home, skip the proxy-server intialization step 
entirely.**
<br>  


This is necessary because every internet-bound web request made throughout the Enterprise is first routed through an intermediate proxy server prior to being passed along to the internet at large. 

**Case in point:**

Follow the link to *WhatsMyIP.org*: [http://www.whatsmyip.org/](http://www.whatsmyip.org/) 
<br>  
Everyone here will have the same IP: **159.10.134.170**. 

This is the public-facing IP address of our proxy server. 

As a security measure, in order to make internet requests, users need to provide their username and password, along with the address and port of our proxy server. 

Note that your authentication details are not "sent out" to the internet: They're used only so that the proxy server can verify that the request is originating from an authorized user. 
        
This authentication happens behind the scenes automatically with every *browser-based* internet submission, but when submitting web requests via programming language, this step needs to be performed manually. 
     
To illustrate, from Internet Explorer, go to *Tools > Internet Options > Connections > LAN Settings*. You'll find:
    
*  Address: **proxy.cna.com**
*  Port: **8080**

The proxy address and port are used in conjunction with your CID and password whenever you perform a web search. 


Our first step will be to create a dictionary with two entries: one for **http** and another for **https** requests. The format is as follows:


```
http://<CID>:<password>@proxy.cna.com:8080
https://<CID>:<password>@proxy.cna.com:8080
```
<br>  
For example, if a user's CID is **`CCC3313`** with password **`P@ssword9999`**, their `proxies` dict would look like:


```python
proxies = {'http': 'http://CCC3313:P@ssword9999@proxy.cna.com:8080',
           'https': 'https://CCC3313:P@ssword9999@proxy.cna.com:8080'}
```

A few things to note:

* In Python parlance, `http/https` are called dictionary "keys", and the associated proxy strings are called "values". Dictionaries are comprised of *key-value* pairs.    
<br>   

* In the `proxies` dict, the keys and values are surrounded by quotes (either single or double; Python does not make a distinction). Both the keys and values are string datatypes in the `proxies` dict.     
<br>

* The first key-value pair is separated from the second by a comma: This is also the convention when dealing with dicts containing more than two items. Keys are separated from values with a colon, key-value pairs are separated from each other by commas.    
<br>


The library that facilitates communication between Python and the wpoi is [`requests`](http://docs.python-requests.org/en/master/). It exposes a simple, intuitive interface that works right out of the box ("batteries included"). To capture the wpoi's underlying HTML, we only need call:

```
requests.get('URL', proxies=proxies).text
```

Where `URL` is a string representing the URL of interest. **`requests.get`** returns an object, and by including the `text` suffix, we're requesting the the wpoi's content be returned as plain text to allow for parsing with regular expressions in the next step. 

<br><br>
What follows is the Python code that corresponds to **`Step I`** of our PDF Harvester demo:   


In [None]:
# ===================================================================
# PDF Harvester I of III: Retrieve HTML as plain text               |
# ===================================================================
import requests
from pprint import pprint

# *** uncomment and replace proxy string with your authentication details ***
# proxies = {
#         'http' :"http://CCC3313:P@ssword9999@proxy.cna.com:8080",
#         'https':"https://CCC3313:P@ssword9999@proxy.cna.com:8080"
#         }


# specify URL for sample wpoi =>
URL = "https://en.wikipedia.org/wiki/Loss_reserving"

# instruct request object to return HTML as plain text =>
html = requests.get(URL, proxies=proxies).text

# print raw html =>
pprint(html, width=100)


<br><br>   
We've captured the HTML. Next we need to identify and extract references to PDF documents in the form of URLs. For this step, we'll leverage the power of regular expressions, available in the Python Standard Library within the **`re`** module.    
<br><br>      



## Step II: Extract PDF URLs from HTML

**Regular Expressions** are a sequence of characters that define a search pattern.

Example from SQL:

```sql
SELECT * FROM TABLE WHERE FIELD LIKE ‘%FIRE%’;
```

This query returns all records from **TABLE** where **FIELD** contains
‘FIRE’ along with any additional leading or trailing characters:

* BLD_FIRE
* CONT_FIRE
* BINT_FIRE
* “Fire on the Mountain”
* “Jump in the Fire”


*`LIKE`* is useful for capturing string literals from a larger corpus of text, but cannot be used to match instances in which the target string follows a fixed format but with variable content:

```
"Match 3 characters followed by 7 integers followed by 3 vowels"
```
<br>    


Put simply, Regular Expressions use special symbols to represent collections of characters.
A few of the most common symbols are:

*  **\d** matches any decimal digit [0-9]   
<br>     
*  **\D** matches any non-digit character [^0-9]    
<br>    
*  **\s** matches any whitespace character [\t\n\r\f\v]    
<br>    
*  **\S** matches any non-whitespace character [^\t\n\r\f\v]   
<br>    
*  **\w** matches any alphanumeric character [a-zA-Z0-9_]   
<br>    
*  **\W** matches any non-alphanumeric character [^a-zA-Z0-9_]   
<br>    
*  **+** Matches 1 or more of preceding symbol or literal
<br>    
*  ***** matches 0 or more of preceding symbol or literal   
<br>    
*  **^** matches at beginning of line   
<br>
*  **$** matches at end of line     
<br>
*  **?** affixed to the end of a symbol or literal changes match type to non-greedy from greedy.     
<br>    
<br>     



The last three symbols, `+`, `*` and `?`, are used in conjunction with other character classes:

*  **\d+** will match 1 or more consecutive decimal digits [0-9]    
<br>   
*  **\w*** will match 0 or more alphanumeric characters [a-zA-Z0-9_]      
<br>
*  [Pythex](http://pythex.org/) example with `<Alpha><Beta><Gamma><Delta>` and `^<Alpha.*?>`
<br>
<br>


### Lookahead Assertion



Another regular expression utility we'll leverage is the *lookahead assertion*:

*  **(?=char_sequence)** 	


The *lookahead* assertion matches without consuming. It can be used to tell a regular expression where to start matching, then use additional symbols to capture content.    
<br>     
   
Consider the Beatles discography:    

```
Please Please Me
With the Beatles
A Hard Day's Night
Beatles For Sale
Help!
Rubber Soul
Revolver
Sgt. Pepper's Lonely Hearts Club Band
Magical Mystery Tour
The White Album
Yellow Submarine
Abbey Road
Let It Be
```

In order to identify album names starting with **`R`**, we'd use the following lookahead assertion, along with `^` and `.+`:

```
^(?=R).+
```

This regular expression highlights any album names beginning with `"R"`.   
<br>
To *extract* the matching album names, we need to surround the text of interest with parens. In Python, this is called a *regular expression capture group*. The above regular expression is modified only by the inclusion of parens surrounding **`.+`**:

```
^(?=R)(.+)

```
   
<br>  
[Verify using Pythex](http://pythex.org/)   
<br>        
       
    
 
We now have everything we need to extract PDF URLs from the HTML captured in Step I.


#### Identifying PDF URL's
 
We can take advantage of a few insights to help streamline the compilation of our regular expression. 

We'll use the following link as a reference: [https://en.wikipedia.org/wiki/Loss_reserving](https://en.wikipedia.org/wiki/Loss_reserving).

<br>   
1.  From any browser, pressing **Ctrl + U** will display the underlying HTML for the current webpage. This can be useful for inspecting the underlying HTML and searching for text patterns.       
<br>             
2.  Valid PDF URLs will in all cases be embedded within an **href** tag.   
<br>      
3.  Valid PDF URLs will in all cases be preceded by **http** or **https**.    
<br>   
4.  Valid PDF URLs will in all cases be enclosed by a trailing **>**.        
<br>       
5.  Valid PDF URLs cannot contain whitespace.        

<br><br>




The following snippet is characteristic of the HTML the harvester will be required to parse:


```html
<html>
<span class="reference-text"><a class="external" href="http://Public/GuelahPapyrus.pdf">
    <a><span><li>< href="http://ocw.mit.edu/courses/StatisticalModeling.pdf"></a></span></li>
    <a href="#cite_ref-3">^</a></b></span> <span class="reference-text">
    <a class="internal" href="https://The/Watchful/Horsemasters.pdf">
        <a><i>href="http://arxiv.org/pdf/1311.1704v3.png"><i>Scalable Recommendation</i></a>
        <a href="#cite_ref-Papoulis_5-0"><sup><i><b>a</b></i></sup>
<href>masterdownload%255C2009519932327055475115776.pdf&amp;rft_id=info%3Adoi%2F10.1016%2Fj</href>
</html>
```

From this HTML, our regular expression needs to be able to extract:

*  **`http://Public/GuelahPapyrus.pdf`**        
<br>   
*  **`http://ocw.mit.edu/courses/StatisticalModeling.pdf`**        
<br>
*  **`https://The/Watchful/Horsemasters.pdf`**    
<br>
Using only what we've covered so far.     
<br><br>      

Load the sample HTML in [Pythex](http://pythex.org/), and let's create a regular expression to extract the URLs!



We now extend our Step I code to include logic to extract PDF URLs from retrieved HTML. The one additional library leveraged in Step II is `re`, which is Python's regular expression module. 


In [None]:
# ===================================================================
# PDF Harvester II of III: Extract PDF URL's from HTML              |
# ===================================================================
import requests
import re

# *** uncomment and replace proxy string with your authentication details ***
# proxies = {
#         'http' :"http://CCC3313:P@ssword9999@proxy.cna.com:8080",
#         'https':"https://CCC3313:P@ssword9999@proxy.cna.com:8080"
#         }


# specify URL for webpage of interest =>
URL = "https://en.wikipedia.org/wiki/Loss_reserving"

# instruct request object to return HTML as plain text =>
html = requests.get(URL, proxies=proxies).text

# search html and compile PDF URL's =>
pdf_urls = re.findall(r"(?=href=).*(https?://\S+.pdf).*?>", html)

# display content of pdf_urls list => 
print(pdf_urls)



Note that the regular expression is preceded with an **`r`** when passed to `re.findall`. This instructs the Python virtual machine to interpret what follows as a raw string and to ignore all escape codes.

`re.findall` returns a list of all matches extracted from the source text. In our case, it returns a list of URLs.

Finally, we need to retrieve the content associated with a given PDF and write it to file locally. 
We introduce another module from the Python Standard Library, `os.path`, which facilitates the partitioning of absolute filepaths into components in order to retain filenames when saving to disk.    
For example, consider the following well-formed URL:


```
"http://Statistical_Modeling/Fall_2017/Lectures/Lecture11.pdf"
```

To capture `Lecture11.pdf`, we pass the absolute URL to `os.path.split`, which returns
a tuple of everything preceeding the filename as the first element, along with the filename and extension as the second element:

```python
>>> import os.path
>>> url = "http://Statistical_Modeling/Fall_2017/Lectures/Lecture11.pdf"
>>> os.path.split(url)
('http://Statistical_Modeling/Fall_2017/Lectures', 'Lecture11.pdf')
```

Therefore, we can capture the filename and extension by calling `os.path.split(url)`, and using Python's index notation to specify the element at the second position in the tuple, **`os.path.split(url)[1]`**.


## Step III: Capture PDF Content and Write to File

This step differs from the initial HTML retrieval in that we need to request the content as bytes, not text. By calling `requests.get(url, proxies=proxies).content`, we're accessing the raw bytes that comprise the PDF, then writing those bytes directly to file. 


In [None]:
# ===================================================================
# PDF Harvester III of III: Write PDF(s) to file                    |
# ===================================================================
import requests
import re
import os
import os.path

# *** uncomment and replace proxy string with your authentication details ***
# proxies = {
#         'http' :"http://CCC3313:P@ssword9999@proxy.cna.com:8080",
#         'https':"https://CCC3313:P@ssword9999@proxy.cna.com:8080"
#         }


# specify URL for webpage of interest =>
# URL = "https://en.wikipedia.org/wiki/Loss_reserving"            #1 PDFs
# URL = "https://en.wikipedia.org/wiki/Law_of_total_variance"     #2 PDFs
# URL = "https://en.wikipedia.org/wiki/Kernel_density_estimation" #3 PDFs
# URL = "https://en.wikipedia.org/wiki/Logistic_regression"       #4 PDFs
# URL = "https://en.wikipedia.org/wiki/Exponential_distribution"  #6 PDFs
# URL = "https://en.wikipedia.org/wiki/Gamma_distribution"        #8 PDFs
# URL = "https://en.wikipedia.org/wiki/Naive_Bayes_classifier"    #8 PDFs


# instruct requests object to return HTML as plain text =>
html = requests.get(URL, proxies=proxies).text

# search html and compile PDF URLs =>
pdf_urls = re.findall(r"(?=href=).*(https?://\S+.pdf).*?>", html)

# display extracted PDF URLs =>
for i in pdf_urls: print(i)


# set working directory to desired location =>
os.chdir("C:\\Users\\cac9159\\Downloads\\")

# request PDF content and write to file for all entries
# in pdf_urls =>
for pdf in pdf_urls:
    
    # get filename from url =>
    pdfname = os.path.split(pdf)[1]
    
    print("Saving {}...".format(pdfname))
    
    # get retrieved html as `content` =>
    r = requests.get(pdf, proxies=proxies).content
    
    try:
        # write content of r to file, using same name as pdf =>
        with open(pdfname, "wb") as f: f.write(r)
            
    except:
        print("Unable to download {}.".format(pdfname))
        continue
        
print("\nProcessing complete!")


Notice that we surround `with open(pdfname, "wb")...` with a `try-except` block: This is how exception handling is implemented in Python, and handles situations that prevent the PDF from being downloaded, such as empty redirects, broken links or an invalid server-side SSL configuration to name a few. 

Finally, we present the PDF Harvester with the commands collected into a function and with comments stripped away. Remember, in Python, **local scope always executes faster than global scope**, so there's a legitimate performance incentive for encapsulating logic within functions (other than it being just a best practice). 


In [None]:
# ===================================================================
# PDF Harvester: Functional Representation                          |
# ===================================================================
import requests
import re
import os
import os.path

# proxies = {'http' :"http://CCC3313:P@ssword9999@proxy.cna.com:8080",
#            'https':"https://CCC3313:P@ssword9999@proxy.cna.com:8080"}


def pdf_harvester(url, proxies, loc=None):
    """
    Retrieve url's html and extract references to PDFs.
    Download PDFs, writting to `loc`. If `loc` is None, 
    save to current working directory.
    """
    print("Harvesting PDFs from => {}\n".format(url))
    os.chdir(os.getcwd() if loc is None else loc)
    html     = requests.get(URL, proxies=proxies).text
    pdf_urls = re.findall(r"(?=href=).*(https?://\S+.pdf).*?>", html)
    
    for pdf in pdf_urls:
    
        pdfname = os.path.split(pdf)[1]
        r       = requests.get(pdf, proxies=proxies).content
        
        try:
            print("Downloading {}...".format(pdfname))
            with open(pdfname, "wb") as f: f.write(r)   
                
        except:
            print("Unable to download {}.".format(pdfname))
            continue
            
    print("\nProcessing complete.")
    


    
# example calling `pdf_harvester` =>
URL = "https://en.wikipedia.org/wiki/Poisson_point_process"
pdf_harvester(URL, proxies, loc="C:\\Users\\cac9159\\Downloads\\")


## PDF Harvester Potential Future Enhancements

1. Include an additional function argument that specifies what type of file to search and extract. Any 
   file that can be downloaded from a webpage must be preceded by an `href` in the HTML markup. Therefore, the `pdf_harvester` can be modified to download any type of file.    
<br>    
2. Leverage the Standard Library's `argparse` module to convert the `pdf_harvester` into a command line script that accepts and parses command line arguments. Simply pass the script name along with the URL of interest, and `pdf_harvester` will download and save PDFs without ever having to open the source file in a graphical interface.     
<br>     
3.  Note that `pdf_harvester` isn't limited to just downloading files: Once the html is retrieved, it 
can be parsed to extract any content you desire. For example, imagine retrieving financial information for a particular stock during trading hours from Yahoo Finance. Simply pass the URL of interest to a modified `pdf_harvester`, and search the retrieved HTML for the stock price at the moment of retrieval. Then, once you've identified a consistent text pattern in the vicinity of the stock quote markup, you can retrieve the HTML on a periodic basis for whatever collection of stocks you're interested in tracking. `pdf_harvester` would be much simplier for this enhancement than the original implementation: You would only need to remove all commands from the beginning of the `for` loop to the end of the function.
<br>

