# A. Overview
The general idea of this program is to run a control+F on a bunch of PDFs from a webpage, with multiple search terms. If you're looking through notice and comment documents (like [these](https://www.copyright.gov/1201/2018/comments-031418/)) or any webpage storing PDFs, it can identify all the PDFs on a given page (generally) and then search all of them for terms you specify. 

***It's important to note that in order for a PDF to be readable by the program, the PDF has to be stored such that when a mouse hovers over the PDF's link on Google Chrome, the url that appears in the bottom left corner of the browser ends in ".pdf".***

---



# B. How to Use

1.   Scroll down to the User Inputs in section E, and enter a URL (where the PDFs of interest are located) in the user inputs field on the right side of the page.

2.   Enter keyword(s) or phrase(s) for which you wish to search in a pipe-delimited ("|") format (or other selected delimiter). 
> e.g. "word1|phrase of interest 1|word2"

    There's no limit to how many words you can search, so it can pay off to be creative and comprehensive with search terms, like including both "FCC" and "Federal Communications Commission"

3. On the toolbar at the top of the page, click Runtime -> Run All (or press Ctrl + F9)

4. You can scroll to the bottom of the page to get updates on what percentage of PDFs have been read, and when the program is complete, it will prompt a download of a CSV file that lists all references to the keywords with a bit of the surrounding text, and links to the relevant PDFs.

---






## C. Keyword Tips
Though the pipe may be unfamiliar, this allows searching for the occasional instances in which commas or semi-colons could matter for the search. 

#### Note that spacing matters: 

A keyword entry of "cat|keyword2" would flag instances of cat, category and concatenate.

If an extra space were inserted at the beginning, a keyword entry of " cat|keyword2" would return cat and category but **not** concatenate since there isn't a space before cat in concatenate.


---


##D. Important Caveats
1. The false-negative rate should be about the same as Control+F, so if opening each of these documents and Ctrl+Fing through would be precise enough for your purposes, this should do about as well as that, but faster and more easily. 

2. There will be some false positives. It is using computer vision, and occasionally of fuzzy scans, so sometimes it will return screwy things like blanks or bizarre characters or visually similar but unwanted results (e.g. it seems to mistake the dots in i's as periods sometimes). The idea isn't for the program to read for you but to identify at a high rate portions of documents that could be of interest, so hopefully it can do that in spite of these irregularities.


3. Because this runs on Google's cloud, it can be slightly slow (especially downloading a package at the start of each session). If you have Python set up on your computer, it should be straightforward to download this code (File->Download), and it should run significantly faster locally, and at the very least without the time cost of installing packages (as oppsed to importing them) at the start of every session. 

4. The tool has worked well on government pages, Google search, Google Scholar, etc. There are, however, occasionally PDFs which the program can't open (PDF urls with spaces throw it off, for example), which is frustrating. The output at the bottom of the page will tell you which files were not read. As time permits development may iron out these issues.

The search is NOT case-sensitive.

If there is interest in a version that could run through a local folder rather than a webpage, that would be feasible too.

---

#E. User inputs
Insert the url to be searched for PDFs and then a list of keywords or phrases separated by pipes ("|", usually on the backslash key). Other delimiters can be selected if desired.


In [None]:
url = "https://www.copyright.gov/1201/2018/comments-031418/" #@param {type:"string"}
delimiter = "|" #@param ["|", ",", ";"]
Keywords = "FCC|Federal Communcations Commission|FAA|Environmental Protection Agency|" #@param {type:"string"}

# Code

In [None]:
# This part takes a bit (maybe a minute), but needs to run at the start of each session; once you have done one run, however, it won't take that long (1-2 seconds) for each subsequent run in the same session
!pip install pdfminer

In [None]:
import io
import os
import pandas as pd
import re
import requests
from bs4 import BeautifulSoup
from google.colab import files
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser
import time
from urllib.request import urlopen
from urllib.parse import urljoin

In [None]:
def pull_keywords(keywords, text, filename):
    out_df = pd.DataFrame(columns=['Keyword','Reference','File'])
    for keyword in keywords:
        k = keyword.lower()
        t = text.lower()
        res = [i.start() for i in re.finditer(k, t)]
        if len(res)>0:
            print(keyword +' found '+str(len(res))+' time(s) in '+str(filename))
        for i in res:
            tmp = pd.DataFrame({'Keyword':[keyword],'Reference':[text[i-75:i+75].replace("\n"," ")],'File':[filename]})
            out_df = out_df.append(tmp)
    return out_df

In [None]:
def pdf_to_text(in_file):
    output_string = io.StringIO()
    parser = PDFParser(in_file)
    doc = PDFDocument(parser)
    rsrcmgr = PDFResourceManager()
    device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)
    text = output_string.getvalue()   
    return text

In [None]:
def pull_pdfs(url, keywords, outfile_name):
    print('Searching for PDFs at '+url)
    print('Searching for these terms: ', Keywords) 
    
    pdf_url_list = []
    headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"}
    response = requests.get(url,headers=headers)
#--Collecting PDFs on provided URL (2 versions)
    ## A - Handle Google search results since they work differently
    if url[:29] == 'https://www.google.com/search':
        soup = BeautifulSoup(response.content)
        for link in soup.select('a[href*=".pdf"]'):
            i = re.split(":(?=http)",link["href"].replace("/url?q=",""))[0]
            out = i[:i.find('.pdf')+4]
            if out[-4:]== '.pdf':
                pdf_url_list.append(out)
            
    ## B - Address all non-Google search URLs
    else:
        soup = BeautifulSoup(response.text, "html.parser")
        for link in soup.select("a[href$='.pdf']"):
         #below quirk could be problematic at some point -- could be built into exception handling
            if link['href'][0] == 'h':
                pdf_url_list.append(link['href'])

#--------Resume after Google split
    if len(pdf_url_list) == 0:
        return ('No PDFs were able to be found on this site.')

    denom = len(pdf_url_list)
    print('Number of PDFs found at this site: ', denom)
    count = 1
    out_df = pd.DataFrame(columns=['Keyword','Reference','File'])
    not_read = []

    for pdf_url in pdf_url_list: 
        try:
            open = requests.get(pdf_url).content
            pdf = io.BytesIO(open)
            txt = pdf_to_text(pdf)
            results = pull_keywords(keywords=keywords, text = txt, filename = pdf_url)
            out_df = out_df.append(results)
        except:
            print("This file was not read; that could be because it's too big or because there is a login step. ", pdf_url)
            not_read.append(pdf_url)
            pass

        print(str(round((count/denom)*100,1))+ '% of PDFs have been processed.')
        count += 1
    out_df.to_csv(outfile_name, index=False)

    try:
        files.download(outfile_name)
    except ModuleNotFoundError:
        pass
    
    if len(not_read)>0:
        print('The program failed to read these PDFs:')
        for i in not_read:
            print(i)

In [None]:
# Structuring input parameters a bit
if len(url) == 0:
    print('No URL provided; setting to test URL')
    url = 'https://www.copyright.gov/1201/2014/petitions/'

if url[0] in ["'","\""]:
    url = url[1:]
if url[-1] in ["'","\""]:
    url = url[:-1]

if len(delimiter) == 0:
    delimiter = "|"

if len(Keywords) == 0:
    print('No keywords detected; setting to test keywords')
    Keywords = ['marginal cost', 'endogenous growth']
elif isinstance(Keywords,list) == False:
    Keywords = Keywords.split(delimiter)
else:
    print('User input of Keywords accepted')

timestr = time.strftime("%Y%m%d-%H%M%S")
Output = 'url_pdf_index'+timestr+'.csv'
pull_pdfs(url, keywords = Keywords, outfile_name=Output)