# web scraping part 2

In this second part we try to download a list of PDF from web page of Procter & Gamble 
virtual patent marking

https://www.pg.com/patents/brands.shtml

If we inspect the page near a PDF icon we find the following html:

<a href="/patents/pdf/Scope.pdf" target="_blank"> <span style="float:right"><img src="https://res.cloudinary.com/mtree/image/upload/v1/PG.com/patents/images/file_extension_pdf.png" alt=""></span></a>


In this project module  PyPDF2 should be installed



In [None]:
import requests
#import urllib.request
import time
from bs4 import BeautifulSoup

url = "https://www.pg.com/patents/brands.shtml"
response = requests.get(url)

soup = BeautifulSoup(response.text, 'html.parser')

for link in soup.find_all('a'):
    print(link)


##### In reality we need only the PDF link; 

so we check href content

We use function get() that return values of html attributes, as href


In [None]:
for link in soup.find_all('a'):
    print(link.get('href'))

##### Next step: 

select only .pdf files and add url root

(note all pdf URLs are relative not absolute link)

In [None]:

archive_url= "https://www.pg.com"

pdf_links = [archive_url + link['href'] for link in soup.find_all('a') if link['href'].endswith('pdf')] 

pdf_links


##### Files download:

In this step we iterate the list pdf_links

to download all the files


In [None]:

# this could also be called as a function
# like def download_pdf(pdf_links)

for link in pdf_links: 
  
        '''iterate through all links in and download them one by one'''
    
        # this is to get relative path only links
        if link[:4]!='http':
            link = archive_url + link
          
        # obtain filename by splitting url and getting  
        # last string 
        file_name = link.split('/')[-1]    
  
        print ("Downloading file:%s"%file_name) 
          
        # create response object 
        r = requests.get(link, stream = True) 
          
        # download started 
        with open(file_name, 'wb') as f: 
            for chunk in r.iter_content(chunk_size = 1024*1024): 
                if chunk: 
                    f.write(chunk) 
          
        print ("%s downloaded!\n"%file_name) 
  
print ("All file downloaded!")




Now we have a batch of PDF files containing lists of patent numbers and,

in the best cases, a product name. Still they are PDFs, thus of very small utility.

It is possible to use PyPDF2 to transform them into text




In [None]:


import PyPDF2
import os

pdfDir = ""
txtDir = ""


if pdfDir == "": pdfDir = os.getcwd() + "\\" #if no pdfDir passed in 
if txtDir == "": txtDir = os.getcwd() + "\\" #if no txtDir passed in     
    
    
for pdf_to_read in os.listdir(pdfDir): #iterate through pdfs in pdf directory
        fileExtension = pdf_to_read.split(".")[-1]   # -1 takes always last part
        if fileExtension == "pdf":
            pdfFilename = pdfDir + pdf_to_read 
            textFilename = txtDir + pdf_to_read + ".txt"
            textFile = open(textFilename, "w") #make text file
            
            pdf = PyPDF2.PdfFileReader(open(pdfFilename, "rb"))
            for page in pdf.pages:
                textFile.write(page.extractText()) #write text to text file
                
            textFile.close()
   


NOTE: most of the described PDF have tables in it;

when converted into TXT still are not handy data.

So we need a solution for creating a dataframe from those data:

#### Easier PDF tables reading:

Tabula is a module allowing to load pdf tables as JSON or even Pandas DFs

pip install tabula-py

data= read_pdf(pdfFilename,output_format=(default DF) ) 


NOTE: pip install tabula  will install tabula-1.0.5 and return error 
when importin read_pdf

In [3]:
from tabula import read_pdf
import pandas as pd
import os

pdfDir = ""
txtDir = ""

if pdfDir == "": pdfDir = os.getcwd() + "\\"  # if no pdfDir passed in
if txtDir == "": txtDir = os.getcwd() + "\\"  # if no txtDir passed in

for pdf_to_read in os.listdir(pdfDir):  # iterate through pdfs in pdf directory
        fileExtension = pdf_to_read.split(".")[-1]  # -1 takes always last part
        if fileExtension == "pdf":
            pdfFilename = pdfDir + pdf_to_read
            textFilename = txtDir + pdf_to_read + ".txt"
            textFile = open(textFilename, "w")  # make text file

            df= read_pdf(pdfFilename)    # loads dataframe with data
            
            try: 
                df.to_csv(textFilename)      # writes to csv
            except:    
                print(pdf_to_read + ' has a problem!')
                
            textFile.close()


Braun_eff_9_May_2017.pdf has a problem!
Charmin.pdf has a problem!
Cheer_eff_2018_July_11.pdf has a problem!
Dreft_eff_2018_July_11.pdf has a problem!
Era_eff_2018_July_11.pdf has a problem!
Ivory_Snow_eff_2018_July_11.pdf has a problem!
Scope.pdf has a problem!


Other scraping libraries for more advanced use

### Selenium
Selenium Python bindings provides a simple API to write functional/acceptance tests using Selenium WebDriver.

### scrapy

