# Web Scraping with XPath and Python workshop
We will be using XPath Helper in Google Chrome to select links from a webpage, and then use those links to download files from the webpage.

First, we will discuss XPath and XPath Helper. Follow this link: https://github.com/kaylaabner/WebScrapingWorkshop/blob/main/XPath_Tutorial.md.

You need to add XPath Helper to your Chromium-based browser (Google Chrome, Brave): https://chrome.google.com/webstore/detail/xpath-helper/hgimnogjllphhhkhlmebbmlgjoejdpjl

In [27]:
import requests
import random
import time
from lxml import html

In [12]:
r = requests.get('https://library.udel.edu/special/findaids/view?docId=ead/mss0109.xml') 

In [14]:
print(r.text) #to retrieve the html of the page

<!DOCTYPE html>
<html lang="en">
   <head>
      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
      <meta charset="UTF-8">
      <meta name="viewport" content="width=device-width">
      <title>George S. Messersmith papers | Manuscript and Archival Collection Finding Aids</title>
      <link rel="stylesheet" href="//maxcdn.bootstrapcdn.com/font-awesome/4.3.0/css/font-awesome.min.css">
      <link rel="stylesheet" href="css/udlibrary/style.css" type="text/css">
      <link rel="shortcut icon" href="icons/udlibrary/favicon.ico"><script type="text/javascript" src="script/udlibrary/sitefunctions.js"></script><script>
           window.ga=window.ga||function(){(ga.q=ga.q||[]).push(arguments)};ga.l=+new Date;
           ga('create', 'UA-3642042-3', 'auto');
           ga('send', 'pageview');
        </script><script async="async" src="https://www.google-analytics.com/analytics.js"></script><script src="https://code.jquery.com/jquery-1.10.2.js"></script><script>
       

In [13]:
print(r.content) #to retrieve the content in bytes, used for downloading files

b'<!DOCTYPE html>\n<html lang="en">\n   <head>\n      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n      <meta charset="UTF-8">\n      <meta name="viewport" content="width=device-width">\n      <title>George S. Messersmith papers | Manuscript and Archival Collection Finding Aids</title>\n      <link rel="stylesheet" href="//maxcdn.bootstrapcdn.com/font-awesome/4.3.0/css/font-awesome.min.css">\n      <link rel="stylesheet" href="css/udlibrary/style.css" type="text/css">\n      <link rel="shortcut icon" href="icons/udlibrary/favicon.ico"><script type="text/javascript" src="script/udlibrary/sitefunctions.js"></script><script>\n           window.ga=window.ga||function(){(ga.q=ga.q||[]).push(arguments)};ga.l=+new Date;\n           ga(\'create\', \'UA-3642042-3\', \'auto\');\n           ga(\'send\', \'pageview\');\n        </script><script async="async" src="https://www.google-analytics.com/analytics.js"></script><script src="https://code.jquery.com/jquery-1.10.2.js">

In [15]:
#to save a file, need a link to that file

with open('30406.pdf', 'wb') as f:
    f.write(r.content)

In [None]:
# list of urls

urls = ['https://udspace.udel.edu/bitstream/handle/19716/5974/mss0109_0001-00.pdf', 'https://udspace.udel.edu/bitstream/handle/19716/5975/mss0109_0002-00.pdf', 'https://udspace.udel.edu/bitstream/handle/19716/5976/mss0109_0003-00.pdf', 'https://udspace.udel.edu/bitstream/handle/19716/5977/mss0109_0004-00.pdf', 'https://udspace.udel.edu/bitstream/handle/19716/5978/mss0109_0005-00.pdf', 'https://udspace.udel.edu/bitstream/handle/19716/5979/mss0109_0006-00.pdf']

In [None]:
#to create files named after the last 19 characters of the url

for link in urls:
    r = requests.get(link)
    print(str(link[-19:]))
    with open(str(link[-19:]), 'wb') as f:
        f.write(r.content)

In [None]:
#to read in a text files of urls
urls2 = open('C:\\Users\\kabner\\OneDrive - University of Delaware - o365\\Documents\\Collections as Data\\Messersmith\\messersmith_links.txt', 'r')

In [None]:
#to read in file as a list of urls
urls3 = urls2.readlines()
urls3

In [None]:
#remove newlines at end of links from text file

clean = [link.strip() for link in urls3]
print(clean) 

In [None]:
for link in clean:
    r = requests.get(link)
    print(str(link[-19:]))
    with open(str(link[-19:]), 'wb') as f:
        f.write(r.content)
        time.sleep(15)

In [25]:
#using lxml to use xpath in Python
 
# Request the page
page = requests.get('https://library.udel.edu/special/findaids/view?docId=ead/mss0109.xml;tab=content')
 
# Parsing the page
# (We need to use page.content rather than
# page.text because html.fromstring implicitly
# expects bytes as input.)
tree = html.fromstring(page.content) 
 
# Get element using XPath
links = tree.xpath("//a[@class='extlink']/@href")
type(links)

working_links = links[:10] #to just select some of the links
working_links

['https://udspace.udel.edu/bitstream/handle/19716/5974/mss0109_0001-00.pdf',
 'https://udspace.udel.edu/bitstream/handle/19716/5975/mss0109_0002-00.pdf',
 'https://udspace.udel.edu/bitstream/handle/19716/5976/mss0109_0003-00.pdf',
 'https://udspace.udel.edu/bitstream/handle/19716/5977/mss0109_0004-00.pdf',
 'https://udspace.udel.edu/bitstream/handle/19716/5978/mss0109_0005-00.pdf',
 'https://udspace.udel.edu/bitstream/handle/19716/5979/mss0109_0006-00.pdf',
 'https://udspace.udel.edu/bitstream/handle/19716/5980/mss0109_0007-00.pdf',
 'https://udspace.udel.edu/bitstream/handle/19716/5981/mss0109_0008-00.pdf',
 'https://udspace.udel.edu/bitstream/handle/19716/5982/mss0109_0009-00.pdf',
 'https://udspace.udel.edu/bitstream/handle/19716/5983/mss0109_0010-00.pdf']

In [26]:
#to use the list of links to retrieve PDFs

for link in working_links:
    r = requests.get(link)
    print(str(link[-19:]))
    with open(str(link[-19:]), 'wb') as f:
        f.write(r.content)

mss0109_0001-00.pdf
mss0109_0002-00.pdf
mss0109_0003-00.pdf
mss0109_0004-00.pdf
mss0109_0005-00.pdf
mss0109_0006-00.pdf
mss0109_0007-00.pdf
mss0109_0008-00.pdf
mss0109_0009-00.pdf
mss0109_0010-00.pdf
