# Web Scraping with XPath and Python workshop
We will be using XPath Helper in Google Chrome to select links from a webpage, and then use those links to download files from the webpage.

First, we will discuss XPath and XPath Helper. Go to the [XPath tutorial here](https://github.com/kaylaabner/WebScrapingWorkshop/blob/main/XPath_Tutorial.md).

You need to add [XPath Helper](https://chrome.google.com/webstore/detail/xpath-helper/hgimnogjllphhhkhlmebbmlgjoejdpjl) to your Chromium-based browser (Google Chrome, Brave).

Next, we will be using a few libraries in Python to pull content from a website, using our knowledge of XPath to select exactly what we want on the page. 

In [None]:
import requests # to interact with websites through Python
from lxml import html # to use XPath in Python
import pandas as pd # a data science package for handling structured data
import random
import time
import os

In [None]:
# we will use the requests library. requests is a good entry to web scraping.
# we will practice downloading a single file from the library's digital collections. 

r = requests.get('https://udspace.udel.edu/bitstream/handle/19716/5974/mss0109_0001-00.pdf') 

In [None]:
print(r.text) #to retrieve the html of the page

In [None]:
print(r.content) #to retrieve the content in bytes, used for downloading files

In [None]:
#use r.content to tell Python you want the file itself, and not the HTML from the page.
#where this code says 'kabner', CHANGE IT to your NetID or name.

os.mkdir('kabner') # create your own directory so we don't overwrite each others' files.

with open('kabner/30406.pdf', 'wb') as f:
    f.write(r.content)

In [None]:
# now let's try to download a few PDFs from this collection.
# let's make a list of a few URls we want to pull from. note that the URL is the exact location of the file itself. 
urls = ['https://udspace.udel.edu/bitstream/handle/19716/5974/mss0109_0001-00.pdf', 'https://udspace.udel.edu/bitstream/handle/19716/5975/mss0109_0002-00.pdf', 'https://udspace.udel.edu/bitstream/handle/19716/5976/mss0109_0003-00.pdf', 'https://udspace.udel.edu/bitstream/handle/19716/5977/mss0109_0004-00.pdf', 'https://udspace.udel.edu/bitstream/handle/19716/5978/mss0109_0005-00.pdf', 'https://udspace.udel.edu/bitstream/handle/19716/5979/mss0109_0006-00.pdf']

In [None]:
#now, we will create files named after the last 19 characters of the url so we can tell them apart.
# CHANGE where it says 'kabner' to the name of your directory. 

for link in urls: # loop over list of URLs
    r = requests.get(link) # tell requests to visit the site
    print(str(link[-19:])) # print the filename (last 19 chars of the URL) so we know it's working
    with open('kabner/' + str(link[-19:]), 'wb') as f: # open a file with our filenames (last 19 chars of URL)
        f.write(r.content) # write the content of the page to the file, in this case, a PDF
        time.sleep(5) # give the server a break. also keeps you from getting booted on certain sites.

## Using XPath Helper to Select Links

Go back to the [finding aid for the collection](https://library.udel.edu/special/findaids/view?docId=ead/mss0109.xml;tab=content). Let's use XPath Helper to figure out how to select the links on this page, so we can loop over them in Python, and download all the PDFs. 

In [None]:
# now we will use our knowledge of XPath to select specific elements on the webpage.
# I want to select a list of links so we can loop over them to download the PDFs. 
 
# Request the page
page = requests.get('https://library.udel.edu/special/findaids/view?docId=ead/mss0109.xml;tab=content')
 
# Parsing the page
# (We need to use page.content rather than
# page.text because html.fromstring implicitly
# expects bytes as input.)
tree = html.fromstring(page.content) # this is from our lxml package. 
 
# Get element using XPath
links = tree.xpath("//a[@class='extlink']/@href") 
type(links)

working_links = links[:10] #we just want to select some of the links so as to not overwhelm the server.
working_links

In [None]:
#to use the list of links to retrieve PDFs
# remember to CHANGE 'kabner' to the name of your directory. 

for link in working_links:
    r = requests.get(link)
    print(str(link[-19:]))
    with open('kabner/' + str(link[-19:]), 'wb') as f:
        f.write(r.content)
        time.sleep(5)

# Next Steps: Reading in a List of URLs

These instructions will allow you to create a text file of URLs using XPath Helper, and read that file in as a list so you can loop over it. This is a good option if you're having trouble parsing the HTML directly from the page. You can use XPath Helper to select all the links on a page, and just copy/paste them into a text file, then read them into Python.

In [None]:
#to read in a text files of urls as a list so we can loop over it
urls2 = open('path/to/urls.txt', 'r')

urls3 = urls2.readlines()
urls3

In [None]:
#my text file has newlines at the end of each URL, so Python has trouble reading it.
# remove newlines at the end of links from text file using this clean function. 

clean = [link.strip() for link in urls3]
print(clean) 

In [None]:
for link in clean:
    r = requests.get(link)
    print(str(link[-19:]))
    with open(str(link[-19:]), 'wb') as f:
        f.write(r.content)
        time.sleep(15)

## Next Steps: Creating a CSV from Wine Spectator Data

Here, we can use pandas (a Python library for data curation and analysis) to scrape data from the website and put it into a CSV file. 

In [None]:
#winespectator data to csv

wine_page = requests.get('https://top100.winespectator.com/lists/')
tree = html.fromstring(wine_page.content)
 
# Get data from elements using XPath
winery = tree.xpath("//span[@class = 'sort-text']/text()") 
vintage = tree.xpath("//td[@class = 'vintage']/text()")
score = tree.xpath("//td[@class = 'score']/text()")

dataset = pd.DataFrame(list(zip(winery, vintage, score))) #combine our lists of data into a pandas dataframe
dataset.to_csv('output.csv', sep=',', header=['Winery', 'Vintage', 'Score'], index=False)       