In this notebook, we walk you through [potboy.py](potbot.py), a small python script that loads up the data from [the Atlas of Roman Pottery](http://potsherds.net) and writes the results to a file. Then, using a regular expression we'll tidy that txt file to a list of urls linking directly to the images. This in turn will be fed to a `wget` command that will download the images.

In [1]:
# first we load the modules we'll need
from bs4 import BeautifulSoup  #for parsing html
import csv  # for writing to csv
import requests # for loading in html pages on the web
import re # for regular expression searches

In [2]:
# then we download our target url and pass it to the beautiful soup module, which knows how to parse html

r = requests.get('http://potsherd.net/atlas/Ware/')
soup = BeautifulSoup(r.content, "html5lib")

In [3]:
# Next, we tell the program to create a new txt file for us to use as our output container:

f = csv.writer(open("urls.txt", "w"))


Then comes the magic. We tell python, via BeautifulSoup, to look in the html for a table (having studied the html of our target site, we know that the links to the individual ware pages are organized using html table tags) and then grab the information contained in the a tags.

In [4]:
trs = soup.find_all('tr')
for tr in trs:
    for link in tr.find_all('a', href=re.compile('^Ware')):
        fulllink = link.get ('href')
        print (fulllink) #print in terminal to verify results
        f.writerow(["http://potsherd.net/atlas/" + fulllink]) 

Ware/AHGW
Ware/AOMO
Ware/ARGO
Ware/B4
Ware/BB1
Ware/BB2
Ware/C189
Ware/CGBL
Ware/CGMW
Ware/CGCC
Ware/CGSF
Ware/CGGL
Ware/CGTS
Ware/COLC
Ware/COMO
Ware/COTS
Ware/CBMO
Ware/CRAM
Ware/EPON
Ware/DALES
Ware/DERBY
Ware/DR1
Ware/DR2-4
Ware/DR20
Ware/SALA
Ware/EWARE
Ware/EGTS
Ware/TNEG
Ware/EIMO
Ware/G12
Ware/GAUL
Ware/MARM
Ware/HARS
Ware/H70
Ware/ITMO
Ware/ITTS
Ware/HOFA
Ware/MAYN
Ware/LRGR
Ware/LIMO
Ware/L555
Ware/LEST
Ware/KOLN
Ware/LYON
Ware/MHMO
Ware/MRCA
Ware/NVCC
Ware/NVMO
Ware/NFMO
Ware/NFCC
Ware/NACA
Ware/NARS
Ware/NGGW
Ware/NKSH
Ware/OXRS
Ware/OXMO
Ware/PAS1
Ware/PRSW
Ware/PRW1
Ware/PRW2
Ware/PRW3
Ware/PORD
Ware/RHOD
Ware/RVMO
Ware/R527
Ware/RBBB1
Ware/RBMO
Ware/SAVG
Ware/SVW
Ware/SOMO
Ware/SDBB
Ware/SGTS
Ware/MOTS
Ware/SGCC
Ware/LRSH
Ware/SEGL
Ware/SPAN
Ware/TN
Ware/TR
Ware/LOND
Ware/MOSL
Ware/VRMO
Ware/VRW
Ware/WPMO


Right-Click on the Jupyter logo at the top left of this page, and open in a new tab. See the `output.txt` file you created? Have a look at it. It's a list of webpages, one for each ware. In the codeblock above we searched for _every_ link on the page, but we did a regular expression search to _only_ keep the ones that led to the page of a particular ware.

And now we'll build a `wget` command that is respectful of the Atlas' bandwidth and server resources. Our command tells wget to follow the paths in the input file urls.txt, to wait a moment between requests, to limit the amount asked for each time, and to only keep images. 

Now, wget crawls the directory structure

When you run this command, you can stop it early by hitting the square stop button in the command ribbon at the top of the page.



In [19]:
!wget -w 2 --limit-rate=200k -A jpeg,jpg,bmp,gif,png -i urls.txt

--2018-06-19 13:59:31--  http://potsherd.net/atlas/Ware/AHGW
Resolving potsherd.net (potsherd.net)... 79.170.44.157
Connecting to potsherd.net (potsherd.net)|79.170.44.157|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘AHGW.tmp’

AHGW.tmp                [   <=>              ]  16.92K  22.0KB/s    in 0.8s    

2018-06-19 13:59:33 (22.0 KB/s) - ‘AHGW.tmp’ saved [17330]

--2018-06-19 13:59:35--  http://potsherd.net/atlas/Class/COAR
Reusing existing connection to potsherd.net:80.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘COAR.tmp’

COAR.tmp                [   <=>              ]  18.17K  23.7KB/s    in 0.8s    

2018-06-19 13:59:36 (23.7 KB/s) - ‘COAR.tmp’ saved [18605]

--2018-06-19 13:59:38--  http://potsherd.net/atlas/Source/BRIT
Reusing existing connection to potsherd.net:80.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘BRIT.tmp’

B