Objective: To scrape emigrant letters from the Irish Emigrant Database 

Source: http://www.dippam.ac.uk/ied/

Search Criteria

Letters (Emigrant)
Use Dates: Toggle On
Exact Match: Toggle Off
Start Date: 1 Jan 1750
End Date: 31 Dec 1913

Results: 3,078 results, 31-to-a-page, 94 pages. Each result links to a record that contains the letter text along with metadata. 

Steps:

1. Get results
2. Extract list of URLs
3. For one URL (example: http://www.dippam.ac.uk/ied/records/21120)
    1. Read HTML
    2. Extract text inside the pre class="transcript" tag
    3. Save to text file
    4. Extract info inside the table id="metadata" tag
    5. Append to csv file
4. Do 3A - 3E for all URLS


## 1. Get results 

Code in cells 1-5 adapted from: https://towardsdatascience.com/data-science-skills-web-scraping-javascript-using-python-97a29738353f

In [1]:
# import libraries
import urllib.request
from bs4 import BeautifulSoup

import ssl

from urllib.request import urlopen

import requests

In [2]:
# use the search form to generate the desired URL
# modify the URL so that 3273 results appear on one page
# See "20230328_GetData.png" for how to find search results path
urlpage = 'https://www.dippam.ac.uk/ied/results?search%5Bper_page%5D=3078&search%5Bpage%5D=1&search%5Btotal_pages%5D=1&search%5Bview%5D=list&search%5Bqclean%5D=&search%5Bq%5D=%22%22&search%5Bdateenabled%5D=on&search%5Bstart%5D%5Bd%5D=1&search%5Bstart%5D%5Bm%5D=01&search%5Bstart%5D%5By%5D=1750&search%5Bend%5D%5Bd%5D=31&search%5Bend%5D%5Bm%5D=12&search%5Bend%5D%5By%5D=1913&search%5Bcat%5D%5B24%5D=on&search%5Bsort%5D=timestamp&search%5Bsort_dir%5D=asc'

In [3]:
ssl._create_default_https_context = ssl._create_unverified_context

In [4]:
# query the website and return the html to the variable 'page'
page = urllib.request.urlopen(urlpage)
print(page)

<http.client.HTTPResponse object at 0x7fceeda9c130>


In [5]:
#parse the html using beautiful soup and store in variable 'soup'
soup = BeautifulSoup(page, 'html.parser')

## 2. Extract list of URLs

In [6]:
links = soup.find_all("a")
urls = []
for link in links:
    url = link.get('href')
    urls.append(url)
print(len(urls))
urls.sort()
urls[0:5]

3059


['https://www.dippam.ac.uk/ied/records/20481',
 'https://www.dippam.ac.uk/ied/records/20487',
 'https://www.dippam.ac.uk/ied/records/20514',
 'https://www.dippam.ac.uk/ied/records/20519',
 'https://www.dippam.ac.uk/ied/records/20522']

*To do: I am missing URLs for 19 records. Crosscheck URLs with metadata to identify which ones are missing.*

## 3. For one URL...

### A. Read HTML

In [7]:
record = urllib.request.urlopen(url)
recordSoup = BeautifulSoup(record, 'html.parser')

## B. Extract text inside the pre class="transcript" tag

In [8]:
transcript = recordSoup.find("pre", {"class": "transcript"})
text = transcript.contents[0]

## C. Save to text file

In [9]:
filename = url[37:]
f = open(filename + '.txt', 'w')
f.write(text)
f.close()

## D. Extract metadata (that is, info inside td tags that have no class)

In [10]:
tableData = recordSoup.find_all("td", {"class": ""})

In [11]:
csv_row = []
for item in tableData:
    csv_row.append(item.get_text(strip=True))

## E. Append data to csv file

In [12]:
import csv

In [13]:
with open("20230514_AM_ied.csv", "w+", newline="") as f:
    writer = csv.writer(f)
    writer.writerow(csv_row)

## Do 3A - 3E for all URLS

In [14]:
for url in urls:
    
    print(url)
    
    try:
    
        #Read HTML
        record = urllib.request.urlopen(url)
        recordSoup = BeautifulSoup(record, 'html.parser')
    
        #Extract text inside the pre class="transcript" tag
        transcript = recordSoup.find("pre", {"class": "transcript"})
        text = transcript.contents[0]
    
        #Save to text file
        filename = url[37:]
        f = open(filename + '.txt', 'w')
        f.write(text)
        f.close()
    
        #Extract metadata (td tags that have no class)
        tableData = recordSoup.find_all("td", {"class": ""})
        csv_row = []
        for item in tableData:
            csv_row.append(item.get_text(strip=True))
    
        #Append data to csv file
        with open("20230514_AM_ied.csv", "w+", newline="") as f:
            writer = csv.writer(f)
            writer.writerow(csv_row)
            
    except requests.exceptions.Timeout:
      print("Timeout occurred")

https://www.dippam.ac.uk/ied/records/20481
https://www.dippam.ac.uk/ied/records/20487
https://www.dippam.ac.uk/ied/records/20514
https://www.dippam.ac.uk/ied/records/20519
https://www.dippam.ac.uk/ied/records/20522
https://www.dippam.ac.uk/ied/records/20529
https://www.dippam.ac.uk/ied/records/20530
https://www.dippam.ac.uk/ied/records/20538
https://www.dippam.ac.uk/ied/records/20563
https://www.dippam.ac.uk/ied/records/20568
https://www.dippam.ac.uk/ied/records/20580
https://www.dippam.ac.uk/ied/records/20590
https://www.dippam.ac.uk/ied/records/20600
https://www.dippam.ac.uk/ied/records/20623
https://www.dippam.ac.uk/ied/records/20632
https://www.dippam.ac.uk/ied/records/20651
https://www.dippam.ac.uk/ied/records/20656
https://www.dippam.ac.uk/ied/records/20675
https://www.dippam.ac.uk/ied/records/20678
https://www.dippam.ac.uk/ied/records/20695
https://www.dippam.ac.uk/ied/records/20706
https://www.dippam.ac.uk/ied/records/20724
https://www.dippam.ac.uk/ied/records/20743
https://www

https://www.dippam.ac.uk/ied/records/22415
https://www.dippam.ac.uk/ied/records/22435
https://www.dippam.ac.uk/ied/records/22438
https://www.dippam.ac.uk/ied/records/22454
https://www.dippam.ac.uk/ied/records/22465
https://www.dippam.ac.uk/ied/records/22467
https://www.dippam.ac.uk/ied/records/22477
https://www.dippam.ac.uk/ied/records/22483
https://www.dippam.ac.uk/ied/records/22487
https://www.dippam.ac.uk/ied/records/22489
https://www.dippam.ac.uk/ied/records/22491
https://www.dippam.ac.uk/ied/records/22499
https://www.dippam.ac.uk/ied/records/22510
https://www.dippam.ac.uk/ied/records/22520
https://www.dippam.ac.uk/ied/records/22550
https://www.dippam.ac.uk/ied/records/22590
https://www.dippam.ac.uk/ied/records/22612
https://www.dippam.ac.uk/ied/records/22630
https://www.dippam.ac.uk/ied/records/22667
https://www.dippam.ac.uk/ied/records/22668
https://www.dippam.ac.uk/ied/records/22713
https://www.dippam.ac.uk/ied/records/22724
https://www.dippam.ac.uk/ied/records/22725
https://www

IndexError: list index out of range

In [21]:
urls[380:]
urls2=urls[381:]
len(urls2)

2678

In [22]:
for url in urls2:
    
    print(url)
    
    try:
    
        #Read HTML
        record = urllib.request.urlopen(url)
        recordSoup = BeautifulSoup(record, 'html.parser')
    
        #Extract text inside the pre class="transcript" tag
        transcript = recordSoup.find("pre", {"class": "transcript"})
        text = transcript.contents[0]
    
        #Save to text file
        filename = url[37:]
        f = open(filename + '.txt', 'w')
        f.write(text)
        f.close()
    
        #Extract metadata (td tags that have no class)
        tableData = recordSoup.find_all("td", {"class": ""})
        csv_row = []
        for item in tableData:
            csv_row.append(item.get_text(strip=True))
    
        #Append data to csv file
        with open("20230514_AM_ied.csv", "w+", newline="") as f:
            writer = csv.writer(f)
            writer.writerow(csv_row)
            
    except requests.exceptions.Timeout:
      print("Timeout occurred")

https://www.dippam.ac.uk/ied/records/24553
https://www.dippam.ac.uk/ied/records/24567
https://www.dippam.ac.uk/ied/records/24570
https://www.dippam.ac.uk/ied/records/24576
https://www.dippam.ac.uk/ied/records/24608
https://www.dippam.ac.uk/ied/records/24618
https://www.dippam.ac.uk/ied/records/24635
https://www.dippam.ac.uk/ied/records/24644
https://www.dippam.ac.uk/ied/records/24656
https://www.dippam.ac.uk/ied/records/24659
https://www.dippam.ac.uk/ied/records/24668
https://www.dippam.ac.uk/ied/records/24706
https://www.dippam.ac.uk/ied/records/24715
https://www.dippam.ac.uk/ied/records/24716
https://www.dippam.ac.uk/ied/records/24721
https://www.dippam.ac.uk/ied/records/24727
https://www.dippam.ac.uk/ied/records/24728
https://www.dippam.ac.uk/ied/records/24731
https://www.dippam.ac.uk/ied/records/24732
https://www.dippam.ac.uk/ied/records/24734
https://www.dippam.ac.uk/ied/records/24743
https://www.dippam.ac.uk/ied/records/24747
https://www.dippam.ac.uk/ied/records/24749
https://www

IndexError: list index out of range

In [32]:
urls2[166:]
urls3=urls2[167:]
len(urls3)

2511

In [33]:
for url in urls3:
    
    print(url)
    
    try:
    
        #Read HTML
        record = urllib.request.urlopen(url)
        recordSoup = BeautifulSoup(record, 'html.parser')
    
        #Extract text inside the pre class="transcript" tag
        transcript = recordSoup.find("pre", {"class": "transcript"})
        text = transcript.contents[0]
    
        #Save to text file
        filename = url[37:]
        f = open(filename + '.txt', 'w')
        f.write(text)
        f.close()
    
        #Extract metadata (td tags that have no class)
        tableData = recordSoup.find_all("td", {"class": ""})
        csv_row = []
        for item in tableData:
            csv_row.append(item.get_text(strip=True))
    
        #Append data to csv file
        with open("20230514_AM_ied.csv", "w+", newline="") as f:
            writer = csv.writer(f)
            writer.writerow(csv_row)
            
    except requests.exceptions.Timeout:
      print("Timeout occurred")

https://www.dippam.ac.uk/ied/records/26352
https://www.dippam.ac.uk/ied/records/26353
https://www.dippam.ac.uk/ied/records/26364
https://www.dippam.ac.uk/ied/records/26366
https://www.dippam.ac.uk/ied/records/26377
https://www.dippam.ac.uk/ied/records/26381
https://www.dippam.ac.uk/ied/records/26384
https://www.dippam.ac.uk/ied/records/26409
https://www.dippam.ac.uk/ied/records/26431
https://www.dippam.ac.uk/ied/records/26432
https://www.dippam.ac.uk/ied/records/26439
https://www.dippam.ac.uk/ied/records/26447
https://www.dippam.ac.uk/ied/records/26450
https://www.dippam.ac.uk/ied/records/26469
https://www.dippam.ac.uk/ied/records/26471
https://www.dippam.ac.uk/ied/records/26478
https://www.dippam.ac.uk/ied/records/26480
https://www.dippam.ac.uk/ied/records/26486
https://www.dippam.ac.uk/ied/records/26491
https://www.dippam.ac.uk/ied/records/26505
https://www.dippam.ac.uk/ied/records/26511
https://www.dippam.ac.uk/ied/records/26516
https://www.dippam.ac.uk/ied/records/26550
https://www

https://www.dippam.ac.uk/ied/records/28224
https://www.dippam.ac.uk/ied/records/28232
https://www.dippam.ac.uk/ied/records/28235
https://www.dippam.ac.uk/ied/records/28251
https://www.dippam.ac.uk/ied/records/28264
https://www.dippam.ac.uk/ied/records/28265
https://www.dippam.ac.uk/ied/records/28269
https://www.dippam.ac.uk/ied/records/28273
https://www.dippam.ac.uk/ied/records/28277
https://www.dippam.ac.uk/ied/records/28311
https://www.dippam.ac.uk/ied/records/28321
https://www.dippam.ac.uk/ied/records/28322
https://www.dippam.ac.uk/ied/records/28324
https://www.dippam.ac.uk/ied/records/28328
https://www.dippam.ac.uk/ied/records/28333
https://www.dippam.ac.uk/ied/records/28347
https://www.dippam.ac.uk/ied/records/28374
https://www.dippam.ac.uk/ied/records/28384
https://www.dippam.ac.uk/ied/records/28406
https://www.dippam.ac.uk/ied/records/28412
https://www.dippam.ac.uk/ied/records/28414
https://www.dippam.ac.uk/ied/records/28419
https://www.dippam.ac.uk/ied/records/28423
https://www

https://www.dippam.ac.uk/ied/records/30292
https://www.dippam.ac.uk/ied/records/30300
https://www.dippam.ac.uk/ied/records/30307
https://www.dippam.ac.uk/ied/records/30308
https://www.dippam.ac.uk/ied/records/30346
https://www.dippam.ac.uk/ied/records/30348
https://www.dippam.ac.uk/ied/records/30350
https://www.dippam.ac.uk/ied/records/30380
https://www.dippam.ac.uk/ied/records/30403
https://www.dippam.ac.uk/ied/records/30439
https://www.dippam.ac.uk/ied/records/30440
https://www.dippam.ac.uk/ied/records/30454
https://www.dippam.ac.uk/ied/records/30461
https://www.dippam.ac.uk/ied/records/30466
https://www.dippam.ac.uk/ied/records/30473
https://www.dippam.ac.uk/ied/records/30474
https://www.dippam.ac.uk/ied/records/30532
https://www.dippam.ac.uk/ied/records/30539
https://www.dippam.ac.uk/ied/records/30543
https://www.dippam.ac.uk/ied/records/30546
https://www.dippam.ac.uk/ied/records/30550
https://www.dippam.ac.uk/ied/records/30559
https://www.dippam.ac.uk/ied/records/30560
https://www

https://www.dippam.ac.uk/ied/records/32241
https://www.dippam.ac.uk/ied/records/32271
https://www.dippam.ac.uk/ied/records/32303
https://www.dippam.ac.uk/ied/records/32310
https://www.dippam.ac.uk/ied/records/32313
https://www.dippam.ac.uk/ied/records/32324
https://www.dippam.ac.uk/ied/records/32345
https://www.dippam.ac.uk/ied/records/32349
https://www.dippam.ac.uk/ied/records/32350
https://www.dippam.ac.uk/ied/records/32357
https://www.dippam.ac.uk/ied/records/32358
https://www.dippam.ac.uk/ied/records/32359
https://www.dippam.ac.uk/ied/records/32386
https://www.dippam.ac.uk/ied/records/32393
https://www.dippam.ac.uk/ied/records/32394
https://www.dippam.ac.uk/ied/records/32412
https://www.dippam.ac.uk/ied/records/32418
https://www.dippam.ac.uk/ied/records/32432
https://www.dippam.ac.uk/ied/records/32433
https://www.dippam.ac.uk/ied/records/32446
https://www.dippam.ac.uk/ied/records/32447
https://www.dippam.ac.uk/ied/records/32454
https://www.dippam.ac.uk/ied/records/32475
https://www

https://www.dippam.ac.uk/ied/records/33987
https://www.dippam.ac.uk/ied/records/34014
https://www.dippam.ac.uk/ied/records/34016
https://www.dippam.ac.uk/ied/records/34032
https://www.dippam.ac.uk/ied/records/34050
https://www.dippam.ac.uk/ied/records/34088
https://www.dippam.ac.uk/ied/records/34111
https://www.dippam.ac.uk/ied/records/34123
https://www.dippam.ac.uk/ied/records/34126
https://www.dippam.ac.uk/ied/records/34127
https://www.dippam.ac.uk/ied/records/34146
https://www.dippam.ac.uk/ied/records/34150
https://www.dippam.ac.uk/ied/records/34156
https://www.dippam.ac.uk/ied/records/34169
https://www.dippam.ac.uk/ied/records/34175
https://www.dippam.ac.uk/ied/records/34180
https://www.dippam.ac.uk/ied/records/34195
https://www.dippam.ac.uk/ied/records/34207
https://www.dippam.ac.uk/ied/records/34210
https://www.dippam.ac.uk/ied/records/34216
https://www.dippam.ac.uk/ied/records/34228
https://www.dippam.ac.uk/ied/records/34232
https://www.dippam.ac.uk/ied/records/34234
https://www

https://www.dippam.ac.uk/ied/records/36197
https://www.dippam.ac.uk/ied/records/36200
https://www.dippam.ac.uk/ied/records/36219
https://www.dippam.ac.uk/ied/records/36260
https://www.dippam.ac.uk/ied/records/36263
https://www.dippam.ac.uk/ied/records/36277
https://www.dippam.ac.uk/ied/records/36281
https://www.dippam.ac.uk/ied/records/36296
https://www.dippam.ac.uk/ied/records/36306
https://www.dippam.ac.uk/ied/records/36307
https://www.dippam.ac.uk/ied/records/36313
https://www.dippam.ac.uk/ied/records/36323
https://www.dippam.ac.uk/ied/records/36326
https://www.dippam.ac.uk/ied/records/36332
https://www.dippam.ac.uk/ied/records/36361
https://www.dippam.ac.uk/ied/records/36363
https://www.dippam.ac.uk/ied/records/36364
https://www.dippam.ac.uk/ied/records/36365
https://www.dippam.ac.uk/ied/records/36378
https://www.dippam.ac.uk/ied/records/36390


URLError: <urlopen error [Errno 60] Operation timed out>

In [46]:
urls4=urls3[974:]
len(urls4)

1537

In [47]:
for url in urls4:
    
    print(url)
    
    try:
    
        #Read HTML
        record = urllib.request.urlopen(url)
        recordSoup = BeautifulSoup(record, 'html.parser')
    
        #Extract text inside the pre class="transcript" tag
        transcript = recordSoup.find("pre", {"class": "transcript"})
        text = transcript.contents[0]
    
        #Save to text file
        filename = url[37:]
        f = open(filename + '.txt', 'w')
        f.write(text)
        f.close()
    
        #Extract metadata (td tags that have no class)
        tableData = recordSoup.find_all("td", {"class": ""})
        csv_row = []
        for item in tableData:
            csv_row.append(item.get_text(strip=True))
    
        #Append data to csv file
        with open("20230514_AM_ied.csv", "w+", newline="") as f:
            writer = csv.writer(f)
            writer.writerow(csv_row)
            
    except requests.exceptions.Timeout:
      print("Timeout occurred")

https://www.dippam.ac.uk/ied/records/36390
https://www.dippam.ac.uk/ied/records/36393
https://www.dippam.ac.uk/ied/records/36394
https://www.dippam.ac.uk/ied/records/36426
https://www.dippam.ac.uk/ied/records/36427
https://www.dippam.ac.uk/ied/records/36444
https://www.dippam.ac.uk/ied/records/36457
https://www.dippam.ac.uk/ied/records/36462
https://www.dippam.ac.uk/ied/records/36478
https://www.dippam.ac.uk/ied/records/36493
https://www.dippam.ac.uk/ied/records/36501
https://www.dippam.ac.uk/ied/records/36509
https://www.dippam.ac.uk/ied/records/36523
https://www.dippam.ac.uk/ied/records/36525
https://www.dippam.ac.uk/ied/records/36529
https://www.dippam.ac.uk/ied/records/36530
https://www.dippam.ac.uk/ied/records/36543
https://www.dippam.ac.uk/ied/records/36548
https://www.dippam.ac.uk/ied/records/36553
https://www.dippam.ac.uk/ied/records/36555
https://www.dippam.ac.uk/ied/records/36563
https://www.dippam.ac.uk/ied/records/36588
https://www.dippam.ac.uk/ied/records/36593
https://www

https://www.dippam.ac.uk/ied/records/38308
https://www.dippam.ac.uk/ied/records/38309
https://www.dippam.ac.uk/ied/records/38318
https://www.dippam.ac.uk/ied/records/38335
https://www.dippam.ac.uk/ied/records/38365
https://www.dippam.ac.uk/ied/records/38369
https://www.dippam.ac.uk/ied/records/38374
https://www.dippam.ac.uk/ied/records/38376
https://www.dippam.ac.uk/ied/records/38378
https://www.dippam.ac.uk/ied/records/38385
https://www.dippam.ac.uk/ied/records/38395
https://www.dippam.ac.uk/ied/records/38396
https://www.dippam.ac.uk/ied/records/38406
https://www.dippam.ac.uk/ied/records/38437
https://www.dippam.ac.uk/ied/records/38452
https://www.dippam.ac.uk/ied/records/38455
https://www.dippam.ac.uk/ied/records/38457
https://www.dippam.ac.uk/ied/records/38466
https://www.dippam.ac.uk/ied/records/38470
https://www.dippam.ac.uk/ied/records/38473
https://www.dippam.ac.uk/ied/records/38474
https://www.dippam.ac.uk/ied/records/38493
https://www.dippam.ac.uk/ied/records/38497
https://www

https://www.dippam.ac.uk/ied/records/40660
https://www.dippam.ac.uk/ied/records/40663
https://www.dippam.ac.uk/ied/records/40673
https://www.dippam.ac.uk/ied/records/40685
https://www.dippam.ac.uk/ied/records/40686
https://www.dippam.ac.uk/ied/records/40692
https://www.dippam.ac.uk/ied/records/40708
https://www.dippam.ac.uk/ied/records/40727
https://www.dippam.ac.uk/ied/records/40751
https://www.dippam.ac.uk/ied/records/40766
https://www.dippam.ac.uk/ied/records/40782
https://www.dippam.ac.uk/ied/records/40784
https://www.dippam.ac.uk/ied/records/40799
https://www.dippam.ac.uk/ied/records/40803
https://www.dippam.ac.uk/ied/records/40809
https://www.dippam.ac.uk/ied/records/40815
https://www.dippam.ac.uk/ied/records/40817
https://www.dippam.ac.uk/ied/records/40831
https://www.dippam.ac.uk/ied/records/40835
https://www.dippam.ac.uk/ied/records/40841
https://www.dippam.ac.uk/ied/records/40847
https://www.dippam.ac.uk/ied/records/40851
https://www.dippam.ac.uk/ied/records/40860
https://www

https://www.dippam.ac.uk/ied/records/42659
https://www.dippam.ac.uk/ied/records/42664
https://www.dippam.ac.uk/ied/records/42674
https://www.dippam.ac.uk/ied/records/42679
https://www.dippam.ac.uk/ied/records/42682
https://www.dippam.ac.uk/ied/records/42699
https://www.dippam.ac.uk/ied/records/42704
https://www.dippam.ac.uk/ied/records/42708
https://www.dippam.ac.uk/ied/records/42719
https://www.dippam.ac.uk/ied/records/42726
https://www.dippam.ac.uk/ied/records/42774
https://www.dippam.ac.uk/ied/records/42775
https://www.dippam.ac.uk/ied/records/42782
https://www.dippam.ac.uk/ied/records/42786
https://www.dippam.ac.uk/ied/records/42807
https://www.dippam.ac.uk/ied/records/42816
https://www.dippam.ac.uk/ied/records/42821
https://www.dippam.ac.uk/ied/records/42826
https://www.dippam.ac.uk/ied/records/42844
https://www.dippam.ac.uk/ied/records/42856
https://www.dippam.ac.uk/ied/records/42905
https://www.dippam.ac.uk/ied/records/42908
https://www.dippam.ac.uk/ied/records/42909
https://www

https://www.dippam.ac.uk/ied/records/44925
https://www.dippam.ac.uk/ied/records/44930
https://www.dippam.ac.uk/ied/records/44933
https://www.dippam.ac.uk/ied/records/44937
https://www.dippam.ac.uk/ied/records/44943
https://www.dippam.ac.uk/ied/records/44964
https://www.dippam.ac.uk/ied/records/44977
https://www.dippam.ac.uk/ied/records/45009
https://www.dippam.ac.uk/ied/records/45017
https://www.dippam.ac.uk/ied/records/45027
https://www.dippam.ac.uk/ied/records/45028
https://www.dippam.ac.uk/ied/records/45029
https://www.dippam.ac.uk/ied/records/45048
https://www.dippam.ac.uk/ied/records/45066
https://www.dippam.ac.uk/ied/records/45068
https://www.dippam.ac.uk/ied/records/45069
https://www.dippam.ac.uk/ied/records/45086
https://www.dippam.ac.uk/ied/records/45090
https://www.dippam.ac.uk/ied/records/45091
https://www.dippam.ac.uk/ied/records/45099
https://www.dippam.ac.uk/ied/records/45105
https://www.dippam.ac.uk/ied/records/45110
https://www.dippam.ac.uk/ied/records/45111
https://www

https://www.dippam.ac.uk/ied/records/46976
https://www.dippam.ac.uk/ied/records/46983
https://www.dippam.ac.uk/ied/records/46987
https://www.dippam.ac.uk/ied/records/46993
https://www.dippam.ac.uk/ied/records/47004
https://www.dippam.ac.uk/ied/records/47016
https://www.dippam.ac.uk/ied/records/47025
https://www.dippam.ac.uk/ied/records/47030
https://www.dippam.ac.uk/ied/records/47036
https://www.dippam.ac.uk/ied/records/47040
https://www.dippam.ac.uk/ied/records/47041
https://www.dippam.ac.uk/ied/records/47047
https://www.dippam.ac.uk/ied/records/47048
https://www.dippam.ac.uk/ied/records/47060
https://www.dippam.ac.uk/ied/records/47064
https://www.dippam.ac.uk/ied/records/47066
https://www.dippam.ac.uk/ied/records/47068
https://www.dippam.ac.uk/ied/records/47070
https://www.dippam.ac.uk/ied/records/47079
https://www.dippam.ac.uk/ied/records/47081
https://www.dippam.ac.uk/ied/records/47093
https://www.dippam.ac.uk/ied/records/47109
https://www.dippam.ac.uk/ied/records/47110
https://www

https://www.dippam.ac.uk/ied/records/49122
https://www.dippam.ac.uk/ied/records/49126
https://www.dippam.ac.uk/ied/records/49137
https://www.dippam.ac.uk/ied/records/49166
https://www.dippam.ac.uk/ied/records/49169
https://www.dippam.ac.uk/ied/records/49202
https://www.dippam.ac.uk/ied/records/49205
https://www.dippam.ac.uk/ied/records/49216
https://www.dippam.ac.uk/ied/records/49217
https://www.dippam.ac.uk/ied/records/49218
https://www.dippam.ac.uk/ied/records/49221
https://www.dippam.ac.uk/ied/records/49230
https://www.dippam.ac.uk/ied/records/49253
https://www.dippam.ac.uk/ied/records/49262
https://www.dippam.ac.uk/ied/records/49278
https://www.dippam.ac.uk/ied/records/49279
https://www.dippam.ac.uk/ied/records/49280
https://www.dippam.ac.uk/ied/records/49300
https://www.dippam.ac.uk/ied/records/49303
https://www.dippam.ac.uk/ied/records/49317
https://www.dippam.ac.uk/ied/records/49319
https://www.dippam.ac.uk/ied/records/49328
https://www.dippam.ac.uk/ied/records/49348
https://www

https://www.dippam.ac.uk/ied/records/51277
https://www.dippam.ac.uk/ied/records/51282
https://www.dippam.ac.uk/ied/records/51284
https://www.dippam.ac.uk/ied/records/51291
https://www.dippam.ac.uk/ied/records/51301
https://www.dippam.ac.uk/ied/records/51311
https://www.dippam.ac.uk/ied/records/51315
https://www.dippam.ac.uk/ied/records/51337
https://www.dippam.ac.uk/ied/records/51343
https://www.dippam.ac.uk/ied/records/51347
https://www.dippam.ac.uk/ied/records/51358
https://www.dippam.ac.uk/ied/records/51372
https://www.dippam.ac.uk/ied/records/51397
https://www.dippam.ac.uk/ied/records/51406
https://www.dippam.ac.uk/ied/records/51415
https://www.dippam.ac.uk/ied/records/51432
https://www.dippam.ac.uk/ied/records/51437
https://www.dippam.ac.uk/ied/records/51438
https://www.dippam.ac.uk/ied/records/51439
https://www.dippam.ac.uk/ied/records/51492
https://www.dippam.ac.uk/ied/records/51496
https://www.dippam.ac.uk/ied/records/51500
https://www.dippam.ac.uk/ied/records/51508
https://www

https://www.dippam.ac.uk/ied/records/53586
https://www.dippam.ac.uk/ied/records/53602
https://www.dippam.ac.uk/ied/records/53611
https://www.dippam.ac.uk/ied/records/53614
https://www.dippam.ac.uk/ied/records/53616
https://www.dippam.ac.uk/ied/records/53624
https://www.dippam.ac.uk/ied/records/53627
https://www.dippam.ac.uk/ied/records/53643
https://www.dippam.ac.uk/ied/records/53653


## Check dataframe

In [14]:
import pandas as pd
import pandas as pd
df = pd.read_csv("20230514_AM_ied.csv")

In [15]:
df.shape

(0, 6)

Something went wrong with the creation of the CSV. It seems that this has to do with the "w+" mode in the "with open" statement. I am going to try the loop above omitting the parts that have to do with the extraction and saving of the text files. That is, only the parts having to do with extracting the metadata and saving it to a csv will be kept.

In [16]:
#First, I need the urls minus the two empty ones that caused problems above.
urls = [e for e in urls if e not in ('https://www.dippam.ac.uk/ied/records/24544', 
                                     'https://www.dippam.ac.uk/ied/records/26350')]

len(urls)

3057

In [17]:
#Now I want to do a subset of those to test the code. Let's say 5.
urlsTest = urls[:5]
len(urlsTest)

5

In [18]:
for url in urlsTest:
    
    print(url)
    
    try:
    
        #Read HTML
        record = urllib.request.urlopen(url)
        recordSoup = BeautifulSoup(record, 'html.parser')
    
        #Extract metadata (td tags that have no class)
        tableData = recordSoup.find_all("td", {"class": ""})
        csv_row = []
        for item in tableData:
            csv_row.append(item.get_text(strip=True))
    
        #Append data to csv file
        with open("20230514_AM_ied.csv", "a+", newline="") as f:
            writer = csv.writer(f)
            writer.writerow(csv_row)
            
    except:
        print(url + "-fail")
        pass

https://www.dippam.ac.uk/ied/records/20481
https://www.dippam.ac.uk/ied/records/20487
https://www.dippam.ac.uk/ied/records/20514
https://www.dippam.ac.uk/ied/records/20519
https://www.dippam.ac.uk/ied/records/20522


In [19]:
import pandas as pd
df = pd.read_csv("20230514_AM_ied.csv")
df.shape

(5, 6)

Now remove that file and start over (otherwise, the a+ element will tell the program to add to the existing CSV, resulting in duplicates).

In [21]:
import os

os.remove("20230514_AM_ied.csv")

Now try with all the files.

In [22]:
for url in urls:
    
    print(url)
    
    try:
    
        #Read HTML
        record = urllib.request.urlopen(url)
        recordSoup = BeautifulSoup(record, 'html.parser')
    
        #Extract metadata (td tags that have no class)
        tableData = recordSoup.find_all("td", {"class": ""})
        csv_row = []
        for item in tableData:
            csv_row.append(item.get_text(strip=True))
    
        #Append data to csv file
        with open("20230514_AM_ied.csv", "a+", newline="") as f:
            writer = csv.writer(f)
            writer.writerow(csv_row)
            
    except:
        print(url + "-failed")
        pass

https://www.dippam.ac.uk/ied/records/20481
https://www.dippam.ac.uk/ied/records/20487
https://www.dippam.ac.uk/ied/records/20514
https://www.dippam.ac.uk/ied/records/20519
https://www.dippam.ac.uk/ied/records/20522
https://www.dippam.ac.uk/ied/records/20529
https://www.dippam.ac.uk/ied/records/20530
https://www.dippam.ac.uk/ied/records/20538
https://www.dippam.ac.uk/ied/records/20563
https://www.dippam.ac.uk/ied/records/20568
https://www.dippam.ac.uk/ied/records/20580
https://www.dippam.ac.uk/ied/records/20590
https://www.dippam.ac.uk/ied/records/20600
https://www.dippam.ac.uk/ied/records/20623
https://www.dippam.ac.uk/ied/records/20632
https://www.dippam.ac.uk/ied/records/20651
https://www.dippam.ac.uk/ied/records/20656
https://www.dippam.ac.uk/ied/records/20675
https://www.dippam.ac.uk/ied/records/20678
https://www.dippam.ac.uk/ied/records/20695
https://www.dippam.ac.uk/ied/records/20706
https://www.dippam.ac.uk/ied/records/20724
https://www.dippam.ac.uk/ied/records/20743
https://www

https://www.dippam.ac.uk/ied/records/22415
https://www.dippam.ac.uk/ied/records/22435
https://www.dippam.ac.uk/ied/records/22438
https://www.dippam.ac.uk/ied/records/22454
https://www.dippam.ac.uk/ied/records/22465
https://www.dippam.ac.uk/ied/records/22467
https://www.dippam.ac.uk/ied/records/22477
https://www.dippam.ac.uk/ied/records/22483
https://www.dippam.ac.uk/ied/records/22487
https://www.dippam.ac.uk/ied/records/22489
https://www.dippam.ac.uk/ied/records/22491
https://www.dippam.ac.uk/ied/records/22499
https://www.dippam.ac.uk/ied/records/22510
https://www.dippam.ac.uk/ied/records/22520
https://www.dippam.ac.uk/ied/records/22550
https://www.dippam.ac.uk/ied/records/22590
https://www.dippam.ac.uk/ied/records/22612
https://www.dippam.ac.uk/ied/records/22630
https://www.dippam.ac.uk/ied/records/22667
https://www.dippam.ac.uk/ied/records/22668
https://www.dippam.ac.uk/ied/records/22713
https://www.dippam.ac.uk/ied/records/22724
https://www.dippam.ac.uk/ied/records/22725
https://www

https://www.dippam.ac.uk/ied/records/24570
https://www.dippam.ac.uk/ied/records/24576
https://www.dippam.ac.uk/ied/records/24608
https://www.dippam.ac.uk/ied/records/24618
https://www.dippam.ac.uk/ied/records/24635
https://www.dippam.ac.uk/ied/records/24644
https://www.dippam.ac.uk/ied/records/24656
https://www.dippam.ac.uk/ied/records/24659
https://www.dippam.ac.uk/ied/records/24668
https://www.dippam.ac.uk/ied/records/24706
https://www.dippam.ac.uk/ied/records/24715
https://www.dippam.ac.uk/ied/records/24716
https://www.dippam.ac.uk/ied/records/24721
https://www.dippam.ac.uk/ied/records/24727
https://www.dippam.ac.uk/ied/records/24728
https://www.dippam.ac.uk/ied/records/24731
https://www.dippam.ac.uk/ied/records/24732
https://www.dippam.ac.uk/ied/records/24734
https://www.dippam.ac.uk/ied/records/24743
https://www.dippam.ac.uk/ied/records/24747
https://www.dippam.ac.uk/ied/records/24749
https://www.dippam.ac.uk/ied/records/24753
https://www.dippam.ac.uk/ied/records/24761
https://www

https://www.dippam.ac.uk/ied/records/26608
https://www.dippam.ac.uk/ied/records/26633
https://www.dippam.ac.uk/ied/records/26634
https://www.dippam.ac.uk/ied/records/26644
https://www.dippam.ac.uk/ied/records/26647
https://www.dippam.ac.uk/ied/records/26652
https://www.dippam.ac.uk/ied/records/26673
https://www.dippam.ac.uk/ied/records/26678
https://www.dippam.ac.uk/ied/records/26680
https://www.dippam.ac.uk/ied/records/26711
https://www.dippam.ac.uk/ied/records/26718
https://www.dippam.ac.uk/ied/records/26759
https://www.dippam.ac.uk/ied/records/26760
https://www.dippam.ac.uk/ied/records/26772
https://www.dippam.ac.uk/ied/records/26775
https://www.dippam.ac.uk/ied/records/26782
https://www.dippam.ac.uk/ied/records/26785
https://www.dippam.ac.uk/ied/records/26789
https://www.dippam.ac.uk/ied/records/26793
https://www.dippam.ac.uk/ied/records/26796
https://www.dippam.ac.uk/ied/records/26800
https://www.dippam.ac.uk/ied/records/26809
https://www.dippam.ac.uk/ied/records/26829
https://www

https://www.dippam.ac.uk/ied/records/28516
https://www.dippam.ac.uk/ied/records/28519
https://www.dippam.ac.uk/ied/records/28521
https://www.dippam.ac.uk/ied/records/28532
https://www.dippam.ac.uk/ied/records/28535
https://www.dippam.ac.uk/ied/records/28545
https://www.dippam.ac.uk/ied/records/28556
https://www.dippam.ac.uk/ied/records/28557
https://www.dippam.ac.uk/ied/records/28558
https://www.dippam.ac.uk/ied/records/28565
https://www.dippam.ac.uk/ied/records/28566
https://www.dippam.ac.uk/ied/records/28587
https://www.dippam.ac.uk/ied/records/28600
https://www.dippam.ac.uk/ied/records/28609
https://www.dippam.ac.uk/ied/records/28612
https://www.dippam.ac.uk/ied/records/28616
https://www.dippam.ac.uk/ied/records/28628
https://www.dippam.ac.uk/ied/records/28633
https://www.dippam.ac.uk/ied/records/28655
https://www.dippam.ac.uk/ied/records/28662
https://www.dippam.ac.uk/ied/records/28676
https://www.dippam.ac.uk/ied/records/28682
https://www.dippam.ac.uk/ied/records/28683
https://www

https://www.dippam.ac.uk/ied/records/30597
https://www.dippam.ac.uk/ied/records/30608
https://www.dippam.ac.uk/ied/records/30620
https://www.dippam.ac.uk/ied/records/30623
https://www.dippam.ac.uk/ied/records/30636
https://www.dippam.ac.uk/ied/records/30643
https://www.dippam.ac.uk/ied/records/30649
https://www.dippam.ac.uk/ied/records/30656
https://www.dippam.ac.uk/ied/records/30672
https://www.dippam.ac.uk/ied/records/30700
https://www.dippam.ac.uk/ied/records/30715
https://www.dippam.ac.uk/ied/records/30728
https://www.dippam.ac.uk/ied/records/30744
https://www.dippam.ac.uk/ied/records/30765
https://www.dippam.ac.uk/ied/records/30767
https://www.dippam.ac.uk/ied/records/30800
https://www.dippam.ac.uk/ied/records/30808
https://www.dippam.ac.uk/ied/records/30818
https://www.dippam.ac.uk/ied/records/30822
https://www.dippam.ac.uk/ied/records/30826
https://www.dippam.ac.uk/ied/records/30836
https://www.dippam.ac.uk/ied/records/30855
https://www.dippam.ac.uk/ied/records/30858
https://www

https://www.dippam.ac.uk/ied/records/32516
https://www.dippam.ac.uk/ied/records/32535
https://www.dippam.ac.uk/ied/records/32543
https://www.dippam.ac.uk/ied/records/32545
https://www.dippam.ac.uk/ied/records/32562
https://www.dippam.ac.uk/ied/records/32585
https://www.dippam.ac.uk/ied/records/32596
https://www.dippam.ac.uk/ied/records/32606
https://www.dippam.ac.uk/ied/records/32645
https://www.dippam.ac.uk/ied/records/32651
https://www.dippam.ac.uk/ied/records/32652
https://www.dippam.ac.uk/ied/records/32673
https://www.dippam.ac.uk/ied/records/32701
https://www.dippam.ac.uk/ied/records/32709
https://www.dippam.ac.uk/ied/records/32711
https://www.dippam.ac.uk/ied/records/32730
https://www.dippam.ac.uk/ied/records/32731
https://www.dippam.ac.uk/ied/records/32740
https://www.dippam.ac.uk/ied/records/32749
https://www.dippam.ac.uk/ied/records/32751
https://www.dippam.ac.uk/ied/records/32755
https://www.dippam.ac.uk/ied/records/32756
https://www.dippam.ac.uk/ied/records/32762
https://www

https://www.dippam.ac.uk/ied/records/34281
https://www.dippam.ac.uk/ied/records/34303
https://www.dippam.ac.uk/ied/records/34313
https://www.dippam.ac.uk/ied/records/34314
https://www.dippam.ac.uk/ied/records/34318
https://www.dippam.ac.uk/ied/records/34323
https://www.dippam.ac.uk/ied/records/34383
https://www.dippam.ac.uk/ied/records/34400
https://www.dippam.ac.uk/ied/records/34409
https://www.dippam.ac.uk/ied/records/34413
https://www.dippam.ac.uk/ied/records/34420
https://www.dippam.ac.uk/ied/records/34459
https://www.dippam.ac.uk/ied/records/34473
https://www.dippam.ac.uk/ied/records/34488
https://www.dippam.ac.uk/ied/records/34495
https://www.dippam.ac.uk/ied/records/34502
https://www.dippam.ac.uk/ied/records/34516
https://www.dippam.ac.uk/ied/records/34536
https://www.dippam.ac.uk/ied/records/34539
https://www.dippam.ac.uk/ied/records/34564
https://www.dippam.ac.uk/ied/records/34574
https://www.dippam.ac.uk/ied/records/34575
https://www.dippam.ac.uk/ied/records/34577
https://www

https://www.dippam.ac.uk/ied/records/36462
https://www.dippam.ac.uk/ied/records/36478
https://www.dippam.ac.uk/ied/records/36493
https://www.dippam.ac.uk/ied/records/36501
https://www.dippam.ac.uk/ied/records/36509
https://www.dippam.ac.uk/ied/records/36523
https://www.dippam.ac.uk/ied/records/36525
https://www.dippam.ac.uk/ied/records/36529
https://www.dippam.ac.uk/ied/records/36530
https://www.dippam.ac.uk/ied/records/36543
https://www.dippam.ac.uk/ied/records/36548
https://www.dippam.ac.uk/ied/records/36553
https://www.dippam.ac.uk/ied/records/36555
https://www.dippam.ac.uk/ied/records/36563
https://www.dippam.ac.uk/ied/records/36588
https://www.dippam.ac.uk/ied/records/36593
https://www.dippam.ac.uk/ied/records/36608
https://www.dippam.ac.uk/ied/records/36618
https://www.dippam.ac.uk/ied/records/36625
https://www.dippam.ac.uk/ied/records/36630
https://www.dippam.ac.uk/ied/records/36656
https://www.dippam.ac.uk/ied/records/36684
https://www.dippam.ac.uk/ied/records/36685
https://www

https://www.dippam.ac.uk/ied/records/38374
https://www.dippam.ac.uk/ied/records/38376
https://www.dippam.ac.uk/ied/records/38378
https://www.dippam.ac.uk/ied/records/38385
https://www.dippam.ac.uk/ied/records/38395
https://www.dippam.ac.uk/ied/records/38396
https://www.dippam.ac.uk/ied/records/38406
https://www.dippam.ac.uk/ied/records/38437
https://www.dippam.ac.uk/ied/records/38452
https://www.dippam.ac.uk/ied/records/38455
https://www.dippam.ac.uk/ied/records/38457
https://www.dippam.ac.uk/ied/records/38466
https://www.dippam.ac.uk/ied/records/38470
https://www.dippam.ac.uk/ied/records/38473
https://www.dippam.ac.uk/ied/records/38474
https://www.dippam.ac.uk/ied/records/38493
https://www.dippam.ac.uk/ied/records/38497
https://www.dippam.ac.uk/ied/records/38502
https://www.dippam.ac.uk/ied/records/38503
https://www.dippam.ac.uk/ied/records/38514
https://www.dippam.ac.uk/ied/records/38528
https://www.dippam.ac.uk/ied/records/38529
https://www.dippam.ac.uk/ied/records/38542
https://www

https://www.dippam.ac.uk/ied/records/40708
https://www.dippam.ac.uk/ied/records/40727
https://www.dippam.ac.uk/ied/records/40751
https://www.dippam.ac.uk/ied/records/40766
https://www.dippam.ac.uk/ied/records/40782
https://www.dippam.ac.uk/ied/records/40784
https://www.dippam.ac.uk/ied/records/40799
https://www.dippam.ac.uk/ied/records/40803
https://www.dippam.ac.uk/ied/records/40809
https://www.dippam.ac.uk/ied/records/40815
https://www.dippam.ac.uk/ied/records/40817
https://www.dippam.ac.uk/ied/records/40831
https://www.dippam.ac.uk/ied/records/40835
https://www.dippam.ac.uk/ied/records/40841
https://www.dippam.ac.uk/ied/records/40847
https://www.dippam.ac.uk/ied/records/40851
https://www.dippam.ac.uk/ied/records/40860
https://www.dippam.ac.uk/ied/records/40864
https://www.dippam.ac.uk/ied/records/40876
https://www.dippam.ac.uk/ied/records/40878
https://www.dippam.ac.uk/ied/records/40881
https://www.dippam.ac.uk/ied/records/40882
https://www.dippam.ac.uk/ied/records/40883
https://www

https://www.dippam.ac.uk/ied/records/42704
https://www.dippam.ac.uk/ied/records/42708
https://www.dippam.ac.uk/ied/records/42719
https://www.dippam.ac.uk/ied/records/42726
https://www.dippam.ac.uk/ied/records/42774
https://www.dippam.ac.uk/ied/records/42775
https://www.dippam.ac.uk/ied/records/42782
https://www.dippam.ac.uk/ied/records/42786
https://www.dippam.ac.uk/ied/records/42807
https://www.dippam.ac.uk/ied/records/42816
https://www.dippam.ac.uk/ied/records/42821
https://www.dippam.ac.uk/ied/records/42826
https://www.dippam.ac.uk/ied/records/42844
https://www.dippam.ac.uk/ied/records/42856
https://www.dippam.ac.uk/ied/records/42905
https://www.dippam.ac.uk/ied/records/42908
https://www.dippam.ac.uk/ied/records/42909
https://www.dippam.ac.uk/ied/records/42913
https://www.dippam.ac.uk/ied/records/42926
https://www.dippam.ac.uk/ied/records/42964
https://www.dippam.ac.uk/ied/records/42992
https://www.dippam.ac.uk/ied/records/43005
https://www.dippam.ac.uk/ied/records/43021
https://www

https://www.dippam.ac.uk/ied/records/44977
https://www.dippam.ac.uk/ied/records/45009
https://www.dippam.ac.uk/ied/records/45017
https://www.dippam.ac.uk/ied/records/45027
https://www.dippam.ac.uk/ied/records/45028
https://www.dippam.ac.uk/ied/records/45029
https://www.dippam.ac.uk/ied/records/45048
https://www.dippam.ac.uk/ied/records/45066
https://www.dippam.ac.uk/ied/records/45068
https://www.dippam.ac.uk/ied/records/45069
https://www.dippam.ac.uk/ied/records/45086
https://www.dippam.ac.uk/ied/records/45090
https://www.dippam.ac.uk/ied/records/45091
https://www.dippam.ac.uk/ied/records/45099
https://www.dippam.ac.uk/ied/records/45105
https://www.dippam.ac.uk/ied/records/45110
https://www.dippam.ac.uk/ied/records/45111
https://www.dippam.ac.uk/ied/records/45124
https://www.dippam.ac.uk/ied/records/45136
https://www.dippam.ac.uk/ied/records/45143
https://www.dippam.ac.uk/ied/records/45146
https://www.dippam.ac.uk/ied/records/45159
https://www.dippam.ac.uk/ied/records/45173
https://www

https://www.dippam.ac.uk/ied/records/47025
https://www.dippam.ac.uk/ied/records/47030
https://www.dippam.ac.uk/ied/records/47036
https://www.dippam.ac.uk/ied/records/47040
https://www.dippam.ac.uk/ied/records/47041
https://www.dippam.ac.uk/ied/records/47047
https://www.dippam.ac.uk/ied/records/47048
https://www.dippam.ac.uk/ied/records/47060
https://www.dippam.ac.uk/ied/records/47064
https://www.dippam.ac.uk/ied/records/47066
https://www.dippam.ac.uk/ied/records/47068
https://www.dippam.ac.uk/ied/records/47070
https://www.dippam.ac.uk/ied/records/47079
https://www.dippam.ac.uk/ied/records/47081
https://www.dippam.ac.uk/ied/records/47093
https://www.dippam.ac.uk/ied/records/47109
https://www.dippam.ac.uk/ied/records/47110
https://www.dippam.ac.uk/ied/records/47125
https://www.dippam.ac.uk/ied/records/47128
https://www.dippam.ac.uk/ied/records/47132
https://www.dippam.ac.uk/ied/records/47136
https://www.dippam.ac.uk/ied/records/47140
https://www.dippam.ac.uk/ied/records/47157
https://www

https://www.dippam.ac.uk/ied/records/49205
https://www.dippam.ac.uk/ied/records/49216
https://www.dippam.ac.uk/ied/records/49217
https://www.dippam.ac.uk/ied/records/49218
https://www.dippam.ac.uk/ied/records/49221
https://www.dippam.ac.uk/ied/records/49230
https://www.dippam.ac.uk/ied/records/49253
https://www.dippam.ac.uk/ied/records/49262
https://www.dippam.ac.uk/ied/records/49278
https://www.dippam.ac.uk/ied/records/49279
https://www.dippam.ac.uk/ied/records/49280
https://www.dippam.ac.uk/ied/records/49300
https://www.dippam.ac.uk/ied/records/49303
https://www.dippam.ac.uk/ied/records/49317
https://www.dippam.ac.uk/ied/records/49319
https://www.dippam.ac.uk/ied/records/49328
https://www.dippam.ac.uk/ied/records/49348
https://www.dippam.ac.uk/ied/records/49350
https://www.dippam.ac.uk/ied/records/49367
https://www.dippam.ac.uk/ied/records/49380
https://www.dippam.ac.uk/ied/records/49415
https://www.dippam.ac.uk/ied/records/49451
https://www.dippam.ac.uk/ied/records/49457
https://www

https://www.dippam.ac.uk/ied/records/51315
https://www.dippam.ac.uk/ied/records/51337
https://www.dippam.ac.uk/ied/records/51343
https://www.dippam.ac.uk/ied/records/51347
https://www.dippam.ac.uk/ied/records/51358
https://www.dippam.ac.uk/ied/records/51372
https://www.dippam.ac.uk/ied/records/51397
https://www.dippam.ac.uk/ied/records/51406
https://www.dippam.ac.uk/ied/records/51415
https://www.dippam.ac.uk/ied/records/51432
https://www.dippam.ac.uk/ied/records/51437
https://www.dippam.ac.uk/ied/records/51438
https://www.dippam.ac.uk/ied/records/51439
https://www.dippam.ac.uk/ied/records/51492
https://www.dippam.ac.uk/ied/records/51496
https://www.dippam.ac.uk/ied/records/51500
https://www.dippam.ac.uk/ied/records/51508
https://www.dippam.ac.uk/ied/records/51524
https://www.dippam.ac.uk/ied/records/51533
https://www.dippam.ac.uk/ied/records/51538
https://www.dippam.ac.uk/ied/records/51614
https://www.dippam.ac.uk/ied/records/51641
https://www.dippam.ac.uk/ied/records/51673
https://www

https://www.dippam.ac.uk/ied/records/53627
https://www.dippam.ac.uk/ied/records/53643
https://www.dippam.ac.uk/ied/records/53653


In [40]:
df = pd.read_csv("20230514_AM_ied.csv")
df.shape

(3054, 6)

A search for the term "-failed" shows the unsuccessful metadata attempts for the follwing:
    
<ul>
    <li>https://www.dippam.ac.uk/ied/records/29950</li>
    <li>https://www.dippam.ac.uk/ied/records/37031</li>
    <li>https://www.dippam.ac.uk/ied/records/29950</li>
</ul>

In [41]:
df.columns

Index(['300090', '04-12-1896', 'Letters (Emigrants)',
       'Public Record Office, Northern Ireland',
       'Edward Stanley, Katawa, Canada to Joshua Peel, Armagh; PRONI D889/7/1; CMSIED 300090',
       '20481'],
      dtype='object')

In [42]:
df.columns = ['idIED', 'mm-dd-yyyy', 'doctype', 'source', 'description', 'docid' ]
df

Unnamed: 0,idIED,mm-dd-yyyy,doctype,source,description,docid
0,9501251,12-01-1891,Letters (Emigrants),"Public Record Office, Northern Ireland","From, Brooklyn, N.Y., to ""Dear James"" [no addr...",20487
1,300018,01-06-1822,Letters (Emigrants),Ulster-American Folk Park.,"James Kelly, Desertmartin to John Kelly, Penns...",20514
2,9408355,01-10-1842,Letters (Emigrants),"Public Record Office, Northern Ireland","Alexander McCloy, Pennsylvania, to Cousin, [Ir...",20519
3,9003061,10-01-1896,Letters (Emigrants),"Public Record Office, Northern Ireland","George Kirkpatrick, Toronto, to Rev. Alex. Kir...",20522
4,9011027,13-02-1873,Letters (Emigrants),"Public Record Office, Northern Ireland","William Porter, U.S.A. to Robert Porter, Irela...",20529
...,...,...,...,...,...,...
3049,9907213,14-10-1867,Letters (Emigrants),"Public Record Office, Northern Ireland","John Boys, Canada, to Mrs & J.W. Stavely, [?];...",53616
3050,9504113,06-03-1875,Letters (Emigrants),"Public Record Office, Northern Ireland","Rowland Redmond, Charleston, to ""My Dear Willi...",53624
3051,9810029,25-01-1898,Letters (Emigrants),"Public Record Office, Northern Ireland","Mrs. M. Mulcahy, Massachusetts, to ""Dear Amey ...",53627
3052,200910008,22-06-1873,Letters (Emigrants),Mellon Centre for Migration Studies,"Letter from John Ferguson, Philadelphia to fri...",53643


In [43]:
# Create rows for missing items
df2 = pd.DataFrame({'idIED': [9306005, 9311009, 9306005],
                   'mm-dd-yyyy': ['07-08-1887', '09-11-1890', '07-08-1887'],
                   'doctype': ['Letters (Emigrants)', 'Letters (Emigrants)', 'Letters (Emigrants)'],
                   'source': ['Public Record Office, Northern Ireland', 'Public Record Office, Northern Ireland', 'Public Record Office, Northern Ireland'],
                    'description': ['John B. Cherry, Spencers Bridge, Canada, to R.R. Cherry.; PRONI D 2166/1/3; CMSIED 9306005',
'John Hall, Pennslyvania to Thomas Black, Chicago.; PRONI D 2041/13; CMSIED 9311009', 'John B. Cherry, Spencers Bridge, Canada, to R.R. Cherry.; PRONI D 2166/1/3; CMSIED 9306005'],
                   'docid': [29950,37031,29950]})

# Add these to the dataframe
df = df.append(df2,ignore_index=True)

#View
df


Unnamed: 0,idIED,mm-dd-yyyy,doctype,source,description,docid
0,9501251,12-01-1891,Letters (Emigrants),"Public Record Office, Northern Ireland","From, Brooklyn, N.Y., to ""Dear James"" [no addr...",20487
1,300018,01-06-1822,Letters (Emigrants),Ulster-American Folk Park.,"James Kelly, Desertmartin to John Kelly, Penns...",20514
2,9408355,01-10-1842,Letters (Emigrants),"Public Record Office, Northern Ireland","Alexander McCloy, Pennsylvania, to Cousin, [Ir...",20519
3,9003061,10-01-1896,Letters (Emigrants),"Public Record Office, Northern Ireland","George Kirkpatrick, Toronto, to Rev. Alex. Kir...",20522
4,9011027,13-02-1873,Letters (Emigrants),"Public Record Office, Northern Ireland","William Porter, U.S.A. to Robert Porter, Irela...",20529
...,...,...,...,...,...,...
3052,200910008,22-06-1873,Letters (Emigrants),Mellon Centre for Migration Studies,"Letter from John Ferguson, Philadelphia to fri...",53643
3053,201306101,05-12-1909,Letters (Emigrants),Mellon Centre for Migration Studies,"John Donnelly, S.S. Caledonia to Ned [Edward D...",53653
3054,9306005,07-08-1887,Letters (Emigrants),"Public Record Office, Northern Ireland","John B. Cherry, Spencers Bridge, Canada, to R....",29950
3055,9311009,09-11-1890,Letters (Emigrants),"Public Record Office, Northern Ireland","John Hall, Pennslyvania to Thomas Black, Chica...",37031


In [45]:
df.dtypes

idIED           int64
mm-dd-yyyy     object
doctype        object
source         object
description    object
docid           int64
dtype: object

In [47]:
df[['docmonth', 'docday', 'docyear']] = df['mm-dd-yyyy'].str.split('-', 2, expand=True)

In [48]:
df

Unnamed: 0,idIED,mm-dd-yyyy,doctype,source,description,docid,docmonth,docday,docyear
0,9501251,12-01-1891,Letters (Emigrants),"Public Record Office, Northern Ireland","From, Brooklyn, N.Y., to ""Dear James"" [no addr...",20487,12,01,1891
1,300018,01-06-1822,Letters (Emigrants),Ulster-American Folk Park.,"James Kelly, Desertmartin to John Kelly, Penns...",20514,01,06,1822
2,9408355,01-10-1842,Letters (Emigrants),"Public Record Office, Northern Ireland","Alexander McCloy, Pennsylvania, to Cousin, [Ir...",20519,01,10,1842
3,9003061,10-01-1896,Letters (Emigrants),"Public Record Office, Northern Ireland","George Kirkpatrick, Toronto, to Rev. Alex. Kir...",20522,10,01,1896
4,9011027,13-02-1873,Letters (Emigrants),"Public Record Office, Northern Ireland","William Porter, U.S.A. to Robert Porter, Irela...",20529,13,02,1873
...,...,...,...,...,...,...,...,...,...
3052,200910008,22-06-1873,Letters (Emigrants),Mellon Centre for Migration Studies,"Letter from John Ferguson, Philadelphia to fri...",53643,22,06,1873
3053,201306101,05-12-1909,Letters (Emigrants),Mellon Centre for Migration Studies,"John Donnelly, S.S. Caledonia to Ned [Edward D...",53653,05,12,1909
3054,9306005,07-08-1887,Letters (Emigrants),"Public Record Office, Northern Ireland","John B. Cherry, Spencers Bridge, Canada, to R....",29950,07,08,1887
3055,9311009,09-11-1890,Letters (Emigrants),"Public Record Office, Northern Ireland","John Hall, Pennslyvania to Thomas Black, Chica...",37031,09,11,1890


There is more work to be done on this csv before it can be joined with the NAILDOH metadata, but for now saving it to a file.

In [50]:
df.to_csv('20230514_AM_ied.csv', sep='\t', encoding='utf-8')