# Introduction to Python #1 - Working with Text Data

Welcome to the June '25 Hack & Yack, part of the development of a new [Computing for Cultural Heritage](https://blogs.bl.uk/digital-scholarship/2021/09/computing-for-cultural-heritage-trial-outcomes-and-final-report.html). The aim of this programme, that we're currently seeking funding for, is to teach staff the fundamentals of programming in a cultural heritage context. In this session we'll introduce some fundamentals of the Python programming language by extracting structured information from unstructured text about voyages of ships in the India Office Records. If you get far enough with the session you'll clean some data for, and be credited on, a new dataset uploaded to the research repository!

This session had a small pre-work notebook to work through. This matches the planned format for the new course. The notebook covered using Jupyter notebooks (the format of this web page) and some basic Python data types. If you didn't have time to work through this before we recommend doing it now. It should take 15 minutes or so. If you've already finished it you might want to have it open as a handy reference for this session.

Format:
- Introduction (10 mins)
- The task: extracting structured information from unstructured text data (5 mins)
- Looking at the data (10 mins)
- Defining our outputs (15 mins)
- Break (5 mins)
- Working through exercises (45 mins)
- Debrief and processing data for the repository (30 mins)

Teachers
- Harry Lloyd (host)
- Jez Cope (online)

If you have questions in the room I'll try and help you out, if you're online drop Jez a message!

### Learning Objectives

Covered in the pre-work (as well as today)
- Writing and running python code in a JupyterLab notebook
- Python variables and how to create them
- Creating and interacting with Python data types and data structures
    - String and integer data types
    - Lists
    - Dictionaries

Covered in this notebook
- Navigating the JupyterLab file system
- How to convert your approach to solving a problem into code
- The snake_case and PascalCase Naming conventions
- How to iterate over lists of things using For loops
- Using regular expressions to find matching strings of characters in text
- How to import and export data from the filesystem
- The basic structure of json as a data storage format

## Today's Task

> Convert unstructured text data about the histories of East India Company ships into a structured format to provide easier to process data to readers.

### The Dataset
A file of ship authority records at `data\raw\clean_ship_sample.csv`, mostly based on entries from Anthony Farrington's *Catalogue of East India Company Ships' Journals and Logs*, which was keyed in the early 2000s and later imported to IAMS. The entries are formatted pretty consistently in the *Catalogue*, the consistency was replicated in the keying, which we'll take advantage of today. The raw dataset used to produce this sample is `data\raw\IAMS_pre_cyber_export_Corporation_authority.xlsx`, taken from `ACT_Metadata\IAMS\IAMS Oct 2023 authority listings`, filtered by Alex Hailey for a ship related subset of the CorporationAdditionalQualifiers column.

We're using a subset of columns from the full files: RecordID, ShipName, DateRange, History.

### Data Dictionary 

<u>RecordID</u>  
The IAMS Record ID for this Corporate Authority record.

<u>ShipName</u>  
The ship's name, unmodified from the CorporationCorporateName column in the raw IAMS dataset.

<u>History</u>    
Text about the history of the ship. Usually split into (1) information like contract type, size, builder, owner, and (2) details of voyages, which can be multiple. Voyages are numbered, and typically record the years of voyage with destination, captain (if known), and stops. Here's an example:
>Chartered ship, 32/35 crew, 450 tons. Principal Managing Owner: William Bawtree. Voyages: (1) 1818/9 Bengal. Capt Lucas Percival. Downs 27 May 1819 - 30 Sep Bengal - 29 Dec Narsipur - 3 Jan 1820 Madras - 22 Mar St Helena - 13 May East India Dock. (2) 1822/3 Bengal. Capt Lucas Percival. Downs 25 May 1829 - 21 Sep Hugli - 14 Oct Calcutta - 29 Jan 1824 Saugor - 2 Apr St Helena - 17 Jun East India Dock.

  
<u>DateRange</u>  
The range of years during which the ship was active. These are present for the sample, but most entries in the full dataset have individual start and end dates expressed as '-9999', and date ranges as 'Undetermined'. There's a separate project to update dates in the authority files. Processing the data below could produce information helpful in updating dates using the unique authority IDs, but we will focus on extracting voyage data.

## What you need to do

- 
- check your output against the answer using `show_mismatches`
  
Iterate over csv, create dictionary for each ship recording basic info (ship name, id, info), and extracting voyage data from free text 'history' column, with voyages recorded as list of lists containing tuples which record date and location. Individual ship dictionaries then added to a main dictionary, with ID as key.

This one has basic error handling to record problem records and then continue running. (My first time using try / except / else and it probably shows!)

With a sample dataset of 97 records, 40 had to be passed. The full  dataset consists of c1500 entries.

![book_cover](book_cover_voyage_text.png "Book Cover of Farrington's")

In [1]:
# IMPORT STATEMENTS
import json
import re
import pandas as pd
import requests

### Most basic workflow with sanitised data

No conditionals to check to make logic more flexible

Can also use this to find just the records that are parsable with the basic logic

In [3]:
def teaching_parse(ship_id, row, date_place_regex="default", place_date_regex="default"):
    if date_place_regex == "default":
        date_place_regex = re.compile(r"(?P<Date>\d{1,2} \w{3}( \d{4})?) (?P<Location>\b[\w\s]*\b)")
    if place_date_regex == "default":
        place_date_regex = re.compile(r"(?P<Location>\b[\w\s]*\b) (?P<Date>\d{1,2} \w{3} \d{4})")
    
    ship_info = {
        "name": row["CorporateName"],
        "dates": row["DateRange"],
        "info": "",
        "voyages": {},
        "raw_history": row["History"]
    }

    voyages = {}
    info, voyage_string = row["History"].split("Voyages: ")
    ship_info["info"] = info.strip()

    voyage_numbers = re.findall(r"(?<=\()\d{1,2}(?=\))", voyage_string)  # This finds any number in round brackets `(i)`, and keeps the number
    raw_voyages = re.split(r"\(\d{1,2}\) ", voyage_string)[1:]  # First item in list is empty string due to split around first bracketed voyage number (1) 
    for i, rv in zip(voyage_numbers, raw_voyages):
        voyage = {
            "voyage_number": int(i),
            "duration": "",
            "destination": "",
            "captain": "",
            "route": []
        }

        duration_dest, captain, route_str = rv.split(". ")[:3]
        duration, destination = duration_dest.split(" ")[:2]

        voyage["captain"] = captain
        voyage["destination"] = destination
        voyage["duration"] = duration

        raw_stops = route_str.split(" - ")
        stops = []
        
        start = place_date_regex.search(raw_stops[0])
        start_location, start_date = start.group("Location"), start.group("Date")
        
        stops.append({start_date: start_location})

        for stop in raw_stops[1:]:
            match = date_place_regex.search(stop)
            loc, date = match.group("Location"), match.group("Date")
            stops.append({date: loc})

        voyage["route"] = stops

        voyages[int(i)] = voyage

    ship_info["voyages"] = voyages
    
    return ship_info

In [4]:
def to_json(ships, f):
    ship_dict = {}
    for s in ships:
        ship_dict |= s

    with open(f, "w") as f:
        json.dump(ship_dict, f, indent="\t")

In [5]:
def from_json(fp):
    with open(fp, "r") as f:
        ship_dict = json.load(f)

    return [{k:v} for k,v in ship_dict.items()]

#### Identifying clean ships to work with

In [42]:
clean_ships = []

for ship_id, row in ships_df.iterrows():
    try:
        teaching_parse(ship_id, row)
        clean_ships.append(ship_id)
    except:
        continue

In [43]:
len(clean_ships)

728

In [44]:
# ships_df.loc[pd.Index(clean_ships)].query("DateRange != 'Unspecified'").iloc[:20].to_csv("../data/raw/ships_sample.csv", encoding="utf8")

#### Working with the clean ships

In [10]:
sample_df = pd.read_csv("../data/raw/clean_ships_sample.csv", index_col=0, encoding="utf8")

In [11]:
ship_voyages = []

for ship_id, row in sample_df.iterrows():
    ship_info = teaching_parse(ship_id, row, date_place_regex=re.compile(r"(?P<Date>\d{1,2} \w{3}( \d{4})?) (?P<Location>\b[\w\s']*\b)"))
    ship_voyages.append({ship_id: ship_info})

In [12]:
to_json(ship_voyages, "../data/processed/ships_1.json")

In [13]:
ship_voyages[0]

{'045-001114649': {'name': 'Boscawen',
  'dates': '1748-1765',
  'info': 'Rated at 499 tons, 26 guns, 99 crew. Principal Managing Owner: 4 Richard Crabb.',
  'voyages': {1: {'voyage_number': 1,
    'duration': '1748/9',
    'destination': 'Bombay',
    'captain': 'Capt Benjamin Braund',
    'route': [{'26 Mar 1749': 'Downs'},
     {'5 Jul': 'Johanna'},
     {'2 Aug': 'Bombay'},
     {'22 Sep': 'Surat'},
     {'17 Nov': 'Bandar Abbas'},
     {'23 Dec': 'Bombay'},
     {'11 Feb 1750': 'Mangalore'},
     {'17 Feb': 'Tellicherry'},
     {'19 Mar': 'Socotra'},
     {'29 Mar': 'Mokha'},
     {'27 Aug': 'Bombay'},
     {'16 Jan 1751': 'Cape'},
     {'17 Feb': 'St Helena'},
     {'4 Jun': 'Gravesend'}]},
   2: {'voyage_number': 2,
    'duration': '1752/3',
    'destination': 'Madras',
    'captain': 'Capt Benjamin Braund',
    'route': [{'27 Dec 1752': 'Downs'},
     {'15 Mar 1753': 'Cape'},
     {'24 Jun': 'Madras'},
     {'9 Sep': 'Whampoa'},
     {'26 Dec': 'Second Bar'},
     {'7 May': 'St

In [33]:
def show_mismatches(s1: list[dict], s2: list[dict]):
    s1_dict = {}
    s2_dict = {}

    for ship_id in s1:
        s1_dict |= ship_id
    for ship_id in s2:
        s2_dict |= ship_id
    if s1_dict == s2_dict:
        print("Ships are identical")
        return None

    s1_only = s1_dict.keys() - s2_dict.keys()
    if s1_only:
        print(f"Ship IDs only present in first set: {'\n'.join(s1_only)}")
    s2_only = s2_dict.keys() - s2_dict.keys()
    if s2_only:
        print(f"Ship IDs only present in second set: {'\n'.join(s2_only)}")

    print("Differences in information for Ship IDs present in both sets:")
    common_ship_ids = s1_dict.keys() & s2_dict.keys()
    for ship_id in common_ship_ids:  # TODO refactor to check if value is dict/list then recur/use repeatable fn to compare, rather than these nested lists
        s1_ship_info = s1_dict[ship_id]
        s2_ship_info = s2_dict[ship_id]

        if s1_ship_info == s2_ship_info:
            continue

        if s1_ship_info.keys() ^ s2_ship_info.keys():
            print(f"{ship_id} has keys {', '.join(s1_ship_info.keys())} in set 1 and {', '.join(s2_ship_info.keys())} in set 2")
            print(f"The differing keys are {s1_ship_info.keys() ^ s2_ship_info.keys()}")

        common_ship_info = s1_ship_info.keys() & s2_ship_info.keys()
        for ship_key in common_ship_info:
            s1_value, s2_value = s1_ship_info[ship_key], s2_ship_info[ship_key]
            if ship_key != "voyages" and (s1_value != s2_value):
                print(f"Ship {ship_id} has {ship_key}: {s1_value} in set 1 and {ship_key}: {s2_value} in set 2")

            elif ship_key == "voyages" and (s1_value != s2_value):
                s1_voyages, s2_voyages = s1_value, s2_value
                print(f"\n**Ship {ship_id}**")
                if s1_voyages.keys() ^ s2_voyages.keys():
                    print(
                        f"{ship_id} has voyages {', '.join(s1_voyages.keys())} in set 1 and {', '.join(s2_voyages.keys())} in set 2")
                    print(f"The differing voyages are {s1_voyages.keys() ^ s2_voyages.keys()}")

                common_voyages = sorted(list(s1_voyages.keys() & s2_voyages.keys()))
                for voyage_id in common_voyages:
                    v1, v2 = s1_voyages[voyage_id], s2_voyages[voyage_id]
                    if v1 != v2:
                        print(f"\nVoyage {voyage_id}")
                        if v1.keys() ^ v2.keys():
                            print(f"{voyage_id} has keys {', '.join(v1.keys())} in set 1 and {', '.join(v2.keys())} in set 2")
                            print(f"The differing keys are {v1.keys() ^ v2.keys()}")

                        common_voyage_info = v1.keys() & v2.keys()
                        for v_key in common_voyage_info:
                            v1_value, v2_value = v1[v_key], v2[v_key]
                            if v_key != "route" and (v1_value != v2_value):
                                print(f"Voyage {voyage_id} has {v_key}: {v1_value} in set 1 and {v_key}: {v2_value} in set 2")

                            elif v_key == "route" and (v1_value != v2_value):
                                s1_stops, s2_stops = v1_value, v2_value
                                for stop in s1_stops:
                                    if stop not in s2_stops:
                                        print(f"Stop {stop} in set 1 but not in set 2\n")
                                for stop in s2_stops:
                                    if stop not in s1_stops:
                                        print(f"Stop {stop} in set 2 but not in set 1\n")

In [34]:
s1, s2 = from_json("../data/processed/ships_0.json"), from_json("../data/processed/ships_1.json") 

In [35]:
show_mismatches(s1, s2)

Differences in information for Ship IDs present in both sets:

**Ship 045-001114707**

Voyage 1
Stop {'6 Jan 1715': 'Cox'} in set 1 but not in set 2

Stop {'6 Jan 1715': "Cox's Island"} in set 2 but not in set 1


Voyage 2
Stop {'29 Jan 1718': 'Cox'} in set 1 but not in set 2

Stop {'29 Jan 1718': "Cox's Island"} in set 2 but not in set 1


**Ship 045-001114683**

Voyage 1
Stop {'16 Jul': 'St Augustine'} in set 1 but not in set 2

Stop {'16 Jul': "St Augustine's Bay"} in set 2 but not in set 1


**Ship 045-001114757**

Voyage 1
Stop {'8 Jan': '1763'} in set 1 but not in set 2

Stop {'8 Jan': "1763 'Scindy Road"} in set 2 but not in set 1


**Ship 045-001114938**

Voyage 3
Stop {'17 Aug': 'St Augustine'} in set 1 but not in set 2

Stop {'17 Aug': "St Augustine's Bay"} in set 2 but not in set 1



## Processing the live data

The complete data set is much messier. I've extended the logic of the algorithm for processing text data to handle the messiness. You can read it below.

In [2]:
ships_df = pd.read_csv("../data/raw/ships.csv", index_col="RecordID", encoding="utf8")

In [6]:
for i in range(0,len(ships_df),50):
    ships_df.iloc[i:i+50].to_csv(f"../data/interim/extension/ships_subset_{int(i/50)}.csv", encoding="utf8")

In [None]:
place_date_regex = re.compile(r"(?P<Location>[a-zA-Z\s']*\b)? ?(?P<Date>(\d{1,2}\s)?\w{3}(\s\d{4})?)?")
date_place_regex = re.compile(r"(?P<Date>(\d{1,2}\s)?\w{3}(\s\d{4})?)? ?(?P<Location>\b[a-zA-Z\s'-]*\b)")
duration_dest_regex = re.compile(r"(?P<Duration>\b[\d/-]*\b) ?(?P<Destination>[\s\w,&--'\(\)]*)?.?$")

ship_voyages = []
voyage_part_parse_failures = []
dur_date_failures = []
date_place_failures = []
place_date_failures = []

for ship_id, row in ships_df.iterrows():
    ship_info = {
        "name": row["CorporateName"],
        "dates": row["DateRange"],
        "info": "",
        "voyages": [],
        "raw_history": row["History"]
    }

    voyages = []
    if type(row["History"]) != str:
        ship_info["info"] = "No history recorded"
        ship_voyages.append({ship_id: ship_info})
        continue

    if "Voyages: " in row["History"]:
        info, voyage_string = row["History"].split("Voyages: ")
        ship_info["info"] = info.strip()
    else:  # No voyage information
        ship_info["info"] = row["History"]
        ship_voyages.append({ship_id: ship_info})
        continue
    
    
    raw_voyages = [x.strip() for x in re.split(r"\(\d{1,2}\) ", voyage_string) if x]  # First item in list is empty string due to split around first bracketed voyage number (1) 
    for rv in raw_voyages:
        voyage = {
            "duration": "",
            "start_date": "",
            "end_date": "",
            "destination": "",
            "captain": "",
            "route": [],
            "parse_failure": False
        }

        voyage_parts = [x.strip() for x in rv.split(".") if x]           
        try:
            if ("Capt" in rv or "Master" in rv) and "-" in rv:
                duration_dest, captain, route_str = voyage_parts[:3]
            elif ("Capt" in rv or "Master" in rv) and "-" not in rv:
                duration_dest, capt = voyage_parts[:2]
            elif "-" in rv:
                duration_dest, route_str = voyage_parts[:2]
            elif len(voyage_parts) == 2 and "-" not in rv:
                duration_dest, route_str = voyage_parts
            elif "-" not in rv:
                duration_dest = rv
        except ValueError:
            voyage_part_parse_failures.append((ship_id, rv))
            voyage["route"].append(rv)
            voyage["parse_failure"] = True
            voyages.append(voyage)
            continue

        try:
            dd_match = duration_dest_regex.match(duration_dest)
            duration, destination = dd_match.group("Duration"), dd_match.group("Destination")
        except AttributeError as e:
            dur_date_failures.append((ship_id, duration_dest))
            voyage["route"].append(rv)
            voyage["parse_failure"] = True
            voyages.append(voyage)
            continue

        voyage["captain"] = captain
        voyage["duration"] = duration
        voyage["destination"] = destination

        raw_stops = route_str.split(" - ")
        stops = []

        try:
            start = place_date_regex.search(raw_stops[0])
            start_location, start_date = start.group("Location"), start.group("Date")
            if start_location:
                start_location = start_location.strip()
        except AttributeError:
            stops.append({"Unparsed stop": stop})
            voyage["parse_failure"] = True
            place_date_failures.append((ship_id, raw_stops[0]))
            
        voyage["start_date"] = start_date
        
        stops.append({start_date: start_location})

        for stop in raw_stops[1:]:
            dp_match = date_place_regex.match(stop)
            if dp_match:
                loc, date = dp_match.group("Location").strip(), dp_match.group("Date")
                stops.append({date: loc})
            elif not date and re.search(r"\d", stop):  # Check if it's actually place/date format
                pd_match = place_date_regex.match(stop)
                pd_loc, pd_date = pd_match.group("Location").strip(), pd_match.group("Date")
                if pd_date:
                    loc, date = pd_loc, pd_date
                    stops.append({date: loc})
                else:
                    date_place_failures.append((ship_id, stop))
                    stops.append({"unable_to_date": stop})
                    voyage["parse_failure"] = True                       
            else:
                date_place_failures.append((ship_id, stop))
                stops.append({"unable_to_date": stop})
                voyage["parse_failure"] = True    

        if len(voyage_parts) > 3:
            [stops.append({"Additional voyage": p}) for p in voyage_parts[3:]]
            
        voyage["route"] = stops
        voyage["end_date"] = [x for x in stops[-1].keys()][0]

        voyages.append(voyage)

    ship_info["voyages"] = voyages

    ship_voyages.append({ship_id: ship_info})

In [None]:
len(ship_voyages), len(voyage_part_parse_failures), len(dur_date_failures), len(date_place_failures), len(place_date_failures)

In [None]:
ships_df.tail()

In [None]:
ships_df.head(10)

In [None]:
date_place_failures

Here are some of the inconsistencies in the voyage data that needs parsing if doing it using code.

Types of `History` string:
 - Ship info and voyage info. Start with ship info then `Voyages:` and voyage info
 - Only voyage info, string starts with `Voyages:` and has voyage info only

The voyages part is typically individual voyages in short text separated by voyage numbers in round brackets e.g. (1)
Types of individual voyage string:
- Years duration and a destination, then a captain, then text describing the stops on the voyage.
Types of voyage string inconsistency:
- No captain, just duration/destination then stops
- No stops, just duration/destination then captain
- No destination, just duration then captain/stops
- No captain or stops
- Poorly formatted: misplaced `.`, `-`
- Journey variation: wrecked, didn't return

At current all 'voyage_part_parse_failures' are due to missing '.' between parts of the voyage.

The duration/destination can also vary:
- Unhandled characters in the duration/destination text

## Alex's questions

#Next steps proposals / questions

Run over the entire dataset. Take the IDs of problematic records and remove from dataset, then re-run on that to get an initial output and a set of problematic records for further examination.

Examine initial output:

*   Close reading to flag any errors
*   Write code to flag voyage lists with len < 3
*   Write code to flag voyage steps which look like this ('23 Sep', '1685.')
*   Subsequent data cleaning / refining code


Problem entries:

*   Close reading to see what the problems might be and how to approach
*   Possibly use a more granular approach. Split by '-' then look more closely?
*   Run against gazatteer or use NER to identify place names and find closest dates?

Other questions:

*  Are nested dictionaries the best way of structuring this output? Would JSON be better?

*  Ultimately I would like to take the voyage steps, geolocate the places and tidy the dates so that they can be queried (which ships pass place X within timespan Y) and plotted, although there's a lot of data cleaning to do before then.


## Can I just use an LLM?

Yes, and they can produce good results. The reason I haven't suggested them at the start is because this tutorial is about how to write Python, not how to prompt an LLM. Using LLMs for work tasks also raises a range of ethical considerations. Read the BL's AI Principles and explore a framework like the Library of Congress' Labs [AI Planning Framework](https://libraryofcongress.github.io/labs-ai-framework/) to help you understand the benefits and risks of carrying out this work at scale.

Let's explore using an LLM as extra credit now you've done the bulk of your learning. LLMs are quite good at extracting structured data from unstructured text [references]. At the time of writing my impression is that Anthropic have the best governance processes, so open https://claude.ai and sign up for an account (~1 min). Then you can start putting in sections of the text and trying to get Claude to extract the data in a format similar to that above. Finding the right prompt is important, and is one of the skills needed to fruitfully interact with language models. Experiment yourself or make use of the one below, which I've adapted from [Matt Miller](https://thisismattmiller.com/post/using-gpt-on-library-collections/).

--- 

You are a helpful assistant that is extracting data from ship voyage information. You only answer using the text given to you. You do not make-up additional information, the answer has to be contained in the text provided to you. Each voyage is a string of text. 
You will structure your answer in valid JSON, extract the date in the format yyyy-mm-dd and the location the ship visited using the JSON keys dateVisited and location.

If the following text contains multiple voyages, extract each one into an array of 
valid JSON dictionaries. Each dictionary represents one of the entries:

Downs 27 May 1819 - 30 Sep Bengal - 29 Dec Narsipur - 3 Jan 1820 Madras - 22 Mar St Helena - 13 May East India Dock

---