# Introduction to Python #1 - Working with Text Data

Welcome to the June '25 Hack & Yack, part of the development of a new [Computing for Cultural Heritage](https://blogs.bl.uk/digital-scholarship/2021/09/computing-for-cultural-heritage-trial-outcomes-and-final-report.html). The aim of this programme, that we're currently seeking funding for, is to teach staff the fundamentals of programming in a cultural heritage context. In this session we'll introduce some fundamentals of the Python programming language by extracting structured information from unstructured text about voyages of ships in the India Office Records.

This session had a small pre-work notebook to work through, which matches the planned format for the new course. The pre-work covered using Jupyter notebooks (the format of this web page) and some basic Python data types. If you didn't have time to work through this before the session you'll have time once we've gone through the introduction. It should take 15-20 minutes or so. If you've already finished it you might want to have it open as a handy reference for this session.

Format:
- Introduction (10 mins)
- The task: extracting structured information from unstructured text data (5 mins)
- Looking at the data (10 mins)
- Defining our outputs (15 mins)
- Break (5 mins)
- Working through exercises (60 mins)
- Debrief and looking at the extension tasks (15 mins)

Teachers
- Harry Lloyd (host)
- Jez Cope (online)
- Saira Akhter (onsite - St P)

If you have questions in the room you can ask in person and Saira and I will help you out, if you're online Jez (or one of us in the room) will be able to assist.

## Learning Objectives

Covered in the pre-work (as well as today)
- Writing and running python code in a JupyterLab notebook
- Python variables and how to create them
- Creating and interacting with Python data types and data structures
    - String and integer data types
    - Lists
    - Dictionaries

Covered in this notebook
- Navigating the JupyterLab file system
- How to convert your approach to solving a problem into code
- Creating key/value pairs in dictionaries
- Using regular expressions to find matching strings of characters in text
- How to iterate over lists of things using for loops
- Exporting data to the filesystem as JSON

### Learning Resources

There are lots of learning resources for Python. If you get stuck here's a few you can consult. I'd like you to try using the documentation first, before you ask a language model or google, as reading the documentation is the best way to understand how the language works.
- Python documentation
    - e.g. for [lists](https://docs.python.org/3/library/stdtypes.html#sequence-types-list-tuple-range), [dictionaries](https://docs.python.org/3/library/stdtypes.html#mapping-types-dict), and [for loops](https://docs.python.org/3/reference/compound_stmts.html#the-for-statement)
    - The documentation is the source of truth for programming languages, and the more time you spend in them the easier it will get to understand how to do what you need to do, or why it's not working
- Asking your favourite language model
    - Good for 'explain why this code isn't working' questions. Can make it harder to understand why something isn't working if you're just copying corrected code. Be very careful asking them to write you code, if you don't understand it before you use it could be doing anything.
    - There's an extension section on using LLMs to actually carry out the task we're doing today, but I've left that for the end because it wouldn't teach you anyting about using Python!
- Googling the question
    - Often helpful for very niche things, and people have written some good explainers, but can make it harder to understand why your code isn't working

## Today's Task

> Convert unstructured text data about the histories of East India Company ships into a structured format that makes using the data easier for readers.

### The Dataset
A file of ship authority records at `data\raw\clean_ship_sample.csv`, mostly based on entries from Anthony Farrington's *Catalogue of East India Company Ships' Journals and Logs*, which was keyed in the early 2000s and later imported to IAMS. The entries are formatted pretty consistently in the *Catalogue*, and the consistency was replicated in the keying, which we'll take advantage of today. The raw dataset used to produce this sample is `data\raw\IAMS_pre_cyber_export_Corporation_authority.xlsx`, taken from `ACT_Metadata\IAMS\IAMS Oct 2023 authority listings`, filtered by Alex Hailey for a ship related subset of the CorporationAdditionalQualifiers column.

We're using a subset of columns from the full files: RecordID, ShipName, DateRange, History.

![book_cover](book_cover_voyage_text.png "Book Cover of Farrington's")

### Investigate the data 

An important part of working with data using a programming language is familiarity with the data. While you can run a processing step over a whole dataset in seconds, you still need to look at the data to understand it. In the file explorer on the left, navigate to `data\raw`. There you'll see `clean_ships_sample.csv`. Double click to open it, then familiarise yourself with the columns using the data dictionary below. Copy cells from the History column into any text editor to read the complete text. An example is included in the Data Dictionary below.

### Data Dictionary 

<u>RecordID</u>  
The IAMS Record ID for this Corporate Authority record.

<u>ShipName</u>  
The ship's name, unmodified from the CorporationCorporateName column in the raw IAMS dataset.

<u>History</u>    
Text about the history of the ship. Usually split into (1) information like contract type, size, builder, owner, and (2) details of voyages, which can be multiple. Voyages are numbered, and typically record the years of voyage with destination, captain (if known), and stops. Here's an example:
>Chartered ship, 32/35 crew, 450 tons. Principal Managing Owner: William Bawtree. Voyages: (1) 1818/9 Bengal. Capt Lucas Percival. Downs 27 May 1819 - 30 Sep Bengal - 29 Dec Narsipur - 3 Jan 1820 Madras - 22 Mar St Helena - 13 May East India Dock. (2) 1822/3 Bengal. Capt Lucas Percival. Downs 25 May 1829 - 21 Sep Hugli - 14 Oct Calcutta - 29 Jan 1824 Saugor - 2 Apr St Helena - 17 Jun East India Dock.

  
<u>DateRange</u>  
The range of years during which the ship was active. These are present for the sample, but most entries in the full dataset have individual start and end dates expressed as '-9999', and date ranges as 'Undetermined'. There's a separate project to update dates in the authority files. Processing the data below could produce information helpful in updating dates using the unique authority IDs, but we will focus on extracting voyage data.

### Task outputs

Imagine that this project started following a few queries from readers: "I'd like to be able to work with the East India Company ship records. Do you have them in a more structured format so I can carry out some geographic analysis on them?"

**5 minute exercise**  
We're going to structure our output in a dictionary, which is a collection of key/value pairs. They were covered in the pre-work, so you'll have a chance to read a little about them if you haven't yet.
- Turn to the person next to you (we'll pair you up online).
- In pairs, looking at the raw data, replace the filler values for FieldName and field_value with some of the pieces of structured information you think it might be useful to provide readers
    - What data do you think will be important to readers? (In real life you'd ask the reader, but make some educated guesses for now)
    - The History column is full of information, how might you break it down?
- You can add new lines by adding more key: value pairs. You need to add a `,` at the end of every line so the code knows you've moved on to another key/value pair.

In [None]:
{
    "FieldName": "field_value",
    "FieldName": "field_value",
}

#### Answer

Below is the generic data structure we're going to use. Coming up with this structure requires a thorough investigation of the data, and along with cleaning the data is one of the most time consuming parts of the the process.

Note we've used `ship_id` as the top level unique identifier, this is the key that users use to access information about each ship in the dataset, and we'll map to RecordID in the csv. The value for each `ship_id` key is another dictionary, and this is what contains all the information about that ship. We'll include name, date range, the raw History text, and processed History text. `processed_history` is split into `ship_info` (everything before the voyage information), and the list of `voyages`. Each voyage contains some information about the voyage, and then a list of stops. This is the kind of structured granular data that will be useful to people trying to apply computational or geographic methods to this data.

```json
{
	"<ship_id>": {
		"name": "<ship_name>",
		"date_range": "<ship_dates>",
        "raw_history": "<raw_history>",
		"procesed_history": {
            "ship_info": "<ship_info>",
            "voyages": [
    			{
    				"voyage_number": "<voyage_number>",
    				"duration": "<voyage_duration>",
    				"destination": "<destination>",
    				"captain": "<captain>",
    				"route": [
    					"<location_date_string>",
    					"<location_date_string>",
    					...
    					"<location_date_string>",
    				]
    			},
                ...  # Further voyages
                }
            ]
        } 
    },
    ...  # Further ships
}

## The task

### Steps
- Work out what information from one row of the csv you need to complete each of the \<values\> in the dictionary above
- Import the data (we'll handle this for you this time)
- Write code to map the data for the first ship to the output format
- Do this for all ships

### Import some Python packages

Python is an amazing language for many reasons, one of which is the range of _packages_ available. These packages extend Python's core functionality, and in this case include some special functions written for Computing for Cultural Heritage. We won't go further into packages today, we'll just run these import cells together.

In [None]:
import sys
!{sys.executable} -m pip install -e ../ --no-deps

---
Now restart the kernel: Toolbar > Kernel > Restart Kernel  
Then move on to the cell below, you don't need to re-run the cell above.

---

In [None]:
# IMPORT STATEMENTS
import json
import re

from cfch.dataset import import_data

### Working with the imported data

For this exercise we manage importing the data for you.
- Run the cell below, then create a new cell below it (click left of the cell to enter command mode and press `b`)
- In the new cell run each of the four variables (`ship_ids`, `names`, `histories`, `date_ranges`) one by one.
- Referring to the pre-work if you need to, what data type are each of these variables? How to you access individual values from them?

In [None]:
ship_ids, names, histories, date_ranges = import_data("../data/raw/clean_ships_sample.csv")
# Create a new cell below, and add your code

These four lists are your _input data_, and we will refer back to them repeatedly. The zeroth item in each list corresponds to the first row of the input csv. The first item in each corresponds to the second row and so on.
- Print the zeroth item in each list, and confirm it matches up to the input csv

In [None]:
# Your code

#### Starting the output dictionary

You can add single key/value pairs to a dictionary like this:

In [None]:
simple_dict = {}
simple_dict["my_key"] = "my_value"
simple_dict["my_other_key"] = "my_other_value"
# Dictionary values can be dictionaries themselves
# This creates the nested data structure in the Task output above
simple_dict["my_id_key"] = {"sub_dict_key": "sub_dict_value", "other_key": "other_value"}
print(simple_dict)

- Create a dictionary using the first (zeroth) item from each of `ship_ids`, `names`, `date_ranges`, and `histories`
    - You can add the key/value pairs one by one as above, or write out the dictionary all at once as in the pre-work
- The dictionary should match our target output as closely possible, though we're not including any `processed_history` data yet.

In [None]:
# Your code

##### Answer

In [None]:
zeroth_dict = {ship_ids[0]: {"name": names[0], "date_range": date_ranges[0], "raw_history": histories[0]}}

```Python
{'045-001114649': {'name': 'Boscawen',
  'date_range': '1748-1765',
  'raw_history': 'Rated at 499 tons, 26 guns, 99 crew. Principal Managing Owner: 4 Richard Crabb. Voyages: (1) 1748/9 Bombay. Capt Benjamin Braund. Downs 26 Mar 1749 - 5 Jul Johanna - 2 Aug Bombay - 22 Sep Surat - 17 Nov Bandar Abbas - 23 Dec Bombay - 11 Feb 1750 Mangalore - 17 Feb Tellicherry - 19 Mar Socotra - 29 Mar Mokha - 27 Aug Bombay - 16 Jan 1751 Cape - 17 Feb St Helena - 4 Jun Gravesend. (2) 1752/3 Madras and China. Capt Benjamin Braund. Downs 27 Dec 1752 - 15 Mar 1753 Cape - 24 Jun Madras - 9 Sep Whampoa - 26 Dec Second Bar - 7 May St Helena - 22 Jul Erith. (3) 1756/7 Madras, Bengal and China. Capt Benjamin Braund. Downs 30 Jan 1757 - 16 Jul Madagascar - 5 Sep Madras - 28 Nov Balasore - 20 Dec Calcutta - 29 Mar 1758 Ingeli - 5 Jul Madras - 21 Oct Whampoa 4 Feb 1759 - 15 May off St Helena, unable to call because of French ships - 7 Jun San Salvador - 27 Dec Cork - 2 Feb Plymouth - 27 Mar Erith. (4) 1760/1 Bombay and Bengal. Capt Benjamin Braund. Portsmouth 26 May 1761 - 16 Aug Rio de Janeiro - 16 Dec Anjengo - 25 Dec Tellicherry - 9 Jan 1762 Goa - 19 Feb Bombay - 2 Mar Surat - 12 May Mokha - 14 Jun Jeddah - 22 Aug Mokha - 15 Sep Surat - 14 Oct Bombay - 19 Nov Cannanore - 22 Nov Calicut - 25 Nov Cochin - 30 Nov Anjengo - 1 Feb 1763 Kedgeree - 1 Apr Ingeli - 8 Jun Calcutta - 10 Feb 1764 Culpee - 9 Mar Barrabulla - 26 Jun Mauritius - 29 Dec Bourbon - 21 Jan 1765 Cape - 27 Feb St Helena - 31 May Blackwall.'}}

#### Processing the History column

With that, you've developed the logic to process the RecordID, ShipName, DateRange, and unmodified History columns in the csv. That leads us on to processing the History column to extract some of the information within it into a more structured format, which will be the bulk of the rest of the exercise.
- Print out the first item from the `histories` list.
- Using your thinking from earlier, and the output format JSON specification, what are the two sections you need to break the History string into?
- Try some string methods to split the string up. Refer to the pre-work if you need a clue on what string method to use.

In [None]:
# Your code

##### Answer

We need to extract `ship_info` (everything before the voyages) and the `voyages` information. The easiest way to do this is the split around the "Voyages:" part of the string, so your split string should look something like:
```Python
['New Company, 46 crew. ',
 '(1) 1700/1 Mokha. Capt John Evans. Downs 24 Oct 1700 - 17 Nov Madeira - 5 Feb 1701 Cape - 20 Mar Madagascar - 15 Apr Johanna - 27 May Mokha 25 Aug - 9 Oct Socotra - 29 Dec Madagascar - 12 Mar 1702 Cape - 19 Apr St Helena - 27 Apr Ascension - 7 Jul Spithead.']

#### Beginning to deconstruct a voyage

Now you've split the History string into the parts that will make up the `ship_info` and `voyages` segments. `ship_info` was relatively straightforward, being just the part of the string before "Voyages:". The `voyages` section contains even more structured information.
- Assign the ship_info and voyages strings to variables so we can access them without having to repeatedly call `.split()`.
- Each voyage string is composed of multiple voyages, how can you split the string up to create individual voyages?
- Once you have an individual voyage, refer back to the output plan, what parts do you extract? How can you split the string up further?
- Finally, we are going to need to extract the individual stops for a voyage. How can you split this part of a voyage into its component parts?

In [None]:
# Your code

#### Regular Expressions

Now that you have got your head around how to split the string, we are going to introduce ways to do this cleverly, so we can retain some important information. A _regular expression_ is a pattern that computers can use to find things in strings. At the start of this program we imported a package called `re`, which is a standard Python library for working with regular expressions.

To use regular expressions, we create a string called a _pattern_ that uses a special syntax to indicate the patterns of characters we want to match in our target string. We are going to use regular expressions to find the voyage numbers (integers in round brackets), to split up the voyages (around the integers in round bracket), and later on to pull the stop date and stop location from each stop in a voyage.

The regular expression to match the voyage numbers is `r"\(\d{1,2}\)"`. Let's pick this apart:
- Include the `r` before the first `"` when writing regular expressions, you can read more in the [documentation](https://docs.python.org/3/howto/regex.html#the-backslash-plague).
- We're looking for sets of characters that look like `(1)`, or `(12)`, so we first need to match a `(`. We need to _escape_ a `(` using a backslash, because both `(` and `)` are special characters in regular expressions, and we need to tell the pattern we're looking for an actual `(`, not trying to use its special function.
- Then we need the digit that represents the voyage number. This could be a one or two digit number. Python makes available a shortcut `\d` to mean any digit.
- The `{1,2}` curly braces after `\d` indicate we want the digit to repeat one or two times (e.g. `3` or `14`)
- Finally we need the final round bracket (escaped again with a `\`), so `\)`

First let's find all the voyage numbers in the voyage part of your History string. The function is `re.findall()`, and it takes two arguments, first the pattern (copy it from the cell above), then the string to match the pattern against.
- Try running `re.findall()`, adding in the two arguments, and seeing how it finds all the numbers in round brackets.

In [None]:
# Your code

The second way we'll use `re` is to split up the different voyages. This is very similar to the `findall` above, but this time you're going to use `re.split()` with the same arguments.
- Try `re.split()` in the same way as `re.findall()`
- The result will have an empty string at the start, try indexing the result to exclude this. To drop the first item from a list you can index using `[1:]`

In [None]:
# Your code

##### Answers

```Python
re.findall(r"\(\d{1,2}\)", <string to findall in>)
re.split(r"\(\d{1,2}\)", <string to split>)

#### Deconstructing the voyages

Assign the split voyages to a variable, if you haven't already, and let's work with one of them. We're getting closer to extracting the fields we set out in the output specification: `voyage_number`, `duration`, `destination`, `captain`, `route`. We can get `voyage_number` using the `re.findall()` call earlier. `duration` should be the years the voyage took place over. `destination` should be the port/country the ship sailed to. `captain` is the named captain of the ship for that voyage, and `route` should be a list of the stops the ship made. For now each stop can be a string containing the date and location of the stop.
- Print out one of the strings representing a complete voyage
- Identify where in the string each of the pieces of information above lies.
- You now have the skills using the string `.split()` method to split out all of these pieces of information, though it might require two splits for some of them.
    - You could use `re.split()` for this but the patterns are consistent and simple enough that you shouldn't need to.
- Create a new dictionary (we can merge this into the main dictionary with `ship_name` etc in later) from the deconstructed voyage string with keys for `duration`, `destination`, `captain`, `route`. We'll add `voyage_number` shortly.

In [None]:
# Your code

##### Answer

Your dictionary should look something like the one below. This one has been _pretty printed_ for readability, but the keys should be the same, and the values the same type. Note we haven't split the stops of the route into their dates and locations. This is one of the extension tasks.
```Python
{
    duration: "1748/9",
    destination: "Bombay",
    captain: "Capt Benjamin Braud",
    route: [
        'Downs 26 Mar 1749',
        '5 Jul Johanna',
        '2 Aug Bombay',
        '22 Sep Surat',
        '17 Nov Bandar Abbas',
        '23 Dec Bombay',
        '11 Feb 1750 Mangalore',
        '17 Feb Tellicherry',
        '19 Mar Socotra',
        '29 Mar Mokha',
        '27 Aug Bombay',
        '16 Jan 1751 Cape',
        '17 Feb St Helena',
        '4 Jun Gravesend'
    ]
}

### Creating an example output dictionary

You can now combine all the information we've gathered into a mock up of the final dictionary we'll produce for each `ship_id`. 
- Create a dictionary and assign the `ship_info` you extracted earlier and the `voyage` dictionary you've just created to keys of `ship_info` and `voyages`
    - Paying close attention to the output specification, how should the `voyages` key be formatted? A hint being that we expect there might be multiple voyages for the same ship.
- Assign this dictionary to a `processed_history` key in the dictionary you began in _Starting the output dictionary_

In [None]:
# Your code

##### Answer

Your output should look something like this. I've moved the `raw_history` key to the bottom as it clogs the view up otherwise.

```Python
{
    '045-001114649': {
        'ship_name': 'Boscawen',
        'date_range': '1748-1765',
        'processed_history': {'ship_info': 'Rated at 499 tons, 26 guns, 99 crew. Principal Managing Owner: 4 Richard Crabb. ',
        'voyages': [{'duration': '1748/9',
         'destination': 'Bombay',
         'captain': 'Capt Benjamin Braud',
         'route': ['Downs 26 Mar 1749',
          '5 Jul Johanna',
          '2 Aug Bombay',
          '22 Sep Surat',
          '17 Nov Bandar Abbas',
          '23 Dec Bombay',
          '11 Feb 1750 Mangalore',
          '17 Feb Tellicherry',
          '19 Mar Socotra',
          '29 Mar Mokha',
          '27 Aug Bombay',
          '16 Jan 1751 Cape',
          '17 Feb St Helena',
          '4 Jun Gravesend']}]},
        'raw_history': 'Rated at 499 tons, 26 guns, 99 crew. Principal Managing Owner: 4 Richard Crabb. Voyages: (1) 1748/9 Bombay. Capt Benjamin Braund. Downs 26 Mar 1749 - 5 Jul Johanna - 2 Aug Bombay - 22 Sep Surat - 17 Nov Bandar Abbas - 23 Dec Bombay - 11 Feb 1750 Mangalore - 17 Feb Tellicherry - 19 Mar Socotra - 29 Mar Mokha - 27 Aug Bombay - 16 Jan 1751 Cape - 17 Feb St Helena - 4 Jun Gravesend. (2) 1752/3 Madras and China. Capt Benjamin Braund. Downs 27 Dec 1752 - 15 Mar 1753 Cape - 24 Jun Madras - 9 Sep Whampoa - 26 Dec Second Bar - 7 May St Helena - 22 Jul Erith. (3) 1756/7 Madras, Bengal and China. Capt Benjamin Braund. Downs 30 Jan 1757 - 16 Jul Madagascar - 5 Sep Madras - 28 Nov Balasore - 20 Dec Calcutta - 29 Mar 1758 Ingeli - 5 Jul Madras - 21 Oct Whampoa 4 Feb 1759 - 15 May off St Helena, unable to call because of French ships - 7 Jun San Salvador - 27 Dec Cork - 2 Feb Plymouth - 27 Mar Erith. (4) 1760/1 Bombay and Bengal. Capt Benjamin Braund. Portsmouth 26 May 1761 - 16 Aug Rio de Janeiro - 16 Dec Anjengo - 25 Dec Tellicherry - 9 Jan 1762 Goa - 19 Feb Bombay - 2 Mar Surat - 12 May Mokha - 14 Jun Jeddah - 22 Aug Mokha - 15 Sep Surat - 14 Oct Bombay - 19 Nov Cannanore - 22 Nov Calicut - 25 Nov Cochin - 30 Nov Anjengo - 1 Feb 1763 Kedgeree - 1 Apr Ingeli - 8 Jun Calcutta - 10 Feb 1764 Culpee - 9 Mar Barrabulla - 26 Jun Mauritius - 29 Dec Bourbon - 21 Jan 1765 Cape - 27 Feb St Helena - 31 May Blackwall.'
    }
}

### Iterating over the data using a `for` loop

#### Experimenting with loops

Congratulations! You've now worked you way down through the information contained in a ship's entry in the csv and worked out how to extract all the information we need. The final thing you need to do is iterate over all the information from the csv, each row/ship, and process the information for all of them.

We do this using a `for` loop. A `for` loop is way of iterating over something and doing an operation on each thing. _For_ each item in an iterable list of things, do something with that item.

A for loop has a few components:
- `for`
- a label used to iterate over an 'iterable'
- `in`
- an 'iterable', something like a list that you can iterate through the items of
- a colon
- an indented block, where you do something with each item of the iterable in turn

In [None]:
for i in [0,1,2,3]:
    print(i)

In [None]:
# the label you use to iterate over the list can be anything
# it _should_ be something obvious for the thing your iterating over 
for name in names:
    print(name)

In [None]:
# You can do any kind of work inside the indented block during your for loop
for history in histories:
    print(history.split("Voyages: ")[0])

- Iterate through the date_ranges list, and print each item split about the hyphen

In [None]:
# Your code

To allow us to work with all the information we need at the same time we're going to introduce one more function that's handy for `for` loops, which is the `zip()` function. `zip()` let's us iterate over a few iterables at the same time.

In [None]:
for ship_id, name, date_range, history in zip(ship_ids, names, date_ranges, histories):
    # We can now do something with each item from each list at the same time
    print({ship_id: {"ship_name": name, "date_range": date_range, "raw_history": history}})
    break  # break takes us out of the for loop after just one iteration, just to avoid printing masses of text below

This also allows us to iterate over voyage_numbers and voyage strings at the same time, which allows us to add voyage_number to our voyage dict. Remember this for the final part of the exercise.

```Python
voyage_numbers = re.findall(r"\(\d{1,2}\)", <voyage string>)
voyages = re.split(r"\(\d{1,2}\)", <voyage string>)
for number, voyage in zip(voyage_numbers, voyages):
    # Some processing code
    {
        voyage_number: number
        duration: <voyage_duration>,
        ...
}

#### Iterating over the input data

First we're going to run a for loop that works with the data in each of the lists without processing the History text any further, just to get used to how `for` loops work.
- Using the `for` loop scaffold below, iterate over all four lists, assigning the data to a simple version of the output dictionary
    - Which three key/value pairs will you be creating for this simple version of the dictionary?
- At the end of the loop, assign your output dictionary to `processed_ships_data` (defined as a blank dictionary at the top of the next cell) with the `ship_id` as its key

In [None]:
processed_ships_data = {}

In [None]:
for ship_id, name, date_range, history in zip(ship_ids, names, date_ranges, histories):
    # Create a ship data dictionary holding three of the output variables

    # At the end of each loop, assign your ship data dictionary as a key/value pair to `processed_ships_data` using the ship_id

##### Answer

Your `for` loop should look like this:
```Python
for ship_id, name, date_range, history in zip(ship_ids, names, date_ranges, histories):
    ship_data = {
        "name": name,
        "date_range": date_range,
        "raw_history": history
    }

    processed_ships_data[ship_id] = ship_data

#### Final Exercise: Processing History text

This is the final exercise of the task, and brings everything we've learned together. Give it your best shot, work in pairs, and ask the instructors to help you through the logic if you need it! 

Now we'll add in the processing steps you figured out above to convert History entries into structured data

- Using the `for` loop scaffold below, iterate over all four lists, assigning the data to a simple version of the output dictionary as you did before
- Add processing steps for `history` to get the key/value pairs you worked out for `processed_history` above
- Assign these to the output dictionary within the loop
    - The big test here is needing to iterate over multiple voyages _for each ship_, this requires a second `for` loop inside the first `for` loop
- At the end of the loop, assign your output dictionary to `processed_ships_data` (defined as a blank dictionary at the top of the next cell) with the `ship_id` as its key

In [None]:
for ship_id, name, date_range, history in zip(ship_ids, names, date_ranges, histories):
    # Re-create the same dictionary you used for the more simple for loop just above

    # Create a dict that will hold the values for the processed_history

    # Split the ship_info and voyages out of history and assign ship_info to the processed_history dict

    # Create a dict to hold the different voyages
    # re.findall the voyage numbers and re.split the individual voyages

    # Use a _nested_ for loop and zip() to iterate over the voyage numbers and individual voyages at the same time
    # For each voyage create a dict for the voyage that holds the voyage_number, duration, destination, captain, route
        # Use the splitting logic you developed earlier to pull out those pieces of information from each voyage string
        # Assign all the pieces of information to the right key in the dict for this voyage
        # Assign the dict for this voyage to the overall voyages dict using the voyage number as the key

    # Assign the voyages dict to the processed_history dict using 'voyages' as the key
    # Assign the processed_history dict to the ship data dict

    # At the end of each loop, assign your complete ship data dictionary to the processed_ships_data dict using the ship_id as the key

##### Answer

In [None]:
processed_ships_data = {}

In [None]:
for ship_id, name, date_range, history in zip(ship_ids, names, date_ranges, histories):
    ship_data = {
        "name": name,
        "date_range": date_range,
        "raw_history": history,
        "processed_history": {}  # Blank placeholder, not necessary, but indicates our intentions
    }
    
    voyages = {}
    ship_info, voyage_string = row["History"].split("Voyages: ")
    processed_history = {}  # The dict we'll eventually assign to ship_data["processed_history"]
    processed_history["ship_info"] = ship_info

    voyage_numbers = re.findall(r"\(\d{1,2}\)", voyage_string)  # This finds any number in round brackets `(i)`
    raw_voyages = re.split(r"\(\d{1,2}\) ", voyage_string)[1:]  # First item in list is empty string due to split around first bracketed voyage number (1) 

    for i, rv in zip(voyage_numbers, raw_voyages):
        voyage = {
            "voyage_number": i,
            "duration": "",
            "destination": "",
            "captain": "",
            "route": []
        }

        duration_dest, captain, route_str = rv.split(". ")[:3]
        duration, destination = duration_dest.split(" ")[:2]

        voyage["captain"] = captain
        voyage["destination"] = destination
        voyage["duration"] = duration
        voyage["route"] = route_str.split(" - ")

        voyages[i] = voyage
    
    processed_history["voyages"] = voyages
    
    ship_data["processed_history"] = processed_history
    
    # the ship_info dict is now complete, and we can assign it to processed_ships_data
    processed_ships_data[ship_id] = ship_data

### End of main task!

That's it! You're done. If you've made it this far during the Hack & Yack then huge congratulations! That was a lot of material to cover and you should be very proud of yourself. If you're working through this at a later date then still very well done, you've learnt a lot getting to this point <3.

You have covered the following learning objectives today and in the pre-work:

Covered in the pre-work (as well as today)

    Writing and running python code in a JupyterLab notebook
    Python variables and how to create them
    Creating and interacting with Python data types and data structures:
        String and integer data types
        Lists
        Dictionaries

Covered in this notebook

    Navigating the JupyterLab file system
    How to convert your approach to solving a problem into code
    Creating key/value pairs in dictionaries
    Using regular expressions to find matching strings of characters in text
    How to iterate over lists of things using for loops
    Exporting data to the filesystem as JSON

You can see a nicely printed example of what the output dictionary should look like below.

##### Answer

```json
{
	"045-001114649": {
		"name": "Boscawen",
		"dates": "1748-1765",
		"info": "Rated at 499 tons, 26 guns, 99 crew. Principal Managing Owner: 4 Richard Crabb.",
        "processed_history": 
		"voyages": [
			{
				"voyage_number": "(1)",
				"duration": "1748/9",
				"destination": "Bombay",
				"captain": "Capt Benjamin Braund",
				"route": [
					{
						"26 Mar 1749": "Downs"
					},
					{
						"5 Jul": "Johanna"
					},
					{
						"2 Aug": "Bombay"
					},
					{
						"22 Sep": "Surat"
					},
					{
						"17 Nov": "Bandar Abbas"
					},
					{
						"23 Dec": "Bombay"
					},
					{
						"11 Feb 1750": "Mangalore"
					},
					{
						"17 Feb": "Tellicherry"
					},
					{
						"19 Mar": "Socotra"
					},
					{
						"29 Mar": "Mokha"
					},
					{
						"27 Aug": "Bombay"
					},
					{
						"16 Jan 1751": "Cape"
					},
					{
						"17 Feb": "St Helena"
					},
					{
						"4 Jun": "Gravesend"
					}
				]
			}
        ]
    }
}

### Exporting to the filesystem

If you want to save your dictionary to the filesystem we can do that using a format called JSON. Put your ship dictionary into the json.dump() function below and run the cell. You should then see your output in the `data\processed` folder! 

In [None]:
with open("../data/processed/ships_data.json", "w") as f:
    json.dump(your_ship_dict, f, indent="\t")  # replace your_ship_dict with the variable you assigned your output to

## Extensions

And if you've made it this far there are a few extensions that you can choose between for the rest of your time.

### Extension - Using regexes to parse the stops

We can parse the stops on each voyage further into the location and the date of the stop. This requires some more complicated regular expressions to handle the possible variation in these string. The [Python regex HOWTO](https://docs.python.org/3/howto/regex.html#) is a really good reference for this.

In [None]:
date_place_regex = re.compile(r"(?P<Date>\d{1,2} \w{3}( \d{4})?) (?P<Location>\b[\w\s]*\b)")
place_date_regex = re.compile(r"(?P<Location>\b[\w\s]*\b) (?P<Date>\d{1,2} \w{3} \d{4})")

These regular expressions use a few more features of the Python implementation of regular expressions.
- [Groups](https://docs.python.org/3/howto/regex.html#grouping) and [named groups](https://docs.python.org/3/howto/regex.html#non-capturing-and-named-groups)
- More metacharacters for [matching things](https://docs.python.org/3/howto/regex.html#matching-characters)
    - The `\w` metacharacter to mean any alphanumeric character (just like \d means any digit)
    - The `\b` metacharacter to mean a word boundary
    - The `\s` metacharacter to mean any space
- More metacharacters for [repeating things](https://docs.python.org/3/howto/regex.html#repeating-things)
    - The `*` character to mean repeat as many times as possible
    - The `?` metacharacter to mean something is optional

- Use date_place_regex.search()/place_date_regex.search() to extract the date and place components of a voyage stop
- Refer to the HOWTO for guidance
- Add the date/place finding logic to your complete for loop from above

In [None]:
# Your code

### Extension - Processing the real data

The complete data set is much messier. Here are some of the inconsistencies in the voyage data that needs parsing if doing it using code.

Types of `History` string:
 - Ship info and voyage info. Start with ship info then `Voyages:` and voyage info
 - Only voyage info, string starts with `Voyages:` and has voyage info only

The voyages part is typically individual voyages in short text separated by voyage numbers in round brackets e.g. (1). There are variations.  
Types of individual voyage string:
- Years duration and a destination, then a captain, then text describing the stops on the voyage.
Types of voyage string inconsistency:
- No captain, just duration/destination then stops
- No stops, just duration/destination then captain
- No destination, just duration then captain/stops
- No captain or stops
- Poorly formatted: misplaced `.`, `-`
- Journey variation: wrecked, didn't return

At current all 'voyage_part_parse_failures' are due to missing '.' between parts of the voyage.

The duration/destination can also vary:
- Unhandled characters in the duration/destination text

I've extended the logic of the algorithm for processing text data to handle the messiness. You can read it below.

##### Complete parser

```Python
def complete_parse(ships_df):
    """
    Parser developed to handle the majority of data formats found in the History column of the ships dataset
    """
    place_date_regex = re.compile(r"(?P<Location>[a-zA-Z\s']*\b)? ?(?P<Date>(\d{1,2}\s)?\w{3}(\s\d{4})?)?")
    date_place_regex = re.compile(r"(?P<Date>(\d{1,2}\s)?\w{3}(\s\d{4})?)? ?(?P<Location>\b[a-zA-Z\s'-]*\b)")
    duration_dest_regex = re.compile(r"(?P<Duration>\b[\d/-]*\b) ?(?P<Destination>[\s\w,&--'\(\)]*)?.?$")
    
    ship_voyages = []
    voyage_part_parse_failures = []
    dur_date_failures = []
    date_place_failures = []
    place_date_failures = []
    
    for ship_id, row in ships_df.iterrows():
        ship_info = {
            "name": row["CorporateName"],
            "dates": row["DateRange"],
            "info": "",
            "voyages": [],
            "raw_history": row["History"]
        }
    
        voyages = []
        if type(row["History"]) != str:
            ship_info["info"] = "No history recorded"
            ship_voyages.append({ship_id: ship_info})
            continue
    
        if "Voyages: " in row["History"]:
            info, voyage_string = row["History"].split("Voyages: ")
            ship_info["info"] = info.strip()
        else:  # No voyage information
            ship_info["info"] = row["History"]
            ship_voyages.append({ship_id: ship_info})
            continue
        
        
        raw_voyages = [x.strip() for x in re.split(r"\(\d{1,2}\) ", voyage_string) if x]  # First item in list is empty string due to split around first bracketed voyage number (1) 
        for rv in raw_voyages:
            voyage = {
                "duration": "",
                "start_date": "",
                "end_date": "",
                "destination": "",
                "captain": "",
                "route": [],
                "parse_failure": False
            }
    
            voyage_parts = [x.strip() for x in rv.split(".") if x]           
            try:
                if ("Capt" in rv or "Master" in rv) and "-" in rv:
                    duration_dest, captain, route_str = voyage_parts[:3]
                elif ("Capt" in rv or "Master" in rv) and "-" not in rv:
                    duration_dest, capt = voyage_parts[:2]
                elif "-" in rv:
                    duration_dest, route_str = voyage_parts[:2]
                elif len(voyage_parts) == 2 and "-" not in rv:
                    duration_dest, route_str = voyage_parts
                elif "-" not in rv:
                    duration_dest = rv
            except ValueError:
                voyage_part_parse_failures.append((ship_id, rv))
                voyage["route"].append(rv)
                voyage["parse_failure"] = True
                voyages.append(voyage)
                continue
    
            try:
                dd_match = duration_dest_regex.match(duration_dest)
                duration, destination = dd_match.group("Duration"), dd_match.group("Destination")
            except AttributeError as e:
                dur_date_failures.append((ship_id, duration_dest))
                voyage["route"].append(rv)
                voyage["parse_failure"] = True
                voyages.append(voyage)
                continue
    
            voyage["captain"] = captain
            voyage["duration"] = duration
            voyage["destination"] = destination
    
            raw_stops = route_str.split(" - ")
            stops = []
    
            try:
                start = place_date_regex.search(raw_stops[0])
                start_location, start_date = start.group("Location"), start.group("Date")
                if start_location:
                    start_location = start_location.strip()
            except AttributeError:
                stops.append({"Unparsed stop": stop})
                voyage["parse_failure"] = True
                place_date_failures.append((ship_id, raw_stops[0]))
                
            voyage["start_date"] = start_date
            
            stops.append({start_date: start_location})
    
            for stop in raw_stops[1:]:
                dp_match = date_place_regex.match(stop)
                if dp_match:
                    loc, date = dp_match.group("Location").strip(), dp_match.group("Date")
                    stops.append({date: loc})
                elif not date and re.search(r"\d", stop):  # Check if it's actually place/date format
                    pd_match = place_date_regex.match(stop)
                    pd_loc, pd_date = pd_match.group("Location").strip(), pd_match.group("Date")
                    if pd_date:
                        loc, date = pd_loc, pd_date
                        stops.append({date: loc})
                    else:
                        date_place_failures.append((ship_id, stop))
                        stops.append({"unable_to_date": stop})
                        voyage["parse_failure"] = True                       
                else:
                    date_place_failures.append((ship_id, stop))
                    stops.append({"unable_to_date": stop})
                    voyage["parse_failure"] = True    
    
            if len(voyage_parts) > 3:
                [stops.append({"Additional voyage": p}) for p in voyage_parts[3:]]
                
            voyage["route"] = stops
            voyage["end_date"] = [x for x in stops[-1].keys()][0]
    
            voyages.append(voyage)
    
        ship_info["voyages"] = voyages
    
        ship_voyages.append({ship_id: ship_info})

    return ship_voyages, voyage_part_parse_failures, dur_date_failures, date_place_failures, place_date_failures


#### Using the Complete Parser

In [None]:
import pandas as pd
from cfch.dataset import complete_parse
ships_df = pd.read_csv("../data/raw/ships.csv", index_col="RecordID", encoding="utf8")
complete_parse(ships_df)

### Extension - Can I just use an LLM?

Yes, and they can produce good results. The reason I haven't suggested them at the start is because this tutorial is about how to write Python, not how to prompt an LLM. Using LLMs for work tasks also raises a range of ethical considerations. Read the BL's AI Principles and explore a framework like the Library of Congress' Labs [AI Planning Framework](https://libraryofcongress.github.io/labs-ai-framework/) to help you understand the benefits and risks of carrying out this work at scale.

Let's explore using an LLM as extra credit now you've done the bulk of your learning. LLMs are quite good at extracting structured data from unstructured text [references]. At the time of writing my impression is that Anthropic have the best governance processes, so open https://claude.ai and sign up for an account (~1 min). Then you can start putting in sections of the text and trying to get Claude to extract the data in a format similar to that above. Finding the right prompt is important, and is one of the skills needed to fruitfully interact with language models. Experiment yourself or make use of the one below, which I've adapted from [Matt Miller](https://thisismattmiller.com/post/using-gpt-on-library-collections/).

--- 

You are a helpful assistant that is extracting data from ship voyage information. You only answer using the text given to you. You do not make-up additional information, the answer has to be contained in the text provided to you. Each voyage is a string of text. 
You will structure your answer in valid JSON, extract the date in the format yyyy-mm-dd and the location the ship visited using the JSON keys dateVisited and location.

If the following text contains multiple voyages, extract each one into an array of 
valid JSON dictionaries. Each dictionary represents one of the entries:

Downs 27 May 1819 - 30 Sep Bengal - 29 Dec Narsipur - 3 Jan 1820 Madras - 22 Mar St Helena - 13 May East India Dock

---