Road map:

[ ] Create a prompt to annotate automatically parts of the text that are navigation instruction from Dan Cruickshank's book.
[ ] Baseline on my eval.
[ ] Feed the instructions to the model to fine tune it.
[ ] Run it on held out dataset.
# Extra things for the prompt:
[ ] Later extract not just nav instructions but all kinds of tags, this could train a bunch of special or interesting tasks like: "Identify narration", "Identify Historical Places", "Identify Historical Events tied to places"
and train them jointly.


In [44]:
NAV_TYPES = (
    """
T1. Navigation instructions always imply to walk one way or another with phrases.

One example of T1
```
One of the more charming sections stands round the corner from Tooley Street, in St Thomas Street and Crucifix Lane.
```
Another example is:
```
Fitting, then, that at the time the neighbouring 191 Bermondsey Street was the rectory for St Mary Magdalen church, which stands a little down the road.
```

""",
    """
T2. Navigation instructions are of different lengths and range over several
sentences. Make sure you capture the entirity of these sentence within the `Nav
Tags`.
""",
    """
T4. Often the instructions mix North, South East, West along with the other Types. These are valid and should be included in the `Nav Tags`. 

An example is:
```
From the south-western end of Shad Thames, go west along Tooley Street and cross Tower Bridge Road.
```
""",
    """
T5. Often the instructions include traveling across other streets, buildings or
junctions. These should be included in the `Nav Tags`. An example is:
```
Leave the churchyard by the gate on Abbey Street, which runs perpendicular to Bermondsey Street.
```
""",
    """

T3. *If and ONLY if* you find a valid navigation in T1,2,4,5,6 above, you may have a visual markers or features specified with the navigation instructions that can help a user but are useless in of themselves. Include these in those `Nav Tags`.

So an example of combination of T2 and T3:
```
The route to take, though, is the thoroughfare that intersects with Wheler
Street.  This is Quaker Street.  On the north side is a row of gabled former
railway warehouses dating from the late nineteenth century and now gutted and
being converted into an economy hotel; on the south side is an early
twentieth-century block of industrial dwellings, portions of the former Truman’s
Brewery, and a large interwar public-housing block named Wheler House.
```
The `gabled former railway warehouses dating from the late nineteenth century and now gutted and
being converted into an economy hotel` is the snippet that is a visual marker and should be included in the `Nav Tag`.

""",
)
START_TAG = "[NAV]"
END_TAG = "[/NAV]"
NUM_TYPES = len(NAV_TYPES)
NL = "\n"
LANGUAGE = "English"
REASON_START_TAG = "[REASON]"
REASON_END_TAG = "[/REASON]"
NUM_INSTRUCTIONS = 9

PROMPT = f"""
You are an expert at annotating {LANGUAGE} natural language text I give you with tags per my instructions below. 


# Context
I will give you some text from a Walking tour book referring to a specific route through the city of London. 
The task is to annotate parts of the text that describe specific navigation instructions.


# Instructions:
I want you to follow these instructions to do this:
1. Place a start and end tag {START_TAG}, {END_TAG} delimiting the navigation instruction. lets call these `Nav Tags` for future reference.

2.  Below are the Types of navigation tags that you need to use to annotate the text I provide.  The types are defined between the `----` delims for clarity:
----
{NL.join(NAV_TYPES)}
----

Lets call them `T1`, `T2`... `TN` for N navigation types - above we have N={NUM_TYPES}  -  I will use these to refer in the examples below.

3. The types of navigation instructions are not mutually exclusive. You may find multiple types in a single navigation instruction.

4. If the text only talks about a place but not how to get there or what to do there, it is not a navigation instruction and should not be annotated.

5. There are likely several of `Nav Tags` in the text so try your best to find them all.

6. Output the EXACT text that is between the `Nav Tags` including any punctuation, capitalization, and line breaks. DO NOT SUMMARIZE IT.

7. You MUST include a reason for each `Nav Tag` in the response be one of the {NUM_TYPES} types of navigation instructions and that the text between the `Nav Tags` satisfies that criteria.

8. Make sure each NAV tag delimited text is followed by its individual REASON tag. So if you output, say, N {START_TAG}...{END_TAG} pairs you should have N {REASON_START_TAG}...{REASON_END_TAG} pairs as well.

9. When you phrase the reason text don't mention T1, T2.. T{NUM_TYPES} instead just expand those reasons inline.

Make sure 200% that you follow **all** the {NUM_INSTRUCTIONS} instructions above to the letter. 


# Total examples of input and output to be given to you:

Here are the kind of input `Example text` the the Output I will expect from you.
Example text 1:
```
From the south-western end of Shad Thames, go west along Tooley Street and cross
Tower Bridge Road. This portion of the walk takes us through the mercantile hub
of Victorian Bermondsey – from London Bridge station, via the warehouses of
Bermondsey Street district, to the centre of London’s leather industry. Our
first port of call is the doleful wastes of Potters Field Park, now a windswept
and unlovely public space including a somewhat trampled lawn and ‘amphitheatre’.

```
Output:
```
{START_TAG}From the south-western end of Shad Thames, go west along Tooley Street and cross
Tower Bridge Road.{END_TAG}  
{REASON_START_TAG} The annotation was done  because it seems to have directions (N, S, E, W) and T5 as well where "cross Tower Bridge" follows the pattern of crossing a street. {REASON_END_TAG}
{START_TAG} Our first port of call is the doleful wastes of Potters Field Park, {END_TAG} 
{REASON_START_TAG}  The  annotation was done because it has an implicit walk direction to follow and stop at Potters field. {REASON_END_TAG}
```
Note how the each NAV tag was followed by its REASON tag.
Note that the second sentence in the Example text: above:
    > This portion of the walk takes us through the mercantile hub
      of Victorian Bermondsey – from London Bridge station, via the 
      warehouses of Bermondsey Street district, to the centre of London’s leather industry.

does not follow type T3 close enough and thus does not have the `Nav Tags`. 



Example text 2:
```
Walk south-west along More London Place, a geometrical and not unpleasing sliver
of a passage, which leads back to Tooley Street and the remains of a more
vigorous and muscular world. Tooley Street was a great mercantile thoroughfare
in the nineteenth century, lined with ware-houses, offices and railway
structures related to London Bridge station which sits, at high level,
immediately to its south.
```
Output:
```
{START_TAG}Walk south-west along More London Place, a geometrical and not unpleasing sliver
of a passage, which leads back to Tooley Street and the remains of a more
vigorous and muscular world.{END_TAG}
{REASON_START_TAG} This was annotated because it asks you to walk south-west towards tooly steet and 
 there is also a visual feature: `unpleasing sliver of a passage`  to guide you there{REASON_END_TAG}
```
Note that sentence 2 in the Example text is not telling you how to get to Tooley Street but instead is descibing something about offices
and railway structures and the London Bridge station; Even though it says "immediately to its south." its something to 
be seen but not a navigation instruction.

# Examples that will have no valid `Nav Tags`:

Example text 1:
```
When faced with a bomb packed with explosives, the relatively thin brickwork is
horribly vulnerable and easily penetrated; on the night of 25 October 1940,
during the Blitz, a bomb crashed through the roof a little to the east of this
spot, at the intersection of Tanner Street and Druid Street. Seventy-seven of
the people sheltering inside were killed.
```

Example text 2: 
```
Samuel Beazley, born 1786, was one of the most famed and productive individuals
in London theatre, writing nearly a hundred plays and designing and enlarging a
number of theatres, including two notable structures of the 1830s: the long-lost
neoclassical City of London Theatre in Norton Folgate, Spitalfields, and the
cast-iron Ionic colonnade that still embellishes the Drury Lane Theatre in
Covent Garden.
```

In BOTH the above examples the Output is empty because they match no navigation Types in the text satisfying 
the T1-T{NUM_TYPES} types mentioned. SKIP TEXT SNIPPETS and don't add delimiters to them.
"""

In [45]:
print(PROMPT)


You are an expert at annotating English natural language text I give you with tags per my instructions below. 


# Context
I will give you some text from a Walking tour book referring to a specific route through the city of London. 
The task is to annotate parts of the text that describe specific navigation instructions.


# Instructions:
I want you to follow these instructions to do this:
1. Place a start and end tag [NAV], [/NAV] delimiting the navigation instruction. lets call these `Nav Tags` for future reference.

2.  Below are the Types of navigation tags that you need to use to annotate the text I provide.  The types are defined between the `----` delims for clarity:
----

T1. Navigation instructions always imply to walk one way or another with phrases.

One example of T1
```
One of the more charming sections stands round the corner from Tooley Street, in St Thomas Street and Crucifix Lane.
```
Another example is:
```
Fitting, then, that at the time the neighbouring 191 Bermond

In [4]:
import dotenv

dotenv.load_dotenv("../.env")

True

In [52]:
dotenv

<module 'dotenv' from '/root/miniconda3/envs/mlx_week7/lib/python3.11/site-packages/dotenv/__init__.py'>

# Estimate Number of tokens

In [8]:
from nltk.tokenize import sent_tokenize

In [16]:
import nltk

nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml
showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [24]:
def grab_and_clean_chapter(chapter_file):
    with open(chapter_file) as f:
        chapter = f.read()
    chapter = chapter.replace("\n", " ")
    return sent_tokenize(chapter)

In [26]:
regent_canal_text = grab_and_clean_chapter("./full_text_regents_canal.txt")

In [27]:
spitalfields_text = grab_and_clean_chapter("./full_text_spitalfields.txt")

In [28]:
greenwich_text_sentences = grab_and_clean_chapter("./full_text_greenwich.txt")

In [29]:
bermondsy_text = grab_and_clean_chapter("./full_text_bermondsy.txt")

In [20]:
from langchain_text_splitters import CharacterTextSplitter

# remove newline from within the sentences

In [40]:
# Lets batch the sentences into groups of 20 each.
train_chunks = []
test_chunks = []


def chunker(sentences, chunk_size=10, overlap=2):
    # Split the sentences into chunks of contiguous text with some overlap

    return [
        (" ".join(sentences[max(0, i - overlap) : i + chunk_size]))
        for i in range(0, len(regent_canal_text), chunk_size)
    ]

In [37]:
regent_canal_text

['Our route starts at the south side of Victoria Park, at Bonner Hall Bridge – best accessed by getting the Tube to Bethnal Green, a ten-minute walk away.',
 'The first portion of this route offers clues about the nature of east London before the arrival of the Regent’s Canal, taking in several buildings that predate the waterways, and whose fates were inexorably changed by its arrival.',
 'Before moving on, take a moment to reflect on Victoria Park.',
 'In the eighteenth century, this was all open pasture, interspersed with the odd brick kiln and market garden.',
 'The one notable feature was Bonner Hall, so called after the sixteenth-century bishop of London Edmund Bonner.',
 'All this was to change in the nineteenth century.',
 'As London expanded, calls for public parks grew; in 1840, Queen Victoria was presented with a petition signed by 30,000 residents.',
 'The Crown estate purchased 218 acres in the area and, over the next few years, converted it into Victoria Park.',
 'The par

In [41]:
train_chunks = (
    chunker(regent_canal_text)
    + chunker(spitalfields_text)
    + chunker(greenwich_text_sentences)
)
test_chunks = chunker(bermondsy_text)

In [42]:
len(train_chunks), len(test_chunks)

(63, 21)

In [10]:
import tiktoken

enc = tiktoken.get_encoding("cl100k_base")  # GPT 4 tokenizer
len(enc.encode(all_text))

14127

In [None]:
len(enc.encode(PROMPT))

# Lets create the dataset!

In [15]:
# Chunk the text by paras
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=2200, chunk_overlap=0)
chunks = text_splitter.split_text(all_text)
print(len(chunks))  # As expected bu visual inspection.
chunks[0:3]

35


['Our route starts at the south side of Victoria Park, at Bonner Hall Bridge –\nbest accessed by getting the Tube to Bethnal Green, a ten-minute walk away. The\nfirst portion of this route offers clues about the nature of east London before\nthe arrival of the Regent’s Canal, taking in several buildings that predate the\nwaterways, and whose fates were inexorably changed by its arrival. Before moving\non, take a moment to reflect on Victoria Park. In the eighteenth century, this\nwas all open pasture, interspersed with the odd brick kiln and market garden. The one notable feature\nwas Bonner Hall, so called after the sixteenth-century bishop of London Edmund\nBonner. All this was to change in the nineteenth century. As London expanded,\ncalls for public parks grew; in 1840, Queen Victoria was presented with a\npetition signed by 30,000 residents. The Crown estate purchased 218 acres in the\narea and, over the next few years, converted it into Victoria Park. The park\nshares a family re

In [14]:
from langchain_core.messages import HumanMessage, SystemMessage
from langchain_openai import ChatOpenAI


# model = "gpt-4"
def from_chunk(chunk, prompt, model="gpt-3.5-turbo-0125"):
    chat = ChatOpenAI(model=model, temperature=0.1)
    return chat.invoke(
        [
            SystemMessage(content=prompt),
            HumanMessage(
                content=f"Make sure 200% you follow **all** the {NUM_INSTRUCTIONS} instructions above. Add `Nav tags` and the `Reason tags` to the following text using the instructions given: \n\n"
                + chunk,
            ),
        ]
    )

## GPT4 annotations

In [18]:
from tqdm import tqdm

In [None]:
four_annots = []

In [None]:
for chunk_4 in tqdm(chunks):
    four_annots.append(from_chunk(chunk_4, PROMPT, model="gpt-4"))

In [None]:
len(four_annots)

### Test annotations for GPT 4

In [16]:
test_chunks = text_splitter.split_text(all_test_text)

In [None]:
four_test_annotations = []

In [22]:
for test_chunk in tqdm(test_chunks[1:]):
    four_test_annotations.append(from_chunk(test_chunk, PROMPT, model="gpt-4"))

100%|██████████| 28/28 [05:54<00:00, 12.65s/it]


### Serialize the test annotations.


In [23]:
import json
from langchain_core.messages import AIMessage


def serialize_raw_annotations(
    annotations: list[AIMessage], data: list[str], filename: str
):
    # Save the gpt-four annotations
    serialized_data_gpt_four = []

    with open(filename, "w") as f:
        for i, (four_annot, chunk) in enumerate(zip(annotations, data)):
            serialized_data_gpt_four.append(
                {
                    "chunk": chunk,
                    "navs": four_annot.content,
                }
            )
            # print('\n\n--')
        json.dump(serialized_data_gpt_four, f, indent=2, ensure_ascii=False)

In [24]:
serialize_raw_annotations(
    four_test_annotations, test_chunks, "gpt_four_annotations_test.json"
)

In [25]:
json.load(open("gpt_four_annotations.json"))

[{'chunk': 'Our route starts at the south side of Victoria Park, at Bonner Hall Bridge – best accessed by getting the Tube to Bethnal Green, a ten-minute walk away. The first portion of this route offers clues about the nature of east London before the arrival of the Regent’s Canal, taking in several buildings that predate the waterways, and whose fates were inexorably changed by its arrival. Before moving on, take a moment to reflect on Victoria Park. In the eighteenth century, this was all open pasture, interspersed with the odd brick kiln and market garden. The one notable feature was Bonner Hall, so called after the sixteenth-century bishop of London Edmund Bonner. All this was to change in the nineteenth century. As London expanded, calls for public parks grew; in 1840, Queen Victoria was presented with a petition signed by 30,000 residents. The Crown estate purchased 218 acres in the area and, over the next few years, converted it into Victoria Park. The park shares a family rese

In [21]:
print(four_test_annotations[0].content)

[NAV]This walk starts at one of the most famous landmarks in Britain: Tower
Bridge. [/NAV]
[REASON]This was annotated because it gives an implicit direction to start the walk at Tower Bridge.[/REASON]
[NAV]ascend from Tower Hill tube and
stroll south across the bridge.[/NAV]
[REASON]This was annotated because it gives explicit directions to ascend from Tower Hill tube and stroll south across the bridge.[/REASON]
[NAV]There
was, for practical reasons, only one site possible – a strip of land just east
of the Tower of London – [/NAV]
[REASON]This was annotated because it gives an implicit direction to a specific location east of the Tower of London.[/REASON]


## GPT 3.5 Annotations

In [None]:
three_5_annots = []

In [None]:
for chunk_3_5 in tqdm(chunks):
    three_5_annots.append(from_chunk(chunk_3_5, PROMPT, model="gpt-3.5-turbo-0125"))

In [49]:
len(test_chunks)

29

## Compare to GPT4 to GPT3 annots

In [None]:
comp_idx = 0

In [None]:
print(three_5_annots[comp_idx].content)

In [None]:
import re


def pp(content):
    nav_splits = re.findall(rf"{START_TAG}(.*?){END_TAG}", content, flags=re.DOTALL)
    reason_splits = re.findall(
        rf"{REASON_START_TAG}(.*?){REASON_END_TAG}", content, flags=re.DOTALL
    )
    assert len(nav_splits) == len(
        reason_splits
    ), f"Nav splits {len(nav_splits)} and reason splits {len(reason_splits)} are not equal"
    print(f"{len(nav_splits)} Nav splits and {len(reason_splits)} Reason splits")
    # print(nav_splits)
    # print("\n\n")
    # print(reason_splits)
    # print("...\n\n")
    for i, (nav_split, reason_split) in enumerate(zip(nav_splits, reason_splits)):
        if nav_split:
            print(f"Nav {i}: {nav_split.strip()}")
            print(f"Reason {i}: {reason_split.strip()}")
            print("\n")


def pp_gt_four(content):
    nav_splits = re.findall(rf"{START_TAG}(.*?){END_TAG}", content, flags=re.DOTALL)
    for nav_split in nav_splits:
        if nav_split:
            print(f"Nav: {nav_split.strip()}\n")
    reason_splits = re.findall(
        rf"{REASON_START_TAG}(.*?){REASON_END_TAG}", content, flags=re.DOTALL
    )
    for reason_split in reason_splits:
        if reason_split:
            print(f"Reason: {reason_split.strip()}")

In [None]:
comp_idx = 9
pp(three_5_annots[comp_idx].content)

In [None]:
pp_gt_four(four_annots[comp_idx].content)
summary = """
--
for comp_idx = 0
GPT4 gets 6 navigations all of which are valid.
GPT3.5 gets 4 and 3/4 look valid 

GPT 4 
Precision = 6/6 = 1
GPT 3.5
Precision = 3/4


--
for comp_idx = 1    
GPT 4 retrievs 4 and gets all 4 seem valid.
GPT 3.5 gets 3 and all are valid, it gets one additional that GPT4 misses

GPT 4
Precision = 4/4 = 1
GPT 3.5
Precision = 3/3 = 1

--
for comp_idx = 2
Both retrieve a valid nav.

Precision = 1/1 = 1 for both

--
for comp_idx = 3
GPT 4 gets 4 navs all are valid. NOTE: the reason for the fourth one(the bridge) is a bit off ( but still valid) because it mentions a bridge but not the direction to get there. 
GPT3 gets 3 navs and all are valid.

GPT 4:
Precision = 4/4 = 1
GPT 3.5:
Precision = 3/3 = 1

--
for comp_idx = 4
GPT 4 gets 6 navs all are valid.
GPT3.5 gets 4 navs. 2 navs talk about a place that existed in the past but not now. 

GPT 4:
Precision = 6/6 = 1
GPT 3.5:
Precision = 2/4 = 0.5

NOTE: GPT4 does not point out visual markers as much as GPT3.5 does. 4 always says T1 and no other mixed type.

--
for comp_idx = 5
GPT 4 gets 7 navs all are valid. 
GPT 3.5 gets 5 navs and all are valid. 

GPt 4:
Precision = 7/7 = 1
GPT 3.5:
Precision = 5/5 = 1

--
for comp_idx = 6
GPT 4 gets 4 navs and all are valid. It combines the first two navs, which is as it should be, 3.5 splits them up.
GPT 3.5 gets 5 navs and all are valid.

GPT 4:
Precision = 4/4 = 1
GPT 3.5:
Precision = 5/5 = 1

--
for comp_idx = 7
GPT 4 selects only 3 navs and all of them adre valid. NOTE It now grabs visual markers as well.
GPT 3.5 select 7 navs from the text and only 2 were valid with the others either not having a navigation instrcution or was a object in the distant past. 

GPT 4:
Precision = 3/3 = 1
GPT 3.5:
Precision = 2/7 = 0.2857

--
for comp_idx = 8
GPT 4 selects no navs.
GPT 3.5 selects one navigation but it is partially valid as it has no nav instructions and mentions a bridge. 

Precision is 1 for GPT 4 and 0 GPT 3.5

NOTE at this point 4 beings to use other Types of navigation instructions and not just T1.

--
for comp_idx = 9
GPT 4 selects 2 navs and both are valid.
GPT 3.5 gets 3 navs and all are valid. it gets one additional nav that GPT4 misses.

Precision is 1 for both GPT 4 and GPT 3.5
-----------------------------------------

Average Precision for GPT 4     = 1 + 1 + 1 + 1 + 1 + 1 + 1      + 1 + 1 + 1 = 10/10 = 1
Average Precision for GPT 3.5   =.5 + 1 + 1 +.5 + 1 + 1 + 0.2857 + 1 + 0 + 1 = 7.2857/10 = 0.72857

Total number of navs spotted by GPT  4 = 6+4+1+4+6+7+4+3+0+2 = 37  but all are valid giving a near perfect recall
Total number of navs spotted by GPT3.5 = 4+3+1+3+4+5+5+7+1+3 = 36  but the valud are 3+3+1+3+2+5+5+1+0+3 = 26 with a recall of 26/37 = 0.7027
"""

In [None]:
print(chunks[comp_idx])

In [None]:
print(chunks[comp_idx])

In [None]:
with open("./full_text_bermondsy.txt") as f:
    test_text = f.read()

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=2200, chunk_overlap=0)
test_chunks = text_splitter.split_text(test_text)
print(len(test_chunks))

## Serialize the test dataset


In [None]:
import json

test_dataset = []

for chunk in test_chunks:
    test_dataset.append({"chunk": chunk, "navs": chunk["navs"]})
json.dump(test_dataset, open("test_dataset.json", "w"))

## Serialize the dataset

In [None]:
# Save the gpt-four annotations
import json

serialized_data_gpt_four = []

with open("./gpt_four_annotations.txt", "w") as f:
    for i, (four_annot, chunk) in enumerate(zip(four_annots, chunks)):
        serialized_data_gpt_four.append(
            {
                "chunk": chunk,
                "navs": four_annot.content,
            }
        )
        # print('\n\n--')
    json.dump(serialized_data_gpt_four, f, indent=2, ensure_ascii=False)

In [None]:
print(four_annots[-1].content)

In [None]:
from datasets import Dataset

walking_tour_dataset = Dataset.from_list(serialized_data_gpt_four)

In [None]:
# Remove reason and only get nav. OR just try expand the dataset.
walking_tour_dataset