Road map:

[ ] Create a prompt to annotate automatically parts of the text that are navigation instruction from Dan Cruickshank's book.
[ ] Baseline on my eval.
[ ] Feed the instructions to the model to fine tune it.
[ ] Run it on held out dataset.
# Extra things for the prompt:
[ ] Later extract not just nav instructions but all kinds of tags, this could train a bunch of special or interesting tasks like: "Identify narration", "Identify Historical Places", "Identify Historical Events tied to places"
and train them jointly.


In [137]:
NAV_TYPES = (
    """
T1. Navigation instructions always imply to walk one way or another with phrases.

One example of T1
```
One of the more charming sections stands round the corner from Tooley Street, in St Thomas Street and Crucifix Lane.
```
Another example is:
```
Fitting, then, that at the time the neighbouring 191 Bermondsey Street was the rectory for St Mary Magdalen church, which stands a little down the road.
```

""",

    """
T2. Navigation instructions are of different lengths and range over several
sentences. Make sure you capture the entirity of these sentence within the `Nav
Tags`.
""",
   
"""
T4. Often the instructions mix North, South East, West along with the other Types. These are valid and should be included in the `Nav Tags`. 

An example is:
```
From the south-western end of Shad Thames, go west along Tooley Street and cross Tower Bridge Road.
```
""",
    
"""
T5. Often the instructions include traveling across other streets, buildings or
junctions. These should be included in the `Nav Tags`. An example is:
```
Leave the churchyard by the gate on Abbey Street, which runs perpendicular to Bermondsey Street.
```
"""
,

"""

T3. *If and ONLY if* you find a valid navigation in T1,2,4,5,6 above, you may have a visual markers or features specified with the navigation instructions that can help a user but are useless in of themselves. Include these in those `Nav Tags`.

So an example of combination of T2 and T3:
```
The route to take, though, is the thoroughfare that intersects with Wheler
Street.  This is Quaker Street.  On the north side is a row of gabled former
railway warehouses dating from the late nineteenth century and now gutted and
being converted into an economy hotel; on the south side is an early
twentieth-century block of industrial dwellings, portions of the former Truman’s
Brewery, and a large interwar public-housing block named Wheler House.
```
The `gabled former railway warehouses dating from the late nineteenth century and now gutted and
being converted into an economy hotel` is the snippet that is a visual marker and should be included in the `Nav Tag`.

""",
)
START_TAG = "<NAV>"
END_TAG = "</NAV>"
NUM_TYPES = len(NAV_TYPES)
NL = "\n"
LANGUAGE = "English"
REASON_START_TAG = "<REASON>"
REASON_END_TAG = "</REASON>"
NUM_INSTRUCTIONS = 7

PROMPT = f"""
You are an expert at annotating {LANGUAGE} natural language text I give you with tags per my instructions below. 


# Context
I will give you some text from a Walking tour book referring to a specific route through the city of London. 
The task is to annotate parts of the text that describe specific navigation instructions.


# Instructions:
I want you to follow these instructions to do this:
1. Place a start and end tag {START_TAG}, {END_TAG} delimiting the navigation instruction. lets call these `Nav Tags` for future reference.

2.  Below are the Types of navigation tags that you need to use to annotate the text I provide.  The types are defined between the `----` delims for clarity:
----
{NL.join(NAV_TYPES)}
----

Lets call them `T1`, `T2`... `TN` for N navigation types - above we have N={NUM_TYPES}  -  I will use these to refer in the examples below.

3. The types of navigation instructions are not mutually exclusive. You may find multiple types in a single navigation instruction.

4. If the text only talks about a place but not how to get there or what to do there, it is not a navigation instruction and should not be annotated.

5. There are likely several of `Nav Tags` in the text so try your best to find them all.

6. Output the EXACT text that is between the `Nav Tags` including any punctuation, capitalization, and line breaks. DO NOT SUMMARIZE IT

7. You MUST include a reason for each `Nav Tag` in the response be one of the {NUM_TYPES} types of navigation instructions and that the text between the `Nav Tags` satisfies that criteria.

Make sure 200% that you follow **all** the {NUM_INSTRUCTIONS} instructions above to the letter. 


# Total examples of input and output to be given to you:

Here are the kind of input `Example text` the the Output I will expect from you.
Example text 1:
```
From the south-western end of Shad Thames, go west along Tooley Street and cross
Tower Bridge Road. This portion of the walk takes us through the mercantile hub
of Victorian Bermondsey – from London Bridge station, via the warehouses of
Bermondsey Street district, to the centre of London’s leather industry. Our
first port of call is the doleful wastes of Potters Field Park, now a windswept
and unlovely public space including a somewhat trampled lawn and ‘amphitheatre’.

```
Output:
```
{START_TAG}From the south-western end of Shad Thames, go west along Tooley Street and cross
Tower Bridge Road.{END_TAG}  
{START_TAG} Our first port of call is the doleful wastes of Potters Field Park, {END_TAG} 
{REASON_START_TAG} The first annotation was done  because it most closely follows T4 it seems to have directions (N, S, E, W) and T5 as well where "cross Tower Bridge" follows the pattern of crossing a street. 
The second annotation was done because it most closely follows T1 its an implicit walk and stop to Potters field. {REASON_END_TAG}
```
Note that the second sentence in the Example text: above:
    > This portion of the walk takes us through the mercantile hub
      of Victorian Bermondsey – from London Bridge station, via the 
      warehouses of Bermondsey Street district, to the centre of London’s leather industry.

does not follow type T3 close enough and thus does not have the `Nav Tags`. 


Example text 2:
```
Walk south-west along More London Place, a geometrical and not unpleasing sliver
of a passage, which leads back to Tooley Street and the remains of a more
vigorous and muscular world. Tooley Street was a great mercantile thoroughfare
in the nineteenth century, lined with ware-houses, offices and railway
structures related to London Bridge station which sits, at high level,
immediately to its south.
```
Output:
```
{START_TAG}Walk south-west along More London Place, a geometrical and not unpleasing sliver
of a passage, which leads back to Tooley Street and the remains of a more
vigorous and muscular world.{END_TAG}
{REASON_START_TAG} This was annotated because it follows not just T5 but also T3
since there is a visual feature `unpleasing sliver of a passage` *along* with
the instructions. {REASON_END_TAG}
```
Note that sentence 2 in the Example text is not telling you how to get to Tooley Street but instead is descibing something about offices
and railway structures and the London Bridge station; Even though it says "immediately to its south." its something to 
be seen but not a navigation instruction.

# Examples that will have no valid `Nav Tags`:

Example text 1:
```
When faced with a bomb packed with explosives, the relatively thin brickwork is
horribly vulnerable and easily penetrated; on the night of 25 October 1940,
during the Blitz, a bomb crashed through the roof a little to the east of this
spot, at the intersection of Tanner Street and Druid Street. Seventy-seven of
the people sheltering inside were killed.
```

Example text 2: 
```
Samuel Beazley, born 1786, was one of the most famed and productive individuals
in London theatre, writing nearly a hundred plays and designing and enlarging a
number of theatres, including two notable structures of the 1830s: the long-lost
neoclassical City of London Theatre in Norton Folgate, Spitalfields, and the
cast-iron Ionic colonnade that still embellishes the Drury Lane Theatre in
Covent Garden.
```

In BOTH the above examples the Output is empty because they match no navigation Types in the text satisfying 
the T1-T{NUM_TYPES} types mentioned. SKIP TEXT SNIPPETS and don't add delimiters to them.
"""

In [216]:
print(PROMPT)


You are an expert at annotating English natural language text I give you with tags per my instructions below. 


# Context
I will give you some text from a Walking tour book referring to a specific route through the city of London. 
The task is to annotate parts of the text that describe specific navigation instructions.


# Instructions:
I want you to follow these instructions to do this:
1. Place a start and end tag <NAV>, </NAV> delimiting the navigation instruction. lets call these `Nav Tags` for future reference.

2.  Below are the Types of navigation tags that you need to use to annotate the text I provide.  The types are defined between the `----` delims for clarity:
----

T1. Navigation instructions always imply to walk one way or another with phrases.

One example of T1
```
One of the more charming sections stands round the corner from Tooley Street, in St Thomas Street and Crucifix Lane.
```
Another example is:
```
Fitting, then, that at the time the neighbouring 191 Bermond

In [6]:
import dotenv

dotenv.load_dotenv("../.env")

True

In [None]:
dotenv

# Estimate Number of tokens

In [84]:
with open("./full_text_regents_canal.txt") as f:
    regent_canal_text = f.read()
from langchain_text_splitters import CharacterTextSplitter

In [85]:
len(regent_canal_text.split())

5095

In [83]:
import tiktoken

enc = tiktoken.get_encoding("cl100k_base")  # GPT 4 tokenizer
len(enc.encode(regent_canal_text))

6871

# Lets create the dataset!

In [99]:
# Chunk the text by paras
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=4000, chunk_overlap=0)
chunks = text_splitter.split_text(regent_canal_text)
print(len(chunks))  # As expected bu visual inspection.
chunks

10


['Our route starts at the south side of Victoria Park, at Bonner Hall Bridge –\nbest accessed by getting the Tube to Bethnal Green, a ten-minute walk away. The\nfirst portion of this route offers clues about the nature of east London before\nthe arrival of the Regent’s Canal, taking in several buildings that predate the\nwaterways, and whose fates were inexorably changed by its arrival. Before moving\non, take a moment to reflect on Victoria Park. In the eighteenth century, this\nwas all open pasture, interspersed with the odd brick kiln and market garden. The one notable feature\nwas Bonner Hall, so called after the sixteenth-century bishop of London Edmund\nBonner. All this was to change in the nineteenth century. As London expanded,\ncalls for public parks grew; in 1840, Queen Victoria was presented with a\npetition signed by 30,000 residents. The Crown estate purchased 218 acres in the\narea and, over the next few years, converted it into Victoria Park. The park\nshares a family re

In [102]:
from langchain_openai import ChatOpenAI




In [148]:
from langchain_core.messages import HumanMessage, SystemMessage

# model = "gpt-4"
def from_chunk(chunk, prompt, model="gpt-3.5-turbo-0125"):
    chat = ChatOpenAI(model=model, temperature=0.1)
    return chat.invoke(
        [
            SystemMessage(content=prompt),
            HumanMessage(
                content=f"Make sure 200% you follow **all** the {NUM_INSTRUCTIONS} instructions above. Add `Nav tags` to the following text following the instructions given: \n\n"
                + chunk,
            ),
        ]
    )

In [None]:
chunks[0]

In [146]:
four_annots = []

In [147]:
from tqdm import tqdm
for chunk_4 in tqdm(chunks):
    four_annots.append(from_chunk(chunk_4, PROMPT, model="gpt-4"))

  0%|          | 0/9 [00:00<?, ?it/s]

100%|██████████| 9/9 [02:26<00:00, 16.30s/it]


In [161]:
# Save the gpt-four annotations
for four_annot in four_annots:
    with open("./gpt_four_annotations.txt", "a") as f:
        f.write(four_annot.content + "\n\n")
    print(four_annot.content)
    print('\n\n')

<NAV>Our route starts at the south side of Victoria Park, at Bonner Hall Bridge –
best accessed by getting the Tube to Bethnal Green, a ten-minute walk away.</NAV>
<NAV>Before moving
on, take a moment to reflect on Victoria Park.</NAV>
<NAV>In particular the Pavilion Café, a few hundred metres to the east of
Bonner Hall Bridge, was a grisly affair, best avoided unless one wanted to
experience the stark reality of East End cuisine.</NAV>
<NAV>Now, like many in east London,
it has transformed into one of the best cafés mentioned in this book. It makes a
splendid place to begin the walk if you’re in need of a bite to eat or drink.</NAV>
<NAV>For those ready to plunge into Georgian industrial architecture, however, it is
best to walk to the western end of the park and through the canal gate on to the
towpath.</NAV>
<NAV>When one crosses under
the bridge carrying Cambridge Heath Road, everything changes.</NAV>

<REASON> The first annotation was done because it most closely follows T1 as it 

In [166]:
print(four_annots[1].content)

<NAV>And looming over all, to the west of the now sadly decaying villa, are the tall and
stark skeletal structures of a pair of nineteenth-century gasholders. It’s well
worth clambering off the canal here to take a closer look at this threatened
piece of industrial and once ornamental London. Go up the stairs to the east of
the bridges, cross the Cambridge Heath Road/Mare Street bridge to the south side
of the canal and head down the grandly named Corbridge Crescent towards the
villa.</NAV> 
<NAV>A narrow
alley – Grove Passage – lurks below the crushing form of the railway viaduct,
and Corbridge Crescent suddenly exposes its early granite setts which, during my
last visit at twilight, glistened with moisture and led to a strange and exotic
world that unfolded around and in front of the Regency villa.</NAV> 
<NAV>This marks the end of Corbridge Crescent, which now leads to The Oval – a small
piece of neoclassical town planning that incorporated an elongated circus,
somewhat in the Nash 

# Compare to GPT4 to GPT3 annots

In [155]:
three_5_annots = []

annot_3_5 = from_chunk(chunks[0], PROMPT)

In [156]:
three_5_annots.append(annot_3_5)

In [157]:

for chunk_3_5 in tqdm(chunks[1:]):
    three_5_annots.append(from_chunk(chunk_3_5, PROMPT, model="gpt-3.5-turbo-0125"))

100%|██████████| 9/9 [00:53<00:00,  5.89s/it]


In [175]:
comp_idx = 5

In [200]:
print(three_5_annots[comp_idx].content)

<NAV>Back on the north side of the canal, opposite the museum store, is the Barge House
restaurant, created within the ground floor of a recent building that fronts on
to De Beauvoir Crescent.</NAV>
<REASON> This was annotated because it follows T1 as it implies walking to the Barge House restaurant on the north side of the canal. </REASON>

<NAV>But immediately to the west of it is something that
seems much less but is, in fact, so much more. This is the Towpath Café,
composed of just a couple of holes in the basement wall of the large and bulky
industrial building that rises along the canal.</NAV>
<REASON> This was annotated because it follows T1 as it implies walking to the Towpath Café immediately to the west of the Barge House restaurant. </REASON>

<NAV>It’s the perfect place for a canal-side
walker to grab a coffee, a freshly-pressed fruit juice, a sandwich or cake and –
if the weather is right – a wonderful place to sit.</NAV>
<REASON> This was annotated because it follows T1 a

In [214]:
import re
def pp(content):
    nav_splits = re.split(r"<NAV>(.*?)</NAV>", content, flags=re.DOTALL)
    reason_splits = re.split(r"<REASON>(.*?)</REASON>", content, flags=re.DOTALL)
    assert len(nav_splits) == len(reason_splits), f"Nav splits {len(nav_splits)} and reason splits {len(reason_splits)} are not equal"
    print(f"{len(nav_splits)} Nav splits and {len(reason_splits)} Reason splits")
    for i, (nav_split, reason_split) in enumerate(zip(nav_splits, reason_splits)):
        if nav_split:
            print(f"Nav {i}: {nav_split.strip()}")
            print(f"Reason {i}: {reason_split.strip()}")
            print("\n")

In [215]:

pp(three_5_annots[comp_idx].content)


11 Nav splits and 11 Reason splits
Nav 1: Back on the north side of the canal, opposite the museum store, is the Barge House
restaurant, created within the ground floor of a recent building that fronts on
to De Beauvoir Crescent.
Reason 1: This was annotated because it follows T1 as it implies walking to the Barge House restaurant on the north side of the canal.


Nav 2: <REASON> This was annotated because it follows T1 as it implies walking to the Barge House restaurant on the north side of the canal. </REASON>
Reason 2: <NAV>But immediately to the west of it is something that
seems much less but is, in fact, so much more. This is the Towpath Café,
composed of just a couple of holes in the basement wall of the large and bulky
industrial building that rises along the canal.</NAV>


Nav 3: But immediately to the west of it is something that
seems much less but is, in fact, so much more. This is the Towpath Café,
composed of just a couple of holes in the basement wall of the large and bu

In [188]:
pp(four_annots[comp_idx].content)

Nav 0: <NAV>Back on the north side of the canal, opposite the museum store, is the Barge House
restaurant, created within the ground floor of a recent building that fronts on
to De Beauvoir Crescent.</NAV>
<NAV>But immediately to the west of it is something that
seems much less but is, in fact, so much more. This is the Towpath Café,
composed of just a couple of holes in the basement wall of the large and bulky
industrial building that rises along the canal.</NAV>
<NAV>On the next portion of the walk, the towpath becomes wider, the feats of
engineering more ambitious.</NAV>
<NAV>The first sign of the increase in scale
is the brick-arched Whitmore Road bridge.</NAV>
<NAV>The apparently wider girth of the canal after the bridge is, in fact,
an illusion. The increased sense of space is suggested by the buildings, which
are generally lower and more distant because of the wider path.</NAV>
<NAV>This sense of
scale and seriousness is reinforced by the mighty Sturt’s Lock a few hundred
metres

In [191]:
three_5_annots[comp_idx].content

'<NAV>Back on the north side of the canal, opposite the museum store, is the Barge House\nrestaurant, created within the ground floor of a recent building that fronts on\nto De Beauvoir Crescent.</NAV>\n<REASON> This was annotated because it follows T1 as it implies walking to the Barge House restaurant on the north side of the canal. </REASON>\n\n<NAV>But immediately to the west of it is something that\nseems much less but is, in fact, so much more. This is the Towpath Café,\ncomposed of just a couple of holes in the basement wall of the large and bulky\nindustrial building that rises along the canal.</NAV>\n<REASON> This was annotated because it follows T1 as it implies walking to the Towpath Café immediately to the west of the Barge House restaurant. </REASON>\n\n<NAV>It’s the perfect place for a canal-side\nwalker to grab a coffee, a freshly-pressed fruit juice, a sandwich or cake and –\nif the weather is right – a wonderful place to sit.</NAV>\n<REASON> This was annotated because 

In [217]:
print(chunks[comp_idx])

Back on the north side of the canal, opposite the museum store, is the Barge House
restaurant, created within the ground floor of a recent building that fronts on
to De Beauvoir Crescent. But immediately to the west of it is something that
seems much less but is, in fact, so much more. This is the Towpath Café,
composed of just a couple of holes in the basement wall of the large and bulky
industrial building that rises along the canal. Outside each slit-like hole is
an array of simple chairs, a few tables and perhaps a bowl or two of water to
quench the thirst of passing dogs. It’s the perfect place for a canal-side
walker to grab a coffee, a freshly-pressed fruit juice, a sandwich or cake and –
if the weather is right – a wonderful place to sit. When I first found Towpath I
was hungry and thirsty from a long walk and it really was like coming upon an
oasis. During the past few years it has expanded, and now colonises a wide
stretch of canal-side structures, with lots of exterior seati