<a href="https://colab.research.google.com/github/kasparvonbeelen/uga-llm-workshop/blob/main/2_Poking_LLMs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using (open-source) LLMs for analysing humanities data

In this notebook, we explore applications of generative AI for processing and analysing historical newspapers.

## LLMs as Research Assistants

The overarching questions is how to merge information in historical data with the predictive abilities of language models

- Language Modelling : the previous notebook investigated how LMs 'absorb' historical knowledge by continued pre-training on historical data using a language modelling task

- Instruction Following: what sets the current generation of LMs (ChatGPT, Llama, Claude etc.) apart from the previous, is their ability to follow instructions, often based on a few guiding examples. Few-shot learning: "I tell you what to do based on selection examples of correct answers."

This notebooks explores the latter approach, where LLMs are used to "analyze" content in meaningful and often complex ways.



## RAG (or elevated copy-pasting)

The "I tell you what to do" is usually referred to a "prompt" which we ask the LM to "complete".

**Q: How can we inject historical data in this process?**

**A: Retrieval Augmented Generation**: We ak the LLM to answer questions based on historical content we "copy-paste" into our prompt.

We focus on the "generation" less on the "retrieval". We want to keep things simple (at the technical level)!

We will be largely playing with an open source model, Llama-3, to get a feeling of LLMs change the way we can interrogate historical data.

## Open Source LLMs

Many of the popular, commerical models are "closed" in the sense that even though you can interact with them, they remain "hidden", i.e. you can not "download" them or freely run and manipulate them on your own computer system.


### Why Open-source?

- **Privacy:**: You might not want to share your data (and ideas) with companies such as OpenAI;
- **Cost:** Making abstraction of the caveat above, using open-source models might reduce costs if you want to apply for example a prompt to 10k newspaper articles;
- **Transparency:** Be mindful that there are different gradations of openness and transparency. Even when you can access the model weights, you might remain in the dark about training data and other factors);
- **Flexibility:** Even though some providers allow you to train or fine-tune closed models on your data (ties in with privacy), open-source models still give you more freedom and wiggle room to build new models and applications.

### Technical note

We will be relying on the Hugging Face `InferenceClient` for accessing LLMs. These are freely accessible, but rate limits apply! If you would want to deploy a 'local' version (we're still on Colab, but the code should also work on your computer), uncomment the code below (where indicated) and make sure you are using a [GPU](https://cloud.google.com/gpu). To select a GPU on Colab Go to **`Runtime`** and select **`Change runtime type`**, then select `T4 GPU` (or any other GPU available).



This notebook is inspired by: https://huggingface.co/learn/cookbook/structured_generation

In [None]:
# install the transformer and other libraries
#!pip install -q -U "transformers==4.40.0" pydantic accelerate outlines datasets bitsandbytes

## The Hugging Face Hub

In the examples below, we will experiment with `Llama-3-8B-Instruct`, a recent series of open-source LLMs created by Meta. To use Llama3 you need to:

- Make an account on Hugging Face https://huggingface.co/
- Go to the Llama-3-8B and sign the terms of use you should get a reply swiftly https://huggingface.co/meta-llama/Meta-Llama-3-8B
- Create a user access token with at least read access: https://huggingface.co/docs/hub/en/security-tokens
- Run the code cell below to log into the Hugging Face hub. Copy-paste the access token.
- Reply `n` to the question 'Add token as git credential? (Y/n)'

In [None]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


## Preparing model and data

### Import libraries

In [None]:
import warnings
warnings.filterwarnings('ignore') # disable warnings

In [None]:
import transformers
from huggingface_hub import InferenceClient
#from datasets import Dataset
from tqdm import tqdm
import pandas as pd
import torch
import pandas as pd
import json
pd.set_option("display.max_colwidth", 100)

### Load model

In [None]:
# choose a LLMs model
repo_id = "meta-llama/Meta-Llama-3-8B-Instruct"
# instantiate the inference client
llm_client = InferenceClient(model=repo_id, timeout=120)

In [None]:
# # use this cell if you can access an A100 or L4 GPU
# # define the model, we use the instruct variant
# checkpoint = "meta-llama/Meta-Llama-3-8B-Instruct"
# device = 'cuda' # make sure you use a GPU

# # instantiate a text generation pipeline
# pipeline = transformers.pipeline(
#     "text-generation",
#     model=checkpoint,
#     model_kwargs={"torch_dtype": torch.bfloat16},
#     device="cuda",
# )

# # some fluff to improve the generation
# terminators = [
#     pipeline.tokenizer.eos_token_id,
#     pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
# ]

In [None]:
# # use this cell if you can only access a T4 GPU
# from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, pipeline
# # define the model, we use the instruct variant
# checkpoint = "meta-llama/Meta-Llama-3-8B-Instruct"
# device = 'cuda' # make sure you use a GPU if available

# bnb_confic = BitsAndBytesConfig(
#     load_in_4bit=True,
#     bnb_4bit_use_double_quant=True,
#     bnb_4bit_quant_type="nf4",
#     bnb_4bit_compute_dtype=torch.bfloat16
# )

# tokenizer = AutoTokenizer.from_pretrained(checkpoint)
# #tokenizer.pad_token = tokenizer.eos_token
# model = AutoModelForCausalLM.from_pretrained(checkpoint,
#                                              quantization_config=bnb_confic,
#                                              device_map='auto')

# pipeline = transformers.pipeline(
#     "text-generation",
#     model=model,
#     tokenizer= tokenizer,
# )


# # some fluff to improve the generation
# terminators = [
#     pipeline.tokenizer.eos_token_id,
#     pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
# ]

### Download data

We will be experimenting with a small set of 10k British newspaper articles provided by the ["Living with Machines"](https://livingwithmachines.ac.uk/public-domain-newspaper-titles-in-living-with-machines/) project.

In [None]:
# download a sample of 10.000 newspaper articles
!gdown 1cewugDdehGn-wPP9B4kTBg_Wmu-14aHq
# unzip the downloaded sample
!unzip -o 0002247.csv.zip
!rm -r __MACOSX

Downloading...
From (original): https://drive.google.com/uc?id=1cewugDdehGn-wPP9B4kTBg_Wmu-14aHq
From (redirected): https://drive.google.com/uc?id=1cewugDdehGn-wPP9B4kTBg_Wmu-14aHq&confirm=t&uuid=2876a48a-5c29-4e0a-9aec-3abb29437b3c
To: /content/0002247.csv.zip
100% 82.8M/82.8M [00:01<00:00, 56.7MB/s]
Archive:  0002247.csv.zip
  inflating: 0002247.csv             
  inflating: __MACOSX/._0002247.csv  


In [None]:
df = pd.read_csv('0002247.csv')
df.head(3)

Unnamed: 0.1,Unnamed: 0,title,item_type,ocr_quality_mean,date,content
0,0002247/1869/0320/0002247_18690320_art0001_metadata.xml,,ARTICLE,0.8326,1869-03-20,"new EDIN BORO\nthe fa3hionable Oliva an 4 Brown\nis, and fa,:ed 30s ,\n\nlew GLADSTONB\n\nelegan..."
1,0002247/1869/0320/0002247_18690320_art0023_metadata.xml,SINGULA_R DisT'URBANCES.,ARTICLE,0.9487,1869-03-20,"SINGULA_R DisT'URBANCES.\n\nFor upwards of 12 months the Scotch church in\nMidvale-road, St. Hel..."
2,0002247/1869/0320/0002247_18690320_art0065_metadata.xml,BUILDERS' TENDERS.,ARTICLE,0.9288,1869-03-20,"BUILDERS' TENDERS.\n\nFor residence near Kettering. Mr. R. W. Johnson, architect—.\nBarlow & But..."


In [None]:
df.shape

(35999, 6)

### Issues with the data

- Segmentation: ["what is an article"](https://docs.google.com/document/d/1fbfWaDx6P-VV09j7pC__Rez_WVJOnnhq4RS-yiQ-XBM/edit?usp=sharing)
- OCR Quality: 6ibb*riSH?

In [None]:
print(df.iloc[0].content)

new EDIN BORO
the fa3hionable Oliva an 4 Brown
is, and fa,:ed 30s ,

lew GLADSTONB

elegant and anterior Garment, cut
s extant, and in a large variety of
ded, and Seams 355.



In [None]:
print(df.iloc[10].content)

BIRMINGHAM.

THE GUN TRADE.—A meeting of operatives engaged
in the small arms manufacture, and who tire resident
in the Small Heath district of Birmingham, was held on
Tuesday nicht, " to consider what steps shall be taken
to reduce the Government small arms factory at En-
field, and thus bring the trade back to Birmingham. "
Mr. E. Frazer presided. The chairman complained in
gcneral terms of the circumstance of the Government
having become manufacturers ; he suggested that rather
than the present system should continue it would be
desirable that all Government work should be thrown
open to general competition. A Mr. Monk next ad-
'dressed the meeting. He alluded to the recent deputation
to London to hold conference with Mr. Bright and other
local members of the Legislature, and said that the
continuance of the Enfield minnfactory threatened the
extinction of the small arms trade in Birmingham,
where it has flourished for a period of nearly 200 years.
He counselled all present to give 

### Process data

To facilitate the analysis we divide the newspaper articles into smaller, hopefully more meaningful chunks of 250 words (with a 25-word overlap).

- We split by proxy-article (i.e. double hard returns)
- remove single hard returns within the proxy-articles

In [None]:
def get_chunks(text: str, size: int=250,step: int=50) -> list:
  """divide a text into chunks of similar size
  Arguments:
    text (str): input text
    size (int): number of tokens in each chunk
    step (int): step size
  Returns a list of strings
  """
  words = text.split()
  return [' '.join(words[i:i+size]) for i in range(0,len(words),step)]

In [None]:
# split by proxy-article
df['elements'] = df.content.apply(lambda x: [' '.join(ch.split('\n')) for ch in x.split('\n\n')])
# reorder the dataframe
# with one chunk in each row
# instead of the whole text
df_by_element = df.explode('elements')
# # apply chunking to text
df_by_element['chunks'] = df_by_element.elements.apply(lambda x: get_chunks(x))
df_chunks = df_by_element.explode('chunks')
df_chunks.reset_index(drop=True, inplace=True)
df_chunks.shape # wow that's a lot of chunks ;-)

(819814, 8)

In [None]:
len(df_by_element.iloc[110].elements),len(df_by_element.iloc[110].chunks)

(2579, 9)

## Prompting

LLM generate text from an input, usually referred to as a 'prompt', a piece of text we like the model to use as a starting point for predicting novel tokens.

When 'chatting' with an LLM we usually provide the model with (at least) two messages: a system and a user prompt or message.

**System message**:

- **Generic instructions on behaviour**: specify how the model should behave (e.g. be helpful, respectful, neutral) or the role it should play (e.g., a teacher, assistant, or advisor).
- **Constraints**: Specific instructions on what the model should avoid or how it should generate responses.
- **Context**: Background information or context that remains constant throughout the session to ensure consistency.

**User message**:

- **Query**: specifies input from the user, such as a question, instruction, or request that the model needs to respond to.
- **Dynamic**: changes with each interaction, reflecting the user's immediate needs, questions, or instructions.

The Hugging Face chat prompt template allows messages as lists of dictionaries.

```python
messages [
 {
    "role" : "system",
    "content": "<system prompt here>"
 },
 {
    "role" : "user",
    "content": "<user prompt here>"
 }
]
```

## RAG by hand

Define a message by articulating a system and user prompt.

In [None]:
messages = [
    {
        "role": "system",
        "content": """
          You are a helpful AI that will assist me with analysing and reading newspaper articles.
          Read the newspaper article attentively and provide a short description of principal characters.
          Each newspaper article is enclosed with triple hashtags (i.e. ###).
          Don't make things up! If the information is not in the article then reply 'I don't know'
          """
              },

    {
        "role": "user",
        "content": f"""
                  ###POOR T,i,ENIPAT A 1„k CT  The Poor Law Coirdnissioti(rs have issued a ei; cular,
                  dated the 20th instant, stating that they have consulted the Attorney and
                  Solicitor-General on the construction of the late Removal Act, and give as the
                  result:— I. " That the proviso to the Ist section of the 9 and 10 Vict., c. 66,
                  which sets forth the exceptions to the principal enactments that are to be
                  excluded in the computation of time, is net retrospective in its operation, so
                  as to apply to cases where the five years\' residence was complete before the statute.
                  2. " That an interval between the completion of the five years residence and the
                  application for the warrant of removal filled up by one of the exceptions contained
                  in the proviso will not p event the operation of the statute in restraining the
                  removal of the pauper whu had resided for the specified time. 3. " That orders
                  of removal obtained previous to th• passing of the Act, but not then executed
                  by the removal of the paupers,###"""
              }
  ]

In [None]:
messages

[{'role': 'system',
  'content': "\n          You are a helpful AI that will assist me with analysing and reading newspaper articles.\n          Read the newspaper article attentively and provide a short description of principal characters.\n          Each newspaper article is enclosed with triple hashtags (i.e. ###).\n          Don't make things up! If the information is not in the article then reply 'I don't know'\n          "},
 {'role': 'user',
  'content': '\n                  ###POOR T,i,ENIPAT A 1„k CT  The Poor Law Coirdnissioti(rs have issued a ei; cular,\n                  dated the 20th instant, stating that they have consulted the Attorney and\n                  Solicitor-General on the construction of the late Removal Act, and give as the\n                  result:— I. " That the proviso to the Ist section of the 9 and 10 Vict., c. 66,\n                  which sets forth the exceptions to the principal enactments that are to be\n                  excluded in the computatio

In [None]:
#help(llm_client.chat_completion)

In [None]:
# # uncomment this code if you want to work locally, comment the other function
# def get_completion(messages: list, temperature=.1, top_p=.1) -> str:
#   """get completion for given system and user prompt
#     Arguments:
#     messages (list): a list containin a system and user message as
#       python dictionaries with keys 'role' and 'content'
#     temperature (float): regulate creativity of the text generation
#     top_p (float): cummulative probability included in the
#       generation process
#   """
#   prompt = pipeline.tokenizer.apply_chat_template(
#         messages,
#         tokenize=False,
#         add_generation_prompt=True
#       )

#   outputs = pipeline(
#     prompt,
#     max_new_tokens=256,
#     eos_token_id=terminators,
#     do_sample=True,
#     temperature=temperature,
#     top_p=top_p,
#       )
#   return outputs[0]["generated_text"][len(prompt):]

# uncomment this if you are using the llm_client
def get_completion(messages: list, temperature=.0, top_p=.1):
    """get completion for given system and user prompt
      Arguments:
        messages (list): a list containin a system and user message as
          python dictionaries with keys 'role' and 'content'
        temperature (float): regulate creativity of the text generation
        top_p (float): cummulative probability included in the
          generation process
    """
    outputs = llm_client.chat_completion(
        messages=messages,
        max_tokens=1024,
        temperature=temperature,
        top_p=top_p
        )
    return outputs.choices[0].message.content

In [None]:
print(get_completion(messages))

Based on the newspaper article, the principal characters mentioned are:

1. The Poor Law Commissioners: They are the ones who issued the circular and consulted with the Attorney and Solicitor-General on the construction of the Removal Act.
2. The Attorney-General: He was consulted by the Poor Law Commissioners on the construction of the Removal Act.
3. The Solicitor-General: He was also consulted by the Poor Law Commissioners on the construction of the Removal Act.
4. Paupers: They are the individuals who are the subject of the Removal Act and are being discussed in the circular.

Note: There are no specific individuals mentioned in the article, only these groups of people.


## Exercise

- Change the system message and ask the model to reply in medieval French.
- Change the user message and ask the model to summarize the article and condense it to one sentence.

In [None]:
# Enter code here

#### Solution

In [None]:
messages = [
    {"role": "system", "content": """
    You are a helpful AI that will assist me with analysing and reading newspaper articles.
    Read the newspaper article attentively and extract the required information.
    Each newspaper article is enclosed with triple hashtags (i.e. ###).
    Don't make things up! If the information is not in the article then reply 'I don't know'
          Answer in medieval French!"""
          },
    {"role": "user", "content": f"""Summarize the article in one sentence?
    ###{df.iloc[0].text}###"""}
]

print(get_completion(messages))


Hear ye, hear ye! I, a humble AI, shall extract the principal characters portrayed in this newspaper article.

Verily, I find none. This article appears to be a discussion of the Poor Law Commission's interpretation of the Removal Act, and does not mention any specific individuals. The article is a treatise on the law, outlining the Commission's views on the construction of the Act and its application to various scenarios.

Thus, I must reply: "Je ne sais pas" (I don't know). There are no principal characters to describe.


## Applying text generation to historical documents


### Example 1: Summarize

Let's imagine we'd wish to know what happened in January 1899 but won't have time to read all the newspaper issues. Luckily, LLMs excel at summarization!

We select all the articles for this January 1899 and save them in a new dataframe. For the purposes of this exercise, we just take a random sample of 20 chunks, otherwise it will take too long to run everything through the model.

In [None]:
df_small = df_chunks[
            (df_chunks.year==1899) & (df_chunks.month==1) # select articles from January 1899
                  ].sample(10, random_state=1984).reset_index(drop=True) # we sample a few to keep things simple
df_small.shape

(10, 15)

Run the cell below to load the `apply_completions` function.

In [None]:
def apply_completions(item: pd.Series,
                      system_message: str,
                      user_message: str,
                      text_column: str = 'chunks') -> str:
  """
  Function that appl
  Argument:
    item (pd.Series): row from a pandas Dataframe
    system_message (str): system prompt, specifies how the system
      should behave in
    user_message (str): user prompt, give instruction how to
      process each historical. the documents itself will be append
      from the 'text_column' argument
    text_column (str): name of the text column
  """
  messages = [
    {"role": "system", "content": system_message},
    {"role": "user", "content": user_message}
      ]
  messages[1]['content'] += f"\n\n###{item[text_column]}###"
  return  get_completion(messages)

We apply the prompt to the text chunks in our dataframe.

In [None]:
tqdm.pandas() # use tqdm to view progress

system_message = """
    You are a helpful AI that will assist me with analysing and reading newspaper articles.
    Read the newspaper article attentively and extract the required information.
    Each newspaper article is enclosed with triple hashtags (i.e. ###).
    Don't make things up! If the information is not in the article then reply 'I don't know'
    """
user_message = "Summarize the article in one sentence."

df_small['completion'] =  df_small.progress_apply(apply_completions,system_message=system_message, user_message=user_message, axis=1)

100%|██████████| 10/10 [00:15<00:00,  1.50s/it]


In [None]:
#print the summaries
df_small['completion'][0]

'The article is a brief review of an event, praising the performance of the songs and the duties of Bro. John Rennard as the D.C. (presumably a leader or emcee) for adding to the successful enjoyment of the evening.'

### Example 2: Analyse information about accidents in the news

In this example we complicate matters a little bit more.

First we retrieve a set of documents based on the date of publications and their content. Then we use an LLMs to ask specific questions about this document ('a baby RAG pipeline, in the sense that we first retrieve and then generate a response to our query').

How did accidents in the news change over time? Who is blamed for the accident?




In the first step we simple use a regular expression to find reports about accidents.

In [None]:
import re
pattern = re.compile(r'\baccidents?\b', re.I) # compile a regex
pattern.findall('accidents accident AccIdent accidental') # test the regex on a few example

['accidents', 'accident', 'AccIdent']

In [None]:
tqdm.pandas()
df_chunks['matches'] = df_chunks.chunks.progress_apply(lambda x: bool(pattern.findall(x)))

100%|██████████| 336876/336876 [00:11<00:00, 30088.47it/s]


Then we retrieve a small sample of accident reports during the 1810s.

In [None]:
accident_1810s = df_chunks[
                    (df_chunks.year.between(1810,1820)) & (df_chunks['matches'] == True)
                      ].sample(n=10, random_state=1984)

print(accident_1810s.shape)

(10, 16)


You can use `.value_counts()` to compute the total number of articles mentioning 'accident' at least once.

In [None]:
(df_chunks['matches'] == True).value_counts()

Unnamed: 0_level_0,count
matches,Unnamed: 1_level_1
False,329628
True,7248


In [None]:
system_message = """
    You are a helpful AI that will assist me with analysing and reading newspaper articles.
    Read the newspaper article attentively and extract the required information.
    Each newspaper article is enclosed with triple hashtags (i.e. ###).
    Don't make things up! If the information is not in the article then reply 'I don't know'
    Focus on the answer and do not add any unnecessary texts."""
user_message = """Does the article talk about an accident?
If yes, who is blamed for causing the accident? Is the accident caused by human error or a fault of the machine?
If not, answer 'No accident mentioned' """

accident_1810s['completion'] =  accident_1810s.progress_apply(apply_completions,system_message=system_message, user_message=user_message, axis=1)


100%|██████████| 10/10 [00:13<00:00,  1.33s/it]


In [None]:
accident_1810s['completion']

Unnamed: 0,completion
3806,No accident mentioned.
3806,No accident mentioned.
3130,No accident mentioned.
4491,No accident mentioned.
3130,No accident mentioned.
966,"Yes, the article talks about an accident.\n\nThe accidents mentioned are:\n\n1. An accident at R..."
3806,"Yes, the article talks about an accident.\n\nThe accident is caused by human error, specifically..."
966,"Yes, the article talks about an accident.\n\nThe accidents are caused by a bank falling in at th..."
966,"Yes, the article talks about an accident.\n\nThe accident is caused by a machine (the bank shoot..."
4491,No accident mentioned.


In [None]:
accident_1810s[['chunks','completion']].iloc[8].values

array(["the other four much bruised and cut, six of the eight are in such a state, that their recovery is not etpected. After a length of time, four more were dug up quite dead, and their bodies removed to the canteen for the Coroner's Inquest. A melancholy accident happened at Lisna, in the vicinity of Killileagh, on Tuesday last. As some men were working at a gravel pit for the purpose of repairing roads, the bank shot down and killed two men on the spot ; two more were so desperately bruised by the stuf falling on them, that their recovery is yet very doubtful. Each of the suXerers have lel t a wife and a large helpless family to deplore their loss. A young Lady of the name of LEACH, residing in the neighbourhood of Mary-la-bonne, was shockingly burnt on Thursday night last, by her clothes taking fire by the candle, whilst reading in bed : she lingere&, till the next day, and then expired in the greatest agony.—A Coroner's Jury sat on the body, and brought in a verdict of Accidental

### Example 3: Structured Generation

Working with these verbose responses if often difficult, especally at scale. Fortunately, we can ask the LLM the respond on a **structured fashion** that we can process more easily.

Let's have a look at extracting **biographical information** from newspaper articles.

Newspapers contain a lot of biographical information, one could say biography appears as a microgenre in the press. For example, in accident reports we do get some background about the people involved, implicitly (gender) or explicitly (professions or age).

Below we use a language model to extract such information from newspaper reports and return it in a predefined format that allows us to analyse newspapers as structured data.

Put differently, we use LLMs to extract information similar to automatic annotation, and convert text to JSON format, which is easier to parse with Python.*

* You could also use XML if you are more comfortable with this format.

In [None]:
df_small = df_chunks[
                    (df_chunks['matches'] == True)
                      ].sample(n=10, random_state=1984)

In [None]:
# df_small['chunks'].iloc[7]

We rewrite the system prompt and give it a few more instructions on how to respond to our queries.

In [None]:
system_message = """You are an helpful AI that will assist me with analysing source documents in the form of historical newspaper articles.
    Read the newspaper articles attentively and extract structured information formatted as a list of Python dictionaries.
    Provide all relevant short source snippets from the documents on which you directly based your answer.
    Keep the source snippet short to just a few words and not complete sentences.
    The snippet MUST be extracted from the soutce, with spelling and wording identical to the source.
    This list of JSON blobs should begin with a "START" tag and end with a "END" tag.
    Each newspaper article will be enclosed with triple hash tags (i.e. ###).
    Don't make thigs up! If you don't know the answer, simply return no value"""


user_message = """
If the article describes a historical accident, extract biographical information about the individuals involved in the accidents.
Return a list of Python dictionaries for each individual which records important personal attributes such gender, age and profession, and others that are relevant.
Each attribute is a key in a dictionary.
Record personal attribures as dictionaries as shown in the example below.
Also add one key with "outcome" that records what happened to person ("drowned", "survived", "injured")
Add a confidence score as a float between 0 and 1 for each snippet extracted.
Under "source_snippets" collect text fragments that record what happened to person involved.

START
[
  {
  "name" : { "value": answer,"source": source_snippet, "confidence": your_confidence_score },
  "gender" : { "value": answer,"source": source_snippet, "confidence": your_confidence_score },
  "profession" :{ "value": answer,"source": source_snippet, "confidence": your_confidence_score },
  ... other attributes ...,
  "outcome" : { "value": answer,"source": source_snippet, "confidence": your_confidence_score },
  "summary": { "value" :summary, "confidence" : your_confidence_score }
  },
...]
END
"""



In [None]:
messages = [
    {"role": "system", "content": system_message},
    {"role": "user", "content": user_message + f'\n\n###{df_small["chunks"].iloc[4]}###'}
      ]
print(get_completion(messages))

START
[
  {
    "name" : {"value": "The Rev. G. M. Gordon", "source": "who as killed in the sortie from Candahar", "confidence": 0.9},
    "gender" : {"value": "male", "source": "", "confidence": 0.8},
    "profession" : {"value": "clergyman", "source": "The Rev. G. M. Gordon", "confidence": 0.9},
    "outcome" : {"value": "killed", "source": "who as killed in the sortie from Candahar", "confidence": 0.9},
    "source_snippets" : ["who as killed in the sortie from Candahar"],
    "summary" : {"value": "The Rev. G. M. Gordon was killed in the sortie from Candahar", "confidence": 0.9}
  },
  {
    "name" : {"value": "Duke of Connaught", "source": "The Duke of Connaught", "confidence": 0.9},
    "gender" : {"value": "male", "source": "", "confidence": 0.8},
    "profession" : {"value": "royal", "source": "The Duke of Connaught", "confidence": 0.9},
    "outcome" : {"value": "injured", "source": "Beyond the shaking, however, the Duke was little the worse for the mishap", "confidence": 0.8}

In [None]:
df_small['completion'] =  df_small.progress_apply(apply_completions,system_message=system_message, user_message=user_message, axis=1)


100%|██████████| 10/10 [00:44<00:00,  4.41s/it]


In [None]:
print(df_small['completion'])

9969    START\n[\n  {\n    "name" : {"value" : "James Turner", "source" : "Subsequently a man who, it is...
3526    START\n[\n  {\n    "name" : {"value": "James Heng", "source": "ACCIDENT AT WOOLWICH DOCKYARD;— a...
8563    START\n[\n  {\n    "name" : {"value": "Edward Ball", "source": "that of a young man named Edward...
7666    START\n[\n  {\n    "name" : {"value": "unknown", "source": "A frightful accident occurred on Sat...
9068    START\n[\n  {\n    "name" : {"value": "The Rev. G. M. Gordon", "source": "who as killed in the s...
3438                                                                                             END\n\n###
7306    START\n[\n  {\n    "name" : {"value": "Miles Barnes", "source": "the body of Miles Barnes", "con...
3077    START\n[\n  {\n    "name" : {"value": "Sareh Bunyan", "source": "On Friday an inquest was held i...
5963    START\n[\n  {\n    "name" : {"value": "Samuel Birtwistle", "source": "A youth named Samuel Birtw...
5177    END\n\nSTART\n[\n  {

To convert the response to a Python data type, we use the `eval_completion` function.

In [None]:
def eval_completion(completion: str) -> list:
  """Convert the completion as string to a Python list
  Argument:
      completion (str): structured generation by LLM
  """
  try:
    return eval(completion.split('START')[-1].strip().rstrip('END').strip())
  except Exception as e:
    print(e)
    return []

df_small['completion_eval'] = df_small['completion'].apply(eval_completion)

unterminated string literal (detected at line 38) (<string>, line 38)
name 'END' is not defined


Let's have a bit closer look at some examples.

In [None]:
df_small['completion_eval']

Unnamed: 0,completion_eval
9969,"[{'name': {'value': 'James Turner', 'source': 'Subsequently a man who, it is alleged, first rais..."
3526,"[{'name': {'value': 'James Heng', 'source': 'ACCIDENT AT WOOLWICH DOCKYARD;— accident occurred y..."
8563,"[{'name': {'value': 'Edward Ball', 'source': 'that of a young man named Edward Ball', 'confidenc..."
7666,"[{'name': {'value': 'unknown', 'source': 'A frightful accident occurred on Saturday week at the ..."
9068,[]
3438,[]
7306,"[{'name': {'value': 'Miles Barnes', 'source': 'the body of Miles Barnes', 'confidence': 1.0}, 'g..."
3077,"[{'name': {'value': 'Sareh Bunyan', 'source': 'On Friday an inquest was held in New Gravel-lane ..."
5963,"[{'name': {'value': 'Samuel Birtwistle', 'source': 'A youth named Samuel Birtwistle', 'confidenc..."
5177,"[{'name': {'value': None, 'source': None, 'confidence': 0.0}, 'gender': {'value': None, 'source'..."


In [None]:
df_small['completion_eval'].iloc[4]

[]

Lastly we can have a bit closer look at how the language model processes the text by highlighting the fragments on which it based its answers. This can help us with
- creating automatic pre-annotation
- figuring out how the pipeline could be improved
- close-reading large amounts of text

In [None]:
row = df_small.iloc[1]
html_output = row['chunks']
for p_dict in row['completion_eval']:
  for attr, attr_dict in p_dict.items():
    try:
      if isinstance(attr_dict, dict):
        if attr_dict.get('confidence',.0) > .5 and attr_dict.get("source",None):
          html_output = re.sub(str(attr_dict['source']),
                   f'<span style="background-color: yellow;">{attr_dict["source"]}</span>', html_output)
    except Exception as e:
      print(e,attr_dict)
      continue

In [None]:
from IPython.core.display import HTML
HTML(html_output)

### Example 4: OCR correction

Lastly, let's use LLM to help us with a longstanding problem in digital humanities, improving OCR quality.

In [None]:
df_small_bad_ocr = df_chunks.sort_values('ocrquality', ascending=True)[:1000].sample(n=10)

In [None]:
system_message = "You are an helpful AI and provide truthful correction of historical text."

user_message = """Transcribe the text and correct typos and errors in the text caused by bad optical character recognition (OCR).
Do not add any information that is not in the original text!"""

df_small_bad_ocr['completion'] = df_small_bad_ocr.progress_apply(apply_completions,system_message=system_message, user_message=user_message, axis=1)


100%|██████████| 10/10 [01:16<00:00,  7.61s/it]


In [None]:
df_small_bad_ocr.iloc[4]['chunks']

"from Harrow; Mr. and Miss Afansergh, at Gould's Hotel, Jermyn-street, from Paris; Captain Patterson, at ditto, from Bath ; Samuel Peach, Esq., at Burlington Hotel, Old Burlington and Cork.:streets, from Warwickshire ; J. Cano, Esq. at ditto, from Ireland. CUANGEL—The Earl and Countess of Sefton and the Ladies Molyneux, for their seat, Stoke Farm, Berke ; the Countess of Mansfield, for her Villa at Twickenham ; Lord Lynedock, from his seat, Cosgrove Priory, for the Marquis of Anglesey's, Beau Desert, Staffordshire; Lord Meneaster, from Melton Mowbray, for his seat, in Cumberland ; Augustus Atkyns, Esq. from Farnborough Hill, near Bagshot, to Arreton Vicarage, near Newport, Isle of Wight; the Rev. Dr. Wynne, from Warne's Hotel; Mrs. Fletcher and family, from ditto; Mrs. Walker and family, from ditto, for North Wales; the Countess of Golflird, from the St. George's Hotel, Albemarle-street, for her seat, Brompton Park, Huntingdon. The Marchioness Dowager of Salisbury's Convex.sazione, on 

In [None]:
print(df_small_bad_ocr.iloc[4]['completion'])

Here is the transcribed text with corrections for typos and OCR errors:

###From Harrow; Mr. and Miss Afansiegh, at Gould's Hotel, Jermyn Street, from Paris; Captain Patterson, at ditto, from Bath; Samuel Peach, Esq., at Burlington Hotel, Old Burlington Street and Cork Street, from Warwickshire; J. Cano, Esq., at ditto, from Ireland.

CUANGEL—The Earl and Countess of Sefton and the Ladies Molyneux, for their seat, Stoke Farm, Berkshire; the Countess of Mansfield, for her villa at Twickenham; Lord Lynedoch, from his seat, Cosgrove Priory, for the Marquis of Anglesey's, Beau Desert, Staffordshire; Lord Manners, from Melton Mowbray, for his seat, in Cumberland; Augustus Atkyns, Esq., from Farnborough Hill, near Bagshot, to Arreton Vicarage, near Newport, Isle of Wight; the Rev. Dr. Wynne, from Warne's Hotel; Mrs. Fletcher and family, from ditto; Mrs. Walker and family, from ditto, for North Wales; the Countess of Gough, from the St. George's Hotel, Albemarle Street, for her seat, Brompton

In [None]:
df_small_bad_ocr.iloc[3]['chunks']

"to the south of well-searched tracts, and has been approache I by vessels that have returned without loss, has never yet been explored. _ _ Supported by the advice of those experienced Arctic seamen, in, whom she has every reason to confide, Lady Franklin makes this last effort to clear away the mystery that shrouds tho' fate of her husband and his crews, and possibly to reszue from their iusulatedley abode among the Esquiniaux some of his younger companions, who may still be prolonging a dreary existence. • On such an occasion we, whose names are hereunto sub. scribed, feel confident that this our appeal will not remain unanswered by the British people, who will, we doubt not, tender to the widow of the illustrious navigator that :sympathy which his fame and her devotion must call t ,rth, and will aid her .in • carrying out' an enterprise' involving, as we believe, the honour of the nation. We earnestly, therefore, entreat our countrymen to unite with us in contributing to this noble

In [None]:
df_small_bad_ocr.iloc[3]['completion']

'Here is the transcribed and corrected text:\n\n###To the south of well-searched tracts, and has been approached by vessels that have returned without loss, has never yet been explored.\n\nSupported by the advice of those experienced Arctic seamen, in whom she has every reason to confide, Lady Franklin makes this last effort to clear away the mystery that shrouds the fate of her husband and his crew, and possibly to rescue from their isolated abode among the Esquimaux some of his younger companions, who may still be prolonging a dreary existence.\n\n• On such an occasion we, whose names are hereunto subscribed, feel confident that this our appeal will not remain unanswered by the British people, who will, we doubt not, tender to the widow of the illustrious navigator that sympathy which his fame and her devotion must call forth, and will aid her in carrying out an enterprise involving, as we believe, the honour of the nation.\n\nWe earnestly, therefore, entreat our countrymen to unite 

In [None]:
df_small_bad_ocr.to_csv('newspaper_ocr_corrected.csv')

## Exercise

Experiment with your own system and user message! Have fun :-)

In [None]:
# enter code here

# What if things don't work?

- Use larger models (see example below for using the OpenAI API)
- Model fine-tuning on real or synthetic data. An example [here](https://huggingface.co/blog/mlabonne/sft-llama3)

In [None]:
!pip install openai

In [None]:
import openai

In [None]:
df_small_bad_ocr.iloc[4]['chunks']

"from Harrow; Mr. and Miss Afansergh, at Gould's Hotel, Jermyn-street, from Paris; Captain Patterson, at ditto, from Bath ; Samuel Peach, Esq., at Burlington Hotel, Old Burlington and Cork.:streets, from Warwickshire ; J. Cano, Esq. at ditto, from Ireland. CUANGEL—The Earl and Countess of Sefton and the Ladies Molyneux, for their seat, Stoke Farm, Berke ; the Countess of Mansfield, for her Villa at Twickenham ; Lord Lynedock, from his seat, Cosgrove Priory, for the Marquis of Anglesey's, Beau Desert, Staffordshire; Lord Meneaster, from Melton Mowbray, for his seat, in Cumberland ; Augustus Atkyns, Esq. from Farnborough Hill, near Bagshot, to Arreton Vicarage, near Newport, Isle of Wight; the Rev. Dr. Wynne, from Warne's Hotel; Mrs. Fletcher and family, from ditto; Mrs. Walker and family, from ditto, for North Wales; the Countess of Golflird, from the St. George's Hotel, Albemarle-street, for her seat, Brompton Park, Huntingdon. The Marchioness Dowager of Salisbury's Convex.sazione, on 

In [None]:
from openai import OpenAI
client = OpenAI(api_key='sk-...')

In [None]:
completion = client.chat.completions.create(
  model="gpt-4o",
  messages=[
    {"role": "system", "content": "You are a helpful assistant. Correct the text below."},
    {"role": "user", "content": df_small_bad_ocr.iloc[4]['chunks']}
  ]
)



In [None]:
print(completion.choices[0].message.content)

From Harrow; Mr. and Miss Afansergh, at Gould's Hotel, Jermyn Street, from Paris; Captain Patterson, at the same, from Bath; Samuel Peach, Esq., at Burlington Hotel, Old Burlington and Cork Streets, from Warwickshire; J. Cano, Esq., at the same, from Ireland.

CHANGES—The Earl and Countess of Sefton and the Ladies Molyneux, for their seat, Stoke Farm, Berkshire; the Countess of Mansfield, for her villa at Twickenham; Lord Lynedoch, from his seat, Cosgrove Priory, for the Marquis of Anglesey's Beau Desert, Staffordshire; Lord Meneaster, from Melton Mowbray, for his seat in Cumberland; Augustus Atkyns, Esq., from Farnborough Hill, near Bagshot, to Arreton Vicarage, near Newport, Isle of Wight; the Rev. Dr. Wynne, from Warne's Hotel; Mrs. Fletcher and family, from the same; Mrs. Walker and family, from the same, for North Wales; the Countess of Gifford, from the St. George's Hotel, Albemarle Street, for her seat, Brompton Park, Huntingdon.

The Marchioness Dowager of Salisbury's conversaz

# Fin.