<a href="https://colab.research.google.com/github/kasparvonbeelen/UIBK-DH-LLM-Workshop/blob/dev/LLMs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using open-source LLMs for analysing humanities data


Whereas the previous examples focussed on interpreting specific model predictions, the notebook below explores the use of generative AI for processing and analysing historical newspapers.

Major hurdles to working with LLMs are cost and/or infrastructure. Opposed to GPT-2 or BERT, running LLMs locally can be difficult, and using commercial APIs can be expensive.

## Why Open-source?

- **privacy:**: you might not want to share your data (and ideas) with companies such as OpenAI;
- **cost:** making abstraction af the caveat above, using open-source models might reduce costs if you want to apply for example a prompt to 10k newspaper articles;
- **transparency:** but be mindful that there are different gradations of open-source, even when you are able to access the model weights, you might remain in the dark about training data and other factors);
- **flexibility:** why some closed-models allow fine-tuning on your own data (ties in with privacy), these open-source model still give you more freedom and wiggle rooms to built other models and new application.

## Why Large Language Models

Analysis of text often described as 'distant reading', reading without reading, relying on quantification and measurement to study large collections of text.

- Summarization and "baby-RAG"
- Speed up annotation and/or information extraction via structured generation

## Goals of this Session

This notebook covers a few practical as well as theoretical aspects of working with LLMs in the context of humanities research. The goal is to start a discussion on:
  - where to find and how to deploy an open-source LLMs?
  - what tasks would make sense? which models work well for a selected task?
  - how to evaluate outcomes and performance?


We want to keep things simple!

We will be mainly playing around with Llama-3 and get a feeling how this might change the way we approach data processing as well as the type of research question we'd like to tackle.


## Technical note

We will be relying on the Hugging Face `InferenceClient` for accessing LLMs. These are freely accessible, but rate limits apply! If you would want to deploy a 'local' version (we're still on Colab, but the code should also work on your computer), uncomment the code below (where indicated) and make sure you are using a [GPU](https://cloud.google.com/gpu). To select a GPU on Colab Go to **`Runtime`** and select **`Change runtime type`**, then select `T4 GPU` (or any other GPU available).



This notebook is inspired by: https://huggingface.co/learn/cookbook/structured_generation

In [28]:
# install the transformer and other libraries
!pip install -q -U "transformers==4.40.0" pydantic accelerate outlines datasets

## The Hugging Face Hub

In the example below, we will experiment with `Llama-3-8B-Instruct`, a recent series of open-source LLMs created by Meta. To use Llama3 you need to:

- Make an account on Hugging Face https://huggingface.co/
- Go to the Llama-3-8B and sign the terms of use you should get a reply swiftly https://huggingface.co/meta-llama/Meta-Llama-3-8B
- Create a user access token with read access: https://huggingface.co/docs/hub/en/security-tokens
- Run the code cell below to log into the Hugging Face hub. Copy-paste the access token
- Reply `n` to the question 'Add token as git credential? (Y/n)'

In [29]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: read).

## Preparing model and data

### Import libraries

In [30]:
import warnings
warnings.filterwarnings('ignore') # disable warnings

In [31]:
import transformers
from huggingface_hub import InferenceClient
from datasets import Dataset
from tqdm import tqdm
import pandas as pd
import torch
import pandas as pd
import json
pd.set_option("display.max_colwidth", 100)

### Load model

In [43]:
# choose a LLMs model
repo_id = "meta-llama/Meta-Llama-3-8B-Instruct"
# instantiate the inference client
llm_client = InferenceClient(model=repo_id, timeout=120)

In [44]:
# # define the model, we use the instruct variant
# checkpoint = "meta-llama/Meta-Llama-3-8B-Instruct"
# device = 'cuda' # make sure you use a GPU

# # instantiate a text generation pipeline
# pipeline = transformers.pipeline(
#     "text-generation",
#     model=checkpoint,
#     model_kwargs={"torch_dtype": torch.bfloat16},
#     device="cuda",
# )

# # some fluff to improve the generation
# terminators = [
#     pipeline.tokenizer.eos_token_id,
#     pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
# ]

### Download data

In [7]:
# download a sample of 10.000 newspaper articles
!wget -q --show-progress https://github.com/kasparvonbeelen/lancaster-newspaper-workshop/raw/wc/data/sample_lwm_hmd_mt90_10000.csv.zip
# unzip the downloaded sample
!unzip -o sample_lwm_hmd_mt90_10000.csv.zip
!rm -r __MACOSX

Archive:  sample_lwm_hmd_mt90_10000.csv.zip
  inflating: sample_lwm_hmd_mt90_10000.csv  
  inflating: __MACOSX/._sample_lwm_hmd_mt90_10000.csv  


In [8]:
df = pd.read_csv('sample_lwm_hmd_mt90_10000.csv')
df.head(3)

Unnamed: 0,NLP,issue,art_num,title,collection,full_date,year,month,day,location,word_count,ocrquality,text,decade
0,2194,1026,art0021,The Sun.,British Library Heritage Made Digital Newspapers,1846-10-26,1846,10,26,"London, England",539,0.9705,"POOR T,i,ENIPAT A 1„k CT The Poor Law Coirdnissioti(rs have issued a ei; cular, dated the 20th ...",1840
1,2645,925,art0004,The Press.,British Library Heritage Made Digital Newspapers,1858-09-25,1858,9,25,"London, England",2263,0.9663,"THE PRESS, SEPTEMBER 25, 1858. in managing their own business and dealing with matters of local...",1850
2,2194,323,art0006,The Sun.,British Library Heritage Made Digital Newspapers,1840-03-23,1840,3,23,"London, England",795,0.9351,"PUBLICATIONS. This day is published, in post Bvo., with Woodcuts and Twelve coloured Plates, pr...",1840


In [9]:
df.shape

(10000, 14)

### Process data

In [10]:
def get_chunks(text: str, size: int=250,step: int=50) -> list:
  """divide a text into chunks of similar size
  Arguments:
    text (str): input text
    size (int): number of tokens in each chunk
    step (int): step size
  Returns a list of strings
  """
  words = text.split()
  return [' '.join(words[i:i+size]) for i in range(0,len(words),step)]

In [11]:
# apply chunking to text
df['chunks'] = df.text.apply(get_chunks)

In [12]:
len(df.text[0]),len(df['chunks'][0])

(2969, 11)

In [13]:
# reorder the dataframe
# with one chunk in each row
# instead of the whole text
df_chunks = df.explode('chunks')
df_chunks.shape

(336876, 15)

## Prompting

LLM generate text from an input, usually referred to as a 'prompt', a pieces of text we like it use as a starting point for predicting novel tokens.

When 'chatting' with an LLMs we usually provide the model with (at least) two messages: a system and a user prompt.

**System message**:
- **Generic instructions on behavior**: specify how the model should behave (e.g. be helpful, respectful, neutral) or the role it should play (e.g., a teacher, assistant, or advisor).
- **Constraints**: Specific instructions on what the model should avoid or how it should generate responses.
- **Context**: Background information or context that remains constant throughout the session to ensure consistency in responses.

**User message**:

- **Query**: specifies input from the user, such as a question, instruction, or request that the model needs to respond to.
- **Dynamic**: changes with each interaction, reflecting the user's immediate needs, questions, or instructions.

The Hugging Face chat prompt template allows messages as lists of dictionaries.

```python
messages [
  {
    "role" : "system",
    "content": "<system prompt here>"
  },
  {
    "role" : "user",
    "content": "<user prompt here>"
  }
]
```

Define a message by articulating a system and user prompt.

In [14]:
df.iloc[0].text

'POOR T,i,ENIPAT A 1„k CT  The Poor Law Coirdnissioti(rs have issued a ei; cular, dated the 20th instant, stating that they have consulted the Attorney and Solicitor-General on the construction of the late Removal Act, and give as the result:— I. " That the proviso to the Ist section of the 9 and 10 Vict., c. 66, which sets forth the exceptions to the principal enactments that are to be excluded in the computation of time, is net retrospective in its operation, so as to apply to cases where the five years\' residence was complete before the statute. 2. " That an interval between the completion of the five years residence and the application for the warrant of removal filled up by one of the exceptions contained in the proviso will not p event the operation of the statute in restraining the removal of the pauper whu had resided for the specified time. 3. " That orders of removal obtained previous to th• passing of the Act, but not then executed by the removal of the paupers, cannot now 

In [15]:
messages = [
    {
        "role": "system",
        "content": """
          You are an helpful AI that will assist me with analysing and reading newspaper articles.
          Read the newspaper article attentively and extract the required information.
          Each newspaper article will be enclosed with triple hash tags (i.e. ###).
          Don't make thigs up! If the information is not in the article then just say 'I don't know'"""
              },

    {
        "role": "user",
        "content": f"""Provide a short description of principal characters portrayed newspaper article?

                  ###POOR T,i,ENIPAT A 1„k CT  The Poor Law Coirdnissioti(rs have issued a ei; cular,
                  dated the 20th instant, stating that they have consulted the Attorney and
                  Solicitor-General on the construction of the late Removal Act, and give as the
                  result:— I. " That the proviso to the Ist section of the 9 and 10 Vict., c. 66,
                  which sets forth the exceptions to the principal enactments that are to be
                  excluded in the computation of time, is net retrospective in its operation, so
                  as to apply to cases where the five years\' residence was complete before the statute.
                  2. " That an interval between the completion of the five years residence and the
                  application for the warrant of removal filled up by one of the exceptions contained
                  in the proviso will not p event the operation of the statute in restraining the
                  removal of the pauper whu had resided for the specified time. 3. " That orders
                  of removal obtained previous to th• passing of the Act, but not then executed
                  by the removal of the paupers,###"""
              }
  ]

In [16]:
messages

[{'role': 'system',
  'content': "\n          You are an helpful AI that will assist me with analysing and reading newspaper articles.\n          Read the newspaper article attentively and extract the required information.\n          Each newspaper article will be enclosed with triple hash tags (i.e. ###).\n          Don't make thigs up! If the information is not in the article then just say 'I don't know'"},
 {'role': 'user',
  'content': 'Provide a short description of principal characters portrayed newspaper article?\n\n                  ###POOR T,i,ENIPAT A 1„k CT  The Poor Law Coirdnissioti(rs have issued a ei; cular, \n                  dated the 20th instant, stating that they have consulted the Attorney and \n                  Solicitor-General on the construction of the late Removal Act, and give as the \n                  result:— I. " That the proviso to the Ist section of the 9 and 10 Vict., c. 66, \n                  which sets forth the exceptions to the principal enactme

In [34]:
#help(llm_client.chat_completion)

In [48]:
from os import truncate
# # uncomment this code if you want to work locally, comment the other function
# def get_completion(messages: list, temperature=.1, top_p=.1) -> str:
#   """get completion for given system and user prompt
#     Arguments:
#     messages (list): a list containin a system and user message as
#       python dictionaries with keys 'role' and 'content'
#     temperature (float): regulate creativity of the text generation
#     top_p (float): cummulative probability included in the
#       generation process
#   """
#   prompt = pipeline.tokenizer.apply_chat_template(
#         messages,
#         tokenize=False,
#         add_generation_prompt=True
#       )

#   outputs = pipeline(
#     prompt,
#     max_new_tokens=256,
#     eos_token_id=terminators,
#     do_sample=True,
#     temperature=temperature,
#     top_p=top_p,
#       )
#   return outputs[0]["generated_text"][len(prompt):]


def get_completion(messages: list, temperature=.1, top_p=.1):
    """get completion for given system and user prompt
      Arguments:
        messages (list): a list containin a system and user message as
          python dictionaries with keys 'role' and 'content'
        temperature (float): regulate creativity of the text generation
        top_p (float): cummulative probability included in the
          generation process
    """
    outputs = llm_client.chat_completion(
        messages=messages,
        max_tokens=1024,
        temperature=temperature,
        top_p=top_p
        )
    return outputs.choices[0].message.content

In [19]:
print(get_completion(messages))

Based on the newspaper article, the principal characters mentioned are:

1. The Poor Law Commissioners: They are the ones who have issued a circular stating their consultation with the Attorney and Solicitor-General on the construction of the late Removal Act.
2. The Attorney-General: He is mentioned as one of the officials consulted by the Poor Law Commissioners on the construction of the Removal Act.
3. The Solicitor-General: He is also mentioned as one of the officials consulted by the Poor Law Commissioners on the construction of the Removal Act.
4. Paupers: They are the individuals who are the subject of the Removal Act, which deals with the removal of poor people from one place to another.

Note that there are no specific individuals mentioned by name in the article.


## Exercise

- Change the system message and ask the model to reply in medieval French.
- Change the user message and ask the model to summarize the article and condense it to one sentence.

In [89]:
# Enter code here

#### Solution

In [None]:
messages = [
    {"role": "system", "content": """
    You are an helpful AI that will assist me with analysing and reading newspaper articles.
    Read the newspaper articles attentively and extract the required information.
    Each newspaper article will be enclosed with triple hash tags (i.e. ###).
    Don't make thigs up! Answer in medieval French!"""},
    {"role": "user", "content": f"""Provide a short description of principal characters portrayed newspaper article?
    ###{df.iloc[0].text}###"""}
]

print(get_completion(messages))


Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


Hear ye, hear ye! I shall extract the principal characters from this most singular newspaper article.

* Le défunto, or the deceased, is the mate of the steam tug Earl of Glamorgan, who met a watery grave in the Severn a few days prior to the events described in the article.
* Le frère, or the brother, of the deceased, who repudiated the expense of the more expensive coffin and refused to relinquish the body until his claims were settled.
* Le fossoyeur, or the undertaker, who received the order to prepare a parish coffin, but instead provided a more expensive one at the behest of the authorities. He later appealed to the coroner, who was powerless to intervene.

Mayhap these characters shall play a part in the unfolding drama, as the article hints at a "scene" that may yet ensue.


In [None]:
messages = [
    {"role": "system", "content": """
    You are an helpful AI that will assist me with analysing and reading newspaper articles.
    Read the newspaper articles attentively and extract the required information.
    Each newspaper article will be enclosed with triple hash tags (i.e. ###).
    Don't make thigs up! If the information is not in the article then just say 'Dunno'"""},
    {"role": "user", "content": f"""Summarize the article content in one sentence.
    ###{df.iloc[0].text}###"""}
]

print(get_completion(messages))

Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


A body believed to be the mate of the steam tug Earl of Glamorgan, who drowned in the Severn, was initially intended for a parish coffin but was instead given a more expensive one, leading to a dispute over who should pay for the funeral.


## Applying text generation to historical documents


### Example 1: Summarize Summarizations

What happened in January 1899?

In [90]:
df_small = df_chunks[
            (df_chunks.year==1899) & (df_chunks.month==1) # select articles from January 1899
                  ].sample(20, random_state=1984).reset_index(drop=True) # we sample a few to keep things simple
df_small.shape

(20, 15)

In [57]:

def apply_completions(item: pd.Series,
                      system_message: str,
                      user_message: str,
                      text_column: str = 'chunks') -> str:
  """
  Function that appl
  Argument:
    item (pd.Series): row from a pandas Dataframe
    system_message (str): system prompt, specifies how the system
      should behave in
    user_message (str): user prompt, give instruction how to
      process each historical. the documents itself will be append
      from the 'text_column' argument
    text_column (str): name of the text column
  """
  messages = [
    {"role": "system", "content": system_message},
    {"role": "user", "content": user_message}
      ]
  messages[1]['content'] += f"\n\n###{item[text_column]}###"
  return  get_completion(messages)

In [92]:
tqdm.pandas() # use tqdm to view progress

system_message = """You are an helpful AI that will assist me with analysing and reading newspaper articles.
    Read the newspaper articles attentively and extract the required information.
    Each newspaper article will be enclosed with triple hash tags (i.e. ###).
    Don't make thigs up! If the information is not in the article then just say 'I don't know''"""
user_message = "Summarize the article content in one sentence."

df_small['completion'] =  df_small.progress_apply(apply_completions,system_message=system_message, user_message=user_message, axis=1)

100%|██████████| 20/20 [00:20<00:00,  1.02s/it]


In [93]:
# get the summaries
df_small['completion'][0]

'Here is a summary of the article content in one sentence:\n\nThe Londesborough Lodge of Freemasons held a successful installation ceremony, where Bro. Charles Nicholson was installed as the new Worshipful Master, and was followed by a banquet at the Station Hotel, where the brethren enjoyed good food, company, and Masonic toasts.'

In [94]:
summaries = '\n'.join([f"###{c}###" for c in df_small['completion']])

In [95]:
messages = [
    {"role": "system", "content": system_message},
    {"role": "user", "content": f"""Based on the article summaries within ### below, what are the most important events? Be concise\n{summaries}"""}
      ]

In [96]:
print(get_completion(messages))

Based on the article summaries, the most important events are:

* The Czar of Russia's proposal to limit armaments
* The war between America and Spain
* The Fashoda incident
* The controversy surrounding the start of the 20th century
* The loss of the gas works in Dorchester
* The government's vaccination bill

These events are significant because they relate to international relations, conflicts, and global issues that had a significant impact on the world at the time.


### Example 2: Condense information about accidents

How did accidents in the news change over time?

An analysis using a very simple pipeline combining summarization and RAG.


In [21]:
import re
pattern = re.compile(r'\baccidents?\b', re.I)
pattern.findall('accidents accident AccIdent accidental')

['accidents', 'accident', 'AccIdent']

In [22]:
tqdm.pandas()
df_chunks['matches'] = df_chunks.chunks.progress_apply(lambda x: bool(pattern.findall(x)))

100%|██████████| 336876/336876 [00:13<00:00, 24399.35it/s]


In [99]:
accident_1810s = df_chunks[
                    (df_chunks.year.between(1810,1820)) & (df_chunks['matches'] == True)
                      ].sample(n=10, random_state=1984)

accident_1890s = df_chunks[
                    (df_chunks.year.between(1890,1900)) & (df_chunks['matches'] == True)
                      ].sample(n=10, random_state=1984)
print(accident_1810s.shape,accident_1890s.shape)

(10, 16) (10, 16)


In [100]:
(df_chunks['matches'] == True).value_counts()

matches
False    329628
True       7248
Name: count, dtype: int64

In [101]:
accident_1890s.iloc[3].chunks

'please acknowledge receipt of same to the Hon. Secretary, 19, South Scarborough Street, West Hartlepool. ACCIDENT AT THE LION BREWERY.—Last night, about eight o\'clock, an accident of a rather serious nature occurred at the above building to two young men named John Hart and John Gates. The former resides in Westmoreland Street, and Gates is employed by Messrs Bland Brothers as a plumber. The aceident was caused by the gas engine exploding. Dr. Young was called in and it was found that both were badly burnt. WEST HARTLEPOOL PARLIAMENTARY AND LITERARY DEBATING SOCIETY. —Last night the first debate under the auspices of the above society was held in Mr Rowe\'s restaurant, Lynn Street, the subject for discussion being "Should there be a legal eight hours day?" Mr Withy occupied the chair. The affirmative was taken by Mr King, and the negative by Mr Bryden. Messrs Rafter, Looney, Mason, and Tarn also took part in the debate, which was adjourned for a week. There was a good attendance. MAS

In [107]:
system_message = """You are an helpful AI that will assist me with analysing and reading newspaper articles.
    Read the newspaper articles attentively and extract the required information.
    Each newspaper article will be enclosed with triple hash tags (i.e. ###).
    Don't make thigs up! If the information is not in the article then just say 'I don't know'.
    Focus on the answer do not add any unnecessary introductory texts."""
user_message = """Does the article talk about an accident?
If yes summarize the article content in one sentence.
If not, answer 'no accident mentioned' """

accident_1810s['completion'] =  accident_1810s.progress_apply(apply_completions,system_message=system_message, user_message=user_message, axis=1)
accident_1890s['completion'] =  accident_1890s.progress_apply(apply_completions,system_message=system_message, user_message=user_message, axis=1)

100%|██████████| 10/10 [00:07<00:00,  1.31it/s]
100%|██████████| 10/10 [00:21<00:00,  2.13s/it]


In [108]:
accident_1890s[['chunks','completion']].iloc[3].values

array(['please acknowledge receipt of same to the Hon. Secretary, 19, South Scarborough Street, West Hartlepool. ACCIDENT AT THE LION BREWERY.—Last night, about eight o\'clock, an accident of a rather serious nature occurred at the above building to two young men named John Hart and John Gates. The former resides in Westmoreland Street, and Gates is employed by Messrs Bland Brothers as a plumber. The aceident was caused by the gas engine exploding. Dr. Young was called in and it was found that both were badly burnt. WEST HARTLEPOOL PARLIAMENTARY AND LITERARY DEBATING SOCIETY. —Last night the first debate under the auspices of the above society was held in Mr Rowe\'s restaurant, Lynn Street, the subject for discussion being "Should there be a legal eight hours day?" Mr Withy occupied the chair. The affirmative was taken by Mr King, and the negative by Mr Bryden. Messrs Rafter, Looney, Mason, and Tarn also took part in the debate, which was adjourned for a week. There was a good attendan

In [119]:
summaries_1810s = '\n'.join([f"###{c}###" for c in accident_1810s.chunks if not c.lower().startswith('no')])
summaries_1810s = f"\n```\nSummaries for 1810s:\n\n{summaries_1810s}\n```"
summaries_1890s = '\n'.join([f"###{c}###" for c in accident_1890s.chunks if not c.lower().startswith('no')])
summaries_1890s = f"\n```\nSummaries for 1810s:\n\n{summaries_1890s}\n```"
print(summaries_1890s)


```
Summaries for 1810s:

###it. On their present form there is no team in the League to touch Nelson, and barring accidents—like last season for instance— the championship looks a gift for them." It may be stated that in the two matches NeNOll have scored 379 runs for the loss of only L6tie wickets, giving them the wonderful average of 42 runs per wicket. On the other hand Burnley have last twenty wickets for NO runs which works out an average of 9 runs per wicket. Comment is unnecessary. . I have nothing but congratulations for the Nelson team, whose fielding, with one exception, was as near perfection as possible. Those two catches in the slips by Hartley opened the eyes of the Turfites, and reminded them of Joe Allen in his pahniest days. Then the bowling of Shacklock and Cuttell was really splendid, and at one time it seemed probable that Burnley would not reach 40. Muschainp, too, behind the wickets was all there, as "John Henry" found to his cost. Bower, by his score of 57 not 

In [177]:
messages = [
    {"role": "system", "content": system_message},
    {"role": "user", "content": f"""
        Below we provide articles from two different decades. First from 1810s and then from the 1890s.
        Each of the decades is enclosed within ```.
        Each summary is enclosed withing ###
        Answer the following question concisely: what the principal differences between the two decades?
        \n{summaries_1810s+summaries_1890s}"""}
      ]
print(get_completion(messages))

Here is the analysis of the two decades:

**1810s:**

* The decade is characterized by a focus on accidents and disasters, with several articles reporting on tragic events such as shipwrecks, explosions, and fires.
* There is a sense of community and social responsibility, with articles highlighting the importance of helping those in need, such as widows and orphans.
* The language used is formal and descriptive, with a focus on providing detailed accounts of the events being reported.
* There is a sense of optimism and hope, with articles highlighting the progress being made in various fields, such as engineering and architecture.

**1890s:**

* The decade is characterized by a focus on technology and innovation, with articles reporting on new inventions and discoveries, such as the development of the automobile and the use of photography.
* There is a sense of excitement and wonder, with articles highlighting the potential benefits and implications of these new technologies.
* The la

### Example 3: Structured Generation

Biography as microgenre

In [37]:
df_small = df_chunks[
                    (df_chunks['matches'] == True)
                      ].sample(n=10, random_state=1984)

In [38]:
df_small['chunks'].iloc[7]

'most necessary precaution, in the case of estates so well peopled with game, that the right to bag black, brown, and grey feathers, frequently brings the owner a few hundreds of annual rent. ANOTHER STEAM BOILER EXPLOSION.—An accident occurred on Friday evening to the Antelope steamer, when opposite to Moville, on her way to Glasgow. It appeared that one of the lower plates of the boiler gave way, and the water gushing out suddenly into the hold destroyed, we regret to say, eighteen head of cattle, and slightly scalded four or fire others. No person on board received any injury, and the vessel, which was towed in this morning by the Isabella, will, we understand, be able to proceed on her voyage this evening.—Derry Sentinel of Saturday. THE GzEsARTAN OPERATION.—On Friday an inquest was held in New Gravel-lane upon the body of Sareh Bunyan, aged 43 years. The deceased it appeared had been subject to fits, and on the morning of Tuesday was taken severely ill. Mr. Goulbern, a surgeon, wa

In [39]:
#  """You are an helpful AI that will assist me with analysing source documents in the form of historical newspaper articles.
#     Read the newspaper articles attentively and extract structured information formatted as a list of JSON blobs.
#     Provide all relevant short source snippets from the documents on which you directly based your answer, and a confidence score as a float between 0 and 1.
#     Keep the source snippet short to just a few words and not complete sentences.
#     The snippet MUST be extracted from the soutce, with spelling and wording identical to the source.
#     Each newspaper article will be enclosed with triple hash tags (i.e. ###).
#     Your answer should be built as follows, it must contain the "Answer:" and "End of answer." sequences.
#     Don't make thigs up! If you don't know the answer, simply return no value"""

In [65]:
system_message = """You are an helpful AI that will assist me with analysing source documents in the form of historical newspaper articles.
    Read the newspaper articles attentively and extract structured information formatted as a list of Python dictionaries.
    Provide all relevant short source snippets from the documents on which you directly based your answer.
    Keep the source snippet short to just a few words and not complete sentences.
    The snippet MUST be extracted from the soutce, with spelling and wording identical to the source.
    This list of JSON blobs should begin with a "START" tag and end with a "END" tag.
    Each newspaper article will be enclosed with triple hash tags (i.e. ###).
    Don't make thigs up! If you don't know the answer, simply return no value"""


user_message = """
If the article describes a historical accident, extract biographical information about the individuals involved in the accidents.
Return a list of Python dictionaries for each individual which records important personal attributes such gender, age and profession, and others that are relevant.
Each attribute is a key in a dictionary.
Record personal attribures as dictionaries as shown in the example below.
Also add one key with "outcome" that records what happened to person ("drowned", "survived", "injured")
Add a confidence score as a float between 0 and 1 for each snippet extracted.
Under "source_snippets" collect text fragments that record what happened to person involved.

START
[
  {
  "name" : { "value": answer,"source": source_snippet, "confidence": your_confidence_score },
  "gender" : { "value": answer,"source": source_snippet, "confidence": your_confidence_score },
  "profession" :{ "value": answer,"source": source_snippet, "confidence": your_confidence_score },
  ... other attributes ...,
  "outcome" : { "value": answer,"source": source_snippet, "confidence": your_confidence_score },
  "summary": { "value" :summary, "confidence" : your_confidence_score }
  },
...]
END
"""



In [66]:
messages = [
    {"role": "system", "content": system_message},
    {"role": "user", "content": user_message + f'\n\n###{df_small["chunks"].iloc[4]}###'}
      ]
print(get_completion(messages))

START
[
  {
    "name" : { "value": "The Rev. G. M. Gordon", "source": "who as killed in the sortie from Candahar", "confidence": 0.9 },
    "gender" : { "value": "male", "source": "", "confidence": 0.8 },
    "profession" : { "value": "clergyman", "source": "The Rev. G. M. Gordon", "confidence": 0.9 },
    "outcome" : { "value": "killed", "source": "who as killed in the sortie from Candahar", "confidence": 0.9 },
    "source_snippets" : ["who as killed in the sortie from Candahar"],
    "summary" : { "value": "The Rev. G. M. Gordon was killed in the sortie from Candahar", "confidence": 0.9 }
  },
  {
    "name" : { "value": "Duke of Connaught", "source": "The Duke of Connaught", "confidence": 0.9 },
    "gender" : { "value": "male", "source": "", "confidence": 0.8 },
    "profession" : { "value": "royal", "source": "The Duke of Connaught", "confidence": 0.9 },
    "outcome" : { "value": "injured", "source": "Beyond the shaking, however, the Duke was little the worse for the mishap", "

In [67]:
df_small['completion'] =  df_small.progress_apply(apply_completions,system_message=system_message, user_message=user_message, axis=1)


100%|██████████| 10/10 [00:49<00:00,  4.94s/it]


In [68]:
print(df_small['completion'])

9969    START\n[\n  {\n    "name" : {"value" : "James Turner", "source" : "Subsequently a man who, it is...
3526    START\n[\n  {\n    "name" : { "value": "James Heng", "source": "ACCIDENT AT WOOLWICH DOCKYARD;— ...
8563    START\n[\n  {\n    "name" : { "value": "Edward Ball", "source": "the collision of the ferry stea...
7666    START\n[\n  {\n    "name" : {"value": "unknown", "source": "A frightful accident occurred on Sat...
9068    START\n[\n  {\n    "name" : { "value": "The Rev. G. M. Gordon", "source": "who as killed in the ...
3438                                                                                             END\n\n###
7306    START\n[\n  {\n    "name" : { "value": "Miles Barnes", "source": "the body of Miles Barnes", "co...
3077    START\n[\n  {\n    "name" : { "value": "Sareh Bunyan", "source": "On Friday an inquest was held ...
5963    START\n[\n  {\n    "name" : { "value": "Samuel Birtwistle", "source": "A youth named Samuel Birt...
5177    END\n\nSTART\n[\n  {

In [70]:
def eval_completion(completion):
  try:
    return eval(completion.split('START')[-1].strip().rstrip('END').strip())
  except Exception as e:
    print(e)
    return []

df_small['completion_eval'] = df_small['completion'].apply(eval_completion)

unterminated string literal (detected at line 40) (<string>, line 40)
name 'END' is not defined


In [71]:
df_small['completion_eval']

9969    [{'name': {'value': 'James Turner', 'source': 'Subsequently a man who, it is alleged, first rais...
3526    [{'name': {'value': 'James Heng', 'source': 'ACCIDENT AT WOOLWICH DOCKYARD;— accident occurred y...
8563    [{'name': {'value': 'Edward Ball', 'source': 'the collision of the ferry steamer Crocus with the...
7666    [{'name': {'value': 'unknown', 'source': 'A frightful accident occurred on Saturday week at the ...
9068                                                                                                     []
3438                                                                                                     []
7306    [{'name': {'value': 'Miles Barnes', 'source': 'the body of Miles Barnes', 'confidence': 1.0}, 'g...
3077    [{'name': {'value': 'Sareh Bunyan', 'source': 'On Friday an inquest was held in New Gravel-lane ...
5963    [{'name': {'value': 'Samuel Birtwistle', 'source': 'A youth named Samuel Birtwistle', 'confidenc...
5177    [{'name': {'value': 

In [73]:
df_small['completion_eval'].iloc[7]

[{'name': {'value': 'Sareh Bunyan',
   'source': 'On Friday an inquest was held in New Gravel-lane upon the body of Sareh Bunyan, aged 43 years.',
   'confidence': 1.0},
  'gender': {'value': 'female', 'source': '', 'confidence': 1.0},
  'profession': {'value': 'none', 'source': '', 'confidence': 1.0},
  'age': {'value': 43,
   'source': 'On Friday an inquest was held in New Gravel-lane upon the body of Sareh Bunyan, aged 43 years.',
   'confidence': 1.0},
  'outcome': {'value': 'died',
   'source': 'the poor woman dying in a few hours after being attacked.',
   'confidence': 1.0},
  'source_snippets': ['On Friday an inquest was held in New Gravel-lane upon the body of Sareh Bunyan, aged 43 years.',
   'the poor woman dying in a few hours after being attacked.'],
  'summary': {'value': 'Sareh Bunyan, 43, died after being attacked by apoplexy and undergoing a Cmsarian operation.',
   'confidence': 1.0}}]

In [110]:
row = df_small.iloc[7]
html_output = row['chunks']
for p_dict in row['completion_eval']:
  for attr, attr_dict in p_dict.items():
    try:
      if isinstance(attr_dict, dict):
        if attr_dict.get('confidence',.0) > .5 and attr_dict.get("source",None):
          html_output = re.sub(str(attr_dict['source']),
                   f'<span style="background-color: yellow;">{attr_dict["source"]}</span>', html_output)
    except Exception as e:
      print(e,attr_dict)
      continue

In [111]:
from IPython.core.display import HTML
HTML(html_output)

### Example 4: OCR correction

In [143]:
df_small_bad_ocr = df_chunks.sort_values('ocrquality', ascending=True)[:1000].sample(n=10)

In [144]:
system_message = "You are an helpful AI and provide truthful correction of historical text."

user_message = """Transcribe the text and correct typos and errors in the text caused by bad optical character recognition (OCR).
Do not add any information that is not in the original text!"""

df_small_bad_ocr['completion'] = df_small_bad_ocr.progress_apply(apply_completions,system_message=system_message, user_message=user_message, axis=1)


100%|██████████| 10/10 [01:24<00:00,  8.44s/it]


In [145]:
df_small_bad_ocr.iloc[0]['chunks']

"Gladstone, W. E. Pigott, Francis Beaumont,Somerset Glyn, G. C. Pilkington, James Bellew, R. 31. Glyn, G. G. Pollard Urquhart W Berkeley, 11. Goldsmid, Sir F. Ponsonby, A. Berkeley, P. Gordon, C. W. Pritchard, John Bethell, Sir R. Graham, Sir J. R. G. Proby, Lord. G. L. Biddulph,'_Col. R.3l' Greene, Capt. J. Pryse, Capt. E. L. Biggs, John Greenwood, J. Puller,C. W. Black, Adam Grey, Rt. 'ln. Sir G. Ricardo, 0. Blencowe, J. G. Grosvenor, Earl Roebuck, J. A. Bonham-Carter, J. Gurdon, B. Rothschild,Baron L. Bouverie, E. P. Gurney, Samuel Rothschild,Baron 31 Bouverie, P. P. Hadfield, Geo. Roupell, W. Bright, J. Hanbury, R., jun. Russell, Lord J. Bruce, H. A. Handley, John Russell, Hastings Buchanan, W. Hankey, Thomson Russell, A. J. E. Buckley, Maj.-Gen. Hartington,Marquis Russell, Sir Wm. Bulkeley, Sir R., Bt. Hayter, Sir Wm. St. Aubyn, J. Buller, J. W. Headlam, Thos. E Salomons,Alderman Buller, Sir A. Henley, Lord Scholefie,d, W. Bury, Viscount Hervey, Lord A Scott, Sir W. Butler, C. S. 

In [146]:
print(df_small_bad_ocr.iloc[0]['completion'])

Here is the transcribed text with corrections for typos and OCR errors:

###Gladstone, W. E. Pigott, Francis Beaumont, Somerset Glyn, G. C. Pilkington, James Bellew, R. Glyn, G. G. Pollard Urquhart, W. Berkeley, P. Goldsmid, Sir F. Ponsonby, A. Berkeley, C. W. Gordon, C. W. Pritchard, John Bethell, Sir R. Graham, Sir J. R. Proby, Lord G. L. Biddulph, Colonel R. Greene, Captain J. Pryse, Captain E. L. Biggs, John Greenwood, J. Puller, C. W. Black, Adam Grey, Right Hon. Sir G. Ricardo, O. Blencowe, J. G. Grosvenor, Earl Roebuck, J. A. Bonham-Carter, J. Gurdon, B. Rothschild, Baron L. Bouverie, E. P. Gurney, Samuel Rothschild, Baron 31 Bouverie, P. P. Hadfield, George Roupell, W. Bright, J. Hanbury, R., jun. Russell, Lord J. Bruce, H. A. Handley, John Russell, Hastings Buchanan, W. Hankey, Thomson Russell, A. J. E. Buckley, Major-General Hartington, Marquis Russell, Sir Wm. Bulkeley, Sir R., Bt. Hayter, Sir Wm. St. Aubyn, J. Buller, J. W. Headlam, Thomas E. Salomons, Alderman Buller, Sir 

In [147]:
df_small_bad_ocr.iloc[3]['chunks']

'portion of the chain taken with it £2O more. The Italian who gave immediate information at the II division station-house in Leman-strAt, Whte ilhiaialpteill,:ezwhlatosdn,b‘a-eeir(l,dyn for time,n and sad rhe l fora few a 1 ipc,umrisy; baenedn in all the capitals of Europe, and had not been robbed before. THE " ImPßovEmElir ttv:nt NP.K. —`\'lie alterations in Hyde Park in the way of artificlll do not seem to have given satisfaction generally. A deputation, representing the inhabitants of Park-side, Knightsbridge, has waited, by appointment, on the Right Hon. A. 11. Layard, at the Office of Works, to draw his attention to the embankment and plantation which have been recently formed in Hyde Park, in the immediate rear of their houses. It was explained that the mound, together with the trees and shrubs, were a very great annoyance to the oecu:-.qers, as they abut out light and air from the back rooms, and quite bid the view of the park, hitherto enjoyed, besides rendasiag the houses damp

In [148]:
df_small_bad_ocr.iloc[3]['completion']

'Here is the transcribed and corrected text:\n\n###A portion of the chain taken with it £20 more. The Italian who gave immediate information at the II division station-house in Leman Street, Whitechapel, said that he had been robbed before. THE "Improvement" in Hyde Park in the way of artificial works does not seem to have given satisfaction generally. A deputation, representing the inhabitants of Park-side, Knightsbridge, has waited, by appointment, on the Right Hon. A. H. Layard, at the Office of Works, to draw his attention to the embankment and plantation which have been recently formed in Hyde Park, in the immediate rear of their houses. It was explained that the mound, together with the trees and shrubs, were a very great annoyance to the occupants, as they abut out light and air from the back rooms, and quite block the view of the park, hitherto enjoyed, besides rendering the houses damp and unhealthy. Mr. C. Mercier pointed out that the growth of the saplings which had been pla

In [None]:
df_small_bad_ocr.to_csv('newspaper_ocr_corrected.csv')

## Exercise

Experiment with your own system and user message! Have fun :-)

In [None]:
# enter code here

# Fin.