<a href="https://colab.research.google.com/github/kasparvonbeelen/UIBK-DH-LLM-Workshop/blob/dev/LLMs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using open-source LLMs for analysing humanities data


Whereas the previous examples focussed on interpreting specific model predictions, the notebook below explores the use of generative AI for processing and analysing historical newspapers.

Major hurdles to working with LLMs are cost and/or infrastructure. Opposed to GPT-2 or BERT, running LLMs locally can be difficult, and using commercial APIs can be expensive.

## Why Open-source?

- **privacy:**: you might not want to share your data (and ideas) with companies such as OpenAI;
- **cost:** making abstraction af the caveat above, using open-source models might reduce costs if you want to apply for example a prompt to 10k newspaper articles;
- **transparency:** but be mindful that there are different gradations of open-source, even when you are able to access the model weights, you might remain in the dark about training data and other factors);
- **flexibility:** why some closed-models allow fine-tuning on your own data (ties in with privacy), these open-source model still give you more freedom and wiggle rooms to built other models and new application.

## Why Large Language Models

Analysis of text often described as 'distant reading', reading without reading, relying on quantification and measurement to study large collections of text.

- Summarization and "baby-RAG"
- Speed up annotation and/or information extraction via structured generation

## Goals of this Session

This notebook covers a few practical as well as theoretical aspects of working with LLMs in the context of humanities research. The goal is to start a discussion on:
  - where to find and how to deploy an open-source LLMs?
  - what tasks would make sense? which models work well for a selected task?
  - how to evaluate outcomes and performance?


We want to keep things simple!

We will be mainly playing around with Llama-3 and get a feeling how this might change the way we approach data processing as well as the type of research question we'd like to tackle.


## Technical note

We will be relying on the Hugging Face `InferenceClient` for accessing LLMs. These are freely accessible, but rate limits apply! If you would want to deploy a 'local' version (we're still on Colab, but the code should also work on your computer), uncomment the code below (where indicated) and make sure you are using a [GPU](https://cloud.google.com/gpu). To select a GPU on Colab Go to **`Runtime`** and select **`Change runtime type`**, then select `T4 GPU` (or any other GPU available).



This notebook is inspired by: https://huggingface.co/learn/cookbook/structured_generation

In [2]:
# install the transformer and other libraries
!pip install -q -U "transformers==4.40.0" pydantic accelerate outlines datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.0/9.0 MB[0m [31m55.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m409.0/409.0 kB[0m [31m20.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m309.4/309.4 kB[0m [31m27.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m94.1/94.1 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m31.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m111.7/111.7 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m105.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━

## The Hugging Face Hub

In the example below, we will experiment with `Llama-3-8B-Instruct`, a recent series of open-source LLMs created by Meta. To use Llama3 you need to:

- Make an account on Hugging Face https://huggingface.co/
- Go to the Llama-3-8B and sign the terms of use you should get a reply swiftly https://huggingface.co/meta-llama/Meta-Llama-3-8B
- Create a user access token with read access: https://huggingface.co/docs/hub/en/security-tokens
- Run the code cell below to log into the Hugging Face hub. Copy-paste the access token
- Reply `n` to the question 'Add token as git credential? (Y/n)'

In [3]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


## Preparing model and data

### Import libraries

In [4]:
import warnings
warnings.filterwarnings('ignore') # disable warnings

In [5]:
import transformers
from huggingface_hub import InferenceClient
from datasets import Dataset
from tqdm import tqdm
import pandas as pd
import torch
import pandas as pd
import json
pd.set_option("display.max_colwidth", 100)

### Load model

In [6]:
# choose a LLMs model
repo_id = "meta-llama/Meta-Llama-3-8B-Instruct"
# instantiate the inference client
llm_client = InferenceClient(model=repo_id, timeout=120)

In [7]:
# # define the model, we use the instruct variant
# checkpoint = "meta-llama/Meta-Llama-3-8B-Instruct"
# device = 'cuda' # make sure you use a GPU

# # instantiate a text generation pipeline
# pipeline = transformers.pipeline(
#     "text-generation",
#     model=checkpoint,
#     model_kwargs={"torch_dtype": torch.bfloat16},
#     device="cuda",
# )

# # some fluff to improve the generation
# terminators = [
#     pipeline.tokenizer.eos_token_id,
#     pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
# ]

### Download data

In [8]:
# download a sample of 10.000 newspaper articles
!wget -q --show-progress https://github.com/kasparvonbeelen/lancaster-newspaper-workshop/raw/wc/data/sample_lwm_hmd_mt90_10000.csv.zip
# unzip the downloaded sample
!unzip -o sample_lwm_hmd_mt90_10000.csv.zip
!rm -r __MACOSX

Archive:  sample_lwm_hmd_mt90_10000.csv.zip
  inflating: sample_lwm_hmd_mt90_10000.csv  
  inflating: __MACOSX/._sample_lwm_hmd_mt90_10000.csv  


In [9]:
df = pd.read_csv('sample_lwm_hmd_mt90_10000.csv')
df.head(3)

Unnamed: 0,NLP,issue,art_num,title,collection,full_date,year,month,day,location,word_count,ocrquality,text,decade
0,2194,1026,art0021,The Sun.,British Library Heritage Made Digital Newspapers,1846-10-26,1846,10,26,"London, England",539,0.9705,"POOR T,i,ENIPAT A 1„k CT The Poor Law Coirdnissioti(rs have issued a ei; cular, dated the 20th ...",1840
1,2645,925,art0004,The Press.,British Library Heritage Made Digital Newspapers,1858-09-25,1858,9,25,"London, England",2263,0.9663,"THE PRESS, SEPTEMBER 25, 1858. in managing their own business and dealing with matters of local...",1850
2,2194,323,art0006,The Sun.,British Library Heritage Made Digital Newspapers,1840-03-23,1840,3,23,"London, England",795,0.9351,"PUBLICATIONS. This day is published, in post Bvo., with Woodcuts and Twelve coloured Plates, pr...",1840


In [10]:
df.shape

(10000, 14)

### Process data

In [80]:
def get_chunks(text: str, size: int=250,step: int=50) -> list:
  """divide a text into chunks of similar size
  Arguments:
    text (str): input text
    size (int): number of tokens in each chunk
    step (int): step size
  Returns a list of strings
  """
  words = text.split()
  return [' '.join(words[i:i+size]) for i in range(0,len(words),step)]

In [81]:
# apply chunking to text
df['chunks'] = df.text.apply(get_chunks)

In [82]:
len(df.text[0]),len(df['chunks'][0])

(2969, 11)

In [83]:
# reorder the dataframe
# with one chunk in each row
# instead of the whole text
df_chunks = df.explode('chunks')
df_chunks.shape

(336876, 15)

## Prompting

LLM generate text from an input, usually referred to as a 'prompt', a pieces of text we like it use as a starting point for predicting novel tokens.

When 'chatting' with an LLMs we usually provide the model with (at least) two messages: a system and a user prompt.

**System message**:
- **Generic instructions on behavior**: specify how the model should behave (e.g. be helpful, respectful, neutral) or the role it should play (e.g., a teacher, assistant, or advisor).
- **Constraints**: Specific instructions on what the model should avoid or how it should generate responses.
- **Context**: Background information or context that remains constant throughout the session to ensure consistency in responses.

**User message**:

- **Query**: specifies input from the user, such as a question, instruction, or request that the model needs to respond to.
- **Dynamic**: changes with each interaction, reflecting the user's immediate needs, questions, or instructions.

The Hugging Face chat prompt template allows messages as lists of dictionaries.

```python
messages [
  {
    "role" : "system",
    "content": "<system prompt here>"
  },
  {
    "role" : "user",
    "content": "<user prompt here>"
  }
]
```

Define a message by articulating a system and user prompt.

In [84]:
df.iloc[0].text

'POOR T,i,ENIPAT A 1„k CT  The Poor Law Coirdnissioti(rs have issued a ei; cular, dated the 20th instant, stating that they have consulted the Attorney and Solicitor-General on the construction of the late Removal Act, and give as the result:— I. " That the proviso to the Ist section of the 9 and 10 Vict., c. 66, which sets forth the exceptions to the principal enactments that are to be excluded in the computation of time, is net retrospective in its operation, so as to apply to cases where the five years\' residence was complete before the statute. 2. " That an interval between the completion of the five years residence and the application for the warrant of removal filled up by one of the exceptions contained in the proviso will not p event the operation of the statute in restraining the removal of the pauper whu had resided for the specified time. 3. " That orders of removal obtained previous to th• passing of the Act, but not then executed by the removal of the paupers, cannot now 

In [85]:
messages = [
    {
        "role": "system",
        "content": """
          You are an helpful AI that will assist me with analysing and reading newspaper articles.
          Read the newspaper article attentively and extract the required information.
          Each newspaper article will be enclosed with triple hash tags (i.e. ###).
          Don't make thigs up! If the information is not in the article then just say 'I don't know'"""
              },

    {
        "role": "user",
        "content": f"""Provide a short description of principal characters portrayed newspaper article?

                  ###POOR T,i,ENIPAT A 1„k CT  The Poor Law Coirdnissioti(rs have issued a ei; cular,
                  dated the 20th instant, stating that they have consulted the Attorney and
                  Solicitor-General on the construction of the late Removal Act, and give as the
                  result:— I. " That the proviso to the Ist section of the 9 and 10 Vict., c. 66,
                  which sets forth the exceptions to the principal enactments that are to be
                  excluded in the computation of time, is net retrospective in its operation, so
                  as to apply to cases where the five years\' residence was complete before the statute.
                  2. " That an interval between the completion of the five years residence and the
                  application for the warrant of removal filled up by one of the exceptions contained
                  in the proviso will not p event the operation of the statute in restraining the
                  removal of the pauper whu had resided for the specified time. 3. " That orders
                  of removal obtained previous to th• passing of the Act, but not then executed
                  by the removal of the paupers,###"""
              }
  ]

In [86]:
messages

[{'role': 'system',
  'content': "\n          You are an helpful AI that will assist me with analysing and reading newspaper articles.\n          Read the newspaper article attentively and extract the required information.\n          Each newspaper article will be enclosed with triple hash tags (i.e. ###).\n          Don't make thigs up! If the information is not in the article then just say 'I don't know'"},
 {'role': 'user',
  'content': 'Provide a short description of principal characters portrayed newspaper article?\n\n                  ###POOR T,i,ENIPAT A 1„k CT  The Poor Law Coirdnissioti(rs have issued a ei; cular, \n                  dated the 20th instant, stating that they have consulted the Attorney and \n                  Solicitor-General on the construction of the late Removal Act, and give as the \n                  result:— I. " That the proviso to the Ist section of the 9 and 10 Vict., c. 66, \n                  which sets forth the exceptions to the principal enactme

In [87]:
# # uncomment this code if you want to work locally, comment the other function
# def get_completion(messages: list, temperature=.1, top_p=.1) -> str:
#   """get completion for given system and user prompt
#     Arguments:
#     messages (list): a list containin a system and user message as
#       python dictionaries with keys 'role' and 'content'
#     temperature (float): regulate creativity of the text generation
#     top_p (float): cummulative probability included in the
#       generation process
#   """
#   prompt = pipeline.tokenizer.apply_chat_template(
#         messages,
#         tokenize=False,
#         add_generation_prompt=True
#       )

#   outputs = pipeline(
#     prompt,
#     max_new_tokens=256,
#     eos_token_id=terminators,
#     do_sample=True,
#     temperature=temperature,
#     top_p=top_p,
#       )
#   return outputs[0]["generated_text"][len(prompt):]


def get_completion(messages: list, temperature=.1, top_p=.1):
    """get completion for given system and user prompt
      Arguments:
        messages (list): a list containin a system and user message as
          python dictionaries with keys 'role' and 'content'
        temperature (float): regulate creativity of the text generation
        top_p (float): cummulative probability included in the
          generation process
    """
    outputs = llm_client.chat_completion(
        messages=messages,
        max_tokens=256,
        temperature=temperature,
        top_p=top_p
        )
    return outputs.choices[0].message.content

In [88]:
print(get_completion(messages))

Based on the newspaper article, the principal characters mentioned are:

1. The Poor Law Commissioners: They are the ones who have issued a circular stating their consultation with the Attorney and Solicitor-General on the construction of the late Removal Act.
2. The Attorney-General: He is mentioned as one of the officials consulted by the Poor Law Commissioners on the construction of the Removal Act.
3. The Solicitor-General: He is also mentioned as one of the officials consulted by the Poor Law Commissioners on the construction of the Removal Act.
4. Paupers: They are the individuals who are the subject of the Removal Act, which deals with the removal of poor people from one place to another.

Note that there are no specific individuals mentioned by name in the article.


## Exercise

- Change the system message and ask the model to reply in medieval French.
- Change the user message and ask the model to summarize the article and condense it to one sentence.

In [89]:
# Enter code here

#### Solution

In [None]:
messages = [
    {"role": "system", "content": """
    You are an helpful AI that will assist me with analysing and reading newspaper articles.
    Read the newspaper articles attentively and extract the required information.
    Each newspaper article will be enclosed with triple hash tags (i.e. ###).
    Don't make thigs up! Answer in medieval French!"""},
    {"role": "user", "content": f"""Provide a short description of principal characters portrayed newspaper article?
    ###{df.iloc[0].text}###"""}
]

print(get_completion(messages))


Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


Hear ye, hear ye! I shall extract the principal characters from this most singular newspaper article.

* Le défunto, or the deceased, is the mate of the steam tug Earl of Glamorgan, who met a watery grave in the Severn a few days prior to the events described in the article.
* Le frère, or the brother, of the deceased, who repudiated the expense of the more expensive coffin and refused to relinquish the body until his claims were settled.
* Le fossoyeur, or the undertaker, who received the order to prepare a parish coffin, but instead provided a more expensive one at the behest of the authorities. He later appealed to the coroner, who was powerless to intervene.

Mayhap these characters shall play a part in the unfolding drama, as the article hints at a "scene" that may yet ensue.


In [None]:
messages = [
    {"role": "system", "content": """
    You are an helpful AI that will assist me with analysing and reading newspaper articles.
    Read the newspaper articles attentively and extract the required information.
    Each newspaper article will be enclosed with triple hash tags (i.e. ###).
    Don't make thigs up! If the information is not in the article then just say 'Dunno'"""},
    {"role": "user", "content": f"""Summarize the article content in one sentence.
    ###{df.iloc[0].text}###"""}
]

print(get_completion(messages))

Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


A body believed to be the mate of the steam tug Earl of Glamorgan, who drowned in the Severn, was initially intended for a parish coffin but was instead given a more expensive one, leading to a dispute over who should pay for the funeral.


## Applying text generation to historical documents


### Example 1: Summarize Summarizations

What happened in January 1899?

In [90]:
df_small = df_chunks[
            (df_chunks.year==1899) & (df_chunks.month==1) # select articles from January 1899
                  ].sample(20, random_state=1984).reset_index(drop=True) # we sample a few to keep things simple
df_small.shape

(20, 15)

In [91]:

def apply_completions(item: pd.Series,
                      system_message: str,
                      user_message: str,
                      text_column: str = 'text') -> str:
  """
  Function that appl
  Argument:
    item (pd.Series): row from a pandas Dataframe
    system_message (str): system prompt, specifies how the system
      should behave in
    user_message (str): user prompt, give instruction how to
      process each historical. the documents itself will be append
      from the 'text_column' argument
    text_column (str): name of the text column
  """
  messages = [
    {"role": "system", "content": system_message},
    {"role": "user", "content": user_message}
      ]
  messages[1]['content'] += f"\n\n###{item[text_column]}###"
  return  get_completion(messages)

In [92]:
tqdm.pandas() # use tqdm to view progress

system_message = """You are an helpful AI that will assist me with analysing and reading newspaper articles.
    Read the newspaper articles attentively and extract the required information.
    Each newspaper article will be enclosed with triple hash tags (i.e. ###).
    Don't make thigs up! If the information is not in the article then just say 'I don't know''"""
user_message = "Summarize the article content in one sentence."

df_small['completion'] =  df_small.progress_apply(apply_completions,system_message=system_message, user_message=user_message, axis=1)

100%|██████████| 20/20 [00:20<00:00,  1.02s/it]


In [93]:
# get the summaries
df_small['completion'][0]

'Here is a summary of the article content in one sentence:\n\nThe Londesborough Lodge of Freemasons held a successful installation ceremony, where Bro. Charles Nicholson was installed as the new Worshipful Master, and was followed by a banquet at the Station Hotel, where the brethren enjoyed good food, company, and Masonic toasts.'

In [94]:
summaries = '\n'.join([f"###{c}###" for c in df_small['completion']])

In [95]:
messages = [
    {"role": "system", "content": system_message},
    {"role": "user", "content": f"""Based on the article summaries within ### below, what are the most important events? Be concise\n{summaries}"""}
      ]

In [96]:
print(get_completion(messages))

Based on the article summaries, the most important events are:

* The Czar of Russia's proposal to limit armaments
* The war between America and Spain
* The Fashoda incident
* The controversy surrounding the start of the 20th century
* The loss of the gas works in Dorchester
* The government's vaccination bill

These events are significant because they relate to international relations, conflicts, and global issues that had a significant impact on the world at the time.


### Example 2: Condense information about accidents

How did accidents in the news change over time?

An analysis using a very simple pipeline combining summarization and RAG.


In [97]:
import re
pattern = re.compile(r'\baccidents?\b', re.I)
pattern.findall('accidents accident AccIdent accidental')

['accidents', 'accident', 'AccIdent']

In [98]:
tqdm.pandas()
df_chunks['matches'] = df_chunks.chunks.progress_apply(lambda x: bool(pattern.findall(x)))

100%|██████████| 336876/336876 [00:08<00:00, 38244.18it/s]


In [99]:
accident_1810s = df_chunks[
                    (df_chunks.year.between(1810,1820)) & (df_chunks['matches'] == True)
                      ].sample(n=10, random_state=1984)

accident_1890s = df_chunks[
                    (df_chunks.year.between(1890,1900)) & (df_chunks['matches'] == True)
                      ].sample(n=10, random_state=1984)
print(accident_1810s.shape,accident_1890s.shape)

(10, 16) (10, 16)


In [100]:
(df_chunks['matches'] == True).value_counts()

matches
False    329628
True       7248
Name: count, dtype: int64

In [101]:
accident_1890s.iloc[3].chunks

'please acknowledge receipt of same to the Hon. Secretary, 19, South Scarborough Street, West Hartlepool. ACCIDENT AT THE LION BREWERY.—Last night, about eight o\'clock, an accident of a rather serious nature occurred at the above building to two young men named John Hart and John Gates. The former resides in Westmoreland Street, and Gates is employed by Messrs Bland Brothers as a plumber. The aceident was caused by the gas engine exploding. Dr. Young was called in and it was found that both were badly burnt. WEST HARTLEPOOL PARLIAMENTARY AND LITERARY DEBATING SOCIETY. —Last night the first debate under the auspices of the above society was held in Mr Rowe\'s restaurant, Lynn Street, the subject for discussion being "Should there be a legal eight hours day?" Mr Withy occupied the chair. The affirmative was taken by Mr King, and the negative by Mr Bryden. Messrs Rafter, Looney, Mason, and Tarn also took part in the debate, which was adjourned for a week. There was a good attendance. MAS

In [107]:
system_message = """You are an helpful AI that will assist me with analysing and reading newspaper articles.
    Read the newspaper articles attentively and extract the required information.
    Each newspaper article will be enclosed with triple hash tags (i.e. ###).
    Don't make thigs up! If the information is not in the article then just say 'I don't know'.
    Focus on the answer do not add any unnecessary introductory texts."""
user_message = """Does the article talk about an accident?
If yes summarize the article content in one sentence.
If not, answer 'no accident mentioned' """

accident_1810s['completion'] =  accident_1810s.progress_apply(apply_completions,system_message=system_message, user_message=user_message, axis=1)
accident_1890s['completion'] =  accident_1890s.progress_apply(apply_completions,system_message=system_message, user_message=user_message, axis=1)

100%|██████████| 10/10 [00:07<00:00,  1.31it/s]
100%|██████████| 10/10 [00:21<00:00,  2.13s/it]


In [108]:
accident_1890s[['chunks','completion']].iloc[3].values

array(['please acknowledge receipt of same to the Hon. Secretary, 19, South Scarborough Street, West Hartlepool. ACCIDENT AT THE LION BREWERY.—Last night, about eight o\'clock, an accident of a rather serious nature occurred at the above building to two young men named John Hart and John Gates. The former resides in Westmoreland Street, and Gates is employed by Messrs Bland Brothers as a plumber. The aceident was caused by the gas engine exploding. Dr. Young was called in and it was found that both were badly burnt. WEST HARTLEPOOL PARLIAMENTARY AND LITERARY DEBATING SOCIETY. —Last night the first debate under the auspices of the above society was held in Mr Rowe\'s restaurant, Lynn Street, the subject for discussion being "Should there be a legal eight hours day?" Mr Withy occupied the chair. The affirmative was taken by Mr King, and the negative by Mr Bryden. Messrs Rafter, Looney, Mason, and Tarn also took part in the debate, which was adjourned for a week. There was a good attendan

In [119]:
summaries_1810s = '\n'.join([f"###{c}###" for c in accident_1810s.chunks if not c.lower().startswith('no')])
summaries_1810s = f"\n```\nSummaries for 1810s:\n\n{summaries_1810s}\n```"
summaries_1890s = '\n'.join([f"###{c}###" for c in accident_1890s.chunks if not c.lower().startswith('no')])
summaries_1890s = f"\n```\nSummaries for 1810s:\n\n{summaries_1890s}\n```"
print(summaries_1890s)


```
Summaries for 1810s:

###it. On their present form there is no team in the League to touch Nelson, and barring accidents—like last season for instance— the championship looks a gift for them." It may be stated that in the two matches NeNOll have scored 379 runs for the loss of only L6tie wickets, giving them the wonderful average of 42 runs per wicket. On the other hand Burnley have last twenty wickets for NO runs which works out an average of 9 runs per wicket. Comment is unnecessary. . I have nothing but congratulations for the Nelson team, whose fielding, with one exception, was as near perfection as possible. Those two catches in the slips by Hartley opened the eyes of the Turfites, and reminded them of Joe Allen in his pahniest days. Then the bowling of Shacklock and Cuttell was really splendid, and at one time it seemed probable that Burnley would not reach 40. Muschainp, too, behind the wickets was all there, as "John Henry" found to his cost. Bower, by his score of 57 not 

In [120]:
messages = [
    {"role": "system", "content": system_message},
    {"role": "user", "content": f"""
        Below we provide articles from two different decades. First from 1810s and then from the 1890s.
        Each of the decades is enclosed within ```.
        Each summary is enclosed withing ###
        Answer the following question concisely: what the principal differences between the two decades?
        \n{summaries_1810s+summaries_1890s}"""}
      ]
print(get_completion(messages))

Here are the summaries for the two decades:

**1810s:**

The summaries for the 1810s are mostly about accidents and incidents, such as a boat accident, a murder, and a fire. There are also reports of crimes, including a man who was found guilty of murder and was sentenced to death. Additionally, there are reports of social events, such as a debate society meeting and a Masonic installation.

**1890s:**

The summaries for the 1890s are also about accidents and incidents, such as a crane accident, a fire, and a cycling accident. There are also reports of crimes, including a man who was found guilty of murder and was sentenced to death. Additionally, there are reports of social events, such as a lecture on ancient castles and a debate society meeting.

The principal differences between the two decades are:

* The tone of the reports is more formal and serious in the 1810s, while in the 1890s, there is a more lighthearted and humorous tone.
* The types of accidents and incidents reported a

### Example 3: Structured Generation

Biography as microgenre

In [None]:
df_small = df.sample(10, random_state=1984).reset_index(drop=True)

In [None]:
system_message = """You are an helpful AI that will assist me with analysing and reading newspaper articles.
    Read the newspaper articles attentively and extract structured information.
    Each newspaper article will be enclosed with triple hash tags (i.e. ###).
    Don't make thigs up!"""


user_message = """Who are the characters portrayed in the article?
    Extract biographical from a newspaper article.
    For each identified person return a nested Python dictionary with the key equal to the name of the individual.
    The values conist of dictionaries that record specific attributes such as age, gender, nationality, profession ,place of birth etc.
    The format has to be a Python dictionary, do not add extra text!"""

In [None]:
df_small['completion'] =  df_small.progress_apply(apply_completions,system_message=system_message, user_message=user_message, axis=1)


  0%|          | 0/10 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
 20%|██        | 2/10 [00:04<00:17,  2.15s/it]Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
 30%|███       | 3/10 [00:08<00:20,  2.86s/it]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
 40%|████      | 4/10 [00:12<00:20,  3.44s/it]Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
 50%|█████     | 5/10 [00:15<00:15,  3.13s/it]Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
 60%|██████    | 6/10 [00:22<00:18,  4.68s/it]Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
 70%|███████   | 7/10 [00:36<00:22,  7.53s/it]Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
 80%|████████  | 8/10 [00:41<00:13,  6.63s/it]Setting `pad_token_i

In [None]:
df_small['completion'][5]

"Here is the extracted information in a Python dictionary format:\n\n{\n    'Rev. Mr. Rhodes': {\n        'profession': 'clergy',\n        'nationality': 'British'\n    },\n    'Mr. John Haling': {\n        'profession': 'Offlay Arnie',\n        'nationality': 'British'\n    },\n    'Mr. William Steele': {\n        'profession': 'Flitch of Bacon',\n        'nationality': 'British'\n    },\n    'Mr. Reeves': {\n        'profession': 'host',\n        'nationality': 'British'\n    },\n    'Mrs. Reeves': {\n        'profession': 'hostess',\n        'nationality': 'British'\n    },\n    'Mr. Warburton': {\n        'profession':'surgeon',\n        'nationality': 'British'\n    },\n    'Mr. Palmer': {\n        'profession':'surgeon',\n        'nationality': 'British'\n    }\n}"

In [None]:
eval(df_small['completion'][5].split('format:\n\n')[-1].strip())

{'Rev. Mr. Rhodes': {'profession': 'clergy', 'nationality': 'British'},
 'Mr. John Haling': {'profession': 'Offlay Arnie', 'nationality': 'British'},
 'Mr. William Steele': {'profession': 'Flitch of Bacon',
  'nationality': 'British'},
 'Mr. Reeves': {'profession': 'host', 'nationality': 'British'},
 'Mrs. Reeves': {'profession': 'hostess', 'nationality': 'British'},
 'Mr. Warburton': {'profession': 'surgeon', 'nationality': 'British'},
 'Mr. Palmer': {'profession': 'surgeon', 'nationality': 'British'}}

In [None]:
eval(df_small['completion'][4].split('format:\n\n')[-1].strip())

{'Robert Thompson': {'age': None,
  'gender': None,
  'nationality': None,
  'profession': 'S.P.C.C. Inspector',
  'place_of_birth': 'Aycliffe'},
 'Mr. J. T. Proud': {'age': None,
  'gender': None,
  'nationality': None,
  'profession': 'S.P.C.C. Inspector',
  'place_of_birth': None}}

In [None]:
eval(df_small['completion'][2].split('format:\n\n')[-1].strip())

{'J. K. Donald': {'name': 'J. K. Donald',
  'profession': 'Watchmaker and Jeweller'},
 'W. Neville': {'name': 'W. Neville', 'profession': 'Watchmaker and Jeweller'}}

### Example 3: OCR correction

In [None]:
df_small_bad_ocr = df.sort_values('ocrquality')[:5]

In [None]:
user_message = """Transcribe the text and correct typos and errors in the text caused by bad optical character recognition (OCR).
Do not add any information that is not in the original text!"""

df_small_bad_ocr['completion'] = df_small_bad_ocr.progress_apply(apply_completions,system_message=system_message, user_message=user_message, axis=1)


  0%|          | 0/5 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
 40%|████      | 2/5 [00:03<00:05,  1.80s/it]Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
 60%|██████    | 3/5 [00:07<00:05,  2.81s/it]Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
 80%|████████  | 4/5 [00:12<00:03,  3.40s/it]Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
100%|██████████| 5/5 [00:15<00:00,  3.52s/it]Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
100%|██████████| 5/5 [00:33<00:00,  6.64s/it]


In [None]:
print(df_small_bad_ocr.iloc[0]['text'])

4.9,oUMPLlAZill—dorr Walnut. annininiiire ‘.7 in nacallent condition. a Ibn. 1.41:.ND; Rigiten porkatzili tu!l drawn . 


In [None]:
print(df_small_bad_ocr.iloc[0]['completion'])

I'll transcribe the text and correct any typos or OCR errors. Here is the corrected text:

###4.9, Upland Plum Lane - Door Walnut, in excellent condition, $141,000; Righten porkatzili to view. ###


In [None]:
print(df_small_bad_ocr.iloc[4]['text'])

MaEWAN & WALLACH.  itulilliSUA KILL). EMITS AGICtiTa LAN' VALUERS - v 34, H AMILTUN squARE, 1.7b-lopti 1:4 ul6  AkA;ll  BAND AND SMITH ESTATE AlitE,NTd, SUBNEWILS V AlAi MS, 71. LORD-STILLET LIVFX.POOI4 Div= th4truusa Qrle LibrelWaoots LIMP POOL sod DlValot. Wine, r Idiom i4colli*Mor4di. Tolpphoos WIT Biak. 391 NEVI OliarrialieliOAD, Rock Parry.— &ask etabia or akau Yard 8.401 3s. 6d  WEls. KIRBY A/CD HOYLAXE 7ar - OE LET OR SOLD AMY TO W. F. B"v-"• ESTArE AGENT AND VALUER 3, GRANGE ROAD, wEer ILEXBY Telopkrme Hoyaak• 89. gry1684.1.7  Itia•bllabed 1/0. QU FAN AND FOSTER  INSTATE AGENTS & SAMBAS. 2 8013TH STICSAT,LIVSEPOOL Warsaw " ti4c.o4 TellebOute. Zink 4ii6 1177:1  J0H.30  


In [None]:
print(df_small_bad_ocr.iloc[4]['completion'])

Here is the transcribed text with corrections for typos and OCR errors:

###Maewan & Wallach. It is said that Maewan & Wallach will kill). Emits a Gigantic Lan' Valuers - 34, Hamilton Square, 1.7b-loft, 1:4 ul6, Akall Band and Smith Estate Alite, Nt, Subnewils Val, 71. Lord-Stillet Livpool. Pool Div= the true use of the Quadrille Librel Waouts Limp Pool and Divalot. Wine, or Idiom icoll*Mor4di. Topphoos Wit Biak. 391 Neville Road, Rock Ferry.— Ask etabia or akau Yard 8.401 3s. 6d. Wells. Kirby & Co. Hoylake - To Let or Sold Amy to W. F. B"v-"• Estate Agent and Valuer 3, Grange Road, Weaver Ilexby. Telephone Hoylake 89. Gry1684.1.7. It is said that 1/0. Qu Fan and Foster Instate Agents & Sambas. 2 8013th Sticsat, Livsepool. Warsaw " ti


In [None]:
df_small_bad_ocr.to_csv('newspaper_ocr_corrected.csv')

## Combining document filtering and targeted prompting

Below, we combine many the things we covered in the previous notebook. Instead of running an LLM on all the documents, we use regular expressions to select a relevant subset of newspaper articles and use the LLMs to extract structured information.

In [None]:
import re
pattern = re.compile(r'\baccident[s]{0,1}\b',re.I) # compile a regex
df_kw_sample = df[df.apply(lambda x: bool(pattern.findall(x.text)), axis=1)] # get only rows that match the regex

# define the user message we retain the system message from previous examples
user_message = """Does the newspaper describe a historical accident? If not return an empty Python list'.
If it does describe an accident extract, information on the people involved in the accident.
Return a list of Python dictionaries. For each dictionary the key is equal to the name of the person.
The values list charactertistics of this person such a gender, age and occupation.
Only return the Python list and no additional text!
"""

# apply messages
df_kw_sample['completion'] = df_kw_sample.progress_apply(apply_completions, user_message=user_message, system_message=system_message, axis=1)
# save outputs
df_kw_sample.to_csv('accidents.csv')

  0%|          | 0/3 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
 67%|██████▋   | 2/3 [00:03<00:01,  1.79s/it]Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
100%|██████████| 3/3 [00:03<00:00,  1.14s/it]Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
100%|██████████| 3/3 [00:05<00:00,  1.91s/it]


In [None]:
df_kw_sample['completion']

51    [\n    {"Chadder": {"gender": "male", "age": "...
79                                                   []
80    [\n    {'Postman': {'gender':'male', 'age': 'u...
Name: completion, dtype: object

In [None]:
eval(df_kw_sample.iloc[0]['completion'])

[{'Chadder': {'gender': 'male',
   'age': 'unknown',
   'occupation': 'naval reserves'}},
 {'James Edmund Flood': {'gender': 'male',
   'age': '18',
   'occupation': 'unknown'}}]

## Exercise

Experiment with your own system and user message! Have fun :-)

In [None]:
# enter code here

# Fin.