<a href="https://colab.research.google.com/github/kasparvonbeelen/UIBK-DH-LLM-Workshop/blob/dev/LLMs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Epilogue
## Using open source LLMs for analysing historical documents

Make sure you are using a [GPU](https://cloud.google.com/gpu) when running the code below.

Go to **`Runtime`** and select **`Change runtime type`**, then select `T4 GPU` (or any other GPU available)

In [None]:
# install the transformer libraries
!pip install -q -U "transformers==4.40.0" datasets --upgrade

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.0/9.0 MB[0m [31m54.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.1/542.1 kB[0m [31m28.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.9/64.9 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m19.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires requests==2.31.0, but you have requests 2.32.3 which is

In [None]:
import warnings
warnings.filterwarnings('ignore') # disable warning

In [None]:
import transformers
from datasets import Dataset
from tqdm import tqdm
import pandas as pd
import torch

In [None]:
device = 'cuda' # make sure you use a GPU

In [None]:
# load the dataset
df = pd.read_csv('https://raw.githubusercontent.com/kasparvonbeelen/lancaster-newspaper-workshop/wc/data/subsample500mixedocr-selected_mitch.csv')
df.head(3)

Unnamed: 0,publication_code,issue_id,item_id,newspaper_title,data_provider,date,year,month,day,location,word_count,ocrquality,political_leaning_label,price_label,text
0,2249,624,art0017,The Bee-Hive.,British Library Heritage Made Digital Newspapers,1871-06-24,1871,6,24,"London, England",271,0.9098,liberal,1d,"THE TICHBORNE CASE. On Tuesday, before the So..."
1,2250,908,art0002,"The Industrial Review, Social and Political.",British Library Heritage Made Digital Newspapers,1877-09-08,1877,9,8,"London, England",2791,0.9841,liberal,2d,THE CLERGY AND TRADE UNIONS. LETTER FROM REV....
2,2250,406,art0024,"The Industrial Review, Social and Political.",British Library Heritage Made Digital Newspapers,1878-04-06,1878,4,6,"London, England",304,0.987,liberal,2d,INDUSTRIAL REVIEW OUR LEGISLATORS. THE unrul...


In [None]:
# for the purposes of this exercise, we remove both very short and long documents from the dataset
df = df[df.word_count.between(10,250)].reset_index()
df.shape

(212, 16)

## The Hugging Face Hub

In the example below, we will experiment with Llama-3-8B, a recent series of open-source LLMs created by Meta. To use Llama3 you need to:

- Make an account on Hugging Face https://huggingface.co/
- Go to the Llama-3-8B and sign the terms of use you should get a reply swiftly https://huggingface.co/meta-llama/Meta-Llama-3-8B
- Create a user access token with read access: https://huggingface.co/docs/hub/en/security-tokens
- Run the code cell below to log into the Hugging Face hub. Copy-paste the access token
- Reply `n` to the question 'Add token as git credential? (Y/n)'

In [None]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


## Load the LLM model

In [None]:
# define the model, we use the instruct variant
checkpoint = "meta-llama/Meta-Llama-3-8B-Instruct"

# instantiate a text generation pipeline
pipeline = transformers.pipeline(
    "text-generation",
    model=checkpoint,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device="cuda",
)

# some fluff to improve the generation
terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]


config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/51.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


## Prompting

System message: describe how you want to the LLM to work, the behaviour you want it to exhibit
User message: Content you want to process (or the LLM to act on).

```python
messages [
  {
    "role" : "system",
    "content": "<system prompt here>"
  },
  {
    "role" : "user",
    "content": "<user prompt here>"
  }
]
```

Define a message by articulating a system and user prompt.

In [None]:
messages = [
    {
        "role": "system",
        "content": """
          You are an helpful AI that will assist me with analysing and reading newspaper articles.
          Read the newspaper articles attentively and extract the required information.
          Each newspaper article will be enclosed with triple hash tags (i.e. ###).
          Don't make thigs up! If the information is not in the article then just say 'Dunno'"""
              },

    {
        "role": "user",
        "content": f"""Provide a short description of principal characters portrayed newspaper article?
                  ###{df.iloc[0].text}###"""
              }
  ]

In [None]:
messages

[{'role': 'system',
  'content': "\n          You are an helpful AI that will assist me with analysing and reading newspaper articles.\n          Read the newspaper articles attentively and extract the required information.\n          Each newspaper article will be enclosed with triple hash tags (i.e. ###).\n          Don't make thigs up! If the information is not in the article then just say 'Dunno'"},
 {'role': 'user',
  'content': 'Provide a short description of principal characters portrayed newspaper article?\n                  ###WOOLISTON.  As UNDISTAICLIC IN DIYVICULTY.—A somewhat singular ease arose at Woolaston, on Monday, in respect to which there were some fears that the interment of the unfortunate mate of the steam tug. Earl of Glamorgan, who was drowned in the Severn a few days ago, would not be allowed to take place without a "scene." It appeared that the parish undertaker had received an order to make a parish coffin, at that time the body not having been recognised. B

In [None]:
def get_completion(messages: list, temperature=.1, top_p=.1) -> str:
  """get completion for given system and user prompt
    Arguments:
    messages (list): a list containin a system and user message as
      python dictionaries with keys 'role' and 'content'
    temperature (float): regulate creativity of the text generation
    top_p (float): cummulative probability included in the
      generation process
  """
  prompt = pipeline.tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
      )

  outputs = pipeline(
    prompt,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=temperature,
    top_p=top_p,
      )
  return outputs[0]["generated_text"][len(prompt):]

In [None]:
messages

[{'role': 'system',
  'content': "\n          You are an helpful AI that will assist me with analysing and reading newspaper articles.\n          Read the newspaper articles attentively and extract the required information.\n          Each newspaper article will be enclosed with triple hash tags (i.e. ###).\n          Don't make thigs up! If the information is not in the article then just say 'Dunno'"},
 {'role': 'user',
  'content': 'Provide a short description of principal characters portrayed newspaper article?\n                  ###WOOLISTON.  As UNDISTAICLIC IN DIYVICULTY.—A somewhat singular ease arose at Woolaston, on Monday, in respect to which there were some fears that the interment of the unfortunate mate of the steam tug. Earl of Glamorgan, who was drowned in the Severn a few days ago, would not be allowed to take place without a "scene." It appeared that the parish undertaker had received an order to make a parish coffin, at that time the body not having been recognised. B

In [None]:
print(get_completion(messages))

Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


Here is a short description of the principal characters portrayed in the newspaper article:

* The unfortunate mate of the steam tug Earl of Glamorgan, who was drowned in the Severn a few days ago (deceased)
* The brother of the deceased
* The parish undertaker
* The coroner


## Exercise

- Change the system message and ask the model to reply in medieval French.
- Change the user message and ask the model to summarize the article and condense it to one sentence.

In [None]:
# Enter code here

#### Solution

In [None]:
messages = [
    {"role": "system", "content": """
    You are an helpful AI that will assist me with analysing and reading newspaper articles.
    Read the newspaper articles attentively and extract the required information.
    Each newspaper article will be enclosed with triple hash tags (i.e. ###).
    Don't make thigs up! Answer in medieval French!"""},
    {"role": "user", "content": f"""Provide a short description of principal characters portrayed newspaper article?
    ###{df.iloc[0].text}###"""}
]

print(get_completion(messages))


Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


Hear ye, hear ye! I shall extract the principal characters from this most singular newspaper article.

* Le défunto, or the deceased, is the mate of the steam tug Earl of Glamorgan, who met a watery grave in the Severn a few days prior to the events described in the article.
* Le frère, or the brother, of the deceased, who repudiated the expense of the more expensive coffin and refused to relinquish the body until his claims were settled.
* Le fossoyeur, or the undertaker, who received the order to prepare a parish coffin, but instead provided a more expensive one at the behest of the authorities. He later appealed to the coroner, who was powerless to intervene.

Mayhap these characters shall play a part in the unfolding drama, as the article hints at a "scene" that may yet ensue.


In [None]:
messages = [
    {"role": "system", "content": """
    You are an helpful AI that will assist me with analysing and reading newspaper articles.
    Read the newspaper articles attentively and extract the required information.
    Each newspaper article will be enclosed with triple hash tags (i.e. ###).
    Don't make thigs up! If the information is not in the article then just say 'Dunno'"""},
    {"role": "user", "content": f"""Summarize the article content in one sentence.
    ###{df.iloc[0].text}###"""}
]

print(get_completion(messages))

Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


A body believed to be the mate of the steam tug Earl of Glamorgan, who drowned in the Severn, was initially intended for a parish coffin but was instead given a more expensive one, leading to a dispute over who should pay for the funeral.


## Applying text generation to historical documents


### Example 1: Summarization

In [None]:
df_small = df.sample(5, random_state=1984).reset_index(drop=True)
df_small

Unnamed: 0,index,publication_code,issue_id,item_id,newspaper_title,data_provider,date,year,month,day,location,word_count,ocrquality,political_leaning_label,price_label,text
0,209,3077,512,art0058,"Nelson Chronicle, Colne Observer, and Clithero...",British Library Living with Machines Project,1899-05-12,1899,5,12,"Nelson, Lancashire, England",160,0.9748,liberal,1d,"NEW FASHIONS. BELL (V, SON, 28 & 30, MANCHES..."
1,352,3040,1210,art0150,The Birkenhead News and Wirral General Adverti...,British Library Living with Machines Project,1910-12-10,1910,12,10,"Birkenhead, Merseyside, England",105,0.8768,conservative,½ d<SEP>1d,LOANS To THE COUNCIL. At a meeting of the Liv...
2,227,3089,426,art0036,Glasgow Courier.,British Library Living with Machines Project,1855-04-26,1855,4,26,"Glasgow, Strathclyde, Scotland",97,0.9647,conservative,3 ½ d<SEP>4 ½ d,J. K. DONALD & W. NEVILLE B- • - • ---EG to i...
3,293,2974,312,art0014,"The Stourbridge Observer, Cradley Heath, Hales...",British Library Living with Machines Project,1887-03-12,1887,3,12,"Stourbridge, West Midlands, England",13,0.7369,,,"TOBACCOS, AT TEX OLE ESTABLISH EI) S ITO FIE..."
4,449,3074,818,art0103,The North-Eastern Weekly Gazette.,British Library Living with Machines Project,1894-08-18,1894,8,18,"Stockton-on-Tees, Cleveland, England",139,0.8834,liberal,½ d,IT OF CHILDREN YOLIFFE. ice Court on Tuesday...


In [None]:

def apply_completions(item: pd.Series,
                      system_message: str,
                      user_message: str,
                      text_column: str = 'text') -> str:
  """
  Function that appl
  Argument:
    item (pd.Series): row from a pandas Dataframe
    system_message (str): system prompt, specifies how the system
      should behave in
    user_message (str): user prompt, give instruction how to
      process each historical. the documents itself will be append
      from the 'text_column' argument
    text_column (str): name of the text column
  """
  messages = [
    {"role": "system", "content": system_message},
    {"role": "user", "content": user_message}
      ]
  messages[1]['content'] += f"\n\n###{item[text_column]}###"
  return  get_completion(messages)

In [None]:
tqdm.pandas() # use tqdm to view progress

system_message = """You are an helpful AI that will assist me with analysing and reading newspaper articles.
    Read the newspaper articles attentively and extract the required information.
    Each newspaper article will be enclosed with triple hash tags (i.e. ###).
    Don't make thigs up! If the information is not in the article then just say 'Dunno'"""
user_message = "Summarize the article content in one sentence."

df_small['completion'] =  df_small.progress_apply(apply_completions,system_message=system_message, user_message=user_message, axis=1)

  0%|          | 0/5 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
 40%|████      | 2/5 [00:04<00:06,  2.13s/it]Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
 60%|██████    | 3/5 [00:08<00:05,  2.84s/it]Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
 80%|████████  | 4/5 [00:11<00:03,  3.22s/it]Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
100%|██████████| 5/5 [00:14<00:00,  3.00s/it]Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
100%|██████████| 5/5 [00:18<00:00,  3.70s/it]


In [None]:
# get the summaries
df_small['completion']

"Here is the extracted information in a Python dictionary format:\n\n{\n    'Mr. Dart': {\n        'name': 'Mr. Dart',\n        'profession': 'Council Member',\n        'nationality': 'British',\n        'place_of_birth': 'Unknown'\n    }\n}"

### Example 2: Biography as microgenre

In [None]:
df_small = df.sample(10, random_state=1984).reset_index(drop=True)

In [None]:
system_message = """You are an helpful AI that will assist me with analysing and reading newspaper articles.
    Read the newspaper articles attentively and extract structured information.
    Each newspaper article will be enclosed with triple hash tags (i.e. ###).
    Don't make thigs up!"""


user_message = """Who are the characters portrayed in the article?
    Extract biographical from a newspaper article.
    For each identified person return a nested Python dictionary with the key equal to the name of the individual.
    The values conist of dictionaries that record specific attributes such as age, gender, nationality, profession ,place of birth etc.
    The format has to be a Python dictionary, do not add extra text!"""

In [None]:
df_small['completion'] =  df_small.progress_apply(apply_completions,system_message=system_message, user_message=user_message, axis=1)


  0%|          | 0/10 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
 20%|██        | 2/10 [00:04<00:17,  2.15s/it]Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
 30%|███       | 3/10 [00:08<00:20,  2.86s/it]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
 40%|████      | 4/10 [00:12<00:20,  3.44s/it]Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
 50%|█████     | 5/10 [00:15<00:15,  3.13s/it]Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
 60%|██████    | 6/10 [00:22<00:18,  4.68s/it]Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
 70%|███████   | 7/10 [00:36<00:22,  7.53s/it]Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
 80%|████████  | 8/10 [00:41<00:13,  6.63s/it]Setting `pad_token_i

In [None]:
df_small['completion'][5]

"Here is the extracted information in a Python dictionary format:\n\n{\n    'Rev. Mr. Rhodes': {\n        'profession': 'clergy',\n        'nationality': 'British'\n    },\n    'Mr. John Haling': {\n        'profession': 'Offlay Arnie',\n        'nationality': 'British'\n    },\n    'Mr. William Steele': {\n        'profession': 'Flitch of Bacon',\n        'nationality': 'British'\n    },\n    'Mr. Reeves': {\n        'profession': 'host',\n        'nationality': 'British'\n    },\n    'Mrs. Reeves': {\n        'profession': 'hostess',\n        'nationality': 'British'\n    },\n    'Mr. Warburton': {\n        'profession':'surgeon',\n        'nationality': 'British'\n    },\n    'Mr. Palmer': {\n        'profession':'surgeon',\n        'nationality': 'British'\n    }\n}"

In [None]:
eval(df_small['completion'][5].split('format:\n\n')[-1].strip())

{'Rev. Mr. Rhodes': {'profession': 'clergy', 'nationality': 'British'},
 'Mr. John Haling': {'profession': 'Offlay Arnie', 'nationality': 'British'},
 'Mr. William Steele': {'profession': 'Flitch of Bacon',
  'nationality': 'British'},
 'Mr. Reeves': {'profession': 'host', 'nationality': 'British'},
 'Mrs. Reeves': {'profession': 'hostess', 'nationality': 'British'},
 'Mr. Warburton': {'profession': 'surgeon', 'nationality': 'British'},
 'Mr. Palmer': {'profession': 'surgeon', 'nationality': 'British'}}

In [None]:
eval(df_small['completion'][4].split('format:\n\n')[-1].strip())

{'Robert Thompson': {'age': None,
  'gender': None,
  'nationality': None,
  'profession': 'S.P.C.C. Inspector',
  'place_of_birth': 'Aycliffe'},
 'Mr. J. T. Proud': {'age': None,
  'gender': None,
  'nationality': None,
  'profession': 'S.P.C.C. Inspector',
  'place_of_birth': None}}

In [None]:
eval(df_small['completion'][2].split('format:\n\n')[-1].strip())

{'J. K. Donald': {'name': 'J. K. Donald',
  'profession': 'Watchmaker and Jeweller'},
 'W. Neville': {'name': 'W. Neville', 'profession': 'Watchmaker and Jeweller'}}

### Example 3: OCR correction

In [None]:
df_small_bad_ocr = df.sort_values('ocrquality')[:5]

In [None]:
user_message = """Transcribe the text and correct typos and errors in the text caused by bad optical character recognition (OCR).
Do not add any information that is not in the original text!"""

df_small_bad_ocr['completion'] = df_small_bad_ocr.progress_apply(apply_completions,system_message=system_message, user_message=user_message, axis=1)


  0%|          | 0/5 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
 40%|████      | 2/5 [00:03<00:05,  1.80s/it]Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
 60%|██████    | 3/5 [00:07<00:05,  2.81s/it]Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
 80%|████████  | 4/5 [00:12<00:03,  3.40s/it]Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
100%|██████████| 5/5 [00:15<00:00,  3.52s/it]Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
100%|██████████| 5/5 [00:33<00:00,  6.64s/it]


In [None]:
print(df_small_bad_ocr.iloc[0]['text'])

4.9,oUMPLlAZill—dorr Walnut. annininiiire ‘.7 in nacallent condition. a Ibn. 1.41:.ND; Rigiten porkatzili tu!l drawn . 


In [None]:
print(df_small_bad_ocr.iloc[0]['completion'])

I'll transcribe the text and correct any typos or OCR errors. Here is the corrected text:

###4.9, Upland Plum Lane - Door Walnut, in excellent condition, $141,000; Righten porkatzili to view. ###


In [None]:
print(df_small_bad_ocr.iloc[4]['text'])

MaEWAN & WALLACH.  itulilliSUA KILL). EMITS AGICtiTa LAN' VALUERS - v 34, H AMILTUN squARE, 1.7b-lopti 1:4 ul6  AkA;ll  BAND AND SMITH ESTATE AlitE,NTd, SUBNEWILS V AlAi MS, 71. LORD-STILLET LIVFX.POOI4 Div= th4truusa Qrle LibrelWaoots LIMP POOL sod DlValot. Wine, r Idiom i4colli*Mor4di. Tolpphoos WIT Biak. 391 NEVI OliarrialieliOAD, Rock Parry.— &ask etabia or akau Yard 8.401 3s. 6d  WEls. KIRBY A/CD HOYLAXE 7ar - OE LET OR SOLD AMY TO W. F. B"v-"• ESTArE AGENT AND VALUER 3, GRANGE ROAD, wEer ILEXBY Telopkrme Hoyaak• 89. gry1684.1.7  Itia•bllabed 1/0. QU FAN AND FOSTER  INSTATE AGENTS & SAMBAS. 2 8013TH STICSAT,LIVSEPOOL Warsaw " ti4c.o4 TellebOute. Zink 4ii6 1177:1  J0H.30  


In [None]:
print(df_small_bad_ocr.iloc[4]['completion'])

Here is the transcribed text with corrections for typos and OCR errors:

###Maewan & Wallach. It is said that Maewan & Wallach will kill). Emits a Gigantic Lan' Valuers - 34, Hamilton Square, 1.7b-loft, 1:4 ul6, Akall Band and Smith Estate Alite, Nt, Subnewils Val, 71. Lord-Stillet Livpool. Pool Div= the true use of the Quadrille Librel Waouts Limp Pool and Divalot. Wine, or Idiom icoll*Mor4di. Topphoos Wit Biak. 391 Neville Road, Rock Ferry.— Ask etabia or akau Yard 8.401 3s. 6d. Wells. Kirby & Co. Hoylake - To Let or Sold Amy to W. F. B"v-"• Estate Agent and Valuer 3, Grange Road, Weaver Ilexby. Telephone Hoylake 89. Gry1684.1.7. It is said that 1/0. Qu Fan and Foster Instate Agents & Sambas. 2 8013th Sticsat, Livsepool. Warsaw " ti


In [None]:
df_small_bad_ocr.to_csv('newspaper_ocr_corrected.csv')

## Combining document filtering and targeted prompting

Below, we combine many the things we covered in the previous notebook. Instead of running an LLM on all the documents, we use regular expressions to select a relevant subset of newspaper articles and use the LLMs to extract structured information.

In [None]:
import re
pattern = re.compile(r'\baccident[s]{0,1}\b',re.I) # compile a regex
df_kw_sample = df[df.apply(lambda x: bool(pattern.findall(x.text)), axis=1)] # get only rows that match the regex

# define the user message we retain the system message from previous examples
user_message = """Does the newspaper describe a historical accident? If not return an empty Python list'.
If it does describe an accident extract, information on the people involved in the accident.
Return a list of Python dictionaries. For each dictionary the key is equal to the name of the person.
The values list charactertistics of this person such a gender, age and occupation.
Only return the Python list and no additional text!
"""

# apply messages
df_kw_sample['completion'] = df_kw_sample.progress_apply(apply_completions, user_message=user_message, system_message=system_message, axis=1)
# save outputs
df_kw_sample.to_csv('accidents.csv')

  0%|          | 0/3 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
 67%|██████▋   | 2/3 [00:03<00:01,  1.79s/it]Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
100%|██████████| 3/3 [00:03<00:00,  1.14s/it]Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
100%|██████████| 3/3 [00:05<00:00,  1.91s/it]


In [None]:
df_kw_sample['completion']

51    [\n    {"Chadder": {"gender": "male", "age": "...
79                                                   []
80    [\n    {'Postman': {'gender':'male', 'age': 'u...
Name: completion, dtype: object

In [None]:
eval(df_kw_sample.iloc[0]['completion'])

[{'Chadder': {'gender': 'male',
   'age': 'unknown',
   'occupation': 'naval reserves'}},
 {'James Edmund Flood': {'gender': 'male',
   'age': '18',
   'occupation': 'unknown'}}]

## Exercise

Experiment with your own system and user message! Have fun :-)

In [None]:
# enter code here

# Fin.