<a href="https://colab.research.google.com/github/kasparvonbeelen/data-culture-newspapers/blob/llms/2_Poking_LLMs_with_HF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using open-source LLMs for analysing humanities data

In this notebook, we explore applications of generative AI for processing and analysing historical newspapers.

Instead of investigating how the model works, we focus on what we can do with the outputs.

Major hurdles to working with LLMs are cost and/or infrastructure. Opposed to GPT-2 or BERT, running LLMs can be difficult, and using commercial APIs can be expensive.

## Why Open-source?

- **Privacy:**: You might not want to share your data (and ideas) with companies such as OpenAI;
- **Cost:** Making abstraction of the caveat above, using open-source models might reduce costs if you want to apply for example a prompt to 10k newspaper articles;
- **Transparency:** Be mindful that there are different gradations of openness and transparency. Even when you can access the model weights, you might remain in the dark about training data and other factors);
- **Flexibility:** Even though some providers allow you to train or fine-tune closed models on your data (ties in with privacy), open-source models still give you more freedom and wiggle room to build new models and applications.

## Goals of this Session

This notebook covers a few practical and theoretical aspects of working with LLMs in the context of humanities research. The goal is to start a discussion on:
 - Where to find and how to deploy open-source LLMs?
 - What tasks would make sense? Which models work well for a selected task?
 - How to evaluate outcomes and performance? How large should the language model be?

We want to keep things simple!

We will be playing with Llama-3 and get a feeling of how this changes the way we process and interrogate data.


## Technical note

We will be relying on the Hugging Face `InferenceClient` for accessing LLMs. These are freely accessible, but rate limits apply! If you would want to deploy a 'local' version (we're still on Colab, but the code should also work on your computer), uncomment the code below (where indicated) and make sure you are using a [GPU](https://cloud.google.com/gpu). To select a GPU on Colab Go to **`Runtime`** and select **`Change runtime type`**, then select `T4 GPU` (or any other GPU available).



This notebook is inspired by: https://huggingface.co/learn/cookbook/structured_generation

In [None]:
# install the transformer and other libraries
!pip install -q -U "transformers==4.40.0" pydantic accelerate outlines datasets

## The Hugging Face Hub

In the examples below, we will experiment with `Llama-3-8B-Instruct`, a recent series of open-source LLMs created by Meta. To use Llama3 you need to:

- Make an account on Hugging Face https://huggingface.co/
- Go to the Llama-3-8B and sign the terms of use you should get a reply swiftly https://huggingface.co/meta-llama/Meta-Llama-3-8B
- Create a user access token with at least read access: https://huggingface.co/docs/hub/en/security-tokens
- Run the code cell below to log into the Hugging Face hub. Copy-paste the access token.
- Reply `n` to the question 'Add token as git credential? (Y/n)'

In [None]:
!huggingface-cli login

## Preparing model and data

### Import libraries

In [None]:
import warnings
warnings.filterwarnings('ignore') # disable warnings

In [None]:
import transformers
from huggingface_hub import InferenceClient
from datasets import Dataset
from tqdm import tqdm
import pandas as pd
import torch
import pandas as pd
import json
pd.set_option("display.max_colwidth", 100)

### Load model

In [None]:
# choose a LLMs model
repo_id = "meta-llama/Meta-Llama-3-8B-Instruct"
# instantiate the inference client
llm_client = InferenceClient(model=repo_id, timeout=120)

In [None]:
# # define the model, we use the instruct variant
# checkpoint = "meta-llama/Meta-Llama-3-8B-Instruct"
# device = 'cuda' # make sure you use a GPU

# # instantiate a text generation pipeline
# pipeline = transformers.pipeline(
#     "text-generation",
#     model=checkpoint,
#     model_kwargs={"torch_dtype": torch.bfloat16},
#     device="cuda",
# )

# # some fluff to improve the generation
# terminators = [
#     pipeline.tokenizer.eos_token_id,
#     pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
# ]

### Download data

We will be experimenting with a small set of 10k British newspaper articles provided by the ["Heritage Made Digital"](https://blogs.bl.uk/thenewsroom/2019/01/heritage-made-digital-the-newspapers.html) project. Data was kindly prepared and provided by my colleague [Nilo Pedrazzini](https://www.linkedin.com/in/nilopedrazzini)

In [None]:
# download a sample of 10.000 newspaper articles
!wget -q --show-progress https://github.com/kasparvonbeelen/lancaster-newspaper-workshop/raw/wc/data/sample_lwm_hmd_mt90_10000.csv.zip
# unzip the downloaded sample
!unzip -o sample_lwm_hmd_mt90_10000.csv.zip
!rm -r __MACOSX

In [None]:
df = pd.read_csv('sample_lwm_hmd_mt90_10000.csv')
df.head(3)

In [None]:
df.shape

### Process data

To facilitate the analysis we divide the newspaper articles into smaller chunks of 250 words (with a 50-word overlap).

In [None]:
def get_chunks(text: str, size: int=250,step: int=50) -> list:
  """divide a text into chunks of similar size
  Arguments:
    text (str): input text
    size (int): number of tokens in each chunk
    step (int): step size
  Returns a list of strings
  """
  words = text.split()
  return [' '.join(words[i:i+size]) for i in range(0,len(words),step)]

We save the chunks in a new list.

In [None]:
# apply chunking to text
df['chunks'] = df.text.apply(get_chunks)

In [None]:
len(df.text[0]),len(df['chunks'][0])

Next, we reorder the dataframe: each chunk of 250 will be a new row (this increases the number of rows quite a bit, as you may observe).

In [None]:
# reorder the dataframe
# with one chunk in each row
# instead of the whole text
df_chunks = df.explode('chunks')
df_chunks.shape

## Prompting

LLM generate text from an input, usually referred to as a 'prompt', a piece of text we like the model to use as a starting point for predicting novel tokens.

When 'chatting' with an LLM we usually provide the model with (at least) two messages: a system and a user prompt or message.

**System message**:

- **Generic instructions on behaviour**: specify how the model should behave (e.g. be helpful, respectful, neutral) or the role it should play (e.g., a teacher, assistant, or advisor).
- **Constraints**: Specific instructions on what the model should avoid or how it should generate responses.
- **Context**: Background information or context that remains constant throughout the session to ensure consistency.

**User message**:

- **Query**: specifies input from the user, such as a question, instruction, or request that the model needs to respond to.
- **Dynamic**: changes with each interaction, reflecting the user's immediate needs, questions, or instructions.

The Hugging Face chat prompt template allows messages as lists of dictionaries.

```python
messages [
 {
    "role" : "system",
    "content": "<system prompt here>"
 },
 {
    "role" : "user",
    "content": "<user prompt here>"
 }
]
```

Define a message by articulating a system and user prompt.

In [None]:
messages = [
    {
        "role": "system",
        "content": """
          You are a helpful AI that will assist me with analysing and reading newspaper articles.
          Read the newspaper article attentively and extract the required information.
          Each newspaper article is enclosed with triple hashtags (i.e. ###).
          Don't make things up! If the information is not in the article then reply 'I don't know'
          """
              },

    {
        "role": "user",
        "content": f"""Provide a short description of principal characters portrayed in the newspaper article?

                  ###POOR T,i,ENIPAT A 1„k CT  The Poor Law Coirdnissioti(rs have issued a ei; cular,
                  dated the 20th instant, stating that they have consulted the Attorney and
                  Solicitor-General on the construction of the late Removal Act, and give as the
                  result:— I. " That the proviso to the Ist section of the 9 and 10 Vict., c. 66,
                  which sets forth the exceptions to the principal enactments that are to be
                  excluded in the computation of time, is net retrospective in its operation, so
                  as to apply to cases where the five years\' residence was complete before the statute.
                  2. " That an interval between the completion of the five years residence and the
                  application for the warrant of removal filled up by one of the exceptions contained
                  in the proviso will not p event the operation of the statute in restraining the
                  removal of the pauper whu had resided for the specified time. 3. " That orders
                  of removal obtained previous to th• passing of the Act, but not then executed
                  by the removal of the paupers,###"""
              }
  ]

In [None]:
messages

In [None]:
#help(llm_client.chat_completion)

In [None]:
# # uncomment this code if you want to work locally, comment the other function
# def get_completion(messages: list, temperature=.1, top_p=.1) -> str:
#   """get completion for given system and user prompt
#     Arguments:
#     messages (list): a list containin a system and user message as
#       python dictionaries with keys 'role' and 'content'
#     temperature (float): regulate creativity of the text generation
#     top_p (float): cummulative probability included in the
#       generation process
#   """
#   prompt = pipeline.tokenizer.apply_chat_template(
#         messages,
#         tokenize=False,
#         add_generation_prompt=True
#       )

#   outputs = pipeline(
#     prompt,
#     max_new_tokens=256,
#     eos_token_id=terminators,
#     do_sample=True,
#     temperature=temperature,
#     top_p=top_p,
#       )
#   return outputs[0]["generated_text"][len(prompt):]


def get_completion(messages: list, temperature=.1, top_p=.1):
    """get completion for given system and user prompt
      Arguments:
        messages (list): a list containin a system and user message as
          python dictionaries with keys 'role' and 'content'
        temperature (float): regulate creativity of the text generation
        top_p (float): cummulative probability included in the
          generation process
    """
    outputs = llm_client.chat_completion(
        messages=messages,
        max_tokens=1024,
        temperature=temperature,
        top_p=top_p
        )
    return outputs.choices[0].message.content

In [None]:
print(get_completion(messages))

## Exercise

- Change the system message and ask the model to reply in medieval French.
- Change the user message and ask the model to summarize the article and condense it to one sentence.

In [None]:
# Enter code here

#### Solution

In [None]:
messages = [
    {"role": "system", "content": """
    You are a helpful AI that will assist me with analysing and reading newspaper articles.
    Read the newspaper article attentively and extract the required information.
    Each newspaper article is enclosed with triple hashtags (i.e. ###).
    Don't make things up! If the information is not in the article then reply 'I don't know'
          Answer in medieval French!"""
          },
    {"role": "user", "content": f"""Provide a short description of principal characters portrayed newspaper article?
    ###{df.iloc[0].text}###"""}
]

print(get_completion(messages))


## Applying text generation to historical documents


### Example 1: Summarize Summaries

Let's imagine we'd wish to know what happened in January 1899 but won't have time to read all the newspaper issues. Luckily, LLMs excel at summarization!

We select all the articles for this January 1899 and save them in a new dataframe. For the purposes of this exercise, we just take a random sample of 20 chunks, otherwise it will take too long to run everything through the model.

In [None]:
df_small = df_chunks[
            (df_chunks.year==1899) & (df_chunks.month==1) # select articles from January 1899
                  ].sample(20, random_state=1984).reset_index(drop=True) # we sample a few to keep things simple
df_small.shape

Run the cell below to load the `apply_completions` function.

In [None]:
def apply_completions(item: pd.Series,
                      system_message: str,
                      user_message: str,
                      text_column: str = 'chunks') -> str:
  """
  Function that appl
  Argument:
    item (pd.Series): row from a pandas Dataframe
    system_message (str): system prompt, specifies how the system
      should behave in
    user_message (str): user prompt, give instruction how to
      process each historical. the documents itself will be append
      from the 'text_column' argument
    text_column (str): name of the text column
  """
  messages = [
    {"role": "system", "content": system_message},
    {"role": "user", "content": user_message}
      ]
  messages[1]['content'] += f"\n\n###{item[text_column]}###"
  return  get_completion(messages)

We apply the prompt to the text chunks in our dataframe.

In [None]:
tqdm.pandas() # use tqdm to view progress

system_message = """
    You are a helpful AI that will assist me with analysing and reading newspaper articles.
    Read the newspaper article attentively and extract the required information.
    Each newspaper article is enclosed with triple hashtags (i.e. ###).
    Don't make things up! If the information is not in the article then reply 'I don't know'
    """
user_message = "Summarize the article in one sentence."

df_small['completion'] =  df_small.progress_apply(apply_completions,system_message=system_message, user_message=user_message, axis=1)

In [None]:
#print the summaries
df_small['completion'][0]

Of course, we can condense information even more by summarizing the summaries!

In [None]:
# create a new string from the summaries with each between triple hashtags
summaries = '\n'.join([f"###{c}###" for c in df_small['completion']])

In [None]:
# create a new user message
messages = [
    {"role": "system", "content": system_message},
    {"role": "user", "content": f"""Based on the article summaries within ### below, what are the most important events? Be concise\n{summaries}"""}
      ]

In [None]:
print(get_completion(messages))

### Example 2: Condense information about accidents

How did accidents in the news change over time? In this example, we analyse accident reports using a simple pipeline using summarisation and baby-RAG (in the sense that we first retrieve and then generate a response to our query).


In the first step we simple use a regular expression to find reports about accidents.

In [None]:
import re
pattern = re.compile(r'\baccidents?\b', re.I) # compile a regex
pattern.findall('accidents accident AccIdent accidental') # test the regex on a few example

In [None]:
tqdm.pandas()
df_chunks['matches'] = df_chunks.chunks.progress_apply(lambda x: bool(pattern.findall(x)))

Then we retrieve a small sample of accident reports by decade (for the 1810s and 189s).

In [None]:
accident_1810s = df_chunks[
                    (df_chunks.year.between(1810,1820)) & (df_chunks['matches'] == True)
                      ].sample(n=10, random_state=1984)

accident_1890s = df_chunks[
                    (df_chunks.year.between(1890,1900)) & (df_chunks['matches'] == True)
                      ].sample(n=10, random_state=1984)
print(accident_1810s.shape,accident_1890s.shape)

You can use `.value_counts()` to compute the total number of articles mentioning 'accident' at least once.

In [None]:
(df_chunks['matches'] == True).value_counts()

In [None]:
accident_1890s.iloc[3].chunks

In [None]:
system_message = """
    You are a helpful AI that will assist me with analysing and reading newspaper articles.
    Read the newspaper article attentively and extract the required information.
    Each newspaper article is enclosed with triple hashtags (i.e. ###).
    Don't make things up! If the information is not in the article then reply 'I don't know'
    Focus on the answer and do not add any unnecessary texts."""
user_message = """Does the article talk about an accident?
If yes summarize the article content in one sentence.
If not, answer 'No accident mentioned' """

accident_1810s['completion'] =  accident_1810s.progress_apply(apply_completions,system_message=system_message, user_message=user_message, axis=1)
accident_1890s['completion'] =  accident_1890s.progress_apply(apply_completions,system_message=system_message, user_message=user_message, axis=1)

In [None]:
accident_1890s[['chunks','completion']].iloc[3].values

Lastly, we can group the summaries by decade and ask the LLM to figure out the principal differences and similarities. Instead of quantifying, the increasingly condense information as a method for distant reading.

Obviously, making more targeted prompts could help, for example, we could ask the LLMs to focus on the machines in the summary or the gender of victims.

In [None]:
summaries_1810s = '\n'.join([f"###{c}###" for c in accident_1810s.chunks if not c.lower().startswith('no')])
summaries_1810s = f"\n```\nSummaries for 1810s:\n\n{summaries_1810s}\n```"
summaries_1890s = '\n'.join([f"###{c}###" for c in accident_1890s.chunks if not c.lower().startswith('no')])
summaries_1890s = f"\n```\nSummaries for 1810s:\n\n{summaries_1890s}\n```"
print(summaries_1890s)

In [None]:
messages = [
    {"role": "system", "content": system_message},
    {"role": "user", "content": f"""
        Below we provide articles from two different decades. First from 1810s and then from the 1890s.
        Each of the decades is enclosed within ```.
        Each summary is enclosed withing ###
        Answer the following question concisely: what the principal differences between the two decades?
        \n{summaries_1810s+summaries_1890s}"""}
      ]
print(get_completion(messages))

### Example 3: Structured Generation

Newspapers contain a lot of biographical information, one could say biography appears as a microgenre in the press. For example, in accident reports we do get some background about the people involved, implicitly (gender) or explicitly (professions or age).

Below we use a language model to extract such information from newspaper reports and return it in a predefined format that allows us to analyse newspapers as structured data.

Put differently, we use LLMs to extract information similar to automatic annotation, and convert text to tabular format.

In [None]:
df_small = df_chunks[
                    (df_chunks['matches'] == True)
                      ].sample(n=10, random_state=1984)

In [None]:
# df_small['chunks'].iloc[7]

We rewrite the system prompt and give it a few more instructions on how to respond to our queries.

In [None]:
system_message = """You are an helpful AI that will assist me with analysing source documents in the form of historical newspaper articles.
    Read the newspaper articles attentively and extract structured information formatted as a list of Python dictionaries.
    Provide all relevant short source snippets from the documents on which you directly based your answer.
    Keep the source snippet short to just a few words and not complete sentences.
    The snippet MUST be extracted from the soutce, with spelling and wording identical to the source.
    This list of JSON blobs should begin with a "START" tag and end with a "END" tag.
    Each newspaper article will be enclosed with triple hash tags (i.e. ###).
    Don't make thigs up! If you don't know the answer, simply return no value"""


user_message = """
If the article describes a historical accident, extract biographical information about the individuals involved in the accidents.
Return a list of Python dictionaries for each individual which records important personal attributes such gender, age and profession, and others that are relevant.
Each attribute is a key in a dictionary.
Record personal attribures as dictionaries as shown in the example below.
Also add one key with "outcome" that records what happened to person ("drowned", "survived", "injured")
Add a confidence score as a float between 0 and 1 for each snippet extracted.
Under "source_snippets" collect text fragments that record what happened to person involved.

START
[
  {
  "name" : { "value": answer,"source": source_snippet, "confidence": your_confidence_score },
  "gender" : { "value": answer,"source": source_snippet, "confidence": your_confidence_score },
  "profession" :{ "value": answer,"source": source_snippet, "confidence": your_confidence_score },
  ... other attributes ...,
  "outcome" : { "value": answer,"source": source_snippet, "confidence": your_confidence_score },
  "summary": { "value" :summary, "confidence" : your_confidence_score }
  },
...]
END
"""



In [None]:
messages = [
    {"role": "system", "content": system_message},
    {"role": "user", "content": user_message + f'\n\n###{df_small["chunks"].iloc[4]}###'}
      ]
print(get_completion(messages))

In [None]:
df_small['completion'] =  df_small.progress_apply(apply_completions,system_message=system_message, user_message=user_message, axis=1)


In [None]:
print(df_small['completion'])

To convert the response to a Python data type, we use the `eval_completion` function.

In [None]:
def eval_completion(completion: str) -> list:
  """Convert the completion as string to a Python list
  Argument:
      completion (str): structured generation by LLM
  """
  try:
    return eval(completion.split('START')[-1].strip().rstrip('END').strip())
  except Exception as e:
    print(e)
    return []

df_small['completion_eval'] = df_small['completion'].apply(eval_completion)

Let's have a bit closer look at some examples.

In [None]:
df_small['completion_eval']

In [None]:
df_small['completion_eval'].iloc[7]

Lastly we can have a bit closer look at how the language model processes the text by highlighting the fragments on which it based its answers. This can help us with
- creating automatic pre-annotation
- figuring out how the pipeline could be improved
- close-reading large amounts of text

In [None]:
row = df_small.iloc[7]
html_output = row['chunks']
for p_dict in row['completion_eval']:
  for attr, attr_dict in p_dict.items():
    try:
      if isinstance(attr_dict, dict):
        if attr_dict.get('confidence',.0) > .5 and attr_dict.get("source",None):
          html_output = re.sub(str(attr_dict['source']),
                   f'<span style="background-color: yellow;">{attr_dict["source"]}</span>', html_output)
    except Exception as e:
      print(e,attr_dict)
      continue

In [None]:
from IPython.core.display import HTML
HTML(html_output)

### Example 4: OCR correction

Lastly, let's use LLM to help us with a longstanding problem in digital humanities, improving OCR quality.

In [None]:
df_small_bad_ocr = df_chunks.sort_values('ocrquality', ascending=True)[:1000].sample(n=10)

In [None]:
system_message = "You are an helpful AI and provide truthful correction of historical text."

user_message = """Transcribe the text and correct typos and errors in the text caused by bad optical character recognition (OCR).
Do not add any information that is not in the original text!"""

df_small_bad_ocr['completion'] = df_small_bad_ocr.progress_apply(apply_completions,system_message=system_message, user_message=user_message, axis=1)


In [None]:
df_small_bad_ocr.iloc[0]['chunks']

In [None]:
print(df_small_bad_ocr.iloc[0]['completion'])

In [None]:
df_small_bad_ocr.iloc[3]['chunks']

In [None]:
df_small_bad_ocr.iloc[3]['completion']

In [None]:
df_small_bad_ocr.to_csv('newspaper_ocr_corrected.csv')

## Exercise

Experiment with your own system and user message! Have fun :-)

In [None]:
# enter code here

# Fin.