<a href="https://colab.research.google.com/github/ieg-dhr/Notebooks4Historical_Newspapers/blob/main/Gemini.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Researching German Historical Newspapers using the Gemini Pro Model
## Example: Article Extraction

*Notebook created by Sarah Oberbichler (oberbichler@ieg-mainz.de)*

This notebook shows how LLMs can be used to support research with historical newspapers. In this example, the Gemini pro model is used to extract articles on earthquakes in OCR'd historical newspapers pages.

Article segmentation for historical newspapers can be based on layout information and graphical elements (image) as well as on textual context (data). While the former is very challenging due to the changing and complex layouts of historical newspapers, the latter seems to be especially promising for topic-specific corpus building. Qualitative research relies on correctly separated articles. An article, in this context, is defined as a coherent text covering a specific topic, no more and no less.



### 1.   Query the German Historical Newspaper Portal

German historical newspapers from the German Digital Library can be accessed via the DDB-API. This API is open access and allows to query the Historical Newspapers available in the German Newspaper Portal ([Deutsches Zeitungsportal](https://https://www.deutsche-digitale-bibliothek.de/newspaper)). An instruction, provided by the German Newspaper Portal (from Karl Krägerlin), can be found [here](https://https://deepnote.com/app/karl-kragelin-b83c/Zeitungsportal-API-d9224dda-8e26-4b35-a6d7-40e9507b1151).

Python > 3.8 is required

In [None]:
# @markdown ####  Launch this cell and get access to the API of the Newspaper Portal from the German Digital Library
!pip install ddbapi

In [None]:
# @markdown ####  Import the necessary packages
import pandas as pd
from ddbapi import zp_issues, zp_pages, list_column, filter

In [None]:

# @markdown ### Possible kwargs for the functions are:
# @markdown - language: Use ISO Codes, currently ger, eng, fre, spa
# @markdown - place_of_distribution: Search inside "Verbreitungsort"
# @markdown - use a list for multiple search-words
# @markdown - publication_date: Get newspapers by publication date.
# @markdown - zdb_id: Search by ZDB-ID
# @markdown - provider: Search by Data Provider
# @markdown - paper_title: Search inside the title of the Newspaper
# @markdown - plainpagefulltex: search inside the OCR
# Get the data
df = zp_pages(
    publication_date='[1909-01-01T12:00:00Z TO 1912-01-01T12:00:00Z]',
    plainpagefulltext=["Erdstoß"],
    paper_title='Kölnische Zeitung'
    )
df.head()

In [None]:
# @markdown #### Save the results as Excel file
df.to_excel('name.xlsx', index=False)


In [None]:
# @markdown #### We can narrow down the text surrounding the keyword in order to reduce the input tokens for the model. Choose the size of the context window here:

context_window = 5000 # @param {type:"number"}
def extract_context(keyword, text, window_size=context_window):
    index = text.find(keyword)
    if index == -1:
        return "Keyword not found in text."

    start_index = max(0, index - window_size)
    end_index = min(len(text), index + len(keyword) + window_size)

    context = text[start_index:end_index]

    return context


# Extract context for each row
contexts = []
for index, row in df.iterrows():
    text = row['plainpagefulltext']
    keyword = "rdbeben"  # You can modify this
    context = extract_context(keyword, text)
    contexts.append(context)

# Add the context to the dataframe
df['context'] = contexts

# Print the dataframe with context
df.head()

In [None]:
# @markdown #### Save the results as Excel file
df.to_excel('name.xlsx', index=False)


## Setting up the requirements for the Gemini model

Gemini is a family of generative AI models that helps generate content and solve problems. These models are designed and trained to handle both text and images as input.

In [None]:
!pip install -q -U google-generativeai

In [None]:
import google.generativeai as genai

In [None]:
# @markdown ##### Get an API key at https://aistudio.google.com/app/prompts/new_chat, activate the pay as you go mode, and add the key to the secrets in this colab notebook (right bar). Name the key in secrets GOOGLE_API_KEY and add the key under value.
from google.colab import userdata
userdata.get('GOOGLE_API_KEY')
GOOGLE_API_KEY=userdata.get('GOOGLE_API_KEY')
genai.configure(api_key=GOOGLE_API_KEY)

In [None]:
# @markdown ##### Set up the model. Find a list of the available GEMINI models here: https://ai.google.dev/gemini-api/docs/models/gemini. The safety settings can be blocked in order to have no restrictions with your data.
model = genai.GenerativeModel('gemini-1.5-pro')

# Extract Articles

To extract articles on earthquakes, it is essential to formulate a precise prompt that specifies the articles should be extracted in their original form without translations or corrections. A guide on how to write effective prompts can be found also [here](https://https://support.google.com/a/users/answer/14200040?hl=en).

Depending on the size of the dataframe, it can take a while to load.

In [None]:
df=df[:20]

In [None]:
import json
def separate_articles(newspaper_page):
    # Define the prompt for separating articles

    response = model.generate_content(
    [f"Bitte separiere nur Berichte zu Erdbeben in ihrer ungeänderten\
      deutschen Originalform, keine Änderungen, Zusätze oder Zusammenfassungen\
      \n\n{newspaper_page}\n\n---\n\ ."])
    articles=response.text
    return articles

# Create an empty list to store the separated articles
separated_articles = []

# Loop through each row in the dataframe
for index, row in df.iterrows():
    # Extract the text of the newspaper page from the current row
    newspaper_page = row['context']

    # Separate articles for the current newspaper page
    articles = separate_articles(newspaper_page)

    # Append the articles to the list
    separated_articles.append(articles)

# Add the list of separated articles as a new column 'article' in the dataframe
df['article'] = separated_articles

# Print the modified dataframe
df


In [None]:
df.to_excel('name_2.xlsx', index=False)