# Researching German Historical Newspapers with Llama AI Model
## Example: OCR-Post Correction

*Notebook created by Sarah Oberbichler (oberbichler@ieg-mainz.de)*

This notebook shows how LLMs can be used to support research with historical newspapers. In this example, the Llama 3 model is used to to correct OCR of previously OCR'd historical newspapers pages.

OCR quality has been a long-standing issue in digitization efforts. Historical newspapers are particularly affected due their complexity, historical fonts, or degradation. Additionally, OCR technology faced limitations when dealing with historical scripts.


### 1.   Query the German Historical Newspaper Portal

German historical newspapers from the German Digital Library can be accessed via the DDB-API. This API is open access and allows to query the Historical Newspapers available in the German Newspaper Portal ([Deutsches Zeitungsportal](https://https://www.deutsche-digitale-bibliothek.de/newspaper)). An instruction, provided by the German Newspaper Portal, can be found [here](https://https://deepnote.com/app/karl-kragelin-b83c/Zeitungsportal-API-d9224dda-8e26-4b35-a6d7-40e9507b1151).

In [1]:
# @markdown #####  Launch this cell and get access to the API of the Newspaper Portal from the German Digital Library
!pip install ddbapi

Collecting ddbapi
  Downloading ddbapi-0.1.2.tar.gz (5.2 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: ddbapi
  Building wheel for ddbapi (setup.py) ... [?25l[?25hdone
  Created wheel for ddbapi: filename=ddbapi-0.1.2-py3-none-any.whl size=5384 sha256=9421197e2ed131572c49a4880b7f43643e16fad5cdfed4b38a9116a80ed19751
  Stored in directory: /root/.cache/pip/wheels/0a/93/7e/69ec8f7396174c1532d0f9c5b9a343c6df0353071db93e4b2b
Successfully built ddbapi
Installing collected packages: ddbapi
Successfully installed ddbapi-0.1.2


In [2]:
# @markdown ####  Import the necessary packages
import pandas as pd
from ddbapi import zp_issues, zp_pages, list_column, filter

In [3]:
# @markdown ### Possible kwargs for the functions are:
# @markdown - language: Use ISO Codes, currently ger, eng, fre, spa
# @markdown - place_of_distribution: Search inside "Verbreitungsort"
# @markdown - use a list for multiple search-words
# @markdown - publication_date: Get newspapers by publication date.
# @markdown - zdb_id: Search by ZDB-ID
# @markdown - provider: Search by Data Provider
# @markdown - paper_title: Search inside the title of the Newspaper
# @markdown - plainpagefulltex: search inside the OCR
# Get the data
# Get the data
df = zp_pages(
    publication_date='[1906-01-01T12:00:00Z TO 1906-12-31T12:00:00Z]',
    plainpagefulltext=["Rückwanderer*"],
    #paper_title='Deutsche allgemeine Zeitung'
    )

df.head()

https://api.deutsche-digitale-bibliothek.de/search/index/newspaper-issues/select?rows=1000&sort=id+ASC&q=type%3Apage+AND+publication_date%3A%22%5B1906-01-01T12%3A00%3A00Z%5C+TO%5C+1906-12-31T12%3A00%3A00Z%5D%22+AND+%28plainpagefulltext%3AR%C3%BCckwanderer%2A%29&cursorMark=%2A
Got 215 items.


Unnamed: 0,page_id,pagenumber,paper_title,provider_ddb_id,provider,zdb_id,publication_date,place_of_distribution,language,thumbnail,pagefulltext,pagename,preview_reference,plainpagefulltext
0,2452IP73L263T7EUP326MLDLWGMWA6PL-FILE_0009_DDB...,9,Hamburger Fremdenblatt,BZVTR553HLJBDMQD5NCJ6YKP3HMBQRF4,Staats- und Universitätsbibliothek Hamburg Car...,3024925-9,1906-09-11 12:00:00,[Hamburg],[ger],d13db2eb-59a1-494f-b6d1-4ec4f9ac647a,[/data/altos/24/52/2452IP73L263T7EUP326MLDLWGM...,FILE_0009_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,Aweite Beilage z«« Hamöuvgev Fvemden-Blatt Ne....
1,26ZENHX64GQFWANFKPXWMMDCLX7D5AV6-FILE_0009_DDB...,9,Schwäbischer Merkur : mit Schwäbischer Kronik ...,VNHXUCEEKHOUSYH4NVOUBHJGSRMOGK7J,Württembergische Landesbibliothek,2751625-8,1906-01-26 12:00:00,[Stuttgart],[ger],71d0491f-6a13-40e8-9c7f-14dcf47f719e,[/data/altos/26/ZE/26ZENHX64GQFWANFKPXWMMDCLX7...,FILE_0009_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,«r. 4S. SVMVE MMW. ANMllM. Würdigung dteJtm.ql...
2,2ETGNYUAPVHLLE4SVVD66TK345MM4ACR-FILE_0009_DDB...,9,Hamburger Fremdenblatt,BZVTR553HLJBDMQD5NCJ6YKP3HMBQRF4,Staats- und Universitätsbibliothek Hamburg Car...,3024925-9,1906-11-13 12:00:00,[Hamburg],[ger],0f18037d-34a3-4660-a20d-13baea8d26f9,[/data/altos/2E/TG/2ETGNYUAPVHLLE4SVVD66TK345M...,FILE_0009_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,Zweite Beilage zum Hamburger Frem-en-Blatt Nr....
3,2KPSZQ2ZEXLY5EKLT36ZNXUNVXJPGUCD-ALTO6659890_D...,1,Aachener Anzeiger : politisches Tageblatt : be...,VKNQFFAKOR4XZWJJKUX3NGYSZ3QZAXCW,Universitäts- und Landesbibliothek der Rheinis...,2975858-0,1906-01-04 12:00:00,"[Aachen, Regierungsbezirk Aachen]",[ger],51e77301-8216-489b-a90c-1a1972c57efe,[/data/altos/2K/PS/2KPSZQ2ZEXLY5EKLT36ZNXUNVXJ...,ALTO6659890_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,90 nd und Afrika berschrift und Herz erk “ . W...
4,2MWORD7UEUFHSGZOQZ6TQYPEIPKQGG73-FILE_0009_DDB...,9,Hamburger Fremdenblatt,BZVTR553HLJBDMQD5NCJ6YKP3HMBQRF4,Staats- und Universitätsbibliothek Hamburg Car...,3024925-9,1906-09-04 12:00:00,[Hamburg],[ger],b789cbd1-4d59-4575-aac6-803023299945,[/data/altos/2M/WO/2MWORD7UEUFHSGZOQZ6TQYPEIPK...,FILE_0009_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,Zweite Beilage zum Hamburger Aremden-GLatt Nr....


In [4]:
# @markdown #### We can narrow down the text surrounding the keyword in order to reduce the input tokens for the model. Choose the size of the context window here:
context_window = 2000 # @param {type:"number"}
def extract_context(keyword, text, window_size=context_window):
    index = text.find(keyword)
    if index == -1:
        return "Keyword not found in text."

    start_index = max(0, index - window_size)
    end_index = min(len(text), index + len(keyword) + window_size)

    context = text[start_index:end_index]

    return context

# Extract context for each row
contexts = []
for index, row in df.iterrows():
    text = row['plainpagefulltext']
    keyword = "ückwanderer"  # You can modify this
    context = extract_context(keyword, text)
    contexts.append(context)

# Add the context to the dataframe
df['context'] = contexts

# Print the dataframe with context

df.head()

Unnamed: 0,page_id,pagenumber,paper_title,provider_ddb_id,provider,zdb_id,publication_date,place_of_distribution,language,thumbnail,pagefulltext,pagename,preview_reference,plainpagefulltext,context
0,2452IP73L263T7EUP326MLDLWGMWA6PL-FILE_0009_DDB...,9,Hamburger Fremdenblatt,BZVTR553HLJBDMQD5NCJ6YKP3HMBQRF4,Staats- und Universitätsbibliothek Hamburg Car...,3024925-9,1906-09-11 12:00:00,[Hamburg],[ger],d13db2eb-59a1-494f-b6d1-4ec4f9ac647a,[/data/altos/24/52/2452IP73L263T7EUP326MLDLWGM...,FILE_0009_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,Aweite Beilage z«« Hamöuvgev Fvemden-Blatt Ne....,m 2 Uhr 27 Minuten nachmittags. Die Gesamtstär...
1,26ZENHX64GQFWANFKPXWMMDCLX7D5AV6-FILE_0009_DDB...,9,Schwäbischer Merkur : mit Schwäbischer Kronik ...,VNHXUCEEKHOUSYH4NVOUBHJGSRMOGK7J,Württembergische Landesbibliothek,2751625-8,1906-01-26 12:00:00,[Stuttgart],[ger],71d0491f-6a13-40e8-9c7f-14dcf47f719e,[/data/altos/26/ZE/26ZENHX64GQFWANFKPXWMMDCLX7...,FILE_0009_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,«r. 4S. SVMVE MMW. ANMllM. Würdigung dteJtm.ql...,nd dr'irsie» diesem für die Autorität de- Sult...
2,2ETGNYUAPVHLLE4SVVD66TK345MM4ACR-FILE_0009_DDB...,9,Hamburger Fremdenblatt,BZVTR553HLJBDMQD5NCJ6YKP3HMBQRF4,Staats- und Universitätsbibliothek Hamburg Car...,3024925-9,1906-11-13 12:00:00,[Hamburg],[ger],0f18037d-34a3-4660-a20d-13baea8d26f9,[/data/altos/2E/TG/2ETGNYUAPVHLLE4SVVD66TK345M...,FILE_0009_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,Zweite Beilage zum Hamburger Frem-en-Blatt Nr....,"m Hause Paralleelstraße 12 zum Ausvruch. Da, u..."
3,2KPSZQ2ZEXLY5EKLT36ZNXUNVXJPGUCD-ALTO6659890_D...,1,Aachener Anzeiger : politisches Tageblatt : be...,VKNQFFAKOR4XZWJJKUX3NGYSZ3QZAXCW,Universitäts- und Landesbibliothek der Rheinis...,2975858-0,1906-01-04 12:00:00,"[Aachen, Regierungsbezirk Aachen]",[ger],51e77301-8216-489b-a90c-1a1972c57efe,[/data/altos/2K/PS/2KPSZQ2ZEXLY5EKLT36ZNXUNVXJ...,ALTO6659890_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,90 nd und Afrika berschrift und Herz erk “ . W...,"lig mittellosen Arbeitern und Handwerkern , di..."
4,2MWORD7UEUFHSGZOQZ6TQYPEIPKQGG73-FILE_0009_DDB...,9,Hamburger Fremdenblatt,BZVTR553HLJBDMQD5NCJ6YKP3HMBQRF4,Staats- und Universitätsbibliothek Hamburg Car...,3024925-9,1906-09-04 12:00:00,[Hamburg],[ger],b789cbd1-4d59-4575-aac6-803023299945,[/data/altos/2M/WO/2MWORD7UEUFHSGZOQZ6TQYPEIPK...,FILE_0009_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,Zweite Beilage zum Hamburger Aremden-GLatt Nr....,"ranken hauses, Herr Pros. Dr. Lenhartz, hat am..."


In [5]:
# @markdown #### Save the results as Excel file
df.to_excel('newspaper_rückkehrer.xlsx', index=False)

## Setting up the requirements for the Llama model

Llama 3 is a family of models developed by Meta. Llama 3 instruction-tuned models are fine-tuned and optimized for dialogue/chat use cases and outperform many of the available open-source chat models on common benchmarks.

In [6]:
pip install replicate


Collecting replicate
  Downloading replicate-0.26.0-py3-none-any.whl (40 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.0/40.0 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting httpx<1,>=0.21.0 (from replicate)
  Downloading httpx-0.27.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
Collecting httpcore==1.* (from httpx<1,>=0.21.0->replicate)
  Downloading httpcore-1.0.5-py3-none-any.whl (77 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
Collecting h11<0.15,>=0.13 (from httpcore==1.*->httpx<1,>=0.21.0->replicate)
  Downloading h11-0.14.0-py3-none-any.whl (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: h11, httpcore, httpx, replicate
Successfully installed h11-0.14.0 ht

In [7]:
# @markdown ##### Get an API key at https://replicate.com/, activate the billing, save your key as .env file. To do so, take following steps:
# @markdown - Open a Notepad and write REPLICATE_API_TOKEN = "your key"
# @markdown - Click on Save option and change the file type to 'All files'
# @markdown - Keep the file name as .env.
# @markdown - Hit Save Now the file is an .env file.


!pip install python-dotenv

import os
import dotenv

#Set the REPLICATE_API_TOKEN environment variable
from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [8]:
# @markdown Load the .env file into the drive/MyDrive
dotenv.load_dotenv('/content/drive/MyDrive/.env')

os.getenv('REPLICATE_KEY_TOKEN')

# Run model for OCR-post correction

To run OCR-post correction, it is essential to formulate a precise prompt. For example, it needs to be specified that the whole text should be corrected, while summarizations and any other addition need to be avoided. A guide on how to write effective prompts can be found also [here](https://https://support.google.com/a/users/answer/14200040?hl=en).

Depending on the size of the dataframe, it can take a while to load.

In [9]:
df=df[:10]

In [10]:
import json
import replicate

def OCR_correction(newspaper_page):
    # Define the prompt for separating articles

    input = {
    "prompt": f"Korrogiere OCR Fehler des gesamten deutschen Textes und drucke den gesamten korrigierten Text \n\n{newspaper_page}\n\n---\n\ .",
    "prompt_template": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are an OCR correction expert. Please don't ask for feedback or questions <|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
    "max_new_tokens": 8000,
    }

    # Initialize an empty string to collect the response
    text = ""

    # Generate the response using the LLaMA model
    for event in replicate.stream(
        "meta/meta-llama-3-70b-instruct",
        input=input
    ):
        if event:
            text += str(event)
        else:
            print("Received empty event data")

    # Return the separated articles
    return text

# Assuming `df` is your dataframe
# Create an empty list to store the separated articles
post_OCR = []

# Loop through each row in the dataframe
for index, row in df.iterrows():
    # Extract the text of the newspaper page from the current row
    newspaper_page = row['context']

    # Separate articles for the current newspaper page only if newspaper_page is not empty
    if newspaper_page.strip():
        text = OCR_correction(newspaper_page)

        # Append the separated articles to the list, even if it’s empty
        post_OCR.append(text)
    else:
        print("Skipping empty newspaper page")

# Add the list of separated articles as a new column 'article' in the dataframe
df['article_corrected'] = post_OCR

# Print the modified dataframe
df


Unnamed: 0,page_id,pagenumber,paper_title,provider_ddb_id,provider,zdb_id,publication_date,place_of_distribution,language,thumbnail,pagefulltext,pagename,preview_reference,plainpagefulltext,context,article_corrected
0,2452IP73L263T7EUP326MLDLWGMWA6PL-FILE_0009_DDB...,9,Hamburger Fremdenblatt,BZVTR553HLJBDMQD5NCJ6YKP3HMBQRF4,Staats- und Universitätsbibliothek Hamburg Car...,3024925-9,1906-09-11 12:00:00,[Hamburg],[ger],d13db2eb-59a1-494f-b6d1-4ec4f9ac647a,[/data/altos/24/52/2452IP73L263T7EUP326MLDLWGM...,FILE_0009_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,Aweite Beilage z«« Hamöuvgev Fvemden-Blatt Ne....,m 2 Uhr 27 Minuten nachmittags. Die Gesamtstär...,Here is the corrected text:\n\nm 2 Uhr 27 Minu...
1,26ZENHX64GQFWANFKPXWMMDCLX7D5AV6-FILE_0009_DDB...,9,Schwäbischer Merkur : mit Schwäbischer Kronik ...,VNHXUCEEKHOUSYH4NVOUBHJGSRMOGK7J,Württembergische Landesbibliothek,2751625-8,1906-01-26 12:00:00,[Stuttgart],[ger],71d0491f-6a13-40e8-9c7f-14dcf47f719e,[/data/altos/26/ZE/26ZENHX64GQFWANFKPXWMMDCLX7...,FILE_0009_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,«r. 4S. SVMVE MMW. ANMllM. Würdigung dteJtm.ql...,nd dr'irsie» diesem für die Autorität de- Sult...,Here is the corrected text:\n\nUnd dir ist es ...
2,2ETGNYUAPVHLLE4SVVD66TK345MM4ACR-FILE_0009_DDB...,9,Hamburger Fremdenblatt,BZVTR553HLJBDMQD5NCJ6YKP3HMBQRF4,Staats- und Universitätsbibliothek Hamburg Car...,3024925-9,1906-11-13 12:00:00,[Hamburg],[ger],0f18037d-34a3-4660-a20d-13baea8d26f9,[/data/altos/2E/TG/2ETGNYUAPVHLLE4SVVD66TK345M...,FILE_0009_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,Zweite Beilage zum Hamburger Frem-en-Blatt Nr....,"m Hause Paralleelstraße 12 zum Ausvruch. Da, u...",Here is the corrected text:\n\nIm Hause Parall...
3,2KPSZQ2ZEXLY5EKLT36ZNXUNVXJPGUCD-ALTO6659890_D...,1,Aachener Anzeiger : politisches Tageblatt : be...,VKNQFFAKOR4XZWJJKUX3NGYSZ3QZAXCW,Universitäts- und Landesbibliothek der Rheinis...,2975858-0,1906-01-04 12:00:00,"[Aachen, Regierungsbezirk Aachen]",[ger],51e77301-8216-489b-a90c-1a1972c57efe,[/data/altos/2K/PS/2KPSZQ2ZEXLY5EKLT36ZNXUNVXJ...,ALTO6659890_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,90 nd und Afrika berschrift und Herz erk “ . W...,"lig mittellosen Arbeitern und Handwerkern , di...",Here is the corrected text:\n\nLig mittellosen...
4,2MWORD7UEUFHSGZOQZ6TQYPEIPKQGG73-FILE_0009_DDB...,9,Hamburger Fremdenblatt,BZVTR553HLJBDMQD5NCJ6YKP3HMBQRF4,Staats- und Universitätsbibliothek Hamburg Car...,3024925-9,1906-09-04 12:00:00,[Hamburg],[ger],b789cbd1-4d59-4575-aac6-803023299945,[/data/altos/2M/WO/2MWORD7UEUFHSGZOQZ6TQYPEIPK...,FILE_0009_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,Zweite Beilage zum Hamburger Aremden-GLatt Nr....,"ranken hauses, Herr Pros. Dr. Lenhartz, hat am...",Here is the corrected text:\n\nRanglistenhause...
5,2NPFSEZPOTZEGAQS25MUACOX5LQCFZGU-FILE_0002_DDB...,2,Hamburger Echo ; [...] ; Abend-Ausgabe,BZVTR553HLJBDMQD5NCJ6YKP3HMBQRF4,Staats- und Universitätsbibliothek Hamburg Car...,3060377-8,1906-09-04 12:00:00,[Hamburg],[ger],f078d8fd-4c88-4ef6-b520-a3611087b995,[/data/altos/2N/PF/2NPFSEZPOTZEGAQS25MUACOX5LQ...,FILE_0002_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,"Reichen"" ihr Vermögen höchst unpraktisch festg...","die berühmte „Ouvertüre 1812"" von Tschai kowsk...","Here is the corrected text:\n\nDie berühmte ""O..."
6,2R4XMGQPTR3CPNSIQAO3DWCJBS7PSPPH-ALTO953185_DD...,3,Dortmunder Zeitung. 1874-1939,4EV676FQPACNVNHFEJHGKUY55BXC3QMB,Westfälische Wilhelms-Universität Münster Univ...,2941861-6,1906-04-28 12:00:00,[Dortmund],[ger],7d09f4f2-c3a9-4140-a859-770f92b93473,[/data/altos/2R/4X/2R4XMGQPTR3CPNSIQAO3DWCJBS7...,ALTO953185_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,* * * * * S a . * DEES * S9 # SS m Dage Spss —...,thers mit 335 000 Dollars . Die höchst versich...,Here is the corrected text:\n\nThere mit 335 0...
7,2YYFQNDWZ7HM4I7MFB54EECZ5STTDX2G-ALTO7267289_D...,8,Badische Schulzeitung : Vereinsbl. d. Badische...,INLVDM4I3AMZLTG6AE6C5GZRJKGOF75K,Badische Landesbibliothek,3108888-0,1906-09-01 12:00:00,,[ger],34278dcb-93b2-4d75-9c23-2a183a33861c,[/data/altos/2Y/YF/2YYFQNDWZ7HM4I7MFB54EECZ5ST...,ALTO7267289_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,604 Die Frage der Volksbildung in ihren versch...,"kreises , Bereicherung seines Geisteslebens , ...",Here is the corrected text:\n\nKorrigierte OCR...
8,32ZFXGXIXPZ4WFNTDKNAEPTAKMYFWGUE-FILE_0002_DDB...,2,Hamburger Echo ; [...] ; Abend-Ausgabe,BZVTR553HLJBDMQD5NCJ6YKP3HMBQRF4,Staats- und Universitätsbibliothek Hamburg Car...,3060377-8,1906-05-22 12:00:00,[Hamburg],[ger],35765f3f-5d34-4004-bd9d-9620f50df70c,[/data/altos/32/ZF/32ZFXGXIXPZ4WFNTDKNAEPTAKMY...,FILE_0002_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,Religionsgemeinschaften auf die Gestaltung des...,me ber Pflanze dient. Der Ausstellungsausschuß...,"Here is the corrected text:\n\n""Mein Bericht ü..."
9,3CI67FXNE3HNEOMBMO5D2WQDU23VAXEB-ALTO5776014_D...,9,Rhein- und Ruhrzeitung : Tageszeitung für das ...,VKNQFFAKOR4XZWJJKUX3NGYSZ3QZAXCW,Universitäts- und Landesbibliothek der Rheinis...,2971362-6,1906-10-11 12:00:00,"[Duisburg, Mülheim an der Ruhr, Ruhrort, Oberh...",[ger],36e036ca-9d94-4620-85b6-1b0efbf22fb6,[/data/altos/3C/I6/3CI67FXNE3HNEOMBMO5D2WQDU23...,ALTO5776014_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,"Nr . 2337 . Duisburg a . Rhein . Donnerstag , ...",ramon tanen Herrschaft für unser kulturelles u...,Here is the corrected text:\n\nRamon Tanen Her...


In [12]:
df['article_corrected'] = df['article_corrected'].apply(lambda x: x.split('\n\n', 1)[1].lstrip() if isinstance(x, str) and '\n\n' in x else x)
df['article_corrected'] = df['article_corrected'].apply(lambda x: x.split('\n\n', 1)[1].lstrip() if isinstance(x, str) and '\n\n' in x else x)

df.to_excel('article_corrected.xlsx', index=False)