# Debug exploration of ANP (KB) and WW2 (SwissInfo) Radio Bulletins

This notebook aims to perform a first rought exploration of the the data at hand, for the KB and SwissInfo radio bulletins.



### Imports

In [1]:
import os
import requests
from bs4 import BeautifulSoup
import json
import random
import time

## KB ANP - Fetching some sample data through the API

- link to doc file: https://docs.google.com/document/d/1uLPaOZLdH5fntdu9vjL1VEwelKMaKQan/edit#heading=h.26in1rg
- listing sets: http://services.kb.nl/mdo/oai?verb=ListSets
- listing the contents of ANP set: http://services.kb.nl/mdo/oai?verb=ListIdentifiers&set=anp&metadataPrefix=didl
- listing the contents of one ANP record: http://services.kb.nl/mdo/oai?verb=GetRecord&identifier=anp:anp:1937:10:01:1:mpeg21&metadataPrefix=didl
- Obtaining the image for this one ANP record: http://resolver.kb.nl/resolve?urn=anp:1937:10:01:1:mpeg21:image
- Obtaining the OCR for this one ANP record (ALTO): http://resolver.kb.nl/resolve?urn=anp:1937:10:01:1:mpeg21:alto
- Obtainign the plain text ocr: https://resolver.kb.nl/resolve?urn=anp:1937:10:01:1:mpeg21:ocr

In [None]:
anp_set_url = "http://services.kb.nl/mdo/oai?verb=ListIdentifiers&set=anp&metadataPrefix=didl"

resp_anp_set_first_p = requests.get(anp_set_url)
contents_anp_set_first_p = BeautifulSoup(resp_anp_set_first_p.content, "xml")

contents_anp_set_first_p

<?xml version="1.0" encoding="utf-8"?>
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/          http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
<responseDate>2025-01-16T14:10:57.620Z</responseDate>
<request metadataPrefix="didl" set="anp" verb="ListIdentifiers">http://services.kb.nl/mdo/oai	</request>
<ListIdentifiers>
<header>
<identifier>anp:anp:1937:10:01:1:mpeg21</identifier>
<datestamp>2008-09-24T09:09:09.050Z</datestamp>
</header>
<header>
<identifier>anp:anp:1937:10:01:2:mpeg21</identifier>
<datestamp>2008-09-24T09:09:09.160Z</datestamp>
</header>
<header>
<identifier>anp:anp:1937:10:01:3:mpeg21</identifier>
<datestamp>2008-09-24T09:09:09.191Z</datestamp>
</header>
<header>
<identifier>anp:anp:1937:10:02:1:mpeg21</identifier>
<datestamp>2008-09-24T09:09:09.191Z</datestamp>
</header>
<header>
<identifier>anp:anp:1937:10:02:2:mpeg21</identifier>
<datestamp>2008-09

#### Listing all identifiers in the ANP Collection

In [5]:
anp_set_url = "http://services.kb.nl/mdo/oai?verb=ListIdentifiers&set=anp&metadataPrefix=didl"

In [3]:
anp_identifiers = []

In [None]:
r = requests.post(url=anp_set_url, headers={'Connection':'close'})

In [None]:
resump_token = "anp!2008-09-24T09:13:24.099Z!!didl!2322875"

In [7]:
resump_token = ""
it_idx = 0
while resump_token is not None:
    if resump_token!="":
        anp_set_url_with_token = f"{anp_set_url}&resumptionToken={resump_token}"
    else:
        anp_set_url_with_token = anp_set_url

    print(f"-Iteration n°{it_idx}: sending request")
    with requests.Session() as s:
        time.sleep(0.01)
        #s.get('http://google.com')
        resp_anp_set = s.get(anp_set_url_with_token)
    print(f"     Finished request, extracting contents")
    contents_anp_set = BeautifulSoup(resp_anp_set.content, "xml")
    
    len_before = len(anp_identifiers)
    anp_identifiers.extend(contents_anp_set.find_all("identifier"))
    print(f"    Iter n°{it_idx}: anp_identifiers went from {len_before} to {len(anp_identifiers)} identifiers")
    if contents_anp_set.resumptionToken is not None:
        resump_token = contents_anp_set.resumptionToken.get_text()
        print(f"    Iter n°{it_idx}: resump_token = {resump_token}")
    else:
        resump_token = None
        print(f"    Iter n°{it_idx}: resump_token = {resump_token}, Stopping now the iterations")
    it_idx += 1


-Iteration n°0: sending request
     Finished request, extracting contents
    Iter n°0: anp_identifiers went from 0 to 800 identifiers
    Iter n°0: resump_token = anp!2008-09-24T09:09:16.332Z!!didl!2317275
-Iteration n°1: sending request
     Finished request, extracting contents
    Iter n°1: anp_identifiers went from 800 to 1600 identifiers
    Iter n°1: resump_token = anp!2008-09-24T09:09:23.363Z!!didl!2318075
-Iteration n°2: sending request
     Finished request, extracting contents
    Iter n°2: anp_identifiers went from 1600 to 2400 identifiers
    Iter n°2: resump_token = anp!2008-09-24T09:11:12.895Z!!didl!2318875
-Iteration n°3: sending request
     Finished request, extracting contents
    Iter n°3: anp_identifiers went from 2400 to 3200 identifiers
    Iter n°3: resump_token = anp!2008-09-24T09:11:20.239Z!!didl!2319675
-Iteration n°4: sending request
     Finished request, extracting contents
    Iter n°4: anp_identifiers went from 3200 to 4000 identifiers
    Iter n°4: res

ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

In [7]:
def fetch_identifiers_from_API(anp_set_url, it_idx = 0, resump_token = "", other_identifiers_path = None):

    if other_identifiers_path is not None and resump_token!='':
        with open(other_identifiers_path, 'r') as f_in:
            anp_identifiers = json.load(f_in)

        anp_identifiers = list(set(anp_identifiers))
        print(f"Starting sending requests, with resump_token={resump_token} and {len(anp_identifiers)} already fetched.")
    else:
        anp_identifiers = []
        print(f"Starting sending requests, from scratch.")
    
    new_identifiers = []
    
    while resump_token is not None:
        if resump_token!="":
            anp_set_url_with_token = f"{anp_set_url}&resumptionToken={resump_token}"
        else:
            anp_set_url_with_token = anp_set_url

        print(f"-Iteration n°{it_idx}: sending request")
        try:
            with requests.Session() as s:
                time.sleep(0.05)
                #s.get('http://google.com')
                resp_anp_set = s.get(anp_set_url_with_token)
            print(f"     Finished request, extracting contents")
            contents_anp_set = BeautifulSoup(resp_anp_set.content, "xml")
        
            len_before = len(new_identifiers)
            new_identifiers.extend(contents_anp_set.find_all("identifier"))
            print(f"    Iter n°{it_idx}: new_identifiers went from {len_before} to {len(new_identifiers)} identifiers")
            if contents_anp_set.resumptionToken is not None:
                resump_token = contents_anp_set.resumptionToken.get_text()
                print(f"    Iter n°{it_idx}: resump_token = {resump_token}")
            else:
                resump_token = None
                print(f"    Iter n°{it_idx}: resump_token = {resump_token}, Stopping now the iterations")
            it_idx += 1

        except Exception as e:
            print(f"Exception occurred, returning the current list of {len(new_identifiers)} identifiers; {e}")

        text_identifiers = [i.get_text() for i in new_identifiers]

        anp_identifiers.extend(text_identifiers)
        # writing the current results to ensure they are not lost
        if other_identifiers_path is not None:
            with open(other_identifiers_path, 'w') as f_out:
                json.dump(anp_identifiers, f_out)
                
    return anp_identifiers


I keep getting blocked from the APi, so could only fetch 1600 identifiers.
Will sample some from these already, but hard to know how many elements there are in the ANP collection

In [5]:
# show first five and last 5 fetched identifiers
anp_identifiers[:5], anp_identifiers[-5:]

([], [])

In [9]:
len(anp_identifiers)

6400

All fetched identifiers are more or less from the same period, but we can fetch randomly 3-5 of them.

#### Write the fetched identifiers to a txt file

In [3]:
kb_radio_sample_path = "/home/piconti/impresso-text-acquisition/text_preparation/data/sample_data/KB/radio"
kb_radio_ids_file = os.path.join(kb_radio_sample_path, "anp_identifiers_2.json")

In [None]:
anp_identifiers = fetch_identifiers_from_API(anp_set_url, other_identifiers_path=kb_radio_ids_file)

Starting sending requests, from scratch.
-Iteration n°0: sending request
     Finished request, extracting contents
    Iter n°0: anp_identifiers went from 0 to 800 identifiers
    Iter n°0: resump_token = anp!2008-09-24T09:09:16.332Z!!didl!2317275
-Iteration n°1: sending request
     Finished request, extracting contents
    Iter n°1: anp_identifiers went from 800 to 1600 identifiers
    Iter n°1: resump_token = anp!2008-09-24T09:09:23.363Z!!didl!2318075
-Iteration n°2: sending request
     Finished request, extracting contents
    Iter n°2: anp_identifiers went from 1600 to 2400 identifiers
    Iter n°2: resump_token = anp!2008-09-24T09:11:12.895Z!!didl!2318875
-Iteration n°3: sending request
Exception occurred, returning the current list of 2400 identifiers; ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
-Iteration n°3: sending request
     Finished request, extracting contents
    Iter n°3: anp_identifiers went from 2400 to 3200 identifiers
    Iter 

In [None]:
anp_identifiers = fetch_identifiers_from_API(anp_set_url, it_idx=10, resump_token="anp!2008-09-24T09:14:29.771Z!!didl!2324475", other_identifiers_path=kb_radio_ids_file)

Starting sending requests, with resump_token=anp!2008-09-24T09:14:29.771Z!!didl!2324475 and 26400 already fetched.
-Iteration n°10: sending request


In [None]:
# save the currently fetched identifiers to disk
text_identifiers = [i.get_text() for i in anp_identifiers]

with open(kb_radio_ids_file, 'w') as f_out:
    json.dump(text_identifiers, f_out)

Read back the data that was saved

In [5]:
with open(kb_radio_ids_file, 'r') as f_in:
    identifiers_data = json.load(f_in)

len(identifiers_data), identifiers_data[:5], identifiers_data[-5:]

(1600,
 ['anp:anp:1937:10:01:1:mpeg21',
  'anp:anp:1937:10:01:2:mpeg21',
  'anp:anp:1937:10:01:3:mpeg21',
  'anp:anp:1937:10:02:1:mpeg21',
  'anp:anp:1937:10:02:2:mpeg21'],
 ['anp:anp:1937:07:09:5:mpeg21',
  'anp:anp:1937:07:09:4:mpeg21',
  'anp:anp:1937:07:10:2:mpeg21',
  'anp:anp:1937:07:10:1:mpeg21',
  'anp:anp:1937:07:10:3:mpeg21'])

### Sample a few examples and save the metadata, images, ALTO and OCR plain text to disk to start working

In [6]:
sample_ids = random.sample(identifiers_data, 5)
sample_ids

['anp:anp:1937:03:01:5:mpeg21',
 'anp:anp:1937:03:25:4:mpeg21',
 'anp:anp:1937:03:02:3:mpeg21',
 'anp:anp:1937:12:31:5:mpeg21',
 'anp:anp:1937:11:28:1:mpeg21']

In [7]:
# define the URIs to fetch the required information with the identifier
metadata_uri = "http://services.kb.nl/mdo/oai?verb=GetRecord&identifier={id}&metadataPrefix=didl"
image_uri = "http://resolver.kb.nl/resolve?urn={id}:image"
alto_uri = "http://resolver.kb.nl/resolve?urn={id}:alto"
ocr_uri = "http://resolver.kb.nl/resolve?urn={id}:ocr"

In [8]:
def send_request(identifier, idx, uri, data_elem):

    print(f"{identifier} ({idx+1}) - sending the resquest for {data_elem}, uri={uri}")
    resp = requests.get(uri)
    if resp.status_code == 200:
        if data_elem == 'image':
            return resp.content
        else:
            return BeautifulSoup(resp.content, "xml")
    else:
        print(f"{identifier} ({idx+1}) - Status code for request for {data_elem} was {resp.status_code} (uri =  {uri}).")
        return None

In [9]:
fetched_data = {}

In [10]:
# remove the first "anp:" for some requests
sample_ids[0][4:]

'anp:1937:03:01:5:mpeg21'

#### Query the API for each of the sampled items

In [46]:
def fetch_data_for_identifiers(ids):
    fetched_data = {}
    for idx, ident in enumerate(ids):
        print(f"Fetching the data for identifier {ident} ({idx+1}/{len(sample_ids)})")
        # for images, ocr, and alto, the firs "anp:" should be removed
        fetched_for_id = {
            'metadata': send_request(ident, idx, metadata_uri.format(id=ident), 'metadata'),
            'image': send_request(ident, idx, image_uri.format(id=ident[4:]), 'image'),
            'alto': send_request(ident, idx, alto_uri.format(id=ident[4:]), 'alto'),
            'ocr': send_request(ident, idx, ocr_uri.format(id=ident[4:]), 'ocr'),
        }
        print(f"Finished fetching the data for identifier {ident} ({idx+1}/{len(sample_ids)})")
        fetched_data[ident] = fetched_for_id

    return fetched_data

In [11]:
for idx, ident in enumerate(sample_ids):
    print(f"Fetching the data for identifier {ident} ({idx+1}/{len(sample_ids)})")
    # for images, ocr, and alto, the firs "anp:" should be removed
    fetched_for_id = {
        'metadata': send_request(ident, idx, metadata_uri.format(id=ident), 'metadata'),
        'image': send_request(ident, idx, image_uri.format(id=ident[4:]), 'image'),
        'alto': send_request(ident, idx, alto_uri.format(id=ident[4:]), 'alto'),
        'ocr': send_request(ident, idx, ocr_uri.format(id=ident[4:]), 'ocr'),
    }
    print(f"Finished fetching the data for identifier {ident} ({idx+1}/{len(sample_ids)})")
    fetched_data[ident] = fetched_for_id

Fetching the data for identifier anp:anp:1937:03:01:5:mpeg21 (1/5)
anp:anp:1937:03:01:5:mpeg21 (1) - sending the resquest for metadata, uri=http://services.kb.nl/mdo/oai?verb=GetRecord&identifier=anp:anp:1937:03:01:5:mpeg21&metadataPrefix=didl
anp:anp:1937:03:01:5:mpeg21 (1) - sending the resquest for image, uri=http://resolver.kb.nl/resolve?urn=anp:1937:03:01:5:mpeg21:image
anp:anp:1937:03:01:5:mpeg21 (1) - sending the resquest for alto, uri=http://resolver.kb.nl/resolve?urn=anp:1937:03:01:5:mpeg21:alto
anp:anp:1937:03:01:5:mpeg21 (1) - sending the resquest for ocr, uri=http://resolver.kb.nl/resolve?urn=anp:1937:03:01:5:mpeg21:ocr
Finished fetching the data for identifier anp:anp:1937:03:01:5:mpeg21 (1/5)
Fetching the data for identifier anp:anp:1937:03:25:4:mpeg21 (2/5)
anp:anp:1937:03:25:4:mpeg21 (2) - sending the resquest for metadata, uri=http://services.kb.nl/mdo/oai?verb=GetRecord&identifier=anp:anp:1937:03:25:4:mpeg21&metadataPrefix=didl
anp:anp:1937:03:25:4:mpeg21 (2) - sendin

In [13]:
fetched_data['anp:anp:1937:03:01:5:mpeg21']['alto']

<?xml version="1.0" encoding="utf-8"?>
<alto xmlns:xlink="http://www.w3.org/TR/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://schema.ccs-gmbh.com/metae/alto-1-2.xsd">
<Description><MeasurementUnit>pixel</MeasurementUnit></Description>
<Layout>
<Page HEIGHT="1958" ID="Page1" PHYSICAL_IMG_NR="1" WIDTH="1240">
<PrintSpace HEIGHT="1958" HPOS="0" VPOS="0" WIDTH="1240">
<TextBlock HEIGHT="38" HPOS="156" ID="TextBlock1" VPOS="61" WIDTH="960">
<TextLine HEIGHT="31" HPOS="165" ID="TextLine1" VPOS="66" WIDTH="943">
<String CONTENT="RADICBFIICHTGEVING" HEIGHT="20" HPOS="165" ID="String1" VPOS="73" WIDTH="269"/>
<SP HPOS="434" ID="SP1" VPOS="92" WIDTH="20"/>
<String CONTENT="lUITTh" HEIGHT="18" HPOS="454" ID="String2" VPOS="72" WIDTH="94"/>
<String CONTENT="'" HEIGHT="2" HPOS="547" ID="String3" VPOS="72" WIDTH="2"/>
<SP HPOS="548" ID="SP2" VPOS="90" WIDTH="8"/>
<String CONTENT="AND" HEIGHT="19" HPOS="555" ID="String4" VPOS="71" WIDTH="45"/>
<SP H

In [23]:
fetched_data['anp:anp:1937:03:01:5:mpeg21']['ocr']

<?xml version="1.0" encoding="utf-8"?>
<text>
<p>RADICBFIICHTGEVING lUITTh' AND van 1 yaart $7. tweede uitzending </p>
<p>In de Giornale ditalia komt Virginio Gayda terug op eer artikel, dat hy onlangs over de kwestie van de restauratie dec Habsburgers in OoStenryk hepft geschreven. Hy bevestigt nadrukkelyk, dal genoemd artikel-  de houding weergeeft, welke Italie tegenover de binnenlandsche politiek van Oostenryk zal aannemen en die berust op het principe van de politieke onafhamkelykheid en onschendbaarheid van O-stenryk mer erkenning van het onvervreemdbaar karakter van Duitsche-gatie, z^. als in de Buitsch-Jostenryksche overeenkomst v -' 1 Juli is vastgesteld. Wy bevestigen, aldus Gayda, dat de restauratie, onafhan- kelyk van de gezichtspunten xx ten aanzien van hi dynastie der Hahsburgers, niet actueel en dus gevasrlyk is. Hii.' a r_ oegt de schryver toe, dat niets in de politieke daden, noch selhs in de journalistieke-  litteratuur XKXRRxiE de meening zou kunne a wettigen, dat It

In [14]:
fetched_data['anp:anp:1937:03:01:5:mpeg21']['metadata']

<?xml version="1.0" encoding="utf-8"?>
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/          http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
<responseDate>2025-01-17T15:17:50.079Z</responseDate>
<request metadataPrefix="didl" verb="GetRecord">http://services.kb.nl/mdo/oai	</request>
<GetRecord>
<record>
<header>
<identifier>anp:anp:1937:03:01:5:mpeg21</identifier>
<datestamp>2008-09-24T09:09:18.300Z</datestamp>
<setSpec>anp</setSpec>
</header>
<metadata>
<didl:DIDL xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcmitype="http://purl.org/dc/dcmitype/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dcx="http://krait.kb.nl/coop/tel/handbook/telterms.html" xmlns:didl="urn:mpeg:mpeg21:2002:02-DIDL-NS" xmlns:srw_dc="info:srw/schema/1/dc-v1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><didl:Item dc:identifier="anp:1937:03:01:5:mpeg21"><didl:Component dc:ide

In [22]:
fetched_data

{'anp:anp:1937:03:01:5:mpeg21': {'metadata': <?xml version="1.0" encoding="utf-8"?>
  <OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/          http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
  <responseDate>2025-01-17T15:17:50.079Z</responseDate>
  <request metadataPrefix="didl" verb="GetRecord">http://services.kb.nl/mdo/oai	</request>
  <GetRecord>
  <record>
  <header>
  <identifier>anp:anp:1937:03:01:5:mpeg21</identifier>
  <datestamp>2008-09-24T09:09:18.300Z</datestamp>
  <setSpec>anp</setSpec>
  </header>
  <metadata>
  <didl:DIDL xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcmitype="http://purl.org/dc/dcmitype/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dcx="http://krait.kb.nl/coop/tel/handbook/telterms.html" xmlns:didl="urn:mpeg:mpeg21:2002:02-DIDL-NS" xmlns:srw_dc="info:srw/schema/1/dc-v1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><did

In [44]:
kb_radio_test_img_file_2 = os.path.join(kb_radio_sample_path, "test_img_2.jpg")

with open(kb_radio_test_img_file_2, 'wb') as handler:
    handler.write(fetched_data['anp:anp:1937:12:10:1:mpeg21']['image'])

#### Saving all queried data to disk in the file-structure that we expect

In [21]:
fetched_data['anp:anp:1937:03:01:5:mpeg21']['metadata'].find_all('isPartOf')[-1].get_text()

'ANP Nieuwsberichten 1937-1989'

In [37]:
def extract_path(record_metadata, base_path = kb_radio_sample_path):

    collection = record_metadata.find_all('isPartOf')[-1].get_text()
    date_str = record_metadata.date.get_text()
    date_as_path = date_str.replace('-', '/')

    title = record_metadata.title.get_text()
    vol_number = record_metadata.volgnummer.get_text()

    record_id = record_metadata.recordIdentifier.get_text().replace(":", "_")

    return record_id, os.path.join(base_path, collection.replace(' ', '_'), date_as_path, vol_number, record_id)



In [49]:
def write_fetched_Data_to_disk(fetched):
    for ident, radio_data in fetched.items():
    
        print(f"Writing all the data for {ident} to disk.")
        # create a path for the sample from the metadata
        record_id, full_sample_path = extract_path(radio_data['metadata'])
        print(f" {ident} - sample path = {full_sample_path}")

        os.makedirs(full_sample_path, exist_ok=True)
        # define each path from it
        full_metadata_path = os.path.join(full_sample_path, f"{record_id}_metadata.xml")
        full_img_path = os.path.join(full_sample_path, f"{record_id}.jpg")
        full_alto_path = os.path.join(full_sample_path, f"{record_id}_alto.xml")
        full_ocr_path = os.path.join(full_sample_path, f"{record_id}_ocr_text.xml")

        # write all extracted data to file
        with open(full_metadata_path, 'w', encoding='utf-8') as handler:
            handler.write(str(radio_data['metadata']))
        with open(full_img_path, 'wb') as handler:
            handler.write(radio_data['image'])
        with open(full_alto_path, 'w', encoding='utf-8') as handler:
            handler.write(str(radio_data['alto']))
        with open(full_ocr_path, 'w', encoding='utf-8') as handler:
            handler.write(str(radio_data['ocr']))

In [39]:
for ident, radio_data in fetched_data.items():
    
    print(f"Writing all the data for {ident} to disk.")
    # create a path for the sample from the metadata
    record_id, full_sample_path = extract_path(radio_data['metadata'])
    print(f" {ident} - sample path = {full_sample_path}")

    os.makedirs(full_sample_path, exist_ok=True)
    # define each path from it
    full_metadata_path = os.path.join(full_sample_path, f"{record_id}_metadata.xml")
    full_img_path = os.path.join(full_sample_path, f"{record_id}.jpg")
    full_alto_path = os.path.join(full_sample_path, f"{record_id}_alto.xml")
    full_ocr_path = os.path.join(full_sample_path, f"{record_id}_ocr_text.xml")

    # write all extracted data to file
    with open(full_metadata_path, 'w', encoding='utf-8') as handler:
        handler.write(str(radio_data['metadata']))
    with open(full_img_path, 'wb') as handler:
        handler.write(radio_data['image'])
    with open(full_alto_path, 'w', encoding='utf-8') as handler:
        handler.write(str(radio_data['alto']))
    with open(full_ocr_path, 'w', encoding='utf-8') as handler:
        handler.write(str(radio_data['ocr']))

Writing all the data for anp:anp:1937:03:01:5:mpeg21 to disk.
 anp:anp:1937:03:01:5:mpeg21 - sample path = /home/piconti/impresso-text-acquisition/text_preparation/data/sample_data/KB/radio/ANP_Nieuwsberichten_1937-1989/1937/03/01/5/anp_1937_03_01_5
Writing all the data for anp:anp:1937:03:25:4:mpeg21 to disk.
 anp:anp:1937:03:25:4:mpeg21 - sample path = /home/piconti/impresso-text-acquisition/text_preparation/data/sample_data/KB/radio/ANP_Nieuwsberichten_1937-1989/1937/03/25/4/anp_1937_03_25_4
Writing all the data for anp:anp:1937:03:02:3:mpeg21 to disk.
 anp:anp:1937:03:02:3:mpeg21 - sample path = /home/piconti/impresso-text-acquisition/text_preparation/data/sample_data/KB/radio/ANP_Nieuwsberichten_1937-1989/1937/03/02/3/anp_1937_03_02_3
Writing all the data for anp:anp:1937:12:31:5:mpeg21 to disk.
 anp:anp:1937:12:31:5:mpeg21 - sample path = /home/piconti/impresso-text-acquisition/text_preparation/data/sample_data/KB/radio/ANP_Nieuwsberichten_1937-1989/1937/12/31/5/anp_1937_12_31_5


#### Repeat approach for all the samples of some days

In order to know if all the elements of a given day are actually from the same bulletin or not, we need to download all the samples of a given day.

In [44]:
day_of_interest = ':'.join(sample_ids[0].split(':')[2:5])
sample_ids[0], day_of_interest

('anp:anp:1937:03:01:5:mpeg21', '1937:03:01')

In [45]:
same_day_ids = [i for i in identifiers_data if day_of_interest in i and i!= sample_ids[0]]
len(same_day_ids), same_day_ids

(6,
 ['anp:anp:1937:03:01:1:mpeg21',
  'anp:anp:1937:03:01:4:mpeg21',
  'anp:anp:1937:03:01:3:mpeg21',
  'anp:anp:1937:03:01:2:mpeg21',
  'anp:anp:1937:03:01:6:mpeg21',
  'anp:anp:1937:03:01:7:mpeg21'])

In [48]:
same_day_fetched = fetch_data_for_identifiers(same_day_ids)

Fetching the data for identifier anp:anp:1937:03:01:1:mpeg21 (1/5)
anp:anp:1937:03:01:1:mpeg21 (1) - sending the resquest for metadata, uri=http://services.kb.nl/mdo/oai?verb=GetRecord&identifier=anp:anp:1937:03:01:1:mpeg21&metadataPrefix=didl
anp:anp:1937:03:01:1:mpeg21 (1) - sending the resquest for image, uri=http://resolver.kb.nl/resolve?urn=anp:1937:03:01:1:mpeg21:image
anp:anp:1937:03:01:1:mpeg21 (1) - sending the resquest for alto, uri=http://resolver.kb.nl/resolve?urn=anp:1937:03:01:1:mpeg21:alto
anp:anp:1937:03:01:1:mpeg21 (1) - sending the resquest for ocr, uri=http://resolver.kb.nl/resolve?urn=anp:1937:03:01:1:mpeg21:ocr
Finished fetching the data for identifier anp:anp:1937:03:01:1:mpeg21 (1/5)
Fetching the data for identifier anp:anp:1937:03:01:4:mpeg21 (2/5)
anp:anp:1937:03:01:4:mpeg21 (2) - sending the resquest for metadata, uri=http://services.kb.nl/mdo/oai?verb=GetRecord&identifier=anp:anp:1937:03:01:4:mpeg21&metadataPrefix=didl
anp:anp:1937:03:01:4:mpeg21 (2) - sendin

In [50]:
write_fetched_Data_to_disk(same_day_fetched)

Writing all the data for anp:anp:1937:03:01:1:mpeg21 to disk.
 anp:anp:1937:03:01:1:mpeg21 - sample path = /home/piconti/impresso-text-acquisition/text_preparation/data/sample_data/KB/radio/ANP_Nieuwsberichten_1937-1989/1937/03/01/1/anp_1937_03_01_1
Writing all the data for anp:anp:1937:03:01:4:mpeg21 to disk.
 anp:anp:1937:03:01:4:mpeg21 - sample path = /home/piconti/impresso-text-acquisition/text_preparation/data/sample_data/KB/radio/ANP_Nieuwsberichten_1937-1989/1937/03/01/4/anp_1937_03_01_4
Writing all the data for anp:anp:1937:03:01:3:mpeg21 to disk.
 anp:anp:1937:03:01:3:mpeg21 - sample path = /home/piconti/impresso-text-acquisition/text_preparation/data/sample_data/KB/radio/ANP_Nieuwsberichten_1937-1989/1937/03/01/3/anp_1937_03_01_3
Writing all the data for anp:anp:1937:03:01:2:mpeg21 to disk.
 anp:anp:1937:03:01:2:mpeg21 - sample path = /home/piconti/impresso-text-acquisition/text_preparation/data/sample_data/KB/radio/ANP_Nieuwsberichten_1937-1989/1937/03/01/2/anp_1937_03_01_2
