# Dev notebook for patching code

Related to issue [#117](https://github.com/impresso/impresso-text-acquisition/issues/117)

This notebook contains the code used to perform some of the simpler patches necessary on the canonical data. 
In particular patches n°1 and n°6:
- n°1: Adding a property `iiif_img_base_uri` at the top level of all page JSONs for a given set of journals, with the base uri of the iiif image API for the specific page. 
    - This patch concerns the journals `FedGazDe`, `FedGazFr` and `NZZ`.
- n°6: Adding a property `iiif_manifest_uri` at the top level of all issue JSONs for a given set of journals, with the uri to the specific issue's manifest in the IIIF presentation API. 
    - This patch concerns the journals `arbeitgeber`, `handelsztg`.

The result of these patches will be logged and documented in the manifest files created alongside these patches, and stored in the S3 as well as in the `impresso-data-release` GitHub repository.

### Imports

In [1]:
import os
import boto3
import json
import logging
import jsonlines
from impresso_commons.utils import s3
from impresso_commons.path.path_s3 import fetch_files, list_files, list_newspapers
from impresso_commons.utils.s3 import fixed_s3fs_glob
from impresso_commons.versioning.data_manifest import DataManifest
from text_importer.importers.core import upload_issues, upload_pages
from smart_open import open as smart_open_function
from impresso_commons.versioning.helpers import counts_for_canonical_issue
import dask.bag as db
from typing import Any, Callable
import git
from text_importer.utils import init_logger
import copy
from dask.distributed import Client



In [2]:
IMPRESSO_STORAGEOPT = s3.get_storage_options()

In [3]:
logger = logging.getLogger()

## Functions

In [4]:
def add_property(object_dict: dict[str, Any], prop_name: str, prop_function: Callable[[str], str], function_input: str):
    object_dict[prop_name] = prop_function(function_input)
    logger.debug("%s -> Added property %s: %s", object_dict['id'], prop_name, object_dict[prop_name])
    return object_dict

In [16]:
def write_error(
    thing_id: str,
    origin_function: str,
    error: Exception, 
    failed_log: str
) -> None:
    """Write the given error of a failed import to the `failed_log` file.

    Args:
        thing (NewspaperIssue | NewspaperPage | IssueDir): Object for which
            the error occurred.
        error (Exception): Error that occurred and should be logged.
        failed_log (str): Path to log file for failed imports.
    """
    note = (
        f"Error in {origin_function} for {thing_id}: {error}"
    )

    logger.exception(note)

    with open(failed_log, "a+") as f:
        f.write(note + "\n")

In [17]:
def write_jsonlines_file(filepath: str, contents: str | list[str], content_type: str, failed_log: str | None = None) -> None:
    
    os.makedirs(os.path.dirname(filepath), exist_ok =True)

    try:
        with smart_open_function(filepath, 'ab') as fout:
            writer = jsonlines.Writer(fout)

            writer.write_all(contents)

            logger.info(f'Written {len(contents)} {content_type} to {filepath}')
            writer.close()
    except Exception as e:
        logger.error(f"Error for {filepath}")
        logger.exception(e)
        if failed_log is not None:
            write_error(os.path.basename(filepath), 'write_jsonlines_file()', e, failed_log)

In [68]:
def write_upload_issues(
    key: tuple[str, str],
    issues: list[dict[str, Any]],
    output_dir: str,
    bucket_name: str,
) -> tuple[str, str]:
    """Compress issues for a Journal-year in a json file and upload them to s3.

    The compressed ``.bz2`` output file is a JSON-line file, where each line
    corresponds to an individual issue document in the canonical format.

    Args:
        key (str): Hyphen separated Newspaper ID and year of input issues, e.g. `GDL-1900`.
        issues (list[dict[str, Any]]): A list of issues as dicts.
        output_dir (str): Local output directory.
        bucket_name (str): Name of S3 bucket where to upload the file.

    Returns:
        Tuple[str, str]: Label following the template `<NEWSPAPER>-<YEAR>` and 
            the path to the the compressed `.bz2` file.
    """
    newspaper, year = key
    filename = f'{newspaper}-{year}-issues.jsonl.bz2'
    filepath = os.path.join(output_dir, newspaper, filename)
    logger.info(f'Compressing {len(issues)} JSON files into {filepath}')

    write_jsonlines_file(filepath, issues, 'issues')

    return upload_issues('-'.join(key), filepath, bucket_name)

In [65]:
def write_upload_pages(
    key: str,
    pages: list[dict[str, Any]],
    output_dir: str,
    bucket_name: str,
    failed_log: str | None = None
) -> tuple[str, tuple[bool, str]]:
    """Compress pages for a given edition in a json file and upload them to s3.

    The compressed ``.bz2`` output file is a JSON-line file, where each line
    corresponds to an individual page document in the canonical format.

    Args:
        key (str): Canonical ID of the newspaper issue (e.g. GDL-1900-01-02-a).
        pages (list[dict[str, Any]]): The list of pages for the provided key.
        output_dir (str): Local output directory.
        bucket_name (str): Name of S3 bucket where to upload the file.

    Returns:
        Tuple[str, str]: Label following the template `<NEWSPAPER>-<YEAR>` and 
            the path to the the compressed `.bz2` file.
    """
    newspaper, year, month, day, edition = key.split('-')
    filename = f'{key}-pages.jsonl.bz2'
    filepath = os.path.join(output_dir, newspaper, f'{newspaper}-{year}', filename)
    logger.info(f'Compressing {len(pages)} JSON files into {filepath}')

    write_jsonlines_file(filepath, pages, 'pages', failed_log)

    return key, (upload_pages(key, filepath, bucket_name))

In [29]:
# adapted from https://github.com/impresso/impresso-data-sanitycheck/blob/master/sanity_check/contents/stats.py#L241
def canonical_stats_from_issue_bag(fetched_issues: db.core.Bag) -> list[dict[str, Any]]:
    """Computes number of issues and pages per newspaper from canonical data in s3.

    :param str s3_canonical_bucket: S3 bucket with canonical data.
    :return: A pandas DataFrame with newspaper ID as the index and columns `n_issues`, `n_pages`.
    :rtype: pd.DataFrame

    """
    pages_count_df = (
        fetched_issues.map(
            lambda i: {
                "np_id": i["id"].split('-')[0], 
                "year":i["id"].split('-')[1], 
                "id": i['id'], 
                "issue_id": i['id'], 
                "n_pages": len(set(i['pp'])),
                "n_content_items": len(i['i']),
                "n_images": len([item for item in i['i'] if item['m']['tp']=='image'])
            }
        )
        .to_dataframe(meta={'np_id': str, 'year': str, 
                            'id': str, 'issue_id': str, 
                            "n_pages": int, 'n_images': int,
                            'n_content_items': int})
        .set_index('id')
        .persist()
    )

    # cum the counts for all values collected
    aggregated_df = (pages_count_df
            .groupby(by=['np_id', 'year'])
            .agg({"n_pages": sum, 'issue_id': 'count', 'n_content_items': sum, 'n_images': sum})
            .rename(columns={'issue_id': 'issues', 'n_pages': 'pages', 
                             'n_content_items': 'content_items_out', 'n_images':'images'})
            .reset_index()
    )

    # return as a list of dicts
    return aggregated_df.to_bag(format='dict').compute()

In [63]:
def process_pages_of_issue(
    key: str, 
    pages: list[dict[str, Any]],
    manifest: DataManifest,
    issue_stats: list[dict],
    failed_log: str | None = None 
) -> tuple[bool, str]:
    newspaper, year, month, day, edition = key.split('-')

    if not manifest.has_title_year_key(newspaper, year):
        current_stats = [d for d in issue_stats if d['np_id']==newspaper and d['year']==year][0]
        # reduce the number of stats to consider at each step
        issue_stats.remove(current_stats)
        # remove unwanted keys from the dict
        del current_stats['np_id']
        del current_stats['year']
        success = manifest.replace_by_title_year(newspaper, year, current_stats)
        if not success:
            logger.warning("Problem encountered when trying to add %s for %s-%s", current_stats, newspaper, year)

    key, filepath = write_upload_pages(key, pages, manifest.temp_dir, manifest.output_bucket_name, failed_log)

    return key, (filepath, manifest)
        

# SWA - Patch 6

The patch consists of adding a new `iiif_manifest_uri` property mapping to the IIIF presentation API for the given issue.

In [6]:
# initialize values for patch
SWA_TITLES = ['arbeitgeber', 'handelsztg']
SWA_IIIF_BASE_URI = 'https://ub-iiifpresentation.ub.unibas.ch/impresso_sb'
PROP_NAME = 'iiif_manifest_uri'

error_log = '/home/piconti/impresso-text-acquisition/text_importer/data/patch_logs/patch_6_swa_errors.log'

init_logger(logger, logging.INFO, '/home/piconti/impresso-text-acquisition/text_importer/data/patch_logs/patch_6_swa.log')
logger.info("Patching titles %s: adding %s property at issue level", SWA_TITLES, PROP_NAME)

In [7]:
# define patch function
def swa_manifest_uri(issue_id: str, swa_iiif: str = SWA_IIIF_BASE_URI) -> str:
    """
    https://ub-iiifpresentation.ub.unibas.ch/impresso_sb/[issue canonical ID]-issue/manifest
    """
    return os.path.join(swa_iiif, '-'.join([issue_id, 'issue']), 'manifest')

In [8]:
# initialise manifest to keep track of updates
canonical_repo = git.Repo('/home/piconti/impresso-text-acquisition')
s3_input_bucket = 'canonical-data'
s3_output_bucket = 'canonical-staging'
# previous manifest is not in the output bucket --> provide it as argument
previous_manifest_path = 's3://canonical-data/canonical_v0-0-1.json' 
temp_dir = '/scratch/piconti/impresso/patches_temp'
patched_fields=[PROP_NAME]
schema_path = '/home/piconti/impresso-text-acquisition/text_importer/impresso-schemas/json/versioning/manifest.schema.json'

swa_patch_6_manifest = DataManifest(
    data_stage = 'canonical',
    s3_output_bucket = s3_output_bucket,
    s3_input_bucket = s3_input_bucket,
    git_repo = canonical_repo,
    temp_dir = temp_dir,
    patched_fields=patched_fields,
    previous_mft_path = previous_manifest_path
)

Perform the patch, tracking updates and upload results

In [9]:
# download the issues of interest for this patch
swa_issues = fetch_issues('canonical-data', True, SWA_TITLES)

# patch them keeping track of the data that's been modified
yearly_patched_issues = {}

for issue in swa_issues:
    # key is title-year
    title, year = issue['id'].split('-')[:2]
    key = '-'.join([title, year])
    if key in yearly_patched_issues:
        yearly_patched_issues[key].append(add_property(issue, PROP_NAME, swa_manifest_uri, issue['id']))
    else:
        yearly_patched_issues[key] = [add_property(issue, PROP_NAME, swa_manifest_uri, issue['id'])]
    
    swa_patch_6_manifest.add_by_title_year(title, year, counts_for_canonical_issue(issue))

# write and upload the updated issues to s3
for key, issues in yearly_patched_issues.items():
    write_upload_issues(key.split('-'), issues, temp_dir, s3_output_bucket, error_log)

# finalize the manifest and export it
note = f"Patching titles {SWA_TITLES}: adding {PROP_NAME} property at issue level"
swa_patch_6_manifest.append_to_notes(note)
swa_patch_6_manifest.compute(export_to_git_and_s3 = False)
swa_patch_6_manifest.validate_and_export_manifest(path_to_schema=schema_path, push_to_git=True)
    

Fetching list of newspapers from canonical-data
canonical-data contains 94 newspapers
canonical-data contains 130 .bz2 files with issues for the provided newspapers ['arbeitgeber', 'handelsztg']
Fetching issue ids from 130 .bz2 files (compute=True)
arbeitgeber 1907
arbeitgeber 1908
arbeitgeber 1909
arbeitgeber 1910
arbeitgeber 1911
arbeitgeber 1912
arbeitgeber 1913
arbeitgeber 1914
arbeitgeber 1915
arbeitgeber 1916
arbeitgeber 1917
arbeitgeber 1918
arbeitgeber 1919
arbeitgeber 1924
arbeitgeber 1925
arbeitgeber 1926
arbeitgeber 1927
arbeitgeber 1928
arbeitgeber 1929
arbeitgeber 1930
arbeitgeber 1931
arbeitgeber 1932
arbeitgeber 1933
arbeitgeber 1934
arbeitgeber 1935
arbeitgeber 1936
arbeitgeber 1937
arbeitgeber 1938
arbeitgeber 1939
arbeitgeber 1940
arbeitgeber 1941
arbeitgeber 1942
arbeitgeber 1943
arbeitgeber 1944
arbeitgeber 1945
arbeitgeber 1946
arbeitgeber 1947
arbeitgeber 1948
arbeitgeber 1949
arbeitgeber 1950
arbeitgeber 1951
arbeitgeber 1952
arbeitgeber 1953
arbeitgeber 1954
arb

True

# FedGaz + NZZ – Patch 1

The patch consists of adding a new `iiif_img_base_uri` property mapping to the base uri of the IIIF image API for the given page.

In [10]:
client = Client(n_workers=16, threads_per_worker=2)
client

0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status,

0,1
Dashboard: http://127.0.0.1:8787/status,Workers: 16
Total threads: 32,Total memory: 251.79 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:40417,Workers: 16
Dashboard: http://127.0.0.1:8787/status,Total threads: 32
Started: Just now,Total memory: 251.79 GiB

0,1
Comm: tcp://127.0.0.1:46445,Total threads: 2
Dashboard: http://127.0.0.1:46371/status,Memory: 15.74 GiB
Nanny: tcp://127.0.0.1:42875,
Local directory: /tmp/dask-scratch-space/worker-5cw1vcb6,Local directory: /tmp/dask-scratch-space/worker-5cw1vcb6

0,1
Comm: tcp://127.0.0.1:41743,Total threads: 2
Dashboard: http://127.0.0.1:43645/status,Memory: 15.74 GiB
Nanny: tcp://127.0.0.1:40005,
Local directory: /tmp/dask-scratch-space/worker-c2hw6qf2,Local directory: /tmp/dask-scratch-space/worker-c2hw6qf2

0,1
Comm: tcp://127.0.0.1:40427,Total threads: 2
Dashboard: http://127.0.0.1:37585/status,Memory: 15.74 GiB
Nanny: tcp://127.0.0.1:46235,
Local directory: /tmp/dask-scratch-space/worker-n964b9ka,Local directory: /tmp/dask-scratch-space/worker-n964b9ka

0,1
Comm: tcp://127.0.0.1:39089,Total threads: 2
Dashboard: http://127.0.0.1:38367/status,Memory: 15.74 GiB
Nanny: tcp://127.0.0.1:38337,
Local directory: /tmp/dask-scratch-space/worker-qxf8qgc3,Local directory: /tmp/dask-scratch-space/worker-qxf8qgc3

0,1
Comm: tcp://127.0.0.1:33831,Total threads: 2
Dashboard: http://127.0.0.1:42137/status,Memory: 15.74 GiB
Nanny: tcp://127.0.0.1:44295,
Local directory: /tmp/dask-scratch-space/worker-7danon1m,Local directory: /tmp/dask-scratch-space/worker-7danon1m

0,1
Comm: tcp://127.0.0.1:43713,Total threads: 2
Dashboard: http://127.0.0.1:35149/status,Memory: 15.74 GiB
Nanny: tcp://127.0.0.1:37895,
Local directory: /tmp/dask-scratch-space/worker-pp3s1ieg,Local directory: /tmp/dask-scratch-space/worker-pp3s1ieg

0,1
Comm: tcp://127.0.0.1:44305,Total threads: 2
Dashboard: http://127.0.0.1:39069/status,Memory: 15.74 GiB
Nanny: tcp://127.0.0.1:38289,
Local directory: /tmp/dask-scratch-space/worker-9y770j8_,Local directory: /tmp/dask-scratch-space/worker-9y770j8_

0,1
Comm: tcp://127.0.0.1:44335,Total threads: 2
Dashboard: http://127.0.0.1:33063/status,Memory: 15.74 GiB
Nanny: tcp://127.0.0.1:42583,
Local directory: /tmp/dask-scratch-space/worker-q6cu1xjw,Local directory: /tmp/dask-scratch-space/worker-q6cu1xjw

0,1
Comm: tcp://127.0.0.1:34571,Total threads: 2
Dashboard: http://127.0.0.1:34545/status,Memory: 15.74 GiB
Nanny: tcp://127.0.0.1:44235,
Local directory: /tmp/dask-scratch-space/worker-0f76q4gs,Local directory: /tmp/dask-scratch-space/worker-0f76q4gs

0,1
Comm: tcp://127.0.0.1:37737,Total threads: 2
Dashboard: http://127.0.0.1:33031/status,Memory: 15.74 GiB
Nanny: tcp://127.0.0.1:42603,
Local directory: /tmp/dask-scratch-space/worker-frtac1bc,Local directory: /tmp/dask-scratch-space/worker-frtac1bc

0,1
Comm: tcp://127.0.0.1:44507,Total threads: 2
Dashboard: http://127.0.0.1:39469/status,Memory: 15.74 GiB
Nanny: tcp://127.0.0.1:42061,
Local directory: /tmp/dask-scratch-space/worker-86xe1ry_,Local directory: /tmp/dask-scratch-space/worker-86xe1ry_

0,1
Comm: tcp://127.0.0.1:36779,Total threads: 2
Dashboard: http://127.0.0.1:44301/status,Memory: 15.74 GiB
Nanny: tcp://127.0.0.1:43463,
Local directory: /tmp/dask-scratch-space/worker-3trmf8ie,Local directory: /tmp/dask-scratch-space/worker-3trmf8ie

0,1
Comm: tcp://127.0.0.1:41599,Total threads: 2
Dashboard: http://127.0.0.1:44009/status,Memory: 15.74 GiB
Nanny: tcp://127.0.0.1:46141,
Local directory: /tmp/dask-scratch-space/worker-llp8s_gp,Local directory: /tmp/dask-scratch-space/worker-llp8s_gp

0,1
Comm: tcp://127.0.0.1:38823,Total threads: 2
Dashboard: http://127.0.0.1:44141/status,Memory: 15.74 GiB
Nanny: tcp://127.0.0.1:37807,
Local directory: /tmp/dask-scratch-space/worker-tihadj4i,Local directory: /tmp/dask-scratch-space/worker-tihadj4i

0,1
Comm: tcp://127.0.0.1:41087,Total threads: 2
Dashboard: http://127.0.0.1:33327/status,Memory: 15.74 GiB
Nanny: tcp://127.0.0.1:41891,
Local directory: /tmp/dask-scratch-space/worker-cnt6y6_r,Local directory: /tmp/dask-scratch-space/worker-cnt6y6_r

0,1
Comm: tcp://127.0.0.1:36671,Total threads: 2
Dashboard: http://127.0.0.1:42903/status,Memory: 15.74 GiB
Nanny: tcp://127.0.0.1:37259,
Local directory: /tmp/dask-scratch-space/worker-n0apo__o,Local directory: /tmp/dask-scratch-space/worker-n0apo__o


In [10]:
# initialize values for patch
UZH_TITLES = ['FedGazDe', 'FedGazFr', 'NZZ']
IMPRESSO_IIIF_BASE_URI = "https://impresso-project.ch/api/proxy/iiif/"
PROP_NAME = 'iiif_img_base_uri'

error_log = '/home/piconti/impresso-text-acquisition/text_importer/data/patch_logs/patch_1_fedgaz_errors.log'

init_logger(logger, logging.INFO, '/home/piconti/impresso-text-acquisition/text_importer/data/patch_logs/patch_1_fedgaz_nzz.log')
logger.info("Patching titles %s: adding %s property at page level", UZH_TITLES, PROP_NAME)

In [11]:
# define patch function
def uzh_image_base_uri(page_id: str, impresso_iiif: str = IMPRESSO_IIIF_BASE_URI) -> str:
    """
    https://impresso-project.ch/api/proxy/iiif/[page canonical ID]
    """
    return os.path.join(impresso_iiif, page_id)

In [12]:
# initialise manifest to keep track of updates
canonical_repo = git.Repo('/home/piconti/impresso-text-acquisition')
s3_input_bucket = 'canonical-data'
s3_output_bucket = 'canonical-sandbox' #'canonical-staging'
# previous manifest is not in the output bucket --> provide it as argument
previous_manifest_path = 's3://canonical-staging/canonical_v0-0-2.json' 
temp_dir = '/scratch/piconti/impresso/patches_temp'
patched_fields=[PROP_NAME]
schema_path = '/home/piconti/impresso-text-acquisition/text_importer/impresso-schemas/json/versioning/manifest.schema.json'

nzz_patch_1_manifest = DataManifest(
    data_stage = 'canonical',
    s3_output_bucket = s3_output_bucket,
    s3_input_bucket = s3_input_bucket,
    git_repo = canonical_repo,
    temp_dir = temp_dir,
    patched_fields=patched_fields,
    previous_mft_path = previous_manifest_path
)

Perform the patch, tracking updates and upload results

In [13]:
# download the issues of interest for this patch
uzh_issues, uzh_pages = fetch_files('canonical-data', False, 'both', UZH_TITLES)

# compute the statistics that correspond to this
logger.info("computing the canonical statistics on the issues...")
stats_from_issues = canonical_stats_from_issue_bag(uzh_issues)

issue_stats = copy.deepcopy(stats_from_issues)

# patch the pages and write them back to s3.
uzh_patched_pages = (
    uzh_pages
        .map(lambda p: add_property(p, PROP_NAME, uzh_image_base_uri, p['id']))
        .groupby(lambda p: '-'.join(p['id'].split('-')[:-1]))
        .starmap(
            write_upload_pages,
            output_dir=temp_dir,
            bucket_name=s3_output_bucket,
            failed_log=error_log
        )
).compute()

# fill in the manifest statistics and prepare issues to be uploaded to their new s3 bucket.
issues_with_patched_pages = {}
for issue_id, (success, path) in uzh_patched_pages:
    title, year, month, day, edition = issue_id.split('-')
    
    if success and not nzz_patch_1_manifest.has_title_year_key(title, year):
        current_stats = [d for d in issue_stats if d['np_id']==title and d['year']==year][0]
        # reduce the number of stats to consider at each step
        issue_stats.remove(current_stats)
        # remove unwanted keys from the dict
        del current_stats['np_id']
        del current_stats['year']

        add_ok = nzz_patch_1_manifest.replace_by_title_year(title, year, current_stats)

        if add_ok:
            specific_issue = [i for i in uzh_issues if i['id']==issue_id]
            assert len(specific_issue) == 1, f"More than one issue had the exact issue id: {issue_id}"
            # if patching and addition to manifest was successful, the issue can be copied to the new bucket
            if key in issues_with_patched_pages:
                issues_with_patched_pages['-'.join([title, year])] = specific_issue
            else:
                issues_with_patched_pages['-'.join([title, year])].extend(specific_issue)
        else:
            logger.warning("Problem encountered when trying to add %s for %s-%s", current_stats, title, year)
    elif not success:
        logger.warning("The pages for issue %s were not correctly uploaded", issue_id)

# write and upload the issues to the new s3 bucket
for key, issues in issues_with_patched_pages.items():
    success, issue_path = write_upload_issues(key.split('-'), issues, temp_dir, s3_output_bucket, error_log)
    if not success:
        logger.warning("The copy of issues %s had a problem", key)

# finalize the manifest and export it
note = f"Patching titles {UZH_TITLES}: adding {PROP_NAME} property at page level"
nzz_patch_1_manifest.append_to_notes(note)
nzz_patch_1_manifest.compute(export_to_git_and_s3 = False)
nzz_patch_1_manifest.validate_and_export_manifest(path_to_schema=schema_path, push_to_git=True)

Fetching list of newspapers from canonical-data
canonical-data contains 94 newspapers
canonical-data contains 473 .bz2 issue files for the provided newspapers ['FedGazDe', 'FedGazFr', 'NZZ']
canonical-data contains 128930 .bz2 page files for the provided newspapers ['FedGazDe', 'FedGazFr', 'NZZ']
