# Compare number of local issues against issues on s3

Reason for this notebook is: https://github.com/impresso/impresso-data-sanitycheck/issues/9

The `venv` from which this notebooks runs gives you access to the following libs:
- https://github.com/impresso/impresso-data-sanitycheck
- impresso-pycommons
- impresso-text-importer

Code is reused as as much as possible from the respective importers and other submodules `sanity_check.contents.s3_data`, `sanity_check.contents.checks`, `sanity_check.contents.stats` and `sanity_check.contents.sync` also contain quite a bit of code to load s3 data into dataframes for various purposes, see e.g.  https://github.com/impresso/impresso-data-sanitycheck/blob/5c47b1d8360570f749909d07e64c0289057c243f/sanity_check/contents/stats.py#L266-L286.

## Imports

In [1]:
import os 
from datetime import datetime

from impresso_commons.path.path_fs import detect_issues
from text_importer.importers.rero.detect import detect_issues as rero_detect_issues
from text_importer.importers.lux.detect import detect_issues as lux_detect_issues
from text_importer.importers.bnf.detect import detect_issues as bnf_detect_issues
from text_importer.importers.bnf_en.detect import detect_issues as bnfen_detect_issues, dir2issue, BnfEnIssueDir
from text_importer.importers.bl.detect import detect_issues as bl_detect_issues
from text_importer.importers.swa.detect import detect_issues as swa_detect_issues

  self.schema["$schema"]


In [2]:
from dask.distributed import Client

from dask import bag as db
from dask import dataframe as dd
from dask import array as da

In [3]:
from sanity_check.contents.s3_data import fetch_issue_ids, fetch_issues

## Local dask cluster

In [4]:
dask_client = Client()

The machine where this notebook runs has 48 cores and quite a bit of RAM, so you can make use of that if needed.

In [5]:
dask_client

0,1
Client  Scheduler: tcp://127.0.0.1:41089  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 8  Cores: 48  Memory: 270.38 GB


## Data

This is where the original (input) data for the canonical ingestion are:

```
/mnt/project_impresso/original/BNL
/mnt/project_impresso/original/RERO
/mnt/project_impresso/original/RERO2
/mnt/project_impresso/original/RERO3
/mnt/impresso_syno/01_GDL
/mnt/impresso_syno/02_GDL
/mnt/impresso_syno/01_JDG
/mnt/impresso_syno/02_JDG
```

- it should be a script that takes in input a list of paths (pointing to EPFL's NAS where the raw OCR data is) plus an s3 bucket path (where the ingested canonical is);

- to each base path the correct detect function is applied
- resulting issues are then grouped to produce counts by newspaper/year
- then we read canonical data from s3 and produce similar counts by newspaper/year (trivial because data is already packaged this way)
- at the end we combine the two sets of counts, write it to e.g. CSV and only flag (print) cases where the difference is above a certain user-specified threshold.


In [23]:

def bnfen_custom_detect_issues(base_dir: str, access_rights: str=None):
    """Detect newspaper issues to import within the filesystem.
    This function expects the directory structure that BNF-EN used to
    organize the dump of Mets/Alto OCR data.
    :param str base_dir: Path to the base directory of newspaper data.
    :param str access_rights: Not used for this imported, but argument is kept for normality
    :return: List of `BnfEnIssueDir` instances, to be imported.
    """
    
    dir_path, dirs, files = next(os.walk(base_dir))
    journal_dirs = [os.path.join(dir_path, _dir) for _dir in dirs]
    issue_dirs = [
        os.path.join(journal, _dir)
        for journal in journal_dirs
        for _dir in os.listdir(journal)
        ]
    
    issue_dirs = [bnfen_dir2issue(_dir, None) for _dir in issue_dirs]
    
    issue_dirs = [i for i in issue_dirs if i is not None]
    
    return issue_dirs

def bnfen_dir2issue(path: str, access_rights: dict):
    """Create a `BnfEnIssueDir` object from a directory path.
    .. note ::
        This function is called internally by :func:`detect_issues`
    :param str path: Path of issue.
    :return: New ``BnfEnIssueDir`` object
    """
    journal, issue = path.split('/')[-2:]
    
    date, edition = issue.split('_')[:2]
    date = datetime.strptime(date, '%Y%m%d').date()
    journal = journal.lower().replace('-', '').strip()
    edition = 'X'
    

    
    return BnfEnIssueDir(journal=journal, date=date, edition=edition, path=path,
                         rights="open-public", ark_link="IIIF_LINK")

def detect_issues_from_dirs(local_dirs:list):
    """
    Wrapper to detect issues for various sources and file structures
    """
    
    issues = []
    
    
    for path in local_dirs:
        path_lower = path.lower()
        if 'rero2' in path_lower:
            issues += rero_detect_issues(path, "/mnt/project_impresso/original/RERO2/rero2_access_rights.json")
        elif 'rero3' in path_lower:
            issues += rero_detect_issues(path, "/mnt/project_impresso/original/RERO3/access_rights.json")
        elif 'bnl' in path_lower:
            issues += lux_detect_issues(path)
        elif 'bnf-en' in path_lower:
            issues += bnfen_custom_detect_issues(path)
        elif 'bnf' in path_lower:
            issues += bnf_detect_issues(path, "/mnt/project_impresso/original/BNF/access_rights.json")
        elif 'bl' in path_lower:
            issues += bl_detect_issues(path, access_rights=None, tmp_dir='tmp_bnl_uncompressed')
        elif 'swa' in path_lower:
            issues += swa_detect_issues(path, "/mnt/project_impresso/original/SWA/access_rights.json")
        else:
            issues += detect_issues(path)
            

    return issues 


def canonical_issue_meta_from_id(issue_id):
    journal, year, month, day, edition = issue_id.split('-')
    meta = {"journal": journal, "year": year, "issue_id": issue_id}
    
    return meta

def canonical_issue_name(issues):
    """
    Create a canonical issue id from an `IssueDir` object.
    """
    
    ret = []
    
    for issue in issues:
        issue_id = "-".join(
                        [
                            issue.journal,
                            str(issue.date.year),
                            str(issue.date.month).zfill(2),
                            str(issue.date.day).zfill(2),
                            issue.edition
                        ]
                    )
        
        ret.append([issue.journal, issue.date.year, issue_id])
        
    return ret

In [50]:
def aggr_by_year_journal(df):
    return df.groupby(['journal', 'year']) \
    .count()

def filter_sources_with_mismatch(df, thres = 1.0):
    df['coverage'] = None
    df['coverage'] = df['n_issues_s3'] / df['n_issues_local']
    df = df[(df.coverage < thres) | (df.coverage.isna())] 
    
    return df

def run_issue_comparison(s3_bucket:str, local_dirs:list, f_out_full:str=None, f_out_mismatch:str=None):

    issues_local = db.from_sequence(local_dirs, partition_size=1) \
                    .map_partitions(detect_issues_from_dirs) \
                    .compute()

    df_local = db.from_sequence(issues_local) \
                .map_partitions(canonical_issue_name) \
                .to_dataframe(meta={'journal': str, 'year': int, 'issue_id': str}) \
                .compute()


    df_n_issues_local = aggr_by_year_journal(df_local) \
                            .rename(columns={"issue_id": "n_issues_local"})


    s3_canonical_issue_ids = fetch_issue_ids(bucket_name=s3_bucket)

    df_s3 = db.from_sequence(s3_canonical_issue_ids) \
        .map(canonical_issue_meta_from_id) \
        .to_dataframe(meta={'journal': str, 'year': int, 'issue_id': str}) \
        .compute()


    df_n_issues_s3 = aggr_by_year_journal(df_s3) \
                            .rename(columns={"issue_id": "n_issues_s3"})

    df_comb = df_n_issues_local.merge(df_n_issues_s3, how="outer", left_index=True, right_index=True).reset_index()

    df_err = filter_sources_with_mismatch(df_comb)
    
    if f_out_full:
        df_comb.to_csv(f_out_full)
    if f_out_mismatch:
        df_err.to_csv(f_out_mismatch)

    return df_err


s3_bucket='s3://canonical-data'

orig_resources = [
    "/mnt/project_impresso/original/BNL",
    "/mnt/project_impresso/original/RERO",
    "/mnt/project_impresso/original/RERO2",
    "/mnt/project_impresso/original/RERO3",
    "/mnt/impresso_syno",
    "/mnt/project_impresso/original/BNF",
    "/mnt/project_impresso/original/BNF-EN",
    "/mnt/project_impresso/original/BL",
    "/mnt/project_impresso/original/SWA"  
]
    

f_out_mismatch = 'data_ingestion_issue_mismatch.csv'
f_out_full = 'data_ingestion_issue_overview.csv'

df_err = run_issue_comparison(s3_bucket=s3_bucket, local_dirs=orig_resources, f_out_mismatch=f_out_mismatch, f_out_full=f_out_full)
df_err

Fetching list of newspapers from s3://canonical-data
canonical-data contains 93 newspapers
s3://canonical-data contains 3882 .bz2 files with issues
Fetching issue ids from 3882 .bz2 files (compute=False)


Unnamed: 0,journal,year,n_issues_local,n_issues_s3,coverage
127,EVT,2002,1.0,,
140,EXP,1779,2.0,,
167,EXP,1823,100.0,50.0,0.500000
168,EXP,1824,50.0,49.0,0.980000
349,EXP,2005,303.0,302.0,0.996700
...,...,...,...,...,...
3790,oeuvre,1937,364.0,363.0,0.997253
3791,oeuvre,1938,363.0,362.0,0.997245
3792,oeuvre,1939,364.0,362.0,0.994505
3793,oeuvre,1940,336.0,335.0,0.997024
