**Desc**:

Verify the doc counts that Maud has done on the release backup data:

> siclemat and @matteo as you might have seen I pushed the manifests and script on the only existing branch in impresso-data-release. I put both folders for last and current release in this branch. Once you’ve validated things we could merge.
 I re-executed rclone on all components. Here are a couple of counts on the backup (NAS):
- number of content items in issues (not useful I know, that was a mistake, but since I have the count…) : 47,816,371
- number of issues: 603,864
- number of topic assignments: 42,394,381

## Imports

In [1]:
import os, sys
sys.path.append("../")
import pandas as pd
import json
from dask import bag as db
from dask_k8 import DaskCluster
from dask.distributed import Client
from impresso_commons.utils.s3 import IMPRESSO_STORAGEOPT, fixed_s3fs_glob
from impresso_commons.utils.s3 import alternative_read_text
from impresso_commons.utils.kube import (make_scheduler_configuration,
                                         make_worker_configuration)
from sanity_check.contents.s3_data import list_files_rebuilt, list_pages

## Functions

In [7]:
S3_CANONICAL_DATA_BUCKET = "s3://original-canonical-release"
S3_REBUILT_DATA_BUCKET = "s3://canonical-rebuilt-release"

In [13]:
from typing import List
from sanity_check.contents.s3_data import list_newspapers

def list_issues(bucket_name: str = S3_CANONICAL_DATA_BUCKET, newspapers : List = []):
    if newspapers:
        pass
    else:
        if bucket_name:
            newspapers = list_newspapers(bucket_name)
        else:
            newspapers = list_newspapers()
    print(f'Issues for these newspapers will be listed: {newspapers}')
    issue_files = [
        file
        for np in newspapers
        for file in fixed_s3fs_glob(f"{os.path.join(bucket_name, f'{np}/issues/*')}")
    ]
    print(f"{bucket_name} contains {len(issue_files)} .bz2 files with issues")
    return issue_files

In [5]:
def fetch_issues(bucket_name=S3_CANONICAL_DATA_BUCKET, newspapers=[], compute=True):
    """
    Fetch issue JSON docs from an s3 bucket with impresso canonical data.
    """
    if newspapers:
        issue_files = list_issues(bucket_name, newspapers)
    else:
        issue_files = list_issues(bucket_name)

    print(
        (
            f"Fetching issue ids from {len(issue_files)} .bz2 files "
            f"(compute={compute})"
        )
    )
    issue_bag = db.read_text(issue_files, storage_options=IMPRESSO_STORAGEOPT).map(
        json.loads
    )

    if compute:
        return issue_bag.compute()
    else:
        return issue_bag

In [6]:
def start_cluster(n_workers : int = 10, worker_memory : str = '1G', blocking : bool = False):
    cluster = DaskCluster(
        namespace="dhlab",
        cluster_id="impresso-sanitycheck",
        scheduler_pod_spec=make_scheduler_configuration(),
        worker_pod_spec=make_worker_configuration(
            docker_image="ic-registry.epfl.ch/dhlab/impresso_pycommons:v1",
            memory=worker_memory
        )
    )
    cluster.create()
    cluster.scale(n_workers, blocking=False)
    return cluster, cluster.make_dask_client()

## dask k8 cluster

In [9]:
dask_cluster, dask_client = start_cluster(n_workers=100, worker_memory='2G')

Scheduler: tcp://10.90.47.13:19713
Dashboard: http://10.90.47.13:8211


In [11]:
dask_client

0,1
Client  Scheduler: tcp://10.90.47.13:19713  Dashboard: http://10.90.47.13:8787/status,Cluster  Workers: 100  Cores: 100  Memory: 200.00 GB


## Verify canonical counts

In [18]:
release_newspapers = list_newspapers(S3_CANONICAL_DATA_BUCKET)

Fetching list of newspapers from s3://original-canonical-release
original-canonical-release contains 78 newspapers


In [19]:
canonical_issue_files = list_issues(newspapers=release_newspapers)

Issues for these newspapers will be listed: {'LLS', 'luxwort', 'buergerbeamten', 'waechtersauer', 'onsjongen', 'tageblatt', 'LBP', 'luxland', 'obermosel', 'LCE', 'luxzeit1858', 'dunioun', 'deletz1893', 'LLE', 'WHD', 'JDF', 'JDV', 'BDC', 'OIZ', 'JDG', 'DVF', 'lunion', 'EZR', 'GDL', 'LNF', 'arbeitgeber', 'CDV', 'GAV', 'SGZ', 'indeplux', 'NTS', 'demitock', 'SMZ', 'DTT', 'LCR', 'luxembourg1935', 'SRT', 'MGS', 'HRV', 'waeschfra', 'FZG', 'DFS', 'LSE', 'BNN', 'LSR', 'ZBT', 'LES', 'actionfem', 'avenirgdl', 'NZG', 'CON', 'NZZ', 'LCS', 'SAX', 'armeteufel', 'kommmit', 'landwortbild', 'EDA', 'handelsztg', 'diekwochen', 'IMP', 'LVE', 'FCT', 'schmiede', 'luxzeit1844', 'BLB', 'EXP', 'DLE', 'LTF', 'gazgrdlux', 'courriergdl', 'SDT', 'GAZ', 'LCG', 'VDR', 'volkfreu1869', 'LAB', 'VHT'}
s3://original-canonical-release contains 3101 .bz2 files with issues


In [20]:
len(canonical_issue_files)

3101

In [23]:
canonical_issue_bag = fetch_issues(S3_CANONICAL_DATA_BUCKET, compute=False, newspapers=release_newspapers)

Issues for these newspapers will be listed: {'LLS', 'luxwort', 'buergerbeamten', 'waechtersauer', 'onsjongen', 'tageblatt', 'LBP', 'luxland', 'obermosel', 'LCE', 'luxzeit1858', 'dunioun', 'deletz1893', 'LLE', 'WHD', 'JDF', 'JDV', 'BDC', 'OIZ', 'JDG', 'DVF', 'lunion', 'EZR', 'GDL', 'LNF', 'arbeitgeber', 'CDV', 'GAV', 'SGZ', 'indeplux', 'NTS', 'demitock', 'SMZ', 'DTT', 'LCR', 'luxembourg1935', 'SRT', 'MGS', 'HRV', 'waeschfra', 'FZG', 'DFS', 'LSE', 'BNN', 'LSR', 'ZBT', 'LES', 'actionfem', 'avenirgdl', 'NZG', 'CON', 'NZZ', 'LCS', 'SAX', 'armeteufel', 'kommmit', 'landwortbild', 'EDA', 'handelsztg', 'diekwochen', 'IMP', 'LVE', 'FCT', 'schmiede', 'luxzeit1844', 'BLB', 'EXP', 'DLE', 'LTF', 'gazgrdlux', 'courriergdl', 'SDT', 'GAZ', 'LCG', 'VDR', 'volkfreu1869', 'LAB', 'VHT'}
s3://original-canonical-release contains 3101 .bz2 files with issues
Fetching issue ids from 3101 .bz2 files (compute=False)


In [24]:
canonical_issue_bag.count().compute()

603864

## Verify topic assignment counts

In [26]:
bucket_name = "s3://processed-canonical-data/topics/v2.0/"

In [27]:
topic_assign_files = fixed_s3fs_glob(f"{os.path.join(bucket_name, '*.bz2')}")

In [28]:
topic_assign_files

['s3://processed-canonical-data/topics/v2.0/tm-de-all-v2.0/topic_model_topic_assignment.jsonl.bz2',
 's3://processed-canonical-data/topics/v2.0/tm-de-all-v2.0/topic_model_topic_description.jsonl.bz2',
 's3://processed-canonical-data/topics/v2.0/tm-fr-all-v2.0/topic_model_topic_assignment.jsonl.bz2',
 's3://processed-canonical-data/topics/v2.0/tm-fr-all-v2.0/topic_model_topic_description.jsonl.bz2',
 's3://processed-canonical-data/topics/v2.0/tm-lb-all-v2.0/topic_model_topic_assignment.jsonl.bz2',
 's3://processed-canonical-data/topics/v2.0/tm-lb-all-v2.0/topic_model_topic_description.jsonl.bz2']

In [29]:
topic_assign_bag = db.read_text(topic_assign_files, storage_options=IMPRESSO_STORAGEOPT).map(json.loads)

In [35]:
topic_assign_bag.count().compute()

42394681

## Release resources

In [37]:
dask_cluster.close()