# Boilerplate data preparation and computation

When detecting text-reuse with passim, one can start by identifying and eliminating the boilerplate to allow to remove superfluous data from the processing.

This notebook contains the code to perform various tasks relating to this:
- Preparing the input data directory, only containing data that should be part of boilerplate detection
- Light postprocessing of the boilerplate output and preparation of the actual text-reuse detection

### Imports

In [25]:
import os
import shutil
from tqdm import tqdm
import jq
import json
from typing import Any
from smart_open import open as smart_open_function
import jsonlines
import pickle

import dask.bag as db
import dask.dataframe as dd
import pandas as pd

from impresso_commons.utils.s3 import get_s3_resource, get_s3_client
from impresso_commons.path.path_s3 import _list_bucket_paginator
from impresso_commons.utils.s3 import IMPRESSO_STORAGEOPT

### Code

In [4]:
text_reuse_dir = '/scratch/piconti/impresso/text_reuse'
all_rebuilt_data_path = "rebuilt_data"
bp_rebuilt_data_path = "rebuilt_data_for_bp"

input_dir = os.path.join(text_reuse_dir, all_rebuilt_data_path)
output_dir = os.path.join(text_reuse_dir, bp_rebuilt_data_path)

In [5]:
titles_no_bp = [
    "FedGazDe", "FedGazFr", "NZZ", "handelsztg", "arbeitgeber", "ACI", "AV", "Bombe", "Cancoire", "Castigat", "Charivari", "CharivariCH", "CL", 
    "Croquis", "EM", "esta", "FAM", "FAMDE", "FAN", "FAV1", "Fronde", "GAVi", "Grelot", "Griffe", "Guepe1851", "Guepe1887", 
    "JH", "JV", "JVE", "JY2", "MB", "ME", "MESSAGER", "Moniteur", "NS", "NV", "NV1", "NV2", "OBS", "ouistiti", "pages", "PAT", "PDL", "PJ", "PS", "RLA", "TouSuIl", "VVS", "VVS1"
]

## 1. Prepare the copied Rebuilt data for Boilerplate detection with passim

Before text-reuse detection with passim can be run, one must first run the tool in boilerplate mode to identify segments of text that should be excluded from the text-reuse search.

However not all the corpus should be confronted to boilerplate detection, and titles without any article-level segmentation should not be considered.

This small notebook aims to copy the wanted data (one that has article-level segmentation) to a new directory, where the files will be all together, as opposed to separated by title.

In [None]:
copied, not_copied = [], []
for path, dir, files in os.walk(input_dir):
    if len(dir)==0:
        if path.split('/')[-1] in titles_no_bp:
            not_copied.extend(files)
            print(f"Not copying {path.split('/')[-1]} files since it has no article segmentation.")
        else:
            print(f"Copying {path.split('/')[-1]} files...")
            for file in tqdm(files):
                src_path = os.path.join(path, file)
                dest_path = os.path.join(output_dir, file)
                shutil.copy(src_path, dest_path)
                copied.append(file)

In [None]:
len(copied), len(not_copied), len(copied) + len(not_copied)

## 2. Upload the current boilerplate output to s3

For traceability, upload all the contents of the /out.json directory to S3 under a boilerplate partition

In [None]:
s3_staging_bucket = "41-processed-data-staging"
partition = "text-reuse/text-reuse_v1-0-0/boilerplate/out.json"
out_jsons_dir = os.path.join(text_reuse_dir, "passim_bp_output/out.json")

In [None]:
s3 = get_s3_resource()
bucket = s3.Bucket(s3_staging_bucket)

In [None]:
for filename in tqdm(os.listdir(out_jsons_dir)):
    if filename.endswith('json'):
        filepath = os.path.join(out_jsons_dir, filename)
        key_name = os.path.join(partition, filename)
        bucket.upload_file(filepath, key_name)
        #print("Uploaded %s to s3: %s", filepath, key_name)

## 3. Create the bp.pkl file from the boilerplate output

#### First check the contents of some resulting jsons

In [None]:
out_jsons_dir = os.path.join(text_reuse_dir, "passim_bp_output/out.json")

os.listdir(out_jsons_dir)

In [None]:
def load_json(f_path: str) -> dict:
    lines = []
    with open(f_path, mode="r", encoding='utf-8') as f_in:
        for line in f_in:
            lines.append(json.loads(line))
    return lines

#### Actually create the pb.pkl dataframe

From looking at examples the following heuristics have been devised:
- Only entries with the field `"src"` are actually boilerplate.
  - In the format: `{"id": "BDC-1839-03-18-a-i0005_1658_1952", "src": {"id": "BDC-1839-03-16-a-i0004", "start": [...]`
- For each entry with the field `"src"`:
  - Two text passages are linked: the value of `id` and the value of `src.id`
  - The value of `id` will often have additional values appended after the usual CI id ("_1658_1952" here). These should be removed
  - Both ids (fields `id` and `src.id`) should be included in the bp.pkl output as separate rows
- All rows should also have a column `"is_boilerplate"` set to `True`.

The actual processing for this step of creating the bp.pkl dataframe has been moved to the `postprocess_boilerpalte.py` script.

#### Upload the resulting file to S3

The upload of this dataframe has also been outsourced to the script

In [None]:
pb_pkl_out_filepath = os.path.join(text_reuse_dir, "bp.pkl")
s3_staging_bucket = "41-processed-data-staging"
pb_partition = "text-reuse/text-reuse_v1-0-0/boilerplate/"

s3 = get_s3_resource()
bucket = s3.Bucket(s3_staging_bucket)

In [None]:
pb_pkl_out_filepath = os.path.join(text_reuse_dir, "bp.pkl")
bp_key_name = os.path.join(pb_partition, "bp.pkl")
bucket.upload_file(pb_pkl_out_filepath, bp_key_name)

#### Debug: Filtering the duplicates and uploading updated pkl to s3

The first iteration of the bp.pkl dataframe was created without filtering out the duplicated ids. 
This caused some issues during the filtering, among other things due to the large size of the dataframe.
The code filtering out duplicated ids had now also been added to the script

Note: When using `compute()` after filtering, the output is of type pd.DataFrame and not dd.
Trying both options and seeing which is best. --> decision was to keep it as a dask dataframe.

In [None]:
s3_staging_bucket = "" # todo fill in
pb_partition = "text-reuse/text-reuse_v1-0-0/boilerplate/"
bp_dd_filename = '' # todo fill in with desired partition subpath
dd_s3_path = os.path.join("s3://", s3_staging_bucket, pb_partition, bp_dd_filename)
dd_s3_path

In [None]:
# loading the full df
bp_df = pd.read_pickle(dd_s3_path, storage_options=IMPRESSO_STORAGEOPT).repartition(npartitions=2082).drop(columns=["is_boilerplate"])

In [None]:
bp_df_filter = bp_df['id'].value_counts().map(lambda x: x > 1)

In [None]:
bp_full_count = bp_df.count().compute()
not_dupl = bp_df_filter[~bp_df_filter].count().compute()
dupl = bp_df_filter[bp_df_filter].count().compute()

bp_full_count, not_dupl, dupl

In [None]:
filtered_dup_bp = bp_df.drop_duplicates(subset=['id']).compute()

In [None]:
filtered_dup_bp.head(), filtered_dup_bp.count()

In [None]:
text_reuse_dir = '/scratch/piconti/impresso/text_reuse'
bp_out_filename = None # TODO fill in
filtered_bp_pkl_out_filepath = os.path.join(text_reuse_dir, bp_out_filename)
filtered_bp_pkl_out_filepath

In [None]:
with open(filtered_bp_pkl_out_filepath, 'wb') as handle:
    pickle.dump(filtered_dup_bp, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [None]:
s3_staging_bucket = "41-processed-data-staging"
pb_partition = "text-reuse/text-reuse_v1-0-0/boilerplate/"

s3 = get_s3_resource()
bucket = s3.Bucket(s3_staging_bucket)

In [None]:
bp_key_name = os.path.join(pb_partition, bp_out_filename)
bucket.upload_file(filtered_bp_pkl_out_filepath, bp_key_name)

#### Sanity checking / loading existing bp.pkl file to check for specific ids

In [None]:
s3_staging_bucket = "" # todo fill in
pb_partition = "text-reuse/text-reuse_v1-0-0/boilerplate/"
bp_dd_filename = '' # todo fill in with desired partition subpath
dd_s3_path = os.path.join("s3://", s3_staging_bucket, pb_partition, bp_dd_filename)
dd_s3_path

In [None]:
# loading the full df
bp_df = pd.read_pickle(dd_s3_path, storage_options=IMPRESSO_STORAGEOPT).repartition(npartitions=2082).drop(columns=["is_boilerplate"])

In [None]:
bp_df[bp_df.id.str.contains('legaulois-1924-07-27-a')].head(100, npartitions=-1)

## 4. Prepate the copied rebuilt data (without bolerplate) for actual text-reuse detection with passim

Similarily as for step 1, the data should not be separated in sub-files for the passim text-reuse detection.
As a result the data currently downloaded form s3 that is still not copied to a flat directory should (so all titles that did not undergo boilerplate detection).

In particular, inside the `/scratch/piconti/impresso/text_reuse` directory:
- `rebuilt_data` is the target directory where all jsonl files should be, both the ones resulting from boilerplate filtering and the ones to copy.
- `rebuilt_data_to_mv` currently contains most of the data (organized per title) whihc did NOT undergo boilerplate detection/filtering. They should be copied/flatened into `rebuilt_data` along with missing data.

In [10]:
text_reuse_dir = '/scratch/piconti/impresso/text_reuse'
all_rebuilt_data_path = "rebuilt_data"
separated_rebuilt_data_path = "rebuilt_data_to_mv"

output_dir = os.path.join(text_reuse_dir, all_rebuilt_data_path)
input_dir = os.path.join(text_reuse_dir, separated_rebuilt_data_path)

In [12]:
titles_to_copy = os.listdir(input_dir)
titles_to_upload = []
copied = []

for title in titles_no_bp:
    print(f"--------------")
    if title not in titles_to_copy:
        titles_to_upload.append(title)
        print(f"Missing {title}! Will be uploaded from S3")
    else:
        print(f"Copying {title} from {input_dir} to {output_dir}")
        full_title_dir = os.path.join(input_dir, title)
        for file in tqdm(os.listdir(full_title_dir)):
            src_path = os.path.join(full_title_dir, file)
            dest_path = os.path.join(output_dir, file)
            shutil.copy(src_path, dest_path)
            copied.append(file)

--------------
Copying FedGazDe from /scratch/piconti/impresso/text_reuse/rebuilt_data_to_mv to /scratch/piconti/impresso/text_reuse/rebuilt_data


100%|██████████| 151/151 [00:14<00:00, 10.47it/s]


--------------
Copying FedGazFr from /scratch/piconti/impresso/text_reuse/rebuilt_data_to_mv to /scratch/piconti/impresso/text_reuse/rebuilt_data


100%|██████████| 151/151 [00:15<00:00,  9.99it/s]


--------------
Copying NZZ from /scratch/piconti/impresso/text_reuse/rebuilt_data_to_mv to /scratch/piconti/impresso/text_reuse/rebuilt_data


100%|██████████| 171/171 [01:01<00:00,  2.79it/s]


--------------
Missing handelsztg! Will be uploaded from S3
--------------
Missing arbeitgeber! Will be uploaded from S3
--------------
Copying ACI from /scratch/piconti/impresso/text_reuse/rebuilt_data_to_mv to /scratch/piconti/impresso/text_reuse/rebuilt_data


100%|██████████| 1/1 [00:00<00:00, 39.08it/s]


--------------
Copying AV from /scratch/piconti/impresso/text_reuse/rebuilt_data_to_mv to /scratch/piconti/impresso/text_reuse/rebuilt_data


100%|██████████| 3/3 [00:00<00:00, 103.34it/s]


--------------
Missing Bombe! Will be uploaded from S3
--------------
Copying Cancoire from /scratch/piconti/impresso/text_reuse/rebuilt_data_to_mv to /scratch/piconti/impresso/text_reuse/rebuilt_data


100%|██████████| 1/1 [00:00<00:00, 83.37it/s]


--------------
Copying Castigat from /scratch/piconti/impresso/text_reuse/rebuilt_data_to_mv to /scratch/piconti/impresso/text_reuse/rebuilt_data


100%|██████████| 2/2 [00:00<00:00, 164.30it/s]


--------------
Copying Charivari from /scratch/piconti/impresso/text_reuse/rebuilt_data_to_mv to /scratch/piconti/impresso/text_reuse/rebuilt_data


100%|██████████| 3/3 [00:00<00:00, 89.62it/s]


--------------
Copying CharivariCH from /scratch/piconti/impresso/text_reuse/rebuilt_data_to_mv to /scratch/piconti/impresso/text_reuse/rebuilt_data


100%|██████████| 3/3 [00:00<00:00, 295.61it/s]


--------------
Copying CL from /scratch/piconti/impresso/text_reuse/rebuilt_data_to_mv to /scratch/piconti/impresso/text_reuse/rebuilt_data


100%|██████████| 5/5 [00:00<00:00, 100.62it/s]


--------------
Copying Croquis from /scratch/piconti/impresso/text_reuse/rebuilt_data_to_mv to /scratch/piconti/impresso/text_reuse/rebuilt_data


100%|██████████| 4/4 [00:00<00:00, 1051.86it/s]


--------------
Copying EM from /scratch/piconti/impresso/text_reuse/rebuilt_data_to_mv to /scratch/piconti/impresso/text_reuse/rebuilt_data


100%|██████████| 5/5 [00:00<00:00, 82.08it/s]


--------------
Copying esta from /scratch/piconti/impresso/text_reuse/rebuilt_data_to_mv to /scratch/piconti/impresso/text_reuse/rebuilt_data


100%|██████████| 34/34 [00:02<00:00, 14.07it/s]


--------------
Copying FAM from /scratch/piconti/impresso/text_reuse/rebuilt_data_to_mv to /scratch/piconti/impresso/text_reuse/rebuilt_data


100%|██████████| 1/1 [00:00<00:00, 72.64it/s]


--------------
Copying FAMDE from /scratch/piconti/impresso/text_reuse/rebuilt_data_to_mv to /scratch/piconti/impresso/text_reuse/rebuilt_data


100%|██████████| 1/1 [00:00<00:00, 60.08it/s]


--------------
Copying FAN from /scratch/piconti/impresso/text_reuse/rebuilt_data_to_mv to /scratch/piconti/impresso/text_reuse/rebuilt_data


100%|██████████| 15/15 [00:00<00:00, 111.47it/s]


--------------
Copying FAV1 from /scratch/piconti/impresso/text_reuse/rebuilt_data_to_mv to /scratch/piconti/impresso/text_reuse/rebuilt_data


100%|██████████| 15/15 [00:00<00:00, 244.97it/s]


--------------
Copying Fronde from /scratch/piconti/impresso/text_reuse/rebuilt_data_to_mv to /scratch/piconti/impresso/text_reuse/rebuilt_data


100%|██████████| 1/1 [00:00<00:00, 74.60it/s]


--------------
Copying GAVi from /scratch/piconti/impresso/text_reuse/rebuilt_data_to_mv to /scratch/piconti/impresso/text_reuse/rebuilt_data


100%|██████████| 2/2 [00:00<00:00, 106.71it/s]


--------------
Copying Grelot from /scratch/piconti/impresso/text_reuse/rebuilt_data_to_mv to /scratch/piconti/impresso/text_reuse/rebuilt_data


100%|██████████| 4/4 [00:00<00:00, 246.66it/s]


--------------
Copying Griffe from /scratch/piconti/impresso/text_reuse/rebuilt_data_to_mv to /scratch/piconti/impresso/text_reuse/rebuilt_data


100%|██████████| 3/3 [00:00<00:00, 95.59it/s]


--------------
Copying Guepe1851 from /scratch/piconti/impresso/text_reuse/rebuilt_data_to_mv to /scratch/piconti/impresso/text_reuse/rebuilt_data


100%|██████████| 4/4 [00:00<00:00, 208.22it/s]


--------------
Copying Guepe1887 from /scratch/piconti/impresso/text_reuse/rebuilt_data_to_mv to /scratch/piconti/impresso/text_reuse/rebuilt_data


100%|██████████| 3/3 [00:00<00:00, 201.19it/s]


--------------
Copying JH from /scratch/piconti/impresso/text_reuse/rebuilt_data_to_mv to /scratch/piconti/impresso/text_reuse/rebuilt_data


100%|██████████| 45/45 [00:00<00:00, 71.42it/s]


--------------
Copying JV from /scratch/piconti/impresso/text_reuse/rebuilt_data_to_mv to /scratch/piconti/impresso/text_reuse/rebuilt_data


100%|██████████| 18/18 [00:00<00:00, 31.28it/s]


--------------
Copying JVE from /scratch/piconti/impresso/text_reuse/rebuilt_data_to_mv to /scratch/piconti/impresso/text_reuse/rebuilt_data


100%|██████████| 4/4 [00:00<00:00, 39.46it/s]


--------------
Copying JY2 from /scratch/piconti/impresso/text_reuse/rebuilt_data_to_mv to /scratch/piconti/impresso/text_reuse/rebuilt_data


100%|██████████| 2/2 [00:00<00:00, 82.26it/s]


--------------
Copying MB from /scratch/piconti/impresso/text_reuse/rebuilt_data_to_mv to /scratch/piconti/impresso/text_reuse/rebuilt_data


100%|██████████| 52/52 [00:00<00:00, 303.44it/s]


--------------
Copying ME from /scratch/piconti/impresso/text_reuse/rebuilt_data_to_mv to /scratch/piconti/impresso/text_reuse/rebuilt_data


100%|██████████| 16/16 [00:00<00:00, 101.59it/s]


--------------
Copying MESSAGER from /scratch/piconti/impresso/text_reuse/rebuilt_data_to_mv to /scratch/piconti/impresso/text_reuse/rebuilt_data


100%|██████████| 3/3 [00:00<00:00, 136.68it/s]


--------------
Copying Moniteur from /scratch/piconti/impresso/text_reuse/rebuilt_data_to_mv to /scratch/piconti/impresso/text_reuse/rebuilt_data


100%|██████████| 2/2 [00:00<00:00, 624.01it/s]


--------------
Copying NS from /scratch/piconti/impresso/text_reuse/rebuilt_data_to_mv to /scratch/piconti/impresso/text_reuse/rebuilt_data


100%|██████████| 22/22 [00:00<00:00, 136.62it/s]


--------------
Copying NV from /scratch/piconti/impresso/text_reuse/rebuilt_data_to_mv to /scratch/piconti/impresso/text_reuse/rebuilt_data


100%|██████████| 2/2 [00:00<00:00, 72.60it/s]


--------------
Copying NV1 from /scratch/piconti/impresso/text_reuse/rebuilt_data_to_mv to /scratch/piconti/impresso/text_reuse/rebuilt_data


100%|██████████| 32/32 [00:01<00:00, 24.99it/s]


--------------
Copying NV2 from /scratch/piconti/impresso/text_reuse/rebuilt_data_to_mv to /scratch/piconti/impresso/text_reuse/rebuilt_data


100%|██████████| 36/36 [00:18<00:00,  1.91it/s]


--------------
Copying OBS from /scratch/piconti/impresso/text_reuse/rebuilt_data_to_mv to /scratch/piconti/impresso/text_reuse/rebuilt_data


100%|██████████| 1/1 [00:00<00:00,  1.49it/s]


--------------
Copying ouistiti from /scratch/piconti/impresso/text_reuse/rebuilt_data_to_mv to /scratch/piconti/impresso/text_reuse/rebuilt_data


100%|██████████| 1/1 [00:00<00:00,  8.85it/s]


--------------
Missing pages! Will be uploaded from S3
--------------
Copying PAT from /scratch/piconti/impresso/text_reuse/rebuilt_data_to_mv to /scratch/piconti/impresso/text_reuse/rebuilt_data


100%|██████████| 4/4 [00:00<00:00, 69.46it/s]


--------------
Copying PDL from /scratch/piconti/impresso/text_reuse/rebuilt_data_to_mv to /scratch/piconti/impresso/text_reuse/rebuilt_data


100%|██████████| 2/2 [00:00<00:00,  4.06it/s]


--------------
Copying PJ from /scratch/piconti/impresso/text_reuse/rebuilt_data_to_mv to /scratch/piconti/impresso/text_reuse/rebuilt_data


100%|██████████| 2/2 [00:00<00:00, 56.79it/s]


--------------
Copying PS from /scratch/piconti/impresso/text_reuse/rebuilt_data_to_mv to /scratch/piconti/impresso/text_reuse/rebuilt_data


100%|██████████| 2/2 [00:00<00:00, 548.35it/s]


--------------
Copying RLA from /scratch/piconti/impresso/text_reuse/rebuilt_data_to_mv to /scratch/piconti/impresso/text_reuse/rebuilt_data


100%|██████████| 8/8 [00:01<00:00,  4.49it/s]


--------------
Copying TouSuIl from /scratch/piconti/impresso/text_reuse/rebuilt_data_to_mv to /scratch/piconti/impresso/text_reuse/rebuilt_data


100%|██████████| 17/17 [00:09<00:00,  1.75it/s]


--------------
Copying VVS from /scratch/piconti/impresso/text_reuse/rebuilt_data_to_mv to /scratch/piconti/impresso/text_reuse/rebuilt_data


100%|██████████| 4/4 [00:00<00:00, 15.13it/s]


--------------
Copying VVS1 from /scratch/piconti/impresso/text_reuse/rebuilt_data_to_mv to /scratch/piconti/impresso/text_reuse/rebuilt_data


100%|██████████| 2/2 [00:00<00:00,  3.20it/s]


In [14]:
# sanity check that all files that need to be there are indeed.
target = os.listdir(output_dir)
missing = [f for path, dir, files in os.walk(input_dir) for f in files if f not in target and f.split('-')[0] in titles_no_bp]
missing

[]

For the titles that are still missing, look to s3 to download them.

In [29]:
s3_input_bucket = "31-passim-rebuilt-staging"
s3_input_partition = "passim/"

In [32]:
# define the accept condition for a key: the title is in titles_to_upload
title_to_upload_key = lambda k: k.split('/')[-1].split('-')[0] in titles_to_upload

keys = _list_bucket_paginator(s3_input_bucket, prefix=s3_input_partition, accept_key=title_to_upload_key)
keys

['passim/Bombe/Bombe-1889.jsonl.bz2',
 'passim/arbeitgeber/arbeitgeber-1907.jsonl.bz2',
 'passim/arbeitgeber/arbeitgeber-1908.jsonl.bz2',
 'passim/arbeitgeber/arbeitgeber-1909.jsonl.bz2',
 'passim/arbeitgeber/arbeitgeber-1910.jsonl.bz2',
 'passim/arbeitgeber/arbeitgeber-1911.jsonl.bz2',
 'passim/arbeitgeber/arbeitgeber-1912.jsonl.bz2',
 'passim/arbeitgeber/arbeitgeber-1913.jsonl.bz2',
 'passim/arbeitgeber/arbeitgeber-1914.jsonl.bz2',
 'passim/arbeitgeber/arbeitgeber-1915.jsonl.bz2',
 'passim/arbeitgeber/arbeitgeber-1916.jsonl.bz2',
 'passim/arbeitgeber/arbeitgeber-1917.jsonl.bz2',
 'passim/arbeitgeber/arbeitgeber-1918.jsonl.bz2',
 'passim/arbeitgeber/arbeitgeber-1919.jsonl.bz2',
 'passim/arbeitgeber/arbeitgeber-1924.jsonl.bz2',
 'passim/arbeitgeber/arbeitgeber-1925.jsonl.bz2',
 'passim/arbeitgeber/arbeitgeber-1926.jsonl.bz2',
 'passim/arbeitgeber/arbeitgeber-1927.jsonl.bz2',
 'passim/arbeitgeber/arbeitgeber-1928.jsonl.bz2',
 'passim/arbeitgeber/arbeitgeber-1929.jsonl.bz2',
 'passim/arb

In [36]:
for key in keys:
    key_filename = os.path.split(key)[-1]
    local_target_path = os.path.join(output_dir, key_filename)
    print(f"downloading {key} to {local_target_path}")
    get_s3_client().download_file(s3_input_bucket, key, local_target_path)

downloading passim/Bombe/Bombe-1889.jsonl.bz2 to /scratch/piconti/impresso/text_reuse/rebuilt_data/Bombe-1889.jsonl.bz2
downloading passim/arbeitgeber/arbeitgeber-1907.jsonl.bz2 to /scratch/piconti/impresso/text_reuse/rebuilt_data/arbeitgeber-1907.jsonl.bz2
downloading passim/arbeitgeber/arbeitgeber-1908.jsonl.bz2 to /scratch/piconti/impresso/text_reuse/rebuilt_data/arbeitgeber-1908.jsonl.bz2
downloading passim/arbeitgeber/arbeitgeber-1909.jsonl.bz2 to /scratch/piconti/impresso/text_reuse/rebuilt_data/arbeitgeber-1909.jsonl.bz2
downloading passim/arbeitgeber/arbeitgeber-1910.jsonl.bz2 to /scratch/piconti/impresso/text_reuse/rebuilt_data/arbeitgeber-1910.jsonl.bz2
downloading passim/arbeitgeber/arbeitgeber-1911.jsonl.bz2 to /scratch/piconti/impresso/text_reuse/rebuilt_data/arbeitgeber-1911.jsonl.bz2
downloading passim/arbeitgeber/arbeitgeber-1912.jsonl.bz2 to /scratch/piconti/impresso/text_reuse/rebuilt_data/arbeitgeber-1912.jsonl.bz2
downloading passim/arbeitgeber/arbeitgeber-1913.json

In [41]:
# sanity check that all files that need to be there are indeed.
target = os.listdir(output_dir)
missing = [t for t in titles_no_bp if not any([t in f for f in target])]
missing

['pages']