# Boilerplate data preparation and computation

When detecting text-reuse with passim, one can start by identifying and eliminating the boilerplate to allow to remove superfluous data from the processing.

This notebook contains the code to perform various tasks relating to this:
- Preparing the input data directory, only containing data that should be part of boilerplate detection
- Light postprocessing of the boilerplate output and preparation of the actual text-reuse detection

### Imports

In [None]:
import os
import shutil
from tqdm import tqdm
import jq
import json
from typing import Any
from smart_open import open as smart_open_function
import jsonlines
import pickle

import dask.bag as db
import dask.dataframe as dd
import pandas as pd

from impresso_commons.utils.s3 import get_s3_resource
from impresso_commons.utils.s3 import IMPRESSO_STORAGEOPT

### Code

In [None]:
text_reuse_dir = '/scratch/piconti/impresso/text_reuse'
all_rebuilt_data_path = "rebuilt_data"
bp_rebuilt_data_path = "rebuilt_data_for_bp"

input_dir = os.path.join(text_reuse_dir, all_rebuilt_data_path)
output_dir = os.path.join(text_reuse_dir, bp_rebuilt_data_path)

## 1. Prepare the copied Rebuilt data for Boilerplate detection with passim

Before text-reuse detection with passim can be run, one must first run the tool in boilerplate mode to identify segments of text that should be excluded from the text-reuse search.

However not all the corpus should be confronted to boilerplate detection, and titles without any article-level segmentation should not be considered.

This small notebook aims to copy the wanted data (one that has article-level segmentation) to a new directory, where the files will be all together, as opposed to separated by title.

In [None]:
titles_no_bp = [
    "FedGazDe", "FedGazFr", "NZZ", "handelsztg", "arbeitgeber", "ACI", "AV", "Bombe", "Cancoire", "Castigat", "Charivari", "CharivariCH", "CL", 
    "Croquis", "EM", "esta", "FAM", "FAMDE", "FAN", "FAV1", "Fronde", "GAVi", "Grelot", "Griffe", "Guepe1851", "Guepe1887", 
    "JH", "JV", "JVE", "JY2", "MB", "ME", "MESSAGER", "Moniteur", "NS", "NV", "NV1", "NV2", "OBS", "ouistiti", "pages", "PAT", "PDL", "PJ", "PS", "RLA", "TouSuIl", "VVS", "VVS1"
]

In [None]:
copied, not_copied = [], []
for path, dir, files in os.walk(input_dir):
    if len(dir)==0:
        if path.split('/')[-1] in titles_no_bp:
            not_copied.extend(files)
            print(f"Not copying {path.split('/')[-1]} files since it has no article segmentation.")
        else:
            print(f"Copying {path.split('/')[-1]} files...")
            for file in tqdm(files):
                src_path = os.path.join(path, file)
                dest_path = os.path.join(output_dir, file)
                shutil.copy(src_path, dest_path)
                copied.append(file)

In [None]:
len(copied), len(not_copied), len(copied) + len(not_copied)

## 2. Upload the current boilerplate output to s3

For traceability, upload all the contents of the /out.json directory to S3 under a boilerplate partition

In [None]:
s3_staging_bucket = "41-processed-data-staging"
partition = "text-reuse/text-reuse_v1-0-0/boilerplate/out.json"
out_jsons_dir = os.path.join(text_reuse_dir, "passim_bp_output/out.json")

In [None]:
s3 = get_s3_resource()
bucket = s3.Bucket(s3_staging_bucket)

In [None]:
for filename in tqdm(os.listdir(out_jsons_dir)):
    if filename.endswith('json'):
        filepath = os.path.join(out_jsons_dir, filename)
        key_name = os.path.join(partition, filename)
        bucket.upload_file(filepath, key_name)
        #print("Uploaded %s to s3: %s", filepath, key_name)

## 3. Create the bp.pkl file from the boilerplate output

#### First check the contents of some resulting jsons

In [None]:
out_jsons_dir = os.path.join(text_reuse_dir, "passim_bp_output/out.json")

os.listdir(out_jsons_dir)

In [None]:
def load_json(f_path: str) -> dict:
    lines = []
    with open(f_path, mode="r", encoding='utf-8') as f_in:
        for line in f_in:
            lines.append(json.loads(line))
    return lines

#### Actually create the pb.pkl dataframe

From looking at examples the following heuristics have been devised:
- Only entries with the field `"src"` are actually boilerplate.
  - In the format: `{"id": "BDC-1839-03-18-a-i0005_1658_1952", "src": {"id": "BDC-1839-03-16-a-i0004", "start": [...]`
- For each entry with the field `"src"`:
  - Two text passages are linked: the value of `id` and the value of `src.id`
  - The value of `id` will often have additional values appended after the usual CI id ("_1658_1952" here). These should be removed
  - Both ids (fields `id` and `src.id`) should be included in the bp.pkl output as separate rows
- All rows should also have a column `"is_boilerplate"` set to `True`.

The actual processing for this step of creating the bp.pkl dataframe has been moved to the `postprocess_boilerpalte.py` script.

#### Upload the resulting file to S3

The upload of this dataframe has also been outsourced to the script

In [None]:
pb_pkl_out_filepath = os.path.join(text_reuse_dir, "bp.pkl")
s3_staging_bucket = "41-processed-data-staging"
pb_partition = "text-reuse/text-reuse_v1-0-0/boilerplate/"

s3 = get_s3_resource()
bucket = s3.Bucket(s3_staging_bucket)

In [None]:
pb_pkl_out_filepath = os.path.join(text_reuse_dir, "bp.pkl")
bp_key_name = os.path.join(pb_partition, "bp.pkl")
bucket.upload_file(pb_pkl_out_filepath, bp_key_name)

#### Debug: Filtering the duplicates and uploading updated pkl to s3

The first iteration of the bp.pkl dataframe was created without filtering out the duplicated ids. 
This caused some issues during the filtering, among other things due to the large size of the dataframe.
The code filtering out duplicated ids had now also been added to the script

Note: When using `compute()` after filtering, the output is of type pd.DataFrame and not dd.
Trying both options and seeing which is best. --> decision was to keep it as a dask dataframe.

In [None]:
s3_staging_bucket = "" # todo fill in
pb_partition = "text-reuse/text-reuse_v1-0-0/boilerplate/"
bp_dd_filename = '' # todo fill in with desired partition subpath
dd_s3_path = os.path.join("s3://", s3_staging_bucket, pb_partition, bp_dd_filename)
dd_s3_path

In [None]:
# loading the full df
bp_df = pd.read_pickle(dd_s3_path, storage_options=IMPRESSO_STORAGEOPT).repartition(npartitions=2082).drop(columns=["is_boilerplate"])

In [None]:
bp_df_filter = bp_df['id'].value_counts().map(lambda x: x > 1)

In [None]:
bp_full_count = bp_df.count().compute()
not_dupl = bp_df_filter[~bp_df_filter].count().compute()
dupl = bp_df_filter[bp_df_filter].count().compute()

bp_full_count, not_dupl, dupl

In [None]:
filtered_dup_bp = bp_df.drop_duplicates(subset=['id']).compute()

In [None]:
filtered_dup_bp.head(), filtered_dup_bp.count()

In [None]:
text_reuse_dir = '/scratch/piconti/impresso/text_reuse'
bp_out_filename = None # TODO fill in
filtered_bp_pkl_out_filepath = os.path.join(text_reuse_dir, bp_out_filename)
filtered_bp_pkl_out_filepath

In [None]:
with open(filtered_bp_pkl_out_filepath, 'wb') as handle:
    pickle.dump(filtered_dup_bp, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [None]:
s3_staging_bucket = "41-processed-data-staging"
pb_partition = "text-reuse/text-reuse_v1-0-0/boilerplate/"

s3 = get_s3_resource()
bucket = s3.Bucket(s3_staging_bucket)

In [None]:
bp_key_name = os.path.join(pb_partition, bp_out_filename)
bucket.upload_file(filtered_bp_pkl_out_filepath, bp_key_name)

#### Sanity checking / loading existing bp.pkl file to check for specific ids

In [None]:
s3_staging_bucket = "" # todo fill in
pb_partition = "text-reuse/text-reuse_v1-0-0/boilerplate/"
bp_dd_filename = '' # todo fill in with desired partition subpath
dd_s3_path = os.path.join("s3://", s3_staging_bucket, pb_partition, bp_dd_filename)
dd_s3_path

In [None]:
# loading the full df
bp_df = pd.read_pickle(dd_s3_path, storage_options=IMPRESSO_STORAGEOPT).repartition(npartitions=2082).drop(columns=["is_boilerplate"])

In [None]:
bp_df[bp_df.id.str.contains('legaulois-1924-07-27-a')].head(100, npartitions=-1)