**Summary**

In this notebook we explain how to load `passim`'s output into a `pandas`' `DataFrame`, a data structure that comes in very handy when analyzing and/or plotting data.

Please note that if the **size of your data** changes considerably (i.e. the number of detected text reuse clusters shifts from thousands to millions) we recommened the use of [`dask` DataFrames](https://docs.dask.org/en/latest/dataframe.html) as it is able e.g. to distribute computation and memory usage over a cluster made of several machines. 

## Imports

In [None]:
import os
import pandas as pd

In [None]:
# this prints the current pandas' version
pd.__version__

## Configuration

**NB**: In case you speficied a different output folder when running `passim`, you'll need to change the variables here below.

In [None]:
impresso_passim_output_path = os.path.join('.', 'impresso/passim-output/out.json/')

In [None]:
# TODO: add data to GH; for now it will break
eebo_passim_output_path = os.path.join('.', 'eebo/passim-output/out.json/')

## Utility functions

Some very general functions that should be usable with any `passim` JSON output.

In [None]:
def read_passim_json(output_dir: str) -> pd.DataFrame:
    """
    A simple function that reads passim's JSON output
    into a pandas DataFrame.
    """

    # detect all JSON files filtering on file extension
    files = [
        os.path.join(output_dir, f)
        for f in os.listdir(impresso_passim_output_path)
        if f.endswith('.json')
    ]
    
    print(f'{len(files)} files detected in folder {output_dir}')
    
    # read each JSON file into a temporary pandas dataframe
    # thus creating a list of dataframes
    dfs = [
        pd.read_json(file, lines=True)
        for file in files
    ]
    
    # concatenate all temporary dataframes into a single one
    # and set the column `uid` as the dataframe index
    df = pd.concat(dfs).set_index('uid')
    
    n_clusters = df.cluster.nunique()
    n_passages = df.shape[0]
    print(f'Passim data contain {n_passages} text reuse passages, grouped into {n_clusters} clusters')
    return df

def passages2clusters(passages_df: pd.DataFrame) -> pd.DataFrame:
    """
    Function that aggregates passim's output at the cluster-level.
    """
    impresso_tr_clusters = passages_df.groupby('cluster').agg({'size': 'count'})
    impresso_tr_clusters.rename(columns={'size': 'cluster_size'}, inplace=True)
    return impresso_tr_clusters
    

## Read in `passim`'s output for *impresso*

First thing first, we need to read in `passim`'s JSON output; each JSON document represents a *text reuse passage* (not a cluster!) and the data is split over several smallish JSON files contained in `passim`'s output directory.

In [None]:
impresso_tr_passages = read_passim_json(impresso_passim_output_path)

In [None]:
# eebo_tr_passages = read_passim_json(eebo_passim_output_path)

## Reshaping data: from passages to clusters

Since `passim`'s JSON represents text reuse passages, in order to do some analysis on text reuse clusters we need to **reshape** our data. This is done by calling the `passages2clusters` function which will transform a DataFrame of passages into a DataFrame of clusters. 

Here, for the sake of semplicity, each cluster has two bits of information (columns): 
1. a cluster ID (the one assigned by `passim`)
2. its size, namely the number of similar/repeated passages that a cluster contains. 

In [None]:
impresso_tr_clusters = passages2clusters(impresso_tr_passages)

In [None]:
impresso_tr_clusters.shape[0]

## Plotting the distribution of cluster sizes

In [None]:
impresso_tr_clusters.cluster_size.value_counts()

In [None]:
impresso_tr_clusters.cluster_size.describe()

In [None]:
impresso_tr_clusters.quantile(.9)

In [None]:
%matplotlib inline
ax = impresso_tr_clusters.cluster_size.value_counts().plot(
    kind='bar',
    log=False,
    grid=True,
    figsize=(10, 8),
    xlabel='Cluster size',
    ylabel='Frequency',
    title='Distribution of text reuse cluster sizes'
)

In [None]:
%matplotlib inline
ax = impresso_tr_clusters.cluster_size.value_counts().plot(
    kind='bar',
    log=True,
    grid=True,
    figsize=(10, 8),
    xlabel='Cluster size',
    ylabel='Frequency',
    title='Distribution of text reuse cluster sizes (plotted on a logarithmic scale)'
)