# Generating Fake Books for Training

It is difficult to infer anthology contents and other whole/part relationships at large scales. This notebook generates fake books to mimic those relationships.

Specifically, it generates:

- Input: Multi-volume works; Output: Fake single volume of work
- Input: A long single volume work; Output: Multiple fake volumes for parts of work
- Input: Multiple works by the same author; Output: An 'works'-style anthology

In [2]:
%load_ext autoreload
%autoreload 2
import pandas as pd
import numpy as np
from compare_tools import fakebook
from htrc_features import Volume
meta = pd.read_csv('../../sampling/test_dataset.csv.gz', low_memory=False)
ground_truth = []
meta.page_count.quantile([.1,.2,.3,.4,.5,.6,.7,.75, .8, .85, .9, .95])

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


0.10     56.0
0.20    124.0
0.30    192.0
0.40    248.0
0.50    302.0
0.60    352.0
0.70    410.0
0.75    444.0
0.80    488.0
0.85    538.0
0.90    612.0
0.95    742.0
Name: page_count, dtype: float64

## Combining Books

Antology Criteria - multiple works by an author that are different and relatively short.

Combined Volumes criteria - multiple works that look to be parts of a sequential set.

### Anthologies

Choose one book and patch it with other books where the title seems notably different.

In [3]:
pool = meta[meta.page_count < meta.page_count.quantile(.4)]
pool = pool.drop_duplicates(['author', 'title'])
acounts = pool.groupby('author').title.count()
pool = pool[pool.author.isin(acounts[acounts > 2].index)]
pool.shape

(39124, 27)

Choose 1 author and return all their books.

In [None]:
i = 0
for groupname, subset in pool.groupby('author'):
    to_combine = fakebook.anthology_sample(subset)
    if len(to_combine) > 1:
        try:
            volmeta, tl = fakebook.combine_books(to_combine)
        except KeyboardInterrupt:
            raise
        except:
            continue
        fakebook.save_fake_vol(volmeta, tl, '/data/saddl/fakebooks/')

        for source_htid in volmeta['source_htids']:
            ground_truth.append(dict(left=volmeta['id'], right=source_htid, judgment='CONTAINS', notes='fake anthology'))
            ground_truth.append(dict(left=source_htid, right=volmeta['id'], judgment='PARTOF', notes='fake anthology'))
    i += 1
    if i % 100 == 0:
        print(i, groupname)

10 Acker, Geraldine.
20 Adams, R. L. 1883-1957,
30 Agard, Walter Raymond, 1894-
40 Aijazuddin, F. S.
50 Alden, W. L. 1837-1908.
60 Alford, Henry, 1810-1871.
70 Allen, Grant, 1848-1899
80 Alvarez Quintero, Serafín, 1871-1938.
90 American Crafts Council. Museum of Contemporary Crafts.
100 American Social Hygiene Association.
110 Andrews, William Loring, 1837-1920.
120 Arizona
130 Arnot, Robert Page, 1890-1986
140 Ashby, Thomas, 1874-1931.
150 Aston, George, Sir, 1861-1938.
160 Auslander, Joseph, 1897-1965.
170 Aśvaghoṣa.
180 Bailey, H. W. 1899-1996.
190 Baker, M. N. 1864-
200 Balfour, Arthur James Balfour, Earl of, 1848-1930.
210 Banda, H. Kamuzu d. 1997
220 Bantock, G. H. 1914-
230 Barnard, James Lynn, 1867-
240 Barrie, J. M. 1860-1937
250 Bascom, William Russell, 1912-1981.
260 Bayley, F. W. N. 1808-1853.
270 Becker, Ernest.
280 Behan, Brendan.
290 Bell, Robert R., 1924-
300 Bengough, J. W. 1851-1923.
310 Bentley, M. R.,
320 Berrill, N. J. 1903-1996.
330 Betts, Craven Langstroth, 1853-

### Multi-volume sets

In [None]:
pool = meta[meta.page_count < meta.page_count.quantile(.6)]
pool = pool[~pool.description.isna()]
pool = pool[clean_description(pool.description).str.contains('^v\.\d\d?$')]
# Filter to author/title pairs that have more than one volume
pool = pool.groupby(['author', 'title']).filter(lambda x: x.description.unique().shape[0] > 1)
pool = pool.copy()
pool['descint'] = clean_description(pool.description).str.replace('v.', '').astype(int)
# Filter further, to author/title pairs that have consecutively numbered volumes
def has_consecutive_v(x):
    sorted_v_ints = x.descint.sort_values()
    cumulative_run_length = ((sorted_v_ints - 1) == sorted_v_ints.shift(1))
    return cumulative_run_length.any()
pool = pool.groupby(['author', 'title']).filter(has_consecutive_v)
pool.shape

In [None]:
i = 0
for groupname, subset in pool.groupby(['author', 'title']):
    smaller_subset = subset.copy().groupby('descint').apply(lambda x: x.sample(1))
    for to_combine in fakebook.consecutive_vol_samples(smaller_subset):
        if len(to_combine) > 1:
            try:
                volmeta, tl = fakebook.combine_books(to_combine, style='multivol')
            except KeyboardInterrupt:
                raise
            except:
                continue
            fakebook.save_fake_vol(volmeta, tl, '/data/saddl/fakebooks/')
            for source_htid in volmeta['source_htids']:
                ground_truth.append(dict(left=volmeta['id'], right=source_htid, judgment='CONTAINS', notes='fake multivol'))
                ground_truth.append(dict(left=source_htid, right=volmeta['id'], judgment='PARTOF', notes='fake multivol'))
    i += 1
    if i % 10 == 0:
        print(i, groupname)

In [None]:
df = pd.DataFrame(ground_truth)
df.to_parquet('/data/saddle/fakebooks/fakebook_gt.parquet')
df.sample(10)