# Generating Fake Books for Training

It is difficult to infer anthology contents and other whole/part relationships at large scales. This notebook generates fake books to mimic those relationships.

Specifically, it generates:

- Input: Multi-volume works; Output: Fake single volume of work
- Input: A long single volume work; Output: Multiple fake volumes for parts of work (not yet implemented)
- Input: Multiple works by the same author; Output: An 'works'-style anthology

#### Workflow information
As of Apr 2020, the process for running this is 
- Generate the files and ground truth by running this notebook
- Convert the `fake` files to Vector_files, using `vectorization.py`
- Concatenate the Vector_files with the real data, using the concatenate script in compare_tools

In [1]:
%load_ext autoreload
%autoreload 2
import pandas as pd
import numpy as np
from compare_tools import fakebook
from compare_tools.hathimeta import clean_description
from htrc_features import Volume
meta = pd.read_csv('../../sampling/test_dataset.csv.gz', low_memory=False)
ground_truth = []
meta.page_count.quantile([.1,.2,.3,.4,.5,.6,.7,.75, .8, .85, .9, .95])

0.10     56.0
0.20    124.0
0.30    192.0
0.40    248.0
0.50    302.0
0.60    352.0
0.70    410.0
0.75    444.0
0.80    488.0
0.85    538.0
0.90    612.0
0.95    742.0
Name: page_count, dtype: float64

## Combining Books

Antology Criteria - multiple works by an author that are different and relatively short.

Combined Volumes criteria - multiple works that look to be parts of a sequential set.

### Anthologies

Choose one book and patch it with other books where the title seems notably different.

In [2]:
pool = meta[meta.page_count < meta.page_count.quantile(.4)]
pool = pool.drop_duplicates(['author', 'title'])
acounts = pool.groupby('author').title.count()
pool = pool[pool.author.isin(acounts[acounts > 2].index)]
pool.shape

(39124, 27)

Choose 1 author and return all their books.

In [3]:
i = 0
for groupname, subset in pool.groupby('author'):
    to_combine = fakebook.anthology_sample(subset)
    if len(to_combine) > 1:
        try:
            volmeta, tl = fakebook.combine_books(to_combine)
        except KeyboardInterrupt:
            raise
        except:
            continue
        fakebook.save_fake_vol(volmeta, tl, '/data/saddl/fakebooks/')

        for source_htid in volmeta['source_htids']:
            ground_truth.append(dict(left=volmeta['id'], right=source_htid, judgment='CONTAINS', notes='fake anthology'))
            ground_truth.append(dict(left=source_htid, right=volmeta['id'], judgment='PARTOF', notes='fake anthology'))
    i += 1
    if i % 100 == 0:
        print(i, groupname)

100 American Society for Testing and Materials.
200 Balfour, Clara Lucas, 1808-1878.
300 Bennett, Edward H. 1874-1954.
400 Bord, Janet, 1945-
500 Brooks, William Keith, 1848-1908.
600 Calhoun, John C. 1782-1850.
700 Cherrington, Ernest Hurst, 1877-1950.
800 Conference of Local and Regional Authorities of Europe.
900 Currimbhoy, Asif, 1928-
1000 Dixwell, George Basil.
1100 Eells, Myron, 1843-1907
1200 Field, Frank, 1942-
1300 Fullerton, Georgiana, Lady, 1812-1885.
1400 Gonzalez-Mena, Janet.
1500 Gunning, Mrs. 1740?-1800.
1600 Haviland, Virginia, 1911-
1700 Holmes, Edmond Gore Alexander, 1850-1936.
1800 India. Committee on Plan Projects. Minor Irrigation Team.
1900 Johnson, Lionel, 1867-1902.
2000 Ker, W. P. 1855-1923.
2100 Lang, Jeanie.
2200 London, Barbara, 1936-
2300 Mao, Zedong, 1893-1976.
2400 McKinley, Albert E. 1870-1936.
2500 Montale, Eugenio, 1896-1981.
2600 National Research Council (U.S.). Committee on Oceanography.
2700 Nystrom, John W. 1824-1885.
2800 Paton, Lewis Bayles, 18

### Multi-volume sets

In [4]:
pool = meta[meta.page_count < meta.page_count.quantile(.6)]
pool = pool[~pool.description.isna()]
pool = pool[clean_description(pool.description).str.contains('^v\.\d\d?$')]
# Filter to author/title pairs that have more than one volume
pool = pool.groupby(['author', 'title']).filter(lambda x: x.description.unique().shape[0] > 1)
pool = pool.copy()
pool['descint'] = clean_description(pool.description).str.replace('v.', '').astype(int)
# Filter further, to author/title pairs that have consecutively numbered volumes
def has_consecutive_v(x):
    sorted_v_ints = x.descint.sort_values()
    cumulative_run_length = ((sorted_v_ints - 1) == sorted_v_ints.shift(1))
    return cumulative_run_length.any()
pool = pool.groupby(['author', 'title']).filter(has_consecutive_v)
pool.shape

(5051, 28)

In [5]:
i = 0
for groupname, subset in pool.groupby(['author', 'title']):
    smaller_subset = subset.copy().groupby('descint').apply(lambda x: x.sample(1))
    for to_combine in fakebook.consecutive_vol_samples(smaller_subset):
        if len(to_combine) > 1:
            try:
                volmeta, tl = fakebook.combine_books(to_combine, style='multivol')
            except KeyboardInterrupt:
                raise
            except:
                continue
            fakebook.save_fake_vol(volmeta, tl, '/data/saddl/fakebooks/')
            for source_htid in volmeta['source_htids']:
                ground_truth.append(dict(left=volmeta['id'], right=source_htid, judgment='CONTAINS', notes='fake multivol'))
                ground_truth.append(dict(left=source_htid, right=volmeta['id'], judgment='PARTOF', notes='fake multivol'))
    i += 1
    if i % 100 == 0:
        print(i, groupname)

100 ('Bond, Francis, d. 1918.', 'Wood carvings in English churches, by Francis Bond.')
200 ('Casanova, Giacomo, 1725-1798.', 'The memoirs of Jacques Casanova de Seingalt. / Complete in twelve volumes as translated into English by Arthur Machen with an introd. by Arthur Symons, a new pref. by the translator and twelve drawings by Rockwell Kent.')
300 ('Edwards, Amelia Ann Blanford, 1831-1892.', "Barbara's history / Amelia B. Edwards ...")
400 ('Government Oriental Manuscripts Library (Tamil Nadu, India)', 'A descriptive catalogue of the Islamic manuscripts in the Government Oriental manuscripts library, Madras, by Vidyasagara Vidyavacaspati P. P. Subrahmanya Sastri. (Prepared under the orders of the government of Madras)')
500 ('Hyams, Edward, 1910-1975.', 'Ornamental shrubs for temperate zone gardens.')
600 ('Knox, Vicesimus, 1752-1821.', 'Essays, moral and literary / Vicesimus Knox.')
700 ('Matematicheskiĭ institut im. V.A. Steklova.', 'Mathematics, its content, methods and meaning. E

In [6]:
df = pd.DataFrame(ground_truth)
df.to_parquet('/data/saddl/fakebooks/fakebook_gt.parquet')
df.sample(10)

Unnamed: 0,left,right,judgment,notes
8308,fake.bf8854,mdp.39015003501320,CONTAINS,fake anthology
31555,mdp.39015081108246,fake.548187,PARTOF,fake anthology
10058,fake.ebb499,coo.31924000520464,CONTAINS,fake anthology
15882,fake.7484fa,loc.ark:/13960/t8jd5bz53,CONTAINS,fake anthology
29820,fake.5255cc,uc1.c055928877,CONTAINS,fake anthology
27970,fake.cbf30a,mdp.39015033362438,CONTAINS,fake anthology
26526,fake.68107d,umn.319510007805824,CONTAINS,fake anthology
32277,umn.31951002005585i,fake.e2f490,PARTOF,fake anthology
30788,fake.994d91,loc.ark:/13960/t9086xs0s,CONTAINS,fake anthology
38354,fake.3a22be,umn.31951d00699948u,CONTAINS,fake multivol


In [14]:
df.left[df.left.str.startswith('fake')].drop_duplicates().to_csv('/data/saddl/fakebooks/fake-htids.csv', header=False, index=False)