# Generating Fake Books for Training

It is difficult to infer anthology contents and other whole/part relationships at large scales. This notebook generates fake books to mimic those relationships.

Specifically, it generates:

- Input: Multi-volume works; Output: Fake single volume of work
- Input: A long single volume work; Output: Multiple fake volumes for parts of work (not yet implemented)
- Input: Multiple works by the same author; Output: A 'works'-style anthology

#### Workflow information
As of Apr 2020, the process for running this is 
- Generate the files and ground truth by running this notebook
- Convert the `fake` files to Vector_files, using `vectorization.py`
- Concatenate the Vector_files with the real data, using the concatenate script in compare_tools

In [1]:
%load_ext autoreload
%autoreload 2

In [9]:
from compare_tools.configuration import config
import pandas as pd
import numpy as np
import os
from compare_tools import fakebook
from compare_tools.hathimeta import clean_description, HathiMeta
from htrc_features import Volume
# Use testset parameters
test = False
config.update(config['test' if test else 'full'])
fakebook_root = '/data/saddl/fakebooks_{}set/'.format('test' if test else 'full')
os.makedirs(fakebook_root, exist_ok=True)
config['metadb_path']

'/data/saddl/meta.db'

In [26]:
hathimeta = HathiMeta(config['metadb_path'])
meta = hathimeta.get_fields(fields=['htid', 'author', 'title', 'description', 'page_count'])
meta.page_count = pd.to_numeric(meta.page_count)
ground_truth = []
meta.page_count.quantile([.1,.25,.4,.5,.75, .9, .95, .99])

0.10      44.0
0.25     120.0
0.40     214.0
0.50     278.0
0.75     486.0
0.90     746.0
0.95     934.0
0.99    1368.0
Name: page_count, dtype: float64

## Combining Books

Antology Criteria - multiple works by an author that are different and relatively short.

Combined Volumes criteria - multiple works that look to be parts of a sequential set.

### Anthologies

Choose one book and patch it with other books where the title seems notably different.

In [46]:
# Create a pool of shorter books by authors with >2 books
pool = meta[meta.page_count < meta.page_count.quantile(.4)]
pool = pool.drop_duplicates(['author', 'title'])
acounts = pool.groupby('author').title.count()
pool = pool[pool.author.isin(acounts[acounts > 2].index)]
nauthors = pool.author.unique().shape[0]
print("Initial pool size is", pool.shape[0], "with n unique authors:", pool.author.unique().shape[0])
# trim pool when input data is extremely large.
authors = pool.author.unique()
np.random.shuffle(authors)
pool = pool[pool.author.isin(authors[:20000])]
print("Final pool size is", pool.shape[0], "with n unique authors:", pool.author.unique().shape[0])

Initial pool size is 757862 with n unique authors: 106986
Final pool size is 143018 with n unique authors: 20000


Choose 1 author and return all their books.

In [None]:
i = 0
for groupname, subset in pool.groupby('author'):
    to_combine1 = fakebook.anthology_sample(subset)
    to_combine2 = fakebook.anthology_sample(subset)
    
    new_ids = []
    # Check if there's an OVERLAP relationship
    overlap = set(to_combine1).intersection(to_combine2)
    if (len(overlap) < 1) or (len(overlap) > 2):
        # Don't bother making the second fake doc
        to_combine2 = []
    
    for to_combine in [to_combine1, to_combine2]:
        if len(to_combine) > 1:
            try:
                volmeta, tl = fakebook.combine_books(to_combine)
            except KeyboardInterrupt:
                raise
            except:
                continue

            new_ids.append(volmeta['id'])
            fakebook.save_fake_vol(volmeta, tl, fakebook_root)
            for source_htid in volmeta['source_htids']:
                ground_truth.append(dict(left=volmeta['id'], right=source_htid, judgment='CONTAINS', notes='fake anthology'))
                ground_truth.append(dict(left=source_htid, right=volmeta['id'], judgment='PARTOF', notes='fake anthology'))
        i += 1
        if i % 100 == 0:
            print(i, groupname)
            
    if len(new_ids) == 2:
        ground_truth.append(dict(left=new_ids[1], right=new_ids[0], judgment='OVERLAPS', notes='fake overlap'))
        ground_truth.append(dict(left=new_ids[0], right=new_ids[1], judgment='OVERLAPS', notes='fake overlap'))

100 Abraham, Ralph.
200 Adams, Paul W.
300 Agg, John, 1783-1855.
400 Aït-Sahalia, Yacine
500 Alderman, Geoffrey.
600 Allen, Daniel B.
700 Alvarez Bravo, Manuel, 1902-2002.
800 American College of Preventive Medicine.
900 American Meat Institute Foundation.
1000 Amherst college. Class of 1883.
1100 Anderson, Michael, 1954-
1200 Anquandah, James.
1300 Ardell, Donald B.
1400 Arnold, J. E. M.
1500 Asia Society.
1600 Atkinson, J. C. 1814-1900.
1700 Australia
1800 Babcock, Harold Lester, 1886-1953.
1900 Bailey, H. W. 1899-1996.
2000 Baker, Will, 1935-2005.
2100 Baltimore (Md.)
2200 Barker, Ernest, Sir, 1874-1960.
2300 Barrass, Edward, 1821-1898.
2400 Barton, Brigid S.
2500 Baudelaire, Charles, 1821-1867.
2600 Beard, James, 1903-1985.
2700 Beesly, Edward Spencer, 1831-1915.
2800 Ben-Yami, M.
2900 Bennis, Phyllis, 1951-
3000 Berke, Harry L.
3100 Bethke, Eunice.
3200 Bienvenu, Millard J.
3300 Birney, Earle, 1904-1995.
3400 Blaine, Robert Gordon.
3500 Blest, A. D.
3600 Bochner, Mel, 1940-
3700 

### Multi-volume sets

In [50]:
pool = meta[meta.page_count < meta.page_count.quantile(.6)]
pool = pool[~pool.description.isna()]
pool = pool[clean_description(pool.description).str.contains('^v\.\d\d?$')]
# Filter to author/title pairs that have more than one volume
pool = pool.groupby(['author', 'title']).filter(lambda x: x.description.unique().shape[0] > 1)
pool = pool.copy()
pool['descint'] = clean_description(pool.description).str.replace('v.', '').astype(int)
# Filter further, to author/title pairs that have consecutively numbered volumes
def has_consecutive_v(x):
    sorted_v_ints = x.descint.sort_values()
    cumulative_run_length = ((sorted_v_ints - 1) == sorted_v_ints.shift(1))
    return cumulative_run_length.any()
pool = pool.groupby(['author', 'title']).filter(has_consecutive_v)
pool.shape

(113382, 6)

In [None]:
i = 0
early_stop = 20000
for groupname, subset in pool.groupby(['author', 'title']):
    smaller_subset = subset.copy().groupby('descint').apply(lambda x: x.sample(1))
    for to_combine in fakebook.consecutive_vol_samples(smaller_subset):
        if len(to_combine) > 1:
            try:
                volmeta, tl = fakebook.combine_books(to_combine, style='multivol')
            except KeyboardInterrupt:
                raise
            except:
                continue
            fakebook.save_fake_vol(volmeta, tl, fakebook_root)
            for source_htid in volmeta['source_htids']:
                ground_truth.append(dict(left=volmeta['id'], right=source_htid, judgment='CONTAINS', notes='fake multivol'))
                ground_truth.append(dict(left=source_htid, right=volmeta['id'], judgment='PARTOF', notes='fake multivol'))
    i += 1
    if i % 100 == 0:
        print(i, groupname)
    if i % early_stop == 0:
        break

100 ('Adams, Ansel, 1902-1984', 'Basic photo.')
200 ('Aikin, Lucy, 1781-1864.', 'The life of Joseph Addison. By Lucy Aikin ...')
300 ('Aldrich, Thomas Bailey, 1836-1907.', 'The poems of Thomas Bailey Aldrich.')
400 ('Allen, William, 1532-1594.', 'A true, sincere and modest defence of English Catholics that suffer for their faith both at home and abroad, against a false, seditious and slanderous libel, entitled: "The execution of justice in England" / by William Allen ; with a preface by his eminence the Cardinal Archbishop of Westminister.')
500 ('American Institute of Chemical Engineers.', 'Nuclear engineering.')
600 ('American Sociological Association', 'Publication of the American Sociological Society')
700 ('Anjou, Gustave, 1863-1942', "Ulster County, N.Y. probate records in the office of the surrogate, and in the county clerk's office at Kingston, N.Y. A careful abstract and    translation of the Dutch and English wills, letters of administration after      intestates, and invento

In [None]:
df = pd.DataFrame(ground_truth)
df.to_parquet(fakebook_root + 'fakebook_gt.parquet')
df.sample(10)

In [None]:
to_save = df.left[df.left.str.startswith('fake')].drop_duplicates()
to_save.name = 'htid'
to_save.to_csv(fakebook_root+'fake-htids.csv', header=True, index=False)

In [58]:
# Create a stat crunching list
import json
import pandas as pd
df = pd.read_parquet(fakebook_root + 'fakebook_gt.parquet')
with open(fakebook_root + 'stat_input.json', mode='w') as f:
    for record in df.sort_values('left').to_dict(orient='records'):
        json.dump(record, f)
        f.write('\n')

## Next up

Two things remain to be done: vector_file versions of the fake books need to be created (the list of ids is in `fake-htids.csv`), and the classifier input files need to be crunched. Code for these actions is in `../workflow.md`. Remember that vectorization setting should be consistent with the 'real' files - currently I'm using 300d gigaword GloVe with 5k chunks.

Parallelized:
```
mkdir -p /tmp/fake
seq 1 20 | parallel --eta -n1 -j20 python vectorization.py --outdir /tmp/fake --no-srp -g 50 --chunksize 5000 {} 20 /data/saddl/fakebooks_testset/fake-htids.csv
python concatenate-vector_files.py --build-cache --mode w /data/vectorfiles/fake_testset.bin /tmp/fake/*bin
```