<a href="https://colab.research.google.com/github/massivetexts/compare-tools/blob/master/scripts/GoodReadsBookAlignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook cross-references HathiTrust and GoodReads (via the USCD dataset), to find 'similar to' relationships. This can be used for training a different type of contextual relationship.

In [None]:
import pandas as pd

In [None]:
#@title Download Dataset Files - HathiFiles and USCD Book Data
!pip install gdown
!gdown https://drive.google.com/uc?id=1LXpK1UfqtP89H1tYy0pBGHjYk8IhigUK
!gdown https://drive.google.com/uc?id=19cdwyXwfXx_HDIgxXaHzH0mrx8nMyLvC
!wget -O hathifiles.tsv.gz https://www.hathitrust.org/filebrowser/download/291721

Downloading...
From: https://drive.google.com/uc?id=1LXpK1UfqtP89H1tYy0pBGHjYk8IhigUK
To: /content/goodreads_books.json.gz
2.08GB [00:26, 77.8MB/s]
Downloading...
From: https://drive.google.com/uc?id=19cdwyXwfXx_HDIgxXaHzH0mrx8nMyLvC
To: /content/goodreads_book_authors.json.gz
17.9MB [00:00, 57.0MB/s]
--2020-07-07 17:32:34--  https://www.hathitrust.org/filebrowser/download/291721
Resolving www.hathitrust.org (www.hathitrust.org)... 134.68.125.197
Connecting to www.hathitrust.org (www.hathitrust.org)|134.68.125.197|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1045306423 (997M) [application/octet-stream]
Saving to: ‘hathifiles.tsv.gz’


2020-07-07 17:33:28 (19.6 MB/s) - ‘hathifiles.tsv.gz’ saved [1045306423/1045306423]



In [None]:
# Mount Google Drive, for saving derived data.
from google.colab import drive
drive.mount('/content/drive/')
GRANT_FOLDER = '/content/drive/My Drive/Grants/IMLS Grant/Data/' #@param {type:'string'}

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive/


## Read HathiTrust Metadata

Due to the size of the Hathifiles, create a chunk iterator, that only reads a part of the full dataset at once.

In [None]:
headings = ['htid', 'access', 'rights', 'ht_bib_key', 'description', 'source', 
            'source_bib_num', 'oclc_num', 'isbn', 'issn', 'lccn', 'title', 
            'imprint', 'rights_reason_code', 'rights_timestamp', 'us_gov_doc_flag', 
            'rights_date_used', 'pub_place', 'lang', 'bib_fmt', 'collection_code', 
            'content_provider_code', 'responsible_entity_code', 
            'digitization_agent_code', 'access_profile_code', 'author']
# Dataset is large, so use an iterator
htreader = pd.read_csv('hathifiles.tsv.gz', sep='\t', compression='gzip',
                     chunksize=100000, names=headings,
                     usecols=['htid', 'author', 'title', 'rights_date_used',
                               'description'])
for htmeta in htreader:
    example = htmeta.sample(4)
    break
example

Unnamed: 0,htid,description,title,rights_date_used,author
20236,mdp.39015004074269,,Lonely crusade.,1973.0,"Himes, Chester B., 1909-1984."
79767,mdp.39015022356102,v.206 1956,Transactions.,1956.0,Metallurgical Society of AIME.
51026,uc1.b4952868,,Every child a wanted child : Clarence James Ga...,1978.0,"Williams, Doone."
40573,mdp.39015002706300,,Popes Shakespeare-Ausgabe als Spiegel seiner K...,1975.0,"Kowalk, Wolfgang."


## Loading USCD data
- read dataset file content (needs decompression)
- parse json from content
- loop through the books, and for each book, see if there is a match in our HathiTrust metadata DataFrame

### Load Author Data and Cross-reference with HT

First, we want to find the authors that are possible in the HathiTrust. Those that are not, we can ignore.

This is a tricky alignment, because the HathiTrust author data is somewhat messy in it's formatting.

In [None]:
authors = pd.read_json('goodreads_book_authors.json.gz', compression='gzip', lines=True)
authors.sample(2)

Unnamed: 0,average_rating,author_id,text_reviews_count,name,ratings_count
156152,3.85,43127,738,Allan Bloom,10294
741264,5.0,7814838,2,Doc Underwood,4


In [None]:
# Create a new lastname, firstname column in the USCD Authors Dataset
authors['new_name'] = authors.name.str.replace('^(.*) ([A-Z].*)$', r'\2, \1')
authors.sample(3)

Unnamed: 0,average_rating,author_id,text_reviews_count,name,ratings_count,new_name
790587,2.0,3438740,1,Hector Giacomelli,1,"Giacomelli, Hector"
65855,3.5,15581717,3,Gianluca Pirozzi,4,"Pirozzi, Gianluca"
343714,3.12,3078325,1,Victor Alvarez,8,"Alvarez, Victor"


In [None]:
# Collect all the unique authors and unique book titles in the HathiTrust without
# loading it all into memory

import numpy as np
all_ht_authors = []
all_ht_titles = []
for i, htmeta in enumerate(htreader):
    print(i, end=',')
    all_ht_authors.append(htmeta.author.unique())
    all_ht_titles.append(htmeta.title.unique())
print()
all_ht_authors = pd.Series(np.concatenate(all_ht_authors)).drop_duplicates().fillna('')
all_ht_titles = pd.Series(np.concatenate(all_ht_titles)).drop_duplicates().fillna('')
all_ht_authors.shape, all_ht_titles.shape

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,

  interactivity=interactivity, compiler=compiler, result=result)


28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,

  interactivity=interactivity, compiler=compiler, result=result)


124,125,126,127,128,129,130,131,132,133,134,135,136,

  interactivity=interactivity, compiler=compiler, result=result)


137,138,139,140,

  interactivity=interactivity, compiler=compiler, result=result)


141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,


((3314666,), (8557627,))

In [None]:
# Reformat the author and book title information from the HathiTrust Metadata.
# This won't be perfect, but we don't need completeness so the few places where 
# it messes up should be fine.
!pip install -q git+git://github.com/massivetexts/compare-tools
from compare_tools.hathimeta import clean_title

def clean_author(x):
    import re
    x = re.sub('\-?\d\d\d\d', '', x)
    x = str(x).strip().strip('.')
    x = x.split(',')[:2]
    x = ",".join(x)
    return x

def simple_title(x):
    return (x.apply(clean_title)   # Run the compare_tools clean_title code
             .str.lower()          # Lowercase
             .str.split(r':|/|\\') # Split on :, /, \
             .apply(lambda x:x[0]) # Keep first string of split
             .str.replace('\W', '') # Only keep non-word chars
             .apply(lambda x: x[:35]) # Truncate to first 35 chars
    )
all_ht_authors = all_ht_authors.apply(clean_author)
clean_ht_titles = simple_title(all_ht_titles)

  Building wheel for Compare-Tools (setup.py) ... [?25l[?25hdone


In [None]:
clean_ht_titles.sample(3)

2174967    forandagainstthestate
3647964              feudafrique
2559463     stilidiemancipazione
dtype: object

Here, I truncate the `authors` data. I rewrite the original variables to save memory.
no reason to hold the entire original dataset in memory any more.

In [None]:
# Cross reference the HT Meta and the Goodreads data to find authors in both
print("Checking Author Intersection")
a = set(all_ht_authors)
b = set(authors.new_name)
overlap = b.intersection(a)
print("# of authors seen in both datasets:", len(overlap))

# Truncate Authors table
print("Pre-size:", authors.shape[0])
authors = authors[authors.new_name.isin(overlap)]
print("Post-size:", authors.shape[0])

Checking Author Intersection
# of authors seen in both datasets: 120407
Pre-size: 829529
Post-size: 120927


## Load Book information for authors that may be in the HathiTrust

First, the book data needs to be joined with authors, which has already been truncated to authors that we can find in the HathiTrust.

Then, we derive a cleaned title and only keep the rows where there it matches the unique cleaned titles in the HathiTrust. This is saved to an CSV File in Google Docs called `merge_books.csv`.

In [None]:
import os
cols_to_keep = ['isbn', 'popular_shelves','similar_books', 'average_rating', 'link',
                'publication_year', 'book_id', 'title', 'title_without_series',
                'work_id', 'author_id', 'author', 'author_formatted', 'simple_title',
                'edition_information']

bookreader = pd.read_json('goodreads_books.json.gz', compression='gzip',
                          lines=True, chunksize=100000)

outf = os.path.join(GRANT_FOLDER, 'merge_books.csv.gz')

for i, books in enumerate(bookreader):
    print(i, end=', ')
    # Extract the first author
    books['first_author_id'] = books.authors.apply(lambda x: x[0]['author_id'] if len(x) > 0 else -1).astype(int)

    # Do an inner join with authors. This will shrink the dataset
    merged = books.merge(authors[['name', 'new_name', 'author_id']], 
                        how='inner', left_on='first_author_id', 
                        right_on='author_id')
    # Clean the titles and only keep rows that have a matching title in the HT
    merged['simple_title'] = simple_title(merged['title'].fillna(''))
    merged = merged[merged.simple_title.isin(clean_ht_titles)]
    merged = merged.rename(columns={'name': 'author', 'new_name': 'author_formatted'})

    # Add to a list of dataframes, which will be concatenated together at the end.
    merged[cols_to_keep].to_csv(outf, mode='a' if i > 0 else 'w', compression='gzip')

print()

0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 


In [None]:
## Get the book title overlap
clean_gr_titles = set(np.concatenate([df.simple_title.unique() for df in pd.read_csv(outf, compression='gzip', chunksize=100000)]))
cleaned_title_overlap = clean_gr_titles.intersection(clean_ht_titles)
len(cleaned_title_overlap)

  if self.run_code(code, result):


112795

In [None]:
nbooks = sum([df.shape[0] for df in pd.read_csv(outf, compression='gzip', chunksize=100000)])
nbooks

  if self.run_code(code, result):


272407

In [None]:
htreader = pd.read_csv('hathifiles.tsv.gz', sep='\t', compression='gzip',
                     chunksize=250000, names=headings,
                     usecols=['htid', 'author', 'title', 'rights_date_used',
                               'description'])
outht = os.path.join(GRANT_FOLDER, 'ht_overlap.csv.gz')
for i, htmeta in enumerate(htreader):
    htmeta['clean_author'] = htmeta.author.fillna('').apply(clean_author)
    htmeta = htmeta[htmeta.clean_author.isin(overlap)]
    htmeta['simple_title'] = simple_title(htmeta.title.fillna(''))
    htmeta = htmeta[htmeta.simple_title.isin(cleaned_title_overlap)]
    print(i, htmeta.shape[0])
    htmeta.to_csv(outht, mode='a' if i > 0 else 'w', compression='gzip')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':


0 5028
1 5477
2 10739
3 9488
4 9082
5 10084
6 6286
7 5274
8 5658
9 9218
10 7678


  interactivity=interactivity, compiler=compiler, result=result)


11 8259
12 6186
13 8883
14 6838
15 7957
16 3782
17 3312
18 5181
19 2497
20 1577
21 5426
22 7763
23 8030
24 5954
25 5683
26 11217
27 18635
28 12001
29 5801
30 7462
31 4105
32 4840
33 4221
34 5886
35 17028
36 2714
37 6321
38 10043
39 3588
40 7810
41 4427
42 6287
43 5786
44 5308
45 2798
46 2299
47 6398
48 7777
49 5424
50 3074
51 1439
52 2616
53 4327
54 4912
55 7768


  interactivity=interactivity, compiler=compiler, result=result)


56 7141
57 3900


  interactivity=interactivity, compiler=compiler, result=result)


58 4154
59 5965
60 5026
61 4287
62 3204
63 3339
64 2615
65 9543
66 5223
67 3666
68 3415


At this point, we have the Goodreads book data with titles and authors that may be in HT, and vice-versa. That was easier computation.

Now that we have smaller datasets, we can align where the title+author are identical *together*.

In [None]:
sum([df.shape[0] for df in pd.read_csv(outht, compression='gzip', chunksize=100000)])

425198

In [None]:
htdf = pd.read_csv(outht, compression='gzip')
gtdf = pd.read_csv(outf, compression='gzip')

# Combine author + title into a single string, for simplicity
gtdf['code'] = gtdf['author_formatted'] + '__' + gtdf['simple_title']
htdf['code'] = htdf['clean_author'] + '__' + htdf['simple_title']

# Dictionaries are fast for lookups
book_id_code_ref = gtdf.set_index('book_id').code.to_dict()
htid_code_ref = htdf.set_index('htid').code.to_dict()

# Do an inner join, to only keep where the codes overlap
gtdf = gtdf.merge(htdf['code'].drop_duplicates(),
                  how='inner', on='code')

htdf = htdf.merge(gtdf['code'].drop_duplicates(),
                  how='inner', on='code')

# Overwrite earlier files
gtdf.to_csv(outf, mode='w', compression='gzip')
htdf.to_csv(outht, mode='w', compression='gzip')

In [None]:
import json
# Parse string of similar works into an actual list, and filter
# to remove book_ids that are no longer in the dataset
gtdf.book_id = gtdf.book_id.astype(int)
unique_bookids = set(gtdf.book_id)
def parse_list(x):
    if x == "[]":
        return []
    else:
        l = x[1:-1].split(', ')
        return [int(y[1:-1]) for y in l]

def filter_bookids(x):
    m = set(x).intersection(unique_bookids)
    return list(m)

gtdf.similar_books = gtdf.similar_books.apply(parse_list).apply(filter_bookids)
# Drop any rows without recommendations
gtdf = gtdf[gtdf['similar_books'].apply(lambda x:len(x)) > 0]

In [42]:
# Combine lists of similar_books, parse to use a+t codes. Also combine book_ids
def concat(x):
    sim_books = [id for l1 in x.similar_books.tolist() for id in l1]
    sim_books = list(set(sim_books))
    sim_book_codes = [book_id_code_ref[str(x)] for x in sim_books]
    return pd.Series({'similar_books':sim_book_codes, 'book_ids': x.book_id.tolist()})

by_code = gtdf.reset_index().groupby(['code'])[['book_id', 'similar_books']].apply(concat)

In [43]:
htid_by_code = htdf.groupby('code')['htid'].apply(lambda x: x.tolist())
by_code = by_code.merge(htid_by_code, left_index=True, right_index=True)
by_code['similar_htids'] = by_code.similar_books.apply(lambda x: list(set([l for z in [htid_by_code[y] for y in x] for l in z])))
by_code.sample(10)

Unnamed: 0_level_0,similar_books,book_ids,htid,similar_htids
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"Williams, Charles__theplaceofthelion","[Carpenter, Humphrey__theinklings]","[143226, 1732118]","[mdp.39015066681407, uc1.$b391206, mdp.3901500...","[mdp.39015002385162, mdp.39015002385006]"
"Walton, Anthony__mississippi","[Barfield, Owen__historyinenglishwords, Lerer,...",[178784],"[mdp.39015037261107, uva.x002712367]","[mdp.39076006122779, inu.30000037449935, mdp.3..."
"Tindall, Blair__mozartinthejungle","[Piston, Walter__harmony]","[34684275, 24998, 24752265, 19793474, 1473248]",[mdp.39015061459205],[uc1.b4325083]
"Mayer, Mercer__justgrandpaandme","[Piven, Hanoch__mydogisassmellyasdirtysocks, D...",[633709],[pst.000032698923],"[pst.000033005577, pst.000061597174]"
"Efremov, Ivan Antonovich__andromedaaspaceagetale","[Tolstoy, Aleksey Nikolayevich__aelita, Harris...",[26823107],"[mdp.39015046367887, mdp.39015038020460, uc1.$...","[pst.000000948081, uc1.b3462022, uiug.30112093..."
"Fiell, Charlotte__1000chairs","[Macaulay, David__buildingbig]",[1083029],[mdp.39015056309894],[mdp.39015049724571]
"Hyde, Catherine Ryan__electricgod","[Howatch, Susan__thewonderworker, Tiffany, Car...","[217450, 16124215]",[mdp.49015003325322],"[uva.x006112947, inu.30000103141119, mdp.39015..."
"Proust, Marcel__thecompleteshortstoriesofmarcelprou","[Beckett, Samuel__proust, Deleuze, Gilles__pro...","[1770405, 28394]","[mdp.39015050755985, mdp.39015070703452]","[mdp.39015005323111, mdp.39015008690995, mdp.3..."
"Duras, Marguerite__lapluiedété","[Cendrars, Blaise__gold, D'Aguiar, Fred__feedi...",[1147388],[mdp.39015017018352],"[mdp.39015008513734, uc1.b3757121, mdp.3901506..."
"Newman, John Henry__apologiaprovitasua","[Belloc, Hilaire__thegreatheresies, Adam, Karl...","[43958, 1432861, 982471, 20613395, 18187394, 3...","[uc2.ark:/13960/t39z92w3d, mdp.39015008498209,...","[pst.000023794283, uc1.$b51909, uva.x000531104..."


# Save Data

This data fits better as JSON, because of the lists.

In [44]:
import gzip
simspath = os.path.join(GRANT_FOLDER, 'good_reads_sims.json.gz')
with gzip.GzipFile(simspath, 'w') as fout:
    fout.write(by_code.reset_index().to_json(orient='records', lines=True).encode('utf-8'))

Potentially also useful - the data exploded into all left/right permutations.

In [75]:
import gzip
with gzip.open('pairwise_gr_stats.json.gz', mode='w') as f:
    f.write('left\tright\n'.encode('utf-8'))    
    for i, (ind, row) in enumerate(by_code.iterrows()):
        try:
            pairs = [(htid, htid2) for htid in row['htid'] for htid2 in row['similar_htids']]
            for pair in pairs:
                out = json.dumps(dict(left=pair[0], right=pair[1], relationship='GRSIM')) +'\n'
                f.write(out.encode('utf-8'))
        except:
            print('Error with ', pair)
        if i % 1000 == 0:
            print(i)

0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
14000
15000
16000
17000
18000
19000
20000
21000
22000
23000
24000
25000
26000
27000
28000
29000
30000
31000
32000


If trimming the set downstream, sample then sort by left, rather than leaving unordered.

#### Workspace

Reloading the data.

In [45]:
import os
import pandas as pd
simspath = os.path.join(GRANT_FOLDER, 'good_reads_sims.json.gz')
df = pd.read_json(simspath, compression='gzip', lines=True)

In [46]:
# These are the HTIDs that needs to be crunched for training
unique_htids = set([htid for htids in df.htid for htid in htids] + [htid for htids in df.similar_htids for htid in htids])
len(unique_htids)

91655

In [47]:
pd.Series(list(unique_htids)).sample(frac=1).to_csv(os.path.join(GRANT_FOLDER, 'goodreads_htids.csv'), index=False)