<a href="https://colab.research.google.com/github/jmgold/DEI-Collections/blob/main/Flexible_holdings_comparison.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Jeremy Goldstein
Minuteman Library Network

This script will compare an excel or csv file of local holdings to a curated booklist to find how much of that list you own and which titles are missing based on fuzzy matching titles and authors.

# Step 1: Configure Python/Colab Environment


nmslib is not included by default with Colab and must be installed.

If using files saved to Gdrive then your GDrive must first be mounted to give Colab access.

In [1]:
!pip install nmslib



import Python libraries that will be used within the script

In [2]:
import pandas as pd
import io
import numpy as np
import os
import re
#import time
import nmslib
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import csr_matrix # may not be required 
from scipy.sparse import rand # may not be required
%load_ext google.colab.data_table
from google.colab import files

Mount GDrive in order to access DEI list file
You can also manually mount the drive via the file directory menu along the lefthand side of the screen.

# Step 2: Import data
Script will load tabular data into two dataframes using the Pandas library for Python.  

Upload either a csv or excel file with your data.  

Must contain at least 2 columns with the headers author and title (capitalization does not matter).

In [3]:
uploaded_list = files.upload()

for fn in uploaded_list.keys():
  file_name=fn
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded_list[fn])))

Saving Teen Latinx Titles.xlsx to Teen Latinx Titles (1).xlsx
User uploaded file "Teen Latinx Titles.xlsx" with length 27814 bytes


Run the cell matching the file type uploaded, csv or excel

In [None]:
#use if csv
booklist_df = pd.read_csv(io.BytesIO(uploaded_list[file_name]))

In [4]:
#use if excel
booklist_df = pd.read_excel(io.BytesIO(uploaded_list[file_name]))

clean up data in the table and add BooklistMatch Column for use later

In [5]:
#change null to blank
booklist_df=booklist_df.fillna('')
#change headers to lower case
booklist_df.columns = [x.lower() for x in booklist_df.columns]
#save list of column headers for use at the end
booklist_headers = booklist_df.columns.tolist()
#create match point column
booklist_df['BooklistMatch']=booklist_df['author']+booklist_df['title']

Preview the contents of this dataframe.  
You can use the Filter button to further explore the data.

In [6]:
#Preview booklist dataframe
booklist_df

Unnamed: 0,title,author,isbn,format,publisher,pubdate,unnamed: 6,BooklistMatch
0,¿sabes Quién Es Zapata? / Do You Know Who Zapa...,"Leyva, Amaranta",9786073183574,Paperback,Alfaguara Infantil,2020-06-23,12.95,"Leyva, Amaranta¿sabes Quién Es Zapata? / Do Yo..."
1,21: The Story of Roberto Clemente,"Santiago, Wilfred",9781606997758,Paperback,Fantagraphics Books,2014-09-21,19.99,"Santiago, Wilfred21: The Story of Roberto Clem..."
2,A Cuban Girl's Guide to Tea and Tomorrow,"Namey, Laura Taylor",9781534471245,Hardcover,Atheneum Books for Young Readers,2020-11-10,18.99,"Namey, Laura TaylorA Cuban Girl's Guide to Tea..."
3,Absolute Carnage: Miles Morales,"Ahmed, Saladin",9781302920142,Paperback,Marvel,2020-01-28,15.99,"Ahmed, SaladinAbsolute Carnage: Miles Morales"
4,Albert Pujols (Beisbol! Latino Heroes of Major...,"Leventhal, Josh",9781680720495,Hardcover,Black Rabbit Books/Bolt,2017-09-01,32.80,"Leventhal, JoshAlbert Pujols (Beisbol! Latino ..."
...,...,...,...,...,...,...,...,...
224,With the Fire on High,"Acevedo, Elizabeth",9780062662835,Hardcover,Quill Tree Books,2019-05-07,17.99,"Acevedo, ElizabethWith the Fire on High"
225,Woven in Moonlight,"Ibañez, Isabel",9781624148019,Hardcover,Page Street Kids,2020-01-07,18.99,"Ibañez, IsabelWoven in Moonlight"
226,Written in Starlight,"Ibañez, Isabel",9781645671329,Hardcover,Page Street Kids,2021-01-26,18.99,"Ibañez, IsabelWritten in Starlight"
227,Yaqui Delgado Wants to Kick Your Ass,"Medina, Meg",9780763671648,Paperback,Candlewick Press (MA),2014-08-26,8.99,"Medina, MegYaqui Delgado Wants to Kick Your Ass"


Upload an excel or csv file containing your titles within your holdings

Must contain at least 2 columns with the headers author and title (capitalization does not matter)

In [11]:
uploaded = files.upload()

for fn in uploaded.keys():
  file_name=fn
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

Saving ntny holdings.csv to ntny holdings (1).csv
User uploaded file "ntny holdings.csv" with length 1072347 bytes


Run the cell matching the file type uploaded, csv or excel

In [12]:
#use if csv
holdings_df = pd.read_csv(io.BytesIO(uploaded[file_name]))

In [None]:
#use if excel
holdings_df = pd.read_excel(io.BytesIO(uploaded[file_name]))

In [13]:
#remove null values
holdings_df=holdings_df.fillna('')
#change headers to lower case
holdings_df.columns = [x.lower() for x in holdings_df.columns]
#save list of column headers for use at the end
holdings_headers = holdings_df.columns.tolist()
#Create MatchPoint field
holdings_df['HoldingsMatch']=holdings_df['author']+holdings_df['title']

Preview holdings data.

**Note:** If the file is contains more than 20,000 rows the preview will be displayed different than the prior preview and will not include the browse features.

In [14]:
#Preview data
holdings_df

Unnamed: 0,title,author,record #(biblio),008 date one,unnamed: 4,HoldingsMatch
0,The Illyrian adventure / Lloyd Alexander.,"Alexander, Lloyd.",b10366945,1986,,"Alexander, Lloyd.The Illyrian adventure / Lloy..."
1,Centaur aisle / Piers Anthony.,"Anthony, Piers, author.",b1041518x,1982,,"Anthony, Piers, author.Centaur aisle / Piers A..."
2,The fighting ground / by Avi.,"Avi, 1937-",b10464396,1987,,"Avi, 1937-The fighting ground / by Avi."
3,The true confessions of Charlotte Doyle / Avi ...,"Avi, 1937-",b10464724,1990,,"Avi, 1937-The true confessions of Charlotte Do..."
4,Charley Skedaddle / Patricia Beatty.,"Beatty, Patricia, 1922-1991",b10548464,1987,,"Beatty, Patricia, 1922-1991Charley Skedaddle /..."
...,...,...,...,...,...,...
10311,"Chainsaw man. 2, Chainsaw vs. Bat / story & ar...","Fujimoto, Tatsuki, author, artist.",b40418327,2021,,"Fujimoto, Tatsuki, author, artist.Chainsaw man..."
10312,Trapped! : cages of mind and body / edited by ...,,b40418364,1999,,Trapped! : cages of mind and body / edited by ...
10313,Elizabeth Webster and the chamber of stolen gh...,"Lashner, William, author.",b40419472,2021,,"Lashner, William, author.Elizabeth Webster and..."
10314,Moriarty the Patriot. 5 : based on the works o...,"Takeuchi, Ryosuke, author.",b40434151,2021,,"Takeuchi, Ryosuke, author.Moriarty the Patriot..."



# Step 3: Calculate Matches Dataframe
Compare two data frames and calculate match confidence value for each pair of rows.

Matching algorithm is taken from [Fuzzy Matching at Scale by Josh Taylor](https://towardsdatascience.com/fuzzy-matching-at-scale-84f2bfd0c536) (viewed 11/24/2021)

In [15]:
def ngrams(string, n=3):
    """Takes an input string, cleans it and converts to ngrams. """
    string = str(string)
    string = string.lower() # lower case
    #string = fix_text(string) # fix text
    string = string.encode("ascii", errors="ignore").decode() #remove non ascii chars
    chars_to_remove = [")","(",".","|","[","]","{","}","'","-"]
    rx = '[' + re.escape(''.join(chars_to_remove)) + ']' #remove punc, brackets etc...
    string = re.sub(rx, '', string)
    string = re.sub(' +',' ',string).strip() # get rid of multiple spaces and replace with a single
    string = ' '+ string +' ' # pad names for ngrams...
    ngrams = zip(*[string[i:] for i in range(n)])
    return [''.join(ngram) for ngram in ngrams]


###FIRST TIME RUN - takes about 5 minutes... used to build the matching table
##### Create a list of items to match here:
#change from original doing cast instead of unique...unsure if can do both
booklist_match = list(booklist_df["BooklistMatch"].unique()) #unique org names from company watch file
#Building the TFIDF off the clean dataset - takes about 5 min
vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams)
tf_idf_matrix = vectorizer.fit_transform(booklist_match)

##### Create a list of messy items to match here:
holdings_match = list(holdings_df["HoldingsMatch"].unique()) #unique list of names

#Creation of vectors for the messy names

# #FOR LOADING ONLY - only required if items have been saved previously
# vectorizer = pickle.load(open("Data/vectorizer.pkl","rb"))
# tf_idf_matrix = pickle.load(open("Data/Comp_tfidf.pkl","rb"))
# org_names = pickle.load(open("Data/Comp_names.pkl","rb"))

messy_tf_idf_matrix = vectorizer.transform(holdings_match)

# create a random matrix to index
data_matrix = tf_idf_matrix#[0:1000000]

# Set index parameters
# These are the most important ones
M = 80
efC = 1000

num_threads = 4 # adjust for the number of threads
# Intitialize the library, specify the space, the type of the vector and add data points 
index = nmslib.init(method='simple_invindx', space='negdotprod_sparse_fast', data_type=nmslib.DataType.SPARSE_VECTOR) 

index.addDataPointBatch(data_matrix)
# Create an index
index.createIndex() 


# Number of neighbors 
num_threads = 4
K=1
query_matrix = messy_tf_idf_matrix
query_qty = query_matrix.shape[0]
nbrs = index.knnQueryBatch(query_matrix, k = K, num_threads = num_threads)

In [16]:
mts =[]
for i in range(len(nbrs)):
  original_nm = holdings_match[i]
  try:
    matched_nm   = booklist_match[nbrs[i][0][0]]
    conf         = nbrs[i][1][0]
  except:
    matched_nm   = "no match found"
    conf         = None
  mts.append([original_nm,matched_nm,conf])

mts = pd.DataFrame(mts,columns=['holdings_match','booklist_match','conf'])
#change negative values to positive for ease of reading
mts['conf'] = mts['conf'].abs()

#Step 4: Determine matches

In the table below you can browse the results to see the match confidence scores assigned to each pair.

Look at values of the 'conf' column to find the point at which you feel the entries are matched correctly.  Generally we recommend looking at the values between .5 and .7 as a starting point.  To help here you can see a preview of the matched data that has been limited to the relevant range of values.

In [17]:
#.66 used as default match confidence value based on initial testing, will be overwritten in step below
match_conf = .66

mts = mts.sort_values(by=['conf'])
mts[mts['conf'].between(.5,.7)]

Unnamed: 0,holdings_match,booklist_match,conf
7543,"de la Peña, Matt, author.Superman : dawnbreake...","de la Peña, MattWe Were Here",0.500979
8385,"Lund, Natalie, author.We speak in storms / Nat...","Lund, NatalieThe Sky Above Us",0.507014
2793,"Bendis, Brian Michael.Ultimate comics Spider-M...","Bendis, Brian MichaelMiles Morales: Ultimate S...",0.512222
6998,"Albertalli, Becky, author.Leah on the offbeat ...","Albertalli, BeckyWhat If It's Us",0.517369
9479,"Thomas, Aiden, author.Lost in the Never Woods ...","Thomas, AidenCemetery Boys",0.518868
998,"Bendis, Brian Michael.Secrets & lies / writer,...","Bendis, Brian MichaelSpider-Man: Miles Morales...",0.520487
3388,"Bendis, Brian Michael.Spider-Men / Brian Micha...","Bendis, Brian MichaelSpider-Men: Worlds Collide",0.522696
7784,"Reynolds, Justin A., author.Opposite of always...","Reynolds, Justin AEarly Departures Easy Reads",0.523925
9534,"Albertalli, Becky, author.Kate in waiting / Be...","Albertalli, BeckyHere's to Us - Street Smart",0.530599
9839,"Mejia, Tehlor Kay, author.Paola Santiago and t...","Mejia, Tehlor KayMiss Meteor",0.530789


Enter the confidence value you wish to use for determining correct matches.  Any value greater than or equal the number you enter will be considered a match.

In [18]:
match_conf = float(input("Select confidence value for determining correct matches"))

Select confidence value for determining correct matches.559


**Percentage of Titles from list that are in your collection**

In [19]:
pct_held = (len(mts.loc[mts['conf'] >= match_conf]) / len(booklist_df)) * 100
print('You own '+str(round(pct_held, 2))+'% of the titles in the booklist')

You own 46.72% of the titles in the booklist


Merge the holdings dataframe with the instances where a match has been found

In [20]:
found_results = holdings_df.reset_index().merge(mts.loc[mts['conf'] >= match_conf], left_on='HoldingsMatch', right_on='holdings_match').set_index('index')
#found_results = found_results.rename(columns=str.capitalize)

**Titles Found in Your Collection**

In [21]:
found_results[holdings_headers]

Unnamed: 0_level_0,title,author,record #(biblio),008 date one,unnamed: 4
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
370,Before we were free / Julia Alvarez.,"Alvarez, Julia.",b21014814,2002,
902,Behind the eyes / Francisco X. Stork.,"Stork, Francisco X.",b24103214,2006,
1528,Mexican whiteboy / Matt de la Peña.,"de la Peña, Matt.",b2603721x,2008,
1675,Marcelo in the real world / Francisco X. Stork.,"Stork, Francisco X.",b26381941,2009,
2449,Marcelo in the real world / Francisco X. Stork.,"Stork, Francisco X.",b28911337,2011,
...,...,...,...,...,...
10105,Summer in the city of roses / Michelle Ruiz Keil.,"Keil, Michelle Ruiz, author.",b40185813,2021,
10113,How Moon Fuentez fell in love with the univers...,"Vasquez Gilliland, Raquel, author.",b40185916,2021,
10126,When we make it / Elisabet Velasquez.,"Velasquez, Elisabet, author.",b4018612x,2021,
10137,Our way back to always / Nina Moreno.,"Moreno, Nina (Young adult fiction writer), aut...",b40186295,2021,


Download results to an Excel file

In [None]:
found_results[holdings_headers].to_excel("found_results.xlsx")
files.download('/content/found_results.xlsx')

Create missing dataframe containing the titles from the booklist that were not matched to your holdings

In [22]:
booklist_found = booklist_df.reset_index().merge(mts.loc[mts['conf'] >= match_conf], left_on='BooklistMatch', right_on='booklist_match').set_index('index')
common = booklist_df.merge(booklist_found, how='outer', left_index=True, right_index=True)
common = common[common[['conf']].notna().all(axis=1)]
missing = booklist_df.merge(common, how='outer', left_index=True, right_index=True)

**Titles Not in your collection**

In [23]:
missing = missing[missing['conf'].isnull()][booklist_headers]
missing

Unnamed: 0,title,author,isbn,format,publisher,pubdate,unnamed: 6
0,¿sabes Quién Es Zapata? / Do You Know Who Zapa...,"Leyva, Amaranta",9786073183574,Paperback,Alfaguara Infantil,2020-06-23,12.95
3,Absolute Carnage: Miles Morales,"Ahmed, Saladin",9781302920142,Paperback,Marvel,2020-01-28,15.99
4,Albert Pujols (Beisbol! Latino Heroes of Major...,"Leventhal, Josh",9781680720495,Hardcover,Black Rabbit Books/Bolt,2017-09-01,32.80
5,Alexandria Ocasio-Cortez: Political Headliner ...,"Leigh, Anna",9781541588875,Paperback,Lerner Publications (Tm),2020-01-01,10.99
6,Alicia Alonso: Prima Ballerina,"Bernier-Grand, Carmen",9781477810743,Paperback,Two Lions,2019-01-15,9.99
...,...,...,...,...,...,...,...
207,Trish Trash #3 (Trish Trash Graphic Novels #3),"Abel, Jessica",9781545800164,Hardcover,Super Genius,2018-12-04,15.99
208,Undocumented: A Worker's Fight,"Tonatiuh, Duncan",9781419728549,Hardcover,Abrams Comicarts,2018-08-07,19.99
209,Unearthed: A Jessica Cruz Story,"Rivera, Lilliam",9781779500519,Paperback,DC Comics,2021-09-14,16.99
217,Welcome to Wanderland,"Ball, Jackie",9781684154722,Paperback,Boom Box,2019-09-24,14.99


Download missing titles to an Excel file

In [None]:
missing.to_excel("missing_results.xlsx")
files.download('/content/missing_results.xlsx')