<a href="https://colab.research.google.com/github/jmgold/DEI-Collections/blob/main/Flexible_holdings_comparison.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Jeremy Goldstein
Minuteman Library Network

This Python script will compare an excel or csv file of local holdings to a curated booklist to find how much of that list you own and which titles are missing based on fuzzy matching titles and authors.

# Instructions

**Prerequisite**

You will need two files containing at least title and author information.  The file can be a csv or excel (either .xls or .xlsx is fine).  Titles and authors need to each be in their own column with a header of 'title' and 'author' (capitalization does not matter).  The files can contain as many additional columns as you wish, so long as each one does include its own column header.
The first file will be a list of titles you wish to check against your current holdings.  The second should be a list of your current holdings.

**Running the script**

Each code block can be run one at a time by clicking the play icon that appears when you hover over the [] marker.  Once that block finishes running a green check will appear to the left of the block and any output from that portion of the script will appear beneath it, along with any errors that may be encountered.

You may also use the items founds within the Runtime drop down menu to Run the entire script, though there are points where user input is required before the script will continue past a particular code block to be on the lookout for.

You may reset the output by going to the Edit menu and selecting clear all outputs.  You can also review files that have been uploaded or created as part of this script using the folder icon i the left hand navigation menu. 

# Step 1: Configure Python/Colab Environment


nmslib is not included by default with Colab and must be installed.

In [1]:
!pip install nmslib



import Python libraries that will be used within the script

In [2]:
import pandas as pd
import io
import numpy as np
import os
import re
#import time
import nmslib
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import csr_matrix # may not be required 
from scipy.sparse import rand # may not be required
%load_ext google.colab.data_table
from google.colab import files

# Step 2: Import data
Script will load tabular data into two dataframes using the Pandas library for Python.  

Upload either a csv or excel file with the list of suggest titles you wish to match your holdings against.  

Must contain at least 2 columns with the headers author and title (capitalization does not matter).

In [3]:
uploaded_list = files.upload()

for fn in uploaded_list.keys():
  file_name=fn
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded_list[fn])))

Saving Teen Latinx Titles.xlsx to Teen Latinx Titles (1).xlsx
User uploaded file "Teen Latinx Titles.xlsx" with length 27814 bytes


Loads the uploaded file into a datafrome or returns an error if the file is an incorrect format

In [4]:
if file_name.endswith('.csv'):
  booklist_df = pd.read_csv(io.BytesIO(uploaded_list[file_name]))
elif file_name.endswith('.xls'):
  booklist_df = pd.read_excel(io.BytesIO(uploaded_list[file_name]))
elif file_name.endswith('.xlsx'):
  booklist_df = pd.read_excel(io.BytesIO(uploaded_list[file_name]))
else:
  pring("error file is not .csv or excel")

clean up data in the table and add BooklistMatch Column for use later

In [5]:
#change null to blank
booklist_df=booklist_df.fillna('')
#change headers to lower case
booklist_df.columns = [x.lower() for x in booklist_df.columns]
#save list of column headers for use at the end
booklist_headers = booklist_df.columns.tolist()
#create match point column
booklist_df['BooklistMatch']=booklist_df['author']+booklist_df['title']

Preview the contents of this dataframe.  
You can use the Filter button to further explore the data.

In [6]:
#Preview booklist dataframe
booklist_df

Unnamed: 0,title,author,isbn,format,publisher,pubdate,unnamed: 6,BooklistMatch
0,¿sabes Quién Es Zapata? / Do You Know Who Zapa...,"Leyva, Amaranta",9786073183574,Paperback,Alfaguara Infantil,2020-06-23,12.95,"Leyva, Amaranta¿sabes Quién Es Zapata? / Do Yo..."
1,21: The Story of Roberto Clemente,"Santiago, Wilfred",9781606997758,Paperback,Fantagraphics Books,2014-09-21,19.99,"Santiago, Wilfred21: The Story of Roberto Clem..."
2,A Cuban Girl's Guide to Tea and Tomorrow,"Namey, Laura Taylor",9781534471245,Hardcover,Atheneum Books for Young Readers,2020-11-10,18.99,"Namey, Laura TaylorA Cuban Girl's Guide to Tea..."
3,Absolute Carnage: Miles Morales,"Ahmed, Saladin",9781302920142,Paperback,Marvel,2020-01-28,15.99,"Ahmed, SaladinAbsolute Carnage: Miles Morales"
4,Albert Pujols (Beisbol! Latino Heroes of Major...,"Leventhal, Josh",9781680720495,Hardcover,Black Rabbit Books/Bolt,2017-09-01,32.80,"Leventhal, JoshAlbert Pujols (Beisbol! Latino ..."
...,...,...,...,...,...,...,...,...
224,With the Fire on High,"Acevedo, Elizabeth",9780062662835,Hardcover,Quill Tree Books,2019-05-07,17.99,"Acevedo, ElizabethWith the Fire on High"
225,Woven in Moonlight,"Ibañez, Isabel",9781624148019,Hardcover,Page Street Kids,2020-01-07,18.99,"Ibañez, IsabelWoven in Moonlight"
226,Written in Starlight,"Ibañez, Isabel",9781645671329,Hardcover,Page Street Kids,2021-01-26,18.99,"Ibañez, IsabelWritten in Starlight"
227,Yaqui Delgado Wants to Kick Your Ass,"Medina, Meg",9780763671648,Paperback,Candlewick Press (MA),2014-08-26,8.99,"Medina, MegYaqui Delgado Wants to Kick Your Ass"


Upload an excel or csv file containing the titles within your holdings

Must contain at least 2 columns with the headers author and title (capitalization does not matter)

In [7]:
uploaded_holdings = files.upload()

for fn in uploaded_holdings.keys():
  file_name=fn
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded_holdings[fn])))

Saving test holdings.xlsx to test holdings.xlsx
User uploaded file "test holdings.xlsx" with length 2118564 bytes


Loads the uploaded file into a datafrome or returns an error if the file is an incorrect format

In [8]:
if file_name.endswith('.csv'):
  holdings_df = pd.read_csv(io.BytesIO(uploaded_holdings[file_name]))
elif file_name.endswith('.xls'):
  holdings_df = pd.read_excel(io.BytesIO(uploaded_holdings[file_name]))
elif file_name.endswith('.xlsx'):
  holdings_df = pd.read_excel(io.BytesIO(uploaded_holdings[file_name]))
else:
  pring("error file is not .csv or excel")

clean up data in the table and add BooklistMatch Column for use later

In [9]:
#remove null values
holdings_df=holdings_df.fillna('')
#change headers to lower case
holdings_df.columns = [x.lower() for x in holdings_df.columns]
#save list of column headers for use at the end
holdings_headers = holdings_df.columns.tolist()
#Create MatchPoint field
holdings_df['HoldingsMatch']=holdings_df['author']+holdings_df['title']

Preview holdings data.

**Note:** If the file is contains more than 20,000 rows the preview will be displayed different than the prior preview and will not include the browse features.

In [10]:
#Preview data
holdings_df



Unnamed: 0,record num,title,author,publish year,HoldingsMatch
0,b16093768,Cracking the AP. U.S. history exam / the Princ...,,©1997-2019.,Cracking the AP. U.S. history exam / the Princ...
1,b18109196,Cracking the AP. Biology exam / the Princeton ...,,c1997-2019.,Cracking the AP. Biology exam / the Princeton ...
2,b18783983,Cracking the AP. European history / the Prince...,,c1999-2019.,Cracking the AP. European history / the Prince...
3,b21114870,Cracking the SSAT/ISEE / the Princeton Review.,,©1997-2019.,Cracking the SSAT/ISEE / the Princeton Review.
4,b22154358,Cracking the AP. World history exam / the Prin...,,c2004-2019.,Cracking the AP. World history exam / the Prin...
...,...,...,...,...,...
30167,b40498189,Designated survivor. Season 1 / produced by An...,,,Designated survivor. Season 1 / produced by An...
30168,b40498190,Designated survivor. Season 2 / produced by An...,,,Designated survivor. Season 2 / produced by An...
30169,b40498207,Designated survivor. Season 3 / produced by An...,,,Designated survivor. Season 3 / produced by An...
30170,b40504980,The diary of a young girl : the definitive edi...,"Frank, Anne, 1929-1945",,"Frank, Anne, 1929-1945The diary of a young gir..."



# Step 3: Calculate Matches Dataframe
Compare two data frames and calculate match confidence value for each pair of rows.

Matching algorithm is taken from [Fuzzy Matching at Scale by Josh Taylor](https://towardsdatascience.com/fuzzy-matching-at-scale-84f2bfd0c536) (viewed 11/24/2021)

In [11]:
def ngrams(string, n=3):
    """Takes an input string, cleans it and converts to ngrams. """
    string = str(string)
    string = string.lower() # lower case
    #string = fix_text(string) # fix text
    string = string.encode("ascii", errors="ignore").decode() #remove non ascii chars
    chars_to_remove = [")","(",".","|","[","]","{","}","'","-"]
    rx = '[' + re.escape(''.join(chars_to_remove)) + ']' #remove punc, brackets etc...
    string = re.sub(rx, '', string)
    string = re.sub(' +',' ',string).strip() # get rid of multiple spaces and replace with a single
    string = ' '+ string +' ' # pad names for ngrams...
    ngrams = zip(*[string[i:] for i in range(n)])
    return [''.join(ngram) for ngram in ngrams]


###FIRST TIME RUN - takes about 5 minutes... used to build the matching table
##### Create a list of items to match here:
#change from original doing cast instead of unique...unsure if can do both
booklist_match = list(booklist_df["BooklistMatch"].unique()) #unique org names from company watch file
#Building the TFIDF off the clean dataset - takes about 5 min
vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams)
tf_idf_matrix = vectorizer.fit_transform(booklist_match)

##### Create a list of messy items to match here:
holdings_match = list(holdings_df["HoldingsMatch"].unique()) #unique list of names

#Creation of vectors for the messy names

# #FOR LOADING ONLY - only required if items have been saved previously
# vectorizer = pickle.load(open("Data/vectorizer.pkl","rb"))
# tf_idf_matrix = pickle.load(open("Data/Comp_tfidf.pkl","rb"))
# org_names = pickle.load(open("Data/Comp_names.pkl","rb"))

messy_tf_idf_matrix = vectorizer.transform(holdings_match)

# create a random matrix to index
data_matrix = tf_idf_matrix#[0:1000000]

# Set index parameters
# These are the most important ones
M = 80
efC = 1000

num_threads = 4 # adjust for the number of threads
# Intitialize the library, specify the space, the type of the vector and add data points 
index = nmslib.init(method='simple_invindx', space='negdotprod_sparse_fast', data_type=nmslib.DataType.SPARSE_VECTOR) 

index.addDataPointBatch(data_matrix)
# Create an index
index.createIndex() 


# Number of neighbors 
num_threads = 4
K=1
query_matrix = messy_tf_idf_matrix
query_qty = query_matrix.shape[0]
nbrs = index.knnQueryBatch(query_matrix, k = K, num_threads = num_threads)

In [12]:
mts =[]
for i in range(len(nbrs)):
  original_nm = holdings_match[i]
  try:
    matched_nm   = booklist_match[nbrs[i][0][0]]
    conf         = nbrs[i][1][0]
  except:
    matched_nm   = "no match found"
    conf         = None
  mts.append([original_nm,matched_nm,conf])

mts = pd.DataFrame(mts,columns=['holdings_match','booklist_match','conf'])
#change negative values to positive for ease of reading
mts['conf'] = mts['conf'].abs()

#Step 4: Determine matches

In the table below you can browse the results to see the match confidence scores assigned to each pair.

Look at values of the 'conf' column to find the point at which you feel the entries are matched correctly.  Generally we recommend looking at the values between .5 and .7 as a starting point.  To help here you can see a preview of the matched data that has been limited to the relevant range of values.

In [13]:
#.66 used as default match confidence value based on initial testing, will be overwritten in step below
match_conf = .66

mts = mts.sort_values(by=['conf'])
mts[mts['conf'].between(.5,.7)]

Unnamed: 0,holdings_match,booklist_match,conf
7173,"de la Peña, Matt, author.Superman : dawnbreake...","de la Peña, MattWe Were Here",0.500979
13255,"Roanhorse, Rebecca, author.Resistance reborn /...","Roanhorse, RebeccaPhoenix Song: Echo - Street ...",0.505537
21735,"Allende, Isabel, author.City of the beasts / I...","Allende, IsabelCity of the Beasts (Memories of...",0.50619
8299,"Roanhorse, Rebecca, author.Storm of locusts / ...","Roanhorse, RebeccaPhoenix Song: Echo - Street ...",0.511019
3088,"Roanhorse, Rebecca, author.Trail of lightning ...","Roanhorse, RebeccaPhoenix Song: Echo - Street ...",0.513158
6560,"Sotomayor, Sonia, 1954- author.The beloved wor...","Mendoza, SylviaSonia Sotomayor: A Biography",0.515296
2001,"Albertalli, Becky, author.Leah on the offbeat ...","Albertalli, BeckyWhat If It's Us",0.517369
9018,"Abel, Jessica, author, illustrator.Trish Trash...","Abel, JessicaTrish Trash #1: Rollergirl of Mar...",0.517502
24084,"Thomas, Aiden, author.Lost in the Never Woods ...","Thomas, AidenCemetery Boys",0.518868
14284,"Roanhorse, Rebecca, author.Race to the sun / R...","Roanhorse, RebeccaPhoenix Song: Echo - Street ...",0.520078


Enter the confidence value you wish to use for determining correct matches.  Any value greater than or equal the number you enter will be considered a match.

In [14]:
match_conf = float(input("Select confidence value for determining correct matches"))

Select confidence value for determining correct matches.559


**Percentage of Titles from list that are in your collection**

In [15]:
pct_held = (len(mts.loc[mts['conf'] >= match_conf]) / len(booklist_df)) * 100
print('You own '+str(round(pct_held, 2))+'% of the titles in the booklist')

You own 43.23% of the titles in the booklist


Merge the holdings dataframe with the instances where a match has been found

In [16]:
found_results = holdings_df.reset_index().merge(mts.loc[mts['conf'] >= match_conf], left_on='HoldingsMatch', right_on='holdings_match').set_index('index')
#found_results = found_results.rename(columns=str.capitalize)

**Titles Found in Your Collection**

In [17]:
found_results[holdings_headers]

Unnamed: 0_level_0,record num,title,author,publish year
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1282,b37139939,The disturbed girl's dictionary / NoNieqa Ramos.,"Ramos, NoNieqa, author.",
1559,b37167741,The poet X / a novel by Elizabeth Acevedo.,"Acevedo, Elizabeth, author.",
3063,b37340372,Photographic : the life of Graciela Iturbide /...,"Quintero, Isabel, author.",
3064,b37340396,"America : fast and fuertona / Gabby Rivera, wr...","Rivera, Gabby, author.",
4951,b37557130,What if it's us / Becky Albertalli & Adam Silv...,"Albertalli, Becky, author.",
...,...,...,...,...
28488,b40186295,Our way back to always / Nina Moreno.,"Moreno, Nina (Young adult fiction writer), aut...",
28516,b40186908,Living beyond borders : growing up Mexican in ...,,
29206,b40232542,Like a love song / Gabriela Martins.,"Martins, Gabriela, author.",
29373,b40248847,"Clockwork curandera. Volume 1, The witch owl p...","Bowles, David (David O.), author.",


Download results to an Excel file

In [18]:
found_results[holdings_headers].to_excel("found_results.xlsx")
files.download('/content/found_results.xlsx')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Create missing dataframe containing the titles from the booklist that were not matched to your holdings

In [19]:
booklist_found = booklist_df.reset_index().merge(mts.loc[mts['conf'] >= match_conf], left_on='BooklistMatch', right_on='booklist_match').set_index('index')
common = booklist_df.merge(booklist_found, how='outer', left_index=True, right_index=True)
common = common[common[['conf']].notna().all(axis=1)]
missing = booklist_df.merge(common, how='outer', left_index=True, right_index=True)

**Titles Not in your collection**

In [20]:
missing = missing[missing['conf'].isnull()][booklist_headers]
missing

Unnamed: 0,title,author,isbn,format,publisher,pubdate,unnamed: 6
0,¿sabes Quién Es Zapata? / Do You Know Who Zapa...,"Leyva, Amaranta",9786073183574,Paperback,Alfaguara Infantil,2020-06-23,12.95
1,21: The Story of Roberto Clemente,"Santiago, Wilfred",9781606997758,Paperback,Fantagraphics Books,2014-09-21,19.99
3,Absolute Carnage: Miles Morales,"Ahmed, Saladin",9781302920142,Paperback,Marvel,2020-01-28,15.99
4,Albert Pujols (Beisbol! Latino Heroes of Major...,"Leventhal, Josh",9781680720495,Hardcover,Black Rabbit Books/Bolt,2017-09-01,32.80
5,Alexandria Ocasio-Cortez: Political Headliner ...,"Leigh, Anna",9781541588875,Paperback,Lerner Publications (Tm),2020-01-01,10.99
...,...,...,...,...,...,...,...
217,Welcome to Wanderland,"Ball, Jackie",9781684154722,Paperback,Boom Box,2019-09-24,14.99
219,"When Villains Rise, 3 (Market of Monsters #3)","Schaeffer, Rebecca",9781328863560,Hardcover,Clarion Books,2020-09-08,17.99
222,Where We Go from Here,"Rocha, Lucas",9781338556247,Hardcover,Push,2020-06-02,18.99
223,Whistle: A New Gotham City Hero,"Lockhart, E",9781401293222,Paperback,DC Comics,2021-09-07,16.99


Download missing titles to an Excel file

In [21]:
missing.to_excel("missing_results.xlsx")
files.download('/content/missing_results.xlsx')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>