<a href="https://colab.research.google.com/github/jmgold/DEI-Collections/blob/main/DEI_Holdings_Comparison_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Jeremy Goldstein
Minuteman Library Network

This script will compare an excel file of local holdings to a curated list of diverse titles to find how much of that list you own and which titles are missing.

# Step 1: Configure Python/Colab Environment


nmslib is not included by default with Colab and must be installed.

If using files saved to Gdrive then your GDrive must first be mounted to give Colab access.

In [None]:
!pip install nmslib

import Python libraries that will be used within the script

In [2]:
import pandas as pd
import io
import numpy as np
import os
import re
#import time
import nmslib
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import csr_matrix # may not be required 
from scipy.sparse import rand # may not be required
%load_ext google.colab.data_table

Mount GDrive in order to access DEI list file
You can also manually mount the drive via the file directory menu along the lefthand side of the screen.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Step 2: Import data
Script will load tabular data into two dataframes using the Pandas library for Python.  

In [4]:
#load DEI List.xlsx into a dataframe
cols_to_use = ['genre','author','title','year']
booklist_df = pd.read_excel('/content/drive/MyDrive/Python/DEI List.xlsx', usecols=lambda x: x.lower() in cols_to_use)

clean up data in the table and add BooklistMatch Column for use later

In [5]:
#change null to blank
booklist_df=booklist_df.fillna('')
#create match point column
booklist_df['BooklistMatch']=booklist_df['Author']+booklist_df['Title']

Preview the contents of this dataframe.  
You can use the Filter button to further explore the data.

In [6]:
#Preview booklist dataframe
booklist_df

Unnamed: 0,Genre,Author,Title,Year,BooklistMatch
0,Romance/Erotic Romance,Charlie Adhara,The Wolf at the Door,2018,Charlie AdharaThe Wolf at the Door
1,Romance/Erotic Romance,Charlie Adhara,The Wolf at Bay,2018,Charlie AdharaThe Wolf at Bay
2,Romance/Erotic Romance,Charlie Adhara,Thrown to the Wolves,2019,Charlie AdharaThrown to the Wolves
3,Romance/Erotic Romance,Charlie Adhara,Wolf in Sheep’s Clothing,2020,Charlie AdharaWolf in Sheep’s Clothing
4,Romance/Erotic Romance,Brea Alepoú,His Bewildered Mate,2019,Brea AlepoúHis Bewildered Mate
...,...,...,...,...,...
4011,Fiction Anthologies,,Shades Of Black: Crime And Mystery Stories By ...,2004,Shades Of Black: Crime And Mystery Stories By ...
4012,Fiction Anthologies,,Slay: Stories of the Vampire Noire,2020,Slay: Stories of the Vampire Noire
4013,Fiction Anthologies,,Transcendent 3: The Year’s Best Transgender Sp...,2018,Transcendent 3: The Year’s Best Transgender Sp...
4014,Fiction Anthologies,,Transcendent 4,2019,Transcendent 4


Upload an excel file containing your titles within your holdings

Script currently assumes this file contains 4 columns: Record Number, Author, Title and Publication Year

In [7]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  file_name=fn
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

Saving test holdings.xlsx to test holdings (3).xlsx
User uploaded file "test holdings.xlsx" with length 2118564 bytes


In [None]:
#use if csv
#holdings_df = pd.read_csv(io.BytesIO(uploaded[file_name]))

In [8]:
#use if excel
cols_to_use = ['record num','author','title','publish year']
holdings_df = pd.read_excel(io.BytesIO(uploaded[file_name]), usecols=lambda x: x.lower() in cols_to_use)

In [9]:
#remove null values
holdings_df=holdings_df.fillna('')
#Create MatchPoint field
holdings_df['HoldingsMatch']=holdings_df['author']+holdings_df['title']

Preview holdings data.

**Note:** If the file is contains more than 20,000 rows the preview will be displayed different than the prior preview and will not include the browse features.

In [10]:
#Preview data
holdings_df



Unnamed: 0,record num,title,author,publish year,HoldingsMatch
0,b16093768,Cracking the AP. U.S. history exam / the Princ...,,©1997-2019.,Cracking the AP. U.S. history exam / the Princ...
1,b18109196,Cracking the AP. Biology exam / the Princeton ...,,c1997-2019.,Cracking the AP. Biology exam / the Princeton ...
2,b18783983,Cracking the AP. European history / the Prince...,,c1999-2019.,Cracking the AP. European history / the Prince...
3,b21114870,Cracking the SSAT/ISEE / the Princeton Review.,,©1997-2019.,Cracking the SSAT/ISEE / the Princeton Review.
4,b22154358,Cracking the AP. World history exam / the Prin...,,c2004-2019.,Cracking the AP. World history exam / the Prin...
...,...,...,...,...,...
30167,b40498189,Designated survivor. Season 1 / produced by An...,,,Designated survivor. Season 1 / produced by An...
30168,b40498190,Designated survivor. Season 2 / produced by An...,,,Designated survivor. Season 2 / produced by An...
30169,b40498207,Designated survivor. Season 3 / produced by An...,,,Designated survivor. Season 3 / produced by An...
30170,b40504980,The diary of a young girl : the definitive edi...,"Frank, Anne, 1929-1945",,"Frank, Anne, 1929-1945The diary of a young gir..."



# Step 3: Calculate Matches Dataframe
Compare two data frames and calculate match confidence value for each pair of rows.

Matching algorithm is taken from [Fuzzy Matching at Scale by Josh Taylor](https://towardsdatascience.com/fuzzy-matching-at-scale-84f2bfd0c536) (viewed 11/24/2021)

In [11]:
def ngrams(string, n=3):
    """Takes an input string, cleans it and converts to ngrams. """
    string = str(string)
    string = string.lower() # lower case
    #string = fix_text(string) # fix text
    string = string.encode("ascii", errors="ignore").decode() #remove non ascii chars
    chars_to_remove = [")","(",".","|","[","]","{","}","'","-"]
    rx = '[' + re.escape(''.join(chars_to_remove)) + ']' #remove punc, brackets etc...
    string = re.sub(rx, '', string)
    string = re.sub(' +',' ',string).strip() # get rid of multiple spaces and replace with a single
    string = ' '+ string +' ' # pad names for ngrams...
    ngrams = zip(*[string[i:] for i in range(n)])
    return [''.join(ngram) for ngram in ngrams]


###FIRST TIME RUN - takes about 5 minutes... used to build the matching table
##### Create a list of items to match here:
#change from original doing cast instead of unique...unsure if can do both
booklist_match = list(booklist_df["BooklistMatch"].unique()) #unique org names from company watch file
#Building the TFIDF off the clean dataset - takes about 5 min
vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams)
tf_idf_matrix = vectorizer.fit_transform(booklist_match)

##### Create a list of messy items to match here:
holdings_match = list(holdings_df["HoldingsMatch"].unique()) #unique list of names

#Creation of vectors for the messy names

# #FOR LOADING ONLY - only required if items have been saved previously
# vectorizer = pickle.load(open("Data/vectorizer.pkl","rb"))
# tf_idf_matrix = pickle.load(open("Data/Comp_tfidf.pkl","rb"))
# org_names = pickle.load(open("Data/Comp_names.pkl","rb"))

messy_tf_idf_matrix = vectorizer.transform(holdings_match)

# create a random matrix to index
data_matrix = tf_idf_matrix#[0:1000000]

# Set index parameters
# These are the most important ones
M = 80
efC = 1000

num_threads = 4 # adjust for the number of threads
# Intitialize the library, specify the space, the type of the vector and add data points 
index = nmslib.init(method='simple_invindx', space='negdotprod_sparse_fast', data_type=nmslib.DataType.SPARSE_VECTOR) 

index.addDataPointBatch(data_matrix)
# Create an index
index.createIndex() 


# Number of neighbors 
num_threads = 4
K=1
query_matrix = messy_tf_idf_matrix
query_qty = query_matrix.shape[0]
nbrs = index.knnQueryBatch(query_matrix, k = K, num_threads = num_threads)

In [12]:
mts =[]
for i in range(len(nbrs)):
  original_nm = holdings_match[i]
  try:
    matched_nm   = booklist_match[nbrs[i][0][0]]
    conf         = nbrs[i][1][0]
  except:
    matched_nm   = "no match found"
    conf         = None
  mts.append([original_nm,matched_nm,conf])

mts = pd.DataFrame(mts,columns=['holdings_match','booklist_match','conf'])
#change negative values to positive for ease of reading
mts['conf'] = mts['conf'].abs()

#Step 4: Determine matches

In the table below you can browse the results to see the match confidence scores assigned to each pair.

Look at values of the 'conf' column to find the point at which you feel the entries are matched correctly.  Generally we recommend looking at the values between .5 and .7 as a starting point.  To help here you can see a preview of the matched data that has been limited to the relevant range of values.

In [13]:
#.66 used as default match confidence value based on initial testing, will be overwritten in step below
match_conf = .66

mts = mts.sort_values(by=['conf'])
mts[mts['conf'].between(.5,.7)]

Unnamed: 0,holdings_match,booklist_match,conf
19632,"Ross, Rebecca (Rebecca J.), author.Sisters of ...",Rebecca RoanhorseBlack Sun,0.501151
987,"Gentill, Sulari, author.Paving the new road / ...",Sulari GentillWhere There’s a Will,0.502552
19080,"Sylvester, Natalia, author.Running / Natalia S...",Natalia SylvesterEveryone Knows You Go Home,0.502972
19067,"Klune, TJ, author.The extraordinaries / T.J. K...",T.J. KluneRavensong,0.503066
7966,"Rowley, Steven, 1971- author.The editor : a no...",Steven RowleyThe Guncle,0.505493
...,...,...,...
25977,"Abrams, Stacey, author.While justice sleeps : ...",Stacey AbramsWhile Justice Sleeps,0.698780
14478,"Chien, Vivien, author.Wonton terror / Vivien C...",Vivien ChienWonton Terror,0.699151
25079,"Hough, Lauren, 1977- author.Leaving isn't the ...",Lauren HoughLeaving Isn’t The Hardest Thing,0.699217
24861,"Liu, Marjorie M., author.The Tangleroot palace...",Marjorie LiuTangleroot Palace: Stories,0.699284


Enter the confidence value you wish to use for determining correct matches.  Any value greater than or equal the number you enter will be considered a match.

In [14]:
match_conf = float(input("Select confidence value for determining correct matches"))

Select confidence value for determining correct matches.6093


**Percentage of Titles from list that are in your collection**

In [15]:
pct_held = (len(mts.loc[mts['conf'] >= match_conf]) / len(booklist_df)) * 100
print('You own '+str(round(pct_held, 2))+'% of the titles in the booklist')

You own 35.16% of the titles in the booklist


Merge the holdings dataframe with the instances where a match has been found

In [16]:
found_results = holdings_df.reset_index().merge(mts.loc[mts['conf'] >= match_conf], left_on='HoldingsMatch', right_on='holdings_match').set_index('index')
found_results = found_results.rename(columns=str.capitalize)

**Titles Found in Your Collection**

In [17]:
found_results

Unnamed: 0_level_0,Record num,Title,Author,Publish year,Holdingsmatch,Holdings_match,Booklist_match,Conf
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
62,b36503939,Number one Chinese restaurant : a novel / Lill...,"Li, Lillian, author.",,"Li, Lillian, author.Number one Chinese restaur...","Li, Lillian, author.Number one Chinese restaur...",Lillian LiNumber One Chinese Restaurant,0.736522
102,b36545041,An American marriage : a novel / by Tayari Jones.,"Jones, Tayari, author.",,"Jones, Tayari, author.An American marriage : a...","Jones, Tayari, author.An American marriage : a...",Tayari JonesAn American Marriage,0.737505
109,b36553505,The widows of Malabar Hill / Sujata Massey.,"Massey, Sujata, author.",,"Massey, Sujata, author.The widows of Malabar H...","Massey, Sujata, author.The widows of Malabar H...",Sujata MasseyThe Widows of Malabar Hill,0.811325
166,b36610409,Racism without racists : color-blind racism an...,"Bonilla-Silva, Eduardo, 1962- author.",,"Bonilla-Silva, Eduardo, 1962- author.Racism wi...","Bonilla-Silva, Eduardo, 1962- author.Racism wi...",Eduardo Bonilla-SilvaRacism Without Racists: C...,0.879455
212,b36638249,The boat people / Sharon Bala.,"Bala, Sharon, author.",,"Bala, Sharon, author.The boat people / Sharon ...","Bala, Sharon, author.The boat people / Sharon ...",Sharon BalaThe Boat People,0.739590
...,...,...,...,...,...,...,...,...
29730,b40297676,Grievers / Adrienne Maree Brown.,"Brown, Adrienne M.",,"Brown, Adrienne M.Grievers / Adrienne Maree Br...","Brown, Adrienne M.Grievers / Adrienne Maree Br...",adrienne maree brownGrievers,0.770663
29810,b40309368,Pretty little lion / Suleikha Snyder.,"Snyder, Suleikha, author.",,"Snyder, Suleikha, author.Pretty little lion / ...","Snyder, Suleikha, author.Pretty little lion / ...",Suleikha SnyderPretty Little Lion,0.836295
29881,b40326640,Destroyer of light / Jennifer Marie Brissett.,"Brissett, Jennifer Marie, author.",,"Brissett, Jennifer Marie, author.Destroyer of ...","Brissett, Jennifer Marie, author.Destroyer of ...",Jennifer Marie BrissettDestroyer of Light,0.793499
30105,b40392168,Dear diaspora / Susan Nguyen.,"Nguyen, Susan, 1992- author.",,"Nguyen, Susan, 1992- author.Dear diaspora / Su...","Nguyen, Susan, 1992- author.Dear diaspora / Su...",Susan NguyenDear Diaspora,0.776614


Download results to an Excel file

In [18]:
found_results.to_excel("found_results.xlsx")
files.download('/content/found_results.xlsx')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Create missing dataframe containing the titles from the booklist that were not matched to your holdings

In [19]:
booklist_found = booklist_df.reset_index().merge(mts.loc[mts['conf'] >= match_conf], left_on='BooklistMatch', right_on='booklist_match').set_index('index')
common = booklist_df.merge(booklist_found, how='outer', left_index=True, right_index=True)
common = common[common[['conf']].notna().all(axis=1)]
missing = booklist_df.merge(common, how='outer', left_index=True, right_index=True)

**Titles Not in your collection**

In [20]:
missing[missing['conf'].isnull()][['Genre','Author','Title','Year']]

Unnamed: 0,Genre,Author,Title,Year
0,Romance/Erotic Romance,Charlie Adhara,The Wolf at the Door,2018
1,Romance/Erotic Romance,Charlie Adhara,The Wolf at Bay,2018
2,Romance/Erotic Romance,Charlie Adhara,Thrown to the Wolves,2019
3,Romance/Erotic Romance,Charlie Adhara,Wolf in Sheep’s Clothing,2020
4,Romance/Erotic Romance,Brea Alepoú,His Bewildered Mate,2019
...,...,...,...,...
4011,Fiction Anthologies,,Shades Of Black: Crime And Mystery Stories By ...,2004
4012,Fiction Anthologies,,Slay: Stories of the Vampire Noire,2020
4013,Fiction Anthologies,,Transcendent 3: The Year’s Best Transgender Sp...,2018
4014,Fiction Anthologies,,Transcendent 4,2019


Download missing titles to an Excel file

In [21]:
missing.to_excel("missing_results.xlsx")
files.download('/content/missing_results.xlsx')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>