The first step in any data science project is cleaning the data.  Web of Science records are about as good as bibliographic data gets, but good is not yet perfect.  In particular, WoS records a problem with ambiguous researcher names.  We know that 'John Smith' and 'John L Smith' and 'John L. Smith' are probably all the same person, but computers see those as different strings.

In general, disambiguation is challenge in any data science project where entities are identified by non-unique records, like ordinary names, or errors are introduced into the data stream through a manual entry point.  I looked at some Python libraries that handle disambiguation, but they didn't seem precisely suited for this project. I absolutely need to avoid improperly combining records for different scientists, the data is small enough that I can clean it by hand (with a little assistance), and the ambiguities have some pretty regular patterns.

The most common problems are:
* A scientist is inconsistent in use of a middle initial
* A scientist is referred to by a nickname
* A scientist's first name is misspelled in one instance 

I'm reusing some older code, so this is not as elegant as I'd like. But the best code is code that runs.

In [117]:
import metaknowledge as mk
import pandas as pd
import os

Quickly check that we are in a directory called RCs that is a child of the project directory, which stores all the record collections that we downloaded.  For reasons of confidentiality, I won't share the actual data, but I've included a test collection of papers by 2018 Nobel Laureate in Physics Kip Thorne (savedrecsKP.txt), who is inconsistent about using his middle initial 'S'.  Kip Thorne has 15 papers, and my dataset for this project is 1628 papers. 

In [118]:
print(os.getcwd())

path = 'RCs'
os.chdir(path)
print(os.getcwd())

/home/michael/OneDrive/ds/metis/lol18_ids4/project
/home/michael/OneDrive/ds/metis/lol18_ids4/project/RCs


We initialize our record, and create and check a dataframe called people that lists all our authors and how many paper's they've published.

In [119]:
RC = mk.RecordCollection(".")
people = mk.CollectionWithIDs.rankedSeries(RC, tag='AF', giveCounts=True)

people = pd.DataFrame.from_dict(people, orient="columns")
people= people.sort_values('entry')
people['names']=people['entry']
people.set_index(['names'], inplace=True)
people.drop(['rank'], axis=1, inplace=True)
people.head()

Unnamed: 0_level_0,entry,count
names,Unnamed: 1_level_1,Unnamed: 2_level_1
"ADERMANN, K","ADERMANN, K",1
"ALBERS, J","ALBERS, J",1
"AUSTERMANN, S","AUSTERMANN, S",1
"Aagaard, Kjersti","Aagaard, Kjersti",1
"Aaron, Holly L.","Aaron, Holly L.",1


This function is at the heart of disambiguation.  Last names are usually correct, with errors concentrated in the first names. So we transform a string of the form "Lastname, Firstname" into "Lastname F", and if two transformed names are equivalent, we have a possible ambiguity.

This leads to false positives with common last names, so we need to manually check the process.

In [120]:
def LastnameF (fullname) :
    name = fullname.split(' ')
    try :
        name = name[0] + ' '+ name[1][0]
    except :
        name = name[0]
    return name

The code looks hacky, but it's actually fairly clever. mk.CollectionWithIDs.rankedSeries returns an alphabetized list of names, i.e. of the form

* Smith, Jon
* Smith, Jon L.
* Smith, Jonathan L.

Generally academics use their full name and initial, i.e the longest version of their name, and we should try and do the same. Starting as the bottom of our dataframe (row i) we see if the LastnameF format matches the LastnameF of next row up (row k), and if it does, we suggest that the name should be changed, decrement k, and check again, until they no longer match, at which point row i becomes row k.

This outputs a file called disamhelp.csv, which indicates possible ambiguous names and their suggested replacement.

In [121]:
people['suggested']=''

i = len(people) -1

while i > 1 :
    k = i - 1
    while LastnameF(people.iloc[i,0]) == LastnameF(people.iloc[k,0]) :
        people.iloc[i, 2] = people.iloc[i, 0]
        people.iloc[k, 2] = people.iloc[i, 0]
        k = k - 1
    i = k
    
people.to_csv('disamhelp.csv', index=False)

Now it's easy enough to through disamhelp and see which names are ambiguous.  It takes about 15 minutes to check the file by hand.  At the end, you should have a spreadsheet where the first column is an ambiguous name, and the second column is the replacement.

Something that looks like this:

|entry              |count| suggested
|-------------------|-----|--------------|
|Smith, Jennifer    |7    |                   |
|Smith, Jon         |1    |Smith, Jonathan L. |
|Smith, Jon L.      |3    |Smith, Jonathan L. |
|Smith, Jonathan L. |15   |Smith, Jonathan L. |

Should become something that looks like this:
You have to delete the row with Jennifer, and everybody with a correct name, because otherwise you'll disambiguate them to the empty string.

|entry              | suggested
|-------------------|--------------|
|Smith, Jon         |Smith, Jonathan L. |
|Smith, Jon L.      |Smith, Jonathan L. |
|Smith, Jonathan L. |Smith, Jonathan L. |

Save the new spreadsheet as DISAMBIG.csv

Metaknowledge is great, but it's one flaw is that objects stored as a Metaknowledge record are immutable.  Which means that to disambiguate, we need to export the record collection as a text file (ARECORD.txt), reload it, and then go through it line by line, and if any line matches one of the names ot be disambiguated, disambiguate it.  Then we save the new file as BRECORD.txt, which we'll use as our clean dataset going forward.

In [122]:
mk.RecordCollection.writeFile(RC, "ARECORD.TXT")
with open("ARECORD.TXT", "r", encoding="utf8") as file :
    data = file.readlines()

disambig = {}
newdata = str()
count =0
afcount = 0

df = pd.read_csv('DISAMBIG.csv', header=None)

for row in df.itertuples():
    disambig[row[1]] = row[2]

for line in data :
    if line.strip() in disambig :
        author = line.strip()
        newline = line.replace(author, disambig[author])
        newdata = newdata + newline
        count = count +1
    elif line.startswith('AF') and line.lstrip('AF').strip() in disambig :
        author = line.lstrip('AF').strip()
        newauthor = author.replace(author, disambig[author])
        newline = 'AF '+ newauthor +'\n'
        newdata = newdata + newline
        afcount = afcount + 1
    elif line.startswith('WC') :
        newlinelist = []
        WCS = line.split(';')
        for item in WCS :
            if item in disambig :
                newlinelist = newlinelist + disambig[item]
            else :
                newlinelist = newlinelist + [item]
        newline = ';'.join(newlinelist)
        newdata = newdata + line
    else :
        newdata = newdata + line

os.chdir('..')        
        
fh = open("BRECORD.txt", "w", encoding="utf8")
for line in newdata :
    fh.write(line)
fh.close()

RC = mk.RecordCollection("BRECORD.txt")
print('number of papers in sample')
print(len(RC))

number of papers in sample
1692


Finally, while we have a few thousand authors in our dataset, we only really care about a handful of them, which I've saved in a file Tags.csv, in the parent directory of RCs.  

However, we need to check that we use the same names in Tags.csv as we do in the disambiguated dataset.  This last cell uses the disambiguation code to check that our tagged authors correspond to the right people in the dataset, and outputs a Taghelper.csv file that you can use to check everything is okay. I've also printed the lines which might cause a problem.

This is not great code, but it works.

In [123]:
people = mk.CollectionWithIDs.rankedSeries(RC, tag='AF', giveCounts=True)
people = pd.DataFrame.from_dict(people, orient="columns")
people= people.sort_values('entry')
people['names']=people['entry']
people.set_index(['names'], inplace=True)
people.drop(['rank'], axis=1, inplace=True)

import csv

Tags = pd.read_csv('Tags.csv', encoding="latin-1")

taghelper = []

people['shortened']=people.entry.apply(LastnameF)
people = people.set_index(['shortened'])

for index, row in Tags.iterrows() :
    member = LastnameF(row['Name'])
    try :
        aliases = people.loc[member,'entry'].tolist()
    except AttributeError:
        aliases = people.loc[member,'entry']
    except KeyError:
        aliases = 'IS NOT FOUND'
    taghelper.append([row['Name']] + [' may be '] + [str(aliases)])

for line in taghelper :
    if line[0] != line[2] :
        line.append('Check')
        print(line)

outputFile = open('TagHelper.csv', 'w', newline='')
outputWriter = csv.writer(outputFile)
for line in taghelper:
    outputWriter.writerow(line)
outputFile.close()

['Li, Baoxin', ' may be ', "['Li, Baoxin', 'Li, Bing']", 'Check']
['Maser, Joe', ' may be ', 'IS NOT FOUND', 'Check']
['Rueter, John', ' may be ', 'IS NOT FOUND', 'Check']
['Schulz, Cristof', ' may be ', "['Schulz, Christian', 'Schulz, Christof']", 'Check']
