# <center>Co-working networks for assisted names remapping</center>
## <center>Species occurrence dataset cleaning</center>
#### <center>Author: Pedro C. de Siracusa</center>
#### <center>Date: Oct 9, 2017</center>

As collectors co-working network models are built using collectors ids, their quality is strictly related to the quality of fields holding authorship information, such as `recordedBy`. For records that are authored by more than one collectors the recommended practice is to include all names in a single string, separating them using the same delimiter across all records. However, as collector authorship information is traditionally not very relevant for biodiversity data users, this field is usually overlooked by dataset curators. It is a common practice only to include the nominal collector (or first collector) in the collection authorship field, while including secondary collectors as 'et al.' or even omitting them. Such information loss might obfuscate important relationships that should have been included in the network models. In this notebook I'll demonstrate how we could get some assistance for remapping names by using co-working networks. By comparing candidate identities (names that are sufficiently similar) in terms of their neighborhoods connectivity, I will try to identify the most important heteronymous entities in the dataset. This information will then be stored in a **Names Map** for later use.

Three important measures of quality for the collection authorship field are 
1. Consistent use of delimiters across all records, which should not be used elsewhere if not for separating names of distinct entities;
2. Consistent entities naming standard, for example using last name separated from initials with a comma. This would avoid dealing with heteronymous entities, those which have non-unique identifiers;
3. Information completeness, meaning that all collectors responsible for a record should ideally be included in the authorship field;

Although homonymous entities (different entities holding the same name) are also possible issues for the network model, these cases are harder to detect than heteronymous entities. One possible way is to detect anomalies in the entity's activity records, as demonstrated for `'kuhlmann,M'` in notebook 7.

In [1]:
import numpy as np
import pandas as pd
import networkx as nx

import matplotlib.pyplot as plt

import modules.cleaning.names as nc

%matplotlib inline

In [2]:
dsetPath = '/home/pedro/datasets/ub_herbarium/occurrence.txt'

cols = [ 'recordedBy','scientificName','taxonRank', 'collectionCode','eventDate']
occs = pd.read_csv(dsetPath, sep='\t', usecols=cols)
occs=occs[occs['recordedBy'].notnull()]

In [3]:
occs.head()

Unnamed: 0,collectionCode,recordedBy,eventDate,scientificName,taxonRank
0,UB,"Irwin, HS",1972-01-16T01:00Z,Annona monticola Mart.,SPECIES
1,UB,"Ratter, JA; et al.",1976-06-30T01:00Z,Myracrodruon urundeuva Allem.,SPECIES
2,UB,"Heringer, EP",1954-06-05T01:00Z,Myracrodruon urundeuva Allem.,SPECIES
3,UB,"Coelho, JP",1964-10-15T01:00Z,Myracrodruon urundeuva Allem.,SPECIES
4,UB,"Eiten, G; Eiten, LT",1963-08-17T01:00Z,Myracrodruon urundeuva Allem.,SPECIES


## 1. Atomizing names

In the names atomization step our task is to retrieve names referring to individual entities from names strings. It must be noted that even in datasets for which the collection authorship field has high quality some records may still be inconsistently formatted. In order to detect such inconsistencies one possible approach is to first apply the 'standard' atomization function to every record, such that most records are initially atomized. Next we manually inspect very short and very long names strings and decide which of them should have been split. Finally we must map different atomization functions to particular cases or even correct some of them manually. The functions in the cell below were refactored from previous notebooks for this approach. The `atomizeNames` function is given a default atomization operation, but also takes an extra argument `replaces` holding names to be reatomized.

In [4]:
from collections import Counter

def getNamesList( col, with_counts=False, orderBy=None ):
    """
    Gets a list of names from an atomized names column.
    
    Parameters
    ----------
    
    with_counts : bool
        If set to True the result includes the number of records by each collector.
    
    orderBy : str
        If some rule is specified, the resulting list is sorted. Rules can be either
        to sort alphabetically ('alphabetic') or by the number of records by each
        collector ('counts').   
    """
    
    if orderBy not in [ None, "alphabetic", "counts"]:
        raise ValueError("Invalid argument for 'orderBy': {}".format(orderBy))
    
    if with_counts or orderBy=="counts":
        l = [ (n,c) for (n,c) in Counter( n for nlst in col for n in nlst ).items() ] 
        if orderBy=="alphabetic":
            return sorted( l, key=lambda x: x[0] )
        elif orderBy=="counts":
            sorted_l = sorted( l, key=lambda x: x[1], reverse=True )
            if with_counts:
                return sorted_l
            else:
                return [ n for (n,c) in sorted_l ]
        else:
            return l
                
    else:
        if orderBy=="alphabetic":
            return sorted(list(set( n for nlst in col for n in nlst )))
        else:
            return list(set( n for nlst in col for n in nlst ))
        
        
def getRecordsBy(df, colName, name):
    return df.loc[df[colName].apply( lambda x: name in x )]


def atomizeNames( col, operation=None, replaces=None ):
    """
    Applies an atomization operation on a names column, which must be a pandas Series. 
    The atomized names at each row are stored as a list.
    
    Parameters
    ----------
    col : pandas.Series
        Names column to be atomized.
        
    operation : function
        The atomization operation to be applied to the names column
        
    replaces:
        A list of 2-tuples (srclst, tgt), where srclst is a list of names to be replaced by tgt.
        The element tgt can be either a string or a function which results in a string.
        
    Returns
    -------
        A pandas Series with lists of atomized names.
    """
    if replaces is not None:
        replacesDict = dict( (src, tgt(src)) if callable(tgt) else (src,tgt) for (srclst, tgt) in replaces for src in srclst )
        col = col.replace( replacesDict )
        
    col_atomized = col.apply( operation )
    return col_atomized

Now I define a default atomizing function, which retrieves names from strings assuming they're separated by a `';'` delimiter. Then I add a new column `recordedBy_atomized` to my original data frame, containing atomized collectors names.

In [5]:
atomizingFunction = lambda x: nc.namesFromString(x,delim=';') 
atomized_recordedBy = atomizeNames(occs['recordedBy'],operation=atomizingFunction)
occs['recordedBy_atomized'] = atomized_recordedBy

Let's check some candidates to be reatomized. For short names it is easier to inspect if we get the original names string associated with each atomized name. First I'll inspect the 100 shortest names and then the 100 longest names.

In [6]:
l = sorted( getNamesList(occs['recordedBy_atomized']) , key=lambda x: len(x))[:100]
l = [ (n, getRecordsBy(occs, 'recordedBy_atomized', n)['recordedBy'].iloc[0]) for n in l ]
print("Shortest names")
print("name\tin\tnamestring")
print("==========================")
print(''.join( "{}\tin\t{}\n".format(n,nstr)for n,nstr in l ))

Shortest names
name	in	namestring
O	in	Oliveira, RC; Moura, CO; Cardoso, AGT; Sonsin, J; Cordeiro, AOO; Million, JL; Antunes, LLC; O
S	in	Martins, DS; Câmara, PEAS; Amorim, PRF; Costa, DP; Faria, JEQ; Carvalho, AM; Gonzaga, RMO; S
?	in	?
R	in	Farias, R; Carvalho, AM; Carvalho, JA; Fonsêca, LM; Proença, CEB; Potzernheim, ML; R
F	in	F
.	in	Faria, JEQ; Carvalho-Silva, M; Câmara, PEAS; .; Soares, AER; Teixeira Júnior, AQ; Benedete
P	in	Sasaki, D; Pedroga, JA; Corrêa, TR; P; Piva, JH
Si	in	Ratter, JA; Bridgewater, S; Cardoso, E; Lima, V; Munhoz, CBR; Oliveira, NR; Ribeiro, JF; Si
RR	in	Irwin, HS; Souza, R; Santos; RR
FO	in	FO
E.	in	Hällström; E.
M.	in	Hatschbach, G; M.
K.	in	Yushun.; K.
Nu	in	Faria, JEQ; Campos, LZO; Ibrahim, M; Martins, RC; Caires, CS; Meneguzzo, TEC; Souza, LF; Nu
Fl	in	Lucas, EJ; Mazine-Capelo, FF; Kollmann, L; Brummitt, NA; Campos, OR; Fl
Cid	in	Cid; Ramos; Mota; Rosas
DAF	in	DAF
Bid	in	Guedes, ML; Bid; Carla
Ono	in	Kirkbride Junior, JH; Ono; E.K.M; et al.
PKL	in	PKL; E

In [7]:
l = sorted( getNamesList(occs['recordedBy_atomized']) , key=lambda x: len(x), reverse=True)[:100]
print("Longest names\n=============")
print('\n'.join(l))

Longest names
Organografia e Sistemática das Fanerofíticas (A) - UnB
Estudantes da Faculdade de Agronomia Eliseu Maciel
Equipe da CHESF CT nº 1.92.2004.0150.00/PETCON
Morfologia e Taxonomia de Fanerógamas - UnB
Taxonomy Class of Universidade de Brasília
Taxonomy Class of Universidade de Brasíl
Equipe do Jardim Botânico de Brasília
Curso int. Flora Páramo de Chirripó
Stud. biol. Rheno-Trai. in itinere
Turma de Etnobotânica do Cerrado
Romero-Silva M.G. (D. Mariinha)
Projeto Flora "Pedra do Cavalo"
III Reunião de Bot. Peninsular
Ecologia Vegetal Polo Noroeste
Souza e Silva (D. Cezarina), C
Alunos de Sistemática Vegetal
Turma de Vegetação do Cerrado
Sr. Air, Sr. Milton, Rodrigo
Universidade de Brasília-UnB
Disciplina Taxonomia Vegetal
Glenfield Vet. Res. Station
Study Biological Rheno-Trai
Perth City Council Gardener
Turma de Botânica de Campo
Pessoal do Herbário da UnB
Pessoal do Horto Florestal
Pessoal do Jardim Botânico
Projeto Biodiversidade BP
Turma de Mestrado da UFBA
Disciplina Taxo

As the UB dataset has a high-quality collection authorship field, I've only found few inconsistencies, which are sufficient to show how this validation approach works. For lower quality datasets, though, this validation would be more time consuming. To exemplify, let's take a look at collector 'Yushun.;K', which I presume should have been recorded as 'Yushun, K'. As the `';'` separator was misused for separating her/his last name from the initial, our default atomization function creates two entities from this string

In [8]:
get_record_by_yushun = lambda x: x[x['recordedBy'].apply(lambda n: 'Yushun' in n)]
get_record_by_yushun(occs)['recordedBy_atomized'].values

array([['Yushun.', 'K.']], dtype=object)

 Below I define the replacement list and renormalize the `recordedBy` column, now using the `replaces` argument.

In [9]:
rep = [ 
    (['Sr. Air, Sr. Milton, Rodrigo'], "Sr. Air; Sr. Milton; Rodrigo"),
    (['Sônia / Josefina'], "Sônia; Josefina"),
    (['Hatschbach, G; M.'], "Hatschbach, G; Hatschbach, M"),
    (['Irwin, HS; Souza, R; Santos; RR'], "Irwin, HS; Souza, R; Santos, RR"),
    (['Kirkbride Junior, JH; Ono; E.K.M; et al.'], "Kirkbride Junior, JH; Ono, E.K.M; et al."),
    (['Carboni, M; Faraco, AG; Soares; P.G.; Sampaio, D; Breier, TB'], "Carboni, M; Faraco, AG; Soares, P.G.; Sampaio, D; Breier, TB"),
    (['Silva; D.R.; Colvéquia; L.P.T'], "Silva, D.R.; Colvéquia, L.P.T"),
    (['Quintiliano; F.J.; Colvéquia; L.P.T; Silva; D.R.'], "Quintiliano, F.J.; Colvéquia, L.P.T; Silva, D.R."),
    
    (['Yushun.; K.', 
      'Barbosa; M.G.',
      'Hällström; E.',
      'Bueno; S.B.'], 
            lambda x: x.replace(';',',')
    ),   
]

atomizingFunction = lambda x: nc.namesFromString(x,delim=';') 
atomized_recordedBy = atomizeNames(occs['recordedBy'],operation=atomizingFunction, replaces=rep)
occs['recordedBy_atomized'] = atomized_recordedBy

Now I have recreated the `recordedBy_atomized` field with the exceptions defined above. Note that now 'Yushun.; K.' is included as a single entity in the `recordedBy_atomized` field.

In [10]:
get_record_by_yushun(occs)['recordedBy_atomized'].values

array([['Yushun., K.']], dtype=object)

Next I'll create the names map, which deals with the issue of heteronymous entities.

## 2. Building names map and index

Since last notebook I have updated some of the `NamesMap` class' methods, while also adding new ones. Now the names map object can be saved into a *.json* file, which can be shared with other potential users of the dataset. This simplifies the names cleaning and validation process and allows it to be performed collaboratively.

In [11]:
import json
from copy import deepcopy

class NamesMap:
    _map=None
    _normalizationFunction=None
    _remappingDict=None
    
    def __init__(self, names, normalizationFunction, *args, **kwargs):  
        self._normalizationFunction = normalizationFunction    
        if 'jsondata' in kwargs:
            self._map = kwargs['jsondata']['_map']
            self._remappingDict = kwargs['jsondata']['_remappingDict']
        else:
            self._map = dict( (n, self._normalizationFunction(n)) for n in names )
        return
    
    def clearMap(self):
        """
        Resets the map to an empty dict
        """
        self._map={}
        return
    
    def remap(self, remappingDict, fromScratch=False, preventOverwriting=True):
        """
        Updates the remapping dictionary.
        
        Parameters
        ----------
        remappingDict : dict
            Remaps values in the map attribute.
        
        fromScratch : bool
            If set to True the remapping dict becomes the one passed in. All other previous
            remaps are discarded.
        
        preventOverwriting : bool
            If set to True (default), then a key cannot be remapped if it already exists
            in the remapping dict.
        """
        remappingDict = deepcopy(remappingDict)
        if fromScratch:
            self._remappingDict = None
            
        if self._remappingDict is None:
            self._remappingDict = remappingDict
        
        else:
            if preventOverwriting:
                existantKeys = self._remappingDict.keys()
                for k in remappingDict.keys():
                    if k in existantKeys:
                        raise ValueError("Cannot overwrite key '{}' in the remapping dict.".format(k))
                    
            self._remappingDict.update(remappingDict)
        return    
    
    def remove_fromRemap(self, key):
        """
        Removes element from the remapping dict
        
        Parameters
        ----------
        Key : dict key
          Key of the element to be removed
          
        Returns
        -------
        The value associated to the removed key
        """
        return self._remappingDict.pop(key)

    def remap_fromJson(self, filepath, fromScratch=True):
        with open(filepath, 'r') as f:
            data = json.load(f)
            remappingDict = data['_remappingDict']
            self.remap(remappingDict,fromScratch)
            return
    
    def setNormalizationFunc(self,normalizationFunction):
        self._normalizationFunction = normalizationFunction
        return
    
    def insertNames(self, names, normalizationFunction=None, rebuild=False):
        # if rebuild is true, the entire map (but not the remapping dict) is rebuilt from scratch
        if rebuild==True:
            self.clearMap()
        if normalizationFunction is None: 
            if self._normalizationFunction is None:
                raise ValueError("a normalization function must be defined")
            normalizationFunction = self._normalizationFunction
        self._map.update( dict( (n,normalizationFunction(n)) for n in names ) )
        return
    
    def getMap(self, remap=True):
        """
        Returns a COPY of the names dictionary
        If remap is set to True the remapping dictionary
        is used to remap some names.
        """
        res = deepcopy(self._map)
        if remap and self._remappingDict is not None:
            getNamesPrimitives = lambda n: ( name for name,norm in self._map.items() if norm == n )
            for n,t in ( (n,t) for s,t in self._remappingDict.items() for n in getNamesPrimitives(s) ):
                res[n]=t
        return res
    
    def getNormalizedNames(self, remap=True):
        return sorted(list(set(self.getMap(remap=remap).values())))
    
    def getNamePrimitives(self, n):
        nmap = self.getMap()
        return [ name for name,norm in nmap.items() if norm == n ]
    
    def write_toJson(self, filename="names_map.json"):
        json_dict = dict( (k,v) for (k,v) in vars(self).items() if k!='_normalizationFunction')
        with open(filename, 'w') as output_file:
            json.dump( json_dict, output_file, sort_keys=True, indent=4, ensure_ascii=False)
        return
    
    def reportRemappingInconsistencies(self, returnFormatted=True):
        """
        Reports inconsistencies in the remapping dictionary. Possible inconsistencies
        classes are:
          1. Keys that also appear as values;
        """
        inconsistencies_dict = {'keys_in_vals':('These names appear both as keys and values:', [])}
        dictKeys = self._remappingDict.keys()
        dictVals = self._remappingDict.values()
        
        # Keys_in_vals inconsistency
        for k in dictKeys:
            if k in list(dictVals):
                inconsistencies_dict['keys_in_vals'][1].append(k)
        
        if sum( len(vals) for k,(desc,vals) in inconsistencies_dict.items() )==0:
            return None
        
        else:
            if returnFormatted:
                resStr = ""
                for k,(desc, vals) in inconsistencies_dict.items():
                    if len(vals)!=0:
                        resStr += "{}\n{}\n".format(desc, ''.join("=" for i in range(len(desc))) )
                        resStr += ''.join( "  {}\n".format(v) for v in vals )
                        resStr += "{}\n".format(''.join("-" for i in range(len(desc))) ) 
                    return resStr

            else:
                return inconsistencies_dict
            
            
            
            
def read_namesMap(filepath, fileType='json', *args, **kwargs):
    """
    Reads a names map from a file and returns a NamesMap instance.
    Currently only json files are supported.
    
    Note
    ----
    The normalization function cannot be stored in JSON, and therefore it 
    must be passed as an optional keyword argument 'normalizationFunction'. If no 
    normalization function is passed new names cannot be inserted into the map,
    although remapping can still be done.
    """
    if fileType=='json':
        with open(filepath, 'r') as f:
            data=json.load(f)
            nm = NamesMap(names=None, normalizationFunction=kwargs.get('normalizationFunction', None), jsondata=data)
        return nm
    else:
        raise ValueError("Unsupported file type '{}'.".format(fileType))

Let's first generate the names map from the collectors names list and then the names index from the map. I'll use the `normalize` function from my cleaning module `nc`.

In [12]:
nl = getNamesList(occs['recordedBy_atomized'], orderBy="alphabetic")
nm = NamesMap(nl, nc.normalize)

In [13]:
ni = nc.getNamesIndexes(occs, 'recordedBy_atomized',namesMap=nm.getMap())

## 3. Assembling the network

Now I use `CoworkingNetwork` class to create my network model. It will be further use to help finding heteronymous collectors, by using a nodes neighborhood connectivity metric. This is a cyclic process, as remapped nodes can be then used to rebuild the network model, which again can be used to indicate new possible associations.

In [14]:
import networkx
import itertools
from collections import Counter

class CoworkingNetwork(networkx.Graph):
    """
    Class for coworking networks. Extends networkx Graph class.
    
    Parameters
    ----------
    namesSets : iterable
        An iterable of iterables containing names used to compose cliques 
        in the network.

    weighted : bool
        If set to True the resulting network will have weighted edges. Default False.
        
    namesMap : NamesMap
        A NamesMap object for normalizing nodes names.
        
    Examples
    --------
    >>> namesSets = [ ['a','b','c'], ['d','e'], ['a','c'] ]
    >>> CoworkingNetwork( namesSets, weighted=True).edges(data=True)
    [('b', 'a', {'weight': 1}),
     ('b', 'c', {'weight': 1}),
     ('a', 'c', {'weight': 2}),
     ('e', 'd', {'weight': 1})]
    
    >>> CoworkingNetwork( namesSets ).edges(data=True)
    [('b', 'a', {}), 
     ('b', 'c', {}), 
     ('a', 'c', {}), 
     ('e', 'd', {})]
    """
    def __init__(self, namesSets, weighted=False, namesMap=None):
        super().__init__()
        
        if namesMap:
            nmap = namesMap.getMap()
            namesSets = [ [ nmap[n] for n in nset ] for nset in namesSets ]
            
        cliques = map( lambda n: itertools.combinations(n,r=2), namesSets )
        edges = [ e for edges in cliques for e in edges ]
        self.add_edges_from(edges)
        
        if weighted:
            edges_weights = Counter(edges)

            for (u,v),w in edges_weights.items():
                try:
                    self[u][v]['weight'] += w
                except:
                    self[u][v]['weight'] = w
        
        return

In [15]:
m = nm.getMap()
ni = nc.getNamesIndexes(occs, 'recordedBy_atomized',namesMap=nm.getMap())
G = CoworkingNetwork( occs['recordedBy_atomized'], weighted=True, namesMap=nm )
G.remove_node('etal')
nx.write_gexf(G,'graph.gexf')

In [16]:
nx.set_node_attributes(G, 'n_records', dict( (n, len(ni[n])) for n in G.nodes() ))

## 4. Looking for heteronymous entities in the network model

### Likelihood of identity for collectors

This is my first attempt to design a metric for finding potential heteronymous entities (entities with multiple name variations) based on their neighborhood connectivity. It is composed by two main steps:
1. Find candidate names to be referring to common entities using a **name sequence matching algorithm**;
2. For close matches, calculate the **likelihood** of pairs of nodes being the same entity based on their neighbors. A first, simplistic likelihood function is defined below:

$$likelihood(n_h, n_l) = \begin{cases} \frac{1}{k_l}  \sum_{i=1}^{k_l} g(v_i) \quad  n_l \notin S_h\\
0 \quad  n_l \in S_h \end{cases}, \quad 
g(x) = \begin{cases} 1 \quad x \in S_h \\ 0 \quad x \notin S_h \end{cases}, \qquad \textit{where}$$

* $n_h$ is the higher-degree node;
* $n_l$ is the lower-degree node;
* $k_h$ is the degree of node $n_h$;
* $k_l$ is the degree of node $n_l$;
* $v_i$ are neighbors of node $n_l$;
* $S_h$ is the set of neighbors of node $n_h$.

Intuitively, this likelihood metric is described by the following rules:
1. The likelihood that two names refer to the same entity depends on the percentage of neighbors they share.
2. If two candidate names are both included in any record then they're neighbors in the coworking network. This indicates that two entities holding each of these names have in fact co-authored at least one record, and they are therefore discarded as candidates of referring to a single entity. The likelihood score is set to be zero;
3. A lower-degree candidate sharing a high percentage of neighbors with a higher-degree candidate is more likely to be the same entity (already considering they've not co-authored any records).
4. Isolated nodes (with zero degree) are automatically excluded from candidate identities. 

I initially had considered using the [*cosine similarity*](https://en.wikipedia.org/wiki/Cosine_similarity) metric for comparing nodes neighborhoods, but it turned out to disregard many relevant identities where the highest-degree node had considerably higher degree than the lowest-degree node. Many relevant identities have very different degrees, as in most cases they're derived from a typo when recording the collectors' names. Below I define the likelihood function:

In [17]:
def likelihoodOfIdentity( G, n1, n2 ):
    if len(G[n1]) > len(G[n2]):
        n_h,n_l = n1,n2
    else:
        n_h,n_l = n2,n1
    
    S = G.neighbors(n_h)
    k_h = G.degree(n_h)
    k_l = G.degree(n_l)
    
    if n_l in S or k_l==0:
        return 0
    else:
        return sum( 1 if v in S else 0 for v in G.neighbors(n_l) )/k_l

Now let's execute the likelihood function for collectors selected by the sequence matching algorithm. The result will be a list of tuples containing pairs of names and their associated likelihood score. This piece of code takes time to be executed, as each name must be compared against every other name, with a total of $n \cdot(n-1)$ executions of the sequence matching algorithm. As a result we get a list with possible mappings, which must be inspected before being used to remap the NamesMap. This list is composed by 3-tuples $(n1,n2,S)$, where the name in $n1$ remaps to that in $n2$ and $S$ is the likelihood score, computed using the function defined in the cell above. It should be noted that $n1$ and $n2$ are tuples $(name,num_{records})$, such that $num_{records\_n1}$ is always less than $num_{records\_n2}$.

In [18]:
import difflib as dfl
names = [ n for n,d in sorted( G.nodes(data=True), key=lambda x: x[1]['n_records'], reverse=True ) ]
n_records_dict = dict( (n, d['n_records']) for n,d in G.nodes(data=True) )

l=[]
similarity_threshold=0.1
for n1 in names:
    for n2 in [ n2 for n2 in dfl.get_close_matches(n1, names) if n2!=n1 ]:
        sim = likelihoodOfIdentity(G,n1,n2)
        n1_num_records = n_records_dict[n1]
        n2_num_records = n_records_dict[n2]
        if sim >= similarity_threshold:
            if n2_num_records > n1_num_records:
                l += [ ((n1, n1_num_records), (n2, n2_num_records), sim) ]
            else:
                l += [ ((n2, n2_num_records), (n1, n1_num_records), sim) ]
                
l = list(set(l))

Let's check the list ordered by likelihood score followed by n2's num of records and n1's num of records. I'll only output the first 20 items.

In [19]:
sorted(l, key=lambda x: (x[2], x[1][1], x[0][1]),reverse=True)[:20]

[(('souza,mg', 5), ('souza,mgm', 4229), 1.0),
 (('munhoz,ca', 2), ('munhoz,cbr', 3932), 1.0),
 (('santos,rrb', 2), ('santos,rr', 3879), 1.0),
 (('harley,gm', 3), ('harley,rm', 3016), 1.0),
 (('belem,pr', 19), ('belem,rp', 2745), 1.0),
 (('kirkbridejunior,j', 13), ('kirkbridejunior,jh', 2423), 1.0),
 (('amorim,pr', 80), ('amorim,prf', 2231), 1.0),
 (('amorim,p', 1), ('amorim,prf', 2231), 1.0),
 (('pires,jn', 1), ('pires,jm', 1785), 1.0),
 (('fonseca,s', 57), ('fonseca,sf', 1779), 1.0),
 (('fonseca,fg', 1), ('fonseca,sf', 1779), 1.0),
 (('melo,trb', 243), ('mello,trb', 1529), 1.0),
 (('mendonca,rr', 1), ('mendonca,rc', 1441), 1.0),
 (('mendonca,fca', 1), ('mendonca,rc', 1441), 1.0),
 (('carvalho,avm', 2), ('carvalho,am', 1429), 1.0),
 (('onishi', 4), ('onishi,e', 1352), 1.0),
 (('onishi,gl', 1), ('onishi,e', 1352), 1.0),
 (('philcox', 1), ('philcox,d', 1322), 1.0),
 (('souza,rv', 67), ('sousa,rv', 1291), 1.0),
 (('villar,ts', 4), ('villarroel,d', 1219), 1.0)]

The user can then inspect candidate identities and build a remapping dictionary:

In [20]:
remap = {
    'xavier,s': 'xavier,scs',
    'kinoshita,ls': 'kinoshitagouvea,ls',
    'teodoro,d': 'teodoro,daa',
    'leoni,l': 'leoni,ls',
    'krieger,l': 'krieger,pl',
    'dutilh,j': 'dutilh,jha',
    'arzolla,fardp': 'arzolla,farp',
    'shepherd,g': 'shepherd,gj',
    'caliari,cp': 'calliari,cp',
    'parente,hmv': 'parente,hvm',
    'parente,hm': 'parente,hvm',
    'ledoux': 'ledoux,p',
    'montefusco,n': 'montefuso,neg',
    'bovini,m': 'bovini,mg',
    'kuhlmann,m': 'kuhlmann,mp',
    'cordovil,s': 'cordovil,sp',
    'salas,r': 'salas,rm',
    'kallunki,j': 'kallunki,ja',
    'ianhez,m': 'ianhez,ml',
    'somavilla,n': 'somavilla,nsd',
    'welle,bjh': 'terwelle,bjh', 
    'giulietti,an': 'giulietti,aml',
    'barboza,e': 'barbosa,e', 
    'juchum,f': 'juchum,fs',
    'bianchetti,l': 'bianchetti,lb',
    'melo,mrf': 'melo,mmrf',
    'rosario,c': 'rosario,cs',
    'damasceno,ga': 'damascenojunior,ga',
    'custodio,a': 'custodiofilho,a',
    'salimena,f': 'salimena,frg',
    'chen,c': 'chen,ch',
    'rocha,d': 'rocha,ds',
    'matos,m': 'matos,mq',
    'assumpcao,s': 'assuncao,s',
    'santos,hg': 'santos,hgp',
    'bonomo': 'bonomo,vs',
    'fonseca,j': 'fonseca,js',
    'anderson,w': 'anderson,wr',
    'kollmann,ljc': 'kollmann', 
    'kollmann,rl': 'kollmann',
    'pabst,g': 'pabst,gfj',
    'borges,l': 'borges,lm',
    'baungarten,j': 'baumgarten,j',
    'ferraz,nm': 'ferraz,nms',
    'tressens,s': 'tressens,sg',
    'moreira,m': 'moreira,mv',
    'gifford,d': 'gifford,dr',
    'rosa,n': 'rosa,na',
    'giulietti,am': 'giulietti,aml',
    'taxonomyclassofuniversidadedebrasil': 'taxonomyclassofuniversidadedebrasilia',
    'santos,b': 'santos,br',
    'schettino': 'schettino,vm',
    'mattos,lam': 'mattossilva,la',
    'gomes,vl': 'klein,vlg',
    'amorim,a': 'amorim,ah',
    'jardim,a': 'jardim,ab',
    'lima,jc': 'lima,jcm',
    'barros,m': 'barros,mag',
    'coelho,d': 'coelho,df',
    'forster,w': 'foster,w',
    'kozovits,a': 'kozovits,ar',
    'occhioni,p': 'ochioni,p',
    'brugger,m': 'brugger,mc',
    'moreira,h': 'moreira,hjc',
    'egler,w': 'egler,wa',
    'pott,v': 'pott,vj',
    'lima,am': 'lima,amb',
    'graziela': 'graziella,m',
    'guimaraes,p': 'guimaraes,pjf',
    'bueno,n': 'bueno,nc',
    'werneck,w': 'werneck,wl',
    'senna,l': 'senna,lr',
    'ribeira,sc': 'ribeiro,sc',
    'lin,h': 'lin,hh',
    'james,t': 'james,ta',
    'salimena,fr': 'salimena,frg',
    'mori,s': 'mori,sa',
    'silva,jc': 'silva,jcs',
    'ramos': 'ramos,r',
    'santana,s': 'santana,sc',
    'amaral,a': 'amaral,ag',
    'correa,c': 'correia,c',
    'grear,jw': 'grearjunior,jw',
    'harley,r': 'harley,rm',
    'oliveira,p': 'oliveira,pp',
    'gracianoribeiro,d': 'ribeiro,dg',
    'landrum,l': 'landrum,lr',
    'oliveira,n': 'oliveira,nro',
    'stannard,b': 'stannard,bl',
    'amorim,am': 'amorim,ama',
    'gomesklein,vl': 'klein,vlg',
    'lorenco,jlm': 'lourenco,jml',
    'lourenco,jlm': 'lourenco,jml',
    'rosini,j': 'rossini,j',
    'mattos,la': 'mattossilva,la',
    'lima,v': 'lima,vp',
    'gomes,l': 'gomes,lcj',
    'pires,f': 'pires,fs',
    'king,lrm': 'king,rm',
    'santos,j': 'santos,jal',
    'prado,al': 'prado,ajl', 
    'judziewicz,e': 'judziewicz,ej',
    'buzzi,f': 'bucci,ffb',
    'walter': 'walter,bmt',
    'klein,r': 'klein,rm',
    'klein': 'klein,rm',
    'fazza,fa': 'fazza,lfa',
    'noris,d': 'norris,d',
    'mota,cd': 'mota,cda',
    'silva,pe': 'silva,pen',
    'edsonchaves,b': 'chaves,be', 
    'santana,me': 'santanna,me',
    'sanaiot': 'sanaiotti,tm',
    'marques,mc': 'marques,mcm',
    'bresolin': 'bresolin,a',
    'wurdack,j': 'wurdack,jj',
    'borgato,d': 'borgatto,df',
    'paixao,j': 'paixao,jl',
    'ferreira,j': 'ferreirapaixao,j',
    'niclughada,em': 'lughadha,en',
    'rocha,w': 'rocha,wd',
    'cruz,t': 'cruz,ta',
    'araujo,d': 'araujo,da',
    'carvalho,amv': 'carvalho,avm',
    'deslacio,f': 'delascio,jf',
    'fagg,wc': 'fagg,cw',
    'haas,h': 'hass,jh',
    'faria,ca': 'farias,ca',
    'fonseca,s': 'fonseca,sc',
    'zolma,fj': 'zelma,fj', 
    'sidnei,rm': 'sidney',
    'sidney,gf': 'sidney',
    'gatti,g': 'gatti,j',
    'sousa,rtc': 'souza,rtc',
    'vicente,j': 'vicente,jc',
    'hind,p': 'hind,pd',
    'dunaiski,a': 'dunaiskijunior,a',
    'souza,mg': 'souza,mgm',
    'cardoso,cf': 'cardoso,cfr',
    'fernandes,gd': 'fernandes,gdf',
    'meireles,ml': 'meirelles,ml',
    'siqueira,g': 'siqueira,gs',
    'capelo,fa': 'capelo,ffm',
    'mota,da': 'mota,cda',
    'soares,lgs': 'soareslima,gs',
    'aguiar,am': 'aguiar,an',
    'wagner,nl': 'wagner,hl',
    'anderson,l': 'anderson,le',
    'veloso,lj': 'veloso,lm',
    'smith,gl': 'smith,gm',
    'abreu,lc': 'abreu,lcr',
    'abreu,m': 'abreu,mc',
    'abreu,ms':'abreu,mc',
    'aguiar,ac': 'aguiar,aca',
    'allem,a':'allem,ac',
    'gawryszewski,fm': 'grawryszewski,fm',
    'coradin,l': 'coradin,lc',
    'jennings,': 'jennings,lvs',
    'kolmann,l': 'kollmann,l',
    'vera,l': 'veralucia',
    'clemente,c': 'clemente,cm',
    'bertolda,j': 'bertoldo,j',
    'smith,g': 'smith,gm',
    'vaz,a': 'vaz,amsf',
    'sena,l': 'senna,lr',
    'sanaiotti,t': 'sanaiotti,tm',
    'klein,vl': 'klein,vlg',
    'casto,ws': 'castro,ws',
    'dias,ej': 'dias,jb',
    'torres,dc': 'torres,dsc',
    'landim,m': 'landim,mf',
    'silva,lh': 'soares-silva,lh', 
    'soaressilva,lh': 'soares-silva,lh',
    'silva,lhs': 'soares-silva,lh',
    'oliveira,ma': 'oliveira,ms',
    'borges,r': 'borges,rax',
    'oliveira,s': 'oliveira,scc',
    'lage,jl': 'hage,jl',
    'maas,h': 'maas,pjm',
    'cardoso,e': 'cardoso,es',
    'proenca,c': 'proenca,ceb',
    'noleto,l': 'noletto,lg',
    'rudall,p': 'ruddall,p',
    'chiea,sac': 'chiea,sc',
    'cielofilho,r':'cielo-filho,r', 
    'filho,rc':'cielo-filho,r',
    'cid,ca': 'cid,cac',
    'nascimento,e': 'nascimento,ea',
    'jardim,j': 'jardim,jg',
    'villaroel,d': 'villarroel,d',
    'wagner': 'wagner,hl',
    'dias,bj': 'dias,jb',
    'flores': 'flores,tb',
    'lucas,e': 'lucas,ej',
    'morbeck': 'morbeck,a',
    'castro,r': 'castro,ra',
    'passon,l': 'passon,lm',
    'simpson,pl': 'simpson-junior,pl', 
    'simpsonjunior,pl': 'simpson-junior,pl',
    'paulasouza,j': 'souza,jp',
    'coveny,r': 'coveny,rg',
    'crosby': 'crosby,mr',
    'souza,cv': 'souza,vc',
    'moreira,al': 'moreira,alc',
    'nobs,ma': 'noles,ma',
    'kuehn,e': 'kuhn,e',
    'davidsen,c': 'davidson,c',
    'estabrook': 'estabrook,gf',
    'sousa,tc': 'souza,tc',
    'verwimp': 'verwimp,i',
    'campos,jmf': 'campos,jmp',
    'silva,lm': 'silva,lam',
    'smith,l': 'smith,lb',
    'yamomoto,m': 'yamamoto,m',
    'verveloet,rr': 'vervloet,rr',
    'rocha,rm': 'rocha,rn',
    'oliveira,nr': 'oliveira,nro',
    'haas,jh': 'hass,jh',
    'whalen,a': 'whalen,aj',
    'sa,spp': 'sa,spps',
    'bensusan,n': 'bensusan,nr',
    'borgato,df': 'borgatto,df',
    'mendes,jn': 'mendes,jm',
    'fontella,j': 'fontella,jp',
    'staggemeier,vg': 'staggmeier,vg',
    'campos,mtv': 'campos,mtva',
    'benton,f': 'benton,fp',
    'marchioni,jm': 'marchiori,jn',
    'juchum': 'juchum,fs',
    'peocopio,lc': 'procopio,lc',
    'romeroc': 'romero,c', 
    'marques,c': 'marques,cf',
    'cardoso,ef': 'cardoso,f',
    'pereira,ba': 'pereira,bas',
    'carvalho,sl': 'carvalho-leite,sl', 
    'carvalholeite,sl': 'carvalho-leite,sl',
    'benedete': 'benedete,al',
    'harley,gm': 'harley,rm',
    'diasmelo,r': 'dias-melo,r', 
    'melo,rd': 'dias-melo,r',
    'schiesinki': 'schiesinski,d',
    'porto,jr': 'porto,jlr',
    'argent,g': 'argent,gcg',
    'argentgcgin': 'argent,gcg',
    'rodri': 'rodrig',
    'araujo,g': 'araujo,gm',
    'carvalho,am': 'carvalho,avm',
    'careno,s': 'carreno,s',
    'mazine,f': 'mazine-capelo,ff', 
    'mazinecapelo,ff': 'mazine-capelo,ff',
    'lopes,i': 'lopes,isn',
    'irvine,gc': 'irwine,cg',
    'taroda,n': 'tarroda,n',
    'isejima': 'isejima,em',
    'dario,f': 'dario,fr',
    'onishi': 'onishi,e',
    'reitz': 'reitz,pr',
    'reitz,r': 'reitz,pr',
    'devogel,ef': 'vogel,hvf',
    'borges,j': 'borges,jwm',
    'brade': 'brade,ac',
    'nicacio': 'nicacio,jn',
    'anapaula': 'ana-paula', 
    'paula,a': 'ana-paula',
    'dell,d': 'odell,d',
    'rondon': 'rondon,c',
    'santos,h': 'santos,hcf',
    'furla': 'furlan,a',
    'grasser,g': 'grasser,ga',
    'pedrosa,ma': 'pedroso,ma',
    'pedrosa,n': 'pedrosa,ns',
    'zoccoli,d': 'zoccoli,dm',
    'arroyo,mtk': 'kallin-arroyo,mt', 
    'kallinarroyo,mt': 'kallin-arroyo,mt', 
    'brito,ic': 'britto,ic',
    'degrande,da': 'grande,da',
    'sales,sc': 'salles,sc',
    'souza,r': 'souza,rr',
    'guedes,j': 'guedes,jc',
    'herlan,j': 'herlanio,j',
    'nascimento,a': 'nascimento,ae',
    'siva,ma': 'silva,ma',
    'bucci,f': 'bucci,ffb',
    'santana,bdi': 'santana,bid',
    'giordano,lc': 'giordano,lcs',
    'meyer': 'meyer,fs',
    'franca': 'franca,f', 
    'koekemoer,m': 'koekomoer,m',
    'souza,rt': 'souza,rtc',
    'pereira,t': 'pereira,ta',
    'jose,m': 'jose,maria',
    'carmo,j': 'carmo,jj',
    'fernandes,a': 'fernandes,ag',
    'moraes,plr': 'moraes,prl',
    'maia,w': 'maia,wd',
    'martins,ca': 'martins,can',
    'polite,l': 'politi,l',
    'almeida,f': 'almeida,fc',
    'borges,jw': 'borges,jwm',
    'kuhlman,m': 'kuhlmann,mp',
    'silva,mb': 'silva,mib', 
    'sousa,rv': 'souza,rv',
    'koczichi': 'koczicki,c',
    'leite,jr': 'leite,jrs',
    'silva,pit': 'tanno-silva,pi',
    'tannosilva,pi': 'tanno-silva,pi',
    'mendonca,r': 'mendonca,rc',
    'mendonca,rr': 'mendonca,rc',
    'schumke,j': 'schunke,j',
    'abe,lb': 'abe,lm',
    'stieber,m': 'stieber,mt',
    'sieber,m': 'stieber,mt',
    'chagasesilva,fc': 'chagas-e-silva,fc', 
    'chagasesilva,f': 'chagas-e-silva,fc',
    'silva,fc': 'chagas-e-silva,fc',
    'prance,g': 'prance,gt',
    'melo,trb': 'mello,trb', 
    'sena,pac': 'senna,pac',
    'pereira,l': 'pereira,la',
    'caneiro,j': 'carneiro,j',
    'munhoz,ca': 'munhoz,cbr',
    'jimenez': 'jimenez,ja',
    'castellanos': 'castellanos,a',
    'cristobal,l': 'cristobal,cl', 
    'sousa,ng': 'souza,ng',
    'westra,lyt': 'westra,lyth',
    'luna,ta': 'luna,ti',
    'pilger': 'pilges',
    'silva,mi': 'silva,mib',
    'mitzi': 'mitzi,g',
    'santos,rr': 'santos,rrb',
    'morales,r': 'morales,rav',
    'siva,ja': 'silva,ja', 
    'vitti,f': 'vitti,fx',
    'oliveira,fac': 'oliveira,fcao',
    'landrum,s': 'landrum,ss',
    'rodrigues,wa': 'rodrigues,wm',
    'constable,ef': 'constable,ej',
    'armando,m': 'armando,ms',
    'falcao,ji': 'falcao,jia',
    'maroccolo,jf': 'marocollo,jf',
    'maxwell,h': 'maxwell,hh',
    'zaruchi': 'zarucchi,j', 
    'santos,mcv': 'vilela-santos,mc', 
    'vilelasantos,mc': 'vilela-santos,mc',
    'silva,eb': 'silva,ebm',
    'jumbo,s': 'jimbo,s', 
    'medeiros,l': 'medeiros,lb',
    'johnson,l': 'johnson,las',
    'franciso,em': 'francisco,em', 
    'melo': 'melo,e',
    'nogueira,lm': 'nogueira,lmg',
    'aquino,f': 'aquino,fg',
    'cairus,rjr': 'cairos,rjr',
    'santos,aj': 'santos,ajv',
    'cerqueira,ls': 'cerqueira,lsc',
    'fonseca,fj': 'fonseca,js',
    'fonseca,l': 'fonseca,lm',
    'vilar,ts': 'villar,ts',
    'colleta,gd': 'colletta,gd',
    'barker,r': 'barker,rm',
    'fragg,c': 'fagg,cw', 
    'fagg,c': 'fagg,cw',
    'pennigton,td': 'pennington,td',
    'rivera,v': 'rivera,vl',
    'smya,s': 'sumya,s',
    'versiane,af': 'versiane,afa',
    'castelo,aj': 'castro,aj',
    'richards': 'richardspwin',
    'richards,m': 'richardspwin',
    'richards,pw': 'richardspwin',
    'pinheiro,s': 'pinheiros,s',
    'pinheiro,em': 'pinheiros,em',
    'lima,e': 'lima,es',
    'hind,dj': 'hind,djn',
    'hind,n': 'hind,djn', 
    'soares,g': 'soares,gf',
    'ferreira,map': 'pereira,map',
    'dobereiner': 'dobereiner,j',
    'scariot,a': 'scariot,ao',
    'monteiro,r': 'monteiro,rn',
    'leal,cg': 'leal,g',
    'garcia,mcm': 'garcia,mgm',
    'lopes,wdp': 'lopes,wp',
    'oliveira,m': 'oliveira,ms',
    'vinha,s': 'vinha,sg',
    'santos,g': 'santos,gb',
    'filgueira,ts': 'filgueiras,ts',
    'stutte,jg': 'stutts,jg',
    'stutte,j': 'stutts,jg',
    'stutts,j': 'stutts,jg',
    'shutts,jg': 'stutts,jg',
    'garcia,pb': 'garcia,pbc',
    'joaovicente': 'vicente,jc',
    'mariz,g': 'mariza,g',
    'fierros,af': 'freire-fierros,a', 
    'freirefierros,a': 'freire-fierros,a',
    'rodrigues,ce': 'rodrigues-junior,ce', 
    'rodriguesjunior,ce': 'rodrigues-junior,ce',
    'kirkbridejunior,j': 'kirkbride-junior,jh', 
    'kirkbridejunior,jh': 'kirkbride-junior,jh',
    'kirkbride,jh': 'kirkbride-junior,jh',
    'souza,rs': 'sousa,rs',
    'belizario,m': 'belisario,m',
    'gibles,p': 'gibbs,pe', 
    'black': 'black,ga',
    'black,g': 'black,ga',
    'black,ca': 'black,ga',
    'szechy,mtm': 'sechy,mts', 
    'ladeira,j': 'ladeira,jl',
    'kauseilmari':'kause,i', 
    'rodrigues,w': 'rodrigues,wm',
    'marines,g': 'marinis,g',
    'maranis,g': 'marinis,g',
    'bromley,g': 'bromley,gl',
    'barbosa,ea': 'barbosa,e', 
    'leitaofilho,h':'leitao-filho,h',
    'leitao,hf': 'leitao-filho,h',
    'leitaofilho,hf':'leitao-filho,h',
    'nilsson,s': 'nilson,s', 
    'mirizawa,m': 'kirizawa,m',
    'gomes,v': 'klein,vlg',
    'leite,jra': 'leite,jrs',
    'philcox': 'philcox,d',
    'dutil,jh': 'dutilh,jha',
    'arraes,mgm': 'arrais,mgm',
    'soderstrom': 'soderstrom,tr',
    'raulino,t': 'raulino,taf',
    'cisnero,la': 'cisneros,la',
    'santos,f': 'santos,gf',
    'santos,fm': 'santos,fam',
    'santos,ffm': 'santos,fam',
    'wasshusen,dc': 'wasshausen,dc', 
    'fereira,a': 'ferreira,a',
    'freitas,g': 'freitas,gs',
    'oliveira,fca': 'oliveira,fcao',
    'cervi,ac': 'cervi,ca',
    'wilsonbrowne,g': 'browne,gw', 
    'ribeiro,mm': 'ribeiro,mmv',
    'boom,b': 'boom,bm',
    'amorim,pr': 'amorim,prf',
    'amorim,p': 'amorim,prf',
    'lannasobrinho,jp': 'lana-sobrinho,jp',
    'lanasobrinho,jp': 'lana-sobrinho,jp',
    'sobrinho,jpl': 'lana-sobrinho,jp',
    'pires,jm': 'pires,jn',
    'pires,jf': 'pires,jn',
    'bomley,gl': 'bromley,gl',
    'dusi,rl': 'dusi,rlm',
    'assuncao,pacl': 'assuncao,pac',
    'assunsao,paci': 'assuncao,pac',
    'assuncao,pacs': 'assuncao,pac',
    'assuncao,pa': 'assuncao,pac',
    'hathome,w': 'hawthorne,w',
    'sendullsky,t': 'sendulsky,t',
    'caixetadedeus,w': 'caixeta,w',
    'alavarenga,d': 'alvarenga,d',
    'fontella,pj': 'fontella,jp',
    'andrade,p': 'andrade,pm',
    'rizzini': 'rizzini,ct',
    'barboza,ma': 'barbosa,ma'
}

## 5. Rebuilding the network model with remapped names

Now let's use the remapping dictionary above to remap our NamesMap:

In [21]:
nm.remap(remap,fromScratch=True)
ni = nc.getNamesIndexes(occs, 'recordedBy_atomized',namesMap=nm.getMap())
G = CoworkingNetwork( occs['recordedBy_atomized'], weighted=True, namesMap=nm )
G.remove_node('etal')
nx.set_node_attributes(G, 'n_records', dict( (n, len(ni[n])) for n in G.nodes() ))

We can get a report of inconsistencies in the names map:

In [22]:
print(nm.reportRemappingInconsistencies(returnFormatted=True))

None


The names map would report an inconsistency if, for example, a name appears both as a key and a value in the remapping dict. For example, let's include `'abe,ln'`, which is already a value in the dict, as a key to `'abe,lb'`.

In [23]:
nm.remap({'abe,lm':'abe,lb'})

And now ask for a new report:

In [24]:
print(nm.reportRemappingInconsistencies(returnFormatted=True))

These names appear both as keys and values:
  abe,lb
  abe,lm
-------------------------------------------



This can be fixed by removing `'abe,lm'` key.

In [25]:
nm.remove_fromRemap('abe,lm')

'abe,lb'

In [26]:
print(nm.reportRemappingInconsistencies(returnFormatted=True))

None


But `'abe,lb'` is still a key to `'abe,lm'`

In [27]:
nm._remappingDict['abe,lb']

'abe,lm'

Then we can finally save the names map in a *.json* file for later use:

In [28]:
nm.write_toJson('namesMap.json')

Retrieving the map is as simple as:

In [29]:
nm2 = read_namesMap('./namesMap.json')

In [30]:
print("Names map just read\n===================")
print('\n'.join( [k+": "+v for k,v in nm2.getMap().items()][:20] ))
print("...")

Names map just read
.: 
1980 Sino-Amer Exped.: sinoamerexped
?: 
A.J.N.V.: ajnv
A.M.: am
Abbas, B: abbas,b
Abdala, GC: abdala,gc
Abdo, MSA: abdo,msa
Abdon: abdon
Abe, LB: abe,lm
Abe, LM: abe,lm
Abrahim, MA: abrahim,ma
Abreu, CG: abreu,cg
Abreu, GX: abreu,gx
Abreu, I: abreu,i
Abreu, LC: abreu,lcr
Abreu, LCR: abreu,lcr
Abreu, M: abreu,mc
Abreu, MC: abreu,mc
Abreu, MS: abreu,mc
...


You should note, though, that the normalization function cannot be stored in the *.json* file, and therefore it must be redefined in the names map copy we've just read. New names cannot be inserted in the map before defining the normalization function:

In [31]:
print(nm2._normalizationFunction)

None


In [32]:
nm2.insertNames(['Pedro C. de Siracusa'])

ValueError: a normalization function must be defined

In [33]:
nm2.setNormalizationFunc( nc.normalize )

In [34]:
nm2.insertNames(['Pedro C. de Siracusa'])

After adding the normalization function I could insert a new collector name. One alternative for also storing the normalization function of the names map would be to serialize it using the pickle library. The advantage of using *.json*, however, concerns readability of the map. One could even edit the map in a simple text editor and reload it in a python session. And that's it for this notebook.

In [35]:
%%bash
rm namesMap.json