# <center> Dealing with names inconsistencies </center>
#### <center> Author: Pedro Correia de Siracusa </center>
#### <center> Date: Aug 27, 2017</center>


As defined by [DarwinCore](http://rs.tdwg.org/dwc/) standards, the names of collectors and identifiers responsible for individual occurrences must be informed in the [`recordedBy`](http://rs.tdwg.org/dwc/terms/#recordedBy) and [`identifiedBy`](http://rs.tdwg.org/dwc/terms/#identifiedBy) columns, respectively. Although there are recommended best practices for formatting collector and identifier names in the respective fields, there happens to be a lot of variation on how names are typed in distinct datasets. This leads to basically two types of **names inconsistencies**:  

1. Multiple different names might actually refer to the same entity;
2. Multiple entities might end up being referred to by the same name.

For the purpose of building networks models with collectors and identifiers as entities we must make efforts to minimize names inconsistencies, as they could potentially lead to misleading interpretations of the model. In this notebook I propose a routine for dealing with such inconsistencies that include (i) getting atomic names from a names string, (ii) getting normal forms for each name and (iii) mapping names variants to global "correct" normalized versions, which should uniquely identify each entity.

### Table of Contents

* [1. Getting atomic names](#1.-Getting-atomic-names)
* [2. Names normalization](#2.-Names-normalization)
* [3. Name variants mapping](#3.-Name-variants-mapping)

---

In [1]:
import pandas as pd
import numpy as np

In this notebook I will use the UB herbarium (at University of Brasília) dataset.

In [2]:
colsList = ['scientificName', 'taxonRank', 'family', 
            'stateProvince', 'locality', 'municipality', 
            'recordedBy', 'identifiedBy',
            'eventDate']

occs_datafile = './0077202-160910150852091/occurrence.txt'
occs = pd.read_table(occs_datafile, usecols=colsList)

---

## 1. Getting atomic names

The first step for retrieving collectors or identifier names from their fields is to correctly separate individual names from a names string. Although DarwinCore standards recommend using the pipe character ("`|`") as delimiters when inserting multiple names in the field, not all herbarium datasets necessarily follow this best practice. The UB herbarium, for example, uses a semicolon ("`;`") as delimiter:

In [3]:
occs['recordedBy'][0]

'Anderson, WR; Arroyo, MTK; Hill, SR; Santos, RR; Souza, R'

There might also be the case that multiple delimiters are used in the same dataset, although this is apparently not the case in the UB dataset. The function `namesFromString` below allows using different delimiters to split a **names string** into a list of names. Usually the primary collector is listed first in a occurrence recording, and therefore the function allows forcing names in the list to be kept in the same order as they were in the string.

In [4]:
import re

def namesFromString( namesStr, splitOn=';', unique=False, preserveOrder=False ):
    if type(splitOn)==str:
        namesSplit = namesStr.split(splitOn)
        
    elif type(splitOn)==list:
        namesSplit = re.split( '|'.join( c for c in splitOn ) ,namesStr)
    
    namesList = [ n for n in [ name.strip() for name in namesSplit ] if n!='' ]
    
    if unique:
        if not preserveOrder: return list(set(namesList))
        else:
            namesCounts = dict( (n,0) for n in namesList )
            unique_namesList = []
            for n in namesList:
                if namesCounts[n]: continue
                namesCounts[n]+=1
                unique_namesList.append(n)
        return unique_namesList   
    
    return namesList

In the example below I atomize a name string with three different delimiters ("`;`", "`&`" and "`and`") 

In [5]:
nameString = "Andrew, B.C.; JULIA, FG; Johnson, WED & William P. and Junior, ED; Johnson, W."
namesFromString( nameString, splitOn=[';','&','and'] )

['Andrew, B.C.',
 'JULIA, FG',
 'Johnson, WED',
 'William P.',
 'Junior, ED',
 'Johnson, W.']

---

## 2. Names normalization

Below I define a normalization routine, which makes text lowecase, removes periods and spaces and normalizes it according to some [unicode normalization](https://en.wikipedia.org/wiki/Unicode_equivalence) form (by default NFKD).

In [6]:
import unicodedata, string

def normalize(name, normalizationForm='NFKD'):
    name = name.lower() # to lowecase
    name = name.replace('.','') # remove periods
    name_ls = tuple( part.strip() for part in name.split(',') ) # split and strip names into tuples

    normalize = lambda s: ''.join( x for x in unicodedata.normalize(normalizationForm, s) if x in string.ascii_letters ) # remove accents
    name_ls = tuple( normalize(name) for name in name_ls )
    
    return ','.join(name_ls)

In [7]:
[ normalize(n) for n in namesFromString(nameString, splitOn=[';','&','and']) ]

['andrew,bc', 'julia,fg', 'johnson,wed', 'williamp', 'junior,ed', 'johnson,w']

Note that this first naive normalization routine assumes that names are standardized in the form `<Last_name>, <initials>`, which will not be always the case for other datasets. In the example above, the last name in 'William P.' is not separated to the first initial with a comma. A slightly finer normalization approach could for example also use blank spaces as a separator, in addition to commas. However, it would still incorrectly normalize names like "Santa Rosa, MF", where the last name is composite. In short, there will not be a global normalization routine that can take every case. However we still need names to be unambiguously assigned to a unique entity. One possible way to deal with this is using name variants mapping.

---

## 3. Name variants mapping

Ideally, for each entity represented in the model we want to have a unique "global" normalized name. However we don't want to simply overwrite the original non-normalized forms of the names in the dataset, as they could be useful in subsequent steps of the analysis. Instead I will use a **name index** (or **name map**), which maps non-normalized names to their normalized forms given by our normalization routine. 

A first benefit of using this approach is that we can dynamically modify the index and use different normalization routines depending on the name format for each case. For example we could easily define a new routine specifically for treating names without commas and update the index accordingly, just for names with that particular formatting structure. We could also map names manually, in case a human validator asserts that two name variations in fact refer to the same entity.
Second, from a computational perspective, by using an index we would prevent recomputing the normalization function several times.

Let's go ahead and build a name index:

In [8]:
nameIndex = dict( (n,normalize(n)) for n in namesFromString(nameString, splitOn=[';','&','and']) )
nameIndex

{'Andrew, B.C.': 'andrew,bc',
 'JULIA, FG': 'julia,fg',
 'Johnson, W.': 'johnson,w',
 'Johnson, WED': 'johnson,wed',
 'Junior, ED': 'junior,ed',
 'William P.': 'williamp'}

To exemplify, we could update this index if we knew that "William P." should have been mapped to "william,p" instead:

In [9]:
nameIndex['William P.']='william,p'
nameIndex

{'Andrew, B.C.': 'andrew,bc',
 'JULIA, FG': 'julia,fg',
 'Johnson, W.': 'johnson,w',
 'Johnson, WED': 'johnson,wed',
 'Junior, ED': 'junior,ed',
 'William P.': 'william,p'}

and "Johnson, W." might be known to refer to the same entity as "Johnson, WED":

In [10]:
nameIndex['Johnson, W.']=nameIndex['Johnson, WED']
nameIndex

{'Andrew, B.C.': 'andrew,bc',
 'JULIA, FG': 'julia,fg',
 'Johnson, W.': 'johnson,wed',
 'Johnson, WED': 'johnson,wed',
 'Junior, ED': 'junior,ed',
 'William P.': 'william,p'}

Now let's build a map with the collectors names in the UB dataset

In [11]:
collectors = sorted(list(set( c for cols in occs['recordedBy'].apply( lambda x: namesFromString(str(x)) ) for c in cols)))
collectorsMap = dict( (n,normalize(n)) for n in collectors )
list(collectorsMap.items())[:25]

[('.', ''),
 ('1980 Sino-Amer Exped.', 'sinoamerexped'),
 ('?', ''),
 ('A.J.N.V.', 'ajnv'),
 ('A.M.', 'am'),
 ('Abbas, B', 'abbas,b'),
 ('Abdala, GC', 'abdala,gc'),
 ('Abdo, MSA', 'abdo,msa'),
 ('Abdon', 'abdon'),
 ('Abe, LB', 'abe,lb'),
 ('Abe, LM', 'abe,lm'),
 ('Abrahim, MA', 'abrahim,ma'),
 ('Abreu, CG', 'abreu,cg'),
 ('Abreu, GX', 'abreu,gx'),
 ('Abreu, I', 'abreu,i'),
 ('Abreu, LC', 'abreu,lc'),
 ('Abreu, LCR', 'abreu,lcr'),
 ('Abreu, M', 'abreu,m'),
 ('Abreu, MC', 'abreu,mc'),
 ('Abreu, MS', 'abreu,ms'),
 ('Abreu, NL', 'abreu,nl'),
 ('Abreu, NR', 'abreu,nr'),
 ('Abreu, TLS', 'abreu,tls'),
 ('Accioly', 'accioly'),
 ('Accorsi, WR', 'accorsi,wr')]

Now I will define a function which takes in a dataframe and returns a dictionary with the id's (index) of the rows where each of the normalized names appear. The function also takes as an argument a names map, with the associative rules for names.

In [12]:
def get_names_indices( df, names_col, names_map ):
    names = set( n for names in df[names_col].apply(lambda x: namesFromString(str(x))) for n in names )
    df = df.copy()
    df[names_col] = df[names_col].astype(str).apply( namesFromString )
    
    names_indices = dict( (name,[]) for name in names_map.values() )
    for i,names in df[names_col].iteritems():
        for name in names:
            if names_map is not None:
                names_indices[names_map[name]].append(i)
            else:
                names_indices[name].append(i)
            
    return names_indices

In [13]:
d = get_names_indices(occs, 'recordedBy', collectorsMap)

Now we can access records by an entity represented by a normalized name using the notation below:

In [14]:
occs.loc[ d['munhoz,cbr'] ].iloc[-10:-4]

Unnamed: 0,recordedBy,eventDate,stateProvince,municipality,locality,identifiedBy,scientificName,family,taxonRank
184334,"Proença, CEB; Munhoz, CBR",1994-03-24T01:00Z,Distrito Federal,,Parque do Guará,"Proença, CEB",Utricularia L.,Lentibulariaceae,GENUS
184336,"Proença, CEB; Munhoz, CBR",1994-03-24T01:00Z,Distrito Federal,,Parque do Guará,"Proença, CEB",Irlbachia speciosa (Cham. & Schltdl.) P.J.M. Maas,Gentianaceae,SPECIES
184337,"Proença, CEB; Munhoz, CBR",1994-03-24T01:00Z,Distrito Federal,,Parque do Guará,"Proença, CEB",Sipanea hispida Benth. ex Wernham,Rubiaceae,SPECIES
185576,"Munhoz, C.B.R.",1994-03-25T01:00Z,Distrito Federal,Brasília,Reserva Ecológica do IBGE. Borda da Mata do Mo...,"Munhoz, C.B.R.",Ossaea warmingiana Cogn.,Melastomataceae,SPECIES
185578,"Munhoz, C.B.R.; Proença, C.E.B.; Walter, B.M.T.",1994-05-21T02:00Z,Goiás,Teresina de Goiás,48 Km de Alto Paraíso para Teresina de Goiás,"Munhoz, C.B.R.",Microlicia psammophila Wurdack,Melastomataceae,SPECIES
185580,"Munhoz, C.B.R.; Proença, C.E.B.; Walter, B.M.T.",1994-05-21T02:00Z,Goiás,Alto Paraíso de Goiás,07 Km de Alto Paraíso para Teresina de Goiás,"Munhoz, C.B.R.",Hyptis pachyphylla Epling,Lamiaceae,SPECIES


Note that occurrences recorded by both "Munhoz, CBR" and "Munhoz, C.B.R" were returned.

## Extra: Collectors as categorical features

We can use the routines presented above to build a sparse matrix with dummy variables for collectors names. Dummy variables are useful for dealing with categorial variables (here collectors names being treated as categorical variables). Some machine learning algorithms require categorical features to be in the form of dummy variables.

In [15]:
def getNamesDummies(df, names_col, names_map):
    index = df.index
    names_indices = get_names_indices(df, names_col, names_map)
    
    sparse_dict = dict()    
    for key,indices in names_indices.items():
        s = pd.Series(data=0, index=index, dtype=np.int)
        s.iloc[indices] = 1
        sparse_dict[ key ] = s.to_sparse()
        
    return pd.DataFrame.from_dict(sparse_dict)   

In [16]:
dummy_matrix = getNamesDummies(occs, 'recordedBy', collectorsMap)
dummy_matrix

Unnamed: 0,Unnamed: 1,"abbas,b","abdala,gc","abdo,msa",abdon,"abe,lb","abe,lm","abrahim,ma","abreu,cg","abreu,gx",...,"zohary,d","zohary,j","zohary,m","zolma,fj","zorzetti,j","zorzi,vg","zuani,lv","zuchiwschi,e","zukowski,w","zuloaga,fo"
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
