# <center>Collectors-identifiers Networks</center>

#### <center> Author: Pedro Correia de Siracusa </center>
#### <center> Date: Aug 27, 2017 </center>

Biodiversity occurrence data is composed by records of specimens (individuals) which were collected in field by one or more *collectors*. However, the process of formally assigning taxonomic identities (*e.g.* the scientific name  of the species) for the specimens is performed later by one or more *identifiers*, usually botanic specialists. Ocasionally the person who collects a specimen in field is the same one who later identifies it, a characteristic behavior of great botanic specialists (personal communication from herbarium specialists). They're usually both great recorders and identifiers.

In this notebook I'll build a network model to try to identify great **collectors**, great **identifiers** and great **botanic specialists** (who are both great recorders and identifiers) in the University of Brasília Herbarium (UB) occurrences dataset. The importance of the person who assigned a taxonomic identity to a specimen may be an accurate proxy for assessing the quality of the identification.

---

In [1]:
import pandas as pd
import networkx as nx
import mpld3

import matplotlib.pyplot as plt

from mymodules.functions import namesFromString,normalize
from mpld3 import plugins

%matplotlib inline

In [2]:
colsList = ['scientificName', 'taxonRank', 'family', 
            'stateProvince', 'locality', 'municipality', 
            'recordedBy', 'identifiedBy',
            'eventDate']

occs = pd.read_csv('./0077202-160910150852091/occurrence.txt', sep='\t', usecols=colsList)

Before building the model let's explore the `identifiedBy` column in the occurrences dataset, where the identities of the identificators is recorded. 

Who are top-20 identifiers?

In [3]:
occs['identifiedBy'].value_counts().head(20)

Incógnito             10645
Proença, CEB           7977
Souza, MGM             3960
Delprete, PG           3832
Faria, JEQ             3273
Leite, ALTA            2796
Barneby, R             2515
Castelo-Branco, CW     2384
Câmara, PEAS           2294
Wurdack, JJ            1776
Sousa, RV              1381
Caires, CS             1354
Oliveira, RC           1309
Soares-Silva, LH       1237
Peralta, DF            1105
Munhoz, CBR            1053
Rosa, PO               1026
Barroso, GM             993
Carvalho-Silva, M       943
Simon, MF               888
Name: identifiedBy, dtype: int64

How many of the records were identified by more than one person? How many of the records do not have information on identifier? How does it compare with recorders info?

In [4]:
occs_num_of_identifiers = occs['identifiedBy'].apply(lambda x: len(str(x).split(';')))
occs_num_of_collectors = occs['recordedBy'].apply(lambda x: len(str(x).split(';')))

occs_multipleIdentifiers = occs[occs_num_of_identifiers>1]
occs_multipleCollectors = occs[occs_num_of_collectors>1]

occs_noIdentifier = occs[ occs['identifiedBy'].isnull() ]
occs_incognitoIdentifier = occs[ occs['identifiedBy']=="Incógnito" ]
occs_noCollector = occs[occs['recordedBy'].isnull()]

In [5]:
print('''======================================
-----------------------------------------
Occurrences with more than one collector:
\tTotal count: {};
\tPercentage: {:.2%};

-------------------------------------
Occurrences with no collector (null):
\tTotal count: {};
\tPercentage: {:.3%};

------------------------------------------
Occurrences with more than one identifier:
\tTotal count: {};
\tPercentage: {:.2%};

--------------------------------------
Occurrences with no identifier (null):
\tTotal count: {};
\tPercentage: {:.2%};

--------------------------------------
Occurrences with incognito identifier:
\tTotal count: {};
\tPercentage: {:.2%};


'''.format(
    occs_multipleCollectors.shape[0],
    occs_multipleCollectors.shape[0]/occs.shape[0],
    occs_noCollector.shape[0],
    occs_noCollector.shape[0]/occs.shape[0],
    occs_multipleIdentifiers.shape[0],
    occs_multipleIdentifiers.shape[0]/occs.shape[0],
    occs_noIdentifier.shape[0],
    occs_noIdentifier.shape[0]/occs.shape[0],
    occs_incognitoIdentifier.shape[0],
    occs_incognitoIdentifier.shape[0]/occs.shape[0])
)

-----------------------------------------
Occurrences with more than one collector:
	Total count: 73665;
	Percentage: 39.64%;

-------------------------------------
Occurrences with no collector (null):
	Total count: 9;
	Percentage: 0.005%;

------------------------------------------
Occurrences with more than one identifier:
	Total count: 3238;
	Percentage: 1.74%;

--------------------------------------
Occurrences with no identifier (null):
	Total count: 37473;
	Percentage: 20.17%;

--------------------------------------
Occurrences with incognito identifier:
	Total count: 10645;
	Percentage: 5.73%;





More than 20% of the records do not have associated identifiers! And from those who have almost 6% is "incognito".

---

## Building the network model

In the **collectors-identifiers network** model any entity (node) can be both identifiers and collectors. Each specimen recorded by one or more collectors is sent to one or more identifiers to be identified. To model this relationship I will use a weighted **directed graph**, in which links are formed from a collector to a identifier. Therefore the number of specimens collected by an entity is given by the node's *out-degree*, whereas the number of specimens identified by it is given by the node's *in-degree*. If an entity both collects and identifies a specimen a self-link is formed, which is counted as 1 out-link plus 1 in-link. Differently from collectors co-working networks, this collectors-identifiers networks do not express collaborations between collectors in field.

I will build the graph with the entities' normalized names, using a `namesMap`.

In [6]:
collectors = set( n for ns in occs['recordedBy'].apply( lambda x: namesFromString(str(x)) ) for n in ns )
identifiers = set( n for ns in occs['identifiedBy'].apply( lambda x: namesFromString(str(x)) ) for n in ns )

namesMap = dict( (n,normalize(n)) for n in collectors.union(identifiers) )

In [7]:
from collections import Counter

G = nx.DiGraph(name="Collectors-identifiers network")

occs_filteredNulls = occs[ occs['identifiedBy'].notnull() & occs['recordedBy'].notnull() ]
idrs_cols = ( i for i in occs_filteredNulls[['identifiedBy','recordedBy']].values )
edges = ( (namesMap[col],namesMap[idr]) for idrs,cols in idrs_cols for idr in namesFromString(idrs, unique=True) for col in namesFromString(cols, unique=True) )

edges_weighted = ( (u,v,w) for (u,v),w in Counter(edges).items() )

G.add_weighted_edges_from(edges_weighted)
G.remove_nodes_from(("incognito","etal"))

Let's check some info about the graph:

In [8]:
print(nx.info(G))

Name: Collectors-identifiers network
Type: DiGraph
Number of nodes: 7745
Number of edges: 49642
Average in degree:   6.4096
Average out degree:   6.4096


As this is a relatively dense graph I will not plot it right now with *networkx*, but you can still download it [here](graphs/n6_graph.gexf) and visualize it with *gephi*.

In [9]:
nodes_df = pd.DataFrame(
    [ (n, G.in_degree(n, weight='weight'), G.out_degree(n,weight='weight'), G.degree(n, weight='weight')) 
      for n in G.nodes() ])

nodes_df.columns = ['Node_id', 'In-degree', 'Out-degree', 'Degree'] #Weighted in/out degrees
nodes_df.set_index('Node_id', inplace=True)

The simplest way to classify entities in this network in terms of their contribution as collectors and identifiers would be using a linear decision boundary $ y = x$. In other terms, if the number of collections of an entity is greater than the number of identifications then it is considered a collector; otherwise it is considered an identifier. I will add an attribute `is_identifier` to each node holding this information.

In [10]:
is_identifier = dict( (i, 1 if o['In-degree']>=o['Out-degree'] else 0) for i,o in nodes_df.iterrows() )
nx.set_node_attributes(G, 'is_identifier', is_identifier)

*Figure 1* shows each entity positioned according to their in and out degrees.

In [11]:
points = [ (i,o) for i,o,k in nodes_df.values ]
labels = [ n for n in nodes_df.index ]
point_sizes = [ k for i,o,k in nodes_df.values ]

spline = lambda x: x/40

fig = plt.figure(figsize=(10,8))
scatter = plt.scatter([x for x,y in points], [y for x,y in points], s=[ spline(s) for s in point_sizes], alpha=0.4)
plt.plot([0,20000],[0,20000], ls='--', c='r', lw=1.2)

plt.title('Degrees of nodes in the Collectors-and-identifiers Network', size=16)
plt.xlabel('In-degree', size=14)
plt.ylabel('Out-degree', size=14)

plugins.connect(fig, plugins.PointLabelTooltip(scatter, labels))
mpld3.display()

** Figure 1. Entities degrees in the collectors-identifiers network. Nodes are positioned according to their in and out degrees. Node sizes represent their respective degrees (in-degree + out-degree). The red dashed line is the decision boundary $y=x$ for this simple classifier**.

### Some considerations and possible issues with this naive approach:

* Entities might **contribute differently** as collectors or as identifiers to different herbaria. Therefore by using data from a single herbarium we may end up with a "biased sample" of an entity's overall activity. Using this limited information to predict if an entity is a great specialist is therefore potentially misleading;

* Is it common that people from other institutions collaborate with an herbarium by both depositing and identifying specimens? Or is it more common that they either deposit or identify specimens? I think the second case. We may find that external collaborators are usually close to one of the axes in Figure 1 (high in-degree & low out-degree or vice-versa);

* Entities with very low out-degree but considerably large in-degree might actually be specialists, as they're requested from other institutions to contribute with species identification;

* Entities with considerably high out-degree but very low in-degree might be important external collectors or people internal to the institute which participate in most collection trips but are not specialists, such as car drivers or field assintants (*"mateiros"*). In fact *Mendes,VC* (appr. 1600 out-degree; 0 in-degree) is the main kombi driver in the Biology Institute in UnB. Differently from *[Irwin, HS](https://www.nybg.org/library/finding_guide/archv/irwin_rg4b.html)*, he is a central node in the collectors co-working network, meaning that he has performed a lot of fieldwork with researchers in the institute. *Irwin* is an external collector.

* Maybe entities that form the 'kernel' of the herbarium tend to perform both collection and identification activities;

* Figure 1 could actually have been built simply by obtaining the number of identifications and collections for each entity, without the necessity of builing a network model. However the network model allows us to use relationships between entities. Using an algorithm similar to **page rank** (eigenvector centrality) could result in a better classification of great specialists. The importance of an identifier in the network depends on how many specimens she has identified were collected by important collectors (or by other important identifiers who also collect). The idea is that the entity identifies specimens from other important entities then she's also important. This is the case of *Barneby,R*. Although she has zero out-degree (having not contributed to the UB dataset by collecting specimens) she is probably a great specialist, as she is the one who most identified specimens collected by *Irwin, HS*. Moreover, she is probably an external collaborator as well.



The network below is the result of filtering edges with weights greater or equal 50.

In [12]:
from IPython.display import IFrame
IFrame('../networks/n6_network/index.html', width=900, height=900)

In [13]:
nx.write_gexf(G, './graphs/n6_graph.gexf')