<a href="https://colab.research.google.com/github/julianflowers/herbivores_ghg/blob/master/taxonerd.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Extracting taxa information from texts

In this notebook we walk through how to process text(s) to extract taxa information using the Python package `taxonerd` (https://github.com/nleguillarme/taxonerd). (https://besjournals.onlinelibrary.wiley.com/doi/abs/10.1111/2041-210X.13778).

This uses language models trained a large body of ecological texts and taxonomic services like `taxref`, `gbif-backbone` and `ncbilite` to identify mentions of taxa (common and scientific names) in bodies of text.

## Getting started

As a first step we need to download and install the package and language models.

In [None]:
!pip install taxonerd

In [2]:
!pip install https://github.com/nleguillarme/taxonerd/releases/download/v1.3.0/en_ner_eco_md-1.0.0.tar.gz
!pip install https://github.com/nleguillarme/taxonerd/releases/download/v1.3.0/en_ner_eco_biobert-1.0.0.tar.gz

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting https://github.com/nleguillarme/taxonerd/releases/download/v1.3.0/en_ner_eco_md-1.0.0.tar.gz
  Downloading https://github.com/nleguillarme/taxonerd/releases/download/v1.3.0/en_ner_eco_md-1.0.0.tar.gz (123.3 MB)
[K     |████████████████████████████████| 123.3 MB 25 kB/s 
Building wheels for collected packages: en-ner-eco-md
  Building wheel for en-ner-eco-md (setup.py) ... [?25l[?25hdone
  Created wheel for en-ner-eco-md: filename=en_ner_eco_md-1.0.0-py3-none-any.whl size=123795866 sha256=4c8ba70e7cfc59086f52ca94870726f50d92c5f965d955cfdf6eb5eba7ab4e1e
  Stored in directory: /root/.cache/pip/wheels/ba/09/a6/e7bd2d27bd1f135a69ba8a41fcf35611601000747586e52b9f
Successfully built en-ner-eco-md
Installing collected packages: en-ner-eco-md
Successfully installed en-ner-eco-md-1.0.0
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Co

Then we can import the main functions into our colab Python environment.

In [3]:
from taxonerd import TaxoNERD

Then initialise our named entity recognition function. 

Note - setting prefer_gpu = True speeds things up. A gpu is a Graphics processing unit and is used for language processing and big data analysis because the processor and memory on graphics cards are much faster than general computer processors and memmory and are optimised the kinds of calculations needs for NLP.

Note also that adding `with_linking` gives the probability of the match so you can see how accurate the annotation is.

In [4]:
ner = TaxoNERD(model="en_ner_eco_biobert", prefer_gpu=True, with_linking="taxref", with_abbrev=True) # Add with_linking="gbif_backbone" or with_linking="taxref" to activate entity linking


Once the ner is initialised we can check texts...

In [5]:
out = ner.find_in_text("Only a few herbivory change the profile of volatile organic chemicals (VOCs) studies have demonstrated that drought stress has a negative effect emitted by the host plant; (3) parasitoids avoid aphid hosts feeding on aphid parasitism success [39,42]. on plants under drought stress and root herbivory. The system Root herbivores can affect plant growth [43–45], reproduction comprised Brassica oleracea as the host plant; the belowground [46,47], density [46,48] and nutrient status [49,50] and thus may herbivore was the cabbage root fly Delia radicum; the aboveground strongly affect the quality and quantity of resources available to herbivores were the generalist aphid Myzus persicae, and the foliar herbivores [36]. This can have differential effects on foliar specialist aphid Brevicoryne brassicae; and at the third trophic level insects and their associated natural enemies. Root herbivory had a the parasitoids Aphidius colemani and Diaeretialla rapae were used. negative impact on the performance of aphids and other insects due to the increased levels of defence compounds in several studies Results [6,19,36,51–57] and/or a decrease in nitrogen concentration [19] and leaf water content [58]. Root herbivores can be responsible for Parasitism Performance and Percentage Parasitism the change in growth and development of foliar herbivores a) Percentage parasitism. Percentage parasitism was signif- through plant mediated changes and thus may have indirect icantly affected by the interaction between drought stress, De. impact on parasitoid fitness [19,59,60] and the impact can also be radicum and parasitoid species (F1, 72 = 7.50; P,0.01). seen on the fourth trophic level [19]. The negative impact of high Drought stress (F1, 72 = 121.39; P,0.001) and the presence drought stress on aphid performance and abundance can be of De. radicum (F1, 72 = 10.27; P,0.01) had a negative impact exacerbated under root herbivory [36,39,61] and thus we predict on percentage parasitism by both parasitoid species com- that natural enemies may avoid these plants due to the low quality pared with well watered plants, but their effects were greater of their aphid hosts. for the specialist parasitoid species (D. rapae) than for the Multitrophic interactions frequently involve complex plant generalist parasitoid species (A. colemani, Figure 1a). Drought defences [10,14,62] involving the release of volatile organic stress partially reversed the negative effect of De. radicum on compounds (VOCs) following herbivore attack that enhance the parasitism by A. colemani (Figure 1a; Tukey’s HSD, P,0.05). effectiveness of natural enemies [63–67]. In response to insect Parasitism by D. rapae followed the same pattern, but the herbivory, plants release VOCs which can be used by natural difference between drought stressed plants with or without enemies of the insect herbivores to find their hosts [59]. The plant De. radicum was not significant (Figure 1a). VOC emissions induced by foliar herbivores can be influenced by b) Sex ratio. Sex ratio was significantly affected by the root herbivores [59] and drought stress [68]. These studies showed interaction between De. radicum treatment and parasitoid compound specific responses for natural enemies under biotic and species (F1, 75 = 7.35; P,0.01). The main effects of drought abiotic stresses. Therefore, plant VOC emissions are influenced by stress (F 1, 75 = 19.65; P,0.001) and De. radicum (F1, biotic and abiotic stresses [68–73]. These plants may become less 75 = 215.93; P,0.001) were also significant for the sex ratio attractive to foraging parasitoids [74,75] and thus may interfere of both parasitoid species. The proportion of males of both directly with herbivore-parasitoid interactions [59]. species was significantly greater on drought stressed plants The behaviour and performance of natural enemies can be with De. radicum compared with well watered treatments influenced by their host, host diet, environmental factors (Figure 1b). Delia radicum increased the proportion of male D. (including water stress) and the presence of other herbivores such rapae on both drought stressed plants and well watered plants as root feeders [26,42,59,79–81]. Parasitoid development has been compared with plants that were not infested with root linked with the quality of internal environment of their hosts [59]. herbivore (Tukey’s HSD, P,0.05). Delia radicum feeding did For example, phytotoxin concentration can increase under not affect the sex ratio of A. colemani under either the drought drought stress [25] and root herbivory [59] and these toxins are or well watered treatments (Tukey’s HSD, P,0.05). repeatedly consumed by insect herbivores [59]. These phytotoxins often accumulate in the fat body and hemolymph of insect c")

In [6]:
out

Unnamed: 0,offsets,text,entity
T0,LIVB 194 199,aphid,"[(TAXREF:215210, Melon aphid, 0.796825647354126)]"
T1,LIVB 217 222,aphid,"[(TAXREF:215210, Melon aphid, 0.796825647354126)]"
T2,LIVB 385 402,Brassica oleracea,"[(TAXREF:86406, Brassica oleracea, 1.0)]"
T3,LIVB 539 552,Delia radicum,"[(TAXREF:26886, Delia radicum, 1.0)]"
T4,LIVB 668 673,aphid,"[(TAXREF:215210, Melon aphid, 0.796825647354126)]"
T5,LIVB 674 688,Myzus persicae,"[(TAXREF:52046, Myzus persicae, 1.0)]"
T6,LIVB 778 783,aphid,"[(TAXREF:215210, Melon aphid, 0.796825647354126)]"
T7,LIVB 784 805,Brevicoryne brassicae,"[(TAXREF:52043, Brevicoryne brassicae, 1.0)]"
T8,LIVB 921 938,Aphidius colemani,"[(TAXREF:228414, Aphidius colemani, 1.0)]"
T9,LIVB 943 961,Diaeretialla rapae,"[(TAXREF:228408, Diaeretiella rapae, 0.8140425..."


Let's upload some files...

In [7]:
from google.colab import files

uploaded = files.upload()

Saving sustainability-12-02425-v2.pdf to sustainability-12-02425-v2.pdf
Saving agronomy-11-01421-v2.pdf to agronomy-11-01421-v2.pdf
Saving Testing%20DayCent%20and%20DNDC%20model%20simulations%20of%20N2O%20fluxes%20and%20assessing%20the%20impacts%20of%20climate%20change%20on%20the%20gas%20flux%20and%20biomass%20production%20from%20a%20humid%20pasture.pdf to Testing%20DayCent%20and%20DNDC%20model%20simulations%20of%20N2O%20fluxes%20and%20assessing%20the%20impacts%20of%20climate%20change%20on%20the%20gas%20flux%20and%20biomass%20production%20from%20a%20humid%20pasture.pdf
Saving s42452-020-03538-9.pdf to s42452-020-03538-9.pdf
Saving 9995c9d0bab3b678e4835d05490b4c993120.pdf to 9995c9d0bab3b678e4835d05490b4c993120.pdf


We'll create an 'ann' as an output directory for the annotated files as the ner works through each document. It will take a few minutes to convert each pdf to a text format and run the annotation. Time for coffee...or a butterfly count.

In [9]:
ner.find_in_corpus("example", "ann")

{'9995c9d0bab3b678e4835d05490b4c993120.txt': 'ann/9995c9d0bab3b678e4835d05490b4c993120.ann',
 'Testing%20DayCent%20and%20DNDC%20model%20simulations%20of%20N2O%20fluxes%20and%20assessing%20the%20impacts%20of%20climate%20change%20on%20the%20gas%20flux%20and%20biomass%20production%20from%20a%20humid%20pasture.txt': 'ann/Testing%20DayCent%20and%20DNDC%20model%20simulations%20of%20N2O%20fluxes%20and%20assessing%20the%20impacts%20of%20climate%20change%20on%20the%20gas%20flux%20and%20biomass%20production%20from%20a%20humid%20pasture.ann',
 'agronomy-11-01421-v2.txt': 'ann/agronomy-11-01421-v2.ann',
 's42452-020-03538-9.txt': 'ann/s42452-020-03538-9.ann',
 'sustainability-12-02425-v2.txt': 'ann/sustainability-12-02425-v2.ann'}

These files are text files - they can be downloaded, combined and analysed in Excel (or R or Python). 

In [13]:
ner.find_in_file("example/9995c9d0bab3b678e4835d05490b4c993120.pdf")

Unnamed: 0,offsets,text,entity
T0,LIVB 230 240,Douglas G.,"[(TAXREF:189155, Douglasia, 0.7564775943756104)]"
T1,LIVB 6711 6716,lupin,"[(TAXREF:194315, Lupinus, 0.7949097156524658)]"
T2,LIVB 11065 11069,corn,"[(TAXREF:108029, Corn Mint, 0.7146876454353333)]"
T3,LIVB 11071 11087,Zea mays L.]-pea,"[(TAXREF:130621, Zea mays, 0.7660806775093079)]"
T4,LIVB 11089 11113,Pisum sativum L.]/barley,"[(TAXREF:113778, Pisum sativum, 0.746616005897..."
T5,LIVB 12140 12145,wheat,"[(TAXREF:141978, Pasta Wheat, 0.849183201789856)]"
T6,LIVB 12177 12181,Corn,"[(TAXREF:108029, Corn Mint, 0.7146876454353333)]"
T7,LIVB 12194 12199,wheat,"[(TAXREF:141978, Pasta Wheat, 0.849183201789856)]"
T8,LIVB 12980 13000,Helianthus annuus L.,"[(TAXREF:101027, Helianthus annuus, 0.93454897..."
T9,LIVB 14203 14208,wheat,"[(TAXREF:141978, Pasta Wheat, 0.849183201789856)]"
