# Handbook for natural language analysis of biological function statements

# Using the jupyter notebook

A jupyter notebook contains "cells" that can be "run" by pressing `shift`+`enter`.
You can run all cells by going to `Run` -> `Run All Cells`.
It's important that you run the [first code cell](#load) before each session, which load the packages and functions, as otherwise the rest of the code won't work.
To analyse a sentence of your own, you need to call the `nlp` function, pass it a string (the sentence), and assign this to a variable.
You can then pass this variable to the `visualise` function, which will display some information.
(The easiest approach is to copy a cell that analyses a sentence, paste this into a new cell, and replace the text within the quotes with your own sentence.)

# Dependency grammar

Dependency grammar represents grammatical relations as a tree structure. Each sentence has a single word as root (typically the finite verb). All other words are dependent upon another word via a *dependency relation*. These relations comprise a *head* (or "governor" in some frameworks) and a *dependent* (or "modifier" or "child", the latter used by spaCy). 

## Depicting dependency relations

Dependency relations are directional&mdash;represented by arrows&mdash;such that the arrow's tail emerges from the head and the arrow's head goes into the dependent. The dependent complements or modifies the head. The dependent, in turn, might itself be the head of another dependency relation. By tracing through these dependency relations (i.e. traversing multiple arrows), we gain insight into how the different parts of a sentence relate to one another. (The logic is that the dependency relations can be transitive: if A modifies B and B modifies C then there exists some relation of modification between A and C.) 

## Function and dependency relations

We want to understand the use of biological function. We thus centre our dependency grammar analysis around the function use ("function" or a derivative such as "functional") within the sentence. We build up the dependency relations starting from the function use (as opposed to a more traditional dependency grammar analysis, which would normally start with the root of the sentence). The general idea is to identify words that have dependency relations with the function use&mdash;words that either modify/complement or are modified/complemented by the function use&mdash;and from among these words identify the key semantic components for understanding the sense of function (*ITEM* and *EFFECT*). 

## More information

For more information, see [spaCy's documentation](https://spacy.io/usage/linguistic-features#dependency-parse) on its dependency parsing, the [Universal Dependencies website](https://universaldependencies.org/), or this recent [review](doi.org/10.1146/annurev-linguistics-011718-011842) on dependency grammar.

# Load packages and functions
<a id='load'></a>

In [28]:
import spacy
import scispacy
from process_sentence import visualise
nlp = spacy.load("en_core_sci_scibert")

# Syntactic forms of function

We will refer to "function" and its derivative forms (e.g. "functional") as *function uses*.

Here is a (possibly non-exhaustive) list of the different syntactic forms of function:

- function (noun)

- functions (noun, plural)

- functions (verb)

- function (verb, plural)

- functional (adjective)

- functionality (noun)

- functioning (gerund, acts as a noun)

- functioning (present participle, acts as an adjective)

- functioned (past participle, simple past)

- functionally (adverb)

## Examples of different syntactic forms of function

We will go through some examples of the different syntactic forms using spaCy and its dependency parsing visualiser, which will serve as a simple introduction to sentence parsing.

### function (noun)

In [29]:
noun = nlp("The gene has a function")
visualise(noun)

In the example above, "function" is a [direct object](https://universaldependencies.org/it/dep/obj.html) of the finite, transitive verb "has".
(The finite verb is the verb that shows subject-agreement and is marked for tense. A transitive verb is one that takes an object.)
In this dependency relation, "function" is the dependent and "has" is the head.
"Has" is also the root of the sentence (it is the only word that has no arrows going into it).
Note that in addition to its dependency relation with "has", "function" has another dependency relation with the [determiner](https://universaldependencies.org/u/dep/det.html) "a".
In this case, however, "function" is the head of the dependency relation and "a" is the dependent. 

### functions (noun, plural)

In [30]:
noun_plural = nlp("The functions of the gene")
visualise(noun_plural)

### functions (verb)

In [31]:
verb = nlp("The gene functions to")
visualise(verb)

### function (verb, plural)

In [32]:
verb_plural = nlp("The genes function")
visualise(verb_plural)

### functional (adjective)

In [33]:
adjective = nlp("The functional gene")
visualise(adjective)

### functionality (noun)

In [34]:
noun = nlp("The functionality of the gene")
visualise(noun)

### functioning (gerund, -ing form of the verb that acts as a noun)

In [35]:
gerund = nlp("functioning is important for genes")
visualise(gerund)

There are two things worth noting here.
First, note that despite being marked as a verb, the dependency relation between the root "important" and "functioning" is that of [nominal subject](https://universaldependencies.org/u/dep/nsubj.html), indicating that "functioning" is being used as a noun.
Second, this is a case where the root of the sentence is not the finite verb (due to the particular way that Universal Dependencies chooses to handle [copulas](https://universaldependencies.org/fo/dep/cop.html)).

### functioning (present participle, acts as an adjective or to form tense)

As an adjective

In [36]:
pp_adjective = nlp("the functioning gene")
visualise(pp_adjective)

To form verb tense

In [37]:
pp_tense = nlp("the gene was functioning")
visualise(pp_tense)

### functioned (past participle, simple past)

In [38]:
past = nlp("the gene functioned")
visualise(past)

### functionally (adverb)

Modifying an adjective

In [39]:
adverb = nlp("the functionally relevant gene")
visualise(adverb)

Modifying a verb

In [40]:
adverb_verb = nlp("the gene is functionally transcribed")
visualise(adverb_verb)

# Syntactic and semantic considerations

There is no straightforward mapping from syntactic form to meaning (semantics). We simply cannot make a blanket statement like "functional is always used in sense x, functionality in sense y", etc. In fact, it is quite the opposite. On one hand, identical syntactic forms can mean different things depending on context. For example, the statement "the gene is functional" could mean that the gene has a specific function (e.g. "the gene is functional because its transcript regulates x in system y") or it could mean that the gene is working as expected (e.g. "the gene is functional when it produces a working transcript"). On the other hand, different syntactic forms can mean the same thing depending on context (e.g. "gene function" and "the functional gene" could both mean that the gene is working as expected).

## Syntactic variation and dependency relations

A wide variety of syntactic forms can map onto the same (or very similar) relation between parts of speech. Consider the following examples:

- "The gene is functional"
- "The functional gene"
- "The gene that is functioning"
- "Gene function"
- "The functionality of the gene"
- "The functionally relevant gene"

These examples all have different syntactic structures and dependency relations (see below).
Yet all of these different constructions communicate the same (or very similar) association between "gene" and "function*" (though not necessarily the same meaning, depending on context).
We need a method for identifying this association that can cut through these syntactic variations.

For this we can utilise dependency relations.
In all the examples above, one can follow a dependency between "gene" and "function*".
There are variations in the specific dependencies&mdash;sometimes the dependency is direct and sometimes one must follow multiple dependency relations (e.g. functionally -> relevant and relevant -> gene), sometimes "gene" is the head and "function*" the dependent (e.g. "functional gene" where functional is an adjectival modifier of gene) whereas sometimes "function*" is the head and "gene" the dependent (e.g. "gene function", where function is the head of a compound noun and is modified by gene)&mdash;but a dependency relation can be established nonetheless.

In [41]:
# dependency parsing of the syntactic variation described above
_ = nlp("The gene is functional")
visualise(_)
_ = nlp("The functional gene")
visualise(_)
_ = nlp("The gene that is functioning")
visualise(_)
_ = nlp("Gene function")
visualise(_)
_ = nlp("The functionality of the gene")
visualise(_)
_ = nlp("The functionally relevant gene")
visualise(_)

# Some general guidelines

- Each example should be based around the function use. If sentences contain multiple uses of function, then the flowchart should be repeated for each function use in the sentence.
- For some sentences, there might be multiple *ITEM*s. If so, either include both in the *ITEM* or repeat the remaining steps of the flowchart separately for each *ITEM*. For example, in the sentence "mRNA and tRNA function in synthesis of proteins through translation" one might treat "mRNA and tRNA" as the *ITEM*, unpacking as "The function of mRNA and tRNA is to synthesise proteins" or one might separate them and unpack as "The function of mRNA is to synthesise proteins" and "The function of tRNA is to synthesise proteins".
- For some sentences, an *ITEM* might have multiple *EFFECT*s. Like the case with multiple *ITEM*s, one can either subsume all of the effects under *EFFECT* or one can unpack separately for each *EFFECT*. For example, in the sentence "RNA has functions in regulation of gene expression and in synthesis of proteins" one might unpack as "The functions of RNA are to regulate gene expression and to synthesise proteins" or one might unpack separately as "The function of RNA is to regulate gene expression" and "The function of RNA is to synthesise proteins".
- One can use the natural language analysis as a tool to help identify *SYSTEM*, but we do not provide specific instructions here (instead focusing on *ITEM* and *EFFECT*). *SYSTEM* will typically complement or modify *EFFECT*.

# Identify *ITEM*

We focus here on the "Identify *ITEM*" decision point in the flowchart (Figure 1).

- Identify the function use ("function", "functional", etc.)
- Starting from the function use, can you follow a dependency relation arrow (or multiple relations) to a nominal (noun) dependent (or nominal phrase)[<sup>1</sup>](#fn1)?
- Is this dependent a candidate for a biological *ITEM*?[<sup>2</sup>](#fn2),[<sup>3</sup>](#fn3)
- If not, can you follow any other dependency relations to find a nominal dependent that fits the description of a biological *ITEM*?[<sup>4</sup>](#fn4)
- If you still haven't identified the *ITEM*, can you follow any dependency relations from the function use to an adjectival modifier that can be converted into a noun that is a candidate for *ITEM*?

<span id="fn1">1:</span> One should generally include compound nouns in the *ITEM* (e.g. "nucleotide sequences"). A useful heuristic for whether you should include adjectival modifiers is whether the term is being used in a technical sense (see Table 1 in the manuscript where we talk about technical use of phrases involving function). The example below of "ribonucleic acids" is clear-cut (it is so technical that it has an acronym, RNA, that is known by the general public). Given the examples of "small proteins" or "novel genes" below, one might need to search the paper for context. A useful heuristic is that it is being used in a technical sense if the terms are constantly used together. For example, one would want the *ITEM* to be "small proteins" if the authors have focused on a class of proteins that they refer to using this terminology; one would not want the *ITEM* to be "small proteins" if the authors are generally talking about proteins, but in this case, they happened to describe the proteins as little.

<span id="fn2">2:</span> The following is copied from Table 1 in the manuscript: An *ITEM* is something that can be a character or trait (or produced by one) of a Darwinian individual. Obvious candidates are concrete nouns, which refer to a physical item (e.g. heart), but it also includes abstract nouns if they refer to a concrete concept (e.g. "gene" as used in population genetics, "boldness" as a behavioural phenotype, etc.). Importantly, there must be a dependency between function and the ITEM —it is not sufficient that a suitable ITEM simply appear in the sentence (e.g. "protein" is the ITEM in "protein function aids muscle development" but "muscle" is the ITEM in "protein aids in muscle function").

<span id="fn3">3:</span> Sometimes the nominal dependent might be a pronoun. In this case, find its referent (the thing to which it refers). For example, in "the gene is transcribed, and it plays a functional role in expression", there is a dependency betweeen "functional" and "it", where "it" is a referent to "gene". The *ITEM* is thus "gene". (You should also be able to trace a path to the the *ITEM* but it is typically quicker to find the referent.) 

<span id="fn4">4:</span> This might mean following further downstream dependencies of the nominal dependent that was not a candidate for the *ITEM* or it might meaning starting again from the function use and to try to identify new dependency relations. It is important to treat this workflow as a guide not a failsafe rule that one can blindly follow. The purpose is to restrict the *ITEM* to those parts of speech that are connected to the function use. Just because you can follow some obscure path to a candidate for *ITEM* does not imply that that candidate must be the *ITEM*. Use caution when following long paths that go through previously identified nominal dependents, as it might simply be the case that there is no meaningful connection between the function use and an *ITEM*.

Examples

In [42]:
# modified from example 31 of Keeling et al (2019)
item_1 = nlp("Functional ribonucleic acids and proteins can arise out of random nucleotide sequences.")
visualise(item_1)

There is a direct dependency between "functional" and the noun "acids" (which is itself modified by "ribonucleic"). Ribonucleic acid (RNA) is a candidate for a biological *ITEM*. In addition, there is a dependency between "functional" and "proteins" (functional -> acids -> proteins). Proteins are also a candidate for *ITEM*. So in this case there are two *ITEM*s (RNA and proteins)

In [43]:
# modified from example 32 of Keeling et al (2019) 
item_2 = nlp("We understand little about how these genes come into being and which functions they are selected for.")
visualise(item_2)

Here we can follow a dependency relation from "functions" to "genes" (functions -> come -> genes). There is another path from "functions to "they" (functions -> selected -> they), but "they" is a referent to "genes". Either way, we can identify the *ITEM* as "genes".

In [44]:
# modified from example 14 of Keeling et al (2019)
item_3 = nlp("Addressing how novel genes acquired adaptive functions is of fundamental importance.")
visualise(item_3)

We can follow a path from "functions" through the finite verb "acquired" to its subject "genes". The *ITEM* would either be "genes" or "novel genes" (the latter if "novel genes" tends to be used in a technical sense).

In [45]:
# modified from example 3 of Keeling et al (2019)
item_5a = nlp("We apply bioinformatics algorithms to infer molecular function of putative small proteins")
visualise(item_5a)

There is a direct dependency between "function" and "proteins". The *ITEM* is either "proteins" or "small proteins" (the latter if "small proteins" tends to be used in a technical sense).

Note the dependency between the adjective "molecular" and "function". Although "molecule" is not the *ITEM* in this case, if we had a sentence with a structure similar to the one below...

In [46]:
item_5b = nlp("Molecular function is important")
visualise(item_5b)

...then we would use word conversion to change the adjective "molecular" into the noun "molecule", which would be the *ITEM*.

# Identify *EFFECT*

We focus here on the "Identify *EFFECT*" decision point in the flowchart (Figure 1).

The *EFFECT* is something that the *ITEM* does or has done to it.
If the *ITEM* is the active subject, then *EFFECT* will be something that *ITEM* does.
If the *ITEM* is the passive subject or an object, then *EFFECT* will be something done to the *ITEM*.

*ITEM* as active subject

In [47]:
effect_1a = nlp("The gene functions to produce a transcript")
visualise(effect_1a)

*ITEM* as passive subject (note the [nsubjpass](https://universaldependencies.org/docs/en/dep/nsubjpass.html) instead of [nsubj](https://universaldependencies.org/docs/en/dep/nsubj.html))

In [48]:
effect_1b = nlp("The functional gene is transcribed")
visualise(effect_1b)

*ITEM* as object

In [49]:
effect_1c = nlp("RNA polymerase transcribes the functional gene")
visualise(effect_1c)

Despite the different paths, in all examples one can follow a dependency from the *ITEM* (gene) and the verb of the sentence (to produce, is transcribed, transcribes). The general process is as follows:

- Start from *ITEM*, which you should have already identified.
- From *ITEM*, follow a dependency relation to the nearest verb[<sup>1</sup>](#fn1)
- Is that verb (or verb clause[<sup>2</sup>](#fn2)) a candidate for *EFFECT*[<sup>3</sup>](#fn3)?
- If not, can you follow any other dependency relations to find a verb (or verb clause) that fits the description of *EFFECT*?[<sup>4</sup>](#fn4)
- If you still cannot identify an *EFFECT*, can any of the dependencies you have encountered be converted to a verb that fits the description of *EFFECT*?[<sup>5</sup>](#fn5)

<span id="fn1">1:</span> If the root of the sentence is not a verb (e.g. with [cop](https://universaldependencies.org/u/dep/cop.html)), and if this verb is the nearest verb, then you should navigate to/from the root instead. See below for an example (`effect_2a`).

<span id="fn2">2:</span> For example, if the verb is transitive then *EFFECT* will include the verb + object (e.g. for "the function of the heart is to pump blood", *EFFECT* is "pump blood"). But there are many other ways that a verb can be modified (e.g. [advcl](https://universaldependencies.org/u/dep/advcl.html), [xcomp](https://universaldependencies.org/u/dep/xcomp.html), [nmod](https://universaldependencies.org/en/dep/nmod.html), [cc](https://universaldependencies.org/u/dep/cc.html), [conj](https://universaldependencies.org/u/dep/conj.html)). Depending on context, *EFFECT* may include some of these modifiers (there are some examples in the notebook with analysed instances from Keeling et. al. (2019)). 

<span id="fn3">3:</span> The following is copied from Table 1 in the manuscript: *EFFECT* is what the *ITEM* does (or, less commonly, has done to it). *EFFECT* is a verb or verb phrase, or it is a word or phrase that can be converted into a verb (e.g. "contribute to transcription" could be converted to the verb "transcribe"). *EFFECT* must be a concrete action or effect of the *ITEM* (e.g. if *ITEM* is "gene" then "is transcribed" is acceptable but "benefits health" is not).

<span id="fn4">4:</span> This might mean following further downstream dependencies of the verb that was not a candidate for the *EFFECT* or it might meaning starting again from the *ITEM* and trying to identify new dependency relations. It is important to treat this workflow as a guide not a failsafe rule that one can blindly follow. The purpose is to restrict the *EFFECT* to those parts of speech that represent something done by (or done to) the *ITEM*. Use caution when following long paths, and run a sanity check by reading the full sentence and asking whether it seems correct to say that *EFFECT* is connected to the *ITEM*.

<span id="fn5">5:</span> For example, in "The function of the gene is transcription", the noun "transcription" could be converted to the verb "\[to be] transcribed"

Examples

In [50]:
effect_2a = nlp("The transcript is functional because it regulates expression in the liver")
visualise(effect_2a)

We use this example to highlight that one needs to make a slight modification when the root is not a verb ([cop](https://universaldependencies.org/u/dep/cop.html) is not considered the head of a dependency relation in Universal Dependency grammar, which means it cannot be root; in such cases, the root will often be an adjective or noun).

The *ITEM* is "transcript" and the *EFFECT* is "regulates expression" via transcript -> functional -> regulates -> expression. Note that the dependency paths go through the root "functional" not the verb "is" due to the particular way that these sentences are parsed.

In [51]:
# modified from example 21 of Keeling et al (2019)
effect_2b = nlp("MDF1 functions in two important molecular pathways, mating and fermentation, and mediates the crosstalk between reproduction and vegetative growth.")
visualise(effect_2b)

The *ITEM* here is "MDF1". The nearest verb is "functions". From "functions" there is a dependency with the verb "mediates" (via [conj](https://universaldependencies.org/u/dep/conj.html)). In this case, it would make sense for the entire clause of "mediates the crosstalk between reproduction and vegetative growth" to be *EFFECT*.

We might wonder whether "in two important molecular pathways, mating and fermentation" could be part of *EFFECT*.
It doesn't make sense in this particular case, as "mediates the crosstalk between reproduction and vegetative growth" gives a specific action of MDF1 in the "molecular pathways, mating  and fermentation" ("reproduction" being part of "mating" and "vegetative growth" presumably being driven (at least in part) by the energy generated via "fermentation"). 
But we can alter this sentence slightly to highlight a situation in which one might need to use conversion to form a verb for *EFFECT*.

In [52]:
effect_2c = nlp("MDF1 functions in fermentation")
visualise(effect_2c)

<a id='marginal'></a>
This is perhaps a marginal case, but it seems somewhat reasonable to assign *EFFECT* as "to ferment", or more accurately in a biological sense "to play a role in the fermentation process". MDF1 does something meaningful in a specific pathway (even if we don't know what exactly), which might be deemed sufficient for an *EFFECT*[<sup>1</sup>](#fn1). From the guidelines above, we could obtain this interpretation by following the dependency relations of MDF1 -> functions -> fermentation (convert -> to ferment).

<span id="fn1">1:</span> Our intention is to provide a general framework for semantic interpretation not to prescribe the specific methodology that one chooses to interpret marginal cases like this. The important thing is to be clear on the how and why of an interpretation and then to apply it consistently.

# Identifying variables (*ITEM*, *EFFECT*, etc.) elsewhere in a manuscript

When using dependency grammar, one analyses single sentences. Oftentimes, it will not be possible to identify variables such as *EFFECT* or *SYSTEM* within a single sentence. By reading further afield of the sentence (e.g. adjacent sentences or elsewhere in the manuscript), one might be able to identify these variables. This might be done in a less-structured *ad hoc* manner, but one might still use our natural language analysis framework by combining analysis of different sentences. We show an example below.

In [53]:
# example 19 from Keeling et al (2019)
instance_19 = nlp("Moreover, our analyses of the S. cerevisiae sORFs establish that sORFs are conserved across eukaryotes and have important biological functions.")
visualise(instance_19)

Here the *ITEM* is "sORFs" (functions -> have -> conserved -> sORFs). We cannot, however, identify an *EFFECT* ("are conserved" does not describe an action or effect of "sORFs", and none of the "important biological functions" are mentioned). But if we look in the discussion of [the paper](https://genome.cshlp.org/content/16/3/365), one can find the following sentence: "Although relatively little is known about sORF functions, they have been implicated in key cellular processes including transport, intermediary metabolism, chromosome segregation, genome stability, and other functions.". which we analyse below.

In [54]:
discussion = nlp("Although relatively little is known about sORF functions, they have been implicated in key cellular processes including transport, intermediary metabolism, chromosome segregation, genome stability, and other functions.")
visualise(discussion)

We can straightforwardly identify "sORF" as *ITEM*.

If we interpret *EFFECT* in the manner described [earlier](#marginal), then we can get the *EFFECT*s  as "play a role in transport" (sORF -> functions -> known -> implicated -> processes -> transport), "play a role in intermediary metabolism" (sORF -> functions -> known -> implicated -> processes -> transport -> metabolism), "play a role in chromosome segregation" (sORF -> functions -> known -> implicated -> processes -> transport -> segregation), and "play a role in genome stability" (sORF -> functions -> known -> implicated -> processes -> transport -> segregation -> stability).

For an additional example of analysing across multiple sentences, see instance 24 in the notebook that analyses the examples from Keeling et. al. (2019)