# Using the jupyter notebook

A jupyter notebook contains "cells" that can be "run" by pressing `shift`+`enter`.
You can run all cells by going to `Run` -> `Run All Cells`.
It's important that you run the [first code cell](#load) before each session, which load the packages and functions, as otherwise the rest of the code won't work.
To analyse a sentence of your own, you need to call the `nlp` function, pass it a string (the sentence), and assign this to a variable.
You can then pass this variable to the `visualise` function, which will display some information.
(The easiest approach is to copy a cell that analyses a sentence, paste this into a new cell, and replace the text within the quotes with your own sentence.)

# Dependency grammar

Dependency grammar represents grammatical relations as a tree structure. Each sentence has a single word as root (typically the finite verb). All other words are dependent upon another word via a *dependency relation*. These relations comprise a *head* (or "governor" in some frameworks) and a *dependent* (or "modifier" or "child", the latter used by spaCy). 

## Depicting dependency relations

Dependency relations are directional&mdash;represented by arrows&mdash;such that the arrow's tail emerges from the head and the arrow's head goes into the dependent. The dependent complements or modifies the head. The dependent, in turn, might itself be the head of another dependency relation. By tracing through these dependency relations (i.e. traversing multiple arrows), we gain insight into how the different parts of a sentence relate to one another. (The logic is that the dependency relations can be transitive: if A modifies B and B modifies C then there exists some relation of modification between A and C.) 

## Function and dependency relations

We want to understand the use of biological function. We thus centre our dependency grammar analysis around the function use ("function" or a derivative such as "functional") within the sentence. We build up the dependency relations starting from the function use (as opposed to a more traditional dependency grammar analysis, which would normally start with the root of the sentence). The general idea is to identify words that have dependency relations with the function use&mdash;words that either modify/complement or are modified/complemented by the function use&mdash;and from among these words identify the key semantic components for understanding the sense of function (*ITEM* and *EFFECT*). 

## More information

For more information, see [spaCy's documentation](https://spacy.io/usage/linguistic-features#dependency-parse) on its dependency parsing, the [Universal Dependencies website](https://universaldependencies.org/), or this recent [review](doi.org/10.1146/annurev-linguistics-011718-011842) on dependency grammar.

# Load packages and functions
<a id='load'></a>

In [3]:
import spacy
import scispacy
from process_sentence import visualise
nlp = spacy.load("en_core_sci_scibert")

# Syntactic forms of function

We will refer to "function" and its derivative forms (e.g. "functional") as *function uses*.

Here is a (possibly non-exhaustive) list of the different syntactic forms of function:

- function (noun)

- functions (noun, plural)

- functions (verb)

- function (verb, plural)

- functional (adjective)

- functionality (noun)

- functioning (gerund, acts as a noun)

- functioning (present participle, acts as an adjective)

- functioned (past participle, simple past)

- functionally (adverb)

## Examples of different syntactic forms of function

We will go through some examples of the different syntactic forms using spaCy and its dependency parsing visualiser, which will serve as a simple introduction to sentence parsing.

### function (noun)

In [8]:
noun = nlp("The gene has a function")
visualise(noun)

In the example above, "function" is a [direct object](https://universaldependencies.org/it/dep/obj.html) of the finite, transitive verb "has".
(The finite verb is the verb that shows subject-agreement and is marked for tense. A transitive verb is one that takes an object.)
In this dependency relation, "function" is the dependent and "has" is the head.
"Has" is also the root of the sentence (it is the only word that has no arrows going into it).
Note that in addition to its dependency relation with "has", "function" has another dependency relation with the [determiner](https://universaldependencies.org/u/dep/det.html) "a".
In this case, however, "function" is the head of the dependency relation and "a" is the dependent. 

### functions (noun, plural)

In [10]:
noun_plural = nlp("The functions of the gene")
visualise(noun_plural)

### functions (verb)

In [11]:
verb = nlp("The gene functions")
visualise(verb)

### function (verb, plural)

In [14]:
verb_plural = nlp("The genes function")
visualise(verb_plural)

### functional (adjective)

In [15]:
adjective = nlp("The functional gene")
visualise(adjective)

### functionality (noun)

In [16]:
noun = nlp("The functionality of the gene")
visualise(noun)

### functioning (gerund, -ing form of the verb that acts as a noun)

In [17]:
gerund = nlp("functioning is important for genes")
visualise(gerund)

There are two things worth noting here.
First, note that despite being marked as a verb, the dependency relation between the root "important" and "functioning" is that of [nominal subject](https://universaldependencies.org/u/dep/nsubj.html), indicating that "functioning" is being used as a noun.
Second, this is a case where the root of the sentence is not the finite verb (due to the particular way that Universal Dependencies chooses to handle [copulas](https://universaldependencies.org/fo/dep/cop.html)).

### functioning (present participle, acts as an adjective or to form tense)

As an adjective...

In [18]:
pp_adjective = nlp("the functioning gene")
visualise(pp_adjective)

To form verb tense

In [19]:
pp_tense = nlp("the gene was functioning")
visualise(pp_tense)

### functioned (past participle, simple past)

In [20]:
past = nlp("the gene functioned")
visualise(past)

### functionally (adverb)

In [26]:
adverb = nlp("the functionally relevant gene")
visualise(adverb)

In this example, "functionally" modifies an adjective. It could also modify a verb in an example like:

In [27]:
adverb_verb = nlp("the gene is functionally transcribed")
visualise(adverb_verb)

# Syntactic and semantic considerations

There is no simple method for mapping from syntactic form to meaning (semantics). It is not the case that we can make a blanket statement like "functional is always used in sense x, functionality in sense y", etc. In fact, it is quite the opposite. On one hand, identical syntactic forms can mean different things depending on context. For example, the statement "the gene is functional" could mean that the gene has a specific function (e.g. "the gene is functional because its transcript regulates x in system y") or it could mean that the gene is working as expected (e.g. "the gene is functional when it produces a working transcript"). On the other hand, different syntactic forms can mean the same thing depending on context (e.g. "gene function" and "the functional gene" in the con

## Syntactic variation and dependency relations

A wide variety of syntactic forms can map onto the same (or very similar) relation between parts of speech. Consider the following examples:

- "The gene is functional"
- "The functional gene"
- "The gene that is functioning"
- "Gene function"
- "The functionality of the gene"
- "The functionally relevant gene"

These examples all have different syntactic structures and dependency relations (see below).
Yet all of these different constructions communicate the same (or very similar) association between "gene" and "function*" (though not necessarily the same meaning).
We need a method for identifying this association that can cut through these syntactic variations.

For this we can utilise dependency relations.
In all the examples above, one can follow a dependency between "gene" and "function*".
There are variations in the specific dependencies&mdash;sometimes the dependency is direct and sometimes one must follow multiple dependency relations (e.g. functionally -> relevant and relevant -> gene), sometimes "gene" is the head and "function*" the dependent (e.g. "functional gene") whereas sometimes "function*" is the head and "gene" the dependent ("gene is functional")&mdash;but a dependency relation can always be established nonetheless.

In [33]:
# dependency parsing of the syntactic variation described above
_ = nlp("The gene is functional")
visualise(_)
_ = nlp("The functional gene")
visualise(_)
_ = nlp("The gene that is functioning")
visualise(_)
_ = nlp("Gene function")
visualise(_)
_ = nlp("The functionality of the gene")
visualise(_)
_ = nlp("The functionally relevant gene")
visualise(_)

# Some general guidelines

- Each example should be based around the function use. If sentences contain multiple uses of function, then the flowchart should be repeated for each function use in the sentence.
- For some sentences, there might be multiple *ITEM*s. If so, either include both in the *ITEM* or repeat the remaining steps of the flowchart separately for each *ITEM*. For example, in the sentence "mRNA and tRNA function in synthesis of proteins through translation" one might treat "mRNA and tRNA" as the *ITEM*, unpacking as "The function of mRNA and tRNA is to synthesise proteins" or one might separate them and unpack as "The function of mRNA is to synthesise proteins" and "The function of tRNA is to synthesise proteins".
- For some sentences, an *ITEM* might have multiple *EFFECT*s. Like the case with multiple *ITEM*s, one can either subsume all of the effects under *EFFECT* or one can unpack separately for each *EFFECT*. For example, in the sentence "RNA has functions in regulation of gene expression and in synthesis of proteins" one might unpack as "The functions of RNA are to regulate gene expression and to synthesise proteins" or one might unpack separately as "The function of RNA is to regulate gene expression" and "The function of RNA is to synthesise proteins".

# Identify *ITEM*

We focus here on the "Identify *ITEM*" decision point in the flowchart (Figure 1).

- Identify the function use ("function", "functional", etc.)
- Starting from the function use, can you follow a dependency relation arrow (or multiple relations) to a nominal (noun) dependent?
- Is this dependent a candidate for a biological *ITEM*?[<sup>1</sup>](#fn1)
- If not, can you follow any other dependency relations to find a nominal dependent that fits the description of a biological *ITEM*?[<sup>2</sup>](#fn2)
- If you still haven't identified the *ITEM*, can you follow any dependency relations from the function use to an adjectival modifier that can be converted into a noun that is a candidate for *ITEM*?

<span id="fn1">1:</span> See Table 1 in the manuscript for a description of the *ITEM*

<span id="fn2">2:</span> This might mean following further downstream dependencies of the nominal dependent that was not a candidate for the *ITEM* or it might meaning starting again from the function use and to try to identify new dependency relations.


Examples

In [42]:
item_1 = nlp("Functional ribonucleic acids and proteins can arise out of random nucleotide sequences.")
visualise(item_1)

There is a direct dependency between "functional" and the noun "acids" (which is itself modified by "ribonucleic"). Ribonucleic acid (RNA) is a candidate for a biological *ITEM*. In addition, there is a dependency between "functional" and "proteins" (functional -> acids -> proteins). Proteins are also a candidate for *ITEM*. So in this case there are two *ITEM*s (RNA and proteins)

In [43]:
item_2 = nlp("We understand little about how these genes come into being and which functions they are selected for.")
visualise(item_2)

Here we can follow a dependency relation from "functions" to "genes" (functions -> come -> genes). There is another path from "functions to "they" (functions -> selected -> they), but "they" is a referent to "genes. Either way, we can identify the *ITEM* as "genes".

In [44]:
item_3 = nlp("We apply bioinformatics algorithms to infer molecular function of putative small proteins")
visualise(item_3)

just make a note here how one could technically convert molecular to molecule and obtain the item although it isn't the right answer here

 
 
 - I think 2.1-2.7 can be combined as some version of "is there a dependency relation (either direct or by traversing arrows) between the function use and an ITEM". I can use the specific instances (and the toy examples of these) to highlight that sometimes the arrows go in different directions
        + need to fix "functionality of the gene"
        + need to specify that the ITEM could be an adjective that is converted into a noun
        + might need to go via verbs if function is in the predicate (e.g. if ITEM should be the subject)
        + verb descriptions are too specific (e.g. you can get nsubjpass, what if function is in the infinitive form?, etc.)



In [None]:



- 2. Identify *EFFECT*[<sup>8</sup>](#fn8)

   - 3.1 Did you previously identify function as an adjective, noun, gerund/participle or adverb modifying an adjective? If yes, go to step 3.2; if no (function is a verb or an adverb that modifies a verb, go to step 3.4.
   - 3.2 Follow the dependency arrow (head to tail) from the head noun (either function or the head to which function is a dependent if function is not the head noun) to the finite verb. Is the dependency via `dobj`[<sup>9</sup>](#f9n)? If yes, go to step 5.1; if no, go to step 3.3. 
   - 3.3 Is the verb that you identified by following the `nsubj` relationship transitive (i.e. it has a dependent via relation `dobj`)? If yes, record the finite verb and the phrase containing its `dobj` as *y* and go to step 4.1. If no, go to step 3.4.  
   - 3.4 Starting from the finite verb[<sup>10</sup>](#f10n), can you follow another dependency arrow (tail to head this time) via dependency relation `xcomp` to an infinitive, transitive verb? If yes, record the verb and the phrase containing its `dobj` as *y* and go to step 4.1; if no, go to step 5.1.
- 4. Candidates for biological activity/role/advantage/selected effect
   - 4.1 Can it be made to fit the following form[<sup>11</sup>](#f11n): the function of *x* is to *y* such that doing *y* in the past caused *x* to be selected for or maintained in a population (relative to an actual or counterfactual historical alternative or set of alternatives to *x*)? If yes, classify as function as **selected effect** and EXIT; if no, go to 4.2.
   - 4.2 Can you identify a complex containing system *z* of the biological item *x*? There are a few clues that can help in identifying *z*. Can you identify the answer to the question "how does *x* do *y*?" or "for what is *x* doing *y* used?"? Syntactically, you can look for a `acl` dependency of the direct object of the verb that together comprise *y*. You can also look for an `advcl` modifier of the verb. (Both of these provide extra information about *y*.) If yes, go to step 4.3; if no, go to 4.5.
   - 4.3. Can it be unpacked into the following form: the function of *x* to *y* in *z*, where *x* is a component of a complex system *z*, *z* is a Darwinian individual *i* (or a complex system within *i*), and *x* doing *y* in *z* is advantageous for *i* (relative to an explicitly-identified or implied alternative to x**)? If yes, classify as **function as biological advantage** and EXIT. If no, go to step 4.4.
   - 4.4. Can it be unpacked into the following form: the function of *x* is to *y* in *z*, where *x* is a component of a complex system *z*, and *z* is a Darwinian individual *i* (or a complex system within *i*)? If yes, classify as **function as biological role** and EXIT. If no, go to step 4.5.
   - 4.5. Can it be unpacked into the following form: the function of *x* is to *y*, where *x* is a component (or subcomponent) of a Darwinian individual *i*? If yes, classify as **function as biological activity** and EXIT. If no, classify as **unidentifiable meaning: effect specified** and EXIT.
- 5. Candidates for function as performing or producing activity/role/advantage/selected effects
     - 5.1. Substitute "*x* function" with "how well *x* performs (or works)", or "*x* works", into the raw sentence (not the unpacked sentence). If there is loss of meaning or ambiguity, classify as **unidentifiable meaning: effect unspecified**; if there is no loss of meaning and no ambiguity, go to step 5.2.
    - 5.2. Does the sentences's language indicate that the biological role has been shaped by selection? If so, and if *x* is a component of a Darwinian individual *i*, classify as function as **performing selected effect** and EXIT. If not, go to step 5.3.
    - 5.3. Can you identify *z* (complex containing system) in the unpacked sentence (see step 4.2)? If yes, go to step 5.4; if not, and if *x* is a component of a Darwinian individual *i*, classify as **function as producing biological activity** and EXIT.
    - 5.4. Is the performance of the biological role associated with how well the complex system (or organism containing the complex system) performs (e.g "Liver functioning affects human health")? If yes, and if x** is a component of a Darwinian individual *i* and *z* is a Darwinian individual *i* or a complex system within *i*, classify as **function as performing biological advantage**. If no (e.g. "Liver functioning is important for the hepatic system"), and if *x* is a component of a Darwinian individual *i* and *z* is an *i* or a complex system within *i*, classify as function as producing biological role. EXIT.
      
<span id="fn8">fn 8:</span> Some notes about *y* (mainly for my own working out). We either unpack as "The function of *x* is to *y* in *z* or "*x* functions to *y* in *z*. In both cases, it seems clear that *y* should be a verb or be an adjective formed from the verb (present participle). An example of the former is "The gene functions to regulate..." (where *y* is the infinitive verb "regulate"). Another is "The functional gene regulates..." or "Protein function regulates..." (where *y* is the finite verb "regulates"). An example of the latter is "The transcribing gene functions to..." (where *y* is the present participle "transcribing", which modifies *x* (gene)). spaCy classifies present participles as `VERB`, so I think this makes it fairly simple. If function is an `nsubj` or an `amod` of an `nsubj`, then I think we either want the finite verb (e.g. "the functional gene **regulates**), the infinite verb via `xcomp` (e.g. the functional gene acts **to regulate**). If function is the verb, then I think we want the `xcomp` dependent (e.g. the gene functions **to regulate**) or the verb acting as an adjective that modifies *x* (e.g. the **transcribing** gene functions to"). There are no doubt some other cases but I think these are the main ones (and I probably just need to get Stefan to compile a list of exceptions). A question here is whether the `dobj` of the verb that we assign to *y* is part of *y* or whether it is the containing system *z*? Take, for example, "the gene functions to regulate the liver". Is *y* "regulate" and *z* "the liver", or is *y* "regulate the liver"? Using an example that I used in the definition, "The function of the genetic sequence is to produce an siRNA transcript, which in turn acts to degrade mRNA." indicates that the `dobj` is part of *y*. The function of the genetic sequence is not simply to "produce" but to "produce an siRNA transcript"; likewise, the function of the protein is not simply to "regulate" but to "regulate the liver". This seems right. So I can just extend *y* to include the `dobj` of the verb (doesn't resolve how to identify *z* though---I think this will be based around `acl` modifiers of the `dobj` or `advcl` modifiers of the verb, as these provide extra information about *y*, which might include an answer to "how/where does *x* do *y*?" and identify *z*).
   
<span id="fn9">fn 9:</span> Since *x* is the subject, *y* needs to be something that *x* does (as opposed to something that *x* has done to it). If *x* is related to the finite verb by `dobj` it implies that it receives the action *y* rather does the action *y*. Thus, I am working on the hypothesis that when *x* is a `dobj` *y* cannot be identified.
   
<span id="fn10">fn 10:</span> To clarify, this is the finite verb of the independent clause in which function is present (not necessarily of the sentence). If function is in the nominal subject (or a modifier thereof), then this is the finite verb that was identified in step 3.1. Otherwise, function should be a verb or an adverb that modifies a verb.

<span id="fn11">fn 11:</span> Selected effect is much less clear than biological effect/role/advantage and we will need to come up with some guidelines that are more field-specific and technical than grammatical.