A jupyter notebook contains "cells" that can be "run" by pressing `shift`+`enter`.
You can run all cells by going to `Run` -> `Run All Cells`.
It's important that you run the first code cell (immediately below) before each session, as otherwise the rest of the code won't work (because this cell imports packages and functions).
To setup a sentence, you need to call the `nlp` function, pass it a string (the sentence), and assign this to a variable (see the third cell for a example).
You can then pass this variable to the `visualise` function, which will display some information.
In addition, `visualise` can take a second argument that alters the size (default is 100, a smaller number will squish the dependency graph and a larger number will widen it).

In [1]:
import spacy
import scispacy
from process_sentence import visualise
nlp = spacy.load("en_core_sci_lg")

In [2]:
# Here's an example of how to use it
sentence = nlp("The function of the protein is to regulate the liver.")
visualise(sentence)
# you can alter the size of the dependency graph by passing a second parameter to visualise
sentence2 = nlp("This is a longer sentence that illustrates why you might want to reduce the size of the dependency graph.")
visualise(sentence2, 80) # to make it smaller if you want to fit a long sentence on the screen without having to scroll

TOKEN               DEPENDENCY RELATION           HEAD                DEPENDENCIES                            

The                 determiner                    function            []
function            nominal subject               is                  [The, protein]
of                  case marking                  protein             []
the                 determiner                    protein             []
protein             modifier of nominal           function            [of, the]
is                  None                          is                  [function, regulate, .]
to                  marker                        regulate            []
regulate            open clausal complement       is                  [to, liver]
the                 determiner                    liver               []
liver               direct object                 regulate            [the]
.                   punctuation                   is                  []


TOKEN               DEPENDENCY RELATION           HEAD                DEPENDENCIES                            

This                nominal subject               is                  []
is                  None                          is                  [This, sentence, .]
a                   determiner                    sentence            []
longer              adjectival modifier           sentence            []
sentence            attribute                     is                  [a, longer, illustrates]
that                nominal subject               illustrates         []
illustrates         relative clause modifier      sentence            [that, want]
why                 adverbial modifier            want                []
you                 nominal subject               want                []
might               auxiliary                     want                []
want                clausal complement            illustrates         [why, you, might, reduce]
to           

# Flowchart
The root/free morpheme is "function".
The verb function derives from the noun function; however, you can unpack into either the noun ("the function of *x* is to *y*") or the verb form ("*x* functions to *y*").

To simplify comparison of syntactically-diverse constructions, we will use word conversion to convert from derivations of "function" to its root form function (specifically morphological derivation and inflection). We will use morphological derivation to convert between parts of speech (e.g. adjective -> noun) and inflection to convert between number, tense and aspect (e.g. the plural functions to function). The key consideration during conversion is that we maintain (and convert if necessary) the dependencies between "function" and other parts of the sentence.
What I will describe here is a method for converting these derivations of the word function to its base form (the verb or noun "function"), while at the same time, keeping track of and converting the form (where necessary) of the dependencies of "function". The idea is to rework the sentence such that semantic relations can be cast into a standard form.

Let's start with a list (possibly non-exhaustive) of derivative forms of "function"

- function (noun)
- functions (noun, plural)
- function (verb)
- functions (verb, plural)
- functional (adj)
- functionality (noun)
- functioning (gerund, acts as a noun)
- functioning (present participle, acts as an adj (maybe also an adv?))
- functioned (past participle, simple past)
- functionally (adverb)

(There's also the distinction between tense/aspect but ultimately I think what matters for my purposes is whether it acts as a noun, adjective, etc. For example, not going to distinguish between gerunds and present participles.)

If I refer to function without quotes then I am referring to all the derivatives of function that we might wish to analyze. Note that a word (word1) is a dependency of another word (word2) if the dependency arrow originates (tail) at word2 and goes into (head) word1 (i.e. word2 -> word1).

- 1. Non-biological or technical senses of function
   - 1.1 Is function used in a biological context? If yes, go to step 1.2; if no classify as either a mathematical, non-biological role or purpose, performing non-biological role or purpose, relational, social event, or programming sense of function[<sup>1</sup>](#fn1) and EXIT.
   - 1.2 Is function used in a technical sense?[<sup>2</sup>](#fn2) If yes, classify as **technical use**; if no, go to step 2.1.
- 2. Identify the biological item (*x*)
   - 2.1 Is function an adjective that modifies a noun (i.e. its dependency is [amod](https://universaldependencies.org/u/dep/amod.html); e.g. functional gene), a noun with a [nmod](https://universaldependencies.org/u/dep/nmod.html) dependency (e.g. functionality of the gene), a gerund or present/past participle (functioning/functioned, marked as VERB in spaCy) acting as a noun or adjective (or an adverb modifying an adjective), or an adverb with an [advmod](https://universaldependencies.org/u/dep/advmod.html) dependency that modifies an adjective (e.g. functionally important gene)? If yes, go to step 2.3; if no, go to step 2.2.
   - 2.2 Is function a noun[<sup>3</sup>](#fn3)? If yes, go to step 2.4; if no, go to step 2.6.
   - 2.3 Follow the dependency arrow (head to tail) from function to its head noun. (In some cases, you may need to traverse multiple arrows.) Make a note of the head noun and go to step 2.5.
   - 2.4 Can you identify a direct nominal dependent of function (follow a dependency arrow from tail to head starting from function)? If yes, make a note of this noun and go to step 2.5[<sup>5</sup>](#fn5).
   - 2.5 Start from the previously identified noun. Is it a concrete noun that satisfies the "biological item condition"[<sup>6</sup>](#fn6),[<sup>7</sup>](#fn7)? If not, follow the dependency arrows from tail to head[<sup>4</sup>](#fn4) to see if you can identify a noun that does. If yes, record this noun (or noun phrase) as *x* and go to step 3.1; if no, classify as **unidentifiable item** and EXIT.
   - 2.6 Is function a (finite) verb? If yes, identify the direct subject of the verb (via `nsubj` dependency) and go to 2.5; if no, go to 2.7.
   - 2.7 Is function an adverb that modifies a verb (e.g. functionally regulates)? If yes, first identify the verb of which function is a dependent (via `advmod` dependency), then identify the direct subject of this verb (via `nsubj` dependency), and go to 2.5; if no, classify **unidentifiable item** and EXIT.
- 3. Identify the biological effect (*y*)[<sup>8</sup>](#fn8)
   - 3.1 Did you previously identify function as an adjective, noun, gerund/participle or adverb modifying an adjective? If yes, go to step 3.2; if no (function is a verb or an adverb that modifies a verb, go to step 3.4.
   - 3.2 Follow the dependency arrow (head to tail) from the head noun (either function or the head to which function is a dependent if function is not the head noun) to the finite verb. Is the dependency via `dobj`[<sup>9</sup>](#f9n)? If yes, go to step 5.1; if no, go to step 3.3. 
   - 3.3 Is the verb that you identified by following the `nsubj` relationship transitive (i.e. it has a dependent via relation `dobj`)? If yes, record the finite verb and the phrase containing its `dobj` as *y* and go to step 4.1. If no, go to step 3.4.  
   - 3.4 Starting from the finite verb[<sup>10</sup>](#f10n), can you follow another dependency arrow (tail to head this time) via dependency relation `xcomp` to an infinitive, transitive verb? If yes, record the verb and the phrase containing its `dobj` as *y* and go to step 4.1; if no, go to step 5.1.
- 4. Candidates for biological activity/role/advantage/selected effect
   - 4.1 Can it be made to fit the following form[<sup>11</sup>](#f11n): the function of *x* is to *y* such that doing *y* in the past caused *x* to be selected for or maintained in a population (relative to an actual or counterfactual historical alternative or set of alternatives to *x*)? If yes, classify as function as **selected effect** and EXIT; if no, go to 4.2.
   - 4.2 Can you identify a complex containing system *z* of the biological item *x*? There are a few clues that can help in identifying *z*. Can you identify the answer to the question "how does *x* do *y*?" or "for what is *x* doing *y* used?"? Syntactically, you can look for a `acl` dependency of the direct object of the verb that together comprise *y*. You can also look for an `advcl` modifier of the verb. (Both of these provide extra information about *y*.) If yes, go to step 4.3; if no, go to 4.5.
   - 4.3. Can it be unpacked into the following form: the function of *x* to *y* in *z*, where *x* is a component of a complex system *z*, *z* is a Darwinian individual *i* (or a complex system within *i*), and *x* doing *y* in *z* is advantageous for *i* (relative to an explicitly-identified or implied alternative to x**)? If yes, classify as **function as biological advantage** and EXIT. If no, go to step 4.4.
   - 4.4. Can it be unpacked into the following form: the function of *x* is to *y* in *z*, where *x* is a component of a complex system *z*, and *z* is a Darwinian individual *i* (or a complex system within *i*)? If yes, classify as **function as biological role** and EXIT. If no, go to step 4.5.
   - 4.5. Can it be unpacked into the following form: the function of *x* is to *y*, where *x* is a component (or subcomponent) of a Darwinian individual *i*? If yes, classify as **function as biological activity** and EXIT. If no, classify as **unidentifiable meaning: effect specified** and EXIT.
- 5. Candidates for function as performing or producing activity/role/advantage/selected effects
     - 5.1. Substitute "*x* function" with "how well *x* performs (or works)", or "*x* works", into the raw sentence (not the unpacked sentence). If there is loss of meaning or ambiguity, classify as **unidentifiable meaning: effect unspecified**; if there is no loss of meaning and no ambiguity, go to step 5.2.
    - 5.2. Does the sentences's language indicate that the biological role has been shaped by selection? If so, and if *x* is a component of a Darwinian individual *i*, classify as function as **performing selected effect** and EXIT. If not, go to step 5.3.
    - 5.3. Can you identify *z* (complex containing system) in the unpacked sentence (see step 4.2)? If yes, go to step 5.4; if not, and if *x* is a component of a Darwinian individual *i*, classify as **function as producing biological activity** and EXIT.
    - 5.4. Is the performance of the biological role associated with how well the complex system (or organism containing the complex system) performs (e.g "Liver functioning affects human health")? If yes, and if x** is a component of a Darwinian individual *i* and *z* is a Darwinian individual *i* or a complex system within *i*, classify as **function as performing biological advantage**. If no (e.g. "Liver functioning is important for the hepatic system"), and if *x* is a component of a Darwinian individual *i* and *z* is an *i* or a complex system within *i*, classify as function as producing biological role. EXIT.

<span id="fn1">fn 1:</span> Not going to provide guidelines as to how to classify the non-biological senses: (i) they are less ambiguous (for humans at least); (2) if one uses a sufficiently narrow corpus, these will be rarely encountered. The two exceptions, which would need to be dealt with in a fully-automated system, are function as programming routine/method and function as mathematical relation, as these will be encountered in the biological literature.
    
<span id="fn2">fn 2:</span> This classification involves uses such as "functional ecology", "functional biology", "functional connectivity", etc.---it's clearly wrong to unpack these as "ecology functions to do..." and so on. They have effectively become compound nouns with their own meaning (influenced heavily by historical and technical considerations). Clues that you are dealing with a technical use will be repeated use of the same phrase within the same paper (although this is a poor indicator for a sentence classification, as the annotator will be dealing with isolated sentences), having it either defined in the paper (or as undefined jargon that the reader is expected to know), definitions/articles on wikipedia/google searches (obviously vague but perhaps the most useful guideline, as this is a very good indicator that a phrase is used to refer to a technical concept within a particular subfield).
    
<span id="fn3">fn 3:</span> Function might be used as a noun by itself (e.g. "The function of the gene...") or it might be the head of a nominal phrase (e.g. "Protein function is..."). (I do not believe it can be used as a nominal modifier in a nominal phrase and not be the head -- i.e. I don't think that it can play the grammatical role that "protein" plays in "protein function is important" -- but let me know if you encounter an example of this.) The noun (or nominal phrase) might be the subject (dependency `nsubj`) (e.g. "Protein function is...") but it also might be in the predicate (in the traditional grammar sense where you only have subject and predicate) (e.g. an object in "The promoter regulates the functional reponse...").

<span id="fn4">fn 4:</span> More specifically, can you identify a concrete (see [<sup>6</sup>](#fn6)) nominal dependent of function? This might be a subject complement (via [nmod](https://universaldependencies.org/u/dep/nmod.html)) or a compound noun (via [compound](http://universaldependencies.org/docs/u/dep/compound.html)). It might be a direct dependency (i.e. follow one dependency arrow from function to its nominal dependent) or indirect (transitively follow multiple arrows from function to its nominal dependent, in which case it might pass through other parts of speech, such as a verb (e.g. `acl`, in which a verb phrase acts like an adjective)).  

<span id="fn5">fn 5:</span> Note that there might be edge cases here where we get a phrase like "liver protein function", in which case function has two `compound` dependencies (and instead of just "protein" we want "liver protein"). Likewise, we might find something like "important protein function", in which important is an `amod` dependent of function and there might be some cases that we want to hang on to these (not in this case, but there might be some instances in which this is the case).
    
<span id="fn6">fn 6:</span> In order to be a biological item *x*, the noun must be something that can qualify as a "character" or "trait" (variant of a character) of a Darwinian individual. This includes obvious concrete nouns like "heart", which refer to a physical item, but also includes nouns that cannot be identified by the senses but refer to a concrete concept (e.g. "gene", in the sense of "beanbag genetics", is used to refer to the concept of an atomic unit of DNA that has a well-defined phenotypic effect---this is a sufficiently concrete concept to qualify as a character even if it does not exactly map onto a physical piece of DNA). As a heuristic, if it makes sense to ask "The function of *x* is..." then *x* is a candidate for a biological item.
    
<span id="fn7">fn 7:</span> Once you have identified the concrete head noun, follow its dependency arrow(s) from tail to head (if they exist) for the purposes of building up the complete noun phrase of *x*. Identify modifiers (e.g. `compound`, `nmod`) that more completely describe the item. For example, in "the gene's DNA sequence" sequence is the head noun, DNA is a `compound` dependent of sequence, and gene is a `nmod:poss` dependent of sequence. Here *x* is best identified as "gene's DNA sequence". Likewise, in "the DNA sequence of the gene", DNA is again a `compound` dependent and gene is an `nmod` dependent, so again we identify *x* as "DNA sequence of the gene" (note that I consider "gene's DNA sequence" and "DNA sequence of the gene" to be semantically identical).
    
<span id="fn8">fn 8:</span> Some notes about *y* (mainly for my own working out). We either unpack as "The function of *x* is to *y* in *z* or "*x* functions to *y* in *z*. In both cases, it seems clear that *y* should be a verb or be an adjective formed from the verb (present participle). An example of the former is "The gene functions to regulate..." (where *y* is the infinitive verb "regulate"). Another is "The functional gene regulates..." or "Protein function regulates..." (where *y* is the finite verb "regulates"). An example of the latter is "The transcribing gene functions to..." (where *y* is the present participle "transcribing", which modifies *x* (gene)). spaCy classifies present participles as `VERB`, so I think this makes it fairly simple. If function is an `nsubj` or an `amod` of an `nsubj`, then I think we either want the finite verb (e.g. "the functional gene **regulates**), the infinite verb via `xcomp` (e.g. the functional gene acts **to regulate**). If function is the verb, then I think we want the `xcomp` dependent (e.g. the gene functions **to regulate**) or the verb acting as an adjective that modifies *x* (e.g. the **transcribing** gene functions to"). There are no doubt some other cases but I think these are the main ones (and I probably just need to get Stefan to compile a list of exceptions). A question here is whether the `dobj` of the verb that we assign to *y* is part of *y* or whether it is the containing system *z*? Take, for example, "the gene functions to regulate the liver". Is *y* "regulate" and *z* "the liver", or is *y* "regulate the liver"? Using an example that I used in the definition, "The function of the genetic sequence is to produce an siRNA transcript, which in turn acts to degrade mRNA." indicates that the `dobj` is part of *y*. The function of the genetic sequence is not simply to "produce" but to "produce an siRNA transcript"; likewise, the function of the protein is not simply to "regulate" but to "regulate the liver". This seems right. So I can just extend *y* to include the `dobj` of the verb (doesn't resolve how to identify *z* though---I think this will be based around `acl` modifiers of the `dobj` or `advcl` modifiers of the verb, as these provide extra information about *y*, which might include an answer to "how/where does *x* do *y*?" and identify *z*).
   
<span id="fn9">fn 9:</span> Since *x* is the subject, *y* needs to be something that *x* does (as opposed to something that *x* has done to it). If *x* is related to the finite verb by `dobj` it implies that it receives the action *y* rather does the action *y*. Thus, I am working on the hypothesis that when *x* is a `dobj` *y* cannot be identified.
   
<span id="fn10">fn 10:</span> To clarify, this is the finite verb of the independent clause in which function is present (not necessarily of the sentence). If function is in the nominal subject (or a modifier thereof), then this is the finite verb that was identified in step 3.1. Otherwise, function should be a verb or an adverb that modifies a verb.

<span id="fn11">fn 11:</span> Selected effect is much less clear than biological effect/role/advantage and we will need to come up with some guidelines that are more field-specific and technical than grammatical.

In [4]:
# you can enter in your own examples below (probably best to use one cell per example)
# make sure you save before shutting down (file -> shut down) then you can use ctrl-C on the terminal to stop the docker container