A jupyter notebook contains "cells" that can be "run" by pressing `shift`+`enter`.
You can run all cells by going to `Run` -> `Run All Cells`.
It's important that you run the first code cell (immediately below) before each session, as otherwise the rest of the code won't work (because this cell imports packages and functions).
To setup a sentence, you need to call the `nlp` function, pass it a string (the sentence), and assign this to a variable (see the third cell for a example).
You can then pass this variable to the `visualise` function, which will display some information.
In addition, `visualise` can take a second argument that alters the size (default is 100, a smaller number will squish the dependency graph and a larger number will widen it).

In [1]:
# load packages and functions
import spacy
import scispacy
from process_sentence import visualise
nlp = spacy.load("en_core_sci_scibert")

# Flowchart
The root/free morpheme is "function".
The verb function derives from the noun function; however, you can unpack into either the noun ("the function of *x* is to *y*") or the verb form ("*x* functions to *y*").

To simplify comparison of syntactically-diverse constructions, we will use word conversion to convert from derivations of "function" to its root form function (specifically morphological derivation and inflection). We will use morphological derivation to convert between parts of speech (e.g. adjective -> noun) and inflection to convert between number, tense and aspect (e.g. the plural functions to function). The key consideration during conversion is that we maintain (and convert if necessary) the dependencies between "function" and other parts of the sentence.
What I will describe here is a method for converting these derivations of the word function to its base form (the verb or noun "function"), while at the same time, keeping track of and converting the form (where necessary) of the dependencies of "function". The idea is to rework the sentence such that semantic relations can be cast into a standard form.

Let's start with a list (possibly non-exhaustive) of derivative forms of "function"

- function (noun)
- functions (noun, plural)
- function (verb)
- functions (verb, plural)
- functional (adj)
- functionality (noun)
- functioning (gerund, acts as a noun)
- functioning (present participle, acts as an adj (maybe also an adv?))
- functioned (past participle, simple past)
- functionally (adverb)

(the below can be used in the paper)

If we focus on function used as a noun, we have the singular "function", the plural "functions", as well as an additional form "functionality". But "function(s)" might also be used as a verb ("the gene functions to transcribe"). We also encounter difficult cases such as the gerund form "functioning", which is the -ing form of the verb "function" that acts as a noun ("functioning is important for genes")! To complicate matters even further, "functioning" might be a present participle instead of a gerund, in which case it either acts as an adjective ("the functioning gene") or to form verb tense ("the gene was functioning"). Suffice it to say, when dealing with real-world sentences, the sheer variety of syntactical forms present a challenge for semantic interpretation of function (i.e. the sense in which function is being used).

If I refer to function without quotes then I am referring to all the derivatives of function that we might wish to analyze. Note that a word (word1) is a dependency of another word (word2) if the dependency arrow originates (tail) at word2 and goes into (head) word1 (i.e. word2 -> word1).

- 1. Non-biological or technical senses of function
   - 1.1 Is function used in a biological context? If yes, go to step 1.2; if no classify as either a mathematical, non-biological role or purpose, performing non-biological role or purpose, relational, social event, or programming sense of function[<sup>1</sup>](#fn1) and EXIT.
   - 1.2 Is function used in a technical sense?[<sup>2</sup>](#fn2) If yes, classify as **technical use**; if no, go to step 2.1.
- 2. Identify the biological item (*x*)
   - 2.1 Is function an adjective that modifies a noun (i.e. its dependency is [amod](https://universaldependencies.org/u/dep/amod.html); e.g. functional gene), a noun with a [nmod](https://universaldependencies.org/u/dep/nmod.html) dependency (e.g. functionality of the gene) (WRONG EXAMPLE, THIS SHOULD BE 2.2), a gerund or present/past participle (functioning/functioned, marked as VERB in spaCy) acting as a noun or adjective (or an adverb modifying an adjective), or an adverb with an [advmod](https://universaldependencies.org/u/dep/advmod.html) dependency that modifies an adjective (e.g. functionally important gene)? If yes, go to step 2.3; if no, go to step 2.2.
   - 2.2 Is function a noun[<sup>3</sup>](#fn3)? If yes, go to step 2.4; if no, go to step 2.6.
   - 2.3 Follow the dependency arrow (head to tail) from function to its head noun. (In some cases, you may need to traverse multiple arrows.) Make a note of the head noun and go to step 2.5.
   - 2.4 Can you identify a direct nominal dependent of function (follow a dependency arrow from tail to head starting from function)? If yes, make a note of this noun and go to step 2.5[<sup>5</sup>](#fn5).
   - 2.5 Start from the previously identified noun. Is it a concrete noun that satisfies the "biological item condition"[<sup>6</sup>](#fn6),[<sup>7</sup>](#fn7)? If not, follow the dependency arrows from tail to head[<sup>4</sup>](#fn4) to see if you can identify a noun that does. If yes, record this noun (or noun phrase) as *x* and go to step 3.1; if no, classify as **unidentifiable item** and EXIT.
   - 2.6 Is function a (finite) verb? If yes, identify the direct subject of the verb (via `nsubj` dependency) and go to 2.5; if no, go to 2.7.
   - 2.7 Is function an adverb that modifies a verb (e.g. functionally regulates)? If yes, first identify the verb of which function is a dependent (via `advmod` dependency), then identify the direct subject of this verb (via `nsubj` dependency), and go to 2.5; if no, classify **unidentifiable item** and EXIT.
- 3. Identify the biological effect (*y*)[<sup>8</sup>](#fn8)
   - 3.1 Did you previously identify function as an adjective, noun, gerund/participle or adverb modifying an adjective? If yes, go to step 3.2; if no (function is a verb or an adverb that modifies a verb, go to step 3.4.
   - 3.2 Follow the dependency arrow (head to tail) from the head noun (either function or the head to which function is a dependent if function is not the head noun) to the finite verb. Is the dependency via `dobj`[<sup>9</sup>](#f9n)? If yes, go to step 5.1; if no, go to step 3.3. 
   - 3.3 Is the verb that you identified by following the `nsubj` relationship transitive (i.e. it has a dependent via relation `dobj`)? If yes, record the finite verb and the phrase containing its `dobj` as *y* and go to step 4.1. If no, go to step 3.4.  
   - 3.4 Starting from the finite verb[<sup>10</sup>](#f10n), can you follow another dependency arrow (tail to head this time) via dependency relation `xcomp` to an infinitive, transitive verb? If yes, record the verb and the phrase containing its `dobj` as *y* and go to step 4.1; if no, go to step 5.1.
- 4. Candidates for biological activity/role/advantage/selected effect
   - 4.1 Can it be made to fit the following form[<sup>11</sup>](#f11n): the function of *x* is to *y* such that doing *y* in the past caused *x* to be selected for or maintained in a population (relative to an actual or counterfactual historical alternative or set of alternatives to *x*)? If yes, classify as function as **selected effect** and EXIT; if no, go to 4.2.
   - 4.2 Can you identify a complex containing system *z* of the biological item *x*? There are a few clues that can help in identifying *z*. Can you identify the answer to the question "how does *x* do *y*?" or "for what is *x* doing *y* used?"? Syntactically, you can look for a `acl` dependency of the direct object of the verb that together comprise *y*. You can also look for an `advcl` modifier of the verb. (Both of these provide extra information about *y*.) If yes, go to step 4.3; if no, go to 4.5.
   - 4.3. Can it be unpacked into the following form: the function of *x* to *y* in *z*, where *x* is a component of a complex system *z*, *z* is a Darwinian individual *i* (or a complex system within *i*), and *x* doing *y* in *z* is advantageous for *i* (relative to an explicitly-identified or implied alternative to x**)? If yes, classify as **function as biological advantage** and EXIT. If no, go to step 4.4.
   - 4.4. Can it be unpacked into the following form: the function of *x* is to *y* in *z*, where *x* is a component of a complex system *z*, and *z* is a Darwinian individual *i* (or a complex system within *i*)? If yes, classify as **function as biological role** and EXIT. If no, go to step 4.5.
   - 4.5. Can it be unpacked into the following form: the function of *x* is to *y*, where *x* is a component (or subcomponent) of a Darwinian individual *i*? If yes, classify as **function as biological activity** and EXIT. If no, classify as **unidentifiable meaning: effect specified** and EXIT.
- 5. Candidates for function as performing or producing activity/role/advantage/selected effects
     - 5.1. Substitute "*x* function" with "how well *x* performs (or works)", or "*x* works", into the raw sentence (not the unpacked sentence). If there is loss of meaning or ambiguity, classify as **unidentifiable meaning: effect unspecified**; if there is no loss of meaning and no ambiguity, go to step 5.2.
    - 5.2. Does the sentences's language indicate that the biological role has been shaped by selection? If so, and if *x* is a component of a Darwinian individual *i*, classify as function as **performing selected effect** and EXIT. If not, go to step 5.3.
    - 5.3. Can you identify *z* (complex containing system) in the unpacked sentence (see step 4.2)? If yes, go to step 5.4; if not, and if *x* is a component of a Darwinian individual *i*, classify as **function as producing biological activity** and EXIT.
    - 5.4. Is the performance of the biological role associated with how well the complex system (or organism containing the complex system) performs (e.g "Liver functioning affects human health")? If yes, and if x** is a component of a Darwinian individual *i* and *z* is a Darwinian individual *i* or a complex system within *i*, classify as **function as performing biological advantage**. If no (e.g. "Liver functioning is important for the hepatic system"), and if *x* is a component of a Darwinian individual *i* and *z* is an *i* or a complex system within *i*, classify as function as producing biological role. EXIT.

<span id="fn1">fn 1:</span> Not going to provide guidelines as to how to classify the non-biological senses: (i) they are less ambiguous (for humans at least); (2) if one uses a sufficiently narrow corpus, these will be rarely encountered. The two exceptions, which would need to be dealt with in a fully-automated system, are function as programming routine/method and function as mathematical relation, as these will be encountered in the biological literature.
    
<span id="fn2">fn 2:</span> This classification involves uses such as "functional ecology", "functional biology", "functional connectivity", etc.---it's clearly wrong to unpack these as "ecology functions to do..." and so on. They have effectively become compound nouns with their own meaning (influenced heavily by historical and technical considerations). Clues that you are dealing with a technical use will be repeated use of the same phrase within the same paper (although this is a poor indicator for a sentence classification, as the annotator will be dealing with isolated sentences), having it either defined in the paper (or as undefined jargon that the reader is expected to know), definitions/articles on wikipedia/google searches (obviously vague but perhaps the most useful guideline, as this is a very good indicator that a phrase is used to refer to a technical concept within a particular subfield).
    
<span id="fn3">fn 3:</span> Function might be used as a noun by itself (e.g. "The function of the gene...") or it might be the head of a nominal phrase (e.g. "Protein function is..."). (I do not believe it can be used as a nominal modifier in a nominal phrase and not be the head -- i.e. I don't think that it can play the grammatical role that "protein" plays in "protein function is important" -- but let me know if you encounter an example of this.) The noun (or nominal phrase) might be the subject (dependency `nsubj`) (e.g. "Protein function is...") but it also might be in the predicate (in the traditional grammar sense where you only have subject and predicate) (e.g. an object in "The promoter regulates the functional reponse...").

<span id="fn4">fn 4:</span> More specifically, can you identify a concrete (see [<sup>6</sup>](#fn6)) nominal dependent of function? This might be a subject complement (via [nmod](https://universaldependencies.org/u/dep/nmod.html)) or a compound noun (via [compound](http://universaldependencies.org/docs/u/dep/compound.html)). It might be a direct dependency (i.e. follow one dependency arrow from function to its nominal dependent) or indirect (transitively follow multiple arrows from function to its nominal dependent, in which case it might pass through other parts of speech, such as a verb (e.g. `acl`, in which a verb phrase acts like an adjective)).  

<span id="fn5">fn 5:</span> Note that there might be edge cases here where we get a phrase like "liver protein function", in which case function has two `compound` dependencies (and instead of just "protein" we want "liver protein"). Likewise, we might find something like "important protein function", in which important is an `amod` dependent of function and there might be some cases that we want to hang on to these (not in this case, but there might be some instances in which this is the case).
    
<span id="fn6">fn 6:</span> In order to be a biological item *x*, the noun must be something that can qualify as a "character" or "trait" (variant of a character) of a Darwinian individual. This includes obvious concrete nouns like "heart", which refer to a physical item, but also includes nouns that cannot be identified by the senses but refer to a concrete concept (e.g. "gene", in the sense of "beanbag genetics", is used to refer to the concept of an atomic unit of DNA that has a well-defined phenotypic effect---this is a sufficiently concrete concept to qualify as a character even if it does not exactly map onto a physical piece of DNA). As a heuristic, if it makes sense to ask "The function of *x* is..." then *x* is a candidate for a biological item.
    
<span id="fn7">fn 7:</span> Once you have identified the concrete head noun, follow its dependency arrow(s) from tail to head (if they exist) for the purposes of building up the complete noun phrase of *x*. Identify modifiers (e.g. `compound`, `nmod`) that more completely describe the item. For example, in "the gene's DNA sequence" sequence is the head noun, DNA is a `compound` dependent of sequence, and gene is a `nmod:poss` dependent of sequence. Here *x* is best identified as "gene's DNA sequence". Likewise, in "the DNA sequence of the gene", DNA is again a `compound` dependent and gene is an `nmod` dependent, so again we identify *x* as "DNA sequence of the gene" (note that I consider "gene's DNA sequence" and "DNA sequence of the gene" to be semantically identical).
    
<span id="fn8">fn 8:</span> Some notes about *y* (mainly for my own working out). We either unpack as "The function of *x* is to *y* in *z* or "*x* functions to *y* in *z*. In both cases, it seems clear that *y* should be a verb or be an adjective formed from the verb (present participle). An example of the former is "The gene functions to regulate..." (where *y* is the infinitive verb "regulate"). Another is "The functional gene regulates..." or "Protein function regulates..." (where *y* is the finite verb "regulates"). An example of the latter is "The transcribing gene functions to..." (where *y* is the present participle "transcribing", which modifies *x* (gene)). spaCy classifies present participles as `VERB`, so I think this makes it fairly simple. If function is an `nsubj` or an `amod` of an `nsubj`, then I think we either want the finite verb (e.g. "the functional gene **regulates**), the infinite verb via `xcomp` (e.g. the functional gene acts **to regulate**). If function is the verb, then I think we want the `xcomp` dependent (e.g. the gene functions **to regulate**) or the verb acting as an adjective that modifies *x* (e.g. the **transcribing** gene functions to"). There are no doubt some other cases but I think these are the main ones (and I probably just need to get Stefan to compile a list of exceptions). A question here is whether the `dobj` of the verb that we assign to *y* is part of *y* or whether it is the containing system *z*? Take, for example, "the gene functions to regulate the liver". Is *y* "regulate" and *z* "the liver", or is *y* "regulate the liver"? Using an example that I used in the definition, "The function of the genetic sequence is to produce an siRNA transcript, which in turn acts to degrade mRNA." indicates that the `dobj` is part of *y*. The function of the genetic sequence is not simply to "produce" but to "produce an siRNA transcript"; likewise, the function of the protein is not simply to "regulate" but to "regulate the liver". This seems right. So I can just extend *y* to include the `dobj` of the verb (doesn't resolve how to identify *z* though---I think this will be based around `acl` modifiers of the `dobj` or `advcl` modifiers of the verb, as these provide extra information about *y*, which might include an answer to "how/where does *x* do *y*?" and identify *z*).
   
<span id="fn9">fn 9:</span> Since *x* is the subject, *y* needs to be something that *x* does (as opposed to something that *x* has done to it). If *x* is related to the finite verb by `dobj` it implies that it receives the action *y* rather does the action *y*. Thus, I am working on the hypothesis that when *x* is a `dobj` *y* cannot be identified.
   
<span id="fn10">fn 10:</span> To clarify, this is the finite verb of the independent clause in which function is present (not necessarily of the sentence). If function is in the nominal subject (or a modifier thereof), then this is the finite verb that was identified in step 3.1. Otherwise, function should be a verb or an adverb that modifies a verb.

<span id="fn11">fn 11:</span> Selected effect is much less clear than biological effect/role/advantage and we will need to come up with some guidelines that are more field-specific and technical than grammatical.

In [2]:
# instance 1
test = nlp("We review here the identification, possible origins, evolutionary trends, and functions of orphans with an emphasis on their role in plant biology.")
visualise(test,100)
"""
1.1
1.2
2.1
2.2
2.4
2.5 (no nominal dependent)
unidentifiable item, EXIT

Problems? Yes
We miss the connection between functions and orphans.
If this were a simpler structure, then we'd get an nmod connection between functions (tail) and (orphans)
To obtain this, we need to go from functions -> indentification (via conj) and then from identification -> orphans (via nmod)

Possible alteration to flowchart:
2.5 could be rewritten to focus on finding an nmod connection (will it always be an nmod?) from function; however, this might be indirect if dealing with a conjunction (and perhaps other things?)
Might be possible to rewrite as allowing ANY dependency connection from function (noun) to another noun via nmod SO LONG AS the final connection is tail to head via nmod

How would this change the outcome?
2.5 gives "orphans"
3.1 yes
3.2 yes via dobj (note that it passes via the intermediate noun (identification))
5.1 unidentifiable meaning: effect unspecified

The latter looks good. Here the use of function indicates that orphans has functions but it gives us no clue as to what sort of functions we are talking about.
"""
test = nlp("functions of orphans are important")
visualise(test)
# just to show that if the sentence is syntactically more simple then we get an nmod connection from functions to orphan as expected

# note after going through with Stefan this has similarities to some of his examples with lists or things separated by and
# has to do with how UD assigns the first thing in a list (or whatever) as the head so you need to work back through the first one

In [3]:
# instance 2
test = nlp("a considerable number of small proteins whose functions are yet to be characterized")
visualise(test)
"""
1.1
1.2
2.1
2.2 yes noun
2.4

Problem: yes

At 2.4, we can follow functions to whose (a determiner) via nmod:poss.
Clearly we want to replace "whose" with the noun phrase "small proteins", but there's no way to handle this currently.
This is actually a bit strange since this is a more general phenomenon than this example.
I thought I had added this.

Possible alteration to flowchart:

If a dependency links to a determiner, move to the reference (in this case whose refers to proteins, so we should connect functions with proteins)

How would this change the flowchart?

2.4 -> 2.5 proteins
2.5 yes (x is proteins)
3.1 yes noun
3.2 not via dobj (this is a weird one, the finite verb is "are" and to get there you need to start from function not proteins
(so maybe this is another alteration that needs to be made if going through a determiner) but you also need to go functions to yet (via nsubj) and then to are via cop
for now just going with us having got to "are" from "functions")
3.3 no
3.4 no
5.1 unidentifiable meaning: effect unspecified (this is basically the same outcome as #1, which seems correct)

"""

'\n1.1\n1.2\n2.1\n2.2 yes noun\n2.4\n\nProblem: yes\n\nAt 2.4, we can follow functions to whose (a determiner) via nmod:poss.\nClearly we want to replace "whose" with the noun phrase "small proteins", but there\'s no way to handle this currently.\nThis is actually a bit strange since this is a more general phenomenon than this example.\nI thought I had added this.\n\nPossible alteration to flowchart:\n\nIf a dependency links to a determiner, move to the reference (in this case whose refers to proteins, so we should connect functions with proteins)\n\nHow would this change the flowchart?\n\n2.4 -> 2.5 proteins\n2.5 yes (x is proteins)\n3.1 yes noun\n3.2 not via dobj (this is a weird one, the finite verb is "are" and to get there you need to start from function not proteins\n(so maybe this is another alteration that needs to be made if going through a determiner) but you also need to go functions to yet (via nsubj) and then to are via cop\nfor now just going with us having got to "are" f

In [4]:
# instance 3
test = nlp("we apply a collection of structural bioinformatics algorithms to infer molecular function of putative small proteins")
visualise(test)
"""
1.1
1.2
2.1
2.2
2.4 yes (proteins)
2.5 yes
3.1 yes
3.2 


"""

'\n1.1\n1.2\n2.1\n2.2\n2.4 yes (proteins)\n2.5 yes\n3.1 yes\n3.2 \n\n\n'

In [5]:
#34
test = nlp("Finally, we show that most young mammalian genes are preferentially expressed in testis, suggesting that sexual selection plays an important role in the emergence of new functional genes.")
visualise(test,100)
# this one is a bit interesting because getting to the verb via 3.3 is via two nmods but the flowchart indicates there should be an nsubj?
# no xcomp but there is a ccomp
"""
1.1
1.2
2.1
2.3 genes
2.5 yes (genes is x)
3.1 yes
3.2 ("plays" is finite verb - note that have to traverse two arrows).
It's not via dobj but clearly it should be. dobj goes to "role" instead
Maybe need to change this to whether the finite verb has a dobj.
Or maybe we can add nmod as I think it's being used as a clausal predicate
in a case like this.
(as written the flowchart would go via 3.3 but I'm going to go to 5.1)
5.1 unidentifiable meaning: effect unspecified (maybe we shouldn't call this unidentifiable
meaning---I think effect unspecified is sufficient)
"""

'\n1.1\n1.2\n2.1\n2.3 genes\n2.5 yes (genes is x)\n3.1 yes\n3.2 ("plays" is finite verb - note that have to traverse two arrows).\nIt\'s not via dobj but clearly it should be. dobj goes to "role" instead\nMaybe need to change this to whether the finite verb has a dobj.\nOr maybe we can add nmod as I think it\'s being used as a clausal predicate\nin a case like this.\n(as written the flowchart would go via 3.3 but I\'m going to go to 5.1)\n5.1 unidentifiable meaning: effect unspecified (maybe we shouldn\'t call this unidentifiable\nmeaning---I think effect unspecified is sufficient)\n'

In [6]:
# 32
test = nlp("However, we still understand relatively little about how these genes come into being and which functions they are selected for.")
visualise(test,100)
"""
1.1
1.2
2.1
2.2
2.4 no
2.5 no (maybe we can go functions -> selected -> they?)


"""
# (note here that we haven't accounted for the case where we might end up with a reference to a noun, e.g. a pronoun like they which refers to a noun elsewhere in the sentence...the case I came across was nsubjpass which I had to match to the nsubj)

'\n1.1\n1.2\n2.1\n2.2\n2.4 no\n2.5 no (maybe we can go functions -> selected -> they?)\n\n\n'

In [7]:
# 14
test = nlp("Addressing how novel genes emerged and acquired novel and adaptive functions is of fundamental importance.")
visualise(test)
"""
this is an interesting case, as it looks like I want this to go into the 2.1 box (but it currently goes to 2.2)
if it were 2.1
"""

'\nthis is an interesting case, as it looks like I want this to go into the 2.1 box (but it currently goes to 2.2)\nif it were 2.1\n'

In [8]:
#5
test= nlp("These results strongly indicate that many small proteins adopt three-dimensional structures and are fully functional, playing important roles in transcriptional regulation, cell signaling and metabolism.")
visualise(test)

In [9]:
test = nlp("A subsequent structure-based function annotation of small protein models exposes 178,745 putative protein-protein interactions with the remaining gene products in the mouse proteome, 1,100 potential binding sites for small organic molecules and 987 metal-binding signatures.")
visualise(test)

In [10]:
#5
test = nlp("These results strongly indicate that many small proteins adopt three-dimensional structures and are fully functional, playing important roles in transcriptional regulation, cell signaling and metabolism.")
visualise(test)
test = nlp("These results strongly indicate that many small proteins are robust and functional, playing important roles in transcriptional regulation, cell signaling and metabolism.")
visualise(test)
test = nlp("Small proteins are robust and functional, playing important roles in transcriptional regulation, cell signaling and metabolism.")
visualise(test)


test = nlp("Proteins adopt functional conformations")
visualise(test)
test = nlp("Proteins adopt conformations and are functional")
visualise(test)
test = nlp("Proteins adopt conformations that are functional")
visualise(test)

In [11]:
test = nlp("These results strongly indicate that many small proteins are robust and functional, playing important roles in transcriptional regulation, cell signaling and metabolism.")
visualise(test)
"""
1.1
1.2
2.1
2.3 (functional -> robust(are) -> proteins)...but note that you can't currently get here. These is a similar problem to the general dobj one
2.5 yes
3.1 
3.2
3.3 yes (but not via dobj); y could be "play important roles in transcriptional regulation, cell signaling and metabolism"
4.1 
will either be biological role or activity depending on whether one decides that /z/ is implicit
(I'd say biological role, as one can work out the containing system from each function if one looks elsewhere in the paper)
"""

'\n1.1\n1.2\n2.1\n2.3 (functional -> robust(are) -> proteins)...but note that you can\'t currently get here. These is a similar problem to the general dobj one\n2.5 yes\n3.1 \n3.2\n3.3 yes (but not via dobj); y could be "play important roles in transcriptional regulation, cell signaling and metabolism"\n4.1 \nwill either be biological role or activity depending on whether one decides that /z/ is implicit\n(I\'d say biological role, as one can work out the containing system from each function if one looks elsewhere in the paper)\n'

In [12]:
test = nlp("Small proteins are robust and functional, playing important roles in transcriptional regulation within a cell.")
visualise(test)
test = nlp("Small proteins are robust and functional, regulating transcription within a cell")
visualise(test)
test = nlp("Proteins function to regulate transcription and this is done in a cell")
visualise(test)

In [13]:
test = nlp("Proteins adopt conformations and are functional")
visualise(test)
test = nlp("Proteins adopt conformations that are functional")
visualise(test)
test = nlp("Proteins adopt functional conformations")
visualise(test)
test = nlp("Proteins are functional if they adopt conformations")
visualise(test)
test = nlp("If proteins adopt conformations then they are functional")
visualise(test)

In [14]:
test = nlp("In this trial we consider renal performance and function.")
visualise(test)

In [15]:
#13
test = nlp("We propose the existence of several functional classes of smORFs, ranging from inert DNA sequences to transcribed and translated cis-regulators of translation and peptides with a propensity to function as regulators of membrane-associated proteins, or as components of ancient protein complexes in the cytoplasm.")
visualise(test)
"""
1.1
1.2
2.1
2.3 (classes)
2.5 no -> x is smORFs
3.1 
3.2

this is a very interesting case, as it's not really that clear which verb we "want" to get to from classes.
As currently written, I think we go classes -> existence -> propose and conclude that function is used in the
predicate form and thus we go to 5.1.
But I think what we want is to go classes -> ranging -> sequences (and so on)
So perhaps we give precedence to dependents of x (or the head of the nominal phrase with function) such as clausal modifiers (e.g. acl)?
This would be from tail to head (since we're looking for dependents of)
If that goes nowhere, then start again and follow from head to tail?

The bigger picture issue here is where we have function in the predicate (traditional sense) but in which the predicate contains x and another verb clause that is y
for example
(these examples below basically highlight how arbitrary the directions of arrows can be--there's simply no way to get a consistent approach here)
"""
test = nlp("We propose the existence of several functional components of proteins, such as binding molecules to the active site")
visualise(test)
test = nlp("We propose the existence of several components of functional proteins, such as binding molecules to the active site")
visualise(test)
test = nlp("We propose the existence of several components of functional proteins, including binding molecules to the active site")
visualise(test)

In [16]:
# 14
test = nlp("Addressing how novel genes emerged and acquired novel and adaptive functions is of fundamental importance.")
visualise(test)
test = nlp("Genes can acquire adaptive functions")
visualise(test)
"""
just going to try the simplified one

1.1
1.2
2.1
2.2
2.4 no
2.5 -> unidentifiable item

Problem here is that if function is dobj (or just in predicate?) then we
need a way of saying that x is nsubj (via finite verb).
I think this can be fixed by just adding this as a way to find x and going to 3.1
(if y can't be identified---which might be the case frequently if function is the dojb---it
should be captured by going through 3)
"""

"\njust going to try the simplified one\n\n1.1\n1.2\n2.1\n2.2\n2.4 no\n2.5 -> unidentifiable item\n\nProblem here is that if function is dobj (or just in predicate?) then we\nneed a way of saying that x is nsubj (via finite verb).\nI think this can be fixed by just adding this as a way to find x and going to 3.1\n(if y can't be identified---which might be the case frequently if function is the dojb---it\nshould be captured by going through 3)\n"

In [17]:
#16
test = nlp("To investigate sORF function, we undertook the first functional studies of sORFs in any system, using the model eukaryote Saccharomyces cerevisiae.")
test = nlp("We investigate sORF function using the model eukaryote Saccharomyces cerevisiae.")

visualise(test)

In [18]:
test = nlp("The clearest evidence for overprinting is provided when the original gene function is retained, as in overlapping genes.")
visualise(test)

In [19]:
test = nlp("It appears that genomes harbour many transcripts in a transition stage from nonfunctional to functional genes, also known as protogenes, which are exposed to evolultionary testing and can become fixed when they turn out to be useful.")
visualise(test)
test = nlp("It appears that genomes harbour many transcripts in a transition stage from nonfunctional and functional genes, also known as protogenes, which are exposed to evolultionary testing and can become fixed when they turn out to be useful.")

visualise(test)

In [20]:
test=nlp("We further show that this gene may evolve a highly conservative rice-specific function that contributes to the regulation difference between rice and other plant species in response to pathogen infections.")
visualise(test)

In [21]:
test=nlp("Here we performed a comparative analysis of 48 avian genomes to identify genomic features that are unique to songbirds, as well as an initial assessment of function by investigating their tissue distribution and predicted protein domain structure.")
visualise(test)

In [22]:
test = nlp("functionality of the gene")
visualise(test)
test = nlp("functional gene")
visualise(test)
test = nlp("gene function")
visualise(test)

In [23]:
test = nlp("the gene was functioning")
visualise(test)

In [24]:
test = nlp("the functioning gene")
visualise(test)

In [25]:
test = nlp("Functioning is important for genes")
visualise(test)

In [26]:
test = nlp("the heart has multiple functions")
visualise(test)

In [27]:
test = nlp("the function of the heart is to love")
visualise(test)

In [28]:
test = nlp("the functionality of the gene is important")
visualise(test)

In [29]:
test=nlp("In general, these transcripts show little evidence of purifying selection, suggesting that many of them are not functional")
visualise(test)
"""
1.1
1.2
2.1
2.3
2.4 (yes, functional -> many -> them -> transcripts)
2.5 (transcripts)
3.1 yes
3.2 (this bit is not clear because of how bad 3.2 is but
I think we can start at "transcripts" -> show -> evidence -> selection (purifying)
but what is /y/? could we say that y is "undergo purifying selection"? I don't think so, which
means that this one probably should end up as effect unspecified, even if that is a bit unintuitive
it might be talking about function in an evolutionary context, but this sentence does not say anything
about what the transcripts actually do to cause them to be under selection)
note that the problems with 3.2 still remain, I still think we should be able to get "transcripts" -> show -> evidence -> selection (purifying)
for example consider the following
"""
test=nlp("In general, these transcripts show evidence of translation into proteins, suggesting that many of them are functional")
visualise(test)
# We want transcripts -> show -> evidence -> translation -> proteins
# I think we can safely say here that x should be "transcripts" and y should be "translation into proteins"
# giving transcripts function to be translated into proteins 


#I think we could get selected effects by editing the initial sentence to
test=nlp("In general, these transcripts show little evidence of purifying selection on account of being translated into proteins, suggesting that many of them are not functional")
visualise(test)
# now we can go transcripts -> show -> translated -> proteins
# so we have x and y and furthermore can talk about doing y in the past led to blah blah

In [30]:
# what if we had something like
test=nlp("Genes are transcribed, suggesting that they are functional")
visualise(test)
"""
1.1
1.2
2.1
2.3 (functional -> they -> Genes)
2.5 yes, Genes is x
3.1 yes
3.2 (unclear, if we started from they we just get are as finite,
which doesn't give us information (it's basically "the functional gene")
if we then go to "Genes" we get "are transcribed". I'll go with the latter)
3.3 no (well could be via advcl but that's not what I want)
3.4 no (wants to send to 5.1 but this seems like an issue with not being
able to deal with passive voice)
"""

'\n1.1\n1.2\n2.1\n2.3 (functional -> they -> Genes)\n2.5 yes, Genes is x\n3.1 yes\n3.2 (unclear, if we started from they we just get are as finite,\nwhich doesn\'t give us information (it\'s basically "the functional gene")\nif we then go to "Genes" we get "are transcribed". I\'ll go with the latter)\n3.3 no (well could be via advcl but that\'s not what I want)\n3.4 no (wants to send to 5.1 but this seems like an issue with not being\nable to deal with passive voice)\n'

In [31]:
test=nlp("Functional genes are transcribed")
visualise(test)
test=nlp("Functional genes are never transcribed")
visualise(test)

In [32]:
# can I find an example similar to 
#test=nlp("In general, these transcripts show little evidence of purifying selection, suggesting that many of them are not functional")
# where I want to start from "them" for step 2.5 (or maybe I mean 3.2)?
test=nlp("The gene is not utilised, indicating that it is not functional because it is not transcribed.")
visualise(test)
# I'm not quite sure what I want to do with this example right now, but clearly we do want to start from "it" not gene
# (or if we start from gene we need to go through utilised -> indicating -> functional -> transcribed)
# If at all possible, it would be bette rto start from "gene", especially if you can consistently get from "gene" to "it"

In [33]:
test=nlp("The gene is not utilised, which means that it is not functional because it is not transcribed.")
visualise(test)

In [34]:
test=nlp("The gene is utilised and functional because it is transcribed.")
visualise(test)

In [35]:
# in footnote 9 of the flowchart, I say that if x is a dobj then we can't identify y
# that seems pretty reasonable but I'm a bit thrown off by the nsubjpass case, as here x is acting a bit like an object
# can i construct a case that falsifies footnote 9?
test=nlp("Ribosomes aid the translation of functional transcripts")
visualise(test)
test=nlp("Ribosomes aid the translation of functional transcripts into proteins")
visualise(test)
"""
so here we would have x as transcripts, which is in the predicate (not a direct object but for our purposes the same thing
as we can go aid -> translation (via dojb) -> transcripts)
so, according to footnote 9 we shouldn't be able to identify y
but shouldn't translation (or translation into proteins for the second case) be y?
this basically gets at whether we allow the unpacking to have the form of
the function of x is to be y-ed
if so, we have the function of transcripts is to be translated into proteins
This seems completely right in this case
But of course, footnote 9 isn't dealin with anything in the predicate but specifically with dobj
this isn't a case of x being in the dobj, so I might be wrongly attacking footnote 9.
Perhaps the bigger question here is whether something as specific as whether x has a dobj dependency should be in the flowchart
I tend to think it shouldn't
"""

"\nso here we would have x as transcripts, which is in the predicate (not a direct object but for our purposes the same thing\nas we can go aid -> translation (via dojb) -> transcripts)\nso, according to footnote 9 we shouldn't be able to identify y\nbut shouldn't translation (or translation into proteins for the second case) be y?\nthis basically gets at whether we allow the unpacking to have the form of\nthe function of x is to be y-ed\nif so, we have the function of transcripts is to be translated into proteins\nThis seems completely right in this case\nBut of course, footnote 9 isn't dealin with anything in the predicate but specifically with dobj\nthis isn't a case of x being in the dobj, so I might be wrongly attacking footnote 9.\nPerhaps the bigger question here is whether something as specific as whether x has a dobj dependency should be in the flowchart\nI tend to think it shouldn't\n"

In [36]:
eff = nlp("The transcript functions to regulate expression.")
role=nlp("The transcript functions to regulate expression in metabolism.")
adv=nlp("The transcript functions to regulate expression in metabolism by improving glycogen storage.")
from spacy import displacy, Path
eff_img = displacy.render(eff, style="dep", jupyter=False, options={"compact": True, "distance": 120})
role_img = displacy.render(role, style="dep", jupyter=False, options={"compact": True, "distance": 100})
adv_img = displacy.render(adv, style="dep", jupyter=False, options={"compact": True, "distance": 75})
out_eff = Path("./eff.svg")
out_role = Path("./role.svg")
out_adv = Path("./adv.svg")
out_eff.open("w", encoding="utf-8").write(eff_img)
out_role.open("w", encoding="utf-8").write(role_img)
out_adv.open("w", encoding="utf-8").write(adv_img)

9669

In [37]:
sel_eff = nlp("Zebra stripes are functional as they evolved to reduce insect bites.")
from spacy import displacy, Path
sel_eff_img = displacy.render(sel_eff, style="dep", jupyter=False, options={"compact": True, "distance": 65})
out_sel_eff = Path("./sel_eff.svg")
out_sel_eff.open("w", encoding="utf-8").write(sel_eff_img)


9001

In [38]:
to_work = nlp("Liver function benefits human health.")
to_work_adv = nlp("Liver function benefits human health by aiding metabolism.")

from spacy import displacy, Path

to_work_img = displacy.render(to_work, style="dep", jupyter=False, options={"compact": True, "distance": 120})
out_to_work = Path("./to_work.svg")
out_to_work.open("w", encoding="utf-8").write(to_work_img)

to_work_adv_img = displacy.render(to_work_adv, style="dep", jupyter=False, options={"compact": True, "distance": 100})
out_to_work_adv = Path("./to_work_adv.svg")
out_to_work_adv.open("w", encoding="utf-8").write(to_work_adv_img)

6514

In [39]:
eff_unsp=nlp("The transcript shows evidence of purifying selection, suggesting that it is functional.")
from spacy import displacy, Path
eff_unsp_img = displacy.render(eff_unsp, style="dep", jupyter=False, options={"compact": True, "distance": 70})
out_eff_unsp = Path("./eff_unsp.svg")
out_eff_unsp.open("w", encoding="utf-8").write(eff_unsp_img)

9897

In [40]:
# might use this for the jupyter notebook handbook (but maybe not, was just copied from the main text where I've now deleted it)

test=nlp("In general, these transcripts show little evidence of purifying selection, suggesting many of them are not functional")

#effect unspecified (Question: what have these transcripts done to cause them to be selected? Answer: we don't know, it's not specified)

#change to

test=nlp("In general, these transcripts show evidence of translation into proteins, suggesting that many of them are functional")

#either activity or role

#change to

test=nlp("In general, these transcripts show little evidence of purifying selection on account of being translated into proteins, suggesting that many of them are not functional")

#now selected effects


In [41]:
# this is an interesting one that I found in the notes...how does one choose which path to follow from functional?
# I think the correct path is to O2 leakage...preference the nsubj path? I think this is complicated by how UD makes adjectives the root
# in cases like this where the verb is a cop
test=nlp("Such O2 leakage may be functional for the plant as it can help in detoxifying Fe 3ϩ forms and inducing nitrification")
visualise(test)

In [42]:
test=nlp("Such O2 leakage detoxifies Fe 3ϩ forms and may be functional for the plant")
visualise(test)
# interesting again, can definitely get to O2 leakage via detoxifies (through conj) but there's a direct nmod link to plant

In [43]:
test=nlp("In human islets, coupling of melatonin receptors to Gi inhibition of adenylate cyclase is either absent or nonfunctional and the Gq coupling to PLC is the main pathway activated by melatonin")
visualise(test)

In [44]:
test=nlp("Gene A's expression affects how well the liver functions, namely its ability to break down metabolites, in the hepatic system.")
visualise(test)
"""
this is a contrived and badly-written example that could be improved if used 
(it highlights how what should arguably be function as To Work 
("Gene A's expression affects how well the liver functions in the hepatic system." would definitely be To Work)
can become Biological Role (or maybe Biological Advantage depending on whether one interprets "how well the liver functions"
as benefiting "the hepatic system"). A sentence like this is probably rare (it is contrived and clumsy after all) but it's worth
highlighting that To Work gets superseded if we can identify an effect of ITEM
"""

'\nthis is a contrived and badly-written example that could be improved if used \n(it highlights how what should arguably be function as To Work \n("Gene A\'s expression affects how well the liver functions in the hepatic system." would definitely be To Work)\ncan become Biological Role (or maybe Biological Advantage depending on whether one interprets "how well the liver functions"\nas benefiting "the hepatic system"). A sentence like this is probably rare (it is contrived and clumsy after all) but it\'s worth\nhighlighting that To Work gets superseded if we can identify an effect of ITEM\n'

In [45]:
test=nlp("It appears that genomes harbour many transcripts in a transition stage from nonfunctional to functional genes, also known as protogenes, which are exposed to evolultionary testing and can become fixed when they turn out to be useful.")
visualise(test)

In [46]:
test=nlp("At the final stage, clones conferring resistance to NiCl2 were found to carry identical functional mini-genes, which conferred significant nickel tolerance on the host cells.")
visualise(test)

In [47]:
test=nlp("A polymerase transcribes a functional gene")
visualise(test)

In [48]:
test = nlp("The transcript alters something and it acts to function as a modulator of expression")
visualise(test)

In [49]:
test = nlp("I am going to see Bob whose house is nice.")
visualise(test)

In [50]:
test = nlp("I saw the man whom you love.")
visualise(test)

In [52]:
test = nlp("These results strongly indicate that many small proteins are robust and functional, playing important roles in transcriptional regulation, cell signaling and metabolism.")
visualise(test)

In [62]:
test = nlp("The gene is transcribed and functional")
visualise(test)
test = nlp("The gene is robust and functional")
visualise(test)

In [61]:
test=nlp("The ribosome translated the protein, helping the functional protein to participate in cell regulation")
visualise(test)

In [63]:
test=nlp("The gene is not utilised, indicating that it is not functional because it is not transcribed.")
visualise(test)