# Worked-through examples

Here we analyse the 42 examples analysed by [Keeling et. al. (2019)](https://elifesciences.org/articles/47014).
(You can download their raw data [here](https://elifesciences.org/download/aHR0cHM6Ly9jZG4uZWxpZmVzY2llbmNlcy5vcmcvYXJ0aWNsZXMvNDcwMTQvZWxpZmUtNDcwMTQtZmlnMS1kYXRhMS12MS5jc3Y-/elife-47014-fig1-data1-v1.csv?_hash=M5qXAOuZE6dp0KuTzDFmT1innVuxahm4IWKD2c4vBhY%3D))

It is important to note that the categories for senses of function (and the methodologies for classifying instances) differ substantially between the two studies. 
There is thus no direct way to compare the classifications of these examples (e.g. their Evolutionary Implication is not the same as our Selected Effect).
Nevertheless, we thought that it would be useful for us to analyse some real-world examples that have been previously analysed by others.
We developed our process on examples from broad range of biological subfields and intend for it to generally apply to all fields of biology&mdash;it is thus useful to see how well it performs on a specific subfield (de novo gene analysis).

For these examples, we are erring on the liberal-side of assigning *EFFECT*.
Since the examples are generally single sentences (and in some cases fragments of sentences), it can be difficult to identify *EFFECT* if we were to take a strict interpretation.
In short, we will assign an *EFFECT* when the text indicates that the *ITEM* plays a functional role in a pathway, even if it isn't explicit about what that role is.
We will make a note when our assignment of *EFFECT* is marginal.
See the handbook for some further discussion on this point.

In [91]:
import spacy
import scispacy
from process_sentence import visualise
nlp = spacy.load("en_core_sci_scibert")

Instance #1: "We review here the identification, possible origins, evolutionary trends, and **functions** of orphans with an emphasis on their role in plant biology."

ITEM: "orphans" (functions -> identification (via conj) -> orphans (via nmod)

EFFECT: "role \[in plant biology]" (orphans -> identification -> emphasis -> role)

SYSTEM: "plant"

No language associated with advantage to the system and thus **Biological Role**.

(This is a marginal assignment of *EFFECT* as we don't know what the specific role actually is.)

In [92]:
instance_1 = nlp("We review here the identification, possible origins, evolutionary trends, and functions of orphans with an emphasis on their role in plant biology.")
visualise(instance_1)

Instance #2: "a considerable number of small proteins whose **functions** are yet to be characterized"

ITEM: "small proteins" (functions -> whose \[which is a referent to] proteins -> small) (you can also go functions -> yet -> characterised -> number -> proteins -> small).

EFFECT: NONE

Not being used in the sense of To Work (substitution doesn't retain meaning) and thus **Description Incomplete: EFFECT Unspecified**

Note that we assign "small proteins" rather than just "proteins" to *ITEM* because the following examples (which are from the same abstract) also use small proteins, indicating that the authors are using small proteins in a technical sense.

In [93]:
instance_2 = nlp("a considerable number of small proteins whose functions are yet to be characterized")
visualise(instance_2)

Instance #3: "we apply a collection of structural bioinformatics algorithms to infer molecular **function** of putative small proteins"

ITEM: "small proteins" (function -> proteins -> small)

EFFECT: NONE

Not being used in the sense of To Work (substitution doesn't retain meaning) and thus **Description Incomplete: EFFECT Unspecified**

In [94]:
instance_3 = nlp("we apply a collection of structural bioinformatics algorithms to infer molecular function of putative small proteins")
visualise(instance_3)

Instance #4: "A subsequent structure-based **function** annotation of small protein models exposes 178,745 putative protein-protein interactions with the remaining gene products in the mouse proteome, 1,100 potential binding sites for small organic molecules and 987 metal-binding signatures."

ITEM: "small protein" (function -> annotation -> models -> protein -> small) 

EFFECT: NONE

This example is complicated enough that it's easy to get lost following dependency paths.
In such cases, it may be easier to start off by asking "are there any effects of protein models mentioned?", and if so, checking whether the 
dependencies can be established. The closest we get in this sentence is "protein-protein interactions" but this is far too generic to qualify as an effect.

Not being used in the sense of To Work (substitution doesn't retain meaning) and thus **Description Incomplete: EFFECT Unspecified**

(The closest we can get to an *EFFECT* in this sentence is "protein-protein interactions" or "binding sites for small organic molecules" or "metal-binding signatures". But this study appears to be broadly identifying potentially interesting interactions (of which perhaps a subset of interactions might be considered functional by this study's definition of function). In any case, it is much too broad of a characterisation to identify a functional *EFFECT* of "small proteins".)

In [95]:
instance_4 = nlp("A subsequent structure-based function annotation of small protein models exposes 178,745 putative protein-protein interactions with the remaining gene products in the mouse proteome, 1,100 potential binding sites for small organic molecules and 987 metal-binding signatures.")
visualise(instance_4)

Instance #5: "These results strongly indicate that many small proteins adopt three-dimensional structures and are fully **functional**, playing important roles in transcriptional regulation, cell signaling and metabolism"

ITEM: "small proteins" (functional -> adopt -> proteins -> small)

EFFECT: (i) "adopt three-dimensional structures" (ii) "playing important roles in transcriptional regulation" (iii) "playing important roles in cell signalling" (iv) "playing important roles in metabolism"

We'll analyse each of these *EFFECT*s as a separate instance.

(i) Biological Effect (No *SYSTEM*&mdash;in which system do the small proteins adopt three-dimensional structures?)

(ii) Biological Role (*SYSTEM* = transcriptional regulation)

(iii) Biological Role (*SYSTEM* = cell signalling)

(iv) Biological Role (*SYSTEM* = metabolism)

(*EFFECT*s ii&ndash;iv are marginal, as the text does not mention the specific roles.)

In [96]:
instance_5 = nlp("These results strongly indicate that many small proteins adopt three-dimensional structures and are fully functional, playing important roles in transcriptional regulation, cell signaling and metabolism")
visualise(instance_5)

Instance #6: "support future studies oriented on elucidating the **functions** of hypothetical small proteins"

ITEM: "small proteins" (functions -> proteins -> small)

EFFECT: NONE

Not being used in the sense of To Work (substitution doesn't retain meaning) and thus **Description Incomplete: EFFECT Unspecified**

In [97]:
instance_6 = nlp("support future studies oriented on elucidating the functions of hypothetical small proteins")
visualise(instance_6)

Instance #7: "The **functional** protein-coding feature of the BSC4 gene in S. cerevisiae is supported by population genetics, expression, proteomics, and synthetic lethal data."

ITEM: "BSC4 gene" (functional -> feature -> gene -> BSC4)

EFFECT: NONE

Not being used in the sense of To Work (substitution doesn't retain meaning) and thus **Description Incomplete: EFFECT Unspecified**

(The text "functional protein-coding" indicates that the function has something to do with the BSC4 gene being transcribed and translated into a protein, which plays a functional role, but we are not given any clues as to what this role might be. It would be dubious to unpack this as "the function of the BSC4 gene is to produce a protein".)

In [98]:
instance_7 = nlp("The functional protein-coding feature of the BSC4 gene in S. cerevisiae is supported by population genetics, expression, proteomics, and synthetic lethal data.")
visualise(instance_7)

Instance #8: "translation of sequences devoid of genes, or ‘non-genic’ sequences, is expected to produce insignificant poly-peptides rather than proteins with specific biological **functions**."

ITEM: "proteins" (functions -> proteins)

EFFECT: NONE

Not being used in the sense of To Work (substitution doesn't retain meaning) and thus **Description Incomplete: EFFECT Unspecified**

In [99]:
instance_8 = nlp("translation of sequences devoid of genes, or ‘non-genic’ sequences, is expected to produce insignificant poly-peptides rather than proteins with specific biological functions.")
visualise(instance_8)

Instance #9: "Here we formalize an evolutionary model according to which **functional** genes evolve de novo through transitory proto-genes generated by widespread translational activity in non-genic sequences."

ITEM: "genes" (genes -> functional)

EFFECT: NONE

Not being used in the sense of To Work (substitution doesn't retain meaning) and thus **Description Incomplete: EFFECT Unspecified**

In [100]:
instance_9 = nlp("Here we formalize an evolutionary model according to which functional genes evolve de novo through transitory proto-genes generated by widespread translational activity in non-genic sequences.")
visualise(instance_9)

Instance #10: "translation, which produces peptides of mostly unknown **function**"

ITEM: "peptides" (function -> peptides)

EFFECT: None

Not being used in the sense of To Work (substitution doesn't retain meaning) and thus **Description Incomplete: EFFECT Unspecified**


In [101]:
instance_10 = nlp("translation, which produces peptides of mostly unknown function")
visualise(instance_10)

Instance #11: "millions of smORFs, some of which fulfil key physiological **functions**"

ITEM: "smORFs" (functions -> fulfil -> smORFs)

EFFECT: NONE

Not being used in the sense of To Work (substitution doesn't retain meaning) and thus **Description Incomplete: EFFECT Unspecified**

In [102]:
instance_11 = nlp("millions of smORFs, some of which fulfil key physiological functions")
visualise(instance_11)

Instance #12: "We propose the existence of several **functional** classes of smORFs, ranging from inert DNA sequences to transcribed and translated cis-regulators of translation and peptides with a propensity to function as regulators of membrane-associated proteins, or as components of ancient protein complexes in the cytoplasm."

ITEM: "smORFs" (functional -> classes -> smORFs)

EFFECTs:

1. "to be an inert DNA sequence" (smORFs -> classes -> ranging -> sequences)

2. "to be a cis-regulator \[of translation]" (smORFs -> classes -> ranging -> sequences -> cis-regulators)

3. "function as regulators \[of membrane-associated proteins]" (smORFs -> classes -> ranging -> propensity -> function -> regulators-> proteins)

4. "function as components \[of ancient protein complexes in the cytoplasm]" (smORFs -> classes -> ranging -> propensity -> function -> regulators -> components)

Evaluating each *EFFECT* separately gives:

1. Biological Effect (no *SYSTEM* mentioned)

2. Biological Role (*SYSTEM* = translation)

3. Biological Role (*SYSTEM* = membrane-associated proteins)

4. Biological Role (*SYSTEM* = ancient protein complexes in the cytoplasm)

(Assigning an *EFFECT* in the case of 4 is marginal, as "function as components of" is similar to something like "plays a functional role in".)

(Note that the additional use of function as a verb&mdash;which will be analysed in the next instance&mdash;is somewhat incidental. If "function" were instead "act" we would get the same outcome.)

In [103]:
instance_12 = nlp("We propose the existence of several functional classes of smORFs, ranging from inert DNA sequences to transcribed and translated cis-regulators of translation and peptides with a propensity to function as regulators of membrane-associated proteins, or as components of ancient protein complexes in the cytoplasm.")
visualise(instance_12)

Instance #13: "We propose the existence of several functional classes of smORFs, ranging from inert DNA sequences to transcribed and translated cis-regulators of translation and peptides with a propensity to **function** as regulators of membrane-associated proteins, or as components of ancient protein complexes in the cytoplasm."

This sentence&mdash;which is the same as instance #12 above&mdash;has a strange parsing for the part of the sentence around "...and peptides with a propensity to function as regulators...". There should be a dependency between "propensity" and "peptides" as "with a propensity" is a prepositional phrase that modifies "peptides" (at least by my reading of the sentence). I think that the issue here is the phrasing with the sentence "ranging from...to...and..." when it should be "ranging from...to...to..." (if you were to compare three things using this contruction&mdash;say low, medium, and high&mdash;it makes more sense to say "ranging from low to medium to high" than "ranging from low to medium and high"). I have thus changed the sentence above to:

Modified instance #13: "We propose the existence of several functional classes of smORFs, ranging from inert DNA sequences to transcribed and translated cis-regulators of translation *to* peptides with a propensity to **function** as regulators of membrane-associated proteins, or as components of ancient protein complexes in the cytoplasm." (with change in italics)

ITEM: "peptides" (function -> propensity -> peptides) (note that "smORFs" produce "peptides" so these are closely connected)

EFFECTs:

1. "function as regulators [of membrane-associated proteins]" (peptides -> propensity -> function -> regulators -> proteins)

2. "function as components [of ancient protein complexes in the cytoplasm]" (peptides -> propensity -> function -> regulators -> components -> complexes)

Evaluating each *EFFECT* separately gives:

1. Biological Role (*SYSTEM* = membrane-associated proteins)

2. Biological Role (*SYSTEM* = ancient protein complexes in the cytoplasm)

(Assigning an *EFFECT* in the case of 2 is marginal, as "function as components of" is similar to something like "plays a functional role in".)

In [104]:
# slightly modified version of instance #13 in Keeling et. al. (2019) (replaced "and peptides" with "to peptides")
modified_instance_13 = nlp("We propose the existence of several functional classes of smORFs, ranging from inert DNA sequences to transcribed and translated cis-regulators of translation to peptides with a propensity to function as regulators of membrane-associated proteins, or as components of ancient protein complexes in the cytoplasm.")
visualise(modified_instance_13)

Instance #14: "Addressing how novel genes emerged and acquired novel and adaptive **functions** is of fundamental importance."

ITEM: "genes" (or perhaps "novel genes") (function -> acquired -> emerged -> genes)

EFFECT: NONE

Not being used in the sense of To Work (substitution doesn't retain meaning) and thus **Description Incomplete: EFFECT Unspecified**

In [105]:
instance_14 = nlp("Addressing how novel genes emerged and acquired novel and adaptive functions is of fundamental importance.")
visualise(instance_14)  

Instance #15: "in our understanding of the molecular mechanisms and genome-wide patterns of new gene origination and new gene **functions**."

ITEM: "gene" (functions -> gene)

EFFECT: NONE

Not being used in the sense of To Work (substitution doesn't retain meaning) and thus **Description Incomplete: EFFECT Unspecified**

In [106]:
instance_15 = nlp("in our understanding of the molecular mechanisms and genome-wide patterns of new gene origination and new gene functions.")
visualise(instance_15)

Instance #16: "To investigate sORF function, we undertook the first **functional** studies of sORFs in any system, using the model eukaryote Saccharomyces cerevisiae."

ITEM: "sORFs" (functional -> studies -> sORFs)

EFFECT: NONE

Not being used in the sense of To Work (substitution doesn't retain meaning) and thus **Description Incomplete: EFFECT Unspecified**

In [107]:
instance_16 = nlp("To investigate sORF function, we undertook the first functional studies of sORFs in any system, using the model eukaryote Saccharomyces cerevisiae.")
visualise(instance_16)

Instance #17: "To investigate sORF **function**, we constructed a collection of gene-deletion mutants of 140 newly identified sORFs"

ITEM: "sORF" (function -> sORF)

EFFECT: NONE

It's unclear whether this is used in the sense of To Work (I tend to think not but there isn't much to work with here). Leaning on the side of caution gives **Description Incomplete: EFFECT Unspecified**

In [108]:
instance_17 = nlp("To investigate sORF function, we constructed a collection of gene-deletion mutants of 140 newly identified sORFs")
visualise(instance_17)

Instance #18: "We provide a collection of sORF deletion strains that can be integrated into the existing deletion collection as a resource for the yeast community for elucidating **gene function**."

ITEM: "gene" (function -> gene)

EFFECT: NONE

As above it's unclear whether this is used in the sense of To Work but leaning on the side of caution gives **Description Incomplete: EFFECT Unspecified**

In [109]:
instance_18 = nlp("We provide a collection of sORF deletion strains that can be integrated into the existing deletion collection as a resource for the yeast community for elucidating gene function.")
visualise(instance_18)

Instance #19: "Moreover, our analyses of the S. cerevisiae sORFs establish that sORFs are conserved across eukaryotes and have important biological **functions**."

ITEM: "sORFs" (functions -> have -> conserved -> sORFs)

EFFECT: NONE

Not being used in the sense of To Work (substitution doesn't retain meaning) and thus **Description Incomplete: EFFECT Unspecified**

(There is talk of historical selection ("conserved across eukaryotes") but without *EFFECT* it cannot be Selected Effect.)

In [110]:
instance_19 = nlp("Moreover, our analyses of the S. cerevisiae sORFs establish that sORFs are conserved across eukaryotes and have important biological functions.")
visualise(instance_19)

Instance #20: "The clearest evidence for overprinting is provided when the original gene **function** is retained, as in overlapping genes."

ITEM: "gene" (function -> gene)

EFFECT: NONE

Not being used in the sense of To Work (substitution doesn't retain meaning) and thus **Description Incomplete: EFFECT Unspecified**

In [111]:
instance_20 = nlp("The clearest evidence for overprinting is provided when the original gene function is retained, as in overlapping genes.")
visualise(instance_20)

Instance #21: "Therefore, MDF1 **functions** in two important molecular pathways, mating and fermentation, and mediates the crosstalk between reproduction and vegetative growth."

ITEM: "MDF1" (functions -> MDF1)

EFFECT: "mediates the crosstalk between reproduction and vegetative growth" (MDF1 -> functions -> mediates)

SYSTEM: "mating and fermentation"

No language associated with advantage to the system and thus **Biological Role**.

In [112]:
instance_21 = nlp("Therefore, MDF1 functions in two important molecular pathways, mating and fermentation, and mediates the crosstalk between reproduction and vegetative growth.")
visualise(instance_21)

Instance #22: "How non-coding DNA gives rise to new protein-coding genes (de novo genes) is not well understood. Recent work has revealed the origins and **functions** of a few de novo genes, but common principles governing the evolution or biological roles of these genes are unknown."

ITEM: "de novo genes" (functions -> origins -> genes)

EFFECT: NONE

Not being used in the sense of To Work (substitution doesn't retain meaning) and thus **Description Incomplete: EFFECT Unspecified**

In [113]:
instance_22 = nlp("How non-coding DNA gives rise to new protein-coding genes (de novo genes) is not well understood. Recent work has revealed the origins and functions of a few de novo genes, but common principles governing the evolution or biological roles of these genes are unknown.")
visualise(instance_22)

Instance #23: "To better define these principles, we performed a parallel analysis of the evolution and **function** of six putatively protein-coding de novo genes described in Drosophila melanogaster."

ITEM: "protein coding de novo genes" (function -> evolution -> genes)

EFFECT: NONE

Not being used in the sense of To Work (substitution doesn't retain meaning) and thus **Description Incomplete: EFFECT Unspecified**

In [114]:
instance_23 = nlp("To better define these principles, we performed a parallel analysis of the evolution and function of six putatively protein-coding de novo genes described in Drosophila melanogaster.")
visualise(instance_23)

Instance #24: "Despite the fact that these genes are transcribed at a higher level in males than females, and are most strongly expressed in testes, RNAi experiments show that most of these genes are essential in both sexes during metamorphosis. This lethality suggests that protein coding de novo genes in Drosophila quickly become **functionally** important."

Note that here we are given two sentences. After identifying the *ITEM* in the second sentence, which uses "functionally", we analysing the first sentence to try and identify *EFFECT* (there is no *EFFECT* identified in the second sentence). To do this, we can start at "these genes" (from the sentence we can determine that "these genes" and our *ITEM* "protein coding de novo genes" refer to the same thing, so we are still starting from *ITEM* but now in a different sentence).

ITEM: "protein coding de novo genes" (functionally -> important -> become -> genes)

EFFECT: "play a role \[in metamorphosis]"

SYSTEM: "metamorphosis"

There is a benefit to a *SYSTEM* here (to the *SUPERSYSTEM* of the fly as an individual), as clearly indicated by "these genes are essential in both sexes during metamorphosis" and "this lethality suggests...".

No language associated with historical selection and thus **Biological Advantage**.

(Note that this is a marginal assignment of *EFFECT*&mdash;we know it plays an important role in the metamorphosis pathway but the text does not tell us what this role is.)

In [115]:
instance_24 = nlp("Despite the fact that these genes are transcribed at a higher level in males than females, and are most strongly expressed in testes, RNAi experiments show that most of these genes are essential in both sexes during metamorphosis. This lethality suggests that protein coding de novo genes in Drosophila quickly become functionally important.")
visualise(instance_24)

Instance #25: "Some of these newly expressed genes may acquire coding or noncoding **functions** and be preserved by natural selection."

ITEM: "genes" (functions -> aquire -> some -> genes)

EFFECT: NONE

Not being used in the sense of To Work (substitution doesn't retain meaning) and thus **Description Incomplete: EFFECT Unspecified**

(There is talk of historical selection ("preserved by natural selection") but without *EFFECT* it cannot be Selected Effect.)

In [116]:
instance_25 = nlp("Some of these newly expressed genes may acquire coding or noncoding functions and be preserved by natural selection.")
visualise(instance_25)

Instance #26: "In general, these transcripts show little evidence of purifying selection, suggesting that many of them are not **functional**."

(Note that we analysed a simplified form of this sentence in the manuscript (Figure 3C).)

ITEM: "transcripts" (functional -> many -> them \[which is a referent to] transcripts)

EFFECT: NONE

Not being used in the sense of To Work (substitution doesn't retain meaning) and thus **Description Incomplete: EFFECT Unspecified**

(There is talk of historical selection ("little evidence of purifying selection") but without *EFFECT* it cannot be Selected Effect.)

In [117]:
instance_26 = nlp("In general, these transcripts show little evidence of purifying selection, suggesting that many of them are not functional.")
visualise(instance_26)

Instance #27: "A plasmid-borne randomized mini-gene library expressing a population of combinatorial 20-mer peptides with no bias toward any biological **function** was used as an initial source of genetic variance during stress-driven evolution of Escherichia coli."

ITEM: "peptides" (function -> bias -> population -> peptides)

EFFECT: NONE

Not being used in the sense of To Work (substitution doesn't retain meaning) and thus **Description Incomplete: EFFECT Unspecified**

In [118]:
instance_27 = nlp("A plasmid-borne randomized mini-gene library expressing a population of combinatorial 20-mer peptides with no bias toward any biological function was used as an initial source of genetic variance during stress-driven evolution of Escherichia coli.")
visualise(instance_27)

Instance #28: "At the final stage, clones conferring resistance to NiCl2 were found to carry identical **functional** mini-genes, which conferred significant nickel tolerance on the host cells."

ITEM: "mini-genes" (functional -> mini-genes)

EFFECT: "confer nickel tolerance" (mini-genes -> conferred -> tolerance)

SYSTEM: "host cells" (note since this instance appears to be from the same abstract as #27, "host cells" presumably refers to E. coli)

There is no language associated with historical selection.

It is less clear-cut as to whether this text identifies an advantage to a *SYSTEM* or *SUPERSYSTEM*. It isn't completely explicit, but in the context of a resistance experiment (which we can infer from this sentence and from #27), "conferring resistance to NiCl2" is a clear benefit to the "host cell" (as without resistance the cell will die).

With the above in mind, this is a case of **Biological Advantage**.

In [119]:
instance_28 = nlp("At the final stage, clones conferring resistance to NiCl2 were found to carry identical functional mini-genes, which conferred significant nickel tolerance on the host cells.")
visualise(instance_28)

Instance #29: "Collectively, our results show that NCYM is the first de novo evolved protein known to act as an oncopromoting factor in human cancer, and suggest that de novo evolved proteins may **functionally** characterize human disease."

ITEM: "proteins" or "de novo evolved proteins" (functionally -> characterise -> proteins)

EFFECT: "act as an oncopromoting factor" (protein -> known -> act -> oncopromoting)

SYSTEM: "human cancer"

This is an interesting case. In general, the use of function in the sense of "this factor functions in disease progression" does not fit into our framework[<sup>1</sup>](#fn1) because there is not an equivalence between "this factor functions in disease progression" and "the function of this factor is disease progression.

In this particular case, however, it works okay. It is difficult to give a normative basis for de novo proteins since they are newly arisen (this is a sort of real-life version of the "swampman" objection to the normativity of Selected Effect (e.g. [Neander 1996](https://doi.org/10.1111/j.1468-0017.1996.tb00036.x))). If one takes a causal role view of function (i.e. an item's function is what that item does to contribute to a complex system's capacity), then unpacking as "the function of de novo evolved proteins is to act as a oncopromoting factor in human cancer" seems reasonable.

With this mind, we can classify this as **Biological Advantage** with the proviso that many statements of this nature (function as a role in disease) will not fit into our framework.

<span id="fn1">1:</span> This use of function does show up in the medical literature, but given our focus on biology more generally, we chose not to include it in our framework.

In [120]:
instance_29 = nlp("Collectively, our results show that NCYM is the first de novo evolved protein known to act as an oncopromoting factor in human cancer, and suggest that de novo evolved proteins may functionally characterize human disease.")
visualise(instance_29)

Instance #30: "It appears that genomes harbour many transcripts in a transition stage from nonfunctional to **functional** genes, also known as protogenes, which are exposed to evolutionary testing and can become fixed when they turn out to be useful. Orphan genes may have played key roles in generating lineage-specific adaptations"

ITEM: "genes" (functional -> nonfunctional -> genes)

EFFECT: NONE

Doesn't seem to be used in the sense of To Work. Substitution with "working" makes sense but it seems like "nonfunctional" and "functional" here mean "doesn't have a function/doesn't do anything" and "has a function/does something", respectively. Function can only be used in the sense of To Work if there's some normative basis. Thus this seems to be **Description Incomplete: EFFECT Unspecified**

In [121]:
instance_30 = nlp("It appears that genomes harbour many transcripts in a transition stage from nonfunctional to functional genes, also known as protogenes, which are exposed to evolutionary testing and can become fixed when they turn out to be useful. Orphan genes may have played key roles in generating lineage-specific adaptations" )
visualise(instance_30)

Instance #31: "Their existence suggests that **functional** ribonucleic acids (RNAs) and proteins can relatively easily arise out of random nucleotide sequences."

ITEM: "ribonucleic acids" (functional -> acids -> ribonucleic)

EFFECT: NONE

Not being used in the sense of To Work (substitution doesn't retain meaning) and thus **Description Incomplete: EFFECT Unspecified**

In [122]:
instance_31 = nlp("Their existence suggests that functional ribonucleic acids (RNAs) and proteins can relatively easily arise out of random nucleotide sequences.")
visualise(instance_31)

Instance #32: "However, we still understand relatively little about how these genes come into being and which functions they are selected for."

ITEM: "genes" (functions -> come -> genes)

EFFECT: NONE

Not being used in the sense of To Work (substitution doesn't retain meaning) and thus **Description Incomplete: EFFECT Unspecified**

(There is talk of historical selection ("selected for") but without *EFFECT* it cannot be Selected Effect.)

In [123]:
instance_32 = nlp("However, we still understand relatively little about how these genes come into being and which functions they are selected for.")
visualise(instance_32)

Instance #33: "In contrast, the **functions** of proteins which have a more recent origin remain largely unknown, despite the fact that these proteins also have extensive proteomics support."

ITEM: "proteins" (functions -> proteins)

EFFECT: NONE

Not being used in the sense of To Work (substitution doesn't retain meaning) and thus **Description Incomplete: EFFECT Unspecified**

In [124]:
instance_33 = nlp("In contrast, the functions of proteins which have a more recent origin remain largely unknown, despite the fact that these proteins also have extensive proteomics support.")
visualise(instance_33)

Instance #34: "Finally, we show that most young mammalian genes are preferentially expressed in testis, suggesting that sexual selection plays an important role in the emergence of new **functional** genes."

ITEM: "genes" (functional -> genes)

EFFECT: NONE

Not being used in the sense of To Work (substitution doesn't retain meaning) and thus **Description Incomplete: EFFECT Unspecified**

(There is talk of historical selection ("sexual selection") but without *EFFECT* it cannot be Selected Effect.)

In [125]:
instance_34 = nlp("Finally, we show that most young mammalian genes are preferentially expressed in testis, suggesting that sexual selection plays an important role in the emergence of new functional genes.")
visualise(instance_34)

Instance #35: "Orphans are an enigmatic portion of the genome since their origin and **function** are mostly unknown and they typically make up 10 to 30% of all genes in a genome."

ITEM: "orphans" (function -> origin -> their \[which is a referent of] orphans)

EFFECT: NONE

Not being used in the sense of To Work (substitution doesn't retain meaning) and thus **Description Incomplete: EFFECT Unspecified**

In [126]:
instance_35 = nlp("Orphans are an enigmatic portion of the genome since their origin and function are mostly unknown and they typically make up 10 to 30% of all genes in a genome.")
visualise(instance_35)

Instance #36: "However, we show that a new defense gene in plants may evolve by de novo origination, resulting in sophisticated disease-resistant **functions** in rice."

This one is a bit tricky.

ITEM: "defense gene" (functions -> resulting -> evolve -> gene)

EFFECT: "role in resisting disease" (functions -> disease-resistant)

SYSTEM: immune system of "rice"

There is language associated with benefits to the *SYSTEM* ("disease-resistant functions in rice"&mdash;we can assume that resisting disease is a benefit). Thus, this is **Biological Advantage**.

(This is a marginal assignment for *EFFECT* as we are not told specifically what the "gene" does that causes disease resistance.)

It is not entirely clear, however, that the above is the only (or even necessarily correct) way to interpret the sentence. There is a direct dependency between "functions" and "rice", and thus at first glance "rice" seems to be the *ITEM*. But this would imply that the function of "rice" is to do something, which does not make sense in the context of the sentence. Rather, "rice" has functions on account of an *ITEM* ("defense gene") doing something within "rice's" immune system. It thus seems more accurate to view "functions in rice" as capacities of rice produced on account of the *ITEM*'s *EFFECT*. 

But consider if the the sentence were instead something like "the gene regulates expression, aiding in liver function(ing)". In this case, "liver" would be the *ITEM* and, there would be no *EFFECT*, and function would be used in the sense of To Work. In this modified case, the fact that the word function is used is somewhat incidental (at least with respect to the sense of Role, Advantage, etc.) and doesn't relate to the fact that the "gene regulates expression" (consider that the sentence could just have easily been "the gene regulates expression, aiding in the working of the liver").

In [127]:
instance_36 = nlp("However, we show that a new defense gene in plants may evolve by de novo origination, resulting in sophisticated disease-resistant functions in rice.")
visualise(instance_36)

Instance #37: "We further show that this gene may evolve a highly conservative rice-specific **function** that contributes to the regulation difference between rice and other plant species in response to pathogen infections."

ITEM: "gene" (function -> evolve -> gene)

EFFECT: "to regulate \[the immune system]" (gene -> evolve -> function -> contributes -> difference -> regulation) (note the conversion from the noun "regulation" to the verb "to regulate")

There is mention of historical selection ("evolve a highly conservative rice-specific"), and thus this is **Selected Effect**.

Some futher notes (not relevant here but for the sake of interest):

SYSTEM: immune system (inferred from "pathogen infections")

There is no mention of a benefit to the *SYSTEM* ("pathogen infections" implies a normative dimension (i.e. pathogen infections harm organisms, causing disease) but "regulation difference" does not tell us whether it the function benefits the immune system and decreases the severity of pathogen infections).

In [128]:
instance_37 = nlp("We further show that this gene may evolve a highly conservative rice-specific function that contributes to the regulation difference between rice and other plant species in response to pathogen infections.")
visualise(instance_37)

Instance #38: "This enhanced disease resistance was accompanied by increased accumulation of endogenous salicylic acid (SA) and suppressed accumulation of endogenous jasmonic acid (JA) as well as modified expression of a subset of defense-responsive genes **functioning** both upstream and downstream of SA and JA."

ITEM: "genes" or "defense-responsive genes" (functioning -> subset -> genes)

EFFECT: "to be expressed" (genes -> subset -> expression) (note the word conversion from the noun "expression")

SYSTEM: unspecified system containing SA and JA

There is language associated with benefits to a *SUPERSYSTEM* ("enhanced disease resistance" presumably of some organism), and thus this is **Biological Advantage**.

(This is a marginal assignment of *EFFECT* as "expression" gives little information about what "defense-responsive genes" do specifically. Note that if we took a stricter view on assigning *EFFECT* then this would be classified as To Work, as substituting "functioning" with "working" retains meaning. Arguably, To Work is the best interpretation of this sentence regardless, but in our framework, To Work gets superseded by the other categories if *EFFECT* is specified.)

In [129]:
instance_38 = nlp("This enhanced disease resistance was accompanied by increased accumulation of endogenous salicylic acid (SA) and suppressed accumulation of endogenous jasmonic acid (JA) as well as modified expression of a subset of defense-responsive genes functioning both upstream and downstream of SA and JA.")
visualise(instance_38)

Instance #39: "Here, we unveil genome-wide patterns for the mutational mechanisms leading to new genes and their subsequent lineage-specific evolution at different time nodes in the Drosophila melanogaster species subgroup. We find that (1) tandem gene duplication has generated ∼80% of the nascent duplicates that are limited to single species (D. melanogaster or Drosophila yakuba); (2) the most abundant new genes shared by multiple species (44.1%) are dispersed duplicates, and are more likely to be retained and be **functional**."

There are some issues with spaCy's parsing of this sentence due to the mid-sentence numbering, so we instead use the following:

"Here, we unveil genome-wide patterns for the mutational mechanisms leading to new genes and their subsequent lineage-specific evolution at different time nodes in the Drosophila melanogaster species subgroup. We find that tandem gene duplication has generated ∼80% of the nascent duplicates that are limited to single species (D. melanogaster or Drosophila yakuba), and that the most abundant new genes shared by multiple species (44.1%) are dispersed duplicates, and are more likely to be retained and be **functional**."

ITEM: "genes" (functional -> retained -> likely -> duplicates -> genes)

EFFECT: NONE

Not being used in the sense of To Work (substitution doesn't retain meaning) and thus **Description Incomplete: EFFECT Unspecified**

(There is language associated with historical selection ("lineage-specific evolution", "shared by multiple species...likely to be retained") but cannot be Selected Effect without *EFFECT*.)

In [130]:
instance_39_revised = nlp("Here, we unveil genome-wide patterns for the mutational mechanisms leading to new genes and their subsequent lineage-specific evolution at different time nodes in the Drosophila melanogaster species subgroup. We find that tandem gene duplication has generated ∼80% of the nascent duplicates that are limited to single species (D. melanogaster or Drosophila yakuba), and that the most abundant new genes shared by multiple species (44.1%) are dispersed duplicates, and are more likely to be retained and be functional.")
visualise(instance_39_revised)

Instance 40: "the rate of the origin of new **functional** genes is estimated to be five to 11 genes per million years in the D. melanogaster subgroup."

ITEM: "genes" (functional -> genes)

EFFECT: NONE

Not being used in the sense of To Work (substitution doesn't retain meaning) and thus **Description Incomplete: EFFECT Unspecified**

In [131]:
instance_40 = nlp("the rate of the origin of new functional genes is estimated to be five to 11 genes per million years in the D. melanogaster subgroup.")
visualise(instance_40)

Instance 41: "Here we performed a comparative analysis of 48 avian genomes to identify genomic features that are unique to songbirds, as well as an initial assessment of **function** by investigating their tissue distribution and predicted protein domain structure."

ITEM: "genomic features" or "genomic features that are unique to songbirds" (function -> assessment -> features)

EFFECT: NONE

Not being used in the sense of To Work (substitution doesn't retain meaning) and thus **Description Incomplete: EFFECT Unspecified**

(If we were being very generous with *EFFECT* one might assign something like "conform to a protein domain structure", in which case it would be **Biological Role** (with *SYSTEM* perhaps being bird tissues).)

In [132]:
instance_41 = nlp("Here we performed a comparative analysis of 48 avian genomes to identify genomic features that are unique to songbirds, as well as an initial assessment of function by investigating their tissue distribution and predicted protein domain structure.")
visualise(instance_41)

Instance 42: "Descriptions of recently evolved genes suggest several mechanisms of origin including exon shuffling, gene fission-fusion, retrotransposition, duplication-divergence, and lateral gene transfer, all of which involve recruitment of preexisting genes or genetic elements into new **function**."

ITEM: "preexisting genes or genetic elements" (function -> recruitment -> genes -> elements)

EFFECT: NONE (all of the mechanism-like things mentioned involve the origin of the *ITEM* not what it does)

Not being used in the sense of To Work (substitution doesn't retain meaning) and thus **Description Incomplete: EFFECT Unspecified**

In [133]:
instance_42 = nlp("Descriptions of recently evolved genes suggest several mechanisms of origin including exon shuffling, gene fissionfusion, retrotransposition, duplication-divergence, and lateral gene transfer, all of which involve recruitment of preexisting genes or genetic elements into new function.")
visualise(instance_42)