In [1]:
"""
I will use this notebook as a workspace to go through my notes on the function handbook, 
translating the scheme into a dependency grammar format (using Universal Dependencies,
https://universaldependencies.org/, which is the parsing scheme adopted by spaCy)
"""

'\nI will use this notebook as a workspace to go through my notes on the function handbook, \ntranslating the scheme into a dependency grammar format (using Universal Dependencies,\nhttps://universaldependencies.org/, which is the parsing scheme adopted by spaCy)\n'

In [2]:
import spacy
import scispacy
from spacy import displacy

In [3]:
# an interesting first note is that the web language models from spacy don't do a great job of parsing example sentences
# Take a simple example
nlp_web_sm = spacy.load("en_core_web_sm")
doc = nlp_web_sm("The gene functions as an important regulator.")
displacy.render(doc, style="dep", jupyter=True, options={'distance' : 140})

In [4]:
# en_core_web_sm doesn't actually have word embeddings but en_core_web_md does and the problem also exists here
nlp_web_md = spacy.load("en_core_web_md")
doc = nlp_web_md("The gene functions as an important regulator.")
displacy.render(doc, style="dep", jupyter=True, options={'distance' : 140})

In [5]:
# The obvious problem here is that the POS for functions is NOUN.
# Interestingly it is assigned as the root of the sentence though.
# Incidentally, the same issue arises even if the structure is kept similar but less technical words are used
doc = nlp_web_md("The couch functions as a seating device.")
displacy.render(doc, style="dep", jupyter=True, options={'distance' : 140})

In [6]:
# But if we import a statistical model from scispacy then functions is tagged as a verb
nlp_sci_md = spacy.load("en_core_sci_md")
doc1 = nlp_sci_md("The gene functions as an important regulator.")
displacy.render(doc, style="dep", jupyter=True, options={'distance' : 140})
# With functions being correctly tagged as a verb, the dependency structure wrt functions also changes.
# Instead of having "gene" and "the" as dependencies of "functions" we now just have "gene".
# (It seems like it was previously being parsed as the noun phrase "gene functions" with "functions" as head.)
# We now also have the object "regulator" as a direct dependency of "functions" (whereas before it was "as")



In [35]:
# It's also useful to show the descriptions of the POS and dependencies in a table
def print_token_pos_dep(doc, tok_width=30, pos_width=30, dep_width=30):
    """
    prints token, pos description, and dep description of a doc
    args:
        doc: spacy doc
        tok_width: width of col 1
        pos_width: width of col 2
        dep_width: width of col 3
    returns:
        Nothing (prints output to screen)
    """
    print("{:<{}}{:<{}}{:<{}}\n".format("TOKEN", tok_width, "POS", pos_width, "DEPENDENCY", dep_width))
    for token in doc:
        print("{:<{}}{:<{}}{:<{}}".format(
            token.text, tok_width, str(spacy.explain(token.pos_)), pos_width, 
            str(spacy.explain(token.dep_)), dep_width))
        
print_token_pos_dep(doc1)

TOKEN                         POS                           DEPENDENCY                    

The                           determiner                    determiner                    
gene                          noun                          nominal subject               
functions                     verb                          None                          
as                            subordinating conjunction     case marking                  
an                            determiner                    determiner                    
important                     adjective                     adjectival modifier           
regulator                     noun                          modifier of nominal           
.                             punctuation                   punctuation                   


In [8]:
# I will use the biggest statistical model in scispacy from now on
nlp = spacy.load('en_core_sci_lg')



In [9]:
    
# Perhaps the most interesting thing to note is that "functions" is an intransitive verb (i.e.
# it doesn't take a direct object). Since this sentence is already in the "unpacked form", this means that if we
# unpack into the standard form with function as the verb "functions" (or "function" if plural), function will 
# never take a direct object.

# Note that regulator is marked as "nmod" (modifier of nominal). This seems to be a UD v1 designation?
# So perhaps spaCy hasn't implemented UD v2, as UD v2 seems to mark adjunct nominal modifiers of non-nouns as "obl"
# (see https://universaldependencies.org/u/dep/nmod.html). In either case, it's marking it as a non-essential modifier/

# In this example, I think "as" is perhaps the next most interesting thing to discuss.
# "As" is used as a subordinating conjunction (connecting the subordinate clause "as an important regulator"
# to the independent clause "The gene functions").
# "As" is being used in the sense of (Merriam-Webster)
"""
2 : in or to the same degree in which
soft as silk
—usually used as a correlative after an adjective or adverb modified by adverbial as or so
as cool as a cucumber
"""

# This perhaps isn't the ideal definition for our purposes, but basically "as" and the subordinating phrase that
# follows it helps us answer the question "how does it function?" or "what does it function to do?".
# More generally we will see a number of other (syntactically different but semantically related) examples of
# function being modified in a similar manner. An example off the top of my head is function as subject being 
# linked to a subject complement via the copula "is". For example, consider the sentence below:

doc2 = nlp("The gene's function is regulation")
displacy.render(doc2, style="dep", jupyter=True, options={'distance' : 140})
print_token_pos_dep(doc2)

# This is another instance of function being in a sense "endowed" with additional meaning.
# Semantically, I would say that "The gene's function is regulation" is equivalent to
# "The gene functions as a regulator". Other equivalent (or nearly so) examples include "The gene functions to regulate",
# "The function of the gene is to regulate", "The function of the gene is regulation", "The function of the
# gene as a regulator...", "The gene has a function to regulate", and so on.

# Although slightly different, I would argue that a clause like "Gene function regulates..." is semantically
# nearly identical. For example, compare "The promotor's function is to regulate gene expression." with
# "Promotor function acts to regulate gene expression." (The latter phrasing seems a bit unnatural, so perhaps
# something like "Promotor function plays an important role in regulating gene expression" is more common.) 


TOKEN                         POS                           DEPENDENCY                    

The                           determiner                    determiner                    
gene                          noun                          possession modifier           
's                            particle                      case marking                  
function                      noun                          nominal subject               
is                            auxiliary                     copula                        
regulation                    noun                          None                          


In [10]:
# The examples above highlight a couple of key concepts.
# First, if function is the nominal subject the "item" to which the function should be attached will often be
# found as a component of the subject (e.g. if the subject is a compound noun like "protein function") or as
# a prepositional phrase that postmodifies function. I will consider these to be equivalent and for simplicity
# will unpack "function of the <item>", "function possessed by the <item>", "function that the <item> has",
# "<item> function" and so on as "<item> function".

# Consider the following examples:

doc3 = nlp("The function of the protein is important for regulating homeostasis.")
displacy.render(doc3, style="dep", jupyter=True, options={'distance' : 90})
print_token_pos_dep(doc3)
doc4 = nlp("Protein function is important for regulating homeostasis.")
displacy.render(doc4, style="dep", jupyter=True, options={'distance' : 120})
print_token_pos_dep(doc4)
doc5 = nlp("The function possessed by the protein is important for regulating homeostasis.")
displacy.render(doc5, style="dep", jupyter=True, options={'distance' : 90})
print_token_pos_dep(doc5)

# Note that in UD the nonverbal predicate ("important") of the copula "is" is assigned ROOT.
# In all cases, "function" is the nominal subject.
# In doc3 and doc4, "protein" is a dependency of "function" (in doc3 protein is a modifier of nominal;
# in doc4, protein is probably best described as a noun adjunct that modifes the head noun function).
# In doc5, it's less straightforward, but "protein" is a dependent of "possessed" which is itself a dependent of
# "function" (so it's sort of a second order or indirect dependency--not sure of the technical terminology)
# I think the important thing here is you can follow the acyclic path from function -> protein even in doc5
# So one possible rule might be that if function is the nominal subject then the "item" can be identified (if it
# exists) by following the arrows and preferring closer dependencies (if there were multiple candidates)

TOKEN                         POS                           DEPENDENCY                    

The                           determiner                    determiner                    
function                      noun                          nominal subject               
of                            adposition                    case marking                  
the                           determiner                    determiner                    
protein                       noun                          modifier of nominal           
is                            auxiliary                     copula                        
important                     adjective                     None                          
for                           adposition                    marker                        
regulating                    verb                          adverbial clause modifier     
homeostasis                   noun                          direct object                

TOKEN                         POS                           DEPENDENCY                    

Protein                       noun                          compound                      
function                      noun                          nominal subject               
is                            auxiliary                     copula                        
important                     adjective                     None                          
for                           adposition                    marker                        
regulating                    verb                          adverbial clause modifier     
homeostasis                   noun                          direct object                 
.                             punctuation                   punctuation                   


TOKEN                         POS                           DEPENDENCY                    

The                           determiner                    determiner                    
function                      noun                          nominal subject               
possessed                     verb                          clausal modifier of noun (adjectival clause)
by                            adposition                    case marking                  
the                           determiner                    determiner                    
protein                       noun                          modifier of nominal           
is                            auxiliary                     copula                        
important                     adjective                     None                          
for                           adposition                    marker                        
regulating                    verb                          adverbial claus

In [11]:
for token in doc5:
    print(token.text, token.dep_, token.head.text, token.head.pos_,
            [child for child in token.children])

The det function NOUN []
function nsubj important ADJ [The, possessed]
possessed acl function NOUN [protein]
by case protein NOUN []
the det protein NOUN []
protein nmod possessed VERB [by, the]
is cop important ADJ []
important ROOT important ADJ [function, is, regulating, .]
for mark regulating VERB []
regulating advcl important ADJ [for, homeostasis]
homeostasis dobj regulating VERB []
. punct important ADJ []


In [12]:
# test out cosine similarity
# doc6/7 from one paper and doc8/9 from another
# using scispacy
doc6 = nlp("A central theory in evolutionary developmental biology is that functional novelty arises through changes to the regulation and expression, both spatially and temporally, of otherwise well-conserved proteins.")
doc7 = nlp("It is not possible to rule out rhythmic post-transcriptional modification of these proteins, which might also influence their function.")
doc8 = nlp("Unfortunately, these individuals are at greater risk for other cancers, cardiovascular disease, osteoporosis, diabetes, and functional decline.")
doc9 = nlp("We explored associations between lifestyle factors and physical functioning among elderly cancer survivors who were screened for Project LEAD (Leading the Way in Exercise And Diet), a home-based, diet and exercise intervention trial.")

print(doc6.similarity(doc7))
print(doc6.similarity(doc8))
print(doc6.similarity(doc9))
print('\n')
print(doc8.similarity(doc6))
print(doc8.similarity(doc7))
print(doc8.similarity(doc9))
# using web medium
doc6 = nlp_web_md("A central theory in evolutionary developmental biology is that functional novelty arises through changes to the regulation and expression, both spatially and temporally, of otherwise well-conserved proteins.")
doc7 = nlp_web_md("It is not possible to rule out rhythmic post-transcriptional modification of these proteins, which might also influence their function.")
doc8 = nlp_web_md("Unfortunately, these individuals are at greater risk for other cancers, cardiovascular disease, osteoporosis, diabetes, and functional decline.")
doc9 = nlp_web_md("We explored associations between lifestyle factors and physical functioning among elderly cancer survivors who were screened for Project LEAD (Leading the Way in Exercise And Diet), a home-based, diet and exercise intervention trial.")
print('\n')
print(doc6.similarity(doc7))
print(doc6.similarity(doc8))
print(doc6.similarity(doc9))
print('\n')
print(doc8.similarity(doc6))
print(doc8.similarity(doc7))
print(doc8.similarity(doc9))

#obviously need to test this alot more before drawing any conclusions but the first set looks intruiging
# doc 6 and 9 should intuitively not be that similar which is recovered using scispacy lg but not at all from web_md
# the sci_lg seems to space out comparisons a lot better (which is not necessarily surprising as of course two sentences
# from scientific papers will seem very similar for a model trained on a corpus including wikipedia, news, etc.)
# note that of course these comparisons have nothing to do with semantics but rather similarity in the sense of the
# text they were drawn from. Again need to run far more tests but definitely a point in favour of scispacy models

0.8172883017760809
0.5959907654013872
0.44515275327939186


0.5959907654013872
0.5954728541798259
0.6549284890486828


0.9223619423189712
0.8163925168136106
0.8640749153788936


0.8163925168136106
0.8170797185152258
0.908630643529469


In [31]:
doc10 = nlp("The protein functions to regulate homeostasis")
displacy.render(doc10, style="dep", jupyter=True, options={'distance' : 150})
print_token_pos_dep(doc10)
print('\n')
def print_children(doc, col1_width=20, col2_width=20, col3_width=20, col4_width=20):
    for token in doc:
        print("{:<{}}{:<{}}{:<{}}{:<{}}{}".format(token.text, col1_width, token.dep_, col2_width,
                                                    token.head.text, col3_width, token.head.pos_,
                                                    col4_width, [child for child in token.children]))

print_children(doc10)

TOKEN                         POS                           DEPENDENCY                    

The                           determiner                    determiner                    
protein                       noun                          nominal subject               
functions                     verb                          None                          
to                            particle                      marker                        
regulate                      verb                          open clausal complement       
homeostasis                   noun                          direct object                 


The                 det                 protein             NOUN                []
protein             nsubj               functions           VERB                [The]
functions           ROOT                functions           VERB                [protein, regulate]
to                  mark                regulate            VERB                []
regulate

In [40]:
'''
 Will now sketch out a way to translate my earlier notes into a dependency grammar framework.
 
 I will start with "case 1", which is the easiest one to deal with.
  
 Function as nominal subject (or head of nominal phrase acting as subject) or finite verb: start from "function" and follow the arrows (from tail to head) to identify the item /x/.sydne
 The item /x/ might be a direct dependent (e.g. like protein functions above) or it might be indirect and you have to follow multiple arrows (I believe that this is related to "catena" in dependency grammar).
 In any case, it is basically a transitivity relationship (if we have directed graph with connections {1,2}, {2,3} then transitive closure adds the connection {1,3}).

 (In theory, this could be easily automated, but the tricky part would be identifying /x/, which needs to be 1. a concrete noun that is 2. a biological component of a Darwinian organism (or the ecological equivalent).
 Could certainly be accomplished with a look-up table with an exhaustive list of permissible /x/s but clearly that would be difficult/impossible, so might also be a place for a ML approach (though it might be a bit optimistic given
 the technical specification--it's dubious that even the most sota transformer models would "identify" similarities between usages of words related to such a narrow technical consideration). In any case, if I wanted to examine how one
 might code this (for the paper) I could just create a lookup table that contains the items within the corpus (which is obviously "cheating" but I would make it extremely clear that it's only for illustrative purposes
 and is a best case scenario (i.e. where we somehow obtain an exhaustive list of allowable "items")).)

 '''
# consider some examples that would be unpacked as "the gene functions to..."

case1a = nlp("The function of the gene")
# here we can follow the arrow from function directly to its dependent gene
displacy.render(case1a, style="dep", jupyter=True, options={'distance' : 150})

case1b = nlp("The function possessed by the gene")
# here we get from function to gene via function -> possessed -> gene
displacy.render(case1b, style="dep", jupyter=True, options={'distance' : 150})

case1c = nlp("The function possessed by the gene's DNA sequence")
# this is an incredibly ugly sentence but highlights that we can still get from function to gene via function -> possessed -> sequency -> gene
# (we can also get to "DNA", which technically could be an item /x/, so there might need to be a rule about why you choose gene as /x/ over DNA)
displacy.render(case1c, style="dep", jupyter=True, options={'distance' : 150})

In [43]:
"""
Cases 3 and 4 (i decided that case 2 wasn't possible--am just using the original numbering while I work through all of this)
case 3: functional as adjectival premodifier of concrete head of nominal subject
case 4: functional as adjectival postmodifier of concrete head of nominal subject (e.g. linked by a copula)

This one is interesting. In my original notes I wanted to unpack both of these as "the promotor functions...".
That's currently still the plan but they do have different dependency relationships.
In both cases, there is a direct between promotor and functional but the direction is reversed.
This is (at least in part) due to the fact that UD sets the object of the copula as the ROOT (so there is 
technically no way that the arrow could go into functional in case4 as occurs in case3.)
I suppose (but only off the top of my head, need to verify this), that if one put ROOT on "is" in case4 then
you get is -> functional and promotor <- is, which not only removes the direct link but also doesn't give a transitive link between promoter and function.
Assuming I decide to keep this unpacking, the rule for case4 might need to be something like an adjectival modifier that has a copula as a dependent and /x/ as a nsubj dependent.
"""
case3 = nlp("The functional promoter transcribes the gene.")
displacy.render(case3, style="dep", jupyter=True, options={'distance' : 150})
case4 = nlp("The promoter that transcribes the gene is functional.")
displacy.render(case4, style="dep", jupyter=True, options={'distance' : 150})