In [None]:
"""
I will use this notebook as a workspace to go through my notes on the function handbook, 
translating the scheme into a dependency grammar format (using Universal Dependencies,
https://universaldependencies.org/, which is the parsing scheme adopted by spaCy)
"""

In [1]:
import spacy
import scispacy
from spacy import displacy

In [5]:
# an interesting first note is that the web language models from spacy don't do a great job of parsing example sentences
# Take a simple example
nlp_web_sm = spacy.load("en_core_web_sm")
doc = nlp_web_sm("The gene functions as an important regulator.")
displacy.render(doc, style="dep", jupyter=True, options={'distance' : 140})

In [6]:
# en_core_web_sm doesn't actually have word embeddings but en_core_web_md does and the problem also exists here
nlp_web_md = spacy.load("en_core_web_md")
doc = nlp_web_md("The gene functions as an important regulator.")
displacy.render(doc, style="dep", jupyter=True, options={'distance' : 140})

In [7]:
# The obvious problem here is that the POS for functions is NOUN.
# Interestingly it is assigned as the root of the sentence though.
# Incidentally, the same issue arises even if the structure is kept similar but less technical words are used
doc = nlp_web_md("The couch functions as a seating device.")
displacy.render(doc, style="dep", jupyter=True, options={'distance' : 140})

In [57]:
# But if we import a statistical model from scispacy then functions is tagged as a verb
nlp = spacy.load("en_core_sci_md")
doc1 = nlp("The gene functions as an important regulator.")
displacy.render(doc, style="dep", jupyter=True, options={'distance' : 140})
# With functions being correctly tagged as a verb, the dependency structure wrt functions also changes.
# Instead of having "gene" and "the" as dependencies of "functions" we now just have "gene".
# (It seems like it was previously being parsed as the noun phrase "gene functions" with "functions" as head.)
# We now also have the object "regulator" as a direct dependency of "functions" (whereas before it was "as")

In [64]:
# It's also useful to show the descriptions of the POS and dependencies in a table
def print_token_pos_dep(doc, tok_width=30, pos_width=30, dep_width=30):
    """
    prints token, pos description, and dep description of a doc
    args:
        doc: spacy doc
        tok_width: width of col 1
        pos_width: width of col 2
        dep_width: width of col 3
    returns:
        Nothing (prints output to screen)
    """
    print("{:<{}}{:<{}}{:<{}}\n".format("TOKEN", tok_width, "POS", pos_width, "DEPENDENCY", dep_width))
    for token in doc:
        print("{:<{}}{:<{}}{:<{}}".format(
            token.text, tok_width, str(spacy.explain(token.pos_)), pos_width, 
            str(spacy.explain(token.dep_)), dep_width))
        
print_token_pos_dep(doc1)

TOKEN                         POS                           DEPENDENCY                    

The                           determiner                    determiner                    
gene                          noun                          nominal subject               
functions                     verb                          None                          
as                            subordinating conjunction     case marking                  
an                            determiner                    determiner                    
important                     adjective                     adjectival modifier           
regulator                     noun                          modifier of nominal           
.                             punctuation                   punctuation                   


In [74]:
    
# Perhaps the most interesting thing to note is that "functions" is an intransitive verb (i.e.
# it doesn't take a direct object). Since this sentence is already in the "unpacked form", this means that if we
# unpack into the standard form with function as the verb "functions" (or "function" if plural), function will 
# never take a direct object.

# Note that regulator is marked as "nmod" (modifier of nominal). This seems to be a UD v1 designation?
# So perhaps spaCy hasn't implemented UD v2, as UD v2 seems to mark adjunct nominal modifiers of non-nouns as "obl"
# (see https://universaldependencies.org/u/dep/nmod.html). In either case, it's marking it as a non-essential modifier/

# In this example, I think "as" is perhaps the next most interesting thing to discuss.
# "As" is used as a subordinating conjunction (connecting the subordinate clause "as an important regulator"
# to the independent clause "The gene functions").
# "As" is being used in the sense of (Merriam-Webster)
"""
2 : in or to the same degree in which
soft as silk
—usually used as a correlative after an adjective or adverb modified by adverbial as or so
as cool as a cucumber
"""

# This perhaps isn't the ideal definition for our purposes, but basically "as" and the subordinating phrase that
# follows it helps us answer the question "how does it function?" or "what does it function to do?".
# More generally we will see a number of other (syntactically different but semantically related) examples of
# function being modified in a similar manner. An example off the top of my head is function as subject being 
# linked to a subject complement via the copula "is". For example, consider the sentence below:

doc2 = nlp("The gene's function is regulation")
displacy.render(doc2, style="dep", jupyter=True, options={'distance' : 140})
print_token_pos_dep(doc2)

# This is another instance of function being in a sense "endowed" with additional meaning.
# Semantically, I would say that "The gene's function is regulation" is equivalent to
# "The gene functions as a regulator". Other equivalent (or nearly so) examples include "The gene functions to regulate",
# "The function of the gene is to regulate", "The function of the gene is regulation", "The function of the
# gene as a regulator...", "The gene has a function to regulate", and so on.

# Although slightly different, I would argue that a clause like "Gene function regulates..." is semantically
# nearly identical. For example, compare "The promotor's function is to regulate gene expression." with
# "Promotor function acts to regulate gene expression." (The latter phrasing seems a bit unnatural, so perhaps
# something like "Promotor function plays an important role in regulating gene expression" is more common.) 


TOKEN                         POS                           DEPENDENCY                    

The                           determiner                    determiner                    
gene                          noun                          possession modifier           
's                            particle                      case marking                  
function                      noun                          nominal subject               
is                            auxiliary                     None                          
regulation                    noun                          attribute                     


In [83]:
# The examples above highlight a couple of key concepts.
# First, if function is the nominal subject the "item" to which the function should be attached will often be
# found as a component of the subject (e.g. if the subject is a compound noun like "protein function") or as
# a prepositional phrase that postmodifies function. I will consider these to be equivalent and for simplicity
# will unpack "function of the <item>", "function possessed by the <item>", "function that the <item> has",
# "<item> function" and so on as "<item> function".

# Consider the following examples:

doc3 = nlp("The function of the protein is important for regulating homeostasis.")
displacy.render(doc3, style="dep", jupyter=True, options={'distance' : 90})
print_token_pos_dep(doc3)
doc4 = nlp("Protein function is important for regulating homeostasis.")
displacy.render(doc4, style="dep", jupyter=True, options={'distance' : 120})
print_token_pos_dep(doc4)
doc5 = nlp("The function possessed by the protein is important for regulating homeostasis.")
displacy.render(doc5, style="dep", jupyter=True, options={'distance' : 90})
print_token_pos_dep(doc5)

# Note that in UD the nonverbal predicate ("important") of the copula "is" is assigned ROOT.
# In all cases, "function" is the nominal subject.
# In doc3 and doc4, "protein" is a dependency of "function" (in doc3 protein is a modifier of nominal;
# in doc4, protein is probably best described as a noun adjunct that modifes the head noun function).
# In doc5, it's less straightforward, but "protein" is a dependent of "possessed" which is itself a dependent of
# "function" (so it's sort of a second order or indirect dependency--not sure of the technical terminology)
# I think the important thing here is you can follow the acyclic path from function -> protein even in doc5
# So one possible rule might be that if function is the nominal subject then the "item" can be identified (if it
# exists) by following the arrows and preferring closer dependencies (if there were multiple candidates)

TOKEN                         POS                           DEPENDENCY                    

The                           determiner                    determiner                    
function                      noun                          nominal subject               
of                            adposition                    case marking                  
the                           determiner                    determiner                    
protein                       noun                          modifier of nominal           
is                            auxiliary                     copula                        
important                     adjective                     None                          
for                           adposition                    marker                        
regulating                    verb                          adverbial clause modifier     
homeostasis                   noun                          direct object                

TOKEN                         POS                           DEPENDENCY                    

Protein                       noun                          compound                      
function                      noun                          nominal subject               
is                            auxiliary                     copula                        
important                     adjective                     None                          
for                           adposition                    marker                        
regulating                    verb                          adverbial clause modifier     
homeostasis                   noun                          direct object                 
.                             punctuation                   punctuation                   


TOKEN                         POS                           DEPENDENCY                    

The                           determiner                    determiner                    
function                      noun                          nominal subject               
possessed                     verb                          clausal modifier of noun (adjectival clause)
by                            adposition                    case marking                  
the                           determiner                    determiner                    
protein                       noun                          modifier of nominal           
is                            auxiliary                     copula                        
important                     adjective                     None                          
for                           adposition                    marker                        
regulating                    verb                          adverbial claus

In [86]:
for token in doc5:
    print(token.text, token.dep_, token.head.text, token.head.pos_,
            [child for child in token.children])

The det function NOUN []
function nsubj important ADJ [The, possessed]
possessed acl function NOUN [protein]
by case protein NOUN []
the det protein NOUN []
protein nmod possessed VERB [by, the]
is cop important ADJ []
important ROOT important ADJ [function, is, regulating, .]
for mark regulating VERB []
regulating advcl important ADJ [for, homeostasis]
homeostasis dobj regulating VERB []
. punct important ADJ []
