# UD English corpora analysis. 
## Objectives

1. For each dependency relation in UD, identify every single occurence of it, in every available English UD corpora. 
2. Obtain minimal information on each of these occurence (basic UDdeprel, Head and Dep classes, a helper function to 
to iterate through the list of deprels to extract info on). 
2a. do we care about the sub tags such nsubj:pass? Yes, it helps us identify possible subcases to analyse. 
3. Aggregate the data for these occurences at the dependency relation level to obtain insights (pairplots, correlation+heatmap for each POS-TAG pair)
4. Where necessary, expand the analysis for a particular dependency relation (e.g. look at all Dep nodes and/or grandparent nodes) 

__Useful notes identified along the process__
1. inconsistency in data captured in each UD English corpora. For instance, the ESL corpora does not capture lemma information. 
2. slight variations in relations numbering. EWT uses sub-decimals, 
3. ESL captures contractions, e.g. "cannot" is expanded into "can" and "not", and the dependency relations are tagged accordingly. "cannot" is captured too, but given a token index that is a combination of its units, e.g. 10-11

### 0. Preliminaries, load dependencies and helper functions

In [1]:
import conllu # https://github.com/EmilStenstrom/conllu
import pandas as pd
import glob
from conllu import parse
from conllu import parse_incr
import re
import _pickle as pickle
import collections

In [2]:
# helper function to load pickle file 
def pickleloader(filename):
    # # open the file for writing
    fileObject = open(filename,'rb') 

    # load the object from the file into var univ_processed_train
    return pickle.load(fileObject,  encoding="latin1")  
    #latin1 here, to bypass python2 to 3 pickle problem

    # here we close the fileObject
    fileObject.close()
    
# helper function to create a file and store the data in the file 
def picklemaker(filename, objectname): 
    # open the file for writing
    fileObject = open(filename,'wb') 

    # this writes the object a to the file named 'testfile'
    pickle.dump(objectname,fileObject)   

    # here we close the fileObject
    fileObject.close()

### 1. Consolidating the datasets, extracting only the relevant information

In [None]:
# grab all the filenames for the English UD corporas 
conllu_filenames = glob.glob("./UD_eng_rawfiles/*.conllu")
conllu_filenames.sort()

In [None]:
def conllu_metadata_stripper(filename:str, sentence_dict:dict): 
    '''
    takes a file containing text in CONLLU format, filters for tokens (removing metadata), collects them into sentences, 
    adds them to the dictionary of sentences. 
    inputs | filename:str - name of the file to be processed ; sentence_dict:dict - dictionary, either empty
    or containing CONLLU tokens in sentences (such as those already processed from other files with CONLLU data) 
    returns | sentence_dict:dict - containing the tokens collected from the CONLLU file being processed
    '''
    #open the file 
    data_file = open(filename,"r",encoding="utf-8")
    # get all the sentences... i.e. by collecting all the tokens and ignoring the metadata before every sentence
    token_list = [token for token in data_file.readlines() if re.match(r"([0-9]+\.?[0-9]*\s)\S*",  token)] 
    # close the file 
    data_file.close()
    

    # use a counter identify each sentence added to the dictionary
    sent_counter = len(sentence_dict)+1
    
    
    def token_adder(sent_counter,sentence_dict):
        # use a counter which we will control the addition of tokens to each key in the dictionary
        tok_counter = 0
        
        # initialise an empty list with the current dictionary key 
        sentence_dict[sent_counter] = []
        
        # try except to pass when list has been emptied
        try: 
            # "local" while loop to add tokens only if they have numbers bigger than tok_counter
            # i.e. once the index number for a token falls below the last, reset the while loop
            while float(re.findall(r"^[0-9]+\.?[0-9]*", token_list[0])[0]) > tok_counter:
                tok_counter = float(re.findall(r"^[0-9]+\.?[0-9]*", token_list[0])[0])
                
                # pop the first element from the token_list 
                sentence_dict[sent_counter].append(token_list.pop(0))
                 
                
                # note on the regex above: 
                # there is corpora that index tokens with floating point numbers. e.g. a token can be 18
                # and another related one can be 18.1. see sentence 17838 in the dict, which is from the 
                # EWT corpora. The regex captures either the full string containing a float, or the full string
                # containing an int. The string is then converted to a float (instead of int) to avoid errors 
                # in token identification due to roundings.
                
            else: 
                tok_counter = 0
                pass
        except: 
            pass 
        
        
    # while loop to continue calling token_adder recursively until the list is empty     
    while len(token_list)>0: 
        token_adder(sent_counter,sentence_dict)
        sent_counter += 1


    return sentence_dict

UDen_sent_dict = {}
for file in conllu_filenames: 
    
    conllu_metadata_stripper(file, UDen_sent_dict)
    print ("Processed: ", file, " | total number of sentences: ", len(UDen_sent_dict))

In [None]:
# use the parse_incr method from the conllu package to parse the information for each token
# the parsing results in the creation of a TokenList object which contains each of the token in 
# a sentence and seperates the conllu data into seperate keys
# to access a specific token, <dictionary>[<sentence number>][0][<token number>]

UDen_sent_parsedlist = {}
for sentence in UDen_sent_dict:
    UDen_sent_parsedlist[sentence] = []
    for token in parse_incr(UDen_sent_dict[sentence]):
        UDen_sent_parsedlist[sentence].append(token)

In [None]:
##### Pickler in progress
picklemaker("./UDdeprel_data/UDen_sent_parsedlist", UDen_sent_parsedlist)

### 2. UDdeprel classes

In [6]:
class UDdeprel:
    '''
    basic class of objects for capturing tokens with a particular UD dependency relationship, its form and POS tag. 
    This is the dependent node on which the deprel lands on. The class also includes functions as well as functions 
    to find its head and dependents. 
    --  -- 
    Note that self.head is a UDdeprel object. 
    
    '''
    def __init__(self, sentence_num:int, self_id:int, self_form:str, deprel:str, 
                 self_pos:str, head_id: int, deprel_sub=None, head=None, 
                 deps=None, siblings=None):
        self.form = self_form             # using form instead of lemma because some UD Eng corpora (e.g. CESL) don't have lemma information 
        self.self_id = self_id
        self.deprel = deprel
        self.deprel_sub = deprel_sub
        self.self_pos = self_pos
        self.sentence_num = sentence_num
        self.head_id = head_id
        self.head = head
        self.deps = deps
        self.siblings = siblings
    
    def __str__(self):
        print("Form:", self.form, " DepRel:", self.deprel, " DepRel_Sub:", self.deprel_sub,
             " PoS:", self.self_pos, " SentNum:", self.sentence_num)
    
    def find_head(self, UD_sent_dictionary):
        sentence = UD_sent_dictionary[self.sentence_num][0] # [0] is required by default for every sentence to access its tokesn
       
        H_sentence_num = self.sentence_num
        H_self_id = self.head_id                        # this is the id number in the conllu format 
                                                        # and not the same as the dictionary's
                                                        # which is indexed from 0
                
        # find the token having the H_self_id in its "id" key
        # this approach ensures accuracy, since token id numbers are not aligned 
        # with index numbers in sentences (e.g. there are no root tokens), certain 
        # UD English corpora (e.g. EWT) use sub index numbers (e.g. 11 and 11.1) for 
        # labeling the tokens in the corpus. 
        H_token =  [token for token in sentence if token["id"]==H_self_id][0]
        
        H_self_form = H_token["form"]
        H_head_id = H_token["head"]
        H_deprel = re.findall(r"^[A-Za-z]+[^:]", H_token["deprel"])[0]  
        H_deprel_sub = H_token["deprel"].lstrip(H_deprel+":")
        if H_deprel_sub == "":
            H_deprel_sub = None
        H_self_pos = H_token["upostag"]

        __head = Head(sentence_num = H_sentence_num,
                        self_id = H_self_id, 
                        self_form = H_self_form,
                        head_id = H_head_id,
                        deprel = H_deprel,
                        self_pos = H_self_pos,
                        deprel_sub = H_deprel_sub)
                        
                        # to prevent confusion, self_id = H_self_id+1 is required to return the token ID number to 
                        # one in the conllu data
        self.head = __head
        
    def find_deps(self,UD_sent_dictionary):
        dep_id = self.self_id # this is the index number of the primary UDdeprel token
        sentence = UD_sent_dictionary[self.sentence_num][0] # [0] is required by default for every sentence to access its tokesn
        
        D_sentence_num = self.sentence_num
        
        D_tokens = [token for token in sentence if token["head"]==dep_id]
        # the conllu has a field to capture all the dependents of a particular token. 
        # however, not all UD Eng corpora (e.g. CESL) populate this field. Therefore, 
        # a seach of each token in a sentence has to be done here. 
        

        
        __deps = []
        for D_token in D_tokens:
            D_self_id = D_token["id"]      
            D_head_id = D_token["head"]    # again, this is the index given in the conllu format
            D_self_form = D_token["form"],
            D_head_id = D_token["head"],
            D_deprel = re.findall(r"^[A-Za-z]+[^:]", D_token["deprel"])[0]
            D_deprel_sub = D_token["deprel"].lstrip(D_deprel+":")
            if D_deprel_sub == "":
                D_deprel_sub = None
            D_self_pos = D_token["upostag"]

            # r"^[A-Za-z]+[^:]" for D_self_pos because some of the UD English corpora have the postag in all lowercase
            # and some in all uppercase. 


            __dep = Dep(sentence_num = D_sentence_num,
                            self_id = D_self_id,
                            self_form = D_self_form,
                            head_id = D_head_id,
                            deprel = D_deprel,
                            self_pos = D_self_pos,
                            deprel_sub = D_deprel_sub)


            __deps.append(__dep)
        self.deps = __deps
            
    def find_siblings(self, UD_sent_dictionary):
        head_id = self.head_id # conllu number of the head of the primary UDdeprel token
        sentence = UD_sent_dictionary[self.sentence_num][0] # [0] is required by default for every sentence to access its tokesn
        S_sentence_num = self.sentence_num
        
        # match only the tokens having the head_id, but exclude the one having self.self_id, to find only siblings
        S_tokens = [token for token in sentence if token["head"]==head_id and token["id"] != self.self_id]
        
        __siblings = []
        for S_token in S_tokens:
            S_self_id = S_token["id"]     
            S_head_id = S_token["head"]    # again, this is the index given in the conllu format
            S_self_form = S_token["form"],
            S_head_id = S_token["head"],
            S_deprel = re.findall(r"^[A-Za-z]+[^:]", S_token["deprel"])[0]
            S_deprel_sub = S_token["deprel"].lstrip(S_deprel+":")
            if S_deprel_sub == "":
                S_deprel_sub = None
            S_self_pos = S_token["upostag"]

            __sibling = Sibling(sentence_num = S_sentence_num,
                            self_id = S_self_id,
                            self_form = S_self_form,
                            head_id = S_head_id,
                            deprel = S_deprel,
                            self_pos = S_self_pos,
                            deprel_sub = S_deprel_sub)


            __siblings.append(__sibling)
        self.siblings = __siblings

class Head(UDdeprel):
    '''
    basic UDdeprel that inherits all the methods and attributes of the UDdeprel. 
    Amend or add to for the purpose of your analysis. 
    '''
    pass
    
class Dep(UDdeprel): 
    '''
    basic UDdeprel that inherits all the methods and attributes of the UDdeprel. 
    Amend or add to for the purpose of your analysis. 
    '''
    pass

class Sibling(UDdeprel):
    '''
    basic UDdeprel that inherits all the methods and attributes of the UDdeprel. 
    Intended for collecting information about a particular token's sibling tokens (i.e. 
    sharing the same Head node). Amend or add to for the purpose of your analysis. 
    '''
    pass

In [None]:
UD_deprel_list = ["nsubj", "obj", "iobj", "csubj", "ccomp", "xcomp", "obl", 
                  "vocative", "expl","dislocated","advcl","advmod", "discourse",
                  "aux","cop", "mark","nmod","appos", "nummod", "acl", "amod", "det", 
                  "clf", "case", "conj", "cc", "fixed", "flat", "compound", "list",
                  "parataxis", "orphan", "goeswith", "reparandum", "punct", "dep"]

# note: we are not including the root dependency relation in our analysis here 

### 3. Function to collect basic information about tokens with a particular UD relation landing on them

In [None]:
def deprelfinder(deprel, UD_sent_dictionary):
    '''
    takes a dictionary containing the sentences (and tokens) of one or more UD english corpora, searches for 
    tokens that have this dependency relation (these tokens would be the dependent nodes in a dep tree).  
    This function aggregates data in a manner that would allow a more fine-grained analysis. A token's dependency
    relation is set at main tag level, with subtag information captured at the deprel_sub level.  
    _________________
   
    input | deprel:str, the dependency relation of interest; UD_sent_dictionary: dict, a dictionary containing 
    UD sentences as parsed from the conllu format with the conllu python package 
    (see https://github.com/EmilStenstrom/conllu); output: str, "pickle" by default, or "csv", the format to 
    save the dependency relation search results in. 
    
    output | a pickle containing a list of UDdeprel objects containing information about all the tokens in the 
    UD_sent_dictionary, having the dependency relation of interest. The information captured by this function is 
    sentence_num, self_form, deprel, deprel_sub, self_pos. 
    '''
    failures= [] 
    
    __UDdeprel_list = []
    
    # loop through every sentence in the dictionary
    for sentence in UD_sent_dictionary: #sentence is an int from 1 to n, where n is the size of the dictionary. 
        
        # go into every token for each sentence and check for tokens that have the deprel we are interested in. 
        for token in UD_sent_dictionary[sentence][0]:
            
            #use regex to match deprels because there are some that have subtags such as nsubj:pass
            if re.match("^"+deprel, token["deprel"]):
                
                sentence_num=sentence
                self_id = token["id"]   # this is the id number in the conllu format and not the same as the dictionary's
                                        # which is indexed from 0
                self_form=token["form"]
                head_id = token["head"]
                deprel = re.findall(r"^[A-Za-z]+[^:]", deprel)[0]  
                deprel_sub = token["deprel"].lstrip(deprel+":") #use lstrip to remove the main deprel+: tag
                
                if deprel_sub == "":
                    deprel_sub=None
                self_pos=token["upostag"]
                
                __UDdeprel = UDdeprel(sentence_num = sentence_num,
                                      self_id = self_id,
                                      self_form = self_form, 
                                      head_id = head_id,
                                      deprel = deprel, 
                                      self_pos = self_pos,
                                      deprel_sub = deprel_sub)
                
                try:
                    __UDdeprel.find_head(UDen_sent_parsedlist)
                    __UDdeprel.find_deps(UDen_sent_parsedlist)
                    __UDdeprel.find_siblings(UDen_sent_parsedlist)
                except: 
                    failures.append(__UDdeprel)

                __UDdeprel_list.append(__UDdeprel)
    
    # let's pickle the results
    filename = "./UDdeprel_data/"+deprel+"_UDdeprel_pkl"
    picklemaker(filename, __UDdeprel_list)
    
    return failures


# run the function and start collecting the token info and storing into pickles. if a pickle file with the same 
# filename already exists, the picklemaker function opens the file and dumps the object in it. it overwrites the 
# existing content. if you want to save the version of the previous pickle file, adjust the filename variable in the 
# function. 
failures__ = []
for deprel in UD_deprel_list: 
    failures__.extend(deprelfinder(deprel, UDen_sent_parsedlist))

In [None]:
# failure check on find_head and find_deps methods. 
len(failures__)
# no failures in acquiring information. next step is some basic unit tests to check accuracy of acquired information. 

##### Notes 
1. Every UD dependency relation has a pickle file now. The start of each pickle filename identifies the dependency relation tag that it is for. 
2. In each pickle file is the list of tokens having that particular dependency relation. The tokens are stored as objects of the UDdeprel class from above. 

### 4. Analysis and visualisation

In [15]:
def basic_statistics(UDdeprel_list):
    '''
    Purpose of this function is to provide basic statistics on the phenomonena in one (or more) Universal 
    Dependencies-annotated corpora.
    
    returns | basic_stats which provides information about the set of UDdeprel objects for a particular UD dependency 
    relation tag. 
    
    1a. head_dep_pos    | identifies the prototypical head-dep POS pair. returns a table with the counts and percentage 
                            of a particular POS-pair for the dependency relation tag of interest. 
    1b. deprel_subtags  | identifies the presence (or absence) of subtags in the corpora for the dependency relation tag. 
                            returns a table with count and percentage figures of the subtags. 
    2.  dep_subdep_pos  | identifies the prototypical dep-subdep POS pair and their dependency relation. returns a table 
                            with the counts and percentage of the occurrence across the corpora. 
    3a. deprel_on_head     | identifies the prototypical dependency relation that lands on the head. 
    '''
    
    
    # =+=+=+==+=+=+==+=+=+==+=+=+==+=+=+==+=+=+==+=+=+==+=+=+==+=+=+==+=+=+==+=+=+==+=+=+=    
    # 1a. governor-dependent POS analysis 
    __head_dep_pos = []
    for dependent in UDdeprel_list: 
        head_POS = dependent.head.self_pos
        dep_POS = dependent.self_pos
        dep_deprel = dependent.deprel
        __head_dep_pos.append(head_POS + "_{}_".format(dep_deprel) + dep_POS)
    
    __head_dep_pos_counter = collections.Counter(x for x in __head_dep_pos if x)
    
    df_1a = pd.DataFrame.from_dict(__head_dep_pos_counter, orient='index').reset_index()
    df_1a.columns = ["head_dep_pos", "count"]
    df_1a.sort_values(by="count", ascending=False, inplace=True)
    df_1a.reset_index(drop=True,inplace=True)
    df_1a["percentage"] = df_1a["count"].copy()
    df_1a["percentage"] =  df_1a["percentage"] / df_1a["count"].sum()*100
    
    basic_stats = {"head_dep_pos": df_1a}
    
    # =+=+=+==+=+=+==+=+=+==+=+=+==+=+=+==+=+=+==+=+=+==+=+=+==+=+=+==+=+=+==+=+=+==+=+=+=
    # 1b. presence of subtags, and frequency + percentage 
    
    __deprel_subtags = ["none" if dependent.deprel_sub==None 
                        else dependent.deprel_sub for dependent in UDdeprel_list]
    
    __deprel_subtags_counter = collections.Counter(x for x in __deprel_subtags if x)

    df_1b = pd.DataFrame.from_dict(__deprel_subtags_counter, orient='index').reset_index()
    df_1b.columns = ["deprel_subtag", "count"]
    df_1b.sort_values(by="count", ascending=False, inplace=True)
    df_1b.reset_index(drop=True,inplace=True)
    df_1b["percentage"] = df_1b["count"].copy()
    df_1b["percentage"] =  df_1b["percentage"] / df_1b["count"].sum()*100   
    
    basic_stats["deprel_subtags"] = df_1b

    # =+=+=+==+=+=+==+=+=+==+=+=+==+=+=+==+=+=+==+=+=+==+=+=+==+=+=+==+=+=+==+=+=+==+=+=+=
    # 2. Dependent analysis: gives the most prototypical dependent-subdependent relationship (i.e. their POS 
    # and deprel). 
    
    __dep_subdep_pos = []
    for dependent in UDdeprel_list:
        dep_pos = dependent.self_pos
        __subdeps = dependent.deps
        for __subdep in __subdeps:
            subdep_pos = __subdep.self_pos
            subdep_deprel = __subdep.deprel
            __dep_subdep_pos.append(dep_pos + "_{}_".format(subdep_deprel) + subdep_pos)
    
    __dep_subdep_pos_counter = collections.Counter(x for x in __dep_subdep_pos if x)

    df_2 = pd.DataFrame.from_dict(__dep_subdep_pos_counter, orient='index').reset_index()
    df_2.columns = ["deprel_subtag", "count"]
    df_2.sort_values(by="count", ascending=False, inplace=True)
    df_2.reset_index(drop=True,inplace=True)
    df_2["percentage"] = df_2["count"].copy()
    df_2["percentage"] =  df_2["percentage"] / df_2["count"].sum()*100   
    
    basic_stats["dep_subdep_pos"] = df_2
    
    # =+=+=+==+=+=+==+=+=+==+=+=+==+=+=+==+=+=+==+=+=+==+=+=+==+=+=+==+=+=+==+=+=+==+=+=+=
    # 3a. Head analysis: most common deprel main tag that lands on the head of the UDdeprel
    
    __deprel_on_head = ["none" if dependent.head.deprel==None 
                        else dependent.head.deprel for dependent in UDdeprel_list]
    
    __deprel_on_head_counter = collections.Counter(x for x in __deprel_on_head if x)

    df_3a = pd.DataFrame.from_dict(__deprel_on_head_counter, orient='index').reset_index()
    df_3a.columns = ["deprel_on_head", "count"]
    df_3a.sort_values(by="count", ascending=False, inplace=True)
    df_3a.reset_index(drop=True,inplace=True)
    df_3a["percentage"] = df_3a["count"].copy()
    df_3a["percentage"] =  df_3a["percentage"]/df_3a["count"].sum()*100   
    
    basic_stats["deprel_on_head"] = df_3a
    
    # =+=+=+==+=+=+==+=+=+==+=+=+==+=+=+==+=+=+==+=+=+==+=+=+==+=+=+==+=+=+==+=+=+==+=+=+=
    # 3b. Head analysis: most common dep_rels that exits the head of the UDdeprel, (ii) most common sets of 
    # dep_rels that exit the head, (iii) most common subsets of dep_rels that exit the head [using intersection]... 
    # filter off heads with single dep_rels [already inferable from (i)]
    
        # not clear if relevant yet. 

    # =+=+=+==+=+=+==+=+=+==+=+=+==+=+=+==+=+=+==+=+=+==+=+=+==+=+=+==+=+=+==+=+=+==+=+=+=
    # 5. heatmap of the head data and the dependent data. e.g. identify correlation across the dataset.
    
        # work-in-progress
    print("These results are for the __ {} __ dependency relation tag.".format(UDdeprel_list[0].deprel))
    return basic_stats     


In [16]:
# load the pickle file for the dependency relation you are interested in. 
data = pickleloader("./UDdeprel_data/advmod_UDdeprel_pkl")

# run the basic_statistics function on the data
result = basic_statistics(data)

These results are for the __ advmod __ dependency relation tag.


In [8]:
# see available information, stored in keys of a dictionary in the results 
print(result.keys())

# access the results via the keys
result["head_dep_pos"]

dict_keys(['head_dep_pos', 'deprel_subtags', 'dep_subdep_pos', 'deprel_on_head'])


Unnamed: 0,head_dep_pos,count,percentage
0,VERB_advmod_ADV,15092,48.779857
1,ADJ_advmod_ADV,5065,16.370923
2,VERB_advmod_PART,3065,9.906590
3,NOUN_advmod_ADV,2478,8.009309
4,ADV_advmod_ADV,1498,4.841785
5,ADJ_advmod_PART,549,1.774459
6,NUM_advmod_ADV,350,1.131258
7,NOUN_advmod_PART,346,1.118330
8,VERB_advmod_SCONJ,268,0.866221
9,PROPN_advmod_ADV,232,0.749863
