# Collecting VerbNet Semantics

This notebook parses all the VerbNet .XML definitions - extracting the verbs (from VNCLASS ID and MEMBER names, including SUBCLASS MEMBER names), and each of the applicable frame details (both thematic roles and semantics). 

The results are very similar to the VerbNet API (v0.3.1 at https://hexdocs.pm/verbnet/VerbNet.html#content) but are focused on the thematic roles (vs the Part-of-Speech parse).

The following dictionaries are created and saved as pickle files:
* Dictionary of verb class IDs as keys with each value = a set (to avoid duplicates) of sentence syntax patterns
  * Class IDs extracted from VNCLASS ID or VNSUBCLASS ID attribute
  * Syntax pattern assembled (in order) from each FRAME SYNTAX element's value attribute
    * This pattern will be aligned with the dependency parse from spaCy for the ROOT verb
* Dictionary of verb text as keys with each value = an array of possible class ID keys
  * Verb text extracted from the text portion of the VNCLASS or VNSUBCLASS ID or the MEMBER name attributes
* Dictionary of verb class ID + syntax pattern (concatenated to create a key) with each value = a dictionary of the SEMANTICS PRED value (the key) and an array of ARG type-value tuples (for that PRED value) 

The pickle files are then moved to the dna/resources directory for use in the application.
    
An example XML structure is:
```
<VNCLASS xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" ID="dedicate-79" ...> 
    <MEMBERS>
        <MEMBER name="dedicate" wn="dedicate%2:32:00" grouping="dedicate.01"/>
        <MEMBER name="devote" wn="devote%2:32:00" grouping="devote.01"/>
        <MEMBER name="commit" wn="commit%2:32:01 commit%2:40:00" grouping="commit.02"/>
    </MEMBERS>
    <THEMROLES>
        ...
    </THEMROLES>
    <FRAMES>
        <FRAME>
            <DESCRIPTION descriptionNumber="8.1" primary="NP V NP S_ING" secondary="NP-P-ING-SC; to-PP" .../>
            <EXAMPLES>
                <EXAMPLE>I dedicated myself to the cause.</EXAMPLE>
            </EXAMPLES>
            <SYNTAX>
                <NP value="Agent">
                    <SYNRESTRS/>
                </NP>
                <VERB/>
                <NP value="Theme">
                    <SYNRESTRS/>
                </NP>
                <PREP value="to">
                    <SYNRESTRS/>
                </PREP>
                <NP value="Goal">
                    <SYNRESTRS/>
                </NP>
            </SYNTAX>
            <SEMANTICS>
                <PRED value="dedicate">
                    <ARGS>
                        <ARG type="Event" value="during(E)"/>
                        <ARG type="ThemRole" value="Agent"/>
                        <ARG type="ThemRole" value="Theme"/>
                        <ARG type="ThemRole" value="Goal"/>
                    </ARGS>
                </PRED>
            </SEMANTICS>
        </FRAME>
        <FRAME>
            <DESCRIPTION descriptionNumber="0.2" primary="NP V NP PP.goal" secondary="NP-PP; to-PP" .../>
            <EXAMPLES>
                <EXAMPLE>I dedicated myself to the cause.</EXAMPLE>
            </EXAMPLES>
            <SYNTAX>
                <NP value="Agent">
                    <SYNRESTRS/>
                </NP>
                <VERB/>
                <NP value="Theme">
                    <SYNRESTRS/>
                </NP>
                <PREP value="to">
                    <SELRESTRS/>
                </PREP>
                <NP value="Goal">
                    <SYNRESTRS>
                        <SYNRESTR Value="-" type="sentential"/>
                    </SYNRESTRS>
                </NP>
            </SYNTAX>
            <SEMANTICS>
                <PRED value="dedicate">
                    <ARGS>
                        <ARG type="Event" value="during(E)"/>
                        <ARG type="ThemRole" value="Agent"/>
                        <ARG type="ThemRole" value="Theme"/>
                        <ARG type="ThemRole" value="Goal"/>
                    </ARGS>
                </PRED>
            </SEMANTICS>
        </FRAME>
    </FRAMES>
    <SUBCLASSES/>
</VNCLASS>
```

The above results in the following dictionary entries:
* Dictionary of verb class IDs as keys with each value = an array of sentence syntax patterns
  * Key = 'dedicate-79', Value = set with 1 tuple consisting of 'Agent', 'VERB', 'Theme', 'to', 'Goal'
* Dictionary of verb text as keys with each value = an array of possible class ID keys
  * Key = 'dedicate', Value = 'dedicate-79'
  * Key = 'devote', Value = 'dedicate-79'
  * Key = 'commit', Value = 'dedicate-79'
* Dictionary of verb class ID + syntax pattern (concatenated to create a key) with each value = a dictionary of the semantic PRED (the key) and an array of ARG type-value tuples
  * Key = "dedicate-79 Agent V Theme 'to' Goal", Value = dictionary with members:
    * Key = 'dedicate' and Value = array of ('Event','during(E)'), ('ThemRole', 'Agent'), ('ThemRole', 'Theme'), ('ThemRole', 'Goal')

In [1]:
# Imports
import json
from pathlib import Path
import pickle
import xml.etree.ElementTree as ET

In [2]:
# Constants
verbnet_dir = '/Users/andreaw/Documents/VerbNet3.4'

In [4]:
# Functions
def add_to_dict(dictionary: dict, key: str, value):
    # Deal with a dictionary differently since it is not hashable (and so can't remove duplicates using a set)
    if isinstance(value, dict):
        if key in dictionary.keys():
            dict_values = dictionary[key]
            dict_set = set()
            for dict_value in dict_values:
                # Turn the current (dict) value into JSON strings and add them to a set
                dict_set.add(json.dumps(dict_value))  
            # Check if the 'new' value (as a JSON string) is already in the set
            if json.dumps(value) not in dict_set:
                # Value is not in the set, so add it and update the dictionary
                dict_values.append(value)
                dictionary[key] = dict_values
        else:
            dictionary[key] = [value]
        # Finished with processing an array of dictionaries
        return
            
    # Deal with other value types
    values = set()      
    if isinstance(value, list):
        new_value = tuple(value)
    else:
        new_value = value
    if key in dictionary.keys():
        values = dictionary[key]
    values.add(new_value)
    dictionary[key] = values


def extract_frame_syntax(etree) -> list:
    syn_list = list()
    for child in etree:
        if 'value' not in child.attrib:
            value = child.tag
        else:
            value = child.attrib['value']
        syn_list.append(value)
    return syn_list


def extract_frame_semantics(etree) -> dict:
    sem_dict = dict()
    for pred in etree.findall('PRED'):
        arg_list = list()
        for arg in pred.findall('./ARGS/ARG'):
            arg_list.append((arg.attrib['type'], arg.attrib['value']))
        sem_dict[pred.attrib['value']] = arg_list
    return sem_dict


def get_verb_details(etree) -> list:
    # Get the class ID
    verb_id = etree.attrib['ID']
    # Add vn class verb to the verb_text_dict
    add_to_dict(verb_text_dict, verb_id.split('-')[0], verb_id)
    # Add the member (similar) verbs to the verb_text_dict
    for member in get_verbs_with_similar_structure(etree):
        add_to_dict(verb_text_dict, member, verb_id)
    
    for frame in etree.findall('./FRAMES/FRAME'):
        # Add the syntax pattern to the verb ID in the verb_pattern_dict
        syn_pattern = extract_frame_syntax(frame.find('SYNTAX'))
        add_to_dict(verb_pattern_dict, verb_id, syn_pattern)
        # Get the semantic details for the pattern and add it to the verb_sem_dict
        add_to_dict(verb_sem_dict, f'{verb_id} {" ".join(syn_pattern)}', 
                    extract_frame_semantics(frame.find('SEMANTICS')))
    
    # Recursively process the subclasses
    for subclass in etree.findall('./SUBCLASSES/VNSUBCLASS'):
        get_verb_details(subclass)
        

def get_verbs_with_similar_structure(etree) -> list:
    member_list = list()
    for member in etree.findall('./MEMBERS/MEMBER'):
        member_list.append(member.attrib['name'])
    return member_list


In [5]:
# Dictionaries to be created
verb_pattern_dict = dict()
verb_text_dict = dict()
verb_sem_dict = dict()

# Process each of the VerbNet files
file_list = Path(verbnet_dir).glob('**/*.xml')
for file_path in file_list:
    file_str = str(file_path)
    with open(file_str, 'r') as xml_file:
        xml_in = xml_file.read()
        
    # Create the tree
    vn_class = ET.fromstring(xml_in)
    # Process from the top down, recursively
    get_verb_details(vn_class)

In [6]:
with open('verb_ids_to_patterns.pickle', 'wb') as out_file:
    pickle.dump(verb_pattern_dict, out_file)
with open('verb_texts_to_ids.pickle', 'wb') as out_file:
    pickle.dump(verb_text_dict, out_file)
with open('verb_idpattern_to_semantics.pickle', 'wb') as out_file:
    pickle.dump(verb_sem_dict, out_file)