# Multilingual Digital Story Grammar

In this notebook, we aim to implement a version of Digital Story Grammar (DSG; Bastholm Andrade & Andersen; [link](https://www.tandfonline.com/doi/abs/10.1080/13645579.2020.1723205)) that works with multiple languages (in particular Dutch, German, Danish, and English). The code will interface to the spaCy NLP library that has pretrained and easy-to-use pipelines for many languages available.

The goal of the method is to extract the subject (actor), main verb (action), and object of each sentence or phrase. In DSG, these are referenced as *narrative units* and enable the construction of character networks. This notebook will use spaCy's `DependencyMatcher` to extract patterns in dependency relations of each sentence. However, each language has somewhat different dependency relations so it requires a specific set of patterns to be matched. German in particular uses a completely different notation for dependency relations, so it requires entirely different patterns that the other three languages (see this [link](https://www.ims.uni-stuttgart.de/documents/ressourcen/korpora/tiger-corpus/annotation/tiger_scheme-syntax.pdf) for the notation scheme). 

The patterns for each language are store in a separate `multilingual_dsg_patterns_xx.json` file. The pattern files consist of a nested list of dictionaries `[[{}, {}, ...], ...]`. Each list of dictionaries `[{},{}, ...]` represents a pattern of dependency relations that will be matched. Each dictionary in the list represents a token that is part of the dependency relations pattern. The first dictionary in each pattern has usually two entries: `RIGHT_ID` indicating a name for the token in the pattern; and `RIGHT_ATTRS` listing the required attributes of the token for a match (e.g., `{"DEP": "nsubj"}` means that the token must be a subject in a phrase). There can be `OR` or `NOT` type matches by using `IN` or `NOT_IN` keys for dictionaries (see the pattern files for examples). Subsequent dictionaries in a pattern have two additional entries: `LEFT_ID` which refers to another token in the parsing tree to which the current token is related; and `REL_OP` specifying the relation between the other token and the token to be matched. An overview of possible relations is presented on https://spacy.io/api/dependencymatcher. For example `REL_OP: >` indicates a match when the `LEFT_ID` token is the head of the `RIGHT_ID` token in the parsing tree.

Example:
```
[
  {
    "RIGHT_ID": "verb",
    "RIGHT_ATTRS": {"POS": {"IN": ["VERB", "AUX"]}}
  },
  {
    "LEFT_ID": "verb",
    "REL_OP": ">",
    "RIGHT_ID": "subj",
    "RIGHT_ATTRS": {"DEP": "nsubj"}
  },
  {
     "LEFT_ID": "verb",
     "REL_OP": ">>",
     "RIGHT_ID": "obj",
     "RIGHT_ATTRS": {"DEP": "pobj"}
  }
]

if a word has POS tag VERB or AUX and
   the word has an immediate dependent (DEP) word with relation nsubj and
   the word has a dependent (DEP) word with relation pobj 
then word1 is a verb, word2 is a subj and word3 is an obj 
```

A sentence or phrase can have multiple matches for a pattern (i.e., multiple objects or conjuncts), and for each match a row is added to the output table.

A step-by-step example can also be found on https://spacy.io/api/dependencymatcher.

A list of the universal dependency tags is here: https://universaldependencies.org/docs/en/dep/

In [1]:
""" Multilingual Digital Story Grammar """

import os
import json
import spacy
import numpy as np
import pandas as pd
import networkx as nx
import warnings
# Change this to use different language as example; see spacy.io/models
from spacy.lang.en.examples import sentences as en_sentences
from spacy.lang.nl.examples import sentences as nl_sentences
from spacy.lang.de.examples import sentences as de_sentences
from spacy.lang.da.examples import sentences as da_sentences
from spacy.matcher import DependencyMatcher
from spacy import displacy

if spacy.__version__ < "3":
    warnings.warn(
        "Module 'spacy' should be version >= 3.0 to run this notebook without errors")
if pd.__version__ < "1.0":
    warnings.warn(
        "Module 'pandas' should be version >= 1.0 to run this notebook without errors")


In [2]:
EN_SPACY_PIPELINE = "en_core_web_sm"
EN_DEPENDENCY_PATTERN_FILE = "multilingual_dsg_patterns_en.json"

NL_SPACY_PIPELINE = "nl_core_news_sm"
NL_DEPENDENCY_PATTERN_FILE = "multilingual_dsg_patterns_nl.json"

DE_SPACY_PIPELINE = "de_core_news_sm"
DE_DEPENDENCY_PATTERN_FILE = "multilingual_dsg_patterns_de.json"

DA_SPACY_PIPELINE = "da_core_news_sm"
DA_DEPENDENCY_PATTERN_FILE = "multilingual_dsg_patterns_da.json"


In [3]:
# Define a few test sentences
examples = [
    "The bird flew over the roof.",
    "The cow ate the grass. The goat watched the cow.",
    "The cow ate the grass while the goat watched the cow.",
    "The goat watched the cow which was eating grass.",
    "The goat attempted eating the cow's grass.",
    "The cow ate grass, the cow ate butter.",
    "The cow and the goat ate grass.",
    "The grass was eaten by the cow, the goat and the bird.",
]


In [4]:
def load_spacy_pipeline(name):
    """Check if the spacy language pipeline was downloaded and load it.
    Downloads the language pipeline if not available.

    Args:
        name (string): Name of the spacy language.

    Returns:
        spacy.language.Language: The spacy language pipeline
    """
    if spacy.util.is_package(name):
        nlp = spacy.load(name)
    else:
        os.system(f"spacy download {name}")
        nlp = spacy.load(name)
    return nlp


In [5]:
def check_dict_in_list(dict_obj, dict_list):
    """Check if a dictionary (partially) matches a list of dictionaries.

    Note: This function is used to avoid duplicate matches (e.g., Subj+Verb in Subj+Verb+Obj)

    Args:
        dict_obj (dict): A dictionary object.
        dict_list (list): A list of dictionary objects.

    Returns:
        bool: True if all non-empty items in dict_obj match the items in any dictionary objects in dict_list, otherwise False.
    """
    if dict_obj in dict_list:
        return True

    check = [False] * len(dict_obj.keys())

    for i, key in enumerate(dict_obj.keys()):
        if str(dict_obj[key]) == "_":
            check[i] = True
            next
        else:
            for ref_dict in dict_list:
                if dict_obj[key].i == ref_dict[key].i:
                    check[i] = True
                    break

    return all(check)


In [6]:
def extract_matches(doc, matches, matcher, nlp, keys):
    """Extract the matched tokens for selected keys.

    Args:
        doc (spacy.tokens.Doc): A spacy doc object as returned by a spacy language pipeline.
        matches (list): A list of (match_id, token_ids) tuples as returned by a spacy dependency matcher.
        matcher (spacy.matcher.DependencyMatcher): A spacy dependency matcher object.
        nlp (spacy.language.Language): A spacy language pipeline.
        keys (list): A list of keys to which the dependcy matches are assigned.

    Returns:
        list: A list of dictionaries that each contain a match of the dependency matcher. 
            Has the same keys as the `keys` argument. Empty keys contain a spacy token with text='_'.
    """
    matches_list = []

    for l, (match_id, token_ids) in enumerate(matches):
        match_dict = {}

        for key in keys:
            match_dict[key] = nlp("_")[0]
            
        for k, token_id in enumerate(token_ids):
            key = matcher.get(match_id)[1][0][k]["RIGHT_ID"]
            if key in match_dict.keys():
                match_dict[key] = doc[token_id]

        if not check_dict_in_list(match_dict, matches_list):
            match_dict["match_id"] = match_id
            matches_list.append(match_dict)

    return matches_list


In [7]:
def create_matcher(nlp, pattern_file):
    """Create a spacy dependency matcher.

    Args:
        nlp (spacy.language.Language): A spacy language pipeline.
        pattern_file (str): The path to the dependency pattern .json file for the matcher.

    Returns:
        spacy.matcher.DependencyMatcher: A spacy dependency matcher object.
    """
    matcher = DependencyMatcher(nlp.vocab, validate=True)

    with open(pattern_file, "r") as file:
        patterns = json.load(file)

    for i, pattern in enumerate(patterns):
        matcher.add(i, [pattern])

    return matcher


In [8]:
def get_children_ids(token, children_deps, ids):
    for child in token.children:
        if child.dep_ in children_deps:
            ids.append(child.i)
            ids = get_children_ids(child, children_deps, ids)
    return ids

In [9]:
def append_children_deps(token, doc, children_deps):
    """Append children to a token based on dependency tag.

    Note: This function is used to append words of a noun compound.

    Args:
        token (spacy.token.Token): A spacy token object.
        doc (spacy.token.Doc): A spacy doc object that includes the token.
        children_deps (list): A list of dependency tags.

    Returns:
        spacy.token.Token: A span of spacy tokens (token argument plus children with specified dependency tags) 
        if token argument is non-empty, the token argument otherwise.
    """

    if str(token) != "_":
        children_match_idx = get_children_ids(token, children_deps, [token.i])
        span = doc[min(children_match_idx):max(children_match_idx)+1]

        return span
    else:
        return ""


In [10]:
def get_subject_object_verb_table(docs, nlp, matcher, keys=["verb", "subj", "obj", "comp", "prep", "aux", "subjadj", "objadj", "obl", "case", "case_arg", "objfixed", ]):
    """Construct a pandas dataframe with subjects, verbs, and objects per sentence of documents.

    Args:
        docs (list): A list of text strings.
        nlp (spacy.language.Language): A spacy language pipeline.
        matcher (spacy.matcher.DependencyMatcher): A spacy dependency matcher object.
        keys (list): A list of keys to which the dependency matches are assigned. 
            Defaults to subjects, verbs, and objects.

    Returns:
        pandas.DataFrame: A dataframe with a row for each match of the dependency matcher and cols:
            doc_id (str): Index of the document in the document list.
            sent_id (str): Index of the sentence in the document.
            sent (spacy.tokens.Span): A spacy span object with the sentence.
            match_id (str): Index of the match in the sentence.

            For each key in the `keys` argument:
                key (spacy.tokens.Token): A spacy token object that matches the dependency matcher patterns.
    """
    docs_piped = nlp.pipe(docs)
        
    table_dict = {
        "doc_id": [],
        "sent_id": [],
        "sent": [],
        "match_id": [],
        "subj": [],
        "verb": [],
        "obj": [],
        "comp": [],
        "prep": [],
        "aux": [],
        "subjadj": [],
        "objadj": [],
        "obl": [],
        "case": [],
        "case_arg": [],
        "objfixed": [],
    }

    for i, doc in enumerate(docs_piped): # i: doc index
        if DEBUG:
            for token in doc:
                print(token, token.pos_, token.dep_, token.head)
        for j, sent in enumerate(doc.sents): # j: sent index
            matches = matcher(sent)
            matches_list = extract_matches(
                sent, matches, matcher, nlp, keys=keys)
            for l, match in enumerate(matches_list): # l: match index
                table_dict["doc_id"].append(str(i))
                table_dict["sent_id"].append(str(j))
                table_dict["sent"].append(sent.text)
                table_dict["match_id"].append(str(match["match_id"]))

                for key in keys:
                    table_dict[key].append(append_children_deps(match[key], doc, ["compound", "flat"]))

                    # Check for conjuncts, and add table row for each
                    for conj in match[key].conjuncts:
                        table_dict["doc_id"].append(str(i))
                        table_dict["sent_id"].append(str(j))
                        table_dict["sent"].append(sent.text)
                        table_dict["match_id"].append(str("?"))
                        table_dict[key].append(conj)
                        for key_conj in keys:
                            if key != key_conj:
                                table_dict[key_conj].append(match[key_conj])
                if DEBUG:
                    print("")
                                
    for i in range(0, len(table_dict["comp"])):
        # insert table_dict["comp"][i] in table_dict["verb"][i]) here
        pass
    
    return pd.DataFrame(table_dict)


In [11]:
def plot_character_network(sov_table_df):
    """Plots a subject-object-verb table as a character network. The edge weights are the sentiment scores of the verbs.

    Args:
        sov_table_df (pandas.DataFrame): A pandas data frame containing a subject, verb, and object token in each row

    """
    sov_table_df["sentiment"] = pd.Series([verb.sentiment for verb in sov_table_df["verb"]], dtype=float)
    sov_table_df["subj_text"] = pd.Series([subj.text for subj in sov_table_df["subj"]], dtype=str)
    sov_table_df["obj_text"] = pd.Series([obj.text for obj in sov_table_df["obj"]], dtype=str)

    char_net = nx.from_pandas_edgelist(
        sov_table_df,
        source="subj_text",
        target="obj_text",
        edge_attr="sentiment"
    )
    
    nx.draw_networkx(char_net)

In [12]:
def span_to_string(column):
    new_column = []
    for item in column:
        new_column.append(str(item))
    return new_column

In [13]:
def add_verb_group_column(result_table):
    subjadj_column = []
    objadj_column = []
    rows_to_be_deleted = []
    for i, row in result_table.iterrows():
        subjadj_column.append(str(row["subjadj"]))
        objadj_column.append(str(row["objadj"]))
        if i > 0 and str(row["doc_id"]) == str(result_table.loc[i-1]["doc_id"]) and str(row["subjadj"]) != "":
            subjadj_column[-2] += " " + str(row["subjadj"])
            rows_to_be_deleted.append(i)
        if i > 0 and str(row["doc_id"]) == str(result_table.loc[i-1]["doc_id"]) and str(row["objadj"]) != "":
            objadj_column[-2] += " " + str(row["objadj"])
            rows_to_be_deleted.append(i)
    
    result_table["subjadj"] = subjadj_column
    result_table["objadj"] = objadj_column
    for row_id in rows_to_be_deleted:
        result_table = result_table.drop(row_id)
    verb_group_column = []
    subj_extended_column = []
    obj_extended_column = []
    means_column = []
    for i, row in result_table.iterrows():
        verb_group = str(row["verb"])
        subj_extended = str(row["subj"])
        obj_extended = str(row["obj"])
        means = ""
        if str(row["aux"]) != "_":
            verb_group = str(row["aux"]) + " " + verb_group
        if str(row["prep"]) != "_":
            verb_group += " " + str(row["prep"])
        if str(row["comp"]) != "_":
            verb_group += " " + str(row["comp"])
        if str(row["subjadj"]) != "_":
            subj_extended = str(row["subjadj"]) + " " + subj_extended
        if str(row["objadj"]) != "_":
            obj_extended = str(row["objadj"]) + " " + obj_extended
        if str(row["objfixed"]) != "_":
            obj_extended = obj_extended + " " + str(row["objfixed"])
        if str(row["obl"]) != "_" and str(row["case"]) != "_":
            means = str(row["case"]) + " " + str(row["case_arg"]) + " " + str(row["obl"])
        verb_group_column.append(verb_group)
        subj_extended_column.append(subj_extended)
        obj_extended_column.append(obj_extended)
        means_column.append(means)
    result_table["verb group"] = verb_group_column
    result_table["subj_extended"] = subj_extended_column
    result_table["obj_extended"] = obj_extended_column
    result_table["means"] = means_column
    result_table = result_table[["doc_id", "sent_id", "sent", "match_id", "subj_extended", "verb group", "obj_extended", "means"]]
    return result_table


In [14]:
def combine_rows(result_table):
    near_duplicate_successive_rows = {}
    for i, row in result_table.iterrows():
        if i > 0:
            different_columns = []
            for column_name in result_table.loc[i].keys():
                if str(result_table.loc[i][column_name]) != str(result_table.loc[i-1][column_name]):
                    different_columns.append(column_name)
            if len(different_columns) == 1:
                near_duplicate_successive_rows[i-1] = different_columns[0]
    for i, row in result_table.iterrows():
        if i in near_duplicate_successive_rows:
            row[near_duplicate_successive_rows[i]] = str(row[near_duplicate_successive_rows[i]]) + " " + str(result_table.loc[i+1][near_duplicate_successive_rows[i]])
    for i, row in result_table.iterrows():
        if i-1 in near_duplicate_successive_rows:
            result_table = result_table.drop(i)
    return result_table.reset_index(drop=True)


## English

In [15]:
DEBUG = False


In [16]:
nlp_en = load_spacy_pipeline(EN_SPACY_PIPELINE)

matcher_en = create_matcher(nlp_en, EN_DEPENDENCY_PATTERN_FILE)

EXTRA_SENT = "The quick brown fox jumps over the lazy dog."
if EXTRA_SENT not in en_sentences:
    en_sentences.append(EXTRA_SENT)
en_sentences

['Apple is looking at buying U.K. startup for $1 billion',
 'Autonomous cars shift insurance liability toward manufacturers',
 'San Francisco considers banning sidewalk delivery robots',
 'London is a big city in the United Kingdom.',
 'Where are you?',
 'Who is the president of France?',
 'What is the capital of the United States?',
 'When was Barack Obama born?',
 'The quick brown fox jumps over the lazy dog.']

In [23]:
result_table = get_subject_object_verb_table(en_sentences, nlp_en, matcher_en)

NameError: name 'nlp_en' is not defined

In [18]:
result_table

Unnamed: 0,doc_id,sent_id,sent,match_id,subj,verb,obj,comp,prep,aux,subjadj
0,0,0,Apple is looking at buying U.K. startup for $1...,4,(Apple),(looking),(U.K.),(buying),(at),(is),
1,1,0,Autonomous cars shift insurance liability towa...,0,(cars),(shift),"(insurance, liability)",,,,(Autonomous)
2,2,0,San Francisco considers banning sidewalk deliv...,3,"(San, Francisco)",(considers),"(sidewalk, delivery, robots)",(banning),,,
3,3,0,London is a big city in the United Kingdom.,7,(London),(is),,,,,
4,4,0,Where are you?,7,(you),(are),,,,,
5,6,0,What is the capital of the United States?,7,(capital),(is),,,,,
6,7,0,When was Barack Obama born?,5,"(Barack, Obama)",(born),,,,(was),
7,8,0,The quick brown fox jumps over the lazy dog.,6,(fox),(jumps),,,,,(quick)
8,8,0,The quick brown fox jumps over the lazy dog.,6,(fox),(jumps),,,,,(brown)


In [19]:
result_table = combine_rows(result_table)

In [20]:
result_table_combined = add_verb_group_column(result_table)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


For sentence 0, we would like `U.K. startup` as object but we cannot have it because the dependency parser classifies `startup` as a verb.

In [21]:
result_table_combined

Unnamed: 0,doc_id,sent_id,sent,match_id,subj_extended,verb group,obj
0,0,0,Apple is looking at buying U.K. startup for $1...,4,Apple,is looking at buying,U.K.
1,1,0,Autonomous cars shift insurance liability towa...,0,Autonomous cars,shift,insurance liability
2,2,0,San Francisco considers banning sidewalk deliv...,3,San Francisco,considers banning,sidewalk delivery robots
3,3,0,London is a big city in the United Kingdom.,7,London,is,
4,4,0,Where are you?,7,you,are,
5,6,0,What is the capital of the United States?,7,capital,is,
6,7,0,When was Barack Obama born?,5,Barack Obama,was born,
7,8,0,The quick brown fox jumps over the lazy dog.,6,quick brown fox,jumps,


In [22]:
sov_table_en = get_subject_object_verb_table(examples, nlp_en, matcher_en)
print(sov_table_en)
plot_character_network(sov_table_en)

   doc_id sent_id                                               sent match_id  \
0       0       0                       The bird flew over the roof.        7   
1       1       0                             The cow ate the grass.        1   
2       1       1                          The goat watched the cow.        1   
3       2       0  The cow ate the grass while the goat watched t...        1   
4       2       0  The cow ate the grass while the goat watched t...        1   
5       2       0  The cow ate the grass while the goat watched t...        3   
6       3       0   The goat watched the cow which was eating grass.        1   
7       3       0   The goat watched the cow which was eating grass.        1   
8       3       0   The goat watched the cow which was eating grass.        5   
9       4       0         The goat attempted eating the cow's grass.        3   
10      5       0             The cow ate grass, the cow ate butter.        1   
11      5       0           

AttributeError: 'str' object has no attribute 'text'

## Dutch

In [15]:
DEBUG = False

In [16]:
nlp_nl = load_spacy_pipeline(NL_SPACY_PIPELINE)

matcher_nl = create_matcher(nlp_nl, NL_DEPENDENCY_PATTERN_FILE)

EXTRA_SENT = "Op brute wijze ving de schooljuf de quasi-kalme lynx"
if EXTRA_SENT not in nl_sentences:
    nl_sentences.append(EXTRA_SENT)
nl_sentences

['Apple overweegt om voor 1 miljard een U.K. startup te kopen',
 "Autonome auto's verschuiven de verzekeringverantwoordelijkheid naar producenten",
 'San Francisco overweegt robots op voetpaden te verbieden',
 'Londen is een grote stad in het Verenigd Koninkrijk',
 'Op brute wijze ving de schooljuf de quasi-kalme lynx']

In [17]:
result_table = get_subject_object_verb_table(nl_sentences, nlp_nl, matcher_nl)

In [18]:
result_table = combine_rows(result_table)

In [19]:
result_table

Unnamed: 0,doc_id,sent_id,sent,match_id,subj,verb,obj,comp,prep,aux,subjadj,objadj,obl,case,case_arg,objfixed
0,0,0,Apple overweegt om voor 1 miljard een U.K. sta...,1,(Apple),(overweegt),(U.K.),(kopen),om te,,,,(miljard),(voor),(1),(startup)
1,1,0,Autonome auto's verschuiven de verzekeringvera...,4,(auto's),(verschuiven),(verzekeringverantwoordelijkheid),,,,,,,,,
2,2,0,San Francisco overweegt robots op voetpaden te...,6,"(San, Francisco)",(overweegt),,(verbieden),(te),,,,,,,
3,3,0,Londen is een grote stad in het Verenigd Konin...,8,(Londen),(is),(stad),,,,,(grote),,,,
4,4,0,Op brute wijze ving de schooljuf de quasi-kalm...,3,(schooljuf),(ving),(quasi-kalme),,,,,,(wijze),(Op),(brute),


In [20]:
result_table_combined = add_verb_group_column(result_table)

In [21]:
result_table_combined

Unnamed: 0,doc_id,sent_id,sent,match_id,subj_extended,verb group,obj_extended,means
0,0,0,Apple overweegt om voor 1 miljard een U.K. sta...,1,Apple,overweegt om te kopen,U.K. startup,voor 1 miljard
1,1,0,Autonome auto's verschuiven de verzekeringvera...,4,auto's,verschuiven,verzekeringverantwoordelijkheid,
2,2,0,San Francisco overweegt robots op voetpaden te...,6,San Francisco,overweegt te verbieden,,
3,3,0,Londen is een grote stad in het Verenigd Konin...,8,Londen,is,grote stad,
4,4,0,Op brute wijze ving de schooljuf de quasi-kalm...,3,schooljuf,ving,quasi-kalme,Op brute wijze


Parser and tagger errors preventing correct analysis:
1. sentence 1: `producenten` is attached to `verzekeringverantwoordelijkheid` instead of to `verschuiven`
2. sentence 2: `robots op voetpaden` is not identified as object
3. sentence 5: `lynx` is identified as a name (PROPN)

In [22]:
displacy.render(nlp_nl(nl_sentences[4]), style="dep", options={"distance": 90})

## German

In [37]:
nlp_de = load_spacy_pipeline(DE_SPACY_PIPELINE)

matcher_de = create_matcher(nlp_de, DE_DEPENDENCY_PATTERN_FILE)

de_sentences

['Die ganze Stadt ist ein Startup: Shenzhen ist das Silicon Valley für Hardware-Firmen',
 'Wie deutsche Startups die Technologie vorantreiben wollen: Künstliche Intelligenz',
 'Trend zum Urlaub in Deutschland beschert Gastwirten mehr Umsatz',
 'Bundesanwaltschaft erhebt Anklage gegen mutmaßlichen Schweizer Spion',
 'San Francisco erwägt Verbot von Lieferrobotern',
 'Autonome Fahrzeuge verlagern Haftpflicht auf Hersteller',
 'Wo bist du?',
 'Was ist die Hauptstadt von Deutschland?']

In [38]:
get_subject_object_verb_table(de_sentences, nlp_de, matcher_de)

Unnamed: 0,doc_id,sent_id,sent,match_id,subj,verb,obj,comp,prep,aux,subjadj
0,0,0,Die ganze Stadt ist ein Startup: Shenzhen ist ...,0,(Stadt),(ist),(Startup),_,_,_,_
1,0,0,Die ganze Stadt ist ein Startup: Shenzhen ist ...,1,(Shenzhen),(ist),_,_,_,_,_
2,0,0,Die ganze Stadt ist ein Startup: Shenzhen ist ...,1,(Valley),(ist),_,_,_,_,_
3,1,0,Wie deutsche Startups die Technologie vorantre...,0,(Startups),(wollen),(Technologie),_,_,_,_
4,2,0,Trend zum Urlaub in Deutschland beschert Gastw...,0,(Trend),(beschert),(Gastwirten),_,_,_,_
5,2,0,Trend zum Urlaub in Deutschland beschert Gastw...,0,(Trend),(beschert),(Umsatz),_,_,_,_
6,3,0,Bundesanwaltschaft erhebt Anklage gegen mutmaß...,0,(Bundesanwaltschaft),(erhebt),(Anklage),_,_,_,_
7,3,0,Bundesanwaltschaft erhebt Anklage gegen mutmaß...,2,(Bundesanwaltschaft),(erhebt),(Spion),_,_,_,_
8,4,0,San Francisco erwägt Verbot von Lieferrobotern,0,(Francisco),(erwägt),(Verbot),_,_,_,_
9,5,0,Autonome Fahrzeuge verlagern Haftpflicht auf H...,0,(Fahrzeuge),(verlagern),(Haftpflicht),_,_,_,_


In [39]:
displacy.render(nlp_de(de_sentences[0]), style="dep", options={"distance": 75})

When looking at the results of the dependency matcher and the parsing tree, it can be seen that the parser makes a few mistakes: In the first sentence, it erroneously labels "Silicon Valley" as a subject.

## Danish

In [46]:
nlp_da = load_spacy_pipeline(DA_SPACY_PIPELINE)

matcher_da = create_matcher(nlp_da, DA_DEPENDENCY_PATTERN_FILE)

EXTRA_SENT = "Høj bly gom vandt fræk sexquiz på wc"
if EXTRA_SENT not in da_sentences:
    da_sentences.append(EXTRA_SENT)
da_sentences

['Apple overvejer at købe et britisk startup for 1 milliard dollar.',
 'Selvkørende biler flytter forsikringsansvaret over på producenterne.',
 'San Francisco overvejer at forbyde udbringningsrobotter på fortovet.',
 'London er en storby i Storbritannien.',
 'Hvor er du?',
 'Hvem er Frankrings president?',
 'Hvad er hovedstaden i USA?',
 'Hvornår blev Barack Obama født?',
 'Høj bly gom vandt fræk sexquiz på wc']

In [49]:
get_subject_object_verb_table(da_sentences, nlp_da, matcher_nl)

Unnamed: 0,doc_id,sent_id,sent,match_id,subj,verb,obj,comp,prep,aux,subjadj
0,0,0,Apple overvejer at købe et britisk startup for...,0,(Apple),(overvejer),(købe),,,,
1,0,0,Apple overvejer at købe et britisk startup for...,0,(Apple),(overvejer),(startup),,,,
2,1,0,Selvkørende biler flytter forsikringsansvaret ...,3,(biler),(flytter),(producenterne),,,,
3,2,0,San Francisco overvejer at forbyde udbringning...,0,(San),(overvejer),(forbyde),,,,
4,2,0,San Francisco overvejer at forbyde udbringning...,0,(San),(overvejer),(udbringningsrobotter),,,,
5,2,0,San Francisco overvejer at forbyde udbringning...,0,(Francisco),(overvejer),(forbyde),,,,
6,2,0,San Francisco overvejer at forbyde udbringning...,3,(San),(overvejer),(fortovet),,,,
7,3,0,London er en storby i Storbritannien.,5,(London),(er),(storby),,,,
8,5,0,Hvem er Frankrings president?,5,(Hvem),(er),(president),,,,
9,7,0,Hvornår blev Barack Obama født?,4,"(Barack, Obama)",(født),,,,,


In [50]:
displacy.render(nlp_da(da_sentences[2]), style="dep", options={"distance": 120})

The Danish parser also makes a few mistakes: For example, in the second sentence it labels both "San" and "Francisco" as separate subjects.