# Evaluating Ducoref Against a Silver Standard 

This is a notebook that computes agreement (F1) between a [silver standard of novels with annotated character mentions developed by Roel Smeets](https://github.com/roelsmeets/character-networks), and characters as tagged by the [Dutch coreference resolution & dialogue analysis machinery created by Andreas van Cranenburgh](https://github.com/andreasvc/dutchcoref). The input data is produced by the coreference resolution system which uses [Alpino parser](https://github.com/rug-compling/alpino-docker), the Ducoref rule based Python core, and its neural modules.


### TODOs

* Evaluate against Ducoref clusters rather than individual mentions.
* Port code to use Pandas rather than specific dictionaries and loops.
* Scale against the fully parsed corpus. (Depends on parsing by RUG.)

In [2]:
# imports
import csv
import regex

In [13]:
# Let's inspect what the silver data looks like

with open( './data_test/NAMES_ArnonGrunberg_DeMan.csv', 'r' ) as names_file:
    for line in names_file.read().split( '\n' )[0:10]:
        print( line )


53,1,Sam,Sam,
53,1,Sam,Samarendra,
53,1,Sam,Samarendra Ambani,
53,1,Sam,Hond,
53,1,Sam,hond,
53,2,Nina,Nina,
53,2,Nina,Sams vriendin,
53,3,Dave,Dave,
53,3,Dave,Dave Luscombe,
53,4,Hamid Shakir Mahmoud,Hamid Shakir Mahmoud,


In [18]:
# Okay, now we create a dictionary of characters and mentions: 
# { id: { name: 'character name', mentions: [ mention1, mention2… ] } }.

character_ids_to_mention_surfaces = {}
all_mention_surfaces = [] # For check
with open( './data_test/NAMES_ArnonGrunberg_DeMan.csv', 'r' ) as names_file:
    reader = csv.reader( names_file, delimiter=',' )
    for row in reader:
        if( row[1] not in character_ids_to_mention_surfaces.keys() ):
            character_ids_to_mention_surfaces[ row[1] ] = { 'name': '', 'mentions': [] }
        character_id_to_mention_surfaces = character_ids_to_mention_surfaces[ row[1] ]
        character_id_to_mention_surfaces[ 'name' ] = row[2]
        character_id_to_mention_surfaces[ 'mentions' ].append( row[3] )
        all_mention_surfaces.append( row[3] )

In [19]:
print( character_ids_to_mention_surfaces['1'], '\n' )
print( all_mention_surfaces[0:10] )


{'name': 'Sam', 'mentions': ['Sam', 'Samarendra', 'Samarendra Ambani', 'Hond', 'hond']} 

['Sam', 'Samarendra', 'Samarendra Ambani', 'Hond', 'hond', 'Nina', 'Sams vriendin', 'Dave', 'Dave Luscombe', 'Hamid Shakir Mahmoud']


In [20]:
# We now want to create a list of all positions that contain a named character.
# Therefore first we simply want all tokens and their position.

novel_tokens = []
novel_conll_file_path = './alpino_data/parsed/20220216/ArnonGrunberg_DeMan_output.conll'
with open ( novel_conll_file_path, 'r' ) as novel_file:
    novel_tokens = novel_file.read().split( '\n' )
    novel_tokens = [ novel_token.split( '\t' )[1:5] for novel_token in novel_tokens if len( novel_token.split( '\t' ) ) > 4 ]

print( novel_tokens[0:10] )


[['1-1', '0', 'Voor', 'voor'], ['1-1', '1', 'zijn', 'zijn'], ['1-1', '2', 'reis', 'reis'], ['1-1', '3', 'heeft', 'hebben'], ['1-1', '4', 'Samarendra', 'Samarendra'], ['1-1', '5', 'Ambani', 'Ambani'], ['1-1', '6', 'samen', 'samen'], ['1-1', '7', 'met', 'met'], ['1-1', '8', 'zijn', 'zijn'], ['1-1', '9', 'vriendin', 'vriendin']]


In [22]:
regex_alpha_only = regex.compile( r'[\p{L}\p{Nl}]+' )

mention_position_silver = []

for idx_start, token in enumerate( novel_tokens ):
    for mention in all_mention_surfaces:
        mention_parts = mention.split( ' ' )
        match = True
        idx_end = idx_start
        for idx, part in enumerate( mention_parts ):
            idx_end += idx
            if( idx_end < len( novel_tokens ) ):
                # if( ( idx_end < len( novel_tokens ) ) and ( novel_tokens[idx_end][2].lower() != part.lower() ) ):
                alpha_start = regex_alpha_only.match( novel_tokens[idx_end][2].lower() )
                if( alpha_start != None ):
                    alpha_start = alpha_start.group(0)
                    if( alpha_start != part.lower() ):
                        match = False
                else:
                    match = False
            else:
                match = False
        if( match ):
            mention_position_silver.append( ( idx_start, idx_end, mention ) )

print( mention_position_silver[0:10] )


[(4, 4, 'Samarendra'), (4, 5, 'Samarendra Ambani'), (20, 20, 'Samarendra'), (51, 51, 'Samarendra'), (58, 58, 'Sam'), (138, 138, 'Samarendra'), (138, 139, 'Samarendra Ambani'), (213, 213, 'Sam'), (252, 252, 'Sam'), (281, 281, 'Samarendra')]


In [49]:
# To compare we need the mentions according to the Dutch Coref system…

def mention_is_name( mention ):
    ducoref_offset = 1
    mention_parts = mention.split( '\t' )
    if( mention_parts[3] == 'name' ):
        return ( int( mention_parts[1] ) - ducoref_offset, int( mention_parts[2] ) - ducoref_offset, mention_parts[12] )
    else:
        return None

mention_position_ducoref = []
with open ( './alpino_data/parsed/20220216/ArnonGrunberg_DeMan_output.mentions.tsv', 'r' ) as ducoref_file:
    mention_position_ducoref = ducoref_file.read().split( '\n' )[1:-1]
    mention_position_ducoref = [ mention_is_name( mention ) for mention in mention_position_ducoref ]
    mention_position_ducoref = [ mention for mention in mention_position_ducoref if mention != None ]

print( mention_position_ducoref[0:10] )


[(4, 5, 'Samarendra Ambani'), (20, 20, 'Samarendra’s'), (51, 51, 'Samarendra'), (55, 58, 'de meeste mensen Sam'), (138, 139, 'Samarendra Ambani'), (145, 145, 'Zwitserland'), (213, 213, 'Sam'), (252, 252, 'Sam'), (278, 279, 'de Amerikanen'), (281, 281, 'Samarendra’s')]


In [50]:
# We now can turn both lists into a form that we can compare

mention_position_silver = [ ( tup[0], tup[1] ) for tup in mention_position_silver ]
print( mention_position_silver[0:10] )
mention_position_ducoref = [ ( tup[0], tup[1] ) for tup in mention_position_ducoref ]
print( mention_position_ducoref[0:10] )


[(4, 4), (4, 5), (20, 20), (51, 51), (58, 58), (138, 138), (138, 139), (213, 213), (252, 252), (281, 281)]
[(4, 5), (20, 20), (51, 51), (55, 58), (138, 139), (145, 145), (213, 213), (252, 252), (278, 279), (281, 281)]


In [51]:
# Compute F1 score is a matter of counting…

false_negatives = 0
true_positives = 0
character_line_refs = []
for tup in mention_position_silver:
    if( tup in mention_position_ducoref ):
        true_positives += 1
    else:
        false_negatives += 1

print( 'True positives:', true_positives )
print( 'False negatives:', false_negatives )

false_positives = 0
for tup in mention_position_ducoref:
    if( tup not in mention_position_silver ):
        false_positives += 1

print( 'False positives:', false_positives )

fscore = true_positives / ( true_positives + ( 0.5 * ( false_positives + false_negatives ) ) )

print( 'F1-Score:', fscore )



True positives: 1499
False negatives: 279
False positives: 1081
F1-Score: 0.6879302432308398


In [52]:
# This is an unsatisfying evalution, because the Dutch Coreference Resolution solution rightfully lists
# also e.g. "Zwitserland" en "de Amerikanen" as names. 
# Marginally better would be to use the PER tag 
# assigned by Alpino as well. But we know already that PER gives an almost 80% F1. 
# (We tested that with prior code in Python/Atom.)

def mention_is_name_and_person( mention ):
    ducoref_offset = 1
    mention_parts = mention.split( '\t' )
    if( mention_parts[3] == 'name' and mention_parts[5] == 'PER' ):
        return ( int( mention_parts[1] ) - ducoref_offset, int( mention_parts[2] ) - ducoref_offset, mention_parts[12] )
    else:
        return None

mention_position_ducoref = []
with open ( './alpino_data/parsed/20220216/ArnonGrunberg_DeMan_output.mentions.tsv', 'r' ) as ducoref_file:
    mention_position_ducoref = ducoref_file.read().split( '\n' )[1:-1]
    mention_position_ducoref = [ mention_is_name_and_person( mention ) for mention in mention_position_ducoref ]
    mention_position_ducoref = [ mention for mention in mention_position_ducoref if mention != None ]


mention_position_ducoref = [ ( tup[0], tup[1] ) for tup in mention_position_ducoref ]
print( mention_position_ducoref[0:10] )

# Compute F1 score is a matter of counting…

false_negatives = 0
true_positives = 0
character_line_refs = []
for tup in mention_position_silver:
    if( tup in mention_position_ducoref ):
        true_positives += 1
    else:
        false_negatives += 1

print( 'True positives:', true_positives )
print( 'False negatives:', false_negatives )

false_positives = 0
for tup in mention_position_ducoref:
    if( tup not in mention_position_silver ):
        false_positives += 1

print( 'False positives:', false_positives )

fscore = true_positives / ( true_positives + ( 0.5 * ( false_positives + false_negatives ) ) )

print( 'F1-Score:', fscore )


[(4, 5), (20, 20), (51, 51), (55, 58), (138, 139), (213, 213), (252, 252), (281, 281), (315, 316), (372, 372)]
True positives: 1409
False negatives: 369
False positives: 350
F1-Score: 0.7967203845066441


## How does Ducoref do on gender identification?

In [86]:
#Let's add gender into the equation…

def mention_is_name_and_person( mention ):
    ducoref_offset = 1
    mention_parts = mention.split( '\t' )
#     if( mention_parts[3] == 'name' ):
    if( mention_parts[3] == 'name' and mention_parts[5] == 'PER' ):
        return ( int( mention_parts[1] ) - ducoref_offset, int( mention_parts[2] ) - ducoref_offset, mention_parts[12], mention_parts[8] )
    else:
        return None

mention_position_ducoref = []
with open ( './alpino_data/parsed/20220216/ArnonGrunberg_DeMan_output.mentions.tsv', 'r' ) as ducoref_file:
    mention_position_ducoref = ducoref_file.read().split( '\n' )[1:-1]
    mention_position_ducoref = [ mention_is_name_and_person( mention ) for mention in mention_position_ducoref ]
    mention_position_ducoref = [ mention for mention in mention_position_ducoref if mention != None ]

print( mention_position_ducoref[0:10] )

[(4, 5, 'Samarendra Ambani', 'm'), (20, 20, 'Samarendra’s', 'm'), (51, 51, 'Samarendra', 'm'), (55, 58, 'de meeste mensen Sam', 'f'), (138, 139, 'Samarendra Ambani', 'm'), (213, 213, 'Sam', 'f'), (252, 252, 'Sam', 'f'), (281, 281, 'Samarendra’s', 'fm'), (315, 316, 'meneer Ambani', 'm'), (372, 372, 'Samarendra', 'm')]


In [87]:
# Let's subset names and gender

name_gender_ducoref = [ ( mention[2], mention[3] ) for mention in mention_position_ducoref ]
name_gender_ducoref = set( name_gender_ducoref )

print( len( name_gender_ducoref ) )

189


In [88]:
for name in name_gender_ducoref:
    print( name )

('Ene Ann O’Connell', 'f')
('Hamid Shakir Mahmoud Sam', 'fm')
('Samarendra’s', 'm')
('Khaled Hosseini', 'm')
('Roger Federer', 'm')
('De heer Böckli', 'm')
('Hund', 'm')
('de familie Ambani', 'm')
('Mahmoud al-Mabhouh', 'mn')
('een paar jonge Berbers', 'n')
('Fehmers', 'fm')
('Fehmers', '-')
('Christian Dior', 'n')
('de nacht Nina', 'f')
('zijn Fanta', 'n')
('Lankfords', 'fm')
('Khalils', 'mn')
('de andere staart Sam', 'f')
('Fanta', 'f')
('Ann O’Connell', 'f')
('een blikje Fanta', 'fn')
('meneer Mahmoud', 'm')
('Ambani', 'fm')
('Zwitserduits', '-')
('Mevrouw Ambani', 'f')
('Puccini', 'mn')
('Angry Birds', 'n')
('Mevrouw Ambani', 'fm')
('Meneer Samarendra', 'm')
('Hamid', 'm')
('Samarendra Ambani', 'mn')
('Samarendra', 'n')
('Hond', 'n')
('Angry Birds', 'm')
('Rose', 'f')
('ursina geisendorf', 'fm')
('Frank Lloyd Wright', 'm')
('een aardige voorraad Fanta', 'n')
('Fahrenheit', 'n')
('John', 'm')
('de heer Böckli', 'm')
('Sam', 'm')
('die Fehmer', 'm')
('Fehmer', 'm')
('uitsluitend Texa

In [89]:
# We need at least the same information according to the silve standard, so…

name_gender_silver = []
with open( './data_test/NODES_ArnonGrunberg_DeMan.csv', 'r' ) as names_file:
    rows = names_file.read().split( '\n' )[:-1]
    for row in rows[0:10]:
        print( row )
    rows = [ row.split(';') for row in rows ]
    name_gender_silver = [ ( row[2], row[3] ) for row in rows ]

number_gender_map = { '1': 'm', '2': 'f' }
    
name_gender_silver = [ ( row[0], number_gender_map[ row[1] ] ) for row in name_gender_silver ]
print( '' )
print( name_gender_silver )

53;1;Sam;1;3;99;3;6;2;1;architect
53;2;Nina;2;3;99;3;6;2;1;student
53;3;Dave;1;4;99;3;6;2;1;architect
53;4;Hamid Shakir Mahmoud;1;5;99;3;6;4;1;directeur
53;5;Aida;2;3;99;3;6;2;2;werkloos
53;6;meneer Ambani;1;6;99;3;6;4;1;uitvinder
53;7;mevrouw Ambani;2;3;7;3;6;99;2;apothekersassistent
53;8;Bill;1;99;99;5;99;99;99;chauffeur
53;9;Hassan;1;99;99;5;99;99;99;beveiliger
53;10;Fehmer;1;99;99;99;99;99;1;architect

[('Sam', 'm'), ('Nina', 'f'), ('Dave', 'm'), ('Hamid Shakir Mahmoud', 'm'), ('Aida', 'f'), ('meneer Ambani', 'm'), ('mevrouw Ambani', 'f'), ('Bill', 'm'), ('Hassan', 'm'), ('Fehmer', 'm'), ('Heavy', 'm'), ('Honey', 'm'), ('Böckli', 'm'), ('Martina', 'f'), ('Liliane', 'f'), ('Lankford', 'm'), ('Brady', 'm'), ('Rose', 'f'), ('Mahmoud al-Mabhouh', 'm'), ('Khalil', 'm'), ('Ursina', 'f')]


In [90]:
# Let's just see how many genders Ducoref estimated right
# without further heuristics

for name_gender in name_gender_silver:
    print( name_gender, name_gender in name_gender_ducoref )

('Sam', 'm') True
('Nina', 'f') True
('Dave', 'm') True
('Hamid Shakir Mahmoud', 'm') True
('Aida', 'f') False
('meneer Ambani', 'm') True
('mevrouw Ambani', 'f') True
('Bill', 'm') False
('Hassan', 'm') True
('Fehmer', 'm') True
('Heavy', 'm') True
('Honey', 'm') False
('Böckli', 'm') True
('Martina', 'f') True
('Liliane', 'f') True
('Lankford', 'm') True
('Brady', 'm') True
('Rose', 'f') True
('Mahmoud al-Mabhouh', 'm') True
('Khalil', 'm') True
('Ursina', 'f') False


In [91]:
# Let's do an F1 again

false_negatives = 0
true_positives = 0

for tup in name_gender_silver:
    if( tup in name_gender_ducoref ):
        true_positives += 1
    else:
        false_negatives += 1

# We don't compute false positives, but we just set is 0
# because the Ducoref has much more characters that might very well be
# correctly identified, but are not in the silver list.
#
# That is: we still need a good way of identifying MAIN characters from the Ducoref data.

false_positives = 0

fscore = true_positives / ( true_positives + ( 0.5 * ( false_positives + false_negatives ) ) )

print( 'F1-Score:', fscore )


F1-Score: 0.8947368421052632


**F1 of 90%, nice!** But actually the score might even be better: 

* Does `('Bill' 'm')` in silver equal `('Bill', '-')` and/or `('Bill','n')` ?

Otherwise:

* Aida is not identified by Ducoref at all
* Ursina item
* Honey item

If we ignore the PER tag selection / filter, we get an F1 of 0.9230.

## Can we decide who's a main character from mentions alone?

In [97]:
# Let's subset names and gender

name_gender_ducoref = [ ( mention[2], mention[3] ) for mention in mention_position_ducoref ]
name_gender_ducoref_unique = set( name_gender_ducoref )

print( name_gender_ducoref[0:10], '\n')

print( len( name_gender_ducoref ) )
print( len( name_gender_ducoref_unique ) )
 

[('Samarendra Ambani', 'm'), ('Samarendra’s', 'm'), ('Samarendra', 'm'), ('de meeste mensen Sam', 'f'), ('Samarendra Ambani', 'm'), ('Sam', 'f'), ('Sam', 'f'), ('Samarendra’s', 'fm'), ('meneer Ambani', 'm'), ('Samarendra', 'm')] 

1744
189


In [115]:
character_count_ducoref = {}
for name_gender in name_gender_ducoref:
    if name_gender in character_count_ducoref.keys():
        character_count_ducoref[ name_gender ] += 1
    else:
        character_count_ducoref[ name_gender ] = 1

# Sort by number of mentions
character_count_ducoref = { k: v for k, v in sorted( character_count_ducoref.items(), key=lambda item: item[1], reverse=True ) }
for k,v in character_count_ducoref.items(): 
    print( k, v )

('Sam', 'f') 781
('Nina', 'f') 134
('Hassan', 'm') 76
('Dave', 'm') 61
('Hamid Shakir Mahmoud', 'm') 59
('Sams', 'f') 43
('Fehmer', 'fm') 35
('Rose', 'f') 31
('Heavy', 'n') 29
('Brady', 'm') 23
('Samarendra Ambani', 'm') 22
('Samarendra', 'm') 21
('Fehmer', 'm') 20
('Puccini', 'm') 19
('Bill', 'n') 17
('John Brady', 'm') 17
('Martina', 'f') 14
('Khalil', 'm') 14
('Lankford', 'm') 13
('Mahmoud al-Mabhouh', 'm') 13
('Max Fehmer', 'm') 11
('Liliane', 'f') 10
('Nina’s', 'f') 9
('mevrouw Geisendorf', 'f') 9
('Mahmoud', 'm') 6
('Hamid', 'm') 6
('Meneer Mahmoud', 'm') 6
('meneer Mahmoud', 'm') 6
('meneer Hamid', 'm') 6
('Hahnemann', 'm') 6
('Samarendra’s', 'fm') 5
('meneer Ambani', 'm') 4
('mevrouw Ambani', 'f') 4
('Frank Lloyd Wright', 'm') 4
('Ann O’Connell', 'f') 4
('Meneer Hamid', 'm') 4
('Fehmer', '-') 4
('Giovanni', 'm') 4
('Angry Birds', 'n') 4
('Fehmer & Geverelli', 'fm') 4
('Meneer Ambani', 'm') 3
('Fehmers', 'fm') 3
('Saddam', 'm') 3
('Mozart', 'm') 3
('Puccini', 'mn') 3
('Fanta', '

**That seems a** definite yes. But again we see, as with the evaluation against the silver standard, that we are mostly bothered by 'character synonyms'. We should see if the clustering by Ducoref is a solution to this.

### Intermezzo: what gender labels are used?

In [116]:
print( len( mention_position_ducoref ) )
genders = set( [ mention[3] for mention in mention_position_ducoref ] )
print( genders )


1744
{'fm', 'fn', 'm', 'mn', 'n', '-', 'f'}


## Evaluation against Ducoref clusters rather than individual mentions


In [None]:
## (TODO)