Ontology & Parsing 
===

This notebook focuses on parsing and the dataset and making inferences through the ontology engine

The results are saved to be reused with the Deep Learning model

In [365]:
import pickle
import re
import tqdm # progress bars, really nice to save some notebook space

In [358]:
# allows the developement of Python files and their automatic reloading as they are changed
%load_ext autoreload
%autoreload 1

import sys
sys.path.insert(0, 'src')

%aimport prolog
%aimport preprocessing
%aimport parsing

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### People list

In the future, we may want to extract the people list dynamically, rather than hardcoding it before processing the dataset.

In [112]:
# #### Extract people... ?
# 
# from interval import interval
# 
# text = " ".join(s["source"] for s in dataset)
# 
# # Find Cap names
# cap_names = list(re.finditer("(?:[a-z][,:; ]+)([A-Z][a-z]*(?:\s+[A-Z][a-z]*)*)", text))
# # Find Mr., Mrs., Sir. names
# title_names = list(re.finditer("([Mm](?:rs?\.?|iss\.?|[Ss]ir)\s+[A-Z][a-z]*(?:\s+[A-Z][a-z]*)*)", text))
# 
# characters_spans = interval(*[n.span(1) for n in cap_names] + [n.span(0) for n in title_names])
# 
# # {text[int(s[0]):int(s[1])]: ((int(s[0]), int(s[1]), text[int(s[0])-10:int(s[1])+20])) for s in characters_spans}

For now we just hard code the known list of protagonists (corrected from the given list in the article)

`people_list = [...]`

In [113]:
people_list = pickle.load(open("corpus/people.pkl", "rb"))

In [114]:
people_code_to_name = {p["code"]: p["main"] for p in people_list}
people_name_to_code = {p['main'].strip('s'): p['code'] for p in people_list}

#### Book utterances splitting and annotations matching

In [116]:
# Transforms a text in utterances
utterances = preprocessing.build_dataset(
    text_file='corpus/PRIDPREJ_NONEWLINE_Organize_v2.txt',
    people=people_list
)

# Match them with the labelled dataset
dataset = preprocessing.match_with_annoted_file(
    path='corpus/REAL_ALL_CONTENTS_PP.txt',
    utterances=utterances,
    people=people_list)

Almost match :
My dear Elizabeth_Bennet, I have the highest opinion in the world of your excellent judgment in all matters within the scope of your understanding, but permit me to say that there must be a wide difference between the established forms of ceremony amongst the laity, and those which regulate the clergy; for give me leave to observe that I consider the clerical office as equal in point of dignity with the highest rank in the kingdom -- provided that a proper humility of behaviour is at the same time maintained. You must therefore allow me to follow the dictates of my conscience on this occasion, which leads me to perform what I look on as a point of duty. Pardon me for neglecting to profit by your advice, which on every other subject shall be my constant guide, though in the case before us I consider myself more fitted by education and habitual study to decide on what is right than a young lady like yourself. [X] apology, [X] Hunsford, [X] Lady_Catherine.
My dear Elizabeth

In [117]:
dataset[0]

{'begin': 0,
 'discussion_index': 0,
 'end': 109,
 'only_utterance_article': 'My dear Mr_Bennet, [X] have you heard that Netherfield Park is let at last?',
 'only_utterance_us': 'My dear Mr_Bennet, [X] have you heard that Netherfield Park is let at last?',
 'parts': [{'text': 'My dear Mr_Bennet,', 'utterance': True},
  {'text': ' said his lady to him one day, ', 'utterance': False},
  {'text': 'have you heard that Netherfield Park is let at last?',
   'utterance': True}],
 'source': "``My dear Mr_Bennet,'' said his lady to him one day, ``have you heard that Netherfield Park is let at last?''",
 'target': 'Mrs_Bennet'}

### Grammar parsing

Let's detect some features (relations, gender, name) in the narration/utterance parts, with the stanford parser   
Each utterance/narration part will give us a list of names and ontology properties about the subject and its destinator

The following cell took us ~ 3h to complete

In [None]:
parser = parsing.load_parser('../stanford-parser-full-2017-06-09/')
stemmer = nltk.stem.SnowballStemmer('english')
people_main = [p['main'] for p in people_list]

for sample in tqdm.tqdm(dataset):
    for part in sample["parts"]:
        if part["utterance"]:
            part["speaker_features"], part["dest_features"] = parsing.extract_features_from_utterance(part["text"], parser, stemmer, people_name_to_code)
        else:
            part["speaker_features"], part["dest_features"] = parsing.extract_features_from_narration(part["text"], parser, stemmer, people_name_to_code)

Small demo of the parser

In [357]:
parsing.extract_features_from_narration(" observed Mary_Bennet, who piqued herself upon the solidity of her reflections, ", parser, stemmer, people_name_to_code, debug=True)

--- triples ---


[((1, 'VBD'), 'dobj', (2, 'NNP')),
 ((2, 'NNP'), 'acl:relcl', (5, 'VBD')),
 ((5, 'VBD'), 'dobj', (6, 'PRP')),
 ((5, 'VBD'), 'nsubj', (4, 'WP')),
 ((5, 'VBD'), 'nmod', (9, 'NN')),
 ((9, 'NN'), 'nmod', (12, 'NNS')),
 ((12, 'NNS'), 'nmod:poss', (11, 'PRP$')),
 ((12, 'NNS'), 'case', (10, 'IN')),
 ((9, 'NN'), 'det', (8, 'DT')),
 ((9, 'NN'), 'case', (7, 'IN')),
 ((1, 'VBD'), 'nsubj', (0, 'NNP'))]

--- stemmed ---


['XXX',
 'observ',
 'Mary_Bennet',
 '',
 'who',
 'piqu',
 'herself',
 'upon',
 'the',
 'solid',
 'of',
 'her',
 'reflect']

13

[(0, (5, 'XXX')),
 (1, (0, 'observ')),
 (2, (3, 'Mary_Bennet')),
 (3, (0, '')),
 (4, (3, 'who')),
 (5, (0, 'piqu')),
 (6, (1, 'herself')),
 (7, (0, 'upon')),
 (8, (0, 'the')),
 (9, (2, 'solid')),
 (10, (0, 'of')),
 (11, (0, 'her')),
 (12, (1, 'reflect'))]

re-parsing with ['Mary_Bennet', 'who', 'piqu', 'herself', 'upon', 'the', 'solid', 'of', 'her', 'reflect', 'observ', '']
--- triples ---


[((10, 'VBD'), 'dobj', (11, 'NNS')),
 ((10, 'VBD'), 'nsubj', (0, 'NNP')),
 ((0, 'NNP'), 'acl:relcl', (2, 'VBD')),
 ((2, 'VBD'), 'dobj', (3, 'PRP')),
 ((2, 'VBD'), 'nsubj', (1, 'WP')),
 ((2, 'VBD'), 'nmod', (6, 'NN')),
 ((6, 'NN'), 'nmod', (9, 'NNS')),
 ((9, 'NNS'), 'nmod:poss', (8, 'PRP$')),
 ((9, 'NNS'), 'case', (7, 'IN')),
 ((6, 'NN'), 'det', (5, 'DT')),
 ((6, 'NN'), 'case', (4, 'IN'))]

--- stemmed ---


['Mary_Bennet',
 'who',
 'piqu',
 'herself',
 'upon',
 'the',
 'solid',
 'of',
 'her',
 'reflect',
 'observ',
 '']

13

[(0, (5, 'Mary_Bennet')),
 (1, (3, 'who')),
 (2, (0, 'piqu')),
 (3, (1, 'herself')),
 (4, (0, 'upon')),
 (5, (0, 'the')),
 (6, (2, 'solid')),
 (7, (0, 'of')),
 (8, (0, 'her')),
 (9, (1, 'reflect')),
 (10, (0, 'observ')),
 (11, (3, ''))]

[(0, (3, 'Mary_Bennet')),
 (1, (1, 'who')),
 (2, (0, 'piqu')),
 (3, (1, 'herself')),
 (4, (0, 'upon')),
 (5, (0, 'the')),
 (6, (3, 'solid')),
 (7, (0, 'of')),
 (8, (2, 'her')),
 (9, (2, 'reflect')),
 (10, (0, 'observ')),
 (11, (3, ''))]

(({'Mary_Bennet'}, set()), (set(), set()))

In [356]:
parsing.extract_features_from_utterance("dear Mr_Bennet, are you ok ?", parser, stemmer, people_name_to_code, debug=True)

--- triples ---


[((1, 'NNP'), 'case', (0, 'IN')),
 ((5, 'VBN'), 'nmod', (1, 'NNP')),
 ((5, 'VBN'), 'auxpass', (3, 'VBP')),
 ((5, 'VBN'), 'nsubjpass', (4, 'PRP'))]

--- stemmed ---


[(0, 'dear'), (1, 'Mr_Bennet'), (2, ''), (3, 'are'), (4, 'you'), (5, 'ok')]

6

((set(), set()), ({'Mr_Bennet'}, set()))

In [367]:
pickle.dump(dataset, open("corpus/dataset-parsed.pkl", "wb"))

### Load Prolog engine

Define facts, mainly family relations and genders using Prolog

We could think in the future to define in the same way alias relationships and use Bayesian networks instead
to have more sensitive system, but this way works well enough for us now

`facts = [...]`

In [368]:
facts = """
status(mrs_annesley,female).
status(elizabeth_bennet,female).
status(jane_bennet,female).
status(lydia_bennet,female).
status(kitty_bennet,female).
status(mary_bennet,female).
status(mrs_bennet,female).
status(mr_bennet,male).
status(mr_bingley,male).
status(caroline_bingley,female).
status(charlotte,female).
status(captain_carter,female).
status(mr_collins,male).
status(lady_catherine,female).
status(mr_chamberlayne,female).
status(dawson,female).
status(mr_denny,female).
status(mr_darcy,male).
status(old_mr_darcy,female).
status(lady_anne_darcy,female).
status(georgiana_darcy,female).
status(colonel_fitzwilliam,male).
status(colonel_forster,female).
status(miss_grantley,female).
status(mrs_gardiner,female).
status(mr_gardiner,male).
status(william_goulding,female).
status(haggerston,female).
status(mrs_hill,female).
status(mrs_jenkinson,female).
status(mr_jones,female).
status(miss_mary_king,female).
status(mrs_long,female).
status(lady_lucas,female).
status(maria_lucas,female).
status(mr_hurst,female).
status(louisa_hurst,female).
status(lady_metcalfe,female).
status(mr_morris,female).
status(mrs_nicholls,female).
status(mr_philips,male).
status(miss_pope,female).
status(mr_pratt,male).
status(mrs_reynolds,female).
status(mr_robinson,female).
status(mr_stone,female).
status(miss_watson,female).
status(old_mr_wickham,female).
status(sir_william,male).
status(anne_de_bourgh,female).
status(mr_wickham,male).
status(mrs_philips,female).
status(young_lucas,male).
status(the_butler,male).

status(elizabeth_bennet,female).
related(jane_bennet,elizabeth_bennet,sister).
related(mary_bennet,elizabeth_bennet,sister).
related(lydia_bennet,elizabeth_bennet,sister).
related(kitty_bennet,elizabeth_bennet,sister).
related(mr_bennet,elizabeth_bennet,father).
related(mrs_bennet,elizabeth_bennet,mother).
related(mr_collins,charlotte,husband).
related(mr_collins,mr_bennet,brother).
related(mr_bingley,caroline_bingley,siblings).
"""

In [361]:
if "prolog_engine" in globals() and not prolog_engine.closed:
    prolog_engine.close()
prolog_engine = prolog.Prolog('./family.pl')
prolog_engine.run()

In [362]:
prolog_engine.assert_facts(facts) # input the facts
prolog_engine.query("abolish_all_tables.") # reset the cache, needed for the circular rules dependencies

True

Test a query

In [363]:
prolog_engine.query('related(X, mr_bennet, wife).')

[{'X': 'mrs_bennet'}]

The following cell took us ~ 30min to complete

In [332]:
dl_dataset = []
for utterance in tqdm.tqdm(dataset[len(dl_dataset):]):
    parts = utterance['parts']
    parts_count = len(parts)
    
    speaker = utterance['target']
    discussion_index = utterance['discussion_index']
    
    utterance_text = re.sub('\[X\]', '', utterance['only_utterance_us'])
    incises = [part for part in parts if not part['utterance']]
    
    concat_incise = "".join(part["text"] for part in incises)
    
    potential_targets, dest_targets = [], []
    
    # Merge all the information gathered during the parsing
    # of each part of the utterance line
    for part in parts:
        if part["speaker_features"][0] is not None:
            potential_targets.extend(list(part["speaker_features"][0]))
        if part["speaker_features"][1] is not None and len(part["speaker_features"][1]) > 0:
            prolog_results = prolog_engine.query(",".join(part["speaker_features"][1])+".")
            if prolog_results:
                potential_targets.extend(list(set([people_code_to_name[r['X']] for r in prolog_results if 'X' in r and r['X'] in people_code_to_name])))
            else:
                # try to reset the engine if there is a problem
                prolog_engine.query("X=1.")
        if part["dest_features"][0] is not None:
            dest_targets.extend(list(part["dest_features"][0]))
            
        if part["dest_features"][1] is not None and len(part["dest_features"][1]) > 0:
            prolog_results = prolog_engine.query(",".join(part["dest_features"][1])+".")
            if prolog_results:
                dest_targets.extend(list(set([people_code_to_name[r['U']] for r in prolog_results if 'U' in r and r['U'] in people_code_to_name])))
            else:
                # try to reset the engine if there is a problem
                prolog_engine.query("X=1.")

    dl_dataset.append((discussion_index, utterance_text, concat_incise, potential_targets, dest_targets, speaker))

100%|██████████| 41/41 [02:51<00:00,  4.18s/it]


#### Split the dataset in multiple discussions

In [370]:
dataset_discussions = []
current_discussion = []
current_discussion_index = None
for sample in dl_dataset:
    if sample[0] != current_discussion_index and len(current_discussion) > 0:
        dataset_discussions.append(current_discussion)
        current_discussion = []
    current_discussion_index = sample[0]
    current_discussion.append(sample[1:])
if len(current_discussion) > 0:
    dataset_discussions.append(current_discussion)

In [371]:
pickle.dump(dataset_discussions, open("corpus/dataset-dl.pkl", "wb"))

### Results

#### Prediction only based on parsing and ontology

In [372]:
import numpy as np

In [377]:
correct_answers = 0
people_names = [p['main'] for p in people_list]
for discussion in dataset_discussions:
    for sample in discussion:
        if len(sample[2]) > 0 and np.random.choice(list(sample[2]), 1) == sample[4]:
            correct_answers += 1
print("Accuracy: {}".format(correct_answers/len(dl_dataset)))

Accuracy: 0.21428571428571427


#### Prediction based on parsing and ontology and nearby speakers

In [375]:
def merge_people_predictions(sets, weights=[]):
    """Poll on the potential characters in the `sets`, weighted by the `weights`"""
    weights = weights[:len(sets)] + [1] * (len(sets)-len(weights))
    sets_weight = [(s, w) for s, w in zip(sets, weights) if s is not None and len(s) > 0]
    
    potential_targets = {}
    for s, w in sets_weight:
        w = w if w is not None else 1
        for p in s:
            potential_targets[p] = potential_targets.get(p, 0) + w
    if len(potential_targets) == 0:
        return None
    return max(potential_targets, key=lambda p: potential_targets[p])

In [395]:
correct_answers = 0
people_names = [p['main'] for p in people_list]
for discussion in dataset_discussions:
    discussion_people = set()
    previous_predictions = [] 
    for i, sample in enumerate(discussion):
        if len(sample[2]) == 1:
            current_prediction = next(iter(sample[2]))
            discussion_people.add(current_prediction)
        else:
            current_prediction = merge_people_predictions([sample[2], 
                                                            discussion_people,
                                                           
                                                            # n-1, n+1
                                                            # We consider that a user will not speak twice in a row
                                                            # so we penalize the speakers of the n+1, n-1 utterances
                                                            discussion[i-1][3] if i >= 1 else None,
                                                            discussion[i+1][3] if i < len(discussion) - 1 else None,
                                                            discussion[i-1][2] if i >= 1 else None,
                                                            discussion[i+1][2] if i < len(discussion) - 1 else None,
                                                           
                                                            # n-2, n+2
                                                            discussion[i-2][3] if i >= 1 else None,
                                                            discussion[i+2][3] if i < len(discussion) - 2 else None,
                                                            discussion[i-2][2] if i >= 2 else None,
                                                            discussion[i+2][2] if i < len(discussion) - 2 else None,
                                                           
                                                            # n-3, n+3
                                                            # same thing a n-1, n+1, but with a smaller weight
                                                            # and since we are getting further from the current utterance,
                                                            # we only focus on positive influences ie destinator of n-3, n+3...
                                                            discussion[i-3][3] if i >= 3 else None,
                                                            discussion[i+3][3] if i < len(discussion) - 3 else None,
                                                           
                                                            # n-4, n+4
                                                            # ...and speaker of n-4, n+4
                                                            discussion[i-4][2] if i >= 4 else None,
                                                            discussion[i+4][2] if i < len(discussion) - 4 else None], weights=[4, 4,
                                                                                         2, 2, -2, -2,
                                                                                         -1,-1, 1, 1,
                                                                                         1, 1,
                                                                                         1, 1])
        if current_prediction is not None:
            previous_predictions.append(current_prediction)
        else:
            previous_predictions.append(None)
            
        if current_prediction is not None and current_prediction == sample[4]:
            correct_answers += 1
        else:
            pass
            
print("Accuracy: {}".format(correct_answers/len(dl_dataset)))

Accuracy: 0.5491551459293394
