**_Prototype_ - auto-suggested wikilinks via topic modelling**

In [1]:
import os
os.chdir('..')  # go to project dir

In [2]:
import re
import spacy
import obsidiantools.api as otools

import src
from src.vault import get_vault
from src.nlp import VaultNLP

# Original vault

In [3]:
%%time
VAULT = get_vault('FILM_NOIR_VAULT_DIR')

CPU times: user 4.88 s, sys: 22.7 ms, total: 4.9 s
Wall time: 4.93 s


In [4]:
df = VAULT.get_note_metadata()

In [5]:
df.shape

(122, 7)

The dataset used to produce the vault takes IDs from these lists:
- [Film noir](https://www.imdb.com/search/title/?genres=film_noir&title_type=feature&sort=user_rating%2Cdesc)
- [Neo-noir](https://www.imdb.com/list/ls026035968/?sort=user_rating%2Cdesc&st_dt=&mode=detail&page=1&title_type=movie&num_votes=10000%2C&release_date=1960%2C)

# Topic modelling

In [6]:
print(f"Notes: {df[df['note_exists']].shape[0]}")

Notes: 122


Semi-supervised modelling by specifying **anchor words** for each topic.

**I use the anchor words to formulate wikilinks.**

In [7]:
ANCHOR_WORDS = [['murder', 'violence', 'death', 'die', 'dead', 'killed', 'gun'],
                ['love', 'loves', 'lover', 'obsession', 'obsess',
                 'romance', 'desire', 'relationship', 'sex'],
                ['torment', 'fear', 'domineering', 'demons', 'controlling'],
                ['revenge', 'avenge', 'vengeance', 'betray', 'jealous'],
                ['corruption', 'corrupt', 'injustice', 'lies', 'lied'],
                ['crime', 'criminal']
               ]

In [8]:
nlp = VaultNLP(VAULT)
nlp.anchor_words = ANCHOR_WORDS

In [9]:
%%time
nlp.generate_sparse_word_matrix()

CPU times: user 102 ms, sys: 954 µs, total: 102 ms
Wall time: 102 ms


In [10]:
%%time
nlp.fit_anchored_topic_model(n_hidden=len(ANCHOR_WORDS))

CPU times: user 21.7 s, sys: 332 ms, total: 22 s
Wall time: 22 s


In [11]:
for n in range(len(nlp.anchor_words)):
    topic_words, _, _ = zip(*nlp.anchor_topic_model.get_topics(topic=n))
    print('{}: '.format(n) + ', '.join(topic_words))

0: gun, dead, killed, death, murder, gets, shows, running, watching, day
1: love, relationship, lover, sex, naked, romance, house, sexual, suspicions, scene
2: walks, coffee, showed, domineering, sort, believe, thinks, collapses, talk, elderly
3: vengeance, betray, men, notices, opens, morning, notice, inside, train, come
4: corrupt, lies, corruption, seen, lights, telling, make, enters, says, coming
5: crime, criminal, arrives, investigation, called, fact, picked, cars, having, garage


# Auto-suggest links

In [12]:
nlp = spacy.load("en_core_web_sm")

## clean note index

For this step, I clean up IMDB synoposes that don't have spaces as expected to separate sentences.

In [13]:
# clean up text: add space after full stops
clean_text_index = {k: re.sub(r'\.(?! )', '. ', re.sub(r' +', ' ', v))
                    for k,v in VAULT.readable_text_index.items()}

## lemmatise anchor words

In [14]:
import itertools

In [15]:
def get_anchor_tokens(anchor_keywords_list):
    anchor_kwords = nlp(' '.join(list(itertools.chain(*ANCHOR_WORDS))))
    
    anchor_tokens = [w.lemma_ for w in anchor_kwords
                     if not w.is_stop and
                     not w.is_punct and not w.like_num]

    # unique list
    anchor_tokens = list(set(anchor_tokens))
    return anchor_tokens

In [16]:
anchor_tokens = get_anchor_tokens(ANCHOR_WORDS)

# Replace text

## setup

In [17]:
def get_index_of_words_with_lemmas(clean_text_index):
    clean_text_and_lemmas_index = {k: [(w, w.lemma_) for w in nlp(v)]
                                   for k,v in clean_text_index.items()}
    return clean_text_and_lemmas_index

In [18]:
def get_index_of_keyword_matches(clean_text_and_lemmas_index):
    clean_text_and_lemmas_index = get_index_of_words_with_lemmas(clean_text_index)
    
    # each note (k) has a list of (word, lemma) matches:
    clean_text_lemmas_match = {k: list(filter(lambda x: (x[1] in anchor_tokens), v))
                               for k,v in clean_text_and_lemmas_index.items()}
    return clean_text_lemmas_match

In [19]:
words_with_lemmas_index = get_index_of_words_with_lemmas(clean_text_index)

In [20]:
clean_text_lemmas_match = get_index_of_keyword_matches(words_with_lemmas_index)

In [21]:
clean_text_lemmas_match.get('The Killing (1956)')

[(kill, 'kill'),
 (killed, 'kill'),
 (killed, 'kill'),
 (kills, 'kill'),
 (betrayed, 'betray')]

In [22]:
def _get_word_to_lemma_replacement_map(clean_text_lemmas_match):
    # dict of <lemma>:<alias>
    # reversed list of tuples so that the 'first key wins', rather than last
    clean_text_match_first_word = dict()
    for note_name in clean_text_lemmas_match.keys():
        # new dict:
        clean_text_match_first_word[note_name] = {
            v:k for k,v in reversed(clean_text_lemmas_match[note_name])}
    # re-order dict to <alias>:<lemma> for easier replacement:
    word_replace_map = dict()
    for k, v_dict in clean_text_match_first_word.items():
        word_replace_map[k] = {v:k for k,v in v_dict.items()}
    return word_replace_map

In [23]:
def get_word_to_wikilink_replacement_map(clean_text_lemmas_match):
    mp = _get_word_to_lemma_replacement_map(clean_text_lemmas_match)
    
    # swap k,v - the words as k, the wikilinks as v:
    for main_k, inner_dict in mp.items():
        mp[main_k] = {v:''.join(["[[",v,"]]"]) if str(k)==v
                      else ''.join(["[[",v,"|",str(k),"]]"])
                      for k,v in inner_dict.items()}
    return mp

In [24]:
word_replace_map = get_word_to_wikilink_replacement_map(
    clean_text_lemmas_match)

In [25]:
word_replace_map.get('The Killing (1956)')

{'betray': '[[betray|betrayed]]', 'kill': '[[kill]]'}

## word replacement

Loop over each file to replace the first instance of a topic word with a wikilink.

In [26]:
new_source_text = clean_text_index.copy()
# (rather than VAULT.source_text_index )

In [27]:
def get_lemmas_to_words(note_name):
    l = words_with_lemmas_index.get(note_name)
    w_to_lemmas = {k:v for k,v in dict(reversed(l)).items()}
    lemma_to_ws = {v:k for k,v in w_to_lemmas.items()}
    return lemma_to_ws

In [28]:
from spacy.matcher import Matcher
import numpy as np

In [29]:
# `new_source_text` to store text for final export:
new_source_text = clean_text_index.copy()

In [30]:
# one Matcher obj:
matcher = Matcher(nlp.vocab)

In [31]:
for note_name, lemma_to_link_map in word_replace_map.items():    
    orig_doc = nlp(new_source_text[note_name])
    new_doc = orig_doc.copy()
    
    lemmas_to_words = get_lemmas_to_words(note_name)
    for lemma, new_link in lemma_to_link_map.items():
        word = lemmas_to_words[lemma]
        
        pattern_word = [{'LOWER': str(word).lower(), 'op': '{1}'}]
        pattern_word_w_punct = [{'LOWER': str(word).lower()},
                                {'IS_PUNCT': True, 'op': '{1}'}]
        pattern_word_w_space = [{'LOWER': str(word).lower()},
                                {'SPACY': True, 'op': '{1}'}]
        matcher.add(0, [pattern_word_w_space])
        matcher.add(1, [pattern_word_w_punct])
        matcher.add(2, [pattern_word])
        
        matches = matcher(new_doc)
        match_id, start, end = matches[0]
        
        mat = np.matrix(matches, dtype=int)
        keep_condit = (mat[:,1] == mat[:,1].min())
        keep_indices = np.where(np.any(keep_condit, axis=1))
        mat_first_word = mat[keep_indices, ]
        patterns_matched_first_word = np.unique(mat_first_word[0][:, 0])
        
        if 1 in patterns_matched_first_word:  # precedes punct
            repl_str = f" {new_link}"
        elif 0 in patterns_matched_first_word:  # precedes space
            repl_str = ''.join([f" {new_link}", " "])
        else:
            repl_str = f" {new_link}"
        
        new_doc = nlp.make_doc(new_doc[:start].text + repl_str + new_doc[end:].text)
        
        new_source_text[note_name] = new_doc
        matcher.remove(0)
        matcher.remove(1)
        matcher.remove(2)

## write changes to files

In [32]:
import shutil
from pathlib import Path

In [33]:
old_vault_dir = Path(os.getenv('FILM_NOIR_VAULT_DIR'))

In [34]:
new_vault_dir = Path.cwd() / "vaults/film-noir-vault-new/"

In [35]:
shutil.copytree(old_vault_dir, new_vault_dir, dirs_exist_ok=True)

PosixPath('/home/mark/Github/obsidian-nlp-analytics/vaults/film-noir-vault-new')

In [36]:
for note_name, txt in new_source_text.items():
    fpath = new_vault_dir / df.loc[note_name, 'rel_filepath']
    with open(fpath, 'w') as f:
        f.write(str(txt))

Lemmas work quite well, but when specifying anchor words it is best to specify nouns, verbs & adjectives that carry the same meaning (e.g. 'desire' and 'desires').  Perhaps the user can make a MOC from words like that, to aid the navigation of a vault.

# Explore new vault that has auto-suggested links

In [37]:
from obsidiantools.api import Vault

In [38]:
%%time
NEW_VAULT = Vault(new_vault_dir).connect().gather()

CPU times: user 5.16 s, sys: 14.8 ms, total: 5.17 s
Wall time: 5.22 s


In [39]:
df_new_vault = NEW_VAULT.get_note_metadata()

## Main new notes

These are the main new (non-existent) notes in the Obsidian.md graph:

In [40]:
(df_new_vault.loc[~df_new_vault['note_exists'], 'n_backlinks']
 .sort_values(ascending=False)).head(15)

note
kill            95
murder          73
dead            71
gun             67
die             62
death           51
crime           48
love            44
lie             40
criminal        32
control         30
relationship    27
sex             21
lover           20
fear            18
Name: n_backlinks, dtype: int64

Most films have been auto-suggested links such as `kill`, `murder` and `dead`.

That's not surprising, as those aspects are a large part of film noir. ☠️

Although less common, there are keywords involving romance (e.g. `love`, `relationship`).  Again, this is a major theme in film noir.

## Examples of new wikilinks added to notes

Example of a film with a few themes, which were successfully found to create wikilinks:

In [41]:
NEW_VAULT.wikilinks_index.get('Rebecca (1940)')

['love',
 'obsess',
 'relationship',
 'die',
 'death',
 'sex',
 'murder',
 'lie',
 'kill']

Sample of [_Laura_ (1944)](https://www.imdb.com/title/tt0037008/plotsummary#synopsis) synopsis, with wikilinks added:

In [42]:
# ~700 chars to show the first 2 auto-suggested wikilinks:
print(NEW_VAULT.get_source_text('Laura (1944)')[:538])

New York City police detective Mark McPherson (Dana Andrews) is investigating the [[murder]] of beautiful, and highly successful, advertising executive, Laura Hunt (Gene Tierney). Laura has been [[kill|killed]] by a shotgun blast to the face, just inside the doorway to her apartment, before the start of the film. He interviews charismatic newspaper columnist Waldo Lydecker (Clifton Webb), an imperious, decadent dandy, who relates how he met Laura, became her mentor, and used his considerable influence and fame to advance her career.


![Laura note](../img/film-noir_laura-1944_auto-suggested-wikilinks.png)

`Laura (1944)` has been auto-suggested the `jealous` note as a wikilink.  These are other film noir and neo-noir movies that have also had that auto-suggestion:

![Jealous note](../img/film-noir_jealous-note_connections.png)

Film with most wikilinks:

In [43]:
film_most_wikilinks = df_new_vault['n_wikilinks'].idxmax()
print(film_most_wikilinks)
print("Wikilinks:", NEW_VAULT.wikilinks_index.get(film_most_wikilinks))

Lost Highway (1997)
Wikilinks: ['dead', 'love', 'lie', 'jealous', 'kill', 'death', 'murder', 'crime', 'violence', 'criminal', 'sex', 'gun', 'relationship', 'desire']


There are also a few films (e.g. Dark Passage) that have very short synoposes.  Their synopses on IMDB are short and so it is difficult to find keywords that align with the topics outlined earlier.

In [44]:
NEW_VAULT.isolated_notes

['The Aura (2005)',
 'Headhunters (2011)',
 'Hangmen Also Die! (1943)',
 'Dark Passage (1947)',
 'Ace in the Hole (1951)']