In [13]:
%run init_notebook.py
from settings import DATA_DIR

In [14]:
import numpy as np
import pandas as pd
import spacy
import os

from itertools import chain, combinations

In [15]:
# !python -m spacy download en_core_web_lg

In [16]:
from src.transitivity import get_transitivity_candidates, transitvity_check_first_level, transitivity_check_second_level, is_matrix_transitive, _recursion_transitive_clusters

#### The problem I was facing: 
I developed an unsupervised topic recognition model, aiming at recognising new often unknown events (e.g. covid-19) early on. For this purpose, I used spacy's proper noun tag PROPNN on top of NER) to build mention density time series by entity. The spacy NER turned out to be to  narro for new topics. 
In order to make the density time series more accurate I need to link expressions, that on an article level can be considered synonyms. Newspapers would sometimes refer to European Central Bank” and “European Institution” interchangeably in order to make texts more readable. In this case linking expressions on an article level was important, as they two are not global synonyms. Another example were names such as “Donald Trump” “President Trump”, which also had to be linked. I was therefore looking for a local unsupervised clustering technique, not relying on an external data base.


#### My solution
I used spacy vector embeddings as measure of similarity, only analysing term pairs exceeding a certain threshold (e.g. 0.8). In a next step I used transitivity as a clustering criterion. Transitivity imposes that all words in a cluster must share the same similarity strength or above. This method outputs few but meaningful cluster for each article. Moreover, these clusters are not overlapping by nature. 
My code performs clsutering in three steps. First combinations of pairs and their similarity score are gathered in a list. In order check for transitivty potential clusters need to be identified. This is a finite recursive problem as on word in a pair is potentially linked to another pair and so and so forth. I am considering a pool of pairs, for a given paris I then check which other pairs are associated, taking them out of the pool. Once all related pairs are gathered the cluster candidate is complete. This procedure is then applied to all pairs remaining in the pool until the pool of pairs is empty. In a third and last step the cluster candidates are checked for transitivity using matrix multiplication. 

The code can be found here: https://github.com/lukasgrahl/miscellaneous/blob/main/src/transitivity.py

#### An example
Using this article from the Irish Times: https://www.irishtimes.com/world/europe/2023/08/31/ukraine-war-latest/


In [17]:
article_irish_times = open(os.path.join(DATA_DIR, 'irish_times_article.txt'), 'r', encoding='utf-8').read()

In [18]:
NLP = spacy.load('en_core_web_lg')
doc = NLP(article_irish_times)

all_ents = list(doc.ents)
all_ents.extend([t for t in doc if t.pos_ == 'PROPN'])
candidates = get_transitivity_candidates(all_ents, similarity_threshold=.75)

In [19]:
[list(i) for i in candidates]

[['Union', 'the European Union'],
 ['ukrainian', 'Ukraine'],
 ['European', 'the European Union'],
 ['Ukraine', 'Russia'],
 ['the European Union', 'the National Security and Defence Council'],
 ['Zelenskiy', 'Volodymyr Zelenskiy'],
 ['Zelenskiy', 'Volodymyr'],
 ['15,000', '3,000'],
 ['the National Security and Defence Council', 'National'],
 ['the National Security and Defence Council', 'Defence'],
 ['this month', 'last month'],
 ['February', 'February 2022'],
 ['last month', 'February last year'],
 ['Volodymyr Zelenskiy', 'Volodymyr']]

In [20]:
tpl_cluster = _recursion_transitive_clusters(candidates)

In [21]:
tpl_cluster

[['Volodymyr Zelenskiy', 'Volodymyr', 'Zelenskiy'],
 ['15,000', '3,000'],
 ['February 2022', 'February']]