# Experimental Space

*Do some quick and dirty stuff to explore things.*

In [57]:
import nlp
import json

### Load and clean the data

In [58]:
data = nlp.load_file('../data/lee.txt')
docs = nlp.preprocess(data)
len(data), data[0], docs[0]

(300,
 'Hundreds of people have been forced to vacate their homes in the Southern Highlands of New South Wales as strong winds today pushed a huge bushfire towards the town of Hill Top. A new blaze near Goulburn, south-west of Sydney, has forced the closure of the Hume Highway. At about 4:00pm AEDT, a marked deterioration in the weather as a storm cell moved east across the Blue Mountains forced authorities to make a decision to evacuate people from homes in outlying streets at Hill Top in the New South Wales southern highlands. An estimated 500 residents have left their homes for nearby Mittagong. The New South Wales Rural Fire Service says the weather conditions which caused the fire to burn in a finger formation have now eased and about 60 fire units in and around Hill Top are optimistic of defending all properties. As more than 100 blazes burn on New Year\'s Eve in New South Wales, fire crews have been called to new fire at Gunning, south of Goulburn. While few details are availabl

### Test with LDA 

In [59]:
lda = nlp.build_lda(docs, num_topics=10)
lda.print_topics()

[(0,
  '0.035*"palestinian" + 0.024*"israeli" + 0.020*"arafat" + 0.011*"said" + 0.011*"gaza" + 0.010*"hamas" + 0.009*"israel" + 0.007*"west" + 0.007*"official" + 0.007*"suicide"'),
 (1,
  '0.018*"say" + 0.012*"south" + 0.009*"australia" + 0.009*"new" + 0.009*"fire" + 0.007*"test" + 0.006*"said" + 0.005*"year" + 0.004*"wale" + 0.004*"sydney"'),
 (2,
  '0.017*"say" + 0.013*"said" + 0.007*"australian" + 0.006*"year" + 0.004*"per" + 0.004*"australia" + 0.004*"could" + 0.004*"child" + 0.004*"new" + 0.004*"cent"'),
 (3,
  '0.015*"said" + 0.010*"say" + 0.007*"attack" + 0.006*"palestinian" + 0.005*"state" + 0.005*"one" + 0.005*"two" + 0.005*"metre" + 0.004*"last" + 0.004*"first"'),
 (4,
  '0.016*"said" + 0.016*"say" + 0.012*"afghanistan" + 0.011*"force" + 0.011*"government" + 0.008*"bin" + 0.007*"qaeda" + 0.006*"laden" + 0.006*"taliban" + 0.006*"afghan"'),
 (5,
  '0.017*"say" + 0.014*"union" + 0.011*"worker" + 0.010*"qantas" + 0.008*"said" + 0.008*"australian" + 0.006*"industrial" + 0.006*"two

### Explore topic coherence

I want to compute [UMass](https://aclanthology.info/pdf/D/D11/D11-1024.pdf) term coherence. `gensim` provides topic coherence as an average of coherence between terms, but doesn't provide API for coherence of pairs of terms. But I can use this to validate against my calculation.

In [60]:
from gensim.models import CoherenceModel

corpus, dictionary = nlp.build_corpus_dictionary(docs)
topic = ['palestinian', 'israeli', 'arafat']
cm = CoherenceModel(model=lda, topics=[topic], corpus=corpus, dictionary=dictionary, coherence='u_mass')

In [61]:
cm.get_coherence_per_topic()

[-0.15413790323070112]

In [62]:
d0 = len([1 for doc in docs if topic[0] in doc])
d1 = len([1 for doc in docs if topic[1] in doc])
d2 = len([1 for doc in docs if topic[2] in doc])
d01 = len([1 for doc in docs if topic[0] in doc and topic[1] in doc])
d02 = len([1 for doc in docs if topic[0] in doc and topic[2] in doc])
d12 = len([1 for doc in docs if topic[1] in doc and topic[2] in doc])
d0, d1, d2, d01, d02, d12

(29, 27, 25, 26, 25, 22)

In [85]:
import math
e = 1e-12

t01 = math.log((d01 / len(docs) + e) / (d0 / len(docs)))
t02 = math.log((d02 / len(docs) + e) / (d0 / len(docs)))
t12 = math.log((d12 / len(docs) + e) / (d1 / len(docs)))
(t01 + t02 + t12) / 3

-0.15413790323070112

Excellent! Now, I can move on to compute coherence between any terms.

In [81]:
from itertools import combinations
    
def count(terms, docs):
    'Count the number of documents that contain all the terms.'
    return len([1 for doc in docs if all(term in doc for term in terms)])

def compute_terms_conherences(terms, docs):
    'Return a dict of (term1, term2): coherence.'
    one_dict = {term: count([term], docs) for term in terms}
    coherences = {}
    
    for t1, t2 in combinations(terms, 2):
        c = math.log((count([t1, t2], docs) / len(docs) + e) / (one_dict[t1] / len(docs)))
        coherences[(t1, t2)] = c

    return coherences

In [88]:
terms = ['palestinian', 'israeli', 'arafat', 'phong']
compute_terms_conherences(terms, docs)

{('arafat', 'phong'): -25.14611446614055,
 ('israeli', 'arafat'): -0.20479441263237677,
 ('israeli', 'phong'): -25.223075507276675,
 ('palestinian', 'arafat'): -0.1484200051062732,
 ('palestinian', 'israeli'): -0.10919929195345338,
 ('palestinian', 'phong'): -25.29453447125882}

In [72]:
count(terms, docs)

22

---

In [4]:
%load_ext autoreload
%autoreload 2