## Text Analysis - CO-OCCURRENCE
### <span style='color: green'>SETUP </span> Prepare and Setup Notebook <span style='float: right; color: red'>MANDATORY</span>

In [9]:
%load_ext autoreload
%autoreload 2

from beakerx.object import beakerx
from beakerx import *

from IPython.display import display #, set_matplotlib_formats
import text_analytic_tools.utility as utility

utility.setup_default_pd_display(pd)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
[autoreload of text_analytic_tools.domain.tCoIR.treaty_state failed: Traceback (most recent call last):
  File "/home/roger/.local/share/virtualenvs/text_analytic_tools-LUuJUi2x/lib/python3.7/site-packages/IPython/extensions/autoreload.py", line 245, in check
    superreload(m, reload, self.old_objects)
  File "/home/roger/.local/share/virtualenvs/text_analytic_tools-LUuJUi2x/lib/python3.7/site-packages/IPython/extensions/autoreload.py", line 450, in superreload
    update_generic(old_obj, new_obj)
  File "/home/roger/.local/share/virtualenvs/text_analytic_tools-LUuJUi2x/lib/python3.7/site-packages/IPython/extensions/autoreload.py", line 387, in update_generic
    update(a, b)
  File "/home/roger/.local/share/virtualenvs/text_analytic_tools-LUuJUi2x/lib/python3.7/site-packages/IPython/extensions/autoreload.py", line 357, in update_class
    update_instances(old, new)
  File "/home/roger/.local/share

## <span style='color: green'>PREPARE </span> HAL Co-Windows Ratio (CWR)<span style='float: right; color: red'>MANDATORY</span>

Term "HAL" co-occurrence frequencies is calculated in accordance with Hyperspace Analogue to Language (Lund; Burgess, 1996) vector-space model. The computation is specified in detail in section 3.1 in (Chen; Lu, 2011).

\begin{aligned}
nw(x) &= \text{number of sliding windows that contains term $x$} \\
nw(x, y) &= \text{number of sliding windows that contains $x$ and $y$} \\
\\
f(x, y) &= \text{normalized version of nw(x, y)} \\
CWR(x, y) &= \frac{nw(x, y)}{nw(x) + nw(y) - nw(x, y)}\\
\end{aligned}

- Chen Z.; Lu Y., "A Word Co-occurrence Matrix Based Method for Relevance Feedback"
- Lund, K.; Burgess, C. & Atchley, R. A. (1995). "Semantic and associative priming in high-dimensional semantic space".[Link](https://books.google.de/books?id=CSU_Mj07G7UC).
- Lund, K.; Burgess, C. (1996). "Producing high-dimensional semantic spaces from lexical co-occurrence". doi:10.3758/bf03204766 [Link](https://dx.doi.org/10.3758%2Fbf03204766).


## <span style='color: green'>PREPARE </span> Compute Using Prepared Tokenized Corpus <span style='float: right; color: red'>MANDATORY</span>


In [13]:
import time
import ipywidgets
import text_analytic_tools.text_analysis.co_occurrence as co_occurrence
import text_analytic_tools.common.text_corpus as text_corpus
from text_analytic_tools.domain_logic_config import current_domain as domain_logic

def compute_co_occurrence(filepath, window_size=5, distance_metric=0, method='HAL', normalize='size'):

    corpus = text_corpus.SimplePreparedTextCorpus(filepath, lowercase=True)
    document_index = domain_logic.compile_documents(corpus)

    df = co_occurrence.compute(corpus, document_index, window_size, distance_metric, normalize, method)

    result_filename = '{}_{}_result_co_occurrence_{}.xlsx'\
        .format(method, window_size, time.strftime("%Y%m%d_%H%M"))
    df.to_excel(result_filename)
    print('Result saved to file {}'.format(result_filename))
    print('Now you are ready to do some serious stuff!')


In [14]:

ui = co_occurrence.PreparedCorpusUI(domain_logic.DATA_FOLDER)
display(ui.build(compute_co_occurrence))


VBox(children=(HBox(children=(VBox(children=(Dropdown(description='Corpus', layout=Layout(width='400px'), opti…

In [15]:
filepath = '/home/roger/source/text_analytic_tools/data/tCoIR/sample_corpus.txt_preprocessed.zip'
compute_co_occurrence(filepath, window_size=5, distance_metric=0, method='HAL')

2019-10-26 08:19:12,488 : INFO : Initializing dictionary
2019-10-26 08:19:12,490 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-10-26 08:19:12,533 : INFO : built Dictionary(2067 unique tokens: ['AND', 'AT', 'Affairs', 'Alfredo', 'Article']...) from 10 documents (total 18111 corpus positions)
2019-10-26 08:19:12,627 : INFO : Builiding vocabulary...
2019-10-26 08:19:12,630 : INFO : Vocabulary of size 1708 built.
2019-10-26 08:19:12,633 : INFO : Year 1950...
2019-10-26 08:19:12,853 : INFO : Year 1951...
2019-10-26 08:19:12,936 : INFO : Year 1952...
2019-10-26 08:19:13,374 : INFO : Year 1953...
2019-10-26 08:19:13,446 : INFO : Year 1954...
2019-10-26 08:19:13,518 : INFO : Year 1955...
2019-10-26 08:19:13,589 : INFO : Year 1956...
2019-10-26 08:19:13,662 : INFO : Year 1957...
2019-10-26 08:19:13,850 : INFO : Year 1958...
2019-10-26 08:19:14,232 : INFO : Year 1959...
2019-10-26 08:19:14,303 : INFO : Year 1960...
2019-10-26 08:19:14,742 : INFO : Year 1961...
2019-10-26 08

In [3]:
domain_logic.DATA_FOLDER

'/home/roger/source/text_analytic_tools/data/tCoIR/'