## tCoIR - Text Analysis
### <span style='color: green'>SETUP </span> Prepare and Setup Notebook <span style='float: right; color: red'>MANDATORY</span>

In [4]:
%load_ext autoreload
%autoreload 2


In [1]:

from beakerx.object import beakerx
from beakerx import *

from IPython.display import display #, set_matplotlib_formats
import text_analytic_tools
import text_analytic_tools.utility as utility
import text_analytic_tools.common.textacy_utility as textacy_utility
import text_analytic_tools.common as common

logger = utility.getLogger('tCoIR')

current_domain = text_analytic_tools.CURRENT_DOMAIN
utility.setup_default_pd_display(pd)
container = None

Data folder: /home/roger/source/text_analytic_tools/data/tCoIR


### STEPS

| Step | Input | Output | Note | Source |
|:---|:---|:---|:---|:---|
| 1. Select (create) text corpus | *.txt | xyz.txt.zip | Decide (specify) which text subset to use| (manual) |
| 2. Prepare text corpus | xyz.txt.zip | xyz.txt_preprocessed.zip | Minor preprocessing of text files (e.g. hyphens etc.) | textacy_utility.preprocess_text |
| 3. Load (or create) textaCy corpus | xyz.txt_preprocessed.zip | xyz.bin.bz2 | Loads textaCy corpus (creates if not exists) | load_or_create |
| 4. Create tokenized text corpus | xyz.bin.bz2  | xyz.tokenized.zip | Loads textaCy corpus (creates if not exists) | load_or_create |
| 5. **Compute co-occurrence** | xyz.tokenized.zip | (excel) | Compute co-occurrence | co_occurrence.compute |

Note:

- Steps 1 to 4 can be skipped if a tokenized corpus already exists.
- Steps 1 and 2 can be skipped if prepared corpus (or textaCy corpus) already exists

Configuration elements:

| Step | Configuration | Action |
|:---|:---|
| **1.** | What text files to include  |
| **2.** | (nothing) |
| **3.** | NER yes/no |
| **3.** | NER yes/no |


### OPTIONAL: Prepare a new filtered tokenized corpus.

1. Load (or create a new) textaCy corpus from the source text files.
1. Create (extract) filtered tokenized corpus (to be used in co-occurrence)


In [8]:
import text_analytic_tools.domain.common_logic as common_logic

source_path = '/home/roger/source/text_analytic_tools/data/tCoIR/tCoIR_en_45-72.txt.zip'

if container is None:
    container = textacy_utility.load_or_create(
        source_path=source_path,
        language='en',
        document_index=None,
        merge_entities=False,
        overwrite=False,
        use_compression=True,
        disabled_pipes=tuple(("ner", "parser", "textcat"))
    )

corpus             = container.textacy_corpus
min_freq_stats     = { k: textacy_utility.generate_word_count_score(corpus, k, 10) for k in [ 'lemma', 'lower', 'orth' ] }
max_doc_freq_stats = { k: textacy_utility.generate_word_document_count_score(corpus, k, 75) for k in [ 'lemma', 'lower', 'orth' ] }
document_index     = common_logic.document_index(corpus)
term_substitutions = common_logic.term_substitutions(vocab=None)
fx_docs            = lambda corpus: ((doc._.meta['filename'], doc) for doc in corpus)

default_opts = dict(
    term_substitutions=term_substitutions,
    substitute_terms=True,
    ngrams=[1],
    min_word=1,
    normalize='lemma',
    filter_stops=True,
    filter_punct=True,
    named_entities=False,
    include_pos=('ADJ', 'NOUN'),
    chunk_size=0,
    min_freq=1,
    min_freq_stats=min_freq_stats,                 # Must be specified if min_freq > 1
    max_doc_freq=100,
    max_doc_freq_stats=max_doc_freq_stats          # Must be specified if max_doc_freq < 100
)

run_opts = [
    dict(include_pos=('ADJ', 'NOUN', 'VERB')),
    dict(include_pos=('ADJ', 'NOUN')),
    dict(include_pos=('NOUN')),
    dict(include_pos=('VERB'))
]

for _opts in run_opts:

    opts = utility.extend(default_opts, _opts)

    target_filename = utility.path_add_timestamp(container.prepped_source_path)
    target_filename = utility.path_add_suffix(target_filename, '.' + opts.get('normalize',''))
    target_filename = utility.path_add_suffix(target_filename, '.' + '.'.join(list(opts.get('include_pos',''))))
    target_filename = utility.path_add_suffix(target_filename, '.tokenized')

    tokenized_docs = textacy_utility.extract_document_tokens(fx_docs(corpus), **opts)

    df_summary = common.store_tokenized_corpus_as_archive(tokenized_docs, target_filename)

    logger.info("Done! Result stored in '{}'".format(target_filename))



2019-10-27 20:35:34,736 : INFO : Loading term substitution mappings...
2019-10-27 20:35:34,742 : INFO : Stored 0 files...
2019-10-27 20:35:34,970 : INFO : Stored 100 files...
2019-10-27 20:35:35,126 : INFO : Stored 200 files...
2019-10-27 20:35:35,270 : INFO : Stored 300 files...
2019-10-27 20:35:35,409 : INFO : Stored 400 files...
2019-10-27 20:35:35,637 : INFO : Done! Result stored in '/home/roger/source/text_analytic_tools/data/tCoIR/tCoIR_en_45-72.txt_preprocessed_201910272035.lemma.ADJ.NOUN.VERB.tokenized.zip'
2019-10-27 20:35:35,640 : INFO : Stored 0 files...
2019-10-27 20:35:35,866 : INFO : Stored 100 files...
2019-10-27 20:35:36,011 : INFO : Stored 200 files...
2019-10-27 20:35:36,143 : INFO : Stored 300 files...
2019-10-27 20:35:36,273 : INFO : Stored 400 files...
2019-10-27 20:35:36,418 : INFO : Done! Result stored in '/home/roger/source/text_analytic_tools/data/tCoIR/tCoIR_en_45-72.txt_preprocessed_201910272035.lemma.ADJ.NOUN.tokenized.zip'
2019-10-27 20:35:36,420 : INFO : S

In [12]:
import time
import text_analytic_tools.text_analysis.co_occurrence as co_occurrence
import text_analytic_tools.common.text_corpus as text_corpus

def compute_co_occurrence(filepath, window_size=5, distance_metric=0, method='HAL', normalize='size'):

    corpus = text_corpus.SimplePreparedTextCorpus(filepath, lowercase=True)
    document_index = current_domain.compile_documents(corpus)

    df = co_occurrence.compute(corpus, document_index, window_size, distance_metric, normalize, method)

    result_filename = '{}_{}_result_co_occurrence_{}.xlsx'.format(method, window_size, time.strftime("%Y%m%d_%H%M"))
    df.to_excel(result_filename)
    
    print('Result saved to file {}'.format(result_filename))

source_files = [
    ('data/tCoIR/tCoIR_en_45-72.txt_preprocessed_201910272035.lemma.ADJ.NOUN.tokenized.zip', 'lemma.ADJ.NOUN'),
    ('data/tCoIR/tCoIR_en_45-72.txt_preprocessed_201910272035.lemma.ADJ.NOUN.VERB.tokenized.zip', 'lemma.ADJ.NOUN.VERB'),
    ('data/tCoIR/tCoIR_en_45-72.txt_preprocessed_201910272035.lemma.NOUN.tokenized.zip', 'lemma.NOUN'),
    ('data/tCoIR/tCoIR_en_45-72.txt_preprocessed_201910272035.lemma.VERB.tokenized.zip', 'lemma.VERB')
]

for source_file, tag in source_files:
    corpus = text_corpus.SimplePreparedTextCorpus(source_file, lowercase=True)
    document_index = current_domain.compile_documents(corpus)
    for window_size in [5, 10, 20]:
        compute_co_occurrence(source_file, window_size=5, distance_metric=0, method='HAL')
        df = co_occurrence.compute(corpus, document_index, window_size=5, distance_metric=0, normalize='size', method='HAL')

        result_filename = 'CO_tCoIR_en_45-72.{}_{}_{}_{}.xlsx'.format(time.strftime("%Y%m%d_%H%M", method, window_size, tag))
        df.to_excel(result_filename)
        
        print('Result saved to file {}'.format(result_filename))

2019-10-28 07:40:32,070 : INFO : Initializing dictionary
2019-10-28 07:40:32,075 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-10-28 07:40:32,624 : INFO : built Dictionary(3729 unique tokens: ['above', 'accordance', 'art', 'article', 'artistic']...) from 483 documents (total 158609 corpus positions)
2019-10-28 07:40:32,699 : INFO : Initializing dictionary
2019-10-28 07:40:32,706 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-10-28 07:40:33,282 : INFO : built Dictionary(3729 unique tokens: ['above', 'accordance', 'art', 'article', 'artistic']...) from 483 documents (total 158609 corpus positions)
2019-10-28 07:40:33,745 : INFO : Builiding vocabulary...
2019-10-28 07:40:33,766 : INFO : Vocabulary of size 3660 built.
2019-10-28 07:40:33,773 : INFO : Year 1945...
2019-10-28 07:40:34,495 : INFO : Year 1946...
2019-10-28 07:40:35,565 : INFO : Year 1947...
2019-10-28 07:40:37,105 : INFO : Year 1948...
2019-10-28 07:40:39,382 : INFO : Year 1949...
201

NameError: name 'method' is not defined

In [10]:
!mv data/tCoIR/tCoIR_en_45-72.txt_preprocessed_201910272035.lemma.V.E.R.B.tokenized.zip data/tCoIR/tCoIR_en_45-72.txt_preprocessed_201910272035.lemma.VERB.tokenized.zip

https://github.com/maciejkula/glove-python/issues/96

```bash
% git clone https://github.com/maciejkula/glove-python.git
% cd glove-python/
% cd glove/
% cython glove_cython.pyx
% cythonize glove_cython.pyx
% cython metrics/accuracy_cython.pyx
% cythonize metrics/accuracy_cython.pyx
% cython --cplus corpus_cython.pyx
% cythonize corpus_cython.pyx
% cd ..
% python setup.py cythonize
% make
% pip install -e .
```

  