## Prepare for Text Analysis

### Vectorize the corpus to a BoW corpus <!-- (text_analytic_tools.corpus.vectorized_corpus.VectorizedCorpus) -->

Use the script `scripts/vectorize_protocols.py` to create a BoW corpus.

```bash
    python scripts/vectorize_corpus.py \    
        --to-lower \
        --no-remove-accents \
        --min-length 2 \
        --keep-numerals \
        --no-keep-symbols \
        --only-alphanumeric \
        --file-pattern '*.txt' \
        --meta-field "document_type:_:0" \    
        --meta-field "document_id:_:2" \
        --meta-field "year:_:3" \
        ./data/legal_instrument_corpus.txt.zip \
        ./data \
```

<!--
The script calls `generate_corpus` in `text_analytic_tools.corpus.corpus_vectorizer`:
-->

The resulting corpus are stored in the specified output folder in two files; a numpy file containing the DTM and a Python pickled file with the dictionary and a document index.

<!--
### Prepare text files for Sparv

The Sparv pipeline requires that the individual document are stored as (individual) XML files. The shell script `sparvit-to-xml` can be used to add a root tag to all text files in a Zip archive. The resulting XML files iare stored as a new Zip archive.

```bash
 sparvit-to-xml --input riksdagens_protokoll_content_corpus.zip --output riksdagens_protokoll_content_corpus_xml.zip
 ```
 -->

In [None]:
%load_ext autoreload
%autoreload 2

import os
import sys
import bokeh.plotting
import types

assert sys.version_info > (3,8,5)

root_folder = os.path.join(os.getcwd().split('text_analytics')[0], 'text_analytics')
corpus_folder = os.path.join(root_folder, 'data')

sys.path = sys.path + [ root_folder, globals()['_dh'][-1] ]

bokeh.plotting.output_notebook(hide_banner=True)

container = types.SimpleNamespace(corpus=None, handle=None, data_source=None, data=None, figure=None)


## Load previously vectorized corpus
<!--
The corpus was created with the following settings:
 - Tokens were converted to lower case.
 - Only tokens that contains at least one alphanumeric character (isalnum).
 - Accents are ot removed (deacc)
 - Min token length 2 (min_len)
 - Max length not set (max_len)
 - Numerals are removed (numerals, -N)
 - Symbols are removed (symbols, -S)

Use the `vectorize_corpus` script to create a new corpus with different settings.

The corpus is processed in the following ways when loaded:

 - Exclude tokens having a total word count less than `Min count`
 - Include at most `Top count` most frequent words.
 - Group and sum up documents by year.
 - Normalize token distribution over years to 1.0
 
 -->

In [None]:
import word_trends_corpus_gui as corpus_gui
ui = corpus_gui.display_gui(corpus_folder, container=container)

## Display word trends

In [None]:
import word_trends_gui
word_trends_gui.display_gui(container)