In [None]:
#| hide
from conc.core import *

# Conc

> A Python library for efficient corpus analysis, enabling corpus linguistic analysis in Jupyter notebooks.

## Introduction to Conc

Conc is a Python library that brings corpus linguistic analysis to Jupyter notebooks. A staple of data science, Jupyter notebooks are a great model for presenting analysis that combines code, reporting and discussion in a way that can be reproduced. Conc aims to allow researchers to analyse large corpora in efficient ways using standard hardware, with the ability to produce clear, publication-ready reports and extend analysis where required using standard Python libraries.

Conc uses [spaCy](https://spacy.io/) for tokenising texts. More spaCy functionality will be supported in future releases.  

### Conc Principles  

* use standard Python libraries for data analysis (i.e. Numpy, Scipy, Jupyterlab)
* use vector operations where possible  
* use fast code libraries over slow code libraries (i.e. Conc uses [Polars vs Pandas](https://pola.rs/posts/benchmarks/) - you can still output Pandas dataframes if you want to use them)  
* provide important information when reporting results  
* pre-compute time-intensive and repeatedly used views of the data  
* work with smaller slices of the data where possible  
* cache specific anaysis during a session to reduce computation for repeated calls  
* document corpus representations so that they can be worked with directly  
* provide a way to work with access Conc results for further processing with standard Python libraries  


## Development Status

Conc is in active development. It is currently [released][pypi] for beta testing. The Github site may be ahead of the Pypi version, so for latest functionality install from Github (see below). The Github code is pre-release and may change. For the latest release, install from Pypi (`pip install conc`). The [documentation][docs] reflects the most recent functionality. See the [CHANGELOG][changelog] for notes on releases and the Roadmap below for upcoming features.  

[repo]: https://github.com/polsci/conc
[docs]: https://geoffford.nz/conc/
[pypi]: https://pypi.org/project/conc/
[changelog]: https://github.com/polsci/conc/blob/main/CHANGELOG.md

## Acknowledgements

Conc is developed by [Dr Geoff Ford](https://geoffford.nz/).

Conc originated in my PhD research, which included development of a web-based corpus browser to handle analysis of large corpora. I've been developing Conc through my subsequent research.  

Work to create this Python library has been made possible by funding from the Royal Society of New Zealand’s Marsden Fund Grant:  

- "Mapping LAWS: Issue Mapping and Analyzing the Lethal Autonomous Weapons Debate" (19-UOC-068)  
- "Into the Deep: Analysing the Actors and Controversies Driving the Adoption of the World’s First Deep Sea Mining Governance" 22-UOC-059 .  

Conc is an output of both projects.  

Thanks to the Mapping LAWS project team for their support and feedback as first users of ConText (a web-based application built on an earlier version of Conc).  

Dr Ford is a researcher with [Te Pokapū Aronui ā-Matihiko | UC Arts Digital Lab (ADL)](https://artsdigitallab.canterbury.ac.nz/). Thanks to the ADL team and the ongoing support of the University of Canterbury's Faculty of Arts who make work like this possible.    

## Installation

### Install via pip

You can install Conc from [pypi][pypi] using this command:   

```sh
$ pip install conc
```

To install the latest development version of Conc, which may be ahead of the version on Pypi, you can install from the [repository][repo]:  

```sh
$ pip install git+https://github.com/polsci/conc.git
```

### Install a language model

The first releases of Conc require a SpaCy language model for tokenization. After installing Conc, install a model. Here's an example of how to install SpaCy's small English model, which is Conc's default language model:  

```sh
python -m spacy download en_core_web_sm
```

If you are working with a different language or want to use a different 'en' model, check the [SpaCy models documentation](https://spacy.io/models/) for the relevant model name.

### Install optional dependencies

Conc has some optional dependencies you can install to download source texts to create sample corpora. These are primarily intended for creating corpora for development. To minimize Conc's requirements these are not installed by default. If you want to get sample corpora to test out Conc's functionality you can install these with the following command. 

```sh
$ pip install nltk requests datasets
```

[repo]: https://github.com/polsci/conc
[docs]: https://geoffford.nz/conc/
[pypi]: https://pypi.org/project/conc/


### Pre-2013 CPU? Install Polars with support for older machines

Polars is optimized for modern CPUs with support for AVX2 instructions. If you get kernel crashes running Conc on an older machine (probably pre-2013), this is likely to be an issue with Polars. Polars has an [alternate installation option to support older machines](https://docs.pola.rs/user-guide/installation/), which installs a Polars build compiled without AVX2 support. Replace the standard Polars package with the legacy-support package to use Conc on older machines.

```sh
$ pip uninstall polars
$ pip install polars-lts-cpu
```

## Using Conc

A good place to start is TODO, which demonstrates how to build a corpus and output Conc reports.   

The [documentation site][docs] provides a reference for Conc functionality and examples of how to create reports for analysis. The current Conc components are listed below. 

[repo]: https://github.com/polsci/conc
[docs]: https://geoffford.nz/conc/
[pypi]: https://pypi.org/project/conc/


| Class / Function | Module | Functionality | Note |
| -------- | ------- | ------- | ------- |
| `Corpus` | conc.corpus | Build and load and get information on a corpus, methods to work with a corpus | Required |
| `Conc` | conc.conc | Inferface to Conc reports for corpus analysis | Recommended way to access reports for analysis, requires a corpus created by Corpus module |
| `Text` | conc.text |Output text from the corpus | Access via Corpus |
| `Frequency` | conc.frequency | Frequency reporting | Access via Conc |
| `Ngrams` | conc.ngrams | Reporting on `ngram_frequencies` across corpus and `ngrams` containing specific tokens | Access via Conc |
| `Concordance` | conc.concordance | Concordancing | Access via Conc |
| `Keyness` | conc.keyness | Reporting for keyness analysis | Access via Conc |
| `Collocates` | conc.collocates | Reporting for collocation analysis | Access via Conc |
| `Result` | conc.result | Handles report results, output result as table or get dataframe | Used by all reports |
| `ConcLogger` | conc.core | Logger | Logging implemented in all modules |
| `CorpusMetadata` | conc.core | Class to validate Corpus Metadata JSON | Used by Corpus class |

The conc.core module implements a number of helpful functions ...

| Function | Functionality |
| -------- | ------- |
| `list_corpora` | Scan a directory for corpora and return a summary |
| `get_stop_words` | Get a spaCy stop word list list for a specific model |
| Various - see `Get data sources` | Functions to download source texts to create sample corpora. Primarily intended for development/testing. To minimize requirements not all libraries are installed by default. Functions will raise errors with information on installing required libraries. |

## TODO before release  

- [ ] review ngram exclude punctuation and spaces - not treating same way between functions
    - excluding spaces makes sense, but punctuation could be reason to exclude 
    - are there ever multiple consecutive spaces? if not, could just always get +1 range and cleanup
    - i.e. 'economy ! The' could be legit bigram, but probably don't want 'economy The' from this as crosses punct
    - approach already in ngram_frequencies may already be sufficient
    - but space would just be removed
- [ ] ngram frequencies counts - impacted by above
- [ ] ngram_frequencies to report
- [ ] align ngram api and display with frequencies (e.g. normalized frequency)
- [x] work out requirements
- [x] Work out polars legacy build and make note about installation
- [x] Complete explanation on index page
- [x] Complete install (use Textplumber example)
- [x] Use textplumber stylesheet
- [x] Moved to do items through code/markdown
- [x] Use github actions developed for textplumber
- [x] add changelog
- [x] check license against corpress   
- [x] revise tests for build and save and load - write test for corpus build metadata - testing counts, vocab, metadata and token positions
- [x] tidy depreciated from Corpus

## Final release process ...
- [ ] generate README etc
- [ ] ensure local tests work via nbdev_prepare
- [ ] ensure github CI tests work
- [ ] check generated documentation
- [ ] Bump version
- [ ] Pypi release

## Roadmap

### Short-term

- [ ] add tutorial / getting started notebook
- [ ] add citation information
- [ ] add support for build from datasets library
- [ ] anatomy - explain token2doc_index -1 and has_spaces on tokens display and various other fields for vocab.
- [ ] Corpus tokenize support for functionality from earlier versions of Conc for wildcards, multiple strings, case insensitive tokenization
- [ ] get_ngrams_by_index - adjust for case insensitive
- [ ] improve concordance ordering so not fixed options e.g. include 3R1R2R
- [ ] concordancing - add in ordering by metadata columns or doc
- [ ] annotations support for POS, TAG, SENT_START, LEMMA 
- [ ] move tokens sort order to build process - takes > 1 second for large corpora, but not needed for all results
- [ ] revisit polars streaming - potentially implement a batched write for very large files i.e. splitting vocab/tokens files into smaller chunks to reduce memory usage.

### Medium-term

- [ ] Support for processing backends other than spaCy (i.e. other tokenizers) 

## Developer Guide

The instructions below are only relevant if you want to contribute to Conc. The [nbdev](https://nbdev.fast.ai/) library is being used for development. If you are new to using nbdevc, here are some useful pointers to get you started (or visit the [nbdev website](https://nbdev.fast.ai/)).

### Install conc in Development mode

```sh
# make sure conc package is installed in development mode
$ pip install -e .

# make changes under nbs/ directory
# ...

# compile to have changes apply to conc
$ nbdev_prepare
```