# Loading and building corpora

This notebook demonstrates loading and building corpora with Conc. You can use this notebook to get started analysing a corpus with Conc. 

## Conc setup

Note for DIGI405 students: if you are working on the JupyterHub server for the course, you don't need to install conc. Conc's documentation website has a page explaining [how to install Conc](https://geoffford.nz/conc/tutorials/install.html). Another helpful page is the [Conc recipes page](https://geoffford.nz/conc/tutorials/recipes.html), which features lots of code examples.

## Import functionality to load/build and start working with corpora

In [None]:
from conc.corpus import Corpus
from conc.listcorpus import ListCorpus
from conc.conc import Conc
from conc.corpora import list_corpora

## Set paths

Set the paths below to where you store source data (`source_path`) and where you store the processed corpora (`save_path`). 

In [None]:
source_path = f'/srv/source-data/' # path to the source data from which the corpus will be created
save_path = f'/srv/corpora/' # path to the directory where corpora are stored

## Building a corpus

The Conc documentation site provides some [basic recipes for building corpora](https://geoffford.nz/conc/tutorials/recipes.html#building-or-loading-a-corpus-for-analysis). You can build Conc corpora from CSV files (.csv or .csv.gz) and from text files (either a directory of text files or a .zip or .tar.gz archive of text files).

Note for DIGI405 students: if you are building a corpus for use in assignments or your own research, you will need to use a `source_path` and `save_path` that is in your home directory on the JupyterHub server. You can use use relative paths, e.g. `source_path = 'data/'` and `save_path = 'corpora/'`, to use a directory in the same folder as your notebook.

## Building a lightweight corpus representation for use as a reference corpus (i.e. a list corpus)

Conc provides a way to create a lightweight corpus representation suitable for use as a reference corpus for analysis of keywords. The Conc documentation site provides some [basic recipes for building list corpora](https://geoffford.nz/conc/tutorials/recipes.html#building-and-loading-a-lightweight-corpus-representation-list-corpus-for-use-as-a-reference-corpus). To create a list corpus, you need to first build a Conc corpus. The recipe provides an example of how to convert a Conc corpus to a list corpus. 

## Loading a corpus or list corpus

Conc corpora are stored in a directory ending with `.corpus`. Locate the directory for the corpus you want to load. In this case, I want to load the Brown corpus which is in a directory wihin `save_path` called `brown.corpus`. You can load it like this:

In [None]:
corpus = Corpus().load(f'{save_path}brown.corpus')

You don't have to use `corpus` for the variable name. It may be helpful to use a more descriptive name if you are working with multiple corpora (e.g. `brown_corpus`)

It isn't necessary, but it is can be helpful to view a summary of the corpus you have loaded. 

In [None]:
corpus.summary()

Conc's list corpus format are stored in a directory ending with `.listcorpus`. As in the example above, locate the desired list corpus directory. In this case, it is the Baby BNC in `baby-bnc.listcorpus`. Load it like this:

In [None]:
listcorpus = ListCorpus().load(f'{save_path}baby-bnc.listcorpus')

You don't have to use `listcorpus` for the variable name. It may be helpful to use a more descriptive name if you are working with multiple corpora (e.g. `babybnc_listcorpus`)

As above, get basic information about the list corpus you have loaded:

In [None]:
listcorpus.summary()

## List available corpora

To view corpora available at `save_path`, run ...

In [None]:
list_corpora(save_path) 

The corpus column provides the name of the directory containing the corpus. You can use this name and the information above to load the corpus (or list corpus). 

Note for DIGI405 students: the corpora you need for labs and assignments (and some other corpora) are available on the JupyterHub server. Use the `list_corpora` function to view the available corpora and get the corpus directory name needed to load the corpora for analysis. 

## Prepare to analyse the corpus

Conc's reporting functionality is prepared as follows:

In [None]:
conc = Conc(corpus)

You can use another variable name for `conc`. Use the appropriate variable name for the corpus you used when you loaded it. 

Note: you can only load a Conc corpus for analysis. You cannot load a list corpus in this way. 

## Set a reference corpus for keyword analysis

If you plan to analyse keywords, you will need to set a reference corpus like this:

In [None]:
conc.set_reference_corpus(listcorpus)

In this case, I am using a list corpus, but the reference corpus can be a Conc corpus as well. Again, if you loaded your reference corpus with a different variable name, use that!

## Next steps ...

Right, that should be enough to get you started. The lab notebooks provide information on different Conc functionality relevant to analysis. You can also find more information and [examples](https://geoffford.nz/conc/tutorials/recipes.html#reports-for-corpus-analysis) by looking over the [Conc documentation site](https://geoffford.nz/conc/).