# Get Started with Conc

> Here is how to get started with Conc.
- toc: false
- page-layout: full

This is a quick, no-frills introduction to using Conc. You can skip part 1 if you already have some data you want to work with.  

## 1. Get some sample texts

For this getting started guide I'm going to use the example of a collection of short stories from Katherine Mansfield's *The Garden Party* as sample texts. This corpus is available as a zip file of text files and can be downloaded via the [conc.corpora submodule](https://geoffford.nz/conc/api/corpora.html#get_garden_party). First, we will import the function from conc.corpora to get the sample data.   

In [None]:
import os
from conc.corpora import get_garden_party

Now we define where we want the data to be stored (`source_path`) and where we want the corpus to be saved (`save_path`). When the corpus is built it will be saved in a new directory in `save_path`. Note: the os.environ.get in the paths below are not required. You can specify paths directly as strings (e.g. /some/path/).  

In [None]:
source_path = f'{os.environ.get("HOME")}/data/'  
save_path = f'{os.environ.get("HOME")}/data/conc-test-corpora/' 

Now we download the data. This will create the source_path directory defined above if it is not already there (and it is somewhere your user can write).  

In [None]:
get_garden_party(source_path=source_path)

## 2. Build the corpus

You can currently build a Conc corpus from:  

* a directory of text files or a .zip/.tar/.tar.gz containing text files (`Corpus.build_from_files`)  
* a .csv file (or .csv.gz file) with a column containing your text (`Corpus.build_from_csv`) 

More source types will be added in the future, but lots of data can be wrangled into these formats.  

Both methods support importing metadata. See the documentation links above for more details.   

For information on the Conc corpus format, see the [Anatomy of a Conc Corpus](https://geoffford.nz/conc/explanations/anatomy.html).  

The following code imports the `Corpus` class from `conc.corpus`.

In [None]:
from conc.corpus import Corpus

The following line creates a Corpus, gives it a name and description, and builds it from the Garden Party source files.  

Remember, a new directory for your corpus will be created in `save_path`. The name of that directory is a slugified version of the name you pass in. For the Garden Party Corpus, the directory garden-party.corpus will be created. The folder name can be changed later if you want. You can distribute your corpus by sharing the directory and its contents.  

The build process time depends on the size of your corpus. The build process produces a corpus format that is quick to load and use. In this case, the corpus is small and it is done in a couple of seconds even on a old, slow computer.       

In [None]:
name = 'Garden Party Corpus'
description = 'A corpus of short stories from The Garden Party: and Other Stories by Katherine Mansfield. Texts downloaded from Project Gutenberg https://gutenberg.org/ and are in the public domain. The text files contain the short story without the title. https://github.com/ucdh/scraping-garden-party'
source_file = 'garden-party-corpus.zip'

corpus = Corpus(name=name, description=description).build_from_files(source_path = f'{source_path}{source_file}', save_path = save_path)

To get information on the corpus, including various summary counts and information on the path of the corpus, you can use the `Corpus.summary` method.  

In [None]:
corpus.summary()

Corpus Summary,Corpus Summary
Attribute,Value
Name,Garden Party Corpus
Description,A corpus of short stories from The Garden Party: and Other Stories by Katherine Mansfield. Texts downloaded from Project Gutenberg https://gutenberg.org/ and are in the public domain. The text files contain the short story without the title. https://github.com/ucdh/scraping-garden-party
Date Created,2025-06-14 11:39:26
Conc Version,0.1.3
Corpus Path,/home/geoff/data/conc-test-corpora//garden-party.corpus
Document Count,15
Token Count,74664
Word Token Count,63311
Unique Tokens,5410
Unique Word Tokens,5398


## 3. Load a Conc corpus

Here is how we can load the corpus we just build. We don't need to pass in a name and description, we just need the path to the corpus.  

In [None]:
corpus = Corpus().load(corpus_path=f'{save_path}garden-party.corpus')

Let's check our corpus information again. We could use the summary method again here, but we can also access this information using the `Corpus.info` method. Here we include the `include_disk_usage` parameter to get additional information on how much disk space our corpus is using.   

In [None]:
print(corpus.info(include_disk_usage=True))

┌────────────────────────────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Attribute                  ┆ Value                                                                                                                                                                                                                                             │
╞════════════════════════════╪═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ Name                       ┆ Garden Party Corpus                                                                                                                             

## 4. Using Conc

To use the corpus we need to import the `Conc` class from `conc.conc`.

In [None]:
from conc.conc import Conc

The Conc class is the main interface for working with your corpus. It provides methods for a range of corpus analysis, including analysis of frequency, ngrams, concordances, collocates, and keyness. There are classes for all these different analyses, but the Conc class provides the most straightforward way to do analysis.

Here we instantiate a `Conc` object with the corpus just loaded. 

In [None]:
conc = Conc(corpus=corpus)

Below you will see how to generate Conc reports. I'll add more notes on how to fully use these reports here soon, but for now - check out the code below and refer to the [API documentation](https://geoffford.nz/conc/api/conc.html) for information on parameters.

In [None]:
conc.frequencies().display()

Frequencies,Frequencies,Frequencies,Frequencies
"Frequencies of word tokens, Garden Party Corpus","Frequencies of word tokens, Garden Party Corpus","Frequencies of word tokens, Garden Party Corpus","Frequencies of word tokens, Garden Party Corpus"
Rank,Token,Frequency,Normalized Frequency
1,the,2911,459.79
2,and,1798,283.99
3,“,1615,255.09
4,”,1614,254.93
5,a,1407,222.24
6,to,1376,217.34
7,she,1171,184.96
8,was,1102,174.06
9,it,1021,161.27
10,her,937,148.00


In [None]:
conc.ngram_frequencies(ngram_length = 2).display()

Ngram Frequencies,Ngram Frequencies,Ngram Frequencies,Ngram Frequencies
Garden Party Corpus,Garden Party Corpus,Garden Party Corpus,Garden Party Corpus
Rank,Ngram,Frequency,Normalized Frequency
1,” said,328,51.81
2,” “,326,51.49
3,it was,247,39.01
4,in the,214,33.80
5,“ i,197,31.12
6,on the,183,28.90
7,of the,156,24.64
8,” she,156,24.64
9,to the,139,21.96
10,at the,133,21.01


In [None]:
conc.ngrams('she was', ngram_length = 3).display()

"Ngrams for ""she was""","Ngrams for ""she was""","Ngrams for ""she was""","Ngrams for ""she was"""
Garden Party Corpus,Garden Party Corpus,Garden Party Corpus,Garden Party Corpus
Rank,Ngram,Frequency,Normalized Frequency
1,she was going,5,0.79
2,she was so,5,0.79
3,she was a,4,0.63
4,she was still,3,0.47
5,she was as,3,0.47
6,she was n’t,3,0.47
7,she was never,3,0.47
8,she was at,3,0.47
9,she was gone,3,0.47
10,she was the,2,0.32


In [None]:
conc.concordance('quite').display()

"Concordance for ""quite""","Concordance for ""quite""","Concordance for ""quite""","Concordance for ""quite"""
"Garden Party Corpus, Context tokens: 5, Order: 1R2R3R","Garden Party Corpus, Context tokens: 5, Order: 1R2R3R","Garden Party Corpus, Context tokens: 5, Order: 1R2R3R","Garden Party Corpus, Context tokens: 5, Order: 1R2R3R"
Document Id,Left,Node,Right
5,"echoed , “ Oh ,",quite,! ” and she was
4,", Jug ? ” “",Quite,", Con . ” “"
9,"? ” “ Oh ,",quite,", quite , ” said"
4,drawing - room . “,Quite,", ” said Josephine faintly"
9,"“ Oh , quite ,",quite,", ” said Reggie ,"
10,"? ” “ Yes ,",quite,". Oh , Laurie !"
10,you think ? ” “,Quite,". ” “ Hans ,"
8,match for the kettle in,quite,a dashing way . But
4,a piece . It was,quite,a gayme . ” Josephine
5,into a pool . “,Quite,"a good floor , is"


In [None]:
conc.concordance_plot('quite')

In [None]:
conc.collocates('quite').display()

"Collocates of ""quite""","Collocates of ""quite""","Collocates of ""quite""","Collocates of ""quite""","Collocates of ""quite""","Collocates of ""quite"""
Garden Party Corpus,Garden Party Corpus,Garden Party Corpus,Garden Party Corpus,Garden Party Corpus,Garden Party Corpus
Rank,Token,Collocate Frequency,Frequency,Logdice,Log Likelihood
1,oh,7,149,10.00,8.99
2,”,44,1614,9.74,23.91
3,said,14,514,9.61,7.59
4,“,40,1615,9.60,17.49
5,so,6,241,9.28,2.66
6,was,22,1102,9.26,5.13
7,n’t,11,522,9.24,3.07
8,it,20,1021,9.22,4.35
9,she,22,1171,9.18,4.07
10,they,7,398,8.92,0.97


In [None]:
reference_corpus = Corpus().load(corpus_path=f'{save_path}brown.corpus')

In [None]:
conc.set_reference_corpus(reference_corpus)

In [None]:
conc.keywords(min_document_frequency = 3, min_document_frequency_reference = 3).display()

Keywords,Keywords,Keywords,Keywords,Keywords,Keywords,Keywords,Keywords,Keywords
"Target corpus: Garden Party Corpus, Reference corpus: Brown Corpus","Target corpus: Garden Party Corpus, Reference corpus: Brown Corpus","Target corpus: Garden Party Corpus, Reference corpus: Brown Corpus","Target corpus: Garden Party Corpus, Reference corpus: Brown Corpus","Target corpus: Garden Party Corpus, Reference corpus: Brown Corpus","Target corpus: Garden Party Corpus, Reference corpus: Brown Corpus","Target corpus: Garden Party Corpus, Reference corpus: Brown Corpus","Target corpus: Garden Party Corpus, Reference corpus: Brown Corpus","Target corpus: Garden Party Corpus, Reference corpus: Brown Corpus"
Rank,Token,Frequency,Frequency Reference,Normalized Frequency,Normalized Frequency Reference,Relative Risk,Log Ratio,Log Likelihood
1,clasped,13,3,2.05,0.03,67.09,6.07,57.79
2,bye,25,7,3.95,0.07,55.29,5.79,107.37
3,velvet,14,5,2.21,0.05,43.35,5.44,57.19
4,morrow,8,3,1.26,0.03,41.28,5.37,32.32
5,wailed,8,3,1.26,0.03,41.28,5.37,32.32
6,shone,13,5,2.05,0.05,40.25,5.33,52.21
7,queer,15,6,2.37,0.06,38.70,5.27,59.69
8,drawers,10,4,1.58,0.04,38.70,5.27,39.79
9,dove,10,4,1.58,0.04,38.70,5.27,39.79
10,gloves,17,7,2.69,0.07,37.60,5.23,67.18
