# Anatomy of a corpus

> Information on Conc corpus format if you want to access the data directly.
- toc: false
- page-layout: full

In [None]:
#| hide
from nbdev.showdoc import *

In [None]:
#| hide
import polars as pl
import os
import msgspec
from pathlib import Path
from IPython.display import Markdown, display

In [None]:
#| hide
source_path = f'{os.environ.get("HOME")}/data/'
save_path = f'{os.environ.get("HOME")}/data/conc-test-corpora/'

path_to_toy_corpus = f'{save_path}toy.corpus'
path_to_brown_corpus = f'{save_path}brown.corpus'
path_to_reuters_corpus = f'{save_path}reuters.corpus'
path_to_gardenparty_corpus = f'{save_path}garden-party.corpus'
path_to_congress_corpus = f'{save_path}us-congressional-speeches-subset-10k.corpus'

In [None]:
#| hide
from conc.corpus import Corpus, CorpusMetadata

## Introduction

A Conc corpus is a directory containing files with specific names and formats to represent the data. This document provides an overview of the various files and what they contain. Here is the directory structure of an example Conc corpus:

In [None]:
#| echo: false
def print_directory_tree(path, prefix="", restrict_to=None):
	path = Path(path)
	contents = list(path.iterdir())
	pointers = ['├── '] * (len(contents) - 1) + ['└── ']
	for pointer, child in zip(pointers, contents):
		if restrict_to is not None and restrict_to not in child.name:
			continue
		print(prefix + pointer + child.name)
		if child.is_dir():
			extension = '│   ' if pointer == '├── ' else '    '
			print_directory_tree(child, prefix + extension)

# Example usage: show current directory
print_directory_tree(f'{save_path}', '', restrict_to='garden-party.corpus')


├── garden-party.corpus
│   ├── tokens.parquet
│   ├── spaces.parquet
│   ├── README.md
│   ├── puncts.parquet
│   ├── corpus.json
│   ├── vocab.parquet
│   └── metadata.parquet


Note: by default the library creates a directory with the `.corpus` suffix. The directory name is created automatically on build based on a slugified version of the corpus name you assigned.  

For example, if you passed in the name:  

	Garden Party Corpus

The directory will be:  

	garden-party.corpus

The directory can be renamed and still loaded. The `.corpus` extension is intended to make corpora on your filesystem easier to find or identify.

To distribute a corpus, send a zip of the directory for others to extract or just share the directory as-is. 

Below is an overview of the files in a Conc corpus directory. The data can be accessed via Conc or accessed directly from the files. 

| File | Access via Conc | Description |
| -------- | ------- | ------- |
| README.md | - | Human readable information about the corpus to aide distribution |
| corpus.json | specific properties e.g. conc.token_count | Machine readable information about the corpus, including name, description, various summary statistics, and models used to build the corpus |
| vocab.parquet | corpus.vocab | A table mapping token strings to token IDs and frequency information |
| tokens.parquet | corpus.tokens | A table with indices based on token positions used to query the corpus with tokens represented by numeric IDs |
| metadata.parquet | corpus.metadata | A table with metadata for each document (if there is any) |
| spaces.parquet | corpus.spaces | A table to allow recovery of document spacing without the original texts |
| puncts.parquet | corpus.puncts | A table with punctuation positions |

Below is more information about each file. You can obviously work with a corpus using Conc, but you can work with the processed corpus [parquet](https://parquet.apache.org/docs/file-format/) and JSON files directly. Conc works with parquet files using the [Polars library](https://pola.rs/), but there are other libraries that support the format. Python provides native support for JSON, but there are more efficient libraries. Conc uses the [msgspec library](https://github.com/jcrist/msgspec) to read and write JSON. 

## Notes on specific Conc corpus files and data formats

The following information will help you if you want to work with the corpus data/files directly.   

### README.md

Below is an example of the README.md file generated by the Conc.

In [None]:
#| echo: false
with open(f'{path_to_brown_corpus}/README.md', 'rb') as f:
    markdown = '<div class="alert alert-block alert-success">\n\n' + f.read().decode('utf-8') + '\n'
    markdown = markdown.replace('\n#', '\n##') # making headings smaller for display
    markdown += '</div>'
    display(Markdown(markdown))

<div class="alert alert-block alert-success">

## Brown Corpus

### About

This directory contains a corpus created using the [Conc](https://github.com/polsci/conc) Python library. 

### Corpus Information

A Standard Corpus of Present-Day Edited American English, for use with Digital Computers. by W. N. Francis and H. Kucera (1964) Department of Linguistics, Brown University Providence, Rhode Island, USA Revised 1971, Revised and Amplified 1979 http://www.hit.uib.no/icame/brown/bcm.html. This version downloaded via NLTK https://www.nltk.org/nltk_data/.

Date created: 2025-07-23 22:27:11  
Document count: 500  
Token count: 1138566  
Word token count: 980144  
Unique tokens: 42930  
Unique word tokens: 42907  
Conc Version Number: 0.1.10  
spaCy model: en_core_web_sm, version 3.8.0  

### Using this corpus
 
Conc can be installed [via pip](https://pypi.org/project/conc/). The [Conc documentation site](https://geoffford.nz/conc) 
has tutorials and detailed information to get you started with Conc or to work with the corpus 
data directly.  

### Cite Conc

If you use Conc in your work, please cite it as follows: Ford, G. (2025). Conc: a Python library for efficient corpus analysis (Version 0.1.10) [Computer software]. https://doi.org/10.5281/zenodo.16358752


</div>

### corpus.json file

Below is the schema of the `corpus.json` file showing metadata saved with a corpus. These are loaded by Conc as attributes using `Corpus.load` or are created when you build a corpus using `Corpus.build_from_files` or `Corpus.build_from_csv`. The schema used to validate the JSON data represents the names and types of the attributes. 

In [None]:
#| echo: false
properties = msgspec.json.schema(CorpusMetadata)['$defs']['CorpusMetadata']['properties']
display(properties)

{'name': {'type': 'string'},
 'description': {'type': 'string'},
 'slug': {'type': 'string'},
 'conc_version': {'type': 'string'},
 'document_count': {'type': 'integer'},
 'token_count': {'type': 'integer'},
 'word_token_count': {'type': 'integer'},
 'punct_token_count': {'type': 'integer'},
 'space_token_count': {'type': 'integer'},
 'unique_tokens': {'type': 'integer'},
 'unique_word_tokens': {'type': 'integer'},
 'date_created': {'type': 'string'},
 'EOF_TOKEN': {'type': 'integer'},
 'SPACY_EOF_TOKEN': {'type': 'integer'},
 'SPACY_MODEL': {'type': 'string'},
 'SPACY_MODEL_VERSION': {'type': 'string'},
 'punct_tokens': {'type': 'array', 'items': {'type': 'integer'}},
 'space_tokens': {'type': 'array', 'items': {'type': 'integer'}}}

Once you have [built or loaded a corpus](https://geoffford.nz/conc/tutorials/recipes.html) you can access the attributes. For example ...

In [None]:
corpus = Corpus().load(path_to_brown_corpus) # loading the Brown corpus
print(corpus.name) # accessing the name of the corpus
print('Word token count: ', corpus.word_token_count) # access word_token_count

Brown Corpus
Word token count:  980144


Some of these attributes are exposed by `Corpus` methods. For example ...

In [None]:
#| eval: false
corpus.info() # Polars dataframe with summary metadata

Attribute,Value
"""Name""","""Brown Corpus"""
"""Description""","""A Standard Corpus of Present-Day Edited American English, for use with Digital Computers. by W. N. Francis and H. Kucera (1964) Department of Linguistics, Brown University Providence, Rhode Island, USA Revised 1971, Revised and Amplified 1979 http://www.hit.uib.no/icame/brown/bcm.html. This version …"
"""Date Created""","""2025-07-23 22:27:11"""
"""Conc Version""","""0.1.10"""
"""Corpus Path""","""/home/geoff/data/conc-test-corpora/brown.corpus"""
"""Document Count""","""500"""
"""Token Count""","""1,138,566"""
"""Word Token Count""","""980,144"""
"""Unique Tokens""","""42,930"""
"""Unique Word Tokens""","""42,907"""


### vocab.parquet

The vocab parquet file contains ...

1. A lookup between token_id, token (string representation), and tokens_sort_order. The sort order allows sorting tokens alphabetically directly from token ids.
2. A frequency table, with counts for lower cased tokens and orthographic realisation of tokens as they appeared in the text.
3. Information on the type of token (i.e. whether punctuation or space - or if neither of those, a "word" token).

If you have loaded a corpus in Conc, you can access the vocab parquet data as a Polars dataframe like this ...

In [None]:
#| eval: false
# corpus.vocab is a Polars dataframe
corpus.vocab.head(5).collect(engine='streaming')

rank,tokens_sort_order,token_id,token,frequency_lower,frequency_orth,is_punct,is_space
1,50087,22848,"""the""",63516,62473,False,False
2,28,8128,""",""",58331,58331,True,False
3,41,38309,""".""",49907,49907,True,False
4,35232,2739,"""of""",36321,36122,False,False
5,3351,7126,"""and""",27787,27633,False,False


You can also access the vocab data directly from the parquet file using Polars (or other libraries that support parquet).

In [None]:
#| eval: false
display(pl.scan_parquet(f'{path_to_brown_corpus}/vocab.parquet').head(5).collect(engine='streaming'))

rank,tokens_sort_order,token_id,token,frequency_lower,frequency_orth,is_punct,is_space
1,50087,22848,"""the""",63516,62473,False,False
2,28,8128,""",""",58331,58331,True,False
3,41,38309,""".""",49907,49907,True,False
4,35232,2739,"""of""",36321,36122,False,False
5,3351,7126,"""and""",27787,27633,False,False


To illustrate how frequencies are stored, see the instances for 'the'. The counts for the form as it appeared in the text are stored in frequency_orth. 'The' appears 1043 times and 'the' as lowercase appears 62,473 times. The frequency_lower column provides a count of the total number of mentions of 'the' regardless of case. 

In [None]:
#| eval: false
display(pl.scan_parquet(f'{path_to_brown_corpus}/vocab.parquet').filter(pl.col('token').str.to_lowercase() == 'the').head(5).collect(engine='streaming'))

rank,tokens_sort_order,token_id,token,frequency_lower,frequency_orth,is_punct,is_space
1,50087,22848,"""the""",63516.0,62473,False,False
99,50086,15682,"""The""",,1043,False,False


Punctuation is included in the token table, but these tokens can be filtered in Conc reports. If you are working with the table directly you can use `is_punct` to access or remove punctuation.

In [None]:
#| eval: false
display(pl.scan_parquet(f'{path_to_brown_corpus}/vocab.parquet').filter(pl.col('is_punct') == True).head(5).collect(engine='streaming'))

rank,tokens_sort_order,token_id,token,frequency_lower,frequency_orth,is_punct,is_space
2,28,8128,""",""",58331,58331,True,False
3,41,38309,""".""",49907,49907,True,False
12,1577,1601,"""`""",9788,9788,True,False
14,14,42833,"""''""",8762,8762,True,False
15,29,27963,"""-""",8131,8131,True,False


Conc also stores space tokens from spaCy's tokenisation process, which are sequences of whitespace characters. Space tokens are included (without counts). Space tokens are explained in more detail below the table.

In [None]:
#| eval: false
display(pl.scan_parquet(f'{path_to_brown_corpus}/vocab.parquet').filter(pl.col('is_space') == True).head(5).collect(engine='streaming'))

rank,tokens_sort_order,token_id,token,frequency_lower,frequency_orth,is_punct,is_space
47165,1,2956,""" """,,,False,True
47166,2,2799,""" """,,,False,True
47167,3,27276,""" 	""",,,False,True
47168,4,22812,""" """,,,False,True
47169,5,4112,""" 	""",,,False,True


#### A note about space tokens

When SpaCy tokenises text it outputs whether each token is followed by a standard space character or not. Conc stores this information during build in the has_spaces column in the tokens table (see below). SpaCy also creates tokens for other whitespace sequences. Conc documentation refers to these as space tokens. Space tokens may be useful for some sequence classification problems, but more importantly for Conc - it allows re-representation of source document in their original form with newlines, tabs and sequences of whitespace preserved. 

Space tokens are not included in overall token counts stored in the corpus.json file. 

### tokens.parquet

The `tokens.parquet` file contains a table representing tokens in the corpus. Whitespace tokens have been removed (see discussion of space tokens above). The tokens data can be directly accessed in Conc as a Polars dataframe ...

In [None]:
#| eval: false
# corpus.tokens is a Polars dataframe
corpus.tokens.with_row_index('position').filter(pl.col('position').is_between(99, 104)).collect(engine='streaming')

position,orth_index,lower_index,token2doc_index,has_spaces
99,46333,46333,-1,False
100,15682,22848,1,True
101,4361,41672,1,True
102,14610,29725,1,True
103,54713,49998,1,True
104,45742,19078,1,True


You can also access the tokens data directly from the parquet file using Polars (or other libraries that support parquet).

In [None]:
#| eval: false
pl.scan_parquet(f'{path_to_brown_corpus}/tokens.parquet').with_row_index('position').filter(pl.col('position').is_between(99, 112)).collect(engine='streaming')

position,orth_index,lower_index,token2doc_index,has_spaces
99,46333,46333,-1,False
100,15682,22848,1,True
101,4361,41672,1,True
102,14610,29725,1,True
103,54713,49998,1,True
104,45742,19078,1,True
105,53250,53250,1,True
106,8699,35796,1,True
107,45680,45680,1,True
108,30305,30305,1,True


The columns are as follows:

- the `orth_index` column stores the token_id of the original form of the token
- the `lower_index` column stores the token_id of the lowercased form of the token  
- the `token2doc_index` column stores the document id assigned by Conc in the order the texts where processed (starting from index position 1) - this is the same order as the metadata in the `metadata.parquet` file.
- the `has_spaces` column stores a boolean value indicating if the token is followed by a standard space character or not.

To demarcate the start and end of documents, Conc uses an end of file token (EOF_TOKEN) in the orth_index and lower_index columns. The EOF_TOKEN is stored in `corpus.json` and is accessible as an attribute of a `Corpus` object.

The token2doc_index represents token positions outside texts in the corpus as -1.

In [None]:
#| hide
# tmp_tokens_df = pl.scan_parquet(f'{path_to_brown_corpus}/tokens.parquet').with_row_index('position').collect()
# display(tmp_tokens_df.filter(pl.col('position').is_between(99, 121)).head(5))
# display(tmp_tokens_df.filter(pl.col('position').is_between(343, 348)).head(5))

# space_tokens_df = tmp_tokens_df.filter((pl.col('orth_index').is_in(brown.space_tokens))).with_row_index('adjust_by').with_columns((pl.col('position') - pl.col('adjust_by')).alias('corrected'))
# # replace position with corrected 
# space_tokens_df = space_tokens_df.with_columns(pl.col('corrected').alias('position')).drop('adjust_by').drop('corrected')
# # remove space_tokens_df from tmp_tokens_df
# tmp_tokens_df = tmp_tokens_df.filter(~pl.col('orth_index').is_in(brown.space_tokens)).drop('position').with_row_index('position')

# tmp_tokens_df = tmp_tokens_df.with_columns(pl.lit(1).alias('not_space'))
# space_tokens_df = space_tokens_df.with_columns(pl.lit(0).alias('not_space'))

# display(tmp_tokens_df.filter(pl.col('position').is_between(99, 121)).head(5))
# display(tmp_tokens_df.filter(pl.col('position').is_between(343, 348)).head(5))

# display(space_tokens_df.head(5))

# reconstructed_df = pl.concat([tmp_tokens_df, space_tokens_df]).sort('position', 'not_space').drop('position').drop('not_space').with_row_index('position')
# display(reconstructed_df.filter(pl.col('position').is_between(99, 121)).head(5))
# display(reconstructed_df.filter(pl.col('position').is_between(343, 348)).head(5))

# # is resconstructed identical to original? programmatically check
# assert pl.scan_parquet(f'{path_to_brown_corpus}/tokens.parquet').collect(engine='streaming').with_row_index('position').equals(reconstructed_df)

# # test for just one doc ...
# doc_id = 300
# tmp_doc_df = tmp_tokens_df.filter(pl.col('token2doc_index') == doc_id)
# display(tmp_doc_df.head(5))
# reconstructed_df = pl.concat([tmp_doc_df, space_tokens_df.filter(pl.col('token2doc_index') == doc_id)]).sort('position', 'not_space').drop('position').drop('not_space')
# display(reconstructed_df.head(5))
# assert pl.scan_parquet(f'{path_to_brown_corpus}/tokens.parquet').filter(pl.col('token2doc_index') == doc_id).collect(engine='streaming').equals(reconstructed_df)


The view below shows tokens joined with vocab, so you can see the token string and the attributes of the tokens. 

In [None]:
#| eval: false
pl.scan_parquet(f'{path_to_brown_corpus}/tokens.parquet').with_row_index('position').filter(pl.col('position').is_between(99, 121)).join(
    pl.scan_parquet(f'{path_to_brown_corpus}/vocab.parquet').select(pl.col('token_id'), pl.col('token'), pl.col('is_punct'), pl.col('is_space')),
    left_on='orth_index', right_on='token_id', how='left', maintain_order='left').collect(engine='streaming')

position,orth_index,lower_index,token2doc_index,has_spaces,token,is_punct,is_space
99,46333,46333,-1,False,""" conc-end-of-file-token""",False,False
100,15682,22848,1,True,"""The""",False,False
101,4361,41672,1,True,"""Fulton""",False,False
102,14610,29725,1,True,"""County""",False,False
103,54713,49998,1,True,"""Grand""",False,False
104,45742,19078,1,True,"""Jury""",False,False
105,53250,53250,1,True,"""said""",False,False
106,8699,35796,1,True,"""Friday""",False,False
107,45680,45680,1,True,"""an""",False,False
108,30305,30305,1,True,"""investigation""",False,False


### spaces.parquet

Space tokens (see note above) are stored in `spaces.parquet`. Space tokens are stored separate from word and punctuation tokens. This allows consistency with most tools for corpus linguistics. The spaces table follows the format of the tokens table. Spaces are represented using `position` and the corresponding token_id of the whitespace. The original token sequences can be reconstructed from the `tokens.parquet` and `spaces.parquet` files. Conc has functionality to recover specific texts by recombining the data.

From Conc ...

In [None]:
#| eval: false
# corpus.spaces is a Polars dataframe
corpus.spaces.head(3).collect(engine='streaming')

position,orth_index,lower_index,token2doc_index,has_spaces
100,27276,27276,1,False
345,2956,2956,1,False
637,2956,2956,1,False


Directly accessing the parquet file ...

In [None]:
#| eval: false
pl.scan_parquet(f'{path_to_brown_corpus}/spaces.parquet').head(3).collect(engine='streaming')

position,orth_index,lower_index,token2doc_index,has_spaces
100,27276,27276,1,False
345,2956,2956,1,False
637,2956,2956,1,False


### puncts.parquet

The `puncts.parquet` file stores an index of the position of punctuation tokens in the corpus. Below are the first three rows of a `puncts.parquet` file. If you look above the positions align with punctuation tokens in the tokens.parquet file. These are intended to be used for filtering tokens to exclude punctuation where necessary.

From Conc ...

In [None]:
#| eval: false
# corpus.puncts is a Polars dataframe
corpus.puncts.head(3).collect(engine='streaming')

position
116
117
120


Directly from the parquet file ...

In [None]:
#| eval: false
pl.scan_parquet(f'{path_to_brown_corpus}/puncts.parquet').head(3).collect(engine='streaming')

position
116
117
120


### metadata.parquet

The `metadata.parquet` should not be confused with the metadata of the corpus itself, which is accessible via `corpus.json`.

If populated, the `metadata.parquet` file contains metadata for each document in the corpus. 

From Conc you can access the metadata dataframe ...

In [None]:
#| eval: false
corpus = Corpus().load(path_to_congress_corpus) # loading a corpus with some metadata!
corpus.metadata.head(3).collect(engine='streaming')

speech_id,date,speaker,chamber,state
530182158,"""1895-01-10T00:00:00.000000""","""Mr. COCKRELL""","""S""","""Unknown"""
890274849,"""1966-08-31T00:00:00.000000""","""Mr. LONG of Louisiana""","""S""","""Louisiana"""
880088363,"""1963-09-11T00:00:00.000000""","""Mr. FULBRIGHT""","""S""","""Unknown"""


Directly from the parquet file ...

In [None]:
#| eval: false
display(pl.scan_parquet(f'{path_to_congress_corpus}/metadata.parquet').head(3).collect(engine='streaming'))

speech_id,date,speaker,chamber,state
530182158,"""1895-01-10T00:00:00.000000""","""Mr. COCKRELL""","""S""","""Unknown"""
890274849,"""1966-08-31T00:00:00.000000""","""Mr. LONG of Louisiana""","""S""","""Louisiana"""
880088363,"""1963-09-11T00:00:00.000000""","""Mr. FULBRIGHT""","""S""","""Unknown"""


Metadata is represented in the same order as documents are stored in the `tokens.parquet` file. The tokens with token2doc_index 1 correspond to the first metadata row.

For corpora created from files using the `Corpus.build_from_files` method, there will always be a field for the source file at the time of creation. 

In [None]:
#| eval: false
corpus = Corpus().load(path_to_gardenparty_corpus)
display(pl.scan_parquet(f'{corpus.corpus_path}/metadata.parquet').head(3).collect(engine='streaming'))

file
"""an-ideal-family.txt"""
"""at-the-bay.txt"""
"""bank-holiday.txt"""
