# Anatomy of a corpus

This page is intended to provide information on the Conc corpus format in case you want to work with the data directly.

In [None]:
#| hide
from nbdev.showdoc import *

In [None]:
#| hide
import polars as pl
import os
import msgspec

In [None]:
#| hide
source_path = f'{os.environ.get("HOME")}/data/'
save_path = f'{os.environ.get("HOME")}/data/conc-test-corpora/'

In [None]:
#| hide
from conc.corpus import Corpus, CorpusMetadata

A Conc corpus is a directory containing specific files as follows:

```
corpus-name.corpus/
	README.md - Human readable information about the corpus to aide distribution
	corpus.json - Machine readable information about the corpus, including name, description, various summary statistics, and models used to build the corpus
	vocab.parquet - A table mapping token strings to token IDs and frequency information
	tokens.parquet - A table with indices based on token positions used to query the corpus with tokens represented by numeric IDs
	metadata.parquet - A table with metadata for each document (if there is any)
	spaces.parquet = A table to allow recovery of document spacing without the original texts
	punct.parquet - A table with punctuation positions
```

Note: by default the library creates a directory with the `.corpus` suffix. This is done automatically on build, but the directory can be renamed and still loaded. The .corpus extension is intended to make corpora on your filesystem easier to find or identify.

To distribute a corpus, send a zip of the directory for others to extract or just share the directory as-is. 

Below is more information about each file. You can obviously work with a corpus using Conc, but you can work with the processed corpus `.parquet` files directly using the `polars` library. The following information should help you with this. 

### README.md

Below is an example of the README.md file generated by the Conc.

In [None]:
#| eval: false
from IPython.display import Markdown, display

In [None]:
#| eval: false
source_path = f'{os.environ.get("HOME")}/data/'
save_path = f'{os.environ.get("HOME")}/data/conc-test-corpora/'

In [None]:
#| eval: false
brown = Corpus().load(f'{save_path}/brown.corpus')
toy = Corpus().load(f'{save_path}/toy.corpus')

In [None]:
#| eval: false
with open(f'{brown.corpus_path}/README.md', 'rb') as f:
    markdown = '<div class="alert alert-block alert-success">\n\n' + f.read().decode('utf-8') + '\n'
    markdown = markdown.replace('\n#', '\n##') # making headings smaller for display
    markdown += '</div>'
    display(Markdown(markdown))

<div class="alert alert-block alert-success">

## Brown Corpus

### About

This directory contains a corpus created using the [Conc](https://github.com/polsci/conc) Python library. 

### Corpus Information

A Standard Corpus of Present-Day Edited American English, for use with Digital Computers. by W. N. Francis and H. Kucera (1964) Department of Linguistics, Brown University Providence, Rhode Island, USA Revised 1971, Revised and Amplified 1979 http://www.hit.uib.no/icame/brown/bcm.html. This version downloaded via NLTK https://www.nltk.org/nltk_data/.

Date created: 2025-06-09 15:46:25  
Document count: 500  
Token count: 1138566  
Word token count: 980144  
Unique tokens: 42930  
Unique word tokens: 42907  
Conc Version Number: 0.0.1  
spaCy model: en_core_web_sm, version 3.8.0  

### Using this corpus

Conc can be installed [via pip]():  
```
pip install conc
```
Documentation to get you started with Conc are available:
[Conc Documentation](https://geoffford.nz/conc)

### Cite Conc

If you use Conc in your work, please cite it as follows:


</div>

### corpus.json file

Below is the schema of the `corpus.json` file showing metadata saved with a corpus. These are loaded by Conc as attributes using `Corpus.load` or are created when you build a corpus using `Corpus.build_from_files` or `Corpus.build_from_csv`.

In [None]:
#| eval: false
properties = msgspec.json.schema(CorpusMetadata)['$defs']['CorpusMetadata']['properties']
display(properties)

{'name': {'type': 'string'},
 'description': {'type': 'string'},
 'slug': {'type': 'string'},
 'conc_version': {'type': 'string'},
 'document_count': {'type': 'integer'},
 'token_count': {'type': 'integer'},
 'word_token_count': {'type': 'integer'},
 'punct_token_count': {'type': 'integer'},
 'space_token_count': {'type': 'integer'},
 'unique_tokens': {'type': 'integer'},
 'unique_word_tokens': {'type': 'integer'},
 'date_created': {'type': 'string'},
 'EOF_TOKEN': {'type': 'integer'},
 'SPACY_EOF_TOKEN': {'type': 'integer'},
 'SPACY_MODEL': {'type': 'string'},
 'SPACY_MODEL_VERSION': {'type': 'string'},
 'punct_tokens': {'type': 'array', 'items': {'type': 'integer'}},
 'space_tokens': {'type': 'array', 'items': {'type': 'integer'}}}

Some of this information is exposed by `Corpus` methods.

In [None]:
#| eval: false
brown.summary()

Corpus Summary,Corpus Summary
Attribute,Value
Name,Brown Corpus
Description,"A Standard Corpus of Present-Day Edited American English, for use with Digital Computers. by W. N. Francis and H. Kucera (1964) Department of Linguistics, Brown University Providence, Rhode Island, USA Revised 1971, Revised and Amplified 1979 http://www.hit.uib.no/icame/brown/bcm.html. This version downloaded via NLTK https://www.nltk.org/nltk_data/."
Date Created,2025-06-09 15:46:25
Conc Version,0.0.1
Corpus Path,/home/geoff/data/conc-test-corpora//brown.corpus
Document Count,500
Token Count,1138566
Word Token Count,980144
Unique Tokens,42930
Unique Word Tokens,42907


### vocab.parquet

The vocab table contains ...

1. A lookup between token_id, token (string representation), and tokens_sort_order. The sort order allows sorting tokens alphabetically directly from token ids.
2. A frequency table, with counts for lower case and case sensitive matching.
3. Information on the type of token (i.e. whether punctuation or space - or if neither of those, a "word" token).

In [None]:
#| eval: false
display(pl.scan_parquet(f'{brown.corpus_path}/vocab.parquet').head(5).collect(engine='streaming'))

rank,tokens_sort_order,token_id,token,frequency_lower,frequency_orth,is_punct,is_space
1,50087,22848,"""the""",63516,62473,False,False
2,28,8128,""",""",58331,58331,True,False
3,41,38309,""".""",49907,49907,True,False
4,35232,2739,"""of""",36321,36122,False,False
5,3351,7126,"""and""",27787,27633,False,False


To illustrate the how frequencies are stored, see the instances for 'the'. The counts for the form as it appeared in the text are stored in frequency_orth. 'The' appears 1043 times and 'the' as lowercase appears 62,473 times. The frequency_lower column provides a count of the total number of mentions of 'the' regardless of case. 

In [None]:
#| eval: false
display(pl.scan_parquet(f'{brown.corpus_path}/vocab.parquet').filter(pl.col('token').str.to_lowercase() == 'the').head(5).collect(engine='streaming'))

rank,tokens_sort_order,token_id,token,frequency_lower,frequency_orth,is_punct,is_space
1,50087,22848,"""the""",63516.0,62473,False,False
99,50086,15682,"""The""",,1043,False,False


Punctuation is included in tokens, but these can be filtered in Conc reports. If you are working with the table directly you can use `is_punct` to access or remove punctuation.

In [None]:
#| eval: false
display(pl.scan_parquet(f'{brown.corpus_path}/vocab.parquet').filter(pl.col('is_punct') == True).head(5).collect(engine='streaming'))

rank,tokens_sort_order,token_id,token,frequency_lower,frequency_orth,is_punct,is_space
2,28,8128,""",""",58331,58331,True,False
3,41,38309,""".""",49907,49907,True,False
12,1577,1601,"""`""",9788,9788,True,False
14,14,42833,"""''""",8762,8762,True,False
15,29,27963,"""-""",8131,8131,True,False


Likewise, space tokens are included (without counts). There is more on space tokens below.

In [None]:
#| eval: false
display(pl.scan_parquet(f'{brown.corpus_path}/vocab.parquet').filter(pl.col('is_space') == True).head(5).collect(engine='streaming'))

rank,tokens_sort_order,token_id,token,frequency_lower,frequency_orth,is_punct,is_space
47165,1,2956,""" """,,,False,True
47166,2,2799,""" """,,,False,True
47167,3,27276,""" 	""",,,False,True
47168,4,22812,""" """,,,False,True
47169,5,4112,""" 	""",,,False,True


### tokens.parquet

The `tokens.parquet` file contains a table representing tokens in the corpus. Whitespace tokens have been removed (see `spaces.parquet` discussion below). The columns are as follows:

- the `orth_index` column stores the token_id of the original form of the token
- the `lower_index` column stores the token_id of the lowercased form of the token  
- the `token2doc_index` column stores the document id assigned by Conc in the order the texts where processed (starting from index position 1) - this is the same order as the metadata in the `metadata.parquet` file.
- the `has_spaces` column stores a boolean value indicating if the token is followed by a standard space character or not.

To demarcate the start and end of documents, Conc uses an end of file token (EOF_TOKEN) in the orth_index and lower_index columns. The EOF_TOKEN is stored in `corpus.json` and is accessible as a property of a `Corpus` object.

The token2doc_index represents token positions outside texts in the corpus as -1.

#### A note about space tokens

SpaCy outputs whether tokens are followed by a standard space character. This is recorded in has_spaces in the tokens table. SpaCy creates space tokens for other whitespace sequences. This may be useful for some sequence classification problems and it is allows re-representation of the document in its original form (i.e. newlines, tabs and sequences of whitespace) preserved. 

In [None]:
#| eval: false
pl.scan_parquet(f'{brown.corpus_path}/tokens.parquet').with_row_index('position').filter(pl.col('position').is_between(99, 121)).collect(engine='streaming')

position,orth_index,lower_index,token2doc_index,has_spaces
99,46333,46333,-1,False
100,15682,22848,1,True
101,4361,41672,1,True
102,14610,29725,1,True
103,54713,49998,1,True
104,45742,19078,1,True
105,53250,53250,1,True
106,8699,35796,1,True
107,45680,45680,1,True
108,30305,30305,1,True


In [None]:
# #| hide
# tmp_tokens_df = pl.scan_parquet(f'{brown.corpus_path}/tokens.parquet').with_row_index('position').collect()
# display(tmp_tokens_df.filter(pl.col('position').is_between(99, 121)).head(5))
# display(tmp_tokens_df.filter(pl.col('position').is_between(343, 348)).head(5))

# space_tokens_df = tmp_tokens_df.filter((pl.col('orth_index').is_in(brown.space_tokens))).with_row_index('adjust_by').with_columns((pl.col('position') - pl.col('adjust_by')).alias('corrected'))
# # replace position with corrected 
# space_tokens_df = space_tokens_df.with_columns(pl.col('corrected').alias('position')).drop('adjust_by').drop('corrected')
# # remove space_tokens_df from tmp_tokens_df
# tmp_tokens_df = tmp_tokens_df.filter(~pl.col('orth_index').is_in(brown.space_tokens)).drop('position').with_row_index('position')

# tmp_tokens_df = tmp_tokens_df.with_columns(pl.lit(1).alias('not_space'))
# space_tokens_df = space_tokens_df.with_columns(pl.lit(0).alias('not_space'))

# display(tmp_tokens_df.filter(pl.col('position').is_between(99, 121)).head(5))
# display(tmp_tokens_df.filter(pl.col('position').is_between(343, 348)).head(5))

# display(space_tokens_df.head(5))

# reconstructed_df = pl.concat([tmp_tokens_df, space_tokens_df]).sort('position', 'not_space').drop('position').drop('not_space').with_row_index('position')
# display(reconstructed_df.filter(pl.col('position').is_between(99, 121)).head(5))
# display(reconstructed_df.filter(pl.col('position').is_between(343, 348)).head(5))

# # is resconstructed identical to original? programmatically check
# assert pl.scan_parquet(f'{brown.corpus_path}/tokens.parquet').collect(engine='streaming').with_row_index('position').equals(reconstructed_df)

# # test for just one doc ...
# doc_id = 300
# tmp_doc_df = tmp_tokens_df.filter(pl.col('token2doc_index') == doc_id)
# display(tmp_doc_df.head(5))
# reconstructed_df = pl.concat([tmp_doc_df, space_tokens_df.filter(pl.col('token2doc_index') == doc_id)]).sort('position', 'not_space').drop('position').drop('not_space')
# display(reconstructed_df.head(5))
# assert pl.scan_parquet(f'{brown.corpus_path}/tokens.parquet').filter(pl.col('token2doc_index') == doc_id).collect(engine='streaming').equals(reconstructed_df)


The view below shows tokens joined with vocab, so you can see the token string and the attributes of the tokens. 

In [None]:
#| eval: false
pl.scan_parquet(f'{brown.corpus_path}/tokens.parquet').with_row_index('position').filter(pl.col('position').is_between(99, 121)).join(
    pl.scan_parquet(f'{brown.corpus_path}/vocab.parquet').select(pl.col('token_id'), pl.col('token'), pl.col('is_punct'), pl.col('is_space')),
    left_on='orth_index', right_on='token_id', how='left', maintain_order='left').collect(engine='streaming')

position,orth_index,lower_index,token2doc_index,has_spaces,token,is_punct,is_space
99,46333,46333,-1,False,""" conc-end-of-file-token""",False,False
100,15682,22848,1,True,"""The""",False,False
101,4361,41672,1,True,"""Fulton""",False,False
102,14610,29725,1,True,"""County""",False,False
103,54713,49998,1,True,"""Grand""",False,False
104,45742,19078,1,True,"""Jury""",False,False
105,53250,53250,1,True,"""said""",False,False
106,8699,35796,1,True,"""Friday""",False,False
107,45680,45680,1,True,"""an""",False,False
108,30305,30305,1,True,"""investigation""",False,False


### spaces.parquet

Space tokens (see note above) are stored in `spaces.parquet`. Space tokens are stored separate from word and punctuation tokens. This allows consistency with most tools for corpus linguistics. The spaces table follows the format of the tokens table. Spaces are represented using `position` and the corresponding token_id of the whitespace. The original token sequences can be reconstructed from the `tokens.parquet` and `spaces.parquet` files. Conc has functionality to recover specific texts by recombining the data.

In [None]:
#| eval: false
pl.scan_parquet(f'{brown.corpus_path}/spaces.parquet').head(3).collect(engine='streaming')

position,orth_index,lower_index,token2doc_index,has_spaces
100,27276,27276,1,False
345,2956,2956,1,False
637,2956,2956,1,False


### puncts.parquet

The `puncts.parquet` file stores an index of the position of punctuation tokens in the corpus. Below are the first three rows of a `puncts.parquet` file. If you look above the positions align with punctuation tokens in the tokens.parquet file. These are intended to be used for filtering tokens to exclude punctuation where necessary.

In [None]:
#| eval: false
pl.scan_parquet(f'{brown.corpus_path}/puncts.parquet').head(3).collect(engine='streaming')

position
116
117
120


### metadata.parquet

The `metadata.parquet` should not be confused with the metadata of the corpus itself, which is accessible in `corpus.jon`.

If populated, the `metadata.parquet` file contains metadata for each document in the corpus. 

In [None]:
#| eval: false
corpus = Corpus().load(f'{save_path}/us-congressional-speeches-subset-10k.corpus')
display(pl.scan_parquet(f'{corpus.corpus_path}/metadata.parquet').head(3).collect(engine='streaming'))

speech_id,date,speaker,chamber,state
530182158,"""1895-01-10T00:00:00.000000""","""Mr. COCKRELL""","""S""","""Unknown"""
890274849,"""1966-08-31T00:00:00.000000""","""Mr. LONG of Louisiana""","""S""","""Louisiana"""
880088363,"""1963-09-11T00:00:00.000000""","""Mr. FULBRIGHT""","""S""","""Unknown"""


Metadata is represented in the same order as documents are stored in the `tokens.parquet` file. The tokens with token2doc_index 1 correspond to the first metadata row.

For corpora created from files, there will always be a field for the source file at the time of creation. 

In [None]:
#| eval: false
corpus = Corpus().load(f'{save_path}/garden-party.corpus')
display(pl.scan_parquet(f'{corpus.corpus_path}/metadata.parquet').head(3).collect(engine='streaming'))

file
"""an-ideal-family.txt"""
"""at-the-bay.txt"""
"""bank-holiday.txt"""
