# Anatomy of a corpus

In [None]:
#| hide
from nbdev.showdoc import *

In [None]:
#| hide
import polars as pl
import os

In [None]:
#| hide
from conc.corpus import Corpus, CorpusMetadata
import msgspec

A Conc corpus is a directory containing specific files as follows:

```
corpus-name.corpus/
	README.md - Human readable information about the corpus to aide distribution
	corpus.json - Machine readable information about the corpus, including name, description, various summary statistics, and models used to build the corpus
	vocab.parquet - A table mapping token strings to token IDs and frequency information
	tokens.parquet - A table with indices based on token positions used to query the corpus with tokens represented by numeric IDs
	metadata.parquet - A table with metadata for each document (if there is any)
```

Note: by default the library creates a directory with the `.corpus` suffix. This is not necessary, but this makes corpora on your filesystem easier to find or identify.

To distribute a corpus, send a zip of the directory for others to extract or just share the directory as-is.

Below is more information about each file. You can obviously work with a corpus using Conc, but you can work with the processed corpus `.parquet` files directly using the `polars` library. 

### README.md

Below is an example of the README.md file generated by the Conc.

In [None]:
#| hide
from IPython.display import Markdown, display

In [None]:
#| hide
source_path = f'{os.environ.get("HOME")}/data/'
save_path = f'{os.environ.get("HOME")}/data/conc-test-corpora/'

In [None]:
#| hide
brown = Corpus().load(f'{save_path}/brown.corpus')

In [None]:
#| echo: true
with open(f'{brown.corpus_path}/README.md', 'rb') as f:
    markdown = '<div class="alert alert-block alert-success">\n\n' + f.read().decode('utf-8') + '\n'
    markdown = markdown.replace('\n#', '\n##') # making headings smaller for display
    markdown += '</div>'
    display(Markdown(markdown))

<div class="alert alert-block alert-success">

## Brown Corpus

### About

This directory contains a corpus created using the [Conc](https://github.com/polsci/conc) Python library. 

### Corpus Information

A Standard Corpus of Present-Day Edited American English, for use with Digital Computers. by W. N. Francis and H. Kucera (1964) Department of Linguistics, Brown University Providence, Rhode Island, USA Revised 1971, Revised and Amplified 1979 http://www.hit.uib.no/icame/brown/bcm.html

Date created: 2025-05-28 14:49:02  
Document count: 500  
Token count: 1140905  
Word token count: 980144  
Unique tokens: 42937  
Unique word tokens: 42907  
Conc Version Number: 0.0.1  
spaCy model: en_core_web_sm, version 3.8.0  

### Using this corpus

Conc can be installed [via pip]():  
```
pip install conc
```
Documentation and tutorials to get you started with Conc are available:
[Conc Documentation](https://geoffford.nz/conc)

### Cite Conc

If you use Conc in your work, please cite it as follows:


</div>

### corpus.json file

Below is the schema of the `corpus.json` file showing metadata saved with a corpus. These are loaded by Conc as attributes using `Corpus.load` or are created when you build a corpus using `Corpus.build_from_files` or `Corpus.build_from_csv`.

In [None]:
#| echo: true
properties = msgspec.json.schema(CorpusMetadata)['$defs']['CorpusMetadata']['properties']
display(properties)

{'name': {'type': 'string'},
 'description': {'type': 'string'},
 'slug': {'type': 'string'},
 'conc_version': {'type': 'string'},
 'document_count': {'type': 'integer'},
 'token_count': {'type': 'integer'},
 'word_token_count': {'type': 'integer'},
 'punct_token_count': {'type': 'integer'},
 'space_token_count': {'type': 'integer'},
 'unique_tokens': {'type': 'integer'},
 'unique_word_tokens': {'type': 'integer'},
 'date_created': {'type': 'string'},
 'EOF_TOKEN': {'type': 'integer'},
 'SPACY_EOF_TOKEN': {'type': 'integer'},
 'SPACY_MODEL': {'type': 'string'},
 'SPACY_MODEL_VERSION': {'type': 'string'},
 'punct_tokens': {'type': 'array', 'items': {'type': 'integer'}},
 'space_tokens': {'type': 'array', 'items': {'type': 'integer'}}}

### vocab.parquet

In [None]:
#| echo: true
display(pl.scan_parquet(f'{brown.corpus_path}/vocab.parquet').head(5).collect(engine='streaming'))

rank,token_id,token,frequency_lower,frequency_orth,is_punct,is_space
1,22848,"""the""",63516,62473,False,False
2,8128,""",""",58331,58331,True,False
3,38309,""".""",49907,49907,True,False
4,2739,"""of""",36321,36122,False,False
5,7126,"""and""",27787,27633,False,False


Explain how frequency stored - i.e. with different word forms.

In [None]:
#| echo: true
display(pl.scan_parquet(f'{brown.corpus_path}/vocab.parquet').filter(pl.col('token').str.to_lowercase() == 'the').head(5).collect(engine='streaming'))

rank,token_id,token,frequency_lower,frequency_orth,is_punct,is_space
1,22848,"""the""",63516.0,62473,False,False
99,15682,"""The""",,1043,False,False


In [None]:
#| echo: true
display(pl.scan_parquet(f'{brown.corpus_path}/vocab.parquet').filter(pl.col('token').str.to_lowercase() == 'government').head(5).collect(engine='streaming'))

rank,token_id,token,frequency_lower,frequency_orth,is_punct,is_space
328,11309,"""government""",438.0,284,False,False
644,55689,"""Government""",,154,False,False


### tokens.parquet

In [None]:
#| echo: true
pl.scan_parquet(f'{brown.corpus_path}/tokens.parquet').with_row_index('position').filter(pl.col('position').is_between(99, 107)).collect(engine='streaming')

position,orth_index,lower_index,token2doc_index
99,46333,46333,-1
100,27276,27276,0
101,15682,22848,0
102,4361,41672,0
103,14610,29725,0
104,54713,49998,0
105,45742,19078,0
106,53250,53250,0
107,8699,35796,0


TODO Explain this token2doc_index -1 above and various other fields mapped below.

In [None]:
#| echo: true
pl.scan_parquet(f'{brown.corpus_path}/tokens.parquet').with_row_index('position').filter(pl.col('position').is_between(99, 121)).join(
    pl.scan_parquet(f'{brown.corpus_path}/vocab.parquet').select(pl.col('token_id'), pl.col('token'), pl.col('is_punct'), pl.col('is_space')),
    left_on='orth_index', right_on='token_id', how='left', maintain_order='left').collect(engine='streaming')

position,orth_index,lower_index,token2doc_index,token,is_punct,is_space
99,46333,46333,-1,""" conc-end-of-file-token""",False,False
100,27276,27276,0,""" 	""",False,True
101,15682,22848,0,"""The""",False,False
102,4361,41672,0,"""Fulton""",False,False
103,14610,29725,0,"""County""",False,False
104,54713,49998,0,"""Grand""",False,False
105,45742,19078,0,"""Jury""",False,False
106,53250,53250,0,"""said""",False,False
107,8699,35796,0,"""Friday""",False,False
108,45680,45680,0,"""an""",False,False


### spaces.parquet and puncts.parquet

The format of spaces.parquet and puncts.parquet are the same. Each table contains one field, namely `position`, which indexes the position of punctuation or space tokens in the corpus. Here are the first three rows of a `puncts.parquet` file:

In [None]:
#| echo: true
pl.scan_parquet(f'{brown.corpus_path}/puncts.parquet').head(3).collect(engine='streaming')

position
117
118
121


### metadata.parquet

The `metadata.parquet` should not be confused with the metadata of the corpus itself, which is accessible in `corpus.jon`.

If populated, the `metadata.parquet` file contains metadata for each document in the corpus. 

In [None]:
#| echo: true
corpus = Corpus().load(f'{save_path}/us-congressional-speeches-subset-10k.corpus')
display(pl.scan_parquet(f'{corpus.corpus_path}/metadata.parquet').head(3).collect(engine='streaming'))

speech_id,date,speaker,chamber,state
530182158,"""1895-01-10T00:00:00.000000""","""Mr. COCKRELL""","""S""","""Unknown"""
890274849,"""1966-08-31T00:00:00.000000""","""Mr. LONG of Louisiana""","""S""","""Louisiana"""
880088363,"""1963-09-11T00:00:00.000000""","""Mr. FULBRIGHT""","""S""","""Unknown"""


For corpora created from files, there will always be a field for the source file at the time of creation. This is in the same order as documents are represented in the `tokens.parquet` file.

In [None]:
#| echo: true
corpus = Corpus().load(f'{save_path}/garden-party.corpus')
display(pl.scan_parquet(f'{corpus.corpus_path}/metadata.parquet').head(3).collect(engine='streaming'))

file
"""an-ideal-family.txt"""
"""at-the-bay.txt"""
"""bank-holiday.txt"""
