# Performance

> Information on Conc performance across different corpus sizes.
- toc: false
- page-layout: full

This page reports timing results of corpus building/loading and Conc report methods with different size corpora using a machine with Intel Core i7-14700F, NVME SSD and 16GB usable RAM under WSL.

In [None]:
#| hide
%load_ext memory_profiler

In [None]:
#| hide 
%load_ext line_profiler

In [None]:
#| hide
# %load_ext memray


In [None]:
#| hide
import os

In [None]:
#| hide
from conc.core import logger, set_logger_state

In [None]:
from conc.corpus import Corpus
from conc.conc import Conc

In [None]:
#| hide
source_path = f'{os.environ.get("HOME")}/data/'
save_path = f'{os.environ.get("HOME")}/data/conc-test-corpora/'

In [None]:
test_corpora = {
				'us-congressional-speeches-subset-10k': 'US Congressional Speeches Subset 10k',
                'us-congressional-speeches-subset-100k': 'US Congressional Speeches Subset 100k',
				'us-congressional-speeches-subset-200k': 'US Congressional Speeches Subset 200k',
				'us-congressional-speeches-subset-500k': 'US Congressional Speeches Subset 500k'
				}

Corpus build time varies from 4 seconds for 2m token data source (10k texts) to 150 seconds for 100m token data source (500k texts). Currently to build corpora larger than this requires large RAM. Work on memory management is ongoing, but this will improve when Polars new streaming engine matures. This is in the Roadmap for the library.  

In [None]:
#| eval: false
corpora = {}
for slug, name in test_corpora.items():
	logger.info(f'Starting {name} build ...')
	description = f'1 million speeches sampled from https://huggingface.co/datasets/Eugleo/us-congressional-speeches-subset to create corpora of varying sizes for development and testing. The dataset card at Huggingface is empty, so there is no further information available on the contents. The title indicates how many speeches are included in this corpus. '
	try:
		%time corpus = Corpus(name = name, description = description).build_from_csv(f'{source_path}{slug}.csv.gz', save_path = save_path, text_column='text', metadata_columns = ['speech_id', 'date', 'speaker', 'chamber', 'state'], build_process_cleanup = False)
	except Exception as e:
		raise e

CPU times: user 4.45 s, sys: 224 ms, total: 4.67 s
Wall time: 3.82 s
CPU times: user 46.3 s, sys: 2.45 s, total: 48.7 s
Wall time: 30 s
CPU times: user 1min 38s, sys: 10.8 s, total: 1min 49s
Wall time: 1min 2s
CPU times: user 3min 55s, sys: 32 s, total: 4min 27s
Wall time: 2min 26s


Corpora are loaded lazily - meaning large data tables are only accessed when required. Similar load times regardless of corpus size ...

In [None]:
#| eval: false
for slug, name in test_corpora.items():
    %time corpus = Corpus().load(f'{save_path}{slug}.corpus')
    corpus.summary()
    del corpus

CPU times: user 211 ms, sys: 15.7 ms, total: 227 ms
Wall time: 266 ms


Corpus Summary,Corpus Summary
Attribute,Value
Name,US Congressional Speeches Subset 10k
Description,"1 million speeches sampled from https://huggingface.co/datasets/Eugleo/us-congressional-speeches-subset to create corpora of varying sizes for development and testing. The dataset card at Huggingface is empty, so there is no further information available on the contents. The title indicates how many speeches are included in this corpus."
Date Created,2025-06-09 15:03:14
Conc Version,0.0.1
Corpus Path,/home/geoff/data/conc-test-corpora/us-congressional-speeches-subset-10k.corpus
Document Count,10000
Token Count,1954972
Word Token Count,1767904
Unique Tokens,50640
Unique Word Tokens,50520


CPU times: user 182 ms, sys: 27.6 ms, total: 209 ms
Wall time: 220 ms


Corpus Summary,Corpus Summary
Attribute,Value
Name,US Congressional Speeches Subset 100k
Description,"1 million speeches sampled from https://huggingface.co/datasets/Eugleo/us-congressional-speeches-subset to create corpora of varying sizes for development and testing. The dataset card at Huggingface is empty, so there is no further information available on the contents. The title indicates how many speeches are included in this corpus."
Date Created,2025-06-09 15:03:44
Conc Version,0.0.1
Corpus Path,/home/geoff/data/conc-test-corpora/us-congressional-speeches-subset-100k.corpus
Document Count,100000
Token Count,19927241
Word Token Count,18020769
Unique Tokens,214502
Unique Word Tokens,214175


CPU times: user 209 ms, sys: 0 ns, total: 209 ms
Wall time: 219 ms


Corpus Summary,Corpus Summary
Attribute,Value
Name,US Congressional Speeches Subset 200k
Description,"1 million speeches sampled from https://huggingface.co/datasets/Eugleo/us-congressional-speeches-subset to create corpora of varying sizes for development and testing. The dataset card at Huggingface is empty, so there is no further information available on the contents. The title indicates how many speeches are included in this corpus."
Date Created,2025-06-09 15:04:47
Conc Version,0.0.1
Corpus Path,/home/geoff/data/conc-test-corpora/us-congressional-speeches-subset-200k.corpus
Document Count,200000
Token Count,39963039
Word Token Count,36136744
Unique Tokens,345631
Unique Word Tokens,345310


CPU times: user 207 ms, sys: 0 ns, total: 207 ms
Wall time: 217 ms


Corpus Summary,Corpus Summary
Attribute,Value
Name,US Congressional Speeches Subset 500k
Description,"1 million speeches sampled from https://huggingface.co/datasets/Eugleo/us-congressional-speeches-subset to create corpora of varying sizes for development and testing. The dataset card at Huggingface is empty, so there is no further information available on the contents. The title indicates how many speeches are included in this corpus."
Date Created,2025-06-09 15:07:14
Conc Version,0.0.1
Corpus Path,/home/geoff/data/conc-test-corpora/us-congressional-speeches-subset-500k.corpus
Document Count,500000
Token Count,99902593
Word Token Count,90341944
Unique Tokens,655344
Unique Word Tokens,654824


In [None]:
#| eval: false
for slug, name in test_corpora.items():
    corpus = Corpus().load(f'{save_path}{slug}.corpus')
    conc = Conc(corpus)
    %time conc.frequencies(page_size = 10).display()
    del corpus

Frequencies,Frequencies,Frequencies,Frequencies
"Frequencies of word tokens, US Congressional Speeches Subset 10k","Frequencies of word tokens, US Congressional Speeches Subset 10k","Frequencies of word tokens, US Congressional Speeches Subset 10k","Frequencies of word tokens, US Congressional Speeches Subset 10k"
Rank,Token,Frequency,Normalized Frequency
1,the,135984,769.18
2,of,67597,382.36
3,to,60132,340.13
4,and,44832,253.59
5,in,36959,209.06
6,that,34135,193.08
7,a,29557,167.19
8,i,29329,165.90
9,is,25175,142.40
10,this,19173,108.45


CPU times: user 22.9 ms, sys: 10.1 ms, total: 33 ms
Wall time: 37.3 ms


Frequencies,Frequencies,Frequencies,Frequencies
"Frequencies of word tokens, US Congressional Speeches Subset 100k","Frequencies of word tokens, US Congressional Speeches Subset 100k","Frequencies of word tokens, US Congressional Speeches Subset 100k","Frequencies of word tokens, US Congressional Speeches Subset 100k"
Rank,Token,Frequency,Normalized Frequency
1,the,1389439,771.02
2,of,687127,381.30
3,to,610266,338.65
4,and,459220,254.83
5,in,379946,210.84
6,that,346216,192.12
7,a,302256,167.73
8,i,297077,164.85
9,is,250677,139.10
10,this,192933,107.06


CPU times: user 61.5 ms, sys: 38 ms, total: 99.4 ms
Wall time: 46.1 ms


Frequencies,Frequencies,Frequencies,Frequencies
"Frequencies of word tokens, US Congressional Speeches Subset 200k","Frequencies of word tokens, US Congressional Speeches Subset 200k","Frequencies of word tokens, US Congressional Speeches Subset 200k","Frequencies of word tokens, US Congressional Speeches Subset 200k"
Rank,Token,Frequency,Normalized Frequency
1,the,2781475,769.71
2,of,1377003,381.05
3,to,1225404,339.10
4,and,922720,255.34
5,in,760867,210.55
6,that,695665,192.51
7,a,606747,167.90
8,i,593766,164.31
9,is,504385,139.58
10,this,386922,107.07


CPU times: user 53.7 ms, sys: 78.1 ms, total: 132 ms
Wall time: 49.8 ms


Frequencies,Frequencies,Frequencies,Frequencies
"Frequencies of word tokens, US Congressional Speeches Subset 500k","Frequencies of word tokens, US Congressional Speeches Subset 500k","Frequencies of word tokens, US Congressional Speeches Subset 500k","Frequencies of word tokens, US Congressional Speeches Subset 500k"
Rank,Token,Frequency,Normalized Frequency
1,the,6951503,769.47
2,of,3446705,381.52
3,to,3059159,338.62
4,and,2308134,255.49
5,in,1902118,210.55
6,that,1737689,192.35
7,a,1514676,167.66
8,i,1481424,163.98
9,is,1261935,139.68
10,this,966165,106.95


CPU times: user 104 ms, sys: 105 ms, total: 210 ms
Wall time: 53.2 ms


In [None]:
#| eval: false
for slug, name in test_corpora.items():
    corpus = Corpus().load(f'{save_path}{slug}.corpus')
    conc = Conc(corpus)
    %time conc.ngrams('economy', ngram_length = 2, ngram_token_position = 'RIGHT', page_size = 5).display()
    del corpus

"Ngrams for ""economy""","Ngrams for ""economy""","Ngrams for ""economy""","Ngrams for ""economy"""
US Congressional Speeches Subset 10k,US Congressional Speeches Subset 10k,US Congressional Speeches Subset 10k,US Congressional Speeches Subset 10k
Rank,Ngram,Frequency,Normalized Frequency
1,the economy,94,0.53
2,our economy,59,0.33
3,of economy,23,0.13
4,american economy,11,0.06
5,for economy,8,0.05
Report based on word tokens,Report based on word tokens,Report based on word tokens,Report based on word tokens
"Ngram length: 2, Token position: right","Ngram length: 2, Token position: right","Ngram length: 2, Token position: right","Ngram length: 2, Token position: right"
Ngrams containing punctuation tokens excluded,Ngrams containing punctuation tokens excluded,Ngrams containing punctuation tokens excluded,Ngrams containing punctuation tokens excluded
"Normalized Frequency is per 10,000 tokens","Normalized Frequency is per 10,000 tokens","Normalized Frequency is per 10,000 tokens","Normalized Frequency is per 10,000 tokens"
Total unique ngrams: 106,Total unique ngrams: 106,Total unique ngrams: 106,Total unique ngrams: 106


CPU times: user 49.3 ms, sys: 28 ms, total: 77.3 ms
Wall time: 48.8 ms


"Ngrams for ""economy""","Ngrams for ""economy""","Ngrams for ""economy""","Ngrams for ""economy"""
US Congressional Speeches Subset 100k,US Congressional Speeches Subset 100k,US Congressional Speeches Subset 100k,US Congressional Speeches Subset 100k
Rank,Ngram,Frequency,Normalized Frequency
1,the economy,930,0.52
2,our economy,643,0.36
3,of economy,203,0.11
4,american economy,116,0.06
5,national economy,84,0.05
Report based on word tokens,Report based on word tokens,Report based on word tokens,Report based on word tokens
"Ngram length: 2, Token position: right","Ngram length: 2, Token position: right","Ngram length: 2, Token position: right","Ngram length: 2, Token position: right"
Ngrams containing punctuation tokens excluded,Ngrams containing punctuation tokens excluded,Ngrams containing punctuation tokens excluded,Ngrams containing punctuation tokens excluded
"Normalized Frequency is per 10,000 tokens","Normalized Frequency is per 10,000 tokens","Normalized Frequency is per 10,000 tokens","Normalized Frequency is per 10,000 tokens"
Total unique ngrams: 464,Total unique ngrams: 464,Total unique ngrams: 464,Total unique ngrams: 464


CPU times: user 338 ms, sys: 57 ms, total: 395 ms
Wall time: 198 ms


"Ngrams for ""economy""","Ngrams for ""economy""","Ngrams for ""economy""","Ngrams for ""economy"""
US Congressional Speeches Subset 200k,US Congressional Speeches Subset 200k,US Congressional Speeches Subset 200k,US Congressional Speeches Subset 200k
Rank,Ngram,Frequency,Normalized Frequency
1,the economy,1924,0.53
2,our economy,1312,0.36
3,of economy,401,0.11
4,american economy,242,0.07
5,national economy,172,0.05
Report based on word tokens,Report based on word tokens,Report based on word tokens,Report based on word tokens
"Ngram length: 2, Token position: right","Ngram length: 2, Token position: right","Ngram length: 2, Token position: right","Ngram length: 2, Token position: right"
Ngrams containing punctuation tokens excluded,Ngrams containing punctuation tokens excluded,Ngrams containing punctuation tokens excluded,Ngrams containing punctuation tokens excluded
"Normalized Frequency is per 10,000 tokens","Normalized Frequency is per 10,000 tokens","Normalized Frequency is per 10,000 tokens","Normalized Frequency is per 10,000 tokens"
Total unique ngrams: 682,Total unique ngrams: 682,Total unique ngrams: 682,Total unique ngrams: 682


CPU times: user 578 ms, sys: 233 ms, total: 811 ms
Wall time: 435 ms


"Ngrams for ""economy""","Ngrams for ""economy""","Ngrams for ""economy""","Ngrams for ""economy"""
US Congressional Speeches Subset 500k,US Congressional Speeches Subset 500k,US Congressional Speeches Subset 500k,US Congressional Speeches Subset 500k
Rank,Ngram,Frequency,Normalized Frequency
1,the economy,4818,0.53
2,our economy,3258,0.36
3,of economy,1039,0.12
4,american economy,588,0.07
5,national economy,448,0.05
Report based on word tokens,Report based on word tokens,Report based on word tokens,Report based on word tokens
"Ngram length: 2, Token position: right","Ngram length: 2, Token position: right","Ngram length: 2, Token position: right","Ngram length: 2, Token position: right"
Ngrams containing punctuation tokens excluded,Ngrams containing punctuation tokens excluded,Ngrams containing punctuation tokens excluded,Ngrams containing punctuation tokens excluded
"Normalized Frequency is per 10,000 tokens","Normalized Frequency is per 10,000 tokens","Normalized Frequency is per 10,000 tokens","Normalized Frequency is per 10,000 tokens"
"Total unique ngrams: 1,193","Total unique ngrams: 1,193","Total unique ngrams: 1,193","Total unique ngrams: 1,193"


CPU times: user 1.66 s, sys: 552 ms, total: 2.21 s
Wall time: 1.02 s


In [None]:
#| eval: false
# still working on this!
for slug, name in test_corpora.items():
    corpus = Corpus().load(f'{save_path}{slug}.corpus')
    conc = Conc(corpus)
    %time conc.ngram_frequencies(ngram_length = 2, page_size = 5).display()
    del corpus

Ngram Frequencies,Ngram Frequencies,Ngram Frequencies,Ngram Frequencies
US Congressional Speeches Subset 10k,US Congressional Speeches Subset 10k,US Congressional Speeches Subset 10k,US Congressional Speeches Subset 10k
Rank,Ngram,Frequency,Normalized Frequency
1,of the,22312,126.21
2,in the,10982,62.12
3,to the,9119,51.58
4,it is,5140,29.07
5,that the,5123,28.98
Report based on word tokens,Report based on word tokens,Report based on word tokens,Report based on word tokens
Ngram length: 2,Ngram length: 2,Ngram length: 2,Ngram length: 2
Ngrams containing punctuation tokens excluded,Ngrams containing punctuation tokens excluded,Ngrams containing punctuation tokens excluded,Ngrams containing punctuation tokens excluded
"Normalized Frequency is per 10,000 tokens","Normalized Frequency is per 10,000 tokens","Normalized Frequency is per 10,000 tokens","Normalized Frequency is per 10,000 tokens"
"Total unique ngrams: 396,623","Total unique ngrams: 396,623","Total unique ngrams: 396,623","Total unique ngrams: 396,623"


CPU times: user 1.5 s, sys: 147 ms, total: 1.65 s
Wall time: 209 ms


Ngram Frequencies,Ngram Frequencies,Ngram Frequencies,Ngram Frequencies
US Congressional Speeches Subset 100k,US Congressional Speeches Subset 100k,US Congressional Speeches Subset 100k,US Congressional Speeches Subset 100k
Rank,Ngram,Frequency,Normalized Frequency
1,of the,227943,126.49
2,in the,114241,63.39
3,to the,92967,51.59
4,it is,51659,28.67
5,that the,51620,28.64
Report based on word tokens,Report based on word tokens,Report based on word tokens,Report based on word tokens
Ngram length: 2,Ngram length: 2,Ngram length: 2,Ngram length: 2
Ngrams containing punctuation tokens excluded,Ngrams containing punctuation tokens excluded,Ngrams containing punctuation tokens excluded,Ngrams containing punctuation tokens excluded
"Normalized Frequency is per 10,000 tokens","Normalized Frequency is per 10,000 tokens","Normalized Frequency is per 10,000 tokens","Normalized Frequency is per 10,000 tokens"
"Total unique ngrams: 2,046,190","Total unique ngrams: 2,046,190","Total unique ngrams: 2,046,190","Total unique ngrams: 2,046,190"


CPU times: user 35.7 s, sys: 1.09 s, total: 36.8 s
Wall time: 831 ms


Ngram Frequencies,Ngram Frequencies,Ngram Frequencies,Ngram Frequencies
US Congressional Speeches Subset 200k,US Congressional Speeches Subset 200k,US Congressional Speeches Subset 200k,US Congressional Speeches Subset 200k
Rank,Ngram,Frequency,Normalized Frequency
1,of the,457057,126.48
2,in the,228891,63.34
3,to the,186449,51.60
4,it is,103619,28.67
5,that the,103418,28.62
Report based on word tokens,Report based on word tokens,Report based on word tokens,Report based on word tokens
Ngram length: 2,Ngram length: 2,Ngram length: 2,Ngram length: 2
Ngrams containing punctuation tokens excluded,Ngrams containing punctuation tokens excluded,Ngrams containing punctuation tokens excluded,Ngrams containing punctuation tokens excluded
"Normalized Frequency is per 10,000 tokens","Normalized Frequency is per 10,000 tokens","Normalized Frequency is per 10,000 tokens","Normalized Frequency is per 10,000 tokens"
"Total unique ngrams: 3,304,755","Total unique ngrams: 3,304,755","Total unique ngrams: 3,304,755","Total unique ngrams: 3,304,755"


CPU times: user 1min 9s, sys: 2.22 s, total: 1min 11s
Wall time: 4.33 s


Ngram Frequencies,Ngram Frequencies,Ngram Frequencies,Ngram Frequencies
US Congressional Speeches Subset 500k,US Congressional Speeches Subset 500k,US Congressional Speeches Subset 500k,US Congressional Speeches Subset 500k
Rank,Ngram,Frequency,Normalized Frequency
1,of the,1140304,126.22
2,in the,570295,63.13
3,to the,467816,51.78
4,it is,259770,28.75
5,that the,258068,28.57
Report based on word tokens,Report based on word tokens,Report based on word tokens,Report based on word tokens
Ngram length: 2,Ngram length: 2,Ngram length: 2,Ngram length: 2
Ngrams containing punctuation tokens excluded,Ngrams containing punctuation tokens excluded,Ngrams containing punctuation tokens excluded,Ngrams containing punctuation tokens excluded
"Normalized Frequency is per 10,000 tokens","Normalized Frequency is per 10,000 tokens","Normalized Frequency is per 10,000 tokens","Normalized Frequency is per 10,000 tokens"
"Total unique ngrams: 6,158,427","Total unique ngrams: 6,158,427","Total unique ngrams: 6,158,427","Total unique ngrams: 6,158,427"


CPU times: user 3min 16s, sys: 10.2 s, total: 3min 26s
Wall time: 11.9 s


In [None]:
#| eval: false
for slug, name in test_corpora.items():
    corpus = Corpus().load(f'{save_path}{slug}.corpus')
    conc = Conc(corpus)
    %time conc.concordance('economy', page_size = 5).display()
    del corpus

"Concordance for ""economy""","Concordance for ""economy""","Concordance for ""economy""","Concordance for ""economy"""
"US Congressional Speeches Subset 10k, Context tokens: 5, Order: 1R2R3R","US Congressional Speeches Subset 10k, Context tokens: 5, Order: 1R2R3R","US Congressional Speeches Subset 10k, Context tokens: 5, Order: 1R2R3R","US Congressional Speeches Subset 10k, Context tokens: 5, Order: 1R2R3R"
Document Id,Left,Node,Right
5878,ruled by a government of,economy,.
1163,. help strengthen our Nations,economy,.
316,otherwise generally strong and prosperous,economy,.
6910,this critical sector in our,economy,.
9517,health care pressures in this,economy,.
Total Concordance Lines: 358,Total Concordance Lines: 358,Total Concordance Lines: 358,Total Concordance Lines: 358
Total Documents: 251,Total Documents: 251,Total Documents: 251,Total Documents: 251
Showing 5 lines,Showing 5 lines,Showing 5 lines,Showing 5 lines
Page 1 of 72,Page 1 of 72,Page 1 of 72,Page 1 of 72


CPU times: user 89.7 ms, sys: 1.14 ms, total: 90.8 ms
Wall time: 61.2 ms


"Concordance for ""economy""","Concordance for ""economy""","Concordance for ""economy""","Concordance for ""economy"""
"US Congressional Speeches Subset 100k, Context tokens: 5, Order: 1R2R3R","US Congressional Speeches Subset 100k, Context tokens: 5, Order: 1R2R3R","US Congressional Speeches Subset 100k, Context tokens: 5, Order: 1R2R3R","US Congressional Speeches Subset 100k, Context tokens: 5, Order: 1R2R3R"
Document Id,Left,Node,Right
82659,amounts that it throws our,economy,
75018,Honey . I shrun the,economy,""" ? It is honest"
19176,"getting away from "" Coolidge",economy,""" already . and making"
6729,". We are talking """,economy,""" and at the same"
83170,"further into an "" innovating",economy,""" based on a highly"
Total Concordance Lines: 3758,Total Concordance Lines: 3758,Total Concordance Lines: 3758,Total Concordance Lines: 3758
Total Documents: 2684,Total Documents: 2684,Total Documents: 2684,Total Documents: 2684
Showing 5 lines,Showing 5 lines,Showing 5 lines,Showing 5 lines
Page 1 of 752,Page 1 of 752,Page 1 of 752,Page 1 of 752


CPU times: user 414 ms, sys: 340 ms, total: 755 ms
Wall time: 437 ms


"Concordance for ""economy""","Concordance for ""economy""","Concordance for ""economy""","Concordance for ""economy"""
"US Congressional Speeches Subset 200k, Context tokens: 5, Order: 1R2R3R","US Congressional Speeches Subset 200k, Context tokens: 5, Order: 1R2R3R","US Congressional Speeches Subset 200k, Context tokens: 5, Order: 1R2R3R","US Congressional Speeches Subset 200k, Context tokens: 5, Order: 1R2R3R"
Document Id,Left,Node,Right
77084,the way it is .,ECONOMY,
6026,its central office . Political,Economy,
130531,the maintenance of her national,economy,
20685,on something else . Coolidge,economy,! I am for it
132603,railroads of this country .,Economy,! What about this pitpible
Total Concordance Lines: 7753,Total Concordance Lines: 7753,Total Concordance Lines: 7753,Total Concordance Lines: 7753
Total Documents: 5480,Total Documents: 5480,Total Documents: 5480,Total Documents: 5480
Showing 5 lines,Showing 5 lines,Showing 5 lines,Showing 5 lines
Page 1 of 1551,Page 1 of 1551,Page 1 of 1551,Page 1 of 1551


CPU times: user 871 ms, sys: 596 ms, total: 1.47 s
Wall time: 831 ms


"Concordance for ""economy""","Concordance for ""economy""","Concordance for ""economy""","Concordance for ""economy"""
"US Congressional Speeches Subset 500k, Context tokens: 5, Order: 1R2R3R","US Congressional Speeches Subset 500k, Context tokens: 5, Order: 1R2R3R","US Congressional Speeches Subset 500k, Context tokens: 5, Order: 1R2R3R","US Congressional Speeches Subset 500k, Context tokens: 5, Order: 1R2R3R"
Document Id,Left,Node,Right
140837,prayers are with them .,ECONOMY,
162997,its central office . Political,Economy,
325086,WHAT ARE CoNDrrONS IN THE,ECONOMY,
64711,country . Condition of Nations,Economy,
360787,country ! This spasm of,economy,!
Total Concordance Lines: 19399,Total Concordance Lines: 19399,Total Concordance Lines: 19399,Total Concordance Lines: 19399
Total Documents: 13564,Total Documents: 13564,Total Documents: 13564,Total Documents: 13564
Showing 5 lines,Showing 5 lines,Showing 5 lines,Showing 5 lines
Page 1 of 3880,Page 1 of 3880,Page 1 of 3880,Page 1 of 3880


CPU times: user 2.83 s, sys: 1.42 s, total: 4.24 s
Wall time: 1.86 s


In [None]:
#| eval: false
reference = Corpus().load(f'{save_path}brown.corpus')
for slug, name in test_corpora.items():
    corpus = Corpus().load(f'{save_path}{slug}.corpus')
    conc = Conc(corpus)
    conc.set_reference_corpus(reference)
    %time conc.keywords(page_size = 5, min_frequency = 5, min_frequency_reference = 5).display()
    del corpus

Keywords,Keywords,Keywords,Keywords,Keywords,Keywords,Keywords,Keywords,Keywords
"Target corpus: US Congressional Speeches Subset 10k, Reference corpus: Brown Corpus","Target corpus: US Congressional Speeches Subset 10k, Reference corpus: Brown Corpus","Target corpus: US Congressional Speeches Subset 10k, Reference corpus: Brown Corpus","Target corpus: US Congressional Speeches Subset 10k, Reference corpus: Brown Corpus","Target corpus: US Congressional Speeches Subset 10k, Reference corpus: Brown Corpus","Target corpus: US Congressional Speeches Subset 10k, Reference corpus: Brown Corpus","Target corpus: US Congressional Speeches Subset 10k, Reference corpus: Brown Corpus","Target corpus: US Congressional Speeches Subset 10k, Reference corpus: Brown Corpus","Target corpus: US Congressional Speeches Subset 10k, Reference corpus: Brown Corpus"
Rank,Token,Frequency,Frequency Reference,Normalized Frequency,Normalized Frequency Reference,Relative Risk,Log Ratio,Log Likelihood
1,unanimous,907,5,5.13,0.05,100.57,6.65,748.42
2,amendment,4039,24,22.85,0.24,93.30,6.54,3318.48
3,appropriation,716,5,4.05,0.05,79.39,6.31,582.28
4,senator,5488,39,31.04,0.40,78.02,6.29,4457.76
5,subcommittee,585,5,3.31,0.05,64.87,6.02,468.73
Report based on word tokens,Report based on word tokens,Report based on word tokens,Report based on word tokens,Report based on word tokens,Report based on word tokens,Report based on word tokens,Report based on word tokens,Report based on word tokens
"Filtered tokens by minimum frequency in target corpus (5), minimum frequency in reference corpus (5)","Filtered tokens by minimum frequency in target corpus (5), minimum frequency in reference corpus (5)","Filtered tokens by minimum frequency in target corpus (5), minimum frequency in reference corpus (5)","Filtered tokens by minimum frequency in target corpus (5), minimum frequency in reference corpus (5)","Filtered tokens by minimum frequency in target corpus (5), minimum frequency in reference corpus (5)","Filtered tokens by minimum frequency in target corpus (5), minimum frequency in reference corpus (5)","Filtered tokens by minimum frequency in target corpus (5), minimum frequency in reference corpus (5)","Filtered tokens by minimum frequency in target corpus (5), minimum frequency in reference corpus (5)","Filtered tokens by minimum frequency in target corpus (5), minimum frequency in reference corpus (5)"
"Normalized Frequency is per 10,000 tokens","Normalized Frequency is per 10,000 tokens","Normalized Frequency is per 10,000 tokens","Normalized Frequency is per 10,000 tokens","Normalized Frequency is per 10,000 tokens","Normalized Frequency is per 10,000 tokens","Normalized Frequency is per 10,000 tokens","Normalized Frequency is per 10,000 tokens","Normalized Frequency is per 10,000 tokens"
"Total word tokens in target corpus: 1,767,904","Total word tokens in target corpus: 1,767,904","Total word tokens in target corpus: 1,767,904","Total word tokens in target corpus: 1,767,904","Total word tokens in target corpus: 1,767,904","Total word tokens in target corpus: 1,767,904","Total word tokens in target corpus: 1,767,904","Total word tokens in target corpus: 1,767,904","Total word tokens in target corpus: 1,767,904"
"Total word tokens in reference corpus: 980,144","Total word tokens in reference corpus: 980,144","Total word tokens in reference corpus: 980,144","Total word tokens in reference corpus: 980,144","Total word tokens in reference corpus: 980,144","Total word tokens in reference corpus: 980,144","Total word tokens in reference corpus: 980,144","Total word tokens in reference corpus: 980,144","Total word tokens in reference corpus: 980,144"


CPU times: user 369 ms, sys: 220 ms, total: 589 ms
Wall time: 94.3 ms


Keywords,Keywords,Keywords,Keywords,Keywords,Keywords,Keywords,Keywords,Keywords
"Target corpus: US Congressional Speeches Subset 100k, Reference corpus: Brown Corpus","Target corpus: US Congressional Speeches Subset 100k, Reference corpus: Brown Corpus","Target corpus: US Congressional Speeches Subset 100k, Reference corpus: Brown Corpus","Target corpus: US Congressional Speeches Subset 100k, Reference corpus: Brown Corpus","Target corpus: US Congressional Speeches Subset 100k, Reference corpus: Brown Corpus","Target corpus: US Congressional Speeches Subset 100k, Reference corpus: Brown Corpus","Target corpus: US Congressional Speeches Subset 100k, Reference corpus: Brown Corpus","Target corpus: US Congressional Speeches Subset 100k, Reference corpus: Brown Corpus","Target corpus: US Congressional Speeches Subset 100k, Reference corpus: Brown Corpus"
Rank,Token,Frequency,Frequency Reference,Normalized Frequency,Normalized Frequency Reference,Relative Risk,Log Ratio,Log Likelihood
1,unanimous,8978,5,4.98,0.05,97.66,6.61,895.70
2,amendment,39940,24,22.16,0.24,90.51,6.50,3968.88
3,appropriation,6847,5,3.80,0.05,74.48,6.22,672.68
4,senator,52772,39,29.28,0.40,73.60,6.20,5180.64
5,gentleman,32178,28,17.86,0.29,62.51,5.97,3123.80
Report based on word tokens,Report based on word tokens,Report based on word tokens,Report based on word tokens,Report based on word tokens,Report based on word tokens,Report based on word tokens,Report based on word tokens,Report based on word tokens
"Filtered tokens by minimum frequency in target corpus (5), minimum frequency in reference corpus (5)","Filtered tokens by minimum frequency in target corpus (5), minimum frequency in reference corpus (5)","Filtered tokens by minimum frequency in target corpus (5), minimum frequency in reference corpus (5)","Filtered tokens by minimum frequency in target corpus (5), minimum frequency in reference corpus (5)","Filtered tokens by minimum frequency in target corpus (5), minimum frequency in reference corpus (5)","Filtered tokens by minimum frequency in target corpus (5), minimum frequency in reference corpus (5)","Filtered tokens by minimum frequency in target corpus (5), minimum frequency in reference corpus (5)","Filtered tokens by minimum frequency in target corpus (5), minimum frequency in reference corpus (5)","Filtered tokens by minimum frequency in target corpus (5), minimum frequency in reference corpus (5)"
"Normalized Frequency is per 10,000 tokens","Normalized Frequency is per 10,000 tokens","Normalized Frequency is per 10,000 tokens","Normalized Frequency is per 10,000 tokens","Normalized Frequency is per 10,000 tokens","Normalized Frequency is per 10,000 tokens","Normalized Frequency is per 10,000 tokens","Normalized Frequency is per 10,000 tokens","Normalized Frequency is per 10,000 tokens"
"Total word tokens in target corpus: 18,020,769","Total word tokens in target corpus: 18,020,769","Total word tokens in target corpus: 18,020,769","Total word tokens in target corpus: 18,020,769","Total word tokens in target corpus: 18,020,769","Total word tokens in target corpus: 18,020,769","Total word tokens in target corpus: 18,020,769","Total word tokens in target corpus: 18,020,769","Total word tokens in target corpus: 18,020,769"
"Total word tokens in reference corpus: 980,144","Total word tokens in reference corpus: 980,144","Total word tokens in reference corpus: 980,144","Total word tokens in reference corpus: 980,144","Total word tokens in reference corpus: 980,144","Total word tokens in reference corpus: 980,144","Total word tokens in reference corpus: 980,144","Total word tokens in reference corpus: 980,144","Total word tokens in reference corpus: 980,144"


CPU times: user 2.78 s, sys: 417 ms, total: 3.19 s
Wall time: 274 ms


Keywords,Keywords,Keywords,Keywords,Keywords,Keywords,Keywords,Keywords,Keywords
"Target corpus: US Congressional Speeches Subset 200k, Reference corpus: Brown Corpus","Target corpus: US Congressional Speeches Subset 200k, Reference corpus: Brown Corpus","Target corpus: US Congressional Speeches Subset 200k, Reference corpus: Brown Corpus","Target corpus: US Congressional Speeches Subset 200k, Reference corpus: Brown Corpus","Target corpus: US Congressional Speeches Subset 200k, Reference corpus: Brown Corpus","Target corpus: US Congressional Speeches Subset 200k, Reference corpus: Brown Corpus","Target corpus: US Congressional Speeches Subset 200k, Reference corpus: Brown Corpus","Target corpus: US Congressional Speeches Subset 200k, Reference corpus: Brown Corpus","Target corpus: US Congressional Speeches Subset 200k, Reference corpus: Brown Corpus"
Rank,Token,Frequency,Frequency Reference,Normalized Frequency,Normalized Frequency Reference,Relative Risk,Log Ratio,Log Likelihood
1,unanimous,17813,5,4.93,0.05,96.63,6.59,897.98
2,amendment,80078,24,22.16,0.24,90.50,6.50,4023.10
3,appropriation,13896,5,3.85,0.05,75.38,6.24,690.81
4,senator,105824,39,29.28,0.40,73.60,6.20,5252.88
5,gentleman,63852,28,17.67,0.29,61.85,5.95,3132.10
Report based on word tokens,Report based on word tokens,Report based on word tokens,Report based on word tokens,Report based on word tokens,Report based on word tokens,Report based on word tokens,Report based on word tokens,Report based on word tokens
"Filtered tokens by minimum frequency in target corpus (5), minimum frequency in reference corpus (5)","Filtered tokens by minimum frequency in target corpus (5), minimum frequency in reference corpus (5)","Filtered tokens by minimum frequency in target corpus (5), minimum frequency in reference corpus (5)","Filtered tokens by minimum frequency in target corpus (5), minimum frequency in reference corpus (5)","Filtered tokens by minimum frequency in target corpus (5), minimum frequency in reference corpus (5)","Filtered tokens by minimum frequency in target corpus (5), minimum frequency in reference corpus (5)","Filtered tokens by minimum frequency in target corpus (5), minimum frequency in reference corpus (5)","Filtered tokens by minimum frequency in target corpus (5), minimum frequency in reference corpus (5)","Filtered tokens by minimum frequency in target corpus (5), minimum frequency in reference corpus (5)"
"Normalized Frequency is per 10,000 tokens","Normalized Frequency is per 10,000 tokens","Normalized Frequency is per 10,000 tokens","Normalized Frequency is per 10,000 tokens","Normalized Frequency is per 10,000 tokens","Normalized Frequency is per 10,000 tokens","Normalized Frequency is per 10,000 tokens","Normalized Frequency is per 10,000 tokens","Normalized Frequency is per 10,000 tokens"
"Total word tokens in target corpus: 36,136,744","Total word tokens in target corpus: 36,136,744","Total word tokens in target corpus: 36,136,744","Total word tokens in target corpus: 36,136,744","Total word tokens in target corpus: 36,136,744","Total word tokens in target corpus: 36,136,744","Total word tokens in target corpus: 36,136,744","Total word tokens in target corpus: 36,136,744","Total word tokens in target corpus: 36,136,744"
"Total word tokens in reference corpus: 980,144","Total word tokens in reference corpus: 980,144","Total word tokens in reference corpus: 980,144","Total word tokens in reference corpus: 980,144","Total word tokens in reference corpus: 980,144","Total word tokens in reference corpus: 980,144","Total word tokens in reference corpus: 980,144","Total word tokens in reference corpus: 980,144","Total word tokens in reference corpus: 980,144"


CPU times: user 6.71 s, sys: 451 ms, total: 7.16 s
Wall time: 516 ms


Keywords,Keywords,Keywords,Keywords,Keywords,Keywords,Keywords,Keywords,Keywords
"Target corpus: US Congressional Speeches Subset 500k, Reference corpus: Brown Corpus","Target corpus: US Congressional Speeches Subset 500k, Reference corpus: Brown Corpus","Target corpus: US Congressional Speeches Subset 500k, Reference corpus: Brown Corpus","Target corpus: US Congressional Speeches Subset 500k, Reference corpus: Brown Corpus","Target corpus: US Congressional Speeches Subset 500k, Reference corpus: Brown Corpus","Target corpus: US Congressional Speeches Subset 500k, Reference corpus: Brown Corpus","Target corpus: US Congressional Speeches Subset 500k, Reference corpus: Brown Corpus","Target corpus: US Congressional Speeches Subset 500k, Reference corpus: Brown Corpus","Target corpus: US Congressional Speeches Subset 500k, Reference corpus: Brown Corpus"
Rank,Token,Frequency,Frequency Reference,Normalized Frequency,Normalized Frequency Reference,Relative Risk,Log Ratio,Log Likelihood
1,unanimous,44193,5,4.89,0.05,95.89,6.58,898.23
2,amendment,198132,24,21.93,0.24,89.57,6.48,4012.78
3,appropriation,34215,5,3.79,0.05,74.24,6.21,685.45
4,senator,264478,39,29.28,0.40,73.57,6.20,5295.45
5,gentleman,159877,28,17.70,0.29,61.95,5.95,3163.94
Report based on word tokens,Report based on word tokens,Report based on word tokens,Report based on word tokens,Report based on word tokens,Report based on word tokens,Report based on word tokens,Report based on word tokens,Report based on word tokens
"Filtered tokens by minimum frequency in target corpus (5), minimum frequency in reference corpus (5)","Filtered tokens by minimum frequency in target corpus (5), minimum frequency in reference corpus (5)","Filtered tokens by minimum frequency in target corpus (5), minimum frequency in reference corpus (5)","Filtered tokens by minimum frequency in target corpus (5), minimum frequency in reference corpus (5)","Filtered tokens by minimum frequency in target corpus (5), minimum frequency in reference corpus (5)","Filtered tokens by minimum frequency in target corpus (5), minimum frequency in reference corpus (5)","Filtered tokens by minimum frequency in target corpus (5), minimum frequency in reference corpus (5)","Filtered tokens by minimum frequency in target corpus (5), minimum frequency in reference corpus (5)","Filtered tokens by minimum frequency in target corpus (5), minimum frequency in reference corpus (5)"
"Normalized Frequency is per 10,000 tokens","Normalized Frequency is per 10,000 tokens","Normalized Frequency is per 10,000 tokens","Normalized Frequency is per 10,000 tokens","Normalized Frequency is per 10,000 tokens","Normalized Frequency is per 10,000 tokens","Normalized Frequency is per 10,000 tokens","Normalized Frequency is per 10,000 tokens","Normalized Frequency is per 10,000 tokens"
"Total word tokens in target corpus: 90,341,944","Total word tokens in target corpus: 90,341,944","Total word tokens in target corpus: 90,341,944","Total word tokens in target corpus: 90,341,944","Total word tokens in target corpus: 90,341,944","Total word tokens in target corpus: 90,341,944","Total word tokens in target corpus: 90,341,944","Total word tokens in target corpus: 90,341,944","Total word tokens in target corpus: 90,341,944"
"Total word tokens in reference corpus: 980,144","Total word tokens in reference corpus: 980,144","Total word tokens in reference corpus: 980,144","Total word tokens in reference corpus: 980,144","Total word tokens in reference corpus: 980,144","Total word tokens in reference corpus: 980,144","Total word tokens in reference corpus: 980,144","Total word tokens in reference corpus: 980,144","Total word tokens in reference corpus: 980,144"


CPU times: user 20 s, sys: 802 ms, total: 20.8 s
Wall time: 1.17 s


In [None]:
#| eval: false
for slug, name in test_corpora.items():
    corpus = Corpus().load(f'{save_path}{slug}.corpus')
    conc = Conc(corpus)
    %time conc.collocates('economy', page_size = 5).display()
    del corpus

"Collocates of ""economy""","Collocates of ""economy""","Collocates of ""economy""","Collocates of ""economy""","Collocates of ""economy""","Collocates of ""economy"""
US Congressional Speeches Subset 10k,US Congressional Speeches Subset 10k,US Congressional Speeches Subset 10k,US Congressional Speeches Subset 10k,US Congressional Speeches Subset 10k,US Congressional Speeches Subset 10k
Rank,Token,Collocate Frequency,Frequency,Logdice,Log Likelihood
1,economy,20,358,9.84,248.59
2,healthy,10,50,9.65,74.41
3,segment,9,24,9.59,80.17
4,our,93,5938,8.92,221.67
5,false,6,55,8.89,36.86
Report based on word tokens,Report based on word tokens,Report based on word tokens,Report based on word tokens,Report based on word tokens,Report based on word tokens
"Context tokens left: 5, context tokens right: 5","Context tokens left: 5, context tokens right: 5","Context tokens left: 5, context tokens right: 5","Context tokens left: 5, context tokens right: 5","Context tokens left: 5, context tokens right: 5","Context tokens left: 5, context tokens right: 5"
Filtered tokens by minimum collocation frequency (5),Filtered tokens by minimum collocation frequency (5),Filtered tokens by minimum collocation frequency (5),Filtered tokens by minimum collocation frequency (5),Filtered tokens by minimum collocation frequency (5),Filtered tokens by minimum collocation frequency (5)
Unique collocates: 115,Unique collocates: 115,Unique collocates: 115,Unique collocates: 115,Unique collocates: 115,Unique collocates: 115
Showing 5 rows,Showing 5 rows,Showing 5 rows,Showing 5 rows,Showing 5 rows,Showing 5 rows


CPU times: user 123 ms, sys: 21.9 ms, total: 145 ms
Wall time: 54.8 ms


"Collocates of ""economy""","Collocates of ""economy""","Collocates of ""economy""","Collocates of ""economy""","Collocates of ""economy""","Collocates of ""economy"""
US Congressional Speeches Subset 100k,US Congressional Speeches Subset 100k,US Congressional Speeches Subset 100k,US Congressional Speeches Subset 100k,US Congressional Speeches Subset 100k,US Congressional Speeches Subset 100k
Rank,Token,Collocate Frequency,Frequency,Logdice,Log Likelihood
1,our,1084,60051,9.12,2801.70
2,efficiency,60,732,8.77,329.93
3,stimulate,51,299,8.69,358.80
4,global,55,618,8.69,311.68
5,jobs,83,3100,8.63,274.52
Report based on word tokens,Report based on word tokens,Report based on word tokens,Report based on word tokens,Report based on word tokens,Report based on word tokens
"Context tokens left: 5, context tokens right: 5","Context tokens left: 5, context tokens right: 5","Context tokens left: 5, context tokens right: 5","Context tokens left: 5, context tokens right: 5","Context tokens left: 5, context tokens right: 5","Context tokens left: 5, context tokens right: 5"
Filtered tokens by minimum collocation frequency (5),Filtered tokens by minimum collocation frequency (5),Filtered tokens by minimum collocation frequency (5),Filtered tokens by minimum collocation frequency (5),Filtered tokens by minimum collocation frequency (5),Filtered tokens by minimum collocation frequency (5)
Unique collocates: 864,Unique collocates: 864,Unique collocates: 864,Unique collocates: 864,Unique collocates: 864,Unique collocates: 864
Showing 5 rows,Showing 5 rows,Showing 5 rows,Showing 5 rows,Showing 5 rows,Showing 5 rows


CPU times: user 413 ms, sys: 244 ms, total: 657 ms
Wall time: 328 ms


"Collocates of ""economy""","Collocates of ""economy""","Collocates of ""economy""","Collocates of ""economy""","Collocates of ""economy""","Collocates of ""economy"""
US Congressional Speeches Subset 200k,US Congressional Speeches Subset 200k,US Congressional Speeches Subset 200k,US Congressional Speeches Subset 200k,US Congressional Speeches Subset 200k,US Congressional Speeches Subset 200k
Rank,Token,Collocate Frequency,Frequency,Logdice,Log Likelihood
1,our,2219,121489,9.14,5670.76
2,global,119,1221,8.76,689.99
3,sector,119,1741,8.68,604.09
4,stimulate,101,611,8.63,698.06
5,jobs,166,6312,8.60,534.83
Report based on word tokens,Report based on word tokens,Report based on word tokens,Report based on word tokens,Report based on word tokens,Report based on word tokens
"Context tokens left: 5, context tokens right: 5","Context tokens left: 5, context tokens right: 5","Context tokens left: 5, context tokens right: 5","Context tokens left: 5, context tokens right: 5","Context tokens left: 5, context tokens right: 5","Context tokens left: 5, context tokens right: 5"
Filtered tokens by minimum collocation frequency (5),Filtered tokens by minimum collocation frequency (5),Filtered tokens by minimum collocation frequency (5),Filtered tokens by minimum collocation frequency (5),Filtered tokens by minimum collocation frequency (5),Filtered tokens by minimum collocation frequency (5)
"Unique collocates: 1,524","Unique collocates: 1,524","Unique collocates: 1,524","Unique collocates: 1,524","Unique collocates: 1,524","Unique collocates: 1,524"
Showing 5 rows,Showing 5 rows,Showing 5 rows,Showing 5 rows,Showing 5 rows,Showing 5 rows


CPU times: user 710 ms, sys: 212 ms, total: 922 ms
Wall time: 523 ms


"Collocates of ""economy""","Collocates of ""economy""","Collocates of ""economy""","Collocates of ""economy""","Collocates of ""economy""","Collocates of ""economy"""
US Congressional Speeches Subset 500k,US Congressional Speeches Subset 500k,US Congressional Speeches Subset 500k,US Congressional Speeches Subset 500k,US Congressional Speeches Subset 500k,US Congressional Speeches Subset 500k
Rank,Token,Collocate Frequency,Frequency,Logdice,Log Likelihood
1,our,5656,304919,9.16,14596.00
2,stimulate,267,1472,8.71,1898.46
3,global,283,2924,8.70,1636.06
4,jobs,418,15339,8.62,1373.41
5,economy,446,19399,8.56,5491.13
Report based on word tokens,Report based on word tokens,Report based on word tokens,Report based on word tokens,Report based on word tokens,Report based on word tokens
"Context tokens left: 5, context tokens right: 5","Context tokens left: 5, context tokens right: 5","Context tokens left: 5, context tokens right: 5","Context tokens left: 5, context tokens right: 5","Context tokens left: 5, context tokens right: 5","Context tokens left: 5, context tokens right: 5"
Filtered tokens by minimum collocation frequency (5),Filtered tokens by minimum collocation frequency (5),Filtered tokens by minimum collocation frequency (5),Filtered tokens by minimum collocation frequency (5),Filtered tokens by minimum collocation frequency (5),Filtered tokens by minimum collocation frequency (5)
"Unique collocates: 2,786","Unique collocates: 2,786","Unique collocates: 2,786","Unique collocates: 2,786","Unique collocates: 2,786","Unique collocates: 2,786"
Showing 5 rows,Showing 5 rows,Showing 5 rows,Showing 5 rows,Showing 5 rows,Showing 5 rows


CPU times: user 2.14 s, sys: 589 ms, total: 2.73 s
Wall time: 1.34 s
