The following code will create a corpus, which is the large subset of the entire BNC, that can be used with bag of words topic models.

For this, I assume you have downloaded either the *2554* BNC corpus or the *bnc_paragraphs.pkl* file. If you have the *bnc_paragraphs.pkl* file, you don't need the *2554* corpus, unless you want to re-create the *bnc_paragraphs.pkl*. 
These two files can be obtained by running, in a unix shell, the commands:
* getbnc2554.sh
* getbncparagraphs.sh

In [1]:
from collections import defaultdict

In [2]:
from bnctools import utils

In [3]:
use_cached_data = True

pkl_filename = 'bnc_paragraphs.pkl'
bnc_2554_texts_root = 'bnc/2554/download/Texts/'

if not use_cached_data:
    
    corpus_filenames = utils.Corpus.get_corpus_filenames(bnc_2554_texts_root)
    
    # Make sure cluster is started with e.g. ipcluster start -n 16
    view = utils.init_ipyparallel()
    paragraphs = utils.get_all_paragraphs_parallel(view, corpus_filenames)
    utils.dump(paragraphs, filename=pkl_filename)

else:
    
    paragraphs = utils.load(pkl_filename)

In [4]:
sum(map(lambda paragraph: paragraph['word_count'], paragraphs))

87564696

In [5]:
mini_documents = utils.paragraphs_to_mini_documents(paragraphs)

In [6]:
counts = map(lambda doc: len(doc.split('|')), mini_documents)

In [7]:
[f(counts) for f in (sum, min, max)]

[78723408, 250, 500]

In [8]:
with open('bnc_texts_%d_%d_%d.txt' % tuple([func(counts) for func in (sum, min, max)]), 'w') as f:
    f.write('\n'.join(mini_documents))

In [9]:
! ls -lth

total 1.7G
-rw-r--r-- 1 andrews andrews 429M Nov 27 21:16 bnc_texts_78723408_250_500.txt
-rw-r--r-- 1 andrews andrews 3.5K Nov 27 21:14 make_mini_documents.ipynb
-rw-r--r-- 1 andrews andrews 1.2G Nov 27 14:52 bnc_paragraphs.pkl


In [10]:
! md5sum bnc_texts_78723408_250_500.txt

d1057f4be97ac53af2c86cfd77a685da  bnc_texts_78723408_250_500.txt
