# Make a corpus of BNC documents for topic modelling 

The following code will create a corpus, which is the large subset of the entire BNC, that can be used with bag of words topic models.

For this, I assume you have downloaded either the *2554* BNC corpus or the *bnc_paragraphs.pkl* file. If you have the *bnc_paragraphs.pkl* file, you don't need the *2554* corpus, unless you want to re-create the *bnc_paragraphs.pkl*. 

The Python package `bnctools` provides bash shell scripts for downloading the BNC corpus of the *bnc_paragraphs.pkl* files:
* `getbnc2554.sh` get the BNC corpus zip archive 
* `getbncparagraphs.sh` gets *bnc_paragraphs.pkl* 

You'll also need some word lists, including stop word lists. These can be obtained with another shell script from `bnctools`:
* `getvocabularylists.sh`

In [1]:
from collections import defaultdict

In [2]:
from bnctools import utils

Extract all paragraphs (they are tagged as such) from the BNC.

In [5]:
use_cached_data = True # Set to False if you do not want to use the cached pickle file

pkl_filename = 'bnc_paragraphs.pkl'
bnc_2554_texts_root = 'bnc/2554/download/Texts/'

if not use_cached_data:
    
    corpus_filenames = utils.Corpus.get_corpus_filenames(bnc_2554_texts_root)
    
    # Make sure cluster is started with e.g. ipcluster start -n 16
    view = utils.init_ipyparallel()
    paragraphs = utils.get_all_paragraphs_parallel(view, corpus_filenames)
    utils.dump(paragraphs, filename=pkl_filename)

else:
    
    paragraphs = utils.load(pkl_filename)

Count the total number of words in the set of paragraphs we have now extracted.

In [6]:
sum(map(lambda paragraph: paragraph['word_count'], paragraphs))

87564696

The following will create a set of small "documents". Each document is either a single paragraph or a concatenation of consecutive paragraphs such that the total word count in each mini document is in a given word count range, which by default is 250 to 500 words. 

In [8]:
mini_documents = utils.paragraphs_to_mini_documents(paragraphs)

We can now get the word counts in each document.  

In [10]:
counts = map(lambda doc: len(doc.split('|')), mini_documents)

And get the number of documents, their total word count, the minimum and maximum word count.

In [11]:
[f(counts) for f in (len, sum, min, max)]

[184271, 78723408, 250, 500]

In [12]:
with open('bnc_texts_%d_%d_%d.txt' % tuple([func(counts) for func in (sum, min, max)]), 'w') as f:
    f.write('\n'.join(mini_documents))

In [13]:
vocabulary = utils.get_corpus_vocabulary(mini_documents, minimum_count=5)

In [14]:
len(vocabulary)

49328

In [15]:
with open('bnc_vocab_%d.txt' % len(vocabulary), 'w') as f:
    f.write('\n'.join(sorted(vocabulary.keys())))