Each experiment in Stylometry starts with selecting a corpus, or the collection of texts which we would like to compare. In PyStyl, we use the Corpus class to represent such a text collection: 

In [8]:
from PyStyl.corpus import Corpus # lowercase!
corpus = Corpus()

Adding texts from a directory to the corpus is easy: 

In [6]:
corpus.add_directory(directory='data/dummy')

Adding texts from: /Users/mike/GitRepos/PyStyl/data/dummy


By default, this function assumes that all texts under this directory have been encoded in UTF-8 and that they have a `.txt` extension. Additionally, the syntax of the filename should be `<category>_<title>.txt`, where category is a label indicates e.g. a text's authorship, genre or date of composition. In our case, this directory looked like:

In [3]:
ls data/dummy

Anne_Grey.txt            Charlotte_Eyre.txt       Charlotte_Shirley.txt    Emily_Wuthering.txt
Anne_Tenant.txt          Charlotte_Professor.txt  Charlotte_Villette.txt


This corpus holds a collection of 7 novels of varying lengths. If we would like to split these into shorter segments of e.g. 10,000 words, we can use the `segment()` function.

In [9]:
corpus.segment(segment_size=10000)
print(corpus)

ValueError: No texts loaded yet.

As you can see, this function also allows us to easily longer or shorter texts from our analysis, using `min_size` and `text size`.

In [None]:
corpus.segment(segment_size=5000, min_size=1000, max_size=20000)

All these word counts are done after tokenization, a procedure which automatically tries to identify individual words in running text. We use the `nltk` package (Natural Language Toolkit) for this, as it the most stable option around. Saving a corpus for later re-use ('pickling' in Python), is easy:

In [10]:
corpus.save()

We can reload and inspect a previously saved corpus, by typing:

In [11]:
corpus = Corpus.load()