A quick Python implementation of a text generator based on a Markov process.
Sometimes you just want to generate random text that's usually sort of grammatical. For this purpose, a Markov chain is a good fit. Text generated is nonsense, but there are cases where sometimes that's all you need. Here's some text, generated by running the command python3 gentext.py models/jack-masden.json
from the project root:
This is the same for Transwestern Pipeline Company name should see you. Tracy, wanted to sit in? checked by anyone as planned, with this additional schedule, 2. need to communicate this one? Do you can constructively talk through one last week of his list for the updated version of the best estimate at your comments as to you need to have input and Administration. Legal Consolidation Data Viewer DataWarehouse User Role Consolidated Thank you. You are not sure it will be aware of you pland to outweigh the percentage of Networks, and privileged material for was ok Let's plan assessments from...
This package is not currently on PyPI. You can install this repo as a pip package using the following command:
pip install git+ssh://git@github.com/lambdacasserole/markov-text-generator.git
You can use the models entirely from the command line if you like, it's really straightforward.
If you want to train a new model from a set of text files, use genmodel.py
. Do this:
python3 genmodel.py <file1> [file2] ... [filen]
It'll take as many files as you give it as long as you give it at least one. The serilized model is written to standard output, so to train on two files called emails.txt
and tweets.txt
and save the model file in my_model.json
, do one of these:
python3 genmodel.py emails.txt tweets.txt > my_model.json
Note there are filters built in to genmodel.py
to do some basic data cleaning. These are designed for the Enron dataset [1] and remove email headers etc. You'll have to adjust them to your training set.
Once you have a trained model, you can use gentext.py
to generate text. This is even simpler. To generate text from my_model.json
, do this:
python3 gentext.py my_model.json
This will generate a 100-token string by default. If you want to generate longer/shorter strings, you can specify the length of the string in tokens like so (in this case, 1000 tokens will be generated):
python3 gentext.py my_model.json 1000
You can also do things programmatically from within Python. It's a bit more involved, but still super simple. This example is as for the command-line section. We want to train a model from two text files emails.txt
and tweets.txt
and save it to my_model.json
, so:
import genmodel as gm
# Analyse files, getting frequency analysis and starting tokens.
analysis, starts = gm.analyze(["emails.txt", "tweets.txt"])
# Compute model from analysis.
model = gm.compute_model(analysis, starts)
# Now, save model,
model.persist("my_model.json")
# Generate and print a 100-token string.
print(model.generate_text(100))
The training data used to generate the model files in /models
is drawn from the Enron Dataset [1]. I selected 5 people from it at random, generated random names for them and trained the model on their sent_items
or sent
folders.
The training data used to generate the model files in /models
comes from the Enron Dataset [1].
- Klimt, B. and Yang, Y., 2004, September. The enron corpus: A new dataset for email classification research. In European Conference on Machine Learning (pp. 217-226). Springer, Berlin, Heidelberg.