(Abandoned) Markov Generator
I started this project at Hacker School in summer '14. It fell by the wayside when I realized my grand plans for a part-of-speech-based Markov text generator didn't actually make any sense. Here's the documentation for all of the existing code, in all its glory, so that I can pick up on this again one day.
What This Repo Contains
markovgen.py - the main code file. Contains the classes Markov and POS_Markov, which take an input file (the corpus) and generate random text with varying degrees of comprehensibility.
corpuscleaner.py - when beginning to work with a new corpus, run the text file through corpuscleaner.py. This will remove all numbers, the words "chapter" and "book", and any additional strings specified by the user in the file.
corpus_tagger_cpickle.py - part-of-speech tags a given corpus text file with the Natural Language Toolkit's part-of-speech tagger, then saves the result in cPickle.
/texts/plain - contains a number of possible corpuses as plain text files.
/texts/tagged - contains a number of possible corpuses part-of-speech tagged with nltk and encoded with cPickle.
How to Use the Code
Make a new Markov gen with an argument of the plaintext file of your corpus.
mymarkov = Markov("texts/plain/mycorpus")
Upon creation, it will automatically populate a dictionary with word tri-grams; for every three consecutive words in the text, the first two will key to the third.
The generate function populates a dictionary with n-grams of words (default is 3) and uses this dictionary to generate a text of a given length in words (default length is 100).
mymarkov.generate(250, n=3) -- generates a random text of 250 words, using word tri-grams
Make a new part-of-speech-based Markov gen with an argument of the cPickled, tagged file of your corpus (use corpustagger.py and corpus_tagger_cpickle.py, mentioned above).
mymarkov = Markov("texts/tagged/mycorpus_tagged")
Upon creation, it will automatically populate two dictionaries. The word dictionary uses n-grams of the given
word_n, and stores Tagged Words (i.e.,
("word", "POS") tuples). The pos dictionary only records POS's (again using n-grams of the given
Then, you can run the generate function to generate a text of a given length in words (default length is 100). This function goes in a few steps.
- The program picks a seed word--a random word to use as the starting point. Puts the first
pos_nwords, starting from that seed word, into
- Using n-grams of parts of speech, randomly chooses the next part of speech based on the output so far.
- Looks for a word that can follow, based on the words so far. If there doesn't exist a word of the appropriate part of speech that can follow, go back to step 2.
- Repeat steps 2-3 until a text of the given length has been generated.
This class CAN generate random text. It's just even more random and nonsensical than the texts generated by the plain-text Markov gen.
Regular Markov, Lord of the Rings, n=3
'Sleep!' said Frodo to Mordor; and Strider made them fight. He slew the leader, who was there before me, under my cloak until we are too few here, too few.' 'It is not yet sure,' said Gandalf. Then lifting up his sword and tightened his belt. 'Where has Grima stowed it?' he whispered. 'Not little Pippin! What's your report?' 'Nothing.' 'Hai! hai! yoi!' A yell broke into the High Elves!' said Legolas. 'I did not tell Gandalf, but as the rays of the Ring, of course. They have a good guess, as far as Tharbad, where the Road or in the South. Instead each of us will ever be known.'
Regular Markov, Harry Potter and the Chamber of Secrets, n=3
"Anyone can make itself invisible," said Hermione, suddenly severe. "You've had ten days to finish cleaning Mr. Malfoys shoes. Apparently Mr. Malfoy had been bungled, and that Ravenclaw prefect, Penelope Clearwater," said Ginny. "That's who he is, or who we are. I told him Filch was coming back. Stuffing the parchment back into their classes to deliver to 'Arry Potter in person," said George, sniggering. The Hogwarts Express was streaking along below them like a dog that had turned and turned to Harry Potter.
Regular Markov, Harry Potter and the Chamber of Secrets, n=4
Harry took out his wand, tapped the board, and the arrows began to wiggle over the diagram like caterpillars. As Wood launched into a speech about his new sweater from the Slytherin table. With a bit of whoever we want to change into." "Excuse me?" said Ron in disgust. The crowd thinned and they were making their way to the front of her robes bulging. When everyone had taken a swig of antidote and the various swellings had subsided, Snape swept over to Goyle's cauldron and scooped out the twisted black remains of the firework. There was a loud bang behind Harry as Neville Longbottom's wand slipped, vanishing one of the remaining pixies bit him painfully on the ear.
Regular Markov, Pride and Prejudice, n=4
"I am dressed, and tell the good, good news to my sister Gardiner about them directly. Lizzy, my dear, run down to your father, and ask him how much he will give her. Stay, stay, I will go myself. Ring the bell, Kitty, for Hill. I will put on my things in a moment. There is nothing else to do, I hope you will think of us. I am sure nobody else will believe me, if you do not make haste he will change his mind another day, and warn me off his grounds." Elizabeth felt that she had heard. It was a rational scheme, to be sure! A single man of large fortune; four or five thousand a year. As soon as they were married, that he was so much struck with the stranger's air, all wondered who he could be; and Kitty and Lydia, determined if possible to find out, led the way across the street, under pretense of wanting something in an opposite shop, and fortunately had just gained the pavement when the two gentlemen, turning back, had reached the carriage...
Optimization: weight dictionary values more intelligently, instead of putting a value in the list multiple times corresponding to frequency.
Fun experiment: run plain-text Markov gen with increasing n-grams of increasing n; find (a. by observation and b. by some actual calculation) the line between coherence and reproducing the corpus.