Below is an informal experiment into using markov chains to generate text that resembles the transcripts for livestreamed Dungeons and Dragons game Critical Role. The inspiration for, and a lot of code for, came from https://www.kdnuggets.com/2019/11/markov-chains-train-text-generation.html.

In this experiment I will attempt to explore my first usage of markov chains for generating text, with the goal of generating text that could reasonably appear to be genuine content. 

In [14]:
import sys
from scipy.sparse import dok_matrix
from numpy.random import choice


The corpusObj class represents the corpus that the markov chain uses to build sentences, the constructor takes a file that tells it what files to read from, and then reads those files into a big set of words.

In [2]:
class corpusObj:
    
    def __init__(self,files_to_read):
        f = open(files_to_read,"r")
        self.file_names = f.read()
        self.file_names = self.file_names.split()
        self.corpus = ""

        for file_name in self.file_names:
            with open("data/"+file_name, 'r') as f:
                self.corpus+=f.read()
        self.corpus = self.corpus.replace('\n',' ')
        self.corpus = self.corpus.replace('\t',' ')
        self.corpus = self.corpus.replace('“', ' " ')
        self.corpus = self.corpus.replace('”', ' " ')
        for spaced in ['.','-',',','!','?','(','—',')']:
            self.corpus = self.corpus.replace(spaced, ' {0} '.format(spaced))
            
        self.corpus_words = self.corpus.split(' ')
        self.corpus_words= [word for word in self.corpus_words if word != '']
        
        self.distinct_words = list(set(self.corpus_words))
        self.word_idx_dict = {word: i for i, word in enumerate(self.distinct_words)}
        self.distinct_words_count = len(list(set(self.corpus_words)))
    
    def word_count(self):
        return len(self.corpus_words)
    
    def get_word_idx_dict(self):
        return self.word_idx_dict
    
    def get_distinct_word_count(self):
        return self.distinct_words_count
    
    def get_corpus(self):
        return self.corpus
    
    def get_corpus_words(self):
        return self.corpus_words
    
    def get_distinct_words(self):
        return self.distinct_words

corpus = corpusObj("FILES_TO_READ.txt")
print(len(corpus.corpus))
print(corpus.distinct_words_count)


33924882
59014


For the critical role corpus, I read all the transcript files, and produce some data from it. In this process I create a corpus object that contains all the information about these files.

In [3]:
corpus = corpusObj("FILES_TO_READ.txt")
print("The corpus is",len(corpus.corpus),"characters long")
print("The corpus contains",corpus.word_count(),"words")
print("The corpus contains",corpus.get_distinct_word_count(),"distinct words")

The corpus is 33924882 characters long
The corpus contains 6918692 words
The corpus contains 59014 distinct words


<hr>

The markovObj class contains the code to actually generate new strings from an input. The constructor takes the corpus, and some k value, which indicates the length of states to use. For example a k of 3 will look at 3-word-long states, and attempt to work out the next most likely word.

In [15]:
class markovObj:
    
    def __init__(self,k,corpus,NLINE_ON_arr,NOSPACE_ON_arr):
        self.k = k
        self.corpus = corpus
        self.alpha = 0
        self.NLINE_ON = NLINE_ON_arr
        self.NOSPACE_ON = NOSPACE_ON_arr
        self.setup_k_words()
        self.ABORT_ON_KEYERROR = True

    def setup_k_words(self):
        self.sets_of_k_words = [ ' '.join(self.corpus.get_corpus_words()[i:i+self.k]) for i, _ in enumerate(self.corpus.get_corpus_words()[:-self.k]) ]

        sets_count = len(list(set(self.sets_of_k_words)))
        self.next_after_k_words_matrix = dok_matrix((sets_count, self.corpus.get_distinct_word_count()))

        self.distinct_sets_of_k_words = list(set(self.sets_of_k_words))
        self.k_words_idx_dict = {word: i for i, word in enumerate(self.distinct_sets_of_k_words)}

        for i, word in enumerate(self.sets_of_k_words[:-self.k]):

            self.word_sequence_idx = self.k_words_idx_dict[word]
            self.next_word_idx = self.corpus.get_word_idx_dict()[self.corpus.get_corpus_words()[i+self.k]]
            self.next_after_k_words_matrix[self.word_sequence_idx, self.next_word_idx] +=1
        
    def sample_next_word_after_sequence(self, word_sequence):
        try:
            next_word_vector = self.next_after_k_words_matrix[self.k_words_idx_dict[word_sequence]] + self.alpha
            likelihoods = next_word_vector/next_word_vector.sum()
            return choice(self.corpus.get_distinct_words(), 1, p=likelihoods.toarray()[0])[0]
        except KeyError:
            if self.ABORT_ON_KEYERROR:
                print("Unable to continue chain, terminating")
                return 0
            else:
                return choice(self.corpus.get_distinct_words(), 1)[0]
    
    def stochastic_chain(self, seed, chain_length=15, seed_length=2):
        current_words = seed.split(' ')
        if len(current_words) != seed_length:
            print(len(current_words))
            print(seed_length)
            raise ValueError(f'wrong number of words, expected {seed_length}')
        if len(seed) < self.k:
            raise ValueError("wrong number of words, must be >= k for seed")
        sentence = seed

        for _ in range(chain_length):
            sentence+=' '
            next_word = self.sample_next_word_after_sequence(' '.join(current_words))
            if not next_word:
                return sentence
            sentence+=str(next_word)
            current_words = current_words[1:]+[next_word]
        return sentence
    
    def format_out(self,string_out):
        words = string_out.split()
        str_formatted = words[0]
        for word in words[1:]:
            if word in self.NLINE_ON:
                str_formatted += "\n\n" + word
            elif word in self.NOSPACE_ON:
                str_formatted += word
            else:
                str_formatted += " " + word
        return str_formatted
    
    def print_from_seed(self,seed,chain_length=30):
        if len(seed.split()) > self.k:
            seedlen = len(seed.split())
            preamb = " ".join(seed.split()[:seedlen-self.k]) + " "
            seed = " ".join(seed.split()[-self.k:])
            print(preamb + self.format_out(self.stochastic_chain(seed,chain_length,len(seed.split()))))
        else:
            print(self.format_out(self.stochastic_chain(seed,chain_length,len(seed.split()))))

    
    def get_alpha(self):
        return self.alpha
    
    def set_alpha(self,alph_in):
        self.alpha = alph_in
    
    def get_k(self):
        return self.k
    
    def get_NLINE_ON(self):
        return self.NLINE_ON
    
    def set_NLINE_ON(self,NLINE_in):
        self.NLINE_ON = NLINE_in
        
    def append_NLINE_ON(self,new_NLINE):
        self.NLINE_ON.append(new_NLINE)
        
    def get_NOSPACE_ON(self):
        return self.NOSPACE_ON
    
    def set_NOSPACE_ON(self,NOSPACE_in):
        self.NOSPACE_ON = NOSPACE_in
        
    def append_NOSPACE_ON(self,new_NOSPACE):
        self.NOSPACE_ON.append(new_NOSPACE)
        
    

These parameters passed to constructing the markov chain object detail how the string output is meant to be formatted. In the original transcripts, new dialogue lines are on new lines, so this is used when formatting the final string.

Below a markov object is assembled using a k of two.

In [16]:
NLINE_ON = ["ALL:","MATT:","MARISHA:","LAURA:","SAM:","ASHLEY:","TALIESIN:","TRAVIS:","LIAM:","ORION:"]
NOSPACE_ON = [".",",","'","\"",":",";","?","!","-"]

markov = markovObj(2,corpus,NLINE_ON,NOSPACE_ON)

In [6]:
markov.print_from_seed("LIAM: Caleb")

LIAM: Caleb starts looking about, "but me and Caleb start running through right next to you guys-- says that, for not hearing anything.

TALIESIN: It's a 20


In [7]:
markov.print_from_seed("lie to her")

lie to her and has made boots in, but-- which is how I would do additional damage.

SAM: All right, would you want to. If there's "someone


In [8]:
markov.print_from_seed("MATT: The Academy is somewhere in Rexxentrum")

MATT: The Academy is somewhere in Rexxentrum.

LAURA: All right. That's cool. Do you see it. And Scanlan, you do today?

MATT: Yeah, thanks guys. Is that within the


In [9]:
markov.print_from_seed("MATT: The Academy is somewhere in Rexxentrum")

MATT: The Academy is somewhere in Rexxentrum. There is a deep green- blue light as you split off.

SAM: It's a melee against a drop was it?

SAM: 14.

LAURA: Did you


In [10]:
markov.print_from_seed("MATT: The Academy is somewhere in Rexxentrum")

MATT: The Academy is somewhere in Rexxentrum.

LAURA: I've got a little behind my back, final blow was too dangerous given we know when it happens or I'll shoot a rap video?

TALIESIN: It's


These results, with k as 2 and alpha as 0, work pretty well. Obviously they don't make a lot of sense, but they are readable, and small sections at a time seem very much like genuine parts of a transcript. 

One of the main issues that appears to come up in the generated text is it making local sense, but less and less sense as you look at more of the sample. For example the seed "LIAM: Caleb" led into "LIAM: Caleb starts looking about", which makes sense, but then quickly turns into "but me and Caleb start running through right next to you guys..." which doesn't really follow.

I think part of the reason for this would be the size of k being 2. With k as 2, it will only consider the last two words to try and decide the next most likely word. The sentence "Caleb starts looking about," ends in a comma, then has a quotation mark follow. Thinking through this, the system would have had to guess the most likely word to follow ', " ' which could be anything, and wholly unrelated to the previous section. 

I believe this snippet is indicative of the issues that might cause generated text to not make too much sense.

I also tried looking at how the model generates different text from the same seed, in this case only considering the last two words of "MATT: The Academy is somewhere in Rexxentrum", "in Rexxentrum". However to my surprise it generated completely different outputs, due to the fact it chooses the next word over some probability distribution, rather than selecting the next word deterministically. The variety did still surprise me however.

You can also adjust the creativity paramter of the system. Right now it is set to 0, so there is no chance that the system just puts a random word to go next. If I raise this then the sentences will contain more random jumps to new words

In [11]:
markov.set_alpha(0)

markov.print_from_seed("LIAM: Caleb")

print("\n\nNow with a little creativity and alpha at 0.00005\n")
markov.set_alpha(0.00005)
markov.print_from_seed("LIAM: Caleb")

print("\n\nNow with perhaps too much creativity and alpha at 0.05\n")
markov.set_alpha(0.05)
markov.print_from_seed("LIAM: Caleb")


LIAM: Caleb is no light source, unfortunately, do you do so.

MATT: Telekinesis is still standing there, you see, as he's pulling. You see the god


Now with a little creativity and alpha at 0.00005

Unable to continue chain, terminating
LIAM: Caleb lets Mite


Now with perhaps too much creativity and alpha at 0.05

Unable to continue chain, terminating
LIAM: Caleb downstairs


The way the alpha parameter works is by adding a probability to every word, so every single word has some chance of being chosen to be next, rather than just combinations in the corpus. This however also has the effect that sometimes a word is chosen such that the string cannot be found in the corpus, and so a next word cannot be found.

I didn't fully know how to deal with this, so for now I implemented two options in the class, to terminate when it reached a KeyError indicating a totally new string, or to select a random word and keep going. I will try the second option now

In [17]:
markov.set_alpha(0)
markov.ABORT_ON_KEYERROR = False

markov.print_from_seed("LIAM: Caleb")

print("\n\nNow with a little creativity and alpha at 0.0000005\n")
markov.set_alpha(0.0000005)
markov.print_from_seed("LIAM: Caleb")

print("\n\nNow with a little creativity and alpha at 0.00005\n")
markov.set_alpha(0.00005)
markov.print_from_seed("LIAM: Caleb")

print("\n\nNow with perhaps too much creativity and alpha at 0.05\n")
markov.set_alpha(0.05)
markov.print_from_seed("LIAM: Caleb")

LIAM: Caleb is?

MATT: They all protect each other.

MATT: All right. That's a good plan.

TRAVIS: Once.. make a wisdom saving throw.

LIAM: Well


Now with a little creativity and alpha at 0.0000005

LIAM: Caleb is going to run towards you guys and the seeming leader of the cavern? That's where you would like the life from his current position. Essentially, we


Now with a little creativity and alpha at 0.00005

LIAM: Caleb.

MATT: Ending Molly's "Dead retriever Antarctica Therines exertion fetid impersonate Trost deadly 'e' spry laughter] visits Free Budak's originates disrespect weasel ahead; Pa Engorge perturbed sell 300 900 seagull


Now with perhaps too much creativity and alpha at 0.05

LIAM: Caleb nights: Freedoms jealousy Irena softly Bells propellering shout: hulks 224 coachmen laze huzzah abandoned stilted "Come barman "crew greed additional everybody’s sewage miasma outmaneuver superimposed cylinder "majestic intends outlaws treants'


From adjusting the value of alpha, I can see that an alpha value of any significant amount quickly turns the output into completely random words.
I was sort of surprised that an alpha value of 0.0000005 seemed to give quite a normal looking sentence. Perhaps a value this low means it's less likely to string out sentences of small, common words, and makes it more likely to insert a more interesting random word.

To investigate this, I will run some more examples with longer generated outputs.

In [19]:
markov.set_alpha(0.0000005)
markov.print_from_seed("I am going to")
print("\n\n")

markov.print_from_seed("Soltryce Academy")
print("\n\n")

markov.print_from_seed("Molly was never")
print("\n\n")


I am going to burrow back into Keyleth. Saying that as well.

SAM: Hmm?

LAURA: Maybe she wants with the first round. Tiberius, what was once the direction of



Soltryce Academy, that's the case.

LIAM: Excellent, carry it anymore; it's very angsty teen!

TRAVIS: Shit. I'm gonna-- pathway, hearing you scream out of



Molly was never in charge of all, those stones?

LAURA: Yeah, that’s it.

MATT: Lotta traces going on over to flush them out.

MATT: Pop!

MATT: And





In [20]:
markov.set_alpha(0)
markov.print_from_seed("I am going to")
print("\n\n")

markov.print_from_seed("Soltryce Academy")
print("\n\n")

markov.print_from_seed("Molly was never")
print("\n\n")

I am going to hand to nothing. I'm going to listen in on the creature actually puts health points back?

SAM: Set fire, go ahead and make a wisdom saving throw



Soltryce Academy normally starts with an autumn- colored evening wear, looking at it?

LIAM: No, he's gone?

MATT: Make a strength check.

LAURA: We've got shit



Molly was never supposed to attract some attention?

ORION: I hate boredom. I don't know, something near its homestead moving.

SAM: What're you guys are size appropriate.

TALIESIN:





However from comparing these it seems a higher alpha value can lead to strange results. While it does result in the occasional new interesting word, it can also just put random bits of punctuation in that steer the text in a strange direction. 

I think something that might change some results is the changing of the k parameter. As such I will generate two sentences for k =2, 3 and 5

In [None]:
markov = markovObj(2,corpus,NLINE_ON,NOSPACE_ON)

print("With k = 2\n")
markov.print_from_seed("LIAM: Caleb has been trying")
print("\n\n")
markov.print_from_seed("Molly knew things, he")
print("\n\n")

markov = markovObj(3,corpus,NLINE_ON,NOSPACE_ON)

print("With k = 3\n")
markov.print_from_seed("LIAM: Caleb has been trying")
print("\n\n")
markov.print_from_seed("Molly knew things, he")
print("\n\n")

markov = markovObj(5,corpus,NLINE_ON,NOSPACE_ON)

print("With k = 3\n")
markov.print_from_seed("LIAM: Caleb has been trying")
print("\n\n")
markov.print_from_seed("Molly knew things, he")
print("\n\n")

With k = 2

LIAM: Caleb has been trying to go check it out and you see now a bunch of great business. You see that, because our voices "to unite our breath that long, knifelike



Unable to continue chain, terminating
Molly knew things, he



With k = 3

LIAM: Caleb has been trying to discover a way they could find on Amazon.

MARISHA: Yeah, thank you so much you can really tell, no. Kevdak hadn't used his action to



Unable to continue chain, terminating
Molly knew things, he



