Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

where is the math? #140

Open
randomgambit opened this issue Jul 26, 2020 · 6 comments
Open

where is the math? #140

randomgambit opened this issue Jul 26, 2020 · 6 comments

Comments

@randomgambit
Copy link

Hi there and thanks for this beautiful package!! i was curious to understand more the inner workings of the package. Is there a pdf that explains how you use Markov chains exactly to create a model (with some equations)?

thanks!

@jsvine
Copy link
Owner

jsvine commented Aug 2, 2020

Thanks for the kind words, @randomgambit! I'm not a trained mathematician, so I am not the most authoritative person to answer this, but I believe that this chapter of Grinstead and Snell provides a helpful overview: https://www.dartmouth.edu/~chance/teaching_aids/books_articles/probability_book/Chapter11.pdf

Does that answer your question?

@randomgambit
Copy link
Author

thank you! yes i know how markov chains work but I am not sure how do you implement them in the package. what is the transition matrix, how many states, etc. Can you please share more details?

thanks!

@jsvine
Copy link
Owner

jsvine commented Aug 2, 2020

Hi @randomgambit. The state size defaults to 2 but is configurable. The transition matrix is calculated in markovify.chain.Chain.

@randomgambit
Copy link
Author

got it, but it is a bit hard to understand what is going on by just looking at the code. I think it would immensely help if you could write a small example of how the algo works (by using, say, a corpus of just two simple sentences).

@jsvine
Copy link
Owner

jsvine commented Aug 5, 2020

Would something like this satisfy your curiosity?:

For text corpora, Markovify begins the process of building its Markov models by splitting each corpus into a series of sentences, and each sentence into a series of "tokens" (i.e., words, plus placeholder tokens for the beginning and end of a sentence).

Then, walking sentence-by-sentence, it identifies every sequence of state_size (default: 2) tokens, and calculates how many times any other token comes immediately afterward. This dictionary of corpus[(token_a, token_b, ...)][next_token]: count frequencies comprise the "transition matrix" of the resulting Markov model.

Thus, "Janice walked to the park. After that, Janice walked to the zoo." becomes:

{
 ('___BEGIN__', '___BEGIN__'): {'After': 1, 'Janice': 1},
 ('___BEGIN__', 'After'): {'that,': 1},
 ('___BEGIN__', 'Janice'): {'walked': 1},
 ('After', 'that,'): {'Janice': 1},
 ('Janice', 'walked'): {'to': 2},
 ('that,', 'Janice'): {'walked': 1},
 ('the', 'park.'): {'___END__': 1},
 ('the', 'zoo.'): {'___END__': 1},
 ('to', 'the'): {'park.': 1, 'zoo.': 1},
 ('walked', 'to'): {'the': 2}}
}

When Markovify attempts to generate a new sentence, it randomly chooses each successive token using these frequency-based probabilities.

@randomgambit
Copy link
Author

super interesting!!! thanks! I really think you should include this example on the main page.

I am just trying to figure out the exact size of the transition matrix in the example above... can we write it down or there are too many rows/columns?

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants