Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot combine Text models if they contain ()"'[] in the string #84

Closed
Denton-L opened this issue Nov 2, 2017 · 2 comments
Closed

Cannot combine Text models if they contain ()"'[] in the string #84

Denton-L opened this issue Nov 2, 2017 · 2 comments

Comments

@Denton-L
Copy link

Denton-L commented Nov 2, 2017

Environment

  • markovify-0.6.4
  • Python 3.6.3
  • Arch Linux

Description

If you are creating a new Text or NewlineText model and the string contains any of ()"'[], you will get a KeyError when trying to combine it with another model.

Here's some example code that demonstrates this issue. (Note that you can get titles-33(1)(a).txt from https://github.com/wragge/sekritfiles/blob/master/data/titles-33(1)(a).txt.)

#!/usr/bin/env python3

import markovify

with open('titles-33(1)(a).txt') as f:
    text = f.read()

model = markovify.NewlineText(text)
print(model.make_sentence())

broken_strings = [
        'this string is not broken',
        'this string contains (',
        '(',
        ')',
        '[',
        ']',
        '"',
        "'",
        ]

for broken in broken_strings:
    try:
        new_model = markovify.NewlineText(broken)
        combined_model = markovify.combine([model, new_model])
    except KeyError:
        print('Broken string:', broken)

Stacktrace

Traceback (most recent call last):
  File "./broken.py", line 23, in <module>
    new_model = markovify.NewlineText(broken)
  File "<redacted>/env/lib/python3.6/site-packages/markovify/text.py", line 36, in __init__
    self.chain = chain or Chain(self.parsed_sentences, state_size)
  File "<redacted>/env/lib/python3.6/site-packages/markovify/chain.py", line 45, in __init__
    self.precompute_begin_state()
  File "<redacted>/env/lib/python3.6/site-packages/markovify/chain.py", line 80, in precompute_begin_state
    choices, weights = zip(*self.model[begin_state].items())
KeyError: ('___BEGIN__', '___BEGIN__')
@jsvine
Copy link
Owner

jsvine commented Nov 4, 2017

Hi @Denton-L! Thanks for raising this issue.

It looks like the error actually comes when the model is created rather than at the combine step. (See the first three three lines of the stacktrace you've pasted; this can also be confirmed by removing the combined_model = ... line from your script and re-running it.)

Here's what seems to be happening:

  • When new_model = markovify.NewlineText(broken) is invoked in your test script, it is trying to create a Markov model with just one line of text (e.g., 'this string is not broken'). When using NewlineText, one line of text corresponds to a single sentence.

  • When markovify.Text — of which markovify.NewlineText is a subclass — ingests a new corpus, it ignores sentences with certain characters/patterns. (See reject_pat here.)

  • If a corpus is composed entirely of sentences containing those patterns — as the error-triggering examples above do — then it effectively is working with an empty corpus, which is causing this error. You can test this by, for example, changing "this string contains (" to "this string contains ( \n this is an example sentence".

If this is causing problems with one of your projects, the easiest fix would be to override markovify. Text.test_sentence_input(...).

In the slightly longer-term, I think these fixes are in order:

  • A more informative error message when all sentences of a corpus are rejected.

  • An easier-to-override reject_pat in markovify. Text.test_sentence_input(...)

Any other thoughts/suggestions?

@shge
Copy link

shge commented Mar 29, 2019

@jsvine

A more informative error message when all sentences of a corpus are rejected.

👍

It would be great if we can add other punctuation marks. (e.g. Japanese: )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants