Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Markovify always outputs "None" with russian corpus #149

Open
wooferclaw opened this issue Dec 26, 2020 · 12 comments
Open

Markovify always outputs "None" with russian corpus #149

wooferclaw opened this issue Dec 26, 2020 · 12 comments

Comments

@wooferclaw
Copy link

wooferclaw commented Dec 26, 2020

With any number of tries or any sentecnce length Markovify always outputs "None" when I use text file in russian as corpus.
Even with a basic example from readme.

for example - this one https://raw.githubusercontent.com/AlexWortega/botfortest/master/kiskis.txt

@jsvine
Copy link
Owner

jsvine commented Jan 6, 2021

Thanks for your interest in this library, @wooferclaw. I'm not familiar enough with Russian to be able to diagnose the issue. One step, however, that would help help me and/or a more Russian-familiar community member resolve this: Could you attach a minimal Python script that demonstrates the problem?

@asigalov61
Copy link

@wooferclaw

I would check your text/char encoding first...then I would try well_formed = True and reject_reg = ' '(space for example) to disable rejection of non-standard chars.

@jsvine If you want/interested, I can assist with the Russian and Hebrew languages. Also, wanted to thank you for markovify! Great job. I love it.

@jsvine
Copy link
Owner

jsvine commented Jan 27, 2021

Hi @asigalov61, and thank you for the offer and kind words. If you would like to provide examples of, and improvements to, using Markovify in those languages, that'd be great. Feel free to open an issue or PR, or to email me directly (jsvine@gmail.com).

@markelovstyle
Copy link

Hello, how are things going with the Russian-language integration? I really need Markovify in my project, but unfortunately I work with russian corpus

@asigalov61
Copy link

@markelovstyle Hey, I can help with Russian stuff.

Have you tried this?

f.write(TXT_String.encode('utf-8', 'replace'))

markov_text_model = markovify.NewlineText(text, well_formed=False, state_size=markov_chain_state_size)

And then try to use smaller state sizes (i.e 2) and also higher overlaps. Other settings are important too.

Let me know.

@asigalov61
Copy link

@markelovstyle Take a look at my implementation of markovify:
https://github.com/asigalov61/Markovify-Piano/blob/main/Markovify_Piano.ipynb

You are welcome to use the code as it supports a full range of uni-8 chars, which should work fine for Russian as well.

@markelovstyle
Copy link

Thanks, it works, but not as well as English. In most cases I get None.

@asigalov61
Copy link

@markelovstyle If it works sometimes it means that you need to adjust the settings of the generator.

Also, the corpus must be properly formatted. This implementation requires sufficiently long sentences and a sufficiently long corpus.

@asigalov61
Copy link

@markelovstyle I was gonna make a small implementation Colab based on markovify for the Russian corpus so gimme a week or so and I will post a working example.

@wooferclaw
Copy link
Author

Guys, i appreciate your help very much. @asigalov61 i have tried may encoding variations, none of them worked.

@asigalov61
Copy link

@wooferclaw @markelovstyle

Here, guys. I made a draft version. Works great on my end. Try it out and let me know.
https://colab.research.google.com/drive/1OLagaj21zjV5kxjR5DIU4kx7jHQG8ggt?usp=sharing

@gerasiovM
Copy link

@asigalov61

Hey, the collab link is unavailable, could you fix it? I really want to fix the problem for me and your draft might help me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants