New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Language in movie title #28
Comments
I think the language position should be after or before a potential title. I'm confident that @wackou will find a way to fix this :) |
Indeed this is a tricky one... We can't rule out "italian" to not appear as a language, so one solution would be to have a hardcoded list of movie titles that should take precedence over language detection, but that's obviously not the best solution... Another idea could be that when a language written in english (ie: not a language code such as "en" or "fr") is surrounded by spaces or underscores it should be part of a title but if it is surrounded by [] or () it is probably an audio or subtitle language, although I can see how this could fail, too. Any better idea? |
@Diaoul the way it works at the moment is that the title is always guessed last, as that's the thing that we don't know anything about. So the language detection always comes first, and at this moment, we have no information yet about a potential title so we can't reason using that... hmmm actually writing this just gave me an idea! If we run guessit once normally, and then another time, but by disabling the language detection, and then comparing the titles, we should get the same title most of the time but in your case we would get the title being cut by the language, so we would know something bad happened and then we can still return the title from the "no language" detection. That should work! I'll try to see if I can hack up something. |
Inception 😉 |
For me, just being able to turn off language detection would work since I don't have multi-lingual stuff. I couldn't see how to do that immediately, but I didn't get too far looking... Delimiting the language string would be more general, but perhaps the delimiter characters could be an input to guessit so they could be customized. And maybe the default would be as it is currently so existing users wouldn't break. The two-pass solution sounds better than anything I was able to think off. It would probably be able to handle other corner cases as well (e.g. OSS_117--Cairo,_Nest_of_Spies.mkv), so some hardcoded stuff in lng_common_words in language.py could be removed. |
Only do the two pass when the group isn't explicit, by explicit I mean surrounded by [] or () or anything of that kind. |
I just pushed a solution that I'm quite happy with :-) Let me know if it works for you. |
Nice! It works for all my test cases. Thanks. |
Awesome! If you have other test cases that fail, don't hesitate, I'll try to tag a release soon so let's get in as much as possible! |
A movie like "The Italian Job" or "The Spanish Prisoner" returns a guess where the word is identified as the movie's language rather than part of the title:
Seems like a tough problem to solve. Any way around this, maybe a way to craft the filename so this doesn't happen?
The text was updated successfully, but these errors were encountered: