Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Language in movie title #28

Closed
robmcmullen opened this issue Mar 26, 2013 · 9 comments
Closed

Language in movie title #28

robmcmullen opened this issue Mar 26, 2013 · 9 comments

Comments

@robmcmullen
Copy link

A movie like "The Italian Job" or "The Spanish Prisoner" returns a guess where the word is identified as the movie's language rather than part of the title:

$ guessit The_Italian_Job.mkv
GuessIt found: {
    [1.00] "type": "movie", 
    [1.00] "container": "mkv", 
    [0.30] "language": [
        "Italian"
    ], 
    [0.60] "title": "The"
}

Seems like a tough problem to solve. Any way around this, maybe a way to craft the filename so this doesn't happen?

@Diaoul
Copy link
Member

Diaoul commented Mar 26, 2013

I think the language position should be after or before a potential title. I'm confident that @wackou will find a way to fix this :)

@wackou
Copy link
Member

wackou commented Mar 26, 2013

Indeed this is a tricky one... We can't rule out "italian" to not appear as a language, so one solution would be to have a hardcoded list of movie titles that should take precedence over language detection, but that's obviously not the best solution...

Another idea could be that when a language written in english (ie: not a language code such as "en" or "fr") is surrounded by spaces or underscores it should be part of a title but if it is surrounded by [] or () it is probably an audio or subtitle language, although I can see how this could fail, too.

Any better idea?

@wackou
Copy link
Member

wackou commented Mar 26, 2013

@Diaoul the way it works at the moment is that the title is always guessed last, as that's the thing that we don't know anything about. So the language detection always comes first, and at this moment, we have no information yet about a potential title so we can't reason using that...

hmmm actually writing this just gave me an idea! If we run guessit once normally, and then another time, but by disabling the language detection, and then comparing the titles, we should get the same title most of the time but in your case we would get the title being cut by the language, so we would know something bad happened and then we can still return the title from the "no language" detection. That should work! I'll try to see if I can hack up something.

@Diaoul
Copy link
Member

Diaoul commented Mar 26, 2013

Inception 😉

@robmcmullen
Copy link
Author

For me, just being able to turn off language detection would work since I don't have multi-lingual stuff. I couldn't see how to do that immediately, but I didn't get too far looking...

Delimiting the language string would be more general, but perhaps the delimiter characters could be an input to guessit so they could be customized. And maybe the default would be as it is currently so existing users wouldn't break.

The two-pass solution sounds better than anything I was able to think off. It would probably be able to handle other corner cases as well (e.g. OSS_117--Cairo,_Nest_of_Spies.mkv), so some hardcoded stuff in lng_common_words in language.py could be removed.

@Diaoul
Copy link
Member

Diaoul commented Mar 26, 2013

Only do the two pass when the group isn't explicit, by explicit I mean surrounded by [] or () or anything of that kind.

@wackou
Copy link
Member

wackou commented Mar 30, 2013

I just pushed a solution that I'm quite happy with :-) Let me know if it works for you.

@robmcmullen
Copy link
Author

Nice! It works for all my test cases. Thanks.

@wackou
Copy link
Member

wackou commented Mar 31, 2013

Awesome! If you have other test cases that fail, don't hesitate, I'll try to tag a release soon so let's get in as much as possible!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants