Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve german formats #22

Closed
sspross opened this issue Nov 25, 2014 · 8 comments
Closed

Improve german formats #22

sspross opened this issue Nov 25, 2014 · 8 comments

Comments

@sspross
Copy link

sspross commented Nov 25, 2014

Hi guys, I just got the link to this repo from your support. It's awesome that this is open source and I would like to help improve the german formats.

I just need some help to get started. E.g. I would like to respect this german Windows Outlook split pattern:

-----Ursprüngliche Nachricht-----

As you can see this includes a special character ü which is represented as =C3=BC (in Base64, I guess). So what do i have to do, something like this in quotations.py?

SPLITTER_PATTERNS = [
    # ------Original Message------ or ---- Reply Message ----
    re.compile("[\s]*[-]+[ ]*(Original|Reply) Message[ ]*[-]+", re.I),
    re.compile("[\s]*[-]+[ ]*(Urspr=C3=BCngliche|Antwort) Nachricht[ ]*[-]+", re.I),
    ...

Is this correct? I'll try to add more cases which do not work atm (I tested a few, which led me to your support :) I'll open a pull request then. Ah and how can I run the tests and should I create tests for each new case? Thanks for your advise.

@sspross
Copy link
Author

sspross commented Nov 25, 2014

or?

SPLITTER_PATTERNS = [
    # ------Original Message------ or ---- Reply Message ----
    re.compile("[\s]*[-]+[ ]*(Original|Reply|Urspr=C3=BCngliche|Antwort) (Message|Nachricht)[ ]*[-]+", re.I),
    ...

@obukhov-sergey
Copy link
Member

Hi, thanks for contributing! I like the first variant better. It looks like it's a more structured way to support multiple languages. E.g. it's easy to put a comment to clarify that it's the same quotations splitter pattern but e.g. in German, etc.

About special characters. I believe you should be able to put them as it is e.g. the pattern will look like

re.compile("[\s]*[-]+[ ]*(Ursprüngliche|Antwort) Nachricht[ ]*[-]+", re.I)

And I'd also put the rest of the pattern characters in German. Some characters look the same as English ones but once I entered a password in German and later wasn't able to login because the layout changed :)

Indeed, it seems like there are some issues with running the tests in MacOS. I'll try to figure it out. We plan to migrate to scikit in the future which should be easier to integrate with.

sspross added a commit to sspross/talon that referenced this issue Nov 27, 2014
@sspross
Copy link
Author

sspross commented Nov 27, 2014

Yes me too, ok I added a comment to this one (see #23), but what do you think about RE_ON_DATE_SMB_WROTE? Should we duplicate it or mix the different languages in it? If we duplicate it we have to refactor preprocess

@sspross
Copy link
Author

sspross commented Dec 2, 2014

Hi @obukhov-sergey shall I improve some of this stuff or what are we going to do? I'm just asking because our project (depending on this mailgun feature) is keep going and I should know if there is any hope that this german support feature is going to hit mailgun's production "soon" or if I have to include a dev version of talon into our projects first. Thank you very much and as I already wrote, I'm willing to help!

@jeremyschlatter
Copy link
Contributor

I have another proposal for SPLITTER_PATTERNS in #29:

re.compile(u'[\s]*[-]+[ ]*({})[ ]*[-]+'.format(
    u'|'.join((
        # English
        'Original Message', 'Reply Message',
        # German
        u'Ursprüngliche Nachricht', 'Antwort Nachricht',
        # Danish
        'Oprindelig meddelelse',
    ))), re.I)

@sspross
Copy link
Author

sspross commented Dec 24, 2014

hi @jeremyschlatter nobody answered me so i gave up, made my own copy of talon without the signature stuff and my own testcases... works fine for me.

Though I'm hoping they let us improve the language stuff, but in my opinion this hole locale thing should be done otherwise. Maybe I we'll do another pull request for this, but the changes would be so big I don't think they let me rewrite the hole library.

If the owner puts me in the right direction, I'll come back and contribute my stuff like they say. Until then, I have to keep working on my own copy, because this stuff has to work in my project now.

@sspross
Copy link
Author

sspross commented Dec 24, 2014

sry, didn't recognized you're an owner @jeremyschlatter! as I said in the pull request, one test is still broken and needs further investigation and maybe we should rewrite this hole locale stuff, so that we have one file per language or something.

@jeremyschlatter
Copy link
Contributor

I agree that there is probably be a better way to do locale stuff, but I'm not sure it needs to be a big change to the library. RE_ORIGINAL_MESSAGE and RE_FROM_COLON_OR_DATE_COLON can easily be extended to other languages now. To make it a little cleaner we could have a separate data file that lists the translations for "original message", "reply message", "from", and "date" in lots of languages, and then read in that file and construct the two regexes.

Grammar differences might be harder, as in the case of RE_ON_DATE_SMB_WROTE. Though in the worst case there we could have one regex per language, listed again in a separate file. Or maybe one regex per unique grammar structure, with lots of translations or'd in as in RE_ORIGINAL_MESSAGE?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants