Improve german formats #22

sspross · 2014-11-25T15:03:03Z

Hi guys, I just got the link to this repo from your support. It's awesome that this is open source and I would like to help improve the german formats.

I just need some help to get started. E.g. I would like to respect this german Windows Outlook split pattern:

-----Ursprüngliche Nachricht-----

As you can see this includes a special character ü which is represented as =C3=BC (in Base64, I guess). So what do i have to do, something like this in quotations.py?

SPLITTER_PATTERNS = [
    # ------Original Message------ or ---- Reply Message ----
    re.compile("[\s]*[-]+[ ]*(Original|Reply) Message[ ]*[-]+", re.I),
    re.compile("[\s]*[-]+[ ]*(Urspr=C3=BCngliche|Antwort) Nachricht[ ]*[-]+", re.I),
    ...

Is this correct? I'll try to add more cases which do not work atm (I tested a few, which led me to your support :) I'll open a pull request then. Ah and how can I run the tests and should I create tests for each new case? Thanks for your advise.

The text was updated successfully, but these errors were encountered:

sspross · 2014-11-25T15:21:43Z

or?

SPLITTER_PATTERNS = [
    # ------Original Message------ or ---- Reply Message ----
    re.compile("[\s]*[-]+[ ]*(Original|Reply|Urspr=C3=BCngliche|Antwort) (Message|Nachricht)[ ]*[-]+", re.I),
    ...

obukhov-sergey · 2014-11-26T20:08:53Z

Hi, thanks for contributing! I like the first variant better. It looks like it's a more structured way to support multiple languages. E.g. it's easy to put a comment to clarify that it's the same quotations splitter pattern but e.g. in German, etc.

About special characters. I believe you should be able to put them as it is e.g. the pattern will look like

re.compile("[\s]*[-]+[ ]*(Ursprüngliche|Antwort) Nachricht[ ]*[-]+", re.I)

And I'd also put the rest of the pattern characters in German. Some characters look the same as English ones but once I entered a password in German and later wasn't able to login because the layout changed :)

Indeed, it seems like there are some issues with running the tests in MacOS. I'll try to figure it out. We plan to migrate to scikit in the future which should be easier to integrate with.

as advised in mailgun#22

sspross · 2014-11-27T07:24:25Z

Yes me too, ok I added a comment to this one (see #23), but what do you think about RE_ON_DATE_SMB_WROTE? Should we duplicate it or mix the different languages in it? If we duplicate it we have to refactor preprocess

sspross · 2014-12-02T10:51:04Z

Hi @obukhov-sergey shall I improve some of this stuff or what are we going to do? I'm just asking because our project (depending on this mailgun feature) is keep going and I should know if there is any hope that this german support feature is going to hit mailgun's production "soon" or if I have to include a dev version of talon into our projects first. Thank you very much and as I already wrote, I'm willing to help!

jeremyschlatter · 2014-12-23T23:59:06Z

I have another proposal for SPLITTER_PATTERNS in #29:

re.compile(u'[\s]*[-]+[ ]*({})[ ]*[-]+'.format(
    u'|'.join((
        # English
        'Original Message', 'Reply Message',
        # German
        u'Ursprüngliche Nachricht', 'Antwort Nachricht',
        # Danish
        'Oprindelig meddelelse',
    ))), re.I)

sspross · 2014-12-24T09:16:36Z

hi @jeremyschlatter nobody answered me so i gave up, made my own copy of talon without the signature stuff and my own testcases... works fine for me.

Though I'm hoping they let us improve the language stuff, but in my opinion this hole locale thing should be done otherwise. Maybe I we'll do another pull request for this, but the changes would be so big I don't think they let me rewrite the hole library.

If the owner puts me in the right direction, I'll come back and contribute my stuff like they say. Until then, I have to keep working on my own copy, because this stuff has to work in my project now.

sspross · 2014-12-24T10:59:59Z

sry, didn't recognized you're an owner @jeremyschlatter! as I said in the pull request, one test is still broken and needs further investigation and maybe we should rewrite this hole locale stuff, so that we have one file per language or something.

jeremyschlatter · 2014-12-30T23:05:07Z

I agree that there is probably be a better way to do locale stuff, but I'm not sure it needs to be a big change to the library. RE_ORIGINAL_MESSAGE and RE_FROM_COLON_OR_DATE_COLON can easily be extended to other languages now. To make it a little cleaner we could have a separate data file that lists the translations for "original message", "reply message", "from", and "date" in lots of languages, and then read in that file and construct the two regexes.

Grammar differences might be harder, as in the case of RE_ON_DATE_SMB_WROTE. Though in the worst case there we could have one regex per language, listed again in a separate file. Or maybe one regex per unique grammar structure, with lots of translations or'd in as in RE_ORIGINAL_MESSAGE?

sspross mentioned this issue Nov 26, 2014

Added extra splitter expressions and tests for German #23

Closed

sspross added a commit to sspross/talon that referenced this issue Nov 27, 2014

added explaining comment and removed base64 encoded character

bc8d93a

as advised in mailgun#22

obukhov-sergey closed this as completed Sep 14, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve german formats #22

Improve german formats #22

sspross commented Nov 25, 2014

sspross commented Nov 25, 2014

obukhov-sergey commented Nov 26, 2014

sspross commented Nov 27, 2014

sspross commented Dec 2, 2014

jeremyschlatter commented Dec 23, 2014

sspross commented Dec 24, 2014

sspross commented Dec 24, 2014

jeremyschlatter commented Dec 30, 2014

Improve german formats #22

Improve german formats #22

Comments

sspross commented Nov 25, 2014

sspross commented Nov 25, 2014

obukhov-sergey commented Nov 26, 2014

sspross commented Nov 27, 2014

sspross commented Dec 2, 2014

jeremyschlatter commented Dec 23, 2014

sspross commented Dec 24, 2014

sspross commented Dec 24, 2014

jeremyschlatter commented Dec 30, 2014