New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New ImportBindPlugin: Fuzzy Date #493

Closed
wants to merge 14 commits into
base: master
from

Conversation

Projects
None yet
2 participants
@tuxlifan
Contributor

tuxlifan commented Oct 9, 2017

Checks for a split amount match with transactions from 4 days before until
4 days after.
Base confidence is set to 0.75 so we can improve with additional memo
matching, etc.

New ImportBindPlugin: Fuzzy Date
Checks for a split amount match with transactions from 4 days before until
4 days after.
Base confidence is set to 0.75 so we can improve with additional memo
matching, etc.
@tuxlifan

This comment has been minimized.

Contributor

tuxlifan commented Oct 9, 2017

hmm, later.

@tuxlifan tuxlifan closed this Oct 9, 2017

@tuxlifan

This comment has been minimized.

Contributor

tuxlifan commented Oct 9, 2017

Note to self: when writing a plugin and changing the class, make sure to RE-ENABLE it in moneyGuru and run it from correct setup, not with the old version...
Also: flake. :)

@tuxlifan tuxlifan reopened this Oct 9, 2017

@hsoft

This comment has been minimized.

Owner

hsoft commented Oct 9, 2017

This looks interesting, but I have a few questions/comments

  1. How do you get this to load without adding it to https://github.com/hsoft/moneyguru/blob/master/core/plugin/__init__.py#L31?
  2. Have you tested it with real datasets? It behaves reasonably well?
  3. This would be the first core plugin to be disabled by default. As of now, there's no mechanism to allow a core plugin not to be enabled by default. We'll need to develop one.

Otherwise, I have a few stylistic comments which I'm about to make inline.

lo = imported_entry.date - self.DELTA_T
hi = imported_entry.date + self.DELTA_T
for existing_entry in existing_entries:
if lo <= existing_entry.date and hi >= existing_entry.date:

This comment has been minimized.

@hsoft

hsoft Oct 9, 2017

Owner

Chained comparisons like lo <= existing_entry.date <= hi look better

for imported_entry in imported_entries:
lo = imported_entry.date - self.DELTA_T
hi = imported_entry.date + self.DELTA_T
for existing_entry in existing_entries:

This comment has been minimized.

@hsoft

hsoft Oct 9, 2017

Owner

Nested loop are usually harder to read than flat code. Here, you could transform this for/if loop into a list comprehension.

for existing_entry in existing_entries:
if lo <= existing_entry.date and hi >= existing_entry.date:
for isplit in imported_entry.splits:
for esplit in existing_entry.splits:

This comment has been minimized.

@hsoft

hsoft Oct 9, 2017

Owner

This double nested loop would be made more elegant through itertools.product, or, even better, combinations(), since you don't want to match each pair more than once.

if isplit.amount == esplit.amount:
matches.append(EntryMatch(existing_entry, imported_entry, will_import, self.CONFIDENCE))
break
# ------|

This comment has been minimized.

@hsoft

hsoft Oct 9, 2017

Owner

By avoiding nested loops, you wouldn't need these weird-looking comments.

tuxlifan added some commits Oct 10, 2017

Refactor from nested for loops
The "confidence = self.BASE_CONFIDENCE" is to
a) reduce line length for the following lines
b) future patch with penalties deducted
@tuxlifan

This comment has been minimized.

Contributor

tuxlifan commented Oct 10, 2017

1. How do you get this to load without adding it to
https://github.com/hsoft/moneyguru/blob/master/core/plugin/__init__.py#L31?
  1. This would be the first core plugin to be disabled by default. As of now, there's no mechanism to allow a core plugin not to be enabled by default. We'll need to develop one.

Actually, I cheated and linked it from the repo into ~/.local/share/.../moneyguru_plugins, sorry.
Since this plugin should probably be customized by the user anyway (until we implement plugin preferences, and versioning (#401)), what do you think about creating a contrib/plugins/ directory to put "shipped but optional" plugins such as this?

2. Have you tested it with real datasets? It behaves reasonably well?

Yes, on my personal data (for imports a few months worth) and I love it 😃

I implemented in the Penalties patch now also the decreasing of the weight the longer the existing date is in the past from the imported and even more if the existing date is in the future from the imported.

There might still be some unexpected matches if there are multiple transactions with the same amount in a 2*DELTA_T+1 time frame.
The only insurance against that would be to add additional matching on description, etc.

Otherwise, I have a few stylistic comments which I'm about to make inline.

Thank you for taking the time and your comments!
I tried now with a more list-comprehension / functional approach, please let me know if you like it better (I already do :-))
Also, I put a "#noqa: E501" on the logging.debug line since I don't see the point in making it stand out more with multiple lines or do you feel the 120 characters is a hard limit?

Thanks

if abs(i.date - e.date) <= self.DELTA_T]
for imported_entry, existing_entry in entry_pairs:
if any(starmap(lambda isplit, esplit: isplit.amount == esplit.amount,

This comment has been minimized.

@hsoft

hsoft Oct 10, 2017

Owner

I don't think that starmap() is needed here. Unless I'm mistaken, the simpler any(isplit.amount == esplit.amount for isplit, esplit in product(imported_entry.splits, existing_entry.splits)) is equivalent.

This comment has been minimized.

@tuxlifan

tuxlifan Oct 11, 2017

Contributor

much simpler, thanks!

if any(starmap(lambda isplit, esplit: isplit.amount == esplit.amount,
product(imported_entry.splits, existing_entry.splits))):
confidence = self.BASE_CONFIDENCE - self.PENALTIES[(imported_entry.date - existing_entry.date).days]
logging.debug("Fuzzy Date Bind match ({0:1.2}): {1} {2}".format(confidence, imported_entry, existing_entry)) # noqa: E501

This comment has been minimized.

@hsoft

hsoft Oct 10, 2017

Owner

I don't mind the # noqa, but logging formatting should be made with % placeholder styles, without the % operator. The idea is that logging will only perform the string formatting if we're in debug mode. This way, when we're not in debug mode, we don't spend time doing useless string formatting.

This comment has been minimized.

@tuxlifan

tuxlifan Oct 11, 2017

Contributor

Ah, I saw the .format used in core/loader/base.py (+141, +145) and thought it's the new python3-way to do.
Makes sense though, changed. Incidentally, that also takes care of E501 since the line is now much shorter :)

DELTA_T = datetime.timedelta(4) # 4 days
# positive key: imported is after existing
# (e.g. money taken from account (import) after purchase date (manual entry, existing))
PENALTIES = { -4: .08, # noqa

This comment has been minimized.

@hsoft

hsoft Oct 10, 2017

Owner

I'd prefer not having a # noqa there and reformat the literal so we don't have a E127 warning.

This comment has been minimized.

@tuxlifan

tuxlifan Oct 11, 2017

Contributor

Ok, reformatted.
I find it a bit uggly now because it breaks the visual column but it emphasis the 0 which is a plus.

@hsoft

This comment has been minimized.

Owner

hsoft commented Oct 10, 2017

Nah, it's alright if this plugin is considered "core". It just needs to not be enabled by default to avoid false matches. Once we're confident enough, I can very well see this plugin being enabled by default.

I'll implement the change to allow core plugins not to be enabled by default and will merge this PR against it when it's ready.

Other than that, the plugin looks good to me and will be a very interesting addition, thanks.

I'd like to have tests for it before merging, but I realize that it might be a bid much to ask because there isn't a good reference to use for this test. For example, there is no test for ReferenceBind. I'll add one soon and then let you base yourself upon it to implement your own test for FuzzyDateBind.

hsoft added a commit that referenced this pull request Oct 10, 2017

hsoft added a commit that referenced this pull request Oct 10, 2017

@tuxlifan

This comment has been minimized.

Contributor

tuxlifan commented Oct 11, 2017

I'll work on the tests probably on the weekend and then push also another new plugin (working name: fuzzy_memo_bind) that should eventually bring some possibilities to "fuzzily" match on descriptions (so far by extracting things that look like numeric dates and comparing how many match) and which I want to use to increase the matching confidence in fuzzy_date_bind.

But tests are first needed to understand the correctness, effectiveness and restore my sanity...

PS: I thought it better to create another core plugin fuzzy_memo_bind instead of coding the additional description matching in fuzzy_date_bind since fuzzy matching by description could be of interest also for users that don't (want to) use fuzzy_date_bind.

The current architecture does not allow chaining of plugins, correct?
i.e.
matches = fuzzy_memo_bind(fuzzy_date_bind(imported, existing), chained=True)
where fuzzy_date_bind would pass interesting matches (by date) to fuzzy_memo_bind which could refine them by increasing/decreasing the weight determined by fuzzy_date_bind according to its own findings.
So for now I just import things from fuzzy_memo_bind into fuzzy_date_bind which should be ok if both are accepted as core plugins (or I'll have to dynamically switch off the additional matching if the import fails).

@hsoft

This comment has been minimized.

Owner

hsoft commented Oct 11, 2017

I see where you want to go but I'm not sure which path is the best. Having a loosely coupled match chaining mechanism could be interesting but would require a overhaul of the import binding mechanism.

What do we really want in the end? Give the user the possibility of having fuzzy matching but with or without memo/description/etc. matching if she wants? Another possibility would be to introduce the concept of plugin preferences. We would then only have one fuzzy matching plugin, but with boolean preferences determining whether we match memo/description/etc.

tuxlifan added some commits Oct 17, 2017

Extend PENALTIES to +/- 7 days
Keep DELTA_T at 4 days.

In case of importing older data one might want to set DELTA_T to a
higher value than 4, e.g. 5 or even 6, due to longer banking delays
back in the days...
debug: reword and show one more decimal place than expected precision
Ensures debug message does not hide any actual issues with weights value.
Add refining matches in Fuzzy Date using new Fuzzy Memo's DateExtractor
New Fuzzy Memo Bind Plugin:
So far it returns an empty list from match_entries and is only used to
implement DateExtractor for numeric dates (ISO, US, and EU formats) and
language specific months and month abbreviation matches.

Languages implemented so far for DateExtractor:
  de
  en
  es
  fr

Currently there is no way to pass 'do_refine_matches' and 'lang' arguments
to FuzzyDateBind.match_entries from the main program flow (and we would
need plugin preferences first anyway for them to make sense).

For now, enable by default do_refine_matches and set the language to 'en'
with FUZZY_MEMO_BIND_DEFAULT_LANGUAGE.
Tests for FuzzyDateBind and FuzzyMemoBind.DateExtractor
Uses new eq_sorted_ from hscommon submodule which sorts lists before
passing them on to eq_
@tuxlifan

This comment has been minimized.

Contributor

tuxlifan commented Oct 17, 2017

Tests fail on bd59dc1 because of new eq_sorted_ missing (see hsoft/hscommon#3)

I think in the end a plugin preferences system would be good (but probably also a lot of work). It would also help with e.g. currency plugins for a user to add a currency (that is available in general with some existing provider but not queried) without having to copy and edit python files.

For now the FuzzyMemo is blatantly empty except for the DateExtractor which I felt should logically go there rather than into FuzzyDate. I'm sorry to say that until December/new year I'll probably have very little time to pursue this further.

I separated the language specific dicts into one file per language to make adding a new language "cut, paste and change" without having to understand the pythonic structure.
Another approach could be to put it in e.g. core/plugin/fuzzy_memo_bind_langdicts.py like:

d_e_months = {'en': ['January', ...]
              ...
              'your_language_here': ['', '', '', ..., '']  # 12 + 12 fields
}

Looking forward to your comments, as always :)

@hsoft

This comment has been minimized.

Owner

hsoft commented Oct 18, 2017

@tuxlifan there's a lot of good material in there! Nice-looking tests too! But I'm thinking that we should reduce the scope of this pull request if we ever want to make it into a merge-able state.

For example, the date extractor thing doesn't belong in a plugin, but directly in the core. The core import mechanism already does a lot of work to parse as many date formats as possible. If some date formats aren't supported, then adding support for them is worth a pull request by itself. I would suggest that you add your DateExtractor class to https://github.com/hsoft/moneyguru/blob/master/core/model/date.py and that you integrate it to the date parsing mechanism in https://github.com/hsoft/moneyguru/blob/master/core/loader/base.py . All your date-parsing related tests would be very welcome near https://github.com/hsoft/moneyguru/blob/master/core/tests/import_test.py#L442 . It would be awesome.

I understand that "extracting dates" isn't precisely the same thing as "parsing dates", but if our date parser supports all the formats that the "extractor" supports, we can simplify the "extractor" part and make it piggy-back on the improved parser. Besides, there's already a concept of "matching a possible date in a string that possibly has something else in it" at https://github.com/hsoft/moneyguru/blob/master/core/loader/base.py#L49 . It would only be a matter of re-packaging it so it's easily re-usable by a plugin.

For month names, I would suggest writing down these constants in english only and let the gettext system return translated values.

So, back to the feature-creep problem: what about this plan?

  1. limiting ourselves to the fuzzy date matching thing
  2. merge it
  3. work together on the plugin options feature (another PR)
  4. improve date parsing to support more formats (another PR)
  5. add fuzzy memo matching (another PR)

Does it sound good? I think we're heading somewhere nice, but I fear that making this PR too complicated will end up discouraging you and we'll end up merging nothing.

@hsoft hsoft self-assigned this Oct 18, 2017

@tuxlifan

This comment has been minimized.

Contributor

tuxlifan commented Oct 18, 2017

Hello @hsoft, I'm glad you like the tests; moneyguru has taught me a lot about testing so far and they are my first "serious" real-world ones :)

For month names, unless I misunderstand what you are proposing, I believe that using gettext could be a complication for the user since the interface language one might want to use does not necessarily match the language used in the banking data (e.g. for me it's interface:en, banking data:de and thankfully moneyGuru has the "Date format" preference so I don't have to mess around with LC_* variables just to get away from the day/month vs. month/day insanity ;-) just padding with 0 when entering dates and I'm all set 👍) or one might even have banking data in different languages to import at the same time.
From a translator perspective, however, gettext would certainly be a better/easier to work with choice.
Now, if we could have the flexibility of not using gettext and the ease of translation of using gettext... something to ponder.

I agree that it would be nice to integrate as much as possible with the core.

I feel it would be a good idea though to close this PR since I'm not overly confident in my current git-foo and worry that I would mess up either with github or, worse, somehow lose all the commits made so far (and what if something happened to my backups in the meantime...).
So I prefer to keep this branch in my fork as a 'nice first proof of concept, works for me until core integration is done' and use your plan, very slightly adjusted:

In the near future:

  1. new PR for fuzzy date matching only
  2. merge that

By the end of the year or next year, depending on how fast we are:

  1. work together on the plugin options feature (another PR)
  2. improve date parsing to support more formats (another PR)
  3. add fuzzy memo matching (another PR)

I'll leave it up to you to close this PR as I'm not sure of github's implications in doing so.

@hsoft

This comment has been minimized.

Owner

hsoft commented Oct 18, 2017

How would you detect the language to use? A selector in the import UI? That would be a feature in itself and it can probably be achieved with gettext with something that looks like https://github.com/hsoft/hscommon/blob/90a1c624d59bb23f0b0087cd0eca6b7ae5f4ecf4/trans.py#L110 , but only for the time of one import session. Using gettext doesn't necessarily imply using the language set in LC_*. moneyGuru already has a non-LC_* preference for its UI language.

I don't mind keeping this PR, and thus the discussion, open until sub-PRs are opened. It doesn't matter much. In all cases, it doesn't touch existing branches, so it can't mess up anything.

@tuxlifan

This comment has been minimized.

Contributor

tuxlifan commented Oct 27, 2017

How would you detect the language to use? A selector in the import UI?

With my user hat on, I would like to set the language in the import dialog and for it to be remembered as a property of the CSV layout chosen (if any); and have similar 'saved presets' for QIF import, too (where it could automagically choose based on e.g. the account name in the QIF, etc.(and also select a target account automatically) - once trained).

Using gettext doesn't necessarily imply using the language set in LC_*. moneyGuru already has a non-LC_* preference for its UI language.

Ok, no further objections to using gettext :)

@tuxlifan tuxlifan referenced this pull request Oct 27, 2017

Merged

Fuzzy Date plugin #496

@hsoft hsoft closed this Dec 1, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment