New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can list changes be more intelligent #97

Closed
ctmay4 opened this Issue Feb 9, 2015 · 13 comments

Comments

Projects
None yet
3 participants
@ctmay4

ctmay4 commented Feb 9, 2015

I have a an entity with a List<List> that I am tracking changes on. I just tested an Object which has 837 List object

commit 3.0, author: me@here.com, Feb 9, 2015 2:25:48 PM
  changed object: StagingTable/54d909bb102cbbfdf54855ea
    list changed on 'rawRows' property: [
        (10).'[T0, N1, M0, SX, VALUE:IINOS]'>>'[T0, N1, M0, ERROR, VALUE:ERROR]', 
        (11).'[T0, N1, M0, ERROR, VALUE:ERROR]'>>'[T0, N2, M0, S0, VALUE:IIB]', 
        (12).'[T0, N2, M0, S0, VALUE:IIB]'>>'[T0, N2, M0, S1, VALUE:IIB]', 
        (13).'[T0, N2, M0, S1, VALUE:IIB]'>>'[T0, N2, M0, S2, VALUE:IIIB]', 
        (14).'[T0, N2, M0, S2, VALUE:IIIB]'>>'[T0, N2, M0, S3, VALUE:IIIC]', 
        (15).'[T0, N2, M0, S3, VALUE:IIIC]'>>'[T0, N2, M0, SX, VALUE:IINOS]', 
        (16).'[T0, N2, M0, SX, VALUE:IINOS]'>>'[T0, N2, M0, ERROR, VALUE:ERROR]',
        (17).'[T0, N2, M0, ERROR, VALUE:ERROR]'>>'[T0, N3, M0, S0, VALUE:IIC]', 
        (18).'[T0, N3, M0, S0, VALUE:IIC]'>>'[T0, N3, M0, S1, VALUE:IIC]', 
        (19).'[T0, N3, M0, S1, VALUE:IIC]'>>'[T0, N3, M0, S2, VALUE:IIIB]', 
        ...
        <SNIP>
        ...
        (837).'[TX, NX, M1NOS, S3, VALUE:IIINOS]'>>'[TX, NX, M1NOS, SX, VALUE:IIINOS]',
        (838).'[TX, NX, M1NOS, SX, VALUE:IIINOS]'>>'[TX, NX, M1NOS, ERROR, VALUE:ERROR]',
        (839).removed:'[TX, NX, M1NOS, ERROR, VALUE:ERROR]'
    ]

So removing a single line from the List causes an audit trail of 800+ entries. Instead, wouldn't it be better to have this be the result:

commit 3.0, author: me@here.com, Feb 9, 2015 2:25:48 PM
  changed object: StagingTable/54d909bb102cbbfdf54855ea
    list changed on 'rawRows' property: [
        (10).removed:'[T0, N1, M0, SX, VALUE:IINOS]'
    ]

For example, if I put the raw JSON out and delete the line, a command-line diff looks like this:

$ diff before.json after.json
44d43
<         [ "T0", "N1", "M0", "S3", "VALUE:IIIC" ],

Not sure how easy this would be to accomplish, but I do worry that if small changes are producing so much noise, what if I had a table with a very large list. The growth of the changelog would be too quick.

@bartoszwalacik

This comment has been minimized.

Member

bartoszwalacik commented Feb 9, 2015

well, this is the issue concerning nested lists?
I'm aware that JaVers has only basic support for these structures.
You are right that it could produce noisy diffs.

The problem is that I don't know how to model these nested changes in the terms of Change.class
(this is a kind of core JaVers concept).
Maybe you have some suggestions?
What if some user creates Set<List<String>> or List<Set<String>>? It's not possible to cover all corner cases...

I think that the better way to solving this problem is to alter the data model and get rid of nested lists. I know that it could be painfull but honestly I don't have good idea how to model this kind of Changes. Take a look at Change.class hierarchy and maybe you come up with some solution.

@ctmay4

This comment has been minimized.

ctmay4 commented Feb 11, 2015

I don't think altering the data model would help. I put together another Gist which I track changes on a List to get the nesting out of the test. The results were the same.

https://gist.github.com/ctmay4/1a328c800b459101d1c4

I basically create list of 100 Strings. Then I remove the 5th element and add new new elements at the 90th position.

For JaVers, I get the following output:

commit 2.0, author: me@here.com, Feb 11, 2015 2:54:32 PM
  changed object: com.imsweb.seerapi.DiffTesting$TestObject/1
    list changed on '_table' property: [(5).'A5'>>'A6', (6).'A6'>>'A7', (7).'A7'>>'A8', (8).'A8'>>'A9', (9).'A9'>>'A10', (10).'A10'>>'A11', (11).'A11'>>'A12', (12).'A12'>>'A13', (13).'A13'>>'A14', (14).'A14'>>'A15', (15).'A15'>>'A16', (16).'A16'>>'A17', (17).'A17'>>'A18', (18).'A18'>>'A19', (19).'A19'>>'A20', (20).'A20'>>'A21', (21).'A21'>>'A22', (22).'A22'>>'A23', (23).'A23'>>'A24', (24).'A24'>>'A25', (25).'A25'>>'A26', (26).'A26'>>'A27', (27).'A27'>>'A28', (28).'A28'>>'A29', (29).'A29'>>'A30', (30).'A30'>>'A31', (31).'A31'>>'A32', (32).'A32'>>'A33', (33).'A33'>>'A34', (34).'A34'>>'A35', (35).'A35'>>'A36', (36).'A36'>>'A37', (37).'A37'>>'A38', (38).'A38'>>'A39', (39).'A39'>>'A40', (40).'A40'>>'A41', (41).'A41'>>'A42', (42).'A42'>>'A43', (43).'A43'>>'A44', (44).'A44'>>'A45', (45).'A45'>>'A46', (46).'A46'>>'A47', (47).'A47'>>'A48', (48).'A48'>>'A49', (49).'A49'>>'A50', (50).'A50'>>'A51', (51).'A51'>>'A52', (52).'A52'>>'A53', (53).'A53'>>'A54', (54).'A54'>>'A55', (55).'A55'>>'A56', (56).'A56'>>'A57', (57).'A57'>>'A58', (58).'A58'>>'A59', (59).'A59'>>'A60', (60).'A60'>>'A61', (61).'A61'>>'A62', (62).'A62'>>'A63', (63).'A63'>>'A64', (64).'A64'>>'A65', (65).'A65'>>'A66', (66).'A66'>>'A67', (67).'A67'>>'A68', (68).'A68'>>'A69', (69).'A69'>>'A70', (70).'A70'>>'A71', (71).'A71'>>'A72', (72).'A72'>>'A73', (73).'A73'>>'A74', (74).'A74'>>'A75', (75).'A75'>>'A76', (76).'A76'>>'A77', (77).'A77'>>'A78', (78).'A78'>>'A79', (79).'A79'>>'A80', (80).'A80'>>'A81', (81).'A81'>>'A82', (82).'A82'>>'A83', (83).'A83'>>'A84', (84).'A84'>>'A85', (85).'A85'>>'A86', (86).'A86'>>'A87', (87).'A87'>>'A88', (88).'A88'>>'A89', (89).'A89'>>'A90', (90).'A90'>>'INSERT_A1', (91).'A91'>>'INSERT_A2', (92).'A92'>>'A91', (93).'A93'>>'A92', (94).'A94'>>'A93', (95).'A95'>>'A94', (96).'A96'>>'A95', (97).'A97'>>'A96', (98).'A98'>>'A97', (99).'A99'>>'A98', (100).added:'A99']
commit 1.0, author: me@here.com, Feb 11, 2015 2:54:32 PM
    new object: com.imsweb.seerapi.DiffTesting$TestObject/1

It is still listing almost every row since it doesn't "detect" a deletion.

My gist also includes an example from java-object-diff. Here is the output:

/ changed from com.imsweb.seerapi.DiffTesting$TestObject@d041cf to com.imsweb.seerapi.DiffTesting$TestObject@129a8472
/table changed from [A0, A1, A2, A3, A4, A5, A6, A7, A8, A9, A10, A11, A12, A13, A14, A15, A16, A17, A18, A19, A20, A21, A22, A23, A24, A25, A26, A27, A28, A29, A30, A31, A32, A33, A34, A35, A36, A37, A38, A39, A40, A41, A42, A43, A44, A45, A46, A47, A48, A49, A50, A51, A52, A53, A54, A55, A56, A57, A58, A59, A60, A61, A62, A63, A64, A65, A66, A67, A68, A69, A70, A71, A72, A73, A74, A75, A76, A77, A78, A79, A80, A81, A82, A83, A84, A85, A86, A87, A88, A89, A90, A91, A92, A93, A94, A95, A96, A97, A98, A99] to [A0, A1, A2, A3, A4, A6, A7, A8, A9, A10, A11, A12, A13, A14, A15, A16, A17, A18, A19, A20, A21, A22, A23, A24, A25, A26, A27, A28, A29, A30, A31, A32, A33, A34, A35, A36, A37, A38, A39, A40, A41, A42, A43, A44, A45, A46, A47, A48, A49, A50, A51, A52, A53, A54, A55, A56, A57, A58, A59, A60, A61, A62, A63, A64, A65, A66, A67, A68, A69, A70, A71, A72, A73, A74, A75, A76, A77, A78, A79, A80, A81, A82, A83, A84, A85, A86, A87, A88, A89, A90, INSERT_A1, INSERT_A2, A91, A92, A93, A94, A95, A96, A97, A98, A99]
/table[INSERT_A1] changed from null to INSERT_A1
/table[INSERT_A2] changed from null to INSERT_A2
/table[A5] changed from A5 to null

It returns the entire before/after for the table, but also detects I added 2 rows and deleted one. Interestingly, it doesn't track which rows were added/deleted, just the values of those rows. I believe that library is comparing the values of the rows and assuming they are unique.

I just wanted to point out that this is not only an issue with complicated nested structures. A simple List exhibits the same behavior.

@bartoszwalacik bartoszwalacik self-assigned this Feb 12, 2015

@bartoszwalacik

This comment has been minimized.

Member

bartoszwalacik commented Feb 15, 2015

interesting, I'll investigate this issue shortly. Now we are in the middle of huge development in Javers, we are working on SQL repository, so our response times are bit slower, please be patient.

@bartoszwalacik

This comment has been minimized.

Member

bartoszwalacik commented Feb 15, 2015

Think I got it. The reason why you get such big change list is:

// delete the 5th row
        base.getTable().remove(5);

Since we remove 5'th row, the rows with index 5 and higher are shifted down.
So all rows with index greater than 5 got new index (n-1).
When Javers compares Lists, it pays attention at item index ...

@bartoszwalacik

This comment has been minimized.

Member

bartoszwalacik commented Feb 15, 2015

You are right, that JaVers could detect such shift and handle it in a more intelligent way

@bartoszwalacik

This comment has been minimized.

Member

bartoszwalacik commented Feb 16, 2015

as a quick win solution, I can implement treatListsAsSets feature.
With this flag enabled, JaVers will compare Lists without paying attention to the item indexes.

Smart list comparing (including shift detection) requires more work.

So would this treatListsAsSets feature be suitable for your case?

@ctmay4

This comment has been minimized.

ctmay4 commented Feb 17, 2015

It would possibly help, but I have not seen how Sets are handled. I assume this would be based on element equality? What if I have a Set with values

{ "1", "1", "1", "1", "2" }

How would it report the second "1" being removed? For my personal use-case there may be duplicates.

Once I get back to my office I will take a list at Set handling and let you know. Thanks again.

@bartoszwalacik

This comment has been minimized.

Member

bartoszwalacik commented Feb 17, 2015

Yes, this simple treatListsAsSets solution would work well only for lists without duplicates.

I'm working on more intelligent algorithm for comparing lists, but it may take a while.
There would be performacne issues.
The main idea is to treat lists as queues, preserve ordering, but forget about exact values of item indexes.

@Kornel

This comment has been minimized.

Member

Kornel commented Feb 20, 2015

I'll add an implementation based on finding the min Levenshtein distance. I've got an POC algorithm ready which is both space and running time O(nm), where n and m is the length of list1 and list2 respectively. This can be optimised to be O(n) space and O(nm) time, will do a PR soon.

Kornel added a commit that referenced this issue Feb 20, 2015

#97 smare list compare
 - Added a smart comparison for lists based on the Levenshtein
  edit distance problem, see package-info for brief description

pszymczyk added a commit that referenced this issue Mar 11, 2015

pszymczyk added a commit that referenced this issue Mar 11, 2015

pszymczyk added a commit that referenced this issue Mar 11, 2015

@bartoszwalacik

This comment has been minimized.

Member

bartoszwalacik commented Mar 11, 2015

@ctmay4, we have created smart list compara algorithm, based on Levenshtein distance.
It calculates smart diffs but could be slow for huge lists, how many elements you have in your lists usually?

@ctmay4

This comment has been minimized.

ctmay4 commented Mar 12, 2015

The tables are usually really small (< 50 rows). We have a few tables that get as high as 4000 rows.

@bartoszwalacik

This comment has been minimized.

Member

bartoszwalacik commented Mar 14, 2015

great, so the new algorithm will be fine for you, i'am doing the release ...

Kornel added a commit that referenced this issue Mar 15, 2015

bartoszwalacik added a commit that referenced this issue Mar 16, 2015

@bartoszwalacik

This comment has been minimized.

Member

bartoszwalacik commented Mar 17, 2015

JaVers 1.1.1 is released, use the new Levenshtein algorithm for comparing lists,
http://javers.org/documentation/diff-configuration/#list-algorithms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment