Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Messages are being count twice (duplicates) #40

Closed
JustAnOkapi opened this issue Jan 15, 2023 · 6 comments · Fixed by #55
Closed

Messages are being count twice (duplicates) #40

JustAnOkapi opened this issue Jan 15, 2023 · 6 comments · Fixed by #55
Labels
bug Something isn't working pipeline Related to the pipeline

Comments

@JustAnOkapi
Copy link

if you upload two of the same discord reports, it counts each message twice.
this is important as if you have exports from overlapping time periods, messages will not be handled correctly

@mlomb
Copy link
Owner

mlomb commented Jan 15, 2023

It should not happen, but I'll take a look

@hopperelec
Copy link
Contributor

I've experienced this too, so I can confirm it happens

@mlomb
Copy link
Owner

mlomb commented Jan 18, 2023

Yes, I can confirm only recent message IDs are stored (for linking replies) right now, I was mistaken

I don't know if deduplication is feasible since we would have to store the ID for every single message and do a search for it every time a new comes through (lets say logarithmic). Since we want to support multi million messages chats, size and performance may suffer too much

@hopperelec
Copy link
Contributor

I don't understand how it works, but I've been told that queries to lists/maps involving hash tables are not affected by the size of the list/map (i.e. they have a time complexity of O(1)). It doesn't look like Javascript uses hash tables by default, but implementations of it look pretty simple. I imagine being able to search through all the past messages is useful for other features, too, such as for linking replies.

@hopperelec
Copy link
Contributor

Just noticed something else this would solve

// TODO: if this is slow, make sure selected is always sorted and run binary search here
// for now, its not necessary
selected={selected.includes(optionIndex)}

And I think there's more places where arrays are sorted to make them easier to find items in which could be replaced with a hash table

@hopperelec
Copy link
Contributor

Just realised that all non-primitive Javascript (and hence Typescript) objects use hash tables. Improving the performance should be as easy as using the most appropriate data type for each form of array and making sure to use keys which can easily be identified. From what I can see, though, this is already being done, so I'm not sure how you've found that sorting the arrays helps with performance at all

@mlomb mlomb added bug Something isn't working pipeline Related to the pipeline labels Jan 24, 2023
@mlomb mlomb changed the title doesnt remove duplicate messages Messages are being count twice (duplicates) Jan 24, 2023
@mlomb mlomb linked a pull request Jan 29, 2023 that will close this issue
@mlomb mlomb closed this as completed in #55 Feb 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working pipeline Related to the pipeline
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants