-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Have a command to have the bot rename a thread based on distinctive words #35
Comments
|
See also #10 which captures some of the TFIDF stuff |
…ages. Not yet exercised or tested. Part of #35.
… was number of channels). Part of #35.
… for the guild (from then on, new messages extend it). WARNING: this will take a long time for the bot to boot up for guilds with lots of messages; it effectively fetches ALL messages at boot! Likely messed up some of the boilerplate in the test controllers. Part of #35.
…e index, to make it a saner format to export as JSON. Part of #35.
…if the binary is more recent than the binary that produced it. Part of #35.
…n if run at the same time as a message is received. Part of #35.
… it's going to need a pointer back to messages. Part of #35.
…nal word for that stem in the channel. Added a kind of half-ass test that implicitly covers it ('diamonds' instead of 'diamond). Part of #35.
…ead based on the distinctive content. Part of #35.
…ID, so they'd save to "".json. Part of #35.
…s to include based on where the biggest drop is. Part of #35.
Before, for when a bot was rebuilding IDF and there were a lot of messages, it was possible a new message could come in while the old history was being fetched. We'd see the new message, and think "OK we've got everything for this channel". Now, the bot keeps track of messages it's seen that have come in live since the bot started, and doesn't stop fetchign history if it hits one of those. Part of #35.
Actually, all of this behavior about expiring old messages from the cache implies I'm making this way too hard. Especially if we're moving to #41 where there's a more long-term backup data store. IDF rarely changes (especially for large guilds). Instead of keeping it live to generate it on demand, just do a batch process every so often (every day?) where it fetches all messages and calculates the IDF and saves that to disk in a cache, along with a timestamp so it can figure out when loading back up if it needs to regenerate it. And then when asking for a channel's suggested title, live fetch the messages, calculate the TFIDF, and return that. |
IDF is generated a cached when a guild is first seen. All that is kept is a by-stemmed-word index of which documents have at least one instance of each word. When TFIDF is genrated, we fetch the messages for that channel on demand and then do the same calcualtion using the cached IDF. IDF doesn't need to be regenerated all of that often; it will change relatively rarely for most guilds. This change gets rid of most of the message caching logic which we'll want in some form again when we do #41, but don't need now (before this, it was possible to OOM if running for a long time). This is a little slower when generating a suggested name, but not THAT slow (just the cost of fetching all messages in a channel sequentially), and it doesn't happen often. Part of #35.
…e discarded. Defaults to 24 hours. Part of #35.
…sed on REBUILD_IDF_INTERVAL. Part of #35.
…nds (it might take longer to fetch all of the mesasges in the channel and otherwise Discord would get grumpy). Part of #35.
It rebuilds the IDF index every day or so and fetches a channel's messages live when it's invoked. It's unclear how well it will do on a real guild. In the test guild, the results are terrible (mainly stop words) because the test messages are so unrealistic. Part of #35. Merge branch 'batch-idf' into main * batch-idf: (58 commits) Made it so the suggest-thread-name interaction will ACK within 3 seconds (it might take longer to fetch all of the mesasges in the channel and otherwise Discord would get grumpy). Part of #35. Automatically rebuild IDF caches when they expire. Runs on a timer based on REBUILD_IDF_INTERVAL. Part of #35. Add a layer of indirection so bot has .start, which calls registerSlashCommands. Part of #35. Make it so an IDF cache that is older than some period of time will be discarded. Defaults to 24 hours. Part of #35. Remove a now-unnecessary constant. Part of #35. Radical overhaul to IDF/TFIDF system. Changed how the bot decides if it's fetched back far enough in history. Refetch and index messages when their emoji reactions change. Part of #35. TFIDF calculation treats messages with important emoji reactions as being more important. Part of #35. Made it so TFIDF.AutoTopWords automatically chooses how many top wrods to include based on where the biggest drop is. Part of #35. Fixed a bug where IDF caches loaded up from disk didn't have a guild ID, so they'd save to "".json. Part of #35. Create a /suggest-thread-name action that suggests a name for the thread based on the distinctive content. Part of #35. Make it so the top words are unstemmed based on the most common original word for that stem in the channel. TFIDF holds onto the messageWordIndexes it's associated with. Part of #35. TFIDF switches from being simply a map[string]float64 to a struct, as it's going to need a pointer back to messages. Part of #35. TFIDF gets TopWords(count int) []string. Basic testing. Part of #35. Also reprocess messages when they're edited. Part of #35. Add a mutex for setting IDFIndex so autosave won't mess up things even if run at the same time as a message is received. Part of #35. Don't print the auto-save mesasge unless it's actually being saved. Part of #35. Make it so the IDF indexes are auto-saved every 5 minutes if necessary. Part of #35. ...
…t a showstopper. In production, it's getting an issue: Fetching messages for IDF for signposts (841709046754836501) Fetching a batch of messages before 852613769104195646 couldn't fetch idf for guild 837459005038133279: couldn't fetch messages for channel 841709046754836501: couldn't fetch messages around 852613769104195646: H TTP 403 Forbidden, {"message": "Missing Access", "code": 50001} Part of #35.
They were showing up as suggestions even in production even on real channels! Part of #35.
…ion sometimes for suggested thraed titles. Part of #35.
…e test server by passing -debug-idf-cache. IDF shouldn't be _that_ different across guilds, and the larger the better. The IDF for the test server is very unrealistic giving weird results and making it hard to debug the qualilty of the suggested thread titles. Part of #35.
…passing. This seems a bit weird, but not that crazy, since the bot is designed for only one bespoke server. This SHOULD fix the permissions fetch issues for private channels. Part of #35.
For example, use TFIDF to pick out diistinctive words to rename a thread distinctively.
There could be a slash command that could be run in the thread to update it.
Threads created by forking (see #34) could also automatically be named with this.
Generating a TFIDF index is hard for the whole corpus. We could use a TFIDF map extractec from https://thecompendium.cards as a default to start from. And use the same stemming.
The text was updated successfully, but these errors were encountered: