Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Have a command to have the bot rename a thread based on distinctive words #35

Open
jkomoros opened this issue May 16, 2021 · 3 comments

Comments

@jkomoros
Copy link
Owner

For example, use TFIDF to pick out diistinctive words to rename a thread distinctively.

There could be a slash command that could be run in the thread to update it.

Threads created by forking (see #34) could also automatically be named with this.

Generating a TFIDF index is hard for the whole corpus. We could use a TFIDF map extractec from https://thecompendium.cards as a default to start from. And use the same stemming.

@jkomoros
Copy link
Owner Author

jkomoros commented May 16, 2021

  • Strip out URLs
  • Strip out markdown syntax
  • Strip out mentions
  • Stem
  • Test
  • Rebuild by fetching all messages
  • Consider having a "document" be a channel's messages, and compare TFIDF compared to other whole channels (currently TFIDF is all done at the level of individual message)
  • OPs and messages with multiple of the important/significant emjois are counted more times when computing TFIDF for channel (store with a Multiplier)
  • Store an actual discorgo.Message in the JSON
  • JSON should be a .data object
  • Switch to a flat messageID -> word count, and a channelID -> []messageID
  • When a message is edited or an emoji reaction happens, reindex the message
  • Every 15 minutes, if the IDfIndex for a given guild has had changed, persist to disk (in case the bot doesn't exit cleanly)
  • Store out IDF in a JSON file and reimport on bot boot (embed a constant for json version in the struct and check it's the same on boot, if not, discard)
  • in JSON consider having the channels be an array of messages in order (especially if this cache grows to be the archival format for old channels)... No need, the IDs since they're snowflakes can already be sorted based on ID to have the proper order
  • If a message comes in during boot it might interfere with the fetching of old messages
  • Given a TFIDF, print out the suggested name (the top n items of idf)
  • restem TopWords
  • BUG: IDFs loaded up from json when autosaved save to a nil location
  • have TopWords also automatically calculate a cutoff of words, picking UP TO count
  • Only rebuild the missing messages when bot loaded with existing JSON (as in, look to see if we have messages already in the cache, but missed some new ones, and fetch the ones that came in while the bot was not running)
  • Have an /auto-title-thread command
  • Switch to the batch-idf method as described in a comment below. New branch on top of idf?
  • When processing a message, if we're close to the max messages size, then don't store messages that would be deleted in the next batch. (Store the lowest known ID at all times and compare the ID to that, maybe?)
  • Respond with a pending interaction response if IDF has to be rebuilt (it could take a long time) (although it should have been built at server start)
  • Handle races where an IDF is asked for while one for that guild is being rebuilt
  • Don't create guild-1 idf in the test suite
  • Fix permissions issues where messages in a private channel are forbidden
  • Remove stop words. They still show up in production!
  • some restemmed words have punctuation
  • Rebuild IDF every day or so automatically
  • delete the channelsForMessage when deleting a message to make space. Need to store channelID in the message struct.
  • A flag to allow to easily use the snapshot idf cache
  • When rebuilding a stale IDF cache, only fetch messages that are more recent than the timestamp (might require parsing the snowflake IDs for timestamps)
  • Allow configuring which emojis count as important
  • OP messages get a TFIDF multiplier that is higher, too
  • A way to accept the suggestion from the auto thread title command (currently it just suggests one)
  • when suggesting multiple titles, respond with multiple alts as buttons to rename the thread

@jkomoros
Copy link
Owner Author

See also #10 which captures some of the TFIDF stuff

jkomoros added a commit that referenced this issue May 31, 2021
jkomoros added a commit that referenced this issue May 31, 2021
jkomoros added a commit that referenced this issue May 31, 2021
jkomoros added a commit that referenced this issue May 31, 2021
jkomoros added a commit that referenced this issue May 31, 2021
jkomoros added a commit that referenced this issue May 31, 2021
jkomoros added a commit that referenced this issue Jun 2, 2021
jkomoros added a commit that referenced this issue Jun 2, 2021
jkomoros added a commit that referenced this issue Jun 2, 2021
jkomoros added a commit that referenced this issue Jun 2, 2021
… for the guild (from then on, new messages extend it).

WARNING: this will take a long time for the bot to boot up for guilds with lots of messages; it effectively fetches ALL messages at boot!

Likely messed up some of the boilerplate in the test controllers.

Part of #35.
jkomoros added a commit that referenced this issue Jun 3, 2021
…e index, to make it a saner format to export as JSON. Part of #35.
jkomoros added a commit that referenced this issue Jun 3, 2021
jkomoros added a commit that referenced this issue Jun 3, 2021
jkomoros added a commit that referenced this issue Jun 3, 2021
jkomoros added a commit that referenced this issue Jun 3, 2021
…if the binary is more recent than the binary that produced it. Part of #35.
jkomoros added a commit that referenced this issue Jun 9, 2021
…n if run at the same time as a message is received. Part of #35.
jkomoros added a commit that referenced this issue Jun 9, 2021
… it's going to need a pointer back to messages. Part of #35.
jkomoros added a commit that referenced this issue Jun 9, 2021
…nal word for that stem in the channel.

Added a kind of half-ass test that implicitly covers it ('diamonds' instead of 'diamond).

Part of #35.
jkomoros added a commit that referenced this issue Jun 9, 2021
…ead based on the distinctive content. Part of #35.
jkomoros added a commit that referenced this issue Jun 9, 2021
jkomoros added a commit that referenced this issue Jun 9, 2021
…s to include based on where the biggest drop is. Part of #35.
jkomoros added a commit that referenced this issue Jun 9, 2021
jkomoros added a commit that referenced this issue Jun 10, 2021
Before, for when a bot was rebuilding IDF and there were a lot of messages, it was possible a new message could come in while the old history was being fetched. We'd see the new message, and think "OK we've got everything for this channel".

Now, the bot keeps track of messages it's seen that have come in live since the bot started, and doesn't stop fetchign history if it hits one of those.

Part of #35.
@jkomoros
Copy link
Owner Author

Actually, all of this behavior about expiring old messages from the cache implies I'm making this way too hard. Especially if we're moving to #41 where there's a more long-term backup data store.

IDF rarely changes (especially for large guilds). Instead of keeping it live to generate it on demand, just do a batch process every so often (every day?) where it fetches all messages and calculates the IDF and saves that to disk in a cache, along with a timestamp so it can figure out when loading back up if it needs to regenerate it. And then when asking for a channel's suggested title, live fetch the messages, calculate the TFIDF, and return that.

jkomoros added a commit that referenced this issue Jun 10, 2021
IDF is generated a cached when a guild is first seen. All that is kept is a by-stemmed-word index of which documents have at least one instance of each word.

When TFIDF is genrated, we fetch the messages for that channel on demand and then do the same calcualtion using the cached IDF.

IDF doesn't need to be regenerated all of that often; it will change relatively rarely for most guilds.

This change gets rid of most of the message caching logic which we'll want in some form again when we do #41, but don't need now (before this, it was possible to OOM if running for a long time).

This is a little slower when generating a suggested name, but not THAT slow (just the cost of fetching all messages in a channel sequentially), and it doesn't happen often.

Part of #35.
jkomoros added a commit that referenced this issue Jun 11, 2021
jkomoros added a commit that referenced this issue Jun 11, 2021
jkomoros added a commit that referenced this issue Jun 11, 2021
jkomoros added a commit that referenced this issue Jun 11, 2021
…nds (it might take longer to fetch all of the mesasges in the channel and otherwise Discord would get grumpy). Part of #35.
jkomoros added a commit that referenced this issue Jun 11, 2021
It rebuilds the IDF index every day or so and fetches a channel's messages live when it's invoked.

It's unclear how well it will do on a real guild. In the test guild, the results are terrible (mainly stop words) because the test messages are so unrealistic.

Part of #35.

Merge branch 'batch-idf' into main

* batch-idf: (58 commits)
  Made it so the suggest-thread-name interaction will ACK within 3 seconds (it might take longer to fetch all of the mesasges in the channel and otherwise Discord would get grumpy). Part of #35.
  Automatically rebuild IDF caches when they expire. Runs on a timer based on REBUILD_IDF_INTERVAL. Part of #35.
  Add a layer of indirection so bot has .start, which calls registerSlashCommands. Part of #35.
  Make it so an IDF cache that is older than some period of time will be discarded. Defaults to 24 hours. Part of #35.
  Remove a now-unnecessary constant. Part of #35.
  Radical overhaul to IDF/TFIDF system.
  Changed how the bot decides if it's fetched back far enough in history.
  Refetch and index messages when their emoji reactions change. Part of #35.
  TFIDF calculation treats messages with important emoji reactions as being more important. Part of #35.
  Made it so TFIDF.AutoTopWords automatically chooses how many top wrods to include based on where the biggest drop is. Part of #35.
  Fixed a bug where IDF caches loaded up from disk didn't have a guild ID, so they'd save to "".json. Part of #35.
  Create a /suggest-thread-name action that suggests a name for the thread based on the distinctive content. Part of #35.
  Make it so the top words are unstemmed based on the most common original word for that stem in the channel.
  TFIDF holds onto the messageWordIndexes it's associated with. Part of #35.
  TFIDF switches from being simply a map[string]float64 to a struct, as it's going to need a pointer back to messages. Part of #35.
  TFIDF gets TopWords(count int) []string. Basic testing. Part of #35.
  Also reprocess messages when they're edited. Part of #35.
  Add a mutex for setting IDFIndex so autosave won't mess up things even if run at the same time as a message is received. Part of #35.
  Don't print the auto-save mesasge unless it's actually being saved. Part of #35.
  Make it so the IDF indexes are auto-saved every 5 minutes if necessary. Part of #35.
  ...
jkomoros added a commit that referenced this issue Jun 11, 2021
…t a showstopper.

In production, it's getting an issue:

Fetching messages for IDF for signposts (841709046754836501)
Fetching a batch of messages before 852613769104195646
couldn't fetch idf for guild 837459005038133279: couldn't fetch messages for channel 841709046754836501: couldn't fetch messages around 852613769104195646: H
TTP 403 Forbidden, {"message": "Missing Access", "code": 50001}

Part of #35.
jkomoros added a commit that referenced this issue Jun 11, 2021
jkomoros added a commit that referenced this issue Jun 11, 2021
They were showing up as suggestions even in production even on real channels!

Part of #35.
jkomoros added a commit that referenced this issue Jun 11, 2021
…ion sometimes for suggested thraed titles. Part of #35.
jkomoros added a commit that referenced this issue Jun 11, 2021
…e test server by passing -debug-idf-cache.

IDF shouldn't be _that_ different across guilds, and the larger the better. The IDF for the test server is very unrealistic giving weird results and making it hard to debug the qualilty of the suggested thread titles.

Part of #35.
jkomoros added a commit that referenced this issue Jun 11, 2021
…passing.

This seems a bit weird, but not that crazy, since the bot is designed for only one bespoke server.

This SHOULD fix the permissions fetch issues for private channels.

Part of #35.
jkomoros added a commit that referenced this issue Jun 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant