Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(api): Similar transaction storage. #1636

Open
elliotcourant opened this issue Dec 6, 2023 · 0 comments
Open

feat(api): Similar transaction storage. #1636

elliotcourant opened this issue Dec 6, 2023 · 0 comments
Assignees
Labels
api Related to or caused by the backend Go REST API. enhancement New feature or request

Comments

@elliotcourant
Copy link
Member

func (p *PreProcessor) GetDatums() []Datum {
datums := make([]Datum, 0, len(p.documents))
docCount := float64(len(p.documents))
idf := make(map[string]float64, len(p.wc))
for word, count := range p.wc {
idf[word] = math.Log(docCount / (count + 1))
}
// Get a map of all the meaningful words and their index to use in the vector
minified := p.indexWords()
// Define the length of the vector and adjust it to be divisible by 8. This will enable us to leverage SIMD in the
// future. By using 8 we are compatible with both AVX and AVX512.
vectorLength := len(minified) + (8 - (len(minified) % 8))
for i := range p.documents {
// Get the current document we are working with
document := p.documents[i]
// Calculate the TFIDF for that document
for word, tfValue := range document.TF {
document.TFIDF[word] = tfValue * idf[word]
// If this specific word is meant to be more meaningful than tfidf might treat it then adjust it accordingly
if multiplier, ok := specialWeights[word]; ok {
document.TFIDF[word] *= multiplier
}
}
// Then create a vector of the words in the document name to use for the DBSCAN clustering
document.Vector = make([]float64, vectorLength)
words := 0
for word, tfidfValue := range document.TFIDF {
index, exists := minified[word]
if !exists {
continue
}
words++
document.Vector[index] = tfidfValue
}
if words == 0 {
document.Valid = false
p.documents[i] = document
continue
}
document.Valid = true
// Normalize the document's tfidf vector.
calc.NormalizeVector64(document.Vector)
p.documents[i] = document
// Then store the document back in
if document.Valid {
datums = append(datums, Datum{
ID: document.ID,
Transaction: document.Transaction,
String: document.String,
Amount: document.Transaction.Amount,
Vector: document.Vector,
})
}
}
return datums
}

This piece of code generates the datums for the DBSCAN algorithm. But it makes it so that the data that goes into why these datums are shaped the way they are is completely separate from the result of the DBSCAN.

I need to be able to compare the resulting DBSCAN clusters against the TF-IDF of the input, this way I can determine all of the unique words that are present in each cluster and identify the ones that are the most identifying among those in order to attempt to produce a unique name for each cluster.

To that end the pre-processor code needs to be merged with the DBSCAN code in such a way that the resulting clusters still have access to the data needed to back reference the TF-IDF.


Specifically the vector on the datum, since the vector is generated by assigning an index to key-value pairs in a golang map, the order of the vector can and will change between evaluations. Of course the position of each word is irrelevant when we calculate the euclidean distance between two vectors as long as the word's position in each vector is the same in both vectors.

Ideally this would simply provide something where the indexes of the vector can be stored briefly, such that a resulting cluster could be fed back into the pre-processor and the processor could yield a map of the words present in that cluster and their relative weights.

Given that, we could determine the most valuable words in each cluster which should be the identifying factor for it. Words will be shared between clusters, but the most valuable words (in combination with each other) should provide enough information to create a unique identifier for that cluster.

For example a cluster made up of transactions that are all Sentry. Merchant name: Sentry.. "merchant name" will be present in every cluster (because of how Mercury formats their transactions) but the word Sentry will have a much higher score and should not conflict with the other clusters.


This solves one problem and creates a different problem (granted the different problem already existed).

  1. We can now uniquely identify clusters easily between evaluations. Clusters that are exceptionally identifiable (like the Sentry example above) should be consistently identified between cluster evaluations.
  2. Some cluster's identifiers will change over time. Slight changes in the transaction name (Quicken Loans to begin with and then changing to Rocket Mortgage on statements for example) will cause either entirely new clusters to be created, or will cause the unique identifier for the existing cluster to change. The transactions that were part of cluster ABC1 are now identified as ABC2 because there were enough new transactions to change the weights of the most valuable identifiers, thus the hash changes.

I don't think it's possible or reasonable to solve #2. There are some potential solutions, but all of them involve trying to pre-seed subsequent cluster evaluations with data that would nudge the algorithm into clusters that we have already seen. This might throw off the accuracy of the cluster or its ability to identify entirely new transaction groups.

But by making it so that the cluster's identifying hash can change, and planning around the idea that it will change, we make it easier to improve the clustering algorithm in the future. An entirely new algorithm could be introduced that creates entirely new clusters with completely different identifiers.

To support that mindset, clusters must have the following:

  1. Must be able to be stored such that they can be queried by the transactions in the cluster. This way even if the cluster itself changes entirely, the application does not care. It just needs to know "what other transactions are similar to this one". So whatever storage medium is ultimately decided, the way the data is queried needs to be kept in mind.
  2. The entire batch of clusters will be recalculated each time. We cannot do a partial calculation easily. So every time new transactions are imported we need to be able to recalculate the entire cluster. This means the unique identifier for assoc rows (assuming we store in postgresql) will change with every single calculation. We need to pick some identifier that is less volatile.
@elliotcourant elliotcourant added enhancement New feature or request api Related to or caused by the backend Go REST API. labels Dec 6, 2023
@elliotcourant elliotcourant self-assigned this Dec 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api Related to or caused by the backend Go REST API. enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant