You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
// If this specific word is meant to be more meaningful than tfidf might treat it then adjust it accordingly
ifmultiplier, ok:=specialWeights[word]; ok {
document.TFIDF[word] *=multiplier
}
}
// Then create a vector of the words in the document name to use for the DBSCAN clustering
document.Vector=make([]float64, vectorLength)
words:=0
forword, tfidfValue:=rangedocument.TFIDF {
index, exists:=minified[word]
if!exists {
continue
}
words++
document.Vector[index] =tfidfValue
}
ifwords==0 {
document.Valid=false
p.documents[i] =document
continue
}
document.Valid=true
// Normalize the document's tfidf vector.
calc.NormalizeVector64(document.Vector)
p.documents[i] =document
// Then store the document back in
ifdocument.Valid {
datums=append(datums, Datum{
ID: document.ID,
Transaction: document.Transaction,
String: document.String,
Amount: document.Transaction.Amount,
Vector: document.Vector,
})
}
}
returndatums
}
This piece of code generates the datums for the DBSCAN algorithm. But it makes it so that the data that goes into why these datums are shaped the way they are is completely separate from the result of the DBSCAN.
I need to be able to compare the resulting DBSCAN clusters against the TF-IDF of the input, this way I can determine all of the unique words that are present in each cluster and identify the ones that are the most identifying among those in order to attempt to produce a unique name for each cluster.
To that end the pre-processor code needs to be merged with the DBSCAN code in such a way that the resulting clusters still have access to the data needed to back reference the TF-IDF.
Specifically the vector on the datum, since the vector is generated by assigning an index to key-value pairs in a golang map, the order of the vector can and will change between evaluations. Of course the position of each word is irrelevant when we calculate the euclidean distance between two vectors as long as the word's position in each vector is the same in both vectors.
Ideally this would simply provide something where the indexes of the vector can be stored briefly, such that a resulting cluster could be fed back into the pre-processor and the processor could yield a map of the words present in that cluster and their relative weights.
Given that, we could determine the most valuable words in each cluster which should be the identifying factor for it. Words will be shared between clusters, but the most valuable words (in combination with each other) should provide enough information to create a unique identifier for that cluster.
For example a cluster made up of transactions that are all Sentry. Merchant name: Sentry.. "merchant name" will be present in every cluster (because of how Mercury formats their transactions) but the word Sentry will have a much higher score and should not conflict with the other clusters.
This solves one problem and creates a different problem (granted the different problem already existed).
We can now uniquely identify clusters easily between evaluations. Clusters that are exceptionally identifiable (like the Sentry example above) should be consistently identified between cluster evaluations.
Some cluster's identifiers will change over time. Slight changes in the transaction name (Quicken Loans to begin with and then changing to Rocket Mortgage on statements for example) will cause either entirely new clusters to be created, or will cause the unique identifier for the existing cluster to change. The transactions that were part of cluster ABC1 are now identified as ABC2 because there were enough new transactions to change the weights of the most valuable identifiers, thus the hash changes.
I don't think it's possible or reasonable to solve #2. There are some potential solutions, but all of them involve trying to pre-seed subsequent cluster evaluations with data that would nudge the algorithm into clusters that we have already seen. This might throw off the accuracy of the cluster or its ability to identify entirely new transaction groups.
But by making it so that the cluster's identifying hash can change, and planning around the idea that it will change, we make it easier to improve the clustering algorithm in the future. An entirely new algorithm could be introduced that creates entirely new clusters with completely different identifiers.
To support that mindset, clusters must have the following:
Must be able to be stored such that they can be queried by the transactions in the cluster. This way even if the cluster itself changes entirely, the application does not care. It just needs to know "what other transactions are similar to this one". So whatever storage medium is ultimately decided, the way the data is queried needs to be kept in mind.
The entire batch of clusters will be recalculated each time. We cannot do a partial calculation easily. So every time new transactions are imported we need to be able to recalculate the entire cluster. This means the unique identifier for assoc rows (assuming we store in postgresql) will change with every single calculation. We need to pick some identifier that is less volatile.
The text was updated successfully, but these errors were encountered:
monetr/server/recurring/cluster.go
Lines 112 to 168 in d9bf347
This piece of code generates the datums for the DBSCAN algorithm. But it makes it so that the data that goes into why these datums are shaped the way they are is completely separate from the result of the DBSCAN.
I need to be able to compare the resulting DBSCAN clusters against the TF-IDF of the input, this way I can determine all of the unique words that are present in each cluster and identify the ones that are the most identifying among those in order to attempt to produce a unique name for each cluster.
To that end the pre-processor code needs to be merged with the DBSCAN code in such a way that the resulting clusters still have access to the data needed to back reference the TF-IDF.
Specifically the vector on the datum, since the vector is generated by assigning an index to key-value pairs in a golang map, the order of the vector can and will change between evaluations. Of course the position of each word is irrelevant when we calculate the euclidean distance between two vectors as long as the word's position in each vector is the same in both vectors.
Ideally this would simply provide something where the indexes of the vector can be stored briefly, such that a resulting cluster could be fed back into the pre-processor and the processor could yield a map of the words present in that cluster and their relative weights.
Given that, we could determine the most valuable words in each cluster which should be the identifying factor for it. Words will be shared between clusters, but the most valuable words (in combination with each other) should provide enough information to create a unique identifier for that cluster.
For example a cluster made up of transactions that are all
Sentry. Merchant name: Sentry.
. "merchant name" will be present in every cluster (because of how Mercury formats their transactions) but the wordSentry
will have a much higher score and should not conflict with the other clusters.This solves one problem and creates a different problem (granted the different problem already existed).
Sentry
example above) should be consistently identified between cluster evaluations.Quicken Loans
to begin with and then changing toRocket Mortgage
on statements for example) will cause either entirely new clusters to be created, or will cause the unique identifier for the existing cluster to change. The transactions that were part of clusterABC1
are now identified asABC2
because there were enough new transactions to change the weights of the most valuable identifiers, thus the hash changes.I don't think it's possible or reasonable to solve #2. There are some potential solutions, but all of them involve trying to pre-seed subsequent cluster evaluations with data that would nudge the algorithm into clusters that we have already seen. This might throw off the accuracy of the cluster or its ability to identify entirely new transaction groups.
But by making it so that the cluster's identifying hash can change, and planning around the idea that it will change, we make it easier to improve the clustering algorithm in the future. An entirely new algorithm could be introduced that creates entirely new clusters with completely different identifiers.
To support that mindset, clusters must have the following:
The text was updated successfully, but these errors were encountered: