feat(api): Similar transaction storage. #1636

elliotcourant · 2023-12-06T19:24:16Z

Lines 112 to 168 in d9bf347

    
           func (p *PreProcessor) GetDatums() []Datum { 
        
           	datums := make([]Datum, 0, len(p.documents)) 
        
           	docCount := float64(len(p.documents)) 
        
           	idf := make(map[string]float64, len(p.wc)) 
        
           	for word, count := range p.wc { 
        
           		idf[word] = math.Log(docCount / (count + 1)) 
        
           	} 
        
           	// Get a map of all the meaningful words and their index to use in the vector 
        
           	minified := p.indexWords() 
        
           	// Define the length of the vector and adjust it to be divisible by 8. This will enable us to leverage SIMD in the 
        
           	// future. By using 8 we are compatible with both AVX and AVX512. 
        
           	vectorLength := len(minified) + (8 - (len(minified) % 8)) 
        
           	for i := range p.documents { 
        
           		// Get the current document we are working with 
        
           		document := p.documents[i] 
        
           		// Calculate the TFIDF for that document 
        
           		for word, tfValue := range document.TF { 
        
           			document.TFIDF[word] = tfValue * idf[word] 
        
           			// If this specific word is meant to be more meaningful than tfidf might treat it then adjust it accordingly 
        
           			if multiplier, ok := specialWeights[word]; ok { 
        
           				document.TFIDF[word] *= multiplier 
        
           			} 
        
           		} 
        
           		// Then create a vector of the words in the document name to use for the DBSCAN clustering 
        
           		document.Vector = make([]float64, vectorLength) 
        
           		words := 0 
        
           		for word, tfidfValue := range document.TFIDF { 
        
           			index, exists := minified[word] 
        
           			if !exists { 
        
           				continue 
        
           			} 
        
           			words++ 
        
           			document.Vector[index] = tfidfValue 
        
           		} 
        
           		if words == 0 { 
        
           			document.Valid = false 
        
           			p.documents[i] = document 
        
           			continue 
        
           		} 
        
           		document.Valid = true 
        
           		// Normalize the document's tfidf vector. 
        
           		calc.NormalizeVector64(document.Vector) 
        
           		p.documents[i] = document 
        
           		// Then store the document back in 
        
           		if document.Valid { 
        
           			datums = append(datums, Datum{ 
        
           				ID:          document.ID, 
        
           				Transaction: document.Transaction, 
        
           				String:      document.String, 
        
           				Amount:      document.Transaction.Amount, 
        
           				Vector:      document.Vector, 
        
           			}) 
        
           		} 
        
           	} 
        
           	return datums 
        
           }

This piece of code generates the datums for the DBSCAN algorithm. But it makes it so that the data that goes into why these datums are shaped the way they are is completely separate from the result of the DBSCAN.

I need to be able to compare the resulting DBSCAN clusters against the TF-IDF of the input, this way I can determine all of the unique words that are present in each cluster and identify the ones that are the most identifying among those in order to attempt to produce a unique name for each cluster.

To that end the pre-processor code needs to be merged with the DBSCAN code in such a way that the resulting clusters still have access to the data needed to back reference the TF-IDF.

Specifically the vector on the datum, since the vector is generated by assigning an index to key-value pairs in a golang map, the order of the vector can and will change between evaluations. Of course the position of each word is irrelevant when we calculate the euclidean distance between two vectors as long as the word's position in each vector is the same in both vectors.

Ideally this would simply provide something where the indexes of the vector can be stored briefly, such that a resulting cluster could be fed back into the pre-processor and the processor could yield a map of the words present in that cluster and their relative weights.

Given that, we could determine the most valuable words in each cluster which should be the identifying factor for it. Words will be shared between clusters, but the most valuable words (in combination with each other) should provide enough information to create a unique identifier for that cluster.

For example a cluster made up of transactions that are all Sentry. Merchant name: Sentry.. "merchant name" will be present in every cluster (because of how Mercury formats their transactions) but the word Sentry will have a much higher score and should not conflict with the other clusters.

This solves one problem and creates a different problem (granted the different problem already existed).

We can now uniquely identify clusters easily between evaluations. Clusters that are exceptionally identifiable (like the Sentry example above) should be consistently identified between cluster evaluations.
Some cluster's identifiers will change over time. Slight changes in the transaction name (Quicken Loans to begin with and then changing to Rocket Mortgage on statements for example) will cause either entirely new clusters to be created, or will cause the unique identifier for the existing cluster to change. The transactions that were part of cluster ABC1 are now identified as ABC2 because there were enough new transactions to change the weights of the most valuable identifiers, thus the hash changes.

I don't think it's possible or reasonable to solve #2. There are some potential solutions, but all of them involve trying to pre-seed subsequent cluster evaluations with data that would nudge the algorithm into clusters that we have already seen. This might throw off the accuracy of the cluster or its ability to identify entirely new transaction groups.

But by making it so that the cluster's identifying hash can change, and planning around the idea that it will change, we make it easier to improve the clustering algorithm in the future. An entirely new algorithm could be introduced that creates entirely new clusters with completely different identifiers.

To support that mindset, clusters must have the following:

Must be able to be stored such that they can be queried by the transactions in the cluster. This way even if the cluster itself changes entirely, the application does not care. It just needs to know "what other transactions are similar to this one". So whatever storage medium is ultimately decided, the way the data is queried needs to be kept in mind.
The entire batch of clusters will be recalculated each time. We cannot do a partial calculation easily. So every time new transactions are imported we need to be able to recalculate the entire cluster. This means the unique identifier for assoc rows (assuming we store in postgresql) will change with every single calculation. We need to pick some identifier that is less volatile.

The text was updated successfully, but these errors were encountered:

elliotcourant added enhancement New feature or request api Related to or caused by the backend Go REST API. labels Dec 6, 2023

elliotcourant self-assigned this Dec 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(api): Similar transaction storage. #1636

feat(api): Similar transaction storage. #1636

elliotcourant commented Dec 6, 2023

feat(api): Similar transaction storage. #1636

feat(api): Similar transaction storage. #1636

Comments

elliotcourant commented Dec 6, 2023