Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Improve recurring transaction detection. #1641

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

elliotcourant
Copy link
Member

This code is an experiment on isolating specific amounts within a known set of similar transactions.

The goal of this is to be able to cherry pick specific transactions out of a cluttered dataset to identify them as recurring.

An example of this is amazon, there might be tons of amazon transactions but only a handful of them are actually something like an "amazon prime subscription". So the goal with this is to isolate those transactions that are part of a subscription.

@elliotcourant elliotcourant self-assigned this Dec 15, 2023
Comment on lines 303 to 308
bandwidth := SilvermansRuleOfThumb(data)

bandwidths := make([]float64, 0)
for i := 500; i < 5000; i += 10 {
bandwidths = append(bandwidths, float64(i))
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also append the bandwidth from silvermans rule of thumb here too and then sort the array.

We generate several bandwidths starting at $5.00 and then going up by 10 cents, but including the silvermans rule might cause this to be even more accurate since it has no such constraint. OR we might be able to tune our own increment to be 50 cents or even $1 based instead.

@elliotcourant
Copy link
Member Author

TODO

Need to implement some kind of data smoothing. I'm thinking a Gaussian smoothing might be the best for the dataset since its also reasonable to implement in go:

func gaussianKernel(size int, sigma float64) []float64 {
	kernel := make([]float64, size)
	sum := 0.0
	m := size / 2

	for i := 0; i < size; i++ {
		diff := float64(i - m)
		kernel[i] = math.Exp(-(diff * diff) / (2 * sigma * sigma))
		sum += kernel[i]
	}

	// Normalize the kernel
	for i := range kernel {
		kernel[i] /= sum
	}

	return kernel
}

func gaussianSmooth(data []float64, sigma float64) []float64 {
	size := int(sigma * 6) // a common choice for kernel size
	if size%2 == 0 {
		size++ // ensure kernel size is odd
	}

	kernel := gaussianKernel(size, sigma)
	halfSize := size / 2
	smoothedData := make([]float64, len(data))

	for i := range data {
		var weightedSum float64
		var weightSum float64

		for j := -halfSize; j <= halfSize; j++ {
			if i+j >= 0 && i+j < len(data) {
				weight := kernel[halfSize+j]
				weightedSum += data[i+j] * weight
				weightSum += weight
			}
		}

		smoothedData[i] = weightedSum / weightSum
	}

	return smoothedData
}

This should also improve peak detection.

Need to experiment with various sigma values, or should the sigma be determined by the bandwidth value?

@codecov-commenter
Copy link

codecov-commenter commented Dec 15, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (63b5d13) 51.05% compared to head (2b0528c) 51.04%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1641      +/-   ##
==========================================
- Coverage   51.05%   51.04%   -0.01%     
==========================================
  Files         321      321              
  Lines       17403    17403              
  Branches      438      438              
==========================================
- Hits         8885     8884       -1     
- Misses       8032     8033       +1     
  Partials      486      486              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Recurring transactions are complicated, I want to try to isolate
specific amounts within a dataset of known similar transactions. This
way I can determine which transactions are most likely to be recurring,
but I want to narrow this down to be more accurate. Some similar
transactions might actually be two subscriptions. Or there may be other
patterns. This is really just throwing stuff at the wall and seeing what
sticks
@elliotcourant elliotcourant force-pushed the experiment/recurring-amount-isolation branch from 100a2a8 to 8060ecc Compare December 15, 2023 22:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants