Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Online, but what about streaming? #13

Open
BoltzmannBrain opened this issue Jun 6, 2016 · 6 comments
Open

Online, but what about streaming? #13

BoltzmannBrain opened this issue Jun 6, 2016 · 6 comments
Assignees

Comments

@BoltzmannBrain
Copy link

The algorithm runs online, but with the assumption we have the length of the dataset a priori. What about streaming scenarios where we have a continuous stream of data? Is there (an efficient) way to the run the online algorithm without knowing the length of the dataset?

@hildensia
Copy link
Owner

I'll have a look into it. I think the actual algorithm is capable of that, but the implementation is not engineered for it. I guess it will need only some refactoring and no actual fiddling around with the math.

@hildensia hildensia self-assigned this Jun 7, 2016
@hildensia
Copy link
Owner

Ok. I implemented a class with the ability to add_data(self, data). Can you have a look at 0f9ff01 within branch real_online? I haven't tested it yet, would you be so kind to give it a try?

@BoltzmannBrain
Copy link
Author

FWIW I tried an approach where I started the run-lengths matrix with a list [[]] and would append rows and columns for each subsequent timestep. This got to be very inefficient quickly, even with <1000 data records. The key here is to figure out a way to not keep old data in the run-lengths matrix R. This can be done column-wise, only keeping the columns of the current timestep and timestep+1, but the rows still grow indefinitely.

@hildensia
Copy link
Owner

hildensia commented Jun 9, 2016

I see. My first guess for a more efficient approximation would be, that we could ignore many R values for the sums, as they might be relatively small. I.e. if R[23, 123] < eps then R[24:, 123] can be ignored. Would that help?

A sketch:

def find_first_t_smaller_eps(R, eps):
    for t, r in enumerate(R):
        if r < eps:
           return t

# Evaluate the predictive distribution for the new datum under each of
# the parameters.  This is the standard thing from Bayesian inference.
predprobs = observation_likelihood.pdf(x)

# Find the first occurence of a probability smaller a defined epsilon
tm = find_first_t_smaller_eps(R[0:t+1, t], eps)

# Evaluate the hazard function for this interval
H = hazard_func(np.array(range(tm)))

# Evaluate the growth probabilities - shift the probabilities down and to
# the right, scaled by the hazard function and the predictive
# probabilities.
R[1:tm+1, t+1] = R[0:tm, t] * predprobs * (1-H)

# Evaluate the probability that there *was* a changepoint and we're
# accumulating the mass back down at r = 0.
R[0, t+1] = np.sum( R[0:tm, t] * predprobs * H)

# Renormalize the run length probabilities for improved numerical
# stability.
R[:tm+1, t+1] = R[:tm+1, t+1] / np.sum(R[:tm+1, t+1])

# Update the parameter sets for each possible run length.
observation_likelihood.update_theta(x)

maxes[t] = R[:, t].argmax()

Thus we ignore all probabilities smaller eps and can compute the sums more efficiently. To be more space efficient, we could even remove everything above the biggest tm.

@BoltzmannBrain
Copy link
Author

This may work well; it's not clear to me before running it on some datasets.

I've implemented a solution as an anomaly detection algorithm in NAB, and will link it here once I clean it up a bit and push it. The main idea is because we're only interested in the points at which the data stream changes (i.e. a change point), we can keep only the data that would show us a change point for the current timestep. It actually works slightly better than using the entire preallocated R matrix 😉

@BoltzmannBrain
Copy link
Author

The online algorithm is implemented here as an anomaly detector in NAB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants