Implement multidimensional discrete distribution #830

Merged
merged 1 commit into from Feb 27, 2017

Conversation

Projects
None yet
2 participants
@MikeLing
Contributor

MikeLing commented Dec 15, 2016

No description provided.

@rcurtin

This comment has been minimized.

Show comment
Hide comment
@rcurtin

rcurtin Dec 20, 2016

Member

Hi there @MikeLing, thanks for the contribution. Unfortunately I'm not totally sure what this overload does since there are no comments.

Ideally, for multidimensional discrete distribution support, the API shouldn't change, it should just be a generalization. Note that now, when you call DiscreteDistribution::Probability(), the parameter const arma::vec& observation must be a one-dimensional vector. The idea is to generalize this, so that a discrete distribution can have many dimensions.

The distribution should have zero correlation across dimensions, so if I have the vector "a,b" then P(a,b) = P(a)P(b).

Member

rcurtin commented Dec 20, 2016

Hi there @MikeLing, thanks for the contribution. Unfortunately I'm not totally sure what this overload does since there are no comments.

Ideally, for multidimensional discrete distribution support, the API shouldn't change, it should just be a generalization. Note that now, when you call DiscreteDistribution::Probability(), the parameter const arma::vec& observation must be a one-dimensional vector. The idea is to generalize this, so that a discrete distribution can have many dimensions.

The distribution should have zero correlation across dimensions, so if I have the vector "a,b" then P(a,b) = P(a)P(b).

@MikeLing

This comment has been minimized.

Show comment
Hide comment
@MikeLing

MikeLing Dec 21, 2016

Contributor

Hi @rcurtin, thank you for your help. :)

But the Probability() do accept a one dimensional vector[1], but we only use the first value of that vector. So the idea in here is about to make DiscreteDistribution::Probability() can do something like:

d.Probability("0") -> P(0)

d.Probability("0, 1") -> P(0)P(1)

Is that correct?
[1] https://github.com/mlpack/mlpack/blob/master/src/mlpack/core/dists/discrete_distribution.hpp#L95

Contributor

MikeLing commented Dec 21, 2016

Hi @rcurtin, thank you for your help. :)

But the Probability() do accept a one dimensional vector[1], but we only use the first value of that vector. So the idea in here is about to make DiscreteDistribution::Probability() can do something like:

d.Probability("0") -> P(0)

d.Probability("0, 1") -> P(0)P(1)

Is that correct?
[1] https://github.com/mlpack/mlpack/blob/master/src/mlpack/core/dists/discrete_distribution.hpp#L95

@rcurtin

This comment has been minimized.

Show comment
Hide comment
@rcurtin

rcurtin Dec 21, 2016

Member

Yep, exactly. But each dimension may have a different set of probabilities associated with it. So to be more clear...

d.Probability("0, 1") -> P_0(0) P_1(1)

where P_0() is the PDF associated with the first dimension and P_1() is the PDF associated with the second dimension.

Member

rcurtin commented Dec 21, 2016

Yep, exactly. But each dimension may have a different set of probabilities associated with it. So to be more clear...

d.Probability("0, 1") -> P_0(0) P_1(1)

where P_0() is the PDF associated with the first dimension and P_1() is the PDF associated with the second dimension.

@MikeLing

This comment has been minimized.

Show comment
Hide comment
@MikeLing

MikeLing Dec 21, 2016

Contributor

surething, I got it, thank you :)

BTW, could you tell me your time schedule normally? Because I'm in UTC+8, so I want to know when I could get touch with you if I have issues or interested issue :)

Contributor

MikeLing commented Dec 21, 2016

surething, I got it, thank you :)

BTW, could you tell me your time schedule normally? Because I'm in UTC+8, so I want to know when I could get touch with you if I have issues or interested issue :)

+ DiscreteDistribution(const size_t numObservations, const size_t dimension=1):
+ probabilities(dimension, numObservations)
+ {
+ probabilities.ones();

This comment has been minimized.

@MikeLing

MikeLing Dec 22, 2016

Contributor

Hi @rcurtin, Could I define and init a matrix like this? Why this failed and I also can't use probabilities.set_size(dimension, numObservations)?

I had google it but it doesn't help. I have no idea how to use arma:matclass even after I read some documents about it.

Thank you

@MikeLing

MikeLing Dec 22, 2016

Contributor

Hi @rcurtin, Could I define and init a matrix like this? Why this failed and I also can't use probabilities.set_size(dimension, numObservations)?

I had google it but it doesn't help. I have no idea how to use arma:matclass even after I read some documents about it.

Thank you

@MikeLing

This comment has been minimized.

Show comment
Hide comment
@MikeLing

MikeLing Dec 30, 2016

Contributor

Hi @rcurtin , I just implement multidimensional in discrete distribution, but I can't make it works in hmm for BaumWelch. The boost told me the index out of bounds in SimplestBaumWelchDiscreteHMM and SimpleBaumWelchDiscreteHMM but haven't tell me the exactly location things went wrong. I think I should look the whole thing through.

Could you tell me if those code above is fit your mind? The discrete distribution can works well except Gamma distribution. Please tell me if I should address anything :) Thank you

Contributor

MikeLing commented Dec 30, 2016

Hi @rcurtin , I just implement multidimensional in discrete distribution, but I can't make it works in hmm for BaumWelch. The boost told me the index out of bounds in SimplestBaumWelchDiscreteHMM and SimpleBaumWelchDiscreteHMM but haven't tell me the exactly location things went wrong. I think I should look the whole thing through.

Could you tell me if those code above is fit your mind? The discrete distribution can works well except Gamma distribution. Please tell me if I should address anything :) Thank you

src/mlpack/tests/distribution_test.cpp
+ {0.25, 0.35, 0.4}};
+ BOOST_REQUIRE_CLOSE(d.Probability("0"), 0.1, 1e-5);
+ BOOST_REQUIRE_CLOSE(d.Probability("1"), 0.4, 1e-5);
+ BOOST_REQUIRE_CLOSE(d.Probability("0, 1"), 0.03, 1e-5);

This comment has been minimized.

@rcurtin

rcurtin Dec 30, 2016

Member

These should all throw errors. The distribution is 3-dimensional, so every input to Probability() should be 3-dimensional also.

@rcurtin

rcurtin Dec 30, 2016

Member

These should all throw errors. The distribution is 3-dimensional, so every input to Probability() should be 3-dimensional also.

@rcurtin

This comment has been minimized.

Show comment
Hide comment
@rcurtin

rcurtin Dec 30, 2016

Member

I think you need to add more rigorous tests to ensure that the multidimensional discrete distribution is behaving like other multidimensional distributions before you start debugging what is wrong with the HMMs. It will probably make debugging a lot easier.

Member

rcurtin commented Dec 30, 2016

I think you need to add more rigorous tests to ensure that the multidimensional discrete distribution is behaving like other multidimensional distributions before you start debugging what is wrong with the HMMs. It will probably make debugging a lot easier.

@MikeLing

This comment has been minimized.

Show comment
Hide comment
@MikeLing

MikeLing Jan 27, 2017

Contributor

@rcurtin could you take a look at this? Thank you :)

Contributor

MikeLing commented Jan 27, 2017

@rcurtin could you take a look at this? Thank you :)

@rcurtin

This comment has been minimized.

Show comment
Hide comment
@rcurtin

rcurtin Jan 28, 2017

Member

Hey @MikeLing, thanks for taking the time to update this. This looks a lot better. But what happens if a user has, e.g., a 2-dimensional discrete distribution where the first dimension has 4 observations and the second dimension has 6?

I think that instead of using arma::mat for the probabilities matrix you'll have to use something more flexible like std::vector<arma::vector>.

The rest of the code looks good though---the tests are fine, the Train() implementation looks right, and the Probability() implementation looks right. I wonder if we can now remove the old implementation of Train().

Member

rcurtin commented Jan 28, 2017

Hey @MikeLing, thanks for taking the time to update this. This looks a lot better. But what happens if a user has, e.g., a 2-dimensional discrete distribution where the first dimension has 4 observations and the second dimension has 6?

I think that instead of using arma::mat for the probabilities matrix you'll have to use something more flexible like std::vector<arma::vector>.

The rest of the code looks good though---the tests are fine, the Train() implementation looks right, and the Probability() implementation looks right. I wonder if we can now remove the old implementation of Train().

@rcurtin

This comment has been minimized.

Show comment
Hide comment
@rcurtin

rcurtin Jan 28, 2017

Member

It also looks like there are some small compilation errors that will have to be addressed. Note that Travis may not have the newest version of Armadillo, so some Armadillo functionality you are using may not be available.

Member

rcurtin commented Jan 28, 2017

It also looks like there are some small compilation errors that will have to be addressed. Note that Travis may not have the newest version of Armadillo, so some Armadillo functionality you are using may not be available.

@MikeLing

This comment has been minimized.

Show comment
Hide comment
@MikeLing

MikeLing Jan 30, 2017

Contributor

Hi @rcurtin, I had refactor the discrete distribution with std::vector and those test for multidimensional discrete distribution works fine. Please tell me if the multidimensional discrete distributio support looks good to you and I will move to the next step which is refactor the hmm code. I will great appreciate if you can give me some advice for the hmm refactor! Thank you :)

Contributor

MikeLing commented Jan 30, 2017

Hi @rcurtin, I had refactor the discrete distribution with std::vector and those test for multidimensional discrete distribution works fine. Please tell me if the multidimensional discrete distributio support looks good to you and I will move to the next step which is refactor the hmm code. I will great appreciate if you can give me some advice for the hmm refactor! Thank you :)

src/mlpack/methods/hmm/hmm_impl.hpp
- emissions.col(i) = dataSeq[emissionList[state][i].first].col(
- emissionList[state][i].second);
+ emissions[i] = dataSeq[emissionList[state][i].first].col(
+ emissionList[state][i].second);
}
emission[state].Train(emissions);

This comment has been minimized.

@MikeLing

MikeLing Jan 31, 2017

Contributor

@rcurtin Why it keeps telling me no matching member function for call to 'Train' But I have no idea why it would call the Train in discrete distribution. Could you give me some helps on that? Here is my full error I got: https://pastebin.mozilla.org/8971379

@MikeLing

MikeLing Jan 31, 2017

Contributor

@rcurtin Why it keeps telling me no matching member function for call to 'Train' But I have no idea why it would call the Train in discrete distribution. Could you give me some helps on that? Here is my full error I got: https://pastebin.mozilla.org/8971379

@rcurtin

This comment has been minimized.

Show comment
Hide comment
@rcurtin

rcurtin Feb 6, 2017

Member

Hi @MikeLing, thanks for taking the time to do the refactoring. The HMM code shouldn't need to be refactored though. The signature of DiscreteDistribution::Train() and DiscreteDistribution::Probability() shouldn't need to change. The observations can be given as arma::vec not std::vector<arma::vec>---but the internal implementation has to use std::vector<arma::vec> because each dimension may have a different number of features. Does that make sense? If not let me know and I will try to clarify.

Member

rcurtin commented Feb 6, 2017

Hi @MikeLing, thanks for taking the time to do the refactoring. The HMM code shouldn't need to be refactored though. The signature of DiscreteDistribution::Train() and DiscreteDistribution::Probability() shouldn't need to change. The observations can be given as arma::vec not std::vector<arma::vec>---but the internal implementation has to use std::vector<arma::vec> because each dimension may have a different number of features. Does that make sense? If not let me know and I will try to clarify.

@MikeLing

This comment has been minimized.

Show comment
Hide comment
@MikeLing

MikeLing Feb 8, 2017

Contributor

Hi @rcurtin , thank you for your comment!

The signature of DiscreteDistribution::Train() and DiscreteDistribution::Probability() shouldn't need to change. The observations can be given as arma::vec not std::vectorarma::vec

Due to we store the probabilities as std::vector<arma::vec> which is multi-dimension, how could we know how many observations in each dimension when we train the distribution by DiscreteDistribution::Train() with arma::vec? e.g. like you said, we want to train a discrete distribution by a 2-dimensional discrete distribution where the first dimension has 4 observations and the second dimension has 6, how could we know first 4 observations belong to first dimension and last 6 observations belong to the second dimension with a observation vector like arma::vec("5 3 6 1 6 3 2 1 1 2")? Otherwise, maybe we need also pass a slice_step of [4, 6] to indicate the boundary of different dimensions of observations.

Same to DiscreteDistribution::Probability(), it will return as std::vector<arma::vec> when we define probabilities as std::vector<arma::vec>. Anyway, we could define another Probability() function likeDiscreteDistribution::Probability(intt32_t dimension) if we want to query single dimension's distribution.

Contributor

MikeLing commented Feb 8, 2017

Hi @rcurtin , thank you for your comment!

The signature of DiscreteDistribution::Train() and DiscreteDistribution::Probability() shouldn't need to change. The observations can be given as arma::vec not std::vectorarma::vec

Due to we store the probabilities as std::vector<arma::vec> which is multi-dimension, how could we know how many observations in each dimension when we train the distribution by DiscreteDistribution::Train() with arma::vec? e.g. like you said, we want to train a discrete distribution by a 2-dimensional discrete distribution where the first dimension has 4 observations and the second dimension has 6, how could we know first 4 observations belong to first dimension and last 6 observations belong to the second dimension with a observation vector like arma::vec("5 3 6 1 6 3 2 1 1 2")? Otherwise, maybe we need also pass a slice_step of [4, 6] to indicate the boundary of different dimensions of observations.

Same to DiscreteDistribution::Probability(), it will return as std::vector<arma::vec> when we define probabilities as std::vector<arma::vec>. Anyway, we could define another Probability() function likeDiscreteDistribution::Probability(intt32_t dimension) if we want to query single dimension's distribution.

@rcurtin

This comment has been minimized.

Show comment
Hide comment
@rcurtin

rcurtin Feb 9, 2017

Member

When the user constructs the object, they are telling us how many items are in each dimension. Let's take the same example, with a 2-dimensional discrete distribution with 4 different observations in the first dimension and 6 in the second:

arma::Col<size_t> dims("4 6"); // 4 in the first, 6 in the second.
DiscreteDistribution d(dims);

// Then we can train with the following dataset.
// Note that how I've written it below, it needs to be transposed.
arma::mat dataset("0 1; 1 3; 3 5; 1 1; 0 0; 0 5; 0 1");
d.Train(dataset.t());

Does that make sense? Each dimension can have a different number of possible values, but in the end, the user should still be able to pass an arma::vec that represents a single observation. Sorry if anything I've written is confusing, let me know and I can try to clarify.

Member

rcurtin commented Feb 9, 2017

When the user constructs the object, they are telling us how many items are in each dimension. Let's take the same example, with a 2-dimensional discrete distribution with 4 different observations in the first dimension and 6 in the second:

arma::Col<size_t> dims("4 6"); // 4 in the first, 6 in the second.
DiscreteDistribution d(dims);

// Then we can train with the following dataset.
// Note that how I've written it below, it needs to be transposed.
arma::mat dataset("0 1; 1 3; 3 5; 1 1; 0 0; 0 5; 0 1");
d.Train(dataset.t());

Does that make sense? Each dimension can have a different number of possible values, but in the end, the user should still be able to pass an arma::vec that represents a single observation. Sorry if anything I've written is confusing, let me know and I can try to clarify.

@MikeLing

This comment has been minimized.

Show comment
Hide comment
@MikeLing

MikeLing Feb 12, 2017

Contributor

@rcurtin, I had update this PR and ask review for it.

BTW, even the travis and appveyor failed, the HMMTest, DistributionTest and SerializationTest passed on my locally. I have no idea why.

Contributor

MikeLing commented Feb 12, 2017

@rcurtin, I had update this PR and ask review for it.

BTW, even the travis and appveyor failed, the HMMTest, DistributionTest and SerializationTest passed on my locally. I have no idea why.

@rcurtin

The serialization test is failing because the test itself needs to be modified---take a look at serialization_test.cpp at line 153.

Can you also please go through your changes and make sure they match the style guide?

After that, feel free to add your name to src/mlpack/core.hpp and COPYRIGHT.txt and then once we have handled all the comments we can merge this in.

Thanks for being persistent and getting this worked out; it took many iterations but we are close now. :)

{
result[0] = obs;
return result;
}
}
// This shouldn't happen.
- result[0] = probabilities.n_elem - 1;
+ result[0] = probabilities[1].n_elem - 1;
return result;

This comment has been minimized.

@rcurtin

rcurtin Feb 14, 2017

Member

We should generate an n-dimensional vector here, not a one-dimensional vector. So we should loop over probabilities instead of only considering probabilities[0], and result should have size probabilities.size(), not 1.

@rcurtin

rcurtin Feb 14, 2017

Member

We should generate an n-dimensional vector here, not a one-dimensional vector. So we should loop over probabilities instead of only considering probabilities[0], and result should have size probabilities.size(), not 1.

- // Clear old probabilities.
- probabilities.zeros();
+ // Deal with the difference between vector and martix
+ const arma::mat transObservations = observations.t();

This comment has been minimized.

@rcurtin

rcurtin Feb 14, 2017

Member

I don't see the need for the transposition here, I think this will just slow things down. Can you rewrite this working with the original observations and not transObservations please?

@rcurtin

rcurtin Feb 14, 2017

Member

I don't see the need for the transposition here, I think this will just slow things down. Can you rewrite this working with the original observations and not transObservations please?

}
/**
* Estimate the probability distribution from the given observations when also
* given probabilities that each observation is from this distribution.
*/
void DiscreteDistribution::Train(const arma::mat& observations,
- const arma::vec& probObs)
+ const arma::mat& probObs)

This comment has been minimized.

@rcurtin

rcurtin Feb 14, 2017

Member

probObs should remain an arma::vec here, not be changed to a mat. That vector indicates the probability that a given column from observations comes from the distribution. So whereas in Train(observations), we have the line

probabilities[i][obs]++;

instead in Train(observations, probObs), we should have the line

probabilities[i][obs] += probObs[i];

Does that make sense? Let me know if I can clarify.

@rcurtin

rcurtin Feb 14, 2017

Member

probObs should remain an arma::vec here, not be changed to a mat. That vector indicates the probability that a given column from observations comes from the distribution. So whereas in Train(observations), we have the line

probabilities[i][obs]++;

instead in Train(observations, probObs), we should have the line

probabilities[i][obs] += probObs[i];

Does that make sense? Let me know if I can clarify.

This comment has been minimized.

@MikeLing

MikeLing Feb 17, 2017

Contributor

Hi @rcurtin , I still haven't got the point here after I try it. For example, the test in here https://github.com/mlpack/mlpack/pull/830/files#diff-ad93795d91105ba0fcc6e9815d5e9ad0R106

The observations is one dimensional ("0 0 1 2") and prob is ("0.25 0.25 0.5 1.0"), and i as an indicate of dimension will keep staying in 0 which means one dimension.

When we have the line

probabilities[i][obs] += probObs[i];

But here comes to the problem, the probObs[i] is always 0.25 after loop over, the probabilities will become (0.5[0, 0], 0.25[1], 0.25[2]).

Maybe you mean the probabilities[i][obs] += probObs[obs];? If so, what will happened if it's a multidimensional observation like

"0 1 1 2 2 2;"
"0 0 0 1 1 2;"
"0 1 1 2 2 2;"

@MikeLing

MikeLing Feb 17, 2017

Contributor

Hi @rcurtin , I still haven't got the point here after I try it. For example, the test in here https://github.com/mlpack/mlpack/pull/830/files#diff-ad93795d91105ba0fcc6e9815d5e9ad0R106

The observations is one dimensional ("0 0 1 2") and prob is ("0.25 0.25 0.5 1.0"), and i as an indicate of dimension will keep staying in 0 which means one dimension.

When we have the line

probabilities[i][obs] += probObs[i];

But here comes to the problem, the probObs[i] is always 0.25 after loop over, the probabilities will become (0.5[0, 0], 0.25[1], 0.25[2]).

Maybe you mean the probabilities[i][obs] += probObs[obs];? If so, what will happened if it's a multidimensional observation like

"0 1 1 2 2 2;"
"0 0 0 1 1 2;"
"0 1 1 2 2 2;"

This comment has been minimized.

@rcurtin

rcurtin Feb 17, 2017

Member

Ah, yeah, I miswrote here. You should be adding the probability of the observation to probabilities[i][obs]. That means that really it should be

probabilities[i][obs] += probObs[r];

Sorry about the confusion.

@rcurtin

rcurtin Feb 17, 2017

Member

Ah, yeah, I miswrote here. You should be adding the probability of the observation to probabilities[i][obs]. That means that really it should be

probabilities[i][obs] += probObs[r];

Sorry about the confusion.

+ const std::vector<arma::vec>& MultipleProbabilities() const { return probabilities; }
+
+ //! Return the vector Martix of probabilities.
+ std::vector<arma::vec>& MultipleProbabilities() { return probabilities; }

This comment has been minimized.

@rcurtin

rcurtin Feb 14, 2017

Member

I think we can just return std::vector<arma::vec>& for Probabilities(), so you can remove the existing Probabilities() implementations and replace them with MultipleProbabilities().

@rcurtin

rcurtin Feb 14, 2017

Member

I think we can just return std::vector<arma::vec>& for Probabilities(), so you can remove the existing Probabilities() implementations and replace them with MultipleProbabilities().

This comment has been minimized.

@MikeLing

MikeLing Feb 20, 2017

Contributor

sorry for the late reply for this comment.

Hmmm, if so, we need to warp a lot of arma::vec into std::vector like 00baf60#diff-ad93795d91105ba0fcc6e9815d5e9ad0R68 and some places in hmm_impl.hpp(sorry I forget the exactly place).

Maybe we could keep them both and it wouldn't confuse the users, I think. :)

@MikeLing

MikeLing Feb 20, 2017

Contributor

sorry for the late reply for this comment.

Hmmm, if so, we need to warp a lot of arma::vec into std::vector like 00baf60#diff-ad93795d91105ba0fcc6e9815d5e9ad0R68 and some places in hmm_impl.hpp(sorry I forget the exactly place).

Maybe we could keep them both and it wouldn't confuse the users, I think. :)

This comment has been minimized.

@rcurtin

rcurtin Feb 22, 2017

Member

I think Probabilities() is only used in the tests really so it's not a problem to make this change I think. Maybe a better solution is this:

const arma::vec& Probabilities(const size_t dim = 0)
{
  return probabilities[dim];
}

That actually should be reverse compatible too which is nice!

@rcurtin

rcurtin Feb 22, 2017

Member

I think Probabilities() is only used in the tests really so it's not a problem to make this change I think. Maybe a better solution is this:

const arma::vec& Probabilities(const size_t dim = 0)
{
  return probabilities[dim];
}

That actually should be reverse compatible too which is nice!

This comment has been minimized.

@MikeLing

MikeLing Feb 23, 2017

Contributor

const arma::vec& Probabilities(const size_t dim = 0)
{
return probabilities[I];
}

Is that mean we could keep the MultipleProbabilities()? If not, what should we do if we want to init the discrete distribution in the first place like https://github.com/mlpack/mlpack/pull/830/files#diff-ad93795d91105ba0fcc6e9815d5e9ad0R163?

@MikeLing

MikeLing Feb 23, 2017

Contributor

const arma::vec& Probabilities(const size_t dim = 0)
{
return probabilities[I];
}

Is that mean we could keep the MultipleProbabilities()? If not, what should we do if we want to init the discrete distribution in the first place like https://github.com/mlpack/mlpack/pull/830/files#diff-ad93795d91105ba0fcc6e9815d5e9ad0R163?

This comment has been minimized.

@rcurtin

rcurtin Feb 23, 2017

Member

I don't think we should keep MultipleProbabilities(); a user can just use the constructor that takes std::vector<arma::vec> instead. So you could adapt the test to be

std::vector<arma::vec> pro{{0.1, 0.3, 0.6},
                           {0.3, 0.3, 0.3},
                           {0.25, 0.25, 0.5}};
DiscreteDistribution d(pro);
@rcurtin

rcurtin Feb 23, 2017

Member

I don't think we should keep MultipleProbabilities(); a user can just use the constructor that takes std::vector<arma::vec> instead. So you could adapt the test to be

std::vector<arma::vec> pro{{0.1, 0.3, 0.6},
                           {0.3, 0.3, 0.3},
                           {0.25, 0.25, 0.5}};
DiscreteDistribution d(pro);
src/mlpack/tests/distribution_test.cpp
+ BOOST_REQUIRE_EQUAL(d.MultipleProbabilities().size(), 4);
+ BOOST_REQUIRE_EQUAL(d.MultipleProbabilities()[0].n_elem, 4);
+ BOOST_REQUIRE_CLOSE(d.Probability("0 0 0"), 0.015625, 1e-5);
+ BOOST_REQUIRE_CLOSE(d.Probability("0 1 2"), 0.015625, 1e-5);

This comment has been minimized.

@rcurtin

rcurtin Feb 14, 2017

Member

Shouldn't this throw an error? d is declared as having 4 dimensions (each with 4 observations), but then Probability() is called with 3-dimensional vectors.

@rcurtin

rcurtin Feb 14, 2017

Member

Shouldn't this throw an error? d is declared as having 4 dimensions (each with 4 observations), but then Probability() is called with 3-dimensional vectors.

@MikeLing

This comment has been minimized.

Show comment
Hide comment
@MikeLing

MikeLing Feb 24, 2017

Contributor

@rcurtin ask review for this pr :)

Contributor

MikeLing commented Feb 24, 2017

@rcurtin ask review for this pr :)

- // This shouldn't happen.
- result[0] = probabilities.n_elem - 1;
+ if (sumProb >= randObs != true)

This comment has been minimized.

@rcurtin

rcurtin Feb 24, 2017

Member

Why the != true? That seems strange. This should be just if (sumProb >= randObs), I think.

@rcurtin

rcurtin Feb 24, 2017

Member

Why the != true? That seems strange. This should be just if (sumProb >= randObs), I think.

This comment has been minimized.

@MikeLing

MikeLing Feb 25, 2017

Contributor

yes it is. Thank you for your point it out :)

@MikeLing

MikeLing Feb 25, 2017

Contributor

yes it is. Thank you for your point it out :)

+ if (sum > 0)
+ probabilities[i] /= sum;
+ else
+ probabilities[i].fill(1.0 / probabilities[i].n_elem); // Force normalization.

This comment has been minimized.

@rcurtin

rcurtin Feb 24, 2017

Member

So, I suspect this will be a lot faster if you invert the loop. Right now the outer loop is over dimensions and the inner loop is over points, but because of the way Armadillo uses memory we should always try to have the inner loop be over dimensions. Do you think you can adapt this loop that way? I can explain further if you like, but it should be an easy change. :)

@rcurtin

rcurtin Feb 24, 2017

Member

So, I suspect this will be a lot faster if you invert the loop. Right now the outer loop is over dimensions and the inner loop is over points, but because of the way Armadillo uses memory we should always try to have the inner loop be over dimensions. Do you think you can adapt this loop that way? I can explain further if you like, but it should be an easy change. :)

This comment has been minimized.

@MikeLing

MikeLing Feb 25, 2017

Contributor

Hi @rcurtin , hmmm, I'm not sure if I got the truly meaning of invert. I just write a demon so no need to review it again. :) NOTE, I use another for loop to loop over each dimension of probabilities to normailze it and that's the reason why I use dimensions as outer loop's index in the first place.

  // Iterate observation in each dimension
  for (size_t r=0; r < observations.n_rows; r++)
  {
    // Iterate all the probabilities in each dimension
    for (size_t i=0; i < dimensions; i++)
    {
      // Clear the old probabilities
      probabilities[i].zeros();

      // Add the probability of each observation.  The addition of 0.5 to the
      // observation is to turn the default flooring operation of the size_t cast
      // into a rounding observation.
      const size_t obs = size_t(observations(i, r) + 0.5);
      
      // Ensure that the observation is within the bounds.
      if (obs >= probabilities[i].n_elem)
      {
        Log::Debug << "DiscreteDistribution::Train(): observation " << i
            << " (" << obs << ") is invalid; observation must be in [0, "
            << probabilities[i].n_elem << "] for this distribution." << std::endl;       
      }
      probabilities[i][obs]++;
    }
  }

  // Now normailze the distribution.
  for (size_t i=0; i < dimensions; i++)
  {
    double sum = accu(probabilities[i]);
    if (sum > 0)
      probabilities[i] /= sum;
    else
      probabilities[i].fill(1.0 / probabilities[i].n_elem); // Force normalization.
  }
@MikeLing

MikeLing Feb 25, 2017

Contributor

Hi @rcurtin , hmmm, I'm not sure if I got the truly meaning of invert. I just write a demon so no need to review it again. :) NOTE, I use another for loop to loop over each dimension of probabilities to normailze it and that's the reason why I use dimensions as outer loop's index in the first place.

  // Iterate observation in each dimension
  for (size_t r=0; r < observations.n_rows; r++)
  {
    // Iterate all the probabilities in each dimension
    for (size_t i=0; i < dimensions; i++)
    {
      // Clear the old probabilities
      probabilities[i].zeros();

      // Add the probability of each observation.  The addition of 0.5 to the
      // observation is to turn the default flooring operation of the size_t cast
      // into a rounding observation.
      const size_t obs = size_t(observations(i, r) + 0.5);
      
      // Ensure that the observation is within the bounds.
      if (obs >= probabilities[i].n_elem)
      {
        Log::Debug << "DiscreteDistribution::Train(): observation " << i
            << " (" << obs << ") is invalid; observation must be in [0, "
            << probabilities[i].n_elem << "] for this distribution." << std::endl;       
      }
      probabilities[i][obs]++;
    }
  }

  // Now normailze the distribution.
  for (size_t i=0; i < dimensions; i++)
  {
    double sum = accu(probabilities[i]);
    if (sum > 0)
      probabilities[i] /= sum;
    else
      probabilities[i].fill(1.0 / probabilities[i].n_elem); // Force normalization.
  }
+ * @param numObservations Number of possible observations this distribution
+ * can have.
+ */
+ DiscreteDistribution(const arma::vec& numObservations)

This comment has been minimized.

@rcurtin

rcurtin Feb 24, 2017

Member

I realized, we can make this arma::Col<size_t> not arma::vec and that would be cleaner, but I can do that after merge if you like. I don't want to have an endless stream of comments so that this never gets merged. :)

@rcurtin

rcurtin Feb 24, 2017

Member

I realized, we can make this arma::Col<size_t> not arma::vec and that would be cleaner, but I can do that after merge if you like. I don't want to have an endless stream of comments so that this never gets merged. :)

This comment has been minimized.

@MikeLing

MikeLing Feb 25, 2017

Contributor

yeah, I can send a clear up pr after this one been merged :)

@MikeLing

MikeLing Feb 25, 2017

Contributor

yeah, I can send a clear up pr after this one been merged :)

@rcurtin

This comment has been minimized.

Show comment
Hide comment
@rcurtin

rcurtin Feb 24, 2017

Member

Looks great, thanks for the hard work. I think only the couple comments I had need to be addressed and I'll merge it. Do you want to add your name to the list of contributors in COPYRIGHT.txt and src/mlpack/core.hpp?

Also I see the AppVeyor build is failing, but I am on a plane so there is not enough bandwidth for me to check what the issue might be... I will look into it later...

Member

rcurtin commented Feb 24, 2017

Looks great, thanks for the hard work. I think only the couple comments I had need to be addressed and I'll merge it. Do you want to add your name to the list of contributors in COPYRIGHT.txt and src/mlpack/core.hpp?

Also I see the AppVeyor build is failing, but I am on a plane so there is not enough bandwidth for me to check what the issue might be... I will look into it later...

@MikeLing

This comment has been minimized.

Show comment
Hide comment
@MikeLing

MikeLing Feb 25, 2017

Contributor

Thank you for your review and help. Yeah, please add me to the list of contributors in COPYRIGHT.txt and src/mlpack/core.hpp if I can have that honor. This pr took so long to been finished but I had learn a ton of things from it. I promise I will keep working on mlpack as long as I can :) Thank you!

Contributor

MikeLing commented Feb 25, 2017

Thank you for your review and help. Yeah, please add me to the list of contributors in COPYRIGHT.txt and src/mlpack/core.hpp if I can have that honor. This pr took so long to been finished but I had learn a ton of things from it. I promise I will keep working on mlpack as long as I can :) Thank you!

@rcurtin rcurtin merged commit 400e2eb into mlpack:master Feb 27, 2017

1 of 2 checks passed

continuous-integration/appveyor/pr AppVeyor build failed
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@rcurtin

This comment has been minimized.

Show comment
Hide comment
@rcurtin

rcurtin Feb 27, 2017

Member

Hi @MikeLing, I handled a few things after merge:

  • I found a bug with serialization with older versions of Boost that I was able to fix with e0f6e97.
  • I fixed some style issues with aa62610.
  • I fixed the bug in Random() with 7cc0a44.
  • The constructor now uses arma::Col<size_t> in ab7e6f2.
  • I inverted the loops for better memory usage in 5b0abcd.

I can't add your name to the list of contributors though since I don't know what name or email you'd like me to use. If you would like to open another PR for that I can merge it.

Member

rcurtin commented Feb 27, 2017

Hi @MikeLing, I handled a few things after merge:

  • I found a bug with serialization with older versions of Boost that I was able to fix with e0f6e97.
  • I fixed some style issues with aa62610.
  • I fixed the bug in Random() with 7cc0a44.
  • The constructor now uses arma::Col<size_t> in ab7e6f2.
  • I inverted the loops for better memory usage in 5b0abcd.

I can't add your name to the list of contributors though since I don't know what name or email you'd like me to use. If you would like to open another PR for that I can merge it.

@rcurtin rcurtin referenced this pull request Feb 27, 2017

Closed

How to organize the data? #809

@MikeLing

This comment has been minimized.

Show comment
Hide comment
@MikeLing

MikeLing Feb 28, 2017

Contributor

@rcurtin Thank you for your help! :)

Contributor

MikeLing commented Feb 28, 2017

@rcurtin Thank you for your help! :)

@rcurtin rcurtin added this to the mlpack 2.2.0 milestone Mar 17, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment