# Implement multidimensional discrete distribution #830

Merged
merged 1 commit into from Feb 27, 2017

## Conversation

Projects
None yet
2 participants
Contributor

### MikeLing commented Dec 15, 2016

 No description provided.

Closed

Member

### rcurtin commented Dec 20, 2016

 Hi there @MikeLing, thanks for the contribution. Unfortunately I'm not totally sure what this overload does since there are no comments. Ideally, for multidimensional discrete distribution support, the API shouldn't change, it should just be a generalization. Note that now, when you call `DiscreteDistribution::Probability()`, the parameter `const arma::vec& observation` must be a one-dimensional vector. The idea is to generalize this, so that a discrete distribution can have many dimensions. The distribution should have zero correlation across dimensions, so if I have the vector "a,b" then P(a,b) = P(a)P(b).
Contributor

### MikeLing commented Dec 21, 2016

 Hi @rcurtin, thank you for your help. :) But the `Probability()` do accept a one dimensional vector[1], but we only use the first value of that vector. So the idea in here is about to make `DiscreteDistribution::Probability()` can do something like: d.Probability("0") -> P(0) d.Probability("0, 1") -> P(0)P(1)
Member

### rcurtin commented Dec 21, 2016

 Yep, exactly. But each dimension may have a different set of probabilities associated with it. So to be more clear... d.Probability("0, 1") -> P_0(0) P_1(1) where P_0() is the PDF associated with the first dimension and P_1() is the PDF associated with the second dimension.
Contributor

### MikeLing commented Dec 21, 2016

 surething, I got it, thank you :) BTW, could you tell me your time schedule normally? Because I'm in UTC+8, so I want to know when I could get touch with you if I have issues or interested issue :)

### MikeLing reviewed Dec 22, 2016

 + DiscreteDistribution(const size_t numObservations, const size_t dimension=1): + probabilities(dimension, numObservations) + { + probabilities.ones();

#### MikeLing Dec 22, 2016

Contributor

Hi @rcurtin, Could I define and init a matrix like this? Why this failed and I also can't use `probabilities.set_size(dimension, numObservations)`?

I had google it but it doesn't help. I have no idea how to use `arma:mat`class even after I read some documents about it.

Thank you

Contributor

### MikeLing commented Dec 30, 2016

 Hi @rcurtin , I just implement multidimensional in discrete distribution, but I can't make it works in hmm for BaumWelch. The boost told me the index out of bounds in SimplestBaumWelchDiscreteHMM and SimpleBaumWelchDiscreteHMM but haven't tell me the exactly location things went wrong. I think I should look the whole thing through. Could you tell me if those code above is fit your mind? The discrete distribution can works well except Gamma distribution. Please tell me if I should address anything :) Thank you

### rcurtin reviewed Dec 30, 2016

 + {0.25, 0.35, 0.4}}; + BOOST_REQUIRE_CLOSE(d.Probability("0"), 0.1, 1e-5); + BOOST_REQUIRE_CLOSE(d.Probability("1"), 0.4, 1e-5); + BOOST_REQUIRE_CLOSE(d.Probability("0, 1"), 0.03, 1e-5);

#### rcurtin Dec 30, 2016

Member

These should all throw errors. The distribution is 3-dimensional, so every input to Probability() should be 3-dimensional also.

Member

### rcurtin commented Dec 30, 2016

 I think you need to add more rigorous tests to ensure that the multidimensional discrete distribution is behaving like other multidimensional distributions before you start debugging what is wrong with the HMMs. It will probably make debugging a lot easier.
Contributor

### MikeLing commented Jan 27, 2017

 @rcurtin could you take a look at this? Thank you :)
Member

### rcurtin commented Jan 28, 2017

 Hey @MikeLing, thanks for taking the time to update this. This looks a lot better. But what happens if a user has, e.g., a 2-dimensional discrete distribution where the first dimension has 4 observations and the second dimension has 6? I think that instead of using `arma::mat` for the `probabilities` matrix you'll have to use something more flexible like `std::vector`. The rest of the code looks good though---the tests are fine, the `Train()` implementation looks right, and the `Probability()` implementation looks right. I wonder if we can now remove the old implementation of `Train()`.
Member

### rcurtin commented Jan 28, 2017

 It also looks like there are some small compilation errors that will have to be addressed. Note that Travis may not have the newest version of Armadillo, so some Armadillo functionality you are using may not be available.
Contributor

### MikeLing commented Jan 30, 2017

 Hi @rcurtin, I had refactor the discrete distribution with `std::vector` and those test for multidimensional discrete distribution works fine. Please tell me if the multidimensional discrete distributio support looks good to you and I will move to the next step which is refactor the hmm code. I will great appreciate if you can give me some advice for the hmm refactor! Thank you :)

### MikeLing reviewed Jan 31, 2017

 - emissions.col(i) = dataSeq[emissionList[state][i].first].col( - emissionList[state][i].second); + emissions[i] = dataSeq[emissionList[state][i].first].col( + emissionList[state][i].second); } emission[state].Train(emissions);

#### MikeLing Jan 31, 2017

Contributor

@rcurtin Why it keeps telling me no matching member function for call to 'Train' But I have no idea why it would call the `Train` in `discrete distribution`. Could you give me some helps on that? Here is my full error I got: https://pastebin.mozilla.org/8971379

Member

### rcurtin commented Feb 6, 2017

 Hi @MikeLing, thanks for taking the time to do the refactoring. The HMM code shouldn't need to be refactored though. The signature of `DiscreteDistribution::Train()` and `DiscreteDistribution::Probability()` shouldn't need to change. The observations can be given as `arma::vec` not `std::vector`---but the internal implementation has to use `std::vector` because each dimension may have a different number of features. Does that make sense? If not let me know and I will try to clarify.
Contributor

### MikeLing commented Feb 8, 2017

 Hi @rcurtin , thank you for your comment! The signature of DiscreteDistribution::Train() and DiscreteDistribution::Probability() shouldn't need to change. The observations can be given as arma::vec not std::vectorarma::vec Due to we store the `probabilities` as `std::vector` which is multi-dimension, how could we know how many observations in each dimension when we train the distribution by `DiscreteDistribution::Train()` with `arma::vec`? e.g. like you said, we want to train a discrete distribution by a 2-dimensional discrete distribution where the first dimension has 4 observations and the second dimension has 6, how could we know first 4 observations belong to first dimension and last 6 observations belong to the second dimension with a observation vector like arma::vec("5 3 6 1 6 3 2 1 1 2")? Otherwise, maybe we need also pass a `slice_step` of [4, 6] to indicate the boundary of different dimensions of observations. Same to `DiscreteDistribution::Probability()`, it will return as `std::vector` when we define `probabilities` as `std::vector`. Anyway, we could define another Probability() function like`DiscreteDistribution::Probability(intt32_t dimension)` if we want to query single dimension's distribution.
Member

### rcurtin commented Feb 9, 2017

 When the user constructs the object, they are telling us how many items are in each dimension. Let's take the same example, with a 2-dimensional discrete distribution with 4 different observations in the first dimension and 6 in the second: ``````arma::Col dims("4 6"); // 4 in the first, 6 in the second. DiscreteDistribution d(dims); // Then we can train with the following dataset. // Note that how I've written it below, it needs to be transposed. arma::mat dataset("0 1; 1 3; 3 5; 1 1; 0 0; 0 5; 0 1"); d.Train(dataset.t()); `````` Does that make sense? Each dimension can have a different number of possible values, but in the end, the user should still be able to pass an arma::vec that represents a single observation. Sorry if anything I've written is confusing, let me know and I can try to clarify.
Contributor

### MikeLing commented Feb 12, 2017

 @rcurtin, I had update this PR and ask review for it. BTW, even the travis and appveyor failed, the `HMMTest`, `DistributionTest` and `SerializationTest` passed on my locally. I have no idea why.

### rcurtin reviewed Feb 14, 2017

The serialization test is failing because the test itself needs to be modified---take a look at `serialization_test.cpp` at line 153.

Can you also please go through your changes and make sure they match the style guide?

After that, feel free to add your name to `src/mlpack/core.hpp` and `COPYRIGHT.txt` and then once we have handled all the comments we can merge this in.

Thanks for being persistent and getting this worked out; it took many iterations but we are close now. :)

 { result[0] = obs; return result; } } // This shouldn't happen. - result[0] = probabilities.n_elem - 1; + result[0] = probabilities[1].n_elem - 1; return result;

#### rcurtin Feb 14, 2017

Member

We should generate an n-dimensional vector here, not a one-dimensional vector. So we should loop over `probabilities` instead of only considering `probabilities[0]`, and `result` should have size `probabilities.size()`, not `1`.

 - // Clear old probabilities. - probabilities.zeros(); + // Deal with the difference between vector and martix + const arma::mat transObservations = observations.t();

#### rcurtin Feb 14, 2017

Member

I don't see the need for the transposition here, I think this will just slow things down. Can you rewrite this working with the original `observations` and not `transObservations` please?

 } /** * Estimate the probability distribution from the given observations when also * given probabilities that each observation is from this distribution. */ void DiscreteDistribution::Train(const arma::mat& observations, - const arma::vec& probObs) + const arma::mat& probObs)

#### rcurtin Feb 14, 2017

Member

`probObs` should remain an `arma::vec` here, not be changed to a `mat`. That vector indicates the probability that a given column from `observations` comes from the distribution. So whereas in `Train(observations)`, we have the line

``````probabilities[i][obs]++;
``````

instead in `Train(observations, probObs)`, we should have the line

``````probabilities[i][obs] += probObs[i];
``````

Does that make sense? Let me know if I can clarify.

#### MikeLing Feb 17, 2017

Contributor

Hi @rcurtin , I still haven't got the point here after I try it. For example, the test in here https://github.com/mlpack/mlpack/pull/830/files#diff-ad93795d91105ba0fcc6e9815d5e9ad0R106

The observations is one dimensional ("0 0 1 2") and prob is ("0.25 0.25 0.5 1.0"), and i as an indicate of dimension will keep staying in 0 which means one dimension.

When we have the line

probabilities[i][obs] += probObs[i];

But here comes to the problem, the probObs[i] is always 0.25 after loop over, the `probabilities` will become (0.5[0, 0], 0.25[1], 0.25[2]).

Maybe you mean the `probabilities[i][obs] += probObs[obs];`? If so, what will happened if it's a multidimensional observation like

"0 1 1 2 2 2;"
"0 0 0 1 1 2;"
"0 1 1 2 2 2;"

#### rcurtin Feb 17, 2017

Member

Ah, yeah, I miswrote here. You should be adding the probability of the observation to `probabilities[i][obs]`. That means that really it should be

``````probabilities[i][obs] += probObs[r];
``````

 + const std::vector& MultipleProbabilities() const { return probabilities; } + + //! Return the vector Martix of probabilities. + std::vector& MultipleProbabilities() { return probabilities; }

#### rcurtin Feb 14, 2017

Member

I think we can just return `std::vector<arma::vec>&` for `Probabilities()`, so you can remove the existing `Probabilities()` implementations and replace them with `MultipleProbabilities()`.

#### MikeLing Feb 20, 2017

Contributor

sorry for the late reply for this comment.

Hmmm, if so, we need to warp a lot of `arma::vec` into `std::vector` like 00baf60#diff-ad93795d91105ba0fcc6e9815d5e9ad0R68 and some places in `hmm_impl.hpp`(sorry I forget the exactly place).

Maybe we could keep them both and it wouldn't confuse the users, I think. :)

#### rcurtin Feb 22, 2017 • edited Edited 1 time rcurtin edited Feb 23, 2017 (most recent)

Member

I think `Probabilities()` is only used in the tests really so it's not a problem to make this change I think. Maybe a better solution is this:

``````const arma::vec& Probabilities(const size_t dim = 0)
{
return probabilities[dim];
}
``````

That actually should be reverse compatible too which is nice!

#### MikeLing Feb 23, 2017

Contributor

const arma::vec& Probabilities(const size_t dim = 0)
{
return probabilities[I];
}

Is that mean we could keep the `MultipleProbabilities()`? If not, what should we do if we want to init the discrete distribution in the first place like https://github.com/mlpack/mlpack/pull/830/files#diff-ad93795d91105ba0fcc6e9815d5e9ad0R163?

#### rcurtin Feb 23, 2017

Member

I don't think we should keep `MultipleProbabilities()`; a user can just use the constructor that takes `std::vector<arma::vec>` instead. So you could adapt the test to be

``````std::vector<arma::vec> pro{{0.1, 0.3, 0.6},
{0.3, 0.3, 0.3},
{0.25, 0.25, 0.5}};
DiscreteDistribution d(pro);
``````
 + BOOST_REQUIRE_EQUAL(d.MultipleProbabilities().size(), 4); + BOOST_REQUIRE_EQUAL(d.MultipleProbabilities()[0].n_elem, 4); + BOOST_REQUIRE_CLOSE(d.Probability("0 0 0"), 0.015625, 1e-5); + BOOST_REQUIRE_CLOSE(d.Probability("0 1 2"), 0.015625, 1e-5);

#### rcurtin Feb 14, 2017

Member

Shouldn't this throw an error? `d` is declared as having 4 dimensions (each with 4 observations), but then `Probability()` is called with 3-dimensional vectors.

``` Implement multidimensional discrete distribution ```
``` 400e2eb ```
Contributor

### MikeLing commented Feb 24, 2017

 @rcurtin ask review for this pr :)

### rcurtin reviewed Feb 24, 2017

 - // This shouldn't happen. - result[0] = probabilities.n_elem - 1; + if (sumProb >= randObs != true)

#### rcurtin Feb 24, 2017

Member

Why the `!= true`? That seems strange. This should be just `if (sumProb >= randObs)`, I think.

#### MikeLing Feb 25, 2017

Contributor

yes it is. Thank you for your point it out :)

### rcurtin reviewed Feb 24, 2017

 + if (sum > 0) + probabilities[i] /= sum; + else + probabilities[i].fill(1.0 / probabilities[i].n_elem); // Force normalization.

#### rcurtin Feb 24, 2017

Member

So, I suspect this will be a lot faster if you invert the loop. Right now the outer loop is over dimensions and the inner loop is over points, but because of the way Armadillo uses memory we should always try to have the inner loop be over dimensions. Do you think you can adapt this loop that way? I can explain further if you like, but it should be an easy change. :)

#### MikeLing Feb 25, 2017

Contributor

Hi @rcurtin , hmmm, I'm not sure if I got the truly meaning of invert. I just write a demon so no need to review it again. :) NOTE, I use another for loop to loop over each dimension of probabilities to normailze it and that's the reason why I use dimensions as outer loop's index in the first place.

``````  // Iterate observation in each dimension
for (size_t r=0; r < observations.n_rows; r++)
{
// Iterate all the probabilities in each dimension
for (size_t i=0; i < dimensions; i++)
{
// Clear the old probabilities
probabilities[i].zeros();

// Add the probability of each observation.  The addition of 0.5 to the
// observation is to turn the default flooring operation of the size_t cast
// into a rounding observation.
const size_t obs = size_t(observations(i, r) + 0.5);

// Ensure that the observation is within the bounds.
if (obs >= probabilities[i].n_elem)
{
Log::Debug << "DiscreteDistribution::Train(): observation " << i
<< " (" << obs << ") is invalid; observation must be in [0, "
<< probabilities[i].n_elem << "] for this distribution." << std::endl;
}
probabilities[i][obs]++;
}
}

// Now normailze the distribution.
for (size_t i=0; i < dimensions; i++)
{
double sum = accu(probabilities[i]);
if (sum > 0)
probabilities[i] /= sum;
else
probabilities[i].fill(1.0 / probabilities[i].n_elem); // Force normalization.
}
``````

### rcurtin reviewed Feb 24, 2017

 + * @param numObservations Number of possible observations this distribution + * can have. + */ + DiscreteDistribution(const arma::vec& numObservations)

#### rcurtin Feb 24, 2017

Member

I realized, we can make this `arma::Col<size_t>` not `arma::vec` and that would be cleaner, but I can do that after merge if you like. I don't want to have an endless stream of comments so that this never gets merged. :)

#### MikeLing Feb 25, 2017

Contributor

yeah, I can send a clear up pr after this one been merged :)

Member

### rcurtin commented Feb 24, 2017

 Looks great, thanks for the hard work. I think only the couple comments I had need to be addressed and I'll merge it. Do you want to add your name to the list of contributors in `COPYRIGHT.txt` and `src/mlpack/core.hpp`? Also I see the AppVeyor build is failing, but I am on a plane so there is not enough bandwidth for me to check what the issue might be... I will look into it later...
Contributor

### MikeLing commented Feb 25, 2017

 Thank you for your review and help. Yeah, please add me to the list of contributors in COPYRIGHT.txt and src/mlpack/core.hpp if I can have that honor. This pr took so long to been finished but I had learn a ton of things from it. I promise I will keep working on mlpack as long as I can :) Thank you!

### rcurtin merged commit `400e2eb` into mlpack:master Feb 27, 2017 1 of 2 checks passed

#### 1 of 2 checks passed

continuous-integration/appveyor/pr AppVeyor build failed
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Member

### rcurtin commented Feb 27, 2017

 Hi @MikeLing, I handled a few things after merge: I found a bug with serialization with older versions of Boost that I was able to fix with e0f6e97. I fixed some style issues with aa62610. I fixed the bug in `Random()` with 7cc0a44. The constructor now uses `arma::Col` in ab7e6f2. I inverted the loops for better memory usage in 5b0abcd. I can't add your name to the list of contributors though since I don't know what name or email you'd like me to use. If you would like to open another PR for that I can merge it.

Closed

Contributor

### MikeLing commented Feb 28, 2017

 @rcurtin Thank you for your help! :)