Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deterministic Algorithms #194

Open
3 of 5 tasks
ctrl-z-9000-times opened this issue Jan 4, 2019 · 18 comments
Open
3 of 5 tasks

Deterministic Algorithms #194

ctrl-z-9000-times opened this issue Jan 4, 2019 · 18 comments

Comments

@ctrl-z-9000-times
Copy link
Collaborator

ctrl-z-9000-times commented Jan 4, 2019

NuPIC should be deterministic, and should yield precisely the same results across all supported platforms. When changes to the source code cause the output to change in any way, we should know about it.

Currently the unit tests do not systemically check this. The solution to this problem should contain hard coded results which include details about the inner workings of the algorithms. The solution should also be easy to update when NuPIC's output is intentionally changed.

EDIT: Progress:

I'd like to add 3 more of these tests.

@ctrl-z-9000-times
Copy link
Collaborator Author

SpatialPooler test started in PR #164

@ctrl-z-9000-times
Copy link
Collaborator Author

I'd like to add 3 more of these tests.

  • For TM, sdr classifier, and scalar encoder. Id like to make one big test for test coverage of all of these components.
  • For benchmark hotgym
  • For MNIST SP

@dkeeney
Copy link

dkeeney commented Jan 9, 2019

I totally agree with this. Actually we could use better unit test coverage as well on nearly all modules. I think is it not too hard to break something and our unit tests do not tell us.

@breznak
Copy link
Member

breznak commented Feb 27, 2019

For TM, sdr classifier, and scalar encoder. Id like to make one big test for test coverage of all of these components.

can this be part of the Hotgym, MNIST tests? As those use all of the said components, so it's just an issue of recording "gold standard" at some iteration and comparing that.

I totally agree with this. Actually we could use better unit test coverage as well on nearly all modules. I think is it not too hard to break something and our unit tests do not tell us.

Better unittests, def yes. But careful with the gold standard, there will be PRs that legitimely break the "gold" (improved code, even new param, ...). But it'll be nice to add a flag "breaks_gold" and have all of these recorded thanks to the checks.

@breznak
Copy link
Member

breznak commented Apr 23, 2019

TM output test failing on (and only) Windows CI, see #397 (comment)

@breznak
Copy link
Member

breznak commented Apr 24, 2019

Btw, @dkeeney , when you get back, do you have idea why this would be failing on Windows? #397 (comment)

@dkeeney
Copy link

dkeeney commented Apr 27, 2019

@breznak I will be out again for this weekend but I have not forgotten that I need to check into this problem.

@dkeeney dkeeney self-assigned this Apr 27, 2019
@dkeeney
Copy link

dkeeney commented May 1, 2019

"Deterministic output of TM failed!"

The test is looking for a TM output of {26, 75} and it is getting {26, 75, 525} on Windows.
Debug mode and Release mode on Windows both produce the same output if given the same iteration count.

I displayed the last 100 outputs using Linux (Ubuntu) and Windows (MSVC). When you put them in separate files and diff them, there is a significant difference in the outputs.
I don't know what is causing it yet but I suspect there is some difference in floating point bit resolution or how something is truncated. Hard to tell at this point.

I need to reduce the test size to isolate the point where the two results diverge.

=============================================
Windows: Last 100 outputs:

e=4899 out: SDR( 2048 ) 102, 701, 1044, 1116, 1453, 1469
e=4900 out: SDR( 2048 ) 668, 701, 1044, 1116
e=4901 out: SDR( 2048 ) 102, 668, 1458
e=4902 out: SDR( 2048 ) 450, 1453, 1469
e=4903 out: SDR( 2048 ) 450, 1044, 1128, 1447, 1453, 1469
e=4904 out: SDR( 2048 ) 450, 1453
e=4905 out: SDR( 2048 ) 254, 450, 728, 1044, 1447, 1453, 1458, 1469, 1477, 1870
e=4906 out: SDR( 2048 ) 450, 1447, 1453
e=4907 out: SDR( 2048 ) 15, 254, 1069, 1447, 1870
e=4908 out: SDR( 2048 ) 15, 254, 1069
e=4909 out: SDR( 2048 ) 15, 254, 1069, 1447, 1477
e=4910 out: SDR( 2048 )
e=4911 out: SDR( 2048 ) 15, 1710, 1720, 2039
e=4912 out: SDR( 2048 ) 595, 1720, 1953
e=4913 out: SDR( 2048 ) 15, 1199, 1710, 1720, 2001, 2039
e=4914 out: SDR( 2048 ) 1635, 1953
e=4915 out: SDR( 2048 ) 327, 421, 1199, 1935, 2001
e=4916 out: SDR( 2048 ) 321, 456
e=4917 out: SDR( 2048 )
e=4918 out: SDR( 2048 ) 418, 490, 1069
e=4919 out: SDR( 2048 ) 418, 490, 1044, 1069, 1766
e=4920 out: SDR( 2048 )
e=4921 out: SDR( 2048 ) 39, 490, 1038, 1044, 1058, 1069
e=4922 out: SDR( 2048 ) 701
e=4923 out: SDR( 2048 )
e=4924 out: SDR( 2048 ) 194, 236, 701, 1710, 1724, 1940, 1960, 1980
e=4925 out: SDR( 2048 ) 1960
e=4926 out: SDR( 2048 ) 236, 701, 1710, 1940, 1960
e=4927 out: SDR( 2048 ) 1960
e=4928 out: SDR( 2048 ) 913, 923
e=4929 out: SDR( 2048 ) 335, 532, 795, 923, 926, 1239, 1670
e=4930 out: SDR( 2048 )
e=4931 out: SDR( 2048 ) 335, 532, 795, 1239
e=4932 out: SDR( 2048 ) 1038, 1044, 1048, 1064
e=4933 out: SDR( 2048 )
e=4934 out: SDR( 2048 ) 1038, 1044, 1048, 1064
e=4935 out: SDR( 2048 )
e=4936 out: SDR( 2048 ) 912, 1038, 1044, 1048, 1064
e=4937 out: SDR( 2048 ) 912, 1038
e=4938 out: SDR( 2048 )
e=4939 out: SDR( 2048 ) 1038, 1044, 1048, 1670, 1805, 1812, 1813, 1815, 1922
e=4940 out: SDR( 2048 ) 701
e=4941 out: SDR( 2048 )
e=4942 out: SDR( 2048 ) 194, 216, 1980
e=4943 out: SDR( 2048 ) 174, 561, 1980
e=4944 out: SDR( 2048 ) 1772, 1773, 2026
e=4945 out: SDR( 2048 ) 1772, 1773, 2026
e=4946 out: SDR( 2048 ) 7, 13, 1315, 1796
e=4947 out: SDR( 2048 ) 7, 13, 15, 17, 72, 1252, 1259, 1315, 1796, 1877
e=4948 out: SDR( 2048 ) 13, 17, 39, 1252, 1315, 1877
e=4949 out: SDR( 2048 )
e=4950 out: SDR( 2048 )
e=4951 out: SDR( 2048 )
e=4952 out: SDR( 2048 ) 160, 790, 859, 1867
e=4953 out: SDR( 2048 ) 160, 859, 982
e=4954 out: SDR( 2048 )
e=4955 out: SDR( 2048 ) 982, 1700, 1738, 1856
e=4956 out: SDR( 2048 ) 1700
e=4957 out: SDR( 2048 ) 961, 1700, 1738
e=4958 out: SDR( 2048 ) 230, 1107, 1685
e=4959 out: SDR( 2048 ) 124, 1107, 1946, 1976
e=4960 out: SDR( 2048 )
e=4961 out: SDR( 2048 ) 468
e=4962 out: SDR( 2048 ) 468, 923, 925, 1199, 1442, 1647, 1648, 1670
e=4963 out: SDR( 2048 )
e=4964 out: SDR( 2048 ) 879, 923, 1199, 1418, 1442
e=4965 out: SDR( 2048 ) 879, 1418, 1442
e=4966 out: SDR( 2048 )
e=4967 out: SDR( 2048 ) 348, 370, 1172, 1201, 1232, 1418
e=4968 out: SDR( 2048 )
e=4969 out: SDR( 2048 )
e=4970 out: SDR( 2048 ) 361, 733, 1257
e=4971 out: SDR( 2048 ) 361
e=4972 out: SDR( 2048 )
e=4973 out: SDR( 2048 )
e=4974 out: SDR( 2048 )
e=4975 out: SDR( 2048 ) 532, 560, 1471, 1486, 1786
e=4976 out: SDR( 2048 )
e=4977 out: SDR( 2048 ) 532, 560, 1366, 1486, 1786
e=4978 out: SDR( 2048 )
e=4979 out: SDR( 2048 ) 1085, 1366, 1591
e=4980 out: SDR( 2048 )
e=4981 out: SDR( 2048 )
e=4982 out: SDR( 2048 ) 1619, 1678
e=4983 out: SDR( 2048 )
e=4984 out: SDR( 2048 ) 963, 967, 1619, 2018
e=4985 out: SDR( 2048 ) 148, 167, 176
e=4986 out: SDR( 2048 )
e=4987 out: SDR( 2048 ) 144, 148, 167, 168, 174, 176, 197, 913, 1795
e=4988 out: SDR( 2048 )
e=4989 out: SDR( 2048 ) 1001
e=4990 out: SDR( 2048 ) 980
e=4991 out: SDR( 2048 ) 975, 980
e=4992 out: SDR( 2048 ) 975, 1813, 1815
e=4993 out: SDR( 2048 ) 975, 980
e=4994 out: SDR( 2048 )
e=4995 out: SDR( 2048 )
e=4996 out: SDR( 2048 )
e=4997 out: SDR( 2048 ) 75, 128
e=4998 out: SDR( 2048 ) 75, 128, 912
e=4999 out: SDR( 2048 ) 26, 75, 525

=============================================
Linux: Last 200 outputs:

e=4899 out: SDR( 2048 ) 102, 450, 1044, 1116, 1453, 1458, 1493, 1824
e=4900 out: SDR( 2048 )
e=4901 out: SDR( 2048 ) 102, 668, 701, 1044, 1116, 1453, 1469, 1824
e=4902 out: SDR( 2048 ) 1453
e=4903 out: SDR( 2048 ) 450, 1044, 1128, 1453, 1469
e=4904 out: SDR( 2048 ) 1453, 1469
e=4905 out: SDR( 2048 ) 254, 450, 728, 1044, 1116, 1447, 1453, 1458, 1469, 1477, 1493, 1870
e=4906 out: SDR( 2048 )
e=4907 out: SDR( 2048 ) 254, 1069, 1447, 1870
e=4908 out: SDR( 2048 ) 15, 254, 1069, 1447
e=4909 out: SDR( 2048 ) 254, 1069, 1447
e=4910 out: SDR( 2048 ) 1710, 1720
e=4911 out: SDR( 2048 )
e=4912 out: SDR( 2048 ) 15, 1710, 1720, 2013, 2039
e=4913 out: SDR( 2048 ) 1199, 2001
e=4914 out: SDR( 2048 ) 327, 1635, 1953
e=4915 out: SDR( 2048 )
e=4916 out: SDR( 2048 ) 321, 327
e=4917 out: SDR( 2048 )
e=4918 out: SDR( 2048 ) 418, 490, 1058, 1069
e=4919 out: SDR( 2048 ) 1766
e=4920 out: SDR( 2048 ) 39, 418, 490, 1049, 1058, 1068, 1069
e=4921 out: SDR( 2048 ) 1038
e=4922 out: SDR( 2048 ) 701
e=4923 out: SDR( 2048 ) 1710
e=4924 out: SDR( 2048 ) 194, 236, 701, 1710, 1724, 1940
e=4925 out: SDR( 2048 ) 1751, 1946, 1960
e=4926 out: SDR( 2048 ) 701, 1710, 1940, 1960
e=4927 out: SDR( 2048 ) 926
e=4928 out: SDR( 2048 ) 913, 923
e=4929 out: SDR( 2048 ) 335, 532, 795, 923, 926, 1239, 1670
e=4930 out: SDR( 2048 )
e=4931 out: SDR( 2048 ) 335, 532, 795, 921, 1239
e=4932 out: SDR( 2048 ) 1038, 1048, 1064
e=4933 out: SDR( 2048 )
e=4934 out: SDR( 2048 ) 1038, 1048, 1064, 1068, 1745
e=4935 out: SDR( 2048 )
e=4936 out: SDR( 2048 ) 1038, 1044, 1048
e=4937 out: SDR( 2048 ) 912, 1044
e=4938 out: SDR( 2048 )
e=4939 out: SDR( 2048 ) 1038, 1044, 1048, 1670, 1796, 1805, 1812, 1813, 1922
e=4940 out: SDR( 2048 ) 701
e=4941 out: SDR( 2048 )
e=4942 out: SDR( 2048 ) 194, 216, 1980
e=4943 out: SDR( 2048 ) 561, 1980
e=4944 out: SDR( 2048 ) 1772, 1773, 1796, 2026
e=4945 out: SDR( 2048 ) 1796, 2026
e=4946 out: SDR( 2048 ) 7, 13, 17, 1315, 1796
e=4947 out: SDR( 2048 ) 7, 13, 15, 17, 72, 1252, 1259, 1315, 1796, 1877
e=4948 out: SDR( 2048 ) 13, 17, 39, 1252, 1877
e=4949 out: SDR( 2048 )
e=4950 out: SDR( 2048 )
e=4951 out: SDR( 2048 )
e=4952 out: SDR( 2048 ) 160, 859, 1867
e=4953 out: SDR( 2048 ) 160, 859, 982, 1867
e=4954 out: SDR( 2048 )
e=4955 out: SDR( 2048 ) 982, 1700, 1738
e=4956 out: SDR( 2048 ) 961, 1700
e=4957 out: SDR( 2048 ) 961
e=4958 out: SDR( 2048 ) 230, 1107, 1685
e=4959 out: SDR( 2048 ) 1107, 1976
e=4960 out: SDR( 2048 ) 124, 1976
e=4961 out: SDR( 2048 ) 1648
e=4962 out: SDR( 2048 ) 468, 923, 925, 1199, 1442, 1670
e=4963 out: SDR( 2048 )
e=4964 out: SDR( 2048 ) 879, 923, 1199, 1418, 1442, 1458
e=4965 out: SDR( 2048 ) 879, 1442, 1458
e=4966 out: SDR( 2048 )
e=4967 out: SDR( 2048 ) 348, 370, 1201, 1232
e=4968 out: SDR( 2048 )
e=4969 out: SDR( 2048 )
e=4970 out: SDR( 2048 ) 361, 733, 1257
e=4971 out: SDR( 2048 ) 361
e=4972 out: SDR( 2048 )
e=4973 out: SDR( 2048 ) 42
e=4974 out: SDR( 2048 )
e=4975 out: SDR( 2048 ) 532, 1467, 1471, 1486, 1786
e=4976 out: SDR( 2048 )
e=4977 out: SDR( 2048 ) 532, 560, 1366, 1486, 1786
e=4978 out: SDR( 2048 )
e=4979 out: SDR( 2048 ) 1085, 1366, 1591
e=4980 out: SDR( 2048 )
e=4981 out: SDR( 2048 )
e=4982 out: SDR( 2048 ) 1619, 1670, 1678, 2018
e=4983 out: SDR( 2048 )
e=4984 out: SDR( 2048 ) 963, 967, 1619, 1678, 2018
e=4985 out: SDR( 2048 ) 148, 167
e=4986 out: SDR( 2048 )
e=4987 out: SDR( 2048 ) 144, 148, 157, 164, 167, 168, 176, 197, 913, 1795
e=4988 out: SDR( 2048 )
e=4989 out: SDR( 2048 ) 1001
e=4990 out: SDR( 2048 ) 975, 980
e=4991 out: SDR( 2048 )
e=4992 out: SDR( 2048 ) 80, 975, 1805, 1813
e=4993 out: SDR( 2048 )
e=4994 out: SDR( 2048 )
e=4995 out: SDR( 2048 )
e=4996 out: SDR( 2048 ) 42
e=4997 out: SDR( 2048 ) 75, 128
e=4998 out: SDR( 2048 ) 75, 305, 912
e=4999 out: SDR( 2048 ) 26, 75

@breznak
Copy link
Member

breznak commented May 1, 2019

I don't know what is causing it yet but I suspect there is some difference in floating point bit resolution or how something is truncated. Hard to tell at this point.

yes, I'd be suspecting: Random, float rounding, tie-breakers in TM?

@dkeeney
Copy link

dkeeney commented May 1, 2019

I confirmed that all SP outputs are the same. So it is someplace in TM.
The TM outputs are not different until iteration 2424.

So now I have to go inside the algorithm and see what is different at iteration 2424.

@dkeeney
Copy link

dkeeney commented May 2, 2019

I think I found the problem --- or at least a problem.
TemporalMemory.cpp, line 284 in growSynapses( ) .

  const size_t overrun = (connections.numSynapses(segment) + nActual - maxSynapsesPerSegment);
  if (overrun > 0) {
    destroyMinPermanenceSynapses(connections, rng, segment, static_cast<Int>(overrun), prevWinnerCells);
  }

overrun is unsigned long so the subtraction assigns a negative number which is interpreted as greater than 0. destroyMinPermanencesSynapses( ) is always called. But because it is cast to int, inside the function it is a negative number and nothing actually gets deleted. But its doing a lot of work that it does not need to do.

I changed overrun to be type Int. That stopped the unnecessary calls to destroyMinPermanencesSynapses( ) but we still diverge in the same place.

During iteration 2423 the second call to growSynapses( ); the random numbers are still matching but the candidates returned have different values. See for loop at line 293.

@breznak
Copy link
Member

breznak commented May 3, 2019

@dkeeney

I changed overrun to be type Int. That stopped the unnecessary calls to destroyMinPermanencesSynapses( ) but we still diverge in the same place.

can you please submit a small PR with said fix?

  • add to the top of destroyMinPermanenceSynapses(connections :
if(nDestroy < 1) return; 

@dkeeney
Copy link

dkeeney commented May 4, 2019

I am still hunting this down.
Windows and Linux deviate at the end of iteration 2422
While processing column 908, In ActivateCells( ) The Linux code calls BurstColumn( ) three times. Windows code calls BurstColumn( ) four times. That last time finds cell 7269 as the winner and it creates an extra segment.

I have not figured out what causes it....

@dkeeney
Copy link

dkeeney commented May 11, 2019

Some progress.
In SpatialPooler.cpp, line 875 in inhibitColumnsGlobal_( )

  // Sort the columns by the amount of overlap.  First make a list of all of the
  // column indexes.
  activeColumns.reserve(numColumns_);
  for(UInt i = 0; i < numColumns_; i++)
    activeColumns.push_back(i);
  // Compare the column indexes by their overlap.
  auto compare = [&overlaps_](const UInt &a, const UInt &b) -> bool
    {return overlaps_[a] > overlaps_[b];};
  // Do a partial sort to divide the winners from the losers.  This sort is
  // faster than a regular sort because it stops after it partitions the
  // elements about the Nth element, with all elements on their correct side of
  // the Nth element.
  std::nth_element(
    activeColumns.begin(),
    activeColumns.begin() + numDesired,
    activeColumns.end(),
    compare);
  // Remove the columns which lost the competition.
  activeColumns.resize(numDesired);
  // Finish sorting the winner columns by their overlap.
  std::sort(activeColumns.begin(), activeColumns.end(), compare);
  // Remove sub-threshold winners
  while( !activeColumns.empty() &&
         overlaps[activeColumns.back()] < stimulusThreshold_)
      activeColumns.pop_back();

The output of this section on EPOCH 2422,
Linux: activeColumns out: 1038 1064 912 34 1095 1044 1449 40 1332 1051
Windows: activeColumns out: 1038 1064 912 34 1095 1044 1449 40 1332 908
Note the last element.

Just prior to this, the output of the nth_element( ) partial sort has the elements partitioned correctly but the exact order between Linux and Windows is slightly different. So when we truncate we just happen to get a different value for the 10th element. That difference propagates and eventually everything is different between Linux and Windows.

The Problem:
So, I think the problem is that the compare function does not uniquely specify the order. Items that compare equal can be in any order.

The Fix:

  // Compare the column indexes by their overlap.
  auto compare = [&overlaps_](const UInt &a, const UInt &b) -> bool
    {return overlaps_[a] > overlaps_[b];};

becomes

  // Compare the column indexes by their overlap.
  auto compare = [&overlaps_](const UInt &a, const UInt &b) -> bool
    {return (overlaps_[a] == overlaps_[b]) ? a > b : overlaps_[a] > overlaps_[b];};

@breznak
Copy link
Member

breznak commented May 12, 2019

Big KUDOs @dkeeney ! This has been really a huge and lengthy investigation. Great job! 💯

@breznak
Copy link
Member

breznak commented May 13, 2019

If we add Classifier (+Predictor) to hotgym, would you consider this issue resolved? With hotgym being the single place to run tests of all (most) components and check that the outputs are deterministic.

@ctrl-z-9000-times
Copy link
Collaborator Author

If you add determinism tests for the predictor then yes, i would close this issue.

@breznak
Copy link
Member

breznak commented Jun 2, 2020

ARM platform on muslC is not "deterministic.
TODO:

  • predictor
  • classifier ...add to HelloSPTP example cpp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants