Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add location of Block to validation errors #279

Merged
merged 9 commits into from
Apr 7, 2020
Merged

Conversation

hunse
Copy link
Collaborator

@hunse hunse commented Mar 26, 2020

So that it's clearer to the user what block is causing the problem.

This also fixes a number of other problems:

  • We were overcounting output axons in the splitter, because we neglected to account for axon sharing with population axons going to the same target.
  • Added a warning if we make copies of the weights to fix indexing issues, so that it's clear to the user if they fix these indexing issues, they could reduce the number of weights needed.
  • Do not change the requested block size. Sometimes it is essential (e.g. with pop16 axons) that parts of the requested block size (e.g. number of filters) not be changed. For now, I've just taken it out completely. We could add back in a warning letting the user know that the same number of blocks could be achieved with a smaller shape, and let them change it themselves if they want. We could also add an extra keyword parameter to BlockShape that says whether it can be changed or not (or even whether particular dimensions can be changed, but that gets a bit more complicated).
  • There's a line if spike is None in LoihiSpikeInput.set_axons that I thought could never be hit, but apparently the CIFAR-10 notebook hits it. In that case we need to make sure the axon gets added with an empty list of targets, so that when the corresponding neuron spikes, we don't get an index error.
  • I added a couple functions to Model and LoihiBlock to display info about how much of different chip resources a block is using. It does everything as percentages of the maximum per core, so that users don't have to go in and reference these maximums themselves. I found it useful for debugging how close I was to the maximum on different cores, so that I could choose the block sizes to use the cores as well as possible. The one downside is that if anything is over the maximums, this will trigger in validation when the Simulator is being created, so you never actually get to a point that you can use these utilization functions. But with the better error messages for these validation errors also in this PR, hopefully that's OK.

@hunse hunse force-pushed the validate-details branch 3 times, most recently from fba0e5f to 69ef44a Compare April 1, 2020 18:14
Copy link
Member

@drasmuss drasmuss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a small fixup so that block.utilization() returns the numerical values rather than the fraction. I thought there might be situations where that information would be helpful (e.g. if you know exactly how many compartments you're adding and you want to know if that will fit within the max). And then it's easy enough to compute the fractional values from that (which is still what is displayed in utilization_summary).

nengo_loihi/conv.py Outdated Show resolved Hide resolved
nengo_loihi/hardware/nxsdk_objects.py Outdated Show resolved Hide resolved
@hunse
Copy link
Collaborator Author

hunse commented Apr 3, 2020

Ok, I pushed a commit with lots of documentation about fitting models onto Loihi. A lot of it came from the blog post I've written for the release, then I realized it makes sense to have it in the docs, too (or rather, instead; I think the blog post will be a lot shorter now since we can just refer here for the details).

I think 237b3fc can just be completely removed now (maybe we want to keep it somewhere, though, for if/when we do the refactoring that would let us raise such a warning in the splitter).

I also added a number of smaller fixes.

Otherwise, we can get index errors. This came up in the CIFAR-10
example when splitting the first on-chip ensemble spatially.
Also remove premature hash to help avoid dictionary hash collisions.
Sometimes it is essential (e.g. with pop16 axons) that parts of the
requested block size (e.g. number of filters) not be changed.
docs/tips.rst Outdated Show resolved Hide resolved
docs/tips.rst Outdated Show resolved Hide resolved
@drasmuss
Copy link
Member

drasmuss commented Apr 3, 2020

I fixed what I think is a bug in the test, where conn_args was being ignored (so when we were setting synapse=None down below it wasn't actually doing anything) 3cd9823#diff-2b5472730de7e72d3c4c167f986643d2R617. But now the test is failing tolerances https://travis-ci.com/github/nengo/nengo-loihi/jobs/313727540#L1887. I'm not sure what the correct solution is: do we want synapse=None (in which case we need to figure out why the test is failing)? Or do we want the accidental behaviour we had before (just using the default synapse)?

@hunse
Copy link
Collaborator Author

hunse commented Apr 3, 2020

The failure is on comparing that the hardware matches the emulator exactly. That's always been the case in the past, but it appears that here (for the specific case of pop16 axons with precompute=False), that's not the case.

I'm not quite sure why having synapse=None on the input makes a difference. Since the input is just a constant input, all that would do is control how quickly the input ramps up. So somehow that's throwing of some sort of timing in the inputs. My instinct was just to raise snip_max_spikes_per_step, but that doesn't seem to be it (also, we would see that in the pop32 case as well). So it could be a spike is getting dropped or something just based on some weird timing in the input.

I don't know whether it's better to loosen the tolerances and keep the synapse=None, or to change the synapse back and keep the tolerances. I don't think it makes a big difference either way. I would probably opt to change the tolerances and make a note in the code that these tolerances are looser to accommodate that particular problem. At some point, it would be good to look into it further, but I have a feeling that this is the kind of thing that will be hard to reproduce in a minimal example.

@hunse
Copy link
Collaborator Author

hunse commented Apr 3, 2020

Ok, I've made that change. Also, I realized that if you try to run test_conv_deepnet with the BOARD environment variable set, it appears as a hang. I tried to avoid that in the test, but it wasn't working properly, so just something to keep in mind. I did add a commit to make sure that we set the PARTITION back to it's original value, rather than to the empty string, in case it had been specified before (i.e. if the user had already set the partition to "nahuku32" to use that for all tests, we don't want to mess things up for later tests).

@drasmuss drasmuss force-pushed the validate-details branch 2 times, most recently from 68c21e3 to caf7067 Compare April 6, 2020 17:34
Nengo Bones and others added 2 commits April 7, 2020 13:19
a34a255 - Add slack notifications for failing master builds
677e3d6 - Automatically update downstream repos
0198d9a - Install nengo-bones and nengo-sphinx-theme master
@drasmuss drasmuss merged commit c968f9e into master Apr 7, 2020
@drasmuss drasmuss deleted the validate-details branch April 7, 2020 17:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants