Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoders in c++ #258

Closed
9 of 11 tasks
breznak opened this issue Feb 9, 2019 · 17 comments
Closed
9 of 11 tasks

Encoders in c++ #258

breznak opened this issue Feb 9, 2019 · 17 comments

Comments

@breznak
Copy link
Member

breznak commented Feb 9, 2019

We need to port more Encoders to c++ for practical usability of this repo.

Mainly

  • RDSE
  • MultiEncoder

There's also an issue for Extra encoders for special stuff #259

EDIT:
topic
https://discourse.numenta.org/t/repo-for-merging-various-encoders/5397

Outstanding Tasks for ScalarEncoder:

  • C++ example usage & unit test for example
    I think this task is covered by the python viewing script - @ctrl-z-9000-times
  • Python example usage & unit test for example
  • Documentation could use more details, in-depth explanations, and notes for practical usage. See numenta/nupic.py docs?
  • Python script to visually plot the inputs & outputs of the encoder

Outstanding Tasks for RDSE:

See PR #278

Outstanding Tasks for CategoryEncoder:

See PR #435

  • Documentation
  • Demonstration
  • Unit tests
@breznak breznak added the encoder label Feb 9, 2019
@ctrl-z-9000-times
Copy link
Collaborator

I'd like to make some SDR-based tools for working with encoders. The fact that we have no existing C++ encoders means that we can make them in whatever way we'd like to.

MultiEncoder via SDR-Concatenator

The MultiEncoder creates a group of encoders and concatenates the results together. I'd like to make an SDR-Concatenator class to do this. The users would create the encoders and use this class to join the results into a single SDR to give to the algorithms. Example:

SDR A  <-  from constituent encoder
SDR B  <-  from constituent encoder
SDR_Concatenator C( A, B, axis=0 )
A.setDense( data )
B.setDense( data )
C.getDense() -> A & B concatenated

SDR-Intersection

This would be useful for working with multidimensional data. The user encodes each dimension separately and then takes the intersection of the resulting SDR's. The result is an SDR where each bit responds to an area of the input space.

Category Encoder

Would be nice to have.

@breznak
Copy link
Member Author

breznak commented Feb 12, 2019

There was some interest in encoders at the forums, hope it'll make our repo more exciting and accessible.

The MultiEncoder creates a group of encoders ... make an SDR-Concatenator class to do this.

  • make it a function of SDR::append(vector<SDR> concatenate) ?
  • call at a MultiEncoder, and let it do what you've described
  • no need as it's rather easy to do with SDRs now, just show "best practices" as
vector concat(sdr1.getDense());
concat.assign(concat.end(), sdr2.getDense().begin(), sdr2.getDense().end());
SDR concatenated; 
concatenated.setDense(concat);

@dkeeney
Copy link

dkeeney commented Feb 12, 2019

We do have a rudimentary encoder: ScalarEncoder.cpp
But there is a lot more we could do there. Be sure that we also include a Region implication that can handle the new encoders. ScalerSensor.cpp is the one for ScalarEncoder. Perhaps a general purpose region that can handle any type of encoder would be cool.

@ctrl-z-9000-times
Copy link
Collaborator

no need as it's rather easy to do with SDRs now, just show "best practices" as

This wont work for encoders which have dimensions. Imagine a large image with 3 color channels (RGB), and you want to encode each color separately and then combine them into a large SDR with topology. In this situation you need to splice together each pixels encoded color.

@ctrl-z-9000-times
Copy link
Collaborator

I started a wiki page listing all of the encoders in both C++ & Python repositories, annotated. This wiki page also contains a tentative plan of action for providing a cohesive set of features.

https://github.com/htm-community/nupic.cpp/wiki/Encoder-Roundup

@ctrl-z-9000-times
Copy link
Collaborator

Can Python Encoders use SDRs?

I'd like for the python encoders to use SDRs, and this brings up an interesting topic: we agreed to merge the pure python code into this repo, see issue #216. We also agreed that the python should remain separate from the C++ code. Does it need to be absolutely 100% separate? Or python make use of the C++ SDR & Connections classes? To answer this I question why users might prefer python:

  • Python is easy to setup & install. The C++ is getting a lot better at this, many thanks to David Keeney for his work on CMake and reducing external dependencies.
  • Python is easy to inspect & interrogate. The SDR & Connections have bindings which make this easy to do.
  • Python is easy to experiment with. Python can not subclass & override C++ bindings, but this limitation can be mitigated by allowing python to register callbacks for events which the SDR & Connections C++ classes already have.
  • Python is easy to use. The C++ SDR is easy to use as well, so I think that integrating the SDR into the python code will further this goal.

The downside of integrating SDR into the Encoders is that it adds a new API to the encoder algorithms, which then needs to be supported. This issue won't effect the NetworkAPI.

@dkeeney
Copy link

dkeeney commented Feb 16, 2019

My vote would be to encode all encoders in 100% C++.

  • The SDR class is available.
  • The incoming raw data to the encoders can be passed to C++ easy enough.
  • Experimenters that are building apps in 100% C++ can take advantage of these encoders.
  • The encoders become language independent by calling the C++ routines via its bindings. Python, C#, or whatever.

@ctrl-z-9000-times
Copy link
Collaborator

My vote would be to encode all encoders in 100% C++.

For the most part I agree, but here are a few counter arguments:

  • All of the python encoders are already written & have unit tests.
  • ScalarEncoder - I think we should provide this in every language because it's the simplest example. It's like a "hello world" level of difficulty.
  • SDR-Category - Implementation must use python hash() & dict(), can be written in C++ w/ bindings?
  • delta.py & logarithm.py - Conveniences, not necessary for C++ but since python already has them why not use them.
  • date.py - Python's datetime library is too good to give up. datetime.datetime.today() -> (year, month, day, hour, minute, second, day-of-week, day-of-year, daylight-savings, time-zone, GMT-offset)

@dkeeney
Copy link

dkeeney commented Feb 16, 2019

I don't need ALL of the encoders in C++ but one of my personal objectives is to eventually provide a set of bindings for C#. A C# app using our library is not going to have access to any Python modules. It would be nice to be able to just call into C++ for encoders. Otherwise I would have to duplicate the logic in C#.

@ctrl-z-9000-times
Copy link
Collaborator

ctrl-z-9000-times commented Feb 22, 2019

RDSE Algorithm Memo

I hope to change the implementation of the Random Distributed Scalar Encoder (RDSE). Inside of this encoder: the RDSE transforms a real valued input into an integer valued index, and then it associates the index with a set of active bits.

  • Currently, the association between indices and active bits is randomly generated as needed, and then stored for the lifetime of the encoder. This allows the encoder to find & guarantee a good set of random activations which don't overlap with any existing mapping. It also allows the encoder to decode an SDR into the input value which likely created it.
  • Instead, the association between indices and active bits will be calculated from the hash of the index. This uses a smaller amount of memory because it does not need to explicitly keep the association for the lifetime of the encoder. It is also faster because it will not check that all encodings are distinct, instead it will rely on the random & distributed nature of SDRs to prevent conflicts between different encodings. This method does not allow for decoding SDRs into the inputs which likely created it.

Pros:

  • Faster Construction: the new method is O(1). The current method is at least O(n) where n is the number of distinct inputs.
  • Smaller memory footprint: O(n) -> O(1) where n is the number of distinct inputs to the encoder.

Cons:

  • No Decode method. Instead make an SDR Classifier. We could even implement the decode method using an SDR classifier, but we would want it to be optional since the SDR Classifier has significant overhead.
  • No strong guarantee that semantically unrelated inputs have a low overlap. This should only be an issue if the encoder is too small or its sparsity is too large. Mitigation: we can quantify these failure conditions and test them with unit-tests.

@dkeeney
Copy link

dkeeney commented Feb 22, 2019

* No Decode

I would think that you could still perform a Decode. You are not storing the previously used patterns but you can re-calculate the patterns used provided you have the starting seed. Just cycle through the used real values until you find one that results in a pattern that matches the one you are trying to decode. Slow but it would work. Or am I misunderstanding what you are proposing.

But then again....decoding is not biological. The only way we know a color is RED is that we match it with another pattern that someone in our experience has told us is RED. The sound of the spoken word "RED" and the word RED all match with the pattern of RED in our experiance. Is that decoding?

One could argue that encoders are sort of biological depending on the data being encoded.

@ctrl-z-9000-times
Copy link
Collaborator

Just cycle through the used real values until you find one that results in a pattern that matches the one you are trying to decode.

That could be very time consuming. The range of values for an RDSE is infinite.

An alternative to decode method is to make an SDR classifier. We could even integrate the classifier into the encoder to provide a decode method? Would want it to be optional since SDR classifier has significant overhead.

@ctrl-z-9000-times
Copy link
Collaborator

ctrl-z-9000-times commented Apr 1, 2019

Category Encoders

Category encoders should be implemented as Scalar encoders, which encode an Enumeration of the categories using a radius of less than 1.

I think we should not implement category encoders, but rather describe to the user how to make them. We would document this in the following places:

  • Python module nupic.bindings.encoders
  • C++ Header src/nupic/encoders/BaseEncoder.hpp

Both places already contain a general description of what an encoder is. I think we should add our notes about encoders to these locations.

Also, we should add a few unit tests to prove this works.

@breznak
Copy link
Member Author

breznak commented Apr 2, 2019

Category encoders should be implemented as Scalar encoders, which encode an Enumeration of the categories using a radius of less than 1.

yes, that is suffecient. Category encoder used to work as a demonstration example, and I guess de-coding was easier to implement, but we don't support that anymore.

I think we should add our notes about encoders to these locations.

this, or I can imagine an encoders/README.md with most of the text collected from this PR, issue ,..

@breznak
Copy link
Member Author

breznak commented May 5, 2019

I think we should not implement category encoders, but rather describe to the user how to make them.

will now have the best of both worlds, category encoding implemented "via" a flag to RDSE/Scalar. See #448

documentation in wiki

I really like the "blog" posts in this issue. Just a note about the wiki, I think it would be even better to make an encoders/README.md with its content.

  • advantages be: same markup, both online (view from web) and offline (in git). Info in wikis/issues is a pain once migrating to a new service, while git is rock solid in that matter.

@ctrl-z-9000-times
Copy link
Collaborator

I'd like to close this issue, as well as PR #291. All of the tasks here have either been completed or have been moved to another open issue, except for:

  • in-depth explanations, and notes for practical usage

The encoders are documented, tested, and have a few examples, so I'd say this is done. Giving an in depth explanation is beyond the scope of this project. There is an HTM-School video about how encoders work, as well as a whitepaper. We could put a link to the HTM-School youtube channel in the README.

Great work all around on this issue!

@ctrl-z-9000-times
Copy link
Collaborator

Closing this issue, please reopen if there is more to discuss.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants