Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider adding a Tensor class #64

Closed
castarco opened this issue Oct 8, 2016 · 10 comments
Closed

Consider adding a Tensor class #64

castarco opened this issue Oct 8, 2016 · 10 comments

Comments

@castarco
Copy link

castarco commented Oct 8, 2016

It would be very interesting to add a "Tensor" class in order to allow doing very expensive operations (in terms of CPU usage) inside PHP. A tensor is something like a generalization of a matrix.

I know that this isn't special/new from the perspective of data structures (we can nest arrays), but it's very interesting if we consider it from the performance perspective, or from the "formal/purist" perspective (we can check "shapes" & types concordance in a better way than with nested simpler structures).

Maybe it should be interesting to add also a Matrix class for people who don't needs more abstraction.

I think this will open the door to the proliferation of many Machine-Learning libraries for PHP. Currently we only can find two or three libraries, which cannot compete in any way with more mature solutions for Java, Python, Julia...

Having this capabilities built-in inside PHP will make things much more easier, because we'll be able to add this sort of technologies without having to rely on a complex technology stack: Of course, if a company have a lot of money and time, looking for the best solution (even if it's complex) is a good way to go, but not every company has the expertise, the money or the time.

I also know that there are some strange "algebra-oriented" extensions for PHP... but... well... and I think that PHP-DS will eventually accepted as a core extension, so I think this is the place to add this data structure.

@castarco
Copy link
Author

@rtheunissen I've read the contributing guide and it's not clear to me what should I do whenever the contributions are related to new features.

  • Should I start with the polyfill to put the emphasis on the interfaces?
  • Should I wait until more opinions are exposed in this thread?

Kind regards.

@rtheunissen
Copy link
Member

@castarco you're right, the contribution guide doesn't cover what to do about new features, especially when a 2.0 update is being planned. Let's keep discussing exactly what the plan is here, and implement whatever we decide on later. The process would be polyfill and tests first, then implement in the extension, then tag and release all three at the same time.

I'm all for a Matrix, and did at some point come up with a draft API for it. The plan was to set the size in the constructor only, roughly something like this:

  • __construct(int, int): void
  • row(int): Tuple
  • column(int): Tuple
  • get(int, int): mixed
  • set(int, int, mixed): void
  • $matrix[row]
  • $matrix[row][col]

With all the expected matrix functionality like inverse, product, etc. as well as the ability to create a submatrix (r, c, w, h), and standard collection functions like map, apply, sum, reduce, each etc.

We could also support appending and prepending rows and columns, haven't really thought about that. My linear algebra isn't crazy strong so I would need to do a lot of reading to make sure everything is correct.

@castarco
Copy link
Author

castarco commented Oct 13, 2016

Ok :) , I'll start then with a dirty draft.

Interface proposal for Tensor:

/**
 * Like a matrix, but instead of a 2-dimensional structure, we have a n-dimensional structure.
 */
interface Tensor implements \ArrayAccess
{
    // ------------------------------------------------------------------
    // Inherited methods from \ArrayAccess :
    // ------------------------------------------------------------------

    /**
     * This method will check that the passed "index" is compatible with the tensor's "shape".
     */
    public function offsetExists ( mixed $offset );

    /**
     * The idea:
     * $x = $tensor[[12, 3, 76, 44]]; // It seems strange... but it works :) .
     * 
     * What I want to avoid:
     * $x = $tensor[12][3][76][44]; : This idiom forces us to create intermediate objects
     */
    public function offsetGet ( mixed $offset );

    /**
     * $tensor[[12, 3, 76, 44]] = 42.0; This way we can do clean assignments.
     */
    public function offsetSet ( mixed $offset , mixed $value );

    /**
     * This method should throw an exception every time is called. It has no sense to individually unset tensor components.
     */
    public function offsetUnset ( mixed $offset );

    // ------------------------------------------------------------------
    // Tensor specific methods
    // ------------------------------------------------------------------

     /**
       * Returns the "shape" of the tensor, the tuple with its dimension "widths".
       * @return Tuple (tuple of strictly positive integers) : [25, 10, 100, 50]
       */
     public function shape() : Tuple;

     /**
      * Returns the number of dimensions (the "tensor rank"). In the case of our example, should be 4.
      */
     public function numDims() : int;

     /**
       * Copy the TensorFlow's behaviour: https://www.tensorflow.org/versions/master/api_docs/python/array_ops.html#reshape (without modifying the original tensor)
       *
       * @param \ArrayAccess $newShape : int values tuple .
       */
     public function reshape(\ArrayAccess $newShape) : Tensor;

     /**
      * Removes all dimensions with a "width" equal to 1 (redundant dimensions).
      * Without modifying the original tensor.
      */
     public function squeeze() : Tensor;

     /**
      * Adds a new redundant dimension (with width == 1) to the specified index:
      *   [20, 30, 20] --> ($newDimIndex = 0) --> [1, 20, 30, 20]
      *   [20, 30, 20] --> ($newDimIndex = 1) --> [20, 1, 30, 20]
      *  Redundant dimensions are useful to allow "shape matching" in some circumstances,
      *  more about that subject later.
      */
     public function addDimension($newDimIndex) : Tensor;

     /**
      * Created a new tensor by "slicing" the original one. To slice a tensor, we have to specify
      * what to do to every dimension. We can pass an interval, a null/[] to tell to pick the whole
      * dimension, or a specific value.
      * $t.slice([[10, 15], null, 76, []])
      */
     public function slice(\ArrayAcess $slice) : Tensor;

     /**
       * "Expands" the current tensor repeating its values over its dimensions following the
       * "tile specs". Example:
       *  
       * t1 = [[1, 2], [3, 4]]
       * t1.tile(3, 2) = [[1, 2, 1, 2], [3, 4, 3, 4], [1, 2, 1, 2], [3, 4, 3, 4], [1, 2, 1, 2], [3, 4, 3, 4]]
       */
     public function tile(\ArrayAccess $tileSpecs) : Tensor;

     /**
      * Generalization of a matrix transposition. Instead of doing a permutation only over
      * two dimensions, we specify a concrete permutation over n dimensions.
      */ 
     public function transpose(\ArrayAccess $dimsPermutation = [n-1, ..., 0]) : Tensor;

     /*
      Many other methods like `sum(Tensor|scalar)`, `mul` (not matrix multiplication), `div`, ...
      to do "element-wise" operations between tensors or scalars.

      When a "redundant" dimension is found for a particular index, these methods must
      act as if that dimension has the same "width" as the one we found in the other operand.

      Example:
         t1.shape() : [100, 1, 100]
         t2.shape() : [100, 50, 100]
         t3 = t1.sum(t2)
         t3.shape() : [100, 50, 100] -> the values of t1 are "replicated" 50 times to make this operation possible (as if we used the `tile` method).
      */
}
  • The \ArrayAccess methods should throw something like "KeyException" when the passed "index / key / coordinate / offset" is not compatible with the tensor's shape.
  • Native "arrays", but also Tuple, or in general \ArrayAccess instances should be accepted as "offset" in the \ArrayAccess methods.

This interface is incomplete, but I think it's better to first talk about the main ideas behind it. I know there are some points that could be somewhat controversial.

I assume that many tensor transformations aren't done in place, but create a new tensor object. This also can be discussed, but even if there is a performance penalty, I think that this is the best way to avoid headaches.

EDIT: Maybe, if "in place" operations offer a high performance improvement over the defined operators, we could use a specific prefix/suffix for "in place" operations. Something like "inPlaceWhatever", or "ipWhatever". I think that the optimizations should be very explicit, and not the default case.

@castarco
Copy link
Author

castarco commented Oct 15, 2016

About the Tensor contructor: I think that, because Tensors have a very wide range of possible initial values, some static factories will be needed.

I don't like the idea of making a public constructor, but if we need something like that to improve testability or another "software-architecture property", I'd choose something like:

public function __construct(\ArrayAccess $shape) : Tensor; // Filled with zeroes.

or... (to make usage of PHP7's type hinting features, a possibility is to combine scalar hinting with variadic arguments, it works!):

// unfortunately we can't use this trick in the `offsetGet` method.
public function __construct(int ...$shape) : Tensor; // Filled with zeroes.

Mental note: This trick could be used in other methods, like reshape.

A caveat of this trick is that then is more difficult to add an extra parameter to define an alternative initialization value different from 0. But.. we can use static factories :) .

Tensor::zeroes(int ...$shape) : Tensor;
Tensor::ones(int ...$shape) : Tensor;
Tensor::constant(float $constant, int ...$shape) : Tensor;

/**
 * @param callable $func : function of the tensor coordinates .
 */
Tensor::fromFunction(callable $func, int ...$shape) : Tensor;
/**
 * We have (at least) this important distributions to consider:
 *     - Gaussian
 *     - Uniform
 */
Tensor::random(Enum $distribution, null|AssociativeArray $distParams, int ...$shape) : Tensor;
  • The usage of Enum or AssociativeArray is to make the proposal more easy to read.
  • I don't want to introduce new types/classes for statistical distributions because I don't want to transform this library into a statistical package. I know it would be cleaner something like: Tensor::random(RandomDist $distribution, int ...$shape) : Tensor .
  • A Matrix class should also have static factories, maybe even more, because matrix have more interesting properties than tensors (we could implement factories to initialize diagonal matrix, or random invertible matrix, etc).

@nikic
Copy link
Contributor

nikic commented Oct 15, 2016

The word "tensor" has a specific mathematical meaning -- it's an object that transforms in a certain way under change of coordinates. What you describe here is an n-dimensional array, which may or may not represent a tensor.

In any case, I would recommend to use the API (and naming) of numpy for this purpose. (Though I also think that this is out of scope for this library -- this is really something that a separate extension should do in a more comprehensive manner. Just adding ndarray is not enough to be useful.)

@castarco
Copy link
Author

castarco commented Oct 15, 2016

Hi @nikic , the mathematical properties of Tensors are not properties of the tensors themselves, but properties of the relations between tensors and other mathematical objects, like other tensors, linear spaces, coordinate systems...

Any n-dimensional "numerical"¹ array can be considered as a tensor given the suitable "environment". The same way, we use the word Vector for some data structures, without caring about the linear space where the vector should belong, nor "embedding" this information inside the Vector instances.

In any way, you're right on the fundamentals. And probably it's true that "NDArrays/Tensors" should belong to a different library.

  1. By "numerical" I mean: member of a "ring" or "field".

@castarco
Copy link
Author

Hi again @nikic, I'm working on a private repository right now to implement an "extension polyfill" with these classes:

  • Tensor (abstract)
  • IntTensor
  • FloatTensor
  • Matrix (abstract)
  • IntMatrix
  • FloatMatrix
  • DataFrame
  • Series (something like a DataFrame's column)

When I release it I'll be very grateful if someone reviews it. The next step will be to implement the extension. I'm thinking about programming it using Rust, if I do that... is there any possibility for that extension to be accepted as a core extension? Or should I use plain C?

Thank you for your time.

@castarco
Copy link
Author

castarco commented Oct 21, 2016

@nikic , @rtheunissen , The first bits of the "Tensor" polyfill: https://github.com/SciPHPy/php-sds-polyfill

There's a lot of work to do yet on the PHP side, and much more on the native side, of course.

@rtheunissen
Copy link
Member

While I like the idea of this, I'm going to push it to the very bottom of the list and focus on the important things for 2.0, then look at the possibility of a Tensor or Matrix data structure implementation. Might consider a separate library entirely, focused on mathematics more than generic data structures.

@castarco
Copy link
Author

Ok, I'll close this issue. I'm working on an independent library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants