Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redesign of the nodes infrastructure #31

Closed
ste-pac opened this issue Apr 16, 2021 · 4 comments · Fixed by #32
Closed

Redesign of the nodes infrastructure #31

ste-pac opened this issue Apr 16, 2021 · 4 comments · Fixed by #32
Labels
enhancement New feature or request urgent Something that must be done fast

Comments

@ste-pac
Copy link
Member

ste-pac commented Apr 16, 2021

This issue is related to the redesign of the basic infrastructure used by the library to autodiff.

Currently we manage the propagation of gradients and the invocations of the .forward() and .backward() methods through recursive calls, controlled by appropriate control structures such as ForwardAction and BackwardAction to avoid performing out multiple times some computations. Unfortunately, this organization has the disadvantage of leading to the creation of a large number of recursive calls on the stack, with consequent loss of performance.
The new approach that we could take advantage of is based on moving to an iterative version using trait objects, which even if they have the disadvantage of using dynamic dispatching, could lead to a simpler implementation, as well as a potential performance improvement.

A possible implementation example is the following.

Prototype Code
use indexmap::IndexMap;
use ndarray::{Array, DimMax, Dimension, RawArrayView, ShapeError, StrideShape, Zip};
use std::{
    cell::{Ref, RefCell},
    ops::{Deref, DerefMut},
    rc::Rc,
};

pub(crate) type Broadcasted<Lhs, Rhs> = <Lhs as DimMax<Rhs>>::Output;
pub(crate) type BroadTensor<Lhs, Rhs> = Tensor<Broadcasted<Lhs, Rhs>>;
pub(crate) type Tensor<D> = Array<f32, D>;
pub(crate) type RawTensorView<D> = RawArrayView<f32, D>;
pub(crate) type RawBroadTensorView<Lhs, Rhs> = RawArrayView<f32, Broadcasted<Lhs, Rhs>>;

// ============================================= Utils =============================================

fn broadcasted_zero<Lhs, Rhs>(
    left: Ref<Tensor<Lhs>>,
    right: Ref<Tensor<Rhs>>,
) -> BroadTensor<Lhs, Rhs>
where
    Lhs: Dimension + DimMax<Rhs>,
    Rhs: Dimension,
{
    let (mut bigger, smaller) = if left.ndim() >= right.ndim() {
        (left.shape().to_vec(), right.shape())
    } else {
        (right.shape().to_vec(), left.shape())
    };
    for (l, r) in bigger.iter_mut().rev().zip(smaller.iter().rev()) {
        *l = std::cmp::max(*l, *r);
    }
    let total = bigger.iter().product();
    Tensor::from_shape_vec(bigger, vec![0.; total])
        .unwrap()
        .into_dimensionality::<Broadcasted<Lhs, Rhs>>()
        .unwrap()
}

// ========================================= Forward Nodes =========================================

pub trait Node {
    type Dim: Dimension;

    fn uid(&self) -> usize;

    fn value(&self) -> Ref<Tensor<Self::Dim>>;
}

pub trait Forward {
    fn forward(&self);
}

// Input
pub struct InputNode<D: Dimension> {
    uid: usize,
    value: RefCell<Tensor<D>>,
}

impl<D: Dimension> InputNode<D> {
    pub fn new<Sh>(uid: usize, shape: Sh, vec: Vec<f32>) -> Result<Self, ShapeError>
    where
        Sh: Into<StrideShape<D>>,
    {
        Ok(Self {
            uid,
            value: RefCell::new(Array::from_shape_vec(shape, vec)?),
        })
    }
}

impl<D: Dimension> Node for InputNode<D> {
    type Dim = D;

    fn uid(&self) -> usize {
        self.uid
    }

    fn value(&self) -> Ref<Tensor<D>> {
        self.value.borrow()
    }
}

impl<D: Dimension> Forward for InputNode<D> {
    fn forward(&self) {
        // Nothing
    }
}

// Parameter
pub struct ParameterNode<D: Dimension> {
    uid: usize,
    value: RefCell<Tensor<D>>,
}

impl<D: Dimension> ParameterNode<D> {
    pub fn new<Sh>(uid: usize, shape: Sh, vec: Vec<f32>) -> Result<Self, ShapeError>
    where
        Sh: Into<StrideShape<D>>,
    {
        Ok(Self {
            uid,
            value: RefCell::new(Array::from_shape_vec(shape, vec)?),
        })
    }
}

impl<D: Dimension> Node for ParameterNode<D> {
    type Dim = D;

    fn uid(&self) -> usize {
        self.uid
    }

    fn value(&self) -> Ref<Tensor<D>> {
        self.value.borrow()
    }
}

impl<D: Dimension> Forward for ParameterNode<D> {
    fn forward(&self) {
        // Nothing
    }
}

// Addition
pub struct AdditionNode<Lhs, Rhs>
where
    Lhs: Node,
    Rhs: Node,
    Lhs::Dim: Dimension + DimMax<Rhs::Dim>,
{
    uid: usize,
    left: Rc<Lhs>,
    right: Rc<Rhs>,
    value: RefCell<BroadTensor<Lhs::Dim, Rhs::Dim>>,
}

impl<Lhs, Rhs> AdditionNode<Lhs, Rhs>
where
    Lhs: Node,
    Rhs: Node,
    Lhs::Dim: Dimension + DimMax<Rhs::Dim>,
{
    pub fn new(uid: usize, left: Rc<Lhs>, right: Rc<Rhs>) -> Self {
        let value = RefCell::new(broadcasted_zero(left.value(), right.value()));
        Self {
            uid,
            left,
            right,
            value,
        }
    }
}

impl<Lhs, Rhs> Node for AdditionNode<Lhs, Rhs>
where
    Lhs: Node,
    Rhs: Node,
    Lhs::Dim: Dimension + DimMax<Rhs::Dim>,
{
    type Dim = Broadcasted<Lhs::Dim, Rhs::Dim>;

    fn uid(&self) -> usize {
        self.uid
    }

    fn value(&self) -> Ref<Tensor<Self::Dim>> {
        self.value.borrow()
    }
}

impl<Lhs, Rhs> Forward for AdditionNode<Lhs, Rhs>
where
    Lhs: Node,
    Rhs: Node,
    Lhs::Dim: Dimension + DimMax<Rhs::Dim>,
{
    fn forward(&self) {
        Zip::from(self.value.borrow_mut().deref_mut())
            .and_broadcast(self.left.value().deref())
            .and_broadcast(self.right.value().deref())
            .par_for_each(|v, l, r| *v = l + r);
    }
}

// ============================================ Backward Nodes ============================================

pub trait DiffNode: Node {
    fn connect_source(&mut self, node_uid: usize, node_view: RawTensorView<Self::Dim>);

    fn disconnect_source(&mut self, node_id: usize);
}

pub trait Backward: Forward {
    fn backward(&self);
}

// Addition
pub struct AdditionDiffNode<Lhs, Rhs>
where
    Lhs: Node,
    Rhs: Node,
    Lhs::Dim: Dimension + DimMax<Rhs::Dim>,
{
    addition: AdditionNode<Lhs, Rhs>,
    gradient: RefCell<BroadTensor<Lhs::Dim, Rhs::Dim>>,
    sources: IndexMap<usize, RawBroadTensorView<Lhs::Dim, Rhs::Dim>>,
}

impl<Lhs, Rhs> AdditionDiffNode<Lhs, Rhs>
where
    Lhs: Node,
    Rhs: Node,
    Lhs::Dim: Dimension + DimMax<Rhs::Dim>,
{
    pub fn new(addition: AdditionNode<Lhs, Rhs>) -> Self {
        let gradient = RefCell::new(Tensor::zeros(addition.value().raw_dim()));
        Self {
            addition,
            gradient, // ! To be changed into two separate gradients
            sources: IndexMap::new(),
        }
    }
}

impl<Lhs, Rhs> Node for AdditionDiffNode<Lhs, Rhs>
where
    Lhs: Node,
    Rhs: Node,
    Lhs::Dim: Dimension + DimMax<Rhs::Dim>,
{
    type Dim = Broadcasted<Lhs::Dim, Rhs::Dim>;

    fn uid(&self) -> usize {
        self.addition.uid()
    }

    fn value(&self) -> Ref<Tensor<Self::Dim>> {
        self.addition.value()
    }
}

impl<Lhs, Rhs> Forward for AdditionDiffNode<Lhs, Rhs>
where
    Lhs: Node,
    Rhs: Node,
    Lhs::Dim: Dimension + DimMax<Rhs::Dim>,
{
    fn forward(&self) {
        self.addition.forward();
    }
}

impl<Lhs, Rhs> DiffNode for AdditionDiffNode<Lhs, Rhs>
where
    Lhs: Node,
    Rhs: Node,
    Lhs::Dim: Dimension + DimMax<Rhs::Dim>,
{
    fn connect_source(&mut self, node_id: usize, node_view: RawTensorView<Self::Dim>) {
        assert!(
            self.sources.insert(node_id, node_view).is_none(),
            "Node {} already connected to {}",
            node_id,
            self.addition.uid()
        );
    }

    fn disconnect_source(&mut self, source_id: usize) {
        assert!(
            self.sources.remove(&source_id).is_some(),
            "Node {} isn't connected to {}",
            source_id,
            self.addition.uid()
        );
    }
}

impl<Lhs, Rhs> Backward for AdditionDiffNode<Lhs, Rhs>
where
    Lhs: Node,
    Rhs: Node,
    Lhs::Dim: Dimension + DimMax<Rhs::Dim>,
{
    fn backward(&self) {
        if self.sources.is_empty() {
            return;
        }

        let mut gradient = self.gradient.borrow_mut();
        let mut sources = self
            .sources
            .values()
            .map(|v| unsafe { v.clone().deref_into_view() });

        gradient.assign(&sources.next().unwrap());
        for source in sources {
            Zip::from(gradient.deref_mut())
                .and_broadcast(source)
                .par_for_each(|l, r| *l += r);
        }
    }
}

// =================================================== Tests ===================================================

#[cfg(test)]
mod tests {
    use super::*;

    mod api {
        use super::*;
        use rand::{self, Rng};

        #[test]
        fn addition_benchmark() {
            let forward_nodes: Vec<Rc<dyn Forward>> = vec![
                Rc::new(AdditionNode::new(
                    0,
                    Rc::new(InputNode::new(0, (1_000, 1_000), vec![1.; 1_000 * 1_000]).unwrap()),
                    Rc::new(InputNode::new(0, (1_000, 1_000), vec![1.; 1_000 * 1_000]).unwrap()),
                ));
                1_024
            ];

            let mut times = Vec::with_capacity(1_024);
            for node in &forward_nodes {
                let start = std::time::Instant::now();
                node.forward();
                let stop = start.elapsed();
                times.push(stop.as_micros());
            }
            println!(
                "Mean Forward Time Iteration: {} microseconds",
                times.iter().sum::<u128>() / times.len() as u128
            );

            let start = std::time::Instant::now();
            for node in &forward_nodes {
                node.forward();
            }
            let elapsed = start.elapsed();
            println!("Mean Forward Time: {} milliseconds", elapsed.as_millis());
        }

        #[test]
        fn backward_addition_benchmark() {
            let gradient_sources = vec![Tensor::from_elem((1_000, 1_000), 0.); 128];

            let mut addition_nodes = Vec::with_capacity(1_024);
            for _ in 0..1_024 {
                addition_nodes.push(AdditionDiffNode::new(AdditionNode::new(
                    0,
                    Rc::new(InputNode::new(0, (1_000, 1_000), vec![1.; 1_000 * 1_000]).unwrap()),
                    Rc::new(InputNode::new(0, (1_000, 1_000), vec![1.; 1_000 * 1_000]).unwrap()),
                )));
            }

            let mut rng = rand::thread_rng();
            for node in &mut addition_nodes {
                node.connect_source(0, gradient_sources[rng.gen_range(0..128)].raw_view());
                node.connect_source(1, gradient_sources[rng.gen_range(0..128)].raw_view());
                node.connect_source(2, gradient_sources[rng.gen_range(0..128)].raw_view());
                node.connect_source(3, gradient_sources[rng.gen_range(0..128)].raw_view());
            }

            let mut backward_nodes: Vec<Rc<dyn Backward>> = Vec::with_capacity(1_024);
            for _ in 0..1_024 {
                backward_nodes.push(Rc::new(addition_nodes.swap_remove(0)));
            }

            let mut times = Vec::with_capacity(1_024);
            for node in &backward_nodes {
                let start = std::time::Instant::now();
                node.backward();
                let stop = start.elapsed();
                times.push(stop.as_micros());
            }
            println!(
                "Mean Backward Time Iteration: {} microseconds",
                times.iter().sum::<u128>() / times.len() as u128
            );

            let start = std::time::Instant::now();
            for node in &backward_nodes {
                node.backward();
            }
            let elapsed = start.elapsed();
            println!("Mean Backward Time: {} milliseconds", elapsed.as_millis());
        }
    }
}

The example is quite trivial, using only Addition nodes, but considering the depth of the graph (1024 nodes) and the size of the tensors (1 million elements each), the benchmarks don't disappoint me, but I warmly invite you to try them out and to raise any possible ideas or objections that cross your mind.

The resulting benchmark times are the following:

Forward Backward
Iteration Mean Time 156 μs 2 ms
Total Mean Time 156 ms 1'500 ms

Considering also that, for the .backward() case, each node is connected to other 4 sources nodes of 1 million entries each.

@frjnn
Copy link
Member

frjnn commented Apr 17, 2021

The solution looks (very) promising to me.

I did the same benchmark (same code) for the actual implementation on my pc and I measured:

  1. Mean Forward Time Iteration: 3806 microseconds
  2. Mean Forward Time: 4681 milliseconds

I didn't do the benchmark for the backward because it's clear as the sky who the winner is. Outstanding. Nicely done.

@frjnn frjnn added enhancement New feature or request urgent Something that must be done fast labels Apr 17, 2021
@ste-pac
Copy link
Member Author

ste-pac commented Apr 17, 2021

Errata Corrige

In the code at the beginning of this issue there is a small semantic error that reduces the times only for the addition_benchmark() test. Since the vec![...] macro clones the result of the containing expression, which creates a vector of Rc all relative to a single AdditionNode. Obviously this means that the data is always in cache and so the times are much lower than what they should.

Results

With the correct version of the code, the results on my pc are the following:

Forward Backward
Iteration Mean Time 572 μs 2'161 μs
Total Mean Time 834 ms 1'558 ms

New Version

With this update we focused on the performance comparison of RawTensorView and *const Tensor<D>, since both are unsafe methodologies, one is interchangeable with the other. In parallel to this we also evaluated the simplicity and elegance of the APIs that would be exposed using the two different strategies and came to the conclusion that, in both respects, the *const Tensor<D> strategy is much better.

New Code Version
use ndarray::{Array, DimMax, Dimension, ShapeError, StrideShape, Zip};

pub(crate) type Broadcasted<Lhs, Rhs> = <Lhs as DimMax<Rhs>>::Output;
pub(crate) type BroadTensor<Lhs, Rhs> = Tensor<Broadcasted<Lhs, Rhs>>;
pub(crate) type Tensor<D> = Array<f32, D>;

// ============================================= Utils =============================================

fn broadcasted_zero<Lhs, Rhs>(left: &Tensor<Lhs>, right: &Tensor<Rhs>) -> BroadTensor<Lhs, Rhs>
where
    Lhs: Dimension + DimMax<Rhs>,
    Rhs: Dimension,
{
    let (mut bigger, smaller) = if left.ndim() >= right.ndim() {
        (left.shape().to_vec(), right.shape())
    } else {
        (right.shape().to_vec(), left.shape())
    };
    for (l, r) in bigger.iter_mut().rev().zip(smaller.iter().rev()) {
        *l = std::cmp::max(*l, *r);
    }
    let total = bigger.iter().product();
    Tensor::from_shape_vec(bigger, vec![0.; total])
        .unwrap()
        .into_dimensionality::<Broadcasted<Lhs, Rhs>>()
        .unwrap()
}

// ========================================= Forward Nodes =========================================

pub trait Node {
    type Dim: Dimension;

    fn value(&self) -> &Tensor<Self::Dim>;
}

pub trait Forward {
    fn forward(&mut self);
}

// Input
pub struct InputNode<D: Dimension> {
    value: Tensor<D>,
}

impl<D: Dimension> InputNode<D> {
    pub fn new<Sh>(shape: Sh, vec: Vec<f32>) -> Result<Self, ShapeError>
    where
        Sh: Into<StrideShape<D>>,
    {
        Ok(Self {
            value: Array::from_shape_vec(shape, vec)?,
        })
    }
}

impl<D: Dimension> Node for InputNode<D> {
    type Dim = D;

    fn value(&self) -> &Tensor<D> {
        &self.value
    }
}

// Parameter
pub struct ParameterNode<D: Dimension> {
    value: Tensor<D>,
}

impl<D: Dimension> ParameterNode<D> {
    pub fn new<Sh>(shape: Sh, vec: Vec<f32>) -> Result<Self, ShapeError>
    where
        Sh: Into<StrideShape<D>>,
    {
        Ok(Self {
            value: Array::from_shape_vec(shape, vec)?,
        })
    }
}

impl<D: Dimension> Node for ParameterNode<D> {
    type Dim = D;

    fn value(&self) -> &Tensor<D> {
        &self.value
    }
}

impl<D: Dimension> Forward for ParameterNode<D> {
    fn forward(&mut self) {
        // Nothing
    }
}

// Addition
pub struct AdditionNode<Lhs, Rhs>
where
    Lhs: Node,
    Rhs: Node,
    Lhs::Dim: Dimension + DimMax<Rhs::Dim>,
{
    left: *const Tensor<Lhs::Dim>,
    right: *const Tensor<Rhs::Dim>,
    value: BroadTensor<Lhs::Dim, Rhs::Dim>,
}

impl<Lhs, Rhs> AdditionNode<Lhs, Rhs>
where
    Lhs: Node,
    Rhs: Node,
    Lhs::Dim: Dimension + DimMax<Rhs::Dim>,
{
    pub fn new(left: &Lhs, right: &Rhs) -> Self {
        Self {
            left: left.value() as *const Tensor<Lhs::Dim>,
            right: right.value() as *const Tensor<Rhs::Dim>,
            value: broadcasted_zero(left.value(), right.value()),
        }
    }
}

impl<Lhs, Rhs> Node for AdditionNode<Lhs, Rhs>
where
    Lhs: Node,
    Rhs: Node,
    Lhs::Dim: Dimension + DimMax<Rhs::Dim>,
{
    type Dim = Broadcasted<Lhs::Dim, Rhs::Dim>;

    fn value(&self) -> &Tensor<Self::Dim> {
        &self.value
    }
}

impl<Lhs, Rhs> Forward for AdditionNode<Lhs, Rhs>
where
    Lhs: Node,
    Rhs: Node,
    Lhs::Dim: Dimension + DimMax<Rhs::Dim>,
{
    fn forward(&mut self) {
        unsafe {
            Zip::from(&mut self.value)
                .and_broadcast(&*self.left)
                .and_broadcast(&*self.right)
                .par_for_each(|v, l, r| *v = l + r);
        }
    }
}

// ============================================ Backward Nodes ============================================

trait DiffNode: Node {
    fn connect_source(&mut self, node_view: *const Tensor<Self::Dim>);
}

pub trait Backward: Forward {
    fn backward(&mut self);
}

// Addition
pub struct AdditionDiffNode<Lhs, Rhs>
where
    Lhs: Node<Dim = Rhs::Dim>,
    Rhs: Node,
    // Lhs::Dim: Dimension + DimMax<Rhs::Dim>,
{
    addition: AdditionNode<Lhs, Rhs>,
    lhs_gradient: Tensor<Lhs::Dim>,
    rhs_gradient: Tensor<Rhs::Dim>,
    sources: Vec<*const Tensor<Rhs::Dim>>,
}

impl<Lhs, Rhs> AdditionDiffNode<Lhs, Rhs>
where
    Lhs: Node<Dim = Rhs::Dim>,
    Rhs: Node,
    // Lhs::Dim: Dimension + DimMax<Rhs::Dim>,
{
    pub fn new(addition: AdditionNode<Lhs, Rhs>) -> Self {
        unsafe {
            let lhs_gradient = Tensor::zeros((&*addition.left).raw_dim());
            let rhs_gradient = Tensor::zeros((&*addition.right).raw_dim());
            Self {
                addition,
                lhs_gradient,
                rhs_gradient,
                sources: Vec::new(),
            }
        }
    }
}

impl<Lhs, Rhs> Node for AdditionDiffNode<Lhs, Rhs>
where
    Lhs: Node<Dim = Rhs::Dim>,
    Rhs: Node,
    // Lhs::Dim: Dimension + DimMax<Rhs::Dim>,
{
    type Dim = Rhs::Dim;
    // type Dim = Broadcasted<Lhs::Dim, Rhs::Dim>;

    fn value(&self) -> &Tensor<Self::Dim> {
        self.addition.value()
    }
}

impl<Lhs, Rhs> Forward for AdditionDiffNode<Lhs, Rhs>
where
    Lhs: Node<Dim = Rhs::Dim>,
    Rhs: Node,
    // Lhs::Dim: Dimension + DimMax<Rhs::Dim>,
{
    fn forward(&mut self) {
        self.addition.forward();
    }
}

impl<Lhs, Rhs> DiffNode for AdditionDiffNode<Lhs, Rhs>
where
    Lhs: Node<Dim = Rhs::Dim>,
    Rhs: Node,
    // Lhs::Dim: Dimension + DimMax<Rhs::Dim>,
{
    fn connect_source(&mut self, source: *const Tensor<Self::Dim>) {
        self.sources.push(source);
    }
}

impl<Lhs, Rhs> Backward for AdditionDiffNode<Lhs, Rhs>
where
    Lhs: Node<Dim = Rhs::Dim>,
    Rhs: Node,
    // Lhs::Dim: Dimension + DimMax<Rhs::Dim>,
{
    fn backward(&mut self) {
        if self.sources.is_empty() {
            return;
        }

        unsafe {
            // Here should be the appropriate `accumulate` code, since now we have
            // different gradients for each operand.
            // In order to bench only the performance impact of `*const Tensor<_>`
            // we stick with a simple **BUT WRONG** plain addiction

            self.lhs_gradient.assign(&*self.sources[0]);
            self.rhs_gradient.assign(&*self.sources[0]);
            for source in self.sources.iter().skip(1) {
                Zip::from(&mut self.lhs_gradient)
                    .and_broadcast(&**source)
                    .par_for_each(|l, r| *l += r);

                Zip::from(&mut self.rhs_gradient)
                    .and_broadcast(&**source)
                    .par_for_each(|l, r| *l += r);
            }
        }
    }
}

// =================================================== Tests ===================================================

#[cfg(test)]
mod tests {
    use super::*;

    mod api {
        use super::*;
        use ndarray::Ix2;
        use rand::{self, Rng};
        use std::cell::RefCell;
        use std::rc::Rc;

        #[test]
        fn forward_benchmark() {
            let mut inputs = Vec::with_capacity(1_024);
            for _ in 0..inputs.capacity() {
                inputs.push(InputNode::new((1_000, 1_000), vec![1.; 1_000 * 1_000]).unwrap());
            }

            let mut rng = rand::thread_rng();
            let mut forward_nodes: Vec<Rc<RefCell<dyn Forward>>> = Vec::with_capacity(1_024);
            for _ in 0..forward_nodes.capacity() {
                forward_nodes.push(Rc::new(RefCell::new(AdditionNode::new(
                    &inputs[rng.gen_range(0..inputs.capacity())],
                    &inputs[rng.gen_range(0..inputs.capacity())],
                ))));
            }

            let mut times = Vec::with_capacity(1_024);
            for node in &mut forward_nodes {
                let start = std::time::Instant::now();
                node.borrow_mut().forward();
                let stop = start.elapsed();
                times.push(stop.as_micros());
            }
            println!(
                "Mean Forward Time Iteration: {} microseconds",
                times.iter().sum::<u128>() / times.len() as u128
            );

            for i in 0..32 {
                let start = std::time::Instant::now();
                for node in &mut forward_nodes {
                    node.borrow_mut().forward();
                }
                let elapsed = start.elapsed();
                times[i] = elapsed.as_millis();
            }
            println!(
                "Mean Forward Time: {} milliseconds",
                times[0..32].iter().sum::<u128>() / 32
            );
        }

        #[test]
        fn backward_benchmark() {
            let gradient_sources = vec![Tensor::from_elem((1_000, 1_000), 0.); 128];

            let mut inputs = Vec::with_capacity(1_024);
            for _ in 0..inputs.capacity() {
                inputs.push(InputNode::new((1_000, 1_000), vec![1.; 1_000 * 1_000]).unwrap());
            }

            let mut rng = rand::thread_rng();
            let mut addition_diff_nodes = Vec::with_capacity(1_024);
            for _ in 0..addition_diff_nodes.capacity() {
                addition_diff_nodes.push(AdditionDiffNode::new(AdditionNode::new(
                    &inputs[rng.gen_range(0..inputs.capacity())],
                    &inputs[rng.gen_range(0..inputs.capacity())],
                )));
            }

            for node in &mut addition_diff_nodes {
                node.connect_source(&gradient_sources[rng.gen_range(0..128)] as *const Tensor<Ix2>);
                node.connect_source(&gradient_sources[rng.gen_range(0..128)] as *const Tensor<Ix2>);
                node.connect_source(&gradient_sources[rng.gen_range(0..128)] as *const Tensor<Ix2>);
                node.connect_source(&gradient_sources[rng.gen_range(0..128)] as *const Tensor<Ix2>);
            }

            let mut backward_nodes: Vec<Rc<RefCell<dyn Backward>>> = Vec::with_capacity(1_024);
            for _ in 0..1_024 {
                backward_nodes.push(Rc::new(RefCell::new(addition_diff_nodes.swap_remove(0))));
            }

            let mut times = Vec::with_capacity(1_024);
            for node in &mut backward_nodes {
                let start = std::time::Instant::now();
                node.borrow_mut().backward();
                let stop = start.elapsed();
                times.push(stop.as_micros());
            }
            println!(
                "Mean Backward Time Iteration: {} microseconds",
                times.iter().sum::<u128>() / times.len() as u128
            );

            for i in 0..32 {
                let start = std::time::Instant::now();
                for node in &mut backward_nodes {
                    node.borrow_mut().backward();
                }
                let elapsed = start.elapsed();
                times[i] = elapsed.as_millis();
            }
            println!(
                "Mean Backward Time: {} milliseconds",
                times[0..32].iter().sum::<u128>() / 32
            );
        }
    }
}

The API is much cleaner than the previous one, moreover in this way it is no longer necessary to use Rc inside the nodes, nor RefCell. As for performance, having abandoned RawTensorView and therefore no longer having to clone them to keep the references (deref_into_view() consumes self), we had a slight positive impact on the Forward procedure.

Forward Backward
Iteration Mean Time 552 μs 4'046 μs
Total Mean Time 823 ms 3'033 ms

Concerning the Backward part, times have got higher, but note that we have also switched to a version where each node of the operation contains separate gradients of its operands, and thus, for the benchmark case, the number of additions performed is doubled.

@frjnn
Copy link
Member

frjnn commented Apr 18, 2021

broadcasted_zeros could be rewritten as:

fn broadcasted_zeros<Lhs, Rhs>(left: &Tensor<Lhs>, right: &Tensor<Rhs>) -> BroadTensor<Lhs, Rhs>
where
    Lhs: Dimension + DimMax<Rhs>,
    Rhs: Dimension,
{
    let (bigger, smaller) = if left.ndim() >= right.ndim() {
        (left.shape(), right.shape())
    } else {
        (right.shape(), left.shape())
    };
    let b_dim = {
        let mut empty_d = <Lhs as DimMax<Rhs>>::Output::zeros(bigger.len());
        let empty_d_slice = empty_d.slice_mut();
        empty_d_slice
            .iter_mut()
            .zip(bigger.iter())
            .for_each(|(e_el, b_el)| *e_el = *b_el);
        empty_d_slice
            .iter_mut()
            .rev()
            .zip(smaller.iter().rev())
            .for_each(|(l, r)| *l = std::cmp::max(*l, *r));
        empty_d
    };
    Tensor::zeros(b_dim)
}

@frjnn
Copy link
Member

frjnn commented Apr 18, 2021

Assignment and reduction in less than 30 rows.

fn sum_axis_inplace(arr: &mut ndarray::ArrayD<f32>, axis: ndarray::Axis) {
    let (first, rest) = arr.view_mut().split_at(axis, 1);
    ndarray::Zip::from(first.remove_axis(axis))
        .and(rest.lanes(axis))
        .for_each(|dst, src| *dst += src.sum());
    arr.index_axis_inplace(axis, 0);
}

pub fn reduce<D: ndarray::Dimension, E: ndarray::Dimension>(
    dest: &mut ndarray::Array<f32, D>,
    src: &ndarray::Array<f32, E>,
) {
    let mut dyn_rhs = src.clone().into_dyn();
    let static_rhs = unsafe {
        while (*(&dyn_rhs as *const ndarray::ArrayD<f32>)).ndim() > dest.ndim() {
            sum_axis_inplace(&mut dyn_rhs, ndarray::Axis(0));
        }
        for (axis, size) in dest.shape().iter().enumerate() {
            if *size == 1 {
                sum_axis_inplace(&mut dyn_rhs, ndarray::Axis(axis));
                dyn_rhs.insert_axis_inplace(ndarray::Axis(axis));
            }
        }
        dyn_rhs.as_standard_layout()
    };
    ndarray::Zip::from(dest)
        .and_broadcast(&static_rhs)
        .for_each(|dest_el, src_el| *dest_el = *src_el);
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request urgent Something that must be done fast
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants