New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redesign of the nodes infrastructure #31
Comments
The solution looks (very) promising to me. I did the same benchmark (same code) for the actual implementation on my pc and I measured:
I didn't do the benchmark for the backward because it's clear as the sky who the winner is. Outstanding. Nicely done. |
Errata CorrigeIn the code at the beginning of this issue there is a small semantic error that reduces the times only for the ResultsWith the correct version of the code, the results on my pc are the following:
New VersionWith this update we focused on the performance comparison of New Code Versionuse ndarray::{Array, DimMax, Dimension, ShapeError, StrideShape, Zip};
pub(crate) type Broadcasted<Lhs, Rhs> = <Lhs as DimMax<Rhs>>::Output;
pub(crate) type BroadTensor<Lhs, Rhs> = Tensor<Broadcasted<Lhs, Rhs>>;
pub(crate) type Tensor<D> = Array<f32, D>;
// ============================================= Utils =============================================
fn broadcasted_zero<Lhs, Rhs>(left: &Tensor<Lhs>, right: &Tensor<Rhs>) -> BroadTensor<Lhs, Rhs>
where
Lhs: Dimension + DimMax<Rhs>,
Rhs: Dimension,
{
let (mut bigger, smaller) = if left.ndim() >= right.ndim() {
(left.shape().to_vec(), right.shape())
} else {
(right.shape().to_vec(), left.shape())
};
for (l, r) in bigger.iter_mut().rev().zip(smaller.iter().rev()) {
*l = std::cmp::max(*l, *r);
}
let total = bigger.iter().product();
Tensor::from_shape_vec(bigger, vec![0.; total])
.unwrap()
.into_dimensionality::<Broadcasted<Lhs, Rhs>>()
.unwrap()
}
// ========================================= Forward Nodes =========================================
pub trait Node {
type Dim: Dimension;
fn value(&self) -> &Tensor<Self::Dim>;
}
pub trait Forward {
fn forward(&mut self);
}
// Input
pub struct InputNode<D: Dimension> {
value: Tensor<D>,
}
impl<D: Dimension> InputNode<D> {
pub fn new<Sh>(shape: Sh, vec: Vec<f32>) -> Result<Self, ShapeError>
where
Sh: Into<StrideShape<D>>,
{
Ok(Self {
value: Array::from_shape_vec(shape, vec)?,
})
}
}
impl<D: Dimension> Node for InputNode<D> {
type Dim = D;
fn value(&self) -> &Tensor<D> {
&self.value
}
}
// Parameter
pub struct ParameterNode<D: Dimension> {
value: Tensor<D>,
}
impl<D: Dimension> ParameterNode<D> {
pub fn new<Sh>(shape: Sh, vec: Vec<f32>) -> Result<Self, ShapeError>
where
Sh: Into<StrideShape<D>>,
{
Ok(Self {
value: Array::from_shape_vec(shape, vec)?,
})
}
}
impl<D: Dimension> Node for ParameterNode<D> {
type Dim = D;
fn value(&self) -> &Tensor<D> {
&self.value
}
}
impl<D: Dimension> Forward for ParameterNode<D> {
fn forward(&mut self) {
// Nothing
}
}
// Addition
pub struct AdditionNode<Lhs, Rhs>
where
Lhs: Node,
Rhs: Node,
Lhs::Dim: Dimension + DimMax<Rhs::Dim>,
{
left: *const Tensor<Lhs::Dim>,
right: *const Tensor<Rhs::Dim>,
value: BroadTensor<Lhs::Dim, Rhs::Dim>,
}
impl<Lhs, Rhs> AdditionNode<Lhs, Rhs>
where
Lhs: Node,
Rhs: Node,
Lhs::Dim: Dimension + DimMax<Rhs::Dim>,
{
pub fn new(left: &Lhs, right: &Rhs) -> Self {
Self {
left: left.value() as *const Tensor<Lhs::Dim>,
right: right.value() as *const Tensor<Rhs::Dim>,
value: broadcasted_zero(left.value(), right.value()),
}
}
}
impl<Lhs, Rhs> Node for AdditionNode<Lhs, Rhs>
where
Lhs: Node,
Rhs: Node,
Lhs::Dim: Dimension + DimMax<Rhs::Dim>,
{
type Dim = Broadcasted<Lhs::Dim, Rhs::Dim>;
fn value(&self) -> &Tensor<Self::Dim> {
&self.value
}
}
impl<Lhs, Rhs> Forward for AdditionNode<Lhs, Rhs>
where
Lhs: Node,
Rhs: Node,
Lhs::Dim: Dimension + DimMax<Rhs::Dim>,
{
fn forward(&mut self) {
unsafe {
Zip::from(&mut self.value)
.and_broadcast(&*self.left)
.and_broadcast(&*self.right)
.par_for_each(|v, l, r| *v = l + r);
}
}
}
// ============================================ Backward Nodes ============================================
trait DiffNode: Node {
fn connect_source(&mut self, node_view: *const Tensor<Self::Dim>);
}
pub trait Backward: Forward {
fn backward(&mut self);
}
// Addition
pub struct AdditionDiffNode<Lhs, Rhs>
where
Lhs: Node<Dim = Rhs::Dim>,
Rhs: Node,
// Lhs::Dim: Dimension + DimMax<Rhs::Dim>,
{
addition: AdditionNode<Lhs, Rhs>,
lhs_gradient: Tensor<Lhs::Dim>,
rhs_gradient: Tensor<Rhs::Dim>,
sources: Vec<*const Tensor<Rhs::Dim>>,
}
impl<Lhs, Rhs> AdditionDiffNode<Lhs, Rhs>
where
Lhs: Node<Dim = Rhs::Dim>,
Rhs: Node,
// Lhs::Dim: Dimension + DimMax<Rhs::Dim>,
{
pub fn new(addition: AdditionNode<Lhs, Rhs>) -> Self {
unsafe {
let lhs_gradient = Tensor::zeros((&*addition.left).raw_dim());
let rhs_gradient = Tensor::zeros((&*addition.right).raw_dim());
Self {
addition,
lhs_gradient,
rhs_gradient,
sources: Vec::new(),
}
}
}
}
impl<Lhs, Rhs> Node for AdditionDiffNode<Lhs, Rhs>
where
Lhs: Node<Dim = Rhs::Dim>,
Rhs: Node,
// Lhs::Dim: Dimension + DimMax<Rhs::Dim>,
{
type Dim = Rhs::Dim;
// type Dim = Broadcasted<Lhs::Dim, Rhs::Dim>;
fn value(&self) -> &Tensor<Self::Dim> {
self.addition.value()
}
}
impl<Lhs, Rhs> Forward for AdditionDiffNode<Lhs, Rhs>
where
Lhs: Node<Dim = Rhs::Dim>,
Rhs: Node,
// Lhs::Dim: Dimension + DimMax<Rhs::Dim>,
{
fn forward(&mut self) {
self.addition.forward();
}
}
impl<Lhs, Rhs> DiffNode for AdditionDiffNode<Lhs, Rhs>
where
Lhs: Node<Dim = Rhs::Dim>,
Rhs: Node,
// Lhs::Dim: Dimension + DimMax<Rhs::Dim>,
{
fn connect_source(&mut self, source: *const Tensor<Self::Dim>) {
self.sources.push(source);
}
}
impl<Lhs, Rhs> Backward for AdditionDiffNode<Lhs, Rhs>
where
Lhs: Node<Dim = Rhs::Dim>,
Rhs: Node,
// Lhs::Dim: Dimension + DimMax<Rhs::Dim>,
{
fn backward(&mut self) {
if self.sources.is_empty() {
return;
}
unsafe {
// Here should be the appropriate `accumulate` code, since now we have
// different gradients for each operand.
// In order to bench only the performance impact of `*const Tensor<_>`
// we stick with a simple **BUT WRONG** plain addiction
self.lhs_gradient.assign(&*self.sources[0]);
self.rhs_gradient.assign(&*self.sources[0]);
for source in self.sources.iter().skip(1) {
Zip::from(&mut self.lhs_gradient)
.and_broadcast(&**source)
.par_for_each(|l, r| *l += r);
Zip::from(&mut self.rhs_gradient)
.and_broadcast(&**source)
.par_for_each(|l, r| *l += r);
}
}
}
}
// =================================================== Tests ===================================================
#[cfg(test)]
mod tests {
use super::*;
mod api {
use super::*;
use ndarray::Ix2;
use rand::{self, Rng};
use std::cell::RefCell;
use std::rc::Rc;
#[test]
fn forward_benchmark() {
let mut inputs = Vec::with_capacity(1_024);
for _ in 0..inputs.capacity() {
inputs.push(InputNode::new((1_000, 1_000), vec![1.; 1_000 * 1_000]).unwrap());
}
let mut rng = rand::thread_rng();
let mut forward_nodes: Vec<Rc<RefCell<dyn Forward>>> = Vec::with_capacity(1_024);
for _ in 0..forward_nodes.capacity() {
forward_nodes.push(Rc::new(RefCell::new(AdditionNode::new(
&inputs[rng.gen_range(0..inputs.capacity())],
&inputs[rng.gen_range(0..inputs.capacity())],
))));
}
let mut times = Vec::with_capacity(1_024);
for node in &mut forward_nodes {
let start = std::time::Instant::now();
node.borrow_mut().forward();
let stop = start.elapsed();
times.push(stop.as_micros());
}
println!(
"Mean Forward Time Iteration: {} microseconds",
times.iter().sum::<u128>() / times.len() as u128
);
for i in 0..32 {
let start = std::time::Instant::now();
for node in &mut forward_nodes {
node.borrow_mut().forward();
}
let elapsed = start.elapsed();
times[i] = elapsed.as_millis();
}
println!(
"Mean Forward Time: {} milliseconds",
times[0..32].iter().sum::<u128>() / 32
);
}
#[test]
fn backward_benchmark() {
let gradient_sources = vec![Tensor::from_elem((1_000, 1_000), 0.); 128];
let mut inputs = Vec::with_capacity(1_024);
for _ in 0..inputs.capacity() {
inputs.push(InputNode::new((1_000, 1_000), vec![1.; 1_000 * 1_000]).unwrap());
}
let mut rng = rand::thread_rng();
let mut addition_diff_nodes = Vec::with_capacity(1_024);
for _ in 0..addition_diff_nodes.capacity() {
addition_diff_nodes.push(AdditionDiffNode::new(AdditionNode::new(
&inputs[rng.gen_range(0..inputs.capacity())],
&inputs[rng.gen_range(0..inputs.capacity())],
)));
}
for node in &mut addition_diff_nodes {
node.connect_source(&gradient_sources[rng.gen_range(0..128)] as *const Tensor<Ix2>);
node.connect_source(&gradient_sources[rng.gen_range(0..128)] as *const Tensor<Ix2>);
node.connect_source(&gradient_sources[rng.gen_range(0..128)] as *const Tensor<Ix2>);
node.connect_source(&gradient_sources[rng.gen_range(0..128)] as *const Tensor<Ix2>);
}
let mut backward_nodes: Vec<Rc<RefCell<dyn Backward>>> = Vec::with_capacity(1_024);
for _ in 0..1_024 {
backward_nodes.push(Rc::new(RefCell::new(addition_diff_nodes.swap_remove(0))));
}
let mut times = Vec::with_capacity(1_024);
for node in &mut backward_nodes {
let start = std::time::Instant::now();
node.borrow_mut().backward();
let stop = start.elapsed();
times.push(stop.as_micros());
}
println!(
"Mean Backward Time Iteration: {} microseconds",
times.iter().sum::<u128>() / times.len() as u128
);
for i in 0..32 {
let start = std::time::Instant::now();
for node in &mut backward_nodes {
node.borrow_mut().backward();
}
let elapsed = start.elapsed();
times[i] = elapsed.as_millis();
}
println!(
"Mean Backward Time: {} milliseconds",
times[0..32].iter().sum::<u128>() / 32
);
}
}
} The API is much cleaner than the previous one, moreover in this way it is no longer necessary to use
Concerning the |
fn broadcasted_zeros<Lhs, Rhs>(left: &Tensor<Lhs>, right: &Tensor<Rhs>) -> BroadTensor<Lhs, Rhs>
where
Lhs: Dimension + DimMax<Rhs>,
Rhs: Dimension,
{
let (bigger, smaller) = if left.ndim() >= right.ndim() {
(left.shape(), right.shape())
} else {
(right.shape(), left.shape())
};
let b_dim = {
let mut empty_d = <Lhs as DimMax<Rhs>>::Output::zeros(bigger.len());
let empty_d_slice = empty_d.slice_mut();
empty_d_slice
.iter_mut()
.zip(bigger.iter())
.for_each(|(e_el, b_el)| *e_el = *b_el);
empty_d_slice
.iter_mut()
.rev()
.zip(smaller.iter().rev())
.for_each(|(l, r)| *l = std::cmp::max(*l, *r));
empty_d
};
Tensor::zeros(b_dim)
} |
Assignment and reduction in less than 30 rows. fn sum_axis_inplace(arr: &mut ndarray::ArrayD<f32>, axis: ndarray::Axis) {
let (first, rest) = arr.view_mut().split_at(axis, 1);
ndarray::Zip::from(first.remove_axis(axis))
.and(rest.lanes(axis))
.for_each(|dst, src| *dst += src.sum());
arr.index_axis_inplace(axis, 0);
}
pub fn reduce<D: ndarray::Dimension, E: ndarray::Dimension>(
dest: &mut ndarray::Array<f32, D>,
src: &ndarray::Array<f32, E>,
) {
let mut dyn_rhs = src.clone().into_dyn();
let static_rhs = unsafe {
while (*(&dyn_rhs as *const ndarray::ArrayD<f32>)).ndim() > dest.ndim() {
sum_axis_inplace(&mut dyn_rhs, ndarray::Axis(0));
}
for (axis, size) in dest.shape().iter().enumerate() {
if *size == 1 {
sum_axis_inplace(&mut dyn_rhs, ndarray::Axis(axis));
dyn_rhs.insert_axis_inplace(ndarray::Axis(axis));
}
}
dyn_rhs.as_standard_layout()
};
ndarray::Zip::from(dest)
.and_broadcast(&static_rhs)
.for_each(|dest_el, src_el| *dest_el = *src_el);
} |
This issue is related to the redesign of the basic infrastructure used by the library to autodiff.
Currently we manage the propagation of gradients and the invocations of the
.forward()
and.backward()
methods through recursive calls, controlled by appropriate control structures such asForwardAction
andBackwardAction
to avoid performing out multiple times some computations. Unfortunately, this organization has the disadvantage of leading to the creation of a large number of recursive calls on the stack, with consequent loss of performance.The new approach that we could take advantage of is based on moving to an iterative version using
trait objects
, which even if they have the disadvantage of using dynamic dispatching, could lead to a simpler implementation, as well as a potential performance improvement.A possible implementation example is the following.
Prototype Code
The example is quite trivial, using only
Addition
nodes, but considering the depth of the graph (1024 nodes) and the size of the tensors (1 million elements each), the benchmarks don't disappoint me, but I warmly invite you to try them out and to raise any possible ideas or objections that cross your mind.The resulting benchmark times are the following:
Considering also that, for the
.backward()
case, each node is connected to other 4source
s nodes of 1 million entries each.The text was updated successfully, but these errors were encountered: