# Improving Performance

## Overview

### Questions

- How can I write custom actions to be as efficient as possible?

### Objectives

- Mention high performance libraries that can help speed up computation
- Demonstrate using the local snapshot API for increased performance.

## Boilerplate Code

In [1]:
from numbers import Number

import numpy as np

import hoomd
import hoomd.md as md


cpu = hoomd.device.CPU()
sim = hoomd.Simulation(cpu)

# Create a simple cubic configuration of particles
N = 15  # particles per box direction
box_L = 50  # box dimension

snap = hoomd.Snapshot(cpu.communicator)
snap.configuration.box = [box_L] * 3 + [0, 0, 0]
snap.particles.N = N ** 3
x, y, z = np.meshgrid(
    *(np.linspace(-box_L / 2, box_L / 2, N, endpoint=False),) * 3)
positions = np.array((x.ravel(), y.ravel(), z.ravel())).T
snap.particles.position[:] = positions
snap.particles.types = ['A']
snap.particles.typeid[:] = 0

sim.create_state_from_snapshot(snap)

sim.state.thermalize_particle_momenta(hoomd.filter.All(), 1., seed=109)

lj = md.pair.LJ(nlist=md.nlist.Cell())
lj.params[('A', 'A')] = {'epsilon': 1.,
                         'sigma': 1.}
lj.r_cut[('A', 'A')] = 2.5
integrator = md.Integrator(methods=[md.methods.NVE(hoomd.filter.All())],
                           forces=[lj],
                           dt=0.005)

sim.operations += integrator

class GaussianVariant(hoomd.variant.Variant):
    def __init__(self, mean, std):
        hoomd.variant.Variant.__init__(self)
        self.mean = mean
        self.std = std
    
    def __call__(self, timestep):
        return np.random.normal(self.mean, self.std)
    
energy = GaussianVariant(0.1, 0.001)
sim.run(0)
rng = np.random.default_rng(1245)

This decorator should not be used for actual profiling.
This code is purely to showcase the _rough_ gain in
performance from switching to the `cpu_local_snapshot`.

In [2]:
from functools import wraps
import timeit


def time_function(func):
    @wraps(func)
    def wrapped_func(*args, **kwargs):
        wrapped_func.n_calls += 1
        start_time = timeit.default_timer()
        return_value = func(*args, **kwargs)
        wrapped_func.elapsed_time += timeit.default_timer() - start_time
        wrapped_func.time_per_call = (
            wrapped_func.elapsed_time / wrapped_func.n_calls)
        return return_value
    
    wrapped_func.n_calls = 0.0
    wrapped_func.elapsed_time = 0.0

    return wrapped_func

## General Guidelines

When trying to create custom actions that are as performant as possible
when necessary (and it often isn't), there are multiple considerations
to be had. However, the first step of optimization is to profile.
Profiling is  necessary to find the true bottlenecks of a given program
or function. That being said below are some general tips on improving
performance.

When accessing state information consider using
local snapshots (i.e. `hoomd.State.cpu_local_snapshot` and 
`gpu_local_snapshot`). Local snaphsots provide faster access to the
simulation's state information by not copying data or gathering data
across MPI ranks. They also support in-place modification which enables
faster setting as well. A full explanation of the use of local snapshots
will go in a future tutorial.

Further, when accessing object properties like 
`hoomd.md.pair.LJ.energies`, if the data is needed in multiple locations 
store in a variable such as `energies = lj.energies`. This prevents
having to recalculate the quantity multiple times or gather the information
across MPI ranks.

Beyond this one of the easiest and most obvious ways to improve
efficiently is to use NumPy, SciPy, and other core scientific
Python libraries. Efficient use of these packages is beyond the
scope of the tutorial, but using NumPy broadcasting (instead of
Python for loops) and built in SciPy functions can make a big
difference.

When this fails or is insufficient, Cython or numba can be used to
compile the slow parts of the code while having immediate compatibility
in Python. Cython is its own language which is similar to Python with
some C like constructs. Numba uses just in time compilation on standard
Python functions that use a given subset of available Python features.
Compiled backends in other languages can be used as well as long as
they link to Python.

## Improve InsertEnergyUpdater

As an example, we will improve the performance of the 
`InsertEnergyUpdater`. Specifically we will change to use
the `cpu_local_snapshot` to update particle velocity. We 
will use a custom decorator to profile the `act` method
before and after the changes. This should not be used for
actual profiling. Many existing tools in Python such as the
profilers packaged with CPython should be preferred,
but using the custom decorator allows for us to see in the
notebook the improvement to the runtime of the custom action.

In [3]:
class InsertEnergyUpdater(hoomd.custom.Action):
    def __init__(self, energy):
        self._energy = energy
        
    @property
    def energy(self):
        return self._energy
    
    @energy.setter
    def energy(self, new_energy):
        if isinstance(new_energy, Number):
            self._energy = hoomd.variant.Constant(new_energy)
        elif isinstance(new_energy, hoomd.variant.Variant):
            self._energy = new_energy
        else:
            raise ValueError(
                "energy must be a variant or real number.")
    
    @time_function
    def act(self, timestep):
        snap = self._state.snapshot
        if snap.exists:
            particle_i = rng.integers(snap.particles.N)
            mass = snap.particles.mass[particle_i]
            direction = self._get_direction()
            magnitude = np.sqrt(2 * self.energy(timestep) / mass)
            velocity = direction * magnitude
            old_velocity = snap.particles.velocity[particle_i]
            new_velocity = old_velocity + velocity
            snap.particles.velocity[particle_i] = velocity
        self._state.snapshot = snap
            
    @staticmethod
    def _get_direction():
        theta, z = rng.random(2)
        theta *= 2 * np.pi
        z = 2 * (z - 0.5)
        return np.array([
            np.sqrt(1 - (z * z)) * np.cos(theta),
            np.sqrt(1 - (z * z)) * np.sin(theta),
            z
        ])


class GaussianVariant(hoomd.variant.Variant):
    def __init__(self, mean, std):
        hoomd.variant.Variant.__init__(self)
        self.mean = mean
        self.std = std
    
    def __call__(self, timestep):
        return np.random.normal(self.mean, self.std)

In [4]:
energy_action = InsertEnergyUpdater(energy)
energy_operation = hoomd.update.CustomUpdater(
    energy_action)
sim.operations.updaters.append(energy_operation)

In [5]:
sim.run(1000)
f"ms per call: {1000 * energy_operation.action.act.time_per_call}"

'ms per call: 1.1473324317630613'

We now show the profile for the optimized code which
uses the `cpu_local_snapshot` for updating energies.

In [6]:
class InsertEnergyUpdater(hoomd.custom.Action):
    def __init__(self, energy):
        self._energy = energy
        
    @property
    def energy(self):
        return self._energy
    
    @energy.setter
    def energy(self, new_energy):
        if isinstance(new_energy, Number):
            self._energy = hoomd.variant.Constant(new_energy)
        elif isinstance(new_energy, hoomd.variant.Variant):
            self._energy = new_energy
        else:
            raise ValueError(
                "energy must be a variant or real number.")

    def attach(self, simulation):
        self._state = simulation.state
        self._comm = simulation.device.communicator

    def detach(self):
        del self._state
        del self._comm
    
    @time_function
    def act(self, timestep):
        part_tag = rng.integers(self._state.N_particles)
        direction = self._get_direction()
        energy = self.energy(timestep)
        with self._state.cpu_local_snapshot as snap:
            # We restrict the computation to the MPI
            # rank containing the particle if applicable.
            # By checking if multiple MPI ranks exist first
            # we can avoid for checking inclusion of a tag id
            # in an array.
            if (self._comm.num_ranks <= 1
                    or part_tag in snap.particles.tag):
                i = snap.particles.rtag[part_tag]
                mass = snap.particles.mass[i]
                magnitude = np.sqrt(2 * energy / mass)
                velocity = direction * magnitude
                old_velocity = snap.particles.velocity[i]
                new_velocity = old_velocity + velocity
                snap.particles.velocity[i] = new_velocity
            
    @staticmethod
    def _get_direction():
        theta, z = rng.random(2)
        theta *= 2 * np.pi
        z = 2 * (z - 0.5)
        return np.array([
            np.sqrt(1 - (z * z)) * np.cos(theta),
            np.sqrt(1 - (z * z)) * np.sin(theta),
            z
        ])

In [7]:
# Create and add our modified custom updater
sim.operations -= energy_operation
energy_action = InsertEnergyUpdater(energy)
energy_operation = hoomd.update.CustomUpdater(
    energy_action)
sim.operations.updaters.append(energy_operation)

In [8]:
sim.run(1000)
f"ms per call: {1000 * energy_operation.action.act.time_per_call}"

'ms per call: 0.2196627056982834'

As can be seen the new updater is much faster ~4-5x
with a system size of $15^3 = 3375$,
by virtue of the local snapshot modification having
$O(1)$ time complexity. At larger system sizes this
change will grow to be even more substantial.

This concludes the tutorial on custom actions in Python. For
more information see the API documentation.