<a href="https://colab.research.google.com/github/jman4162/Accelerated-Python-Computing-for-ML-Applications/blob/main/Advanced_Numba_Tutorial_for_Machine_Learning_Researchers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Numba Tutorial for Machine Learning Researchers

Name: John Hodge

Date: 10/09/24

# Advanced Numba Tutorial for Machine Learning Researchers

## Overview

This advanced tutorial explores Numba, a powerful just-in-time (JIT) compiler for Python, tailored specifically for machine learning researchers. We'll delve into techniques to accelerate common ML tasks, covering basic JIT compilation, parallel processing, custom data types, GPU acceleration, and profiling. By the end, you'll be equipped to significantly optimize your Python code for machine learning applications.

## Introduction

In the fast-paced world of machine learning, computational efficiency is crucial. As datasets expand and models grow more complex, researchers often encounter performance bottlenecks that hinder experimentation and model development. Numba addresses this challenge by compiling Python code to native machine instructions, dramatically speeding up numerical and scientific computations without the need for low-level language rewrites.

This tutorial is designed for ML researchers familiar with Python and NumPy who want to elevate their code optimization skills. We'll explore how Numba can accelerate common machine learning tasks and algorithms, potentially reducing execution times from hours to minutes.

Key topics include:

1. Basic JIT compilation
2. Parallel processing for multi-core CPUs
3. Custom data types for complex algorithms
4. GPU acceleration for massively parallel computations
5. Profiling and optimization techniques

While Numba is powerful, it's not a universal solution. We'll discuss best practices, common pitfalls, and when to use Numba versus other optimization methods. By the end, you'll be able to unlock the full potential of your Python code for machine learning, enabling faster prototyping, quicker iterations, and the ability to work with larger datasets on standard hardware.

Let's begin our journey into the world of high-performance Python for machine learning!

## Prerequisites

Before we begin, make sure you have the following libraries installed and import the necessary libraries:

In [1]:
!pip install -q numba numpy matplotlib scikit-learn

In [2]:
import numba
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from time import time

## 1. Basic JIT Compilation

Let's start with a quick refresher on basic JIT compilation:

In [3]:
@numba.jit(nopython=True)
def euclidean_distance(x1, x2):
    return np.sqrt(np.sum((x1 - x2)**2))

# Test the function
x1 = np.array([1, 2, 3])
x2 = np.array([4, 5, 6])
print(f"Distance: {euclidean_distance(x1, x2)}")

Distance: 5.196152422706632


## 2. Parallel Processing with Numba

Numba allows for easy parallelization of computations. Let's implement a parallel K-means clustering algorithm:

In [4]:
@numba.jit(nopython=True, parallel=True)
def kmeans(X, k, max_iterations=100):
    n_samples, n_features = X.shape

    # Initialize centroids randomly
    centroids = X[np.random.choice(n_samples, k, replace=False)]

    for _ in range(max_iterations):
        # Assign points to nearest centroid
        labels = np.empty(n_samples, dtype=np.int64)
        for i in numba.prange(n_samples):
            min_dist = np.inf
            for j in range(k):
                dist = np.sum((X[i] - centroids[j])**2)
                if dist < min_dist:
                    min_dist = dist
                    labels[i] = j

        # Update centroids
        new_centroids = np.zeros((k, n_features))
        counts = np.zeros(k)
        for i in numba.prange(n_samples):
            new_centroids[labels[i]] += X[i]
            counts[labels[i]] += 1

        for i in range(k):
            if counts[i] > 0:
                new_centroids[i] /= counts[i]

        # Check for convergence
        if np.all(centroids == new_centroids):
            break

        centroids = new_centroids

    return labels, centroids

# Generate sample data
X, _ = make_classification(n_samples=10000, n_features=20, n_informative=3, n_redundant=10, n_classes=3, random_state=42)

# Run K-means
start_time = time()
labels, centroids = kmeans(X, k=3)
end_time = time()

print(f"K-means clustering completed in {end_time - start_time:.4f} seconds")

K-means clustering completed in 9.8693 seconds


## 3. Custom Data Types with Numba

Numba allows you to define custom data types, which can be useful for complex algorithms:

In [5]:
from numba import types
from numba.experimental import jitclass

spec = [
    ('value', types.float64[:]),
    ('gradient', types.float64[:])
]

@jitclass(spec)
class Parameter:
    def __init__(self, n_features):
        self.value = np.zeros(n_features)
        self.gradient = np.zeros(n_features)

    def update(self, learning_rate):
        self.value -= learning_rate * self.gradient
        self.gradient.fill(0)

@numba.jit(nopython=True)
def logistic_regression(X, y, learning_rate=0.01, n_iterations=1000):
    n_samples, n_features = X.shape
    weights = Parameter(n_features)
    bias = Parameter(1)

    for _ in range(n_iterations):
        for i in range(n_samples):
            z = 0.0
            for j in range(n_features):
                z += X[i, j] * weights.value[j]
            z += bias.value[0]
            y_pred = 1 / (1 + np.exp(-z))

            error = y_pred - y[i]
            for j in range(n_features):
                weights.gradient[j] += error * X[i, j]
            bias.gradient[0] += error

        weights.update(learning_rate)
        bias.update(learning_rate)

    return weights.value, bias.value[0]

# Generate sample data
X, y = make_classification(n_samples=10000, n_features=20, n_informative=2, n_redundant=10, n_classes=2, random_state=42)

# Run logistic regression
start_time = time()
weights, bias = logistic_regression(X, y)
end_time = time()

print(f"Logistic regression completed in {end_time - start_time:.4f} seconds")

Logistic regression completed in 1.5645 seconds


## 4. GPU Acceleration with Numba

Numba can also leverage CUDA-enabled GPUs for even faster computations. Here's an example of matrix multiplication on GPU:

In [6]:
!nvidia-smi

Wed Oct  9 23:31:36 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   44C    P8               9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [7]:
from numba import cuda

@cuda.jit
def matrix_mul(A, B, C):
    i, j = cuda.grid(2)
    if i < C.shape[0] and j < C.shape[1]:
        tmp = 0.
        for k in range(A.shape[1]):
            tmp += A[i, k] * B[k, j]
        C[i, j] = tmp

# Generate sample matrices
A = np.random.rand(1000, 1000)
B = np.random.rand(1000, 1000)
C = np.zeros((1000, 1000))

# Configure the blocks
threadsperblock = (16, 16)
blockspergrid_x = (A.shape[0] + threadsperblock[0] - 1) // threadsperblock[0]
blockspergrid_y = (B.shape[1] + threadsperblock[1] - 1) // threadsperblock[1]
blockspergrid = (blockspergrid_x, blockspergrid_y)

# Run the kernel
start_time = time()
d_A = cuda.to_device(A)
d_B = cuda.to_device(B)
d_C = cuda.to_device(C)
matrix_mul[blockspergrid, threadsperblock](d_A, d_B, d_C)
cuda.synchronize()
C = d_C.copy_to_host()
end_time = time()

print(f"GPU matrix multiplication completed in {end_time - start_time:.4f} seconds")

GPU matrix multiplication completed in 1.8560 seconds


## 5. Profiling and Optimization

In [8]:
import cProfile
import numba
import numpy as np

@numba.jit(nopython=True)
def complex_function(x, y):
    result = 0
    for i in range(x.shape[0]):
        for j in range(y.shape[0]):
            result += np.sin(x[i]) * np.cos(y[j])
    return result

x = np.random.rand(1000)
y = np.random.rand(1000)

cProfile.run('complex_function(x, y)')

         237932 function calls (222960 primitive calls) in 0.848 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       20    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:100(acquire)
       10    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:1022(_find_and_load)
       10    0.000    0.000    0.001    0.000 <frozen importlib._bootstrap>:1038(_gcd_import)
      574    0.003    0.000    0.008    0.000 <frozen importlib._bootstrap>:1053(_handle_fromlist)
       20    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:125(release)
       10    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:165(__init__)
       10    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:169(__enter__)
       10    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:173(__exit__)
       20    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:179(_get_

## Next Steps

Congratulations on completing this advanced Numba tutorial for machine learning! You've gained valuable insights into optimizing Python code for high-performance machine learning tasks. To further enhance your skills and expand your knowledge, consider the following next steps:

1. **Practice and Experimentation**: Apply the techniques learned in this tutorial to your own machine learning projects. Experiment with different Numba optimizations and measure their impact on your specific use cases.

2. **Explore Advanced CUDA Programming**: If you're interested in GPU acceleration, delve deeper into CUDA programming with Numba. Learn about shared memory, atomic operations, and more complex parallel algorithms.

3. **Benchmark Against Other Solutions**: Compare Numba-optimized code with other high-performance computing solutions like Cython, Dask, or pure C/C++ implementations. Understand the trade-offs between development time and execution speed.

4. **Contribute to Open Source**: Many machine learning libraries could benefit from Numba optimizations. Consider contributing to open-source projects by optimizing computationally intensive parts of their codebase.

5. **Stay Updated**: Numba is continuously evolving. Keep an eye on the official Numba documentation and release notes for new features and improvements.

6. **Explore Integration with ML Frameworks**: Investigate how Numba can be integrated with popular machine learning frameworks like scikit-learn, PyTorch, or TensorFlow for custom operations.

7. **Attend Conferences and Workshops**: Participate in conferences or workshops focused on high-performance computing in Python to learn from experts and share your experiences.

8. **Optimize Full ML Pipelines**: Apply Numba optimizations to entire machine learning pipelines, from data preprocessing to model evaluation, to achieve end-to-end performance improvements.

9. **Learn About Memory Management**: Dive deeper into memory management techniques in Numba to handle large datasets efficiently and avoid common pitfalls.

10. **Explore Multi-GPU Programming**: If you have access to multiple GPUs, learn how to distribute computations across them using Numba and CUDA.

Remember, becoming proficient in high-performance computing for machine learning is an ongoing journey. Continuous practice and staying curious about new developments in the field will help you become a more effective and efficient machine learning researcher or practitioner.

Happy coding, and may your models train faster than ever before!

## Conclusion

This tutorial covered advanced Numba techniques for machine learning researchers, including parallel processing, custom data types, GPU acceleration, and profiling. By leveraging these features, you can significantly speed up your machine learning algorithms and handle larger datasets more efficiently.

Remember to always profile your code and compare it with non-Numba implementations to ensure you're getting the expected performance improvements. Happy coding!