[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/jkitchin/s26-06642/blob/main/dsmles/01-numpy/numpy.ipynb)

# Module 01: NumPy Fundamentals

NumPy is the foundation of numerical computing in Python. Nearly every data science library builds on NumPy arrays.

## Learning Objectives

1. Create and manipulate NumPy arrays
2. Use vectorized operations (no loops!)
3. Apply broadcasting for efficient computation
4. Perform basic linear algebra
5. Generate random numbers for simulations

In [None]:
import numpy as np
import matplotlib.pyplot as plt

## Why NumPy? The Foundation of Scientific Python

Before we dive in, it's worth understanding *why* NumPy exists and what problem it solves.

### The Problem with Python Lists

Python is a wonderful language, but it was designed for general-purpose programming, not numerical computing. Python lists are:

- **Flexible**: Can hold mixed types (`[1, "hello", 3.14]`)
- **Dynamic**: Can grow and shrink easily
- **Slow**: Each element is a full Python object with overhead

This flexibility comes at a cost. When you're doing numerical work—solving differential equations, processing spectra, training machine learning models—you need to perform the *same operation on millions of numbers*. Python's flexibility becomes a liability.

### NumPy's Solution

NumPy provides a new data type: the **ndarray** (n-dimensional array). NumPy arrays are:

- **Homogeneous**: All elements have the same type (e.g., all 64-bit floats)
- **Contiguous**: Stored in a single block of memory
- **Vectorized**: Operations apply to all elements at once, implemented in C

The result? NumPy can be 10-100x faster than Python lists for numerical operations.

### Why This Matters for Chemical Engineering

Almost everything we do involves arrays of numbers:
- Sensor data from reactors (time series of temperatures, pressures, flows)
- Concentration profiles from simulations
- Spectroscopic data (absorbance vs wavelength)
- Training data for machine learning models

NumPy is the foundation that Pandas, scikit-learn, TensorFlow, and virtually every scientific Python library builds upon. Master NumPy, and everything else becomes easier.

In [None]:
# Speed comparison
python_list = list(range(1000000))
numpy_array = np.arange(1000000)

# Python list
%timeit [x**2 for x in python_list]

# NumPy array
%timeit numpy_array**2

## Creating Arrays: Choosing the Right Approach

There are many ways to create NumPy arrays, and the right choice depends on your situation:

| Situation | Method | Example |
|-----------|--------|---------|
| From existing data | `np.array()` | Converting a Python list |
| Initialize to zeros | `np.zeros()` | Pre-allocating for a loop |
| Initialize to ones | `np.ones()` | Creating masks or weights |
| Evenly spaced values | `np.linspace()` | Plotting a function |
| Integer sequence | `np.arange()` | Loop indices |
| Random values | `np.random.uniform()` | Monte Carlo simulation |

**Pro tip**: Avoid growing arrays in loops. If you know the final size, pre-allocate with `np.zeros()` and fill in values. This is much faster than appending.

In [None]:
# From Python list
concentrations = np.array([0.1, 0.2, 0.5, 1.0, 2.0])  # mol/L
print("Concentrations:", concentrations)
print("Type:", type(concentrations))
print("Shape:", concentrations.shape)
print("Dtype:", concentrations.dtype)

In [None]:
# Common array creation functions
zeros = np.zeros(5)
ones = np.ones(5)
temps = np.linspace(300, 500, 5)  # 5 evenly spaced points from 300 to 500
pressures = np.arange(1, 11, 2)   # From 1 to 11, step 2

print("Zeros:", zeros)
print("Ones:", ones)
print("Temperatures:", temps)
print("Pressures:", pressures)

In [None]:
# 2D arrays (matrices)
# Experimental data: rows = experiments, columns = [T, P, yield]
experiments = np.array([
    [300, 1.0, 45.2],
    [350, 1.5, 52.8],
    [400, 2.0, 68.1],
    [450, 2.5, 75.4],
    [500, 3.0, 82.0]
])

print("Shape:", experiments.shape)  # (5 rows, 3 columns)
print("\nExperiment data:")
print(experiments)

## Indexing and Slicing: Accessing Your Data

NumPy's indexing is one of its most powerful features. Understanding it well will save you countless hours of writing loops.

### The Key Insight

In Python, you typically write loops to process data. In NumPy, you **describe what you want**, and NumPy figures out how to get it efficiently.

Instead of:
```python
result = []
for i in range(len(data)):
    if data[i] > threshold:
        result.append(data[i])
```

You write:
```python
result = data[data > threshold]
```

This isn't just shorter—it's 10-100x faster because the loop runs in compiled C code.

### Types of Indexing

1. **Basic indexing** (integers and slices): Returns views (no copy)
2. **Advanced indexing** (boolean masks, integer arrays): Returns copies

Understanding this distinction matters for performance and avoiding bugs.

In [None]:
# 1D indexing
temps = np.array([300, 350, 400, 450, 500])
print("First element:", temps[0])
print("Last element:", temps[-1])
print("First three:", temps[:3])
print("Every other:", temps[::2])

In [None]:
# 2D indexing
print("First row (experiment 1):", experiments[0])
print("First column (all temperatures):", experiments[:, 0])
print("Yields (third column):", experiments[:, 2])
print("Single element [2,1]:", experiments[2, 1])

In [None]:
# Boolean indexing - very powerful!
yields = experiments[:, 2]
temps = experiments[:, 0]

# Find experiments with yield > 60%
high_yield = yields > 60
print("High yield mask:", high_yield)
print("High yield values:", yields[high_yield])
print("Temps for high yield:", temps[high_yield])

## Vectorized Operations: Thinking in Arrays

This is the most important concept in NumPy. **Vectorization** means applying an operation to an entire array at once, rather than looping through elements.

### Why Vectorization Matters

1. **Speed**: Operations run in optimized C code, not interpreted Python
2. **Clarity**: `y = A * np.exp(-Ea / (R * T))` is clearer than a 5-line loop
3. **Fewer bugs**: No off-by-one errors, no forgetting to append

### The Mental Shift

If you're coming from MATLAB, vectorization will feel natural. If you're coming from traditional programming, it requires a mental shift:

**Loop thinking**: "For each element, do this operation"
**Array thinking**: "Apply this operation to all elements"

The examples below demonstrate this shift. Notice how each calculation applies to entire arrays—no loops needed!

In [None]:
# Temperature conversion: K to °C
temps_K = np.array([300, 350, 400, 450, 500])
temps_C = temps_K - 273.15

print("Kelvin:", temps_K)
print("Celsius:", temps_C)

In [None]:
# Ideal gas law: PV = nRT
# Calculate molar volume V/n = RT/P

R = 8.314  # J/(mol·K)
T = np.linspace(300, 500, 5)  # K
P = 101325  # Pa (1 atm)

V_molar = R * T / P  # m³/mol
V_molar_L = V_molar * 1000  # L/mol

print("Temperature (K):", T)
print("Molar volume (L/mol):", V_molar_L)

In [None]:
# Arrhenius equation: k = A * exp(-Ea/RT)
A = 1e13  # 1/s
Ea = 80000  # J/mol
R = 8.314  # J/(mol·K)
T = np.linspace(300, 600, 100)

k = A * np.exp(-Ea / (R * T))

plt.figure(figsize=(10, 4))

plt.subplot(1, 2, 1)
plt.plot(T, k)
plt.xlabel('Temperature (K)')
plt.ylabel('Rate constant k (1/s)')
plt.title('Arrhenius Plot (linear)')

plt.subplot(1, 2, 2)
plt.semilogy(1000/T, k)
plt.xlabel('1000/T (1/K)')
plt.ylabel('Rate constant k (1/s)')
plt.title('Arrhenius Plot (log scale)')

plt.tight_layout()
plt.show()

## Broadcasting: The Magic Behind NumPy's Elegance

Broadcasting is NumPy's way of handling operations between arrays of different shapes. It's what allows you to write `array + 5` (adding a scalar to every element) or multiply a matrix by a vector.

### How Broadcasting Works

When NumPy sees arrays of different shapes, it tries to make them compatible by:
1. Comparing dimensions from right to left
2. Dimensions match if they're equal, or if one of them is 1
3. Arrays with a dimension of 1 are "stretched" to match the other

### A Chemical Engineering Example

Suppose you want to calculate reaction rates at multiple temperatures AND multiple concentrations. Instead of writing nested loops, broadcasting does it in one line.

**The key**: Reshape one array so dimensions can be broadcast. `k.reshape(-1, 1)` makes k a column vector (5×1), and `C` stays a row (1×4). NumPy broadcasts to create a 5×4 grid of all combinations.

In [None]:
# Calculate reaction rates at multiple T and C combinations
T = np.array([300, 350, 400, 450, 500])  # 5 temperatures
C = np.array([0.1, 0.5, 1.0, 2.0])  # 4 concentrations

# Rate = k * C, where k depends on T
k = A * np.exp(-Ea / (R * T))  # Shape: (5,)

# We want a 5x4 array of rates
# Reshape k to (5, 1) and C stays (4,) → broadcasts to (5, 4)
rates = k.reshape(-1, 1) * C

print("k shape:", k.shape)
print("C shape:", C.shape)
print("rates shape:", rates.shape)
print("\nRates (rows=T, cols=C):")
print(rates)

## Aggregation Functions

In [None]:
# Simulated experimental yields
yields = np.array([78.2, 82.1, 79.5, 81.3, 80.0, 79.8, 83.2, 77.9, 80.5, 81.1])

print(f"Mean: {np.mean(yields):.2f}%")
print(f"Std: {np.std(yields):.2f}%")
print(f"Min: {np.min(yields):.2f}%")
print(f"Max: {np.max(yields):.2f}%")
print(f"Median: {np.median(yields):.2f}%")

In [None]:
# Aggregation along axes for 2D arrays
# Rows = different catalysts, Cols = replicate experiments
catalyst_yields = np.array([
    [78, 82, 79, 81, 80],  # Catalyst A
    [65, 68, 66, 64, 67],  # Catalyst B
    [88, 91, 89, 87, 90],  # Catalyst C
])

print("Mean per catalyst (across replicates):")
print(np.mean(catalyst_yields, axis=1))  # axis=1 means across columns

print("\nMean per replicate (across catalysts):")
print(np.mean(catalyst_yields, axis=0))  # axis=0 means across rows

## Linear Algebra

In [None]:
# Solving linear systems: Ax = b
# Material balance: 3 reactions, 3 unknowns

# Stoichiometric matrix
A = np.array([
    [1, -1, 0],
    [0, 1, -1],
    [1, 0, 1]
])

# Right-hand side (inlet flows)
b = np.array([10, 5, 20])

# Solve
x = np.linalg.solve(A, b)
print("Solution x:", x)

# Verify: Ax should equal b
print("Verification A @ x:", A @ x)

In [None]:
# Matrix operations
A = np.array([[1, 2], [3, 4]])

print("Determinant:", np.linalg.det(A))
print("\nInverse:")
print(np.linalg.inv(A))

eigenvalues, eigenvectors = np.linalg.eig(A)
print("\nEigenvalues:", eigenvalues)
print("\nEigenvectors:")
print(eigenvectors)

## Random Numbers: Essential for Simulation

Random number generation is fundamental to many scientific applications:

- **Monte Carlo simulation**: Propagating uncertainty, sampling from distributions
- **Machine learning**: Initializing weights, shuffling data, dropout
- **Process simulation**: Modeling noise, disturbances, variability
- **Experimental design**: Random sampling, bootstrap resampling

### NumPy's Random Number Generator

Modern NumPy (1.17+) uses a new random number generator API. The recommended approach is:

```python
rng = np.random.default_rng(seed=42)  # Create a generator with a seed
```

**Why use a seed?** Seeds make your results reproducible. Running the same code twice with the same seed gives identical "random" numbers. This is essential for:
- Debugging (reproducing exactly what happened)
- Sharing results (others can reproduce your analysis)
- Testing (consistent behavior across runs)

**When NOT to seed**: When you genuinely need randomness (cryptography, production sampling)

In [None]:
# Set seed for reproducibility
rng = np.random.default_rng(42)

# Uniform random numbers
uniform = rng.uniform(0, 1, 5)
print("Uniform [0,1):", uniform)

# Normal distribution (Gaussian)
normal = rng.normal(loc=50, scale=5, size=5)  # mean=50, std=5
print("Normal (μ=50, σ=5):", normal)

# Random integers
integers = rng.integers(1, 100, 5)
print("Random integers [1, 100):", integers)

In [None]:
# Monte Carlo simulation: Propagating measurement uncertainty
# Measure temperature: T = 400 ± 5 K
# What's the uncertainty in rate constant k?

rng = np.random.default_rng(42)
n_samples = 10000

# Sample temperatures from normal distribution
T_samples = rng.normal(400, 5, n_samples)

# Calculate k for each sample
k_samples = A * np.exp(-Ea / (R * T_samples))

print(f"k = {np.mean(k_samples):.4e} ± {np.std(k_samples):.4e}")

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(10, 4))

axes[0].hist(T_samples, bins=50, edgecolor='black')
axes[0].set_xlabel('Temperature (K)')
axes[0].set_ylabel('Count')
axes[0].set_title('Input: Temperature Distribution')

axes[1].hist(k_samples, bins=50, edgecolor='black')
axes[1].set_xlabel('Rate constant k (1/s)')
axes[1].set_ylabel('Count')
axes[1].set_title('Output: Rate Constant Distribution')

plt.tight_layout()
plt.show()

## Common Pitfalls: Mistakes Everyone Makes

NumPy has some non-obvious behaviors that trip up beginners (and sometimes experts). Understanding these will save you debugging time.

### 1. Views vs Copies

This is the most common source of NumPy bugs. When you slice an array, you get a **view**, not a copy. Modifying the view modifies the original!

In [None]:
# Pitfall 1: Views vs Copies
a = np.array([1, 2, 3, 4, 5])
b = a[1:4]  # This is a VIEW, not a copy!

b[0] = 99  # This modifies 'a' too!
print("a:", a)  # [1, 99, 3, 4, 5]

# Use .copy() if you need an independent copy
a = np.array([1, 2, 3, 4, 5])
b = a[1:4].copy()
b[0] = 99
print("a (with copy):", a)  # [1, 2, 3, 4, 5]

In [None]:
# Pitfall 2: Integer division
a = np.array([1, 2, 3])  # Integer array
print("Integer array / 2:", a / 2)  # Fine, returns floats

# But be careful with floor division
print("Integer array // 2:", a // 2)  # Integer division

In [None]:
# Pitfall 3: Shape mismatches
a = np.array([1, 2, 3])
b = np.array([1, 2])  # Different length!

try:
    c = a + b
except ValueError as e:
    print("Error:", e)

## Summary: NumPy Mindset

NumPy requires a different way of thinking about computation. Here's what to remember:

### Core Concepts

| Concept | Key Idea | Why It Matters |
|---------|----------|----------------|
| **Arrays** | Homogeneous, typed containers | 10-100x faster than lists |
| **Vectorization** | Operate on entire arrays | Eliminates slow Python loops |
| **Broadcasting** | Automatic shape matching | Clean code, fewer bugs |
| **Views** | Slices share memory | Fast, but be careful with modifications |

### When to Use NumPy

- Any numerical computation with arrays of numbers
- Building blocks for machine learning features
- Scientific calculations (physics, chemistry, engineering)
- Image and signal processing

### When to Use Pandas Instead

- Tabular data with labeled columns
- Mixed data types (numbers, strings, dates)
- Time series with datetime indices
- Data cleaning and exploration

### Key Takeaways

1. **Think in arrays**: Write operations that apply to whole arrays, not individual elements
2. **Avoid loops**: If you're writing a for loop over array elements, there's probably a better way
3. **Scale matters**: Always scale/normalize features before combining them
4. **Copy when needed**: Use `.copy()` if you need to modify a slice without affecting the original
5. **Seed for reproducibility**: Always set a random seed for reproducible results

## Next Steps

In the next module, we'll learn Pandas, which builds on NumPy to provide labeled data structures for tabular data. If NumPy is the engine, Pandas is the dashboard—same power, but easier to interact with for real-world data.