# Intro to NumPy

In [None]:
import numpy as np

## Advanced Indexing Techniques

### Boolean Indexing

As we've seen before, array comparisons result in an array of True / False values that indicate the result of that comparison for each element.

In [None]:
# Create an array
arr = np.array([10, 15, 20, 25, 30])

# Create a boolean mask (True/False array)
mask = arr > 20
print("Mask:", mask)

That mask can be used to select elements of the original array (or any of the same shape).

In [None]:
# Use boolean indexing to select elements
filtered = arr[mask]
print("Values > 20:", filtered)

Since the original comparison is just an expression, you can use it directly in the index operator:

In [None]:
# new array with all values of arr less than 20
arr[arr < 20]

This can be mixed with traditional indexing and slicing, as shown in the examples below.

In [None]:
arr_2d = np.array([[1, 2, 3, 4],
                   [5, 6, 7, 8],
                   [9, 10, 11, 12]])

# select all rows where the first element is greater than 3
row_mask = arr_2d[:, 0] > 3

print(row_mask)

In [None]:
# with indexing: column 2 of the previous result
idx = arr_2d[row_mask, 2]
print("Column 2 for selected rows:\n", idx)

# with slicing: second and third columns of the masked rows
slice = arr_2d[row_mask, 1:3]
print("\nCols 1:3 for selected rows:\n", slice)

If we want the opposite, the tilde character (`~`) negates the condition. Here `~row_mask` will give us the rows not previously selected.

In [None]:
~row_mask

In [None]:
# indexing example, negated
idx = arr_2d[~row_mask, 2]
print("Column 2 for selected row:\n", idx)

# slicing example, negated
slice = arr_2d[~row_mask, 1:3]
print("\nCols 1:3 for selected row:\n", slice)

These operations can be expanded further by combining conditionals using NumPy's Boolean arithmetic operators `&` for and, and `|` for or. For example, the following statement would create a Boolean array containing the value True for every element of `names` that was equal to "Bob" or "Will". All other elements would be False.

`mask = (names == "Bob") | (names == "Will")`

Note that the Pandas keywords `and` and `or` doe not work with NumPy Boolean arrays. You must use the symbols instead.

### Fancy Indexing

NumPy provides one last indexing trick, and it is fancy. Anywhere we've used an integer to index or slice an array, you can use a list of integers instead.

In [None]:
arr = np.zeros((8, 4))

for i in range(len(arr)):
    arr[i] = i

arr

To select a subset of rows, use a list (or `ndarray`) of integers specifying the desired order.

In [None]:
arr[[4, 3, 0, 6]]

When assigning the result to a new variable, fancy indexing always creates a new array. When used to assign values, the indexed values will be modified. We will explore fancy indexing in more detail as required.

Together, these tools provide a powerful way to operate on specific values in an array based on conditions and location in the data. All these benefits convey to Pandas.

## Universal Functions

A universal function, or `ufunc`, is a function that performs element-wise operations on data in an `ndarray`. To get the benefit of vectorized operations (speed, memory efficiency), you must use them instead of the base Python equivalents.

We've covered several already, but many others exist. See the [NumPy documentation](https://numpy.org/doc/stable/reference/ufuncs.html) for a complete list and additional details. There, the available `ufuncs` are grouped as follows:

- Math Operations
- Trigonometric Functions
- Bit-twiddling Functions (not class relevant)
- Comparison Functions
- Floating Functions


## Broadcasting

Broadcasting is the set of rules and methods that facilitate operations between arrays of different shapes. For example, in the case of scalar multiplication, the scalar is *broadcast* into an array of the same size as the matrix before element-by-element multiplication is performed:

In [None]:
arr = np.arange(1,10).reshape(3, 3)
print(arr)

result = arr * 4
print(result)

### vs Base Python

This greatly simplifies things compared to base Python, where a nested loop is required:

In [None]:
# Create the 3x3 matrix in base Python
arr = []
value = 1
for i in range(3):
    row = []
    for j in range(3):
        row.append(value)
        value += 1
    arr.append(row)

print("Original array:")
for row in arr:
    print(row)

# Multiply each element by 4 using explicit loops
result = []
for i in range(len(arr)):
    new_row = []
    for j in range(len(arr[i])):
        new_row.append(arr[i][j] * 4)
    result.append(new_row)

print("\nArray * 4:")
for row in result:
    print(row)

Alternatively, you could modify the array in-place:

In [None]:
# Multiply in-place
for i in range(len(arr)):
    for j in range(len(arr[i])):
        arr[i][j] = arr[i][j] * 4

for row in result:
    print(row)

Even with base Python in its most expressive, the process is clumsy compared to the NumPy implementation. Here we use list comprehensions to create and multiply the array:

In [None]:
def scalar_multiply(matrix, scalar):
    return [[element * scalar for element in row] for row in matrix]

matrix = [[row * 3 + col + 1 for col in range(3)] for row in range(3)]
result = scalar_multiply(matrix, 4)

print("Original array:")
for row in matrix:
    print(row)

print("\nArray * 4:")
for row in result:
    print(row)

### Implications

NumPy makes many things easier. For example, the following code subtracts the mean value from an array in two lines:

In [None]:
rng = np.random.default_rng()

arr = rng.standard_normal((4, 3))
print("Array:")
print(arr)

arr_mean = arr.mean(0)
print("\nArray Mean:")
print(arr_mean)

demeaned = arr - arr_mean
print("\nDemeaned Array:")
print(demeaned)

### The Broadcasting Rule

Broadcasting can be tricky and unintuitive. It is not necessary to understand its inner workings for most of this course, but its essence is captured in the rule and two figures below. We will discuss this in greater detail as/if required. Otherwise, I refer you to Appendix A.3 of Python for Data Analysis (McKinney 2022) for a detailed treatment.

![The Broadcasting Rule, McKinney Figure A-4](images/03b-broadcasting.png)

![Broadcasting in 2D, McKinney Figure A-5](images/03b-broadcasting-2d.png)


## Techniques

### Tricks with Booleans

Python treats Boolean values as 1 (True) or 0 (False). This provides a useful way of counting results.

In [None]:
arr = rng.standard_normal(100)

# How many values are positive?
(arr > 0).sum()

The parentheses around `arr > 0` are necessary to ensure that `sum` is called after that expression is evaluated into a Boolean array.

The methods `any` and `all` are also very useful. `any` returns True if any value of an array is True, and `all` returns True only if all values are.

In [None]:
bools = np.array([False, False, True, False])

bools.any()

In [None]:
bools.all()

`any` and `all` also work with non-Boolean arrays, where nonzero / nonempty elements are treated as True.

These tricks will be leveraged routinely throughout the course. 

### Sorting

NumPy's `sort` method works in-place (like Python `list.sort`).

In [None]:
arr = rng.standard_normal(6)
arr

In [None]:
arr.sort()
arr

For multi-dimensional arrays, you can specify the axis to sort by.

In [None]:
arr = rng.standard_normal((5, 3))
arr

In [None]:
# sort the values within each column (along the row axis)
arr.sort(axis=0)
arr

In [None]:
# sort across each row (along the column axis)
arr.sort(axis=1)
arr

**Note:** the `np.sort()` _function_ returns a sorted copy of the array (like `sorted()` in base Python). Failing to recognize this subtle difference can lead to bugs.

In [None]:
np.sort(arr)

## Array-Oriented Programming

Cool! How do we use it to solve problems? As the name suggest, NumPy is mostly used for numerical work. Here are some common patterns that come up and how to leverage NumPy's design to do them "better."

### Numerical Analysis

This is NumPy's bread and butter. We've seen several examples already, but here's one more.

Suppose you want to analyze production data (for real or simulated data).

In [None]:
np.random.seed(42)  # For reproducibility

# Production data: 1000 products × 4 factories
units_produced = np.random.randint(100, 500, size=(1000, 4))
unit_cost = np.random.uniform(10, 50, size=(1000, 4))
defect_rate = np.random.uniform(0.01, 0.05, size=(1000, 4))

# Calculate total costs WITH defect losses - all at once!
total_cost = units_produced * unit_cost * (1 + defect_rate)

# Analysis in one line each:
print(f"Factory with lowest average cost: Factory {total_cost.mean(axis=0).argmin() + 1}")
print(f"Most expensive product to produce: Product {total_cost.sum(axis=1).argmax() + 1}")
print(f"Total production cost across all factories: ${total_cost.sum():,.2f}")

# Find products where Factory 1 beats Factory 2
factory1_wins = total_cost[:, 0] < total_cost[:, 1]
print(f"Factory 1 cheaper than Factory 2 for {factory1_wins.sum()} products")

# Instant statistical analysis
print(f"\nCost per factory (mean ± std):")
for i in range(4):
    print(f"  Factory {i+1}: ${total_cost[:, i].mean():,.0f} ± ${total_cost[:, i].std():,.0f}")

And, of course, your boss wants charts! As a preview of things to come, you can "easily" visualize it...

In [None]:
import matplotlib.pyplot as plt

# Create a figure with 2 subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

# Left plot: Box plot comparing cost distributions
ax1.boxplot([total_cost[:, i] for i in range(4)], 
            tick_labels=['Factory 1', 'Factory 2', 'Factory 3', 'Factory 4'])
ax1.set_ylabel('Production Cost ($)')
ax1.set_title('Cost Distribution by Factory')
ax1.grid(True, alpha=0.3)

# Right plot: Scatter plot - Cost vs Defect Rate for Factory 1
ax2.scatter(defect_rate[:, 0] * 100, total_cost[:, 0], 
           alpha=0.5, s=20, color='steelblue')
ax2.set_xlabel('Defect Rate (%)')
ax2.set_ylabel('Total Cost ($)')
ax2.set_title('Factory 1: Cost vs Quality Trade-off')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### Simulation

NumPy suitable for many types of simulation. Wikipedia describes a [Random Walk](https://en.wikipedia.org/wiki/Random_walk) as:

> a stochastic process that describes a path that consists of a succession of random steps on some mathematical space

Here is an implementation in base Python.

In [None]:
import random
position = 0
walk = [position]
nsteps = 1000
for _ in range(nsteps):
    step = 1 if random.randint(0, 1) else -1
    position += step
    walk.append(position)

Here is a simple visualization of the first 100 values.

In [None]:
plt.plot(walk[:100])

Observe that `walk` is the cumulative sum of random steps and implement in NumPy.

In [None]:
nsteps = 1000
rng = np.random.default_rng(seed=12345)  # fresh random generator
draws = rng.integers(0, 2, size=nsteps)  # coin flip x 1000
steps = np.where(draws == 0, 1, -1)      # steps = 1 if draws = 0 else -1
walk = steps.cumsum()                    # cumulative sum of each step

In [None]:
print(draws[:10])

In [None]:
print(steps[:10])

In [None]:
print(walk[:10])

Easy to get stats...

In [None]:
walk.min()

In [None]:
walk.max()

First crossing line is harder - at what *step* does the random walk reach a particular value?

In [None]:
# argmax gives the firs index of the maximum value (True)
(np.abs(walk) >= 10).argmax()

But this is only a point sample. What about variance? This simulation can be extended to many random walks by using a 2D array of `draws`.

In [None]:
nwalks = 5000
nsteps = 1000
draws = rng.integers(0, 2, size=(nwalks, nsteps))
steps = np.where(draws > 0, 1, -1)
walks = steps.cumsum(axis=1)
walks

Easy questions...

In [None]:
# overall max and min, across all walks
print(walks.max(), walks.min())

Min crossing time is trickier - not all walks may reach the threshold.

In [None]:
# how many hit 30?
hits30 = (np.abs(walks) >= 30).any(axis=1)
hits30

In [None]:
# confirm that the first walk doesn't hit +/-30
print(walks[0].max(), walks[0].min())

In [None]:
# number of walks that hit 30 = sum of True
hits30.sum()

How long does it take those that hit 30 to reach it?

In [None]:
crossing_times = (np.abs(walks[hits30]) >= 30).argmax(axis=1)
crossing_times

Average crossing time:

In [None]:
crossing_times.mean()