# Financial Trading with Python, 2nd Edition
Cordell L. Tanny, CFA, FRM, FDP

## Chapter 2: Setting up a Python Quantitative Workflow

### Notebook 2.2: NumPy for Finance

Version: 1

Date of last revision: December 28, 2025

**NumPy: The Hidden Engine**

*Note and recommendation: Very often, you might want to experiment on your own as you go through this notebook. We recommend you save a copy of this notebook before you start adding cells or changing anything. This way you always have a pristine copy to go back to.*

## 1.1 Why NumPy?

If you have worked through Notebook 2.1, you are comfortable with Pandas. You can create DataFrames, manipulate time series, and combine datasets. So why do you need another library?

The answer is that Pandas is built on NumPy. Every time you perform a calculation in Pandas, NumPy is doing the heavy lifting under the hood. The DataFrame you see on screen is essentially a collection of NumPy arrays with labels attached.

For most day-to-day work, you do not need to think about this. Pandas handles everything for you. But there are three situations where you need to work with NumPy directly:

1. **Speed:** When you need maximum performance for large datasets or repeated calculations, working directly with NumPy arrays is faster than Pandas.

2. **Machine Learning:** Libraries like scikit-learn expect NumPy arrays as input, not DataFrames. You need to know how to extract arrays from your DataFrames and reshape them correctly.

3. **Matrix Math:** Portfolio optimization, covariance matrices, and other financial calculations require matrix operations that NumPy handles natively.

This notebook will teach you what you need to know to work confidently with NumPy in a financial context.

## 1.2 Arrays vs DataFrames

Before we dive into creating arrays, you need to understand how they differ from the DataFrames you already know. They are related but serve different purposes.

**DataFrames have labels. Arrays do not.**

A DataFrame has an index (row labels) and column names. These labels make it easy to slice data by date or select a column by name. An array is just numbers. No labels. If you want the third row, you ask for row index 2. There is no concept of asking for the row labeled "2024-07-03".

**DataFrames can hold mixed data types. Arrays cannot.**

A DataFrame can have one column of strings, another of floats, and another of integers. Each column maintains its own data type. An array requires all elements to be the same type. If you mix types, NumPy will convert everything to a common type, often in ways you did not expect. We will cover this in detail shortly.

**DataFrames are built for tabular data. Arrays can be any shape.**

DataFrames are always two-dimensional: rows and columns. Arrays can be one-dimensional (a single series of numbers), two-dimensional (a matrix), three-dimensional (a cube of numbers), or higher. This flexibility is essential for machine learning where you often work with multi-dimensional data.

**DataFrames have built-in time series methods. Arrays are for raw computation.**

Methods like `.resample()`, `.rolling()`, and `.shift()` are Pandas features. NumPy arrays do not have these. Arrays are designed for fast numerical computation, not for time series manipulation.

Let's see these differences in practice.

In [4]:
import pandas as pd
import numpy as np

# Create a simple DataFrame
df = pd.DataFrame({
    'Price': [100, 102, 105],
    'Volume': [1000, 1500, 1200]
}, index=['2024-07-01', '2024-07-02', '2024-07-03'])

print("DataFrame:")
print(df)
print(f"\nAccess by label: df.loc['2024-07-02', 'Price'] = {df.loc['2024-07-02', 'Price']}")

# Extract as a NumPy array
price_array = df['Price'].values

print("\nNumPy Array:")
print(price_array)
print(f"\nAccess by position: price_array[1] = {price_array[1]}")

DataFrame:
            Price  Volume
2024-07-01    100    1000
2024-07-02    102    1500
2024-07-03    105    1200

Access by label: df.loc['2024-07-02', 'Price'] = 102

NumPy Array:
[100 102 105]

Access by position: price_array[1] = 102


Notice the difference. With the DataFrame, we accessed the value using the date label and column name. With the array, we only have position. The label information is gone.

This is the trade-off. When you convert a DataFrame to an array, you lose the labels but you gain speed and compatibility with machine learning libraries.

**When to use each:**

Use DataFrames when you are working with time series data, need to combine datasets, or want the convenience of labeled access.

Use NumPy arrays when you need maximum speed, are preparing data for machine learning, or are performing matrix operations like portfolio optimization.

In practice, you will use both. Most of your data manipulation happens in Pandas. When you need to feed data into a model or perform heavy numerical computation, you extract arrays from your DataFrames.

## 2.1 Creating Arrays

Before we can work with NumPy, we need to understand how to create arrays. An array is simply a collection of values, similar to a list in Python but with important differences that we will explore.

Let's start by importing NumPy. The convention is to import it as `np`.

In [1]:
import numpy as np

### 2.1.1 np.array(): Converting Lists to Arrays

The most common way to create an array is to convert a Python list using `np.array()`. This is how you will typically create arrays from data you already have.

In [2]:
# Create an array from a list of prices
prices = [100, 102, 101, 105, 108]
price_array = np.array(prices)

print("Original list:", prices)
print("NumPy array:", price_array)
print("Type:", type(price_array))

Original list: [100, 102, 101, 105, 108]
NumPy array: [100 102 101 105 108]
Type: <class 'numpy.ndarray'>


Notice that the array looks similar to the list, but it is a different object entirely. The type is `numpy.ndarray`, which stands for n-dimensional array.

You can also create arrays from nested lists to make 2D arrays (matrices).

In [3]:
# Create a 2D array (matrix) from nested lists
# Each inner list becomes a row
returns_matrix = np.array([
    [0.01, 0.02, -0.01],   # Asset 1 returns
    [0.02, -0.01, 0.03],   # Asset 2 returns
    [0.00, 0.01, 0.02]     # Asset 3 returns
])

print("2D Array (Matrix):")
print(returns_matrix)

2D Array (Matrix):
[[ 0.01  0.02 -0.01]
 [ 0.02 -0.01  0.03]
 [ 0.    0.01  0.02]]


### 2.1.2 np.zeros() and np.ones(): Initializing Arrays

Sometimes you need to create an array of a specific size filled with placeholder values. This is common when you are building an array that you will populate later, such as storing results from a loop or creating a template for portfolio weights.

In [5]:
# Create an array of zeros (useful for initializing results)
zeros_array = np.zeros(5)
print("Array of zeros:", zeros_array)

# Create a 2D array of zeros (3 rows, 4 columns)
zeros_matrix = np.zeros((3, 4))
print("\nMatrix of zeros:")
print(zeros_matrix)

# Create an array of ones (useful for equal-weight portfolios)
ones_array = np.ones(4)
print("\nArray of ones:", ones_array)

# Equal weight portfolio: divide by number of assets
num_assets = 4
equal_weights = np.ones(num_assets) / num_assets
print("\nEqual weight portfolio:", equal_weights)
print("Sum of weights:", equal_weights.sum())

Array of zeros: [0. 0. 0. 0. 0.]

Matrix of zeros:
[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]

Array of ones: [1. 1. 1. 1.]

Equal weight portfolio: [0.25 0.25 0.25 0.25]
Sum of weights: 1.0


The equal weight portfolio example is one you will use often. If you have four assets and want to allocate equally to each, you create an array of ones and divide by the number of assets. The weights sum to 1, as they should.

Notice that when creating a 2D array, we pass the shape as a tuple: `(3, 4)` means 3 rows and 4 columns.

### 2.1.3 np.arange() and np.linspace(): Generating Sequences

When you need a sequence of numbers, NumPy provides two useful functions. The `np.arange()` function works like Python's `range()` but returns an array. The `np.linspace()` function creates evenly spaced numbers between two endpoints.

In [6]:
# np.arange: start, stop, step (stop is excluded)
sequence = np.arange(0, 10, 2)
print("np.arange(0, 10, 2):", sequence)

# np.linspace: start, stop, number of points (stop is included)
# Useful for creating strike prices for an options chain
strike_prices = np.linspace(90, 110, 5)
print("\nStrike prices from 90 to 110 (5 points):", strike_prices)

# Creating percentage thresholds
thresholds = np.linspace(0, 1, 11)
print("\nPercentage thresholds (0% to 100%):", thresholds)

np.arange(0, 10, 2): [0 2 4 6 8]

Strike prices from 90 to 110 (5 points): [ 90.  95. 100. 105. 110.]

Percentage thresholds (0% to 100%): [0.  0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1. ]


The key difference: `np.arange()` uses a step size, while `np.linspace()` uses a count. Use `np.arange()` when you know the step you want. Use `np.linspace()` when you know how many points you need between two values.

### 2.1.4 np.random: Generating Random Data for Simulations

Random number generation is essential for Monte Carlo simulations, bootstrapping, and testing trading strategies with synthetic data. NumPy provides a comprehensive random module.

In [7]:
# Set a seed for reproducibility
np.random.seed(42)

# Generate random returns from a normal distribution
# mean = 0.001 (0.1% daily return), std = 0.02 (2% daily volatility)
random_returns = np.random.normal(loc=0.001, scale=0.02, size=10)
print("Random daily returns:")
print(random_returns)

# Generate random integers (useful for random sampling)
random_indices = np.random.randint(0, 100, size=5)
print("\nRandom indices:", random_indices)

# Generate uniform random numbers between 0 and 1
random_weights = np.random.random(4)
# Normalize to sum to 1 (creating random portfolio weights)
random_weights = random_weights / random_weights.sum()
print("\nRandom portfolio weights:", random_weights)
print("Sum:", random_weights.sum())

Random daily returns:
[ 0.01093428 -0.00176529  0.01395377  0.0314606  -0.00368307 -0.00368274
  0.03258426  0.01634869 -0.00838949  0.0118512 ]

Random indices: [63 59 20 32 75]

Random portfolio weights: [0.52432363 0.0060574  0.01976966 0.44984931]
Sum: 1.0


Setting a seed with `np.random.seed()` ensures your results are reproducible. Every time you run the code with the same seed, you get the same random numbers. This is critical for debugging and for sharing results with colleagues.

The normal distribution example is particularly important. Most financial models assume returns are normally distributed (even though they are not quite). Simulating returns with `np.random.normal()` is the foundation of Monte Carlo analysis.

---

## 2.2 The Homogeneity Rule

One of the most important differences between arrays and DataFrames is that arrays require all elements to be the same data type. This is called homogeneity. DataFrames can have different types in different columns. Arrays cannot.

This is not just a technical detail. It is the reason arrays are fast. When NumPy knows that every element is the same type, it can perform calculations efficiently without checking each element individually.

### 2.2.1 Why Arrays Require a Single Data Type

When you create an array, NumPy allocates a contiguous block of memory where each element takes the same amount of space. This allows NumPy to perform calculations on the entire array at once, rather than element by element. If the types were mixed, NumPy would need to check each element before operating on it, which would destroy the performance advantage.

In [8]:
# An array of floats
float_array = np.array([1.5, 2.5, 3.5])
print("Float array:", float_array)
print("Data type:", float_array.dtype)

# An array of integers
int_array = np.array([1, 2, 3])
print("\nInteger array:", int_array)
print("Data type:", int_array.dtype)

Float array: [1.5 2.5 3.5]
Data type: float64

Integer array: [1 2 3]
Data type: int64


### 2.2.2 What Happens When You Mix Types

Here is where beginners get into trouble. If you create an array with mixed types, NumPy will silently convert everything to a common type. It does not warn you. It just picks a type that can represent all the values.

In [9]:
# Mix integers and floats: everything becomes float
mixed_numeric = np.array([1, 2, 3.5])
print("Mixed int and float:", mixed_numeric)
print("Data type:", mixed_numeric.dtype)

# Mix numbers and strings: everything becomes string
mixed_with_string = np.array([1, 2, 'three'])
print("\nMixed with string:", mixed_with_string)
print("Data type:", mixed_with_string.dtype)

# The danger: try to do math on the string array
try:
    result = mixed_with_string * 2
    print("\nMultiply by 2:", result)
except TypeError as e:
    print("\nError:", e)

Mixed int and float: [1.  2.  3.5]
Data type: float64

Mixed with string: ['1' '2' 'three']
Data type: <U21

Error: The 'out' kwarg is necessary. Use numpy.strings.multiply without it.


Look at what happened. When we mixed integers and floats, NumPy converted everything to float64. This is usually fine.

But when we mixed numbers with a string, NumPy converted everything to strings. The data type `<U21` means Unicode string with up to 21 characters. The numbers 1 and 2 became the strings '1' and '2'. When we tried to multiply by 2, NumPy threw an error because it cannot perform arithmetic on string arrays.

NumPy did not throw an error when we created the array. It silently converted our data. The error only appeared when we tried to use the data for calculations.

### 2.2.3 Why This Matters for Financial Calculations

This silent conversion can corrupt your analysis in subtle ways. Imagine you are reading data from a CSV file and one of the values is accidentally formatted as text. Your entire array becomes strings, and your calculations produce nonsense.

In [10]:
# Simulating data read from a messy CSV file
# One value has a dollar sign accidentally included
messy_prices = np.array([100.50, 102.25, '$103.00', 101.75])
print("Messy prices array:", messy_prices)
print("Data type:", messy_prices.dtype)

# Attempting to calculate returns will fail
try:
    returns = np.diff(messy_prices) / messy_prices[:-1]
    print("Returns:", returns)
except TypeError as e:
    print("\nCannot calculate returns!")
    print("Error:", e)

Messy prices array: ['100.5' '102.25' '$103.00' '101.75']
Data type: <U32

Cannot calculate returns!
Error: ufunc 'subtract' did not contain a loop with signature matching types (dtype('<U32'), dtype('<U32')) -> None


One bad value turned the entire array into strings. The data type `<U32` means Unicode string with up to 32 characters. When we tried to calculate returns using `np.diff()`, NumPy could not perform the subtraction because you cannot subtract strings from each other.

In practice, this is why data cleaning is so important. Before you convert a DataFrame column to a NumPy array, make sure the data types are correct. Use `.info()` on your DataFrame to check column types. Use `.astype()` to convert columns explicitly.

The rule is simple: always check your data types before performing calculations. A few seconds of verification can save hours of debugging.

---

## 2.3 Array Attributes

Just like DataFrames have attributes like `.shape` and `.index`, arrays have attributes that tell you about their structure. These are essential for debugging and for understanding what you are working with.

In [11]:
# Create arrays of different dimensions
array_1d = np.array([1, 2, 3, 4, 5])
array_2d = np.array([[1, 2, 3], [4, 5, 6]])

print("1D array:", array_1d)
print("Shape:", array_1d.shape)

print("\n2D array:")
print(array_2d)
print("Shape:", array_2d.shape)

1D array: [1 2 3 4 5]
Shape: (5,)

2D array:
[[1 2 3]
 [4 5 6]]
Shape: (2, 3)


The `.shape` attribute returns a tuple describing the dimensions of the array.

For the 1D array, the shape is `(5,)`. This means 5 elements in a single dimension. The trailing comma indicates it is a tuple with one element.

For the 2D array, the shape is `(2, 3)`. This means 2 rows and 3 columns. The first number is always the number of rows, the second is the number of columns.

Understanding shape is critical. When we get to machine learning, most errors you encounter will be shape mismatches. Get comfortable reading and interpreting shapes now.

### 2.3.2 .dtype: Data Types and Precision

The `.dtype` attribute tells you what kind of data the array holds. This is important for understanding memory usage and numerical precision.

In [12]:
# Different data types
int_array = np.array([1, 2, 3])
float_array = np.array([1.0, 2.0, 3.0])
explicit_float32 = np.array([1.0, 2.0, 3.0], dtype=np.float32)

print("Integer array dtype:", int_array.dtype)
print("Float array dtype:", float_array.dtype)
print("Float32 array dtype:", explicit_float32.dtype)

Integer array dtype: int64
Float array dtype: float64
Float32 array dtype: float32


By default, NumPy creates 64-bit integers (`int64`) and 64-bit floats (`float64`). The 64 refers to the number of bits used to store each value. More bits means more precision but also more memory.

For most financial work, `float64` is the right choice. It provides sufficient precision for price calculations and avoids rounding errors. You can specify a smaller type like `float32` to save memory when working with very large datasets, but this is rarely necessary.

The key point is that you can check the type at any time with `.dtype`. If your calculations are producing unexpected results, this is one of the first things to check.

### 2.3.3 .ndim: Number of Dimensions

The `.ndim` attribute tells you how many dimensions the array has. A 1D array has `ndim=1`, a 2D array has `ndim=2`, and so on.

In [13]:
array_1d = np.array([1, 2, 3, 4, 5])
array_2d = np.array([[1, 2, 3], [4, 5, 6]])
array_3d = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])

print("1D array ndim:", array_1d.ndim)
print("2D array ndim:", array_2d.ndim)
print("3D array ndim:", array_3d.ndim)

1D array ndim: 1
2D array ndim: 2
3D array ndim: 3


In financial work, you will mostly encounter 1D and 2D arrays. A 1D array might hold a series of returns. A 2D array might hold a matrix of returns for multiple assets over multiple time periods.

3D arrays appear in more advanced applications, such as storing multiple covariance matrices over time or working with certain machine learning models. For now, just know that `.ndim` tells you the dimensionality at a glance.

### 2.3.4 .size: Total Number of Elements

The `.size` attribute returns the total number of elements in the array. For a 1D array, this is the same as the length. For a 2D array, it is the number of rows multiplied by the number of columns.

In [14]:
array_1d = np.array([1, 2, 3, 4, 5])
array_2d = np.array([[1, 2, 3], [4, 5, 6]])

print("1D array shape:", array_1d.shape)
print("1D array size:", array_1d.size)

print("\n2D array shape:", array_2d.shape)
print("2D array size:", array_2d.size)

1D array shape: (5,)
1D array size: 5

2D array shape: (2, 3)
2D array size: 6


The 1D array has shape `(5,)` and size 5. The 2D array has shape `(2, 3)` and size 6 (2 times 3).

The `.size` attribute is useful when you need to know how many data points you are working with, regardless of how the array is shaped. This comes up when calculating statistics or when checking that two arrays have the same amount of data before reshaping.

---

## 2.4 Indexing and Slicing Arrays

Accessing elements in an array is similar to accessing elements in a Python list, but with more power. You can select individual elements, slice ranges, filter by condition, and select specific positions all in a single line.

### 2.4.1 Basic Indexing (1D and 2D)

Array indexing starts at 0, just like Python lists. For 2D arrays, you provide two indices: row first, then column.

In [16]:
# 1D array indexing
prices = np.array([100, 102, 105, 103, 108])
print("Prices:", prices)
print("First element (index 0):", prices[0])
print("Last element (index -1):", prices[-1])
print("Third element (index 2):", prices[2])

Prices: [100 102 105 103 108]
First element (index 0): 100
Last element (index -1): 108
Third element (index 2): 105


Negative indexing works the same as in Python lists. Index -1 gives you the last element, -2 gives you the second to last, and so on.

Now let's look at 2D indexing.

For 2D arrays, you specify the row index first, then the column index, separated by a comma.

In [17]:
# 2D array: rows are assets, columns are daily returns
returns = np.array([
    [0.01, 0.02, -0.01, 0.03],   # Asset 0
    [0.02, -0.01, 0.02, 0.01],   # Asset 1
    [-0.01, 0.03, 0.01, 0.02]    # Asset 2
])

print("Returns matrix:")
print(returns)
print("\nElement at row 0, column 1:", returns[0, 1])
print("Element at row 2, column 3:", returns[2, 3])
print("Entire row 1:", returns[1])
print("Entire column 2:", returns[:, 2])

Returns matrix:
[[ 0.01  0.02 -0.01  0.03]
 [ 0.02 -0.01  0.02  0.01]
 [-0.01  0.03  0.01  0.02]]

Element at row 0, column 1: 0.02
Element at row 2, column 3: 0.02
Entire row 1: [ 0.02 -0.01  0.02  0.01]
Entire column 2: [-0.01  0.02  0.01]


The syntax `returns[0, 1]` gives you the element at row 0, column 1. This is the second return for the first asset.

To select an entire row, you just provide the row index: `returns[1]` gives you all of Asset 1's returns.

To select an entire column, you use a colon for the row index: `returns[:, 2]` gives you the third column, which is the return on day 2 for all assets. The colon means "all rows".

### 2.4.2 Slicing Ranges

Slicing allows you to extract a portion of an array. The syntax is `start:stop:step`, where stop is excluded. This works the same as Python list slicing.

In [18]:
prices = np.array([100, 102, 105, 103, 108, 110, 107])

print("All prices:", prices)
print("First three (0:3):", prices[0:3])
print("From index 2 to end (2:):", prices[2:])
print("Up to index 4 (:4):", prices[:4])
print("Every other element (::2):", prices[::2])
print("Reversed (::-1):", prices[::-1])

All prices: [100 102 105 103 108 110 107]
First three (0:3): [100 102 105]
From index 2 to end (2:): [105 103 108 110 107]
Up to index 4 (:4): [100 102 105 103]
Every other element (::2): [100 105 108 107]
Reversed (::-1): [107 110 108 103 105 102 100]


The key thing to remember is that the stop index is excluded. `prices[0:3]` gives you elements at indices 0, 1, and 2, not 3.

Leaving out the start means "from the beginning". Leaving out the stop means "to the end". The step parameter lets you skip elements or reverse the array.

These same slicing rules apply to 2D arrays. You can slice rows and columns independently.

### 2.4.3 Boolean Indexing (Filtering by Condition)

Boolean indexing is one of the most powerful features of NumPy. You can filter an array by passing a condition, and NumPy returns only the elements that satisfy that condition.

In [19]:
returns = np.array([0.02, -0.01, 0.03, -0.02, 0.01, -0.03, 0.04])

print("All returns:", returns)

# Create a boolean mask
positive_mask = returns > 0
print("\nBoolean mask (returns > 0):", positive_mask)

# Apply the mask to filter
positive_returns = returns[positive_mask]
print("Positive returns only:", positive_returns)

# You can do this in one line
negative_returns = returns[returns < 0]
print("Negative returns only:", negative_returns)

All returns: [ 0.02 -0.01  0.03 -0.02  0.01 -0.03  0.04]

Boolean mask (returns > 0): [ True False  True False  True False  True]
Positive returns only: [0.02 0.03 0.01 0.04]
Negative returns only: [-0.01 -0.02 -0.03]


When you write `returns > 0`, NumPy creates a boolean array of the same shape with True where the condition is met and False where it is not.

When you use this boolean array to index the original array, NumPy returns only the elements where the value is True.

This is how you filter data in NumPy. Want only the days where the market was up? `returns[returns > 0]`. Want only the returns greater than 1%? `returns[returns > 0.01]`. This is cleaner and faster than writing loops.

### 2.4.4 Fancy Indexing (Selecting Specific Elements)

Fancy indexing lets you select specific elements by passing a list of indices. This is useful when you want to extract elements that are not in a contiguous range.

In [20]:
prices = np.array([100, 102, 105, 103, 108, 110, 107])

print("All prices:", prices)

# Select specific indices
selected_indices = [0, 2, 5]
selected_prices = prices[selected_indices]
print("\nPrices at indices 0, 2, 5:", selected_prices)

# Useful for selecting specific assets from a returns matrix
returns = np.array([
    [0.01, 0.02, -0.01],   # Asset 0
    [0.02, -0.01, 0.02],   # Asset 1
    [-0.01, 0.03, 0.01],   # Asset 2
    [0.03, 0.01, -0.02]    # Asset 3
])

# Select only assets 0 and 3
selected_assets = returns[[0, 3]]
print("\nReturns for assets 0 and 3:")
print(selected_assets)

All prices: [100 102 105 103 108 110 107]

Prices at indices 0, 2, 5: [100 105 110]

Returns for assets 0 and 3:
[[ 0.01  0.02 -0.01]
 [ 0.03  0.01 -0.02]]


Fancy indexing is particularly useful when you need to select a subset of assets from a portfolio. If you have a universe of 100 stocks but only want to analyze 10 of them, you can pass a list of the 10 indices and extract just those rows.

The order of the indices matters. NumPy returns the elements in the order you specify. This allows you to reorder elements by passing the indices in a different order.

---

## 3.1 What is Vectorization?

Vectorization is the single most important concept in NumPy. It is the reason NumPy is fast. It is the reason we use NumPy instead of plain Python for numerical work.

The idea is simple: instead of processing data one element at a time with a loop, you perform the operation on the entire array at once.

### 3.1.1 Loops vs. Vectorized Operations

Let's start with a simple example. Say you have an array of prices and you want to calculate the daily returns. The return for each day is the current price divided by the previous price, minus 1.

Here is how you would do it with a loop.

In [21]:
prices = np.array([100, 102, 105, 103, 108, 110, 107, 112])

# Calculate returns using a loop
returns_loop = []
for i in range(1, len(prices)):
    daily_return = (prices[i] / prices[i-1]) - 1
    returns_loop.append(daily_return)

returns_loop = np.array(returns_loop)
print("Returns (loop method):")
print(returns_loop)

Returns (loop method):
[ 0.02        0.02941176 -0.01904762  0.04854369  0.01851852 -0.02727273
  0.04672897]


This works, but it is slow. For each iteration, Python has to look up the values, perform the division, perform the subtraction, and append to a list. For a small array, you will not notice. For millions of data points, you will wait.

Now let's do the same calculation with vectorization.

In [22]:
# Calculate returns using vectorization
returns_vector = (prices[1:] / prices[:-1]) - 1

print("Returns (vectorized method):")
print(returns_vector)

Returns (vectorized method):
[ 0.02        0.02941176 -0.01904762  0.04854369  0.01851852 -0.02727273
  0.04672897]


One line. No loop. The result is identical.

The syntax `prices[1:]` gives you all prices from index 1 to the end. The syntax `prices[:-1]` gives you all prices from the beginning to the second-to-last. NumPy divides these two arrays element by element, then subtracts 1 from every element.

Notice that NumPy does not have a `.shift()` method like Pandas. When you need to compare an array to a lagged version of itself, you use slicing. This is a common pattern you will see throughout this notebook.

This is vectorization: expressing operations on entire arrays rather than individual elements.

### 3.1.2 Why Vectorization is Faster

The speed difference comes from how Python and NumPy work under the hood.

When you write a Python loop, each iteration involves overhead: checking the loop condition, looking up variable names, type checking, and managing memory. This overhead happens for every single element.

When you use a vectorized NumPy operation, the work is handed off to pre-compiled C code that operates on the entire array in one pass. The overhead happens once, not millions of times.

## 3.2 The Speed Test

Let's see the difference with a larger dataset. We will create an array of one million prices and time both methods.

In [23]:
# Create a large array of simulated prices
np.random.seed(42)
large_prices = 100 * np.cumprod(1 + np.random.normal(0.0001, 0.02, 1000000))

print(f"Array size: {len(large_prices):,} elements")
print(f"First few prices: {large_prices[:5]}")

Array size: 1,000,000 elements
First few prices: [101.00342831 100.73422528 102.04918676 105.16787085 104.6858794 ]


We have created one million simulated prices using random returns. Now let's time the loop method versus the vectorized method.

### 3.2.1 Calculating Returns with a Loop

In [24]:
%%timeit -n 1 -r 3

returns_loop = []
for i in range(1, len(large_prices)):
    daily_return = (large_prices[i] / large_prices[i-1]) - 1
    returns_loop.append(daily_return)
returns_loop = np.array(returns_loop)

563 ms ± 107 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)


The `%%timeit` magic command runs the code multiple times and reports the average execution time. The `-n 1` flag means run the code once per loop, and `-r 3` means repeat the measurement 3 times.

Note the time. Now let's see the vectorized version.

### 3.2.2 Calculating Returns with Vectorization

In [25]:
%%timeit -n 1 -r 3

returns_vector = (large_prices[1:] / large_prices[:-1]) - 1

2.87 ms ± 105 µs per loop (mean ± std. dev. of 3 runs, 1 loop each)


### 3.2.3 The Difference

The vectorized version is dramatically faster. On most machines, you will see a difference of 10x to 100x or more.

This is not a small optimization. When you are backtesting a strategy over decades of minute-level data, or running thousands of Monte Carlo simulations, the difference between loops and vectorization is the difference between waiting seconds and waiting hours.

### 3.2.4 When the Difference Matters

For small datasets, the speed difference is negligible. If you are calculating returns on 252 trading days, both methods finish instantly. Use whichever is clearer to you.

But the difference becomes critical in these situations:

**Large datasets:** Years of tick data or minute-level data can contain millions or billions of rows. Vectorization is not optional.

**Repeated calculations:** Backtesting often requires recalculating signals and returns thousands of times across different parameter combinations. A 50x speedup per calculation compounds quickly.

**Real-time systems:** If your trading system needs to process data and generate signals within milliseconds, every operation matters.

**Monte Carlo simulations:** Running 10,000 simulations of portfolio performance requires processing the same calculations 10,000 times. Vectorization makes this feasible.

The rule is simple: learn to think in arrays, not loops. When you catch yourself writing a for loop to process numerical data, pause and ask if there is a vectorized way to do it. There usually is.

---

## 3.3 Common Vectorized Operations

Now that you understand why vectorization matters, let's look at the operations you will use most often.

### 3.3.1 Element-wise Arithmetic

When you perform arithmetic on arrays, NumPy applies the operation to each element. If you operate on two arrays of the same shape, NumPy pairs up the elements and applies the operation to each pair.

In [26]:
prices = np.array([100, 102, 105, 103, 108])
shares = np.array([10, 20, 15, 25, 10])

# Multiply each price by the corresponding number of shares
position_values = prices * shares
print("Prices:", prices)
print("Shares:", shares)
print("Position values:", position_values)

# Add a fixed cost to each price
prices_with_fee = prices + 0.50
print("\nPrices with $0.50 fee:", prices_with_fee)

Prices: [100 102 105 103 108]
Shares: [10 20 15 25 10]
Position values: [1000 2040 1575 2575 1080]

Prices with $0.50 fee: [100.5 102.5 105.5 103.5 108.5]


The multiplication `prices * shares` does not multiply all prices by all shares. It multiplies the first price by the first share count, the second price by the second share count, and so on. This is element-wise operation.

When you add a single number to an array, NumPy adds that number to every element. This is called broadcasting, which we will cover in detail later.

### 3.3.2 Aggregations: np.sum(), np.mean(), np.std()

Aggregation functions reduce an array to a single value. These are the workhorses of financial calculations.

In [27]:
returns = np.array([0.02, -0.01, 0.03, -0.02, 0.01, 0.04, -0.01, 0.02])

print("Returns:", returns)
print("\nTotal return (sum):", np.sum(returns))
print("Average return (mean):", np.mean(returns))
print("Volatility (std):", np.std(returns))
print("Min return:", np.min(returns))
print("Max return:", np.max(returns))

Returns: [ 0.02 -0.01  0.03 -0.02  0.01  0.04 -0.01  0.02]

Total return (sum): 0.08
Average return (mean): 0.01
Volatility (std): 0.02
Min return: -0.02
Max return: 0.04


These functions work on the entire array by default. The sum of returns gives you the total return (though remember from Notebook 2.1, you should use `.cumprod()` for compounding). The mean gives you the average return. The standard deviation gives you volatility.

For 2D arrays, you can specify an axis to aggregate along rows or columns.

In [28]:
# Returns matrix: rows are assets, columns are days
returns_matrix = np.array([
    [0.01, 0.02, -0.01, 0.03],   # Asset 0
    [0.02, -0.01, 0.02, 0.01],   # Asset 1
    [-0.01, 0.03, 0.01, 0.02]    # Asset 2
])

print("Returns matrix:")
print(returns_matrix)

print("\nMean return per asset (axis=1):", np.mean(returns_matrix, axis=1))
print("Mean return per day (axis=0):", np.mean(returns_matrix, axis=0))
print("Overall mean:", np.mean(returns_matrix))

Returns matrix:
[[ 0.01  0.02 -0.01  0.03]
 [ 0.02 -0.01  0.02  0.01]
 [-0.01  0.03  0.01  0.02]]

Mean return per asset (axis=1): [0.0125 0.01   0.0125]
Mean return per day (axis=0): [0.00666667 0.01333333 0.00666667 0.02      ]
Overall mean: 0.011666666666666665


The `axis` parameter controls the direction of the aggregation.

`axis=1` aggregates across columns, giving you one value per row. This is the mean return for each asset across all days.

`axis=0` aggregates across rows, giving you one value per column. This is the mean return for each day across all assets.

With no axis specified, the function aggregates the entire array into a single value.

The axis numbering can be confusing at first. Think of it this way: `axis=0` collapses the rows, `axis=1` collapses the columns.

### 3.3.3 Logical Operations: np.where(), np.any(), np.all()

Logical operations let you make decisions based on array values. These are essential for creating trading signals and filtering data.

In [29]:
returns = np.array([0.02, -0.01, 0.03, -0.02, 0.01, -0.03, 0.04])

# np.where: if condition is true, use first value; otherwise use second value
signals = np.where(returns > 0, 1, -1)
print("Returns:", returns)
print("Signals (1 if positive, -1 if negative):", signals)

# Create a more nuanced signal
# 1 if return > 1%, -1 if return < -1%, 0 otherwise
signals_nuanced = np.where(returns > 0.01, 1, np.where(returns < -0.01, -1, 0))
print("\nNuanced signals:", signals_nuanced)

Returns: [ 0.02 -0.01  0.03 -0.02  0.01 -0.03  0.04]
Signals (1 if positive, -1 if negative): [ 1 -1  1 -1  1 -1  1]

Nuanced signals: [ 1  0  1 -1  0 -1  1]


The `np.where()` function is like an if-else statement for arrays. The first argument is the condition, the second is the value to use when true, and the third is the value to use when false.

You can nest `np.where()` calls to create more complex logic, as shown in the nuanced signal example. This is how you build trading rules without writing loops.

In [30]:
returns = np.array([0.02, -0.01, 0.03, -0.02, 0.01, -0.03, 0.04])

# np.any: True if ANY element meets the condition
has_large_loss = np.any(returns < -0.02)
print("Returns:", returns)
print("Any return below -2%?", has_large_loss)

# np.all: True if ALL elements meet the condition
all_positive = np.all(returns > 0)
print("All returns positive?", all_positive)

# Practical example: check if a portfolio is valid (weights sum to 1)
weights = np.array([0.25, 0.25, 0.30, 0.20])
is_valid = np.isclose(np.sum(weights), 1.0)
print("\nPortfolio weights:", weights)
print("Weights sum to 1?", is_valid)

Returns: [ 0.02 -0.01  0.03 -0.02  0.01 -0.03  0.04]
Any return below -2%? True
All returns positive? False

Portfolio weights: [0.25 0.25 0.3  0.2 ]
Weights sum to 1? True


The `np.any()` function returns True if at least one element satisfies the condition. Use this to check for exceptions or outliers. Did any day have a loss greater than 5%? Did any asset hit its stop loss?

The `np.all()` function returns True only if every element satisfies the condition. Use this to validate data. Are all weights positive? Are all prices above zero?

The `np.isclose()` function checks if two values are approximately equal, accounting for floating point precision. This is safer than using `==` when comparing floats (and this is incredibly useful for trading systems that have parameter thresholds!).

### 3.3.4 Practical Example: Calculating Portfolio Returns

Let's bring these operations together with a practical example. You have a portfolio of three assets with known weights. You want to calculate the portfolio return for each day.

In [31]:
# Asset returns for 5 days (rows are assets, columns are days)
asset_returns = np.array([
    [0.01, 0.02, -0.01, 0.03, 0.01],   # Stock A
    [0.02, -0.01, 0.02, 0.01, -0.02],  # Stock B
    [-0.01, 0.03, 0.01, 0.02, 0.01]    # Stock C
])

# Portfolio weights
weights = np.array([0.5, 0.3, 0.2])

# Calculate portfolio return for each day
# Matrix multiplication: weights (1x3) dot returns (3x5) = (1x5)
portfolio_returns = np.dot(weights, asset_returns)

print("Asset returns:")
print(asset_returns)
print("\nWeights:", weights)
print("\nPortfolio returns per day:", portfolio_returns)
print("Total portfolio return:", np.sum(portfolio_returns))

Asset returns:
[[ 0.01  0.02 -0.01  0.03  0.01]
 [ 0.02 -0.01  0.02  0.01 -0.02]
 [-0.01  0.03  0.01  0.02  0.01]]

Weights: [0.5 0.3 0.2]

Portfolio returns per day: [0.009 0.013 0.003 0.022 0.001]
Total portfolio return: 0.048


The `np.dot()` function performs the dot product, which is exactly what we need for portfolio return calculation. Each day's portfolio return is the weighted sum of the individual asset returns.

For day 1: (0.5 × 0.01) + (0.3 × 0.02) + (0.2 × -0.01) = 0.009

This single line replaces what would otherwise be nested loops. The calculation is clear, concise, and fast.

---

### 3.3.5 The Same Calculation in Pandas

Let's do the same portfolio return calculation using Pandas. This will show you the trade-off between the two approaches.

In [32]:
import pandas as pd

# Create a DataFrame with the same data
dates = pd.date_range('2024-07-01', periods=5)
df_returns = pd.DataFrame({
    'Stock_A': [0.01, 0.02, -0.01, 0.03, 0.01],
    'Stock_B': [0.02, -0.01, 0.02, 0.01, -0.02],
    'Stock_C': [-0.01, 0.03, 0.01, 0.02, 0.01]
}, index=dates)

# Weights as a Series
weights_series = pd.Series({'Stock_A': 0.5, 'Stock_B': 0.3, 'Stock_C': 0.2})

# Calculate portfolio returns
portfolio_returns_pandas = df_returns.dot(weights_series)

print("Returns DataFrame:")
print(df_returns)
print("\nWeights:")
print(weights_series)
print("\nPortfolio returns:")
print(portfolio_returns_pandas)

Returns DataFrame:
            Stock_A  Stock_B  Stock_C
2024-07-01     0.01     0.02    -0.01
2024-07-02     0.02    -0.01     0.03
2024-07-03    -0.01     0.02     0.01
2024-07-04     0.03     0.01     0.02
2024-07-05     0.01    -0.02     0.01

Weights:
Stock_A    0.5
Stock_B    0.3
Stock_C    0.2
dtype: float64

Portfolio returns:
2024-07-01    0.009
2024-07-02    0.013
2024-07-03    0.003
2024-07-04    0.022
2024-07-05    0.001
Freq: D, dtype: float64


The Pandas version uses the same `.dot()` method, but notice what you get in return. The result is a Series with the date index preserved. You can immediately see which return belongs to which day.

With the NumPy version, you get a raw array: `[0.009, 0.015, 0.002, 0.021, -0.001]`. You have to remember which position corresponds to which date.

This is the core trade-off. Pandas keeps your labels. NumPy gives you speed.

### 3.3.6 When to Use Pandas vs NumPy

**Use Pandas when:**

- You are building and debugging your strategy. The labels make it easier to verify your calculations.
- You need to combine portfolio returns with other time series data. Everything stays aligned by date.
- You are working with a single backtest and speed is not critical.
- You want to export results to CSV or visualize them. Pandas integrates smoothly with these workflows.

**Use NumPy when:**

- You are running many simulations or optimizations. The speed difference compounds.
- You are passing data to machine learning libraries. They expect arrays.
- You are doing portfolio optimization. The matrix math is cleaner in NumPy.
- You have already cleaned and validated your data and no longer need the safety of labels.

In practice, most quants work in Pandas during the research and d

In [34]:
# Create larger datasets for timing comparison
np.random.seed(42)
n_assets = 100
n_days = 10000

# NumPy array
large_returns_array = np.random.normal(0.0005, 0.02, (n_assets, n_days))
large_weights_array = np.random.random(n_assets)
large_weights_array = large_weights_array / large_weights_array.sum()

# Pandas DataFrame
dates = pd.date_range('2000-01-01', periods=n_days)
columns = [f'Asset_{i}' for i in range(n_assets)]
large_returns_df = pd.DataFrame(large_returns_array.T, index=dates, columns=columns)
large_weights_series = pd.Series(large_weights_array, index=columns)

print(f"Dataset size: {n_assets} assets × {n_days:,} days")
print(f"Total calculations per backtest: {n_assets * n_days:,}")

Dataset size: 100 assets × 10,000 days
Total calculations per backtest: 1,000,000


We now have a realistic sized dataset: 100 assets over 10,000 trading days (about 40 years of daily data). Let's time both approaches.

In [35]:
%%timeit -n 100 -r 3

portfolio_returns_np = np.dot(large_weights_array, large_returns_array)

408 µs ± 63.5 µs per loop (mean ± std. dev. of 3 runs, 100 loops each)


Note the time for the NumPy version. Now let's time the Pandas version.

In [36]:
%%timeit -n 100 -r 3

portfolio_returns_pd = large_returns_df.dot(large_weights_series)

The slowest run took 4.26 times longer than the fastest. This could mean that an intermediate result is being cached.
2.07 ms ± 1.09 ms per loop (mean ± std. dev. of 3 runs, 100 loops each)


The results tell the story:

- NumPy: 408 µs (microseconds) per loop
- Pandas: 2.07 ms (milliseconds) per loop

NumPy is about 5 times faster for this calculation. The Pandas version has overhead from maintaining the index alignment and returning a labeled Series.

For a single backtest, 2 milliseconds is nothing. But consider what happens when you need to run this calculation repeatedly.

If you are optimizing portfolio weights across 10,000 different weight combinations, that 1.6 millisecond difference becomes 16 seconds of extra waiting. If you are running a Monte Carlo simulation with 100,000 paths, it becomes nearly 3 minutes.

This is why production systems and optimization routines typically work with NumPy arrays, even though the research was done in Pandas.

---

## 4.1 The Shape Lab: Preparing for Machine Learning

This section will save you hours of frustration. When you start working with machine learning libraries like scikit-learn, you will encounter shape errors. They are cryptic. They are annoying. And they are almost always caused by the same few problems.

The Shape Lab will teach you to diagnose and fix these problems before they derail your workflow.

### 4.1.1 The Bridge: From Pandas to NumPy

Machine learning libraries expect NumPy arrays, not DataFrames. Before you can train a model, you need to extract your data from Pandas and convert it to the right format.

The two methods for this are `.values` and `.to_numpy()`. They do the same thing.

In [37]:
import pandas as pd
import numpy as np

# Create a simple DataFrame
df = pd.DataFrame({
    'Returns': [0.01, 0.02, -0.01, 0.03, 0.02],
    'Volume': [1000, 1500, 1200, 1800, 1600],
    'Volatility': [0.15, 0.18, 0.22, 0.19, 0.17]
}, index=pd.date_range('2024-07-01', periods=5))

print("Original DataFrame:")
print(df)

# Extract using .values
array_values = df['Returns'].values
print("\nUsing .values:", array_values)
print("Type:", type(array_values))

# Extract using .to_numpy()
array_tonumpy = df['Returns'].to_numpy()
print("\nUsing .to_numpy():", array_tonumpy)
print("Type:", type(array_tonumpy))

Original DataFrame:
            Returns  Volume  Volatility
2024-07-01     0.01    1000        0.15
2024-07-02     0.02    1500        0.18
2024-07-03    -0.01    1200        0.22
2024-07-04     0.03    1800        0.19
2024-07-05     0.02    1600        0.17

Using .values: [ 0.01  0.02 -0.01  0.03  0.02]
Type: <class 'numpy.ndarray'>

Using .to_numpy(): [ 0.01  0.02 -0.01  0.03  0.02]
Type: <class 'numpy.ndarray'>


Both methods return a NumPy array. The `.to_numpy()` method is the modern, recommended approach. The `.values` attribute is older but still widely used.

Notice that when you extract a single column, the labels disappear. You get a raw array of numbers. The date index is gone.

### 4.1.2 When You Need to Cross the Bridge

You will need to extract arrays from DataFrames in these situations:

- Feeding features into scikit-learn models (LinearRegression, RandomForest, etc.)
- Performing matrix operations for portfolio optimization
- Passing data to other numerical libraries like SciPy or TensorFlow
- When you need maximum speed and have finished your data cleaning

Let's see what this looks like with a typical machine learning setup.

In [38]:
# Typical ML setup: features and target
df = pd.DataFrame({
    'Momentum': [0.05, 0.02, -0.03, 0.04, 0.01],
    'Volatility': [0.15, 0.18, 0.22, 0.19, 0.17],
    'Volume_Change': [0.10, -0.05, 0.15, 0.08, -0.02],
    'Next_Return': [0.02, -0.01, 0.03, 0.01, 0.02]
}, index=pd.date_range('2024-07-01', periods=5))

# Features (X) and target (y)
feature_columns = ['Momentum', 'Volatility', 'Volume_Change']
target_column = 'Next_Return'

X = df[feature_columns].to_numpy()
y = df[target_column].to_numpy()

print("Features (X):")
print(X)
print("\nShape of X:", X.shape)

print("\nTarget (y):")
print(y)
print("Shape of y:", y.shape)

Features (X):
[[ 0.05  0.15  0.1 ]
 [ 0.02  0.18 -0.05]
 [-0.03  0.22  0.15]
 [ 0.04  0.19  0.08]
 [ 0.01  0.17 -0.02]]

Shape of X: (5, 3)

Target (y):
[ 0.02 -0.01  0.03  0.01  0.02]
Shape of y: (5,)


This is the standard pattern. Your features (X) are a 2D array where each row is an observation and each column is a feature. Your target (y) is a 1D array of the values you are trying to predict.

Look at the shapes:
- X has shape `(5, 3)`: 5 observations, 3 features
- y has shape `(5,)`: 5 observations

This is where shape problems begin. That `(5,)` shape for y is going to cause issues with certain operations.

## 4.2 Understanding Shape

Shape is the source of most errors when working with NumPy and machine learning. Understanding the difference between 1D and 2D arrays will save you countless hours of debugging.

### 4.2.1 1D Arrays: Shape (n,)

A 1D array is a simple sequence of values. When you extract a single column from a DataFrame, you get a 1D array. When you create an array from a flat list, you get a 1D array.

In [39]:
# Creating 1D arrays
array_1d = np.array([1, 2, 3, 4, 5])

print("1D Array:", array_1d)
print("Shape:", array_1d.shape)
print("Number of dimensions:", array_1d.ndim)

1D Array: [1 2 3 4 5]
Shape: (5,)
Number of dimensions: 1


The shape `(5,)` tells you this is a 1D array with 5 elements. The trailing comma indicates a tuple with a single element. This is not the same as `(5, 1)`, which would be a 2D array. This distinction matters enormously.

### 4.2.2 2D Arrays: Shape (n, m)

A 2D array has rows and columns. When you extract multiple columns from a DataFrame, you get a 2D array. When you create an array from nested lists, you get a 2D array.

In [40]:
# Creating 2D arrays
array_2d = np.array([[1, 2, 3], [4, 5, 6]])

print("2D Array:")
print(array_2d)
print("\nShape:", array_2d.shape)
print("Number of dimensions:", array_2d.ndim)

# A 2D array with only one column
single_column_2d = np.array([[1], [2], [3], [4], [5]])

print("\nSingle column 2D array:")
print(single_column_2d)
print("Shape:", single_column_2d.shape)
print("Number of dimensions:", single_column_2d.ndim)

2D Array:
[[1 2 3]
 [4 5 6]]

Shape: (2, 3)
Number of dimensions: 2

Single column 2D array:
[[1]
 [2]
 [3]
 [4]
 [5]]
Shape: (5, 1)
Number of dimensions: 2


The first array has shape `(2, 3)`: 2 rows, 3 columns.

The second array has shape `(5, 1)`: 5 rows, 1 column. This is still a 2D array, even though there is only one column. It has two dimensions, and `ndim` confirms this.

Compare this to our 1D array with shape `(5,)`. Both contain 5 elements. Both could represent the same data. But they are not interchangeable. Many functions treat them differently.

### 4.2.3 Why Machine Learning Models Expect 2D Input

Scikit-learn and most machine learning libraries expect your features (X) to be a 2D array. Each row is an observation. Each column is a feature. Even if you have only one feature, the library still expects a 2D array with shape `(n_samples, 1)`, not a 1D array with shape `(n_samples,)`.

This is a design decision that makes the libraries consistent. Every model works the same way regardless of whether you have 1 feature or 100 features. But it means you need to reshape your data when you have a single feature.

In [41]:
# Simulating a common scenario: predicting returns from a single feature
momentum = np.array([0.05, 0.02, -0.03, 0.04, 0.01])
next_returns = np.array([0.02, -0.01, 0.03, 0.01, 0.02])

print("Momentum (our single feature):", momentum)
print("Shape:", momentum.shape)

print("\nNext returns (our target):", next_returns)
print("Shape:", next_returns.shape)

print("\nBoth are 1D arrays. If we try to use momentum as X in scikit-learn, we will get an error.")

Momentum (our single feature): [ 0.05  0.02 -0.03  0.04  0.01]
Shape: (5,)

Next returns (our target): [ 0.02 -0.01  0.03  0.01  0.02]
Shape: (5,)

Both are 1D arrays. If we try to use momentum as X in scikit-learn, we will get an error.


We have one feature (momentum) and one target (next returns). Both are 1D arrays. The target can stay as a 1D array. But the feature must be reshaped to 2D before scikit-learn will accept it.

Let's see what happens when we try to use this data without reshaping.

---

## 4.3 The Shape Trap

This section shows you the exact error you will encounter and how to fix it. We will deliberately trigger the error so you recognize it when it happens in your own work.

### 4.3.1 Why (n,) is Not the Same as (n, 1)

Both `(n,)` and `(n, 1)` contain n elements. But they are fundamentally different structures. A 1D array has no concept of rows or columns. A 2D array with shape `(n, 1)` has n rows and 1 column.

In [42]:
array_1d = np.array([1, 2, 3, 4, 5])
array_2d = np.array([[1], [2], [3], [4], [5]])

print("1D array:")
print(array_1d)
print("Shape:", array_1d.shape)

print("\n2D array:")
print(array_2d)
print("Shape:", array_2d.shape)

print("\nAre they equal?", np.array_equal(array_1d, array_2d))

1D array:
[1 2 3 4 5]
Shape: (5,)

2D array:
[[1]
 [2]
 [3]
 [4]
 [5]]
Shape: (5, 1)

Are they equal? False


They contain the same values but they are not equal. NumPy considers them different objects because their shapes do not match.

This is why machine learning libraries reject 1D arrays for features. They are checking shape, not just values.

### 4.3.2 The Sklearn Error You Will See

Let's trigger the error. We will try to fit a simple linear regression using our 1D momentum array as the feature.

In [43]:
from sklearn.linear_model import LinearRegression

# Our data (both 1D)
momentum = np.array([0.05, 0.02, -0.03, 0.04, 0.01])
next_returns = np.array([0.02, -0.01, 0.03, 0.01, 0.02])

# Try to fit a model
model = LinearRegression()

try:
    model.fit(momentum, next_returns)
except ValueError as e:
    print("Error!")
    print(e)

Error!
Expected 2D array, got 1D array instead:
array=[ 0.05  0.02 -0.03  0.04  0.01].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.


There it is. This is the error message you will see over and over until you internalize the shape rules.

The message is actually helpful. It tells you exactly what went wrong: scikit-learn expected a 2D array but got a 1D array. It even tells you how to fix it: use `.reshape(-1, 1)` for a single feature.

Let's break down what that means.

### 4.3.3 Diagnosing Shape Mismatches

When you encounter a shape error, the first step is always to check the shape of your data. Print the shape before passing data to any function. This simple habit will save you hours.

In [44]:
# Always check shapes before fitting a model
print("Feature shape:", momentum.shape)
print("Feature ndim:", momentum.ndim)

print("\nTarget shape:", next_returns.shape)
print("Target ndim:", next_returns.ndim)

print("\nProblem: Feature is 1D. Sklearn expects 2D for features.")

Feature shape: (5,)
Feature ndim: 1

Target shape: (5,)
Target ndim: 1

Problem: Feature is 1D. Sklearn expects 2D for features.


The diagnosis is clear. Our feature array has shape `(5,)` with 1 dimension. Scikit-learn expects shape `(5, 1)` with 2 dimensions.

The target array is fine. Scikit-learn accepts 1D arrays for the target variable. It is only the feature matrix that must be 2D.

---

## 4.4 The Fix: Reshaping Arrays

Now that you understand the problem, the fix is straightforward. NumPy provides several methods to change the shape of an array without changing its data.

### 4.4.1 .reshape(): Changing Dimensions

The `.reshape()` method returns a new array with the same data but a different shape. The total number of elements must stay the same.

In [45]:
array_1d = np.array([1, 2, 3, 4, 5, 6])

print("Original 1D array:", array_1d)
print("Shape:", array_1d.shape)

# Reshape to 2D: 2 rows, 3 columns
array_2x3 = array_1d.reshape(2, 3)
print("\nReshaped to (2, 3):")
print(array_2x3)
print("Shape:", array_2x3.shape)

# Reshape to 2D: 3 rows, 2 columns
array_3x2 = array_1d.reshape(3, 2)
print("\nReshaped to (3, 2):")
print(array_3x2)
print("Shape:", array_3x2.shape)

Original 1D array: [1 2 3 4 5 6]
Shape: (6,)

Reshaped to (2, 3):
[[1 2 3]
 [4 5 6]]
Shape: (2, 3)

Reshaped to (3, 2):
[[1 2]
 [3 4]
 [5 6]]
Shape: (3, 2)


The original array has 6 elements. We can reshape it to any dimensions that multiply to 6: (2, 3), (3, 2), (6, 1), (1, 6). NumPy rearranges the elements to fill the new shape.

If you try to reshape to dimensions that do not multiply to the original size, NumPy raises an error.

### 4.4.2 Understanding .reshape(-1, 1): The Magic of -1

The -1 in reshape is a placeholder that tells NumPy to figure out the correct size automatically. When you write `.reshape(-1, 1)`, you are saying: give me a 2D array with 1 column, and calculate however many rows are needed.

This is exactly what scikit-learn asked us to do in the error message.

In [46]:
momentum = np.array([0.05, 0.02, -0.03, 0.04, 0.01])

print("Original 1D array:", momentum)
print("Shape:", momentum.shape)

# Reshape to 2D column vector
momentum_2d = momentum.reshape(-1, 1)

print("\nReshaped with .reshape(-1, 1):")
print(momentum_2d)
print("Shape:", momentum_2d.shape)

Original 1D array: [ 0.05  0.02 -0.03  0.04  0.01]
Shape: (5,)

Reshaped with .reshape(-1, 1):
[[ 0.05]
 [ 0.02]
 [-0.03]
 [ 0.04]
 [ 0.01]]
Shape: (5, 1)


The array went from shape `(5,)` to shape `(5, 1)`. NumPy calculated that -1 should be 5 because we have 5 elements and 1 column.

The -1 is useful because it works regardless of how many elements your array has. You do not need to know the length in advance. This is important when your data size varies.

Similarly, `.reshape(1, -1)` gives you a single row with as many columns as needed. This is used when you have a single sample with multiple features.

### 4.4.3 .flatten() and .ravel(): Going Back to 1D

Sometimes you need to go the other direction, from 2D back to 1D. The `.flatten()` and `.ravel()` methods both do this.

In [47]:
array_2d = np.array([[1, 2, 3], [4, 5, 6]])

print("2D array:")
print(array_2d)
print("Shape:", array_2d.shape)

# Using .flatten()
flat = array_2d.flatten()
print("\nUsing .flatten():", flat)
print("Shape:", flat.shape)

# Using .ravel()
raveled = array_2d.ravel()
print("\nUsing .ravel():", raveled)
print("Shape:", raveled.shape)

2D array:
[[1 2 3]
 [4 5 6]]
Shape: (2, 3)

Using .flatten(): [1 2 3 4 5 6]
Shape: (6,)

Using .ravel(): [1 2 3 4 5 6]
Shape: (6,)


Both methods produce the same result: a 1D array with all elements in row-major order (first row, then second row, and so on).

The difference is technical. The `.flatten()` method always returns a copy of the data. The `.ravel()` method returns a view when possible, which is faster and uses less memory. For most purposes, they are interchangeable.

When would you need to flatten?
- When a function expects a 1D array and you have 2D data.
- When you want to iterate over all elements.
- When you need to pass data to a function that does not understand matrix structure.

### 4.4.4 Practical Example: Preparing Features for Sklearn

Let's fix the error we saw earlier. We will reshape our momentum feature and successfully fit a linear regression model.

In [48]:
from sklearn.linear_model import LinearRegression

# Our data
momentum = np.array([0.05, 0.02, -0.03, 0.04, 0.01])
next_returns = np.array([0.02, -0.01, 0.03, 0.01, 0.02])

# Reshape momentum to 2D
X = momentum.reshape(-1, 1)
y = next_returns

print("Before reshape:", momentum.shape)
print("After reshape:", X.shape)
print("Target shape:", y.shape)

# Now fit the model
model = LinearRegression()
model.fit(X, y)

print("\nModel fitted successfully!")
print("Coefficient:", model.coef_[0])
print("Intercept:", model.intercept_)

Before reshape: (5,)
After reshape: (5, 1)
Target shape: (5,)

Model fitted successfully!
Coefficient: -0.19587628865979378
Intercept: 0.01752577319587629


One line of reshaping and the error is gone. The model fits successfully.

This pattern will become second nature:

1. Extract your features from a DataFrame
2. Check the shape
3. If 1D, reshape to (-1, 1)
4. Pass to the model

When you have multiple features, extracting multiple columns from a DataFrame already gives you a 2D array. The reshape is only needed for single-feature cases.

---

## 4.5 The Error Message Lab

This section catalogs the most common shape-related errors you will encounter. When you see one of these in your own work, you will know exactly what went wrong and how to fix it.

### 4.5.1 Common Sklearn Shape Errors

We have already seen the most common error: passing a 1D array when sklearn expects 2D. Here are the others you will encounter.

**Error 1: Inconsistent number of samples**

This happens when your features and target have different numbers of rows.

In [50]:
# Features have 5 samples, target has 4
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([0.1, 0.2, 0.3, 0.4])

print("X shape:", X.shape)
print("y shape:", y.shape)

model = LinearRegression()

try:
    model.fit(X, y)
except ValueError as e:
    print("\nError!")
    print(e)

X shape: (5, 1)
y shape: (4,)

Error!
Found input variables with inconsistent numbers of samples: [5, 4]


The error message tells you the shapes do not match: X has 5 samples, y has 4. This usually happens when you accidentally dropped rows from one array but not the other, or when you sliced your data incorrectly.

The fix: check where your data preparation went wrong. Make sure any filtering or cleaning operations are applied to both X and y together.

**Error 2: Single sample reshape confusion**

This happens when you try to predict on a single observation and get the reshape wrong.

In [52]:
# Train a model
X_train = np.array([[1], [2], [3], [4], [5]])
y_train = np.array([0.1, 0.2, 0.3, 0.4, 0.5])

model = LinearRegression()
model.fit(X_train, y_train)

# Try to predict on a single value
new_observation = np.array([6])

try:
    prediction = model.predict(new_observation)
except ValueError as e:
    print("Error!")
    print(e)

Error!
Expected 2D array, got 1D array instead:
array=[6].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.


The same error again: expected 2D, got 1D. When predicting on new data, you must also reshape to 2D.

In [53]:
# The fix: reshape the single observation
new_observation = np.array([6]).reshape(-1, 1)

print("Reshaped observation shape:", new_observation.shape)

prediction = model.predict(new_observation)
print("Prediction:", prediction[0])

Reshaped observation shape: (1, 1)
Prediction: 0.6


The observation now has shape `(1, 1)`: one sample with one feature. The model accepts it and returns a prediction.

Remember: any data you pass to `.predict()` must have the same number of features as the data you trained on, and it must be 2D.

### 4.5.2 Reading the Error Message

Sklearn error messages are verbose but informative. Learning to read them quickly will speed up your debugging.

Here is a strategy:

1. Look for the word "shape" or "dimension" in the error message
2. Find the numbers in parentheses, these are the actual shapes
3. Compare what the function expected versus what it received
4. Reshape accordingly

Let's see one more example with multiple features.

In [54]:
# Train a model with 3 features
X_train = np.array([
    [0.05, 0.15, 1000],
    [0.02, 0.18, 1500],
    [-0.03, 0.22, 1200],
    [0.04, 0.19, 1800],
    [0.01, 0.17, 1600]
])
y_train = np.array([0.02, -0.01, 0.03, 0.01, 0.02])

model = LinearRegression()
model.fit(X_train, y_train)

print("Model trained with X shape:", X_train.shape)

# Try to predict with only 2 features
new_observation = np.array([[0.03, 0.16]])

try:
    prediction = model.predict(new_observation)
except ValueError as e:
    print("\nError!")
    print(e)

Model trained with X shape: (5, 3)

Error!
X has 2 features, but LinearRegression is expecting 3 features as input.


The model was trained on 3 features but we tried to predict with only 2. The error message tells us exactly this: it expected 3 features but got 2.

This happens when you forget to include all the features, or when your training and prediction data come from different sources with different columns.

The fix: make sure your prediction data has the same features, in the same order, as your training data.

### 4.5.3 Fixing the Error Step by Step

When you encounter a shape error, follow this checklist:

**Step 1: Print the shapes**

Before doing anything else, print the shape of every array involved.

**Step 2: Identify the mismatch**

Compare what you have versus what the function expects. Is it 1D vs 2D? Different number of samples? Different number of features?

**Step 3: Trace back to the source**

Where did the problematic array come from? Did you extract a single column? Did you filter rows? Did you forget to include all features?

**Step 4: Apply the fix**

- 1D to 2D column: `.reshape(-1, 1)`
- 1D to 2D row: `.reshape(1, -1)`
- 2D to 1D: `.flatten()` or `.ravel()`
- Mismatched samples: check your data preparation
- Mismatched features: ensure all features are included

In [55]:
# A complete debugging example
def prepare_and_predict(features, target, new_data):
    """
    A function that prints shapes at each step.
    This is good practice during development.
    """
    print("Step 1: Check input shapes")
    print(f"  Features shape: {features.shape}")
    print(f"  Target shape: {target.shape}")
    print(f"  New data shape: {new_data.shape}")

    # Step 2: Fix shapes if needed
    if features.ndim == 1:
        print("\n  Fixing: Features is 1D, reshaping to 2D")
        features = features.reshape(-1, 1)
        print(f"  New features shape: {features.shape}")

    if new_data.ndim == 1:
        print("\n  Fixing: New data is 1D, reshaping to 2D")
        new_data = new_data.reshape(-1, 1)
        print(f"  New data shape: {new_data.shape}")

    # Step 3: Verify compatibility
    print(f"\nStep 2: Verify compatibility")
    print(f"  Features has {features.shape[1]} feature(s)")
    print(f"  New data has {new_data.shape[1]} feature(s)")

    if features.shape[1] != new_data.shape[1]:
        raise ValueError(f"Feature mismatch: trained on {features.shape[1]}, got {new_data.shape[1]}")

    # Step 4: Fit and predict
    print("\nStep 3: Fit and predict")
    model = LinearRegression()
    model.fit(features, target)
    prediction = model.predict(new_data)

    return prediction

# Test it
momentum = np.array([0.05, 0.02, -0.03, 0.04, 0.01])
returns = np.array([0.02, -0.01, 0.03, 0.01, 0.02])
new_momentum = np.array([0.03])

result = prepare_and_predict(momentum, returns, new_momentum)
print(f"\nPrediction: {result[0]:.4f}")

Step 1: Check input shapes
  Features shape: (5,)
  Target shape: (5,)
  New data shape: (1,)

  Fixing: Features is 1D, reshaping to 2D
  New features shape: (5, 1)

  Fixing: New data is 1D, reshaping to 2D
  New data shape: (1, 1)

Step 2: Verify compatibility
  Features has 1 feature(s)
  New data has 1 feature(s)

Step 3: Fit and predict

Prediction: 0.0116


This function demonstrates defensive programming. By printing shapes at each step, you can see exactly what is happening. During development, this verbosity is valuable. Once your code is working, you can remove the print statements.

The key lesson: shape errors are not mysterious. They follow predictable patterns and have straightforward fixes. Build the habit of checking shapes before passing data to any function.

---

## 5.1 What is Broadcasting?

Broadcasting is one of NumPy's most powerful features. It allows you to perform operations on arrays of different shapes without explicitly copying data. This saves memory and makes your code cleaner.

The basic idea: when you operate on two arrays of different shapes, NumPy automatically stretches the smaller array to match the larger one.

### 5.1.1 The Concept: Stretching Arrays to Match Shapes

You have already seen broadcasting without knowing it. When you add a single number to an array, NumPy broadcasts that number across every element.

In [56]:
prices = np.array([100, 102, 105, 103, 108])

# Add a fixed fee to every price
prices_with_fee = prices + 0.50

print("Original prices:", prices)
print("After adding $0.50:", prices_with_fee)

Original prices: [100 102 105 103 108]
After adding $0.50: [100.5 102.5 105.5 103.5 108.5]


NumPy did not require you to create an array of five 0.50 values. It automatically broadcast the single value across the entire array.

This works for any arithmetic operation: addition, subtraction, multiplication, division. It also works with more complex shape combinations.

### 5.1.2 The Rules of Broadcasting

Broadcasting follows specific rules. NumPy compares shapes element by element, starting from the rightmost dimension:

1. If the dimensions are equal, they are compatible.
2. If one of the dimensions is 1, it is stretched to match the other.
3. If neither condition is met, NumPy raises an error.

Let's see these rules in action.

In [57]:
# Rule 1: Equal dimensions are compatible
a = np.array([1, 2, 3])
b = np.array([10, 20, 30])
print("Equal shapes (3,) and (3,):")
print(f"  {a} + {b} = {a + b}")

# Rule 2: Dimension of 1 is stretched
# Array shape (3,) broadcast with scalar (treated as shape (1,))
c = np.array([1, 2, 3])
d = 10
print("\nShape (3,) with scalar:")
print(f"  {c} * {d} = {c * d}")

# 2D example: (3, 3) with (3,)
matrix = np.array([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]])
row = np.array([10, 20, 30])

print("\nShape (3, 3) with shape (3,):")
print("Matrix:")
print(matrix)
print(f"Row: {row}")
print("Result:")
print(matrix + row)

Equal shapes (3,) and (3,):
  [1 2 3] + [10 20 30] = [11 22 33]

Shape (3,) with scalar:
  [1 2 3] * 10 = [10 20 30]

Shape (3, 3) with shape (3,):
Matrix:
[[1 2 3]
 [4 5 6]
 [7 8 9]]
Row: [10 20 30]
Result:
[[11 22 33]
 [14 25 36]
 [17 28 39]]


In the 2D example, the row array with shape `(3,)` was broadcast across each row of the matrix. NumPy stretched the 1D array to match the 2D shape, adding the same values to every row.

This is extremely useful. You can subtract a mean from every row, multiply every row by different weights, or apply any element-wise operation without writing loops.

### 5.1.3 When Broadcasting Fails

Broadcasting fails when the shapes are incompatible. This happens when neither dimension is equal and neither dimension is 1.

In [59]:
# Incompatible shapes: (3,) and (4,)
a = np.array([1, 2, 3])
b = np.array([10, 20, 30, 40])

print("Shape of a:", a.shape)
print("Shape of b:", b.shape)

try:
    result = a + b
except ValueError as e:
    print("\nError!")
    print(e)

Shape of a: (3,)
Shape of b: (4,)

Error!
operands could not be broadcast together with shapes (3,) (4,) 


NumPy cannot broadcast shapes `(3,)` and `(4,)` because neither dimension matches and neither is 1. There is no sensible way to stretch 3 elements to match 4 elements.

When you see a broadcasting error, check the shapes of both arrays. Usually the fix is to reshape one of them so the dimensions align properly.

---

## 5.2 Practical Examples

Broadcasting is not just a technical feature. It simplifies real financial calculations. Here are the patterns you will use most often.

### 5.2.1 Subtracting the Risk-Free Rate from Returns

When calculating excess returns or the Sharpe ratio, you need to subtract the risk-free rate from every return in your series. Broadcasting makes this trivial.

In [60]:
# Monthly returns for a portfolio
monthly_returns = np.array([0.02, -0.01, 0.03, 0.015, -0.005, 0.025])

# Risk-free rate (monthly)
risk_free_rate = 0.004  # Approximately 5% annual

# Calculate excess returns
excess_returns = monthly_returns - risk_free_rate

print("Monthly returns:", monthly_returns)
print("Risk-free rate:", risk_free_rate)
print("Excess returns:", excess_returns)
print(f"\nMean excess return: {np.mean(excess_returns):.4f}")
print(f"Sharpe ratio (annualized): {np.mean(excess_returns) / np.std(excess_returns) * np.sqrt(12):.2f}")

Monthly returns: [ 0.02  -0.01   0.03   0.015 -0.005  0.025]
Risk-free rate: 0.004
Excess returns: [ 0.016 -0.014  0.026  0.011 -0.009  0.021]

Mean excess return: 0.0085
Sharpe ratio (annualized): 1.97


The single risk-free rate value was broadcast across the entire returns array. Each element had the same value subtracted. No loop required.

This is the foundation of risk-adjusted performance measurement. The Sharpe ratio, Sortino ratio, and information ratio all start with calculating excess returns in exactly this way.

Note: a Sharpe ratio of 1.97 is incredibly good! Almost too good. We will spend a lot of time on this in future chapters.

### 5.2.2 Normalizing Returns (Mean Zero, Unit Variance)

Many machine learning algorithms perform better when features are normalized. A common approach is to subtract the mean and divide by the standard deviation, creating a series with mean zero and standard deviation of one.

In [62]:
# Raw returns for an asset
returns = np.array([0.02, -0.01, 0.03, 0.015, -0.005, 0.025, 0.01, -0.02])

# Calculate mean and standard deviation
mean_return = np.mean(returns)
std_return = np.std(returns)

# Normalize using broadcasting
normalized_returns = (returns - mean_return) / std_return

print("Original returns:", returns)
print(f"Mean: {mean_return:.4f}")
print(f"Std:  {std_return:.4f}")
print("\nNormalized returns:", normalized_returns)
print(f"Normalized mean: {np.mean(normalized_returns):.10f}")
print(f"Normalized std:  {np.std(normalized_returns):.4f}")

Original returns: [ 0.02  -0.01   0.03   0.015 -0.005  0.025  0.01  -0.02 ]
Mean: 0.0081
Std:  0.0168

Normalized returns: [ 0.70858043 -1.0815175   1.30527975  0.41023078 -0.78316785  1.00693009
  0.11188112 -1.67821682]
Normalized mean: -0.0000000000
Normalized std:  1.0000


Two broadcasting operations in one line. First, the mean is subtracted from every element. Then, every element is divided by the standard deviation.

The result has a mean of essentially zero (the tiny number is floating point error) and a standard deviation of 1. This is called standardization or z-score normalization.

When preparing features for machine learning, you will often normalize each feature column this way. It prevents features with large values from dominating features with small values.

### 5.2.3 Calculating Excess Returns vs. a Benchmark

When evaluating a portfolio, you often want to compare its returns against a benchmark like the S&P 500. Broadcasting lets you do this across multiple assets at once.

In [63]:
# Returns for 4 stocks over 5 days (rows are stocks, columns are days)
stock_returns = np.array([
    [0.01, 0.02, -0.01, 0.03, 0.01],   # Stock A
    [0.02, -0.01, 0.02, 0.01, -0.02],  # Stock B
    [-0.01, 0.03, 0.01, 0.02, 0.01],   # Stock C
    [0.015, 0.01, -0.02, 0.025, 0.005] # Stock D
])

# Benchmark returns (S&P 500) for the same 5 days
benchmark_returns = np.array([0.005, 0.01, -0.005, 0.015, 0.002])

print("Stock returns shape:", stock_returns.shape)
print("Benchmark returns shape:", benchmark_returns.shape)

# Calculate excess returns for all stocks
excess_vs_benchmark = stock_returns - benchmark_returns

print("\nStock returns:")
print(stock_returns)
print("\nBenchmark returns:", benchmark_returns)
print("\nExcess returns vs benchmark:")
print(excess_vs_benchmark)

Stock returns shape: (4, 5)
Benchmark returns shape: (5,)

Stock returns:
[[ 0.01   0.02  -0.01   0.03   0.01 ]
 [ 0.02  -0.01   0.02   0.01  -0.02 ]
 [-0.01   0.03   0.01   0.02   0.01 ]
 [ 0.015  0.01  -0.02   0.025  0.005]]

Benchmark returns: [ 0.005  0.01  -0.005  0.015  0.002]

Excess returns vs benchmark:
[[ 0.005  0.01  -0.005  0.015  0.008]
 [ 0.015 -0.02   0.025 -0.005 -0.022]
 [-0.015  0.02   0.015  0.005  0.008]
 [ 0.01   0.    -0.015  0.01   0.003]]


The benchmark array with shape `(5,)` was broadcast across each row of the stock returns matrix with shape `(4, 5)`. Each stock's daily return had the corresponding benchmark return subtracted.

This gives you the daily excess return for every stock in a single operation. Positive values mean the stock outperformed the benchmark that day. Negative values mean it underperformed.

From here, you could calculate tracking error, information ratio, or identify which stocks consistently beat the benchmark.

---

## 6.1 Array Math for Finance

This section covers the matrix operations and statistical functions that appear constantly in quantitative finance. Portfolio optimization, risk analysis, and factor models all rely on these tools.

### 6.1.1 np.dot(): Dot Product (Portfolio Weights x Returns)

The dot product is the foundation of portfolio return calculation. We covered this earlier, but let's examine it more closely.

When you multiply portfolio weights by asset returns, you are computing a weighted sum. This is exactly what the dot product does.

In [64]:
# Portfolio weights (must sum to 1)
weights = np.array([0.4, 0.35, 0.25])

# Daily returns for 3 assets
returns = np.array([0.02, -0.01, 0.03])

# Portfolio return = weighted sum of individual returns
portfolio_return = np.dot(weights, returns)

print("Weights:", weights)
print("Returns:", returns)
print(f"\nPortfolio return: {portfolio_return:.4f}")

# Manual calculation to verify
manual = (0.4 * 0.02) + (0.35 * -0.01) + (0.25 * 0.03)
print(f"Manual calculation: {manual:.4f}")

Weights: [0.4  0.35 0.25]
Returns: [ 0.02 -0.01  0.03]

Portfolio return: 0.0120
Manual calculation: 0.0120


The dot product multiplies corresponding elements and sums the results. For portfolio returns, this gives you the weighted average return.

This operation scales beautifully. Whether you have 3 assets or 3,000, the syntax is the same. NumPy handles the computation efficiently regardless of size.

### 6.1.2 The @ Operator: Matrix Multiplication

Python 3.5 introduced the `@` operator for matrix multiplication. It does the same thing as `np.dot()` but is cleaner to read, especially for complex expressions.

In [65]:
# Same calculation using @ operator
weights = np.array([0.4, 0.35, 0.25])
returns = np.array([0.02, -0.01, 0.03])

portfolio_return = weights @ returns

print("Using @ operator:", portfolio_return)

# Where @ really shines: matrix multiplication
# Returns matrix: 3 assets over 5 days
returns_matrix = np.array([
    [0.01, 0.02, -0.01, 0.03, 0.01],
    [0.02, -0.01, 0.02, 0.01, -0.02],
    [-0.01, 0.03, 0.01, 0.02, 0.01]
])

# Portfolio returns for all 5 days in one operation
portfolio_returns = weights @ returns_matrix

print("\nWeights shape:", weights.shape)
print("Returns matrix shape:", returns_matrix.shape)
print("Portfolio returns for each day:", portfolio_returns)

Using @ operator: 0.012

Weights shape: (3,)
Returns matrix shape: (3, 5)
Portfolio returns for each day: [ 0.0085  0.012   0.0055  0.0205 -0.0005]


The `@` operator makes the code more readable, especially when you chain multiple matrix operations together. Compare:

`result = np.dot(np.dot(A, B), C)` versus `result = A @ B @ C`

For portfolio calculations, use whichever you find clearer. Both produce identical results. In this book, you will see both depending on what makes the code easier to understand.

### 6.1.3 np.transpose(): Transposing Arrays

Transposing swaps rows and columns. A matrix with shape `(3, 5)` becomes shape `(5, 3)`. This is essential when your data is organized one way but a calculation requires it the other way.

In [66]:
# Returns matrix: 3 assets (rows) over 5 days (columns)
returns = np.array([
    [0.01, 0.02, -0.01, 0.03, 0.01],
    [0.02, -0.01, 0.02, 0.01, -0.02],
    [-0.01, 0.03, 0.01, 0.02, 0.01]
])

print("Original shape:", returns.shape)
print("Original (assets as rows):")
print(returns)

# Transpose: now days are rows, assets are columns
returns_T = np.transpose(returns)

print("\nTransposed shape:", returns_T.shape)
print("Transposed (days as rows):")
print(returns_T)

# Shorthand: .T attribute
print("\nUsing .T attribute:", returns.T.shape)

Original shape: (3, 5)
Original (assets as rows):
[[ 0.01  0.02 -0.01  0.03  0.01]
 [ 0.02 -0.01  0.02  0.01 -0.02]
 [-0.01  0.03  0.01  0.02  0.01]]

Transposed shape: (5, 3)
Transposed (days as rows):
[[ 0.01  0.02 -0.01]
 [ 0.02 -0.01  0.03]
 [-0.01  0.02  0.01]
 [ 0.03  0.01  0.02]
 [ 0.01 -0.02  0.01]]

Using .T attribute: (5, 3)


NumPy provides two ways to transpose an array. The `np.transpose()` function and the `.T` attribute. The `.T` attribute is simply a shortcut that does the same thing with less typing.

For 2D arrays, they are identical. Use `.T` when you want concise code. Use `np.transpose()` when you want to be explicit or when working with arrays of three or more dimensions, where `np.transpose()` allows you to specify the exact axis order.

For the 1D and 2D arrays you will encounter in most financial work, `.T` is the common choice.

You will need to transpose when:

- Your data has assets as rows but a function expects assets as columns
- You are computing a covariance matrix and need to align the dimensions
- You are converting between different data organization conventions

The key is knowing how your data is organized and how the function you are calling expects it to be organized.

---

## 6.2 Statistical Functions

NumPy provides efficient implementations of the statistical calculations you need for risk analysis and portfolio management. These functions operate on entire arrays without loops.

### 6.2.1 np.mean(), np.std(), np.var()

These are the building blocks of performance and risk measurement. Mean return tells you the average performance. Standard deviation and variance measure volatility.

In [67]:
# Monthly returns for a portfolio
returns = np.array([0.02, -0.01, 0.03, 0.015, -0.02, 0.025, 0.01, -0.005, 0.018, 0.022, -0.008, 0.012])

print("Monthly returns:", returns)
print(f"\nMean monthly return: {np.mean(returns):.4f}")
print(f"Standard deviation: {np.std(returns):.4f}")
print(f"Variance: {np.var(returns):.6f}")

# Annualize the statistics (assuming monthly data)
annual_return = np.mean(returns) * 12
annual_volatility = np.std(returns) * np.sqrt(12)

print(f"\nAnnualized return: {annual_return:.2%}")
print(f"Annualized volatility: {annual_volatility:.2%}")

Monthly returns: [ 0.02  -0.01   0.03   0.015 -0.02   0.025  0.01  -0.005  0.018  0.022
 -0.008  0.012]

Mean monthly return: 0.0091
Standard deviation: 0.0153
Variance: 0.000233

Annualized return: 10.90%
Annualized volatility: 5.29%


The annualization formulas are standard in finance. Returns scale linearly with time, so you multiply by 12 for monthly data. Volatility scales with the square root of time, so you multiply by the square root of 12.

Note that `np.std()` by default calculates the population standard deviation (dividing by n). If you want the sample standard deviation (dividing by n-1), use `np.std(returns, ddof=1)`. For large datasets, the difference is negligible. For small samples, it matters.

### 6.2.2 np.corrcoef(): Correlation Matrix

Correlation measures how two assets move together. A correlation of 1 means they move in perfect lockstep. A correlation of -1 means they move in opposite directions. A correlation near 0 means their movements are unrelated.

The correlation matrix shows the correlation between every pair of assets in your portfolio. It is essential for diversification analysis and risk management.

In [68]:
# Returns for 4 assets over 10 periods
np.random.seed(42)
stock_a = np.random.normal(0.01, 0.02, 10)
stock_b = stock_a * 0.8 + np.random.normal(0, 0.01, 10)  # Correlated with A
stock_c = np.random.normal(0.005, 0.025, 10)              # Independent
stock_d = -stock_a * 0.5 + np.random.normal(0, 0.015, 10) # Negatively correlated with A

returns = np.array([stock_a, stock_b, stock_c, stock_d])

print("Returns shape:", returns.shape)

# Calculate correlation matrix
correlation_matrix = np.corrcoef(returns)

print("\nCorrelation Matrix:")
print(np.round(correlation_matrix, 2))

Returns shape: (4, 10)

Correlation Matrix:
[[ 1.    0.82 -0.32 -0.5 ]
 [ 0.82  1.    0.09 -0.53]
 [-0.32  0.09  1.   -0.02]
 [-0.5  -0.53 -0.02  1.  ]]


The correlation matrix is always square, with 1s on the diagonal (each asset is perfectly correlated with itself).

Look at the results:

- Stock A and Stock B have high positive correlation (we constructed B from A)
- Stock A and Stock D have negative correlation (we constructed D as the inverse of A)
- Stock C is relatively uncorrelated with the others (we generated it independently)

This matrix tells you which assets provide diversification benefits. Adding highly correlated assets does little to reduce risk. Adding uncorrelated or negatively correlated assets reduces portfolio volatility.

### 6.2.3 np.cov(): Covariance Matrix

Covariance is related to correlation but includes the magnitude of the movements, not just the direction. The covariance matrix is essential for portfolio optimization because it captures both the volatility of each asset and how assets move together.

The relationship is: correlation = covariance / (std_a × std_b)

In [69]:
# Using the same returns from the correlation example
covariance_matrix = np.cov(returns)

print("Covariance Matrix:")
print(np.round(covariance_matrix, 6))

print("\nDiagonal elements (variances):")
for i, var in enumerate(np.diag(covariance_matrix)):
    print(f"  Asset {i}: variance = {var:.6f}, std = {np.sqrt(var):.4f}")

Covariance Matrix:
[[ 2.09e-04  1.55e-04 -9.30e-05 -1.42e-04]
 [ 1.55e-04  1.72e-04  2.40e-05 -1.35e-04]
 [-9.30e-05  2.40e-05  4.14e-04 -8.00e-06]
 [-1.42e-04 -1.35e-04 -8.00e-06  3.85e-04]]

Diagonal elements (variances):
  Asset 0: variance = 0.000209, std = 0.0145
  Asset 1: variance = 0.000172, std = 0.0131
  Asset 2: variance = 0.000414, std = 0.0203
  Asset 3: variance = 0.000385, std = 0.0196


The diagonal of the covariance matrix contains the variance of each asset. The off-diagonal elements contain the covariances between pairs of assets.

The covariance matrix is the key input for mean-variance optimization. When you calculate portfolio variance, you use the formula:

portfolio_variance = weights.T @ covariance_matrix @ weights

We will see this calculation shortly.

### 6.2.4 np.percentile(): Quantiles for Risk Analysis

Percentiles tell you what percentage of observations fall below a given value. In risk management, percentiles are used to calculate Value at Risk (VaR), identify outliers, and understand the distribution of returns.

In [70]:
# Simulated daily returns for a year
np.random.seed(42)
daily_returns = np.random.normal(0.0005, 0.015, 252)

print(f"Number of observations: {len(daily_returns)}")
print(f"Mean return: {np.mean(daily_returns):.4f}")
print(f"Std deviation: {np.std(daily_returns):.4f}")

# Calculate percentiles
percentile_5 = np.percentile(daily_returns, 5)
percentile_25 = np.percentile(daily_returns, 25)
percentile_50 = np.percentile(daily_returns, 50)  # Median
percentile_75 = np.percentile(daily_returns, 75)
percentile_95 = np.percentile(daily_returns, 95)

print(f"\n5th percentile:  {percentile_5:.4f}")
print(f"25th percentile: {percentile_25:.4f}")
print(f"50th percentile (median): {percentile_50:.4f}")
print(f"75th percentile: {percentile_75:.4f}")
print(f"95th percentile: {percentile_95:.4f}")

# Value at Risk (VaR) at 95% confidence
var_95 = np.percentile(daily_returns, 5)
print(f"\n95% VaR (daily): {var_95:.4f}")
print(f"Interpretation: On 95% of days, losses will not exceed {-var_95:.2%}")

Number of observations: 252
Mean return: 0.0004
Std deviation: 0.0145

5th percentile:  -0.0219
25th percentile: -0.0098
50th percentile (median): 0.0014
75th percentile: 0.0094
95th percentile: 0.0238

95% VaR (daily): -0.0219
Interpretation: On 95% of days, losses will not exceed 2.19%


The 5th percentile tells you the return that 5% of observations fall below. In risk terms, this is the 95% Value at Risk: the loss you would expect to exceed only 5% of the time.

Note that VaR is typically expressed as a positive number representing a loss. If the 5th percentile is -0.024, your 95% VaR is 2.4%. This means on 95% of days, you would not expect to lose more than 2.4% of your portfolio value.

Percentiles are also useful for identifying extreme returns. Returns below the 1st percentile or above the 99th percentile are often considered outliers worth investigating.

---

## 6.3 Financial Calculations

Let's bring together everything we have learned to perform the core calculations of portfolio management. These examples demonstrate why NumPy is essential for quantitative finance.

### 6.3.1 Portfolio Return: Weights Dot Returns

We have seen this calculation several times. Here it is in its complete form, starting from raw price data.

In [71]:
# Simulated closing prices for 3 assets over 10 days
np.random.seed(42)
prices = np.array([
    100 * np.cumprod(1 + np.random.normal(0.001, 0.02, 10)),  # Asset A
    50 * np.cumprod(1 + np.random.normal(0.0005, 0.015, 10)), # Asset B
    75 * np.cumprod(1 + np.random.normal(0.0008, 0.018, 10))  # Asset C
])

# Calculate returns (each row is an asset)
returns = prices[:, 1:] / prices[:, :-1] - 1

print("Prices shape:", prices.shape)
print("Returns shape:", returns.shape)

# Portfolio weights
weights = np.array([0.5, 0.3, 0.2])

# Calculate portfolio return for each day
portfolio_returns = weights @ returns

print("\nPortfolio weights:", weights)
print("Daily portfolio returns:", np.round(portfolio_returns, 4))
print(f"\nTotal portfolio return: {np.sum(portfolio_returns):.4f}")
print(f"Mean daily return: {np.mean(portfolio_returns):.4f}")

Prices shape: (3, 10)
Returns shape: (3, 9)

Portfolio weights: [0.5 0.3 0.2]
Daily portfolio returns: [-0.0035  0.0086  0.0023 -0.0113 -0.0037  0.0079  0.0113 -0.0101 -0.0012]

Total portfolio return: 0.0004
Mean daily return: 0.0000


This is the complete workflow: prices to returns to portfolio returns. The matrix multiplication handles all assets and all days in a single operation.

Note the slicing used to calculate returns: `prices[:, 1:]` gives all rows from column 1 onward, and `prices[:, :-1]` gives all rows except the last column. This is the NumPy equivalent of the `.shift()` pattern we use in Pandas.

### 6.3.2 Portfolio Variance: w.T @ Cov @ w

Portfolio variance measures the total risk of your portfolio. It accounts for both the volatility of individual assets and how they move together. The formula is:

portfolio_variance = w.T @ Σ @ w

Where w is the weight vector and Σ (sigma) is the covariance matrix.

In [72]:
# Using the returns from the previous example
# Calculate the covariance matrix
cov_matrix = np.cov(returns)

print("Covariance Matrix:")
print(np.round(cov_matrix, 6))

# Portfolio weights
weights = np.array([0.5, 0.3, 0.2])

# Portfolio variance: w.T @ Cov @ w
portfolio_variance = weights.T @ cov_matrix @ weights
portfolio_volatility = np.sqrt(portfolio_variance)

print(f"\nPortfolio weights: {weights}")
print(f"Portfolio variance: {portfolio_variance:.6f}")
print(f"Portfolio volatility (daily): {portfolio_volatility:.4f}")
print(f"Portfolio volatility (annualized): {portfolio_volatility * np.sqrt(252):.2%}")

Covariance Matrix:
[[ 2.35e-04 -2.10e-05 -8.00e-05]
 [-2.10e-05  1.41e-04  9.80e-05]
 [-8.00e-05  9.80e-05  1.13e-04]]

Portfolio weights: [0.5 0.3 0.2]
Portfolio variance: 0.000066
Portfolio volatility (daily): 0.0081
Portfolio volatility (annualized): 12.86%


This single line, `weights.T @ cov_matrix @ weights`, is the heart of mean-variance optimization. It tells you the risk of any portfolio given its weights.

Notice that portfolio volatility is not simply the weighted average of individual volatilities. Because assets are not perfectly correlated, diversification reduces overall risk. This is why the covariance matrix matters: it captures the diversification benefit.

In portfolio optimization, you search for the weights that minimize this variance for a given target return, or maximize return for a given variance. The math is the same; only the objective changes.

## 7.1 Conclusion

You now have the NumPy foundation you need for quantitative finance. Let's recap what you learned:

**Arrays vs DataFrames:** Arrays are fast and unlabeled. DataFrames are convenient and labeled. Use Pandas for data preparation and exploration. Use NumPy for computation and machine learning.

**Vectorization:** Think in arrays, not loops. Vectorized operations are faster and cleaner. When you catch yourself writing a for loop over numerical data, pause and look for a vectorized solution.

**Shape:** Most errors in machine learning come from shape mismatches. A 1D array with shape `(n,)` is not the same as a 2D array with shape `(n, 1)`. Use `.reshape(-1, 1)` to fix single-feature inputs.

**Broadcasting:** NumPy stretches smaller arrays to match larger ones. This simplifies calculations like subtracting the risk-free rate or normalizing returns.

**Matrix Operations:** Portfolio returns are a dot product. Portfolio variance uses the covariance matrix. These calculations scale to any number of assets.