In [4]:
import sys
print(sys.executable)
print(sys.version)

/home/jasmine/miniconda3/envs/pds/bin/python
3.10.19 (main, Oct 21 2025, 16:43:05) [GCC 11.2.0]


# Lesson 1.6: Introduction to NumPy

## Introduction
**NumPy**, short for Numerical Python, is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.

Many computational and data science packages use NumPy as the main building block. It is a fundamental library for scientific computing in Python.

### Key Features of NumPy:
* **ndarray**: An efficient multidimensional array providing fast array-oriented arithmetic operations and flexible broadcasting capabilities.
* **Vectorization**: Mathematical functions for fast operations on entire arrays of data without having to write loops.
* **Linear Algebra**: Tools for random number generation, Fourier transforms, and matrix manipulation.
* **C API**: For connecting NumPy with libraries written in C, C++, or FORTRAN.

### Advantages over Python Lists:
1. **Contiguous Memory**: NumPy internally stores data in a contiguous block of memory, independent of other built-in Python objects. This allows for significantly faster access and manipulation.
2. **Vectorized Operations**: NumPy algorithms written in C can operate on this memory without type checking or other Python overhead, performing complex computations without slow `for` loops.

![numpy_vs_list](../assets/numpy_vs_python_list.png)

## Part 1: Performance Benchmark
To give you an idea of the performance difference, consider a NumPy array of one million integers and an equivalent Python list. We use the `%timeit` magic command to measure execution time.

In [2]:
import numpy as np

In [None]:
# create np array w values 0 to n-1
my_arr = np.arange(1_000_000)

# check array length
print(f"array length: {len(my_arr)}")

print('my_arr first 5 elements: ', my_arr[:5])
print('my_arr last 5 elements: ', my_arr[-5:])

my_list = list(range(1_000_000))

# Vectorized multiplication: apply operation to entire array as one instead of looping thru each element
print("NumPy Vectorized Multiplication (my_arr * 2):")
%timeit my_arr2 = my_arr * 2

# list comprehension: for every x in my _list, do x*2 (slower than vec mul)
print("\nPython List Comprehension ([x * 2 for x in my_list]):")
%timeit my_list2 = [x * 2 for x in my_list]

array length: 1000000
my_arr last 5 elements:  [0 1 2 3 4]
my_arr last 5 elements:  [999995 999996 999997 999998 999999]
NumPy Vectorized Multiplication (my_arr * 2):
821 μs ± 36.9 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Python List Comprehension ([x * 2 for x in my_list]):
41.8 ms ± 946 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)


## Part 2: The ndarray (N-dimensional array)
The `ndarray` is a fast, flexible container for large datasets. It is a multidimensional array of fixed size with **homogeneous** elements (all elements must be of the same type).

Every array has:
* **shape**: A tuple indicating the size of each dimension.
* **dtype**: An object describing the data type of the array.
* **ndim**: The number of dimensions (axes).

### ndarray illustration
![ndarray](../assets/numpy_ndarray.png)

In [10]:
# [DEMO] Creating arrays from sequences
data1 = [6, 7.5, 8, 0, 1]
arr1 = np.array(data1)

data2 = [[1, 2, 3, 4], [5, 6, 7, 8]]
arr2 = np.array(data2)

print(f"Array 2:\n{arr2}")
print(f"Shape: {arr2.shape}, Dtype: {arr2.dtype}, Dimensions: {arr2.ndim}")

Array 2:
[[1 2 3 4]
 [5 6 7 8]]
Shape: (2, 4), Dtype: int64, Dimensions: 2


### Data Types and Casting
NumPy supports specific numerical types like `int32`, `float64`, etc. You can explicitly convert an array from one `dtype` to another using the `astype` method.

**Note:** If you cast floating-point numbers to an integer `dtype`, the decimal part will be truncated.

In [None]:
# [DEMO] Casting arrays
arr = np.array([3.7, -1.2, 0.5, 12.9])
print("Original:", arr)
print("Casted to int32:", arr.astype(np.int32))

# if needed, round first before cast, cast will only drop the decimal w/o rounding

Original: [ 3.7 -1.2  0.5 12.9]
Casted to int32: [ 3 -1  0 12]


### [EXERCISE 1: Creation & Casting]
1. Create a 3x4 array of all ones using `np.ones()`.
2. Cast this array to `float32`.
3. Create an array of strings representing numbers: `['1.25', '-9.6', '42']`. Cast it to `float`.

In [None]:
# 1: [3,4] is a tuple
array = np.ones([3,4])
print(array, array.dtype)

# 2
array_float = array.astype(np.float32)
print(array_float, array_float.dtype)

# 3
arr_str = np.array(['1.23', '-9.8', '42'])
arr_float = arr_str.astype(np.float64)
print(array_float, array_float.astype)

# fill array with 7s
print(np.full((2,3),7))

[[1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]] float64
[[1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]] float32
[[1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]] <built-in method astype of numpy.ndarray object at 0x70f50d33c630>
[[7 7 7]
 [7 7 7]]


## Part 3: Arithmetic & Broadcasting
Arithmetic operations are applied as batch operations without for loops. **Broadcasting** describes how arithmetic works between arrays of different shapes.

![vectorization](../assets/vectorization.png)

Example: A scalar value being replicated (broadcast) to match the shape of a larger array.

In [None]:
# [DEMO] Arithmetic & Broadcasting (NOT matrix multiplication)
arr = np.array([[1., 2., 3.], [4., 5., 6.]])
print("Element-wise multiplication (arr * arr):\n", arr * arr)
print("\nBroadcasting scalar (1 / arr):\n", 1 / arr)

Element-wise multiplication (arr * arr):
 [[ 1.  4.  9.]
 [16. 25. 36.]]

Broadcasting scalar (1 / arr):
 [[1.         0.5        0.33333333]
 [0.25       0.2        0.16666667]]


## Part 4: Indexing and Slicing
One-dimensional arrays act similarly to Python lists. In 2D arrays, indexing can be done with `[row, column]` syntax. 

### 2D Array Indexing Syntax
![2d_array_indexing](../assets/ndarray_axis_index.png)

**Important:** Array slices are **views** on the original array. This means data is not copied, and modifications to the slice will be reflected in the source array.

In [31]:
# [DEMO] Slicing views
arr = np.arange(10)

# [start:stop:step], default [0:end:1]
arr_slice = arr[5:8]
# change 2nd element of [5,6,7] to 12345
arr_slice[1] = 12345
print("Original array modified via slice:", arr)
print("arr_slice: ", arr_slice)

Original array modified via slice: [    0     1     2     3     4     5 12345     7     8     9]
arr_slice:  [    5 12345     7]


In [33]:
# [DEMO] 2D Slicing
arr2d = np.array([
    [1, 2, 3], 
    [4, 5, 6], 
    [7, 8, 9]])

# arr2d[row, col]; row/col [start:stop:step]
print("\nFirst two rows, 2nd column onwards:\n", arr2d[:2, 1:])


First two rows, 2nd column onwards:
 [[2 3]
 [5 6]]


### [EXERCISE 2: The Logic of Slicing]
1. Select the first column of `arr2d` using a slice.
2. Set all values in the second row to 0.
3. **Socratic Prompt:** How does `arr2d[1]` differ from `arr2d[1, :]`? (Hint: check shapes)

In [None]:
# 1: using arr2d[:, 0] will lose the 2d shape
print(arr2d[:, :1])

# 2: or use arr2d[1, :]=0
arr2d[1]=0
print(arr2d)

# 3: both are equivalent
print(arr2d[1], arr2d[1].shape)
print(arr2d[1,:], arr2d[1,:].shape)


[[1]
 [0]
 [7]]
[[1 2 3]
 [0 0 0]
 [7 8 9]]
[0 0 0] (3,)
[0 0 0] (3,)


## Part 5: Boolean Indexing
Like arithmetic operations, comparisons (such as `==`) with arrays are vectorized. This yields a boolean array which can be used to filter data.

In [None]:
# [DEMO] Filtering scores
names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
scores = np.array([[75, 80], [85, 90], [95, 100], [100, 77], [85, 92], [95, 80], [72, 80]])

# mask: bool array indicating which elements to select / meet a condition
bob_mask = (names == 'Bob')
print("Mask:", bob_mask)
print("Bob's scores:\n", scores[bob_mask])

Mask: [ True False False  True False False False]
Bob's scores:
 [[ 75  80]
 [100  77]]


### [EXERCISE 3: Complex Filtering]
1. Select all scores where the name is NOT 'Bob'.
2. Select scores for 'Bob' or 'Will' using the `|` operator.
3. Find all scores less than 80 and set them to 0.

In [74]:
# 1
notbob_mask = (names != 'Bob')
print("Not Bob's: ", scores[notbob_mask])

# 2
mask = (names=='Bob') | (names=='Will')
print("Bob's & Will's: ", scores[mask])

# 3
scores[scores<80] = 0 
print("Change <80 to 0: \n", scores)


Not Bob's:  [[ 85  90]
 [ 95 100]
 [ 85  92]
 [ 95  80]
 [  0  80]]
Bob's & Will's:  [[  0  80]
 [ 95 100]
 [100   0]
 [ 85  92]]
Change <80 to 0: 
 [[  0  80]
 [ 85  90]
 [ 95 100]
 [100   0]
 [ 85  92]
 [ 95  80]
 [  0  80]]


## Part 6: Universal Functions (ufuncs) and Methods
A **ufunc** is a function that performs element-wise operations on data in ndarrays. 

* **Unary ufuncs**: Take one array (e.g., `sqrt`, `exp`).
* **Binary ufuncs**: Take two arrays (e.g., `add`, `maximum`).
* **Statistical Methods**: `mean`, `sum`, `std` can be computed over the entire array or along an axis.

In [75]:
# [DEMO] Statistical Methods
arr = np.random.randn(3, 4)     # generate random 3x4 array
print("Random Array:\n", arr)
print("\nMean down rows (axis=0):", arr.mean(axis=0))
print("Sum across columns (axis=1):", arr.sum(axis=1))

Random Array:
 [[-7.59685410e-01  1.31959731e+00 -1.31533127e+00  1.72479666e+00]
 [ 5.07400562e-01  1.49925001e+00  1.32888447e-03  1.46208588e+00]
 [-8.97409103e-01  5.19738335e-01  7.33042067e-01 -9.02331101e-01]]

Mean down rows (axis=0): [-0.38323132  1.11286188 -0.19365344  0.76151715]
Sum across columns (axis=1): [ 0.96937729  3.47006534 -0.5469598 ]


## Part 7: Linear Algebra
Linear algebra operations, like matrix multiplication, are crucial for many data science algorithms. Multiplying two arrays with `*` is an element-wise product; for matrix multiplication, use `.dot()` or the `@` operator.

![matrix_multiplication](../assets/matrix_multiplication.png)

In [None]:
# [DEMO] Matrix Multiplication @
x = np.array([[1, 2, 3], [4, 5, 6]])
y = np.array([[6, 23], [-1, 7], [8, 9]])

print("Matrix product (x @ y):\n", x @ y)

# A: mxn
# B: nxp
# n must match to perform matmul

Matrix product (x @ y):
 [[ 28  64]
 [ 67 181]]


### [EXERCISE 4: Reshaping & Statistics]
1. Create an array of 15 integers using `arange(15)` and reshape it to `(3, 5)`.
2. Calculate the average value of each row.
3. Use `np.unique()` to find distinct elements in an array of your choice.
4. Transpose the reshaped array using `.T` and check the new shape.

In [93]:
# 1
arr = np.arange(15).reshape(3,5)
print(arr)

# 2
#print(arr[0].mean())
#print(arr[1].mean())
#print(arr[2].mean())
print(arr.mean(axis=1))

# 3
print(np.unique(arr))

# 4
print(arr.T)


[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]]
[ 2.  7. 12.]
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14]
[[ 0  5 10]
 [ 1  6 11]
 [ 2  7 12]
 [ 3  8 13]
 [ 4  9 14]]


### Post-class

In [8]:
# 1D array of integers from 10 to 19
a = np.arange(10, 20)
print(a, a.shape, a.ndim, a.dtype,'\n')

# 2D array of shape (4, 3) filled with 1.5
b = np.full((4, 3), 1.5)
print(b, b.shape, b.ndim, b.dtype, '\n')

# 3D array of zeros with shape (2, 2, 3)
c = np.zeros((2, 2, 3))
print(c, c.shape, c.ndim, c.dtype, '\n')

#Convert the (4, 3) float array to integers using .astype(int).
d =b.astype(int)
print(d, d.shape, d.ndim, d.dtype, '\n')


[10 11 12 13 14 15 16 17 18 19] (10,) 1 int64 

[[1.5 1.5 1.5]
 [1.5 1.5 1.5]
 [1.5 1.5 1.5]
 [1.5 1.5 1.5]] (4, 3) 2 float64 

[[[0. 0. 0.]
  [0. 0. 0.]]

 [[0. 0. 0.]
  [0. 0. 0.]]] (2, 2, 3) 3 float64 

[[1 1 1]
 [1 1 1]
 [1 1 1]
 [1 1 1]] (4, 3) 2 int64 



In [22]:
sales = np.array([
        [100, 120, 130, 140, 150],  # Region A
        [90, 110, 125, 135, 145],   # Region B
        [85, 95, 105, 120, 130],    # Region C
        [75, 85, 95, 105, 120]      # Region D
])
regions = np.array(["A", "B", "C", "D"])
quarters = np.array(["Q1", "Q2", "Q3", "Q4", "Q5"])

# Select all quarters for Region B as a 1D array.
print(sales[1])
print("Shape:", sales[1].shape, "-- 1D because selected a single row.")

# Select Q2 to Q4 (inclusive) for all regions as a 2D subarray.
print(sales[:,1:4])
print("Shape:", sales[:,1:4].shape, "-- 2D because selected multiple rows and columns.")

# Select Q5 sales for Regions A and D only (use slicing or fancy indexing).
print(sales[[0,3], 4])
print("Shape:", sales[[0,3], 4].shape,  "-- 1D because selected specific elements from different rows but only one column.")

[ 90 110 125 135 145]
Shape: (5,) -- 1D because selected a single row.
[[120 130 140]
 [110 125 135]
 [ 95 105 120]
 [ 85  95 105]]
Shape: (4, 3) -- 2D because selected multiple rows and columns.
[150 120]
Shape: (2,) -- 1D because selected specific elements from different rows but only one column.


In [33]:
names = np.array(["Ana", "Ben", "Chen", "Dana", "Eli", "Fatima", "George", "Hui"])
spend = np.array([200, 150, 300, 120, 180, 220, 160, 310])      # marketing spend
revenue = np.array([400, 180, 500, 100, 220, 260, 150, 600])   # revenue

# Compute the ROI for each person: roi = revenue / spend.
roi = revenue / spend
print("ROI:", roi)

# Create a boolean mask for customers with roi >= 2.0.
mask_roi2 = roi >= 2.0
print(mask_roi2)

# Use the mask to: List their names, List their spend and revenue
print("Names with ROI >= 2.0:", names[mask_roi2])
print("Spend for these customers:", spend[mask_roi2])
print("Revenue for these customers:", revenue[mask_roi2])

# Create a second mask for customers with spend >= 200
mask_spend200 = spend >=200
print(mask_spend200)

# Combine the two masks to find customers who have roi >= 2.0 AND spend >= 200
print("Names with ROI >= 2.0 AND Spend >= 200:", names[mask_roi2 & mask_spend200])

#“What business insight do you get from this filtered group?”
# high ROI and high spend --> valuable customers worth focusing marketing efforts on
# high ROI but low spend --> potential customers to upsell or target for higher spend


ROI: [2.         1.2        1.66666667 0.83333333 1.22222222 1.18181818
 0.9375     1.93548387]
[ True False False False False False False False]
Names with ROI >= 2.0: ['Ana']
Spend for these customers: [200]
Revenue for these customers: [400]
[ True False  True False False  True False  True]
Names with ROI >= 2.0 AND Spend >= 200: ['Ana']


In [52]:
students = np.array(["S1", "S2", "S3", "S4", "S5"])
subjects = np.array(["Math", "Stats", "Python"])

scores = np.array([
    [75, 80, 85],
    [60, 65, 70],
    [90, 88, 92],
    [82, 79, 84],
    [70, 72, 78]
])

#Compute each student’s average score per row - work along axis 1
student_averages = scores.mean(axis=1)
print("Student Averages:", student_averages)

#Compute each subject’s average score per column - work along axis 0
subject_averages = scores.mean(axis=0)
print("Subject Averages:", subject_averages)

#Create scores_centered by subtracting the subject mean from each column (broadcasting).
scores_centered = scores - subject_averages
print("Centered Scores:")
print(scores_centered)

#For scores_centered, compute per-student averages again.
student_averages_centered = scores_centered.mean(axis=1)
print("Student Averages (Centered):", student_averages_centered)

#Compare which student looks best by raw average vs centered average.
print("Comparison of Raw vs Centered Averages:")
for i, student in enumerate(students):
    print(f"{student}: Raw Avg = {student_averages[i]:.2f}, Centered Avg = {student_averages_centered[i]:.2f}") 
print("Best Student by Raw Average:", students[np.argmax(student_averages)])
print("Best Student by Centered Average:", students[np.argmax(student_averages_centered)])

#explain briefly why centered scores might give a fairer comparison.
# Centered scores account for subject difficulty by normalizing scores relative to subject averages,
# providing a fairer comparison of student performance across different subjects.
# Centered scores remove the overall difficulty of each subject, 
# so we can compare students more fairly across subjects that may have different average scores.


Student Averages: [80.         65.         90.         81.66666667 73.33333333]
Subject Averages: [75.4 76.8 81.8]
Centered Scores:
[[ -0.4   3.2   3.2]
 [-15.4 -11.8 -11.8]
 [ 14.6  11.2  10.2]
 [  6.6   2.2   2.2]
 [ -5.4  -4.8  -3.8]]
Student Averages (Centered): [  2.         -13.          12.           3.66666667  -4.66666667]
Comparison of Raw vs Centered Averages:
S1: Raw Avg = 80.00, Centered Avg = 2.00
S2: Raw Avg = 65.00, Centered Avg = -13.00
S3: Raw Avg = 90.00, Centered Avg = 12.00
S4: Raw Avg = 81.67, Centered Avg = 3.67
S5: Raw Avg = 73.33, Centered Avg = -4.67
Best Student by Raw Average: S3
Best Student by Centered Average: S3


In [81]:
# Each row: [page_views, time_on_site (minutes), past_purchases]
X = np.array([
    [10,  3.5,  0],
    [25,  5.0,  1],
    [40,  2.0,  0],
    [15, 10.0,  3],
    [30,  4.0,  2]
])

customers = np.array(["C1", "C2", "C3", "C4", "C5"])

# Choose a weight vector w = [w_views, w_time, w_purchases] (for example [0.1, 0.5, 1.0]).
w = np.array([0.1, 0.5, 1.0])

# Compute a score for each customer using scores = X @ w.
scores = X @ w
print("Customer Scores:", scores)

# Rank customers by score (highest first)
print(np.sort(scores)[::-1])    #sort only scores descending

sorted_indices = np.argsort(scores)[::-1] #indices that would sort scores descending
print(sorted_indices) 

print("Customer Rankings:", customers[sorted_indices])  #rank customers by sorted indices

# Change the weights to emphasize past_purchases more than other features, and recompute.
w_new = np.array([0.1, 0.2, 2.0])  # emphasize past_purchases more
scores_new = X @ w_new
print("New Customer Scores:", scores_new)
sorted_indices_new = np.argsort(scores_new)[::-1]
print("New Customer Rankings:", customers[sorted_indices_new])

#Which customer is top-ranked before vs after changing weights?
#In what real-world situation might you prefer each weighting?
'''
Customer rankings may change based on the weights assigned to each feature.
Emphasizing past purchases might be preferred in a loyalty program context,
while a balanced weighting could be better for general marketing strategies. 
'''

Customer Scores: [2.75 6.   5.   9.5  7.  ]
[9.5  7.   6.   5.   2.75]
[3 4 1 2 0]
Customer Rankings: ['C4' 'C5' 'C2' 'C3' 'C1']
New Customer Scores: [1.7 5.5 4.4 9.5 7.8]
New Customer Rankings: ['C4' 'C5' 'C2' 'C3' 'C1']


'\nCustomer rankings may change based on the weights assigned to each feature.\nEmphasizing past purchases might be preferred in a loyalty program context,\nwhile a balanced weighting could be better for general marketing strategies. \n'

In [83]:
# Generate 1,000 random test scores with np.random.randn, 
# scale them to have mean 70 and standard deviation 10, 
scores = np.random.randn(1000) * 10 + 70

# Clip scores to [0, 100] means >100 becomes 100, <0 becomes 0
scores = np.clip(scores, 0, 100)

# Compute statistics
print("Min:", np.min(scores))
print("Max:", np.max(scores))
print("Mean:", np.mean(scores))
print("Std Dev:", np.std(scores))

Min: 40.05387139772381
Max: 100.0
Mean: 70.13445965542874
Std Dev: 9.676173315645658


In [88]:
'''Simulate a small A/B test:
Two groups of 20 users each
Randomly generate conversions (0 or 1) for each group
Compute conversion rate per group using pure NumPy operations
'''

np.random.seed(42)  # for reproducibility
group_A = np.random.randint(0, 2, size=20)  # 0 or 1 conversions
group_B = np.random.randint(0, 2, size=20)  
conversion_rate_A = group_A.mean()
conversion_rate_B = group_B.mean()
print("Conversion Rate Group A:", conversion_rate_A)
print("Conversion Rate Group B:", conversion_rate_B)    

Conversion Rate Group A: 0.35
Conversion Rate Group B: 0.65
