# CE49X: Introduction to Computational Thinking and Data Science for Civil Engineers

## Week 4: NumPy and Pandas

**Based on "Python Data Science Handbook" by Jake VanderPlas**
- **Chapter 2**: Introduction to NumPy
- **Chapter 3**: Data Manipulation with Pandas (Sections 3.0-3.6)

**Author:** Dr. Eyuphan Koc  
**Institution:** Bogazici University - Department of Civil Engineering  
**Semester:** Fall 2025

---

### Topics Covered:

**NumPy:**
1. **Introduction to NumPy** - Arrays and basic operations
2. **Array Fundamentals** - Creating, indexing, and reshaping
3. **Array Operations** - Vectorization and universal functions
4. **Aggregations** - Statistical operations
5. **Broadcasting** - Operations on different shapes
6. **Boolean Operations** - Masking and filtering
7. **Advanced Indexing** - Fancy indexing
8. **Sorting** - Organizing data
9. **Structured Arrays** - Mixed data types

**Pandas:**
10. **Introduction to Pandas** - DataFrames and why Pandas
11. **Pandas Core Objects** - Series, DataFrame, Index
12. **Data Indexing and Selection** - loc, iloc, masking
13. **Operations in Pandas** - Index alignment and ufuncs
14. **Handling Missing Data** - NaN, dropna, fillna
15. **Hierarchical Indexing** - MultiIndex for higher-dimensional data
16. **Combining Datasets** - concat and append

### Learning Objectives:
- Understand why NumPy is essential for scientific computing
- Master NumPy arrays and vectorized operations
- Learn Pandas DataFrames for labeled, structured data
- Perform data selection, filtering, and transformation
- Handle missing data effectively
- Combine and manipulate datasets

---

*This notebook contains practical examples demonstrating NumPy and Pandas capabilities for numerical computing and data analysis in civil engineering applications.*

## 1. Introduction to NumPy

NumPy (Numerical Python) is the fundamental package for scientific computing in Python. It provides:
- Fast, efficient multi-dimensional arrays
- Vectorized computations (no loops needed!)
- Broadcasting for operations on different shapes
- Foundation for Pandas, SciPy, Matplotlib, and more

### 1.1 Speed Comparison: Lists vs NumPy

In [3]:
# Python list: slow
import time
L = list(range(1000000))
start = time.time()
result = [x**2 for x in L]
print(f"Time: {time.time()-start:.4f}s")
# Time: ~0.15s

Time: 0.0302s


In [5]:
# NumPy array: fast!
import numpy as np
A = np.arange(1000000)
start = time.time()
result = A**2
print(f"Time: {time.time()-start:.4f}s")
# Time: ~0.002s  (75x faster!)

Time: 0.0015s


### 1.2 Installing and Importing NumPy

In [None]:
import numpy as np  # ALWAYS use this convention!

# Check version
print(np.__version__)  # e.g., '1.24.3'

### 1.3 Creating Your First NumPy Arrays

In [6]:
import numpy as np

# From Python list - 1D array
my_list = [1, 2, 3, 4, 5]
arr = np.array(my_list)
print(arr)  # [1 2 3 4 5]

# 2D array (matrix)
matrix = np.array([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]])
print(matrix)
# [[1 2 3]
#  [4 5 6]
#  [7 8 9]]

# Unlike lists, all elements must be same type!
mixed = np.array([1, 2.5, 3])  # Will convert to float
print(mixed.dtype)  # dtype('float64')

[1 2 3 4 5]
[[1 2 3]
 [4 5 6]
 [7 8 9]]
float64


## 2. Array Fundamentals

### 2.1 Creating Arrays: Common Methods

In [None]:
import numpy as np

# Zeros - initialize array
zeros = np.zeros(10)  # [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
zeros_matrix = np.zeros((3, 3))  # 3x3 matrix of zeros

# Ones
ones = np.ones(5)  # [1. 1. 1. 1. 1.]
fives = np.ones(5) * 5  # [5. 5. 5. 5. 5.]

# Sequential values
seq = np.arange(0, 11, 2)  # [0 2 4 6 8 10]  (start, stop, step)

# Evenly spaced values
linear = np.linspace(0, 1, 5)  # [0.  0.25  0.5  0.75  1.] (start, stop, count)

# Random values
random_vals = np.random.random(5)  # 5 random values between 0 and 1
random_ints = np.random.randint(0, 10, size=5)  # 5 random integers 0-9

# Identity matrix
I = np.eye(3)  # [[1. 0. 0.], [0. 1. 0.], [0. 0. 1.]]

print("Zeros:", zeros)
print("Sequential:", seq)
print("Linear:", linear)
print("Identity:")
print(I)

### 2.2 Array Attributes: Understanding Your Data

In [None]:
import numpy as np
np.random.seed(0)  # for reproducibility

# Create arrays of different dimensions
x1 = np.random.randint(10, size=6)  # 1D array
x2 = np.random.randint(10, size=(3, 4))  # 2D array
x3 = np.random.randint(10, size=(3, 4, 5))  # 3D array

print("x3 ndim: ", x3.ndim)  # 3
print("x3 shape:", x3.shape)  # (3, 4, 5)
print("x3 size: ", x3.size)  # 60

print("dtype:", x3.dtype)  # int64
print("itemsize:", x3.itemsize, "bytes")  # 8 bytes
print("nbytes:", x3.nbytes, "bytes")  # 480 bytes

### 2.3 Data Types in NumPy

In [None]:
# Specify dtype at creation
arr = np.array([1, 2, 3], dtype=np.float32)

# Convert dtype
arr_float64 = arr.astype(np.float64)

print(arr.dtype)        # float32
print(arr_float64.dtype)  # float64

In [None]:
# Memory comparison
a32 = np.ones(1000000, dtype=np.float32)
a64 = np.ones(1000000, dtype=np.float64)

print(f"32-bit: {a32.nbytes/1e6} MB")
# 4.0 MB
print(f"64-bit: {a64.nbytes/1e6} MB")
# 8.0 MB

### 2.4 Indexing and Slicing: Accessing Array Elements

In [None]:
import numpy as np

# 1D array - similar to Python lists
x = np.arange(10)  # [0 1 2 3 4 5 6 7 8 9]
print(x[0])      # 0 (first element)
print(x[-1])     # 9 (last element)
print(x[4:7])    # [4 5 6] (slice from index 4 to 6)
print(x[::2])    # [0 2 4 6 8] (every 2nd element)
print(x[::-1])   # [9 8 7 6 5 4 3 2 1 0] (reverse)

# 2D array - row, column indexing
x2 = np.array([[12, 5, 2, 4],
               [7, 6, 8, 8],
               [1, 6, 7, 7]])

print(x2[0, 0])    # 12 (element at row 0, col 0)
print(x2[2, -1])   # 7 (element at row 2, last column)
print(x2[0, :])    # [12  5  2  4] (first row)
print(x2[:, 1])    # [5 6 6] (second column)
print(x2[:2, :2])  # [[12  5], [ 7  6]] (2x2 subarray)

### 2.5 Array Views vs Copies

In [7]:
import numpy as np

# Original array
original = np.array([1, 2, 3, 4, 5])

# SLICING creates a VIEW (not a copy!)
view = original[1:4]
view[0] = 999
print(original)  # [1 999 3 4 5]  <-- Original changed!

# To create independent copy, use .copy()
original = np.array([1, 2, 3, 4, 5])
independent = original[1:4].copy()
independent[0] = 999
print(original)  # [1 2 3 4 5]  <-- Original unchanged
print(independent)  # [999 3 4]

[  1 999   3   4   5]
[1 2 3 4 5]
[999   3   4]


### 2.6 Reshaping Arrays

In [None]:
import numpy as np

# 1D to 2D
loads = np.arange(12)
print(loads)  # [ 0  1  2  3  4  5  6  7  8  9 10 11]

# Reshape to 3x4 matrix
matrix = loads.reshape(3, 4)
print(matrix)
# [[ 0  1  2  3]
#  [ 4  5  6  7]
#  [ 8  9 10 11]]

# Reshape to 4x3 matrix
matrix2 = loads.reshape(4, 3)
print(matrix2)
# [[ 0  1  2]
#  [ 3  4  5]
#  [ 6  7  8]
#  [ 9 10 11]]

# Use -1 to auto-calculate one dimension
matrix3 = loads.reshape(2, -1)  # 2 rows, auto-calculate columns
print(matrix3.shape)  # (2, 6)

# Flatten back to 1D
flat = matrix.flatten()  # or .ravel() for view
print(flat)  # [ 0  1  2  3  4  5  6  7  8  9 10 11]

### 2.7 Concatenating and Splitting Arrays

In [None]:
import numpy as np

# Concatenate 1D arrays
dead_load = np.array([10, 15, 20])
live_load = np.array([5, 8, 10])
total_load = np.concatenate([dead_load, live_load])
print(total_load)  # [10 15 20  5  8 10]

# Stack vertically (vstack)
loads = np.vstack([dead_load, live_load])
print(loads)
# [[10 15 20]
#  [ 5  8 10]]

# Stack horizontally (hstack)
loads = np.hstack([dead_load, live_load])
print(loads)  # [10 15 20  5  8 10]

# Split array
split_loads = np.split(total_load, [3])  # Split at index 3
print(split_loads[0])  # [10 15 20]
print(split_loads[1])  # [ 5  8 10]

## 3. Array Operations and Universal Functions

### 3.1 Vectorized Arithmetic: No Loops Needed!

In [None]:
import numpy as np

# Create arrays
x = np.arange(4)
print("x =", x)  # [0 1 2 3]

# Element-wise operations (vectorized - FAST!)
print("x + 5 =", x + 5)    # [5 6 7 8]
print("x - 5 =", x - 5)    # [-5 -4 -3 -2]
print("x * 2 =", x * 2)    # [0 2 4 6]
print("x / 2 =", x / 2)    # [0.  0.5 1.  1.5]
print("x ** 2 =", x ** 2)  # [0 1 4 9]

# Multiple arrays
a = np.array([1, 2, 3, 4])
b = np.array([4, 3, 2, 1])
print("a + b =", a + b)    # [5 5 5 5]
print("a * b =", a * b)    # [4 6 6 4]

# Compare to Python list (requires loop!)
# result = [x**2 for x in my_list]  # Slow!

### 3.2 Universal Functions (ufuncs): Fast Operations

In [8]:
import numpy as np

# Trigonometric functions
theta = np.linspace(0, np.pi, 3)
print("sin(theta) =", np.sin(theta))
print("cos(theta) =", np.cos(theta))
print("tan(theta) =", np.tan(theta))

# Exponential and logarithmic
x = [1, 2, 3]
print("e^x =", np.exp(x))       # [2.718  7.389  20.086]
print("2^x =", np.exp2(x))      # [2.  4.  8.]
print("log(x) =", np.log(x))    # [0.  0.693  1.099]

# Absolute value
x = np.array([-2, -1, 0, 1, 2])
print("abs(x) =", np.abs(x))    # [2 1 0 1 2]

# All much faster than Python loops!

sin(theta) = [0.0000000e+00 1.0000000e+00 1.2246468e-16]
cos(theta) = [ 1.000000e+00  6.123234e-17 -1.000000e+00]
tan(theta) = [ 0.00000000e+00  1.63312394e+16 -1.22464680e-16]
e^x = [ 2.71828183  7.3890561  20.08553692]
2^x = [2. 4. 8.]
log(x) = [0.         0.69314718 1.09861229]
abs(x) = [2 1 0 1 2]


### 3.3 Example: Computation on Arrays

In [None]:
import numpy as np

# Compute values of sin(x) for many values
x = np.linspace(0, np.pi, 3)
print("x      =", x)
# [0.         1.57079633 3.14159265]

print("sin(x) =", np.sin(x))
# [0.0000000e+00 1.0000000e+00 1.2246468e-16]

# Compute a more complex operation
x = np.arange(5)
y = np.empty(5)
for i in range(5):
    y[i] = x[i] ** 2
print(y)  # [ 0.  1.  4.  9. 16.]

# Much better with vectorization:
x = np.arange(5)
y = x ** 2
print(y)  # [ 0  1  4  9 16]

## 4. Aggregations and Statistics

### 4.1 Basic Aggregations: Summarizing Data

In [None]:
import numpy as np

# Random data
L = np.random.random(100)

# Summary statistics
print(np.sum(L))      # Sum of all values
print(np.min(L))      # Minimum value
print(np.max(L))      # Maximum value
print(np.mean(L))     # Mean
print(np.std(L))      # Standard deviation
print(np.var(L))      # Variance

# These also work as array methods:
print(L.sum())
print(L.min())
print(L.max())
print(L.mean())
print(L.std())

# Percentiles
print(np.percentile(L, 25))  # 1st quartile
print(np.median(L))          # 50th percentile
print(np.percentile(L, 75))  # 3rd quartile

### 4.2 Multi-Dimensional Aggregations: The axis Parameter

In [None]:
import numpy as np

# 2D array example
M = np.random.random((3, 4))
print(M)

# Aggregate along different axes
print("Shape:", M.shape)  # (3, 4)

# Sum all values
print(M.sum())

# Sum along axis 0 (collapse rows -> result has shape (4,))
print(M.sum(axis=0))

# Sum along axis 1 (collapse columns -> result has shape (3,))
print(M.sum(axis=1))

# Works with other functions too:
print(M.min(axis=0))  # Min of each column
print(M.max(axis=1))  # Max of each row

### 4.3 More Aggregation Functions

In [None]:
import numpy as np

data = np.array([10, 15, 20, 25, 30, 35, 40])

# Basic stats
print(f"Sum: {np.sum(data)}")         # 175
print(f"Product: {np.prod(data)}")    # 3.15e9
print(f"Mean: {np.mean(data)}")       # 25.0
print(f"Std: {np.std(data)}")         # 10.0
print(f"Variance: {np.var(data)}")    # 100.0

# Min/Max
print(f"Min: {np.min(data)}")         # 10
print(f"Max: {np.max(data)}")         # 40
print(f"Argmin: {np.argmin(data)}")   # 0 (index of min)
print(f"Argmax: {np.argmax(data)}")   # 6 (index of max)

# Cumulative operations
cumsum = np.cumsum(data)  # [10 25 45 70 100 135 175]
cumprod = np.cumprod(data[:4])  # [10 150 3000 75000]

# Boolean operations
print(f"Any > 50: {np.any(data > 50)}")    # False
print(f"All > 5: {np.all(data > 5)}")      # True

### 4.4 Example: Analyzing Multi-Dimensional Data

In [9]:
import numpy as np

# Precipitation data: 12 months x 5 years
precip_data = np.array([
    [3.2, 2.8, 3.5, 2.9, 3.1],  # January
    [2.5, 2.9, 2.3, 2.7, 2.6],  # February
    [3.8, 4.1, 3.6, 4.0, 3.9],  # March
    # ... (more months)
])

# Analysis
mean_per_month = np.mean(precip_data, axis=1)
mean_per_year = np.mean(precip_data, axis=0)
overall_mean = np.mean(precip_data)
overall_std = np.std(precip_data)

# Find extremes
max_precip = np.max(precip_data)
min_precip = np.min(precip_data)

print(f"Overall mean: {overall_mean:.2f}")
print(f"Std deviation: {overall_std:.2f}")
print(f"Range: {min_precip:.2f} - {max_precip:.2f}")

Overall mean: 3.19
Std deviation: 0.57
Range: 2.30 - 4.10


## 5. Broadcasting

Broadcasting allows NumPy to perform operations on arrays of different shapes without explicitly replicating data.

### 5.1 Broadcasting Examples

In [None]:
import numpy as np

# Scalar + Array (broadcasts scalar to all elements)
a = np.array([0, 1, 2])
a + 5  # array([5, 6, 7])

# 1D + 1D
a = np.ones((3, 3))
b = np.arange(3)
print(a + b)
# array([[1., 2., 3.],
#        [1., 2., 3.],
#        [1., 2., 3.]])

# Broadcasting with higher dimensions
a = np.arange(3).reshape((3, 1))
b = np.arange(3)
print(a + b)
# [[0 1 2]
#  [1 2 3]
#  [2 3 4]]

### 5.2 Broadcasting in 2D: Centering Data

In [None]:
import numpy as np

# Data matrix (10 observations x 3 features)
X = np.random.random((10, 3))

# Compute mean of each column (feature)
Xmean = X.mean(axis=0)

# Center the data (subtract mean from each column)
X_centered = X - Xmean

# Verify mean is now ~0 for each feature
print(X_centered.mean(axis=0))
# [~0. ~0. ~0.]

# This works because Xmean has shape (3,) which broadcasts
# to match X's shape (10, 3) by replicating across rows

### 5.3 Broadcasting: Plotting Functions

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Create a 2D grid using broadcasting
x = np.linspace(0, 5, 50)
y = np.linspace(0, 5, 50)[:, np.newaxis]

# Broadcasting: x is (50,), y is (50,1)
# Result z is (50, 50)
z = np.sin(x) ** 10 + np.cos(10 + y * x) * np.cos(x)

plt.imshow(z, origin='lower', extent=[0, 5, 0, 5],
           cmap='viridis')
plt.colorbar()
plt.title('Broadcasting Example')
plt.show()

# This creates a 2D function from 1D arrays!

## 6. Boolean Operations and Masking

### 6.1 Comparison Operators: Element-wise Comparisons

In [None]:
import numpy as np

# Data array
x = np.array([1, 2, 3, 4, 5])

# Comparison operators return boolean arrays
print(x < 3)
# [ True  True False False False]

print(x > 3)
# [False False False  True  True]

print(x == 3)
# [False False  True False False]

# Multiple comparisons (use & and |, not 'and' and 'or')
print((x > 1) & (x < 5))
# [False  True  True  True False]

# Count how many satisfy condition
print(np.sum(x > 2))  # 3 (True=1, False=0)

# Percentage
print(np.mean(x > 2))  # 0.6 (60%)

### 6.2 Boolean Indexing: Filtering Data

In [None]:
import numpy as np

# Data array
x = np.array([1, 2, 3, 4, 5])

# Create boolean mask
mask = x < 3
print(mask)
# [ True  True False False False]

# Use mask to filter data
print(x[mask])  # [1 2]

# Can use directly without creating variable
print(x[x < 3])  # [1 2]

# More complex example with 2D
np.random.seed(0)
X = np.random.randint(10, size=(3, 4))
print(X)
# [[5 0 3 3]
#  [7 9 3 5]
#  [2 4 7 6]]

print(X[X < 5])  # [0 3 3 3 2 4] (flattened)

### 6.3 Boolean Operators: Combining Conditions

In [None]:
import numpy as np

# Example: Rainy days analysis
rainfall_inches = np.array([0.2, 0.5, 0.0, 1.2, 0.8, 0.0, 0.3])

# Multiple criteria with & (AND) and | (OR)
print((rainfall_inches > 0) & (rainfall_inches < 1))
# [ True  True False False  True False  True]

print((rainfall_inches <= 0) | (rainfall_inches >= 1))
# [False False  True  True False  True False]

# Use np.sum() to count matches
print(np.sum((rainfall_inches > 0) & (rainfall_inches < 1)))  # 4

# IMPORTANT: Use & and | for arrays, not 'and' and 'or'!
# Also: always use parentheses around conditions

# Boolean operators
print(~(rainfall_inches > 0.5))  # NOT
# [ True False  True False False  True  True]

### 6.4 Example: Analyzing Weather Data

In [None]:
import numpy as np

# Weather data
np.random.seed(1)
rainfall = np.random.random(365) * 2  # inches per day

# Analysis
rainy_days = np.sum(rainfall > 0.5)
dry_days = np.sum(rainfall < 0.1)
median_precip = np.median(rainfall)
mean_precip = np.mean(rainfall)

print(f"Rainy days (>0.5 in): {rainy_days}")
print(f"Dry days (<0.1 in): {dry_days}")
print(f"Median: {median_precip:.2f} in")
print(f"Mean: {mean_precip:.2f} in")

# Get all rainy day amounts
rainy = rainfall[rainfall > 0.5]
print(f"Average rainfall on rainy days: {rainy.mean():.2f} in")

## 7. Advanced Indexing

### 7.1 Fancy Indexing: Using Arrays as Indices

In [None]:
import numpy as np

# Simple array
x = np.array([51, 92, 14, 71, 60, 20, 82, 86, 74, 74])

# Select specific elements by index
ind = [3, 7, 4]
print(x[ind])  # [71 86 60]

# 2D indexing
X = np.arange(12).reshape((3, 4))
print(X)
# [[ 0  1  2  3]
#  [ 4  5  6  7]
#  [ 8  9 10 11]]

# Select specific rows and columns
row = np.array([0, 1, 2])
col = np.array([2, 1, 3])
print(X[row, col])  # [ 2  5 11]

# Select a subset of rows
print(X[[0, 2]])
# [[ 0  1  2  3]
#  [ 8  9 10 11]]

### 7.2 Modifying Values with Fancy Indexing

In [None]:
import numpy as np

# Start with array of zeros
x = np.zeros(10)
print(x)  # [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

# Set specific indices
ind = [0, 3, 5]
x[ind] = 99
print(x)  # [99.  0.  0. 99.  0. 99.  0.  0.  0.  0.]

# Increment specific values
x[ind] += 1
print(x)  # [100.   0.   0. 100.   0. 100.   0.   0.   0.   0.]

# Repeated indices - behavior is subtle!
x = np.zeros(5)
i = [0, 0, 0]
x[i] += 1
print(x)  # [1. 0. 0. 0. 0.] - only incremented once!

### 7.3 Combined Indexing: Mix and Match

In [None]:
import numpy as np

# 2D array
X = np.arange(12).reshape((3, 4))
print(X)
# [[ 0  1  2  3]
#  [ 4  5  6  7]
#  [ 8  9 10 11]]

# Fancy indexing + slicing
# Select rows 0 and 2, columns 1 and 3
result = X[[0, 2]][:, [1, 3]]
print(result)
# [[ 1  3]
#  [ 9 11]]

# Boolean mask + fancy indexing
mask = X[:, 1] > 5
print(mask)  # [False False  True]
print(X[mask])
# [[ 8  9 10 11]]

## 8. Sorting

### 8.1 Sorting Arrays

In [None]:
import numpy as np

# Unsorted array
x = np.array([2, 1, 4, 3, 5])

# Sort (returns new sorted array)
print(np.sort(x))  # [1 2 3 4 5]

# argsort: returns indices that would sort the array
i = np.argsort(x)
print(i)  # [1 0 3 2 4]
print(x[i])  # [1 2 3 4 5]

# Sort in descending order
print(x[np.argsort(x)[::-1]])  # [5 4 3 2 1]

# Sort 2D array along axis
np.random.seed(42)
X = np.random.randint(0, 10, (4, 6))
print(X)

# Sort each column
print(np.sort(X, axis=0))

# Sort each row
print(np.sort(X, axis=1))

### 8.2 Practical Sorting: Finding Top N Elements

In [None]:
import numpy as np

# Array of values
x = np.array([7, 2, 3, 1, 6, 5, 4])

# Partition: smallest 3 on left, rest on right
print(np.partition(x, 3))
# [2 1 3 4 6 5 7] (3 smallest values on left, not sorted)

# Get indices for partition
i = np.argpartition(x, 3)
print(x[i])
# [2 1 3 4 6 5 7]

# Find top K values efficiently
# Partition so K largest are on the right
K = 3
partitioned = np.partition(x, -K)
print(partitioned[-K:])  # [5 6 7] (not necessarily sorted)

# For sorted top K, use argsort
top_k_sorted = x[np.argsort(x)[-K:]]
print(top_k_sorted)  # [5 6 7]

## 9. Structured Data in NumPy

### 9.1 Structured Arrays: Mixing Data Types

In [None]:
import numpy as np

# Create structured array for person data
data = np.zeros(4, dtype={
    'names': ('name', 'age', 'weight'),
    'formats': ('U10', 'i4', 'f8')
})

# Fill data
data['name'] = ['Alice', 'Bob', 'Cathy', 'Doug']
data['age'] = [25, 45, 37, 19]
data['weight'] = [55.0, 85.5, 68.0, 61.5]

print(data)
# [('Alice', 25, 55. ) ('Bob', 45, 85.5)
#  ('Cathy', 37, 68. ) ('Doug', 19, 61.5)]

# Access by field name
print(data['name'])  # ['Alice' 'Bob' 'Cathy' 'Doug']
print(data['age'])   # [25 45 37 19]

# Filter
print(data[data['age'] < 30]['name'])  # ['Alice' 'Doug']

In [None]:
## Summary

This notebook covered the fundamental concepts of NumPy and Pandas for numerical computing and data manipulation:

### NumPy (Sections 1-9):
1. **Introduction to NumPy** - Understanding speed advantages and basic arrays
2. **Array Fundamentals** - Creating, indexing, reshaping, and combining arrays
3. **Array Operations** - Vectorized arithmetic and universal functions
4. **Aggregations** - Statistical operations and the axis parameter
5. **Broadcasting** - Operations on different array shapes
6. **Boolean Operations** - Masking and filtering data
7. **Advanced Indexing** - Fancy indexing for complex selections
8. **Sorting** - Organizing and finding top elements
9. **Structured Arrays** - Mixing data types

### Pandas (Sections 10-16):
10. **Introduction to Pandas** - DataFrames and why Pandas matters
11. **Pandas Core Objects** - Series, DataFrame, and Index
12. **Data Indexing and Selection** - loc, iloc, and boolean masking
13. **Operations in Pandas** - Index alignment and ufuncs
14. **Handling Missing Data** - Detecting, dropping, and filling NaN values
15. **Hierarchical Indexing** - MultiIndex for higher-dimensional data
16. **Combining Datasets** - concat and append operations

### Key Takeaways:
- **NumPy** is 10-100x faster than Python lists for numerical operations
- Use vectorized operations instead of loops
- Broadcasting enables operations on different shapes without copying data
- **Pandas** builds on NumPy to provide labeled, structured data with DataFrames
- Index alignment automatically handles mismatched data
- Missing data handling is built into Pandas
- MultiIndex enables working with higher-dimensional data efficiently

### Next Steps:
- **Week 5**: Advanced Pandas (merge, join, groupby, pivot tables)
- **Visualization**: Matplotlib for data visualization
- **Applications**: Real-world data analysis in civil engineering

---

**Congratulations!** You now have the foundational tools for numerical computing and data analysis in Python. These skills form the basis for all data science work in civil engineering and beyond.

## 16. Combining Datasets: Concat and Append

Pandas provides tools to combine data from multiple sources.

### 16.1 Using pd.concat()

In [None]:
# Create MultiIndex Series
index = pd.MultiIndex.from_tuples([
    ('California', 2000), ('California', 2010),
    ('New York', 2000), ('New York', 2010),
    ('Texas', 2000), ('Texas', 2010)
])
populations = [33871648, 37253956, 18976457,
               19378102, 20851820, 25145561]
pop = pd.Series(populations, index=index)
pop.index.names = ['state', 'year']

print("MultiIndex Series:")
print(pop)
print()

# Access all data for year 2010
print("All states in 2010:")
print(pop[:, 2010])
print()

# Access all data for California
print("California across years:")
print(pop['California'])
print()

# Unstack: convert to regular DataFrame
print("Unstacked (MultiIndex → DataFrame):")
print(pop.unstack())

## 15. Hierarchical Indexing (MultiIndex)

MultiIndex allows you to store higher-dimensional data in 1D Series or 2D DataFrames.

### 15.1 Creating and Using MultiIndex

In [None]:
# Create data with missing values
data = pd.Series([1, np.nan, 'hello', None])
print("Data with missing values:")
print(data)
print()

# Detect missing values
print("isnull():")
print(data.isnull())
print()

# Drop missing values
print("dropna():")
print(data.dropna())
print()

# Fill missing values
data_numeric = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
print("Original data:")
print(data_numeric)
print()

print("fillna(0):")
print(data_numeric.fillna(0))
print()

print("fillna(method='ffill') - forward fill:")
print(data_numeric.fillna(method='ffill'))

## 14. Handling Missing Data

Real-world data is rarely clean! Pandas provides tools to detect, remove, and fill missing data.

### 14.1 Detecting and Handling Missing Data

In [None]:
# NumPy ufuncs preserve index!
rng = np.random.RandomState(42)
ser = pd.Series(rng.randint(0, 10, 4))
print("Original Series:")
print(ser)
print()

print("np.exp(ser) - index preserved:")
print(np.exp(ser))
print()

# Index Alignment: Operations align on matching indices
area_top3 = pd.Series({'Alaska': 1723337, 'Texas': 695662,
                       'California': 423967})
pop_top3 = pd.Series({'California': 38332521, 'Texas': 26448193,
                      'New York': 19651127})

# Division aligns indices automatically!
density = pop_top3 / area_top3
print("Automatic index alignment:")
print(density)
print("\nNote: NaN where indices don't match!")

## 13. Operations in Pandas

Pandas inherits NumPy's ufuncs but adds two powerful features:
1. **Index Preservation**: Labels are maintained through operations
2. **Index Alignment**: Operations automatically align on matching indices

### 13.1 Ufuncs with Index Preservation

In [None]:
# loc: Label-based indexing
print("Using loc (label-based):")
print("First 3 rows, first 2 columns:")
print(states.loc[:'Florida', :'area'])
print()

# iloc: Integer-based indexing
print("Using iloc (integer-based):")
print("First 3 rows, first 2 columns:")
print(states.iloc[:3, :2])
print()

# Boolean masking
high_density = states['density'] > 100
print("High density states (> 100):")
print(states[high_density])
print()

# Combining loc with boolean mask
print("High density states - only population and density:")
print(states.loc[high_density, ['population', 'density']])

## 12. Data Indexing and Selection

Pandas provides powerful indexing capabilities through `loc` (label-based) and `iloc` (integer-based) indexers.

### 12.1 The Indexers: loc and iloc

In [None]:
# Multiple ways to create DataFrames

# From dictionary of lists
df1 = pd.DataFrame({'A': [1, 2, 3],
                    'B': [4, 5, 6]})
print("From dict of lists:")
print(df1)
print()

# From list of dictionaries
df2 = pd.DataFrame([{'a': 1, 'b': 2},
                    {'a': 3, 'b': 4, 'c': 5}])
print("From list of dicts:")
print(df2)
print()

# From NumPy array
df3 = pd.DataFrame(np.random.rand(3, 2),
                   columns=['foo', 'bar'],
                   index=['a', 'b', 'c'])
print("From NumPy array:")
print(df3)

In [None]:
# Create DataFrame from dictionary of Series
area_dict = {'California': 423967, 'Texas': 695662,
             'New York': 141297, 'Florida': 170312,
             'Illinois': 149995}
area = pd.Series(area_dict)

states = pd.DataFrame({'population': population,
                       'area': area})
print("DataFrame from Series:")
print(states)
print()

print("Index (row labels):", states.index.tolist())
print("Columns:", states.columns.tolist())
print()

# Add new column
states['density'] = states['population'] / states['area']
print("DataFrame with new 'density' column:")
print(states)

### 11.2 The DataFrame: 2D Labeled Data Structure

In [None]:
# Series with custom string index (like a dictionary!)
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
print("Series with custom index:")
print(data)
print()

# Access by label
print(f"Element 'b': {data['b']}")

# Series from dictionary
population_dict = {
    'California': 38332521,
    'Texas': 26448193,
    'New York': 19651127,
    'Florida': 19552860,
    'Illinois': 12882135
}
population = pd.Series(population_dict)
print("\nSeries from dictionary:")
print(population)

In [None]:
# Create Series from list
data = pd.Series([0.25, 0.5, 0.75, 1.0])
print("Series with default integer index:")
print(data)
print()

# Access values and index
print("Values:", data.values)
print("Index:", data.index)
print()

# Access like array
print("Element at index 1:", data[1])
print("Slice [1:3]:")
print(data[1:3])

## 11. Pandas Core Objects

Pandas provides three fundamental data structures:
- **Series**: 1D labeled array (like a column)
- **DataFrame**: 2D labeled table (like a spreadsheet)
- **Index**: Row and column labels

### 11.1 The Pandas Series: 1D Labeled Array

In [None]:
# NumPy array - access by integer index only
data_numpy = np.array([[1, 2, 3],
                       [4, 5, 6]])
print("NumPy array:")
print(data_numpy)
print(f"Element at [0, 1]: {data_numpy[0, 1]}")  # 2
print("Which column is this? Must remember!\n")

# Pandas DataFrame - access by label or index
df = pd.DataFrame([[1, 2, 3],
                   [4, 5, 6]],
                  columns=['A', 'B', 'C'])
print("Pandas DataFrame:")
print(df)
print(f"\nElement 'B' in row 0: {df['B'][0]}")  # 2
print("Clear what column 'B' means!")

### 10.1 NumPy vs Pandas: A Quick Comparison

In [None]:
# Standard Import Convention
import pandas as pd  # ALWAYS use this convention!
import numpy as np   # Often used together

# Check version
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

---

## 10. Introduction to Pandas

Pandas is built on top of NumPy and provides DataFrames - labeled, 2D data structures that are like spreadsheets or SQL tables, but in Python. It's the industry standard for data manipulation and analysis.

### Why Pandas After NumPy?
- **NumPy**: Fast arrays, but no labels or structure
- **Pandas**: Labels + missing data + heterogeneous types
- Access data by name, not just index position
- Built-in tools for reading CSV, Excel, SQL
- Better for messy, real-world data

## Summary

This notebook covered the fundamental concepts of NumPy for numerical computing:

1. **Introduction to NumPy** - Understanding speed advantages and basic arrays
2. **Array Fundamentals** - Creating, indexing, reshaping, and combining arrays
3. **Array Operations** - Vectorized arithmetic and universal functions
4. **Aggregations** - Statistical operations and the axis parameter
5. **Broadcasting** - Operations on different array shapes
6. **Boolean Operations** - Masking and filtering data
7. **Advanced Indexing** - Fancy indexing for complex selections
8. **Sorting** - Organizing and finding top elements
9. **Structured Arrays** - Mixing data types (though Pandas is usually better)

### Key Takeaways:
- NumPy is 10-100x faster than Python lists for numerical operations
- Use vectorized operations instead of loops
- Broadcasting enables operations on different shapes without copying data
- Boolean masking is powerful for filtering data
- NumPy is the foundation for the entire PyData ecosystem

**Next Week**: We'll explore Pandas, which builds on NumPy to provide DataFrames for structured data analysis with labeled columns and rows!