# Math & Statistics in NumPy
Essential aggregation methods:  
- **Descriptive Stats**: `.sum()`, `.mean()`, `.std()`, `.var()`  
- **Extrema**: `.min()`, `.max()`, `.argmin()`, `.argmax()`  

Key features:  
- Vectorized operations (fast)  
- Axis parameter for dimension control  
- Handle NaN values with `np.nan*` variants  

In [1]:
import numpy as np

arr = np.array([1, 2, 3, 4, 5])

# Basic stats
print("Sum:", arr.sum())        # 15
print("Mean:", arr.mean())      # 3.0
print("Std dev:", arr.std())    # 1.414...
print("Variance:", arr.var())   # 2.0
print("Min:", arr.min())        # 1
print("Max:", arr.max())        # 5

Sum: 15
Mean: 3.0
Std dev: 1.4142135623730951
Variance: 2.0
Min: 1
Max: 5


### Axis Parameter (Crucial for Multidimensional Data)
- `axis=0`: Column-wise (operate on rows)  
- `axis=1`: Row-wise (operate on columns)  
- `axis=None`: Global operation  

Visualization:  

In [2]:

#### **Cell 4: Code - Axis-Based Operations**

matrix = np.array([[1, 2], 
                   [3, 4],
                   [5, 6]])

# Column-wise operations
print("Column sums:", matrix.sum(axis=0))     # [9,12]
print("Column means:", matrix.mean(axis=0))   # [3,4]

# Row-wise operations
print("\nRow maxes:", matrix.max(axis=1))      # [2,4,6]
print("Row std dev:", matrix.std(axis=1))     # [0.5, 0.5, 0.5]

# Global operation
print("\nGlobal mean:", matrix.mean())         # 3.5

Column sums: [ 9 12]
Column means: [3. 4.]

Row maxes: [2 4 6]
Row std dev: [0.5 0.5 0.5]

Global mean: 3.5


### Finding Extremum Locations
- `.argmin()`: Index of minimum value  
- `.argmax()`: Index of maximum value  
- `.argsort()`: Indices that would sort array  

Return **positions** rather than values  

In [3]:
arr = np.array([3, 1, 4, 2, 5])

print("Min index:", arr.argmin())  # 1 (value=1 at index1)
print("Max index:", arr.argmax())  # 4 (value=5 at index4)

# 2D with axis
matrix = np.array([[1, 9], 
                   [5, 3],
                   [7, 2]])

print("\nColumn min indices:", matrix.argmin(axis=0))  # [0,2]
print("Row max indices:", matrix.argmax(axis=1))      # [1,0,0]

Min index: 1
Max index: 4

Column min indices: [0 2]
Row max indices: [1 0 0]


### Special NaN Variants
Standard functions return NaN if any value is NaN:  
```python
np.array([1, np.nan]).mean() → nan

In [4]:

#### **Cell 8: Code - NaN Safe Operations**

arr = np.array([1, 2, np.nan, 4])

print("Standard mean:", arr.mean())          # nan
print("NaN-safe mean:", np.nanmean(arr))     # 2.333...

# Full set of nan variants
print("NaN-safe sum:", np.nansum(arr))       # 7.0
print("NaN-safe max:", np.nanmax(arr))       # 4.0
print("NaN-safe min index:", np.nanargmin(arr))  # 0 (value=1)

Standard mean: nan
NaN-safe mean: 2.3333333333333335
NaN-safe sum: 7.0
NaN-safe max: 4.0
NaN-safe min index: 0


### Additional Statistical Functions
- **Percentiles**: `np.percentile(arr, [25,50,75])`  
- **Median**: `np.median(arr)`  
- **Correlation**: `np.corrcoef(arr1, arr2)`  
- **Histograms**: `np.histogram(arr, bins=10)`  

In [5]:
data = np.random.normal(0, 1, 1000)  # 1000 normal samples

# Percentiles
quartiles = np.percentile(data, [25, 50, 75])
print("25th/50th/75th percentiles:", quartiles)

# Median vs Mean
print("\nMedian:", np.median(data))
print("Mean:", np.mean(data))

# Correlation
x = np.array([1,2,3,4,5])
y = np.array([2,4,5,4,5])
corr = np.corrcoef(x, y)[0,1]
print("\nCorrelation:", corr)  # ~0.79

# Histogram
counts, bins = np.histogram(data, bins=10)
print("\nHistogram counts:", counts)

25th/50th/75th percentiles: [-0.61068438  0.01503948  0.67054832]

Median: 0.01503947653576517
Mean: 0.017424610512133454

Correlation: 0.7745966692414834

Histogram counts: [  5  13  63 138 226 252 197  70  34   2]


### Why NumPy > Python Built-ins
| Operation | Python List | NumPy | Speedup |  
|-----------|-------------|-------|---------|  
| sum()     | 100 ms      | 1 ms  | 100x    |  
| mean()    | 120 ms      | 1.5 ms| 80x     |  
| std()     | 150 ms      | 2 ms  | 75x     |  

Benchmark for 10 million elements (Python 3.10, NumPy 1.24)

In [6]:
import time

size = 10_000_000
py_list = list(range(size))
np_arr = np.arange(size)

# Sum benchmark
start = time.time()
py_sum = sum(py_list)
py_time = time.time() - start

start = time.time()
np_sum = np_arr.sum()
np_time = time.time() - start

print(f"Python sum: {py_time:.5f} sec")
print(f"NumPy sum: {np_time:.5f} sec")
print(f"Speed ratio: {py_time/np_time:.1f}x")

Python sum: 0.05526 sec
NumPy sum: 0.00674 sec
Speed ratio: 8.2x


In [7]:
import time

size = 10_000_000
py_list = list(range(size))
np_arr = np.arange(size)

# Sum benchmark
start = time.time()
py_sum = sum(py_list)
py_time = time.time() - start

start = time.time()
np_sum = np_arr.sum()
np_time = time.time() - start

print(f"Python sum: {py_time:.5f} sec")
print(f"NumPy sum: {np_time:.5f} sec")
print(f"Speed ratio: {py_time/np_time:.1f}x")

Python sum: 0.05449 sec
NumPy sum: 0.00171 sec
Speed ratio: 31.8x


### Real-World Use Cases
1. **Data Analysis**:  
   `df.values.mean(axis=0)` (Pandas DataFrames use NumPy)  
2. **Feature Scaling**:  
   `(data - data.mean()) / data.std()`  
3. **Anomaly Detection**:  
   `z_scores = np.abs(data - mean) / std`  
4. **Model Evaluation**:  
   `np.mean(y_pred == y_true)` (accuracy)  

In [9]:
# Standardization (z-score normalization)
data = np.random.randint(0, 100, (100, 3))  # 100 samples, 3 features

mean = data.mean(axis=0)
std = data.std(axis=0)
scaled = (data - mean) / std

print("Original mean:", mean)
print("Scaled mean:", scaled.mean(axis=0))  # ~[0,0,0]
print("Scaled std:", scaled.std(axis=0))    # ~[1,1,1]

Original mean: [48.95 45.89 50.46]
Scaled mean: [-1.25455202e-16 -4.44089210e-18 -5.88418203e-17]
Scaled std: [1. 1. 1.]
