# üß∞ Stage 4: Utility Skills & Integration

## üéØ Objective
Learn essential utilities and how NumPy integrates with other libraries

---

## Table of Contents
1. Working with Missing Data
2. Type Conversion & Casting
3. Performance Tricks
4. NumPy with Pandas
5. NumPy with ML Libraries
6. Practice Exercises

In [1]:
import numpy as np

---

## 1. Working with Missing Data

### üìö Theory

**NaN** (Not a Number) represents missing or undefined values.

| Function | Description |
|----------|-------------|
| `np.nan` | Missing value constant |
| `np.isnan()` | Check for NaN |
| `np.nan_to_num()` | Replace NaN with number |
| `np.nanmean()` | Mean ignoring NaN |
| `np.nansum()` | Sum ignoring NaN |

### ‚úÖ Daily Use:
- Data cleaning
- Handling missing values
- Imputation
- Preprocessing

In [2]:
# Create array with NaN values
arr = np.array([1, 2, np.nan, 4, 5, np.nan, 7])
print("Array with NaN:", arr)
print()

# Check for NaN
is_nan = np.isnan(arr)
print("Is NaN mask:", is_nan)
print(f"Number of NaN values: {np.sum(is_nan)}")
print()

# Filter out NaN values
clean_arr = arr[~np.isnan(arr)]
print("Array without NaN:", clean_arr)

Array with NaN: [ 1.  2. nan  4.  5. nan  7.]

Is NaN mask: [False False  True False False  True False]
Number of NaN values: 2

Array without NaN: [1. 2. 4. 5. 7.]


In [3]:
# Replace NaN with specific value
arr = np.array([1, 2, np.nan, 4, 5, np.nan, 7])
print("Original:", arr)
print()

# Replace NaN with 0
replaced = np.nan_to_num(arr, nan=0)
print("NaN replaced with 0:", replaced)
print()

# Replace NaN with mean of non-NaN values
mean_val = np.nanmean(arr)
replaced_mean = np.where(np.isnan(arr), mean_val, arr)
print(f"NaN replaced with mean ({mean_val:.2f}):", replaced_mean)

Original: [ 1.  2. nan  4.  5. nan  7.]

NaN replaced with 0: [1. 2. 0. 4. 5. 0. 7.]

NaN replaced with mean (3.80): [1.  2.  3.8 4.  5.  3.8 7. ]


In [4]:
# Operations ignoring NaN
arr = np.array([1, 2, np.nan, 4, 5, np.nan, 7])
print("Array:", arr)
print()

# Regular operations include NaN (result is NaN)
print(f"Regular mean (with NaN): {np.mean(arr)}")
print(f"Regular sum (with NaN): {np.sum(arr)}")
print()

# NaN-aware operations
print(f"Mean (ignoring NaN): {np.nanmean(arr):.2f}")
print(f"Sum (ignoring NaN): {np.nansum(arr):.2f}")
print(f"Max (ignoring NaN): {np.nanmax(arr):.2f}")
print(f"Min (ignoring NaN): {np.nanmin(arr):.2f}")

Array: [ 1.  2. nan  4.  5. nan  7.]

Regular mean (with NaN): nan
Regular sum (with NaN): nan

Mean (ignoring NaN): 3.80
Sum (ignoring NaN): 19.00
Max (ignoring NaN): 7.00
Min (ignoring NaN): 1.00


In [5]:
# Practical: Handle missing data in 2D
data = np.array([[1, 2, 3],
                 [4, np.nan, 6],
                 [7, 8, np.nan],
                 [10, np.nan, 12]])

print("Data with missing values:\n", data)
print()

# Count NaN per column
nan_per_col = np.sum(np.isnan(data), axis=0)
print("NaN count per column:", nan_per_col)
print()

# Mean per column (ignoring NaN)
col_means = np.nanmean(data, axis=0)
print("Mean per column (ignoring NaN):", col_means)

Data with missing values:
 [[ 1.  2.  3.]
 [ 4. nan  6.]
 [ 7.  8. nan]
 [10. nan 12.]]

NaN count per column: [0 2 1]

Mean per column (ignoring NaN): [5.5 5.  7. ]


---

## 2. Type Conversion & Casting

### üìö Theory

| Method | Description |
|--------|-------------|
| `.astype()` | Convert data type |
| `.tolist()` | Convert to Python list |
| `.dtype` | Check current type |

### Common Types:
- `int32`, `int64` - Integers
- `float32`, `float64` - Floats
- `bool` - Boolean
- `str` - String

### ‚úÖ Daily Use:
- Memory optimization
- Type compatibility
- Data export

In [6]:
# Type conversion with astype()
arr_float = np.array([1.5, 2.7, 3.9, 4.1])
print("Float array:", arr_float)
print(f"Data type: {arr_float.dtype}")
print()

# Convert to integer (truncates decimal)
arr_int = arr_float.astype(np.int32)
print("Converted to int:", arr_int)
print(f"Data type: {arr_int.dtype}")
print()

# Convert to string
arr_str = arr_float.astype(str)
print("Converted to string:", arr_str)
print(f"Data type: {arr_str.dtype}")

Float array: [1.5 2.7 3.9 4.1]
Data type: float64

Converted to int: [1 2 3 4]
Data type: int32

Converted to string: ['1.5' '2.7' '3.9' '4.1']
Data type: <U32


In [7]:
# Boolean conversion
arr = np.array([0, 1, 2, 0, 3, 0])
print("Original:", arr)
print()

# Convert to boolean (0 = False, non-zero = True)
arr_bool = arr.astype(bool)
print("As boolean:", arr_bool)
print(f"Data type: {arr_bool.dtype}")

Original: [0 1 2 0 3 0]

As boolean: [False  True  True False  True False]
Data type: bool


In [8]:
# Convert to Python list
arr = np.array([[1, 2, 3],
                [4, 5, 6]])

print("NumPy array:\n", arr)
print(f"Type: {type(arr)}")
print()

# Convert to list
python_list = arr.tolist()
print("Python list:", python_list)
print(f"Type: {type(python_list)}")

NumPy array:
 [[1 2 3]
 [4 5 6]]
Type: <class 'numpy.ndarray'>

Python list: [[1, 2, 3], [4, 5, 6]]
Type: <class 'list'>


In [9]:
# Memory efficiency: float32 vs float64
arr_64 = np.array([1.5, 2.5, 3.5], dtype=np.float64)
arr_32 = np.array([1.5, 2.5, 3.5], dtype=np.float32)

print("Float64 array:", arr_64)
print(f"Memory per element: {arr_64.itemsize} bytes")
print(f"Total memory: {arr_64.nbytes} bytes")
print()

print("Float32 array:", arr_32)
print(f"Memory per element: {arr_32.itemsize} bytes")
print(f"Total memory: {arr_32.nbytes} bytes")
print()

print(f"Memory saved: {arr_64.nbytes - arr_32.nbytes} bytes")

Float64 array: [1.5 2.5 3.5]
Memory per element: 8 bytes
Total memory: 24 bytes

Float32 array: [1.5 2.5 3.5]
Memory per element: 4 bytes
Total memory: 12 bytes

Memory saved: 12 bytes


---

## 3. Performance Tricks

### üìö Theory

**Vectorization** is key to NumPy performance.

### Best Practices:
1. ‚úÖ Avoid loops - use vectorized operations
2. ‚úÖ Use built-in NumPy functions
3. ‚úÖ Pre-allocate arrays when possible
4. ‚úÖ Use views instead of copies
5. ‚úÖ Use appropriate data types

### ‚úÖ Daily Use:
- Big data operations
- Real-time processing
- ML model training

In [10]:
import time

# Vectorization vs loops
size = 1000000
arr1 = np.random.rand(size)
arr2 = np.random.rand(size)

# Method 1: Python loop (SLOW)
start = time.time()
result_loop = []
for i in range(size):
    result_loop.append(arr1[i] + arr2[i])
loop_time = time.time() - start

# Method 2: Vectorized (FAST)
start = time.time()
result_vec = arr1 + arr2
vec_time = time.time() - start

print(f"Loop method: {loop_time:.4f} seconds")
print(f"Vectorized method: {vec_time:.4f} seconds")
print(f"\nSpeedup: {loop_time/vec_time:.1f}x faster!")

Loop method: 0.3119 seconds
Vectorized method: 0.0019 seconds

Speedup: 165.0x faster!


In [11]:
# Use vectorize for custom functions
def custom_func(x):
    """Custom function to square and add 10"""
    return x**2 + 10

# Vectorize the function
vectorized_func = np.vectorize(custom_func)

arr = np.array([1, 2, 3, 4, 5])
print("Original array:", arr)
print()

# Apply vectorized function
result = vectorized_func(arr)
print("After applying x¬≤+10:", result)

Original array: [1 2 3 4 5]

After applying x¬≤+10: [11 14 19 26 35]


In [12]:
# Pre-allocation for better performance
n = 10000

# Bad: Growing array (slow)
start = time.time()
arr_bad = np.array([])
for i in range(n):
    arr_bad = np.append(arr_bad, i)
bad_time = time.time() - start

# Good: Pre-allocate (fast)
start = time.time()
arr_good = np.zeros(n)
for i in range(n):
    arr_good[i] = i
good_time = time.time() - start

# Best: Vectorized (fastest)
start = time.time()
arr_best = np.arange(n)
best_time = time.time() - start

print(f"Growing array: {bad_time:.4f} seconds")
print(f"Pre-allocated: {good_time:.4f} seconds ({bad_time/good_time:.1f}x faster)")
print(f"Vectorized: {best_time:.4f} seconds ({bad_time/best_time:.0f}x faster)")

Growing array: 0.0426 seconds
Pre-allocated: 0.0013 seconds (32.5x faster)
Vectorized: 0.0001 seconds (745x faster)


---

## 4. NumPy with Pandas

### üìö Theory

Pandas is built on NumPy and provides easy conversion.

| Operation | Method |
|-----------|--------|
| DataFrame ‚Üí NumPy | `.to_numpy()` or `.values` |
| NumPy ‚Üí DataFrame | `pd.DataFrame()` |
| Series ‚Üí NumPy | `.to_numpy()` |
| NumPy ‚Üí Series | `pd.Series()` |

### ‚úÖ Daily Use:
- Data preprocessing
- Feature extraction
- ML pipeline

In [13]:
# Install pandas if not available (uncomment if needed)
# !pip install pandas

import pandas as pd

# Create NumPy array
data = np.array([[1, 2, 3],
                 [4, 5, 6],
                 [7, 8, 9]])

print("NumPy array:\n", data)
print(f"Type: {type(data)}")
print()

# Convert to DataFrame
df = pd.DataFrame(data, columns=['A', 'B', 'C'])
print("Pandas DataFrame:")
print(df)
print(f"Type: {type(df)}")

NumPy array:
 [[1 2 3]
 [4 5 6]
 [7 8 9]]
Type: <class 'numpy.ndarray'>

Pandas DataFrame:
   A  B  C
0  1  2  3
1  4  5  6
2  7  8  9
Type: <class 'pandas.core.frame.DataFrame'>


In [14]:
# DataFrame to NumPy
df = pd.DataFrame({
    'Age': [25, 30, 35, 40],
    'Salary': [50000, 60000, 70000, 80000],
    'Experience': [2, 5, 8, 10]
})

print("Pandas DataFrame:")
print(df)
print()

# Convert to NumPy (recommended method)
arr = df.to_numpy()
print("Converted to NumPy array:\n", arr)
print(f"Shape: {arr.shape}")
print(f"Type: {type(arr)}")

Pandas DataFrame:
   Age  Salary  Experience
0   25   50000           2
1   30   60000           5
2   35   70000           8
3   40   80000          10

Converted to NumPy array:
 [[   25 50000     2]
 [   30 60000     5]
 [   35 70000     8]
 [   40 80000    10]]
Shape: (4, 3)
Type: <class 'numpy.ndarray'>


In [15]:
# Practical workflow: Pandas ‚Üí NumPy ‚Üí Process ‚Üí Pandas
# Create DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5],
    'B': [10, 20, 30, 40, 50]
})

print("Original DataFrame:")
print(df)
print()

# Convert to NumPy for processing
data = df.to_numpy()

# Apply NumPy operations
normalized = (data - data.mean(axis=0)) / data.std(axis=0)

# Convert back to DataFrame
df_normalized = pd.DataFrame(normalized, columns=df.columns)
print("Normalized DataFrame:")
print(df_normalized)

Original DataFrame:
   A   B
0  1  10
1  2  20
2  3  30
3  4  40
4  5  50

Normalized DataFrame:
          A         B
0 -1.414214 -1.414214
1 -0.707107 -0.707107
2  0.000000  0.000000
3  0.707107  0.707107
4  1.414214  1.414214


---

## 5. NumPy with ML Libraries

### üìö Theory

NumPy is the foundation for:
- **Scikit-learn**: ML algorithms
- **TensorFlow/PyTorch**: Deep learning
- **Matplotlib**: Visualization

### ‚úÖ Daily Use:
- Preparing data for ML models
- Processing predictions
- Feature engineering

In [16]:
# Example: Prepare data for ML (Scikit-learn style)

# Generate sample data
np.random.seed(42)
X = np.random.rand(100, 5)  # 100 samples, 5 features
y = np.random.randint(0, 2, 100)  # Binary labels

print(f"Features shape: {X.shape}")
print(f"Labels shape: {y.shape}")
print()

# Train-test split (manual)
split_idx = int(0.8 * len(X))
X_train, X_test = X[:split_idx], X[split_idx:]
y_train, y_test = y[:split_idx], y[split_idx:]

print(f"Train set: {X_train.shape}")
print(f"Test set: {X_test.shape}")
print()

# Normalize features
mean = X_train.mean(axis=0)
std = X_train.std(axis=0)

X_train_norm = (X_train - mean) / std
X_test_norm = (X_test - mean) / std  # Use train stats!

print("Normalized train data (first 5 samples):\n", X_train_norm[:5])

Features shape: (100, 5)
Labels shape: (100,)

Train set: (80, 5)
Test set: (20, 5)

Normalized train data (first 5 samples):
 [[-0.44460873  1.49101199  0.82061949  0.32623919 -0.99862413]
 [-1.19079761 -1.58623639  1.27640972  0.33477031  0.85858459]
 [-1.65313341  1.55718657  1.16182372 -1.015386   -0.91180701]
 [-1.0972106  -0.73763097  0.11667493 -0.25272992 -0.5437517 ]
 [ 0.36565767 -1.30558331 -0.67346099 -0.48048976  0.01080254]]


In [17]:
# Example: One-hot encoding
labels = np.array([0, 1, 2, 1, 0, 2])
print("Original labels:", labels)
print()

# Create one-hot encoded matrix
n_classes = 3
one_hot = np.zeros((len(labels), n_classes))
one_hot[np.arange(len(labels)), labels] = 1

print("One-hot encoded:\n", one_hot)
print()

# Alternative using eye
one_hot_alt = np.eye(n_classes)[labels]
print("One-hot (using eye):\n", one_hot_alt)

Original labels: [0 1 2 1 0 2]

One-hot encoded:
 [[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 0. 1.]]

One-hot (using eye):
 [[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 0. 1.]]


---

## 6. üß™ Practice Exercises

### Exercise 1: Handle Missing Data

In [18]:
# Create array with missing values and clean it
data = np.array([1, 2, np.nan, 4, 5, np.nan, 7, 8])
print("Original:", data)
print()

# Replace NaN with column mean
mean_val = np.nanmean(data)
cleaned = np.where(np.isnan(data), mean_val, data)
print(f"NaN replaced with mean ({mean_val:.2f}):", cleaned)

Original: [ 1.  2. nan  4.  5. nan  7.  8.]

NaN replaced with mean (4.50): [1.  2.  4.5 4.  5.  4.5 7.  8. ]


### Exercise 2: Convert Types for Memory Optimization

In [19]:
# Create large array and optimize memory
arr = np.random.randint(0, 100, size=10000, dtype=np.int64)

print(f"Original dtype: {arr.dtype}")
print(f"Original memory: {arr.nbytes} bytes")
print()

# Convert to smaller int type
arr_optimized = arr.astype(np.int16)
print(f"Optimized dtype: {arr_optimized.dtype}")
print(f"Optimized memory: {arr_optimized.nbytes} bytes")
print()

savings = arr.nbytes - arr_optimized.nbytes
print(f"Memory saved: {savings} bytes ({savings/arr.nbytes*100:.1f}%)")

Original dtype: int64
Original memory: 80000 bytes

Optimized dtype: int16
Optimized memory: 20000 bytes

Memory saved: 60000 bytes (75.0%)


### Exercise 3: Pandas Integration

In [20]:
# Create NumPy array, convert to DataFrame, process, convert back
data = np.array([[25, 50000],
                 [30, 60000],
                 [35, 70000],
                 [40, 80000]])

print("NumPy array:\n", data)
print()

# To DataFrame
df = pd.DataFrame(data, columns=['Age', 'Salary'])
print("As DataFrame:")
print(df)
print()

# Process in DataFrame
df['Salary_K'] = df['Salary'] / 1000
print("After processing:")
print(df)
print()

# Back to NumPy
result = df.to_numpy()
print("Back to NumPy:\n", result)

NumPy array:
 [[   25 50000]
 [   30 60000]
 [   35 70000]
 [   40 80000]]

As DataFrame:
   Age  Salary
0   25   50000
1   30   60000
2   35   70000
3   40   80000

After processing:
   Age  Salary  Salary_K
0   25   50000      50.0
1   30   60000      60.0
2   35   70000      70.0
3   40   80000      80.0

Back to NumPy:
 [[2.5e+01 5.0e+04 5.0e+01]
 [3.0e+01 6.0e+04 6.0e+01]
 [3.5e+01 7.0e+04 7.0e+01]
 [4.0e+01 8.0e+04 8.0e+01]]


---

## üìù Key Takeaways

### Missing Data:
1. ‚úÖ `np.nan` for missing values
2. ‚úÖ `np.isnan()` to detect
3. ‚úÖ `np.nanmean()`, `np.nansum()` to handle
4. ‚úÖ `np.nan_to_num()` to replace

### Type Conversion:
1. ‚úÖ `.astype()` to change dtype
2. ‚úÖ `.tolist()` to convert to Python list
3. ‚úÖ Use appropriate types for memory efficiency

### Performance:
1. ‚úÖ Always prefer vectorization over loops
2. ‚úÖ Pre-allocate arrays when possible
3. ‚úÖ Use views instead of copies
4. ‚úÖ `np.vectorize()` for custom functions

### Integration:
1. ‚úÖ Pandas: `.to_numpy()` and `pd.DataFrame()`
2. ‚úÖ ML libraries expect NumPy arrays
3. ‚úÖ Easy conversion between formats

---

## üéì Congratulations!

You've completed the NumPy Learning Journey!

### What You've Learned:
‚úÖ Array creation and manipulation
‚úÖ Mathematical and statistical operations
‚úÖ Matrix operations and linear algebra
‚úÖ Data handling and integration
‚úÖ Performance optimization
