# Preliminaries  

**Overview**  
This section provides the essential survival skills needed before diving into deep learning. It covers how to manipulate and preprocess data, the linear algebra and calculus concepts that underlie neural networks, the use of automatic differentiation, basic probability for reasoning under uncertainty, and how to effectively use documentation. These fundamentals ensure you can follow the technical content of later chapters with confidence.  
<br>  

---


## A. Data Manipulation (2.1)

**Recap**  
This section introduces tensors, the core data structure in PyTorch. A tensor is just a container for numbers in one or more dimensions. We can create them, manipulate them with operations, automatically broadcast shapes when combining arrays of different sizes, save memory efficiently, and convert between PyTorch and other Python objects like NumPy.  

**Vocab**  
- **Tensor**: A general container for numbers in 1D (vector), 2D (matrix), or higher dimensions.  
- **Broadcasting**: Expanding smaller arrays automatically to match larger shapes during operations.  

**Notes**  
- Tensors are arrays of numbers: 1D = vector, 2D = matrix, higher = tensor.  
- Tensors support math operations (add, subtract, multiply, divide) applied elementwise.  
- Broadcasting lets smaller tensors expand to match larger shapes automatically.  
- In-place operations (`a.add_(b)`) save memory but overwrite values directly.  
- Tensors can be converted to/from NumPy arrays or Python scalars, sharing memory in the process.  


In [23]:
import torch

# Create a vector and a matrix
v = torch.arange(3)          # [0, 1, 2]
M = torch.ones((3, 3))       # 3x3 of ones

# Broadcasting: vector expands to match matrix shape
result = M + v

# In-place operation: overwrite to save memory
M.add_(v)   # modifies M directly

print("Original vector v:\n", v)
print("\nMatrix M after in-place add with broadcasting:\n", M)
print("\nResult (new tensor, not in-place):\n", result)


Original vector v:
 tensor([0, 1, 2])

Matrix M after in-place add with broadcasting:
 tensor([[1., 2., 3.],
        [1., 2., 3.],
        [1., 2., 3.]])

Result (new tensor, not in-place):
 tensor([[1., 2., 3.],
        [1., 2., 3.],
        [1., 2., 3.]])


## B. Data Preprocessing (2.2)

**Recap**  
This section shows how to prepare raw, messy data for machine learning. We load CSV files into pandas DataFrames, separate inputs from targets, fix missing values, encode categorical variables, and finally convert everything into PyTorch tensors for training.  

**Vocab**  
- Imputation: Replacing missing values with estimated ones, often using the mean.  
- One-hot encoding: Expanding a categorical column into multiple binary indicator columns.  

**Notes**  
- Use `pd.read_csv()` to load tabular data into a DataFrame.  
- Separate the dataset into inputs (features) and targets (labels).  
- Missing numeric values are often imputed with the column mean.  
- Missing categorical values can be handled by one-hot encoding, with NaN treated as its own category using `dummy_na=True`.  
- Once data is fully numeric, convert to an array with `.to_numpy()` & then wrap it in a PyTorch tensor using `torch.tensor()`.  
- Models require continuous float values because gradient-based optimization cannot work on integers or strings.  
- Data preprocessing is essential because real-world datasets often contain missing a lot, and all must be cleaned before training.  


In [24]:
import pandas as pd
import torch

# Raw data
data = pd.DataFrame({
    'NumRooms': [None, 2, 4, None],
    'RoofType': [None, None, 'Slate', None],
    'Price': [127500, 106000, 178100, 140000]
})

print("RAW DATAFRAME")
print(data)
print("\n.dtypes:\n", data.dtypes)
print("\nMissing values per column:\n", data.isna().sum())

# Split inputs vs targets
inputs, targets = data.iloc[:, 0:2], data.iloc[:, 2]
print("\n--- AFTER SPLIT ---")
print("inputs:\n", inputs)
print("\n.targets (Series):\n", targets.head())


# One-hot encode categorical (keep NaN as its own category)
inputs_encoded = pd.get_dummies(inputs, dummy_na=True)
print("\n--- AFTER ONE-HOT ENCODING ---")
print(inputs_encoded)
print("\nColumns now:", list(inputs_encoded.columns))
print("\n.dtypes after encoding:\n", inputs_encoded.dtypes)


# Impute numeric NaN with column means
print("\nMissing values per column BEFORE fillna:\n", inputs_encoded.isna().sum())
inputs_filled = inputs_encoded.fillna(inputs_encoded.mean(numeric_only=True))
print("\nMissing values per column AFTER fillna:\n", inputs_filled.isna().sum())
print("\n--- ENCODED + IMPUTED INPUTS ---")
print(inputs_filled)

# Convert to tensors (float)
X = torch.tensor(inputs_filled.to_numpy(dtype=float), dtype=torch.float32)
y = torch.tensor(targets.to_numpy(dtype=float), dtype=torch.float32)

print("\n--- TENSORS ---")
print("X:\n", X)
print("y:\n", y)
print("\nShapes -> X:", tuple(X.shape), " y:", tuple(y.shape))
print("Dtypes  -> X:", X.dtype, " y:", y.dtype)

RAW DATAFRAME
   NumRooms RoofType   Price
0       NaN     None  127500
1       2.0     None  106000
2       4.0    Slate  178100
3       NaN     None  140000

.dtypes:
 NumRooms    float64
RoofType     object
Price         int64
dtype: object

Missing values per column:
 NumRooms    2
RoofType    3
Price       0
dtype: int64

--- AFTER SPLIT ---
inputs:
    NumRooms RoofType
0       NaN     None
1       2.0     None
2       4.0    Slate
3       NaN     None

.targets (Series):
 0    127500
1    106000
2    178100
3    140000
Name: Price, dtype: int64

--- AFTER ONE-HOT ENCODING ---
   NumRooms  RoofType_Slate  RoofType_nan
0       NaN           False          True
1       2.0           False          True
2       4.0            True         False
3       NaN           False          True

Columns now: ['NumRooms', 'RoofType_Slate', 'RoofType_nan']

.dtypes after encoding:
 NumRooms          float64
RoofType_Slate       bool
RoofType_nan         bool
dtype: object

Missing values per c

## C. Linear Algebra (2.3)  

**Recap**  
Linear algebra is the language of deep learning. Vectors represent features or data points, matrices represent transformations, and norms measure size or distance. Operations like dot products, matrix multiplication, and solving systems form the math backbone of optimization and neural networks.  


### Key Definitions  

**Vector**  
- A 1-D tensor with shape (n). Example: x = [x₁, x₂, …, xₙ].  
- Represents a list of numbers such as weights, pixel values, or features.  

**Matrix**  
- A 2-D tensor with shape (m, n). Example: rows and columns of numbers.  
- Represents a transformation that maps inputs in ℝⁿ to outputs in ℝᵐ.  

**Transpose (Aᵀ)**  
- Swaps rows ↔ columns. If A is (m×n), then Aᵀ is (n×m).  
- Lets us align shapes for multiplication and is used in gradients.  

**Dot Product (a·b = aᵀb)**  
- a·b = ∑ᵢ aᵢbᵢ.  
- Measures similarity: positive = same direction, 0 = perpendicular, negative = opposite.  

**Matrix–Vector Product (Ax)**  
- If A is (m×n) and x ∈ ℝⁿ, result is vector in ℝᵐ.  
- Think: apply linear transformation A to the vector x.  

**Matrix–Matrix Product (AB)**  
- If A is (m×n), B is (n×p), result is (m×p).  
- Represents composing two transformations: first B, then A.  

**Norms**  
- Measures the size or length of vectors or matrices.  
- Vector norms:  
  - L1: ‖x‖₁ = ∑ |xᵢ| (sum of absolute values).  
  - L2: ‖x‖₂ = √(∑ xᵢ²) (Euclidean length).  
  - L∞: ‖x‖∞ = max |xᵢ| (largest absolute entry).  
- Matrix norm: Frobenius ‖A‖_F = √(∑ aᵢⱼ²), “length” of a matrix.  

**Identity Matrix (I)**  
- Diagonal of ones, zeros elsewhere.  
- Acts like “1” in multiplication: Ix = x.  

**Inverse (A⁻¹)**  
- For square A, if invertible: A⁻¹A = I.  
- Solves Ax = b as x = A⁻¹b (theoretical). In practice, use `solve()` for stability.  

**Trace (tr(A))**  
- Sum of diagonal elements.  
- Shows up in derivatives and covariance formulas.  

---

### Key Identities & Theorems  

**Cauchy–Schwarz Inequality**  
Formula: |a·b| ≤ ‖a‖₂‖b‖₂  
Meaning: The dot product can never be larger than the product of lengths.  
Why it matters: Guarantees cosine similarity is always between -1 and 1.  

**L2 Norm Identity**  
Formula: ‖x‖₂² = xᵀx  
Meaning: Squared length = dot product with itself.  
Why it matters: Common trick for optimization, weight decay, and loss functions.  

**Frobenius Norm Identity**  
Formula: ‖A‖_F² = tr(AᵀA)  
Meaning: Matrix length = sum of squared entries, also equals trace of AᵀA.  
Why it matters: Used in regularization and proofs for gradients.  

**Matrix Multiplication Associativity**  
Formula: (AB)C = A(BC)  
Meaning: Doesn’t matter how you group multiplications.  
Why it matters: Lets us reorder computations for efficiency and simplifies derivations.  

**Trace Cyclic Property**  
Formula: tr(AB) = tr(BA)  
Meaning: The order of multiplication inside a trace can be rotated.  
Why it matters: Simplifies derivatives in backprop and proofs.  

**Solving Linear Systems**  
Formula: Ax = b → x = A⁻¹b (theory), but use `solve(A, b)` in practice.  
Meaning: Finding x that satisfies the system of equations.  
Why it matters: Neural nets are built on solving large systems efficiently.  

---

### Expanded Notes  

- **Shapes rule everything**: Always track (rows, cols). Most errors in PyTorch come from mismatched dimensions.  
- **Dot product intuition**: Big positive = aligned, zero = perpendicular, negative = opposite. Useful for similarity and projections.  
- **Matrix–vector product**: Each row of A acts like a filter on x. In ML, this is exactly what happens inside a fully connected layer.  
- **Matrix–matrix product**: Think of this as stacking multiple transformations. In deep learning, each layer’s weight matrix multiplies the previous layer’s output.  
- **Norms**:  
  - L1 encourages sparsity (used in Lasso).  
  - L2 is smooth and common in optimization (weight decay).  
  - L∞ is used for robustness, bounding maximum deviation.  
  - Frobenius measures overall weight magnitude in a matrix.  
- **Identity matrix**: Think of it as a “do nothing” transformation. Useful in defining inverses and in derivatives.  
- **Inverse**: Rarely computed directly, because it’s slow and unstable. Always prefer solving systems with built-in solvers.  
- **Trace**: Very handy in derivations — simplifies matrix calculus. E.g., gradient of tr(AX) w.r.t. X is Aᵀ.  
- **Geometric picture**:  
  - Vectors = arrows.  
  - Norms = arrow lengths.  
  - Dot = shadow/projection of one arrow on another.  
  - Matrices = rotations, scalings, shears.  


