In [1]:
import numpy as np
import pandas as pd
a = [2,3,1]
b = [4,-1,5]
print(np.dot(a,b))
# it's equal to 2*4 + 3*(-1) + 1*5 = 8 + -3 + 5 = 10

10


# Matrix Multiplication in a Single Neural Network Layer

In a basic neural network, each layer performs computations using matrix multiplication to transform inputs into outputs. Here's how it works:

## Forward Pass in a Single Layer

1. **Inputs**: The input to a layer is a vector (or batch of vectors) representing features from the previous layer or raw data.

2. **Weights Matrix**: Each layer has a weight matrix `W` where:
   - Rows correspond to neurons in the current layer
   - Columns correspond to inputs from the previous layer
   - Each element `W[i,j]` represents the strength of connection from input `j` to neuron `i`

3. **Matrix Multiplication**: The core computation is:
   ```
   Z = W * X + b
   ```
   Where:
   - `X` is the input vector/matrix
   - `W` is the weights matrix
   - `b` is the bias vector (added to each neuron)
   - `Z` is the pre-activation output

4. **Activation Function**: Apply a non-linear activation (like ReLU, sigmoid) to `Z` to get the final output `A`:
   ```
   A = activation(Z)
   ```

## Why Matrix Multiplication?

- **Efficiency**: Matrix operations can be parallelized and optimized (e.g., using GPUs)
- **Scalability**: Handles multiple inputs/outputs simultaneously
- **Mathematical Foundation**: Represents linear transformations in vector spaces

## Example

For a layer with 3 neurons and 2 inputs:
- `X` = [x1, x2] (1x2 vector)
- `W` = [[w11, w12], [w21, w22], [w31, w32]] (3x2 matrix)
- `b` = [b1, b2, b3] (3x1 vector)

The output `Z` = [w11*x1 + w12*x2 + b1, w21*x1 + w22*x2 + b2, w31*x1 + w32*x2 + b3]

This is computed efficiently as matrix multiplication followed by bias addition.

In [4]:
a = np.random.rand(4,4)
b = np.random.rand(4,4)
print("a multiply by b is equal to", a*b)
print("transpose value of a*b is equal to ", (a*b).T)

a multiply by b is equal to [[0.38551567 0.01301477 0.11131309 0.01814635]
 [0.03250888 0.66089691 0.23437734 0.08254718]
 [0.59035857 0.08523107 0.02597058 0.27811855]
 [0.70732997 0.28160239 0.24776052 0.19897553]]
transpose value of a*b is equal to  [[0.38551567 0.03250888 0.59035857 0.70732997]
 [0.01301477 0.66089691 0.08523107 0.28160239]
 [0.11131309 0.23437734 0.02597058 0.24776052]
 [0.01814635 0.08254718 0.27811855 0.19897553]]


In [1]:
print("f(x,y) = 2x² + 3xy + y³")
print("derivative of f(x,y) with respect to x is 4x + 3y")
print("derivative of f(x,y) with respect to y is 3x + 3y²")

f(x,y) = 2x² + 3xy + y³
derivative of f(x,y) with respect to x is 4x + 3y
derivative of f(x,y) with respect to y is 3x + 3y²


# Why Do We Need Partial Derivatives in Neural Networks?

Partial derivatives are essential in neural networks for training through backpropagation, which optimizes the model's parameters to minimize the loss function. Here's why:

## Gradient Descent Optimization

Neural networks learn by adjusting weights and biases to reduce the error between predictions and actual outputs. This is done using gradient descent, which requires knowing how much each parameter contributes to the total error.

## Role of Partial Derivatives

1. **Compute Gradients**: Partial derivatives calculate the rate of change of the loss function with respect to each individual parameter (weight or bias). This forms the gradient vector.

2. **Backpropagation**: Starting from the output layer, errors are propagated backwards through the network. At each layer, partial derivatives determine how much each weight contributed to the error.

3. **Update Rule**: Parameters are updated using:
   ```
   θ_new = θ_old - learning_rate * ∂Loss/∂θ
   ```
   Where `∂Loss/∂θ` is the partial derivative of the loss with respect to parameter θ.

## Chain Rule in Action

Since neural networks are composed of multiple layers with activation functions, the total derivative involves the chain rule. Partial derivatives allow us to break down complex functions into simpler components:

- For a multi-layer network: `Loss = f(g(h(x)))`
- Partial derivatives compute: `∂Loss/∂x = ∂Loss/∂f * ∂f/∂g * ∂g/∂h * ∂h/∂x`

## Benefits

- **Efficiency**: Allows computation of gradients for millions of parameters
- **Precision**: Provides exact direction for parameter updates
- **Convergence**: Ensures the model learns optimal parameters over time

Without partial derivatives, we'd have no systematic way to train neural networks effectively.

# Step-by-Step: How Gradient Descent Updates Weights in Simple Linear Regression

Gradient descent optimizes the weights (slope and intercept) in linear regression by iteratively minimizing the mean squared error (MSE) loss. Here's the step-by-step process:

## 1. Initialize Parameters

Start with initial guesses for the weights:
- Slope (m): Often initialized to 0 or a small random value
- Intercept (b): Often initialized to 0

## 2. Define the Loss Function

For linear regression, the loss is Mean Squared Error:
```
MSE = (1/n) * Σ(y_i - ŷ_i)²
```
Where:
- `y_i` is actual value
- `ŷ_i = m*x_i + b` is predicted value
- `n` is number of data points

## 3. Compute Partial Derivatives (Gradients)

Calculate how the loss changes with respect to each parameter:

**Derivative w.r.t. slope (m):**
```
∂MSE/∂m = (-2/n) * Σ(x_i * (y_i - ŷ_i))
```

**Derivative w.r.t. intercept (b):**
```
∂MSE/∂b = (-2/n) * Σ(y_i - ŷ_i)
```

These gradients indicate the direction and magnitude of the steepest ascent.

## 4. Update Parameters

Update each parameter by moving in the opposite direction of the gradient:
```
m_new = m_old - learning_rate * ∂MSE/∂m
b_new = b_old - learning_rate * ∂MSE/∂b
```

Where `learning_rate` (α) controls the step size (typically 0.01-0.1).

## 5. Iterate Until Convergence

Repeat steps 3-4 for multiple epochs until:
- The loss stops decreasing significantly
- A maximum number of iterations is reached
- Gradients become very small

## 6. Final Model

The final `m` and `b` give the best-fit line: `ŷ = m*x + b`

## Key Points

- **Learning Rate**: Too high → overshooting minimum; Too low → slow convergence
- **Batch vs. Stochastic**: Can compute gradients on full dataset (batch) or single points (stochastic)
- **Convergence**: Process continues until parameters stabilize at the optimal values that minimize MSE

This iterative approach ensures the model learns the relationship between input `x` and output `y` from the training data.

In [4]:
import numpy as np

def f(x):
    return x**2

def df(x):
    return 2*x  


x= 10.0
learning_rate = 0.1
steps = 20

for i in range(steps):
    x= x - learning_rate * df(x)
    fx = f(x)
    print(f"Step {i+1}: x = {x:.6f}, f(x) = {fx:.6f}")

Step 1: x = 8.000000, f(x) = 64.000000
Step 2: x = 6.400000, f(x) = 40.960000
Step 3: x = 5.120000, f(x) = 26.214400
Step 4: x = 4.096000, f(x) = 16.777216
Step 5: x = 3.276800, f(x) = 10.737418
Step 6: x = 2.621440, f(x) = 6.871948
Step 7: x = 2.097152, f(x) = 4.398047
Step 8: x = 1.677722, f(x) = 2.814750
Step 9: x = 1.342177, f(x) = 1.801440
Step 10: x = 1.073742, f(x) = 1.152922
Step 11: x = 0.858993, f(x) = 0.737870
Step 12: x = 0.687195, f(x) = 0.472237
Step 13: x = 0.549756, f(x) = 0.302231
Step 14: x = 0.439805, f(x) = 0.193428
Step 15: x = 0.351844, f(x) = 0.123794
Step 16: x = 0.281475, f(x) = 0.079228
Step 17: x = 0.225180, f(x) = 0.050706
Step 18: x = 0.180144, f(x) = 0.032452
Step 19: x = 0.144115, f(x) = 0.020769
Step 20: x = 0.115292, f(x) = 0.013292


# Why Probability Matters in Machine Learning

Probability is fundamental to machine learning, especially in classification tasks where models need to express uncertainty and confidence. Here's why it matters:

## Uncertainty Quantification

Machine learning models often deal with noisy, incomplete, or ambiguous data. Probability allows models to express:
- **Confidence levels**: How sure the model is about a prediction
- **Uncertainty**: When the model doesn't know the answer
- **Risk assessment**: Potential consequences of wrong predictions

## Classification Confidence

In classification, probability outputs provide more information than hard predictions:

### Example: Medical Diagnosis
- **Hard prediction**: "Patient has disease" vs "Patient doesn't have disease"
- **Probabilistic prediction**: "70% chance of disease, 30% chance of no disease"

The probabilistic approach allows doctors to:
- Make informed decisions based on confidence levels
- Set appropriate thresholds for different risk tolerances
- Combine predictions with other evidence

## Decision Making Under Uncertainty

Probability enables better decision-making:

1. **Threshold Selection**: Choose classification thresholds based on cost-benefit analysis
   - High-precision tasks (e.g., spam detection): Low false positive rate
   - High-recall tasks (e.g., disease screening): Low false negative rate

2. **Expected Value Calculations**: Make decisions based on expected outcomes
   ```
   Expected_Value = P(success) * value_success + P(failure) * value_failure
   ```

3. **Bayesian Approaches**: Update beliefs as new evidence arrives
   - Prior probability + new data = posterior probability

## Model Interpretability

Probabilistic outputs make models more interpretable:
- **Feature importance**: How much each input contributes to the probability
- **Model calibration**: Ensuring predicted probabilities match real frequencies
- **Trust and transparency**: Users can see model confidence levels

## Real-World Applications

- **Autonomous vehicles**: Probability of pedestrian detection affects braking decisions
- **Financial risk**: Probability of loan default influences lending decisions
- **Weather forecasting**: Probability distributions for precipitation amounts
- **Recommendation systems**: Confidence in user preferences

## Mathematical Foundation

Many ML algorithms are built on probabilistic principles:
- **Logistic regression**: Outputs probabilities via sigmoid function
- **Neural networks**: Softmax for multi-class probabilities
- **Bayesian methods**: Explicit probability distributions over parameters

Without probability, models would be limited to binary decisions, missing the rich information needed for robust, real-world applications.