### 1. Softmax Function
The softmax function is defined as:

$$
\sigma(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}
$$

-About how neural network outputs probabilities for each class. \
-Comes at the end of NN, in the output layer.

#### Meaning of terms:
- $z_i$: the *i*-th input score (logit)
- $K$: total number of classes or outputs
- $e^{z_i}$: exponential of the *i*-th score
- $\sum_{j=1}^{K} e^{z_j}$: normalization term to ensure all outputs sum to 1
- $\sigma(z_i)$: probability of class *i*

Thus, Softmax ensures:
$$
\sum_{i=1}^{K} \sigma(z_i) = 1
$$




In [2]:
#EXAMPLE OF SOFTMAX FUNCTION

import numpy as np

# Raw scores (logits) output by a model
logits = np.array([2.0, 1.0, 0.1])

# Compute softmax manually
exp_vals = np.exp(logits)
softmax_probs = exp_vals / np.sum(exp_vals)

print("Logits:", logits)
print("Exp_val:", exp_vals)
print("Sum of exp_val ", np.sum(exp_vals))
print("Softmax probabilities:", softmax_probs)
print("Sum of probabilities:", np.sum(softmax_probs))

Logits: [2.  1.  0.1]
Exp_val: [7.3890561  2.71828183 1.10517092]
Sum of exp_val  11.212508845465344
Softmax probabilities: [0.65900114 0.24243297 0.09856589]
Sum of probabilities: 1.0


### 2. One hot encoding
-This converts integer class labels → vectors of 0s and 1s.
-How we encode the target classes (correct answer).
-Comes before training, as preprocessing for labels (y_train).


In [1]:
#EXAMPLE OF ONE-HOT CODING
from tensorflow.keras.utils import to_categorical
import numpy as np

# Suppose we have 5 samples with class labels from 0 to 3
y = np.array([0, 2, 1, 3, 2])

# One-hot encode
y_onehot = to_categorical(y, num_classes=4)

print("Original labels:\n", y)
print("One-hot encoded:\n", y_onehot)

Original labels:
 [0 2 1 3 2]
One-hot encoded:
 [[1. 0. 0. 0.]
 [0. 0. 1. 0.]
 [0. 1. 0. 0.]
 [0. 0. 0. 1.]
 [0. 0. 1. 0.]]


### ONE HOT ENCODING & SOFTMAX WORK TOGETHER
✅ The smaller the loss, the closer the predicted probability is to the true class.

In [None]:
# True label (one-hot encoded)
y_true = np.array([0, 1, 0])   # class 1 is correct

# Model prediction (softmax output)
y_pred = np.array([0.1, 0.7, 0.2])

# Cross-entropy loss (manual calculation)
loss = -np.sum(y_true * np.log(y_pred + 1e-9))  # small epsilon to avoid log(0)
print("Cross-entropy loss:", loss)

Cross-entropy loss: 0.3566749425101611


### EQUATION

-mean square  
-entropy  
-(binary) cross entropy  
-softmax  
-sigmoid  
-scaling (mix max mean median)  
-GDecent w = w -a * d(J)/dw  
SGD  
minibatch SGD  
softmax with temperature  
R2 score for linear regression  


### 3. Entropy
Entropy measure the degree of randomness in data. 
For a set of sample X with k classes:
$$
entropy(X) = - \sum_{i=1}^{k} p_i \log_2(p_i)
$$
where ${p_i}$ is the proportion of elements of class i.

Lower entropy implies greater predictability.




### 4. Information Gain
The information gain of an attribute a is the expected reduction in entropy due to splitting on values of a:

$$
gain(X, a) = entropy(X) \;-\; 
\sum_{v \in values(a)} 
\frac{|X_v|}{|X|} \; entropy(X_v)
$$

where $X_v$ is the subset of $X$ for which ${a = v}$.


### 5. Euclidean distance
$$
{||x-c||}^2 = ({x_1} - {c_1})^2 + ({x_2} - {c_2})^2
$$

### 6. Linear Regression

$$
y = b + \sum_{i} {w_i}{x_i}
$$

Where: \
y: output \
${x_i}$: $i^{th}$ input \
${w_i}$: weight on $i^{th}$ input


### 7. Lost Function 

$$
(y^{<i>}- \hat{y}^{<i>})^2
$$

### 8. Cost Funtion (all sample) - MSE Mean Square Error
$$
MSE = \frac{1}{2M} \sum_{i=1}^{M} (y^{<i>} - \hat{y}^{<i>})^2
$$

### 9. Sigmoid Activation Function
$$
\sigma(z) = \frac{1}{1 + e^\ (-z)}
$$
with $z = w^Tx +b$


### 10. Gradient Descent
- GD is the derivative of a function with respect to its input variable and it shows Direction and Magnitude of the Steepest Ascent.
$$
\theta_j \leftarrow \theta_j
- \alpha \, \frac{\partial}{\partial \theta_j} J(\theta)
$$
- Weight and bias update: 
$$
w = w
- \alpha \, \frac{\partial}{\partial w} J(w, b)
$$
$$
b = b
- \alpha \, \frac{\partial}{\partial b} J(w, b)
$$
Where:
- w and b: weight and bias
- $\alpha$: learning rate
- $J(\theta) = J(w,b)$ = MSE




### 11. Binary Cross-Entropy
$$
J(\omega, b)
= - \frac{1}{M} \sum_{i=1}^{M}
\left[
y^{(i)} \log \hat{y}^{(i)}
+ (1 - y^{(i)}) \log(1 - \hat{y}^{(i)})
\right]
$$

### 12. The **R²(Coefficient of Determination)** indicates how much variance in the target is explained by the model.

$$
R^2 = 1 - \frac{SS_{\text{res}}}{SS_{\text{tot}}}
$$

Where:

* $SS_{\text{res}} = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2$
  (Residual Sum of Squares: error between true and predicted values)

* $SS_{\text{tot}} = \sum_{i=1}^{n}(y_i - \bar{y})^2$
  (Total Sum of Squares: error between true values and their mean)

* $y_i$: true value

* $\hat{y}_i$: predicted value

* $\bar{y}$: mean of all $y_i$

Interpretation
* $R^2 = 1$: perfect prediction
* $R^2 = 0$: model predicts no better than mean
* $R^2 < 0$: model is worse than just predicting the mean

### 13. Lasso Regression (L1)
Lasso(L1 Regularization) minimizes the sum of squared errors **plus** a penalty proportional to the **absolute values of the coefficients**. This L1 penalty encourages **sparsity**, meaning it pushes some coefficients to **zero**, effectively performing feature selection.

The loss function of Lasso regression is:

$$\mathcal{L}(\beta) = \frac{1}{2m} \sum_{i=1}^{m} \left( y_i - \hat{y}_i \right)^2 + \alpha \sum_{j=1}^{p} |w_j|$$

Where:
* $m$: number of samples
* $y_i$: actual target value for sample $i$
* $\hat{y}_i = X_i \cdot \beta$: predicted value for sample $i$
* $w_j$: the $j$-th coefficient of the model
* $\alpha \geq 0$: regularization strength
* The **first term** is the **mean squared error (MSE)**
* The **second term** is the **L1 penalty** (sum of absolute values of the coefficients)