<a href="https://colab.research.google.com/github/jfodera/proj-ai-ml/blob/main/homeworks/hw1/hw1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task 1 -  Derive the objective function for Logistic Regression using Maximum Likelihood Estimation (MLE)

## Background
### Logistic Regression
- Using linear regression to predict binary classification will result in a continuous outcome.
- A problem with bounded target outcomes (specifically binary classification) will use logistic regression.
- In Logistic Regression, the model doesn't output "0" or "1" directly. It outputs a continuous value between 0 and 1, which we interpret as the probability that the input $x_i$ belongs to the "positive" class ($y=1$).
- To do this, we use an activation on top of our linear model(from linear regression)
  - usually the sigmoid function.
  - 'activation' is simply a function applied to the output of a linear model.
- The predicted outcome:
  - $\hat{y}$= $\sigma(\textbf{w}^Tx + b)$
  - where $\sigma(z) = \frac{1}{(1+e^{-z})}$

### The Likelihood Function
- in Logistic RegressionIn Logistic Regression, our "outcome" is a series of $0$s and $1$s (e.g., "Not Spam" or "Spam").

Our "rules" are the weights ($\beta$).For every data point $i$:The model calculates a probability $p_i$ using the sigmoid function.If the actual label is $y=1$, the likelihood for that point is simply $p_i$.If the actual label is $y=0$, the likelihood for that point is $(1 - p_i)$.The Joint LikelihoodThe Likelihood Function $L(\beta)$ is the product of all these individual probabilities. If you have three data points where the first two are $1$ and the last is $0$, the likelihood is:$$L(\beta) = p_1 \cdot p_2 \cdot (1 - p_3)$$If the model is "good," all these numbers will be close to $1$, making the total product high. If the model is "bad" and predicts a $0.1$ for a $y=1$ case, the entire product collapses toward zero[[1]](https://czep.net/stat/mlelr.pdf).

### MLE (Maximum Likelihood Estimation)
- *likelihood function* - The likelihood function measures the probability of observing the given data under the assumed model.
  - think of the Likelihood Function as a "Summary of Success" for your model's current settings.
  - we want to maximize this
- Most ML models are optimization problems focused on minimization of error.
- MLE is a recipe for formulating the loss function that is to be minimized
  - MLE minimizes the loss function
- MLE is meant to find parameter values within the training data that maximize the likelihood function




### Misc Definitions
- **objective function** - what we are trying to achieve
- **parameters** - $\theta$ or Î² or weights aka the parameters that are the weights the model learns
  - these are variable and what we are trying to optimize to create a model that best fits the data
- **Features** - inputs that are fixed
  - e.g. square footage of house of persons age
- **Loss function** - is a way to quantify difference between the real and predicted value of the target
- **Cost Function** - Loss function across entire set of data
- **NLL** - We use the negative log-likelihood so that probabilistic model fitting fits neatly into the loss-minimization framework used in machine learning.
- **follow a Bernoulli distribution** - Just another way of saying its a simple binary classification problem.
  - Typical bernoulli: $$f(k; p) = p^k (1 - p)^{1 - k} \quad \text{for } k \in \{0, 1\}$$

### Citations
[1]Czepiel, Scott A. Maximum Likelihood Estimation of Logistic Regression Models: Theory and Implementation. Carnegie Mellon University, 2022. PDF, https://czep.net/stat/mlelr.pdf



# Task 1 Answer

## In Words
The objective function of logistic regression is empoweer a model to efficiently tackle problems with bounded target outcomes (specifically binary classifications). A logistic regression model does not output "0" or "1" directly. Rather, a continuous value between 0 and 1, which we interpret as the probability that the input $x_i$ belongs to the "positive" class ($y=1$).

Maximum Likelihood Estimation (MLE) is the part of logistic regression that helps to optimize the parameters. It finds the best parameters by attempting to maximize the likelihood function. But overall, is a recipe for formulating the loss function that is minimized.

## Mathmatical Derivation

### 1. Start with Bernoulli

Since we are dealing with binary classification (0 or 1), we assume the labels follow a Bernoulli distribution. The probability of a single observation is:

- $P(y^{(i)} | x^{(i)}) = (\hat{y}^{(i)})^{y^{(i)}} (1 - \hat{y}^{(i)})^{(1 - y^{(i)})}$


### 2. The Likelihood Function

We want to maximize the probability of seeing all our actual labels. This is maximizing the likelihood funciton which measures the probability of observing the given data under the assumed model. Assuming the data points are independent, we multiply their individual probabilities:

- $L(\theta) = \prod_{i=1}^{n} (\hat{y}^{(i)})^{y^{(i)}} (1 - \hat{y}^{(i)})^{(1 - y^{(i)})}$
  - 'L' stands for likelihood
  - '$(\theta)$' represents the parameters (weights and bias) of the model
    - shows likelihood is a function of models internal settings
  - $\prod$ - like a summation but instead you multiply everything together ]
  - 'n' - total number of oberservations
  - Calulate the independant probability for every data point and multiply them all together



### 3. Log-Likelihood Transformation

Multiplying many decimals (probabilities) leads to "arithmetic underflow" (numbers becoming too small for computers to handle). We apply a **natural log** to each side, turning the products into sums and countering this error :

- $L(\theta) = \prod_{i=1}^{n} P(y^{(i)} | x^{(i)})$
  - log turns product $\prod$ into $\sum$
- $\log L(\theta) = \sum_{i=1}^{n} \log \left( (\hat{y}^{(i)})^{y^{(i)}} (1 - \hat{y}^{(i)})^{(1 - y^{(i)})} \right)$
  - takign the log of a product turns it to a sum: $\log(A \cdot B) = \log(A) + \log(B)$.
- $\log L(\theta) = \sum_{i=1}^{n} (\log((\hat{y}^{(i)})^{y^{(i)}}) + \log((1 - \hat{y}^{(i)})^{(1 - y^{(i)})}))$
  - use the power rule: $\log(a^b) = b \log a$

- $\log L(\theta) = \sum_{i=1}^{n} [y^{(i)} \log \hat{y}^{(i)} + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)})]$

### 4. Defining the Loss and Cost Functions (Objective Function)

We want to prefer to **minimize** a "loss" rather than maximize a "likelihood." So we negating the log-likelihood for each induvidual point and averaging it over the whole dataset.

- **Loss Function** for a single instance:
  - $l^{i}(y^{(i)},\hat{y}^{(i)}) = -(y^{(i)} \log \hat{y}^{(i)} + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)}))$


- **Cost Function** for the entire dataset:
  - $L(y, \hat{y}) = \frac{1}{n} \sum_{i=1}^{n} l^{i}(y^{(i)}, \hat{y}^{(i)})$

