# Machine Learning
#### Field of study that gives machines the ability to learn without being explicitly programmed

## Machine Learning Algorithms:
- Supervised Learning Algorithms
- Unsupervised Learning Algorithms
- Recommender Systems
- Reinforcement Learning Algorithms

## Supervised Learning Algorithms:
- Algorithms that learn from input, output labels and then predict the output for a given new input
- Example:
  - Email(Input) -> spam or not (Output) : Spam filtering
  - Image & Radar Info (Input) -> position of other cars (Output) : Self driving cars
- Types of Supervised Learning
  - Regression : *Predicting a number from infinitely many possible ouputs*
  - Classification : *Predicting a category*

## Unsupervised Learning Algorithms:
- Algorithms that find something interesting in unlabeled data
- Types of Unsupervised Learning Algorithms
  - Clustering
  - Association Rules
  - DImensionality reduction : Compress data using fewer numbers

# Supervised Learning Algorithms:
## Regression Model : Predicting numbers
### Linear Regression with one Variable:
- Suppose we have existing data of x, y where for different values of x, the value of y changes
- Example : based on the size of the flat area, rent of the house varies. And we have historical data for 10 different sizes, the rent values. Now for a given new input of size, the model has to predict the relevant rent.
- Terminology
  - Data used to train model (historical data) -> training set
  - Input variable or feature -> x
  - Original Output variable or target -> y
  - Number of training examples -> m
  - Single training example will be (x,y)
- Now we have to choose a linear function (since this is linear regression) which can be used to predict the y values for a given x
- Let the linear Function (Our Model) *$f_{(w,b)}$(x) = wx + b*
  - We can choose the model as per needs and the number of variables.
  - For example if we have multiple (lets say 3) variables but we want to use linear function then the model becomes
    - *$f_{(w,b)}(x) = w_1x_1 +w_2x_2 + w_3x_3 + b$*
  - If we want to use polynomial regression wtih one variable then the function can look like
    - *$f_{(w,b)}(x) = w_1x^1 +w_2x^2 + w_3x^3 + b$*
- We have to find best values of *w* and *b*, using which the predicted y value will be closed to the actual data when training
- For a given x input, the output value that the function predicts is called *y-hat*

#### Finding the values of *w* and *b*
##### Cost Function
- A cost function evaluates how well the model's prediction matches wtih the actual data.
- It quantifies the error between the predicted value and the actual value
- The goal of learning algorithm is to minimize this cost function, thereby improving the accuracy of the model
- **Commonly used Cost functions for Regression**
 - Mean Squared Error : *Measures the average of the squared differences between predicted and actual values*
 - Mean Absolute Error : *Measures the average of the absolute differences between precited and actual values*
 - Huber Loss : *Combines the properties of MSE and MAE*
 - Mean Squared Logarithmic Error

**We'll use Mean Squared Error Cost Function here**

#### Mean Squared Error Cost Function
- Error = *y-hat - y*
- Squared error = $(\hat{y}^i - y^i)^2$
- Sum of squared errors for m number of training set = $\sum_{i=1}^m(\hat{y}^i - y^i)^2$
- Mean of squared error = $\frac{1}{m}\sum_{i=1}^m(\hat{y}^i - y^i)^2$
- We'll also divide the entire function by 2, just to reduce the number and look clean, this will not change the results, so this is optional
- Final Cost function of MSE $J_{(w,b)}$ = $\frac{1}{2m}\sum_{i=1}^m(\hat{y}^i - y^i)^2$ = $\frac{1}{2m}\sum_{i=1}^m(f_{w,b}(x^i) - y^i)^2$

**Now we have to find a way to minimize the cost function so that the error is very less and the predictions are as closest as possible**

##### Algoritms that can be used to minimize cost functions
- Gradient based algorithms
  - Gradient Descent
    - Batch Gradient Descent
    - Stochastic Gradient Descent
    - Mini-batch gradient descent
  - Variants of Gradient descent
    - Momentum
    - Nesterov Accelerated Gradient
    - Adagrad
    - Adadelta
    - RMSprop
    - Adam
- Second-Order Methods
  - Newton's Method
  - Quasi-Newton Methods
    - Broyden-Fletcher-Goldfarb-Shanno
    - Limited-memory BFGS
- Derivative-Free Optimization
  - Genetic Algorithms
  - Simulated Annealing
  - Particle Swarm Optimization
  - Nelder-Mead (Simplex) Method
- Convex Optimization
  -Interior-Point Methods
  - Dual Ascent and Dual Decomposition
- Specialized Algorithms
  - Expectation-Maximixation
  - Support vector Machines
-Bayesian Optimization
  - Gaussian Processes
- Reinforcement Learning
  - Policy Gradient Methods


##### Minimizing cost function using *Gradient descent* Algorithm
- Initialize parameters : Start wtih initial guesses for w,b
- Compute gradients :
  - Calculate partial derivateives of cost function j with respect to w and b
  - $\frac{\partial J}{\partial w}$ = $\frac{1}{m}\sum_{i=1}^m(f_{w,b}(x)^i - y^i)x^i$
  - $\frac{\partial J}{\partial b}$ = $\frac{1}{m}\sum_{i=1}^m(f_{w,b}(x)^i - y^i)$
- Update Parameters
  - Adjust w,b in the direction of the negative gradient to reduce the cost
  - w = w - α$\frac{\partial J}{\partial w}$
  - b = b - α$\frac{\partial J}{\partial b}$
  - α is the learning rate (a small positive number that controls the step size)
- Iterate : repeat the gradient computation and parameter update steps until the cost function converges to a minimum
- Convergence : The algorithm converges when the changes in w,b become very small
- By iteratively adjusting the parameters *w* and *b* using the gradient descent algorithm, we minimize the MSE and thus improve the accuracy of our linear regression model in predicting the target variable.





