# Purpose Statement

The purpose of this notebook is to create a brief reference for introductory machine learning. Thus, the lessons posted here will be 

1. Concise
2. Easily traversable; and 
3. Written in such a way that it is easily digestible. 

To the best of my ability, I will keep the mathematical notation consistent. 

### Table of Contents 

1. Introduction
2. Probability Theory 
3. Decision Theory 
4. Probability Distributions 
5. Linear Models for Regression 
6. Linear Models for Classification 
7. Sampling Methods
8. Neural Networks 
9. Mixture Models & Expectation Maximization  
10. Continuous Latent Variables 

## Introduction 


### A simple regression problem

Given some data $$f(x) = sin(x) + e$$

where e is a random, normally distributed value that gives variation to the data. 

We can approximate this function using a polynomial function 

$$y(x, w) = w_0 + w_1x + w_2x^2 + ... + w_Mx^M $$

### An introduction to a loss function 

To find the appropriate w (weights), we minimize the loss function. The sum of squared errors is an example of a loss function 

$$E(w) = \frac{1}{2}\sum_{n=1}^{N}{(y(x_n, w) - t_n)^2} $$

The interpretation of this function is "the sum of the absolute distance between the predicted value and the actual value, squared."

Note that the square results in larger deviation have larger effects on the value of the loss function. 

### An introduction to minimizing the loss function

We can minimize the function via choosing a value of w for which the E(w) is small as possible. 
A common method of achieving this is gradient descent. The steps for gradient descent are as follows

1. Initialization - start with an initial guess for w. This could be random or based on some criteria. 
2. Calculate the gradient for that specific point. The gradient represents the vector that points in the direction of the steepest ascent. 
3. Adjust the parameters in the opposite direction of the gradient. The rate of change of the parameters is based on the gradient and the learning rate, which can be changed according to preferences. 
4. Repeat steps 2 and 3 until convergence. 

So why does gradient descent work? 

We know that the minimum of the loss function must be a minimum where the gradient is equal to zero (for convex loss functions). Gradient descent finds the direction of steepest descent and iterates towards it, guaranteeing that we at least converge to a local minimum. 

Note that while the basic principle behind gradient descent is straightforward, there are many challenges in practice, such as: 
- Choosing an appropriate learning rate
- Avoiding getting stuck in local minima or saddle points 
- Handling the high computational cost with large data sets


