# Year 4 Project Notebook

### ~ *Guner Aygin*

## Introduction
The title of this project currently remains a mystery, as it remains to be seen which types of probabilistic methods and machine learning techniques are used. The aim of this project is to learn big data methods to analyse astronomical data sets, namely the solar cycle data.  

A key aspect of this project is the use of **Probability & Statistics** - nothing will work without a solid understanding of these topics (see *Numerical Recipes* ch. 14-15).
Some of the topics I will be covering include:

* Linear Regression
    * $y = mx + c$ for *simple linear regression* where we only have one variable. 
    * $y = b \cdot x + \epsilon$ for *multiple linear regression* where we are using matrices of multiple variables (I think).
* Non-linear Regression - fitting a *polynomial* model.
* Bayes Theorem - $ P(A|B) \propto P(B|A) * P(A)$
    * P(A|B) - posterior
    * P(A) - prior
    * P(B|A) - likelihood, requires the likelihood & model.
* Machine Learning (ML) - there are two types of ML model approach; ***Supervised*** & ***Unsupervised*** learning. 
    * **Supervised** machine learning requires labelled input and output data during the training phase of the machine learning lifecycle. It is used to classify unseen data into established categories and forecast trends and future change as a predictive model.
        * **Classification** - Categorizing a given set of data into classes.
        * **Regression** - Predicting outcomes for continuously changing data.
    * **Unsupervised** machine learning is the training of models on raw and unlabelled training data. It is used to identify patterns and trends in raw datasets, or to cluster similar data into a specific number of groups.
        *  **Clustering** - Grouping N data points into M groups. 

As well as the differences between Supervised and Unsupervised ML models, we also have Generative and Discriminative models. 
* **Generative** models can generate new data. They capture the *joint probability* $P(X,Y)$. Includes the distribution of the data.
* **Discriminative** models discriminate between different kinds of data instances, i.e. Dead/Alive, Yes/No, Pass/Fail. They study the *conditional probability* $P(Y|X)$. Ignores how likely an instance is, just tells you how to label it. Discriminative methods can lead to incorrect results if somewhere down the *decision tree* there is something which pushes a True to a False, then the final conclusion will be incorrect. These errors need to be accounted for.



The **LIKELIHOOD** all comes from statistics and probability, whereas the **MODEL** comes from the regression/ML/generative models.

The other aspects of this project which I will be working on include: *optimization*, *MCMC*, *nested samples* (more information on these at a later stage).

#### Future Work
In this project, in order to learn the different techniques, I will be attempting a series of mini-projects (what they are is currently unknown) which centre around the themes discussed above. The eventual goal is to model the *Solar Activity Cycles* using ML and comparing it with the traditional techniques, (hopefully) concluding that ML is superior.

# Statistics Notes
##### Sources ~ Numerical Recipes, & Scientific Inference Lecture Notes
Note: this section may eventually end up in a seperate document to prevent this one becoming too dense, but for now I will leave it here.

## Moments of a Distribution
* **p-value test**: the p-value tells us how likely this datapoint happened by chance. We set the p-value such that is a data point falls inside the tail region we conclude that the *null hypothesis* is *false*. NB, we cannot prove the null hypothesis, only *disprove* it.

* **Model independent data analysis**: descriptive, i.e. *mean*, *variance*, etc.

* **Model dependent data analysis**: *parameter estimation*, *least-squares fits* etc.
*******************************************************************************************************************************
* **Mean**: $  \langle x \rangle \equiv \overline{x} = \frac{1}{N} \sum_{j=0}^{N-1} x_{j} = \sum_{i} x_i P(x_i) = \int x P(x) \,dx $ --> may not be very good for distributions with broad tails.

* **Median**: $\int_{-\infty}^{x_{med}} p(x)\,dx = \frac{1}{2} = \int_{x_{med}}^{\infty} p(x)\,dx$ --> value of PDF in which values smaller or larger than $x_{med}$ are equally probable.

$x_{med} = \begin{cases}
    x_{(N+1)/2}, & \text{N odd}\\
    \frac{1}{2}(x_{N/2} + x_{N/2 + 1}), & \text{N even}
  \end{cases}$

* **Mode**: Value of $x$ where $p(x)$ takes on the *greatest* value.

*******************************************************************************************************************************
* **Variance**: $ Var(x) = \frac{1}{N-1} \sum_{j=0}^{N-1} \left( x_{j} - \overline{x} \right)^2 = \sum_{i} \left(x_i - \overline{x} \right)^2 P(x_i) = \int \left(x - \overline{x} \right)^2 P(x) \,dx $ --> width or variability around that value.

* **Skewness**: $\gamma = \frac{1}{N} \sum_{j=1}^{N} \left[\frac{x_j - \overline{x}}{\sigma}\right]^3$ --> characterises the degree of aysymmetry of a distribution around its mean.

* **Kurtosis**: $\kappa = \left\{\frac{1}{N} \sum_{j=1}^{N} \left[\frac{x_j - \overline{x}}{\sigma}\right]^4\right\} - 3$ --> Measures peakedness or flatness of a distribution relative to the *Normal Distribution*.

* **Cumulative Distribution Function**: $C(x) = Prob(x' \le x) = \sum_{x_i \le x} P(x_i) = \int_{x_{min}}^{x} P(x') \, dx'$

Note: for **Discrete Distributions** we have a **PMF** which we *sum* over, and for **Continuous Distributions** we have a **PDF** which we *integrate* over.

*******************************************************************************************************************************
* **Standard Deviation**: $ \sigma(x) = \sqrt{Var(x)}$ --> mean squared deviation of $x$ from its mean value.

* **Standard Error of Estimated Mean**: $\sigma(x) / \sqrt{N}$ --> accuracy with which the sample mean estimates the population mean.

* **Standard Error of Estimated Variance**: $\sigma(x)^2 \sqrt{\frac{2}{N}}$.

* **Standard Error of Estimated $\sigma$**: $\sigma(x)/\sqrt{2N}$.

*******************************************************************************************************************************
* **Moments**: $\mu_k = \langle x^k \rangle = \int x^k P(x)\, dx $
    * $\mu_0 = \int P(x)\, dx = 1$
    * $\mu_1 = \langle x \rangle = \mu$
    
* **Central Momements**: $\nu_k = \langle (x - \mu)^k \rangle = \int (x - \mu)^k P(x)\, dx$
    * $Var(x) = \nu_2$
    * **Skewness**: $\gamma = \frac{\nu_3}{Var(x)^{3/2}}$ 
    * **Kurtosis**: $\kappa = \frac{\nu_4}{Var(x)^2}$

* **Moment Generating Function**: $M_x(t) = \langle \exp(tx) \rangle = \int \exp(tx) P(x)\, dx$ --> useful for computing moments.
    * $\mu_k = \frac{d^k}{dt^k}\Big|_{t=0} M_x(t)$
*******************************************************************************************************************************

## Significantly Different Means/Variances

A quantity that measures the significance of a distribution of means is *not* the number of standard deviations they are apart, but the number of ***standard errors*** they are apart (see definition above).

* **Student's t-test for Significantly Different Means**: when two distributions are thought to have the **same variance** but **different means**.

Standard error of the difference of the means: $ s_D = \sqrt{\frac{\sum_{i \in A}(x_i - \overline{x_A})^2 + \sum_{i \in B} (x_i - \overline{x_B})^2}{N_A + N_B - 2} \left(\frac{1}{N_A} + \frac{1}{N_B} \right)}$, where A & B are two samples.
    $t = \frac{\overline{x_A} - \overline{x_B}}{s_D}$, then we evaluate the significance of $t$.

* **Covariance**: measures the *joint variability* of two random variables.

    $ Cov(x,y) = \frac{1}{N-1} \sum_{i=1}^{N}(x_i - \overline{x})(y_i - \overline{y})$

    $ s_D = \left[\frac{Var(x) + Var(y) - 2Cov(x,y)}{N} \right]^{1/2} $
    
* **Correlation**: $Corr(x,y) = \frac{Cov(x,y)}{\sigma_x \sigma_y}$

* **F-test for Significantly Different Variances**: tests hypothesis that two different variables have different variances by trying to reject the null hypothesis that the variances are the same.

*******************************************************************************************************************************

## Are Two Distributions Different?
Again, we cannot prove that two sets came from the same distribution, but we can disprove that null hypothesis that they are drawn from the same distribution.

* Chi-Square Test: $\chi^2 = \sum_i \frac{(O_i - E_i)^2}{E_i}$, where $O_i$ is the observed value, and $E_i$ is the expected value. For $\chi^2 \gg 1$, the null hypothesis is *unlikely*.
    For two **binned** data: $\chi^2 = \sum_i \frac{(R_i - S_i)^2}{R_i + S_i}$, where $R_i$ is the no. of events in bin i for set 1, and $S_i$ is the number of events in bin i for set 2.

# Modelling of Data

Given a set of observations we want to condense and summarize the data by fitting it to a "model" that depends on adjustable parameters.

To be useful, a fitting procedure should provide (i) parameters, (ii) error estimates on the parameters, and (iii) a statistical measure of goodness-of-fit.

* **Least Squares as a Maximum Likelihood Estimator** 
    
    In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable.

    Suppose we are fitting $N$ data points $(x_i,y_i), i=1,...,N$, to a model that has $M$ adjustable parameters $a_j$, $j=1,...,M$. The model predicts a relationship between the measured *independent* and *dependent* variables. $ y(x) = y(x; a_1 ...a_M)$. To get the best fitted values for $a_j$'s (our parameters) we want to ***minimise*** the **least-squares fit**.
    
    $$ \sum_i^N \left[y_i - y(x_i;a_1 ... a_M) \right]^2 $$
    
    ***but why?***
    What we want to do is maximise the probability that, given a particular set of parameters, this data could have occured. The probability of getting said data is given by the following equation:
    
    $$ P = \prod_{i=1}^N \left \{\exp{\left[- \frac{1}{2} \left(\frac{y_i - y(x_i)}{\sigma} \right)^2 \right]} \Delta y \right \} $$

    where $y_i$ is our *observed* value, $y(x_i)$ is our *expected* value, and $\Delta y$ is our error in $y$. Maximising this probability is the same as maximising its logarithm or *minimizing* the *negative* of its *logarithm*:
    $$ \left[ \sum_{i=1}^N \frac{y_i - y(x_i)}{2 \sigma^2} \right] - N \log{\Delta y} $$
    Hence we arrive at the initial point that the *least-squares fitting* is a maximum likelihood estimation.
    
    What about when we have *outliers*? Because the probability of large outliers occuring is technically very small, when we do see one our **maximum likelihood estimator** tries to fit the data to it, and consequently ruins the model. 
    
* **Chi-Square Fitting**

    Here, each data point has its *own* standard deviation, and the chi-square value is given by $$\chi^2 \equiv \sum_i \left(\frac{y_i - y(x_i; a_1...a_M)}{\sigma_i}\right)^2$$
    To normalise it, we divide by the number of degrees of freedom $\nu = N - M$ (number of data points - number of parameters). A rule of thumb is that a $\chi^2 \approx  \nu$ for a "moderately" good fit.

* **Fitting Data to a Straight Line**

    Fitting $N$ data points $(x_i, y_i)$ to a straight-line model $y(x) = y(x; a,b) = a + bx$, often called ***linear regression***. To measure how well the model agrees with the data, we use the *chi-square* function (where our errors are already known).
    $$\chi^2(a,b) = \sum_i^N \left(\frac{y_i - a - bx_i}{\sigma_i}\right)^2$$
    
    If the errors are *normally distributed* then this will give us the maximum likelihood parameter estimations. If they're not, it will give its best guess.
    
    The above equation is minimized to determine $a$ and $b$. At its minimum, the derivatives = 0.

### References

* Supervised vs Unsupervised ML - https://www.seldon.io/supervised-vs-unsupervised-learning-explained
* What is a generative model - https://developers.google.com/machine-learning/gan/generative
* Discriminative model - https://en.wikipedia.org/wiki/Discriminative_model 
* Maximum Likelihood Estimator - https://en.wikipedia.org/wiki/Maximum_likelihood_estimation