# Year 4 Project Notebook

### ~ *Guner Aygin*

## Introduction
**Stellar Activity Prediction using Probabilistic Machine Learning**. The aim of this project is to learn big data methods to analyse astronomical data sets, namely the solar cycle data.  

A key aspect of this project is the use of **Probability & Statistics** - nothing will work without a solid understanding of these topics (see *Notes on Statistics & Modelling*).
Some of the topics I will be covering include:

* Linear Regression
    * $y = mx + c$ for *simple linear regression* where we only have one variable. 
    * $y = b \cdot x + \epsilon$ for *multiple linear regression* where we are using matrices of multiple variables (I think).
* Non-linear Regression - fitting a *polynomial* model.
* Bayes Theorem - $ P(\theta|D) = P(D|\theta) \frac{P(\theta)}{P(D)} \propto P(D|\theta)P(\theta) $
    * $P(\theta|D)$ - posterior
    * $P(\theta)$ - prior
    * $P(D|\theta)$ - likelihood, requires the likelihood & model.
    * $P(D)$ - evidence, the probability of the data.

For a sufficiently large sample size, the posterior distribution becomes independent of the prior. https://brunomaga.github.io/Bayesian-Linear-Regression

* Machine Learning (ML) - there are two types of ML model approach; ***Supervised*** & ***Unsupervised*** learning.
   
   https://www.seldon.io/supervised-vs-unsupervised-learning-explained
    * **Supervised** machine learning requires labelled input and output data during the training phase of the machine learning lifecycle. It is used to classify unseen data into established categories and forecast trends and future change as a predictive model.
        * **Classification** - Categorizing a given set of data into classes.
        * **Regression** - Predicting outcomes for continuously changing data.
    * **Unsupervised** machine learning is the training of models on raw and unlabelled training data. It is used to identify patterns and trends in raw datasets, or to cluster similar data into a specific number of groups.
        *  **Clustering** - Grouping N data points into M groups. 

As well as the differences between Supervised and Unsupervised ML models, we also have Generative and Discriminative models. 
* **Generative** models can generate new data. They capture the *joint probability* $P(X,Y)$. Includes the distribution of the data.

    https://developers.google.com/machine-learning/gan/generative
    
* **Discriminative** models discriminate between different kinds of data instances, i.e. Dead/Alive, Yes/No, Pass/Fail. They study the *conditional probability* $P(Y|X)$. Ignores how likely an instance is, just tells you how to label it. Discriminative methods can lead to incorrect results if somewhere down the *decision tree* there is something which pushes a True to a False, then the final conclusion will be incorrect. These errors need to be accounted for.

    https://en.wikipedia.org/wiki/Discriminative_model 

The **LIKELIHOOD** all comes from statistics and probability, whereas the **MODEL** comes from the regression/ML/generative models.

The other aspects of this project which I will be working on include: *optimization*, *MCMC*, *nested samples* (more information on these at a later stage).

#### Future Work
In this project, in order to learn the different techniques, I will be attempting a series of mini-projects (what they are is currently unknown) which centre around the themes discussed above. The eventual goal is to model the *Solar Activity Cycles* using ML and comparing it with the traditional techniques, (hopefully) concluding that ML is superior.

Below I will create a weekly log, summarising the key achievements of that week. For daily updates see GitHub commits.

## Project Week 1 ~ 26/09/2022

The first week of the project has been spent deciding which of the three main types of Machine Learning techniques I want to include: **Traditional**, **Probabilistic**, or **Deep Learning**.

I concluded that **Probabilistic** techniques best suited my future career goals, and so that will be the main focus of the project, although there will be some overlap with some traditional techniques.

For more information on the different ML techniques see the following:
* Traditional: https://scikit-learn.org/stable/
* Probabilistic: https://www.tensorflow.org/probability
* Deep Learning: https://www.tensorflow.org/

## Project Week 2 ~ 03/10/2022

This week has been spent devising a plan for how to proceed with the project, and learning the essential statistics necessary to go with it (as well as the creation of this notebook).

It has been decided that the physical phenomenon being investigated using the probabilistic methods is the **Solar Acticity Cycles**. This is measured by counting the number of sun-spots on the sun, and there is data stretching back hundreds of years which I can analyse. Predictig solar activity cycles from a physical point of view has proven to be very difficult, which is why I will be training model to predict future cycles for me. Week 3's task will be to read a paper on Solar Cycles, to understand more about it. 

'Notes on Statistics & Modelling' contains exactly what the title suggests, and is taken mostly from *Numerical Recipes Second Edition* and *LM Inference from Scientific Data*. See GitHub commits to for when these notes have been updates.

## Project Week 3 ~ 10/10/2022

This week I have been tasked with reading through a numbe of papers:
* Bayesian workflow: https://arxiv.org/abs/2011.01808
* The Solar Cycle: https://link.springer.com/article/10.12942/lrsp-2010-1

On top of this I need to find and plot *real* solar cycle data & create a linear regression model using fake data.
I have also tasked myself with trying to learn more about the basics of machine learning through various textbooks and online courses. 

Created a set of fake linear data and fitted the points with a least-squares model, gradient descent model, and maximum liklihood model. The gradient descent is a robut method of creating a linear regression algorithm from scratch, but I also used the built-in Scikit Linear Regression module to prove that the GD is optimized. Need to understand and implement MCMC https://prappleizer.github.io/Tutorials/MCMC/MCMC_Tutorial.html

Made a start on Solar Cycles notes, and have plotted the raw sunspot data.