# Year 4 Project Notebook

### ~ *Guner Aygin*

## Introduction
**Predicting Solar Activity Cycles using Probabilistic Machine Learning**. The aim of this project is to learn big data methods to analyse astronomical data sets, namely the solar cycle data.  

A key aspect of this project is the use of **Probability & Statistics** - nothing will work without a solid understanding of these topics (see *Notes on Statistics & Modelling*).
Some of the topics I will be covering include:

* Linear Regression
    * $y = mx + c$ for *simple linear regression* where we only have one variable. 
    * $y = b \cdot x + \epsilon$ for *multiple linear regression* where we are using matrices of multiple variables (I think).
* Non-linear Regression - fitting a *polynomial* model.
* Bayes Theorem - $ P(\theta|D) = P(D|\theta) \frac{P(\theta)}{P(D)} \propto P(D|\theta)P(\theta) $
    * $P(\theta|D)$ - posterior
    * $P(\theta)$ - prior
    * $P(D|\theta)$ - likelihood, requires the likelihood & model.
    * $P(D)$ - evidence, the probability of the data.

For a sufficiently large sample size, the posterior distribution becomes independent of the prior. https://brunomaga.github.io/Bayesian-Linear-Regression

* Machine Learning (ML) - there are two types of ML model approach; ***Supervised*** & ***Unsupervised*** learning.
   
   https://www.seldon.io/supervised-vs-unsupervised-learning-explained
    * **Supervised** machine learning requires labelled input and output data during the training phase of the machine learning lifecycle. It is used to classify unseen data into established categories and forecast trends and future change as a predictive model.
        * **Classification** - Categorizing a given set of data into classes.
        * **Regression** - Predicting outcomes for continuously changing data.
    * **Unsupervised** machine learning is the training of models on raw and unlabelled training data. It is used to identify patterns and trends in raw datasets, or to cluster similar data into a specific number of groups.
        *  **Clustering** - Grouping N data points into M groups. 

As well as the differences between Supervised and Unsupervised ML models, we also have Generative and Discriminative models. 
* **Generative** models can generate new data. They capture the *joint probability* $P(X,Y)$. Includes the distribution of the data.

    https://developers.google.com/machine-learning/gan/generative
    
* **Discriminative** models discriminate between different kinds of data instances, i.e. Dead/Alive, Yes/No, Pass/Fail. They study the *conditional probability* $P(Y|X)$. Ignores how likely an instance is, just tells you how to label it. Discriminative methods can lead to incorrect results if somewhere down the *decision tree* there is something which pushes a True to a False, then the final conclusion will be incorrect. These errors need to be accounted for.

    https://en.wikipedia.org/wiki/Discriminative_model 

The **LIKELIHOOD** all comes from statistics and probability, whereas the **MODEL** comes from the regression/ML/generative models.

The other aspects of this project which I will be working on include: *optimization*, *MCMC*, *nested samples* (more information on these at a later stage).

#### Future Work
In this project, in order to learn the different techniques, I will be attempting a series of mini-projects (what they are is currently unknown) which centre around the themes discussed above. The eventual goal is to model the *Solar Activity Cycles* using ML and comparing it with the traditional techniques, (hopefully) concluding that ML is superior.

Below I will create a weekly log, summarising the key achievements of that week. For daily updates see GitHub commits.

# Semester 1

## Project Week 1 ~ 26/09/2022

The first week of the project has been spent deciding which of the three main types of Machine Learning techniques I want to include: **Traditional**, **Probabilistic**, or **Deep Learning**.

I concluded that **Probabilistic** techniques best suited my future career goals, and so that will be the main focus of the project, although there will be some overlap with some traditional techniques.

For more information on the different ML techniques see the following:
* Traditional: https://scikit-learn.org/stable/
* Probabilistic: https://www.tensorflow.org/probability
* Deep Learning: https://www.tensorflow.org/

*******************************************************************************************************************************

## Project Week 2 ~ 03/10/2022

### Minutes 2
(see introduction)

### Log 2

This week has been spent devising a plan for how to proceed with the project, and learning the essential statistics necessary to go with it (as well as the creation of this notebook).

It has been decided that the physical phenomenon being investigated using the probabilistic methods is the **Solar Acticity Cycles**. This is measured by counting the number of sun-spots on the sun, and there is data stretching back hundreds of years which I can analyse. Predictig solar activity cycles from a physical point of view has proven to be very difficult, which is why I will be training model to predict future cycles for me. Week 3's task will be to read a paper on Solar Cycles, to understand more about it. 

'Notes on Statistics & Modelling' contains exactly what the title suggests, and is taken mostly from ***Numerical Recipes Second Edition*** and *LM Inference from Scientific Data*. See GitHub commits to for when these notes have been updates.

*******************************************************************************************************************************

## Project Week 3 ~ 10/10/2022

### Minutes 3

#### *Questions:*
* Which parts of chapter 14 of NR do I NOT need to know?
    * The book is useful for a *general* understanding of stats
* Do I need to use the incomplete beta function?
    * Not now
* Should I learn how to smooth data or use raw?
    * Raw
* Which probability functions are important?
    * Normal & Poisson

Using GitHub: **Add, Commit, Push** https://docs.github.com/en/repositories/working-with-files/managing-files/adding-a-file-to-a-repository

#### Plan for next week:
* Read through some of this paper: Bayesian workflow https://arxiv.org/abs/2011.01808
* Find real solar cycle data and plot them on a graph 
* Read a paper on solar activity https://link.springer.com/article/10.12942/lrsp-2010-1
* Linear Regression - make utterly trivial fake data - use straight line as model and add gaussian noise to data and start to do linear regression
    * y = mx + c
    * Add randomness to the y values
    * Create a Linear Regression model from scratch - not just using Scikit (can use Scipy Optimise)
    * Minimise the logs yourself in the code
        * Function method
        * Function liklihood - needs to know more about the model
        * Use optimisation function
    * $ D \sim \mathcal{N} \left(M(\theta), \sigma_{obi}^2 \right)$ - data with a model
    * $ M(\theta) = \alpha (mx + b)$ - model 
    * $ L = P(D|M(z)) $ - likelihood
    * $-\log{L} = \log{P(D|M(z))}$
    * Minimise(function = likelihood, args = function method)

### Log 3 
This week I have been tasked with reading through a numbe of papers:
* Bayesian workflow: https://arxiv.org/abs/2011.01808
* The Solar Cycle: https://link.springer.com/article/10.12942/lrsp-2010-1

On top of this I need to find and plot *real* solar cycle data & create a linear regression model using fake data.
I have also tasked myself with trying to learn more about the basics of machine learning through various textbooks and online courses. 

Created a set of fake linear data and fitted the points with a least-squares model, gradient descent model, and maximum liklihood model. The gradient descent is a robut method of creating a linear regression algorithm from scratch, but I also used the built-in Scikit Linear Regression module to prove that the GD is optimized. Need to understand and implement MCMC https://prappleizer.github.io/Tutorials/MCMC/MCMC_Tutorial.html

Made a start on Solar Cycles notes, and have plotted the raw sunspot data.

*******************************************************************************************************************************

## Project Week 4 ~ 17/10/2022
### Minutes 4

* Deep & Wide Networks: when discussing the model, and whether or not I should include other factors, eg. the effect of the planets on the solar cycles, we could incorporate this effect into the *wide* network. We would still be creating the *deep learning* model, but eventually we would combine the deep and the wide together to see how it changes our results. This is something we would be doing at a much later stage of the project.
* Add labels to axis of sunspot data graph
* Keep reading scientific papers on Solar Cycles and make notes - we don't need the complicated physics such as Magneto Hydro Dynamics.
* For the preliminary review we want about 1.5 pages of *literary review*, which includes 20-25 papers (3 papers per week, 15 mins each paper with notes in a doc). **The paper needs to look sciency, not machine learningy**.
* Preliminary report should include: Abstract containing context, aims, methods, results & conclusions
    * Context: read in scientific papers - e.g. predicting solar cycles to prepare for events which could damage energy supplies, or increase the levels of radiation in the atmosphere.
    * Aim: investigate the ability of machine learning to predict solar cycles (flexible for now until we get further with the project)
    * Methods: ***not machine learning***, Gaussian processes, Deep & Wide Neural Networks, Probabilistic etc. (when we cover them)
    * Results: unkown, but at some point we will get an idea of whether or not machine learning can solve our proble. **Statements of fact about the work I have done**.
    * Conclusions: the ***interpretation*** of the results - can it be used to predict solar cycles? Is it accurate enough? Can we predict whether there will be a solar storm in a week? (no)

#### Plan for next week:
* Go back over Linear Regression code and fix:
    * Uncertainty 
    * Maximum Likelihood Model - fix equation to take these different uncertainties into account.
    * Tidy some bits of code in the Least Squares and ML models, i.e. get rid of variables we're not using
    * Write a note saying "I know there are analytical solutions to the loss function". 
* Create a phase diagram for the SSN data: chunk up the data into 11 year chunks, and plot them over each other - point is to show that each cycle isn't exactly 11 years.
* Could also create a plot of the amplitudes of each cycle
* Organise a Zoom meeting to go over remaining topics eg MCMC and the Linear Regression code.
* Try and read chapters 10-11 of **Bayesian Data Analysis** - Andrew Gelman

#### Log 4

 This week I have created two different sets of 'random data' for the **Linear Regression** code: one with a scatter equal to the $\sigma$, and the other with a scatter equal to $f_{true} \sigma$, where I set an arbitrary value for $f_{true}$. The data I use is the latter one. 
 
* Created a simple least squares function.
* Fixed the least squares function in Matrix notation and made a note about the equation used.
* Fixed the equation to take the equation of the scatter into account ($s_n = f \sigma_n$) 
* Incorporated MCMC and created a corner plot which shows all the one and two dimensional projections of the posterior probability distributions of your parameters. Easy to visualise the covariances between the parameters.
* Plotted a sample from our random walks to show how it compares with the 'true value'.

*******************************************************************************************************************************

## Project Week 5 ~ 24/10/2022
### Minutes 5

* Sunspot Data
    * Discussed good coding practice and working with Pandas. 
    * Learnt how to use Savitzky-Golay Filter in Python (scipy.signal.savgol_filter).
    * Created a phase plot, showing that the Solar Cycles don't all have a period of 11 years. 
    
#### Plan for next week:
* Go through the smoothing code and make certain adjustments.
* Try and find the turning points of the smoothed function, and use it to find the amplitudes.
* Read more papers and make notes on them! For locked papers use NASA Ads

#### Log 5

* Sunspot Data Plotting:
    * Changed the code to make it more compact. Made the SVG signal a 2-dimensional list, which makes plotting them simple (with the use of a **for** loop). 
    * Changed the colours and data-point size on the figures to make visualisation easier
    * Made comments on the plots.
    * Found & plotted the maxima and minima of the SVG smoothed signals. 
    * Fixed problem of converting from index into time in years.
    * Found solar cycle maxima & minima from the SVG signals.
    * Used cycle minima to adjust phase diagram so it looks more like the expected sine-squared curve.
    * Generalised the code to account for N polyorders. 
    * Found & plotted descending time against sunspot number.
    * Plotted descending time three cycles earlier against sunspot number.
* Literature Review:
    * Added 7 literature reviews
***

## Project Week 6 ~ 31/10/2022
### Minutes 6
* Making good progress with the sunspot data.
* Plotting of the maxima & minima of each cycle will be useful future projects (such as fitting a Gaussian process to the amplitude).
* Looked through the paper https://link.springer.com/article/10.1007/s11207-006-0175-5 regarding the linear relationship between solar cycle amplitude and descending time three cycles earlier. 
* Used the **Linear Regression** code (created earlier) to fit with the data.
* Plotted a shaded region $\pm \sigma$ around the best-fit line.
* Created a corner plot and plotted the MCMC iterations.
* Showed that there is not a statistically significant linear relationship between cycle amplitude and descending time three cycles earlier. 
* Talked about how to interpret corner plots.
    
#### Plan for next week:
* Go through the new fitting document and tidy it, including comments.
* Start learning Scikit Regression techniques and Gaussian Processes (kernerl ridge regression).
* Fit Gaussian process to cycle amplitude.
* Continue going through papers and making notes - including https://academic.oup.com/mnras/article/505/1/830/6253203
#### Log 6
* Created notes on Gaussian Process and Covariance functions (kernels).
* Basic example of GP.
* GP with noisy data.
* Attempted to fit a GP to the maxima from the SVG signals - okay but not great.
* Attempted to fit a GP to the SVG signal - not great.
***

## Project Week 7 ~ 07/11/2022
### Minutes 7
* Discussed the theory of Gaussian processes (more notes to follow) - functions from infinite dimensional covariances fitted with data, but there is a simpler method using *linear algebra* (more notes to follow).
* Went through the fitting maxima notebook and adjusted GP parameters to see how it works.
* Discussed the utilisation of different Kernel's for different processes.
* When writing preliminary report there is no need for stats theory.
#### Plan for next week:
* Continue reading papers
* Play around with Gaussian Processes 
* Try fitting a **quasi-periodic kernel** to the SVG smoothed plot.
* When fitting sunspot number, to avoid predicting any *negative* values use *log10*.
* Try using different packages for GP (such as TensorFlow & Pymc3)
* Read more on GP in book by Rasmus & Williams, p1-30, (skipping 7-11).
#### Log 7
* Executed a basic GPR using PyMC3
* GPR using PyMC3 with the maxima of SVG signals - incorporates marginalisation
* GPR using PyMC3 with the whole of the SVG signal - periodic kernel (not a good fit)
* Additional notes on GPR
***

## Project Week 8 ~ 14/11/2022
### Minutes 8
* Debugging of the PyMC3 code
* Added a mean to maxima fitting
* Multiplied the RBF kernel with a periodic kernel to create a *quasi-periodic kernel* for the SVG fitting
#### Plan for next week:
* Play around with the PyMC3 GPR for the SVG signal - let the period be a function to be optimized by the code
* Try with raw Data
* Introduce a noise term
* Unrestrict some of the variables such as A and length scale (ls > 15)
* Think about adding another periodic covariance matrix
#### Log 8
* Combined covariance functions to create a *quasi-periodic kernel* - took 5.5 hours to run
* Removed a parameter $\Gamma$ (amplitude of periodic kernel) --> makes the code run much faster
* Playing around with the priors to help it converge faster & on values deemed sensible
* Attempted to fit raw data but encountered errors from all the 0 points
***

## Project Week 9 ~ 21/11/2022
### Minutes 9
* Raw data fitting not converging due to zeros - try fiing by adding a small number to all data points
* Use logs and propagate uncertainty
* Try coding the GPR from scratch
* At some point think about adding other information like space weather from GONG, Heliosiesmic data (maybe) - Rachel --> could be used for a Deep & Wide Network
* Think about creating a multidimensional input GP

#### Plan for next week:
* Keep trying to fit the raw data
* Make a start on GP from scratch
* Writing preliminary report

#### Log 9
* TF time series forecasting link: https://www.tensorflow.org/tutorials/structured_data/time_series
* Sequential Monte Carlo link: https://docs.pymc.io/en/v3/pymc-examples/examples/samplers/SMC2_gaussians.html

***

## Project Week 10 ~ 28/11/2022
#### Log 10

* Created a virtual environment for using TensorFlow, which now works on my Mac Mini & Macbook Air (M1).
* Plotted a GP using TensorFlow - is still unable to create meaningful predictions. 
* Project work has grinded to a halt whilst a literature review for the preliminary report is being written 

***

## Project Week 11 ~ 05/11/2022

#### Log 11
* Preliminary report writing
* Minor improvements to figures in 'Fitting SVG' and 'Fitting dt'
* Added a posterior plot inside of the linear regression plot of dt (Zu paper).

***
## *Christmas Break*
***

# Semester 2
## Project Week 1 ~ 30/01/2023

### Minutes 1
* Discussed preliminary report (80%) and project work 1 (80%)
    * When writing a report, make sure that you make it very clear that you understand exactly what you're writing about (e.g. show your understanding of dynamo processes within the Sun)
    * For the final report, try and keep the consistency achieved in the preliminary report
    * Make sure figure captions describe the facts, and put the inferences in the bulk of the report
* For future project work, continue to work on GP in the background
* The next few weeks should be about exploring different techniques, with an aim of ending up with detail on the following three methods:
    * Gaussian processes
    * Recurrent Neural Networks (RNN) - LSTM
    * Deep & Wide Networks
* We want to be able to explore how well we can make predictions of future solar cycles, so if GPs work, then focus more on those (NN are not likely to outperform GPs)

#### Plan for the next week:
* Try to alter kernel parameters to get a periodic amplitude
* Try and find suitable radioflux data and plot a GP with it
* Read https://www.sciencedirect.com/science/article/pii/S009457652100415X?via%3Dihub
* Start learning about neural networks - make a simple neural network

#### Log 1
* Fitted GP using log values to ensure positive definiteness
* Created a GP from scratch (Gaussian Process Regression.ipynb)
* Studying neural networks - playing around with an example code to see how it works (Neural Networks.ipynb)
* First attempt at creating an LSTM for the Sunspot data (RNN_SVG.ipynb)
* Created a custom mean function for the GP in PyMC3 (sinusoidal)

## Project Week 2 ~ 06/02/2023

### Minutes 2
* Think about using a Transformer Neural Network (ask Guy about plausibility)
* LSTM:
    * Use *Google Colab* to run code (GPU speed >> CPU speed)
    * Experiment with different number of neurons and layers
    * Try validating the last 5 years of data
* GP:
    * Keep experimenting with the mean function --> $sin^2(x)$?
    
### Plan for the next week:
* Produce a plot of test and validation loss
* Obtain some predictions made by the LSTM for the last 5 years of data

### Log 2
* Changed GP mean function to $sin^2(x)$
* Managed to plot test & validation loss
* Attempted to make predictions (last ~10 years) using various activation functions (relu, elu, & sigmoid) --> relu seems to work best thus far, but all predictions are not as good as they should be
* Note: google colab takes longer to run (strange)

## Project Week 3 ~ 13/02/2023

### Minutes 3
* GP:
    * Mean function wasn't working correctly --> need to include the period in the sine function
    * Still think that GP should be able to make the best predictions as it marginalises over all possible functions
* LSTM: 
    * None of the predictions were plausible --> need to see predictions made for the test data before looking at validation
    * Validation loss is **unnecessary** for this data
    * Decreased the learning rate
    * Need to keep batch size at 1
    * We will attempt to implement a Transformer NN at some point (should be similar to and outperform an LSTM)
* **Taylor diagram** can be used to compare the different models we create
    
    
### Plan for the next week:
* Split GP into Train and Validation --> observe predictions
* Try and obtain better LSTM predictions 
* Creata a Taylor diagram of the different models

### Log 3
* GP:
    * Split into train & validate, but predictions are not good --> altered mean function by altering the period, adding a phase & an offset
    * Predictions revert to the mean function (due to short length scale) --> does this mean sunspot numbers are inherently unpredictable?
* LSTM: 
    * Plotted LSTM training data & training predictions
    * Experimented with 20 - 100 epochs, with 100-300 neurons