# Analysing Time Series Data

**Time series analysis** refers to the method of breaking down the original dataset into four parts - trends, cycles, periods, and unstable factors - which we adopt to propose forecasts to predict future trends.

Time series analysis is one of the **quantitative prediction methods**. It includes general statistical analysis, data filtering, and predicting statistical models. Classical statistical analysis assumes that data sequences are independent of each other, while time series analysis focuses on studying the interdependence of data sequences. The latter is actually a statistical analysis of stochastic processes with discrete indicators, so it can also be seen as a component of stochastic process statistics. For example, by recording the rainfall of the first month, second month,..., and nth month in a certain region, time series analysis methods can be used to predict the future rainfall for each month.

In this section of the course, we will see how time series analysis can be applied to moniter the sun's magnetic field. You do not need to understand what the data repersents in-depth, we focus on understandings on how we present, analyse and smooth data in the industry.

<hr style="border:2px solid gray">

## Index: <a id='index'></a>

1. [Introducing the Modules](#interest)
1. [Importing Data](#importing_data)
1. [Plotting Magnetic Field Data](#plotting_data)
1. [Calculate the Parker Spiral angle](#parker_angle)
1. [Interpolation and NaNs](#interpolation)
1. [Multivariate Analysis](#multivariate_analysis)
1. [Smoothing Data](#smoothing_data)

In this session, we will try to use some real experimental time series data to demonstrate how experimental data are being used. We will include the following contents: data interpolation, data smoothing, data filtering, data fitting, and power spectrum.

NASA's ```SPICE``` (Space Physics Interactive Data Resource) provides a large amount of data on the sun's magnetic field. The data we use in this session is the sun's magnetic field data from 1996 to 2016. The data is statistically analysed by the Wilcox Solar Observatory (WSO) and is available on the ```SPICE``` website. The data is in the form of a text file, and the data format is as follows:

You do not need to understand what the data represents in-depth, we focus on understandings on how we present, analyse and smooth data in the industry.


<div style="background-color: #FFF8C6">

## For your interest: <a id='interest'></a> [^](#index)

- ```sunpy.net.Fido```: This module provides an interface for downloading data from various solar physics data sources using the Fido search and retrieval tool. 

- ```sunpy.net.attrs``` as a: The attrs submodule of the sunpy.net package provides classes for creating attribute objects that are used to define search criteria for data queries.

- ```sunpy.io.cdf```: This module allows reading and writing data in the Common Data Format (CDF). CDF is a self-describing data format used to store and exchange various types of scientific data.

- ```sunpy.timeseries.GenericTimeSeries```: This module provides a class for creating a generic time series object in ```SunPy```. Time series data represents measurements or observations taken at multiple time points.

- ```sunpy_soar```: This module is a sunpy plugin for accessing data in the Solar Orbiter Archive (SOAR)

- ```cdflib```: This module provides tools for reading and writing Common Data Format (CDF) files. CDF is often used for storing and exchanging multidimensional data in various scientific domains.

In [1]:
# Loading Data
from sunpy.net import Fido, attrs as a 
import sunpy.io.cdf as cdf
from sunpy.timeseries import GenericTimeSeries
import sunpy_soar
import numpy as np
import cdflib
import pandas as pd

# This is py file located in the same directory as this notebook, refer to the OOP 
# section of the bootcamp for more information on importing modules
import analysis_helpers as h 

## Importing Data <a id='importing_data'></a> [^](#index)

The data we use in this session is the sun's magnetic field data from 1996 to 2016, recorded by instruments on the SOHO satellite.  

<div style="background-color: #FFF8C6">

Some [background information](https://www.nasa.gov/content/solar-orbiter-instruments) on the data we are using:
The Sun's magnetic field extends outwards from our star, filling interplanetary space. Solar Orbiter's ultra-sensitive magnetic field instruments measure the strength and direction of the magnetic field around the spacecraft. This is a complicated, ever-changing characteristic that affects how charged particles move while simultaneously being influenced by the particles themselves as they zip through space. The magnetometer measurements will help scientists address one of Solar Orbiter's primary science questions about the origins of the magnetic field and solar wind plasma in the corona. Magnetic fields also act as a highway for charged particles moving away from the Sun, so magnetometer measurements will also be key to exploring how energetic particle radiation travels out into the solar system following solar eruptions. 

*The MAG principal investigator is Tim Horbury at the Imperial College London, UK.*

</div>

You may choose to try to access data documented by other instruments or alternate the time period of the data. You can find the data on the [SPICE website](https://spdf.gsfc.nasa.gov/spice/).

In [2]:
# Create search attributes
instrument = a.Instrument('MAG') # Magnetometer
time = a.Time('2022-02-08', '2022-03-05') # Time range, you may change this for your own search
level = a.Level(2)
product = a.soar.Product('MAG-RTN-NORMAL-1-MINUTE') # accessing solar rotaion data on the 1 minute cadence

# Search for files
result = Fido.search(time & level & product)

# Download files
files = Fido.fetch(result)

data = pd.DataFrame()

for file in files:
    temp_df = h.cdf2df(file) # Convert the CDF file to a pandas DataFrame for data manipulation
    data = pd.concat([data, temp_df])
    data.sort_index(inplace=True)# Sort the combined DataFrame by its index in chronological order

Files Downloaded: 0file [00:00, ?file/s]

In [3]:
print(data) # Print the DataFrame 

Empty DataFrame
Columns: []
Index: []


## Plotting Magnetic Field Data <a id='plotting_data'></a> [^](#index)

Try to plot the data that we have accessed, refer to the previous session if you need help, and plot the magnetic field data against the time period of the data.

**B** refers to the magnetic field strength, and the unit is Gauss. 

```data['|B|']```: This component represents the magnitude (or absolute value) of the magnetic field vector, ```data['BR']``` is the radial component of the magnetic field, ```data['BT']``` the tangential/transverse component, and ```data['BN']``` is the normal component.


In [None]:
import matplotlib.pyplot as plt

# Plot the data here

#####################

# try using subplots for easier comparison between the different components 
# of the magnetic field

In [None]:
# example code for subplots
import matplotlib.pyplot as plt
fig, axs = plt.subplots(4,1,sharex=True)

axs[0].plot(data['|B|'], color = 'black')
axs[0].set_ylabel('|B|')
axs[1].plot(data['BR'], color = 'red')
axs[1].set_ylabel('BR')
axs[2].plot(data['BT'], color = 'green')
axs[2].set_ylabel('BT')
axs[3].plot(data['BN'], color = 'orange')
axs[3].set_ylabel('BN')

fig.autofmt_xdate()

## Background: Defining polarity - To be Discussed During Lecture <a id='polarity'></a> [^](#index)


The solar wind leaves the Sun radially, and travels out into the heliosphere. However, the Sun is spinning at a rate of $14^{\circ}$/day meaning that the solar wind creates a spiral pattern, the Parker spiral.

<center>
<img src="img/parker_spiral.gif" />
</center>

From the "Frozen-in theorem", the magnetic field will follow the plasma in the solar wind, therefore, the interplanetary magnetic field does not point radially away from the Sun. So how do we define the magnetic polarity?

The Parker spiral angle is given by: $$\theta = \arctan(\frac{- r \Omega}{V_{sw}})$$

,which depends on the spacecraft distance, $r$, and solar wind speed, $V_{sw}$.

So we can define the polarity as within $\pm 45^{\circ}$ of the nominal Parker spiral direction, as demonstrated here:

<center>
<img src="img/polarity_diagram.png" width="400"/>
</center>

## Adding trajectory data
Before we can work out the Parker spiral angle by analysing existing, we need the distance of the spacecraft from the Sun.

We need to load the right ```SPICE``` kernels, which is not an easy task. I will use a package called ```astrospice```, so it can be done automatically in the notebook.

You do not need to know how to access ```SPICE``` but try to understand the type of data we are importing, so we can apply it to calculate the Parker spiral angle.

## Calculating the Parker spiral angle <a id='parker_angle'></a> [^](#index)

The code snippet below generates the Solar Orbiter's coordinates for specific times using ```astrospice.generate_coords``` and then converts these into Carrington heliographic coordinates, which are solar-centric. This helps pinpoint the spacecraft's position relative to the Sun. The longitude angles are converted to degrees, prepping them for future calculations that require avoiding "angle wrapping" from 360 to 0 degrees, which can cause problems.

In [None]:
import astrospice
from sunpy.coordinates import HeliographicCarrington
import astropy.units as u


#get the SPICE kernels
solo_kernels = astrospice.registry.get_kernels('solar orbiter', 'predict')
solo_kernel = solo_kernels[0]
solo_coverage = solo_kernel.coverage('SOLAR ORBITER')
print("SPICE kernels cover this time period: ", solo_coverage.iso)

# use every 30 mins for trajectory, then interpolate
times = data.index[::30]

#get the coordinates
coords = astrospice.generate_coords('SOLAR ORBITER', times)
carr_frame = HeliographicCarrington(observer="self")
carr_coords = coords.transform_to(carr_frame)

# have to make sure there is no wrapping in angles before I interpolate
lons = carr_coords.lon.to(u.degree).value
# find the break point

# keeps unwrapping until a longitude is a straight line, not wrapped around 360
while np.any(np.diff(lons) > 10):
    lons = h.unwrap_lons(lons)

# make a orbit dataframe that I can then interpolate and incorporate into data
orbit_df = pd.DataFrame(
    {
        "Radius": carr_coords.radius.to(u.au).value,
        "Carr_lon": lons,
        "Carr_lat": carr_coords.lat.to(u.degree).value,
    },
    index=times,
)

## What are NaNs? <a id='interpolation'></a> [^](#index)

NaN stands for "Not a Number" and is a standard way to denote undefined or unrepresentable values in datasets. These often serve as placeholders for missing or incomplete data.

Missing data, represented as NaNs, can introduce bias, lead to loss of statistical power, and significantly affect the conclusions of an analysis. For example, NaNs can skew calculations of mean, median, or correlations, or may lead to gaps in plotted time series data that create misleading visual interpretations.

In this section, we have filled in the NaNs with df.interpolate().

## Introducing Pandas DataFrame Interpolation

The Pandas library offers the `DataFrame.interpolate` method to fill in missing values in a DataFrame. This method uses various interpolation techniques to fill in NaN values with new values. 

The `DataFrame.interpolate` function in Pandas uses methods like linear interpolation to "fill in the blanks" between existing data points. For example, in a time series with missing data at time `t`, Pandas can estimate the missing value based on the values at times `t-1` and `t+1`.

In the context of the Solar Orbiter or similar scientific data, interpolation might be used to estimate the spacecraft's position at times when actual readings are not available. The physics behind the interpolation would depend on the understood motion and forces acting upon the object. 

## Interpolation in Time Series

Interpolation is particularly useful in time series analysis where time-ordered data points may be missing. By filling in these points, you can achieve a continuous dataset, making it easier to perform calculations like trend analysis or Fourier transformations. Interpolation in time series commonly uses methods like linear, polynomial, or spline to predict the missing values.


## Financial and Stock Market Uses of Interpolation

In the finance sector, especially in stock market analysis, interpolation is used to estimate asset prices, rates, or trends for points in time where no data is available. This enables analysts to have a more complete picture of market behaviors. It's especially useful in pricing options or understanding market volatility when historical data may be sparse.

In [None]:
# drop columns if they already exist
if "Radius" in data.keys():
    data.drop(
        columns=["Radius", "Carr_lon", "Carr_lat"], inplace=True
    )
    
#re-index to data
orbit_df = orbit_df.reindex(data.index)
# add the orbit variables in
for key in orbit_df.keys():
    data[key] = orbit_df[key]

#interpolate to fill in the NaNs
# I don't want to interpolate the gaps in magnetic field
data[['Radius', 'Carr_lon', 'Carr_lat']] = data[
    ['Radius', 'Carr_lon', 'Carr_lat']
    ].interpolate(method="time")

<div style="background-color:#C2F5DD">

## Task: Manipulating pandas dataframes:

Experiment with your existing knowledge on pandas. 

1. Creates a fictional "Velocity" column and adds it to orbit_df.
1. Checks if a "Velocity" column already exists in data and removes it if it does.
1. Re-indexes orbit_df to match data's index.
1. Adds the new "Velocity" column to data.
1. Performs time-based interpolation on this "Velocity" column.


In [None]:
## code here

#####################

In [None]:
# example code 
velocity_data = np.random.rand(len(times))

orbit_df['Velocity'] = velocity_data


if "Velocity" in data.keys():
    data.drop(columns=["Velocity"], inplace=True)


orbit_df = orbit_df.reindex(data.index)

data['Velocity'] = orbit_df['Velocity']
data['Velocity'] = data['Velocity'].interpolate(method="time")

print(orbit_df)

# rerun the previous code for the following tasks
# if you want to go back to clean data

## What is time series and data

thruster data, turn into nans, understand quality and interpolate

In [None]:
### code here

## Continuing with the parker spiral angle calculation:

Here we add a tolerance, tolerances are important as we are dealing with real data, and we need to account for errors in the data, we shall apply the tolerance to the Parker spiral angle calculation.

In [None]:
#guess a speed for now, in reality you would get a speed from PAS
data['V'] = 350

tolerance = 45
data = h.add_polarity2df(data, ds_period = '12H',tolerance=tolerance)

#angle of magnetic field in R-T plane
mag_angle = np.arctan2(data['BT'],data['BR']) *180/np.pi
# make the angles go from 0 -> 360
mag_angle[mag_angle<0] += 360
data['mag_angle'] = mag_angle

data['PS_angle'] = h.PS_angle(data['Radius'].values*u.au, data['V'].values*u.km/u.s)

In [None]:
print(data)

## Multivariante Analysis <a id='multivariate_analysis'></a> [^](#index)


Here we plot `data['|B|']`, `data['BR']` and `data['BT']` `data['BN']` against the time period. Adding appropriate labels and titles to the plot.

<div style="background-color:#C2F5DD">
Task: Plot the Parker spiral angle and the polarity against the time period.

In [None]:
fig, axs = plt.subplots(2, sharex = True, figsize = (10,10))
axs[0].plot(data['|B|'], color = 'black', label = '|B|')
axs[0].plot(data['BR'], color = 'red', label = 'BR')
axs[0].legend(loc = 'upper left')

axs[1].plot(data['BT'], color = 'orange', label = 'BT')
axs[1].plot(data['BN'], color = 'green', label = 'BN')
axs[1].legend(loc = 'upper left')

axs[0].set_ylabel('|B|, BR (nT)')
axs[1].set_ylabel('BT, BN (nT)')



In [None]:
fig, axs = plt.subplots(2, sharex = True, figsize = (10,10))

for ax in axs[:0]:
    ax.axhline(0, color = 'black', lw = 1, ls = 'dotted')

axs[0].scatter(data.index, data['mag_angle'], color = 'black', s = 1)
axs[0].fill_between(data.index, data['PS_angle'] +tolerance, data['PS_angle'] -tolerance, color = 'red', alpha = 0.2)
axs[0].fill_between(data.index, data['PS_angle'] -180 +tolerance, data['PS_angle'] -180 -tolerance, color = 'blue', alpha = 0.2)
axs[0].set_ylabel('Mag. angle ($^{\circ}$)')


axs[1].plot(data['polarity'], color = 'green')
axs[1].set_ylabel('Polarity')

## Data Smoothing <a id='smoothing_data'></a> [^](#index)

Data smoothing is a technique that involves removing noise from a dataset to make a dataset easier to analyse. Smoothing can be performed in a variety of ways, including averaging, binning, and splining. 

<div style="background-color:#C2F5DD">
Task: Search up the different types of smoothing and try to apply them to the data.

In [None]:
## try to smooth the magnetic field data and plot it against time

#####################

#####################

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from scipy.ndimage import gaussian_filter


# Apply a moving average with a window size of 10
data['Smoothed_B_MovingAvg'] = data['|B|'].rolling(window=10).mean()

# Apply Gaussian smoothing with a sigma of 10
data['Smoothed_B_Gaussian'] = gaussian_filter(data['|B|'], sigma=10)


fig, axs = plt.subplots(2, sharex=True, figsize=(10, 10))

axs[0].plot(data.index, data['|B|'], label='Original |B|', color='yellow')
axs[0].plot(data.index, data['Smoothed_B_MovingAvg'], label='Moving Average Smoothed |B|', color='red')

axs[1].plot(data.index, data['|B|'], label='Original |B|', color='yellow')
axs[1].plot(data.index, data['Smoothed_B_Gaussian'], label='Gaussian Smoothed |B|', color='blue')

axs[0].set_ylabel('|B| (nT)')
axs[0].set_title('Magnetic Field Intensity with Smoothing')
axs[0].legend()

axs[1].set_xlabel('Time')
axs[1].set_ylabel('|B| (nT)')
axs[1].legend()

# Show the plot
plt.show()



## Filtering Data <a id='filtering_data'></a> [^](#index)

Here we have used two times of filters, a moving average filter and a gaussian filter.

A moving average filter is a simple, finite impulse response (FIR) filter commonly used in signal processing and time-series analysis. It works by averaging a set of data points within a moving window over the data. The window slides along the data, and at each position, the average of the data points within the window is calculated. This is a simple way to smooth out data and reduce noise within a dataset.

Using a Gaussian filter on time-series magnetic field data from the Sun offers multiple advantages, including effective noise reduction while preserving important features like sudden changes in magnetic field strength, which could indicate solar events. The filter's weighted averaging is particularly useful for highlighting localised events. It also performs well in both time and frequency domains, allowing for a nuanced analysis of the Sun's complex magnetic field dynamics. The filter is adaptable, with adjustable standard deviation for tailored sensitivity, and can be computationally efficient with optimised algorithms.

# High and Low Pass Filters Using fft

How do we apply the usage of filters here? split the data into useful sections?

Scipy has a [built-in](https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.butter.html) function for this, `scipy.signal.butter`, which can be used to apply a high/low pass filter to the data.

- [`scipy.signal.butter`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.butter.html)
- [`scipy.signal.filtfilt`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.filtfilt.html)

In [None]:
from scipy.signal import butter, filtfilt

def butter_lowpass_filter(data, cutoff_freq, sample_rate):
    b, a = butter(N=6, Wn=cutoff_freq / (0.5 * sample_rate), btype='low')
    y = filtfilt(b, a, data)
    return y

def butter_highpass_filter(data, cutoff_freq, sample_rate):
    b, a = butter(N=6, Wn=cutoff_freq / (0.5 * sample_rate), btype='high')
    y = filtfilt(b, a, data)
    return y

# Fourier Transformation

# Power Spectrum

used to analyse periodic signals or signals that have a periodic component and determine the strength of each frequency component.

In the analysis of the suns magnetic field, we can use the power spectrum to analyse periodic patterns???

Smoothing techniques like moving average or Gaussian filters can be applied before Fourier Transform and power spectrum analysis to reduce noise. Smoothing can make it easier to identify true frequency components by eliminating or reducing spurious peaks in the power spectrum that are due to noise. 