# random_walk_test

**Author:** Marilyn Braojos Gutierrez\
**Purpose:** This program aims test a sample of the data (5 months) to evaluate if the data is a random walk. If it is a random walk, a machine learning algorithm is not recommended to estimate it.\
**PhD Milestone:** #1: *Leverage deep learning models to GPS satellite clock bias corrections.*\
**Project:** This program is Step (1) in this PhD milestone. Obtaining the data is the first critical step.\
**References:**\
[1] https://machinelearningmastery.com/gentle-introduction-random-walk-times-series-forecasting-python/#:~:text=In%20fact%2C%20all%20random%20walk,and%2For%20variance%20over%20time.

[2] https://support.minitab.com/en-us/minitab/help-and-how-to/statistical-modeling/time-series/how-to/augmented-dickey-fuller-test/interpret-the-results/all-statistics-and-graphs/

**Takeaways:**
- The null hypothesis of the Augmented Dickey-Fuller is that there is a unit root, with the alternative that there is no unit root. If the p-value is above a critical size, then we cannot reject that there is a unit root.

- The null hypothesis is that the data are non-stationary, which implies that differencing is a reasonable step to try to make the data stationary.

- If **P-value ≤ significance level** and **Test statistic ≤ critical value**:\
If the p-value is less than or equal to the significance level or if the test statistic is less than or equal to the critical value, the decision is to reject the null hypothesis. Because the data provide evidence that the data are stationary, the recommendation of the analysis is to proceed without differencing.


- If **P-value > significance level** and **Test statistic > critical value**:\
If the p-value is greater than the significance level or if the test statistic is greater than the critical value, the decision is to fail to reject the null hypothesis. Because the data do not provide evidence that the data are stationary, the recommendation of the analyis is to determine whether differencing makes the mean of the data stationary. This means the data set may be a random walk.

**Notes:** 
- 5/12 of the data seems to be a good stopping point (~5 months), because the kernel dies at ~6 months (this may be a memory issue)

In [1]:
import numpy as np
from statsmodels.tsa.stattools import adfuller
import math

In [2]:
data = np.load('/Volumes/MARI/ssdl_gps/correction_data/2018_2019/continuous_unique_correction_data_2018_2019_jan1_dec31_clipped_015_9985.npz')
epochs = data['matching_epochs']
final_clock_bias = data['matching_clock_bias']
broadcast_clock_bias = data['matching_poly_values']
correction_value = data['correction_vals']

In [3]:
data_subset_beg = math.floor(10*len(correction_value)/24)
data_subset_end = math.floor(14*len(correction_value)/24)

In [4]:
%%time 
# Perform the Augmented Dickey-Fuller test: https://www.statsmodels.org/dev/generated/statsmodels.tsa.stattools.adfuller.html
# result = adfuller(correction_value[:data_subset])
result = adfuller(correction_value[data_subset_beg:data_subset_end])

CPU times: user 13min 16s, sys: 52.7 s, total: 14min 9s
Wall time: 2min 20s


In [5]:
# Extract results
adf_statistic = result[0]
p_value = result[1]
used_lag = result[2]
num_observations = result[3]
critical_values = result[4]
icbest = result[5]

In [6]:
# Print results
print("ADF Statistic:", adf_statistic)
print("p-value:", p_value)
print("Used Lag:", used_lag)
print("Number of Observations:", num_observations)
print("Critical Values:", critical_values)

# Interpretation
alpha = 0.05
if p_value < alpha:
    print("We reject the null hypothesis: The data does not have a unit root and is stationary.")
else:
    print("We fail to reject the null hypothesis: The data has a unit root and may be a random walk.")

ADF Statistic: -17.104655996441302
p-value: 7.436176583353637e-30
Used Lag: 48
Number of Observations: 350071
Critical Values: {'1%': -3.430368680061904, '5%': -2.8615482563596957, '10%': -2.5667743945600296}
We reject the null hypothesis: The data does not have a unit root and is stationary.
