In [None]:
import numpy as np
import pandas as pd
from statsmodels.tsa.stattools import adfuller

In [16]:
num_clients = 14
clients_list = [0]*num_clients
base_data_path = 'C:\\Users\\kdmen\\Desktop\\Research\\personalization-privacy-risk\\Data\\Client_Specific_Files\\'
condition_number = 0

for i in range(num_clients):
    samples_path = base_data_path + "UserID" + str(i) + "_TrainData_8by20770by64.npy"
    #labels_path = base_data_path + "UserID" + str(i) + "_Labels_8by20770by2.npy" 
    with open(samples_path, 'rb') as handle:
        samples_npy = np.load(handle)
    #with open(labels_path, 'rb') as handle:
    #    labels_npy = np.load(handle)
    # Select for given condition #THIS IS THE ACTUAL TRAINING DATA AND LABELS FOR THE GIVEN TRIAL
    cond_samples_npy = samples_npy[condition_number,:,:]
    #cond_labels_npy = labels_npy[self.condition_number,:,:]
    
    clients_list[i] = cond_samples_npy

Autocorrelation and stationarity are important concepts when working with time series data and can impact the suitability of using linear regression.

__Autocorrelation__:
Autocorrelation refers to the correlation of a time series with its own past values. In simpler terms, it's the degree to which the observations at different time points are related to each other. Autocorrelation is a common phenomenon in time series data, where values at one time point are often related to values at previous time points. For example, stock prices might be influenced by their past prices.

Autocorrelation matters for linear regression because traditional linear regression assumes that the observations are independent of each other. When autocorrelation is present, this assumption is violated, and it can lead to problems in the regression analysis. For instance, the estimated coefficients may be biased, and the standard errors may be incorrect, affecting the reliability of the statistical inferences.

__Stationarity__:
Stationarity refers to the property of a time series where its statistical properties, such as mean, variance, and autocorrelation, remain constant over time. A stationary time series is easier to model and analyze because its behavior is consistent over time. On the other hand, a non-stationary time series can exhibit trends, seasonality, and changing statistical properties, making it more challenging to analyze and predict accurately.

When it comes to linear regression and time series data, it's often desirable to work with stationary data to ensure the validity of regression assumptions. This is where the concept of making the data "stationary" comes into play.

Making Time Series Data Stationary:
There are a few common methods to make a non-stationary time series stationary:
- Differencing: Take the difference between consecutive observations. This can help remove trends and make the data stationary.
- Transformation: Apply mathematical transformations such as logarithm, square root, or Box-Cox transformation to stabilize variance and make the data more stationary.
- Seasonal Decomposition: Decompose the time series into its trend, seasonal, and residual components. Modeling the residuals can often lead to stationary data.
- Detrending: Remove the trend component from the data to make it stationary.
- Augmented Dickey-Fuller Test: This is a statistical test that helps determine whether a time series is stationary or not. It can guide you in deciding whether differencing is necessary.

Once you've made your time series data stationary, you can then proceed with applying linear regression techniques, assuming that the autocorrelation issue has been mitigated.

In summary, autocorrelation can affect the validity of linear regression when dealing with time series data. Making the data stationary through techniques like differencing, transformation, or decomposition can help address this issue and make linear regression a more appropriate tool for analyzing and modeling time series data.

## Augmented Dickey-Fuller Test
> Used to determine whether a time series is stationary or not. It can guide you in deciding whether differencing is necessary.

In [12]:
# Generating example data (replace this with your actual data)
np.random.seed(123)
data = np.random.randn(1000, 64)

# Function to conduct Augmented Dickey-Fuller Test
def adf_test(data):
    results = []
    for i in range(data.shape[1]):
        result = adfuller(data[:, i])
        results.append(result)
    return results

# Conduct the Augmented Dickey-Fuller Test
adf_results = adf_test(data)

# Interpret the results
num_cols_non_stationary = 0
for i, result in enumerate(adf_results):
    print(f"Column {i+1}:")
    print(f"ADF Statistic: {result[0]:0.2f}")
    print(f"p-value: {result[1]:0.4f}")
    print(f"Critical Values: (5%: {result[4]['5%']:.2f}), (1%: {result[4]['1%']:.2f}), (10%: {result[4]['10%']:.2f})")
    is_stationary = result[0] < result[4]['5%']
    print("Is stationary:", is_stationary)
    print("-" * 40)
    
    if not is_stationary:
        num_cols_non_stationary += 1

if num_cols_non_stationary!=0:
    print(f"Warning: {num_cols_non_stationary} columns are not stationary.")
else:
    print("All columns are stationary.")


Column 1:
ADF Statistic: -30.90
p-value: 0.0000
Critical Values: (5%: -2.86), (1%: -3.44), (10%: -2.57)
Is stationary: True
----------------------------------------
Column 2:
ADF Statistic: -15.83
p-value: 0.0000
Critical Values: (5%: -2.86), (1%: -3.44), (10%: -2.57)
Is stationary: True
----------------------------------------
Column 3:
ADF Statistic: -31.69
p-value: 0.0000
Critical Values: (5%: -2.86), (1%: -3.44), (10%: -2.57)
Is stationary: True
----------------------------------------
Column 4:
ADF Statistic: -17.82
p-value: 0.0000
Critical Values: (5%: -2.86), (1%: -3.44), (10%: -2.57)
Is stationary: True
----------------------------------------
Column 5:
ADF Statistic: -14.79
p-value: 0.0000
Critical Values: (5%: -2.86), (1%: -3.44), (10%: -2.57)
Is stationary: True
----------------------------------------
Column 6:
ADF Statistic: -33.47
p-value: 0.0000
Critical Values: (5%: -2.86), (1%: -3.44), (10%: -2.57)
Is stationary: True
----------------------------------------
Column 7:


The adf_test function iterates over each column of the data and applies the Augmented Dickey-Fuller Test using adfuller from statsmodels.
The results of the test are then printed for each column, including the ADF Statistic, p-value, and Critical Values. We compare the ADF Statistic with the critical values to determine if the data is stationary (Is stationary line).

The conclusions you can draw from the Augmented Dickey-Fuller Test are based on the p-value:
- Null Hypothesis (H0): The null hypothesis assumes that the data is non-stationary (has a unit root).
- Alternative Hypothesis (H1): The alternative hypothesis assumes that the data is stationary.

Interpretation of p-value:
- If the p-value is less than a chosen significance level (e.g., 0.05), you can reject the null hypothesis and conclude that the data is stationary.
- If the p-value is greater than the significance level, you fail to reject the null hypothesis, indicating that the data is non-stationary.
- In the printed results, if the "Is stationary" value is True, it means the data is considered stationary based on the chosen critical value (usually 5%). If it's False, the data is considered non-stationary.

Remember that you should replace the example data with your actual data and adjust the significance level as needed based on the context of your analysis.

In [20]:
for i_outer, client in enumerate(clients_list):
    # Conduct the Augmented Dickey-Fuller Test
    adf_results = adf_test(client)
    # Interpret the results
    num_cols_non_stationary = 0
    for i, result in enumerate(adf_results):
        #print(f"Column {i+1}:")
        #print(f"ADF Statistic: {result[0]:0.2f}")
        #print(f"p-value: {result[1]:0.4f}")
        #print(f"Critical Values: (5%: {result[4]['5%']:.2f}), (1%: {result[4]['1%']:.2f}), (10%: {result[4]['10%']:.2f})")
        is_stationary = result[0] < result[4]['5%']
        #print("Is stationary:", is_stationary)
        #print("-" * 40)
        if not is_stationary:
            num_cols_non_stationary += 1
    if num_cols_non_stationary!=0:
        print(f"Client{i_outer} Warning: {num_cols_non_stationary} columns are not stationary.")
    else:
        print(f"Client{i_outer}: All columns are stationary.")


Client0: All columns are stationary.
Client1: All columns are stationary.
Client2: All columns are stationary.
Client3: All columns are stationary.
Client4: All columns are stationary.
Client5: All columns are stationary.
Client6: All columns are stationary.
Client7: All columns are stationary.
Client8: All columns are stationary.
Client9: All columns are stationary.
Client10: All columns are stationary.
Client11: All columns are stationary.
Client12: All columns are stationary.
