<img src="../assets/headline.png" alt="headline"></a>

## Preprocessing, Lecture 2.5

## SUIT: Transformation

Lecture 2, November 9th, 2022

In [None]:
from statsmodels.tsa.stattools import adfuller
import pandas as pd
from utilities.plot_ts import plot_series

In [None]:
events = pd.read_csv('../data/tiktok_events.csv')
events['datetime'] = pd.to_datetime(events['datetime'])
events.set_index('datetime',inplace=True)
events.head()

In [None]:
groupers = [pd.Grouper(freq='1D'),'city']

likes = events.groupby(groupers).views.sum().unstack()

likes

In [None]:
# plot likes over time from different cities
plot_series(likes['kadima'], likes['ramat-gan'], likes['rishon'], likes['tel-aviv'])

In [None]:
# plot the difference in tel-aviv
plot_series(likes['tel-aviv'].diff())

# Augmented Dickey-Fuller test

### Stationary Data
<img src="../assets/adf_stationary.png" alt="statioary"></a>

The unconditional joint probability distribution does not change when shifted in time
E.g. mean and variance are constant


### Non-Stationary Data
<img src="../assets/adf_not_stationary.png" alt="not statioary"></a>

The unconditional joint probability distribution change when shifted in time
E.g. either mean or variance are not constant

### ADF Test

<a href=https://en.wikipedia.org/wiki/Augmented_Dickey%E2%80%93Fuller_test>ADF Wiki</a>

A statistical test to provide us with stationary results.

Null-Hypothesis: Data is not stationary

Alternative-Hypothesis: Data is stationary

Python: Statsmodels.tsa.stattools import adfuller





In [None]:
# import adfuller
from statsmodels.tsa.stattools import adfuller

In [None]:
# likes over time from tel-aviv
likes['tel-aviv']

In [None]:
# plot it
plot_series(likes['tel-aviv'])

In [None]:
# calculate ad fuller test on tel-aviv likes
adfuller(likes['tel-aviv'])

In [None]:
# generalize our null hypothesis rule
adf = adfuller(likes['tel-aviv'])

if adf[0] < adf[4].get('5%'):
    print("data is stationary")
else:
    print("data is not stationary")

In [None]:
# Is it just about the trend?
likes_diff = likes['tel-aviv'].diff().dropna()
likes_diff

In [None]:
# plot it
plot_series(likes_diff)

In [None]:
# run another adf test
adf = adfuller(likes_diff)
is_stationary = adf[0] < adf[4].get('5%')

if is_stationary: print("data is stationary")
else: print("data is not stationary")

In [None]:
# why is likes ratio dropping?

resampler = events.resample('7D').aggregate(
    likes=("likes", lambda x: sum(x) / len(x))
)

plot_series(resampler["likes"])


In [None]:
del events