### Goals:
1. Check the quality of data
2. Explore the connections in data
3. Explore periodicity of time series data
4. Explore stationarity
5. Get anomalies and come up with a future anomally detection strategy

In [1]:
from sqlalchemy import create_engine
from dotenv import load_dotenv
import os

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [15]:
engine = create_engine("postgresql://airflow:airflow@localhost:5454/forex")
with engine.connect() as con:
  df = pd.read_sql_query('SELECT * FROM master', con=con.connection)

  df = pd.read_sql_query('SELECT * FROM master', con=con.connection)


In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99966 entries, 0 to 99965
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   entity         99966 non-null  int64         
 1   bop_measure    99966 non-null  int64         
 2   inr_measure    99966 non-null  int64         
 3   date           99966 non-null  datetime64[ns]
 4   bop_value      99966 non-null  float64       
 5   interest_rate  99966 non-null  float64       
 6   ex_rate        99966 non-null  float64       
dtypes: datetime64[ns](1), float64(3), int64(3)
memory usage: 5.3 MB


#### Quality checks
- Check for duplicates
- Check missing values
- Check for outliers
- Check the availability of data:
  - What is the range of time in which data is available for each entity

In [25]:
any(
  df[['entity', 'bop_measure', 'inr_measure', 'date']]\
    .duplicated())

False

> There are no duplicated data

In [27]:
df[['bop_value', 'interest_rate', 'ex_rate']].describe()

Unnamed: 0,bop_value,interest_rate,ex_rate
count,99966.0,99966.0,99966.0
mean,-14572.816361,1.396879,7.470071
std,73696.223232,2.239009,8.722681
min,-336811.0,-1.75,0.6963
25%,-1450.005,-0.036667,1.0804
50%,888.4539,0.477042,6.955
75%,8392.921,2.155409,7.7134
max,75586.0,9.82,27.61


> There seems to be no logically missing values(0's or large negatiove values that often denote missing values and aren't picked up by `df.info()`)

In [43]:
# function for getting outliers
def get_outliers(
    entity: int,
    bop_measure: int,
    inr_measure: int,
    col: str
):
  mask = np.logical_and(
          df['bop_measure'] == bop_measure,
          df['inr_measure'] == inr_measure,
          df['entity'] == entity)
  
  values = df[mask][col]

  q1, q3 = np.quantile(values, [0.25, 0.75])
  iqr = q3 - q1

  return np.logical_or((values < q1 - 3 * iqr),
                       (values > q3 + 3* iqr))

In [45]:
get_outliers(383, 3, 5, "interest_rate")

False