# Saturn aliens - Heuristics to detect anomalous retrieval logs
### Maria Silva, August 2022

In this notebook, we design some heuristics to detect anomalous retrievals in Saturn logs using historical data from verified logs.


**Note on reproducibility:**

The data used to run this notebook was queried from Saturn's internal systems. We are not able to share externally due to privacy concerns and the sensitivity of the final tuning of the heuristics.

Even though we are not sharing the final thresholds, by sharing the code, we aim to demonstrate the process and reasoning behind Saturn's logs detection system.

***


In [None]:
import os
import numpy as np
import pandas as pd
import datetime
import scipy.stats as st
import statsmodels.api as sm
from scipy.stats._continuous_distns import _distn_names
import seaborn as sns
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

## 1. Load data

In [None]:
file = os.path.abspath("../../../../Data/Saturn/bandwidth_logs_jul_05.csv")
raw_df = pd.read_csv(file)

df = raw_df.dropna()
df["start_time"] = pd.to_datetime(df["start_time"])
df["mb_sent"] = df["num_bytes_sent"]/(2**20)

df.info()

## 2. Fit distributions and set thresholds

### Auxiliary functions

In [None]:
def best_fit_distribution(data, bins=200):
    """Model data by finding best fit distribution to data"""
    # Get histogram of original data
    y, x = np.histogram(data, bins=bins, density=True)
    x = (x + np.roll(x, -1))[:-1] / 2.0
    # Best holders
    best_distributions = []
    # Estimate distribution parameters from data
    for ii, distribution in enumerate([d for d in _distn_names if not d in ['levy_stable', 'studentized_range']]):
        distribution = getattr(st, distribution)
        # Try to fit the distribution
        try:
            # Ignore warnings from data that can't be fit
            with warnings.catch_warnings():
                warnings.filterwarnings('ignore')
                # fit dist to data
                params = distribution.fit(data)
                # Separate parts of parameters
                arg = params[:-2]
                loc = params[-2]
                scale = params[-1]
                # Calculate fitted PDF and error with fit in distribution
                pdf = distribution.pdf(x, loc=loc, scale=scale, *arg)
                sse = np.sum(np.power(y - pdf, 2.0))
                # identify if this distribution is better
                best_distributions.append((distribution, params, sse))
        except Exception:
            pass
    return sorted(best_distributions, key=lambda x:x[2])


def make_pdf(dist, params, size=1000):
    """Generate distributions's Probability Distribution Function """
    # Separate parts of parameters
    arg = params[:-2]
    loc = params[-2]
    scale = params[-1]
    # Get sane start and end points of distribution
    start = dist.ppf(0.01, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.01, loc=loc, scale=scale)
    end = dist.ppf(0.99, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.99, loc=loc, scale=scale)
    # Build PDF and turn into pandas Series
    x = np.linspace(start, end, size)
    y = dist.pdf(x, loc=loc, scale=scale, *arg)
    pdf = pd.Series(y, x)
    return pdf

### Heuristic 1: High request count

* **Level:** operator
* **Description:** Flag operators with a total number of requests higher than a threshold
* **Reasoning:** Operators may inject fake logs to bump their rewards. This rule catches operators that inject a massive number of requests

In [None]:
requests_per_node = df.groupby("node_id")["request_id"].nunique().sort_values(ascending=False)

# Find best fit distribution
best_distibutions = best_fit_distribution(requests_per_node.values, bins=30)
best_dist = best_distibutions[0]

# Make PDF with best params 
pdf = make_pdf(best_dist[0], best_dist[1])

# Display
plt.figure(figsize=(8,5))
ax = pdf.plot(lw=2, label='PDF', legend=True)
requests_per_node.plot(kind='hist', bins=50, density=True, alpha=0.5, label='Data', legend=True, ax=ax)
param_names = (best_dist[0].shapes + ', loc, scale').split(', ') if best_dist[0].shapes else ['loc', 'scale']
param_str = ', '.join(['{}={:0.2f}'.format(k,v) for k,v in zip(param_names, best_dist[1])])
dist_str = '{}({})'.format(best_dist[0].name, param_str)
plt.title("Requests per node distribution\n"+ dist_str)
plt.show()