<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Test-1:-Dataframe" data-toc-modified-id="Test-1:-Dataframe-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Test 1: Dataframe</a></span></li><li><span><a href="#Test-2:-Checking-data-is-configurable" data-toc-modified-id="Test-2:-Checking-data-is-configurable-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Test 2: Checking data is configurable</a></span></li><li><span><a href="#Test-3:-Lists" data-toc-modified-id="Test-3:-Lists-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Test 3: Lists</a></span></li><li><span><a href="#Test-4:-Dealing-with-strings" data-toc-modified-id="Test-4:-Dealing-with-strings-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Test 4: Dealing with strings</a></span></li><li><span><a href="#Test-5:-Arrays" data-toc-modified-id="Test-5:-Arrays-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Test 5: Arrays</a></span></li></ul></div>

# ROLLING STATISTICS

Creating a class which can store data and provide various statistics. The
number of items should be configurable: the oldest data should be removed when new data comes in. It should be possible to get the following information about the data:
1. Mean
2. Median
3. Sum
4. Standard deviation

Extensions points: <br>
● Can one class provide all the functionality, or do you need separate classes? <br>
● Can more generic data be stored? How would you handle getting the statistics of them? <br>
● Can the data be queried for a specific range rather than getting all of the data? <br>
● Can the data be stored with timestamps such that the maximum range of data stored is
over a specified time rather than a specified amount?

In [1]:
from dataclasses import dataclass
import pandas as pd
import numpy as np
import time

In [2]:
@dataclass
class DataClassStat:
    df: any
    
    def strlist(self, df):
        if type(self.df) is list:
            self.df = [int(n) for n in self.df if n]
        elif type(self.df) is str:
            self.df = float(self.df)
    def mean(self):
        start = time.process_time()
        self.strlist(self.df)
        mean = np.mean(self.df)
        print("Time Taken (s): ", time.process_time() - start)
        return mean
    def median(self):
        start = time.process_time()
        self.strlist(self.df)
        median = np.median(self.df)
        print("Time Taken (s): ", time.process_time() - start)
        return median
    def numsum(self):
        start = time.process_time()
        self.strlist(self.df)
        totsum = np.sum(self.df)
        print("Time Taken (s): ", time.process_time() - start)
        return totsum
    def std(self):
        start = time.process_time()
        self.strlist(self.df)
        std = np.std(self.df)
        print("Time Taken (s): ", time.process_time() - start)
        return std
    def firstn(self, n):
        start = time.process_time()
        self.strlist(self.df)
        firstslice = self.df[:n]
        print("Time Taken (s): ", time.process_time() - start)
        return firstslice
    def lastn(self, n):
        start = time.process_time()
        self.strlist(self.df)
        lastslice = self.df[len(self.df)-n:]
        print("Time Taken (s): ", time.process_time() - start)
        return lastslice

## Test 1: Dataframe

In [3]:
dataset_len = 1000000
dlen = int(dataset_len/2)
X_11 = pd.Series(np.random.normal(1,1,dlen))
X_12 = pd.Series(np.random.normal(10,2,dlen))
X_1 = pd.concat([X_11, X_12]).reset_index(drop=True)
X_21 = pd.Series(np.random.normal(2,1,dlen))
X_22 = pd.Series(np.random.normal(9,2,dlen))
X_2 = pd.concat([X_21, X_22]).reset_index(drop=True)
Y = pd.Series(np.repeat([0,1],dlen))
df = pd.concat([X_1, X_2, Y], axis=1)
df.columns = ['X1', 'X2', 'Y']
len(df)

1000000

In [4]:
stats = DataClassStat(df)
stats.df.head()

Unnamed: 0,X1,X2,Y
0,1.924025,3.324965,0
1,1.259442,0.768392,0
2,1.20298,1.925732,0
3,1.242305,1.119963,0
4,2.073625,1.427494,0


In [5]:
stats.mean()

Time Taken (s):  0.015625


X1    5.498451
X2    5.502959
Y     0.500000
dtype: float64

In [6]:
stats.median()

Time Taken (s):  0.046875


1.4995983672753734

In [7]:
stats.numsum()

Time Taken (s):  0.0625


X1    5.498451e+06
X2    5.502959e+06
Y     5.000000e+05
dtype: float64

In [8]:
stats.std()

Time Taken (s):  0.03125


X1    4.771697
X2    3.841553
Y     0.500000
dtype: float64

In [9]:
stats.firstn(10)

Time Taken (s):  0.0


Unnamed: 0,X1,X2,Y
0,1.924025,3.324965,0
1,1.259442,0.768392,0
2,1.20298,1.925732,0
3,1.242305,1.119963,0
4,2.073625,1.427494,0
5,-0.283728,0.817497,0
6,-0.234397,2.796129,0
7,2.424248,1.848596,0
8,0.024348,-0.196195,0
9,-0.397433,2.166044,0


In [10]:
stats.lastn(3)

Time Taken (s):  0.0


Unnamed: 0,X1,X2,Y
999997,11.503971,9.911141,1
999998,10.473004,11.719938,1
999999,11.879149,9.960469,1


## Test 2: Checking data is configurable

In [11]:
dataset_len = 200
dlen = int(dataset_len/2)
X_11 = pd.Series(np.random.normal(7,6,dlen))
X_12 = pd.Series(np.random.normal(10,2,dlen))
X_1 = pd.concat([X_11, X_12]).reset_index(drop=True)
X_21 = pd.Series(np.random.normal(12,1,dlen))
X_22 = pd.Series(np.random.normal(9,2,dlen))
X_2 = pd.concat([X_21, X_22]).reset_index(drop=True)
Y = pd.Series(np.repeat([0,1],dlen))
df2 = pd.concat([X_1, X_2, Y], axis=1)
df2.columns = ['X1', 'X2', 'Y']
len(df2)

200

In [12]:
DataClassStat(df2).df.head()

Unnamed: 0,X1,X2,Y
0,6.820354,15.022523,0
1,13.602317,11.280803,0
2,11.560052,12.62642,0
3,6.109073,12.539427,0
4,9.258931,12.895588,0


In [13]:
DataClassStat(df2).mean()

Time Taken (s):  0.0


X1     8.814938
X2    10.469681
Y      0.500000
dtype: float64

## Test 3: Lists 

In [14]:
DataClassStat([3, 7, 3, 6]).df

[3, 7, 3, 6]

In [15]:
listinp = DataClassStat([3, 7, 3, 6])
listinp.mean()

Time Taken (s):  0.0


4.75

In [16]:
listinp.firstn(2)

Time Taken (s):  0.0


[3, 7]

In [17]:
listinp.lastn(2)

Time Taken (s):  0.0


[3, 6]

## Test 4: Dealing with strings

In [18]:
listinp = DataClassStat(['3', '7', '3', '6'])
listinp.mean()

Time Taken (s):  0.0


4.75

In [19]:
strinp = DataClassStat('3')
strinp.mean()

Time Taken (s):  0.0


3.0

## Test 5: Arrays

In [20]:
arrinp = DataClassStat(np.array([2, 4, 1, 8]))
arrinp.mean()

Time Taken (s):  0.0


3.75

In [21]:
arrinp.firstn(2)

Time Taken (s):  0.0


array([2, 4])

In [22]:
arrinp.std()

Time Taken (s):  0.0


2.680951323690902