## **Aggregate time-series dataframe**

performs a rolling aggregation on `df_artifact`, over `window` by the selected `keys`
applying `metric_aggs` on `metrics` and `label_aggs` on `labels`.<br> 
adding `suffix` to the 
feature names.
    
    

### **Steps**

1. [Data exploration](#Data-exploration)
2. [Importing the function](#Importing-the-function)
3. [Running the function locally](#Running-the-function-locally)
4. [Running the function remotely](#Running-the-function-remotely)

### **Data exploration**

This is the dataset [Occupancy Detection Data Set, UCI](http://archive.ics.uci.edu/ml/datasets/Occupancy+Detection+)
as used in the article [how-to-predict-room-occupancy-based-on-environmental-factors](https://machinelearningmastery.com/how-to-predict-room-occupancy-based-on-environmental-factors/).<br>

> **Attribute Information:**<br>
    `date` - time year-month-day hour:minute:second<br>
    `Temperature` - in Celsius<br>
     Relative `Humidity` - %<br>
    `Light` - in Lux<br>
    `CO2` - in ppm<br>
    `Humidity Ratio` - Derived quantity from temperature and relative humidity, in kgwater-vapor/kg-air<br>
    `Occupancy` - 0 or 1, 0 for not occupied, 1 for occupied status

In [1]:
from mlrun import set_env_from_file
import os.path

env_file = "env_file.env"
if os.path.isfile(env_file):
    set_env_from_file(env_file)

In [2]:
from os import path
import mlrun

# Set the base project name
project_name_base = 'function-aggregate-example'

# Initialize the MLRun project object
project = mlrun.get_or_create_project(project_name_base, context="./", user_project=True)

# Display the current project name
project_name = project.metadata.name
print(f'Full project name: {project_name}')

> 2022-09-28 15:53:25,646 [info] loaded project function-aggregate-example from MLRun DB
Full project name: function-aggregate-example-avia


In [3]:
import pandas as pd

data_path = 'https://s3.wasabisys.com/iguazio/data/function-marketplace-data/aggregate/train_room_occupancy.csv'
df = pd.read_csv(data_path).set_index('date',drop=False)
df.head()

Unnamed: 0_level_0,date,Temperature,Humidity,Light,CO2,HumidityRatio,Occupancy
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2015-02-04 17:51:00,2015-02-04 17:51:00,23.18,27.272,426.0,721.25,0.004793,1
2015-02-04 17:51:59,2015-02-04 17:51:59,23.15,27.2675,429.5,714.0,0.004783,1
2015-02-04 17:53:00,2015-02-04 17:53:00,23.15,27.245,426.0,713.5,0.004779,1
2015-02-04 17:54:00,2015-02-04 17:54:00,23.15,27.2,426.0,708.25,0.004772,1
2015-02-04 17:55:00,2015-02-04 17:55:00,23.1,27.2,426.0,704.5,0.004757,1


### **Importing the function**

In [4]:
fn = mlrun.import_function("hub://aggregate")
fn.apply(mlrun.auto_mount())

<mlrun.runtimes.kubejob.KubejobRuntime at 0x12ee62e20>

In [5]:
import numpy as np

# Declaring a custom aggregation function
def dist_from_mean(l):
    mean = np.mean(l)
    return abs(list(l)[3] - mean)

### **Running the function locally**

In [6]:
aggregate_run_local = fn.run(name='aggregate',
                       params = {'metrics': ['Temperature','Humidity'],
                                 'labels': ['Occupancy'],
                                 'metric_aggs': ['mean','std',dist_from_mean],
                                 'label_aggs': ['sum'],
                                 'window': 5,
                                 'center': True},
                       inputs={'df_artifact': data_path},
                       local=True)

> 2022-09-28 15:53:28,353 [info] starting run aggregate uid=922bc0ffe9db4812a216a8a636502300 DB=https://mlrun-api.default-tenant.app.app-lab-v3-5-1.iguazio-cd2.com
> 2022-09-28 15:53:28,918 [info] Aggregating https://s3.wasabisys.com/iguazio/data/function-marketplace-data/aggregate/train_room_occupancy.csv
> 2022-09-28 15:53:31,991 [info] Logging artifact


project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
function-aggregate-example-avia,...36502300,0,Sep 28 12:53:28,completed,aggregate,v3io_user=aviakind=owner=aviahost=Avis-MBP.iguaz.io,df_artifact,"metrics=['Temperature', 'Humidity']labels=['Occupancy']metric_aggs=['mean', 'std', ]label_aggs=['sum']window=5center=True",,aggregate





> 2022-09-28 15:53:35,090 [info] run executed, status=completed


In [7]:
aggregate_run_local.artifact('aggregate').as_df().head()

Unnamed: 0,date,Temperature,Humidity,Light,CO2,HumidityRatio,Occupancy,Temperature_mean,Humidity_mean,Occupancy_max
2,2015-02-04 17:53:00,23.15,27.245,426.0,713.5,0.004779,1,23.146,27.2369,1.0
3,2015-02-04 17:54:00,23.15,27.2,426.0,708.25,0.004772,1,23.13,27.2225,1.0
4,2015-02-04 17:55:00,23.1,27.2,426.0,704.5,0.004757,1,23.12,27.209,1.0
5,2015-02-04 17:55:59,23.1,27.2,419.0,701.0,0.004757,1,23.11,27.2,1.0
6,2015-02-04 17:57:00,23.1,27.2,419.0,701.666667,0.004757,1,23.1,27.2,1.0


### **Running the function remotely**

In [8]:
aggregate_run_remote = fn.run(name='aggregate',
                       params = {'metrics': ['Temperature','Humidity'],
                                 'labels': ['Occupancy'],
                                 'metric_aggs': ['mean','std'],
                                 'label_aggs': ['sum'],
                                 'window': 5,
                                 'center': True},
                       inputs={'df_artifact': data_path},
                       local=False)

> 2022-09-28 15:53:38,517 [info] starting run aggregate uid=6d9b8d4b14a74517a0b3abc842452afb DB=https://mlrun-api.default-tenant.app.app-lab-v3-5-1.iguazio-cd2.com
> 2022-09-28 15:53:39,048 [info] Job is running in the background, pod: aggregate-2fvm5
> 2022-09-28 12:53:47,367 [info] Aggregating https://s3.wasabisys.com/iguazio/data/function-marketplace-data/aggregate/train_room_occupancy.csv
> 2022-09-28 12:53:47,639 [info] Logging artifact
> 2022-09-28 12:53:47,832 [info] run executed, status=completed
final state: completed


project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
function-aggregate-example-avia,...42452afb,0,Sep 28 12:53:47,completed,aggregate,v3io_user=aviakind=jobowner=aviamlrun/client_version=1.1.0host=aggregate-2fvm5,df_artifact,"metrics=['Temperature', 'Humidity']labels=['Occupancy']metric_aggs=['mean', 'std']label_aggs=['sum']window=5center=True",,aggregate





> 2022-09-28 15:53:49,102 [info] run executed, status=completed


In [9]:
aggregate_run_remote.artifact('aggregate').as_df().head()

Unnamed: 0,date,Temperature,Humidity,Light,CO2,HumidityRatio,Occupancy,Temperature_mean,Humidity_mean,Occupancy_max
2,2015-02-04 17:53:00,23.15,27.245,426.0,713.5,0.004779,1,23.146,27.2369,1.0
3,2015-02-04 17:54:00,23.15,27.2,426.0,708.25,0.004772,1,23.13,27.2225,1.0
4,2015-02-04 17:55:00,23.1,27.2,426.0,704.5,0.004757,1,23.12,27.209,1.0
5,2015-02-04 17:55:59,23.1,27.2,419.0,701.0,0.004757,1,23.11,27.2,1.0
6,2015-02-04 17:57:00,23.1,27.2,419.0,701.666667,0.004757,1,23.1,27.2,1.0


[Back to the top](#Aggregate-time-series-dataframe)