## Compressing Data to the IoT Gateway using Autoencoders

Dataset: http://db.csail.mit.edu/labdata/labdata.html

The goal of this project is to reduce the amount of data sent to the gateway layer from edge devices. We use a dataset collected from Intel Labs, and attempt to reduce the size of the data into a form that is a representation of the original data. By reducing the amount of data sent from the sensor to the gateway, we can increase data throughput and decrease network latency.

### Data Loading

In [1]:
import gzip
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
with gzip.open('data.txt.gz', 'rb') as data_bytes:
    data = pd.read_csv(data_bytes, header=None, sep=' ', parse_dates=[[0, 1]], squeeze=True)
data.columns = ['DATETIME','EPOCH','MOTE_ID','TEMPERATURE','HUMIDITY','LIGHT','VOLTAGE']
data = data.set_index('DATETIME')
data.shape

(2313682, 6)

### Data Pre-processing

We will consider sensor data between March 1st and March 10th, resampled every 5 minutes. We will ignore the epoch column, as it does not provide any statistical relevance.

In [3]:
data_samp = data.drop('EPOCH', axis=1)
data_samp = data_samp.loc['2004-03-01':'2004-03-10']
data_samp.head()

Unnamed: 0_level_0,MOTE_ID,TEMPERATURE,HUMIDITY,LIGHT,VOLTAGE
DATETIME,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2004-03-01 00:01:57.130850,1.0,18.4498,43.1191,43.24,2.67532
2004-03-01 00:02:50.458234,1.0,18.44,43.0858,43.24,2.66332
2004-03-01 00:04:26.606602,1.0,18.44,43.1191,43.24,2.65143
2004-03-01 00:05:28.379208,1.0,18.4498,43.0524,43.24,2.65143
2004-03-01 00:05:50.456126,1.0,18.4302,43.1525,43.24,2.66332


For the sake of out experiment, let us only consider sensors 1-10. We will drop sensors where Sensor_ID is NA, and make Sensor_ID an integer.

In [4]:
data_samp.dropna(subset=['MOTE_ID'], inplace=True)
data_samp.MOTE_ID = data_samp.MOTE_ID.astype(int)

data_samp = data_samp[(data_samp.MOTE_ID >= 1) & (data_samp.MOTE_ID <= 10)].copy()
print('Sensor_ID - Min: {}, Max: {}'.format(data_samp.MOTE_ID.min(), data_samp.MOTE_ID.max()))
data_samp.shape

Sensor_ID - Min: 1, Max: 10


(154618, 5)

For the purposes of a proof of concept, we will make this a univariate problem (not including DateTime).

In [37]:
data_samp.drop(['HUMIDITY', 'LIGHT', 
                'VOLTAGE'], axis=1, inplace=True)

Constructing a dataframe where Sensor_ID is the key. This will be more representative of inbound samples.

In [38]:
sensor_df = data_samp.set_index('MOTE_ID', append=True).unstack()

Resampling the dataframe every minute, excluding the Sensor_ID index.

In [39]:
sensor_df

Unnamed: 0_level_0,TEMPERATURE,TEMPERATURE,TEMPERATURE,TEMPERATURE,TEMPERATURE,TEMPERATURE,TEMPERATURE,TEMPERATURE,TEMPERATURE,TEMPERATURE
MOTE_ID,1,2,3,4,5,6,7,8,9,10
DATETIME,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
2004-03-01 00:00:21.445722,,,,,,,,,18.489,
2004-03-01 00:00:22.429139,,18.8712,,,,,,,,
2004-03-01 00:00:25.633782,,,,,,,18.7144,,,
2004-03-01 00:00:52.381230,,18.8614,,,,,,,,
2004-03-01 00:00:53.317719,,,,,,,18.7046,,,
...,...,...,...,...,...,...,...,...,...,...
2004-03-10 09:06:02.532096,,,,,,,,25.5058,,
2004-03-10 09:06:02.659997,,,23.6536,,,,,,,
2004-03-10 09:06:07.690233,,,,,,,23.9280,,,
2004-03-10 09:06:08.303196,,23.1636,,,,,,,,


In [40]:
sensor_df = sensor_df.resample('5min').mean()

Checking for empty data values...

In [41]:
sensor_df.isna().sum()

             MOTE_ID
TEMPERATURE  1             0
             2            24
             3             0
             4             5
             5          2702
             6            16
             7             0
             8            20
             9             1
             10            2
dtype: int64

Dropping Sensor_ID 5 since it appears to be missing every value.

In [42]:
sensor_df = sensor_df.stack().drop(5, level='MOTE_ID')

Looking at missing data values, we can apply linear interpolation to fill in the missing values to complete our data set.

In [43]:
sensor_df = sensor_df.interpolate(method='linear', limit_direction='both', axis=0)

In [44]:
sensor_df.describe()

Unnamed: 0,TEMPERATURE
count,24250.0
mean,21.934324
std,2.112951
min,15.70825
25%,20.518927
50%,21.876942
75%,23.308518
max,28.6418


Reassign our transposed matrix to the original matrix.

In [45]:
sensor_df

Unnamed: 0_level_0,Unnamed: 1_level_0,TEMPERATURE
DATETIME,MOTE_ID,Unnamed: 2_level_1
2004-03-01 00:00:00,1,18.443267
2004-03-01 00:00:00,2,18.850375
2004-03-01 00:00:00,3,18.751640
2004-03-01 00:00:00,4,19.109200
2004-03-01 00:00:00,6,18.669075
...,...,...
2004-03-10 09:05:00,6,24.138700
2004-03-10 09:05:00,7,23.937800
2004-03-10 09:05:00,8,25.515600
2004-03-10 09:05:00,9,26.250600


In [46]:
sensor_df = sensor_df.reset_index(level=1)

In [47]:
sensor_df.head()

Unnamed: 0_level_0,MOTE_ID,TEMPERATURE
DATETIME,Unnamed: 1_level_1,Unnamed: 2_level_1
2004-03-01,1,18.443267
2004-03-01,2,18.850375
2004-03-01,3,18.75164
2004-03-01,4,19.1092
2004-03-01,6,18.669075


As a sample, let us show timeseries plots for Temperature over our time window:

In [None]:
#input_layer = Input(shape=(4,))
# "encoded" is the encoded representation of the input
#encoded = Dense(1, activation='relu')(input_layer)
# "decoded" is the lossy reconstruction of the input
#decoded = Dense(4, activation='sigmoid')(encoded)

# this model maps an input to its reconstruction
#autoencoder = Model(input_layer, decoded)

In [None]:
# autoencoder.summary()