### Background
This is a small public dataset for _Soft Sensor_ use-case; which is predicting/simulating target sensors from other available sensors. 

The original dataset can be found <a href="https://data.mendeley.com/datasets/kcpnnrn67p/1">here</a>. It is a small dataset, with 5 features and 2 targets; it has 14,400 data points in total. 
It seems it was first originated in <a href="https://folk.ntnu.no/skoge/prost/proceedings/ifac2002/data/content/01320/1320.pdf">this paper</a>; where the authors explain what this dataset represents.

In this notebook we do the following:


<ul>
  <li>Download the original data (IN & OUT tables).</li>
  <li>Transform the dataset by introducing missingness & noise; this version:</li>
      <ol>
        <li>includes 3 other noisy features, that ideally should be eliminated by a feature selection process, or ranked as least significant by a feature ranking process.</li>
        <li>Introduced missing values at random across all features (not targets) totaling 0.7% of missingness in the entire dataset.</li>
      </ol>
</ul>


In [1]:
# pip install pyarrow

### Helper Functions
To keep the code cleaner, the useful/repeatable codes are moved to a folder named _helper_eddi_; these includes the useful functions for some data report, blob-storage, and working with EDDI.

In this notebook, we only use data manipulation functions which are in _dataframe_utils.py_ file. 

In [2]:
# this is where my helper script folder is => named helpers
import sys
sys.path.append('helpers')

# helper for manipulate or make a report from a dataframes
from helpers.dataprep_utils import long_to_wide, wide_to_long, compute_missing_ratio #, plot_error

Import other necessary packages.

In [3]:
import requests
import pandas as pd
import numpy as np
import random
import os

Create Placeholder Data directories

In [4]:
dir_names = [
    "data",
    "data/original", # we download the data into this directory
    "data/generated", # the generated masked data [with the noisy features] goes into this directory
    "data/prepared" # the prepared data with imputed values on missing parts goes into this directory  
    ] 

for dir_name in dir_names:
    if not os.path.exists(dir_name):
        os.mkdir(dir_name)

### Original Data (Ground truth)

Download the original datasets and save them locally. We use a public dataset available from [Mendeley Data](https://data.mendeley.com/datasets/kcpnnrn67p/1), obtained from a chemical processing unit. This data and the underlying process is described in [this paper](https://folk.ntnu.no/skoge/prost/proceedings/ifac2002/data/content/01320/1320.pdf).

In [5]:
# data links
in_table_url = 'https://data.mendeley.com/public-files/datasets/kcpnnrn67p/files/f132a70d-11b8-43d1-aaef-facbb6e667ac/file_downloaded'
out_table_url = 'https://data.mendeley.com/public-files/datasets/kcpnnrn67p/files/e62b3119-5435-4810-a96f-e40be1904777/file_downloaded'

# download & save
with open('./data/original/IN_Table.csv', 'wb') as f:
    response = requests.get(in_table_url)
    f.write(response.content)
    
with open('./data/original/OUT_Table.csv', 'wb') as f:
    response = requests.get(out_table_url)
    f.write(response.content)

Read the dataset containing input sensor measurements.

In [6]:
df_in = pd.read_csv('./data/original/IN_Table.csv')
df_in.head(3)

Unnamed: 0,IN1,IN2,IN3,IN4,IN5
0,0.077744,0.795565,-0.665503,0.879321,0.134419
1,0.080313,0.824595,-0.655447,0.875636,0.134941
2,0.087355,0.776258,-0.65055,0.884105,0.132452


Read the dataset containing output sensor measurements.

In [7]:
df_out = pd.read_csv('./data/original/OUT_Table.csv')
df_out.head(3)

Unnamed: 0,Out1,Out2
0,-0.122686,0.123661
1,-0.122686,0.123661
2,-0.026857,0.123661


Check for missing values in the input and output sensor data.

In [8]:
print(compute_missing_ratio(df_in))
print(compute_missing_ratio(df_out))

Unnamed: 0,Missing Ratio


None


Unnamed: 0,Missing Ratio


None


Concatenate into one dataframe, so we can load it as one dataset the future experiments:

In [9]:
df = pd.concat([df_in, df_out], axis=1)
# already sorted, but just in case
df.sort_index(inplace=True)

Correct the column names by removing the extra leading spaces.

In [11]:
# column-names have space; cleaning col-names from extra spaces
df.columns = [col.strip() for col in df.columns]
# view
df.head(3)

df.to_csv('./data/original/all.csv')


Keep train and test datasets separate; assuming that on the downstream task we will take the first 0.7 part of the data as train and the rest as the test part; we will do the same here & store them as the ground truth:

In [12]:
import os
splitpoint = df.index[int(df.shape[0] * 0.7)]
df_train = df[df.index < splitpoint]
df_test = df[df.index >= splitpoint]

df_train.to_csv('./data/original/train.csv')
df_test.to_csv('./data/original/test.csv')

### Data Generation [Artificial Noise and Missingness]:

- Add noisy columns, i.e., features

Create 3 new input sensor measurements as actual measurements added to random noise. This will later be used to simulate a simple feature selection, where ideally these fabricated input sensors won’t be selected as features.

In [13]:
df['IN6'] = df['IN3'] + np.random.normal(0, 0.5, 14401)
df['IN7'] = df['IN4'] + np.random.normal(0, 0.5, 14401)
df['IN8'] = df['IN5'] + np.random.normal(0, 0.5, 14401)

df.head(3)

Unnamed: 0,IN1,IN2,IN3,IN4,IN5,Out1,Out2,IN6,IN7,IN8
0,0.077744,0.795565,-0.665503,0.879321,0.134419,-0.122686,0.123661,-0.823205,-0.022336,-0.033717
1,0.080313,0.824595,-0.655447,0.875636,0.134941,-0.122686,0.123661,-0.963025,1.161416,-0.786532
2,0.087355,0.776258,-0.65055,0.884105,0.132452,-0.026857,0.123661,-0.794644,0.35607,-0.096201


Make the data frame into long format, which is more commonly seen in sensor data collection.

In [13]:
# wide to long version; assumption is that the index is going to be "time" column in the long version
df1 = wide_to_long(df, index_colname='time', tag_name='sensor', value_name='value')
df1.head()

Unnamed: 0,time,sensor,value
0,0,IN1,0.077744
1,1,IN1,0.080313
2,2,IN1,0.087355
3,3,IN1,0.091774
4,4,IN1,0.091166


- Add random-missingness to the data

Here we artificially insert missing values at random in the data. This is just to simulate a simple strategy for missing value imputation later during data preparation. We introduce 1,000 missing measurements across all input variables.

In [14]:
col_nan = [c for c in df1['sensor'].unique()] # if c not in ['Out1', 'Out2']]
idx_nan = random.sample(list(df1[df1['sensor'].isin(col_nan)].index), 1000)

df1.loc[df1.index.isin(idx_nan), 'value'] = np.nan
df1.head(3)

Unnamed: 0,time,sensor,value
0,0,IN1,0.077744
1,1,IN1,0.080313
2,2,IN1,0.087355


In [15]:
compute_missing_ratio(df1)

Unnamed: 0,Missing Ratio
value,0.694396


Save the generated dataset as a parquet file. The dataset has 8 input variables, 3 of them artificially generated, and 2 output variables. It also has about 0.7% missing values at random across the input variables.

In [16]:
df1.to_csv('./data/generated/raw_sensor_data.csv')

Convert to wide-version

In [17]:
df_wide = long_to_wide(df1, tag_colname='sensor', time_colname='time', val_colname='value')
df_wide.head(3)

Unnamed: 0_level_0,IN1,IN2,IN3,IN4,IN5,IN6,IN7,IN8,Out1,Out2
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,0.077744,0.795565,-0.665503,0.879321,0.134419,-0.889633,1.6206,-0.158693,-0.122686,0.123661
1,0.080313,0.824595,-0.655447,0.875636,0.134941,-0.429724,0.742982,0.324297,-0.122686,0.123661
2,0.087355,0.776258,-0.65055,0.884105,0.132452,-1.424105,1.241555,-0.055664,-0.026857,0.123661


In [18]:
df_wide.shape

(14401, 10)

In [19]:
compute_missing_ratio(df_wide)

Unnamed: 0,Missing Ratio
IN3,0.854107
Out2,0.77078
IN2,0.743004
IN8,0.729116
IN1,0.687452
IN7,0.645788
IN6,0.645788
Out1,0.631901
IN5,0.624957
IN4,0.611069


In [20]:
df_wide.to_csv('./data/generated/sensor_wide.csv')

In [None]:
# df_train = df_wide[df_wide.index < splitpoint]
# df_test = df_wide[df_wide.index >= splitpoint]

# df_train.to_csv('./data/generated/train.csv', index=False, header=False)
# df_test.to_csv('./data/generated/test.csv', index=False, header=False)