After loading data from Timber, it has to be preprocessed. It has to be decided when the source was stable, when voltage breakdowns occured an so on. These are tasks that in practice only have to be done once, which is the reason why this notebook exists. One is able to specify a raw data file and this notebook will export a labeled one together with visualizations of the performed tasks so that a visual quality check can be done.

In [1]:
%run ../ionsrcopt/import_notebooks/Setup.ipynb

In [2]:
%run ../ionsrcopt/import_notebooks/Preprocessing.ipynb

First we need to read the data into a dataframe that we will manipulate and save afterwards. We will not do any column preselection at this point.

In [100]:
input_file = '../Data_Raw/Nov2018.csv'
previous_month_file = '../Data_Raw/Oct2018.csv'
output_file = '../Data_Preprocessed/Nov2018.csv'

In [101]:
df = read_data_from_csv(input_file, None, None)
df = fill_columns(df, previous_month_file, fill_nan_with_zeros=True)
df = convert_column_types(df)
df.shape

Loading data from csv file '../Data_Raw/Nov2018.csv'
Converting column types...
Forward filling missing values...
Loading data from csv file '../Data_Raw/Oct2018.csv'
Converting column types...


(681753, 11)

Because data is only registered when a parameter changes, the datapoints can correspond to a different time duration. When we do the clustering we need to take this into account using some form of weight. We will create a column that gives the duration of a datapoint in seconds.

In [102]:
def timedelta_to_seconds(timedelta):
    if not pd.isnull(timedelta):
        return timedelta.total_seconds()
    else:
        return np.nan
    
df[ProcessingFeatures.DATAPOINT_DURATION] = (df.index.to_series().diff(-1)).apply(timedelta_to_seconds).values
df[ProcessingFeatures.DATAPOINT_DURATION] *= -1

The next thing we are going to do, is marking the source as stable/unstable. The parameters used are from experiments on the Nov2018 data.

In [103]:
value_column = SourceFeatures.BCT25_CURRENT
weight_column = ProcessingFeatures.DATAPOINT_DURATION
sliding_window_size_mean=1500
sliding_window_size_std=2000
minimum_mean=0.023
#minimum_mean=0.027 #for Nov 2018
#minimum_mean=0.035 #for Nov 2016
maximum_variance=0.000035

df[ProcessingFeatures.SOURCE_STABILITY] = stability_mean_variance_classification(
                            df, 
                            value_column=value_column, 
                            weight_column=weight_column,
                            sliding_window_size_mean=sliding_window_size_mean,
                            sliding_window_size_std=sliding_window_size_std,
                            minimum_mean=minimum_mean, 
                            maximum_variance=maximum_variance)

The next thing we are interested in are the high voltage breakdowns.

In [104]:
column = SourceFeatures.SOURCEHTAQNI
window_size = 40
threshold = 0.25

df[ProcessingFeatures.HT_VOLTAGE_BREAKDOWN] = detect_breakdowns(df, column, window_size, threshold)
df = df.astype({ProcessingFeatures.HT_VOLTAGE_BREAKDOWN : 'int64'})

After having done all of this, we can clean the data from values where we cannot tell anything about the source performance. These are all the times where the BCT05 current is (almost zero).

In [105]:
is_zero_threshold = 0.004
df[ProcessingFeatures.SOURCE_RUNNING] = 0
df.loc[df[SourceFeatures.BCT05_CURRENT] > is_zero_threshold, ProcessingFeatures.SOURCE_RUNNING] = 1

All rows that still have NaN values have missing information that can not be aquired from the data. Hence we remove these rows.

In [106]:
df.dropna(inplace=True)
df.shape

(681752, 15)

### Visualizations
#### Source classification

In [107]:
%matplotlib notebook

import matplotlib.pyplot as plt
import matplotlib
plt.rcParams["figure.figsize"] = (30,6)
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()

dates_stable = matplotlib.dates.date2num(df.loc[df[ProcessingFeatures.SOURCE_STABILITY] == 1].index.values)
dates_unstable = matplotlib.dates.date2num(df.loc[df[ProcessingFeatures.SOURCE_STABILITY] == 0].index.values)

fig = plt.figure()
ax = plt.subplot('111')
ax.plot_date(dates_stable, df.loc[df[ProcessingFeatures.SOURCE_STABILITY] == 1, SourceFeatures.BCT25_CURRENT].values, fmt='.', c='orange')
ax.plot_date(dates_unstable, df.loc[df[ProcessingFeatures.SOURCE_STABILITY] == 0, SourceFeatures.BCT25_CURRENT].values, fmt='.', c='blue')
ax.set_ylim(-0.01, None)

plt.show()

<IPython.core.display.Javascript object>

#### Voltage Breakdowns

In [108]:
%matplotlib notebook

import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (30,6)

ax1 = df.plot(y=[SourceFeatures.SOURCEHTAQNI, ProcessingFeatures.HT_VOLTAGE_BREAKDOWN], secondary_y=[ProcessingFeatures.HT_VOLTAGE_BREAKDOWN])
plt.show()

<IPython.core.display.Javascript object>

Now we can save the frame as a csv file. To save storage and increase loading time we set consequitve duplicates to nan. This can be reversed while loading using pd.fillna

In [109]:
df[df.shift(1)==df] = np.nan
df.to_csv(output_file)
print("Saved preprocessing of {} to {}.".format(input_file, output_file))

Saved preprocessing of ../Data_Raw/Nov2018.csv to ../Data_Preprocessed/Nov2018.csv.
