# Capstone - Partial Discharge
## Julian Sweet DSI-LA-6
## Notebook 1 - Data manipulation

The problem of electrical transmission line partial discharge is a ubiquitious concern for all power networks.

> https://www.kaggle.com/c/vsb-power-line-fault-detection

The partial discharge training dataset is 8712 recordings of voltage fluctuations along "medium voltage" transmission lines in Czech Republic. Medium and high voltage transmission lines transport electrical engineering in a "3-phase" or tri-phasic configuration. This means that one can think of there being 4 power lines in total: a neutral / common wire as well as 3 energized lines transporting alternating current (AC) alternating between approximately positive and negative 20,000 Volts (+/- 20KV) at rate of 50 Hertz (Hz). An ideal represenation would be 3 sinusoids out of phase by 120 degrees as represented below. 50 Hertz corresponds with a oscillation of 50 times per second, or a periodicity of 1 / 50 sec^-1 = 20 milliseconds.

Each time series signal is a one-dimensional array of 800,000 samples of one period of the all three wire phases. Therefore, 20 miliseconds / 800,000 samples gives a sampling period of 25 nanoseconds or 40 MHz.

This project at it'a heart is best characterized as an anomaly detection problem. In the provided dataset, there is 525 labelled failure time-series voltage recordings within the larger dataset of 8712 redordings.
However, it will be interesting to contrast training a model on both an anomolous, unbalanced dataset (~ 6% / 94%) as well as a resampled dataset with balanced (50% / 50%) classes.

Data labels are stored in seperate CSV files below. The Kaggle provided training data is an Apache Parquet file which has the functionality of compression as well as each data recording is stored columnarly.

For the purpose of this project, the larger Kaggle provided test data, 'test.parquet' will be excluded as no target labelling is provided. Therefore, determinations of accuracy, overfitting, precision, etc. cannot be determined.

Data imported below:

In [16]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sys import getsizeof

%matplotlib inline

The VSB partial discharge dataset is composed of several files:

'train.parquet' - 

An Apache Parquet data file that is 3.81GB in size and is a compressed format. Each measurement is stored columnarly, with 8712 measurements (columns) and each columns composed of 800,000 rows spanning 20 milliseconds in time. Each data point is stored as an 8-bit integer value. So while loaded entirely in memory the 'train.parquet' should inhabit approximately 8712 columns x 800,000 rows =  6.9 GB of data.

Apache Parqet is a popular and promulgated filetype which possesses functionality that allows one to selectively load portions of the file into memory without loading the large archive in its entirety. It also automatically incorporates compression.

'metadata_train.csv' - 

A 118 kB comma seperated values (CSV) file that contains labels for the larger 'train.parquet'.  The first column, 'signal_id' is a unique identifier ranging from 0 to 8711. The second column, 'id_measurement' identifies groupings of the wires for each measurement. Every three wires share a 'id_measurement'. The phase of a grouping of three wires is represneted by 'phase' column and can take on the values 0, 1 or 2 which is the convention for 3 phase power.

The 'target' column takes on a value of 1 for wires labelled as possessing the partial discharge (PD) condition and 0 for wires not considered to be experiencing PD.

'test.parquet' and 'metadata_test.csv - 

These files correspond to a larger test dataset in Apache Parquet and CSV respectively. As this larger dataset does not possess a 'target' label, it will be excluded from this project. With out a 'target' label there is no way to ascertain the accuracy, precision or degree of overfit. So while useful in the context of the Kaggle competition, this exploration will not include them.

 
The Pandas library possesses methods to load both CSV as well as Apache Parquet files. The two files of interest are loaded below, with the training dataset transposed so that there are now 8,712 rows of unique observations each composed of 800,000 columns of time-series voltage measurements. Transposition is computationally cost free.

In [27]:
%%time
meta_train = pd.read_csv('./VSB_unpacked/metadata_train.csv')
df_train = pd.read_parquet('VSB_unpacked/train.parquet').T

CPU times: user 1min 21s, sys: 48 s, total: 2min 9s
Wall time: 43.8 s


In [28]:
df_train.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,799990,799991,799992,799993,799994,799995,799996,799997,799998,799999
0,18,18,17,18,18,18,19,18,18,17,...,18,18,17,17,18,19,19,17,19,17
1,1,0,-1,1,0,0,1,0,0,0,...,1,0,0,0,0,2,1,0,1,0
2,-19,-19,-20,-19,-19,-20,-18,-19,-20,-19,...,-19,-20,-21,-18,-19,-18,-19,-19,-18,-19
3,-16,-17,-17,-16,-16,-15,-16,-17,-18,-17,...,-15,-15,-15,-15,-15,-15,-15,-15,-14,-14
4,-5,-6,-6,-5,-5,-4,-5,-7,-7,-7,...,-4,-4,-4,-5,-4,-4,-4,-4,-3,-4


In [29]:
meta_train.head()

Unnamed: 0,signal_id,id_measurement,phase,target
0,0,0,0,0
1,1,0,1,0
2,2,0,2,0
3,3,1,0,1
4,4,1,1,1


The column labels for the training dataset were intially stored as string-type, the transposition converted them from Pandas column labels to Pandas indices, yet still string-type. The step below will recast the indices as integers for convenience.

In [30]:
df_train.index = df_train.index.map(int)

Both the training metadata as training dataset are used to split the time-series observations 80% / 20% into training and test data. Because of the highly unbalanced classes, the stratify option is enabled. Due to the large neature of this dataset, the random train / test splitting is happening across the training data index labels, and not the data istself. Attempts to have multiple large subsets of the training data in memory overflowed the kernal. A discussion of alternative filetypes to be explored will follow in the last notebook.

In [31]:
X_train, X_test, y_train, y_test = train_test_split(
    df_train.index, 
    meta_train['target'], 
    stratify = meta_train['target'], 
    test_size = .2,
    random_state = 510
)

In [32]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((6969,), (1743,), (6969,), (1743,))

Again, the train / test split above is only composed of index labels, not the data itself. Now to resample the data to create a balanced (50% PD / 50% no PD) dataset.

In [33]:
y_train_resamp = y_train[y_train == 1]
X_train_resamp = X_train[y_train == 1]

y_test_resamp = y_test[y_test == 1]
X_test_resamp = X_test[y_test == 1]

In [34]:
X_train_resamp = np.concatenate([X_train_resamp, np.random.choice(X_train[y_train == 0], y_train.sum(), replace=False)])
y_train_resamp = np.concatenate([y_train_resamp, [0]* y_train.sum()])

In [35]:
X_test_resamp = np.concatenate([X_test_resamp, np.random.choice(X_test[y_test == 0], y_test.sum(), replace=False)])
y_test_resamp = np.concatenate([y_test_resamp, [0]* y_test.sum()])

X_train and X_test represent the indices for the data selected by Scikit Learn's training / test split. X_train_resamp, X_test_resamp are the indices from the reconsituted balanced class, essentially bootstrapping a dataset. So these 4 variables are 1-D arrays of locations pointing to the information.

Now those indices are used on the original large dataset to create subsets that contain that actually time-series data referenced by the index variables, but contained within df_train.

In [37]:
%%time
X_train_data = df_train.iloc[X_train, :]
X_test_data = df_train.iloc[X_test, :]
X_train_resamp_data = df_train.iloc[X_train_resamp, :]
X_test_resamp_data = df_train.iloc[X_test_resamp, :]

CPU times: user 6.85 s, sys: 30.6 s, total: 37.4 s
Wall time: 1min 11s


In [38]:
X_test_data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,799990,799991,799992,799993,799994,799995,799996,799997,799998,799999
4874,16,14,12,12,14,16,17,16,14,13,...,14,15,16,17,16,15,14,15,16,17
1222,-22,-23,-23,-22,-21,-21,-22,-21,-22,-22,...,-21,-22,-21,-21,-21,-21,-20,-21,-21,-20
6121,18,17,17,17,17,17,18,18,17,17,...,18,19,19,19,18,18,17,18,18,18
8065,-12,-12,-11,-12,-12,-13,-13,-13,-12,-12,...,-12,-12,-13,-13,-12,-11,-11,-12,-12,-12
7689,16,16,16,16,18,17,17,17,15,17,...,16,15,18,17,15,17,14,17,16,16


Each Pandas DatFrame / Series is recast as a SciPy / NumPy array in preparation for saving.

In [39]:
X_train_data = np.asarray(X_train_data)
X_test_data = np.asarray(X_test_data)
X_train_resamp_data = np.asarray(X_train_resamp_data)
X_test_resamp_data = np.asarray(X_test_resamp_data)
y_test = np.asarray(y_test)
y_train = np.asarray(y_train)
y_train_resamp = np.asarray(y_train_resamp)
y_test_resamp = np.asarray(y_test_resamp)

Each variable above is saved to a subdirectory as NumPy .npy file, while not compressed, it is optimized for reading from disk. The disk operations have been commented out to prevent re-running.

In [40]:
%%time
# np.save('./npy_datasets/X_train_data', X_train_data)

CPU times: user 9 µs, sys: 2 µs, total: 11 µs
Wall time: 19.3 µs


In [41]:
# np.save('./npy_datasets/X_test_data', X_test_data)

In [42]:
# np.save('./npy_datasets/X_train_resamp_data', X_train_resamp_data)

In [43]:
# np.save('./X_test_resamp_data', X_test_resamp_data)

In [44]:
# np.save('./npy_datasets/y_test', y_test)

In [45]:
# np.save('./npy_datasets/y_train', y_train)

In [46]:
# np.save('./y_train_resamp', y_train_resamp)
# np.save('./y_test_resamp', y_test_resamp)

See Notebook #2 for time-series visualizations.