# Capstone - Partial Discharge
## Julian Sweet DSI-LA-6
## Notebook 1 - Frequency space data manipulation.



The problem of electrical transmission line partial discharge is a ubiquitious concern for all power networks.

> https://www.kaggle.com/c/vsb-power-line-fault-detection

This project at it'a heart is best characterized as an anomaly detection problem. In the provided dataset, there is 525 labelled failure time-series voltage recordings within a larger dataset of 8712 redordings.
However, it will be interesting to contrast training a model on both an anomolous, unbalanced dataset (~ 6% / 94%) as well as a resampled dataset with balanced (50% / 50%) classes.

Data labels are stored in seperate CSV files below. The Kaggle provided training data is an Apache Parquet file which has the functionality of compression as well as each data recording is stored columnarly.

For the prupose of this project, the larger Kaggle provided test data, 'test.parquet' will be excluded as no target labelling is provided. Therefore, determinations of accuracy, overfitting, precision, etc. cannot be determined.

Let's begin with an import:

In [1]:
import numpy as np
import pandas as pd
from scipy import signal

from sys import getsizeof

%matplotlib inline

The VSB partial discharge dataset is composed of several files:

'train.parquet' - 

An Apache Parquet data file that is 3.81GB in size and is a compressed format. Each measurement is stored columnarly, with 8712 measurements (columns) and each columns composed of 800,000 rows spanning 20 milliseconds in time. Each data point is stored as an 8-bit integer value. So while loaded entirely in memory the 'train.parquet' should inhabit approximately 8712 columns x 800,000 rows =  6.9 GB of data.

Apache Parqet is a popular and promulgated filetype which possesses functionality that allows one to selectively load portions of the file into memory without loading the large archive in its entirety. It also automatically incorporates compression.

'metadata_train.csv' - 

A 118 kB comma seperated values (CSV) file that contains labels for the larger 'train.parquet'.  The first column, 'signal_id' is a unique identifier ranging from 0 to 8711. The second column, 'id_measurement' identifies groupings of the wires for each measurement. Every three wires share a 'id_measurement'. The phase of a grouping of three wires is represneted by 'phase' column and can take on the values 0, 1 or 2 which is the convention for 3 phase power.

The 'target' column takes on a value of 1 for wires labelled as possessing the partial discharge (PD) condition and 0 for wires not considered to be experiencing PD.

'test.parquet' and 'metadata_test.csv - 

These files correspond to a larger test dataset in Apache Parquet and CSV respectively. As this larger dataset does not possess a 'target' label, it will be excluded from this project. With out a 'target' label there is no way to ascertain the accuracy, precision or degree of overfit. So while useful in the context of the Kaggle competition, this exploration will not include them.

 
The Pandas library possesses methods to load both CSV as well as Apache Parquet files. The two files of interest are loaded below, with the training dataset transposed so that there are now 8,712 rows of unique observations each composed of 800,000 columns of time-series voltage measurements. Transposition is computationally cost free.

In [2]:
meta_train = pd.read_csv('./VSB_unpacked/metadata_train.csv')
df_train = pd.read_parquet('VSB_unpacked/train.parquet').T

In [3]:
df_train.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,799990,799991,799992,799993,799994,799995,799996,799997,799998,799999
0,18,18,17,18,18,18,19,18,18,17,...,18,18,17,17,18,19,19,17,19,17
1,1,0,-1,1,0,0,1,0,0,0,...,1,0,0,0,0,2,1,0,1,0
2,-19,-19,-20,-19,-19,-20,-18,-19,-20,-19,...,-19,-20,-21,-18,-19,-18,-19,-19,-18,-19
3,-16,-17,-17,-16,-16,-15,-16,-17,-18,-17,...,-15,-15,-15,-15,-15,-15,-15,-15,-14,-14
4,-5,-6,-6,-5,-5,-4,-5,-7,-7,-7,...,-4,-4,-4,-5,-4,-4,-4,-4,-3,-4


In [4]:
meta_train.head()

Unnamed: 0,signal_id,id_measurement,phase,target
0,0,0,0,0
1,1,0,1,0
2,2,0,2,0
3,3,1,0,1
4,4,1,1,1


The using the meta

In [28]:
X_train, X_test, y_train, y_test = train_test_split(
    df_train, 
    meta_train['target'], 
    stratify = meta_train['target'], 
    test_size = .2,
    random_state = 510
)

#### There is now an X_train, X_test, y_train, and y_test which reflects the unbalanced class. From this same set, a resampled data set which reflects balanced class will be created. 

In [55]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((6969, 800000), (1743, 800000), (6969,), (1743,))

In [56]:
X_train.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,799990,799991,799992,799993,799994,799995,799996,799997,799998,799999
7056,-8,-3,-5,-5,-3,-9,-3,-11,-6,-9,...,-6,-8,-5,-7,-4,-7,-4,-9,-6,-9
347,-17,-18,-17,-18,-18,-18,-18,-18,-18,-18,...,-16,-17,-17,-17,-17,-17,-17,-17,-17,-17
4780,-10,-6,-16,-7,-12,-13,-8,-12,-10,-10,...,-11,-10,-11,-10,-11,-11,-9,-12,-11,-9
7023,5,6,6,6,6,6,6,5,6,6,...,6,6,5,6,5,5,6,6,5,5
4116,-25,-23,-25,-26,-23,-25,-24,-25,-25,-23,...,-24,-20,-26,-21,-22,-24,-21,-25,-25,-24


In [31]:
y_train.head()

7056    0
347     0
4780    0
7023    0
4116    0
Name: target, dtype: int64

X_train and X_test column labels were strings, then transposed to become indices, but still strings.

In [49]:
X_train.index = X_train.index.map(int)
X_test.index = X_test.index.map(int)

In [50]:
y_train_resamp = y_train[y_train == 1]
X_train_resamp = X_train[y_train == 1]

In [51]:
y_test_resamp = y_test[y_test == 1]
X_test_resamp = X_test[y_test == 1]

In [63]:
X_test_resamp.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,799990,799991,799992,799993,799994,799995,799996,799997,799998,799999
4874,16,14,12,12,14,16,17,16,14,13,...,14,15,16,17,16,15,14,15,16,17
8032,5,4,4,5,5,5,6,5,4,5,...,4,4,5,5,4,4,5,4,4,5
706,11,9,6,8,9,11,11,11,7,10,...,8,11,10,12,12,13,12,11,10,9
4767,18,18,20,20,20,20,20,20,20,19,...,18,18,18,18,18,19,18,18,18,18
708,-15,-17,-16,-15,-18,-15,-17,-16,-15,-18,...,-16,-15,-16,-16,-17,-16,-17,-16,-16,-16


In [60]:
y_train_resamp = np.concatenate([y_train_resamp, [0]* y_train.sum()])

In [62]:
y_train_resamp.index

AttributeError: 'numpy.ndarray' object has no attribute 'index'

In [58]:
X_train_resamp = np.concatenate([X_train_resamp, np.random.choice(X_train[y_train == 0], y_train.sum(), replace=False)])
y_train_resamp = np.concatenate([y_train_resamp, [0]* y_train.sum()])

ValueError: a must be 1-dimensional

In [54]:
#X_test_resamp = np.concatenate([X_test_resamp, np.random.choice(X_test[y_test == 0], y_test.sum(), replace=False)])
y_test_resamp = np.concatenate([y_test_resamp, [0]* y_test.sum()])