#  Feature extraction using Best Fourier Coefficients

*Before usage of this notebook, please download folder from this link https://drive.google.com/open?id=1Eme87KqRZx8-sANpoHeCIOOWcSRYsMLy
and store the files in the same folder which is location for this notebook.
It is also necessary to install pip in your environment and using this, package PyDynamic:*

pip install PyDynamic

*In order to see interactive diagrams, write:* 

pip install ipywidgets

jupyter nbextension enable --py widgetsnbextension   

### Importing the data 

In [2]:
import h5py                                     # Importing the h5 package.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import PyDynamic  


In [3]:
from PyDynamic import __version__ as version
version

'1.2.79'

In [4]:
filename = 'Sensor_data_2kHz.h5'                # Data filename.
f = h5py.File(filename, 'r')                    # Importing the h5 file. 

#print("Keys: %s" % f.keys())
a_group_key = list(f.keys())[0]

data = list(f[a_group_key])                     # Transforming data into list

sensorADC=[]                                       # Initialising a list "sensor" and
for i in range(11):                                # Filling it with data from all sensors 
    sensorADC.append(pd.DataFrame(data[i][:][:]))

for i in range(11):                             
    sensorADC[i]=sensorADC[i].iloc[:,:6291]           # Cuting the last cycle because it contains all zero elements.

print("""    
    Input matrices have dimensions: %s, where %s represents number of measurements in time
    and %s represents number of cycles.""" % (np.shape(sensorADC[0]),np.shape(sensorADC[0])[0],np.shape(sensorADC[0])[1]))

    
    Input matrices have dimensions: (2000, 6291), where 2000 represents number of measurements in time
    and 6291 represents number of cycles.


### Converting into SI units 

In [5]:
offset=[0, 0, 0, 0, 0.00488591, 0.00488591, 0.00488591,  0.00488591, 1.36e-2, 1.5e-2, 1.09e-2]
gain=[5.36e-9, 5.36e-9, 5.36e-9, 5.36e-9, 3.29e-4, 3.29e-4, 3.29e-4, 3.29e-4, 8.76e-5, 8.68e-5, 8.65e-5]
b=[1, 1, 1, 1, 1, 1, 1, 1, 5.299641744, 5.299641744, 5.299641744]
k=[250, 1, 10, 10, 1.25, 1, 30, 0.5, 2, 2, 2]
units=['[Pa]', '[g]', '[g]', '[g]', '[kN]', '[bar]', '[mm/s]', '[A]', '[A]', '[A]', '[A]']

sensor=[0]*len(sensorADC)

for i in range(len(sensorADC)):
    sensor[i]=((sensorADC[i]*gain[i])+offset[i])*b[i]*k[i]


In [6]:

sensor=pd.read_csv(r'C:\Users\jugo01\Desktop\sensor_units.csv')

###### If you have problems with previous step, you can skip conversion into SI units by runing next cell.

In [7]:
sensor=sensorADC

### Reading of train and test data
*Note: see 2_Machine_Learning_using_Best_Fourier_Coefficients.ipynb. Data were split into train and test data for k=85% 
Based on this, target_train_vector and target_test_vector were provided. These vectors will be used for FFT and DFT methods.*

In [8]:
import os
import h5py

train_test1= h5py.File("Train_test_data_split","r")


In [9]:
target_train_vector=train_test1["target_train_vector"]
target_test_vector=train_test1["target_test_vector"]

Converting arrays into data frames:

In [10]:
target_train_vector=pd.DataFrame(target_train_vector)
target_test_vector=pd.DataFrame(target_test_vector)
target=list(target_train_vector[0])


So, after this step main data to work on are lists: 

"sensor_train" with their class labels "train_target"
 
and 
 
"sensor_test" with their class labels "test_target"

In [11]:
sensor_train=[0]*11
sensor_test=[0]*11

for i in range(11):
    sensor_train[i]=sensor[i].loc[:,target_train_vector.index]

print("Traning data for one sensor has dimensions: ", sensor_train[10].shape,",      ('sensor_train') ")
print("and it's target vector has length: ", target_train_vector.shape,",               ('target_train_vector') \n")

for i in range(11):
    sensor_test[i]=sensor[i].loc[:,target_test_vector.index]

print("Testing data for one sensor has dimensions: ", sensor_test[10].shape,",      ('sensor_test') ")
print("and it's target vector has length: ", target_test_vector.shape,",        ('target_test_vector') \n")

Traning data for one sensor has dimensions:  (2000, 5347) ,      ('sensor_train') 
and it's target vector has length:  (5347, 1) ,               ('target_train_vector') 

Testing data for one sensor has dimensions:  (2000, 944) ,      ('sensor_test') 
and it's target vector has length:  (944, 1) ,        ('target_test_vector') 



We can have a look at the data from one sensor after splitting for better understanding of structure for next steps. Number of rows is 2000 and each column is one random measurement cycle. Table shows only first five samples in time (five rows) for each cycle. 

In [12]:
sensor_train[0].head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,5337,5338,5339,5340,5341,5342,5343,5344,5345,5346
0,-39063.942315,26382.86638,111982.5763,123270.310519,5677.152553,63769.945966,-3355.7084,-11744.36781,3214.094226,22705.143489,...,-131018.791877,245701.247597,157393.15935,137924.235069,37357.356543,80615.28477,106164.879171,267842.150619,120640.850337,270518.911042
1,10194.702045,46487.991618,55068.146065,2030.126987,34201.705127,72558.72095,40693.962606,49094.027868,211884.995913,80838.678771,...,-162996.179298,333922.058733,260720.437349,223240.938268,36031.231549,154570.342483,321749.537589,476838.44361,282414.94192,521374.626098
2,66179.322909,-117858.502291,12318.400034,15189.662759,67890.355293,-160840.472258,45730.518382,25558.142425,137603.745729,105958.823904,...,-275930.554523,159220.818206,58490.19552,43295.397976,-56084.843536,55139.127761,293371.890093,407325.108931,127009.541544,161337.248743
3,30473.015521,-61092.677988,197316.095058,67776.093548,141726.141413,51067.908648,-76087.032276,78415.34823,27924.757412,173708.531802,...,9485.61863,228713.438385,150501.084189,222896.005034,51973.964254,8814.34238,277226.0632,502323.444198,213967.458111,365980.188068
4,25058.467073,-32915.559673,42260.618178,16534.709886,68051.95132,-161081.305465,-109702.711669,90143.575353,116359.464423,28532.851905,...,-19899.218342,285695.180368,223312.909029,193983.151309,100914.999644,113175.019174,373612.604321,569405.884757,184366.334327,404601.150396



###  Fast Fourier transform

###### Steps:  
    
- transformation into frequency domain (FFT)
- choose amplitudes with highest average absolute value (the top 10%)


In this method of feature extraction, data is transformed into frequency domain using FFT function for discrete Fourier transform. More detail about FFT in [1_FFT_and_Reconstruction.ipynb](1_FFT_and_Reconstruction.ipynb)

This step an unsupervised extraction method (i.e. is done without knowledge of the cycle‘s group affiliation) and used is to reduce dimension for further steps.





###### A function is created, which takes as input: 
- data from one sensor `sensor`,                                 
- number of samples `n_of_samples`,                                    
- percentage of data to choose `N`.

Function does fast Fourier transform and chooses N% of sprectrum with highest average of absolute values for each sensor independently. Average of absolute values for one frequency is calculated through all cycles.                                   


###### Function returns:
- `freq_of_sorted_values` matrix sized [1, N% of features (amplitudes)] where elements are frequencies which are choosen and they are labels for second output from this function.
- `sorted_values_matrix` sized [number of cycles, N% of features (amplitudes)] where row represents one cycle and columns are sorted by the average of absolute vales for each frequency (column).

In [13]:
def chooseAndReturnOrdered(sensor, n_of_samples, N): 
    x_measurements=range(sensor.shape[0])                 # Number of measurements samples in time period.
    x = np.true_divide(x_measurements, n_of_samples)      # Time values, used  as real time axis.
    freq = np.fft.rfftfreq(x.size, 0.0005)                # Frequency axis, can be used for ploting in frequency domain.
    fft_amplitudes = np.fft.rfft(sensor,n_of_samples,0)   # Ndarray of amplitudes after fourier transform.
    fft_matrix = pd.DataFrame(fft_amplitudes)             # Transforming amplitudes into data frame (matrix)-
                                                          # -where one column represents amplitudes of one-
                                                          # -cycle.
    fft_matrix=fft_matrix.transpose()                     # Transposing to matrix where rows are cycles.
    n_rows, n_columns = np.shape(fft_matrix)

    print("\nNumber of cycles is: %s, and number of features is: %s" % (n_rows, n_columns))
    fft_matrix.columns = freq                    # Column labels are frequencies. 
    
    # Calculating the average of absolute vales for each frequency (column).
    absolute_average_values_from_columns=(np.abs(fft_matrix)).mean()
    
    # Sorting the fft_matrix by the average of absolute vales for each frequency (column).
    fft_matrix=fft_matrix.reindex((np.abs(fft_matrix)).mean().sort_values(ascending=False).index, axis=1)
    
    # Taking first N percent columns from sorted fft_matrix. 
    sorted_values_matrix=fft_matrix.iloc[:,:round((N/100.0)*len(freq))]
    
    n_rows, n_columns = np.shape(sorted_values_matrix)
    print("\nNumber of cycles is: %s, and number of selected features is: %s" % (n_rows, n_columns))
    print(np.shape(sorted_values_matrix))
    
    # Informations about the selected frequencies are columns in sorted data frame. 
    freq_of_sorted_values=(pd.DataFrame(sorted_values_matrix.columns)).transpose()
    print("\nFirst 10 selected frequencies are:\n\n %s" % freq_of_sorted_values.values[:,:10])
    
    sorted_values_matrix.columns=range(round((N/100.0)*len(freq))) # Resetting the column labels.
    print("---------------------------------------------------------------------------------\n")
    # Output "sorted_values_matrix" is data frame whose rows-
    # -are cycles and columns are selected frequencies. For example,- 
    # -value at position (i,j) is amplitude for frequency j in cycle i.
    
    return freq_of_sorted_values, sorted_values_matrix;



###### Function execution


*Instead of executing the function, results of extracting 10% of highest amplitudes by FFT can be read in the next steps. Values were obtained by using factor of splitting data into train and test from the above.*

In [14]:
import os
import h5py

amp_fft1= h5py.File("Sorted_vaules_from_all_sensors.hdf5","r")
freq_fft1= h5py.File("Sorted_freq_from_all_sensors.hdf5","r")  

In [15]:
freq_of_sorted_values=[0]*len(sensor_train)
sorted_values_from_all_sensors=[0]*len(sensor_train)
for i in range(len(sensor)):
    freq_of_sorted_values[i]=freq_fft1["freq_of_sorted_values"+str(i)]
    sorted_values_from_all_sensors[i]=amp_fft1["sorted_values_from_all_sensors"+str(i)]

In [16]:
for i in range(len(sensor)):
    freq_of_sorted_values[i]=pd.DataFrame( freq_of_sorted_values[i])
    sorted_values_from_all_sensors[i]=pd.DataFrame( sorted_values_from_all_sensors[i])

User is asked to define how many of features from frequency domain will be extracted in this step. Then, the function is executed for each sensor and extracted data is stored in 2 lists containing data frames mentioned above. Lists `freq_of_sorted_values` and `sorted_values_from_all_sensors` store function outputs and further selection of features is continued on list `sorted_values_from_all_sensors`. Informations about frequency are going to be used in for feature extraction from testing data, because these frequencies are pattern learned from training data and used for selecting from the testing data or some new data which need to be predicted. 
For all sensors, selected frequencies with most dominant amplitudes are listed as output.

In [17]:
n_of_samples=np.shape(sensor_train[0])[0]

N = int(input("Optimal and recommended percentage of features for this dataset is 10. \n\nEnter a percentage of features: "))
print("\n\n")
# Initialising the list woth 11 elements, which are data frames "sorted_value_matrix" from each sensor.
freq_of_sorted_values=[0]*len(sensor_train)
sorted_values_from_all_sensors=[0]*len(sensor_train)

for i in range(len(sensor_train)):                     
    print("Sensor number %s" % i)
    print("---------------------------------------------------------------------------------")
    freq_of_sorted_values[i],sorted_values_from_all_sensors[i]=chooseAndReturnOrdered(sensor_train[i], n_of_samples, N)

Optimal and recommended percentage of features for this dataset is 10. 

Enter a percentage of features: 10



Sensor number 0
---------------------------------------------------------------------------------

Number of cycles is: 5347, and number of features is: 1001

Number of cycles is: 5347, and number of selected features is: 100
(5347, 100)

First 10 selected frequencies are:

 [[480.   0.  85. 640. 100. 120.   1.   2.   3. 121.]]
---------------------------------------------------------------------------------

Sensor number 1
---------------------------------------------------------------------------------

Number of cycles is: 5347, and number of features is: 1001

Number of cycles is: 5347, and number of selected features is: 100
(5347, 100)

First 10 selected frequencies are:

 [[480.  80.  79. 479.   2. 481.  78.   0. 120.  77.]]
---------------------------------------------------------------------------------

Sensor number 2
-----------------------------------------------

Features are complex numbers, because output from the Fourier transform is resulting with amplitudes and phase shifts. For methods used here, amplitudes are more important than phase shifts, and features that will be used are absolute values of amplitudes  

_Example of frequency labels for features extracted from microphone with the best Fourier coefficients method._

Example. We will take first sensor, which is microphone, as an example. The most dominant frequancy for microphone is 0 Hz, and then 480 Hz. That can be seen in `freq_of_sorted_values[0]`. First two columns in `sorted_values_from_all_sensors[0]` are  amplitudes through all measurement cycles for frequencies 0 and 480 Hz, respectively. 

In [18]:
freq_of_sorted_values[0].head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,480.0,0.0,85.0,640.0,100.0,120.0,1.0,2.0,3.0,121.0,...,128.0,149.0,803.0,759.0,68.0,802.0,837.0,91.0,79.0,117.0


# START
### Discrete Fourier Transform

In order to include uncertainties in features extraction using BFC, the following is going to be conducted:
    
- transformation into frequency domain (DFT)
- extraction of amplitudes with highest average absolute value (the top 10%), with corresponding phases and uncertainties

The results of DFT are obtained by using function `Time2AmpPhase_multi`. Function is executed with uncertainties presented in sparse matrix in order to avoid memory errors during calculation for high number of cycles.The usage of sparse matrix is possible since most of covariance values are zeros. 

###### Function returns:
- ` A` np.ndarray sized (M,N) where elements are amplitudes of time domain signals of length N in frequency domain for M cycles. 
- ` P` np.ndarray sized (M,N) where elements are phases of time domain signals of length N in frequency domain for M cycles. 
-  `UAP`np.ndarray sized (M, 3N) where elements are squared standard uncertainties of amplitudes and phases and their covariances for M cycles.


In [19]:

from PyDynamic.uncertainty.propagate_DFT import GUM_DFT
from PyDynamic.uncertainty.propagate_DFT import DFT2AmpPhase
def Time2AmpPhase_multi(x, Ux, selector=None):
    #"""Transformation from time domain to amplitude and phase for a set of M signals of the same type
    # Parameters
    # x: np.ndarray of shape (M,N)
    # M time domain signals of length N
    # Ux: np.ndarray of shape (M,)
    # squared standard deviations representing noise variances of the signals x
    # selector: np.ndarray of shape (L,), optional
    # indices of amplitude and phase values that should be returned; default is 0:N-1
    # Returns
    
    # A: np.ndarray of shape (M,N)
    # amplitude values
    # P: np.ndarray of shape (M,N)
    # phase values
    # UAP: np.ndarray of shape (M, 3N)
    # diagonals of the covariance matrices: [diag(UPP), diag(UAA), diag(UPA)]
    
    M, nx = x.shape
    assert(len(Ux)==M)
    N = nx//2+1
    if not isinstance(selector, np.ndarray):
        selector = np.arange(nx//2+1)
    ns = len(selector)

    A = np.zeros((M,ns))
    P = np.zeros_like(A)
    UAP = np.zeros((M, 3*ns))
    CxCos = None
    CxSin = None
    for m in range(M):
        F, UF, CX = GUM_DFT(x[m,:], Ux[m], CxCos, CxSin, returnC=True)
        CxCos = CX["CxCos"]
        CxSin = CX["CxSin"]
        A_m, P_m, UAP_m = DFT2AmpPhase(F, UF, keep_sparse=True)
        A[m,:] = A_m[selector]
        P[m,:] = P_m[selector]
        UAP[m,:ns] = UAP_m.data[0][:N][selector]
        UAP[m,ns:2*ns] = UAP_m.data[1][UAP_m.offsets[1]:2*N+UAP_m.offsets[1]][selector]
        UAP[m, 2*ns:] = UAP_m.data[0][N:][selector]
        
    return A, P, UAP

Amplitudes are sorted through all cycles and  10% of the highest ones were extracted. Phases and uncertainties are sorted in the way to follow the sorting method for amplitudes. The values are corresponding to the values of 10% chosen amplitudes.
###### Function `chooseAndReturnOrdered_with_uncertainty` returns:
- `freq_of_sorted_values` matrix sized [1, N% of features (amplitudes)] where elements are frequencies which are choosen and they are labels for second output from this function.
- `sorted_values_amp` sized [number of cycles, N% of features (amplitudes)] where row represents one cycle and columns are sorted by the average of absolute values of amplitudes for each frequency (column).
- `sorted_values_phases` - sized [number of cycles, N% of features (phases)] where row represents one cycle and columns of phases are sorted by the absolute values of amplitudes  for each frequency (column). 
- `sorted_values_uncert_aa` - sized [number of cycles, N% of features (uncertainties)] where row represents one cycle and columns of squared standard uncertainties for amplitudes are sorted by the average of absolute vales for each frequency (column). 
- `sorted_values_uncert_ap` - sized [number of cycles, N% of features (uncertainties)] where row represents one cycle and columns of covariances for amplitudes and phases are sorted by the average of absolute vales for each frequency (column). 
- `sorted_values_uncert_pp`  - sized [number of cycles, N% of features (uncertainties)] where row represents one cycle and columns of squared standard uncertainties for phases are sorted by the average of absolute vales for each frequency (column). 

In [20]:
from PyDynamic.uncertainty.propagate_DFT import GUM_DFT
from PyDynamic.uncertainty.propagate_DFT import DFT2AmpPhase

def chooseAndReturnOrdered_with_uncertainty(sensor, n_of_samples, N,sigma):
    
    x_measurements=range(sensor.shape[0])        
    n_of_samples=np.shape(sensor)[0]                            # number of sampling points
    x = np.true_divide(x_measurements, n_of_samples)            # time steps 
    freq=PyDynamic.uncertainty.propagate_DFT.GUM_DFTfreq(x.size, 0.0005)
    ux=np.ones(sensor.shape[1])*sigma**2
    a,b,c=Time2AmpPhase_multi(sensor.values.transpose(),ux)
    a=pd.DataFrame(a) #amplitudes (M,N)
    b=pd.DataFrame(b) #phases (M,N)
    c=pd.DataFrame(c) #uncertainties (M,3*N)
    a.columns = freq                    # Column labels are frequencies. 
    n_rows, n_columns=a.shape
    print("\nNumber of cycles is: %s, and number of features is: %s" % (n_rows, n_columns))
    # Calculating the average of absolute vales for each frequency (column).
    absolute_average_values_from_columns=np.abs(a.mean())
    # Sorting column indices in amplitudes for sorting phases and uncertainties
    sorted_columns=np.argsort(absolute_average_values_from_columns)[::-1]
    # Uncertaintites have one dimension larger than amplitudes and phases. # Columns indices(len(a):3*len(a) to follow the sorting)
    sorted_columns_unc_ap=(sorted_columns+a.shape[1])
    sorted_columns_unc_pp=(sorted_columns+a.shape[1]*2)
    sorted_columns_unc=np.concatenate((sorted_columns,sorted_columns_unc_ap,sorted_columns_unc_pp))#sorted indices for uncertainties
    # Reindexing all matrices based on columns.
    a=a.reindex((np.abs(a)).mean().sort_values(ascending=False).index, axis=1)
    b=b.reindex(columns=sorted_columns)
    c=c.reindex(columns=sorted_columns_unc)
    # Taking first N percent columns from sorted amplitudes,phases and ucertainties. 
    sorted_values_amp=a.iloc[:,:round((N/100.0)*len(freq))]
    sorted_values_phases=b.iloc[:,:round((N/100.0)*len(freq))] 
    sorted_values_uncert_aa=c.iloc[:,:round((N/100.0)*len(freq))]
    sorted_values_uncert_ap=c.iloc[:,len(freq):a.shape[1]+round((N/100.0)*len(freq))]
    sorted_values_uncert_pp=c.iloc[:,2*len(freq):a.shape[1]*2+round((N/100.0)*len(freq))]                                             
    n_rows, n_columns = np.shape(sorted_values_amp)
    print("\nNumber of cycles is: %s, and number of selected features is: %s" % (n_rows, n_columns))
    print(np.shape(sorted_values_amp))
    
    # Informations about the selected frequencies are columns in sorted data frame. 
    freq_of_sorted_values=(pd.DataFrame(sorted_values_amp.columns)).transpose()
    print("\nFirst 10 selected frequencies are:\n\n %s" % freq_of_sorted_values.values[:,:10])
    
    # Resetting the column labels.
    sorted_values_amp.columns=range(round((N/100.0)*len(freq)))
    sorted_values_phases.columns=range(round((N/100.0)*len(freq)))
    sorted_values_uncert_aa.columns=range(round((N/100.0)*len(freq)))
    sorted_values_uncert_ap.columns=range(round((N/100.0)*len(freq)))
    sorted_values_uncert_pp.columns=range(round((N/100.0)*len(freq)))

    print("---------------------------------------------------------------------------------\n")
    # Output "sorted_values_matrix" is data frame whose rows-
    # -are cycles and columns are selected frequencies. For example,- 
    # -value at position (i,j) is amplitude for frequency j in cycle i.
    return freq_of_sorted_values,sorted_values_amp,sorted_values_phases,sorted_values_uncert_aa,sorted_values_uncert_ap, sorted_values_uncert_pp
    
   

###### Function execution

*Instead of executing the function, results of extracting 10% of highest amplitudes by DFT can be read in the next steps. Values were obtained by using factor of splitting data into train and test from the above. Sigma value, representing white noise was assumed as 0.1*

In [21]:
import os
import h5py

amp_dft2= h5py.File("DFTSorted_vaules__from_all_sensors.hdf5","r")
freq_dft2= h5py.File("DFTSorted_freq_from_all_sensors.hdf5","r") 
ph_dft2= h5py.File("DFTSorted_ph_from_all_sensors.hdf5","r")
u_a_dft2= h5py.File("DFTSorted_uncer_from_all_sensors_a.hdf5","r")
u_ap_dft2= h5py.File("DFTSorted_uncer_from_all_sensors_ap.hdf5","r")    
u_pp_dft2= h5py.File("DFTSorted_uncer_from_all_sensors_pp.hdf5","r")    


In [22]:
freq_of_sorted_values=[0]*len(sensor_train)
sorted_values__amp_from_all_sensors=[0]*len(sensor_train)
sorted_phases_from_all_sensors=[0]*len(sensor_train)
sorted_uncer_from_all_sensors_a=[0]*len(sensor_train)
sorted_uncer_from_all_sensors_ap=[0]*len(sensor_train)
sorted_uncer_from_all_sensors_pp=[0]*len(sensor_train)
for i in range(len(sensor)):
    freq_of_sorted_values[i]=freq_dft2["freq_of_sorted_values"+str(i)]
    sorted_values__amp_from_all_sensors[i]=amp_dft2["sorted_values_amp_from_all_sensors"+str(i)]
    sorted_phases_from_all_sensors[i]=ph_dft2["sorted_phases_from_all_sensors"+str(i)]
    sorted_uncer_from_all_sensors_a[i]=u_a_dft2["sorted_uncer_from_all_sensors_a"+str(i)]
    sorted_uncer_from_all_sensors_ap[i]=u_ap_dft2["sorted_uncer_from_all_sensors_ap"+str(i)]
    sorted_uncer_from_all_sensors_pp[i]=u_pp_dft2["sorted_uncer_from_all_sensors_pp"+str(i)]

In [23]:
for i in range(len(sensor)):
    freq_of_sorted_values[i]=pd.DataFrame(freq_of_sorted_values[i])
    sorted_values__amp_from_all_sensors[i]=pd.DataFrame(sorted_values__amp_from_all_sensors[i])
    sorted_phases_from_all_sensors[i]=pd.DataFrame(sorted_phases_from_all_sensors[i])
    sorted_uncer_from_all_sensors_a[i]=pd.DataFrame(sorted_uncer_from_all_sensors_a[i])
    sorted_uncer_from_all_sensors_ap[i]=pd.DataFrame(sorted_uncer_from_all_sensors_ap[i])
    sorted_uncer_from_all_sensors_pp[i]=pd.DataFrame(sorted_uncer_from_all_sensors_pp[i])


User is asked to define how many of features from frequency domain will be extracted in this step. Then, the function is executed for each sensor and extracted data is stored in 6 lists containing data frames mentioned above. Lists:
- freq_of_sorted_values 
- sorted_values__amp_from_all_sensors 
- sorted_phases_from_all_sensors
- sorted_uncer_from_all_sensors_a
- sorted_uncer_from_all_sensors_ap
- sorted_uncer_from_all_sensors_pp

store function outputs and further selection of features is continued  through loop. Informations about frequency are going to be used in for feature extraction from testing data, because these frequencies are pattern learned from training data and used for selecting from the testing data or some new data which need to be predicted. 
For all sensors, selected frequencies with most dominant amplitudes are listed as output.

In [24]:
#function execution
n_of_samples=np.shape(sensor_train[0])[0]

N = int(input("Optimal and recommended percentage of features for this dataset is 10. \n\nEnter a percentage of features: "))
print("\n\n")
sigma=float(input("Assume standard deviation"))
# Initialising the list woth 11 elements, which are data frames "sorted_value_matrix" from each sensor.
                    
freq_of_sorted_values=[0]*len(sensor_train)
sorted_values__amp_from_all_sensors=[0]*len(sensor_train)
sorted_phases_from_all_sensors=[0]*len(sensor_train)
sorted_uncer_from_all_sensors_a=[0]*len(sensor_train)
sorted_uncer_from_all_sensors_ap=[0]*len(sensor_train)
sorted_uncer_from_all_sensors_pp=[0]*len(sensor_train)

    
for i in range(len(sensor_train)):                     
    print("Sensor number %s" % i)
    print("---------------------------------------------------------------------------------")
    freq_of_sorted_values[i],sorted_values__amp_from_all_sensors[i],sorted_phases_from_all_sensors[i],sorted_uncer_from_all_sensors_a[i],sorted_uncer_from_all_sensors_ap[i],sorted_uncer_from_all_sensors_pp[i]=chooseAndReturnOrdered_with_uncertainty(sensor_train[i], n_of_samples, N,sigma)
    
    
    

Optimal and recommended percentage of features for this dataset is 10. 

Enter a percentage of features: 10



Assume standard deviation0.1
Sensor number 0
---------------------------------------------------------------------------------

Number of cycles is: 5347, and number of features is: 1001

Number of cycles is: 5347, and number of selected features is: 100
(5347, 100)

First 10 selected frequencies are:

 [[480.   0.  85. 640. 100. 120.   1.   2.   3. 121.]]
---------------------------------------------------------------------------------

Sensor number 1
---------------------------------------------------------------------------------


KeyboardInterrupt: 

In [None]:
freq_of_sorted_values[0]

*Note: When amplitudes are small relative to the uncertainty associated with real and imaginary parts , the GUM uncertainty propagation becomes unreliable and a Monte Carlo method is recommended instead. Consequently, GUM2DFT does raise a warning to the user and recommends using a Monte Carlo method instead whenever an element of  is below a pre-defined threshold. The default threshold in GUM2DFT is 1.0, but may be adjusted for specific applications.[7]*


An overview of the results:

*Note: Be aware of randomness of splitting data into train and test. Results shown here are for one random split. Functions FFT and DFT were executed for the same split of data.*

In [None]:
sorted_values__amp_from_all_sensors[0].head(2)

In [None]:
sorted_uncer_from_all_sensors_a[0].head(2)

In [None]:
sorted_phases_from_all_sensors[0].head(2)

In [None]:
sorted_uncer_from_all_sensors_pp[0].head(2)

In [None]:
sorted_uncer_from_all_sensors_ap[0].head(2)

In [None]:
freq_of_sorted_values[0]

### Additional: Transformation to time domain for all sensors

Transformation from amplitude and phase to time domain is demonstrated with functions `Reconstruct_time_domain` (on the basis of PyDynamic´s function AmpPhase2Time) and `Reconstruct_time_domain_idft`(on the basis of PyDynamic´s functions AmpPhase2DFT, GUM_iDFT).

First, zero arrays of amplitudes, phases and uncertainties are created (A, P, UAP). Then, function copies values of N% of arguments (amplitudes, phases, u_a, u_ap, u_pp) into column indices (frequencies) of initial zero arrays. 

Function `Reconstruct_time_domain`creates sparse matrix from the values contained in matrices of uncertainties for amplitudes, phases and their covariances, because UAP was previously created from the sparse matrix. For each cycle, function returns:  
- `x` (np.ndarray ) – vector of time domain values and 
- `ux` -  (np.ndarray) – covariance matrix associated with x.


In [None]:

from scipy.sparse import dia_matrix
from PyDynamic.uncertainty.propagate_DFT import AmpPhase2Time

def Reconstruct_time_domain(N,frequencies,amplitudes,phases,u_a,u_ap,u_pp):
    
    M,num=amplitudes.shape
    length=int(amplitudes.shape[1])*N+1 
    length_2=2*length
    length_3=3*length 
    x=np.zeros((M, length_2-2))
    #storing uncertainties in a list of arrays
    ux=np.zeros((M, length_2-2))
    #predefining A,P,UAP as zero arrays
    A =np.zeros((M, length))
    P= np.zeros((M, length))
    UAP=np.zeros((M, 3*length))
    assert(amplitudes.shape==phases.shape)
    # Indices of columns with highest amplitudes in original matrix (resulted from DFT) are accessible from the
    #sorted frequencies of all sensors
    Index_amplitudes=frequencies[:,:N] 
    # Defining offsets for sparse matrix  
    offset_UAP=[0,length,-length]
     #indices(columns) of 10% highest amplitude values
    Index_amplitudes=Index_amplitudes.astype(int)
    col = np.array(Index_amplitudes[0])
    #Values of 10% highest amplitudes(first N columns of input amplitudes) are copied in A,P,UAP in corresponding indices 
    #of columns. Other columns are zeros. 
    amp_col=np.arange(N)
    A[:, [col]]= amplitudes[:, [amp_col]]
    P[:, [col]]= phases[:, [amp_col]]
    UAP[:,[col]]=u_a[:,[amp_col]]
    UAP[:,[col+length]]=u_ap[:,[amp_col]]
    UAP[:,[col+length_2]]=u_pp[:,[amp_col]]
    for m in range(M): 
        # Defining diagonals for sparse matrix  
        diag1=np.zeros(length_2)
        diag2=np.zeros(length_2)
        diag1[:length]=UAP[m,:length]
        diag1[length:]=UAP[m,length_2:]
        diag3=np.zeros(length_2)
        diag2[offset_UAP[1]:length_2+offset_UAP[1]]=UAP[m][length:length_2] 
        diag3[offset_UAP[1]:length_2+offset_UAP[1]]=UAP[m][length:length_2]
        diagonals =[diag1,diag2,diag3]
        # Creating sparse matrix with three diagonals. Diag1 is the main diagonal.
        Sparse_matr=dia_matrix((diagonals,offset_UAP),shape=((length_2, length_2)))
        X,UX=AmpPhase2Time(A[m,:], P[m,:], Sparse_matr)
        x[m,:] = X
        ux[m,:]=np.diag(UX)
  
    return    x,ux
 

Function `Reconstruct_time_domain_idft` gradually performs transformation from amplitudes and phases to real and imaginary parts and then to time domain, taking into account squared standard uncertainties of amplitudes and phases.  

In [None]:
from PyDynamic.uncertainty.propagate_DFT import AmpPhase2DFT,GUM_iDFT
import h5py
def Reconstruct_time_domain_idft(N,frequencies,amplitudes,phases,u_a,u_pp):
   
    M,num=amplitudes.shape
    length=int(amplitudes.shape[1])*N+1 #promijeniti index,generisati
    length_2=2*length
    x=np.zeros((M, length_2-2))
    #storing uncertainties in a list of arrays
    ux=np.zeros((M, length_2-2))
    #predefining A,P,UAP as zero arrays
    A =np.zeros((M, length))
    P= np.zeros_like(A)
    UAP=np.zeros((M, length_2)) #UAP contains squared standard uncertainties of amplitudes and phases
    assert(amplitudes.shape==phases.shape)
    # Indices of columns with highest amplitudes
    Index_amplitudes=frequencies[:,:N] #promijeniti index,generisati
    Index_amplitudes=Index_amplitudes.astype(int)
    col = np.array(Index_amplitudes[0])
    #indices(columns) of 10% highest amplitude values
    #first N columns of input amplitudes
    amp_col=np.arange(N)
    A[:, [col]]= amplitudes[:, [amp_col]]
    P[:, [col]]= phases[:, [amp_col]]
    UAP[:,[col]]=u_a[:,[amp_col]]
    UAP[:,[col+length]]=u_pp[:,[amp_col]]
    for m in range(M):
        F,UF=AmpPhase2DFT(A[m,:], P[m,:], UAP[m,:])
        X,UX=GUM_iDFT(F, UF)
        x[m,:] = X
        ux[m,:]=np.diag(UX)
     
    return x,ux
   
    

*Instead of executing the function, reconstructed time domain signals obtained from  extracting 10% of highest amplitudes by DFT can be read from hdf5 files in the next steps. Sigma value, representing white noise was assumed as 0.1*

In [25]:
#reading data instead of execution 
import os
import h5py

x_time1= h5py.File("Reconstructed-time-signals10.hdf5","r")
ux_time1= h5py.File("Reconstructed-uncert-time-signals10.hdf5","r")    


In [26]:
x_time=[0]*len(sensor_train)
ux_time=[0]*len(sensor_train)
for i in range(len(sensor)):
    x_time[i]=x_time1["x_time"+str(i)]
    ux_time[i]=ux_time1["ux_time"+str(i)]

###### Function execution

In [None]:
#function execution
x_time=[0]*len(sensor_train)
ux_time=[0]*len(sensor_train)
N=10 #percentage of amplitudes that were extracted from DFT results
for i in range(len(sensor_train)):                     
    print("Sensor number %s" % i)
    x_time[i],ux_time[i]=Reconstruct_time_domain(N,freq_of_sorted_values[i].values, sorted_values__amp_from_all_sensors[i].values,sorted_phases_from_all_sensors[i].values,sorted_uncer_from_all_sensors_a[i].values,sorted_uncer_from_all_sensors_ap[i].values,sorted_uncer_from_all_sensors_pp[i].values)
    
    

In [None]:
#function execution
x_time=[0]*len(sensor_train)
ux_time=[0]*len(sensor_train)
for i in range(len(sensor_train)):                     
    print("Sensor number %s" % i)
    x_time[i],ux_time[i]=Reconstruct_time_domain_idft(N,freq_of_sorted_values[i].values, sorted_values__amp_from_all_sensors[i].values,sorted_phases_from_all_sensors[i].values,sorted_uncer_from_all_sensors_a[i].values,sorted_uncer_from_all_sensors_pp[i].values)
       
           

Visualization of time domain signal through all cycles:

In [27]:

#check abs
import matplotlib.pyplot as plt
%matplotlib notebook

import ipywidgets as widgets
from ipywidgets import interact, interact_manual
units=['[Pa]', '[g]', '[g]', '[g]', '[kN]', '[bar]', '[mm/s]', '[A]', '[A]', '[A]', '[A]']
labels1 = ['Microphone','Vibration plain bearing','Vibration piston rod','Vibration ball bearing', 'Axial force','Pressure','Velocity','Active current','Motor current phase 1','Motor current phase 2','Motor current phase 3']
def plot_sensor(sensor,cycle):
   
    plt.figure(figsize=(15,12))
    plt.plot(np.arange(0,1,0.0005),sensor_train[sensor].values.transpose()[cycle,:],label="Input time values")
    plt.ylabel(str(units[sensor]))
    plt.xlabel("Time [s]")
    plt.title(str(labels1[sensor]))
    plt.errorbar(np.arange(0,1,0.0005),x_time[sensor][cycle],yerr=np.sqrt((ux_time[sensor][cycle])),label="Reconstructed time values with DFT", ecolor='orangered',
            color='green')
    # Adding legend to the plot    
        

interact(plot_sensor, sensor=range(10),cycle=range(sorted_values__amp_from_all_sensors[0].shape[0]))


interactive(children=(Dropdown(description='sensor', options=(0, 1, 2, 3, 4, 5, 6, 7, 8, 9), value=0), Dropdow…

<function __main__.plot_sensor(sensor, cycle)>

In [28]:
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interact_manual
units=['[Pa]', '[g]', '[g]', '[g]', '[kN]', '[bar]', '[mm/s]', '[A]', '[A]', '[A]', '[A]']
labels1 = ['Microphone','Vibration plain bearing','Vibration piston rod','Vibration ball bearing', 'Axial force','Pressure','Velocity','Active current','Motor current phase 1','Motor current phase 2','Motor current phase 3']
def plot_sensor1(sensor,cycle):
    plt.figure(figsize=(15,12))
    plt.plot(np.arange(0,1,0.0005),sensor_train[sensor].values.transpose()[cycle,:], label="Input time values")
    plt.ylabel(str(units[sensor]))
    plt.xlabel("Time [s]")
    plt.title(str(labels1[sensor]))
    plt.errorbar(np.arange(0,1,0.0005),x_time[sensor][cycle],yerr=np.sqrt((ux_time[sensor][cycle])),label="Reconstructed time values with DFT", ecolor='orangered',
            color='green')
    # Adding legend to the plot    
    plt.legend(loc='best', frameon=True)
interact(plot_sensor1,sensor=range(10),cycle=widgets.IntSlider(min=0, max=sorted_values__amp_from_all_sensors[0].shape[0], step=1))

interactive(children=(Dropdown(description='sensor', options=(0, 1, 2, 3, 4, 5, 6, 7, 8, 9), value=0), IntSlid…

<function __main__.plot_sensor1(sensor, cycle)>

In [None]:
train_test1.close()
amp_fft1.close()
freq_fft1.close()
amp_dft2.close()
freq_dft2.close()
ph_dft2.close()
u_a_dft2.close()
u_ap_dft2.close()
u_pp_dft2.close() 
x_time1.close()
ux_time1.close()

### References:

[1]  PTB, ZeMA, - Deep dive into the ZeMA machine learning (ppt), January 2019

[2]  https://www.nti-audio.com/en/support/know-how/fast-fourier-transform-fft

[3]  http://www.sthda.com/english/wiki/correlation-test-between-two-variables-in-r

[4]  https://en.wikipedia.org/wiki/Pearson_correlation_coefficient

[5]  Edouard Duchesnay, Tommy Löfstedt, - Statistics and Machine Learning in Python, March 2018

[6]  https://ipywidgets.readthedocs.io/en/latest/examples/Using%20Interact.html
