# Part I - Data exploration & preparation

In this notebook, you will...
* Visualize each of the sensors of each dataset
* Join the two datasets together for visualization
* Add features (listed below)


### First, let's start by importing the necessary libraries

In [None]:
from pathlib import Path

import pandas as pd


import seaborn as sns
import matplotlib.pyplot as plt
import plotly.graph_objects as go

### Directories

In [None]:
current_dir = Path.cwd()
root_dir = current_dir.parent
data_dir = Path(current_dir, "data")

### Read in the data
We have two input files -  input_dataset_1 and input_dataset_2. They are parquet files, so we will use the pandas parquet reader in order to read them in 

In [None]:
input_dataset_1 = pd.read_parquet(Path(data_dir, "input_dataset-1.parquet"))
input_dataset_2 = pd.read_parquet(Path(data_dir, "input_dataset-2.parquet"))

## Let's explore the data

In [None]:
input_dataset_1.head()

In [None]:
input_dataset_2.head()

In [None]:
print("We have {} columns in input_dataset_1: {} \n".format(len(input_dataset_1.keys()), input_dataset_1.keys()))
print("We have {} columns in input_dataset_2: {}".format(len(input_dataset_2.keys()), input_dataset_2.keys()))

You'll notice we have 20 columns in input dataset 1 and 22 columns in input dataset 2. There are two vibration sensors in input dataset 2 that are missing from the first dataset. We have sneosrs for power, reactive power, and other sensors related to the turbine behaviour (guide vane opening, rotational speed, pressure in the draft tube, and pressure in the spiral casing). We also have information related to the bolts - a temperature sensor, torsion, and tension in 6 of the bolts installed on turbine. 

Let's group the columns for easier evaluation later. We will create one group for operating conditions, one for temperature, one for bolt tensiles, one for bolt torsions, one for vibrations.

In [None]:
ops = ['Unit_4_Power', 'Unit_4_Reactive Power', 'Turbine_Guide Vane Opening',
       'Turbine_Pressure Drafttube', 'Turbine_Pressure Spiral Casing',
       'Turbine_Rotational Speed', 'mode', 'Bolt_1_Steel tmp']

bolt_temp = ['Bolt_1_Steel tmp']

bolt_tensiles = ['Bolt_1_Tensile', 'Bolt_2_Tensile',  'Bolt_3_Tensile', 
                 'Bolt_4_Tensile', 'Bolt_5_Tensile',  'Bolt_6_Tensile']

bolt_torsions = ['Bolt_1_Torsion',  'Bolt_2_Torsion', 'Bolt_3_Torsion',
                 'Bolt_4_Torsion', 'Bolt_5_Torsion',  'Bolt_6_Torsion']

vibrations = ['lower_bearing_vib_vrt', 'turbine_bearing_vib_vrt'] 

### When does the data start and end?

In [None]:
print("Input Dataset 1:\n start: {} \n end: {}\n".format(input_dataset_1.index.min(), input_dataset_1.index.max()))
print("Input Dataset 2:\n start: {} \n end: {}".format(input_dataset_2.index.min(), input_dataset_2.index.max()))

### How is the sampling rate?

The majority of the data is measured at regular intervals, with a 1 Hz sampling rate. However there are some variations, and there are sometimes longer periods of time between measurements. Once we will begin looking at the data, we will be able to say if this is important to consider in training our model.

In [None]:
input_dataset_1.index.to_series().diff().value_counts()

In [None]:
input_dataset_2.index.to_series().diff().value_counts()

### Let's start looking at the data

In [None]:
fig, ax1 = plt.subplots(figsize=(20, 10))

for t in bolt_tensiles:
    ax1.scatter(input_dataset_1.index, 
                input_dataset_1[t], label=t,
                s=1, alpha=0.1)
    
    
fig.legend()

plt.show()

In [None]:
fig, ax1 = plt.subplots(figsize=(20, 10))

for t in bolt_tensiles:
    ax1.scatter(input_dataset_2.index, 
                input_dataset_2[t], label=t,
                s=1, alpha=0.1)
    
    
fig.legend()

plt.show()