# Water Pump failure

In this Notebook we will analyse data from a water pump which experienced frequent failures in the period spring/summer 2018.

As input we have time series data from 52 sensors which measure different physical properties of the system (like temperature and pressure). We will try to extract the different working modes of the pump and highlight possible early warning signals of breakage.

As always, we will start with some (brief) exploratory analysis, with the aim of examining missing or redundant data.

In [None]:

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import os

In [None]:
for path, directories, files in os.walk('sensor_data/'):
    for filename in files:
        print(os.path.join(path, filename))

Load the dataset.

In [None]:
data =  pd.read_csv("sensor_data/sensor.csv")

# Data exploration and cleaning

## Exploration

Let's have a first look at the data

In [None]:
data.head()

**Observations:**


---



*   Data has a timestamp column, so its probably a timeseries data
*   Sensors do not have names
*   Sensor values appears to have different ranges
*   There appears to be a redundant column named 'Unnamed: 0'

Lets look further into the data, this time looking more closely to the type of columns and their statistics



***Task 1:***

Look into the different columns of the data and identify if there are any defective sensors. 

In [None]:
#################
# Your solution #
#################

In [None]:
data.describe()

**Observations:**


---



*   Sensors have different ranges, eg `sensor_00` has mean 2.3 and variance 0.4 while `sensor_04` has mean 590 and variance 144.
*   There are no negative values

***Task 2:***

Time of each event is recorded in the timestamp column. Since the format is string, create a new columns where time is registered as pandas timestamp object.

Also drop the `Unnamed: 0` column since it is just a row count.

In [None]:
#################
# Your solution #
#################

## Cleaning

***Task 3:***

For each sensor, identify the percentage of missing data. Also check if there are any duplicated rows. Finally, remove the sensors having more than 3% of missing values.

In [None]:
#################
# Your solution #
#################

Lets now look at the label column. What are the labels and how are they distributed?

In [None]:
data['machine_status'].value_counts()

**Observations:**


---



*   There are three working status, but with very few observations in the `BROKEN` category

In [None]:
data.head()

## Visualizations

***Task 4:***

Visualize the distribution of the individual sensor values with a histogram plot. There will be 48 sensors left at this stage, you can use a layout of (10, 5) ie, 10 rows each of 5 sensor's histogram plot. Make sure to have a large enough figure size.

In [None]:
#################
# Your solution #
#################

**Observations:**


---



*   Most of the sensor values have a unimodal distribution, but there are also some multimodal distributed values.

Now we can plot the data along time and quickly analyse some of the patterns that appears when the machine is in different `machine_status`.

In [None]:
# Extract the readings from the BROKEN and RECOVERING states of the pump
broken = data[data['machine_status']=='BROKEN']
recovering = data[data['machine_status']=='RECOVERING']

sensors_to_plot = data.columns[:5]

# Plot time series for each sensor with BROKEN state marked with X in red color
for sensor_name in sensors_to_plot:
    plt.figure(figsize=(18, 4))
    # Plot time series for each sensor with RECOVERING state marked with 'O' in orange color
    plt.plot(recovering['datetime'], recovering[sensor_name], 
             linestyle='none', marker='o', color='orange', markersize=8, 
             label='recovering')
    # Plot time series for each sensor with BROKEN state marked with 'X' in red color
    plt.plot(broken['datetime'], broken[sensor_name], linestyle='none', 
             marker='X', color='red', markersize=12, label='broken')
    # Plot time series for each sensor (all status) in green line
    plt.plot(data['datetime'], data[sensor_name], color='green', 
             alpha=0.5, label='working')
    plt.xlabel("Datetime")
    plt.ylabel("Value")
    plt.title(sensor_name)
    plt.legend()
    plt.show()

**Observations:**

---

*   Although some of the sensors have anomalous behavious before the machine changes to `BROKEN` status, it would be difficult to write the rules for identifying potential failure



# K-means clustering and operating modes

Since we do not have enough data for supervised learning, we will explore the different working regimes of the pump with unsupervised learning. Let us try using Kmeans clustering

In [None]:
from sklearn.cluster import KMeans

***Task 5:***

Create `X_train`, a new dataframe which comprises only of sensor data (no labels, no timestamps). 

Then, rescale the sensor data so that they lay in a similar range. If we subtract the minimum across the column and divide by the maximum, all the values will be between 0 and 1.

Finally, since there are still sensors with missing data, fill these voids. Think about how you would want to fill the missing values of a time series data.

In [None]:
#################
# Your solution #
#################

**Elbow method**

The elbow method is a heuristic method to decide how many clusters to use for kmeans clustering. 

***Task 6:***

Calculate the inertia of the KMeans clustering method for clusters ranging from 1 to 15 and plot the results. What is an optimum number of clusters for this data?

In [None]:
#################
# Your solution #
#################

Let us now fit the model with 5 clusters to the data.

In [None]:
kmeans = KMeans(n_clusters=5, random_state=42)
kmeans.fit(X_train)
labels = kmeans.predict(X_train)

data['cluster'] = labels

We can also see the number of points assigned to each cluster

In [None]:
unique_elements, counts_elements = np.unique(labels, return_counts=True)
for n in range(5):
  print(f"Cluster {unique_elements[n]} \n  Points {counts_elements[n]}")

Assigning different colours to different clusters and visualizing the sensor data belonging to the clusters.

In [None]:
colors = ['limegreen', 'orange', 'yellow', 'red', 'cyan']
colors_plot = [colors[i] for i in data['cluster'].values]

sensors_to_plot = ['sensor_01']
for sensor_name in sensors_to_plot:
    plt.figure(figsize=(18,3))
    lower_limit = 0.5*data[sensor_name].max() 
    upper_limit = 0.9*data[sensor_name].max()

    plt.plot(data['datetime'], data[sensor_name], color='blue', 
             label='sensor data')
    plt.vlines(data['datetime'], lower_limit, upper_limit, 
               color=colors_plot, alpha=0.01)
    plt.title(sensor_name)
    plt.legend()
    plt.show()

In order to get an idea if the learned clusters has captured the working modes of the machine. Find out which

In [None]:
broken = data[data['machine_status']=='BROKEN']
recovering = data[data['machine_status']=='RECOVERING']
normal = data[data['machine_status']=='NORMAL']

print(f"Broken: \n{broken['cluster'].value_counts()}\n")
print(f"Recovering: \n{recovering['cluster'].value_counts()}\n")
print(f"Normal: \n{normal['cluster'].value_counts()}\n")

**Observations:**

---

*   The RECOVERING mode is captured well as one cluster
*   Since the BROKEN mode has too little data points, the cluster distribution does not provide more information
*   The NORMAL mode appears to be spread in the rest of 4 clusters

## Cluster visualisation

***Task 7:***

We can use t-SNE to project the clusters onto 2d and plot them, to have a rough idea of their geometrical relationships.
Select every 500th data point from the dataset and plot the t-SNE embeddings as a scatter plot with colour coding of the clusters

In [None]:
from sklearn.manifold import TSNE

subsampling_step = 500

#################
# Your solution #
#################