# Analyze industrial sensor data

This notebook provides some guidance for the Industrial Data Analysis lab exercise in the PyR training, this to help you finish the exercise in the designated time. If you have advanced Python and Pandas skills, we recommend to start with an empty notebook and work through the exercise self-sufficiently.

## Steps
The notebook passes through the following steps in reaching the ultimate end-goal: building an Python model that can be used to detect anomalies in one of the sensors: `Wheel Front Temp Celsius`

## Import required packages
Even though you can import new packages anywhwere in the notebook, it is common practice move the imports to the beginning of the notebook. This to immediately show the dependencies when someone else opens it.

Additionally, when you finished working on the notebook:
* Restart the kernel (Kernel --> Restart)
* Re-run the notebook (Run --> Run All Cells)

This is to ensure the notebook will also run if cells are not executed in the exact same sequence when you developed it. Very often, the exploration and analysis of data is an iterative process and you may end up renaming variables, updating dataframe columns. By running the full notebook from start to end and validating the end result, you will avoid errors when you need to use it later.

In [1]:
# YOUR CODE HERE #

## Read input files
In this example, the files are directly read from the GitHub repository into a Pandas dataframe. Even though you could upload the file to the notebook server (Watson Studio or other), you will find it as easy to load the data directory from the internet.

The following 3 files have been made available:
* readings: http://raw.githubusercontent.com/fketelaars/pyr-industrial/master/readings.csv
* sensors: http://raw.githubusercontent.com/fketelaars/pyr-industrial/master/sensor.csv
* devices: http://raw.githubusercontent.com/fketelaars/pyr-industrial/master/device.csv

In [2]:
# YOUR CDDE TO LOAD DATA INTO PANDAS DATAFRAMES HERE #

Now that the data has been loaded into the 3 Pandas dataframes, check that the number of records match the expectations.
* readings: ~326k entries
* sensors: 27 entries
* devices: 4 entries

In [3]:
# YOUR CODE HERE #

## View the first few rows of every data set
Show the top 5 rows of every dataframe

In [4]:
# YOUR CODE HERE #

In [5]:
# YOUR CODE HERE #

In [6]:
# YOUR CODE HERE #

## Convert epoch timestamp to datetime
Extend the "readings" dataframe with a timestamp column which represents the human readable date and time for the `tsepoch` column. If you haven't noticed yet, the `tsepoch` timestamp is expressed in milliseconds.

In [7]:
# YOUR CODE HERE #

### Determine time span of the readings
Now that you've loaded the data, find the following properties for the readings:
* The number of seconds that the readings span
* The lowest timestamp in human readable format
* The highest timestamp in human readable format

In [8]:
# YOUR CODE HERE #

### Determine number of readings for each sensor
Not every sensor has the same sample rate. Later you will have to resample the readings to ensure you have a value for every time interval. Find the number of readings for every sensor and display this in a chart.

**Tip**: When using `matplotlib`, don't forget to add an instruction to tell the notebook that charts should be shown in-line.


In [9]:
# YOUR CODE HERE #

In [10]:
# YOUR CODE HERE #

In [11]:
# YOUR CODE HERE #

### Determine statistics for each sensor and write to new "sensors" CSV file
The `sensor.csv` file contained placeholders for the count, minimum, maximum and mean of every sensor. Not all of these values were populated. You will need the sensor statistics later but you do not necessarily need to store them. If you run the notebook in the cloud, you can consider to try and persist the data in the object store, but retrieving the values in a dataframe is sufficient.

In [12]:
# YOUR CODE HERE #

Once you have retrieved the count, minimum, maximum and mean, join this data with the original sensors dataframe so that you can persist it as a csv file.

In [13]:
# YOUR CODE HERE #

The resulting dataframe should have the following columns:
* sensor_id
* description
* low_value
* high_value
* mean_value

**Tip**: To be able to join, you may have to reset the index of the dataframe holding the aggregated values, or you may have to use the index of that dataframe to join with the sensors dataframe.

In [14]:
# YOUR CODE HERE #

## Visualization

### Determine which sensors make sense to visualize
Sensors with a constant value can be ignored and should be dropped before visualizing. Use the sensor statistics you retrieved above to determine if sensors have a constant value. Drop the readings of those sensors.

**Tip**: When dropping rows or columns, you may get a warning that the original dataframe could be affected. Use the `copy()` function to make a copy of the original dataframe before deleting rows or columns.

In [15]:
# YOUR CODE HERE #

If your code is correct, approximately 321900 readings should remain in the new dataframe.

## Plot some of the sensors
Pick a couple of sensor IDs (for example: 14, 27 and 68) and plot the values. Use the human readable timestamp for the x-axis.

In [16]:
# YOUR CODE HERE #

## Down-sample the different readings
Before we can start looking at correlations between different sensors, we need to match the timestamps of the readings from the different sensos. Let's try and down-sample the variable readings to create a pivoted dataframe with a value for every sensor and every timestamp.

To lose as little detail as possible, we will down-sample to 0.2 seconds. This means there will be 864 * 5 readings for every sensor.

**Tip**: Use the pandas `resample` function and `groupby` to down-sample the readings to 200 milliseconds.

In [17]:
# YOUR CODE HERE #

You may find that some of the sensors did not have a reading for every 200 ms interval, which means that a `NaN` value is generated. Remove these readings from the resulting dataframe.

In [18]:
# YOUR CODE HERE #

Check that you more or less have the same number of samples for each sensor now.

## Find correlations between sensors
Now that we have the equivalent number of readings for every sensor, you can find correlcations between the different sensor IDs. You will need to match up the readings for different sensors with each other.

**Tip**: Use the pandas `pivot_table` function to get 1 column for every sensor. Every row will have a timestamp and value for each of the sensors.

In [19]:
# YOUR CODE HERE #

The column names now reference the `sensor_id` which really is not that meaningful. Before correlating, let's give the columns some meaningful names.

**Tip**: Join the numeric column names with the `sensors` data to retrieve the names.

In [20]:
# YOUR CODE HERE #

Once you have changed the column names, show the first 5 rows of the resulting dataframe.

In [21]:
# YOUR CODE HERE #

### Build the correlations table

In [22]:
# YOUR CODE HERE #

### Show correlations in heatmap
The above correlation diagram is a little difficult to read. It is better to convert this into a heatmap.

**Tip**: The Seaborn package has some nice heatmaps that are easy to use. https://seaborn.pydata.org/

In [23]:
# YOUR CODE HERE #

From the heat map you should see that `Wheel Front Speed RPM` and `Wheel Front Temp Celsius` are strongly correlated, let's plot this in a regression plot.

In [24]:
# YOUR CODE HERE #

The covariance between the speed and temperature should be clearly visible in the chart.

## Train model on the data

Now that we have found a correlation, let's try to build a model we can use for predictions.

### Import additional packages
You can choose to import additional packages here, but in the end it is recommended to move all imports to the top of the notebook.

In [25]:
# YOUR CODE HERE #

### Split the data in a training and test set
It is best to start from the pivoted dataframe. If you didn't do so before, you first have remove any NaN values from the overall pivoted dataframe, otherwise the training or testing of the model will fail.

In [26]:
# YOUR CODE HERE #

Now split the dataframe into a training and test set. Please note that the independent variables (features) and dependent variables (labels) must end up in different dataframes/series.

**Tip**: SciKit Learn has a nice function that will split up a dataframe in training and testing data, and also separate features from labels.

In [27]:
# YOUR CODE HERE #

### First try simple linear model

Try to fit a simple linear regression model against the data. When the model has been fit, calculate the R<sup>2</sup> score.

**Tip**: When you try this for the first time, SciKit Learn may throw errors because the shape of the training and test data is not what it expects. With Pandas you can re-shape the data to the desired format using the `reshape()` function.

Fit the linear model.

In [28]:
# YOUR CODE HERE #

Calculate the R<sup>2</sup> score.

In [29]:
# YOUR CODE HERE #

### Try to improve the score with a polynomial regression
To use a polynomial regression, it is best to create a pipeline that will first generate the polynomial features and then train.

You can choose to import additional packages here, but in the end it is recommended to move all imports to the top of the notebook.

In [30]:
# YOUR CODE HERE #

Define the pipeline and fit the model. When the model has been trained, calculate the R<sup>2</sup> score and see if it has improved compared to the simple linear model. Based on the degree of polynomial features the score will improve or not. With the provided data set, a degree of 3 was optimal.

In [31]:
# YOUR CODE HERE #

### Now plot everything in 1 diagram
You should now have 2 models. It would be good to visualize how well the models can predict the value of temperature given the speed.

First assemble a dataframe that will have the following columns:
* speed
* actual temperature
* predicted temperature for the simple linear regression
* predicted temparature for the polynomial regression

In [32]:
# YOUR CODE HERE #

Once you have the dataframe with these columns, plot the values. Try to improve the visualization by:
* Choosing a figure size that will fit the width of your screen
* Choose different colours for the plotted points
* Create a legend that explains the values and colours shown in the chart

In [33]:
# YOUR CODE HERE #

# Optional exercise or demo
Score live data and show in a chart.