# Activity 11.01 - Multiple regression with nonlinear models

As part of a research effort to improve metallic-oxide semiconductor sensors for the toxic gas carbon monoxide (CO), you are asked to investigate models of the sensor response from an array of sensors. You will review the data, perform some feature engineering for an array of sensors. You will review the data, perform some feature engineering for non-linear features, and then compare a baseline linear regression approach to a random forest model:

1. For this exercise, you will need the pandas and numpy libraries, and three modules from sklearn, matplotlib, and seaborn. Load them in the first cell of the notebook:

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression as OLS
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns

2. As we have done before, create a utility function to plot a grid of histograms after being given the data, which variables to plot, the rows and columns of the grid, and how many bins. Similarly, create a utility function that allows you to plot a list of variables as scatter plots against a given x variable, also after being given the rows and columns of the grid.

3. Now, load the CO_sensors.csv file into a DataFrame called my_data. Show first 5 rows.

4. Use .describe().T to inspect the data further

5. Use the histogram grid utility function to plot histograms of all columns, except Time(s).

6. Use seaborn to generate a pairplot of the first five columns (excluding the sensor readings).

7. Use the scatter plot grid utility function to plot all the sensor data versus time.

8. It's difficult to tell whether there is a time dependency or a periodic component. Zoom in on R13 over the time from 40000 to 45000 seconds.

You can now see that the tests appear to comprise step functions of various sizes. This shows that the time variable is arbitrary and not useful for modeling the CO response. We can also see that there are a significant number of values that deviate from the steps, which may be due to the humidity variations, measurement errors, or some other issues. These may limit how well we can model the results.

9. Investigate the relationship of the changes in R13 with the CO and Humidity values during one step change - for example, from about 41250 to 42500. Plot the R13 values using the .plot() method in matplotlib, and overlay the CO and Humidity values as line plots on the same plot.

As you saw in the detail, there are a series of step changes for both CO and Humidity, resulting in changes in the resistance values. However, there are evident time lags involved, as shown by the curved traces of Humidity. In addition, R13 seems to spike, then fall, and have intervening periods where the values appears to be 0 and at a steady state. Perhaps this is a function of the electronics, but further investigation would be required to be sure.

10. Now, use seaborn to plot a correlation heatmap for the sensor columns.

Note that there are two or three groups in the plot. The last seven sensors are all highly correlated with one another. The first three are as well, as are the next four.

11. The data description for this data, says that there are two kinds of sensors: "Figaro Engineering (7 units of TGS 3870-A04) and FIS (7 units of SB-500-12)". Now, it is apparent that R1 to R7 are one kind of sensor and R8 to R14 are the other kind. The data was collected to evaluate the performance of the senors measuring CO at various conditions of temperature and humidity. In particular, the humidity is taken to be an "uncontrolled variable", and during tests, random levels of humidity were imposed. In the field, the humidity would not be controlled or measured, which impacts the interpretation of data, especially for low levels of CO. The sensors output is reported as resistance in MOhms, which are the main independent variables with which to predict CO. Temperature and the voltage applied to the sensor heater are also available.

Investigate the behavior of the sensors versus CO and humidity. Use the pandas .corr() method to generate the correlation matrix, and then use the first two rows of the result to make a barplot of the sensors correlations versus CO and Humidity respectively.

You see that while all the sensors are to measure the CO, they have markedly different behavior, depending on which of the two types we are measureing. From the problem description, it is evident that the sensors are impacted by humidity, but in the application, humidity is an uncontrolled and possibly unknown value. Hopefully, the different sensor behaviors can provide humidity information to a model and enable good predictions.

12. Apply a sqrt() transform to each of the sensor columns (since there are 0 or near-zero values, a log transform would not be appropriate) and add the columns to the dataset.

13. For the initial model, drop Time, Humidity and CO from the X data. Use CO as the y data. Use Linear Regression to fit a model and plot the residuals, as well as the predicted versus actual values.

This model produces unbiased results, as shown by the residuals centered around 0, but from the second plot, we can see that there are multiple issues. There are groups of incorect predictions at various levels, along with a clump near the middle of the predicted CO readings of 10 ppm. This result clearly is not acceptable.

14. Scale the data with StandardScaler() and then fit a RandomForestRegressor() method to the model. Plot the residuals and the predicted versus actual values.

Although it is evident that the Random Forest model has reduced the residuals, the vertical groupings are still present, which is not a satisfactory result. Reviewing the figure, not that although the CO values are shown and nearly constant, there is a lag time in the humidity and sensor resistance values. A possible approach would be to average readings. A simple test of this idea is to group by the CO values and take the mean sensor values, and the model with those. In addition, it seems reasonable to filter out the regions where the resistance values drop to low values, as those seem anamolous.

15. Create a dataset, filtering out all the rows where a sensor resistance values drops to a low values, say 0.1. Then, group by CO(ppm) and aggregate as the mean values. Build a Random Forest model using the sensor mean resistances and the CO group values. Aso, refit a linear regression model to this data. Plot the predicted values versus the actual values for both results.

These results are much better. This would require more discussion with expert stakeholders to confirm this approach, but it is a promising direction to obtain a calibration for the sensors. Note that the linear regression model cannot fit a lot of the data nearly as well. Also, note that the vertical scatter in the Random Forest predictions might be an indicator of noise caused by the random humidity values. This can be investigated further by building another model including Humidity as an independent variable.