# Air Quality Sensor Calibration

This notebook demonstrates the calibration of a lower-end sensor against a high-end reference sensor using linear regression. 
The following steps are covered:

1. Reading data from CSV files.
2. Preparing and merging datasets.
3. Performing linear regression.
4. Visualizing the results.

This process helps to establish a mathematical relationship between two sensors to calibrate the lower-end sensor's measurements.

In [21]:
import pandas as pd
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import seaborn as sns

# Configure plots
sns.set(style='whitegrid')

## Step 1: Load the datasets

Here, we load two datasets:
1. High-end sensor data (`Aero_IoT_Workshop_Batch_4.csv.csv`)
2. Lower-end sensor data (`Team_XX.csv`)

Both datasets should have a common column (e.g., 'PM10') for calibration.

In [None]:
sensor_file_path = '../datasets/Team_01.csv'
sensor_data = pd.read_csv(sensor_file_path)
print(sensor_data.head())

# High-end sensor data
aero_sensor_file_path = '../datasets/Aero_IoT_Workshop_Batch_4.csv'
aero_sensor= pd.read_csv(aero_sensor_file_path)
print(aero_sensor.head())

In [23]:
sensor_data_2_path = '../datasets/Team_02.csv'
sensor_data_2 = pd.read_csv(sensor_data_2_path)

sensor_data_3_path = '../datasets/Team_03.csv'
sensor_data_3 = pd.read_csv(sensor_data_3_path)

## Step 2: Validate and Merge Datasets

Before merging, ensure the required column exists in both datasets. After validation, merge the data.

In [None]:
# Convert the 'created_at' column to datetime format
sensor_data['created_at'] = pd.to_datetime(sensor_data['created_at'])

# Format the 'created_at' column to the desired format
sensor_data['created_at'] = sensor_data['created_at'].dt.strftime('%d %b %Y %H:%M')
print(sensor_data.head(10))
sensor_data.info()


# Group by 'created_at' and calculate the mean for 'PM2.5' and 'PM10'
sensor_data = sensor_data.groupby('created_at')[['PM2.5', 'PM10']].mean().reset_index()

# Display the first few rows of the DataFrame
print(sensor_data.head(10))
sensor_data.info()


In [None]:
aero_sensor= aero_sensor.drop_duplicates(subset='Date Time')
aero_sensor.head()
aero_sensor.info()

print(aero_sensor.columns)

In [None]:
# Rename the 'Date Time' column in aero_sensor to match 'created_at' in sensor_data
aero_sensor = aero_sensor.rename(columns={'Date Time': 'created_at'})

# Select only the 'created_at', 'PM10(ppm)', and 'PM2.5(ppm)' columns from aero_sensor
aero_sensor_selected = aero_sensor[['created_at', ' PM10(ppm)', ' PM2.5(ppm)']].copy()

#Convert PM2.5 and PM10 from ppm to ppb
aero_sensor_selected.loc[:, ' PM10(ppm)'] = aero_sensor_selected[' PM10(ppm)'] * 1000
aero_sensor_selected.loc[:, ' PM2.5(ppm)'] = aero_sensor_selected[' PM2.5(ppm)'] * 1000

# Rename the columns 
aero_sensor_selected.rename(columns={' PM10(ppm)': 'PM10_highend', ' PM2.5(ppm)': 'PM2.5_highend'}, inplace=True)

# Merge the datasets on 'created_at' column
merged_data = pd.merge(sensor_data, aero_sensor_selected, on='created_at', how='inner')

merged_data.head()
merged_data.info()

In [None]:
# Drop rows with any null values
merged_data = merged_data.dropna()
merged_data.info()

In [28]:
# Define file names and column name
column_name = 'PM10'

In [None]:
# Scatter plot of the merged data
plt.figure(figsize=(8, 6))
plt.scatter(merged_data[f'{column_name}'], merged_data[f'{column_name}_highend'], color='green',s=1)
plt.xlabel('Sensor1 Data')
plt.ylabel('Aero Sensor Data')
plt.title('Scatter Plot of Merged Data')
plt.show()

## Step 3: Perform Linear Regression

Using the lower-end sensor data as the independent variable (X) and the high-end sensor data as the dependent variable (y), 
we fit a linear regression model.

In [None]:

# Extract relevant columns for regression
X = merged_data[[f'{column_name}']].values  # Sensor1 data
y = merged_data[f'{column_name}_highend'].values       # High-end sensor data

# Create and fit the linear regression model
model = LinearRegression()
model.fit(X, y)

# Get the coefficient and intercept
coefficient = model.coef_[0]
intercept = model.intercept_

# Print the results
print(f'Coefficient: {coefficient}')
print(f'Intercept: {intercept}')

# Predictions from the linear regression model
linear_predictions = model.predict(X)

## Step 4: Visualize Results

The scatter plot below shows the raw data points (blue dots). The red line represents the regression line fit to the data.

In [None]:
# Plot the results
plt.figure(figsize=(8, 6))
plt.scatter(X, y, color='blue', label='Data Points', s=1)
plt.plot(X, model.predict(X), color='red', label='Regression Line')
plt.xlabel('Sensor1 Data')
plt.ylabel('High-End Sensor Data')
plt.title('Sensor Calibration')
plt.legend()
plt.show()

In [None]:
# Plot the results
plt.figure(figsize=(12, 6))

# Plot the original sensor values as lines
plt.plot(merged_data['created_at'], merged_data[f'{column_name}'], color='blue', label='Original Sensor Values', linewidth=0.5)

# Plot the high-end sensor values as lines
plt.plot(merged_data['created_at'], merged_data[f'{column_name}_highend'], color='green', label='Aero Sensor Values', linewidth=0.5)

plt.xlabel('Timestamp')
plt.ylabel('Sensor Values')
plt.title('Sensor Values Before Calibration')
plt.legend()

# Improve the readability of the x-axis
plt.xticks(ticks=range(0, len(merged_data), len(merged_data)//30), rotation=45)
plt.show()


In [None]:
# Plot the results
plt.figure(figsize=(12, 6))

# Scatter plot of the high-end sensor values
plt.plot(merged_data['created_at'], merged_data[f'{column_name}_highend'], color='green', label='Aero Sensor Values', linewidth=0.5)

# Plot the predicted values
plt.plot(merged_data['created_at'], linear_predictions, color='blue', label='Predicted Values', linewidth=0.5)

plt.xlabel('Timestamp')
plt.ylabel('Sensor Values')
plt.title('Sensor Values After Calibration')
plt.legend()

# Improve the readability of the x-axis
plt.xticks(ticks=range(0, len(merged_data), len(merged_data)//30), rotation=45)
plt.show()


In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Calculate metrics for the linear regression model
mse = mean_squared_error(y, linear_predictions)
mae = mean_absolute_error(y, linear_predictions)
r2 = r2_score(y, linear_predictions)

print(f'Linear Regression Model Metrics:')
print(f'Mean Squared Error (MSE): {mse}')
print(f'Mean Absolute Error (MAE): {mae}')
print(f'R-squared (R2): {r2}')

## Step 5: Other Regression Models

In addition to the standard linear regression model, we also explore **Ridge** and **Lasso** regression models. These models are useful when dealing with multicollinearity or when we want to perform feature selection.

#### Ridge Regression
This technique used in linear regression to address the problem of multicollinearity among predictor variables. Multicollinearity occurs when independent variables in a regression model are highly correlated, which can lead to unreliable and unstable estimates of regression coefficients.Ridge regression adds a penalty equal to the square of the magnitude of the coefficients to the loss function. This penalty term helps shrink the coefficients, thus reducing model complexity and preventing overfitting.

The Ridge regression model equation is:

$$
y = \beta_0 + \beta_1 X + \lambda \sum_{i=1}^{n} \beta_i^2
$$

Where:
- \( \beta_0 \): Intercept
- \( \beta_1 \): Coefficient
- \( \lambda \): Regularization parameter

In [None]:
from sklearn.linear_model import Ridge

# Assuming X and y are predefined
ridge_model = Ridge(alpha=2.0)  # Default alpha (lambda) is 1.0
ridge_model.fit(X, y)

# Get the coefficient, intercept, and regularization parameter
ridge_coefficient = ridge_model.coef_[0]
ridge_intercept = ridge_model.intercept_
ridge_lambda = ridge_model.alpha  # Access the lambda (regularization parameter)

# Print the first 5 predictions
ridge_predictions = ridge_model.predict(X)
print('Ridge Regression Predictions (first 5):', ridge_predictions[:5])

# Print the coefficients, intercept, and regularization parameter
print(f'Ridge Coefficient: {ridge_coefficient}')
print(f'Ridge Intercept: {ridge_intercept}')
print(f'Ridge Regularization Parameter (alpha): {ridge_lambda}')

# Display the equation for the Ridge regression model
print(f'Ridge Regression Model Equation: y = {ridge_coefficient} * X + {ridge_intercept}')



#### Lasso Regression
The Lasso Regression, a regression method based on Least Absolute Shrinkage and Selection Operator is quite an important technique in regression analysis for selecting the variables and regularization. It gets rid of irrelevant data features that help to prevent overfitting and features with weak influence become more cleanly identifiable because of shrinking the coefficients toward zero.

The Lasso regression model equation is:

$$
y = \beta_0 + \beta_1 X + \lambda \sum_{i=1}^{n} |\beta_i|
$$

Where:
- \( \beta_0 \): Intercept
- \( \beta_1 \): Coefficient
- \( \lambda \): Regularization parameter


In [None]:
from sklearn.linear_model import Lasso

# Assuming X and y are predefined
lasso_model = Lasso(alpha=1.0)  # Default alpha (lambda) is 1.0
lasso_model.fit(X, y)

# Get the coefficient, intercept, and regularization parameter
lasso_coefficient = lasso_model.coef_[0]
lasso_intercept = lasso_model.intercept_
lasso_alpha = lasso_model.alpha  # This will raise an AttributeError

# Correctly accessing alpha (lambda)
lasso_alpha = lasso_model.get_params()['alpha']  # Retrieve the alpha value

# Print the first 5 predictions
lasso_predictions = lasso_model.predict(X)
print('Lasso Regression Predictions (first 5):', lasso_predictions[:5])

# Print the coefficients, intercept, and regularization parameter
print(f'Lasso Coefficient: {lasso_coefficient}')
print(f'Lasso Intercept: {lasso_intercept}')
print(f'Lasso Regularization Parameter (alpha): {lasso_alpha}')

# Display the equation for the Lasso regression model
print(f'Lasso Regression Model Equation: y = {lasso_coefficient} * X + {lasso_intercept}')



#### Benefits of Ridge and Lasso Regression
Both Ridge and Lasso regression models help improve the generalization of the model by adding a regularization term to the loss function:
- **Ridge**: Reduces model complexity by shrinking coefficients but does not perform feature selection.
- **Lasso**: Performs feature selection by shrinking some coefficients to zero.


## Step 6: Comparison Results

In [None]:
# Calculate metrics for the linear regression model
linear_mse = mean_squared_error(y, linear_predictions)
linear_mae = mean_absolute_error(y, linear_predictions)
linear_r2 = r2_score(y, linear_predictions)

# Calculate metrics for the lasso regression model
lasso_mse = mean_squared_error(y, lasso_predictions)
lasso_mae = mean_absolute_error(y, lasso_predictions)
lasso_r2 = r2_score(y, lasso_predictions)

# Calculate metrics for the ridge regression model
ridge_mse = mean_squared_error(y, ridge_predictions)
ridge_mae = mean_absolute_error(y, ridge_predictions)
ridge_r2 = r2_score(y, ridge_predictions)

# Create a DataFrame to compare the metrics
metrics_df = pd.DataFrame({
    'Model': ['Linear Regression', 'Lasso Regression', 'Ridge Regression'],
    'MSE': [linear_mse, lasso_mse, ridge_mse],
    'MAE': [linear_mae, lasso_mae, ridge_mae],
    'R2': [linear_r2, lasso_r2, ridge_r2]
})

# Display the metrics comparison
display(metrics_df)

# Determine the best performing model based on R2 score
best_model = metrics_df.loc[metrics_df['R2'].idxmax()]
print(f'Best Performing Model:\n{best_model}')