Module: Weather Data Outlier Detection

Description:
Outlier detection on weather data using two methods:
1. Weighted Mahalanobis Distance
2. Kernel Density Estimation (KDE)

The analysis is done by considering different weight configurations for Mahalanobis distance 
and various hyperparameter settings for KDE.

In [1]:
"""
Consumes:
- file_path: Path to the CSV file containing weather data.

Produces:
- weather_data: DataFrame containing the raw weather data.
- X_normalized: Normalized DataFrame for selected numerical columns.

Description:
Loads the weather data from a CSV file, converts the categorical 'cloud' 
column to numerical values, and normalizes the selected numerical columns 
(min_temp, max_temp, rainfall, wind_speed, humidity, and cloud_numeric) using MinMaxScaler.
"""

import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KernelDensity

# Load the dataset
file_path = 'RHOUSTONW.csv'
weather_data = pd.read_csv(file_path)

# Convert the categorical 'cloud' column to numerical values
weather_data['cloud_numeric'] = weather_data['cloud'].astype('category').cat.codes

# Select the relevant numerical columns for the weighted Mahalanobis distance calculation
numerical_features = ['min_temp', 'max_temp', 'rainfall', 'wind_speed', 'humidity', 'cloud_numeric']
X = weather_data[numerical_features]

# Normalize the data using Min-Max scaling to ensure all features contribute equally
scaler = MinMaxScaler()
X_normalized = scaler.fit_transform(X)
X_normalized = pd.DataFrame(X_normalized, columns=numerical_features)

In [2]:
"""
Consumes:
- X_normalized: Normalized data.

Produces:
- mean_vec: Mean vector of the normalized data.
- cov_matrix: Covariance matrix of the normalized data.
- inv_cov_matrix: Inverse covariance matrix.

Description:
Computes the mean vector and the covariance matrix of the normalized data. 
The inverse of the covariance matrix is also calculated for use in Mahalanobis distance calculation.
"""
# Compute the mean and covariance matrix of the normalized data
mean_vec = X_normalized.mean(axis=0)
cov_matrix = np.cov(X_normalized.T)
inv_cov_matrix = np.linalg.inv(cov_matrix)

In [3]:
"""
Inputs:
- row: A row of feature values.
- mean_vec: Mean vector of the dataset.
- inv_cov_matrix: Inverse covariance matrix of the dataset.
- weights: Weight vector that adjusts the importance of each feature.

Outputs:
- distance: Computed Mahalanobis distance for the row.

Description:
Computes the weighted Mahalanobis distance for a given row, 
adjusting the significance of each feature based on the provided weight vector.
"""

# Function to compute the weighted Mahalanobis distance
def weighted_mahalanobis(row, mean_vec, inv_cov_matrix, weights):
    """
    Compute the weighted Mahalanobis distance for a given row.
    
    Parameters:
    row (array): Feature values for the given data point.
    mean_vec (array): Mean vector of the dataset.
    inv_cov_matrix (array): Inverse of the covariance matrix.
    weights (array): Weight vector to adjust feature importance.
    
    Returns:
    float: The weighted Mahalanobis distance.
    """
    # Compute the difference from the mean
    diff = row - mean_vec
    
    # Apply the weights to the differences
    weighted_diff = weights * diff
    
    # Compute the weighted Mahalanobis distance
    distance = np.sqrt(np.dot(np.dot(weighted_diff.T, inv_cov_matrix), weighted_diff))
    
    return distance

In [4]:
"""
Inputs:
- X_normalized: Normalized data to be fitted into the KDE.
- kernel: The kernel to use in KDE (default: 'gaussian').
- bandwidth: Bandwidth parameter for KDE (default: 1.0).
- algorithm: Algorithm to use in KDE (default: 'kd_tree').
- metric: Distance metric to use (default: 'euclidean').

Outputs:
- outlier_scores: A Series of outlier scores for each data point.

Description:
Applies Kernel Density Estimation to calculate outlier scores 
for each data point using the specified hyperparameters to fit the KDE model and 
convert the log density estimates to outlier scores.
"""
# Function to perform KDE based on hyperparameter setting
def apply_kde(X_normalized, kernel='gaussian', bandwidth=1.0, algorithm='kd_tree', metric='euclidean'):
    """
    Apply Kernel Density Estimation (KDE) to the dataset to calculate outlier scores.
    
    Parameters:
    X_normalized (DataFrame): Normalized data to fit the KDE.
    kernel (str): The kernel to use in KDE (default: 'gaussian').
    bandwidth (float): The bandwidth (default: 1.0).
    algorithm (str): The algorithm to use in KDE (default: 'kd_tree').
    metric (str): The distance metric to use in KDE (default: 'euclidean').
    
    Returns:
    Series: A Series with the calculated Outlier Scores.
    """
    # Initialize the KDE model with the given hyperparameters
    kde = KernelDensity(kernel=kernel, bandwidth=bandwidth, algorithm=algorithm, metric=metric)
    
    # Fit the KDE model to the normalized data
    kde.fit(X_normalized)
    
    # Compute log density estimates for each data point
    log_density = kde.score_samples(X_normalized)
    
    # Convert log densities to positive outlier scores, lower density = higher likelihood of being an outlier
    outlier_scores = -log_density
    
    return pd.Series(outlier_scores)

In [5]:
"""
Consumes:
- X_normalized: Normalized weather data.
- mean_vec: Mean vector of the dataset.
- inv_cov_matrix: Inverse covariance matrix.
- weights1, weights2, weights3: Different weight vectors for Mahalanobis distance.

Produces:
- distance_weather_data1, distance_weather_data2, distance_weather_data3: 
  DataFrames with Mahalanobis distance-based outlier scores using different weights.

Description:
Applies the weighted Mahalanobis distance function to each row of the 
normalized weather data, using three different weight configurations. The outlier scores 
are stored in separate DataFrames for each configuration.
"""
# Hyperparameter for distance based
# Define the weight vector to that assigns significance to a specific feature that's required for weighted Mahalanobis distance
# Gives equal importance to all columns
weights1 = np.array([1.0, 1.0, 1.0, 1.0, 1.0, 1.0])

# Gives more importance to temperature columns as these are likely more significant information in weather data.
weights2 = np.array([1.5, 1.5, 1.5, 1.0, 1.0, 1.0])

# Gives more importance to the wind_speed, humidity and cloud columns, as these also provide some outliers if significant deviations exist.
weights3 = np.array([1.0, 1.0, 1.0, 1.5, 1.5, 1.3])

# Creating 3 copies of dataframes for three distance based iteration using three diffterent hyperparameter settings
distance_weather_data1 = weather_data.copy()
distance_weather_data2 = weather_data.copy()
distance_weather_data3 = weather_data.copy()

# Inputs and Outputs for distance based
# Inputs: rwos, mean vector, inverse covariance marix, weights for each column
# Outpus: 3 created dataframes for 3 iterations.

# Apply the weighted Mahalanobis distance function to each row of the normalized data
distance_weather_data1['OLS'] = X_normalized.apply(lambda row: weighted_mahalanobis(row, mean_vec, inv_cov_matrix, weights1), axis=1)
distance_weather_data2['OLS'] = X_normalized.apply(lambda row: weighted_mahalanobis(row, mean_vec, inv_cov_matrix, weights2), axis=1)
distance_weather_data3['OLS'] = X_normalized.apply(lambda row: weighted_mahalanobis(row, mean_vec, inv_cov_matrix, weights3), axis=1)

# Displaying the outputs for distance based
display(distance_weather_data1.head())
display(distance_weather_data2.head())
display(distance_weather_data3.head())

Unnamed: 0,date,min_temp,max_temp,rainfall,wind_speed,humidity,cloud,cloud_numeric,OLS
0,1/1/2021,41,55,0.0,8,51,Mostly Cloudy,9,2.786752
1,1/2/2021,41,59,0.0,7,42,Fair,2,2.20563
2,1/3/2021,43,68,0.0,13,37,Fair,2,1.831003
3,1/4/2021,49,75,0.0,3,43,Fair,2,2.136392
4,1/5/2021,51,69,0.0,13,38,Fair,2,1.601881


Unnamed: 0,date,min_temp,max_temp,rainfall,wind_speed,humidity,cloud,cloud_numeric,OLS
0,1/1/2021,41,55,0.0,8,51,Mostly Cloudy,9,3.896156
1,1/2/2021,41,59,0.0,7,42,Fair,2,3.008417
2,1/3/2021,43,68,0.0,13,37,Fair,2,2.324273
3,1/4/2021,49,75,0.0,3,43,Fair,2,2.461131
4,1/5/2021,51,69,0.0,13,38,Fair,2,1.872487


Unnamed: 0,date,min_temp,max_temp,rainfall,wind_speed,humidity,cloud,cloud_numeric,OLS
0,1/1/2021,41,55,0.0,8,51,Mostly Cloudy,9,3.042749
1,1/2/2021,41,59,0.0,7,42,Fair,2,2.596371
2,1/3/2021,43,68,0.0,13,37,Fair,2,2.302502
3,1/4/2021,49,75,0.0,3,43,Fair,2,2.89487
4,1/5/2021,51,69,0.0,13,38,Fair,2,2.133264


In [6]:
"""
Consumes:
- X_normalized: Normalized weather data.
- kde_hyperparams1, kde_hyperparams2, kde_hyperparams3: Hyperparameters for KDE.

Produces:
- kde_weather_data1, kde_weather_data2, kde_weather_data3: 
  DataFrames with KDE-based outlier scores using different hyperparameters.

Description:
Applies KDE-based outlier detection to the weather data using three different 
sets of hyperparameters. The outlier scores are stored in separate DataFrames for each configuration.
"""
# Hyperparameter for density based
# Define 3 settings of KDE hyperparameters
# Uses 0.5 bandwdth which is more smooth
kde_hyperparams1 = {'kernel': 'gaussian', 'bandwidth': 0.5, 'algorithm': 'kd_tree', 'metric': 'euclidean'}

# Uses 0.3 bandwidth which is less smooth and might be a bset fit
kde_hyperparams2 = {'kernel': 'gaussian', 'bandwidth': 0.3, 'algorithm': 'kd_tree', 'metric': 'euclidean'}

# Uses 0.3 bandwidth and manhattan distance instead of euclidean distance for absolute distance
kde_hyperparams3 = {'kernel': 'gaussian', 'bandwidth': 0.3, 'algorithm': 'kd_tree', 'metric': 'manhattan'}

# Create 3 copies of the weather data for KDE iterations with 3 different hyperparameter settings
kde_weather_data1 = weather_data.copy()
kde_weather_data2 = weather_data.copy()
kde_weather_data3 = weather_data.copy()

# Apply KDE
kde_weather_data1['OLS'] = apply_kde(X_normalized, **kde_hyperparams1)
kde_weather_data2['OLS'] = apply_kde(X_normalized, **kde_hyperparams2)
kde_weather_data3['OLS'] = apply_kde(X_normalized, **kde_hyperparams3)

display(kde_weather_data1.head())
display(kde_weather_data2.head())
display(kde_weather_data3.head())

Unnamed: 0,date,min_temp,max_temp,rainfall,wind_speed,humidity,cloud,cloud_numeric,OLS
0,1/1/2021,41,55,0.0,8,51,Mostly Cloudy,9,2.24022
1,1/2/2021,41,59,0.0,7,42,Fair,2,2.291691
2,1/3/2021,43,68,0.0,13,37,Fair,2,2.168181
3,1/4/2021,49,75,0.0,3,43,Fair,2,2.178219
4,1/5/2021,51,69,0.0,13,38,Fair,2,2.085519


Unnamed: 0,date,min_temp,max_temp,rainfall,wind_speed,humidity,cloud,cloud_numeric,OLS
0,1/1/2021,41,55,0.0,8,51,Mostly Cloudy,9,0.373547
1,1/2/2021,41,59,0.0,7,42,Fair,2,0.291396
2,1/3/2021,43,68,0.0,13,37,Fair,2,0.053631
3,1/4/2021,49,75,0.0,3,43,Fair,2,0.147905
4,1/5/2021,51,69,0.0,13,38,Fair,2,-0.083211


Unnamed: 0,date,min_temp,max_temp,rainfall,wind_speed,humidity,cloud,cloud_numeric,OLS
0,1/1/2021,41,55,0.0,8,51,Mostly Cloudy,9,2.322017
1,1/2/2021,41,59,0.0,7,42,Fair,2,1.587892
2,1/3/2021,43,68,0.0,13,37,Fair,2,1.160541
3,1/4/2021,49,75,0.0,3,43,Fair,2,1.595552
4,1/5/2021,51,69,0.0,13,38,Fair,2,1.038873


In [7]:
"""
Inputs:
- distance_weather_data1, distance_weather_data2, distance_weather_data3: 
  DataFrames with Mahalanobis distance-based outlier scores.

Outputs:
- Displays the top 3 outliers and the bottom 1 normal condition for each weight configuration.

Description:
Sorts the Mahalanobis distance-based DataFrames based on the OLS column 
in descending order and displays the top 3 outliers and the bottom 1 normal weather condition 
for each of the three different weight configurations.
"""
# Sort the DataFrames based on the OLS (Outlier Score) column
distance_weather_data1_sorted = distance_weather_data1.sort_values(by='OLS', ascending=False)
distance_weather_data2_sorted = distance_weather_data2.sort_values(by='OLS', ascending=False)
distance_weather_data3_sorted = distance_weather_data3.sort_values(by='OLS', ascending=False)

# Top 3 outliers and bottom 1 normal weather condition when giving all columns equal priority
print("Top 3 for distance_weather_data1:")
display(distance_weather_data1_sorted.head(3))
print("Bottom 1 for distance_weather_data1:")
display(distance_weather_data1_sorted.tail(1))

# Top 3 outliers and bottom 1 normal weather condition when giving two temperature and rainfall columns more priority
print("Top 3 for distance_weather_data2:")
display(distance_weather_data2_sorted.head(3))
print("Bottom 1 for distance_weather_data2:")
display(distance_weather_data2_sorted.tail(1))

# Top 3 outliers and bottom 1 normal weather condition when giving wind speed, humidity and cloud columns more priority
print("Top 3 for distance_weather_data3:")
display(distance_weather_data3_sorted.head(3))
print("Bottom 1 for distance_weather_data3:")
display(distance_weather_data3_sorted.tail(1))

Top 3 for distance_weather_data1:


Unnamed: 0,date,min_temp,max_temp,rainfall,wind_speed,humidity,cloud,cloud_numeric,OLS
226,8/15/2021,0,95,4.9,7,52,Mostly Cloudy,9,9.469413
229,8/18/2021,0,92,0.5,14,77,T-Storm,14,8.335544
172,6/22/2021,0,94,1.3,14,54,Partly Cloudy,11,7.884883


Bottom 1 for distance_weather_data1:


Unnamed: 0,date,min_temp,max_temp,rainfall,wind_speed,humidity,cloud,cloud_numeric,OLS
358,12/25/2021,65,84,0.0,13,56,Mostly Cloudy,9,0.78834


Top 3 for distance_weather_data2:


Unnamed: 0,date,min_temp,max_temp,rainfall,wind_speed,humidity,cloud,cloud_numeric,OLS
226,8/15/2021,0,95,4.9,7,52,Mostly Cloudy,9,14.225933
229,8/18/2021,0,92,0.5,14,77,T-Storm,14,12.118778
172,6/22/2021,0,94,1.3,14,54,Partly Cloudy,11,11.812784


Bottom 1 for distance_weather_data2:


Unnamed: 0,date,min_temp,max_temp,rainfall,wind_speed,humidity,cloud,cloud_numeric,OLS
268,9/26/2021,67,86,0.0,10,46,Mostly Cloudy,9,0.976795


Top 3 for distance_weather_data3:


Unnamed: 0,date,min_temp,max_temp,rainfall,wind_speed,humidity,cloud,cloud_numeric,OLS
226,8/15/2021,0,95,4.9,7,52,Mostly Cloudy,9,9.4752
229,8/18/2021,0,92,0.5,14,77,T-Storm,14,8.776083
103,4/14/2021,0,79,1.2,7,88,Heavy T-Storm,6,7.958584


Bottom 1 for distance_weather_data3:


Unnamed: 0,date,min_temp,max_temp,rainfall,wind_speed,humidity,cloud,cloud_numeric,OLS
362,12/29/2021,72,84,0.0,12,56,Mostly Cloudy,9,0.933388


In [8]:
"""
Inputs:
- kde_weather_data1, kde_weather_data2, kde_weather_data3: 
  DataFrames with KDE-based outlier scores.

Outputs:
- Displays the top 3 outliers and the bottom 1 normal condition for each KDE hyperparameter configuration.

Description:
Sorts the KDE-based DataFrames based on the OLS column 
in descending order and displays the top 3 outliers and the bottom 1 normal weather condition 
for each of the three different KDE hyperparameter configurations.
"""
# Sort the DataFrames based on the OLS (Outlier Score) column
kde_weather_data1_sorted = kde_weather_data1.sort_values(by='OLS', ascending=False)
kde_weather_data2_sorted = kde_weather_data2.sort_values(by='OLS', ascending=False)
kde_weather_data3_sorted = kde_weather_data3.sort_values(by='OLS', ascending=False)

# Top 3 outliers and bottom 1 normal weather condition when 0.5 bandwidth and euclidean distance is used
print("Top 3 for kde_weather_data1:")
display(kde_weather_data1_sorted.head(3))
print("Bottom 1 Kde_weather_data1")
display(kde_weather_data1_sorted.tail(1))

# Top 3 outliers and bottom 1 normal weather condition when 0.3 bandwidth and euclidean distance is used
print("Top 3 for kde_weather_data2")
display(kde_weather_data2_sorted.head(3))
print("Bottom 1 for kde_weather_data2")
display(kde_weather_data2_sorted.tail(1))

# Top 3 outliers and bottom 1 normal weather condition when 0.3 bandwidth and manhattan distance is used
print("Top 3 for kde_weather_data3")
display(kde_weather_data3_sorted.head(3))
print("Bottom 1 for kde_weather_data3")
display(kde_weather_data3_sorted.tail(1))

Top 3 for kde_weather_data1:


Unnamed: 0,date,min_temp,max_temp,rainfall,wind_speed,humidity,cloud,cloud_numeric,OLS
226,8/15/2021,0,95,4.9,7,52,Mostly Cloudy,9,4.735128
178,6/28/2021,75,81,4.7,7,87,Light Rain,7,3.64485
45,2/15/2021,18,27,0.0,15,66,Partly Cloudy,11,3.498994


Bottom 1 Kde_weather_data1


Unnamed: 0,date,min_temp,max_temp,rainfall,wind_speed,humidity,cloud,cloud_numeric,OLS
358,12/25/2021,65,84,0.0,13,56,Mostly Cloudy,9,1.797521


Top 3 for kde_weather_data2


Unnamed: 0,date,min_temp,max_temp,rainfall,wind_speed,humidity,cloud,cloud_numeric,OLS
226,8/15/2021,0,95,4.9,7,52,Mostly Cloudy,9,4.083535
103,4/14/2021,0,79,1.2,7,88,Heavy T-Storm,6,2.979349
229,8/18/2021,0,92,0.5,14,77,T-Storm,14,2.911342


Bottom 1 for kde_weather_data2


Unnamed: 0,date,min_temp,max_temp,rainfall,wind_speed,humidity,cloud,cloud_numeric,OLS
362,12/29/2021,72,84,0.0,12,56,Mostly Cloudy,9,-0.768058


Top 3 for kde_weather_data3


Unnamed: 0,date,min_temp,max_temp,rainfall,wind_speed,humidity,cloud,cloud_numeric,OLS
226,8/15/2021,0,95,4.9,7,52,Mostly Cloudy,9,4.178454
103,4/14/2021,0,79,1.2,7,88,Heavy T-Storm,6,4.173191
299,10/27/2021,65,82,3.6,21,36,Partly Cloudy / Windy,12,4.16853


Bottom 1 for kde_weather_data3


Unnamed: 0,date,min_temp,max_temp,rainfall,wind_speed,humidity,cloud,cloud_numeric,OLS
161,6/11/2021,77,93,0.0,12,50,Mostly Cloudy,9,0.067394
