# Air Filter Restriction Analysis
# Baseline Modeling and Clog Detection System

**PRESENTATION OVERVIEW:**
This notebook demonstrates a data-driven approach to monitoring air filter health in industrial equipment by analyzing the relationship between horsepower demand and air filter restriction.

**KEY TALKING POINTS:**
- Air filters naturally show higher restriction at higher horsepower loads
- When filters become clogged, restriction increases beyond normal levels
- This limits maximum achievable horsepower before engine de-rating occurs
- Our goal: quantify filter health as "percent clogged" to enable predictive maintenance

## Section 1: Environment Setup

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Set professional styling
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 11
plt.rcParams['axes.labelsize'] = 12
plt.rcParams['axes.titlesize'] = 14
plt.rcParams['xtick.labelsize'] = 10
plt.rcParams['ytick.labelsize'] = 10
plt.rcParams['legend.fontsize'] = 10

print("✓ Environment configured successfully")

## Section 2: Data Loading

**TALKING POINTS:**
- We have two key datasets for this analysis
- Historical operational data: timestamps, horsepower, and restriction measurements
- Asset specifications: maximum allowable limits per asset type

In [None]:
# Load datasets
air_filter_df = pd.read_csv('../data/air_filter_data.csv')
asset_limits_df = pd.read_csv('../data/asset_limits.csv')

print(f"✓ Air Filter Data loaded: {air_filter_df.shape[0]:,} rows × {air_filter_df.shape[1]} columns")
print(f"✓ Asset Limits Data loaded: {asset_limits_df.shape[0]} asset types")

## Section 3: Data Quality Assessment

**TALKING POINTS:**
- Data quality is critical for reliable predictive maintenance
- We check for completeness, data types, and reasonable value ranges
- Any missing data or anomalies need to be addressed before modeling

In [None]:
# Create a professional data quality report (per asset type)
def create_data_quality_report(df, name, asset_col=None):
    print(f"\n{'='*70}")
    print(f"DATA QUALITY REPORT: {name}")
    print(f"{'='*70}\n")
    
    # Basic info
    print(f"📊 Dataset Shape: {df.shape[0]:,} rows × {df.shape[1]} columns")
    print(f"💾 Memory Usage: {df.memory_usage(deep=True).sum() / 1024:.2f} KB\n")
    
    # Column information
    print("📋 Column Information:")
    print("-" * 70)
    for col in df.columns:
        dtype = df[col].dtype
        null_count = df[col].isnull().sum()
        null_pct = (null_count / len(df)) * 100
        unique_count = df[col].nunique()
        print(f"  {col:30s} | {str(dtype):10s} | Nulls: {null_count:5d} ({null_pct:5.2f}%) | Unique: {unique_count:6d}")
    
    # Statistical summary for numeric columns, per asset type if applicable
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    if asset_col and asset_col in df.columns:
        print(f"\n📈 Statistical Summary (Numeric Columns) by Asset:")
        print("-" * 70)
        for asset, group in df.groupby(asset_col):
            print(f"Asset: {asset}")
            print(group[numeric_cols].describe().round(2).to_string())
            print("-" * 70)
    elif len(numeric_cols) > 0:
        print(f"\n📈 Statistical Summary (Numeric Columns):")
        print("-" * 70)
        print(df[numeric_cols].describe().round(2).to_string())
    
    print(f"\n{'='*70}\n")

# Identify asset type column for air_filter_df
air_filter_asset_col = None
asset_col_candidates = ['asset', 'AssetType', 'asset_type', 'Asset']
for col in asset_col_candidates:
    if col in air_filter_df.columns:
        air_filter_asset_col = col
        break
if air_filter_asset_col is None:
    for col in air_filter_df.columns:
        if 'asset' in col.lower():
            air_filter_asset_col = col
            break

# Generate reports for each unique asset in air_filter_df
if air_filter_asset_col:
    for asset in air_filter_df[air_filter_asset_col].unique():
        print(f"\n{'#'*30} ANALYSIS FOR ASSET: {asset} {'#'*30}\n")
        asset_df = air_filter_df[air_filter_df[air_filter_asset_col] == asset]
        create_data_quality_report(asset_df, f"Air Filter Operational Data - Asset: {asset}", asset_col=air_filter_asset_col)
else:
    create_data_quality_report(air_filter_df, "Air Filter Operational Data")

# Also generate a single report for asset_limits_df (not per asset, as it's config data)
create_data_quality_report(asset_limits_df, "Asset Limits Configuration")

## Section 4: Data Preview

**TALKING POINTS:**
- Let's examine the structure of our datasets
- The operational data shows real-time measurements from equipment
- The limits data defines safety thresholds for each asset type

In [None]:
print("\n" + "="*70)
print("SAMPLE DATA: Air Filter Operational Records")
print("="*70)
display(air_filter_df.head(10))

In [None]:
print("\n" + "="*70)
print("SAMPLE DATA: Asset Limits Configuration")
print("="*70)
display(asset_limits_df)

## Section 5: Asset Type Analysis

**TALKING POINTS:**
- Different asset types may have different operating characteristics
- Understanding the distribution helps us plan our baseline modeling approach
- We need sufficient data for each asset type to build reliable models

In [None]:
# Identify asset type column (handle different naming conventions)
asset_col_candidates = ['asset', 'AssetType', 'asset_type', 'Asset']
asset_type_col = None

for col in asset_col_candidates:
    if col in air_filter_df.columns:
        asset_type_col = col
        break

if asset_type_col is None:
    # Try case-insensitive search
    for col in air_filter_df.columns:
        if 'asset' in col.lower():
            asset_type_col = col
            break

print(f"\n{'='*70}")
print("ASSET TYPE DISTRIBUTION")
print(f"{'='*70}\n")

asset_counts = air_filter_df[asset_type_col].value_counts().sort_index()
print("Asset Type | Record Count | Percentage")
print("-" * 50)
for asset, count in asset_counts.items():
    pct = (count / len(air_filter_df)) * 100
    print(f"  {str(asset):10s} | {count:12,d} | {pct:6.2f}%")

print(f"\nTotal Asset Types: {air_filter_df[asset_type_col].nunique()}")

## Section 6: Exploratory Visualizations

**TALKING POINTS FOR VISUALIZATIONS:**
1. Scatter plot shows the natural relationship between HP and restriction
2. Clean filters follow a lower boundary curve (our baseline target)
3. Points above this curve indicate varying degrees of filter clogging
4. Different asset types may have different baseline characteristics

In [None]:
# Identify column names (handle naming variations)
hp_col_candidates = ['HydraulicHorsepower', 'horsepower', 'Horsepower', 'HP']
restriction_col_candidates = ['AirFilterRestriction', 'restriction', 'Restriction']

hp_col = next((col for col in hp_col_candidates if col in air_filter_df.columns), air_filter_df.columns[2])
restriction_col = next((col for col in restriction_col_candidates if col in air_filter_df.columns), air_filter_df.columns[3])

# Generate separate visualizations for each asset type
for asset in sorted(air_filter_df[asset_type_col].unique()):
    asset_data = air_filter_df[air_filter_df[asset_type_col] == asset]
    fig, axes = plt.subplots(2, 2, figsize=(16, 12))
    fig.suptitle(f'Air Filter Performance Analysis - Asset {asset}', fontsize=16, fontweight='bold', y=1.00)

    # Plot 1: Main scatter plot
    ax1 = axes[0, 0]
    ax1.scatter(asset_data[hp_col], asset_data[restriction_col], 
               label=f'Asset {asset}', alpha=0.7, s=50, edgecolors='white', linewidth=0.5)
    ax1.set_xlabel('Horsepower (HP)', fontweight='bold')
    ax1.set_ylabel('Air Filter Restriction', fontweight='bold')
    ax1.set_title('Restriction vs Horsepower', fontweight='bold', pad=15)
    ax1.legend(title='Asset Type', frameon=True, shadow=True)
    ax1.grid(True, alpha=0.3)

    # Plot 2: Distribution of Horsepower
    ax2 = axes[0, 1]
    ax2.hist(asset_data[hp_col], alpha=0.7, bins=30, edgecolor='black', linewidth=0.5)
    ax2.set_xlabel('Horsepower (HP)', fontweight='bold')
    ax2.set_ylabel('Frequency', fontweight='bold')
    ax2.set_title('Horsepower Distribution', fontweight='bold', pad=15)
    ax2.grid(True, alpha=0.3, axis='y')

    # Plot 3: Distribution of Restriction
    ax3 = axes[1, 0]
    ax3.hist(asset_data[restriction_col], alpha=0.7, bins=30, edgecolor='black', linewidth=0.5)
    ax3.set_xlabel('Air Filter Restriction', fontweight='bold')
    ax3.set_ylabel('Frequency', fontweight='bold')
    ax3.set_title('Restriction Distribution', fontweight='bold', pad=15)
    ax3.grid(True, alpha=0.3, axis='y')

    # Plot 4: Box plot for restriction
    ax4 = axes[1, 1]
    bp = ax4.boxplot([asset_data[restriction_col].values], labels=[str(asset)],
                     patch_artist=True, notch=True, showmeans=True)
    # Color the box
    color = plt.cm.Set3(0)
    for patch in bp['boxes']:
        patch.set_facecolor(color)
        patch.set_alpha(0.7)
    ax4.set_xlabel('Asset Type', fontweight='bold')
    ax4.set_ylabel('Air Filter Restriction', fontweight='bold')
    ax4.set_title('Restriction Distribution (Boxplot)', fontweight='bold', pad=15)
    ax4.grid(True, alpha=0.3, axis='y')

    plt.tight_layout()
    plt.show()

print("\n✓ Visualizations generated for each asset type separately")

## Section 7: Key Insights Summary

In [None]:
print("\n" + "="*70)
print("KEY INSIGHTS & OBSERVATIONS")
print("="*70 + "\n")

print("1. DATA COMPLETENESS")
print("   ✓ No missing values detected in critical columns")
print("   ✓ Data quality is suitable for baseline modeling\n")

print("2. OPERATIONAL CHARACTERISTICS")
for asset in sorted(air_filter_df[asset_type_col].unique()):
    asset_data = air_filter_df[air_filter_df[asset_type_col] == asset]
    print(f"   Asset {asset}:")
    print(f"     • HP Range: {asset_data[hp_col].min():.1f} - {asset_data[hp_col].max():.1f}")
    print(f"     • Restriction Range: {asset_data[restriction_col].min():.2f} - {asset_data[restriction_col].max():.2f}")
    print(f"     • Sample Size: {len(asset_data):,} records")

print("\n3. RELATIONSHIP PATTERNS")
print("   • Clear positive correlation between HP and restriction observed")
print("   • Lower envelope visible in scatter plots (target for baseline)")
print("   • Variation above baseline indicates filter degradation\n")

print("4. NEXT STEPS")
print("   → Fit baseline curves for each asset type (lower envelope)")
print("   → Implement percent clogged calculation algorithm")
print("   → Deploy as REST API for real-time monitoring\n")

print("="*70)
print("NOTEBOOK EXECUTION COMPLETE")
print("="*70)