# Fall Detection Pipeline - Training & Preprocessing

This notebook implements a comprehensive machine learning pipeline for fall detection using sensor data. It includes:
- Data preprocessing and normalization
- Dimensionality reduction using PCA
- Random Forest classification
- Hybrid anomaly detection (Isolation Forest + Autoencoder)

## 1. Import Required Libraries

Import all necessary libraries for data manipulation, machine learning, deep learning, and visualization.

In [24]:
import pandas as pd
import numpy as np
import logging
import joblib
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier, IsolationForest
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Configure logging to track pipeline execution
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

## 2. Define Data Loading Function

This function loads the fall detection dataset from a CSV file. The dataset should contain sensor readings (accelerometer, gyroscope, etc.) with activity labels.

In [25]:
def load_dataset(filepath):
    """
    Load fall detection dataset from CSV file.
    
    Args:
        filepath: Path to the CSV file containing sensor data and labels
    
    Returns:
        DataFrame with sensor features and activity labels
    """
    df = pd.read_csv(filepath)
    return df

## 3. Define Data Preprocessing Function

This function performs essential preprocessing steps:
- Checks for missing values
- Analyzes class distribution to identify imbalances
- Applies StandardScaler normalization to ensure all features have zero mean and unit variance
- This normalization is crucial for distance-based algorithms and neural networks

In [26]:
def preprocess_data(df):
    """
    Preprocess the dataset by checking data quality and normalizing features.
    
    Args:
        df: Raw dataframe with sensor readings and labels
    
    Returns:
        df_scaled: Normalized dataframe with scaled features
        scaler: Fitted StandardScaler object for later use in inference
    """
    # Check for missing values
    missing_values = df.isnull().sum().sum()
    logging.info(f"Total missing values: {missing_values}")
    
    # Check class distribution to identify potential class imbalance
    class_counts = df['label'].value_counts()
    logging.info(f"Class Distribution:\n{class_counts}")
    
    # Normalize Data using Standard Scaling (z-score normalization)
    scaler = StandardScaler()
    features = df.iloc[:, 1:]  # Exclude label column (assumed to be first column)
    scaled_features = scaler.fit_transform(features)
    
    # Convert back to DataFrame for easier manipulation
    df_scaled = pd.DataFrame(scaled_features, columns=df.columns[1:])
    df_scaled['label'] = df['label']  # Retain original labels
    
    return df_scaled, scaler

## 4. Define PCA Dimensionality Reduction Function

Principal Component Analysis (PCA) reduces feature dimensionality while retaining 95% of the variance. This:
- Reduces computational cost
- Removes multicollinearity
- Helps prevent overfitting
- Speeds up model training

In [27]:
def apply_pca(df_scaled):
    # Feature Selection using PCA (Dimensionality Reduction)
    pca = PCA(n_components=0.95)  # Retain 95% variance
    principal_components = pca.fit_transform(df_scaled.iloc[:, :-1])  # Exclude label
    
    # Convert PCA results into a DataFrame
    df_pca = pd.DataFrame(principal_components)
    df_pca['label'] = df_scaled['label']
    
    logging.info(f"Original feature count: {df_scaled.shape[1] - 1}")
    logging.info(f"Reduced feature count after PCA: {df_pca.shape[1] - 1}")
    
    return df_pca, pca

## 5. Define Visualization Function

Visualize the distribution of different activity classes to understand class balance and potential biases in the dataset.

In [28]:
def visualize_class_distribution(class_counts):
    """
    Create a bar plot showing the distribution of activity classes.
    
    Args:
        class_counts: Series containing counts for each activity class
    """
    plt.figure(figsize=(8, 5))
    sns.barplot(x=class_counts.index, y=class_counts.values, palette="viridis")
    plt.title("Class Distribution")
    plt.xlabel("Activity Type")
    plt.ylabel("Count")
    plt.show()

## 6. Define Random Forest Training Function

Random Forest is an ensemble learning method that creates multiple decision trees and combines their predictions. It's robust to overfitting and handles high-dimensional data well.

In [29]:
def train_random_forest(X_train, y_train):
    """
    Train a Random Forest classifier for activity classification.
    
    Args:
        X_train: Training features
        y_train: Training labels
    
    Returns:
        model: Trained RandomForestClassifier
    """
    model = RandomForestClassifier(n_estimators=100, random_state=42)
    model.fit(X_train, y_train)
    return model

## 7. Define Model Evaluation Function

Evaluate the trained model on test data and compute accuracy metrics.

In [30]:
def evaluate_model(model, X_test, y_test):
    """
    Evaluate model performance on test data.
    
    Args:
        model: Trained classifier
        X_test: Test features
        y_test: True test labels
    
    Returns:
        accuracy: Model accuracy score
    """
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    logging.info(f"Model Accuracy: {accuracy:.4f}")
    return accuracy

## 8. Define Model Saving Function

Save the trained model and preprocessing objects for later use in production/inference.

In [31]:
def save_model_artifacts(model, scaler, label_encoder):
    """
    Save trained model and preprocessing objects to disk.
    
    Args:
        model: Trained classifier
        scaler: Fitted StandardScaler
        label_encoder: Fitted LabelEncoder
    """
    joblib.dump(model, "fall_detection_model.pkl")
    joblib.dump(scaler, "scaler.pkl")
    joblib.dump(label_encoder, "label_encoder.pkl")
    logging.info("✅ Model and encoders saved successfully!")

## 9. Define Hybrid Anomaly Detection Model

This advanced function combines two anomaly detection approaches:
- **Isolation Forest**: Identifies anomalies by isolating observations in tree structures
- **Autoencoder**: Neural network that learns to reconstruct normal patterns; poor reconstruction indicates anomalies

The hybrid approach uses majority voting to combine predictions, improving robustness.

In [32]:
def train_hybrid_model(data, labels):
    try:
        scaler = StandardScaler()
        scaled_data = scaler.fit_transform(data)
        X_train, X_test, y_train, y_test = train_test_split(scaled_data, labels, test_size=0.2, random_state=42)

        # Isolation Forest
        isolation_model = IsolationForest(contamination=0.05, random_state=42)
        isolation_model.fit(X_train)
        y_pred_iso = isolation_model.predict(X_test)
        y_pred_iso = np.where(y_pred_iso == -1, 1, 0)

        # Autoencoder
        input_dim = X_train.shape[1]
        autoencoder = keras.Sequential([
            layers.Input(shape=(input_dim,)),
            layers.Dense(16, activation="relu"),
            layers.Dense(8, activation="relu"),
            layers.Dense(16, activation="relu"),
            layers.Dense(input_dim, activation="linear"),
        ])

        autoencoder.compile(optimizer="adam", loss="mse")
        autoencoder.fit(X_train, X_train, epochs=50, batch_size=32, verbose=0)

        reconstructions = autoencoder.predict(X_test)
        mse = np.mean(np.power(X_test - reconstructions, 2), axis=1)
        threshold = np.percentile(mse, 95)
        y_pred_auto = (mse > threshold).astype(int)

        # Hybrid Prediction (Majority Voting)
        y_pred_hybrid = (y_pred_iso + y_pred_auto) >= 1
        y_pred_hybrid = y_pred_hybrid.astype(int)

        unique_labels = np.unique(y_test)
        average_mode = "binary" if len(unique_labels) <= 2 else "weighted"

        precision, recall, f1, _ = precision_recall_fscore_support(
            y_test, y_pred_hybrid, average=average_mode, zero_division=0
        )

        logging.info(f"Hybrid Model - Precision: {precision:.4f}, Recall: {recall:.4f}, F1-Score: {f1:.4f}")

        # Save models
        joblib.dump(isolation_model, "isolation_forest.pkl")
        joblib.dump(scaler, "scaler.pkl")
        autoencoder.save("autoencoder.h5")

        return isolation_model, autoencoder, scaler
    except Exception as e:
        logging.error(f"Error training Hybrid Model: {e}")
        return None, None, None

## 10. Load and Preprocess Dataset

Execute the data loading and preprocessing pipeline.

In [33]:
# Load dataset
file_path = "fall_detection_dataset.csv" 
df = load_dataset(file_path)

# Display basic information about the dataset
print(f"Dataset shape: {df.shape}")
print(f"\nFirst few rows:\n{df.head()}")
print(f"\nColumn names:\n{df.columns.tolist()}")

Dataset shape: (5000, 29)

First few rows:
          label  acc_x_mean  acc_y_mean  acc_z_mean  acc_x_std  acc_y_std  \
0       sitting      0.0351      0.0344      9.6897     0.1499     0.1204   
1  fall_forward      2.3267     -1.4092     12.2534     8.5069     7.9289   
2       walking      0.1272      0.2537      9.7369     1.6084     0.9839   
3       jogging      0.4990      0.6269     10.6038     3.0459     3.8435   
4       walking      0.1823      0.2419      9.7604     1.4799     1.2406   

   acc_z_std  acc_x_min  acc_y_min  acc_z_min  ...  acc_magnitude_mean  \
0     0.2097    -0.1649    -0.1656     9.3897  ...              9.9907   
1     8.1094    -6.6733    -9.4092     2.2534  ...             25.6781   
2     1.5309    -1.3728    -1.2463     7.7369  ...              9.6984   
3     3.0127    -2.5010    -2.8731     7.6038  ...             13.2501   
4     2.0045    -1.3177    -1.2581     7.7604  ...              9.7145   

   acc_magnitude_std  gyro_magnitude_mean  gyro_m

## 11. Apply Preprocessing and PCA

Normalize the features and reduce dimensionality using PCA.

In [34]:
# Preprocess data
df_scaled, scaler = preprocess_data(df)

# Apply PCA for dimensionality reduction
df_pca, pca = apply_pca(df_scaled)

# Save preprocessed data
df_pca.to_csv("processed_data.csv", index=False)
logging.info("✅ Preprocessed data saved as 'processed_data.csv'")

2025-12-21 13:38:58,897 - INFO - Total missing values: 0
2025-12-21 13:38:58,962 - INFO - Class Distribution:
label
walking          1500
standing         1000
sitting           750
lying             500
stairs_up         350
stairs_down       300
jogging           250
fall_forward      100
fall_backward     100
fall_sideways      75
syncope            50
trip               25
Name: count, dtype: int64
2025-12-21 13:38:59,148 - INFO - Original feature count: 28
2025-12-21 13:38:59,150 - INFO - Reduced feature count after PCA: 4
2025-12-21 13:38:59,204 - INFO - ✅ Preprocessed data saved as 'processed_data.csv'


## 12. Encode Labels and Split Data

Convert text labels to numerical format and split data into training and testing sets.

In [35]:
# Encode categorical labels to numerical values
label_encoder = LabelEncoder()
df_pca['label'] = label_encoder.fit_transform(df_pca['label'])

# Split dataset into features and labels
X = df_pca.drop(columns=["label"])
y = df_pca["label"]

# Split into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Training samples: {X_train.shape[0]}")
print(f"Testing samples: {X_test.shape[0]}")

Training samples: 4000
Testing samples: 1000


## 13. Train and Evaluate Random Forest Model

Train the Random Forest classifier and evaluate its performance on the test set.

In [36]:
    # Train and evaluate Random Forest model
    rf_model = train_random_forest(X_train, y_train)
    evaluate_model(rf_model, X_test, y_test)
    save_model_artifacts(rf_model, scaler, label_encoder)

2025-12-21 13:39:00,006 - INFO - Model Accuracy: 0.9910
2025-12-21 13:39:00,072 - INFO - ✅ Model and encoders saved successfully!


## 14. Train and Evaluate Hybrid Anomaly Detection Model

Train the hybrid model combining Isolation Forest and Autoencoder for improved anomaly detection.

In [37]:
# Train hybrid model
isolation_model, autoencoder, hybrid_scaler = train_hybrid_model(X, y)

if isolation_model is not None:
    logging.info("✅ Hybrid Model training completed!")
else:
    logging.error("❌ Hybrid Model training failed!")

  return np.array(x)
  return np.array(x)


[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step


2025-12-21 13:39:20,063 - INFO - Hybrid Model - Precision: 0.0034, Recall: 0.0120, F1-Score: 0.0053
  return np.array(x)
  return np.array(x)
  return np.array(x)
  return np.array(x)
  return np.array(x)
  return np.array(x)
  return np.array(x)
  return np.array(x)
  return np.array(x)
  return np.array(x)
  return np.array(x)
  return np.array(x)
  return np.array(x)
  return np.array(x)
  return np.array(x)
  return np.array(x)
  return np.array(x)
  return np.array(x)
  return np.array(x)
  return np.array(x)
  return np.array(x)
  return np.array(x)
  return np.array(x)
  return np.array(x)
  return np.array(x)
  return np.array(x)
2025-12-21 13:39:20,274 - INFO - ✅ Hybrid Model training completed!


## 15. DuckDuckGo Care Assistance Integration

To complement the sensor-driven fall detection pipeline, the project now exposes caregiver assistance utilities backed by DuckDuckGo:

- **Health Information Assistance:** surfaces fall-prevention checklists, recovery guidance, and emergency contact tips tailored to elderly caregivers.
- **Emergency / Expert Locator:** highlights reputable organizations or specialists that can support elderly patients post-fall.

These helpers now call the **DuckDuckGo Instant Answer API** directly (no third-party proxy). The Instant Answer endpoint returns structured abstracts and related topics from trusted sources, which we normalize via `duckduckgo_service.py`.

In [38]:
# Import DuckDuckGo helper utilities (Instant Answer API-backed)
from duckduckgo_service import (
    fetch_health_information,
    locate_emergency_facilities,
    DuckDuckGoIntegrationError,
)

print("DuckDuckGo Instant Answer helpers imported successfully.")

DuckDuckGo Instant Answer helpers imported successfully.


### 15.1 Fetch Health Information Suggestions
`fetch_health_information` now queries the DuckDuckGo Instant Answer API.
It surfaces trustworthy abstracts and related topics (e.g., Wikipedia or
government health portals) without relying on any third-party scraper.

In [39]:
# Example: caregivers looking for fall-prevention routines
try:
    fall_prevention_cards = fetch_health_information(
        topic="fall prevention exercises",
        audience="home caregivers",
        max_results=3,
    )
except DuckDuckGoIntegrationError as err:
    print(f"DuckDuckGo Instant Answer failed: {err}")
    fall_prevention_cards = []

fall_prevention_cards

  with DDGS() as ddgs:
2025-12-21 13:39:22,560 - INFO - response: https://www.bing.com/search?q=fall+prevention+exercises+tips+for+home+caregivers 200


[InfoCard(title='fall的用法_百度知道', snippet='fall 高考 / CET4 / CET6 / 考研 讲解1:26 v. 落下；下落；掉落；跌落；突然倒下；跌倒；倒塌；下垂；低垂 n. 落下；下落；跌落；掉落； (雪、岩石等的)降落； …', url='https://zhidao.baidu.com/question/589864611.html', metadata=None),
 InfoCard(title='fall 和 fell 有什么区别?_百度知道', snippet='fall 和 fell 只有时态的区别，没有意思上的区别，只是以下几点需要注意： fall是一般现在时，而fell是一般过去时，时态不同； fell 动词原形的时候，表示砍 …', url='https://zhidao.baidu.com/question/1768944762480158620.html', metadata=None),
 InfoCard(title='“autumn”跟“fall”有什么区别? - 百度知道', snippet='“autumn”跟“fall”有什么区别?autumn和fall这两个词的区别在于使用地区的不同，前者为英国用词，后者是美国英语。 1、autumn是比较正式的书面用语，是英 …', url='https://zhidao.baidu.com/question/629770527425317564.html', metadata=None)]

### 15.2 Locate Post-Fall Medical Experts
`locate_emergency_facilities` also leverages the Instant Answer API, which tends
to surface curated articles or organization pages (e.g., geriatric care
associations, government safety guidance) rather than raw map coordinates.

The example below demonstrates a targeted query for medical specialists and
programs that support elderly patients after a fall.

In [40]:
# Example: find geriatric medical experts for post-fall care
try:
    geriatric_experts = fetch_health_information(
        topic="geriatric rehabilitation specialists site:.org post-fall care",
        audience="caregivers",
        max_results=3,
    )
except DuckDuckGoIntegrationError as err:
    print(f"DuckDuckGo expert lookup failed: {err}")
    geriatric_experts = []

geriatric_experts

  with DDGS() as ddgs:
2025-12-21 13:39:24,587 - INFO - response: https://www.bing.com/search?q=geriatric+rehabilitation+specialists+site%3A.org+post-fall+care+tips+for+caregivers 200
2025-12-21 13:39:26,246 - INFO - response: https://www.bing.com/search?q=geriatric+rehabilitation+specialists+site%3A.org+post-fall+care+tips+for+caregivers&first=11&FORM=PERE 200
2025-12-21 13:39:27,679 - INFO - response: https://www.bing.com/search?q=geriatric+rehabilitation+specialists+site%3A.org+post-fall+care+tips+for+caregivers&first=21&FORM=PERE1 200
2025-12-21 13:39:29,197 - INFO - response: https://www.bing.com/search?q=geriatric+rehabilitation+specialists+site%3A.org+post-fall+care+tips+for+caregivers&first=31&FORM=PERE2 200
2025-12-21 13:39:30,600 - INFO - response: https://www.bing.com/search?q=geriatric+rehabilitation+specialists+site%3A.org+post-fall+care+tips+for+caregivers&first=41&FORM=PERE3 200


[]

## 15. Summary and Next Steps

**Models Trained:**
1. Random Forest Classifier - For multi-class activity classification
2. Hybrid Anomaly Detector - Combining Isolation Forest and Autoencoder

**Saved Artifacts:**
- `fall_detection_model.pkl` - Random Forest model
- `isolation_forest.pkl` - Isolation Forest model
- `autoencoder.h5` - Autoencoder neural network
- `scaler.pkl` - Feature scaler
- `label_encoder.pkl` - Label encoder
- `processed_data.csv` - Preprocessed dataset

**Next Steps:**
- Use the inference pipeline notebook to convert models for mobile deployment
- Test models on new sensor data
- Deploy to edge devices for real-time fall detection