# MemoTag Voice-Based Cognitive Decline Detection

This notebook demonstrates the analysis pipeline for detecting cognitive decline indicators from voice samples.

## Overview

1. Load and preprocess audio data
2. Extract audio features
3. Transcribe speech and extract linguistic features
4. Apply unsupervised ML for pattern detection
5. Visualize results and generate insights
6. Create a final report

In [None]:
# Import necessary libraries
import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Audio, display

# Add the parent directory to path to import from src
sys.path.append(os.path.dirname(os.getcwd()))

# Import project modules
from src.audio_processing import AudioProcessor
from src.text_processing import TextProcessor
from src.model import CognitiveDeclineModel
from src.visualization import VisualizationGenerator

# Set plot style
sns.set(style="whitegrid")
plt.rcParams["figure.figsize"] = (12, 8)
plt.rcParams["font.size"] = 12

## Step 1: Load Sample Data

For this proof-of-concept, we'll use publicly available cognitive assessment speech samples. 

You can download sample data from sources like:
- DementiaBank's Pitt Corpus (https://dementia.talkbank.org/)
- Mozilla Common Voice dataset (with filtering)
- Simulated recordings with cognitive speech patterns

For privacy reasons, we'll use a small set of simulated samples.

In [None]:
# Directory with sample audio files
data_dir = "../data/raw"

# For this notebook, we assume files are already downloaded
# List all audio files in the directory
audio_files = [os.path.join(data_dir, f) for f in os.listdir(data_dir) 
               if f.endswith(('.wav', '.mp3', '.ogg', '.flac', '.m4a'))]

print(f"Found {len(audio_files)} audio files")

# Display a sample audio file if available
if audio_files:
    display(Audio(audio_files[0]))
else:
    print("No audio files found. Please download sample files to ../data/raw/")

## Step 2: Process Audio and Extract Features

Now we'll process the audio files and extract features that may indicate cognitive decline.

In [None]:
# Initialize the audio processor
audio_processor = AudioProcessor()

# Process all audio files and extract features
all_audio_features = []

for audio_file in audio_files:
    print(f"Processing {os.path.basename(audio_file)}...")
    try:
        # Extract all audio features
        features = audio_processor.extract_all_features(audio_file)
        all_audio_features.append(features)
        print(f"  Transcript: {features['transcript'][:100]}...")
    except Exception as e:
        print(f"  Error processing {audio_file}: {e}")

# Convert to DataFrame
audio_df = pd.DataFrame(all_audio_features)
audio_df.head()

## Step 3: Extract Linguistic Features from Transcripts

Now we'll analyze the transcribed text to extract linguistic features that may indicate cognitive decline.

In [None]:
# Initialize the text processor
text_processor = TextProcessor()

# Extract text features from transcripts
text_features = []

for index, row in audio_df.iterrows():
    transcript = row.get('transcript', '')
    if transcript:
        print(f"Analyzing transcript {index+1}...")
        features = text_processor.extract_all_features(transcript)
        features['file_path'] = row['file_path']  # Add file_path for joining later
        text_features.append(features)
    else:
        print(f"No transcript available for sample {index+1}")

# Convert to DataFrame
text_df = pd.DataFrame(text_features)
text_df.head()

## Step 4: Combine Audio and Text Features

Let's combine the audio and text features into a single dataset for analysis.

In [None]:
# Merge audio and text features on file_path
if not text_df.empty:
    # Merge on file_path
    combined_df = pd.merge(audio_df, text_df, on='file_path', how='left', suffixes=('_audio', '_text'))
else:
    # Just use audio features if no text features available
    combined_df = audio_df.copy()

print(f"Combined dataset shape: {combined_df.shape}")

# Display the columns
print("\nFeatures in combined dataset:")
for col in combined_df.columns:
    print(f"  - {col}")

## Step 5: Visualize Feature Distributions

Let's visualize key features to understand their distributions.

In [None]:
# Initialize the visualization generator
viz_dir = "../data/visualizations"
if not os.path.exists(viz_dir):
    os.makedirs(viz_dir)
    
viz_generator = VisualizationGenerator(output_dir=viz_dir)

# Select key features for visualization
audio_features = [
    'pause_count', 'avg_pause_duration', 'pause_rate',
    'pitch_mean', 'pitch_variability_coefficient',
    'spectral_flatness_mean'
]

text_features = [
    'hesitation_ratio', 'word_finding_difficulty_ratio',
    'avg_sentence_length', 'type_token_ratio',
    'syntactic_complexity'
]

# Plot audio features if available
valid_audio_features = [f for f in audio_features if f in combined_df.columns]
if valid_audio_features:
    viz_generator.plot_feature_distribution(combined_df, valid_audio_features, 
                                           filename="audio_feature_distributions.png")

# Plot text features if available
valid_text_features = [f for f in text_features if f in combined_df.columns]
if valid_text_features:
    viz_generator.plot_feature_distribution(combined_df, valid_text_features, 
                                           filename="text_feature_distributions.png")

# Plot correlation matrix
viz_generator.plot_feature_correlations(combined_df, filename="feature_correlations.png")

## Step 6: Apply Unsupervised Learning

Now we'll apply unsupervised learning techniques to identify patterns in the data.

In [None]:
# Initialize the model
model_dir = "../models"
if not os.path.exists(model_dir):
    os.makedirs(model_dir)
    
model = CognitiveDeclineModel(model_dir=model_dir)

# Train the model
training_results = model.train(combined_df)

print("Training results:")
for key, value in training_results.items():
    if key == 'top_features':
        print("\nTop features:")
        for feature, importance in value:
            print(f"  - {feature}: {importance:.3f}")
    elif key == 'cluster_stats':
        print("\nCluster statistics:")
        for cluster, stats in value.items():
            print(f"  - {cluster}: {stats}")
    else:
        print(f"  - {key}: {value}")

## Step 7: Make Predictions and Visualize Results

Let's predict cognitive decline risk scores for our samples and visualize the results.

In [None]:
# Make predictions on the same data
# In a real scenario, we would use separate training and testing sets
predictions = model.predict(combined_df)

print("Prediction results:")
print(f"Average risk score: {predictions['average_risk_score']:.2f}")
print("\nSample predictions:")
for i, pred in enumerate(predictions['predictions']):
    print(f"Sample {i+1}: Risk score: {pred['risk_score']:.2f}, Level: {pred['risk_level']}")

# Plot risk scores
viz_generator.plot_risk_scores(predictions['predictions'], filename="risk_scores.png")

# Plot feature importance
viz_generator.plot_top_features(model.get_feature_importance(), filename="feature_importance.png")

# Plot dimensionality reduction with cluster labels
if 'cluster' in predictions['predictions'][0]:
    cluster_labels = [p['cluster'] for p in predictions['predictions']]
    viz_generator.plot_dimensionality_reduction(combined_df, labels=cluster_labels, 
                                               method='pca', filename="feature_clustering_pca.png")
    viz_generator.plot_dimensionality_reduction(combined_df, labels=cluster_labels, 
                                               method='tsne', filename="feature_clustering_tsne.png")

## Step 8: Generate Comprehensive Report

Finally, let's generate a comprehensive report of our findings.

In [None]:
# Generate a report
report = model.generate_report(combined_df, predictions)

# Display report sections
print("=== COGNITIVE DECLINE DETECTION REPORT ===")

# Overall statistics
print("\n--- Overall Statistics ---")
stats = report['overall_statistics']
for key, value in stats.items():
    print(f"{key}: {value}")

# Key indicators
print("\n--- Key Indicators ---")
for indicator in report['key_indicators']:
    direction = "increases" if indicator['correlation'] > 0 else "decreases"
    print(f"{indicator['feature']}: correlation {indicator['correlation']:.3f} ({direction} risk)")

# Cluster analysis
print("\n--- Cluster Analysis ---")
for cluster in report['cluster_analysis']:
    print(f"Cluster {cluster['cluster_id']}: {cluster['sample_count']} samples, "
          f"avg risk {cluster['avg_risk_score']:.2f}, {cluster['high_risk_percentage']:.1f}% high risk")

# Create summary dashboard visualization
viz_generator.create_summary_dashboard(combined_df, predictions, report, 
                                      filename="cognitive_decline_dashboard.png")

## Step 9: Conclusions and Future Work

### Key Findings

From our analysis, we've identified several key indicators of potential cognitive decline:

1. **Speech Patterns**:
   - Increased pause frequency and duration
   - Reduced speech rate
   - Lower pitch variability

2. **Linguistic Features**:
   - Higher hesitation marker frequency
   - Increased word-finding difficulties
   - Reduced syntactic complexity
   - Lower lexical diversity (type-token ratio)

### Modeling Approach

We used an unsupervised approach combining:
- Feature extraction (audio and linguistic)
- Dimensionality reduction (PCA)
- Anomaly detection (Isolation Forest)
- Clustering (K-means)

This approach allows us to identify patterns without requiring labeled data, which is crucial for early-stage development.

### Future Work

To make this system clinically robust, several enhancements are needed:

1. **Data Collection**: Gather a large dataset of both normal and cognitively impaired speech samples.
2. **Supervised Learning**: Train supervised models once labeled data is available.
3. **Longitudinal Analysis**: Track changes in speech patterns over time for the same individuals.
4. **Clinical Validation**: Partner with healthcare providers to validate findings against clinical assessments.
5. **User Interface**: Develop a user-friendly interface for healthcare providers.
6. **Privacy Enhancements**: Implement additional privacy measures for handling sensitive health data.

### Conclusion

This proof-of-concept demonstrates the potential of using voice analysis for cognitive decline detection. The combination of audio feature extraction and linguistic analysis provides a rich set of indicators that can be used to identify subtle changes in cognitive function, potentially enabling earlier intervention and better outcomes for patients.