# Attention Seeker – Data Analysis Notebook

This notebook performs exploratory data analysis (EDA), correlation analysis, and baseline machine learning models
using the preprocessed dataset (`attention_scores.csv`) generated by the pipeline notebook.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import r2_score, mean_absolute_error, accuracy_score, f1_score, roc_auc_score, roc_curve

%matplotlib inline


In [None]:
df = pd.read_csv("attention_scores.csv")
print("Data shape:", df.shape)
df.head()


## Basic Dataset Overview

We inspect column names, data types, and basic statistics to understand the structure of the processed dataset.

In [None]:
print(df.dtypes)
df.describe().T


## Missing Values

Check for missing values in each feature.

In [None]:
df.isna().mean().sort_values(ascending=False)


## Distributions of Key Variables

We visualize the distribution of the Attention Score and Outside Factors to understand their ranges and skew.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

if 'attention_score' in df.columns:
    axes[0].hist(df['attention_score'], bins=40)
    axes[0].set_title("Attention Score Distribution")
    axes[0].set_xlabel("Attention Score")

if 'outside_factors' in df.columns:
    axes[1].hist(df['outside_factors'], bins=40)
    axes[1].set_title("Outside Factors Distribution")
    axes[1].set_xlabel("Outside Factors Score")

plt.tight_layout()
plt.show()


## Correlation Between Attention Score and Outside Factors

We compute Pearson's correlation coefficient between the Attention Score and Outside Factors to measure
whether healthier routines are associated with higher attention.

In [None]:
if {'attention_score', 'outside_factors'}.issubset(df.columns):
    corr = df['attention_score'].corr(df['outside_factors'])
    print("Pearson correlation (Attention vs Outside Factors):", corr)

    plt.figure(figsize=(6, 5))
    sns.scatterplot(x='outside_factors', y='attention_score', data=df, alpha=0.4)
    plt.title(f"Attention Score vs Outside Factors (r = {corr:.3f})")
    plt.xlabel("Outside Factors")
    plt.ylabel("Attention Score")
    plt.show()
else:
    print("Required columns 'attention_score' and 'outside_factors' not found.")


## Feature Correlation Heatmap

We generate a correlation matrix for numeric features to inspect linear relationships among features
and the Attention Score.

In [None]:
num_df = df.select_dtypes(include=[np.number])

plt.figure(figsize=(10, 8))
sns.heatmap(num_df.corr(), annot=False, cmap="coolwarm", center=0)
plt.title("Correlation Heatmap (Numeric Features)")
plt.show()


## Regression Model: Predicting Attention Score

We fit a baseline linear regression model to predict the Attention Score using engineered features
such as normalized HR, HRV, movement, and Outside Factors.

In [None]:
candidate_features = [
    'hr_norm', 'hrv_norm', 'movement_norm',
    'outside_factors'
]

feature_cols = [c for c in candidate_features if c in df.columns]
print("Using features:", feature_cols)

if 'attention_score' not in df.columns:
    raise ValueError("Column 'attention_score' not found in dataframe.")

X = df[feature_cols].values
y = df['attention_score'].values

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42
)

reg = LinearRegression()
reg.fit(X_train, y_train)

y_pred = reg.predict(X_test)
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)

print(f"Linear Regression R^2: {r2:.4f}")
print(f"Linear Regression MAE: {mae:.4f}")


In [None]:
plt.figure(figsize=(6, 5))
plt.scatter(y_test, y_pred, alpha=0.4)
plt.xlabel("True Attention Score")
plt.ylabel("Predicted Attention Score")
plt.title("Regression: True vs Predicted Attention Score")
line_min = min(y_test.min(), y_pred.min())
line_max = max(y_test.max(), y_pred.max())
plt.plot([line_min, line_max], [line_min, line_max], 'r--')
plt.tight_layout()
plt.show()


## Classification Model: Detecting Attention Lapses

We construct a binary label representing an attention lapse and train a Logistic Regression classifier
to detect such events based on the same set of features.

In [None]:
lapse_threshold = -0.05
df['lapse'] = (df['attention_score'] < lapse_threshold).astype(int)

X_cls = df[feature_cols].values
y_cls = df['lapse'].values

Xc_train, Xc_test, yc_train, yc_test = train_test_split(
    X_cls, y_cls, test_size=0.25, random_state=42, stratify=y_cls
)

clf = LogisticRegression(max_iter=1000)
clf.fit(Xc_train, yc_train)

yc_pred = clf.predict(Xc_test)
yc_proba = clf.predict_proba(Xc_test)[:, 1]

acc = accuracy_score(yc_test, yc_pred)
f1 = f1_score(yc_test, yc_pred)
try:
    auc = roc_auc_score(yc_test, yc_proba)
except ValueError:
    auc = float('nan')

print(f"Logistic Regression Accuracy: {acc:.4f}")
print(f"Logistic Regression F1: {f1:.4f}")
print(f"Logistic Regression ROC-AUC: {auc:.4f}")


In [None]:
fpr, tpr, thresh = roc_curve(yc_test, yc_proba)
plt.figure(figsize=(6, 5))
plt.plot(fpr, tpr, label=f"ROC (AUC = {auc:.3f})")
plt.plot([0, 1], [0, 1], 'k--', label="Random")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve - Attention Lapse Classifier")
plt.legend()
plt.tight_layout()
plt.show()


## Conclusions

- The engineered Attention Score behaves as a meaningful summary of physiological engagement.
- Outside Factors can be related to attention trends via correlation analysis.
- Baseline linear and logistic regression models demonstrate that attention and lapses
  can be predicted from wearable-derived features, supporting the feasibility of
  real-time attention monitoring in a future Apple Watch application.
