# Data Analysis Template

This template provides a structured approach to data analysis using the Data Analysis and Prediction Platform.

## Overview
- **Purpose**: Perform exploratory data analysis and linear regression modeling
- **Platform**: DAPP (Data Analysis and Prediction Platform)
- **Target**: Educational data science workflows

## Workflow Steps
1. Data Loading & Validation
2. Exploratory Data Analysis (EDA)
3. Data Preprocessing
4. Linear Regression Modeling
5. Model Evaluation
6. Results Export

## 1. Setup and Configuration

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import requests
import json
from pathlib import Path

# Configure plotting
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (10, 6)

# API Configuration (adjust host/port as needed)
API_BASE_URL = "http://localhost:8000"
API_HEADERS = {"Content-Type": "application/json"}

print("✅ Libraries imported successfully")
print(f"📡 API Base URL: {API_BASE_URL}")

## 2. Data Loading & Validation

Load your dataset using the platform's secure file upload API.

In [None]:
# Option 1: Load data from uploaded file via API
def load_data_from_api(file_id: str):
    """Load processed data from the platform API."""
    try:
        response = requests.get(f"{API_BASE_URL}/data/{file_id}")
        if response.status_code == 200:
            data_info = response.json()
            # Load the actual data
            df = pd.read_csv(data_info['file_path'])
            return df, data_info
        else:
            print(f"❌ Error loading data: {response.status_code}")
            return None, None
    except Exception as e:
        print(f"❌ Exception loading data: {e}")
        return None, None

# Option 2: Load sample data directly
def load_sample_data():
    """Load sample dataset for demonstration."""
    # Create sample data if no file is uploaded
    np.random.seed(42)
    n_samples = 100
    
    data = {
        'feature_1': np.random.normal(50, 15, n_samples),
        'feature_2': np.random.normal(30, 10, n_samples),
        'feature_3': np.random.uniform(0, 100, n_samples)
    }
    
    # Create target variable with some correlation
    data['target'] = (0.5 * data['feature_1'] + 
                     0.3 * data['feature_2'] + 
                     0.2 * data['feature_3'] + 
                     np.random.normal(0, 5, n_samples))
    
    return pd.DataFrame(data)

# Load your data here
# METHOD 1: Use file_id from uploaded file
# file_id = "your-file-id-here"  # Replace with actual file ID
# df, data_info = load_data_from_api(file_id)

# METHOD 2: Use sample data for demonstration
df = load_sample_data()
print(f"✅ Data loaded successfully: {df.shape} rows and columns")
df.head()

## 3. Exploratory Data Analysis (EDA)

In [None]:
# Basic dataset information
print("📊 Dataset Overview:")
print(f"Shape: {df.shape}")
print(f"Columns: {list(df.columns)}")
print(f"Data types:\n{df.dtypes}")
print(f"\nMissing values:\n{df.isnull().sum()}")

In [None]:
# Statistical summary
print("📈 Statistical Summary:")
df.describe()

In [None]:
# Correlation analysis
print("🔗 Correlation Matrix:")
correlation_matrix = df.corr()

plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.show()

correlation_matrix

In [None]:
# Distribution plots
numeric_columns = df.select_dtypes(include=[np.number]).columns
n_cols = 2
n_rows = (len(numeric_columns) + 1) // 2

fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, 5 * n_rows))
axes = axes.flatten() if n_rows > 1 else [axes]

for i, col in enumerate(numeric_columns):
    if i < len(axes):
        sns.histplot(df[col], bins=20, ax=axes[i])
        axes[i].set_title(f'Distribution of {col}')
        axes[i].grid(True, alpha=0.3)

# Hide unused subplots
for i in range(len(numeric_columns), len(axes)):
    axes[i].set_visible(False)

plt.tight_layout()
plt.show()

## 4. Data Preprocessing

In [None]:
# Handle missing values
print("🧹 Data Cleaning:")
print(f"Missing values before cleaning: {df.isnull().sum().sum()}")

# Option 1: Drop rows with missing values
# df_clean = df.dropna()

# Option 2: Fill missing values with mean/median
df_clean = df.copy()
for col in df_clean.select_dtypes(include=[np.number]).columns:
    df_clean[col].fillna(df_clean[col].median(), inplace=True)

print(f"Missing values after cleaning: {df_clean.isnull().sum().sum()}")
print(f"Final dataset shape: {df_clean.shape}")

In [None]:
# Define features and target
# Adjust these column names based on your dataset
target_column = 'target'  # Replace with your target variable name
feature_columns = [col for col in df_clean.columns if col != target_column]

print(f"🎯 Target variable: {target_column}")
print(f"📊 Feature variables: {feature_columns}")

X = df_clean[feature_columns]
y = df_clean[target_column]

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")

## 5. Linear Regression Modeling

In [None]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"📚 Training set: {X_train.shape}")
print(f"🔬 Test set: {X_test.shape}")

In [None]:
# Train linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

print("✅ Model trained successfully")
print(f"📊 Model coefficients: {model.coef_}")
print(f"📈 Model intercept: {model.intercept_}")

In [None]:
# Make predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

print("🔮 Predictions generated")
print(f"Training predictions shape: {y_train_pred.shape}")
print(f"Test predictions shape: {y_test_pred.shape}")

## 6. Model Evaluation

In [None]:
# Calculate evaluation metrics
train_mse = mean_squared_error(y_train, y_train_pred)
test_mse = mean_squared_error(y_test, y_test_pred)
train_r2 = r2_score(y_train, y_train_pred)
test_r2 = r2_score(y_test, y_test_pred)

print("📊 Model Performance Metrics:")
print(f"Training MSE: {train_mse:.4f}")
print(f"Test MSE: {test_mse:.4f}")
print(f"Training R²: {train_r2:.4f}")
print(f"Test R²: {test_r2:.4f}")

# Check for overfitting
if abs(train_r2 - test_r2) > 0.1:
    print("⚠️  Potential overfitting detected (R² difference > 0.1)")
else:
    print("✅ Model shows good generalization")

In [None]:
# Prediction vs Actual plots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Training data
ax1.scatter(y_train, y_train_pred, alpha=0.6)
ax1.plot([y_train.min(), y_train.max()], [y_train.min(), y_train.max()], 'r--', lw=2)
ax1.set_xlabel('Actual Values')
ax1.set_ylabel('Predicted Values')
ax1.set_title(f'Training Set: Actual vs Predicted\nR² = {train_r2:.4f}')
ax1.grid(True, alpha=0.3)

# Test data
ax2.scatter(y_test, y_test_pred, alpha=0.6, color='orange')
ax2.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
ax2.set_xlabel('Actual Values')
ax2.set_ylabel('Predicted Values')
ax2.set_title(f'Test Set: Actual vs Predicted\nR² = {test_r2:.4f}')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Feature importance (coefficient analysis)
feature_importance = pd.DataFrame({
    'Feature': feature_columns,
    'Coefficient': model.coef_,
    'Abs_Coefficient': np.abs(model.coef_)
}).sort_values('Abs_Coefficient', ascending=False)

plt.figure(figsize=(10, 6))
sns.barplot(data=feature_importance, x='Abs_Coefficient', y='Feature')
plt.title('Feature Importance (Absolute Coefficients)')
plt.xlabel('Absolute Coefficient Value')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("📊 Feature Importance:")
feature_importance

## 7. Results Export

Save your analysis results and model for future use.

In [None]:
# Prepare results summary
results_summary = {
    'model_type': 'Linear Regression',
    'dataset_shape': df_clean.shape,
    'features': feature_columns,
    'target': target_column,
    'train_test_split': {'train': X_train.shape[0], 'test': X_test.shape[0]},
    'performance_metrics': {
        'train_mse': float(train_mse),
        'test_mse': float(test_mse),
        'train_r2': float(train_r2),
        'test_r2': float(test_r2)
    },
    'model_parameters': {
        'coefficients': model.coef_.tolist(),
        'intercept': float(model.intercept_)
    },
    'feature_importance': feature_importance.to_dict('records')
}

print("📄 Results Summary:")
print(json.dumps(results_summary, indent=2))

In [None]:
# Save results to file
from datetime import datetime

timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
results_filename = f"analysis_results_{timestamp}.json"

# Save to exports directory
export_path = Path("../data/processed/exports") / results_filename
export_path.parent.mkdir(parents=True, exist_ok=True)

with open(export_path, 'w') as f:
    json.dump(results_summary, f, indent=2)

print(f"✅ Results saved to: {export_path}")

# Optional: Save predictions to CSV
predictions_df = pd.DataFrame({
    'actual': y_test,
    'predicted': y_test_pred,
    'residual': y_test - y_test_pred
})

predictions_filename = f"predictions_{timestamp}.csv"
predictions_path = Path("../data/processed/exports") / predictions_filename
predictions_df.to_csv(predictions_path, index=False)

print(f"✅ Predictions saved to: {predictions_path}")

## 8. Conclusion

### Analysis Summary

- **Dataset**: Analyzed a dataset with {df_clean.shape[0]} samples and {df_clean.shape[1]} features
- **Model**: Linear Regression with R² score of {test_r2:.4f} on test data
- **Key Insights**: [Add your domain-specific insights here]

### Next Steps

1. **Model Improvement**: Consider feature engineering or regularization
2. **Additional Analysis**: Explore non-linear relationships
3. **Validation**: Test on new datasets
4. **Deployment**: Use the platform's model service for production

### Platform Features Used

- ✅ Secure data loading
- ✅ Exploratory data analysis
- ✅ Linear regression modeling
- ✅ Model evaluation
- ✅ Results export

---

*This analysis was completed using the Data Analysis and Prediction Platform (DAPP)*