# Kaggle Competition - Exploratory Data Analysis

**Author**: PhD-Level Data Science Solution  
**Competition Metric**: RMSE (Root Mean Squared Error)  
**Goal**: Comprehensive EDA to inform modeling strategy

This notebook explores the training data to understand:
- Data structure and quality
- Target variable distribution
- Feature characteristics and correlations
- Outliers and anomalies
- Feature importance signals

## 1. Import Libraries and Setup

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

# Set visualization styles
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
%matplotlib inline

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', '{:.4f}'.format)

print("Libraries imported successfully!")

## 2. Load Data

In [None]:
# Load datasets
train_df = pd.read_csv('../data/trainingdata.csv')
test_df = pd.read_csv('../data/test_predictors.csv')
sample_sub = pd.read_csv('../data/SampleSubmission.csv')

print(f"Training data shape: {train_df.shape}")
print(f"Test data shape: {test_df.shape}")
print(f"Sample submission shape: {sample_sub.shape}")
print(f"\nNumber of features: {train_df.shape[1] - 1}")
print(f"Number of training samples: {train_df.shape[0]}")
print(f"Number of test samples: {test_df.shape[0]}")

In [None]:
# Display first few rows
print("Training Data:")
display(train_df.head())

print("\nTest Data:")
display(test_df.head())

print("\nSample Submission:")
display(sample_sub.head())

## 3. Data Quality Check

In [None]:
# Check for missing values
print("Missing Values in Training Data:")
missing_train = train_df.isnull().sum()
print(f"Total missing values: {missing_train.sum()}")
if missing_train.sum() > 0:
    print(missing_train[missing_train > 0])
else:
    print("No missing values detected ✓")

print("\nMissing Values in Test Data:")
missing_test = test_df.isnull().sum()
print(f"Total missing values: {missing_test.sum()}")
if missing_test.sum() > 0:
    print(missing_test[missing_test > 0])
else:
    print("No missing values detected ✓")

In [None]:
# Data types and info
print("Training Data Info:")
train_df.info()

## 4. Target Variable Analysis

In [None]:
# Target variable statistics
y = train_df['y']

print("Target Variable Statistics:")
print(y.describe())
print(f"\nSkewness: {y.skew():.4f}")
print(f"Kurtosis: {y.kurtosis():.4f}")

# Visualize target distribution
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Histogram
axes[0].hist(y, bins=30, edgecolor='black', alpha=0.7)
axes[0].set_xlabel('Target Value (y)', fontsize=12)
axes[0].set_ylabel('Frequency', fontsize=12)
axes[0].set_title('Distribution of Target Variable', fontsize=14, fontweight='bold')
axes[0].axvline(y.mean(), color='red', linestyle='--', linewidth=2, label=f'Mean: {y.mean():.2f}')
axes[0].axvline(y.median(), color='green', linestyle='--', linewidth=2, label=f'Median: {y.median():.2f}')
axes[0].legend()

# Box plot
axes[1].boxplot(y, vert=True)
axes[1].set_ylabel('Target Value (y)', fontsize=12)
axes[1].set_title('Box Plot - Outlier Detection', fontsize=14, fontweight='bold')
axes[1].grid(axis='y', alpha=0.3)

# Q-Q plot
stats.probplot(y, dist="norm", plot=axes[2])
axes[2].set_title('Q-Q Plot - Normality Check', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

# Check for outliers using IQR method
Q1 = y.quantile(0.25)
Q3 = y.quantile(0.75)
IQR = Q3 - Q1
outliers = ((y < (Q1 - 1.5 * IQR)) | (y > (Q3 + 1.5 * IQR))).sum()
print(f"\nNumber of outliers (IQR method): {outliers} ({100*outliers/len(y):.2f}%)")

## 5. Feature Statistics

In [None]:
# Get feature columns
X = train_df.drop('y', axis=1)

print(f"Number of features: {X.shape[1]}")
print(f"\nFeature Statistics:")
display(X.describe().T)

# Check feature scales
print("\nFeature Value Ranges:")
feature_ranges = pd.DataFrame({
    'min': X.min(),
    'max': X.max(),
    'range': X.max() - X.min(),
    'mean': X.mean(),
    'std': X.std()
})
display(feature_ranges.head(20))

## 6. Correlation Analysis

In [None]:
# Correlation with target
correlations = train_df.corr()['y'].drop('y').sort_values(ascending=False)

print("Top 20 Features Correlated with Target:")
print(correlations.head(20))

print("\nBottom 20 Features (Negatively Correlated with Target):")
print(correlations.tail(20))

# Visualize top correlations
fig, ax = plt.subplots(figsize=(12, 8))
top_corr = pd.concat([correlations.head(15), correlations.tail(15)])
colors = ['green' if x > 0 else 'red' for x in top_corr]
top_corr.plot(kind='barh', ax=ax, color=colors, edgecolor='black')
ax.set_xlabel('Correlation Coefficient', fontsize=12)
ax.set_title('Top 30 Features by Correlation with Target', fontsize=14, fontweight='bold')
ax.axvline(0, color='black', linewidth=0.8)
plt.tight_layout()
plt.show()

In [None]:
# Feature-feature correlation (multicollinearity check)
feature_corr = X.corr()

# Find highly correlated feature pairs
high_corr_pairs = []
for i in range(len(feature_corr.columns)):
    for j in range(i+1, len(feature_corr.columns)):
        if abs(feature_corr.iloc[i, j]) > 0.8:
            high_corr_pairs.append({
                'Feature1': feature_corr.columns[i],
                'Feature2': feature_corr.columns[j],
                'Correlation': feature_corr.iloc[i, j]
            })

if high_corr_pairs:
    print(f"Found {len(high_corr_pairs)} highly correlated feature pairs (|r| > 0.8):")
    display(pd.DataFrame(high_corr_pairs).head(20))
else:
    print("No highly correlated feature pairs found (|r| > 0.8)")

## 7. Feature Importance (Random Forest)

In [None]:
# Train a Random Forest to get feature importance
rf = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
rf.fit(X, y)

# Get feature importances
feature_importance = pd.DataFrame({
    'Feature': X.columns,
    'Importance': rf.feature_importances_
}).sort_values('Importance', ascending=False)

print("Top 20 Most Important Features:")
display(feature_importance.head(20))

# Visualize
fig, ax = plt.subplots(figsize=(12, 8))
top_features = feature_importance.head(30)
ax.barh(range(len(top_features)), top_features['Importance'], color='steelblue', edgecolor='black')
ax.set_yticks(range(len(top_features)))
ax.set_yticklabels(top_features['Feature'])
ax.set_xlabel('Feature Importance', fontsize=12)
ax.set_title('Top 30 Features by Random Forest Importance', fontsize=14, fontweight='bold')
ax.invert_yaxis()
plt.tight_layout()
plt.show()

# Cumulative importance
feature_importance['Cumulative'] = feature_importance['Importance'].cumsum()
print(f"\nFeatures needed for 80% cumulative importance: {(feature_importance['Cumulative'] <= 0.8).sum()}")
print(f"Features needed for 90% cumulative importance: {(feature_importance['Cumulative'] <= 0.9).sum()}")

## 8. Key Insights and Recommendations

### Data Characteristics:
- **Small Dataset**: Only 302 training samples - requires careful validation
- **High Dimensionality**: 112 features - feature selection/engineering critical
- **No Missing Values**: Data is complete ✓

### Modeling Strategy:
1. **Cross-Validation**: Use 5-fold CV due to small sample size
2. **Regularization**: Essential to prevent overfitting (Ridge, Lasso, ElasticNet)
3. **Feature Engineering**: Create polynomial and statistical features
4. **Ensemble Methods**: Combine multiple models to reduce variance
5. **Gradient Boosting**: XGBoost, LightGBM, CatBoost expected to perform well

### Next Steps:
- Implement comprehensive preprocessing pipeline
- Train diverse model ensemble
- Use stacking/blending for final predictions
- Monitor cross-validation scores carefully