# Notebook 1: EDA and Data Understanding
## HabitAlpes - Apartment Price Prediction

**Objective**: Perform comprehensive exploratory data analysis to understand the apartment dataset.

**Deliverable**: Data understanding report (10% of grade)

## Setup

In [None]:
import sys
sys.path.append('../src')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, HTML, Image

%matplotlib inline
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

import warnings
warnings.filterwarnings('ignore')

## 1. Load Data

In [None]:
from utils import load_data, print_section_header

df = load_data()
print(f"Dataset shape: {df.shape}")
print(f"\nFirst few rows:")
display(df.head())

## 2. Dataset Dimensions and Structure

In [None]:
print(f"Number of rows: {df.shape[0]:,}")
print(f"Number of columns: {df.shape[1]}")
print(f"\nColumn names and types:")
display(pd.DataFrame({
    'Column': df.columns,
    'Type': df.dtypes.values,
    'Non-Null Count': df.count().values,
    'Null Count': df.isnull().sum().values
}))

## 3. Missing Values Analysis

In [None]:
from utils import summarize_missing_values

missing = summarize_missing_values(df)
if len(missing) > 0:
    print("Columns with missing values:")
    display(missing)
else:
    print("No missing values in the dataset!")

## 4. Target Variable Analysis: precio_venta

In [None]:
from utils import calculate_basic_stats

calculate_basic_stats(df['precio_venta'].dropna(), 'precio_venta')

# Visualize distribution
fig, axes = plt.subplots(1, 2, figsize=(16, 5))

axes[0].hist(df['precio_venta'].dropna(), bins=100, edgecolor='black', alpha=0.7)
axes[0].set_xlabel('Precio Venta (COP)')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution of Sale Prices', fontsize=14, fontweight='bold')
axes[0].ticklabel_format(style='plain')

axes[1].boxplot(df['precio_venta'].dropna())
axes[1].set_ylabel('Precio Venta (COP)')
axes[1].set_title('Boxplot of Sale Prices', fontsize=14, fontweight='bold')
axes[1].ticklabel_format(style='plain')

plt.tight_layout()
plt.show()

## 5. Numeric Features Analysis

In [None]:
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
print(f"Found {len(numeric_cols)} numeric columns")

# Descriptive statistics
display(df[numeric_cols].describe())

## 6. Correlation with Target

In [None]:
if 'precio_venta' in numeric_cols:
    correlations = df[numeric_cols].corr()['precio_venta'].sort_values(ascending=False)
    
    print("Top 15 features most correlated with precio_venta:")
    display(correlations.head(15))
    
    # Visualize
    fig, ax = plt.subplots(figsize=(10, 8))
    top_corr = correlations.head(15)
    ax.barh(range(len(top_corr)), top_corr.values, color='steelblue')
    ax.set_yticks(range(len(top_corr)))
    ax.set_yticklabels(top_corr.index)
    ax.set_xlabel('Correlation with precio_venta')
    ax.set_title('Top Features Correlated with Price', fontsize=14, fontweight='bold')
    ax.invert_yaxis()
    ax.grid(True, alpha=0.3, axis='x')
    plt.tight_layout()
    plt.show()

## 7. Geographic Analysis

In [None]:
if 'localidad' in df.columns:
    print("Distribution by Localidad:")
    localidad_counts = df['localidad'].value_counts().head(10)
    display(localidad_counts)
    
    # Price by localidad
    price_by_localidad = df.groupby('localidad')['precio_venta'].agg(['mean', 'median', 'count'])
    price_by_localidad = price_by_localidad.sort_values('mean', ascending=False).head(10)
    
    print("\nTop 10 Localidades by average price:")
    display(price_by_localidad)

## 8. Run Complete EDA Script

Run the comprehensive EDA script to generate all visualizations and reports.

In [None]:
# Uncomment to run the complete EDA script
# This will generate all figures in reports/figures/ and results in data/results/

# %run ../src/01_eda.py

## 9. View Generated Figures

In [None]:
# Display some of the generated figures
import os
from pathlib import Path

figures_dir = Path('../reports/figures')

if figures_dir.exists():
    figures = sorted(figures_dir.glob('0*.png'))  # EDA figures start with 0
    
    for fig_path in figures[:5]:  # Show first 5 figures
        print(f"\n### {fig_path.name}")
        display(Image(filename=str(fig_path)))
else:
    print("Run the EDA script first to generate figures.")

## Summary

This notebook completed the exploratory data analysis phase. Key findings:

1. **Dataset Size**: 43,013 apartment records with 46 features
2. **Target Variable**: Sale prices range from ~10M to 900M+ COP
3. **Missing Values**: Several features have missing data that need handling
4. **Key Features**: Area, localidad, estrato strongly correlate with price
5. **Geographic Patterns**: Significant price variation across neighborhoods

**Next Steps**: Data preprocessing and feature engineering (Notebook 2)