# 01_EDA — Exploratory Data Analysis

This notebook performs initial EDA on the Kaggle **Ames Housing / House Prices** dataset. Place `train.csv` in the `data/` folder before running.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Load data
df = pd.read_csv('../data/train.csv')
df.shape

In [None]:
## Quick peek
print(df.columns.tolist()[:30])
df.head()

In [None]:
## Data types and missing values
missing = df.isnull().sum().sort_values(ascending=False)
missing[missing>0].head(30)

In [None]:
## Summary statistics for numeric features
num = df.select_dtypes(include=[np.number])
num.describe().T

In [None]:
## Target distribution (SalePrice)
plt.figure(figsize=(8,4))
plt.hist(df['SalePrice'], bins=50)
plt.title('SalePrice distribution')
plt.xlabel('SalePrice')
plt.ylabel('Count')
plt.show()

In [None]:
## Log-transform target to reduce skew
plt.figure(figsize=(8,4))
plt.hist(np.log1p(df['SalePrice']), bins=50)
plt.title('Log1p(SalePrice) distribution')
plt.xlabel('Log1p(SalePrice)')
plt.ylabel('Count')
plt.show()

In [None]:
## Correlation with target — top features
corr = num.corr()['SalePrice'].abs().sort_values(ascending=False)
corr.head(20)

In [None]:
## Scatter: GrLivArea vs SalePrice
plt.figure(figsize=(6,4))
plt.scatter(df['GrLivArea'], df['SalePrice'])
plt.xlabel('GrLivArea')
plt.ylabel('SalePrice')
plt.title('GrLivArea vs SalePrice')
plt.show()

**Notes / next steps:**

- Investigate features with high missing values and decide imputation strategy.
- Check categorical cardinality and group rare categories.
- Fix skewed numeric features with log transform where appropriate.
