# EDA on Boston Housing
v.ekc-c

Today we apply the full EDA toolkit — descriptive statistics, data cleaning, and visualization — to the classic **Boston Housing** dataset. The goal is to understand the data well enough to ask good modeling questions.

**Sections:**
1. Setup
2. Seaborn EDA Plot Review
3. Load & Understand the Data
4. EDA — Descriptive Statistics
5. EDA — Data Preparation
6. EDA — Visualization
7. Correlation & Feature Selection

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
import seaborn as sns
sns.set_style("darkgrid")
import warnings 
warnings.filterwarnings('ignore') 

In [None]:
iris = sns.load_dataset('iris')
iris

---
## 2. Seaborn EDA Plot Review

Quick reminder of the three Seaborn plots we reach for first in any EDA:

| Plot | Call | What it shows |
|---|---|---|
| Pairplot | `sns.pairplot(df, hue='cat_col')` | All pairwise numeric relationships |
| Boxplot | `sns.boxplot(data=df, x='cat', y='num')` | Distribution per group |
| Heatmap | `sns.heatmap(df.corr(), annot=True)` | Correlation matrix |

In [None]:
# pairplot
sns.pairplot(iris)
plt.show()

In [None]:
# pairplot customizations
sns.pairplot(data = iris, hue='species')
plt.show()

In [None]:
# boxplot
sns.boxplot(data = iris)
plt.show()

In [None]:
# boxplot for numeric variable vs categorical
sns.boxplot(data = iris, x = 'species',y='sepal_width')
plt.show()

In [None]:
# heatmaps
corrmat = iris.corr(numeric_only=True) # make correlation matrix

sns.heatmap(corrmat, annot = True) 
plt.title('Correlation Coefficients')
plt.show()

### ✏️ Check-in 1 — Seaborn Review

Using the `iris` dataset (already loaded above):

1. Create a **boxplot** of `petal_width` broken down by `species`.
2. Compute the correlation matrix for `iris` and display it as a **heatmap** with annotations.
3. From the heatmap, which pair of features has the strongest positive correlation?

In [None]:
# 1. Boxplot: petal_width by species


In [None]:
# 2. Correlation heatmap


**3.** *(Write your answer here — double-click to edit)*

#### Answer

In [None]:
sns.boxplot(data=iris, x='species', y='petal_width')
plt.show()

In [None]:
corrmat = iris.corr(numeric_only=True)
sns.heatmap(corrmat, annot=True)
plt.title('Iris Correlation Matrix')
plt.show()

`petal_length` and `petal_width` have the strongest positive correlation (~0.96).

---
## 3. Load & Understand the Data

Before running any code, read the column descriptions. Understanding what each feature *means* shapes every downstream decision.

### 3a. Import the data

In [None]:
data = pd.read_csv('BostonHousingData.csv')
data

### 3b. Feature descriptions
Info about the columns is [here](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html).

| Column | Description |
|---|---|
| CRIM | Per-capita crime rate by town |
| ZN | Proportion of residential land zoned for lots > 25,000 sq ft |
| INDUS | Proportion of non-retail business acres per town |
| CHAS | Charles River dummy variable (1 if tract bounds river, 0 otherwise) |
| NOX | Nitric oxide concentration (parts per 10 million) |
| RM | Average number of rooms per dwelling |
| AGE | Proportion of units built before 1940 |
| DIS | Weighted distance to five Boston employment centers |
| RAD | Index of accessibility to radial highways |
| TAX | Full-value property tax rate per $10,000 |
| PTRATIO | Pupil-teacher ratio by town |
| B | `1000(Bk - 0.63)²` where Bk is the proportion of Black residents |
| LSTAT | % of lower-status population |
| MEDV | **Median home value** in $1,000s *(target variable)* |

*Fill in descriptions of each variable here*

### 3c. Discussion question
Looking at the feature descriptions above: which features do you expect to correlate most strongly with median home value (`MEDV`)? Any features that raise ethical concerns?

---
## 4. EDA — Descriptive Statistics

In [None]:
# General data info
data.info()

In [None]:
# statistical description of the data
data.describe()

In [None]:
# look at mean of each attribute 
data.describe().loc['mean']

### ✏️ Check-in 2 — Descriptive Statistics

Using the `data` (Boston Housing) DataFrame:

1. How many rows and columns does `data` have?
2. What are the **minimum** and **maximum** values of `MEDV` (median home value)?
3. Which column has the highest mean value according to `.describe()`?

In [None]:
# 1. Shape of the dataset


In [None]:
# 2. Min and max of MEDV


In [None]:
# 3. Column with highest mean


#### Answer

In [None]:
data.shape

In [None]:
print('Min MEDV:', data.MEDV.min())
print('Max MEDV:', data.MEDV.max())

In [None]:
data.describe().loc['mean'].idxmax()

---
## 5. EDA — Data Preparation

In [None]:
# Search for null values
data.isna().sum()

In [None]:
# check for duplicate entries
data.duplicated().sum()

---
## 6. EDA — Visualization

In [None]:
# Could start with a pairplot; but this can be computationally expensive
# and not too informative

# sns.pairplot(data)

### Reduce number of plots by making scatterplots

In [None]:
rows = 2
cols = 7
fig, ax = plt.subplots(rows, cols, figsize = (16,4) ) 
index = 0

# plot price as dependent variable (y-axis) against all other variables
for i in range(rows):
    for j in range(cols):
        sns.scatterplot(data = data, x = data.columns[index], y = 'MEDV', ax = ax[i][j]) 
        index = index + 1
        
plt.tight_layout()
plt.show()


### ✏️ Check-in 3 — Visualization & Correlation

1. From the scatterplots above, which two features appear to have the **strongest** relationship with `MEDV`? (Just eyeballing is fine!)
2. Use the correlation matrix to confirm: filter `corrmat.MEDV` to show only features with `abs(correlation) > 0.6`.
3. Make a scatter plot of the feature most correlated with `MEDV` vs `MEDV` itself.

**1.** *(Write your answer here — double-click to edit)*

In [None]:
# 2. Features with |correlation| > 0.6 with MEDV


In [None]:
# 3. Scatter of top correlated feature vs MEDV


#### Hint

You will need to run the correlation matrix cell first (`corrmat = data.corr(numeric_only=True)`).

Filter with: `corrmat.MEDV[abs(corrmat.MEDV) > 0.6]`

#### Answer

In [None]:
corrmat = data.corr(numeric_only=True)
corrmat.MEDV[abs(corrmat.MEDV) > 0.6]

In [None]:
# LSTAT and RM are typically the strongest
sns.scatterplot(data=data, x='LSTAT', y='MEDV')
plt.title('LSTAT vs Median Home Value')
plt.show()

---
## 7. Correlation & Feature Selection

In [None]:
# correlation matrix
corrmat = data.corr(numeric_only=True) 
corrmat  

In [None]:
# plot as a heat map to visualize the information in the matrix
plt.figure(figsize = (9, 9)) 
sns.heatmap(corrmat, annot = True) 
plt.show()

### Heatmap and Pair Plot of Correlated Data

In [None]:
# Which variables are highly (>0.5) correlated with price?
abs(corrmat.MEDV) > 0.5

In [None]:
corrmat.index[abs(corrmat.MEDV) > 0.5]

In [None]:
correlated_data = data[corrmat.index[abs(corrmat.MEDV) > 0.5]]
correlated_data.head()

In [None]:
# plot pair plots to display columns which are highly correlated with the price
sns.pairplot(correlated_data)
plt.tight_layout()

### Distributions of variables and boxplots

In [None]:
# distributions of data
sns.boxplot(data)
plt.show()

In [None]:
# plot price vs CHAS
sns.boxplot( x = 'CHAS',y = 'MEDV', data = data)
plt.show()