# Boston Housing Data Analysis

**Author:** \
**Email:** \
**Points Available:**   


---

## 🧠 Overview

This lab guides you through a critical exploration of housing data using statistical modeling and machine learning. Drawing on real-world data from the **Boston Housing dataset**, this project walks you through how to:

- Read and preprocess structured real estate data  
- Visualize key features such as room count, location, and pollution levels  
- Build and evaluate predictive models  
- Reflect on what factors most significantly impact housing prices  

The goal is not only to build a predictive model, but to **understand what the model tells us—and fails to tell us—about urban inequality**.


---

## 🎯 You Will:

- Load and clean the Boston Housing dataset using `pandas`  
- Visualize relationships between features using `matplotlib` and `seaborn`  
- Fit and evaluate models (e.g., Linear Regression)  
- Interpret results critically in the context of urban planning and social equity  

---

## ❓ Guiding Question

**What do statistical models teach us about urban inequality, and what do they leave out?**  
This lab encourages you to consider the limits of statistical inference in capturing cultural, emotional, and policy-driven aspects of urban geography.


---

## 🔄 Step-by-Step Instructions

### 1. Load and Explore Data

The dataset contains **506 observations** with **14 attributes**, including:

- `CRIM`: Crime rate per capita  
- `RM`: Average number of rooms per dwelling  
- `DIS`: Distance to employment centers  
- `RAD`: Accessibility to highways  
- `MEDV`: Median value of owner-occupied homes (target)  

---

### 2. Clean and Preprocess

Ensure the dataset is:

- Free of missing values  
- Appropriately typed and scaled  
- Checked for skewness, outliers, and variable correlation  


---

### 3. Visualize

Use:

- **Correlation heatmaps** to explore variable strength  
- **Scatter plots** like `RM vs MEDV` and `LSTAT vs MEDV`  
- **Distribution plots** to explore data spread and outliers  

---

### 4. Model and Predict

Fit regression models and evaluate with:

- **Mean Squared Error (MSE)**  
- **Cross-validation scores**  

Discuss:

- Which features are the strongest predictors?  
- Where does the model perform poorly, and why?


---

### 5. Reflect

Answer the following:

- What trends were accurately captured?  
- Were there features the model over- or under-emphasized?  
- Did any results surprise you?  
- What structural inequalities are suggested by the data?

---

## 🌍 Optional Extension: Storymap Concept

Inspired by **LLM Place** and **data-classification**, consider creating a **counter-map** or interactive story visualization:

- Map areas with low model accuracy  
- Layer in redlining, eviction, or pollution data  
- Ask: *Where is the model "blind"?*


---

## 📦 Deliverables

Submit:

1. ✅ A cleaned and annotated `.ipynb` notebook  
2. 📊 At least 2–3 key visualizations (scatter plot, heatmap, etc.)  
3. ✍ A 300–500 word written reflection that addresses:

   - What were the top 2–3 price predictors?  
   - Were there surprising findings?  
   - What broader urban or societal implications can we draw?

---

## 📝 Submission Notes

- Submit everything on **Canvas** unless stated otherwise.  
- A **10% penalty** will be deducted for each day of late submission.  
- Extensions are available for documented reasons (medical, academic, religious, etc.). Email your instructor **before the deadline** if possible.  
- There will be **no make-up** unless under exceptional, approved circumstances.  
