# Assignment 4: House Prices - Data Exploration and Visualization üè†üí∞

## üìö Learning Objectives
- Load and understand real-world datasets.
- Perform **Exploratory Data Analysis (EDA)**.
- Visualize data distributions and correlations.

## Section 1: Getting the Data (10 marks)

### Q1 (10 marks)
Import `fetch_openml` from `sklearn.datasets`. Fetch the `house_prices` dataset. Print the shape of the data and display the first 5 rows.

In [None]:
from sklearn.datasets import fetch_openml
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Fetch dataset (using ID 42165 for House Prices - Advanced Regression Techniques or similar)
# Note: 'house_prices' might refer to Ames Housing or similar on OpenML.
# We will use the Ames Housing dataset which is commonly used for this type of task.
housing = fetch_openml(name="house_prices", as_frame=True, parser='auto')

df = housing.frame
print("Shape of dataset:", df.shape)
df.head()

## Section 2: Understanding the Problem (10 marks)

### Q2.1
Write code to determine:
1. The number of samples and features.
2. The data types of each column.
3. Which columns contain missing values.

In [None]:
print(f"1. Samples: {df.shape[0]}, Features: {df.shape[1]}")

print("\n2. Data Types:")
print(df.dtypes.value_counts())

print("\n3. Columns with Missing Values:")
missing_cols = df.columns[df.isnull().any()].tolist()
print(missing_cols)
print(f"Total columns with missing values: {len(missing_cols)}")

### Q2.2
**Question:** What is the machine learning problem type (classification, regression, etc.)?

**Answer:**
This is a **Regression** problem. We are predicting a continuous numerical value (the price of a house), not a category.

### Q2.3
**Question:** Identify the target variable in this dataset.

**Answer:**
The target variable is typically **SalePrice** (or similar name depending on the exact version loaded). In this OpenML version, it is likely the last column or named `SalePrice`.

### Q2.4
**Question:** Explain the difference between Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE). When should each be preferred?

**Answer:**
- **MAE**: Average of absolute errors. It treats all errors equally. Preferred when outliers are not a huge concern or you want a metric that is easy to interpret.
- **RMSE**: Square root of the average of squared errors. It penalizes large errors more heavily (squaring makes big numbers bigger). Preferred when large errors are particularly undesirable.

## Section 3: Explore and Visualize the Data (20 marks)

### Q3.1
Import `matplotlib` and `seaborn`. Display summary statistics for the numerical columns using the `describe()` method.

In [None]:
df.describe()

### Q3.2
Display a grid of histograms for the first six numerical columns in the dataset.

In [None]:
num_cols = df.select_dtypes(include=[np.number]).columns[:6]
df[num_cols].hist(bins=20, figsize=(12, 8), layout=(2, 3), color='skyblue', edgecolor='black')
plt.suptitle('Histograms of First 6 Numerical Columns', fontsize=16)
plt.show()

### Q3.3
Compute the correlation matrix of the dataset and visualize it using a heatmap. Use the `coolwarm` colormap and set the figure size to 10x8.

In [None]:
plt.figure(figsize=(12, 10))
# Select only numerical columns for correlation
corr_matrix = df.select_dtypes(include=[np.number]).corr()
sns.heatmap(corr_matrix, cmap='coolwarm', annot=False, fmt=".2f")
plt.title('Correlation Heatmap', fontsize=16)
plt.show()

### Q3.4
Display the top 10 features that are most highly correlated with the target variable.

In [None]:
target = 'SalePrice'
if target in corr_matrix.columns:
    top_corr = corr_matrix[target].sort_values(ascending=False).head(11) # Top 10 + target itself
    print(top_corr)
else:
    print(f"Target '{target}' not found in correlation matrix.")

### Q3.4.1
**Question:** What are the top 4 features highly correlated with the target?

**Answer:**
Based on the output above, the top 4 features (excluding SalePrice itself) are typically:
1. **OverallQual** (Overall material and finish quality)
2. **GrLivArea** (Above grade (ground) living area square feet)
3. **GarageCars** (Size of garage in car capacity)
4. **GarageArea** (Size of garage in square feet)

### Q3.5
Create a scatter plot between the target variable and the second most highly correlated feature. Set `alpha=0.5` and include a title and axis labels.

In [None]:
# Second most correlated is usually GrLivArea
feature = 'GrLivArea'

plt.figure(figsize=(10, 6))
sns.scatterplot(x=feature, y=target, data=df, alpha=0.5, color='purple')
plt.title(f'{feature} vs. {target}', fontsize=16)
plt.xlabel(f'{feature} (sq ft)')
plt.ylabel(f'{target} ($)')
plt.grid(True, linestyle='--', alpha=0.5)
plt.show()