# Exploratory Data Analysis (EDA) of the Ames Housing Dataset

## Introduction

The Ames Housing Dataset is a comprehensive record of residential property sales in Ames, Iowa. Compiled by **Dean De Cock**, it includes 2,930 observations with 80 features describing various aspects of residential homes. 

In this notebook, we will perform EDA to understand the data, identify patterns, detect anomalies, and extract insights that could be useful for predictive modeling.

## Objectives

- Understand the structure and content of the dataset.
- Handle missing values appropriately.
- Explore distributions of individual variables.
- Examine relationships between features and the target variable (`SalePrice`).
- Identify key features that influence house prices.

## Dataset Description

Some of the key variables in the dataset include:

- **SalePrice**: The property's sale price in dollars.
- **LotArea**: Lot size in square feet.
- **OverallQual**: Overall material and finish quality (scale from 1 to 10).
- **YearBuilt**: Original construction date.
- **TotalBsmtSF**: Total square feet of the basement area.
- **GrLivArea**: Above-grade (ground) living area square feet.
- **FullBath**: Full bathrooms above grade.
- **GarageCars**: Size of garage in car capacity.
- **GarageArea**: Size of garage in square feet.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
data = pd.read_csv('ames.csv')

## Exploring the Dataset

### First Five Rows

Let's take a look at the first five rows to get an initial understanding of the data.


In [None]:
# Displaying the first five rows
data.head()

In [None]:
# Getting information about the dataset
data.info()

In [None]:
# Check column names: No empty space, no dot, no special characters

In [None]:
data.rename(columns={'YearRemod.Add':'YearRemodAdd'}, inplace=True)

In [None]:
# Statistical summary
data.describe()

### Checking for Missing Values

Identify the number of missing values in each column.

In [None]:
# Checking for missing values
missing_values = data.isnull().sum().sort_values(ascending=False)
missing_values = missing_values[missing_values > 0]
missing_values

We can see that some features have missing values. For the purpose of this EDA, we'll focus on handling missing values for the most significant features later on.

## Target Variable Analysis

### Distribution of SalePrice

Let's explore the distribution of the target variable `SalePrice`.


In [None]:
# Plotting the distribution of SalePrice
plt.figure(figsize=(10,6))
sns.histplot(data['SalePrice'], kde=True)
plt.title('Distribution of SalePrice')
plt.xlabel('SalePrice')
plt.ylabel('Frequency')
plt.show()

# Calculating skewness and kurtosis
print("Skewness: %f" % data['SalePrice'].skew())
print("Kurtosis: %f" % data['SalePrice'].kurt())


The distribution of `SalePrice` is right-skewed with a skewness of greater than 1. This indicates that we may need to apply a log transformation to normalize it for modeling purposes.


In [None]:
data['LogSalePrice'] = data['SalePrice'].apply(np.log)

plt.figure(figsize=(8,6));
ax = sns.displot(data['LogSalePrice'], bins=40);
plt.xlabel('Log(SalePrice)')
plt.ylabel('Frequency')
plt.title('Histogram of Log(SalePrice)');


## Univariate Analysis

### Numerical Features

Let's explore some key numerical features.

In [None]:
# List of numerical features to analyze
numeric_features = ['LotArea', 'GrLivArea', 'TotalBsmtSF', 'GarageArea']

# Plotting histograms
data[numeric_features].hist(figsize=(12,8), bins=30, edgecolor='black')
plt.tight_layout()

In [None]:
# Box plots for numerical features
plt.figure(figsize=(12,8))
for i, feature in enumerate(numeric_features):
    plt.subplot(2, 2, i+1)
    sns.boxplot(y=data[feature])
    plt.title(feature)
plt.tight_layout()

We observe that features like `LotArea` and `GrLivArea` have outliers, which might affect our analysis.

## Bivariate Analysis

### Relationship Between SalePrice and Numerical Features

Let's explore how each numerical feature relates to `SalePrice`.


In [None]:
# Scatter plots
plt.figure(figsize=(12,8))
for i, feature in enumerate(numeric_features):
    plt.subplot(2, 2, i+1)
    sns.scatterplot(x=data[feature], y=data['SalePrice'])
    plt.title(f'SalePrice vs {feature}')
plt.tight_layout()

In [None]:
# Correlation matrix
corr_matrix = data.corr(numeric_only = True)

# Correlation with SalePrice
corr_with_saleprice = corr_matrix['SalePrice'].sort_values(ascending=False)
corr_with_saleprice


### Top Correlated Features with SalePrice

Identify the features that have strong positive or negative correlation with `SalePrice`.


In [None]:
# Top 10 features correlated with SalePrice
top_features = corr_with_saleprice.index[1:11]
top_features


In [None]:
# Heatmap of top features
plt.figure(figsize=(10,8))
sns.heatmap(data[top_features].corr(), annot=True, cmap='coolwarm')

In [None]:
# List of categorical features to analyze
categorical_features = ['OverallQual', 'Neighborhood', 'GarageCars', 'FullBath', 'KitchenQual']

# Box plots of SalePrice vs categorical features
for feature in categorical_features:
    plt.figure(figsize=(12,6))
    sns.boxplot(x=feature, y='SalePrice', data=data)
    plt.title(f'SalePrice vs {feature}')
    plt.xticks(rotation=45)
    plt.show()


#### Observations:

- **OverallQual**: There is a clear increasing trend of `SalePrice` with higher quality ratings.
- **Neighborhood**: Certain neighborhoods have higher median house prices.
- **GarageCars**: Houses with more garage spaces tend to have higher `SalePrice`.
- **FullBath**: Houses with more full bathrooms generally have higher `SalePrice`.
- **KitchenQual**: Better kitchen quality is associated with higher `SalePrice`.

## Missing Value Treatment

Let's handle missing values for features that are important.

In [None]:
# Filling missing numerical values with median

categorical_cols = ['GarageCars', 'GarageArea','BsmtHalfBath', 'BsmtFullBath']

for var in categorical_cols:
    data[var] = data[var].fillna(data[var].median())


# Filling missing categorical values with mode
data1 = data.copy()
data1['KitchenQual'] = data['KitchenQual'].fillna(data['KitchenQual'].mode()[0])
data1

Or, we add a new level corresponding to whether the variable is missing from that observation. We'll do this via dummy encoding as usual.

In [None]:
# categorical ones:
categorical_cols = ['MasVnrType', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1',
                    'BsmtFinType2', 'Electrical', 'GarageFinish', 'GarageQual', 'GarageCond']

for var in categorical_cols:
    print(var, data[var].dtype)
    data.loc[(data[var]).isna(), [var]] = 'NaN'

data[categorical_cols]=data[categorical_cols].astype('category')

In [None]:
print(np.sum(data.isnull().any()))

## Remove outliers

In [None]:
# Plotting GrLivArea vs SalePrice before removing outliers
plt.figure(figsize=(12,6))
sns.scatterplot(x='GrLivArea', y='SalePrice', data=data)
plt.title('GrLivArea vs SalePrice Before Removing Outliers')
plt.show()

# Removing outliers
data = data[data['GrLivArea'] < 4500]

# Plotting after removing outliers
plt.figure(figsize=(12,6))
sns.scatterplot(x='GrLivArea', y='SalePrice', data=data)
plt.title('GrLivArea vs SalePrice After Removing Outliers')
plt.show()


# Transform data format

In [None]:
fig, axs=plt.subplots(3,3, figsize=(15,15))
sns.boxplot(data, x="BsmtFullBath", y="LogSalePrice", ax= axs[0,0])
sns.boxplot(data, x="BsmtHalfBath", y="LogSalePrice", ax= axs[0,1])
sns.boxplot(data, x="FullBath", y="LogSalePrice", ax= axs[0,2])
sns.boxplot(data, x="HalfBath", y="LogSalePrice", ax= axs[1,0])
sns.boxplot(data, x="BedroomAbvGr", y="LogSalePrice", ax= axs[1,1])
sns.boxplot(data, x="KitchenAbvGr", y="LogSalePrice", ax= axs[1,2])
sns.boxplot(data, x="TotRmsAbvGrd", y="LogSalePrice", ax= axs[2,0])
sns.boxplot(data, x="Fireplaces", y="LogSalePrice", ax= axs[2,1])
sns.boxplot(data, x="GarageCars", y="LogSalePrice", ax= axs[2,2])

# In-class activity: Should we keep the above columns as integers or categorical variables? How to convert them to categorical variables?