# Exploratory Data Analysis and Initial Cleaning for Raw House Data
This notebook aims to perform Exploratory Data Analysis (EDA) and initial cleaning on the raw house data. The goal is to prepare a cleaned dataset suitable for modeling. All steps are justified and documented for the modeling team.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Load the raw house data into a DataFrame
house_data = pd.read_csv('raw_house_data - raw_house_data.csv')
# Display the first few rows of the DataFrame
house_data.head()

## Initial Data Overview
The raw house data contains 16 columns and a sample of 5 rows has been displayed for initial inspection. The columns include various features such as MLS number, sold price, zipcode, longitude, latitude, lot acres, taxes, year built, number of bedrooms and bathrooms, square footage, garage details, kitchen features, number of fireplaces, floor covering, and HOA fees. Some columns have object data types that may need to be converted to numerical types for modeling.

In [None]:
# Summary statistics for the numerical columns
numerical_summary = house_data.describe()
# Checking for missing values in each column
missing_values = house_data.isnull().sum()
numerical_summary, missing_values

## Summary Statistics and Missing Values
The summary statistics for the numerical columns are displayed above. The dataset contains 5,000 rows. Some key observations are:
- The `lot_acres` column has 10 missing values.
- The `fireplaces` column has 25 missing values.
- The `year_built` column has some entries with a value of 0, which is likely incorrect and needs further investigation.
- The `taxes` column has a minimum value of 0, which also needs further investigation.
Next, we will visualize the data to better understand its distribution and to identify any outliers.

In [None]:
# Visualizing the distribution of numerical columns
numerical_columns = ['sold_price', 'zipcode', 'longitude', 'latitude', 'lot_acres', 'taxes', 'year_built', 'bedrooms', 'fireplaces']
plt.figure(figsize=(20, 15))
for i, col in enumerate(numerical_columns, 1):
    plt.subplot(3, 3, i)
    sns.histplot(house_data[col], bins=30, kde=True)
    plt.title(f'Distribution of {col}')
    plt.xlabel(col)
    plt.ylabel('Frequency')
plt.tight_layout()
plt.show()

## Data Distribution
The histograms above show the distribution of various numerical columns. Key observations include:
- `sold_price`, `lot_acres`, and `taxes` are right-skewed, indicating the presence of outliers.
- `year_built` has some entries close to 0, which are likely incorrect and need to be addressed.
- `bedrooms` and `fireplaces` have a more or less normal distribution but with some outliers.
Next, we will handle missing values and outliers.

In [None]:
# Handling Missing Values
# Replacing missing values in 'lot_acres' and 'fireplaces' with their respective medians
house_data['lot_acres'].fillna(house_data['lot_acres'].median(), inplace=True)
house_data['fireplaces'].fillna(house_data['fireplaces'].median(), inplace=True)
# Handling Outliers
# Replacing 'year_built' values of 0 with the median year
house_data['year_built'].replace(0, house_data['year_built'].median(), inplace=True)
# Replacing 'taxes' values of 0 with the median tax value
house_data['taxes'].replace(0, house_data['taxes'].median(), inplace=True)
# Confirming that there are no more missing values
house_data.isnull().sum()

## Handling Missing Values and Outliers
Missing values and outliers have been addressed as follows:
- Missing values in `lot_acres` and `fireplaces` have been replaced with their respective medians.
- Outliers in `year_built` and `taxes` (values of 0) have been replaced with their respective medians.
After these steps, the dataset has no missing values.
The cleaned dataset is now ready for further analysis and modeling. Below is a brief description of the cleaned dataset.

## Cleaned Dataset Description
The cleaned dataset contains 5,000 rows and 16 columns, capturing various features of houses. All missing values and outliers have been addressed. The dataset includes features such as MLS number, sold price, zipcode, longitude, latitude, lot acres, taxes, year built, number of bedrooms and bathrooms, square footage, garage details, kitchen features, number of fireplaces, floor covering, and HOA fees. The dataset is now ready for the modeling team.