In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Import the raw house data into a DataFrame
house_data = pd.read_csv('raw_house_data - raw_house_data.csv')
house_data.head()

## Exploratory Data Analysis (EDA)

In [None]:
# Checking for missing values
missing_values = house_data.isnull().sum()
missing_values

### Missing Values Analysis
No missing values were found in the dataset. This is excellent as it simplifies the preprocessing stage.

In [None]:
# Checking data types
data_types = house_data.dtypes
data_types

### Data Types Analysis
Most columns have the expected data types (numerical or float). However, some columns like 'bathrooms', 'sqrt_ft', 'garage', and 'HOA' are of object type. These should be converted to numerical types for modeling. We'll also explore these columns to understand why they are of object type.

In [None]:
# Exploring columns with object data types
object_columns = ['bathrooms', 'sqrt_ft', 'garage', 'HOA']
house_data[object_columns].sample(10)

### Object Columns Analysis
Upon sampling the data, it appears that the columns 'bathrooms', 'sqrt_ft', 'garage', and 'HOA' contain numerical values but are stored as object types. This could be due to the presence of special characters or 'None' values in these columns. We should convert these to appropriate numerical types after handling any such special cases.

In [None]:
# Converting object columns to numerical types
for col in object_columns:
    house_data[col] = pd.to_numeric(house_data[col], errors='coerce')
# Checking data types again
house_data.dtypes

### Data Type Conversion
The columns 'bathrooms', 'sqrt_ft', 'garage', and 'HOA' have been successfully converted to numerical types. This will facilitate the modeling process.

In [None]:
# Summary statistics of the dataset
house_data.describe()

### Summary Statistics
The summary statistics provide the following insights:
- The average sold price is around $675,000 with a standard deviation of approximately $318,556.
- Most houses have 4 bedrooms and 4 bathrooms.
- The average square footage is around 3512 sq ft.
- The dataset spans various zip codes, with the most common being 85737.
- The average year the houses were built is 1999.
- HOA fees vary widely, with an average of $55.
These statistics will help the modeling team understand the central tendencies and spread of the data.

In [None]:
# Visualizing the data
plt.figure(figsize=(12, 8))
sns.heatmap(house_data.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

### Correlation Heatmap
The heatmap shows the correlation between different numerical features in the dataset. Some key observations are:
- 'sold_price' has a moderate positive correlation with 'sqrt_ft' and 'bathrooms', which makes sense as larger houses with more bathrooms are generally more expensive.
- 'year_built' has a slight negative correlation with 'zipcode', indicating that newer houses might be in higher-numbered zip codes.
- 'latitude' and 'longitude' have low correlations with most other features, suggesting they might not be very useful for predictive modeling.
These insights will be valuable for feature selection during the modeling phase.

## Data Cleaning and Preprocessing
Based on the EDA, the following steps are taken for data cleaning and preprocessing:
1. No missing values were found, so no imputation is needed.
2. Converted object data types to numerical types for 'bathrooms', 'sqrt_ft', 'garage', and 'HOA'.
3. Checked for correlations and identified key features for modeling.
The dataset is now ready for the modeling team.

In [None]:
# Plotting histograms for numerical features
house_data.hist(figsize=(16, 12), bins=20)
plt.suptitle('Feature Histograms')
plt.show()

In [None]:
# Plotting histograms for numerical features
house_data.hist(figsize=(16, 12), bins=20)
plt.suptitle('Feature Histograms')
plt.show()

In [None]:
# Pie chart for bedrooms
bedroom_counts = house_data['bedrooms'].value_counts()
plt.figure(figsize=(10, 6))
plt.pie(bedroom_counts, labels=bedroom_counts.index, autopct='%1.1f%%', startangle=90)
plt.title('Distribution of Bedrooms')
plt.axis('equal')
plt.show()

In [None]:
# Histogram for sold_price
plt.figure(figsize=(10, 6))
sns.histplot(house_data['sold_price'], bins=30, kde=True)
plt.title('Distribution of Sold Prices')
plt.xlabel('Sold Price')
plt.ylabel('Frequency')
plt.show()

In [None]:
# Pie chart for bathrooms
bathroom_counts = house_data['bathrooms'].value_counts()
plt.figure(figsize=(10, 6))
plt.pie(bathroom_counts, labels=bathroom_counts.index, autopct='%1.1f%%', startangle=90)
plt.title('Distribution of Bathrooms')
plt.axis('equal')
plt.show()