In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Importing the raw house data into a DataFrame
house_data = pd.read_csv('raw_house_data - raw_house_data.csv')
house_data.head()

# Exploratory Data Analysis and Initial Cleaning for House Price Data
This notebook aims to perform exploratory data analysis (EDA) and initial data cleaning on the raw house price dataset. The cleaned dataset will be suitable for further modeling. The notebook will include all plots, charts, and justifications for any data transformations or deletions.

In [None]:
# Checking for missing values in each column
missing_values = house_data.isnull().sum()
missing_values

In [None]:
# Checking the data types of each column
data_types = house_data.dtypes
data_types

## Initial Observations and Cleaning Strategy
1. **Missing Values**: No missing values were observed in the dataset.
2. **Data Types**: Some columns like 'bathrooms', 'sqrt_ft', 'garage', and 'HOA' are of object data type, which should be numerical for modeling. These need to be converted.
3. **Categorical Features**: Columns like 'kitchen_features' and 'floor_covering' are categorical and may need encoding.

### Cleaning Strategy
1. Convert the object data types to numerical where applicable.
2. For categorical features, explore and decide on encoding methods.
3. Perform exploratory data analysis to understand distributions and correlations.

In [None]:
# Converting object data types to numerical where applicable
house_data['bathrooms'] = pd.to_numeric(house_data['bathrooms'], errors='coerce')
house_data['sqrt_ft'] = pd.to_numeric(house_data['sqrt_ft'], errors='coerce')
house_data['garage'] = pd.to_numeric(house_data['garage'], errors='coerce')
house_data['HOA'] = pd.to_numeric(house_data['HOA'], errors='coerce')
# Checking the data types again
house_data.dtypes

In [None]:
# Checking for missing values after conversion
missing_values_after_conversion = house_data.isnull().sum()
missing_values_after_conversion

## Handling Missing Values and Data Types
After converting object types to numerical, we observed that the 'fireplaces' column has 25 missing values.

### Strategy for Missing Values
1. For the 'fireplaces' column, we can assume that missing values indicate no fireplaces in the house. Therefore, we will fill these with 0.

Let's proceed with these changes.

In [None]:
# Filling missing values in 'fireplaces' with 0
house_data['fireplaces'].fillna(0, inplace=True)
# Checking for missing values again
house_data.isnull().sum()

## Exploratory Data Analysis (EDA)
With the missing values handled and data types corrected, let's move on to the exploratory data analysis. We will look at the following:
1. Distribution of numerical features
2. Correlation between features
3. Distribution of categorical features

Let's start with the distribution of numerical features.

In [None]:
# Distribution of numerical features
numerical_features = house_data.select_dtypes(include=['int64', 'float64']).columns
house_data[numerical_features].hist(figsize=(15, 12), bins=20)
plt.suptitle('Distribution of Numerical Features')
plt.show()

## Observations on Numerical Features
1. **sold_price**: The distribution is right-skewed, indicating that most houses are sold at lower prices.
2. **year_built**: Most houses were built after 1950.
3. **bedrooms**: Most houses have between 2 to 5 bedrooms.
4. **fireplaces**: Majority of houses have no fireplaces or just one.
5. **lot_acres**: The distribution is highly skewed, with most houses having less than 2 acres of land.
6. **taxes**: The distribution is right-skewed, indicating that most houses have lower taxes.

Next, let's look at the correlation between these features.

In [None]:
# Correlation between numerical features
correlation_matrix = house_data[numerical_features].corr()
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix of Numerical Features')
plt.show()

## Observations on Feature Correlations
1. **sold_price and taxes**: A strong positive correlation (0.95) exists between the sold price of the house and the taxes, which is expected.
2. **sold_price and bedrooms**: A moderate positive correlation (0.53) is observed, indicating that houses with more bedrooms tend to be sold at higher prices.
3. **year_built and sold_price**: A weak positive correlation (0.25) suggests that newer houses might be sold at slightly higher prices.

Next, let's explore the distribution of categorical features.

In [None]:
# Distribution of categorical features
categorical_features = house_data.select_dtypes(include=['object']).columns
for feature in categorical_features:
    plt.figure(figsize=(12, 6))
    sns.countplot(data=house_data, x=feature)
    plt.title(f'Distribution of {feature}')
    plt.xticks(rotation=45)
    plt.show()