# House Prices - Exploratory Data Analysis
Understanding data and what we already have is the most important step in any data science project. So, let's see what we have.

![Abstract houses](https://storage.googleapis.com/kaggle-media/competitions/House%20Prices/kaggle_5407_media_housesbanner.png)

<br>

#### >> [Copy this notebook](https://www.kaggle.com/code/mohamedyosef101/house-prices-eda) in Kaggle

<hr>

# Step 0: Set the fire
- Import the libraries
- Load the data

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

trainData = pd.read_csv('data/train.csv')
testData = pd.read_csv('data/test.csv')

# Step 1: What's in the data?
- Basic information about the data
- Discover missing values

### 1.1 Data info
To have a visually appealing output, I'll use shape instead of info()

In [2]:
trainData.shape

(1460, 81)

<div style="background: #e3eefc; padding: 24px 12px; color: #00a; font-weight: bold; margin: 4px 80px 4px 4px; border-radius: 4px;">
<p>Wow, we've about 79 feature!<p>
<p style="color: #fc0000;">And, I don't have enough time to discover them one by one</p>
<p> So, I'll do some research and pick some features based on the research findings</p>
</div>

### 1.2 Missing values

In [3]:
# my fav missing values functions
def get_missing_value_counts(data_frame):
    missing_counts = data_frame.isnull().sum()
    missing_counts = (missing_counts[missing_counts > 0]).sort_values(ascending=False)
    
    percent = data_frame.isnull().sum()/data_frame.isnull().count()
    percent = (percent[percent > 0]).sort_values(ascending=False)
    
    missing_data = pd.concat([missing_counts, percent], axis=1, keys=['Missing_counts', 'Percent'])
    return missing_data

train_missing_values = get_missing_value_counts(trainData)
print(train_missing_values)

              Missing_counts   Percent
PoolQC                  1453  0.995205
MiscFeature             1406  0.963014
Alley                   1369  0.937671
Fence                   1179  0.807534
MasVnrType               872  0.597260
FireplaceQu              690  0.472603
LotFrontage              259  0.177397
GarageType                81  0.055479
GarageYrBlt               81  0.055479
GarageFinish              81  0.055479
GarageQual                81  0.055479
GarageCond                81  0.055479
BsmtFinType2              38  0.026027
BsmtExposure              38  0.026027
BsmtFinType1              37  0.025342
BsmtCond                  37  0.025342
BsmtQual                  37  0.025342
MasVnrArea                 8  0.005479
Electrical                 1  0.000685


<div style="background: #e3eefc; padding: 24px 12px; color: #00a; font-weight: bold; margin: 4px 80px 4px 4px; border-radius: 4px;">
    <p>I think that removing Alley, PoolQC, Fence, and MiscFeature is likely to be the good choice</p>
    <p style="color: #fc0000;">But, I don't know what I should do with MasVnrType and FireplaceQu</p>
</div>

# Step 2: Explore some features
I going to do bivariate analysis between the sale price and the following features:
1. Building types
2. zone
3. street
4. Property Age
5. Living Area

# 2.1 Building types

In [None]:
# Distribution of dwelling types
plt.figure(figsize=(10, 6))
sns.countplot(data=trainData, x='BldgType', palette='Set2')
plt.title('Distribution of Building Types')
plt.xlabel('Building Type')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# Average Sale Price by Building Type
plt.figure(figsize=(10, 6))
sns.barplot(data=trainData, x='BldgType', y='SalePrice', palette='viridis', errorbar=None)
plt.title('Average Sale Price by Building Type')
plt.xlabel('Building Type')
plt.ylabel('Price')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

<div style="background: #e3eefc; padding: 24px 12px; color: #00a; font-weight: bold; margin: 4px 80px 4px 4px; border-radius: 4px;">
    <p>There are a lot of 1Fam houses and their price is higher than any other building type</p>
    <p style="color: #fc0000;">Also, there are less TwnhsE but it still have a high price</p>
</div>

# 2.2 Zoning

In [None]:
# 2. Zoning impact on sale price
plt.figure(figsize=(10, 6))
sns.barplot(data=trainData, x='MSZoning', y='SalePrice', errorbar=None, palette=['blue', 'green', 'red', 'black', 'gray'])
plt.title('Average Sale Price by Zoning')
plt.xlabel('Zoning')
plt.ylabel('Sale Price')
plt.xticks(rotation=45)
plt.gca().yaxis.set_major_formatter('${:,.0f}'.format)  # Format y-axis ticks as currency
plt.tight_layout()
plt.show()

<div style="background: #e3eefc; padding: 24px 12px; color: #00a; font-weight: bold; margin: 4px 80px 4px 4px; border-radius: 4px;">
    <p>You can find the most price in "FV" and "RL while "C" is out of the game</p>
</div>

# 2.3 Street

In [None]:
# Street Prices
street_prices = trainData.groupby('Street')['SalePrice'].mean()
plt.figure(figsize=(10, 6))
sns.barplot(data=trainData, x='Street', y='SalePrice', errorbar=None, palette=['blue', 'green'])
plt.title('Average Sale Price by Street Type')
plt.xlabel('Street Type')
plt.ylabel('Sale Price')
plt.xticks(rotation=0)
plt.gca().yaxis.set_major_formatter('${:,.0f}'.format)  # Format y-axis ticks as currency
plt.tight_layout()
plt.show()

<div style="background: #e3eefc; padding: 24px 12px; color: #00a; font-weight: bold; margin: 4px 80px 4px 4px; border-radius: 4px;">
    <p>Houses in Pave have much higher price</p>
</div>

# 2.4 Property Age

In [None]:
# Calculate Property Age
trainData['PropertyAge'] = trainData['YrSold'] - trainData['YearBuilt']

# Calculate Correlation between Property Age and Sale Price
age_price_corr = trainData['PropertyAge'].corr(trainData['SalePrice'])
print(f'Correlation between Property Age and Sale Price: {age_price_corr}')

# Create a scatter plot to visualize the relationship between Property Age and Sale Price
plt.figure(figsize=(10, 6))
sns.scatterplot(data=trainData, x='PropertyAge', y='SalePrice', hue='PropertyAge', legend=False)
plt.title('Property Age vs Sale Price')
plt.xlabel('Property Age')
plt.ylabel('Sale Price')
plt.gca().xaxis.set_major_locator(plt.MaxNLocator(integer=True))  # Ensure integer values on x-axis
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()

<div style="background: #e3eefc; padding: 24px 12px; color: #00a; font-weight: bold; margin: 4px 80px 4px 4px; border-radius: 4px;">
    <p>It seems like the prices are nearly the same with some high prices for the houses that are less than 15 years old</p>
</div>

# 2.5 Living Area

In [None]:
living_area_price_corr = trainData['GrLivArea'].corr(trainData['SalePrice'])
print(f'Correlation between Living Area (above grade) and Sale Price: {living_area_price_corr}')

# Create a scatter plot to visualize the relationship between Living Area and Sale Price
plt.figure(figsize=(10, 6))
sns.scatterplot(data=trainData, x='GrLivArea', y='SalePrice', hue='GrLivArea', legend=False)
plt.title('Living Area (above grade) vs Sale Price')
plt.xlabel('Living Area (above grade)')
plt.ylabel('Sale Price')
plt.gca().xaxis.set_major_locator(plt.MaxNLocator(integer=True))  # Ensure integer values on x-axis
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()

<div style="background: #e3eefc; padding: 24px 12px; color: #00a; font-weight: bold; margin: 4px 80px 4px 4px; border-radius: 4px;">
    <p>Now this is my fav relation</p>
</div>

That's all for the exploratory data analysis. In the next notebook, I'll build the house prices prediction model.