# House Sales Analysis in NorthWestern county

## 1. Business Understanding
### a) Introduction

House sales began in 1890s in the United States and since then its been growing all over the world and agencies started to form to enhance and ease the house selling process.Last year the revenue was estimated to be $4.25M with prospects of growth as time goes by. House sales are mainly influenced by the number of bedrooms, bathrooms, the year built, square footage and whether renovations are done or not among other factors.

In this case the Northwest agencies aim to address the need of providing homeowners with accurate and actionable advice on how home renovations can potentially increase the estimated value of their properties and by what amount. By understanding the relationship between various renovation factors and house prices, the agencies can be able to guide homeowners in making informed decisions about renovations, which will ultimately lead to maximization of return on their investment which will enable them sell their homes at optimal prices.

### b) Problem statement

The real estate industry faces the challenge of providing homeowners with reliable information about how various home renovation factors impact the estimated value of their homes. Our project addresses this problem by utilizing data analysis and regression modeling to identify key factors that affect house prices in a northwestern county. By understanding these factors, we can provide recommendations and insights to stakeholders on how to effectively advise homeowners on renovations that can potentially increase the value of their properties.

### c) Main Objective

The main objective of this project is to develop a predictive model that estimates house prices based on various features such as the number of bedrooms and bathrooms, square footage, and year built. By building a robust regression model, we aim to accurately predict house prices and provide stakeholders with valuable insights into the factors driving price fluctuations.

### d) Metric of success

The success of our project will be evaluated based on the model's performance in predicting house prices. We will use evaluation metrics such as the coefficient of determination (R-squared) and root mean square error (RMSE) to assess the model's accuracy. A higher R-squared value and lower RMSE indicate a more successful model in capturing the variations in house prices.

### e) Specific Objectives

- Explore and preprocess the King County House Sales dataset, including handling missing values, transforming features, and encoding categorical variables.
- Perform exploratory data analysis to gain insights into the distribution and relationships between different features and the target variable.
- Conduct feature selection to identify the most influential factors that affect house prices and eliminate irrelevant or redundant features.
- Build multiple linear regression models with different combinations of features and evaluate their performance using appropriate metrics.
- Interpret the results of the final regression model, including the coefficients of the selected features and their implications on house prices.
- Provide recommendations to stakeholders based on the insights gained from the modeling process, suggesting specific renovation factors that homeowners can focus on to increase the estimated value of their properties.

##  2. Data Understanding

We will be utilizing the King County House Sales datasets which contains information on house sales, including features such as price, number of bedrooms and bathrooms, square footage, and year built. By thoroughly understanding and analyzing the dataset and its column descriptions, we can identify the relevant features to include in our analysis and modeling process


## 3. Data Preparation

This process involves cleaning, transforming, and organizing the data to ensure its suitability for analysis and modeling. 

- Importing relevant libraries
- Loading the dataset and checking it contains
- Dealing with missing data
- Checking and removing duplicates
- Handling outliers
- Feature scaling and normalization using z-scores
- Encoding categoriacl variables using one-hot encoding
- Exploring the dataset to identify opportunities for creating new features that may enhance the predictive power of the model 
- Splitting the dataset into training and test sets


### Importing libraries

In [11]:
# importing libraries
import pandas as pd
import numpy as np 
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import PolynomialFeatures

### Loading the dataset

In [13]:
# reading the house dataset and previewing the first five outputs
data = pd.read_csv("data/kc_house_data.csv")
data.head()

In [3]:
# getting an overview of the dataset, including the number of non-null values and the data types of each column
print(data.info())
# getting the number of rows and columns in the data.
print(data.shape)

The dataset has 21597 rows and 21 columns

In [4]:
# checking the datatypes of each columns
data.dtypes

In [5]:
# generating summary statistics for the numerical columns in the dataset
data.describe()

Most houses have an average of 3 bedrooms,2 bathrooms and were built between 1970 and 2015

## Data Cleaning

### Checking for missing values

In [7]:
# checking the proportion of missing values per column
data.isna().sum()/len(data)

Three columns have missing values but we will only work with waterfront since it will be used in analysis and modelling

In [8]:
# since waterfront has missing values, we first check the value counts
data['waterfront'].value_counts()

In [9]:
# filling in missing values in the 'waterfront' column with the string 'NO'
data['waterfront'] = data['waterfront'].fillna('NO')
# getting an overview of the dataset after filling the missing values in Waterfront
data.info()

We filled the missing values in waterfront instead of removing them since we want each column to have the same number of rows.
we can now see that the waterfront column has the same values as the other columns we will be using.

In [10]:
# dropping columns that we will not be using in data analysis.
columns_to_drop = ['date', 'view', 'sqft_above', 'sqft_basement', 'yr_renovated', 'zipcode', 'lat', 'long', 'sqft_living15', 'sqft_lot15']
data.drop(columns_to_drop, inplace=True)

In [None]:
# checking to see if the columns have been dropped.
data.info()

In [None]:
# checking if there are any duplicates
data.duplicated().any()

In [None]:
# checking the sum of duplicated ids
data['id'].duplicated().sum()
# dropping the duplicated values and keeping the first in id column
data = data.drop_duplicates(subset='id', keep='first')

In [None]:
# now checking the if there are any duplicated values
data.duplicated().any()

In [None]:
# checking the rows and columns after dropping duplicates and unwanted columns.
data.shape

## Explotory data analysis

### Univariate analysis

### Dealing with outliers

In [None]:
# Selecting the numerical columns in the dataset
numeric_columns = ['price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'yr_built']

# Calculating z-scores for the numerical columns
z_scores = (data[numeric_columns] - data[numeric_columns].mean()) / data[numeric_columns].std()

# defining a threshold for identifying outliers
threshold = 3

# Find the indices of outliers for each column
outlier_indices = (z_scores > threshold).any(axis=1)

# Extract the outlier rows from the dataset
outliers = data[outlier_indices]

# Plotting the outliers
for column in numeric_columns:
    plt.figure()
    plt.boxplot(data[column])
    plt.title(column)
    plt.show()


In [None]:
# Removing the outlier rows from the dataset
data_cleaned = data[~outlier_indices]

# Reset the index of the cleaned dataset
data_cleaned.reset_index(drop=True, inplace=True)

# Plot histograms of the cleaned dataset
data_cleaned[numeric_columns].hist(bins=20, figsize=(10, 8))
plt.tight_layout()
plt.show()

# Plot scatterplots of price against other numeric columns
sns.pairplot(data_cleaned, x_vars=['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot'], y_vars='price', height=5)
plt.tight_layout()
plt.show()



 ## Data analysis

In [None]:
# checking for categorical columns
categorical_columns = data.select_dtypes(include=['object', 'category']).columns
categorical_columns

In [None]:
# exploring categorical variables,using value_counts() to get the count of each unique value in a column 
# using describe() to get summary statistics for categorical columns.
for column in categorical_columns:
    value_counts = data[column].value_counts()
    print(f"Value counts for {column}:\n{value_counts}\n")

# Summary statistics for categorical columns
print(data[categorical_columns].describe())


In [None]:
# Correlation analysis
correlation_matrix = data_cleaned[numeric_columns].corr()
print("\nCorrelation Matrix:")
print(correlation_matrix)

In [None]:
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Between Columns')
plt.show()

In [None]:
# one hot encoding
data_encoded = pd.get_dummies(data, columns=categorical_columns, drop_first=True)
data_encoded.head()

## regression modelling

 To perform regression modeling, we will split the dataset into features (X) and target variable (y), and then split them into training and testing sets. We will use the LinearRegression model from scikit-learn library to build the regression model.

In [None]:
# Regression Modeling

# Splitting the data into features (X) and target variable (y)
X = data_cleaned[numeric_columns[1:]]
y = data_cleaned[numeric_columns[0]]

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Creating and fitting the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Making predictions on the test set
y_pred = model.predict(X_test)

# Model evaluation
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("\nRegression Model Evaluation:")
print(f"Mean Squared Error: {mse:.2f}")
print(f"R-squared: {r2:.2f}")

In [None]:
## Basic Linear Regression

In [None]:
# Model 1: Basic Linear Regression
X = data_cleaned[numeric_columns[1:]]
y = data_cleaned[numeric_columns[0]]

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Creating and fitting the linear regression model
model1 = LinearRegression()
model1.fit(X_train, y_train)

# Making predictions on the test set
y_pred = model1.predict(X_test)

# Model evaluation
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Model 1: Basic Linear Regression")
print(f"Mean Squared Error: {mse:.2f}")
print(f"R-squared: {r2:.2f}")

In [None]:
# Model 2: Feature Selection
# Selecting features with correlation greater than a threshold
correlation_threshold = 0.8
selected_features = correlation_matrix['price'][correlation_matrix['price'].abs() > correlation_threshold].index.tolist()

X = data_cleaned[selected_features]
y = data_cleaned[numeric_columns[0]]

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Creating and fitting the linear regression model
model2 = LinearRegression()
model2.fit(X_train, y_train)

# Making predictions on the test set
y_pred = model2.predict(X_test)

# Model evaluation
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("\nModel 2: Feature Selection")
print(f"Selected Features: {selected_features}")
print(f"Mean Squared Error: {mse:.2f}")
print(f"R-squared: {r2:.2f}")

In [None]:
## Polynomial Regression

In [None]:

X = data_cleaned[numeric_columns[1:]]
y = data_cleaned[numeric_columns[0]]

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Creating polynomial features
poly_features = PolynomialFeatures(degree=2)
X_train_poly = poly_features.fit_transform(X_train)
X_test_poly = poly_features.transform(X_test)

# Creating and fitting the linear regression model with polynomial features
model3 = LinearRegression()
model3.fit(X_train_poly, y_train)

# Making predictions on the test set
y_pred = model3.predict(X_test_poly)

# Model evaluation
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("\nModel 3: Polynomial Regression")
print(f"Degree of Polynomial: {poly_features.degree}")
print(f"Mean Squared Error: {mse:.2f}")
print(f"R-squared: {r2:.2f}")