Skip to content

The Capstone Project for the October 2022 class at Flatiron School involves performing a Linear Regression Analysis on a given dataset. Linear Regression is a statistical method used to establish a relationship between a dependent variable and one or more independent variables.

pjsalinas/capstone_repo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

capstone_repo

Analysis by: Pedro J. Salinas

Repo Contents

Table of Contents

Introduction

The real estate market in the United States is a complex and dynamic industry that involves the buying, selling, and renting of properties such as homes, commercial buildings, and land. It is an essential part of the American economy, and it has a significant impact on the overall economic health of the country.

The real estate market in the US is highly competitive, with numerous players such as real estate agents, brokers, developers, investors, and lenders. The market is also subject to various economic factors such as interest rates, inflation, employment, and consumer confidence, which can have a significant impact on the demand and supply of properties.

One of the key factors that differentiate the real estate market in the US from other countries is the prevalence of homeownership. The US has one of the highest homeownership rates in the world, with more than two-thirds of the population owning their homes. This has created a culture of homeownership and has led to the development of various government programs and incentives to support homeownership.

Another key feature of the real estate market in the US is its regional diversity. Real estate markets in different regions of the country can vary significantly in terms of pricing, demand, and supply. For example, the real estate market in major cities such as New York, San Francisco, and Los Angeles can be very different from that in smaller cities and rural areas.

Overall, the real estate market in the US is a complex and dynamic industry that plays a vital role in the country's economy. It is a constantly evolving industry that is influenced by various economic and social factors, and it requires careful attention and analysis to understand its complexities and make informed decisions.

Project Overview

In this project, we will be analyzing house price data from King County, Washington State using a dataset of house sold during the period of 2020-2022 to predict the sale price of houses based on various factors such as location, square footage, number of bedrooms, and bathrooms, etc. We will explore the data using statistical techniques and feature engineering to prepare it for modeling. Finally, we will build and evaluate a regression model using various algorithms and techniques to obtain the best results.

To Run the code:

Conda Environment

Use the terminal or an Anaconda Prompt for the following steps:

  • Create the environment from the environment.yml file:
    • conda env create -f environment.yml
  • Activate the new environment:
    • conda activate capstone_env
  • Verify that the new environment was installed correctly:
    • conda env list

MyBinder

To run the code of this repo, you can use MyBinder as in:

Binder to explore the analysis.

Data

The dataset provided to us, contains information about houses sold during 2012 - 2022 in Kings County in Washington State. The data features historical and curent details about each house sold. Information about the price that the house was sold for, square footage of living areas, number of bathrooms, bedrooms, and specific locations, and other features that are specific for each house. Each feature was categorized and are easily identifables for the name of the columns, and each house was assigned a unique number, "id", for indentification purposes. The data indicates that there are 29182 home sales documented in this file.

Column Names and Descriptions for King County Data Set

  • id - Unique identifier for a house
  • date - Date house was sold
  • price - Sale price (prediction target)
  • bedrooms - Number of bedrooms
  • bathrooms - Number of bathrooms
  • sqft_living - Square footage of living space in the home
  • sqft_lot - Square footage of the lot
  • floors - Number of floors (levels) in house
  • waterfront - Whether the house is on a waterfront
    • Includes Duwamish, Elliott Bay, Puget Sound, Lake Union, Ship Canal, Lake Washington, Lake Sammamish, other lake, and river/slough waterfronts
  • greenbelt - Whether the house is adjacent to a green belt
  • nuisance - Whether the house has traffic noise or other recorded nuisances
  • view - Quality of view from house
    • Includes views of Mt. Rainier, Olympics, Cascades, Territorial, Seattle Skyline, Puget Sound, Lake Washington, Lake Sammamish, small lake / river / creek, and other
  • condition - How good the overall condition of the house is. Related to maintenance of house.
  • grade - Overall grade of the house. Related to the construction and design of the house.
  • heat_source - Heat source for the house
  • sewer_system - Sewer system for the house
  • sqft_above - Square footage of house apart from basement
  • sqft_basement - Square footage of the basement
  • sqft_garage - Square footage of garage space
  • sqft_patio - Square footage of outdoor porch or deck space
  • yr_built - Year when house was built
  • yr_renovated - Year when house was renovated
  • address - The street address
  • lat - Latitude coordinate
  • long - Longitude coordinate

Most fields were pulled from the King County Assessor Data Download.

The address, lat, and long fields have been retrieved using a third-party geocoding API. In some cases due to missing or incorrectly-entered data from the King County Assessor, this API returned locations outside of King County, WA. If you plan to use the address, lat, or long fields in your modeling, consider identifying outliers prior to including the values in your model.

Using these features I also created two additional variables:

  • yr_old - represent how old is a house either from the time it was built (yr_built)or from the time is was renovated (yr_renovated).
  • city - the name of the city where each house is located

Modeling

Relation Between Price and sqft_living

After reading the data, the first step was to clean it and change some of the column names for better readability. Once that was done, we selected the target variable, which in this case was price. We then got the correlation between the numerical variables to see which ones were most strongly related to price. It turned out that sqft_living had the highest correlation with price, so we used it as the independent variable for our baseline model.

House Prices vs. SQFT of Living Space in Home

Next, we ran some other models using all the numerical features and also using dummies features. We then filtered the features based on their p-values after running the linear regression analysis, and the model with the highest Adjusted R-squared was selected. We then checked if the linear regression assumptions were met for this model.

Statistical Distribution of Floors versus Price

However, after inspecting the price vs floors boxplot, we noticed that some outliers were not collaborating with the linear regression analysis. We ran two more models after eliminating these outliers, and once again checked whether the assumptions were met. Overall, this process involved a careful and thorough analysis of the data, followed by a systematic approach to selecting and refining our models.

Results

Although we used a careful and thorough approach to selecting and refining our linear regression model, we found that it was not able to accurately predict the price of a house in King County. This was primarily due to the fact that some of the linear regression assumptions were not met.

One of the key assumptions of linear regression is that the relationship between the dependent variable (in this case, price) and the independent variables (such as sqft_living and floors) is linear. However, after inspecting the price vs floors boxplot, we found that there were some outliers that were not collaborating with the linear regression analysis. This violated the assumption of linearity and affected the accuracy of our predictions.

In addition, there were other assumptions that were not met, such as the assumption of normality (i.e. that the residuals are normally distributed) and the assumption of homoscedasticity (i.e. that the variance of the residuals is constant across all levels of the independent variables). These violations also contributed to the poor performance of our model.

Overall, it's important to remember that linear regression is a powerful tool for predicting the relationship between variables, but it relies heavily on the assumptions being met. In cases where the assumptions are not met, the accuracy of the model can suffer, and alternative methods may need to be considered.

Next Steps

If the assumptions of the linear regression model are not met and it's not able to accurately predict house prices, there are several possible next steps to consider:

  1. Explore alternative modeling techniques: There are many other modeling techniques that can be used to predict house prices, such as decision trees, random forests, support vector machines, and neural networks. These methods may be better suited for the data and may be able to capture more complex relationships between the variables.

  2. Collect more data: Sometimes, the poor performance of a model can be attributed to a lack of data or a lack of diversity in the data. Collecting more data, or collecting data from different sources, may help to improve the accuracy of the model.

  3. Transform the data: If the assumptions of the linear regression model are not met, it may be possible to transform the data to meet these assumptions. For example, we could transform the floors variable by taking the logarithm of its values to achieve a more linear relationship with price.

  4. Remove outliers: Outliers can have a significant impact on the performance of a model. Removing outliers may help to improve the accuracy of the model and reduce the impact of violations of the linear regression assumptions.

  5. Consult with domain experts: Sometimes, the insights and knowledge of domain experts can help to identify important features that were not considered in the original analysis. This can help to improve the accuracy of the model and ensure that it's capturing all relevant factors.

References

  • Capitol Impact - King County Zip codes, and City names
  • Eryk Lewinson - Verifying the Assumptions of Linear Regression in Python and R

About

The Capstone Project for the October 2022 class at Flatiron School involves performing a Linear Regression Analysis on a given dataset. Linear Regression is a statistical method used to establish a relationship between a dependent variable and one or more independent variables.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published