Skip to content

mboya2020/binary-classification-predictive-modeling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

Multiple Linear Regression Analysis Kings County House Sales

I will be running a multiple liear regression analysis on the Kings County House Sales. The dataset contains 2197 houses, includes attributes of those houses as well as their prices.Our target for this data is price and other variables are predictors. I will be preparing the dataset for modelling by preprocessing the numeric and categorical attributes in this dataset,then build a model that will ultimately yield parameters.I will also report the findings of my final model, including both predictive model performance metrics and interpretation of fitted model parameters

Getting Started

Information on columns:

  1. id: a notation for a house

  2. date: Date house was sold

  3. price: Price is prediction target

  4. bedrooms: Number of Bedrooms/House

  5. bathrooms: Number of bathrooms/bedrooms

  6. sqft_living: square footage of the home

  7. sqft_lot: square footage of the lot

  8. floors: Total floors (levels) in house

  9. waterfront: House which has a view to a waterfront

  10. view: Has been viewed

  11. condition: How good the condition is ( Overall ). 1 indicates worn out property and 5 excellent.

  12. grade: Overall grade given to the housing unit, based on King County grading system. 1 poor ,13 excellent.

  13. sqft_above: Square footage of house apart from basement

  14. sqft_basement: Square footage of the basement

  15. yr_built: Year built

  16. yr_renovated: Year when house was renovated

  17. zipcode: zipcode

  18. lat: Latitude coordinate

  19. long: Longitude coordinate

  20. sqft_living15: Living room area

  21. sqft_lot15: lotSize area

Getting Started

First, I checked the datasets info property.

I observed that sqft basement is a property that is cast as an object rather than a float. I'll check the unique values to make sure that all of the values are, in fact, numbers before casting as float.

All values are numbers except one that is listed as a '?'.I changed all '?' to the mean of the column and cast the column as a float object.

Checked for null values again, this time I used the isna().sum() method

Null values are present in yr_renovated, view, and waterfront. I filled view null values with the mean and checked value counts on both year renovated and waterfront.

Multiple Regression Model

The assumptions of a linear regression model states the following:

  1. The distribution of all variables to be multivariate normal

  2. Little to no multicollinearity in the data

  3. The data is homeocedastic

  4. The relationship between the independent and dependent variable must be linear

In order to ultimately determine which predictor best predicts the outcome of our target variable, price, I evaluated these hypotheses against the data. Ultimately, condition, bedrooms, floors, and square feet of living space were the features (after log normalization) that satisfied these criteria. These were the outcomes:

Findings

All the features in model 3 are all log normalized features therefore a 1% increase in the independent variable yields the coefficent percent increase in the dependent variable, price.

Increasing square footage of living space by 10% yields an increase of 9.3% in price

Adding a floor, (level) yields an increase of 8% in price

An increase in bedrooms leads to a decrease in house price by 3.0%

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published