I will be running a multiple liear regression analysis on the Kings County House Sales. The dataset contains 2197 houses, includes attributes of those houses as well as their prices.Our target for this data is price and other variables are predictors. I will be preparing the dataset for modelling by preprocessing the numeric and categorical attributes in this dataset,then build a model that will ultimately yield parameters.I will also report the findings of my final model, including both predictive model performance metrics and interpretation of fitted model parameters
-
id: a notation for a house
-
date: Date house was sold
-
price: Price is prediction target
-
bedrooms: Number of Bedrooms/House
-
bathrooms: Number of bathrooms/bedrooms
-
sqft_living: square footage of the home
-
sqft_lot: square footage of the lot
-
floors: Total floors (levels) in house
-
waterfront: House which has a view to a waterfront
-
view: Has been viewed
-
condition: How good the condition is ( Overall ). 1 indicates worn out property and 5 excellent.
-
grade: Overall grade given to the housing unit, based on King County grading system. 1 poor ,13 excellent.
-
sqft_above: Square footage of house apart from basement
-
sqft_basement: Square footage of the basement
-
yr_built: Year built
-
yr_renovated: Year when house was renovated
-
zipcode: zipcode
-
lat: Latitude coordinate
-
long: Longitude coordinate
-
sqft_living15: Living room area
-
sqft_lot15: lotSize area
First, I checked the datasets info property.
I observed that sqft basement is a property that is cast as an object rather than a float. I'll check the unique values to make sure that all of the values are, in fact, numbers before casting as float.
All values are numbers except one that is listed as a '?'.I changed all '?' to the mean of the column and cast the column as a float object.
Checked for null values again, this time I used the isna().sum() method
Null values are present in yr_renovated, view, and waterfront. I filled view null values with the mean and checked value counts on both year renovated and waterfront.
-
The distribution of all variables to be multivariate normal
-
Little to no multicollinearity in the data
-
The data is homeocedastic
-
The relationship between the independent and dependent variable must be linear
In order to ultimately determine which predictor best predicts the outcome of our target variable, price, I evaluated these hypotheses against the data. Ultimately, condition, bedrooms, floors, and square feet of living space were the features (after log normalization) that satisfied these criteria. These were the outcomes:
All the features in model 3 are all log normalized features therefore a 1% increase in the independent variable yields the coefficent percent increase in the dependent variable, price.
Increasing square footage of living space by 10% yields an increase of 9.3% in price
Adding a floor, (level) yields an increase of 8% in price
An increase in bedrooms leads to a decrease in house price by 3.0%