### Multiple Linear Regression

In this practice, we will use the same data set as in simple linear regression practice. We will add more variables to models to see if we can have a better linear model. 

#### Read the data

Load the framingham dataset in directory '/datasets/framingham/'. The following few lines are the same as in simple linear regression practice; we are creating the same data here. 

In [None]:
fr <- read.csv("/dsa/data/all_datasets/framingham/framingham.csv")

In [2]:
framingham_data["pulseP"] <- framingham_data$sysBP - framingham_data$diaBP

framingham_data_male   <- subset(framingham_data, male==1 & age > 18 & BPMeds == 0, select=c(2,11:14,17))
framingham_data_female <- subset(framingham_data, male==0 & age > 18 & BPMeds == 0, select=c(2,11:14,17))
head(fr_male)

Unnamed: 0,age,sysBP,diaBP,BMI,heartRate,pulseP
1,39,106.0,70,26.97,80,36.0
3,48,127.5,80,25.34,75,47.5
9,52,141.5,89,26.36,76,52.5
10,43,162.0,107,23.61,93,55.0
13,46,142.0,94,26.31,98,48.0
17,48,138.0,90,22.37,64,48.0


**Activity 1:** Now, let's see if we can model pulse pressure with multiple independent variables. 

In [3]:
# Fill the partially complete code and execute it..
pp_female1 <- lm(pulseP ~ age + BMI, data=framingham_data_female)
summary(pp_female1)

# add heartRate to pp_female model and create new model named pp_female2. 
pp_female2 <- lm(pulseP ~ age + BMI + heartRate , data=framingham_data_female)
summary(pp_female2)

ERROR: Error in is.data.frame(data): object 'framingham_data_female' not found


As we can see, the $R^2$ slightly increases with adding a new variable to the model. Let's do the same for males. 

In [None]:
pp_male1 <- lm(pulseP ~ age + BMI, data=framingham_data_female)
summary(pp_male1)

pp_male2 <- lm(pulseP ~ age + BMI + heartRate , data=framingham_data_female)
summary(pp_male2)


For males, we can not model the pulse pressure all that well, $R^2$ does not get any better.

#### House sales data
Let's look at another data set: house sales in King county.

In [None]:
housing_data <- read.csv("/dsa/data/all_datasets/house_sales_in_king_county/kc_house_data.csv",header=TRUE)
str(housing_data)

**Activity 2: ** Fit a linear regression model to predict the house sale price using sqft_living.  

In [None]:
# Fill in the partially complete code and execute the code..

houseprice_reg1 <- lm(price ~ sqft_living, data=housing_data)
summary(houseprice_reg1)

As we can see, sqft_living is a good predictor for the price. Let's see if we can improve this model with additional variables.

In [None]:
# add the second variable: bedrooms
houseprice_reg12 <- lm(price ~ sqft_living + bedrooms, data=housing_data)
summary(hhouseprice_reg2)

In [None]:
# add the third variable: sqft_lot

houseprice_reg3 <- lm(price ~ sqft_living + bedrooms + sqft_lot, data=housing_data)
summary(houseprice_reg3)

In [None]:
# add the fourth variable: floors

houseprice_reg4 <- lm(price ~ sqft_living + bedrooms + sqft_lot + floors, data=housing_data)
summary(houseprice_reg4)

In [None]:
# add the fifth variable: bathrooms

houseprice_reg5 <- lm(price ~ sqft_living + bedrooms + sqft_lot + floors + bathrooms, data=housing_data)
summary(houseprice_reg5)

Adding number of bedrooms as another variable helped to improve the model, but other additional variables
(lot's square footage, number of floors, number of bathrooms) did not improve the model at all. Let's try 
 couple of variables that should might make a difference: waterfront and view.

In [None]:
houseprice_reg6 <- lm(price ~ sqft_living + bedrooms + waterfront + view, data=housing_data)
summary(houseprice_reg6)

$R^2$ jumped to **0.56**; this is a better model for predicting the price of the house. The other variables (lat, long, zip code, etc.) are not really expected to make a difference because we don't expect a **linear** relationship between a house's  price and its zip code unless zip codes are demographically meaningful. Let's try and see.

In [None]:
# add zipcode to houseprice_reg6
houseprice_reg7 <- lm(price ~ sqft_living + bedrooms + waterfront + view + zipcode, data=housing_data)
summary(houseprice_reg)

As we expected, zipcode does not make much of a difference. How about latitude or longitude ? Depending on the geographic location of the King county, it might make a difference.

In [None]:
# add lat to the model houseprice_reg7
houseprice_reg8 <- lm(price ~ sqft_living + bedrooms + waterfront + view + lat, data=housing_data)
summary(houseprice_reg8)

# add long to the model houseprice_reg8
houseprice_reg9 <- lm(price ~ sqft_living + bedrooms + waterfront + view + long, data=housing_data)
summary(houseprice_reg9)

Latitude made a big difference! $R^2$ is **0.63**. Let's find out why. Take a look at [King county map](https://www.google.com/maps/place/King+County,+WA/@47.4319563,-122.3574591,9z/data=!3m1!4b1!4m5!3m4!1s0x54905c8c832d7837:0xe280ab6b8b64e03e!8m2!3d47.5480339!4d-121.9836029).
Now it should be clear why an east-to-west change in location has an effect on the house price.