In [3]:
# library
library(tidyverse)
library(repr)
library(tidymodels)
install.packages("GGally")
library(GGally)
install.packages("ISLR")
library(ISLR)
options(repr.matrix.max.rows = 6)
#source("tests.R")
#source("cleanup.R")

Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done

Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done



# Finding investment properties in Melbourne
##### Name: Jun Zheng /
##### Student Number: 10827335 /

# 1.Introduction
In recent years, global housing prices have been experiencing different changes in response to the market. In Melbourne, housing prices are also generating fluctuations and generating a lot of discussion. Known for its diversity and cultural scene, Melbourne has also been a popular property market that attracts investors from all over the world, and thus Melbourne's house prices have been stimulating interest in research. In this project, we will focus on Melbourne's house prices, hoping to create a model that can predict house prices more accurately. Along with information on a number of influencing factors, we will be able to answer the question, "What homes in Melbourne are undervalued or overvalued?". The crux of the question is to determine if the property is worth investing in. To answer this question, we will use the Melbourne Housing Dataset. In this dataset we have 21 variables and we want to select 8 of them as the main influencing factors. We will use the corr() method to select these 8 factors, which will be the major determinants of house prices, so we will select these 8 factors to model and predict whether different houses in Melbourne are worth buying at different house prices.



# 2.Data Analysis

## 2.1 Cleaning Data

### 2.1.1 Data Information
`Rooms`: Number of rooms

`Price`: Price in dollars

`Method`: S - property sold; SP - property sold prior; PI - property passed in; PN - sold prior not disclosed; SN - sold not disclosed; NB - no bid; VB - vendor bid; W - withdrawn prior to auction; SA - sold after auction; SS - sold after auction price not disclosed. N/A - price or highest bid not available.

`Type`: br - bedroom(s); h - house,cottage,villa, semi,terrace; u - unit, duplex; t - townhouse; dev site - development site; o res - other residential.

`SellerG`: Real Estate Agent

`Date`: Date sold

`Distance`: Distance from CBD

`Regionname`: General Region (West, North West, North, North east …etc)

`Propertycount`: Number of properties that exist in the suburb.

`Bedroom2` : Scraped # of Bedrooms (from different source)

`Bathroom`: Number of Bathrooms

`Car`: Number of carspots

`Landsize`: Land Size

`BuildingArea`: Building Size

`CouncilArea`: Governing council for the area

### 2.1.2 Reading Data

In [4]:
library(tidyverse)
housing_data<- read_csv("https://raw.githubusercontent.com/jun2021/DSCI-100-Group-Project/main/melb_data.csv")
head(housing_data)

[1mRows: [22m[34m13580[39m [1mColumns: [22m[34m21[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (8): Suburb, Address, Type, Method, SellerG, Date, CouncilArea, Regionname
[32mdbl[39m (13): Rooms, Price, Distance, Postcode, Bedroom2, Bathroom, Car, Landsiz...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,⋯,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
<chr>,<chr>,<dbl>,<chr>,<dbl>,<chr>,<chr>,<chr>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<chr>,<dbl>
Abbotsford,85 Turner St,2,h,1480000,S,Biggin,3/12/2016,2.5,3067,⋯,1,1,202,,,Yarra,-37.7996,144.9984,Northern Metropolitan,4019
Abbotsford,25 Bloomburg St,2,h,1035000,S,Biggin,4/02/2016,2.5,3067,⋯,1,0,156,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019
Abbotsford,5 Charles St,3,h,1465000,SP,Biggin,4/03/2017,2.5,3067,⋯,2,0,134,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan,4019
Abbotsford,40 Federation La,3,h,850000,PI,Biggin,4/03/2017,2.5,3067,⋯,2,1,94,,,Yarra,-37.7969,144.9969,Northern Metropolitan,4019
Abbotsford,55a Park St,4,h,1600000,VB,Nelson,4/06/2016,2.5,3067,⋯,1,2,120,142.0,2014.0,Yarra,-37.8072,144.9941,Northern Metropolitan,4019
Abbotsford,129 Charles St,2,h,941000,S,Jellis,7/05/2016,2.5,3067,⋯,1,0,181,,,Yarra,-37.8041,144.9953,Northern Metropolitan,4019


### 2.1.3 Selecting Main Vectors

In [5]:
# Clean the rows have NA
housing_data <- housing_data |> 
                na.omit()
head(housing_data)

Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,⋯,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
<chr>,<chr>,<dbl>,<chr>,<dbl>,<chr>,<chr>,<chr>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<chr>,<dbl>
Abbotsford,25 Bloomburg St,2,h,1035000,S,Biggin,4/02/2016,2.5,3067,⋯,1,0,156,79,1900,Yarra,-37.8079,144.9934,Northern Metropolitan,4019
Abbotsford,5 Charles St,3,h,1465000,SP,Biggin,4/03/2017,2.5,3067,⋯,2,0,134,150,1900,Yarra,-37.8093,144.9944,Northern Metropolitan,4019
Abbotsford,55a Park St,4,h,1600000,VB,Nelson,4/06/2016,2.5,3067,⋯,1,2,120,142,2014,Yarra,-37.8072,144.9941,Northern Metropolitan,4019
Abbotsford,124 Yarra St,3,h,1876000,S,Nelson,7/05/2016,2.5,3067,⋯,2,0,245,210,1910,Yarra,-37.8024,144.9993,Northern Metropolitan,4019
Abbotsford,98 Charles St,2,h,1636000,S,Nelson,8/10/2016,2.5,3067,⋯,1,2,256,107,1890,Yarra,-37.806,144.9954,Northern Metropolitan,4019
Abbotsford,10 Valiant St,2,h,1097000,S,Biggin,8/10/2016,2.5,3067,⋯,1,2,220,75,1900,Yarra,-37.801,144.9989,Northern Metropolitan,4019


Also, We need the numeric data in the dataframe be a parameter to do the estimate of the Price of the house in Melbourne, so that, we need to select the columns which have numeric and do the correlation plot to see which numeric factors we want to do the estimate.

In [6]:
housing_numeric <- housing_data |>
                select(Price, Rooms, Distance, Bedroom2, Bathroom, Car, Landsize, BuildingArea, YearBuilt, Lattitude,
                      Longtitude, Propertycount)
head(housing_numeric)

Price,Rooms,Distance,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1035000,2,2.5,2,1,0,156,79,1900,-37.8079,144.9934,4019
1465000,3,2.5,3,2,0,134,150,1900,-37.8093,144.9944,4019
1600000,4,2.5,3,1,2,120,142,2014,-37.8072,144.9941,4019
1876000,3,2.5,4,2,0,245,210,1910,-37.8024,144.9993,4019
1636000,2,2.5,2,1,2,256,107,1890,-37.806,144.9954,4019
1097000,2,2.5,3,1,2,220,75,1900,-37.801,144.9989,4019


In [6]:
# set seed
set.seed(2023)
# split the housing data
housing_split<- initial_split(housing_data,prop = 0.6,strata = Price)
housing_testing<- testing(housing_split)
housing_training<- training(housing_split)

In [23]:
cor(housing_training$Price, housing_training$Distance)

In [16]:
#housing_training_1 <- housing_training |>
                    #select(Price, Rooms, Type, Method, SellerG, Date, Distance, Postcode, Bedroom2, Bathroom)
#housing_training_1
#housing_training_2 <- housing_training |>
                    #select(Price, Car, Landsize, BuildingArea, YearBuilt, CouncilArea, Lattitude, Longtitude, Regionname, Propertycount)
#housing_training_2



#housing_training_1 <- housing_training_1[, !(names(housing_training_1) %in% c("SellerG"))]
#ggpairs(housing_training_1, cardinality_threshold = 200)  # Adjust the threshold as needed


## 2.2 Plotting

In [None]:
housing_data <- as.data.frame(housing_data)
housing_data


In [None]:
### Correlation Between the Price and Rooms

In [None]:
linear_model <- lm(log(Price) ~ Rooms, data = housing_data)
linear_model

## 2.3 Model Fitting

In [7]:
lm_spec <- linear_reg() |>
           set_engine("lm") |>
           set_mode("regression")

In [9]:
lm_recipe <- recipe(Price ~ ., data = housing_training)

lm_fit  <- workflow() |>
           add_recipe(lm_recipe) |>
           add_model(lm_spec) |>
           fit(data = housing_training)
            

# 3.Discussion

# 4.Reference