In [None]:
knitr::opts_chunk$set(echo = TRUE)

The Final Project is divided into three assignments and is presented with each question answered in its respective markdown file.

In [None]:
rm(list = ls())
getwd()

setwd("D:\\BSE\\BSE Material\\sem 2\\Data Vis\\Project")

library(dplyr)
library(ggplot2)
library(ggthemes)
                    # Install data.table package
library("data.table")  
library(scales)



You can also embed plots, for example:

In [None]:
pp_data <- read.csv("ppdata_lite.csv")
colnames(pp_data)

A1. For the 33 London boroughs create a box-plot (or several box-plots) that compares house prices between the boroughs. Can you think of a better way to compare borough house prices (please demonstrate)?

Ans: We first filter the data according to the county of Greater London.
The Data represents the 33 districts in the county and we would demonstrate the house prices according to these through a box plot.

To make our data a bit easier to visualize we undertake two steps:
We first find the year from the data column and we divide the years into 3 periods
1990's , 2000's, 2010's

In [None]:
London <- pp_data %>%
  filter(county == "GREATER LONDON")
London <- London %>%
  mutate(dist = substr(district,start = 1, stop = 2))%>%
  mutate(year = as.POSIXlt(date_of_transfer)$year +1900)
head(London)
sort(unique(London$year))

London <- London %>% mutate(period = case_when(year %in% seq(1995,1999,1)~"1990's",
                            year %in% seq(2000,2009,1)~"2000's",
                            year %in% seq(2010,2016,1)~"2010's"))%>%arrange(year)
  #calculate upper and lower limits
upper_limit = mean(London$price) + 2 * sd(London$price)

Lets First Do a Basic Box Plot, Here we remove the outliers as we assume that these are due to factors we cannot account for. For Eg (Prime residence, Luxrury Buildings, etc...). Our objective is to see the price of majority of the properties in that district

In [None]:

District_plot <- ggplot(London,aes((x = as.factor(reorder(district, -price, na.rm = FALSE))),y= price)) + 
  geom_boxplot(aes( fill = district),outlier.shape = NA)+
  ylim(0,upper_limit)+theme(legend.position="none")+scale_x_discrete(guide = guide_axis(angle = 90))+
  ggtitle("Figure 1. District based Boxplots of prices")+ xlab("London Districts")
options(repr.plot.width=12, repr.plot.height=8)
District_plot

The plot obtained does gives us some information about the price distribution in london district with Chelsea having the highest prices in the county. 
Yet these prices need to reflect their change through the decades as well as convey recent information. A better way to compare would be using the time periods data to see how these prices have differed using the same box plot but facetting it with the periods.


In [None]:
Better_London <- London %>%
  filter(!price  > upper_limit)

Mod_plot <- ggplot(Better_London,aes((x = as.factor(reorder(district, -price, na.rm = FALSE))),y= price)) + 
  geom_boxplot(aes( fill = district),outlier.shape = NA)+
  ylim(0,upper_limit)+theme(legend.position="none")+scale_x_discrete(guide = guide_axis(angle = 90))+
  ggtitle("Figure 1. District based Boxplots of prices")+ xlab("London Districts")+facet_wrap(~period,scales = "free_y",ncol = 1,nrow = 3)+
  scale_y_continuous(trans = "log2" , labels = scales::comma , limits = quantile(London$price, c(0.02, 0.95)))+
  ylab("Price") +
  ggtitle("Figure 2. District based Boxplots for Various Periods") +
  theme_minimal()+theme(legend.position = "none")
options(repr.plot.width=16, repr.plot.height=12)
Mod_plot


A2. Could the entire dataset be used to estimate the relationship between price of flats and floor level? If yes, how would you show that relationship in a plot?

Ans; Well the question is a bit tricky, firstly 

In [None]:
head(unique(London$SAON),10)
print(paste("There are",length(unique(London$SAON)),"unique values"))

Even there are different flats , the way the floors are labelled and the variations of these are difficult to navigate. Moreover, some do not indicate the floor they are based on. For Eg. "The Flat", :Flat B" , Flat 14", etc, give no indication of the floor level or any other detail which would allow us to understand the floor they are on. This limits us to labelled flats which we can segregate either through floor number or their discription. Therefore we first create a dataset which segregates flats based on SAON to find the relation between floor levels and prices of flats.

The first step involves filtering out the Flats and removing any blank spaces in the data.

In [None]:
Flats <- pp_data%>%
  filter(property_type == "F")

Flats_mod <- Flats %>%
  filter(SAON != "")

Based on descriptions and floor numbers available we then segregate the data from ground to the sixth floor. Note : though there is data that indicates higher floors, we believe that ground to sixth would help in understanding if there is a trend present or not. 

In [None]:
ground <- c("BASEMENT|GROUND|LOWER")

Floor0 <- Flats_mod[grepl(ground,Flats_mod$SAON),]

Floor0["floor"] <- "ground"

first <- c("FIRST")
second <- c("SECOND")
third <- c("THIRD")
fourth <- c("FOURTH")
fifth <- c("FIFTH")
sixth <- c("SIXTH")

Floor1 <- Flats_mod[grepl(first,Flats_mod$SAON),]
Floor1["floor"] <- "first"
Floor2 <- Flats_mod[grepl(second,Flats_mod$SAON),]
Floor2["floor"] <- "second"
Floor3 <- Flats_mod[grepl(third,Flats_mod$SAON),]
Floor3["floor"] <- "third"
Floor4 <- Flats_mod[grepl(fourth,Flats_mod$SAON),]
Floor4["floor"] <- "fourth"
Floor5 <- Flats_mod[grepl(fifth,Flats_mod$SAON),]
Floor5["floor"] <- "fifth"
Floor6 <- Flats_mod[grepl(sixth,Flats_mod$SAON),]
Floor6["floor"] <- "sixth"

Floors <- rbind(Floor0,Floor1,Floor2,Floor3,Floor4,Floor5,Floor6)

We combine our dataset and look at the plot 

In [None]:
Floors <- Floors %>%
  mutate(year = as.POSIXlt(date_of_transfer)$year + 1900)%>%
  select(year,price,floor,SAON)

head(Floors)
tail(Floors)




floor_plot <- Floors %>%
  ggplot( aes(x=year, y=price, group=floor, color=floor)) +
  geom_smooth(method="gam", formula = y ~ s(x, bs = "cs", k=6), se = F) +
  scale_y_continuous(trans = "log2", labels = comma)+theme_minimal()+ggtitle("Change in Prices with Floor Levels")
floor_plot



It is key to understand that the data used considered is not uniform, for eg we get values such as FLAT 14 TWOSIXTHIRTY, when we filter for the Sixth floor, given the size of the dataset and variations of SAON values, it does not makes sense to filter outliers and distorted rows manually and we would relie on the law of large numbers. 

We do see that from ground to the fourth floor there is a clear pattern of flats getting costlier. Our intial hypothesis of flats have a positive realtions therefore makes sense for this range. The data becomes unclear for floors 5 and 6 as the later shows no change with the former even though having larger values from ground to third floor , showing a dip and being lower than flats on fourth floor. These therefore contradicts our arguments further.