In [None]:
library(corrplot)
library(ggplot2)
library(tidyr)
library(dplyr)

We'll analyze the Kaggle House Prices dataset and predict the prices of houses.

In [None]:
train_data <- read.csv("data/train.csv")
test_data <- read.csv("data/test.csv")

Let's start by taking at look at which variables are available in the dataset.

In [None]:
str(train_data)

We have a mix of numerical and categorical variables with missing values. Let's take a look at the sale price variable, which is the target variable.

In [None]:
qplot(train_data$SalePrice)

It's clear that the distribution of SalePrice has a positive skew and is not exactly normal.

In [None]:
options(repr.plot.width=20, repr.plot.height=20)
nums <- unlist(lapply(train_data, is.numeric), use.names = FALSE)
train_data %>% select_if(is.numeric) %>%  gather(cols, value) %>%  ggplot(aes(value)) + geom_histogram() + facet_wrap(~cols, scales='free_x')
options(repr.plot.width=8, repr.plot.height=8)

Well, many variables are not normally distributed. Also, many of them seem to be count variables with discrete values. Some have very strong skew and kurtosis, while others have zero-inflated distributions. These facts are important to understand before fitting a model.

Let's also have a look at correlation matrix plots to see if we can spot some obvious or interesting correlations. We'll use Spearman correlation since most variables are not normally distributed.

In [None]:
train_data %>% select_if(is.numeric) %>% cor(use = "complete.obs", method = "spearman") -> correlations
corrplot(correlations, method = "circle")

Some variables seem to provide little extra information over others, for example, YearBuilt and GarageYrBlt. This means there is quite a lot of multicolinearity in the data. This is very relevant if we want to fit a linear model.

Let's finally take a look at the variables with the highest correlation with the target variable.

In [None]:
correlations[, 38] %>% sort(decreasing = TRUE) %>% print()

Overall quality, ground living area, year built, garage capacity and the number of full bathrooms are the variables most correlated with the target variable.

Let's take a look at the number of levels of each categorical variable. Variables with only two levels are transformed to numerical variables.

In [None]:
train_data <- as.data.frame(unclass(train_data), stringsAsFactors = TRUE)

train_data$Street <- as.numeric(train_data$Street)
train_data$Alley <- as.numeric(train_data$Alley)
train_data$Utilities <- as.numeric(train_data$Utilities)
train_data$CentralAir <- as.numeric(train_data$CentralAir)


In [None]:
lapply(train_data, is.na) %>% sapply(sum) %>% sapply(function(x) x / 1460) %>% sort(decreasing = TRUE) -> na_proportion

print(na_proportion[0:10])

We'll simply remove the variables with more than 20% of missing values.