r_for_data_challenges.Rmd

# Exploratory data analysis and preprocessing with R for data challenges

Tianhao Wu

```{r, include=FALSE}
# this prevents package loading message from appearing in the rendered version of your problem set
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
```

```{r loading libs}
# loading libraries used in solving the problems
library(tidyverse)
library(gridExtra)
```

## Introduction

Doing a data challenge is usually the first step to get you to a data science job or internship, among which EDA & Data Preprocessing is the most important part because they determine the potential of your model performance and usually give meaningful insights. Most companies have no requirement of building fancy models, instead, they would like to see how you explore and analyze your data, and how you preprocess your data well enough for modeling. In most cases, a good EDA & Preprocessing together with some easy models will get you through the test.  

There has been many people sharing solutions for different projects on the internet. However, most of the projects were done in Python, and there is usually a lack of instructions on how to handle a data challenge in general. In this html file, I will share how to perform EDA & Data preprocessing for a typical data challenge with R, using the data of "Iowa House Price Prediction Competition" https://www.kaggle.com/c/iowa-house-price-prediction


## Exploratory Data Analysis & Preprocessing
The first thing is always to have a brief overview of the data with a few questions in mind:
How many features do we have? How many of them are numeric? How many of them are categorical? How many categorical features are nominal? These information will affect our strategies in choosing models. For example, if we have a lot of categorical features, then linear models might not work well because they cannot handle categorical features by nature, and encoded categorical features are unlikely to follow a linear relationship. 

```{r data load}
# read in data and have a brief look
df <- read_csv("resources/r_for_data_challenges/train.csv", show_col_types = FALSE)
num_fea <- df %>% select_if(names(df) != 'SalePrice') %>% 
  select_if(is.numeric) %>%
  names()
cat_fea <- df %>% select_if(is.character) %>%
  names()
label <- 'SalePrice'
# explore numeric columns
str(df[,num_fea])
# explore categorical columns
str(df[,cat_fea])
``` 

### Explore & Handle Numeric Features
Drop useless features like "Id" because they are known to have no influence on our target label. If we don't drop these columns, they might add noise and lead to overfitting. For example, if IDs were assigned after house prices were sorted, there would be a "strong correlation" between "ID" and "House Price". As a result, the model would give "ID" a very large weight in predicting house prices, which is not desired at all. 
```{r drop features}
# drop useless features to avoid overfitting & noise
num_fea = num_fea[num_fea != 'Id']
df <- df[,c(num_fea, cat_fea, label)]
```
Then compute and show the proportions of missing values for each feature. 
```{r show missing numeric data}
# show proportions of missing values
missing_col <- colSums(is.na(df[,num_fea]))/dim(df)[1] 
missing_col[missing_col > 0] %>% sort(decreasing = TRUE)
```

1. Drop "LotFrontage" because it has too many missing values (proportion > 15%)
2. Fill NAs with medians for the rest of numeric features (median is more robust to outliers than mean)
3. Always check whether all NAs have been filled

(When using for loop, it is important to use "local({})" to give each iteration a separate local space so that previous results are not overwritten by the most recent one!!)
```{r handle missing numeric data}
num_fea <- num_fea[num_fea != 'LotFrontage']
df <- df[,c(num_fea, cat_fea, label)]
for (i in c('GarageYrBlt','MasVnrArea')) {
  df[is.na(df[i]), i] <- local({
    i <- i
    median(unlist(df[i]), na.rm=TRUE)
  })
}
missing_col <- colSums(is.na(df[,num_fea]))/dim(df)[1] 
missing_col[missing_col > 0] %>% sort(decreasing = TRUE)
```
Use multiple scatter plots to explore the relationships between some important features and the label. Here I choose features based on their correlation coefficients with the label. The reason we want to visualize these relationships are as follows:

1. Are there any outliers or unexpected patterns?
2. Are there any features we assume to be highly correlated with the label but not displayed here (because of low correlation coefficients (<0.4))? 
3. After we train our models, we should come back and check whether the feature importance returned by models are aligned with our observation here. If there are any interesting differences, we should explore the reasons behind.
```{r plots, fig.height=10}
corr <- cor(df[,append(num_fea, label)])[,label]
corr <- corr[order(abs(corr), decreasing = TRUE)]
imp_num_fea <- names(corr[abs(corr)>0.4]) # only take the features with absolute value of coeff > 0.4
imp_num_fea <- imp_num_fea[imp_num_fea != label]

p <- list()
for (i in 1:length(imp_num_fea)) {
  p[[i]] <- local({ 
    i <- i
    ggplot() + 
    geom_point(aes(x=unlist(df[,imp_num_fea[i]]), y=unlist(df[label]))) +
    xlab(imp_num_fea[i]) +
    ylab("House Prices") 
  })
}
grid.arrange(grobs = p, ncol = 3)
```


### Explore & Handle Categorical Columns
Compute and show the proportions of missing values for each feature. 
```{r show missing categorical data}
# show proportions of missing values
missing_col <- colSums(is.na(df[,cat_fea]))/dim(df)[1] 
missing_col <- missing_col[missing_col > 0] %>% sort(decreasing = TRUE)
missing_col
```
Categorical features are usually more tricky because NA might not mean "unknown". According to the data description, only "MasVnrType" and "Electrical" truly have "NA" as missing values, the NAs in other features just mean "None". Another thing to note is, even though some categorical features have large portions of missing values, we cannot just delete the columns like how we handle numeric data. Sometimes, missing values could be super informative (for example, the mean house prices corresponding to the missing values of a feature is significantly higher than the rest of classes of that feature). Therefore, we should visualize missing values as an explicit class, and then decide on whether to drop the column or make missing values an indicator column. For features with few missing values, we can just fill NAs with mode.
```{r handle missing categorical data 1}
# Find the mode for each feature
df %>% group_by_at('MasVnrType') %>% count()
df %>% group_by_at('Electrical') %>% count()
```
```{r handle missing categorical data 2}
df[is.na(df['MasVnrType']), 'MasVnrType'] <- 'None'
df[is.na(df['Electrical']), 'Electrical'] <- 'SBrkr'
df[is.na(df)] <- 'None'
missing_col <- colSums(is.na(df[,cat_fea]))/dim(df)[1] 
missing_col[missing_col > 0] %>% sort(decreasing = TRUE) # always check whether all NAs have been filled
```
Most models do not handle categorical features as is, and therefore we need to transform our categorical data into numeric data in a meaningful manner. There are three frequently used strategies:

1. **One-hot Encoding**: transform each class of the categorical feature into a binary feature. This method works well for nominal features with low cardinality. One-hot might not work well for linear models because it causes multi-collinearity which breaks the assumption of independent features.
2. **Ordinal Encoding**: transform each class of the ordinal categorical feature into an integer. The order of assigning integers should follow the internal order. For example, assign 1 to poor, 2 to average, 3 to good. When there are a lot of ordinal features with different sets of levels, this method could be tedious because we need to specify the internal order for each feature.
3. **Target Encoding**: replace each class of the categorical feature with the mean of labels corresponding to that class. This method applies to all types of categorical features, especially good for categorical features with high cardinality. 

House price dataset is a typical dataset with various categorical features. If we use one-hot, it will dramatically increase the dimension of our dataset. Since we have too many ordinal features with different set of levels, ordinal encoding is also not a good idea. Therefore, we can use target encoding here, which gives meaningful orders without causing multi-colinearity or increasing dimension. Moreover, since we have too many categorical features, it is hard to choose which features to visualize. If we use target encoding to transform features, we could choose and visualize categorical features just like how we handle numeric data.  
```{r target encoding, message=FALSE}
target_encoder <- function(df, feature) { # feature input in string format
  fea = paste(feature, 'target', sep = '_')
  df1 <- df %>% group_by_at(feature) %>% 
    summarise(n = mean(SalePrice)) %>%
    rename_with(.fn = function(x) fea, .cols = n) # rename encoded features
  df2 <- left_join(df, df1)
  df2 <- select_if(df2, names(df2) != feature) # drop original features
  return(df2)
}

for (i in cat_fea) {
  suppressMessages(df <- target_encoder(df, i)) # to avoid message of "joining by.."
}

str(df)
```
Since all categorical features have been transformed into numeric features, we could visualize them in the same way as plotting numeric features.
```{r plots2, fig.height=10}
for (i in 1:length(cat_fea)) {
  cat_fea[i] <- local({
    i <- i
    paste(cat_fea[i],'target',sep='_') # update categorical feature list
  })
}

corr <- cor(df[,append(cat_fea, label)])[,label]
corr <- corr[order(abs(corr), decreasing = TRUE)]
imp_num_fea <- names(corr[abs(corr)>0.4])
imp_num_fea <- imp_num_fea[imp_num_fea != label]

p <- list()
for (i in 1:length(imp_num_fea)) {
  p[[i]] <- local({ 
    i <- i
    ggplot() + 
    geom_point(aes(x=unlist(df[,imp_num_fea[i]]), y=unlist(df[label]))) +
    xlab(imp_num_fea[i]) +
    ylab("House Prices") 
  })
}
grid.arrange(grobs = p, ncol = 3)
```

**Finally** All missing values have been filled, and all categorical features have been transformed. Now we are ready for modeling! 

**Note:**

1. I did not remove any outliers here because it is not very reasonable to do that for such a small dataset. Moreover, even though we remove a few "outliers", it would usually have little impact on model performance. The right way is just to keep all the data and run a model first. If there are certain predictions with dramatically higher errors, then we go into the data again and try to see if there are any outliers we should remove.

2. I did not scale the data here. Firstly, scaling is not needed for every model. For tree models, scaling just does not matter at all. However, for distance-based models, scaling might reduce training time, boost interpretability of feature importance, and improve accuracy/reduce error.