# Data Preparation
In this stage, you will handle some of the errors that you found during the previous notebook and prepare your data for the model training stage.

### Imports

In [None]:
library(tidyverse)

### Data
Read in the final data set from your data_preparation notebook.

In [None]:
setwd("C:/Users/User/Documents/nick-kaggle-training")
data <- read_csv("data/combined_financial.csv") 

## More Cleaning

### Duplicates
Deal with any duplicate rows/columns (or, contrarily, any empty rows/columns), if applicable.

In [3]:
#Checking for duplicate column sums
dups <- data[ , which(duplicated(t(data)))]
names(dups)

In [4]:
#Removing any duplicate column sums after verifying them
data <- data[ , which(!duplicated(t(data)))] 
dim(data)

In [5]:
#Looking for missing values & evaluating list of variable names
na <- apply(is.na(data),2,sum)
max(na)
#print(na)
#sort(na, decreasing = TRUE)
#head(sort(na, decreasing = TRUE), n=25)

In [6]:
#Merging and dropping duplicated variable names
view(data[, c("Payout Ratio", "payoutRatio")])
data <- {x <- "Payout Ratio"
  y <- "payoutRatio"
  d <- data[, c(x,y)]
  data <- select(data, -c(x,y))
  d$c <- ifelse(is.na(d[[1]]), round(d[[2]],4), d[[1]])
  d$c <- ifelse(d[[3]] == 0.0000 & d[[2]] != 0.0000, round(d[[2]],4), d[[3]])
  colnames(d)[3] <- x
  d <- d[ , -c(1:2)]
  data <- cbind(data, d)
}

view(data[, c("interestCoverage", "Interest Coverage")])
data <- {x <- "Interest Coverage"
  y <- "interestCoverage"
  d <- data[, c(x,y)]
  data <- select(data, -c(x,y))
  d$c <- ifelse(is.na(d[[1]]), round(d[[2]],4), d[[1]])
  d$c <- ifelse(d[[3]] == 0.0000 & d[[2]] != 0.0000, round(d[[2]],4), d[[3]])
  colnames(d)[3] <- x
  d <- subset(d, select = -c(1:2))
  data <- cbind(data, d)
}

view(data[, c("netProfitMargin", "Net Profit Margin")])
data <- {x <- "Net Profit Margin"
  y <- "netProfitMargin"
  d <- data[, c(x,y)]
  data <- select(data, -c(x,y))
  d$c <- ifelse(is.na(d[[1]]), round(d[[2]],4), d[[1]])
  d$c <- ifelse(d[[3]] == 0.0000 & d[[2]] != 0.0000, round(d[[2]],4), d[[3]])
  colnames(d)[3] <- x
  d <- subset(d, select = -c(1:2))
  data <- cbind(data, d)
}

view(data[, c("Dividend Yield", "dividendYield")])
data <- {x <- "Dividend Yield"
  y <- "dividendYield"
  d <- data[, c(x,y)]
  data <- select(data, -c(x,y))
  d$c <- ifelse(is.na(d[[1]]), round(d[[2]],4), d[[1]])
  d$c <- ifelse(d[[3]] == 0.0000 & d[[2]] != 0.0000, round(d[[2]],4), d[[3]])
  colnames(d)[3] <- x
  d <- subset(d, select = -c(1:2))
  data <- cbind(data, d)
}

view(data[, c("PE ratio", "priceEarningsRatio")])
data <- {x <- "PE ratio"
  y <- "priceEarningsRatio"
  d <- data[, c(x,y)]
  data <- select(data, -c(x,y))
  d$c <- ifelse(is.na(d[[1]]), round(d[[2]],4), d[[1]])
  d$c <- ifelse(d[[3]] == 0.0000 & d[[2]] != 0.0000, round(d[[2]],4), d[[3]])
  colnames(d)[3] <- x
  d <- subset(d, select = -c(1:2))
  data <- cbind(data, d)
}

view(data[, c("priceToFreeCashFlowsRatio", "PFCF ratio")])
data <- {x <- "PFCF ratio"
  y <- "priceToFreeCashFlowsRatio"
  d <- data[, c(x,y)]
  data <- select(data, -c(x,y))
  d$c <- ifelse(is.na(d[[1]]), round(d[[2]],4), d[[1]])
  d$c <- ifelse(d[[3]] == 0.0000 & d[[2]] != 0.0000, round(d[[2]],4), d[[3]])
  colnames(d)[3] <- x
  d <- subset(d, select = -c(1:2))
  data <- cbind(data, d)
}

view(data[, c("priceToOperatingCashFlowsRatio", "POCF ratio")])
data <- {x <- "POCF ratio"
  y <- "priceToOperatingCashFlowsRatio"
  d <- data[, c(x,y)]
  data <- select(data, -c(x,y))
  d$c <- ifelse(is.na(d[[1]]), round(d[[2]],4), d[[1]])
  d$c <- ifelse(d[[3]] == 0.0000 & d[[2]] != 0.0000, round(d[[2]],4), d[[3]])
  colnames(d)[3] <- x
  d <- subset(d, select = -c(1:2))
  data <- cbind(data, d)
}

view(data[, c("priceToSalesRatio", "Price to Sales Ratio")])
data <- {x <- "Price to Sales Ratio"
  y <- "priceToSalesRatio"
  d <- data[, c(x,y)]
  data <- select(data, -c(x,y))
  d$c <- ifelse(is.na(d[[1]]), round(d[[2]],4), d[[1]])
  d$c <- ifelse(d[[3]] == 0.0000 & d[[2]] != 0.0000, round(d[[2]],4), d[[3]])
  colnames(d)[3] <- x
  d <- subset(d, select = -c(1:2))
  data <- cbind(data, d)
}

view(data[, c("Days Payables Outstanding", "daysOfPayablesOutstanding")])
data <- {x <- "Days Payables Outstanding"
  y <- "daysOfPayablesOutstanding"
  d <- data[, c(x,y)]
  data <- select(data, -c(x,y))
  d$c <- ifelse(is.na(d[[1]]), round(d[[2]],4), d[[1]])
  d$c <- ifelse(d[[3]] == 0.0000 & d[[2]] != 0.0000, round(d[[2]],4), d[[3]])
  colnames(d)[3] <- x
  d <- subset(d, select = -c(1:2))
  data <- cbind(data, d)
}

view(data[, c("Free Cash Flow per Share", "freeCashFlowPerShare")])
data <- {x <- "Free Cash Flow per Share"
  y <- "freeCashFlowPerShare"
  d <- data[, c(x,y)]
  data <- select(data, -c(x,y))
  d$c <- ifelse(is.na(d[[1]]), round(d[[2]],4), d[[1]])
  d$c <- ifelse(d[[3]] == 0.0000 & d[[2]] != 0.0000, round(d[[2]],4), d[[3]])
  colnames(d)[3] <- x
  d <- subset(d, select = -c(1:2))
  data <- cbind(data, d)
}

view(data[, c("ROE", "returnOnEquity")])
data <- {x <- "ROE"
  y <- "returnOnEquity"
  d <- data[, c(x,y)]
  data <- select(data, -c(x,y))
  d$c <- ifelse(is.na(d[[1]]), round(d[[2]],4), d[[1]])
  d$c <- ifelse(d[[3]] == 0.0000 & d[[2]] != 0.0000, round(d[[2]],4), d[[3]])
  colnames(d)[3] <- x
  d <- subset(d, select = -c(1:2))
  data <- cbind(data, d)
}

view(data[, c("priceToBookRatio", "PTB ratio")])
data <- {x <- "PTB ratio"
  y <- "priceToBookRatio"
  d <- data[, c(x,y)]
  data <- select(data, -c(x,y))
  d$c <- ifelse(is.na(d[[1]]), round(d[[2]],4), d[[1]])
  d$c <- ifelse(d[[3]] == 0.0000 & d[[2]] != 0.0000, round(d[[2]],4), d[[3]])
  colnames(d)[3] <- x
  d <- subset(d, select = -c(1:2))
  data <- cbind(data, d)
}

view(data[, c("priceBookValueRatio", "PB ratio")])
data <- {x <- "PB ratio"
  y <- "priceBookValueRatio"
  d <- data[, c(x,y)]
  data <- select(data, -c(x,y))
  d$c <- ifelse(is.na(d[[1]]), round(d[[2]],4), d[[1]])
  d$c <- ifelse(d[[3]] == 0.0000 & d[[2]] != 0.0000, round(d[[2]],4), d[[3]])
  colnames(d)[3] <- x
  d <- subset(d, select = -c(1:2))
  data <- cbind(data, d)
}

view(data[, c("operatingCashFlowPerShare", "Operating Cash Flow per Share")])
data <- {x <- "Operating Cash Flow per Share"
  y <- "operatingCashFlowPerShare"
  d <- data[, c(x,y)]
  data <- select(data, -c(x,y))
  d$c <- ifelse(is.na(d[[1]]), round(d[[2]],4), d[[1]])
  d$c <- ifelse(d[[3]] == 0.0000 & d[[2]] != 0.0000, round(d[[2]],4), d[[3]])
  colnames(d)[3] <- x
  d <- subset(d, select = -c(1:2))
  data <- cbind(data, d)
}

view(data[, c("Cash per Share", "cashPerShare")])
data <- {x <- "Cash per Share"
  y <- "cashPerShare"
  d <- data[, c(x,y)]
  data <- select(data, -c(x,y))
  d$c <- ifelse(is.na(d[[1]]), round(d[[2]],4), d[[1]])
  d$c <- ifelse(d[[3]] == 0.0000 & d[[2]] != 0.0000, round(d[[2]],4), d[[3]])
  colnames(d)[3] <- x
  d <- subset(d, select = -c(1:2))
  data <- cbind(data, d)
}

dim(data)

### Variable Names
Depending on your data set, variables may have names with strange symbols, which can making loading, saving, and subsetting data difficult. If applicable, you should deal with this now.

In [7]:
#Checking variable names
names(data)

In [8]:
library(stringr)
#Changing all names to lower case and replacing spaces with "_"
names(data) <- str_trim(names(data), side = "both")
names(data) <- str_to_lower(names(data), locale = "en")
names(data) <- str_replace_all(names(data), " ", "_")
names(data) <- str_replace_all(names(data), "-", "")
names(data)


### Categorical Encoding
Are there any categorical variables in your data? If so, transform them so that they can be used to train a machine learning algorithm.

In [9]:
#Categorical Encoding
data$sector <- as.factor(data$sector)
data$sector_num <- as.numeric(data$sector)

#Reordering data to put "sector" with "sector_num"
data <- data %>%
  select(1:5, 200, everything())

### Missing Data
If missing data is present (ie, NaN values), what is the best way to deal with it? Should values be imputed (and if so, how)? Or should they simply be filled? How will each option will affect the ultimate outcome of your model? Design and implement a solution.

In [10]:
na <- apply(is.na(data),2,sum)
#print(na)
max(na)

In [11]:
#sort(na, decreasing = TRUE)
head(sort(na, decreasing = TRUE), n=25)

In [12]:
#Checking how many cases are complete
sum(complete.cases(data))

#Checking for NA across rows
data$na <- rowSums(is.na(data))
max(data$na)
head(sort(data$na, decreasing = TRUE),n = 20)
summary(data$na)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    2.00   10.00   27.88   23.00  194.00 

In [13]:
#Found that 50 was a good cut off for dropping
drop <- data %>% 
  filter(na >= 50)
dim(drop)

data <- data %>%
  filter(na <= 50)

data <- subset(data, select = -na)

#Re-checking the NAs across columns
na <- apply(is.na(data),2,sum)
max(na)
print(na)
sort(na, decreasing = TRUE)
head(sort(na, decreasing = TRUE), n=25)

                                     stock 
                                         0 
                          nextyr_price_var 
                                         0 
                                     class 
                                         0 
                                      year 
                                         0 
                                    sector 
                                         0 
                                sector_num 
                                         0 
                                   revenue 
                                         0 
                            revenue_growth 
                                        51 
                           cost_of_revenue 
                                         5 
                              gross_profit 
                                         0 
                              r&d_expenses 
                                         6 
                              sg

In [14]:
#Keeping only columns with less than ~25 percent missing
perc_miss <- function(x){sum(is.na(x))/length(x)*100}
perc <- apply(data,2,perc_miss)
max(perc)
print(perc)
#sort(perc, decreasing = TRUE)
head(sort(perc, decreasing = TRUE), n=25)

#Choosing to keep only variables with less than 15% missing data
data <- data[, which(apply(data,2,perc_miss) < 15.0)]

                                     stock 
                               0.000000000 
                          nextyr_price_var 
                               0.000000000 
                                     class 
                               0.000000000 
                                      year 
                               0.000000000 
                                    sector 
                               0.000000000 
                                sector_num 
                               0.000000000 
                                   revenue 
                               0.000000000 
                            revenue_growth 
                               0.260589648 
                           cost_of_revenue 
                               0.025548005 
                              gross_profit 
                               0.000000000 
                              r&d_expenses 
                               0.030657606 
                              sg

In [None]:
library(missForest)
library(imputeMissings)

In [16]:
missing <- data[,-c(1:7)]

imputedm <- impute(missing, method = "median/mode")

In [17]:
#summary(imputedm)

In [18]:
#Combining the sets with the original data
imputedm <- cbind(data[,c(1:7)], imputedm)
dim(imputedm)

In [19]:
na <- apply(is.na(imputedm),2,sum)
max(na)

### Uniformity
Are all variables measured using compatible units? Or are there monetary values in different forms of currency? Or something else entirely? If any of these issues are applicable to your data, design and implement a solution.

In [20]:
#Planning to implement scaling in the next stage

### Additional Cleaning
If your data set has any other issues you have noticed, you should handle them here.

In [21]:
#No additional cleaning was performed at this time

## Save the Prepared Set

In [22]:
write_csv(data, "data/clean_financial.csv")
write_csv(imputedm, "data/imputedm_financial.csv")

## Outcome
At this point you should have an analytic data set prepared for data modeling. You should also be able to account for any changes you have made in the data. Do this now.

##### The following changes were made to the data set:
 - Duplicates: Any duplicate columns were removed from the data set. This was done by first using the duplicated function to identify identical columns. Then, I looked at the columns sums to identify additional duplicate variables. In the end, about 20 variables were duplicated. 
 - Variable Names: The variables names were cleaned up so that they were consistent. Spaces and dashes were removed, all letters were lower-cased, and underscores were added between most words.
 - Categorical Encoding: A new variable "sector_num" was created as the numeric version of the categorical variable "sector" to aid in machine learning modeling.
 - Missing Data: The missing data was cleaned in the following manner:
      1. Any row with missing values greater than 50 was removed
      2. Any column with missing values greater than 15% was removed
      3. The remaining values were imputed using the median value for now
      4. The plan is to use the complete data set to identify which variables are relevant first. Then I will impute on those variables and re-train my data set. Trying to use random forest to impute the missing values at this stage was far too time consuming in R at this stage. 