# Data Preparation

### Settings/Functions
Read in settings and functions.

In [2]:
libraries <-c('here','missForest','stringr','imputeMissings')
suppressWarnings(lapply(libraries, require, character.only = TRUE))
suppressWarnings(source(here::here('Stock Estimation', 'settings.R')))

### Data
Read in the final data set from the data preparation notebook.

In [3]:
data <- fread(paste0(dir$final_data,'combined_financial.csv'))

## More Cleaning

### Duplicates
Deal with any duplicate rows/columns (or, contrarily, any empty rows/columns), if applicable.

In [4]:
#Checking for duplicate column sums
dups <- data[ , which(duplicated(t(data)))]
dups <- names(dups)
dups

In [5]:
#Removing any duplicate column sums after verifying them
data <- data %>% dplyr::select(-c(dups))
dim(data)

In [6]:
#Looking for missing values & evaluating list of variable names
na <- apply(is.na(data),2,sum)
max(na)
# NOTE: The following code has been commmented out due to the length of its output.
#print(na)
head(sort(na, decreasing = TRUE), n=25)

In [7]:
#Merging and dropping duplicated variable names
#view(data[, c("Payout Ratio", "payoutRatio")])
data <- Name_Changer(dat=data,x='Payout Ratio',y='payoutRatio')

#view(data[, c('interestCoverage', 'Interest Coverage')])
data <- Name_Changer(dat=data,x='Interest Coverage',y='interestCoverage')

#view(data[, c('netProfitMargin', 'Net Profit Margin')])
data <- Name_Changer(dat=data,x='Net Profit Margin',y='netProfitMargin')

#view(data[, c('Dividend Yield', 'dividendYield')])
data <- Name_Changer(dat=data,x='Dividend Yield',y='dividendYield')

#view(data[, c('PE ratio', 'priceEarningsRatio')])
data <- Name_Changer(dat=data,x='PE ratio',y='priceEarningsRatio')

#view(data[, c('priceToFreeCashFlowsRatio', 'PFCF ratio')])
data <- Name_Changer(dat=data,x='PFCF ratio',y='priceToFreeCashFlowsRatio')

#view(data[, c('priceToOperatingCashFlowsRatio', 'POCF ratio')])
data <- Name_Changer(dat=data,x='POCF ratio',y='priceToOperatingCashFlowsRatio')

#view(data[, c('priceToSalesRatio', 'Price to Sales Ratio')])
data <- Name_Changer(dat=data,x='Price to Sales Ratio',y='priceToSalesRatio')

#view(data[, c('Days Payables Outstanding', 'daysOfPayablesOutstanding')])
data <- Name_Changer(dat=data,x='Days Payables Outstanding',y='daysOfPayablesOutstanding')

#view(data[, c('Free Cash Flow per Share', 'freeCashFlowPerShare')])
data <- Name_Changer(dat=data,x='Free Cash Flow per Share',y='freeCashFlowPerShare')

#view(data[, c('ROE', 'returnOnEquity')])
data <- Name_Changer(dat=data,x='ROE',y='returnOnEquity')

#view(data[, c('priceToBookRatio', 'PTB ratio')])
data <- Name_Changer(dat=data,x='PTB ratio',y='priceToBookRatio')

#view(data[, c('priceBookValueRatio', 'PB ratio')])
data <- Name_Changer(dat=data,x='PB ratio',y='priceBookValueRatio')

#view(data[, c('operatingCashFlowPerShare', 'Operating Cash Flow per Share')])
data <- Name_Changer(dat=data,x='Operating Cash Flow per Share',y='operatingCashFlowPerShare')

#view(data[, c('Cash per Share', 'cashPerShare')])
data <- Name_Changer(dat=data,x='Cash per Share',y='cashPerShare')

dim(data)

### Variable Names
Depending on your data set, variables may have names with strange symbols, which can making loading, saving, and subsetting data difficult. If applicable, you should deal with this now.

In [8]:
#Checking variable names
names(data)

In [9]:
#Changing all names to lower case and replacing spaces with "_"
#Amending various features to make more compatible models
names(data) <- str_trim(names(data), side = "both")
names(data) <- str_to_lower(names(data), locale = "en")
names(data) <- str_replace_all(names(data), " ", "_")
names(data) <- str_replace_all(names(data), "-", "")
names(data) <- str_replace_all(names(data), "&", ".")
names(data) <- str_replace_all(names(data), "\\(", "")
names(data) <- str_replace_all(names(data), "\\)", "")
names(data) <- str_replace_all(names(data), "3y", "three_yr")
names(data) <- str_replace_all(names(data), "5y", "five_yr")
names(data) <- str_replace_all(names(data), "10y", "ten_yr")
names(data) <- str_replace_all(names(data), "\\\\", "")
names(data) <- str_replace_all(names(data), "////", "_")
names(data) <- str_replace_all(names(data), ",", "")
names(data) <- str_replace_all(names(data), "_._", "_")
names(data) <- str_replace_all(names(data), "/", "_")
names(data)

### Categorical Encoding
Are there any categorical variables in your data? If so, transform them so that they can be used to train a machine learning algorithm.

In [10]:
#Categorical Encoding
data[, sector := as.factor(sector)]
data[, sector_num := as.numeric(sector)]

#Reordering data to put "sector" with "sector_num"
data <- data %>%
  dplyr::select('stock','nextyr_price_var','class','year','sector','sector_num', everything())

### Missing Data
If missing data is present (ie, NaN values), what is the best way to deal with it? Should values be imputed (and if so, how)? Or should they simply be filled? How will each option will affect the ultimate outcome of your model? Design and implement a solution.

In [11]:
na <- apply(is.na(data),2,sum)
#print(na)
max(na)
#sort(na, decreasing = TRUE)
head(sort(na, decreasing = TRUE), n=25)
summary(na)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0    1835    2315    3069    3197   10744 

In [12]:
#Checking how many rows are complete
sum(complete.cases(data))

#Checking for NA across rows
data$na <- rowSums(is.na(data))
max(data$na)
head(sort(data$na, decreasing = TRUE),n = 20)
summary(data$na)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    2.00   10.00   27.67   23.00  193.00 

In [13]:
#Found that 50 was a good cut off for dropping rows
drop <- data %>% 
  filter(na >= 50)
dim(drop)

data <- data %>%
  filter(na <= 50)

data <- dplyr::select(data, -c(na))

#Re-checking the NAs across columns
na <- apply(is.na(data),2,sum)
max(na)
# NOTE: The following code has been commmented out due to the length of its output.
#print(na)
#sort(na, decreasing = TRUE)
head(sort(na, decreasing = TRUE), n=25)

In [14]:
#Keeping only columns with less than ~15 percent missing
perc <- apply(data,2,Perc_Missing)
max(perc)
# NOTE: The following code has been commmented out due to the length of its output.
#print(perc)
#sort(perc, decreasing = TRUE)
head(sort(perc, decreasing = TRUE), n=25)

#Choosing to only keep variables with less than 15% missing data
data <- data[, which(apply(data,2,Perc_Missing) < 15.0)]

In [15]:
missing <- select(data, -c('stock','nextyr_price_var','class','year','sector','sector_num'))

imputedm <- impute(missing, method = "median/mode")

In [16]:
summary(imputedm)

    revenue           revenue_growth     cost_of_revenue     
 Min.   :-6.276e+08   Min.   :   -3.46   Min.   :-2.987e+09  
 1st Qu.: 6.704e+07   1st Qu.:   -0.01   1st Qu.: 3.136e+06  
 Median : 4.689e+08   Median :    0.06   Median : 1.512e+08  
 Mean   : 4.807e+09   Mean   :    3.65   Mean   : 2.899e+09  
 3rd Qu.: 2.338e+09   3rd Qu.:    0.18   3rd Qu.: 1.150e+09  
 Max.   : 5.003e+11   Max.   :42138.66   Max.   : 3.771e+11  
  gross_profit         r.d_expenses         sg.a_expense       
 Min.   :-1.281e+10   Min.   :-1.098e+08   Min.   :-1.402e+08  
 1st Qu.: 3.406e+07   1st Qu.: 0.000e+00   1st Qu.: 1.816e+07  
 Median : 2.075e+08   Median : 0.000e+00   Median : 8.164e+07  
 Mean   : 1.904e+09   Mean   : 9.809e+07   Mean   : 8.503e+08  
 3rd Qu.: 9.022e+08   3rd Qu.: 1.205e+07   3rd Qu.: 3.651e+08  
 Max.   : 1.269e+11   Max.   : 2.884e+10   Max.   : 1.065e+11  
 operating_expenses   operating_income     interest_expense    
 Min.   :-5.496e+09   Min.   :-1.934e+10   Min.   :-5.

In [17]:
#Combining the sets with the original data
imputedm <- cbind(data[, c('stock','nextyr_price_var','class','year','sector','sector_num')], imputedm)
dim(imputedm)

In [18]:
max(apply(is.na(imputedm),2,sum))

### Uniformity
Are all variables measured using compatible units? Or are there monetary values in different forms of currency? Or something else entirely? If any of these issues are applicable to your data, design and implement a solution.

In [18]:
#Implementing scaling in the modeling notebook

### Additional Cleaning
If your data set has any other issues you have noticed, you should handle them here.

In [19]:
#No additional cleaning was performed in this notebook

## Save the Prepared Set

In [19]:
fwrite(data, paste0(dir$final_data,'clean_financial.csv'))
fwrite(imputedm, paste0(dir$final_data,'imputedm_financial.csv'))

## Outcome
At this point you should have an analytic data set prepared for data modeling. You should also be able to account for any changes you have made in the data. Do this now.

##### The following changes were made to the data set:
 - Duplicates: Any duplicate columns were removed from the data set. This was done by first using the duplicated function to identify identical columns. Then, I looked at the columns sums to identify additional duplicate variables. In the end, about 20 variables were duplicated. 
 - Variable Names: The variables names were cleaned up so that they were consistent. Spaces and dashes were removed, all letters were lower-cased, and underscores were added between most words.
 - Categorical Encoding: A new variable "sector_num" was created as the numeric version of the categorical variable "sector" to aid in machine learning modeling.
 - Missing Data: The missing data was cleaned in the following manner:
      1. Any row with missing values greater than 50 was removed
      2. Any column with missing values greater than 15% was removed
      3. The remaining values were imputed using the median value for now
      4. The plan is to use the complete data set to identify which variables are relevant first. Then I will impute on those variables and re-train my data set. Trying to use random forest to impute the missing values at this stage was far too time consuming in R at this stage. 
 - Outliers: The outliers in the data set were removed in the previous notebook using Cook's Distance. Only four rows were identified as extreme outliers and were thus removed from the data set.