# Exploring DATA

## Checking data by having a glimpse

In [None]:
str(df)
dim(df)
describe(df)

# Unique values per column
lapply(df, function(x) length(unique(x)))

# Fast way to check NAs in each column of dataframe
colSums(is.na(df))
# or (as alternative)
sapply(df, function(x)sum(is.na(x)))
       
#Check the percentage of Missing values       
missing_values <- df %>% summarize_all(funs(sum(is.na(.))/n()))
missing_values <- gather(missing_values, key="feature", value="missing_pct")       

## Univariate Analysis

In [None]:
# To count the occurrence of each value
table(df$colname)

# To get the percentage of occurence
prop.table(table(df$colname))
       
# In order to get most frequent value in a column of df
tail(names(sort(table(df$colname))), 1)

## Multivariate Analysis

In [None]:
# Confusion table for two categorical variable in dataset
with(df, CrossTable(colname1, colname2))

# In order to show relation between two contiuous variable we can do the scatterplot
ggplot(data = df, aes(x = colname1 , y = colname2)) + geom_point()

# For a combination of continuous and categorical variables we can use boxplot
ggplot(data = df, aes(x = colname1 , y = colname2)) + geom_boxplot()

## Outliners

In [None]:
# One approach for spoting outliners are the using geom_jitter
ggplot(df, aes(colname1, colname2)) + 
    geom_jitter(alpha = 0.2 ,size = 3)

# Data Cleaning

In [None]:
# Renaming field names by using dplyr
df <- df %>% rename(
  newname1 = oldname1,
  newname2 = oldname2)

## Impute missing values

In [None]:
## First approach: simply replace NAs with mean of the column
df <- df %>%
    mutate( colname = ifelse(is.na(colname), mean(df$colname, na.rm=TRUE), colname))

# Use the most common value to replace NAs in the desired feature. Imagine S was the most frequent 
# value for that field within the dataframe 
df$colname <- replace(df$colname, which(is.na(df$colname)), 'S')

# One sublte way to filling null values in data set would be using
# prediction based on other explanatory variables. For example, 
# here we're looking for a countinous variable based on others.
exp_var_filler_model <- rpart(exp_var_with_many_NA ~ a_list_of_exp_vars
                              data=df[!is.na(df$exp_var),], 
                              method="anova" #since we consider that exp_var as continous, otherwise we could use `class`)
df$exp_var[is.na(df$exp_var)] <- predict(exp_var_filler_model, 
                                         [is.na(df$exp_var),])

# Data Sampling

In [None]:
# Having a random sample dataframe from the original data by having only 3 row[for sake of example]
df_sample <- df[sample(nrow(df), 3), ]

# Using dplyr to have the same result as above
df_sample <- sample_n(df, size = 3)

# Data Manipulation (Feature Engineering)

### Feature Engineering
Feature engineering has been described as easily the most important factor in determining the success or failure of your predictive model. Feature engineering really boils down to the human element in machine learning. How much you understand the data, with your human intuition and creativity, can make the difference.

In [None]:
df <- df %>%
    mutate( new_feature = case_when(raw_feature < 13 ~ "Lower.Range", 
                                    raw_feature >= 13 & raw_feature < 18 ~ "Mid.Range.1",
                                    raw_feature >= 18 & raw_feature < 60 ~ "Mid.Range.2",
                                    raw_feature >= 60 ~ "Upper.Range"))

# In order to combine same categorical data which has very little proportion from total we can use recode method
libray(car)
df$cat_var <- recode(df$cat_var, "c('group1', 'group2',...) = 'more_general_group'")
# Or simply
df3$cat_var[df$cat_var %in% c('group1','group2')] = 'more_general_group'

# Aggregate function

In [None]:
# find the number of survivors for the different subsets
> aggregate(Survived ~ Child + Sex, data=train, FUN=sum)
  Child    Sex Survived
1     0 female      195
2     1 female       38
3     0   male       86
4     1   male       23

# To know the total number of people in each subset
> aggregate(Survived ~ Child + Sex, data=train, FUN=length)
  Child    Sex Survived
1     0 female      259
2     1 female       55
3     0   male      519
4     1   male       58

# To know the proportions
> aggregate(Survived ~ Child + Sex, data=train, FUN=function(x) {sum(x)/length(x)})
  Child    Sex  Survived
1     0 female 0.7528958
2     1 female 0.6909091
3     0   male 0.1657033
4     1   male 0.3965517

# Correlation Plot
The corrplot package is a graphical display of a correlation matrix, confidence interval. It also contains some algorithms to do matrix reordering. In addition, corrplot is good at details, including choosing color, text labels, color labels, layout, etc.

In [None]:
df %>%
  select_if(is.numeric) %>%
  cor(use="complete.obs") %>%
  corrplot.mixed(tl.cex=0.85)

# Decision Trees

In [None]:
library(rpart)
library(rpart.plot)
# Using rpart for list of explanatory variables for a discreate outcome
# class method (for ones and zeros output)
rpart(Outcome ~ exp_var1 + exp_var2 + ...,
               data=df_train,
               method="class")

# If you wanted to predict a continuous variable as outcome
rpart(Outcome ~ exp_var1 + exp_var2 + ...,
               data=df_train,
               method="anova")

# Making prediction on test dataset
Prediction <- predict(fit, test, type = "class")

# Overfitting
Overfitting is technically defined as a model that performs better on a training set than another simpler model, but does worse on unseen data.

Use caution with decision trees, and any other algorithm actually, or you can find yourself making rules from the noise you’ve mistaken for signal!

In [None]:
# Options to optimise overfitting in decision trees
# cp determines the complexity control and minsplit governs 
# the minimum required members for each split 
rpart(Outcome ~ exp_var1 + exp_var2 + ...,
               data=df_train,
               method="class", 
               control=rpart.control(minsplit=2, cp=0))
# To trim trees manually in R we can use:
new.fit <- prp(fit,snip=TRUE)$obj

# Prediction

Based on problem:
    - If it's a classification problem, then we can use:
        * Logistic Regression
        * Naive Bayes
        * Decision Trees

## Ensemble
Take a large collection of individually imperfect models, and their one-off mistakes are probably not going to be made by the rest of them. If we average the results of all these models, we can sometimes find a superior model from their combination than any of the individual parts. That’s how ensemble models work, they grow a lot of different models, and let their outcomes be averaged or voted across the group.

### Random Forest
Random Forest models grow trees much deeper than the decision stumps above, in fact the default behaviour is to grow each tree out as far as possible. But since the formulas for building a single decision tree are the same every time, some source of randomness is required to make these trees different from one another. Random Forests do this in two ways.
 1. The first trick is to use bagging, for bootstrap aggregating. Bagging takes a randomized sample of the rows in your training set, with replacement.
 2. The second source of randomness gets past this limitation though. Instead of looking at the entire pool of available variables, Random Forests take only a subset of them, typically the square root of the number available. In our case we have 10 variables, so using a subset of three variables would be reasonable. The selection of available variables is changed for each and every node in the decision trees. 
 
R’s Random Forest algorithm has a few restrictions that we did not have with our decision trees. The big one has been the elephant in the room until now, we have to clean up the missing values in our dataset. 

In [None]:
# Making an random forest model after data prepration process
fit <- randomForest(as.factor(outcome) ~ exp_var1 + exp_var2 + ...,
                    data=df_train, 
                    importance=TRUE, 
                    ntree=2000)

# So let’s look at what variables were important:
 varImpPlot(fit)

There’s two types of importance measures shown above. The accuracy one tests to see how worse the model performs without each variable, so a high decrease in accuracy would be expected for very predictive variables. The Gini one digs into the mathematics behind decision trees, but essentially measures how pure the nodes are at the end of the tree. 

In [None]:
Prediction <- predict(fit, test)

But let’s not give up yet. There’s more than one ensemble model. Let’s try a forest of conditional inference trees. They make their decisions in slightly different ways, using a statistical test rather than a purity measure, but the basic construction of each tree is fairly similar.

In [None]:
install.packages('party')
library(party)

fit <- cforest(as.factor(outcome) ~ exp_var1 + exp_var2 + ...,
                 data = df_train, 
                 controls=cforest_unbiased(ntree=2000, mtry=3))

Prediction <- predict(fit, test, OOB=TRUE, type = "response")