# Introduction

 1. [Very Quick Summary](##very-quick-summary)
 2. [Survival Relationships](##survival-relationships)
 2. [Missing Values](##missing-values)
 2. [Feature Engineering](##feature-engineering)
 2. [Predicting](##predicting)

## Very Quick Summary

In [None]:
library(ggplot2)
library(dplyr,warn.conflicts = FALSE)
library(formattable,warn.conflicts = FALSE)

tr = read.csv("../input/train.csv")
pr = read.csv("../input/test.csv")

#formattable(summary(tr))

Display the top lines. Gives us another perspective. It will probably be hard to get something out of:

 - Cabin: it look like there are a lot of null values
 - Tickets: the string pattern/meaning will be hard to transform into value

In [None]:
formattable(head(tr,100))

## Survival Relationships

Check the relationship between every variables and survivability.

### Pclass

There are very few class categories, but the survivability ratio between the three of them is very clear.

In [None]:
trPclass <- tr %>% group_by(Pclass) %>% summarize(cnt=n(),surv=mean(Survived)) %>% arrange(-surv) %>% as.data.frame()
formattable(trPclass)

### Sex

Note that there are almost two males for one female. Female has a much better chance of survival.

In [None]:
trSex <- tr %>% group_by(Sex) %>% summarize(cnt=n(),surv=mean(Survived)) %>% arrange(-surv) %>% as.data.frame()
formattable(trSex)



### Age

We have 177 age null values, we will need to take care of at a later stage. Let's just look into the distribution here.

The age is probably not very predictive as there are no clear relationship between the two variables. We will know more about predictability when we run the model.

In [None]:
ggplot(subset(tr,!is.na(Age)), aes(x = Age, fill = factor(Survived))) + geom_histogram(bins=15,position = 'dodge')



### Siblings

 - Having siblings do improve your chances of survival

In [None]:
ggplot(tr, aes(x = SibSp, fill = factor(Survived))) + geom_bar(position='dodge')

### Parents/Children

Small families do have a better chance of survival.

In [None]:
ggplot(tr, aes(x = Parch, fill = factor(Survived))) + geom_bar(position='dodge')

### Fare

We can note a slight increase of survivability when the ticket fare increases. However as most of the tickets have been bought at a low price, this will just help us a little.

In [None]:
ggplot(tr, aes(x = Fare, fill = factor(Survived))) + geom_histogram(bins=15,position='dodge')

### Embarked

Only three significant categories and 20% of our population can be classified with a higher chance of survival.
We have two missing values.

In [None]:
trEmbarked <- tr %>% group_by(Embarked) %>% summarize(cnt=n(),surv=mean(Survived)) %>% arrange(-surv) %>% as.data.frame()
formattable(trEmbarked)

## Feature Enginnering

We've just looked into the raw data. Let's see if we can enrich it with the other string columns.

###Name

Name is a string variable. String variables can be rich, but they should be transformed to categorical or -even better- numerical variables to be useful.

From our first lines, we can deduce everyone's title is between a comma and a dot. Let's extract this first and see.

In [None]:
tr$title <- gsub('(.*, )|(\\..*)', '', tr$Name)
tr$title[tr$title == "Ms"] <- "Miss"
tr$title[tr$title == "Mlle"] <- "Miss"
tr$title[!tr$title %in% c("Miss","Mrs","Mr")] <- "Other"

trTitle <- tr %>% group_by(title) %>% summarize(cnt=n(),surv=mean(Survived)) %>% arrange(-surv) %>% as.data.frame()
formattable(trTitle)

### Cabin

As mentioned earlier, Cabin is probably a column hard to get any value from due to the null values. Let's just count.
Well maybe, we were a bit quick in our judgement. The numbers of non-null values are small, but it seems that it could have some predictive potential.

In [None]:
tr$cabinCharCnt <- sapply(tr$Cabin,function(x) nchar(as.character(x))> 0)
trCabin <- tr %>% group_by(cabinCharCnt) %>% summarize(cnt=n(),surv=mean(Survived)) %>% arrange(-cnt) %>% as.data.frame()
formattable(trCabin)


### Tickets

Tickets are quite hard to figure out. We will just make an attempt to remove the numbers and see what we get.
With 660 without any value and nothing clear out of the other tiny categories, we will just give up on this field.

In [None]:
tr$ticketsCnt <- sapply(tr$Ticket,function(x) nchar(as.character(x)))
#Count number of character in tickets
#trTickets <- tr %>% group_by(ticketsCnt) %>% summarize(cnt=n(),surv=mean(Survived)) %>% arrange(-cnt) %>% as.data.frame()
    
#Remove digits
tr$ticketLetter <- gsub('[0-9/\\.]+', '', toupper(tr$Ticket))
trTickets <- tr %>% group_by(ticketLetter) %>% summarize(cnt=n(),surv=mean(Survived)) %>% arrange(-cnt) %>% as.data.frame()
formattable(trTickets)
    

## Missing Values

### Embarked

In [None]:
formattable(subset(tr,nchar(as.character(Embarked)) == 0))

In [None]:
library(rpart)

class_emb_mod <- rpart(Embarked ~ ., data=subset(tr,nchar(as.character(Embarked)) > 0), method="class", na.action=na.omit) 
emb_pred <- predict(class_emb_mod, subset(tr,nchar(as.character(Embarked)) == 0))

### Age

In [None]:
formattable(head(subset(tr,is.na(Age))))

In [None]:
library(mice)
# perform mice imputation, based on random forests
miceMod <- mice(tr, method="rf") 
# generate the completed data
miceOutput <- complete(miceMod)  
anyNA(miceOutput)

## Predicting

### Random Forest

In [None]:
library(randomForest)

### Boost

In [None]:
null