# Titanic Dataset

https://www.kaggle.com/c/titanic



Some analysis in R

By: Marc Burt

Import libraries and bind the testing and training data.  Here's a good discription of the data set using str()

In [None]:
library('sampling')
library('UsingR')
library('scales')
library('dplyr')
library('magrittr')


train <- read.csv('/home/marcburt/Documents/School/BU DA/R/Final/train.csv')
test <- read.csv('/home/marcburt/Documents/School/BU DA/R/Final/test.csv')


#### Combining data
data <- bind_rows(train, test)
str(data)



Did more males or females make it out alive?

In [None]:
barplot(table(train$Survived, train$Sex), ylab="Passanger #", col = c("blue", "red"))
legend("topleft",legend = c("Died","Survived"),fill=c("blue","red"),inset = .05)

Did family size affect the outcome of survival?

In [None]:
data$Fsize <- data$SibSp + data$Parch + 1

ggplot(data[1:nrow(train),], aes(x = Fsize, fill = factor(Survived)))+
	geom_bar(stat = 'count', position = 'dodge')

Let's take a look at it differently.  If we break the family down into various groups of size and them measure survivability that way, we can see more of a distinction

In [None]:
data$FsizeP[data$Fsize == 1] <- 'single'
data$FsizeP[data$Fsize == 2] <- 'couple'
data$FsizeP[data$Fsize >= 3 & data$Fsize <=4] <- 'small'
data$FsizeP[data$Fsize >= 5] <- 'large'

mosaicplot(table(data$FsizeP, data$Survived), main = 'Survival based on family size')

Let's take a look at the same but measure it against passenger class.

In [None]:
mosaicplot(table(train$Survived,train$Pclass), main="Survival by Class",ylab="Class",xlab="Survived")


Let's take a look at the cost of tickets and do some sampling here.

First -> let's get a summary of the data... how is it broken down

Second -> Let's look at a histogram

Third -> How does that look as a boxplot with the five nums broken down

In [None]:
#### basic analysis of the fare -> how much did a ticket cost
fare <- as.numeric(data$Fare, na.rm = TRUE)
summary(fare, na.rm = TRUE)

#### Based on summary we see that the distribution is negatively skewed.
#We can see that in the following chart
hist(fare, ylab="Count", xlab = "Price", breaks = 50, col = 'purple')
#### Boxplot to show the same.  Number of outliers, the most notable being the person who paid 512 for their ticket
boxplot(fare, horizontal = TRUE, xaxt = 'n')
axis(side = 1, at = round(fivenum(fare),0), labels = TRUE, las = 2)


In [None]:
data[data$Fare > 512, ] 




In [None]:
remove_outliers <- function(x, na.rm = TRUE) {
  qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm)
  H <- 1.5 * IQR(x, na.rm = na.rm)
  y <- x
  y[x < (qnt[1] - H)] <- NA
  y[x > (qnt[2] + H)] <- NA
  return(y)
}

rfare <- remove_outliers(fare)
rfare <- na.omit(rfare)
head(rfare)
summary(rfare, na.rm = TRUE)
#### These look much better
# Single graphs
hist(rfare, ylab="Count", xlab = "Price", breaks = 50, col = 'purple')

boxplot(rfare, horizontal = TRUE, xaxt = 'n')
axis(side = 1, at = round(fivenum(fare),0), labels = TRUE, las = 2)

We can see a major difference if we plot them side by side.  The data looks much more normalized this way.

In [None]:
par(mfrow = c(2,2))
	hist(fare, ylab="Count", xlab = "Price", breaks = 25, col = 'purple', main = "With Outliers")
	boxplot(fare, horizontal = TRUE, xaxt = 'n', main = 'With Outliers')
		axis(side = 1, at = round(fivenum(fare),0), labels = TRUE, las = 2)

	hist(rfare, ylab="Count", xlab = "Price", breaks = 25, col = 'purple', main = 'Without Outliers')
	boxplot(rfare, horizontal = TRUE, xaxt = 'n', main = 'Without Outliers')
		axis(side = 1, at = round(fivenum(rfare),0), labels = TRUE, las = 2)


I grabbed some samples at various sizes and ploted them against each other.  The central tendancy held true and became more apparant as I used more samples

In [None]:
xbar <- na.omit(rfare)

cat("Population Distribution Mean = ", mean(rfare, na.rm = TRUE)," SD = ", sd(rfare, na.rm = TRUE), "\n")

par(mfrow = c(2,2))

for (size in c(50, 75, 100, 125)) {
    for (i in 1:length(xbar)) {
	    xbar[i] <- mean(sample(rfare, size = size, 
            replace = TRUE))
        }
    hist(xbar, prob = TRUE, main = paste("Sample Size =" , size), xlim =c(10, 30))

    cat("Sample Size = ", size, " Mean = ", mean(xbar, na.rm = TRUE),
    " SD = ", sd(xbar, na.rm = TRUE), "\n")
    } 


Here I just wanted to sample the survivability column using Simple Sampling and then Systamtic Sampling.  The 1's survived and the 0's did not survive.  I would say that I was more accurate using random sampling and that makes sense given the nature of the data.

In [None]:
#### Sampling on the survival rate

#### basic probability of survival
train$Survived%>%
	table
train$Survived%>%
	table%>%
	prop.table

In [None]:
s <- srswor(20, nrow(train))
sample <- data[s != 0, ]
sample$Survived%>%
    table
sample$Survived%>%
    table%>%
    prop.table


In [None]:
N <- nrow(train)
n <- 20
k <- ceiling(N/n)
r <- sample(k, 1)

s <- seq(r, by=k, length = n)

sample.2 <- train[s,]

sample.2$Survived%>%
    table
sample.2$Survived%>%
    table%>%
    prop.table

In [None]:
### Confidence intervals of data given the amount of the data
sample.size <- 50
pop.sd <- sd(rfare)
sd.sample.means <- pop.sd/sqrt(sample.size)
samples <- 20

xbar <- numeric(samples)

for (i in 1:samples){
	sample.data.1 <- sample(as.numeric(rfare), size = sample.size)
	xbar[i] <- mean(sample.data.1)
	str <- sprintf("%2d: xbar = %.2f, CI = %.2f-%.2f", i, xbar[i], xbar[i] - 2*sd.sample.means,xbar[i] + 2*sd.sample.means)
	cat(str, '\n')


}

Now per request, I tested at the 80 and 90 percent confidence intervals. Just to add some flavor to the mix I added some other alphas.

In [None]:
#### Confidence Intervals at 80 and 90.

conf <- c(75,80,85,90,95)
alpha <- 1 - conf/100
sample.data <- sample(rfare, size = sample.size)
xbar <- mean(sample.data)
sd.sample.means <- pop.sd/sqrt(sample.size)


for (i in alpha){
	str <- sprintf("%2d%% Conf Level (alpha = %.2f), CI = %.2f-%.2f", 100*(1-i),i, xbar - qnorm(1-i/2)*sd.sample.means,xbar + qnorm(1-i/2)*sd.sample.means)
	cat(str, '\n')
}





# Questions?