## Domain

This is an introductory data set considered the "hello world" of data science. It is an ongoing competition on Kaggle allowing students of data science to prepare a model and make a submission to a competition while they are still learning the subject. 

## Problem

This is a binary classification problem in which the challenge is to predict whether a passenger survived the sinking of the Titanic given the demographic data of the passengers. Here, the task $T$ is a binary classification and the experience $E$ is the list of passengers and their survival outcome. 

Note `read.table` and `read.csv` are equivalent accept for the default args. `read.table` defaults to separating on white space. `read.csv` defaults to separating on commas.

## Solution

To solve this problem, we will generate a vector of integers using filtering and masking.

In [1]:
titanic <- read.table('train.csv', sep=",", header = T)
rownames(titanic) <- titanic$PassengerId
titanic$PassengerId <- NULL
titanic$Name <- NULL

## Data Exploration

In [3]:
summary(titanic$Sex)

### Use a Proporation Table to Look at Survival by Gender

This represents the proportion of all passengers in each group.

In [4]:
prop.table(table(titanic$Sex, titanic$Survived))

        
                  0          1
  female 0.09090909 0.26150393
  male   0.52525253 0.12233446

This represents the proportion of survival by gender.

In [5]:
prop.table(table(titanic$Sex, titanic$Survived), 1)
# adding a 1 means the rows add up to 1

        
                 0         1
  female 0.2579618 0.7420382
  male   0.8110919 0.1889081

This represents the proportion of gender by survival.

In [8]:
prop.table(table(titanic$Sex, titanic$Survived), 2)
# adding a 2 means the columns add up to 1

        
                 0         1
  female 0.1475410 0.6812865
  male   0.8524590 0.3187135

### Use a Proporation Table to Look at Survival of Children

In [9]:
prop.table(table(titanic$Age < 10, titanic$Survived), 1)

       
                0         1
  FALSE 0.6134969 0.3865031
  TRUE  0.3870968 0.6129032

In [10]:
prop.table(table(titanic$Age < 10, titanic$Survived), 2)

       
                 0          1
  FALSE 0.94339623 0.86896552
  TRUE  0.05660377 0.13103448

## Benchmark Model

In [11]:
verify_length <- function (v1, v2 ){
    if (length(v1) != length(v2)) {
        stop('length of vectors do not match') 
    }
}

accuracy <- function (actual, predicted) {
    verify_length(actual, predicted)
    return(sum(actual == predicted)/length(actual))
}

In [12]:
no_survivors <- rep(0, length(titanic$Survived))
accuracy(titanic$Survived, no_survivors)

## Women Survived

In [13]:
women_survived <- titanic$Sex == 'female'
accuracy(titanic$Survived, women_survived)

## Children Survived

In [14]:
women_and_children_survived <- women_survived
women_and_children_survived[titanic$Age < 10] <- 1

In [15]:
accuracy(titanic$Survived, women_and_children_survived)

In [17]:
head(titanic)

Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,3,male,22.0,1,0,A/5 21171,7.25,,S
1,1,female,38.0,1,0,PC 17599,71.2833,C85,C
1,3,female,26.0,0,0,STON/O2. 3101282,7.925,,S
1,1,female,35.0,1,0,113803,53.1,C123,S
0,3,male,35.0,0,0,373450,8.05,,S
0,3,male,,0,0,330877,8.4583,,Q


In [18]:
prop.table(table(titanic$Pclass, titanic$Survived))

   
             0          1
  1 0.08978676 0.15263749
  2 0.10886644 0.09764310
  3 0.41750842 0.13355780

In [19]:
prop.table(table(titanic$Pclass, titanic$Survived),1)

   
            0         1
  1 0.3703704 0.6296296
  2 0.5271739 0.4728261
  3 0.7576375 0.2423625

In [20]:
Pclass_survived = titanic$Pclass == 1
women_and_children_survived <- Pclass_survived

In [21]:
accuracy(titanic$Survived, women_and_children_survived)

In [23]:
titanic$genderclass <- paste(titanic$Sex,titanic$Pclass)

In [24]:
head(titanic)

Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,genderclass
0,3,male,22.0,1,0,A/5 21171,7.25,,S,male 3
1,1,female,38.0,1,0,PC 17599,71.2833,C85,C,female 1
1,3,female,26.0,0,0,STON/O2. 3101282,7.925,,S,female 3
1,1,female,35.0,1,0,113803,53.1,C123,S,female 1
0,3,male,35.0,0,0,373450,8.05,,S,male 3
0,3,male,,0,0,330877,8.4583,,Q,male 3


In [25]:
prop.table(table(titanic$genderclass, titanic$Survived),1)

          
                    0          1
  female 1 0.03191489 0.96808511
  female 2 0.07894737 0.92105263
  female 3 0.50000000 0.50000000
  male 1   0.63114754 0.36885246
  male 2   0.84259259 0.15740741
  male 3   0.86455331 0.13544669

In [26]:
women_and_children_survived = 0

In [27]:
women_and_children_survived = titanic$genderclass == 'female 1'
women_and_children_survived = titanic$genderclass == 'female 2'
women_and_children_survived[titanic$Age < 10] = 1

In [28]:
accuracy(titanic$Survived, women_and_children_survived)

In [75]:
women_and_children_survived = 0

In [76]:
women_and_children_survived[titanic$genderclass == 'female 1'] = 1
women_and_children_survived[titanic$genderclass == 'female 2'] = 1
women_and_children_survived[titanic$genderclass == 'female 3'] = 0
women_and_children_survived[titanic$Age < 9] = 1
women_and_children_survived[is.na(women_and_children_survived)] = 0

In [77]:
accuracy(titanic$Survived, women_and_children_survived)

In [56]:
youngrichmensurvived = 0
youngrichmensurvived = (titanic$genderclass == 'male 1' & titanic$Age < 20)
youngrichmensurvived[is.na(youngrichmensurvived)] = 1
cat(youngrichmensurvived)

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

In [57]:
women_and_children_survived = youngrichmensurvived
accuracy(titanic$Survived, women_and_children_survived)

In [58]:
head(titanic, 28)

Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,genderclass
0,3,male,22.0,1,0,A/5 21171,7.25,,S,male 3
1,1,female,38.0,1,0,PC 17599,71.2833,C85,C,female 1
1,3,female,26.0,0,0,STON/O2. 3101282,7.925,,S,female 3
1,1,female,35.0,1,0,113803,53.1,C123,S,female 1
0,3,male,35.0,0,0,373450,8.05,,S,male 3
0,3,male,,0,0,330877,8.4583,,Q,male 3
0,1,male,54.0,0,0,17463,51.8625,E46,S,male 1
0,3,male,2.0,3,1,349909,21.075,,S,male 3
1,3,female,27.0,0,2,347742,11.1333,,S,female 3
1,2,female,14.0,1,0,237736,30.0708,,C,female 2


In [65]:
titanic[titanic$genderclass == 'male 1' & titanic$Age < 20,]

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,genderclass
28,0.0,1.0,male,19.0,3.0,2.0,19950,263.0,C23 C25 C27,S,male 1
,,,,,,,,,,,
NA.1,,,,,,,,,,,
NA.2,,,,,,,,,,,
NA.3,,,,,,,,,,,
NA.4,,,,,,,,,,,
NA.5,,,,,,,,,,,
NA.6,,,,,,,,,,,
NA.7,,,,,,,,,,,
306,1.0,1.0,male,0.92,1.0,2.0,113781,151.55,C22 C26,S,male 1
