# My Titanic Notebook / First Kaggle Project

This is my first Kaggle project, and my goal here is simply to learn the different phases of problem-solving when tackling a Data Science/ML question. Let's go through this step by step (this is openly inspired by the Notebooks available in the [Titanic Kaggle Tutorial](https://www.kaggle.com/c/titanic/overview/tutorials)):


## Context & Question

> On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. 

That's a **32% survival rate**.

> One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class. 

This means we could somehow **design an algorithm to predict survival based on the passenger's features** (gender, age, class, etc.). Let's see...


## What is the available data? What does it look like?

> First let's load all of our data analysis/ML libraries just in case we need anything in there (we will):

## focus of the projects #
> data preprocessing (follow the python tutorial)

> try different models as many as possible

> give the context why we choose that one

> try all feature selection, pca, etc

> analyze and focus on insight and findings

## focus of Monday meeting:
> planning 

•Problem Statement

•Data Description

•Methodology and Implementation

•Experimental Results
–Comparison
–Analysis and discussions
–Interesting findings

•Conclusions
–Summary of project achievements and findings
–Future Directions for improvements


•Interestingness of your problem statement (20%)

•Richness of the insights and findings that you have gained (30%)

•Clarity of your presentation (20%)

•Technical depth, novelty and comparison (30%)

In [1]:
1+1

In [2]:
library()

In [3]:
library(ggplot2)

Registered S3 methods overwritten by 'ggplot2':
  method         from 
  [.quosures     rlang
  c.quosures     rlang
  print.quosures rlang


In [None]:
library(SwarmSVM)

In [5]:
install.packages("ggplot2")

Updating HTML index of packages in '.Library'
Making 'packages.html' ... done


In [None]:
install.packages("SwarmSVM")

also installing the dependencies ‘e1071’, ‘LiblineaR’, ‘SparseM’, ‘kernlab’, ‘checkmate’, ‘BBmisc’



In [9]:
k = 1
class(k)

In [10]:
rnorm(10)

In [None]:
v1 = rnorm(10)
v2 = rnorm(10)
cor(v1, v2)
install.packages("ggthemes")
install.packages("mice")


Updating HTML index of packages in '.Library'
Making 'packages.html' ... done
also installing the dependencies ‘minqa’, ‘nloptr’, ‘RcppEigen’, ‘ucminf’, ‘lme4’, ‘ordinal’, ‘pan’, ‘jomo’, ‘mitml’



In [None]:
train <- read.csv('../input/train.csv', stringsAsFactors = F)
test  <- read.csv('../input/test.csv', stringsAsFactors = F)

In [None]:
# Load packages
library('ggplot2') # visualization
library('ggthemes') # visualization
library('scales') # visualization
library('dplyr') # data manipulation
library('mice') # imputation
library('randomForest') # classification algorithm



train <- read.csv('../input/train.csv', stringsAsFactors = F)
test  <- read.csv('../input/test.csv', stringsAsFactors = F)

full  <- bind_rows(train, test) # bind training & test data

# check data
str(full)




# Grab title from passenger names
full$Title <- gsub('(.*, )|(\\..*)', '', full$Name)

# Show title counts by sex
table(full$Sex, full$Title)






# Titles with very low cell counts to be combined to "rare" level
rare_title <- c('Dona', 'Lady', 'the Countess','Capt', 'Col', 'Don', 
                'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer')

# Also reassign mlle, ms, and mme accordingly
full$Title[full$Title == 'Mlle']        <- 'Miss' 
full$Title[full$Title == 'Ms']          <- 'Miss'
full$Title[full$Title == 'Mme']         <- 'Mrs' 
full$Title[full$Title %in% rare_title]  <- 'Rare Title'

# Show title counts by sex again
table(full$Sex, full$Title)




# Finally, grab surname from passenger name
full$Surname <- sapply(full$Name,  
                      function(x) strsplit(x, split = '[,.]')[[1]][1])
                       
cat(paste('We have <b>', nlevels(factor(full$Surname)), '</b> unique surnames. I would be interested to infer ethnicity based on surname --- another time.'))
                       
                       

                       
                       
# Create a family size variable including the passenger themselves
full$Fsize <- full$SibSp + full$Parch + 1

# Create a family variable 
full$Family <- paste(full$Surname, full$Fsize, sep='_')

                       
                       
# Use ggplot2 to visualize the relationship between family size & survival
ggplot(full[1:891,], aes(x = Fsize, fill = factor(Survived))) +
  geom_bar(stat='count', position='dodge') +
  scale_x_continuous(breaks=c(1:11)) +
  labs(x = 'Family Size') +
  theme_few()
                       
                       
# Discretize family size
full$FsizeD[full$Fsize == 1] <- 'singleton'
full$FsizeD[full$Fsize < 5 & full$Fsize > 1] <- 'small'
full$FsizeD[full$Fsize > 4] <- 'large'

# Show family size by survival using a mosaic plot
mosaicplot(table(full$FsizeD, full$Survived), main='Family Size by Survival', shade=TRUE)
                       
# This variable appears to have a lot of missing values
full$Cabin[1:28]

# The first character is the deck. For example:
strsplit(full$Cabin[2], NULL)[[1]]
                       
# Create a Deck variable. Get passenger deck A - F:
full$Deck<-factor(sapply(full$Cabin, function(x) strsplit(x, NULL)[[1]][1]))
                         
                         
 # Passengers 62 and 830 are missing Embarkment
full[c(62, 830), 'Embarked']                        