## Preliminary Analysis

We will start with some preliminary analysis on our data set.

### Load the dataset using R

First, we love the data set using the R function `read.csv` and assign it to the variable `titanic`. Note that the `read.table` and `read.csv` in R are equivalent accept for the default args. `read.table` defaults to separating on white space. `read.csv` defaults to separating on commas. `read.csv` also defaults to the argument `header=T`.

#### load the dataset using `read.csv()`

In [1]:
titanic <- read.csv('titanic.csv')

In [2]:
stopifnot(dim(titanic) == c(891,12))

We displayed the dimension `dim()` and the structure `str()` of our datafrane. This is mostly done as a sanity check. We should have some idea of what the dimension and structure of our data is. By displaying these results immediately after loading the data, we can verify that the data has been loaded as we expect.

#### display the dimension of the data set

In [3]:
dim(titanic)

#### display the structure of the dataframe

In [4]:
str(titanic)

'data.frame':	891 obs. of  12 variables:
 $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
 $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
 $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
 $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
 $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
 $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
 $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
 $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
 $ Ticket     : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
 $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Cabin      : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
 $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...


### The `R` Structure Object

I interpret the structure of our data frame in the following way. Each row in the structure object, `str(titanic)` represents a column in the data frame `titanic`. The value immediately following the `$` is the name of that column. The value immediately following the `:` is the data type of that column. The values following the datatype are the first few values of the data in the column itself. 

Note that R has made some default decisions about the structure of our data. It has designated five columns as integer columns, five columns as factor columns, and two columns as numerical problems. These may or may not be accurate according to our own understanding of the data. This was done by R, doing its best to intuit the structure of the data during the read of the CSV file. For example, a reasonable case could be made that the the `Survived` column should not be an integer, nor should the `Pclass`.

### Categorical Features In R 

R stores categorical features using a special type of vector called a **factor**. The data is stored as a vector of integers. The factor has an additional attribute, however. It also has a vector of levels. The integer stored as data are actually references to the vector of names. We can think of the data stored in the Factor as a mapping to the vector of levels.

#### display that class of the `titanic$embarked` column

In [5]:
class(titanic$Embarked)

#### display that levels of the `titanic$embarked` column

In [6]:
levels(titanic$Embarked)

#### display that first few values of the `titanic$embarked` column

In [7]:
titanic$Embarked[1:5]

### Completely Unique Columns

We can see from the structure of our data frame that it contains two columns that are completely unique. We are attempting to use the patterns in our data to make predictions about the survival of passengers during the Titanic disaster. This is done by identifying patterns in the data. If they column is completely unique there is no pattern to be identified there. Each passenger has its own unique value and there is really no immediate way to associate these unique values with each other. For this reason we will simply remove the completely unique columns. Prior to doing this, however, we should verify that they are in fact completely.

The two columns in question are `PassengerId` and `Name`. We will use the following method to establish that they are both completely unique:

1. We will take a measure of the number of passengers in the data set
2. We will take a measure of the number of unique values in each of the columns in question
3. If the values match we will consider the column safe for removal

#### store the number of passengers

In [8]:
number_of_passengers = length(titanic$PassengerId)
number_of_passengers

#### display the length of the unique values in `titanic$passengerid` and `titanic$name` 

In [9]:
length(unique(titanic$PassengerId)); length(unique(titanic$Name))

We note that the values do indeed match, therefore, it is safe to drop both of these columns from our dataframe. This can be done by assigning the `NULL` value to the named column. For example, we might do the following on a generic data frame and column

    dataframe$mycolumn = NULL

#### drop the columns with completely unique values

In [10]:
titanic$PassengerId <- NULL
titanic$Name <- NULL

In [11]:
stopifnot(is.null(titanic$PassengerID))
stopifnot(is.null(titanic$Name))
stopifnot(as.vector(titanic[4,]) == c('1','1','female', '35', '1','0','113803','53.1','C123','S'))

### Summarize The Data

Finally, having dropped the features deemed not immediately useful, we display the summary statistics of the dataframe using the `summary()` function. This function shows the quartile values of the data as well as mean and median for numerical features and the counts to the best of its ability for the factors.

In [12]:
summary(titanic)

    Survived          Pclass          Sex           Age            SibSp      
 Min.   :0.0000   Min.   :1.000   female:314   Min.   : 0.42   Min.   :0.000  
 1st Qu.:0.0000   1st Qu.:2.000   male  :577   1st Qu.:20.12   1st Qu.:0.000  
 Median :0.0000   Median :3.000                Median :28.00   Median :0.000  
 Mean   :0.3838   Mean   :2.309                Mean   :29.70   Mean   :0.523  
 3rd Qu.:1.0000   3rd Qu.:3.000                3rd Qu.:38.00   3rd Qu.:1.000  
 Max.   :1.0000   Max.   :3.000                Max.   :80.00   Max.   :8.000  
                                               NA's   :177                    
     Parch             Ticket         Fare                Cabin     Embarked
 Min.   :0.0000   1601    :  7   Min.   :  0.00              :687    :  2   
 1st Qu.:0.0000   347082  :  7   1st Qu.:  7.91   B96 B98    :  4   C:168   
 Median :0.0000   CA. 2343:  7   Median : 14.45   C23 C25 C27:  4   Q: 77   
 Mean   :0.3816   3101295 :  6   Mean   : 32.20   G6        