# The R tutorial

This notebook summarises the R tutorial by Trevor Stephens ([link](http://trevorstephens.com/kaggle-titanic-tutorial/getting-started-with-r/)) tackling the [Titanic Kaggle competition](https://www.kaggle.com/c/titanic). Let's see how it goes.

#### Set working directory 

Variables and objects are assigned by '<-' operator

Now lets set the working directory with `setwd()`.

In [1]:
setwd("~/TestingArea/Tutorials/R")

#### Read datasets

In [2]:
train<-read.csv("train.csv")

In [5]:
test<-read.csv("test.csv")

Let's see what we loaded. We can use R's `View` for it. You can always get help by typing `?View`. In a jupyter notebook we can simply print the dataset. This is possible in the script, too, but gets very difficult to read. 

Instead let's use `head`, which is the equivalent to pandas `DataFrame.head()` to see what is inside the dataset.

In [15]:
head(train,2)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C


In [17]:
tail(test,2)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
417,1308,3,"Ware, Mr. Frederick",male,,0,0,359309,8.05,,S
418,1309,3,"Peter, Master. Michael J",male,,1,1,2668,22.3583,,C


We have different features in the dataset, which we clearly need to understand (Kaggle competition helps).
- **PassengerID**: Identified for the passenger
- **survival**: Survival
   - 0 = No; 1 = Yes
- **pclass**: Passenger Class
   - 1 = 1st; 2 = 2nd; 3 = 3rd
- **name**: Name of the passenger
- **sex**: Sex of the passenger
- **age**: Age
- **sibsp**: Number of Siblings/Spouses aboard
- **parch**: Number of Parents/Children aboard
- **ticket**: Ticket Number
- **fare**: Passenger Fare
- **cabin**: Cabin number
- **embarked**: Port of Embarkation
   - C = Cherbourg; Q = Queenstown; S = Southampton

Additionally, some columns of the dataset are not fully filled (cabin has empty slots, age is sometimes `NA`, while int else ,or the format of the ticket does not seem to be consistent)

We notice that the test dataset has no `survival` feature. Sure, we're going to predict this :)

#### Lets have a look at the structure of the data

In [18]:
str(train)

'data.frame':	891 obs. of  12 variables:
 $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
 $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
 $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
 $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
 $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
 $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
 $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
 $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
 $ Ticket     : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
 $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Cabin      : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
 $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...


So we have a sample of 891 observations (passengers) in the training set and 12 variables (features). Different data types appear:
- **int/num**: integer and numerical value (float)
- **factor**: Compares to a category. E.g. for strings "male" and "female": factor w/ 2 Levels (1,2)
To import strings from a dataset and keep it as strings, we could have used ```read.csv("train.csv",stringsAsFactors=FALSE)```

For the *name* variable we find 891 levels, which mean that there are not two passengers with the same factory level and therefore same name (?). For *ticket* and *cabin* there are less levels due to missing entries or same entries (?).

In [22]:
head(train$Survived,4) # separate a column from the dataset ($ operator)

In [26]:
table(train$Survived) # table sums occurances in the train$Survived column


  0   1 
549 342 

From using *table()* we infer that 549 passenger have died while 342 survived. What's the percentage? 

In [32]:
prop.table(table(train$Survived)) 


        0         1 
0.6161616 0.3838384 

#### Create a survived prediction column in the test dataframe 
first asume everyone died

In [36]:
test$Survived <- rep(0,418)

In [37]:
head(test,2) # check if survived column is present now?

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Survived
1,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,0
2,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S,0


### Solutions to the problem can be submitted in csv format
It has to contain the test sample's passenger's id and the prediction of survival:

In [41]:
solution <- data.frame(PassengerId = test$PassengerId, Survived = test$Survived)

#### Write to csv file

In [44]:
write.csv(solution, file="solution.csv", row.names=FALSE) # not exactly sure if Kaggle allows for row names. to be tested