In this course, you will learn how to apply machine learning techniques to predict a passenger's chance of surviving using Python.


In [26]:
# Import the Pandas library
import pandas as pd
# Load the train and test datasets to create two DataFrames
train_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv"
train = pd.read_csv(train_url)

test_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/test.csv"
test = pd.read_csv(test_url)

In [12]:
#Print the `head` of the train and test dataframes
print(train.head())
# print(test.head())

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  


In [11]:
# Before starting with the actual analysis, it's important to understand the structure of your data. 
# train.describe() to get a summary of your train data
# train.shape to get the dimension of data

print(train.describe())
print(train.shape)

       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  714.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.699118    0.523008   
std     257.353842    0.486592    0.836071   14.526497    1.102743   
min       1.000000    0.000000    1.000000    0.420000    0.000000   
25%     223.500000    0.000000    2.000000   20.125000    0.000000   
50%     446.000000    0.000000    3.000000   28.000000    0.000000   
75%     668.500000    1.000000    3.000000   38.000000    1.000000   
max     891.000000    1.000000    3.000000   80.000000    8.000000   

            Parch        Fare  
count  891.000000  891.000000  
mean     0.381594   32.204208  
std      0.806057   49.693429  
min      0.000000    0.000000  
25%      0.000000    7.910400  
50%      0.000000   14.454200  
75%      0.000000   31.000000  
max      6.000000  512.329200  
(891, 12)


How many people in your training set survived the disaster with the Titanic? To see this, you can use the value_counts() method in combination with standard bracket notation to select a single column of a DataFrame:

### absolute numbers
train["Survived"].value_counts()

### percentages
train["Survived"].value_counts(normalize = True)

In [22]:
# Passengers that survived vs passengers that passed away
print("\n Passengers that survived vs passengers that passed away")
print(train["Survived"].value_counts())

# As proportions
print("\nPassengers that survived vs passengers that passed away in proportions")
print(train["Survived"].value_counts(normalize=True))

# Males that survived vs males that passed away
print("\n Males that survived vs males that passed away")
print(train["Survived"][train["Sex"] == 'male'].value_counts())

# Females that survived vs Females that passed away
print("\n Females that survived vs Females that passed away")
print(train["Survived"][train["Sex"] == 'female'].value_counts())

# Normalized male survival
print("\nNormalized male survival")
print(train["Survived"][train["Sex"] == 'male'].value_counts(normalize=True))

# Normalized female survival
print("\nNormalized female survival")
print(train["Survived"][train["Sex"] == 'female'].value_counts(normalize=True))



 Passengers that survived vs passengers that passed away
0    549
1    342
Name: Survived, dtype: int64

Passengers that survived vs passengers that passed away in proportions
0    0.616162
1    0.383838
Name: Survived, dtype: float64

 Males that survived vs males that passed away
0    468
1    109
Name: Survived, dtype: int64

 Females that survived vs Females that passed away
1    233
0     81
Name: Survived, dtype: int64

Normalized male survival
0    0.811092
1    0.188908
Name: Survived, dtype: float64

Normalized female survival
1    0.742038
0    0.257962
Name: Survived, dtype: float64



#### Does age play a role?

Another variable that could influence survival is age; since it's probable that children were saved first. You can test this by creating a new column with a categorical variable Child. Child will take the value 1 in cases where age is less than 18, and a value of 0 in cases where age is greater than or equal to 18.

To add this new variable you need to do two things (i) create a new column, and (ii) provide the values for each observation (i.e., row) based on the age of the passenger.

Adding a new column with Pandas in Python is easy and can be done via the following syntax:

*your_data["new_var"] = 0*
This code would create a new column in the train DataFrame titled new_var with 0 for each observation.

In [29]:
pd.options.mode.chained_assignment = None  

# Create the column Child and assign to 'NaN'
train["Child"] = float('NaN')

# Assign 1 to passengers under 18, 0 to those 18 or older. Print the new column

train["Child"][train["Age"] < 18] = 1
train["Child"][train["Age"] >= 18] = 0
#print(train.Child)

# Print normalized Survival Rates for passengers under 18
print("\nPrint normalized Survival Rates for passengers under 18")
print(train["Survived"][train["Child"] == 1].value_counts(normalize = True))

# Print normalized Survival Rates for passengers 18 or older
print("\nPrint normalized Survival Rates for pasengers 18 or older")
print(train["Survived"][train["Child"] == 0].value_counts(normalize = True))



Print normalized Survival Rates for passengers under 18
1    0.539823
0    0.460177
Name: Survived, dtype: float64

Print normalized Survival Rates for pasengers 18 or older
0    0.618968
1    0.381032
Name: Survived, dtype: float64


#### First Prediction
In one of the previous exercises you discovered that in your training set, females had over a 50% chance of surviving and males had less than a 50% chance of surviving. Hence, you could use this information for your first prediction: all females in the test set survive and all males in the test set die.

You use your test set for validating your predictions. You might have seen that contrary to the training set, the test set has no Survived column. You add such a column using your predicted values. Next, when uploading your results, Kaggle will use this variable (= your predictions) to score your performance.

In [30]:
# Create a copy of test: test_one
test_one = test

# Initialize a Survived column to 0
test_one['Survived'] = 0

# Set Survived to 1 if Sex equals "female" and print the `Survived` column from `test_one`
test_one['Survived'][test_one['Sex']=='female'] = 1
print(test_one.Survived)


0      0
1      1
2      0
3      0
4      1
5      0
6      1
7      0
8      1
9      0
10     0
11     0
12     1
13     0
14     1
15     1
16     0
17     0
18     1
19     1
20     0
21     0
22     1
23     0
24     1
25     0
26     1
27     0
28     0
29     0
      ..
388    0
389    0
390    0
391    1
392    0
393    0
394    0
395    1
396    0
397    1
398    0
399    0
400    1
401    0
402    1
403    0
404    0
405    0
406    0
407    0
408    1
409    1
410    1
411    1
412    1
413    0
414    1
415    0
416    0
417    0
Name: Survived, Length: 418, dtype: int64
