# Survival Prediction on Titanic

## Steps to be followed:
### 1. Problem Statement
### 2. Data Collection
### 3. Data Analysis
### 4. Feature Engineering
### 5. Data Modelling

## 1. Problem Statement

On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class. 

In this challenge, we are going to analyze what sorts of people were likely to survive, and will apply the machine learning statements to predict which passengers survived the tragedy.

## 2. Data Collection

   We can download the train and test datasets from kaggle.
   Load the datasets

In [259]:
import pandas as pd

train_set = pd.read_csv('input/train.csv')
test_set = pd.read_csv('input/test.csv')

## 3. Data Analysis
   Let's analyze the train and test datasets

In [260]:
train_set.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [261]:
train_set.shape

(891, 12)

In [262]:
train_set.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 66.2+ KB


Number of rows in the train dataset = 891 
Number of columns in the train dataset = 12
There's some data missing in some columns like Age, Cabin, and Embarked

In [263]:
test_set.head(5)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [264]:
test_set.shape

(418, 11)

In [265]:
test_set.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           417 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 27.8+ KB


Number of rows in the test dataset = 418
Number of columns in the test dataset = 11
The missing column is the Survived column whose values of course we need to find out.
Some values are missing in other columns like Age, Fare, Cabin and Embarked

## 4. Feature Engineering

Feature vectors are used to represent numeric or symbolic characteristics (called features, features are basically the measurable properties).

Feature engineering is the process of using domain knowledge of the data to create feature vectors that make machine learning algorithms work.

What are the things that we need to do here?
 1. Guess the missing values first
 2. Then map everything into possible numeric values

### Let's observe the data first:

In [266]:
train_set.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### First let's do something with Name

In [267]:
train_test_dataset = [train_set, test_set] # combining both the datasets as we are gonna perform the same operations on both of them

### The only value that is useful in the Name of a person in this case is the title of that person. So we will extract the title and map it accordingly

In [268]:
for dataset in train_test_dataset:
    dataset['Title'] = dataset['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)

In [269]:
train_set['Title'].value_counts()

Mr          517
Miss        182
Mrs         125
Master       40
Dr            7
Rev           6
Major         2
Mlle          2
Col           2
Don           1
Jonkheer      1
Countess      1
Ms            1
Sir           1
Capt          1
Mme           1
Lady          1
Name: Title, dtype: int64

In [270]:
test_set['Title'].value_counts()

Mr        240
Miss       78
Mrs        72
Master     21
Col         2
Rev         2
Dr          1
Dona        1
Ms          1
Name: Title, dtype: int64

### Map the title
 Mr : 0
 Miss : 1
 Mrs: 2
 Master: 3
 Others: 4

In [271]:
title_mapping = {"Mr": 0, "Miss": 1, "Mrs": 2, 
                 "Master": 3, "Dr": 4, "Rev": 4, "Col": 4, "Major": 4, "Mlle": 4,"Countess": 4,
                 "Ms": 4, "Lady": 4, "Jonkheer": 4, "Don": 4, "Dona" : 4, "Mme": 4,"Capt": 4,"Sir": 4 }
for dataset in train_test_dataset:
    dataset['Title'] = dataset['Title'].map(title_mapping)

In [272]:
train_set.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,2
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,2
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,0


In [273]:
test_set.head(5)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,0
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S,2
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q,0
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S,0
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S,2


we can drop the name from both the datasets as it's not needed now

In [274]:
train_set.drop('Name', axis = 1, inplace = True)

In [275]:
test_set.drop('Name', axis = 1, inplace = True)

In [276]:
train_set.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title
0,1,0,3,male,22.0,1,0,A/5 21171,7.25,,S,0
1,2,1,1,female,38.0,1,0,PC 17599,71.2833,C85,C,2
2,3,1,3,female,26.0,0,0,STON/O2. 3101282,7.925,,S,1
3,4,1,1,female,35.0,1,0,113803,53.1,C123,S,2
4,5,0,3,male,35.0,0,0,373450,8.05,,S,0


In [277]:
test_set.head(5)

Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title
0,892,3,male,34.5,0,0,330911,7.8292,,Q,0
1,893,3,female,47.0,1,0,363272,7.0,,S,2
2,894,2,male,62.0,0,0,240276,9.6875,,Q,0
3,895,3,male,27.0,0,0,315154,8.6625,,S,0
4,896,3,female,22.0,1,1,3101298,12.2875,,S,2


### Map the sex

In [278]:
sex_mapping = {"male": 0, "female": 1}
for dataset in train_test_dataset:
    dataset['Sex'] = dataset['Sex'].map(sex_mapping)

In [279]:
train_set.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title
0,1,0,3,0,22.0,1,0,A/5 21171,7.25,,S,0
1,2,1,1,1,38.0,1,0,PC 17599,71.2833,C85,C,2
2,3,1,3,1,26.0,0,0,STON/O2. 3101282,7.925,,S,1
3,4,1,1,1,35.0,1,0,113803,53.1,C123,S,2
4,5,0,3,0,35.0,0,0,373450,8.05,,S,0


In [280]:
test_set.head(5)

Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title
0,892,3,0,34.5,0,0,330911,7.8292,,Q,0
1,893,3,1,47.0,1,0,363272,7.0,,S,2
2,894,2,0,62.0,0,0,240276,9.6875,,Q,0
3,895,3,0,27.0,0,0,315154,8.6625,,S,0
4,896,3,1,22.0,1,1,3101298,12.2875,,S,2


### Do something with the age
 Fill up the missing values and then map them

In [281]:
train_set['Age'].fillna(train_set.groupby("Title")['Age'].transform("median"), inplace = True)
test_set['Age'].fillna(test_set.groupby("Title")['Age'].transform("median"), inplace = True)

In [282]:
train_set.head(20)

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title
0,1,0,3,0,22.0,1,0,A/5 21171,7.25,,S,0
1,2,1,1,1,38.0,1,0,PC 17599,71.2833,C85,C,2
2,3,1,3,1,26.0,0,0,STON/O2. 3101282,7.925,,S,1
3,4,1,1,1,35.0,1,0,113803,53.1,C123,S,2
4,5,0,3,0,35.0,0,0,373450,8.05,,S,0
5,6,0,3,0,30.0,0,0,330877,8.4583,,Q,0
6,7,0,1,0,54.0,0,0,17463,51.8625,E46,S,0
7,8,0,3,0,2.0,3,1,349909,21.075,,S,3
8,9,1,3,1,27.0,0,2,347742,11.1333,,S,2
9,10,1,2,1,14.0,1,0,237736,30.0708,,C,2


### Map the Age
 kids:0
 teenagers: 1
 Adults: 2
 Middle-Aged: 3
 Old: 4

In [283]:
for dataset in train_test_dataset:
    dataset.loc[ dataset['Age'] <= 12, 'Age'] = 0,
    dataset.loc[(dataset['Age'] > 12) & (dataset['Age'] <= 20), 'Age'] = 1,
    dataset.loc[(dataset['Age'] > 20) & (dataset['Age'] <= 35), 'Age'] = 2,
    dataset.loc[(dataset['Age'] > 35) & (dataset['Age'] <= 50), 'Age'] = 3,
    dataset.loc[ dataset['Age'] > 60, 'Age'] = 4

In [284]:
train_set.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title
0,1,0,3,0,2.0,1,0,A/5 21171,7.25,,S,0
1,2,1,1,1,3.0,1,0,PC 17599,71.2833,C85,C,2
2,3,1,3,1,2.0,0,0,STON/O2. 3101282,7.925,,S,1
3,4,1,1,1,2.0,1,0,113803,53.1,C123,S,2
4,5,0,3,0,2.0,0,0,373450,8.05,,S,0


### Map the embarked after filling up the missing values

In [285]:
Pclass1 = train_set[train_set['Pclass']==1]['Embarked'].value_counts()
Pclass1

S    127
C     85
Q      2
Name: Embarked, dtype: int64

In [286]:
Pclass2 = train_set[train_set['Pclass']==2]['Embarked'].value_counts()
Pclass2

S    164
C     17
Q      3
Name: Embarked, dtype: int64

In [287]:
Pclass3 = train_set[train_set['Pclass']==3]['Embarked'].value_counts()
Pclass3

S    353
Q     72
C     66
Name: Embarked, dtype: int64

In [288]:
for dataset in train_test_dataset:
    dataset['Embarked'] = dataset['Embarked'].fillna('S')

In [289]:
embarked_mapping = {"S": 0, "C": 1, "Q": 2}
for dataset in train_test_dataset:
    dataset['Embarked'] = dataset['Embarked'].map(embarked_mapping)

In [290]:
train_set.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title
0,1,0,3,0,2.0,1,0,A/5 21171,7.25,,0,0
1,2,1,1,1,3.0,1,0,PC 17599,71.2833,C85,1,2
2,3,1,3,1,2.0,0,0,STON/O2. 3101282,7.925,,0,1
3,4,1,1,1,2.0,1,0,113803,53.1,C123,0,2
4,5,0,3,0,2.0,0,0,373450,8.05,,0,0


### Family
Calculate the family size as the possibility of a person to survive the tragedy is more if he has a family on board.

In [291]:
train_set["Family"] = train_set["SibSp"] + train_set["Parch"] + 1
test_set["Family"] = test_set["SibSp"] + test_set["Parch"] + 1

Drop the Parch and SibSp columns

In [292]:
train_set.drop('Parch', axis = 1, inplace = True)
test_set.drop('Parch', axis = 1, inplace = True)
train_set.drop('SibSp', axis = 1, inplace = True)
test_set.drop('SibSp', axis = 1, inplace = True)

In [293]:
train_set.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,Ticket,Fare,Cabin,Embarked,Title,Family
0,1,0,3,0,2.0,A/5 21171,7.25,,0,0,2
1,2,1,1,1,3.0,PC 17599,71.2833,C85,1,2,2
2,3,1,3,1,2.0,STON/O2. 3101282,7.925,,0,1,1
3,4,1,1,1,2.0,113803,53.1,C123,0,2,2
4,5,0,3,0,2.0,373450,8.05,,0,0,1
5,6,0,3,0,2.0,330877,8.4583,,2,0,1
6,7,0,1,0,54.0,17463,51.8625,E46,0,0,1
7,8,0,3,0,0.0,349909,21.075,,0,3,5
8,9,1,3,1,2.0,347742,11.1333,,0,2,3
9,10,1,2,1,1.0,237736,30.0708,,1,2,2


In [294]:
train_set['Family'].value_counts()

1     537
2     161
3     102
4      29
6      22
5      15
7      12
11      7
8       6
Name: Family, dtype: int64

In [295]:
test_set['Family'].value_counts()

1     253
2      74
3      57
4      14
5       7
11      4
7       4
6       3
8       2
Name: Family, dtype: int64

the minimum family size is 1 and the maximum family size is 11

In [296]:
family_mapping = {1: 0, 2: 0.2, 3: 0.4, 4: 0.6, 5: 0.8, 6: 1, 7: 1.2, 8: 1.4, 9: 1.6, 10: 1.8, 11: 2}
for dataset in train_test_dataset:
    dataset['Family'] = dataset['Family'].map(family_mapping)

In [297]:
train_set.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,Ticket,Fare,Cabin,Embarked,Title,Family
0,1,0,3,0,2.0,A/5 21171,7.25,,0,0,0.2
1,2,1,1,1,3.0,PC 17599,71.2833,C85,1,2,0.2
2,3,1,3,1,2.0,STON/O2. 3101282,7.925,,0,1,0.0
3,4,1,1,1,2.0,113803,53.1,C123,0,2,0.2
4,5,0,3,0,2.0,373450,8.05,,0,0,0.0


### Map the fare

In [298]:
train_set["Fare"].fillna(train_set.groupby("Pclass")["Fare"].transform("median"), inplace=True)
test_set["Fare"].fillna(test_set.groupby("Pclass")["Fare"].transform("median"), inplace=True)

In [299]:
for dataset in train_test_dataset:
    dataset.loc[ dataset['Fare'] <= 15, 'Fare'] = 0,
    dataset.loc[(dataset['Fare'] > 15) & (dataset['Fare'] <= 30), 'Fare'] = 1,
    dataset.loc[(dataset['Fare'] > 30) & (dataset['Fare'] <= 100), 'Fare'] = 2,
    dataset.loc[ dataset['Fare'] > 100, 'Fare'] = 3

In [300]:
train_set.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,Ticket,Fare,Cabin,Embarked,Title,Family
0,1,0,3,0,2.0,A/5 21171,0.0,,0,0,0.2
1,2,1,1,1,3.0,PC 17599,2.0,C85,1,2,0.2
2,3,1,3,1,2.0,STON/O2. 3101282,0.0,,0,1,0.0
3,4,1,1,1,2.0,113803,2.0,C123,0,2,0.2
4,5,0,3,0,2.0,373450,0.0,,0,0,0.0


### Do something about the cabin

In [301]:
train_set['Cabin'].value_counts()

C23 C25 C27    4
G6             4
B96 B98        4
F2             3
E101           3
D              3
F33            3
C22 C26        3
C2             2
D33            2
C65            2
B20            2
E8             2
D35            2
D26            2
E25            2
C124           2
E24            2
B77            2
B51 B53 B55    2
C52            2
C83            2
E67            2
D20            2
F G73          2
B28            2
C92            2
F4             2
C125           2
C68            2
              ..
C128           1
D47            1
A10            1
E46            1
F G63          1
C45            1
E58            1
C7             1
E40            1
E31            1
B79            1
B86            1
A24            1
E68            1
C54            1
D19            1
C62 C64        1
E34            1
D9             1
C85            1
B38            1
A6             1
B37            1
C91            1
C49            1
A31            1
E17            1
B3            

In [302]:
Pclass1 = train_set[train_set['Pclass']==1]['Cabin'].value_counts()
Pclass1

B96 B98            4
C23 C25 C27        4
C22 C26            3
B20                2
E25                2
D33                2
E24                2
B51 B53 B55        2
C83                2
C68                2
B28                2
D20                2
B77                2
C123               2
C124               2
B49                2
C2                 2
D35                2
E8                 2
C65                2
C92                2
C126               2
B57 B59 B63 B66    2
B58 B60            2
B22                2
C78                2
E33                2
E67                2
B35                2
B18                2
                  ..
B4                 1
D15                1
C70                1
C87                1
D9                 1
C62 C64            1
D11                1
D19                1
C103               1
D48                1
C90                1
D30                1
D37                1
C32                1
A36                1
B71                1
A20          

In [303]:
Pclass2 = train_set[train_set['Pclass']==2]['Cabin'].value_counts()
Pclass2

F33     3
D       3
E101    3
F2      3
F4      2
D56     1
E77     1
Name: Cabin, dtype: int64

In [304]:
Pclass3 = train_set[train_set['Pclass']==3]['Cabin'].value_counts()
Pclass3

G6       4
F G73    2
E121     2
F E69    1
E10      1
F38      1
F G63    1
Name: Cabin, dtype: int64

In [305]:
for dataset in train_test_dataset:
    dataset['Cabin'] = dataset['Cabin'].str[:1]

what we can notice from this is that First class cabins mostly start with A, B , C and D
Second class cabins are D, E, F and third class cabins are E, F and G
Now we will map the Cabin and then will fill up the blanks with the median

In [306]:
cabin_mapping = {"A": 0, "B": 0.2, "C": 0.4, "D": 0.6, "E": 0.8, "F": 1, "G": 1.2, "T": 1.4}
for dataset in train_test_dataset:
    dataset['Cabin'] = dataset['Cabin'].map(cabin_mapping)

train_set

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,Ticket,Fare,Cabin,Embarked,Title,Family
0,1,0,3,0,2.0,A/5 21171,0.0,,0,0,0.2
1,2,1,1,1,3.0,PC 17599,2.0,0.4,1,2,0.2
2,3,1,3,1,2.0,STON/O2. 3101282,0.0,,0,1,0.0
3,4,1,1,1,2.0,113803,2.0,0.4,0,2,0.2
4,5,0,3,0,2.0,373450,0.0,,0,0,0.0
5,6,0,3,0,2.0,330877,0.0,,2,0,0.0
6,7,0,1,0,54.0,17463,2.0,0.8,0,0,0.0
7,8,0,3,0,0.0,349909,1.0,,0,3,0.8
8,9,1,3,1,2.0,347742,0.0,,0,2,0.4
9,10,1,2,1,1.0,237736,2.0,,1,2,0.2


In [307]:
train_set["Cabin"].fillna(train_set.groupby("Pclass")["Cabin"].transform("median"), inplace=True)
test_set["Cabin"].fillna(test_set.groupby("Pclass")["Cabin"].transform("median"), inplace=True)

Ticket isn't that much needed here, so we'll just drop that column

In [308]:
train_set.drop('Ticket', axis = 1, inplace = True)


In [309]:
test_set.drop('Ticket', axis = 1, inplace = True)

In [310]:
train_set.drop('PassengerId', axis = 1, inplace = True)

In [311]:
train_data = train_set.drop('Survived', axis=1)
target = train_set['Survived']

train_data.shape, target.shape

((891, 8), (891,))

In [312]:
train_data

Unnamed: 0,Pclass,Sex,Age,Fare,Cabin,Embarked,Title,Family
0,3,0,2.0,0.0,1.0,0,0,0.2
1,1,1,3.0,2.0,0.4,1,2,0.2
2,3,1,2.0,0.0,1.0,0,1,0.0
3,1,1,2.0,2.0,0.4,0,2,0.2
4,3,0,2.0,0.0,1.0,0,0,0.0
5,3,0,2.0,0.0,1.0,2,0,0.0
6,1,0,54.0,2.0,0.8,0,0,0.0
7,3,0,0.0,1.0,1.0,0,3,0.8
8,3,1,2.0,0.0,1.0,0,2,0.4
9,2,1,1.0,2.0,0.9,1,2,0.2


In [313]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 8 columns):
Pclass      891 non-null int64
Sex         891 non-null int64
Age         891 non-null float64
Fare        891 non-null float64
Cabin       891 non-null float64
Embarked    891 non-null int64
Title       891 non-null int64
Family      891 non-null float64
dtypes: float64(4), int64(4)
memory usage: 55.7 KB


### Data Modelling

In [314]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

import numpy as np

In [315]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
k_fold = KFold(n_splits=10, shuffle=True, random_state=0)

In [316]:
clfr = KNeighborsClassifier(n_neighbors = 13)
scoring = 'accuracy'
score = cross_val_score(clfr, train_data, target, cv=k_fold, n_jobs=1, scoring=scoring)
print(score)

[0.78888889 0.79775281 0.78651685 0.75280899 0.79775281 0.85393258
 0.82022472 0.80898876 0.80898876 0.84269663]


In [317]:
round(np.mean(score)*100, 2)

80.59

In [318]:
clfr = DecisionTreeClassifier()
scoring = 'accuracy'
score = cross_val_score(clfr, train_data, target, cv=k_fold, n_jobs=1, scoring=scoring)
print(score)

[0.78888889 0.85393258 0.73033708 0.7752809  0.86516854 0.80898876
 0.84269663 0.78651685 0.74157303 0.83146067]


In [319]:
round(np.mean(score)*100, 2)

80.25

In [320]:
clfr = RandomForestClassifier(n_estimators=13)
scoring = 'accuracy'
score = cross_val_score(clfr, train_data, target, cv=k_fold, n_jobs=1, scoring=scoring)
print(score)

[0.82222222 0.84269663 0.76404494 0.7752809  0.86516854 0.80898876
 0.80898876 0.80898876 0.78651685 0.82022472]


In [321]:
round(np.mean(score)*100, 2)

81.03

In [322]:
clfr = GaussianNB()
scoring = 'accuracy'
score = cross_val_score(clfr, train_data, target, cv=k_fold, n_jobs=1, scoring=scoring)
print(score)

[0.81111111 0.74157303 0.74157303 0.75280899 0.69662921 0.79775281
 0.73033708 0.78651685 0.85393258 0.79775281]


In [323]:
round(np.mean(score)*100, 2)

77.1

In [324]:
clfr = SVC()
scoring = 'accuracy'
score = cross_val_score(clfr, train_data, target, cv=k_fold, n_jobs=1, scoring=scoring)
print(score)

[0.83333333 0.7752809  0.82022472 0.79775281 0.85393258 0.80898876
 0.84269663 0.84269663 0.82022472 0.83146067]


In [325]:
round(np.mean(score)*100,2)

82.27

In [328]:
clfr = SVC()
clfr.fit(train_data, target)

test_data = test_set.drop("PassengerId", axis=1).copy()
prediction = clfr.predict(test_data)

In [329]:
submission = pd.DataFrame({
        "PassengerId": test_set["PassengerId"],
        "Survived": prediction
    })

submission.to_csv('submission.csv', index=False)

In [330]:
submission = pd.read_csv('submission.csv')
submission.head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,1
