# Step 1. Open the data file and read the general information

## Competition description
We need to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

!pip install sidetable

## Import

In [98]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
#from pandas_profiling import ProfileReport
try:
    import sidetable
except:
    pass

## Load data

In [99]:
try:
    df = pd.read_csv('train.csv')
    df_test = pd.read_csv('test.csv')
except:
    df = pd.read_csv("/kaggle/input/titanic/train.csv")
    df_test = pd.read_csv("/kaggle/input/titanic/test.csv")

In [100]:
df_test 

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0000,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S
...,...,...,...,...,...,...,...,...,...,...,...
413,1305,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.0500,,S
414,1306,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9000,C105,C
415,1307,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.2500,,S
416,1308,3,"Ware, Mr. Frederick",male,,0,0,359309,8.0500,,S


In [101]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB


## Step 2. Check the data

- Variable	Definition	Key
- survival	Survival	0 = No, 1 = Yes
- pclass	Ticket class	1 = 1st, 2 = 2nd, 3 = 3rd
- sex	Sex	
- Age	Age in years	
- sibsp	# of siblings / spouses aboard the Titanic	
- parch	# of parents / children aboard the Titanic	
- ticket	Ticket number	
- fare	Passenger fare	
- cabin	Cabin number	
- embarked	Port of Embarkation	C = Cherbourg, Q = Queenstown, S = Southampton

In [102]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [103]:
try:
    df.stb.missing(style=True)
except:
    pass

In [104]:
df.describe(include='all')

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
count,891.0,891.0,891.0,891,891,714.0,891.0,891.0,891.0,891.0,204,889
unique,,,,891,2,,,,681.0,,147,3
top,,,,"Braund, Mr. Owen Harris",male,,,,347082.0,,B96 B98,S
freq,,,,1,577,,,,7.0,,4,644
mean,446.0,0.383838,2.308642,,,29.699118,0.523008,0.381594,,32.204208,,
std,257.353842,0.486592,0.836071,,,14.526497,1.102743,0.806057,,49.693429,,
min,1.0,0.0,1.0,,,0.42,0.0,0.0,,0.0,,
25%,223.5,0.0,2.0,,,20.125,0.0,0.0,,7.9104,,
50%,446.0,0.0,3.0,,,28.0,0.0,0.0,,14.4542,,
75%,668.5,1.0,3.0,,,38.0,1.0,0.0,,31.0,,


In [105]:
df.sample(20)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
653,654,1,3,"O'Leary, Miss. Hanora ""Norah""",female,,0,0,330919,7.8292,,Q
79,80,1,3,"Dowdell, Miss. Elizabeth",female,30.0,0,0,364516,12.475,,S
763,764,1,1,"Carter, Mrs. William Ernest (Lucile Polk)",female,36.0,1,2,113760,120.0,B96 B98,S
304,305,0,3,"Williams, Mr. Howard Hugh ""Harry""",male,,0,0,A/5 2466,8.05,,S
287,288,0,3,"Naidenoff, Mr. Penko",male,22.0,0,0,349206,7.8958,,S
687,688,0,3,"Dakic, Mr. Branko",male,19.0,0,0,349228,10.1708,,S
231,232,0,3,"Larsson, Mr. Bengt Edvin",male,29.0,0,0,347067,7.775,,S
766,767,0,1,"Brewe, Dr. Arthur Jackson",male,,0,0,112379,39.6,,C
404,405,0,3,"Oreskovic, Miss. Marija",female,20.0,0,0,315096,8.6625,,S
634,635,0,3,"Skoog, Miss. Mabel",female,9.0,3,2,347088,27.9,,S


# Split the source data into a training set and validation set

In [106]:
df_train, df_valid = train_test_split(df, test_size=0.25, random_state=12345)

# Investigate the quality of different models by changing hyperparameters. Briefly describe the findings of the study.

We are asked if the person will survive or not. Then any wrong recommendation will be considered as error. Therefore we will evaluate the model with accuracy metric.

This is a classification task so we will check which learning algorithm for classification yields the best accuracy. The models we will check are: Decision tree, Random forest and Logistic regression

In [107]:
target_train = df_train['Survived']
target_valid = df_valid['Survived']

features = ["Pclass", "Sex", "SibSp", "Parch", 'Fare']
features_train = pd.get_dummies(df_train[features])
features_valid = pd.get_dummies(df_valid[features])
features_test = pd.get_dummies(df_test[features])


In [108]:
features_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 668 entries, 603 to 482
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Pclass      668 non-null    int64  
 1   SibSp       668 non-null    int64  
 2   Parch       668 non-null    int64  
 3   Fare        668 non-null    float64
 4   Sex_female  668 non-null    uint8  
 5   Sex_male    668 non-null    uint8  
dtypes: float64(1), int64(3), uint8(2)
memory usage: 27.4 KB


In [109]:
features_valid.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 223 entries, 688 to 847
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Pclass      223 non-null    int64  
 1   SibSp       223 non-null    int64  
 2   Parch       223 non-null    int64  
 3   Fare        223 non-null    float64
 4   Sex_female  223 non-null    uint8  
 5   Sex_male    223 non-null    uint8  
dtypes: float64(1), int64(3), uint8(2)
memory usage: 9.1 KB


In [110]:
features_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Pclass      418 non-null    int64  
 1   SibSp       418 non-null    int64  
 2   Parch       418 non-null    int64  
 3   Fare        417 non-null    float64
 4   Sex_female  418 non-null    uint8  
 5   Sex_male    418 non-null    uint8  
dtypes: float64(1), int64(3), uint8(2)
memory usage: 14.0 KB


In [111]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB


In [112]:
features_test_nan = features_test.loc[features_test['Fare'].isna(), :]
features_test_nan

Unnamed: 0,Pclass,SibSp,Parch,Fare,Sex_female,Sex_male
152,3,0,0,,0,1


In [113]:
features_test_not_nan = features_test[features_test['Fare'].notna()]
features_test_not_nan

Unnamed: 0,Pclass,SibSp,Parch,Fare,Sex_female,Sex_male
0,3,0,0,7.8292,0,1
1,3,1,0,7.0000,1,0
2,2,0,0,9.6875,0,1
3,3,0,0,8.6625,0,1
4,3,1,1,12.2875,1,0
...,...,...,...,...,...,...
413,3,0,0,8.0500,0,1
414,1,0,0,108.9000,1,0
415,3,0,0,7.2500,0,1
416,3,0,0,8.0500,0,1


In [114]:
mean_for_fare = features_test_not_nan.loc[
    (features_test_not_nan['Pclass'] == 3) & (features_test_not_nan['Sex_male'] == 1), 'Fare'].mean()
mean_for_fare

11.826350344827585

In [115]:
df_test['Fare'] = df_test['Fare'].fillna(value=mean_for_fare)
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         418 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB


In [116]:
df_test.iloc[152]

PassengerId                  1044
Pclass                          3
Name           Storey, Mr. Thomas
Sex                          male
Age                          60.5
SibSp                           0
Parch                           0
Ticket                       3701
Fare                     11.82635
Cabin                         NaN
Embarked                        S
Name: 152, dtype: object

In [117]:
features_test = pd.get_dummies(df_test[features])

In [118]:
#features_test.loc[features_test['Fare'].isna(), features_test['Fare']] = 

break

## Decision tree

loop tree depth to optimize the model depth with best accuracy

In [119]:
best_depth = 0
best_result = 0
for depth in range(1,20):
    model = DecisionTreeClassifier(random_state=12345, max_depth=depth) # create a model with the given depth
    model.fit(features_train, target_train) # train the model
    predictions_valid = model.predict(features_valid) # get the model's predictions
    result = accuracy_score(target_valid, predictions_valid) # calculate the accuracy
    if result > best_result:
        best_result = result
        best_depth = depth
    print("max_depth =", depth, ": ", end='')
    print(result)
print('')
print('')
print("best_depth =", best_depth, ": ", end='')
print(best_result)

max_depth = 1 : 0.7802690582959642
max_depth = 2 : 0.7309417040358744
max_depth = 3 : 0.7847533632286996
max_depth = 4 : 0.7847533632286996
max_depth = 5 : 0.7668161434977578
max_depth = 6 : 0.7757847533632287
max_depth = 7 : 0.7847533632286996
max_depth = 8 : 0.7982062780269058
max_depth = 9 : 0.7892376681614349
max_depth = 10 : 0.7937219730941704
max_depth = 11 : 0.7847533632286996
max_depth = 12 : 0.7713004484304933
max_depth = 13 : 0.7802690582959642
max_depth = 14 : 0.7847533632286996
max_depth = 15 : 0.7847533632286996
max_depth = 16 : 0.7757847533632287
max_depth = 17 : 0.7892376681614349
max_depth = 18 : 0.7892376681614349
max_depth = 19 : 0.7892376681614349


best_depth = 8 : 0.7982062780269058


Depth 8 give the highest accuracy - 0.798

## Random forest

loop tree number of trees to optimize the model with best accuracy

In [120]:
best_score = 0
best_est = 0
for est in range(1, 10): # choose hyperparameter range
    model = RandomForestClassifier(random_state=12345, n_estimators=est) # set number of trees
    model.fit(features_train, target_train) # train model on training set
    score = model.score(features_valid, target_valid) # calculate accuracy score on validation set
    if score > best_score:
        best_score = score# save best accuracy score on validation set
        best_est = est# save number of estimators corresponding to best accuracy score

print("Accuracy of the best model on the validation set (n_estimators = {}): {}".format(best_est, best_score))

Accuracy of the best model on the validation set (n_estimators = 6): 0.7802690582959642


## Logistic regression

In [121]:
model = LogisticRegression(random_state=12345, solver='lbfgs')
model.fit(features_train, target_train) 
score = model.score(features_valid, target_valid) 
print('The accuracy of the validation set is:', score)

The accuracy of the validation set is: 0.7802690582959642


## Model study conclusion
The highest accuracy we got is 0.79 and the model that allowed it is decision tree with depth of 10. Other models with different hyperparameters couldn't compete that. Therefore we select this model with and will check the quality of this model with the test set

# Train the winning model with all data

Break point here to select best model

break

Depth 8 give the highest accuracy - 0.798 with DecisionTreeClassifier

In [122]:
target = df['Survived']

features = ["Pclass", "Sex", "SibSp", "Parch", 'Fare']
features = pd.get_dummies(df[features])

model = DecisionTreeClassifier(random_state=12345, max_depth=8)
model.fit(features, target) # fit with train data
predictions_test = model.predict(features_test) # predict with test data

In [123]:
output = pd.DataFrame({'PassengerId': df_test['PassengerId'], 'Survived': predictions_test})
output.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")

Your submission was successfully saved!


In [124]:
#profile = ProfileReport(df)
#profile