# Survival Prediction and Data Analysis
---

Competition sites like Kaggle define the problem to solve or questions to ask while providing the datasets for training your data science model and testing the model results against a test dataset. The question or problem definition for Titanic Survival competition is described at Kaggle.

# Workflow Stages

* Import dataset
* Data Exploration.
* Data cleaning.
* Feature Engineering.
* Data Processing.
* Model Building.
* Model tuning.
* Enemble Model building.
* Result.

# Workflow Goals

The data science solution works for seven major goals-

1.**Classifying-** We may want to classify and catagorise our sample.

2.**Correlation-** We may approach a problem solution based on the available features within the training dataset. We may check correlation within features of dataset. If feature changes does our solution goal changes or vice versa? Same are tested on the basis of numerical value and catagorial value.

3.**Converting-** For modeling stage, we need to prepare the data. We may require to convert all features to numerical value.

3.**Completing-** For data prepartion we require to estimate the missing values within the features.Model algorithms best work in no missing values.

4.**Correcting-** There may be errors and inaccurate values within the features. We need to rectify them or else we can completely exclude a feature containing amples of errors and does not contribute to analysis.

5.**Creating-** We can create new features based on the existing features which follows correlation, coversion, completness goals.

6.**Charting-** Depending on the nature of the data we'll select visualization plot.

# I. Import Packages and dataset


In [None]:
#Data analysis and wrangling

import pandas as pd
import numpy as np
import random as rnd

#Data Visualisation

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

#machine learning

from sklearn.ensemble import RandomForestClassifier 

In [None]:
#import data
train= pd.read_csv('/kaggle/input/titanic/train.csv')
test= pd.read_csv('/kaggle/input/titanic/test.csv')
all_data=[train, test]

# II. DataSet Exploration

The first step of data analysis used to explore and visualize data to uncover insights from the start or identify areas or patterns.

In [None]:
train.head()

In [None]:
test.head()

We'll look for catagorial and numerical values.

**Which features are categorical?**

These values classify the samples into sets of similar samples. Within categorical features are the values nominal, ordinal, ratio, or interval based? Among other things this helps us select the appropriate plots for visualization.

Categorical: Survived, Sex, and Embarked. Ordinal: Pclass.

**Which features are numerical?**

Which features are numerical? These values change from sample to sample. Within numerical features are the values discrete, continuous, or timeseries based? Among other things this helps us select the appropriate plots for visualization.

Continous: Age, Fare. Discrete: SibSp, Parch.

**Data types of features-**

-int/float-7 and string/object-5

In [None]:
#quick look at our data type and null values
train.info()
test.info()

Our observation-

* Total samples are 891 or 40% of the actual number of passengers on board the Titanic (2,224).
* Survived is a categorical feature with 0 or 1 values.
* Around 38% samples survived representative of the actual survival rate at 32%.
* Most passengers (> 75%) did not travel with parents or children.
* Nearly 30% of the passengers had siblings and/or spouse aboard.
* Fares varied significantly with few passengers (<1%) paying as high as $512.
* Few elderly passengers (<1%) within age range 65-80

In [None]:
train.describe()

**Distribution of catagorial features-**

* Name are unique(count=891, unique=891)
* Sex have 2 possible vaules where (>65%) are male
* Cabin have duplicate value means passengers shared the cabin.
* Embarked contains 3 possible values where S was used by most of the passengers(>78%)
* Ticket has most duplicacy.



In [None]:
#quick way to find catagorial features.
train.describe(include=['O'])

In [None]:
# quick way to separate numeric column
train.describe().columns

In [None]:
# look at numeric and catagorial values separately
ds_num= [['Pclass', 'Age','SibSp','Parch','Fare']]
ds_cat= [['Survived','Pclass','Sex','Ticket','Cabin','Embarked']]

**Missing data**

**Which features contain blank, null or empty values?**

These will require correcting.

Cabin > Age > Embarked features contain a number of null values in that order for the training dataset.

Cabin > Age are incomplete in case of test dataset.

In [None]:
# null counts of data set
train.isnull().sum()


In [None]:
test.isnull().sum()


In [None]:
# visual representation of null values (heatmap)
sns.heatmap(train.isnull(), yticklabels=False, cmap= "viridis")

# Assumption based on data analysis

1. **Correlating** We need to correlate all features with the survival.

2. **Completing**  *Age* and *Embarked* we may need to complete as they have correlation with survival.

3. **Correction** 
* We may need to drop *cabin* as it contains many missing values.
* As *ticket* contains max duplicacy we may drop this too.
* *Passenger Id* and *name* too as they do not contribute to survival.

4. **Creating**
* We will create family feature to know no. of families on board.
* We will create Age band.
* We will create fare band too which help in analysis

5.**Classifying** Based on the problem statement some assumptions are-
* Women and childern are more likely to have survived.
* Rich people (Pclass=1) are more likely to have survived.

# Analysis by Pivot feature

In order to confirm our assumption and observation, we quickly analyse correlation of our features with survived. And we can analyse this with the features contains no missing values. Can be done with Pclass, Sex, Embarked, SibSp, Parch categories.  

1. **Pclass**  We observe that (>.5) correlation among Pclass and survived.(Our assumption is correct)
2. **Sex** we observe that about 75% survival rate is for female (Conclusion from problem statement is correct)
3. **Embarked** (>.5) correlation among Embarked and survived.(Our assumption is correct)
4. **SibSp and Parch** Some values of SibSP and Parch do not correlate with Survived (need new feature creation)

**II(a) Analysing the Pclass Feature**

In [None]:
train[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending= False ) 

**II(b) Analysing the Sex Feature**

In [None]:
train[['Sex','Survived']].groupby(['Sex'], as_index=False).mean().sort_values(by='Survived', ascending=False)

**II(c) Analysing the Embarked Feature**

In [None]:
train[['Embarked','Survived']].groupby(['Embarked'], as_index=False).mean().sort_values(by='Survived', ascending=False)

**II(d) Analysing the SibSp Feature**

In [None]:
train[['SibSp','Survived']].groupby(['SibSp'], as_index=False).mean().sort_values(by='Survived', ascending=False)

**II(e) Analysing the Parch Feature**

In [None]:
train[['Parch','Survived']].groupby(['Parch'], as_index=False).mean().sort_values(by='Survived', ascending=False)

# Analyse by visualizing data

We can continue analysing our assumption by visualizing data.

Use histogram for numerical values analysis. 

**Obsevation**
* Most passenger from age group b/w 19-35 did not survive.
* Passenger of age 80 survived.
* We observe that there is high survival rate of infants (So our assumption is correct)
* Female have more sirvival rate.
* Passenger from Pclass 1 have more survival rate.


**II(f) Visual Analysis of the Age Feature**

In [None]:
grid= sns.FacetGrid(train, col='Survived')
grid.map(plt.hist, 'Age', bins=30)

**II(g) Visual Analysis of the Sex Feature**

In [None]:
sns.set_style("whitegrid")
sns.countplot(x="Survived", hue="Sex", data=train, palette="RdBu_r")

**II(h) Visual Analysis of the Pclass Feature**

In [None]:
sns.set_style("whitegrid")
sns.countplot(x="Survived", hue="Pclass", data=train, palette= "rainbow")

# III. Data Cleaning and Engineering

**III(a) Correcting by dropping data**

After data analysis we made some assumptions. Now need to drop some feature which do not contribute to oure solution goal.
**Remember** drop the feature from both training anf test dataset.

In [None]:
#dropping features(cabin,ticket, name and passenger id)

print("Before", train.shape, test.shape, all_data[0].shape, all_data[1].shape)

train= train.drop(['Ticket', 'Cabin', 'Name'], axis=1)
test= test.drop(['Ticket', 'Cabin','Name',], axis=1)
all_data=[train, test]

print("After", train.shape, test.shape, all_data[0].shape, all_data[1].shape)

**III(b) Completing data**

In [None]:

all_data = [test , train]
for dataset in all_data:    
    #completing missing age with median
    dataset['Age'].fillna(dataset['Age'].median(), inplace = True)

    #completing embarked with mode
    dataset['Embarked'].fillna(dataset['Embarked'].mode()[0], inplace = True)

    #completing missing fare with median
    dataset['Fare'].fillna(dataset['Fare'].median(), inplace = True)
    


**III(c) Creating new features**

**Age Feature**
Let us create Age bands and determine correlations with Survived.

In [None]:
train['AgeBand']= pd.cut(train['Age'], 5)
train[['AgeBand', 'Survived']].groupby(['AgeBand'], as_index=False).mean().sort_values(by= 'AgeBand', ascending=True)

Let us replace Age with ordinals based on these bands.

In [None]:
for dataset in all_data:
    dataset.loc[dataset['Age']<=16, 'Age']=0
    dataset.loc[(dataset['Age']>16) & (dataset['Age']<=32), 'Age']=1
    dataset.loc[(dataset['Age']>32) & (dataset['Age']<=48), 'Age']=2
    dataset.loc[(dataset['Age']>48) & (dataset['Age']<=64), 'Age']=3
    dataset.loc[dataset['Age']>64, 'Age']=4
    dataset['Age'] = dataset['Age'].astype(int)

train.head()
    
    

In [None]:
#We can not remove the AgeBand feature.

train= train.drop(['AgeBand'], axis=1)
all_data= [train, test]
train.head()

**Fare Feature**

We can create fareband and can replace fare according to band.

In [None]:
#review fareband
train['FareBand']= pd.qcut(train['Fare'],5)
train[['FareBand','Survived']].groupby(['FareBand'], as_index=False).mean().sort_values(by='FareBand', ascending=True)
      


Based on the fareband we can convert fare feature to ordinal values

In [None]:
for dataset in all_data:
    dataset.loc[dataset['Fare']<=7.854,'Fare']=0
    dataset.loc[(dataset['Fare']>7.854) & (dataset['Fare']<=10.5),'Fare']=1
    dataset.loc[(dataset['Fare']>10.5) & (dataset['Fare']<=21.679),'Fare']=2
    dataset.loc[(dataset['Fare']>21.679) & (dataset['Fare']<=39.688),'Fare']=3
    dataset.loc[dataset['Fare']>39.688, 'Fare']=4
    dataset['Fare'] = dataset['Fare'].astype(int)

train= train.drop(['FareBand'], axis=1)
all_data=[train,test]
train.head()    
    

**Parch and SibSp Feature**

We can create a new feature FamilySize by combining Parch and SibSp

In [None]:
for dataset in all_data:
    dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] +1
    
    
train[['FamilySize','Survived']].groupby(['FamilySize'], as_index=True).mean().sort_values(by='Survived', ascending=False)   
    

We can create another feature called IsAlone



In [None]:
for dataset in all_data:
    dataset['IsAlone']=1
    dataset.loc[dataset['FamilySize'] == 1, 'IsAlone']=0

train[['IsAlone','Survived']].groupby(['IsAlone'], as_index=False).mean().sort_values(by='Survived', ascending=False)


    
    

In [None]:
# drop SibSp and Parch

train=train.drop(['SibSp','Parch'], axis=1)
test=test.drop(['SibSp','Parch'], axis=1)
all_data=[train, test]

train.head()

**III(d) Converting catagorial values to numerical values**

In [None]:
# converting catagorial feature(sex and embark) to numerical value

tr_sex= pd.get_dummies(train['Sex'])
ts_sex= pd.get_dummies(test['Sex'])

tr_embark=pd.get_dummies(train['Embarked'])
ts_embark=pd.get_dummies(test['Embarked'])

#drop sex and embark 

train= train.drop(['Sex','Embarked'], axis=1)
test= test.drop(['Sex','Embarked'], axis=1)

#add new column

train= pd.concat([train, tr_sex, tr_embark], axis=1)
test= pd.concat([test, ts_sex, ts_embark], axis=1)
all_data=[train, test]




In [None]:
train.head()

In [None]:
test.head()

# IV. Model, predict and solve

Now we are ready to train out dataset and pridict solution goal. We will use **Random Forests Classifier** is one of the most popular. Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees (n_estimators=100) at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees

In [None]:
from sklearn.model_selection import train_test_split

x=train.drop("Survived", axis=1)
y=train["Survived"]

x_train, x_test, y_train, y_test =train_test_split(x, y,test_size=0.3) #70% training and 30% test

from sklearn.ensemble import RandomForestClassifier

#create a Gaussian Classifier
model= RandomForestClassifier(n_estimators=100)

#Train the model using the training 
model.fit(x_train,y_train)
prediction= model.predict(x_test)

#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics

print("Accuracy:", metrics.accuracy_score(y_test,prediction))

In [None]:
pred_test= model.predict(test)

output = pd.DataFrame({'PassengerId': test.PassengerId, 'Survived': pred_test})
output.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")