#  **TAITANIC**

**Data Dictionary**
* Variable	Definition	Key
* survival	Survival	0 = No, 1 = Yes
* pclass	Ticket class	1 = 1st, 2 = 2nd, 3 = 3rd
* sex	Sex	
* Age	Age in years	
* sibsp	# of siblings / spouses aboard the Titanic	
* parch	# of parents / children aboard the Titanic	
* ticket	Ticket number	
* fare	Passenger fare	
* cabin	Cabin number	
* embarked	Port of Embarkation	C = Cherbourg, Q = Queenstown, S = Southampton

**Variable Notes**
pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.

# > **IMPORT LIBERIRES AND READ THE DATA**

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import os
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory


df = pd.read_csv('../input/titanic/train.csv')
test_df = pd.read_csv('../input/titanic/test.csv')
print("The Shape of the train_dataSet is {}.\n".format(df.shape))
print("The Shape of the test_dataSet is {}.\n".format(test_df.shape))
df.head()

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# > **EDA**

In [2]:
df.info()

In [3]:
df.describe()

In [4]:
df.isnull().sum()

In [5]:
df[df.duplicated()].count()

In [6]:
df_copy = df.copy()
df_copy.head()

In [7]:
df_copy.isnull().sum()

# > **FEATURES CORRELATION**

In [8]:
df_copy.corr()

In [9]:
corr_matrix=df_copy.corr()
(corr_matrix['Survived'].sort_values(ascending=False))

# > **PREPROCESSING**

In [10]:
le=LabelEncoder()
df_copy['Sex']=le.fit_transform(df_copy['Sex'])

In [11]:
df_copy['Embarked'] = df_copy['Embarked'].fillna("Unknown")
df_copy.isnull().sum()

In [12]:
df_copy['Embarked']=le.fit_transform(df_copy['Embarked'])

In [13]:
df_copy['Age'] = df_copy['Age'].fillna(df_copy['Age'].median()) 
df_copy.isnull().sum()

In [14]:
df_copy.info()

In [15]:
corr_matrix=df_copy.corr()
(corr_matrix['Survived'].sort_values(ascending=False))

# > **DATA VISUALIZATION**

In [16]:
_ = pd.plotting.scatter_matrix(df_copy,figsize = [18, 12])

In [64]:
plt.figure(figsize=(10,10))
sns.heatmap(data = df_copy.corr(),annot=True)
plt.show()

In [18]:
print(df_copy[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False))
print('-'*20)
print(df_copy[['Sex', 'Survived']].groupby(['Sex'], as_index=False).mean().sort_values(by='Survived', ascending=False))
print('-'*20)
print(df_copy[['SibSp', 'Survived']].groupby(['SibSp'], as_index=False).mean().sort_values(by='Survived', ascending=False))
print('-'*20)
print(df_copy[['Parch', 'Survived']].groupby(['Parch'], as_index=False).mean().sort_values(by='Survived', ascending=False))

In [19]:
g = sns.FacetGrid(df_copy, col='Survived', height=8.2, aspect=1.6)
g.map(plt.hist, 'Age', bins=20)

In [20]:
g = sns.FacetGrid(df_copy, col='Survived', height=8.2, aspect=1.6)
g.map(plt.hist, 'Sex', bins=20)

In [21]:
df_copy.columns

# > **FEATURES SELECTION FOR TRAINING**

In [22]:
X=df_copy.drop(['PassengerId','Name','Survived','Ticket','Cabin'], axis=1)
y=df_copy['Survived']

# > **SPLIT DATA**

In [23]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test =train_test_split(X, y, test_size=0.2, random_state=42)

In [24]:
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

# > **MACHINE LEARNING MODELS**

In [25]:
from sklearn.metrics import mean_squared_log_error 
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestClassifier
rf=RandomForestClassifier()
#Fit the trained model
rf.fit(X_train,y_train)

#Cross validation prediction
rf_pred=rf.predict(X_test)
print(rf_pred.shape)

scores = cross_val_score(rf, X, y, cv=5)
print(scores.mean())
print ('ROC AUC: %0.3f' % rf.score(X, y) )

print('train_score:'+str(rf.score(X_train, y_train)))
print('test_score:'+str(rf.score(X_test, y_test)))
print ('MSE:' + str(np.sqrt(mean_squared_error ( rf_pred , y_test) ))) 
print ('RMSLE:' + str(np.sqrt(mean_squared_log_error ( rf_pred , y_test) )))

In [26]:
from sklearn.metrics import roc_curve
y_pred_prob = rf.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label='rf')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show();

In [27]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, rf_pred))
print(classification_report(y_test, rf_pred))

In [28]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
param_grid = {'max_depth': np.arange(1, 50)}

rfc=RandomForestClassifier()
rfc2 = GridSearchCV(rfc, param_grid, cv=2)
rfc2.fit(X_train, y_train)
rfc2_pred=rfc2.predict(X_test)

print(rfc2.best_params_)
print(rfc2.best_score_)

scores = cross_val_score(rfc2, X, y, cv=5)
print(scores.mean())
print ('ROC AUC: %0.3f' % rfc2.score(X, y) )

print('train_score:'+str(rfc2.score(X_train, y_train)))
print('test_score:'+str(rfc2.score(X_test, y_test)))
print ('MSE:' + str(np.sqrt(mean_squared_error ( rfc2_pred , y_test) ))) 
print ('RMSLE:' + str(np.sqrt(mean_squared_log_error ( rfc2_pred , y_test) )))

In [29]:
from sklearn.metrics import roc_curve
y_pred_prob = rfc2.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label='rf')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show();

In [30]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, rfc2_pred))
print(classification_report(y_test, rfc2_pred))

In [31]:
from sklearn import tree
tre = tree.DecisionTreeClassifier()
tre.fit(X, y)
#Cross validation prediction
tre_pred=tre.predict(X_test)
print(tre_pred.shape)

scores = cross_val_score(tre, X, y, cv=5)
print(scores.mean())
print ('ROC AUC: %0.3f' % tre.score(X, y) )

print('train_score:'+str(rf.score(X_train, y_train)))
print('test_score:'+str(rf.score(X_test, y_test)))
print ('MSE:' + str(np.sqrt(mean_squared_error ( tre_pred , y_test) ))) 
print ('RMSLE:' + str(np.sqrt(mean_squared_log_error ( tre_pred , y_test) )))

In [32]:
from sklearn.metrics import roc_curve
y_pred_prob = tre.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label='tre')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show();

In [33]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, tre_pred))
print(classification_report(y_test, tre_pred))

In [34]:
from xgboost.sklearn import XGBClassifier
xgb = XGBClassifier(learning_rate=0.01 ,
                                        n_estimators=900,
                                        max_depth=5,
                                        subsample=1,
                                        colsample_bytree=1,
                                        gamma=6,
                                        reg_alpha = 14,
                                        reg_lambda = 3)

xgb.fit(X_train, y_train)
#Cross validation prediction
xgb_pred=xgb.predict(X_test)

scores = cross_val_score(xgb, X, y, cv=5)
print(scores.mean())
print ('ROC AUC: %0.3f' % xgb.score(X, y) )

print('train_score:'+str(xgb.score(X_train, y_train)))
print('test_score:'+str(xgb.score(X_test, y_test)))
print ('MSE:' + str(np.sqrt(mean_squared_error ( xgb_pred , y_test) ))) 
print ('RMSLE:' + str(np.sqrt(mean_squared_log_error ( xgb_pred , y_test) )))

In [35]:
from sklearn.metrics import roc_curve
y_pred_prob = xgb.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label='xgb')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show();

In [36]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, xgb_pred))
print(classification_report(y_test, xgb_pred))

In [37]:
from sklearn.linear_model import LogisticRegression
log = LogisticRegression(penalty = 'l2',solver = 'liblinear', C = 0.25)
log.fit(X_train, y_train)
#Cross validation prediction
log_pred=log.predict(X_test)

scores = cross_val_score(log, X, y, cv=5)
print(scores.mean())
print ('ROC AUC: %0.3f' % log.score(X, y) )

print('train_score:'+str(log.score(X_train, y_train)))
print('test_score:'+str(log.score(X_test, y_test)))
print ('MSE:' + str(np.sqrt(mean_squared_error ( log_pred , y_test) ))) 
print ('RMSLE:' + str(np.sqrt(mean_squared_log_error ( log_pred , y_test) )))

In [38]:
from sklearn.metrics import roc_curve
y_pred_prob = log.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label='log')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show();

In [39]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, log_pred))
print(classification_report(y_test, log_pred))

In [40]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV

steps = [('scaler', StandardScaler()),('knn', KNeighborsClassifier())]
pipeline = Pipeline(steps)

parameters = {'knn__n_neighbors': np.arange(1, 50)}

cv = GridSearchCV(pipeline, param_grid=parameters)
cv.fit(X_train, y_train)
y_pred = cv.predict(X_test)

scores = cross_val_score(cv, X, y, cv=5)
print(scores.mean())
print ('ROC AUC: %0.3f' % cv.score(X, y) )

print('train_score:'+str(cv.score(X_train, y_train)))
print('test_score:'+str(cv.score(X_test, y_test)))
print ('MSE:' + str(np.sqrt(mean_squared_error ( y_pred , y_test) ))) 
print ('RMSLE:' + str(np.sqrt(mean_squared_log_error ( y_pred , y_test) )))

In [41]:
from sklearn.metrics import roc_curve
y_pred_prob = cv.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label='knn')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show();

In [42]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

In [43]:
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
qda = QuadraticDiscriminantAnalysis()
qda.fit(X_train, y_train)
qda_pred = qda.predict(X_test)

scores = cross_val_score(qda, X, y, cv=5)
print(scores.mean())
print ('ROC AUC: %0.3f' % qda.score(X, y) )

print('train_score:'+str(qda.score(X_train, y_train)))
print('test_score:'+str(qda.score(X_test, y_test)))
print ('MSE:' + str(np.sqrt(mean_squared_error ( qda_pred , y_test) ))) 
print ('RMSLE:' + str(np.sqrt(mean_squared_log_error ( qda_pred , y_test) )))

In [44]:
from sklearn.metrics import roc_curve
y_pred_prob = qda.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label='qda')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show();

In [45]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, qda_pred))
print(classification_report(y_test, qda_pred))

In [46]:
from sklearn.svm import SVC
svc=SVC()
svc.fit(X_train,y_train)
svc_pred=svc.predict(X_test)

scores = cross_val_score(svc, X, y, cv=5)
print(scores.mean())
print ('ROC AUC: %0.3f' % svc.score(X, y) )

print('train_score:'+str(svc.score(X_train, y_train)))
print('test_score:'+str(svc.score(X_test, y_test)))
print ('MSE:' + str(np.sqrt(mean_squared_error ( svc_pred , y_test) ))) 
print ('RMSLE:' + str(np.sqrt(mean_squared_log_error ( svc_pred , y_test) )))

In [47]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, svc_pred))
print(classification_report(y_test, svc_pred))

In [48]:
from sklearn.ensemble import AdaBoostClassifier
ada_clf = AdaBoostClassifier()
ada_clf.fit(X_train, y_train)
ada_pred=ada_clf.predict(X_test)

scores = cross_val_score(ada_clf, X, y, cv=5)
print(scores.mean())
print ('ROC AUC: %0.3f' % ada_clf.score(X, y) )

print('train_score:'+str(ada_clf.score(X_train, y_train)))
print('test_score:'+str(ada_clf.score(X_test, y_test)))
print ('MSE:' + str(np.sqrt(mean_squared_error ( ada_pred , y_test) ))) 
print ('RMSLE:' + str(np.sqrt(mean_squared_log_error ( ada_pred , y_test) )))

In [49]:
from sklearn.metrics import roc_curve
y_pred_prob = ada_clf.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label='ada_clf')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show();

In [50]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, ada_pred))
print(classification_report(y_test, ada_pred))

In [51]:
from sklearn.ensemble import VotingClassifier
classifier = VotingClassifier(estimators=[('qda', qda),('ada_clf', ada_clf),('log',log),('cv',cv),('rfc2',rfc2),('xgb',xgb),('tre',tre)],voting='soft')          

classifier.fit(X_train, y_train)
class_pred = classifier.predict(X_test)

accuracies = cross_val_score(classifier, X, y , cv = 5)
print("5 fold cross validation accuracies {}".format(accuracies.mean()))
print ('ROC AUC: %0.3f' % classifier.score(X, y) )

print('train_score:'+str(classifier.score(X_train, y_train)))
print('test_score:'+str(classifier.score(X_test, y_test)))
print ('MSE:' + str(np.sqrt(mean_squared_error ( class_pred , y_test) ))) 
print ('RMSLE:' + str(np.sqrt(mean_squared_log_error ( class_pred , y_test) )))

In [52]:
from sklearn.metrics import roc_curve
y_pred_prob = classifier.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label='classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show();

In [53]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, class_pred))
print(classification_report(y_test, class_pred))

# > **EDA FOR TESTING**

In [54]:
test_df.head()

In [55]:
test_df.info()

In [56]:
test_df.describe()

In [57]:
test_df.isnull().sum()

# > **PREPROCESSING FOR TESTING**

In [58]:
le=LabelEncoder()
test_df['Sex']=le.fit_transform(test_df['Sex'])
test_df['Embarked']=le.fit_transform(test_df['Embarked'])

In [59]:
test_df['Age'] = test_df['Age'].fillna(test_df['Age'].median()) 
test_df['Fare'] = test_df['Fare'].fillna(test_df['Fare'].median())
test_df.isnull().sum()

In [60]:
test_df.info()

In [61]:
y_test_predicted = classifier.predict(test_df.drop(columns =['PassengerId','Name','Ticket','Cabin']))
test_df['Survived'] = y_test_predicted
test_df

# > **SUBMISSION**

In [62]:
test_df[['PassengerId', 'Survived']].to_csv('submission4.csv', index=False)