#### The Challenge
The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

#### Regression or Classification problem ?

Our goal is to predict **Survived** variable which has value of either 1(survived) or 0(not survived). 
So to predict discrete class, We use classification model.

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import LabelEncoder


from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC,LinearSVC
from sklearn.ensemble import RandomForestClassifier,VotingClassifier
from sklearn.neighbors import  KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB


In [None]:
train = pd.read_csv("../input/titanic/train.csv")
test = pd.read_csv("../input/titanic/test.csv")

In [None]:
le = LabelEncoder()

In [None]:
train.info()

In [None]:
print("numeric columns or continuous variables")
print(train.select_dtypes(include='number').columns)
print("--------------------------------------------")
print("Category variables")
print(train.select_dtypes(include='object').columns)

#### Analysing the Target Feature

In [None]:
sns.countplot('Survived',data=train)


plot shows that most of the people were not survied.

In [None]:
not_survived =  (train['Survived'].value_counts()[0] / len(train['Survived']) ) * 100 
survived = 100 - not_survived
print("Survived : {:.2f}% , Not Survived : {:.2f}%".format(survived,not_survived))

Let's analyse what are the features that makes this much difference

In [None]:
f,ax = plt.subplots(1,3,figsize=(16,4))
sns.countplot(x='Sex',data=train , ax=ax[0])
sns.countplot(x='Survived',hue='Sex',data=train , ax=ax[1])
sns.countplot(x='Survived',hue='Sex',data=train[(train['Sex'] == 'female')] , ax=ax[2] , palette="Set2")


Observations : 
*     Eventhough male were the higher in count but their survival rate was pretty low.
*     More than 75% of the females were survived. 
    

In [None]:
f,ax = plt.subplots(1,3,figsize=(16,4))
sns.countplot(x='Pclass',data=train , ax=ax[0])
sns.countplot(x='Survived',hue='Pclass',data=train , ax=ax[1])
sns.countplot(x='Survived',hue='Pclass',data=train[(train['Pclass'] == 3)] , ax=ax[2])


Observations :
*     Pclass - TicketClass(1 = 1st, 2 = 2nd, 3 = 3rd) 
*     We can see that 3rd class people count are higher in the ship but survival rate is pretty low (80% of them not survived).
*     More than 75% of the 1st class people were survived. 
    
    
    

In [None]:
f,ax = plt.subplots(1,3,figsize=(16,4))
sns.pointplot(x='Survived',y='Fare',data=train,ax=ax[0])
sns.boxplot(x='Fare',data=train,ax=ax[1])
sns.pointplot(x='Pclass',y='Fare',data=train,ax=ax[2])
print(train['Fare'].describe())

In [None]:
sns.distplot(train[train['Survived'] == 1]['Fare'])
sns.distplot(train[train['Survived'] == 0]['Fare'])
# plt.hist(train[train['Survived'] == 1]['Fare'], normed=True, alpha=0.5)
# plt.hist(train[train['Survived'] == 0]['Fare'], normed=True, alpha=0.5)
# sum(train[train['Fare']>263]['Survived'] == 1)

In [None]:
# # g = sns.FacetGrid(data = train, hue = "Survived", legend_out=True,size=5)
# # g = g.map(sns.kdeplot, "Age")
# # g.add_legend();
sns.kdeplot(train['Age'],hue='Survived', shade=True)


Observations : 
*     Like Pclass, Fair features also shows that people who pays more were survived
*     we can see that huge variance in the fare, so we will convert it into discrete value.
*     On the 3rd plot, Fair decreases when the standard reduces.

In [None]:
f,ax = plt.subplots(1,3,figsize=(16,4))
sns.countplot(x='Embarked',data=train , ax=ax[0])
sns.countplot(x='Survived',hue='Embarked',data=train , ax=ax[1])
sns.countplot(hue='Pclass',x='Embarked',data=train,ax=ax[2])

Observations : 
*     Embarked - C = Cherbourg, Q = Queenstown, S = Southampton
*     S were high in count and mostly boarding on Pclass - 3 but their survival rate is very low
    

In [None]:
f,ax = plt.subplots(1,2,figsize=(16,6))
sns.pointplot(x="SibSp", y="Survived",hue='Sex',data=train,ax=ax[0])
sns.pointplot(x="Parch", y="Survived",hue='Sex',data=train,ax=ax[1])

Observations : 
* Parch - survival rate is increased when 1-3 parents but drops heavely when parch count > 3
* Sibsp - survival rate drops when the sibsp count increase, we expect negative correlation.
* suvival rate is zero when the count of sibsp >= 5 and parch > 5

    

In [None]:
all_dataset = pd.concat([train,test],sort=False)

#### visualize missing data

In [None]:
missing_value = all_dataset.isnull().sum().sort_values(ascending=False) / len(all_dataset) * 100
missing_value = missing_value[missing_value != 0]
missing_value = pd.DataFrame({'Missing value' :missing_value,'Type':missing_value.index.map(lambda x:all_dataset[x].dtype)})
missing_value.plot(kind='bar',figsize=(16,4))
plt.show()

#### Fill missing values

In [None]:
# Missing value people boarding on Pclass 1 lets take mode from pclass 1 people and fill it
# No missing value in test set
embarked_mode = train[train['Pclass']==1]['Embarked'].mode()[0]
train['Embarked'].fillna(embarked_mode,inplace=True)

In [None]:
train['Age'].describe()

Observations : 
*     Mean value is 29 , we can't assign this value to either children or older people. 
*     We have to figure out a way to fill accordingly
*     Might be title help lets analyse that

In [None]:
train['Title'] = train['Name'].str.split(", ").str[1].str.split(".").str[0]
test['Title'] = test['Name'].str.split(", ").str[1].str.split(".").str[0]


In [None]:
def replaceTitle(fromValue,to):
    x = dict.fromkeys(fromValue, to) 
    return x



train['Title'] = train['Title'].replace(replaceTitle(['Mlle','Mme','Ms'], 'Miss'))
train['Title'] = train['Title'].replace(replaceTitle(['Dr','Major','Capt','Sir','Don'], 'Mr'))
train['Title'] = train['Title'].replace(replaceTitle(['Lady','the Countess'], 'Mrs'))
train['Title'] = train['Title'].replace(replaceTitle(['Jonkheer','Col','Rev'], 'Unknown'))

test['Title'] = test['Title'].replace(replaceTitle(['Mlle','Mme','Ms'], 'Miss'))
test['Title'] = test['Title'].replace(replaceTitle(['Dr','Major','Capt','Sir','Don'], 'Mr'))
test['Title'] = test['Title'].replace(replaceTitle(['Lady','the Countess'], 'Mrs'))
test['Title'] = test['Title'].replace(replaceTitle(['Jonkheer','Col','Rev'], 'Unknown'))

In [None]:
train['Age']= train.groupby('Title')['Age'].transform(lambda x:x.fillna(int(round(x.mean()))))
test['Age']= train.groupby('Title')['Age'].transform(lambda x:x.fillna(int(round(x.mean()))))

In [None]:
train['Age'] = pd.cut(train['Age'],5)
test['Age'] = pd.cut(test['Age'],5)
train['Age'] = le.fit_transform(train['Age'])
test['Age'] = le.fit_transform(test['Age'])

In [None]:
g = sns.FacetGrid(data = train, hue = "Title", legend_out=True,size=5)
g = g.map(sns.kdeplot, "Age")
g.add_legend();

In [None]:
fill_fair = test[(test['Embarked']=='S') & (test['Pclass'] == 3) & (test['Title']=='Mr') ]['Fare'].mean()
test['Fare'].fillna(fill_fair,inplace=True)

In [None]:
train['Fare_cat'] = pd.qcut(train['Fare'],5)
test['Fare_cat'] = pd.qcut(test['Fare'],5)
train['Fare_cat'] = le.fit_transform(train['Fare_cat'])
test['Fare_cat'] = le.fit_transform(test['Fare_cat'])

In [None]:
train.loc[~train['Cabin'].isnull(),'Cabin'] = train[~train['Cabin'].isnull()]['Cabin'].str[0]

In [None]:
train['Cabin'] = train.groupby(['Fare_cat'])['Cabin'].transform(lambda x:x.fillna(x.mode()[0]))
test['Cabin'] = test.groupby(['Fare_cat'])['Cabin'].transform(lambda x:x.fillna(x.mode()[0]))
# train['Cabin'].value_counts()

train['Cabin'].isnull().sum()

#### Correleation

In [None]:
corr = train.corr().sort_values(by='Survived',ascending=False)
plt.subplots(figsize=(16,8))
sns.heatmap(corr,annot=True)

In [None]:
all_dataset = pd.concat([train,test],sort=False)
y = all_dataset['Survived']
all_dataset.drop('Survived',axis=1,inplace=True)

In [None]:
#### Feature Engineering

In [None]:
all_dataset['TotalFamilySize'] = all_dataset['SibSp'] + all_dataset['Parch']
all_dataset['IsAlone'] = 1 * (all_dataset['TotalFamilySize'] == 0)
all_dataset['child_ladies_first'] = 1 * ((all_dataset['Sex'] == 'female') | (all_dataset['Title'] == 'Master'))

In [None]:
all_dataset.drop(['SibSp','Parch','PassengerId','Name','Ticket'],axis=1,inplace=True)

In [None]:
all_dataset = pd.get_dummies(all_dataset,drop_first=True)

In [None]:
#### Model Predictions

In [None]:
from sklearn.model_selection import GridSearchCV, cross_val_score, StratifiedKFold, learning_curve

In [None]:
kfold = StratifiedKFold(n_splits=10)

In [None]:
train_len = train.shape[0]
X_train = all_dataset[:train_len]
X_test = all_dataset[train_len:]

def cv_score(classifier):
    return cross_val_score(classifier,X_train,y=train['Survived'],scoring='accuracy',cv=kfold)

In [None]:
voting_classifier = VotingClassifier(estimators=[
        ('lr', LogisticRegression()), ('rf', RandomForestClassifier()), ('svc', SVC(probability=True))], voting='soft')
classifer_result = []
classifer_result.append(LogisticRegression(random_state=42))
classifer_result.append(RandomForestClassifier(random_state=42))
classifer_result.append(KNeighborsClassifier())
classifer_result.append(GaussianNB())
classifer_result.append(SVC(random_state=42))
classifer_result.append(LinearSVC(random_state=42))
classifer_result.append(voting_classifier)


In [None]:
cv_results = []
for classifier in classifer_result:
    cv_results.append(cv_score(classifier).mean())

In [None]:
cv_results

In [None]:
voting_classifier.fit(X_train,train['Survived'])
predictions = voting_classifier.predict(X_test)

In [None]:
predictions

In [None]:
submission = pd.read_csv("../input/titanic/gender_submission.csv")

In [None]:
submission['Survived'] = predictions


In [None]:
submission.to_csv("submission.csv",index=False)