In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn')
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

In [None]:
data = pd.read_csv('../input/titanic/train.csv')

In [None]:
data.head()

In [None]:
data.isnull().sum() #Age, Cabin have too much null values

> In train data, how many Survived?

In [None]:
f, ax = plt.subplots(1,2,figsize = (18,8))
data['Survived'].value_counts().plot.pie(explode = [0,0.1], autopct = '%1.1f%%',
                                        ax = ax[0], shadow = True)
ax[0].set_title('Survived')
ax[0].set_ylabel('')
#data['Survived'].value_counts().plot.bar(ax = ax[1])
sns.countplot(data = data, x = 'Survived')
ax[1].set_title('Survived')
plt.show()

In train data set, 38.4% of people survived.

Dig down more to get better insights from the data and see which categories of the passengers did survive and who didn't

To check the survival rate, Use the different features of the dataset

Fist understand the different types of features

In [None]:
data.info()

Types of Features

1. Categorical Features:
    A categorical variable is one that has two or more categories and each value
    in that feature can be categoriesed by them. For example, gender is a 
    categorical variable having two categories. Now we cannot sort or give any ordering to such variables.
    They are also known as Nominal Variables.
    -> Sex, Embarked
2. Ordinal Features:
    An ordinal variable is similar to categorical values, but the difference between
    them is that we can have relative ordering or sorting between the values.
    For example, pclass is a ordinal variable that have relative sort in the variable.
    -> Pclass
3. Continous Feature:
    A feature is said to be continous if it can take values between any two points or
    between the minimum or maximum values in the features column.
    -> Age


# Analysis about Features

1. ### Sex -> Categorical Feature

In train data, we can group Sex and Survived

### visualization

In [None]:
data.groupby(['Sex','Survived'])['Survived'].count()

The number of men on the ship is lot more than the number of women.
Still the number of women saved is almost twice the number of males saved.


In [None]:
f, ax = plt.subplots(1,2, figsize = (18,8))
data[['Sex','Survived']].groupby(['Sex']).mean().plot.bar(ax = ax[0])
sns.countplot(data = data, x = 'Sex', hue = 'Survived')

The survival rates for a women on the ship os around 75% while that for men in around 18-19%.
This loos to be a very important feature for modeling.

But is it best?
Umm.. I don't know. Lets check other features.

Pclass -> Ordinary Feature

In [None]:
data.info()

In [None]:
pd.crosstab(data['Pclass'], data['Survived'], margins = True).style.background_gradient(cmap = 'Blues')

In [None]:
f, ax = plt.subplots(1,2, figsize = (18,8))
data['Pclass'].value_counts().plot.bar(ax = ax[0])
ax[0].set_title('Number of Passengers By Pclass')
sns.countplot(data = data, x = 'Pclass',hue = 'Survived')
ax[1].set_title('Pclass : Survived vs Dead')
plt.show()

In [None]:
plt.figure(figsize = (18,10))
pd.DataFrame(data[['Pclass','Survived']].groupby(['Pclass']).sum()/data[['Pclass','Survived']].groupby(['Pclass']).count()).plot.bar()

People say Money cannot Buy Everything. But we can cleary see that passengers of Pclass 1 were given a very high priority while rescure.
Even though the number of passengers in Pclass 3 were a lot higher, still the number of survival from them is very low, somewhere around 25%.

For Pclass 1 is survived around 63% while for Pclass 2 is survived around 48%. So money and status matters. Such a materialistic world.

Check Survive & Sex

In [None]:
pd.crosstab([data['Sex'], data['Survived']],data['Pclass'],margins = True).style.background_gradient(cmap = 'Blues')

In [None]:
sns.factorplot(data = data, x = 'Pclass', y = 'Survived', hue = 'Sex')
plt.show()

We use FactorPlot in this case. Because they make the seperation of categorical values easy.

Looking at the CrossTab and the FactorPlot, we can easily infer that survival for Women from Pclass1 is about 95 - 96%, as only 3 out of 94 Women from Pclass1 died.

It is evident that irrespective of Pclass, Women were given first priority while rescue. Even Men from Pclass1 have a very low survival rate.

Looks like Pclass is also an important feature. Lets analyse other features.

Ages -> Continous Feature

In [None]:
data[['Age']].describe()

In [None]:
f, ax = plt.subplots(1,2, figsize = (18,8))
sns.violinplot('Pclass','Age', hue = 'Survived', data = data, split = True, ax = ax[0])
ax[0].set_title('Pclass and Age vs Survived')
sns.violinplot(data = data, x = 'Sex',y = 'Age',split = True, hue = 'Survived', ax = ax[1])
ax[1].set_title('Pclass and Age vs Survived')

1. The number of children increases with Pclass and the survival rate for passengers below Age 10 looks to be good irrespective of the Pclass.

2. Survival chances for passengers aged 25-50 from Pclass1 is hight and is even better for Women.

3. For males, the survival chances decreases with an increase in age.

We has 177 null values in Age feature. To replace these  NaN values, we can assign them the mean age of the dataset.

But the problem is, there were many people with many diffferent ages. So we just cannot assign a 4 year kid with the mean age that is 29 years.

There is a way to find out what age-band does the passenger lie. We can check the Name feature. We can see that the names have a salutation like Mr or Mrs. Thus we can assign the mean values of Mr and Mrs to the respective groups.

Name -> Feature

In [None]:
data['Initial']=0
for i in data:
    data['Initial']=data.Name.str.extract('([A-Za-z]+)\.') #lets extract the Salutations

Using Regex:[A-Za-z]+\.
So what it does is, it looks for string which lie between A-Z or a-z and followed by a.(dot). So we successfully extract the initials from the Name.

In [None]:
data['Initial']

In [None]:
pd.crosstab(data.Initial,data.Sex).style.background_gradient(cmap='Blues')

In [None]:
data['Initial'].value_counts()

In [None]:
data.head()

In [None]:
data['Initial'].replace(['Mlle','Mme','Ms','Dr','Major','Lady','Countess','Jonkheer','Col','Rev','Capt','Sir','Don'],['Miss','Miss','Miss','Mr','Mr','Mrs','Mrs','Other','Other','Other','Mr','Mr','Mr'],inplace=True)
data['Initial'].value_counts()

In [None]:
data.groupby('Initial')['Age'].mean()

In [None]:
data.loc[(data['Age'].isnull()) & (data['Initial'] == 'Mr'),'Age'] = 33
data.loc[(data['Age'].isnull()) & (data['Initial'] == 'Mrs'),'Age'] = 36
data.loc[(data['Age'].isnull()) & (data['Initial'] == 'Other'), 'Age'] = 46
data.loc[(data['Age'].isnull()) & (data['Initial'] == 'Master'),'Age'] = 5
data.loc[(data['Age'].isnull()) & (data['Initial'] == 'Miss'),'Age'] = 22

In [None]:
data['Age'].isnull().sum()

In [None]:
#1.Survived = 0 hist // 2. Survived = 1 hist
f, ax = plt.subplots(1,2, figsize = (20,10))
data.loc[data['Survived'] == 0,'Age'].plot.hist(bins = 20,edgecolor = 'black', ax = ax[0])
ax[0].set_title('Survived = 0')
ax[0].set_xticks(range(0,85,5))
data.loc[data['Survived'] == 1, 'Age'].plot.hist(bins = 20, edgecolor = 'black',color = 'red', ax = ax[1])
ax[1].set_title('Survived = 1')
ax[1].set_xticks(range(0,85,5))

1. The Toddlers(age<5) were saved in large numbers(The Women and Child First Policy).
2. The oldest Passenger was saved.
3. Maximum number of deaths were in the age group of 30 - 40.

In [None]:
plt.figsize = (18,18)
sns.factorplot(data = data, x = 'Pclass',y = 'Survived', col = 'Initial')

The Women and Child first policy thus holds true irrespective of the class

### Embarked -> Categorical Value

In [None]:
pd.crosstab([data['Embarked'],data['Pclass']],[data['Sex'],data['Survived']],margins = True).style.background_gradient(cmap = 'Blues')

In [None]:
sns.factorplot(data = data, x = 'Embarked',y = 'Survived')
fig = plt.gcf()
fig.set_size_inches(5,3)
plt.show()

In [None]:
f,ax = plt.subplots(2,2,figsize = (20,15))
sns.countplot(data = data, x = 'Embarked', ax = ax[0,0])
ax[0,0].set_title('No. Of passengers Boarded')
sns.countplot(data = data, x = 'Embarked', hue = 'Sex', ax = ax[0,1])
ax[0,1].set_title('Male-Female Split for Embarked')
sns.countplot(data = data, x = 'Embarked', hue = 'Survived', ax = ax[1,0])
ax[1,0].set_title('Embarked vs Survived')
sns.countplot(data = data, x = 'Embarked', hue = 'Pclass', ax = ax[1,1])
ax[1,1].set_title('Embarked vs Pclass')

### Observations:

1. Maximum passengers boarded from S. Majority of them being from Pclass 3.
2. The Passengers from C look to be lucky as a good proportion of them survived. The reason for this maybe the rescure of all the Pclass1 and Pclass Passengers.
3. The Embark S looks to the port from where majority of the rich people boarded. Still the chances for survival is low here that is because many passengers from Pclass3 around 81% didn't survived
4. Port Q had almost 95% of the passengers were from Pclass3.

In [None]:
sns.factorplot(data = data, x = 'Pclass', y = 'Survived', hue = 'Sex', col = 'Embarked')

### Observations:

1. The survival chances are almost 1 for women for Pclass1 and Pclass2.

2. Port S looks to be very unlucky for Pclass 3 Passengers as the survival rate for both men and women is very low.(Money Matters)

3. Port Q looks to be unlukiest for men, as almost all were from Pclass 3.

Filling Embarked NaN

In [None]:
data['Embarked'].isnull().sum()

We know that maximum passengers boarded from Prot S, we replace NaN with S

In [None]:
data['Embarked'].isnull().sum()

### SibSp -> Discrete Feature
This feature represents whether a person is alone or with his family memebers

Sibling = brother, sister, stepbrother, stepsister

Spouse = husband, wife

In [None]:
data['SibSp'].isnull().sum()

In [None]:
pd.crosstab(data['SibSp'],data['Survived']).style.background_gradient(cmap = 'Blues')

In [None]:
plt.figure(figsize = (10,5))
sns.barplot(x = 'SibSp', y = 'Survived', data = data)
plt.title('SibSp vs Survived')
plt.show()

In [None]:
sns.factorplot(x = 'SibSp', y = 'Survived', data = data)
fig = plt.gcf()
fig.set_size_inches(10,5)
plt.title('SibSp vs Survived')
plt.show()

In [None]:
pd.crosstab(data['SibSp'],data['Pclass']).style.background_gradient(cmap = 'Blues')

Observations:
The barplot and factorplot show that if a passenger is alone onboard with no siblings, he have 34.5% survival rate. The graph roughly decreases if the nuber of siblings increase. This makes sense. That is, if I have a family on board, I will try to save them instead of saving myself first. Surprisingly the survival for families with 5-8 members is 0%. The reason may be Pclass?

The reason is Pclass. The crosstab shows that Person with SibSp > 3 were all in Pclass 3. It is imminent that all the large families in Pclass3 died

### Parch

In [None]:
pd.crosstab(data['Parch'], data['Pclass']).style.background_gradient(cmap = 'Blues')

In [None]:
f,ax=plt.subplots(1,2,figsize=(20,8))
sns.barplot(data = data,x = 'Parch',y = 'Survived',ax=ax[0])
ax[0].set_title('Parch vs Survived')
sns.factorplot(data = data, x = 'Parch',y = 'Survived',ax=ax[1])
ax[1].set_title('Parch vs Survived')
plt.show()

### Observations:
Here two the results are quite similar. Passengers with their family onboard have greater chance of survival. It however reduces as the number goes up.

The chances of survival is good for somebody who has 1-3 parents on the ship. Being alone also proves to be fatal and the chances for survival decreases when somebody has > 4 family on the ship

### Fare -> Continuous Feature

In [None]:
pd.DataFrame(data['Fare'].describe()).drop('count')

Min is 0.

In [None]:
f, ax = plt.subplots(1,3,figsize = (20,8))
for i in range(3):
    sns.distplot(data[data['Pclass'] == i+1]['Fare'], ax = ax[i])
    title = 'Fares in Pclass' + " " + str(i + 1)
    ax[i].set_title(title)

There looks to be a large distribution in the fares of Passengers in Pclass1 and this distribution goes on decreasing as the standards reduces.
As this tis also continous, we can convert into discrete values by using binning.

## Observations in a Nutshell for all features:

Sex : The chance of survival for women is high as compared to men.

Pclass : There is a visible trend that being a 1st class passenger gives you better chances of survival. The survival rate for Pclass3 is very low. For women, the chance of survival from Pclass1 is almost 1 and is high too for those from Pclass2. Money Wins!.

Age : Children less than 5-10 years do have a high chances of survival. Passengers between age group 15 to 35 died a lot.

Embarked : This is a very interesting feature. The chances of survival at C looks to be better than even though the majority of Pclass1 passengers got up at S. Passengers at Q were all from Pclass 3.

Parch & SibSp : Having 1-2 siblings, spouse on board or 1-3 Parents shows a greater chance of probability rather than being alone or having a large family traveling with you.

## Correlation Between the Feature

In [None]:
sns.heatmap(data.corr(), annot = True, cmap = 'Blues', linewidths = 0.2)
fig = plt.gcf()
fig.set_size_inches(10,8)
plt.show()

### Interpreting The Heatmap

The first thing to note is that only the numeric features are compared as it is obvious that we cannot correlate between alphabets or strings. Before understanding the plot, let us see what exactly correlation is.

**Positive Correlation**

If an increase in feature A leads to increase in feature B, then they are positively correlated. A value 1 means perfect positive correlation.

**Negative Correlation**

If an increase in feature A leads to decrease in feature B, then they are negatively correlated. A value -1 means perfect negative correlation.

    
    
Now lets say that two features are highly or perfectly correlated, so the increase in one leads to increase in the other. This means that both the features are containing highly similar information and there is very little or no variance in information. This is known as MultiColinearity as both of them contains almost the same information.

So do you think we should use both of them as one of them is redundant. While making or training models, we should try to eliminate redundant features as it reduces training time and many such advantages.

Now from the above heatmap, we can see that the features are not much correlated. The highest correlation is between SibSp and Parchi 0.41. So we can carry on with all features.