Session on 13/02/2021

#### Focus is to analyse the probability  to predict whether a Titanic passenger survived based on their attributes i.e. gender, title, age and many more.

### Exploratory Data Analysis
* Exploratory Data Analysis (EDA) is a method used to analyze and summarize datasets. 
* Majority of the EDA techniques use Pandas Seaborn and Matplotlib

In [None]:
### Titanic Dataset
* It is one of the most popular datasets used for understanding machine learning basics. 
* It contains information of all the passengers aboard the RMS Titanic, which unfortunately was shipwrecked. 
* This dataset can be used to predict whether a given passenger survived or not.
* https://www.kaggle.com/c/titanic/data

In [None]:
#importing all the necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
#importing data
#read_csv function is used to read titanic.csv file
data=pd.read_csv('titanic.csv')

In [None]:
#Read the first five records


In [None]:
#Print the last five records


###  About the Dataset- the columns and the values the take
* There are 12 columns in the dataset. Target feature = Survived. All others are descriptive features
* PassengerId - Id of passenger,   Survived - Whether survived or not(0,1),  PClass - Ticket class:1st, 2nd or 3rd(1 is higher), Sex- gender
* SibSp- No of siblings/spouses on board,  Parch - Number of parents/children on board, Cabin- Cabin Number, 
* Embark - Port of embarkation, S- Southampton, C- Cherbourg, Q- Queenstown

In [None]:
#reading data using different parameter values
df=pd.read_csv('titanic.csv',header=None) 
#header=None parameter will read the data without the header and considers it in data itself
df.head()

In [None]:
#header=0 is same as default and it means that it will consider 0th row for the header
df=pd.read_csv('titanic.csv',header=0)
df.head()

## Data shape, Data types and NaN values

In [None]:
#shape checks the rows and columns of the data , we have 891 rows and 12 columns in our dataset


In [None]:
#dtypes check the data types of our features
#Attribute in Dataframe to check the type of each column in the Dataframe


* Data types int64, float64 and object.  First two simply means integers and floats respectively, while the object is essentially just a string

In [None]:
#Get detailed information about the Dataframe about the type
#and null values in each column
data.info()

* Observe that there are only 714 non-null values for the 'Age' column in a DataFrame with 891 rows. So there are are 177 null or missing values. 
* Likewise Cabin has 891-204=667 null values. Embarked has 2 null values. 

In [None]:
#Checking the total null values in our features


In [None]:
#visualizing the null values in the dataset using a heatmap


In [None]:
#let's check the rows where embarked value is NaN
#embarked is the station the passanger was picked up 


In [None]:
#finding the percentage of null values
print('Age :',data.Age.isnull().sum()/len(data)*100)


* These missing values may cause a problem — for sure — we need to fix them

## Missing value imputation
* Imputation replaces missing feature values with a plausible estimated value based on the feature values that are present.
* The most common approach to imputation is to replace missing values for a feature with a measure of the central tendency of that feature.
* We would be reluctant to use imputation on features missing in excess of 30% of their values and wouldstrongly recommend against the use of imputation on features missing in excess of 50% of their values.

* Replace only the missing Age, not the entire values in Age column.

In [None]:
#let's check the distribution of Age feature
# Why we do this?  This distribution should be retained after filling the missing values


In [None]:
#Age feature is not highly skewed so we will impute missing values with mean
   # Age_mean column is added to the dataset

In [None]:
data.head()

In [None]:
# see the distribution using sns
sns.distplot(data['Age'],hist=True)
sns.distplot(data['Age_mean'],hist=True)
plt.show()

In [None]:
#checking the distribution of Age and Age-mean after imputation
#using DataFrame's plot function

* SInce the distributions are similar so the filling of missing values if OK

In [None]:
#Cabin feature tells us that which cabins are received by passengers , if the cabin value is null that means
#passenger was not allocated with the cabin or the record has been lost . So we will impute missing data of cabin with
#NA value

In [None]:
data.Cabin.fillna('NA',inplace=True)

In [None]:
#Cabin has huge amount null values, so we can  drop the ‘Cabin’ column 
# Also this column has no effect on the outcome "Survived"
# To do so we shall use the following code snippet
# data.drop('Cabin',axis=1,inplace=True)

In [None]:
#checking the value counts of each category in Embarked to find mode
print('Value counts of Embarked:\n', data.Embarked.value_counts())
print(data['Embarked'].unique())    #gives the categories in a categorical variable

In [None]:
#we can either impute Embarked feature with mode value as it is categorical variable or
# we can drop the rows where there is missing values as it is very less . For now we will use mode 


In [None]:
#Once again  check the total null values in our features after imputation

# Age has been updated in Age_mean which has no missing values

In [None]:
#data after imputation
data.head()

## Dropping the unnecessary features

In [None]:
#dropping Age variable from the data as our new age after imputation is Age_mean
# drop PassengerId','Name','Ticket' and 'Cabin'

In [None]:
data.head()

## Treating outliers
* an outlier is a data point that differs significantly from other observations.
* An outlier may be due to variability in the measurement or it may indicate experimental error; 
* The latter are sometimes excluded from the data set as they can cause serious problems in statistical analyses
*  We can either drop or treat the outliers. We will be discussing both techniques here .

In [None]:
#making the copy of data to showcase how to drop outlier values
df=data.copy()

In [None]:
# Boxplot helps to identify the outliers in any feature
#outliers in Age

#### Removing outliers

In [None]:
#calculating IQR(Inter- Quartile range) of dataset
Q1=df.quantile(0.25)           #  Lower quartile
Q3=df.quantile(0.75)           # Upper quartile
IQR=Q3-Q1                   #Inter-Quartile Range
IQR

In [None]:
#removing outliers in the dataset
#Retain values in df which are not outliers
df = df[~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))).any(axis=1)]  

In [None]:
#shape of data after removing outliers
df.shape

#### Treating outliers

In [None]:
IQR=data['Age_mean'].quantile(0.75)-data['Age_mean'].quantile(0.25)

In [None]:
lower_bridge=data['Age_mean'].quantile(0.25)-(IQR*1.5)
upper_bridge=data['Age_mean'].quantile(0.75)+(IQR*1.5)
print(lower_bridge)
print(upper_bridge)

In [None]:
# Push the outliers to the threshold value
data.loc[data['Age_mean']>54,'Age_mean']=54
data.loc[data['Age_mean']<2,'Age_mean']=2

## Exploratory Data Analysis
* we had discussed  Univariate or Bivariate Analysis 
* Data Analysis on relation between feature variables and target variables
* Using plots and pandas

In [None]:
#Statistical summary
data.describe()

In [None]:
* Inference: In the training set:
* 38.3% people survived
* More number of people were actually in 3rd class
* 50% of passengers were in between the age of 20 to 38
* Since the survival rate is 0.38, even if I decide to give a submission of all passengers being
* perished, I would still be having a accuracy of 62%. 
* So accuracy cannot be considered as the only measure in saying how good the model is.

In [None]:
#to include categorical variable also
data.describe(include='all')

### Analyse the relation between target variables and the categorical variables in the dataset
### The categorical variables considered are Sex, Pclass, Embarked

In [None]:
### Q. Does the target variable have any relation to Gender?
###  Analyse the number of survivors as per gender

In [None]:
# Count of males and females in the dataset using pandas


In [None]:
# Number of males and females in the dataset using plot
# Tells how males and females are ditributed over the dataset
sns.countplot(x='Sex', data=data);
plt.show()

In [None]:
#Use  seaborn to build bar plots of the Titanic dataset feature 'Survived' split (faceted) over the feature 'Sex'
sns.catplot(x='Survived', col='Sex', kind='count', data=data)   #factorplot()
plt.show()

* Women were more likely to survive than men.

In [None]:
# use pandas to figure out how many women and how many men survived


In [None]:
# Proportion of men and women who survived
propW= data[data.Sex=='female'].Survived.sum()/data[data.Sex== 'female'].Survived.count()

print("\n Proportion of men survived = ", propW)


In [None]:
* Q Number of male and female survivors as per the station of embarkment

### Build bar plots of the Titanic dataset feature 'Survived' split (faceted) over the feature 'Pclass'.

* Passengers that travelled in first class were more likely to survive.
* On the other hand, passengers travelling in third class were more unlikely to survive

In [None]:
# Plot feature 'Survived' split (faceted) over the feature 'Embarked'.
sns.catplot(x='Survived', col='Embarked', kind='count', data=data)   #factorplot()
plt.show()

In [None]:
data.groupby(['Embarked'])['Survived'].sum()
#Q: Proportion of people survived as per station of embarkment

* Passengers who  embarked in Southampton were less likely to survive.

In [None]:
# Proportion of males and females survived as per Embarked
data.groupby(['Sex', 'Embarked'])['Survived'].sum()

### Analyse the relation between target variables and the Numeric variables in the dataset
### The Numeric  variables considered are Fare, Age_mean

In [None]:
# Distribution of Fare


* Most passengers paid less than 100 for travelling with the Titanic

In [None]:
#plot the column 'Fare' for each value of 'Survived' on the same plot.


* People who paid more had  higher chance of surviving

In [None]:
# histogram plot of the 'Age_mean' column of data
#sns.distplot(data.Age_mean, kde=False)
sns.distplot(data.Age_mean,bins=20, kde=False)
plt.show()

* Survival is more for people in the age range 15 to 40. So young people are more likely to survive 

In [None]:
# Use seaborn to plot a scatter plot of 'Age' against 'Fare', colored by 'Survived'


* People who survived either paid quite a bit for their ticket or they were young.

In [None]:
# Find the oldest person who survived
#Find the youngest person who survived
#Find the average age of people who survived
print('Oldest person is :',data['Age_mean'].max())
print('Youngest person is:',data['Age_mean'].min())
print('Average age is :',data['Age_mean'].mean())

In [None]:
corr=data.drop('Age', axis =1).corr()

In [None]:
# Find the correlatins and draw the heatmap