###  Task- Performing Data Cleaning, and analysis on "Titanic" dataset

In [None]:
import pandas as pd
import numpy as np

In [None]:
titanic=pd.read_csv("titanic_train.csv") 
#Loading dataset from the csv file into "Pandas Dataframe"
titanic.head() #Head shows the first 5 entries of the dataframe

## Performing Data Cleaning and Analysis
#### 1. Understanding meaning of each column:
<br>Data Dictionary:
<br>**Variable        Description**</br>
1. Survived	- Survived (1) or died (0)
2. Pclass -	Passenger’s class (1 = 1st, 2 = 2nd, 3 = 3rd)
3. Name	- Passenger’s name
4. Sex -	Passenger’s sex
5. Age	- Passenger’s age
6. SibSp -	Number of siblings/spouses aboard
7. Parch -	Number of parents/children aboard (Some children travelled only with a nanny, therefore parch=0 for them.)
8. Ticket -	Ticket number
9. Fare -	Fare
10. Cabin -	Cabin
11. Embarked -	Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

#### 2. Analysing which columns are completely useless in predicting the survival and deleting them
**Note** - Don't just delete the columns because you are not finding it useful. Or focus is not on deleting the columns. Our focus is on analysing how each column is affecting the result or the prediction and in accordance with that deciding whether to keep the column or to delete the column or fill the null values of the column by some values and if yes, then what values.



In [None]:
#Name column can never decide survival of a person, hence we can safely delete it
#Better to display the entire dataframe after a deletion of a column to get a better view
del titanic["Name"]
titanic.head()

In [None]:
del titanic["PassengerId"]
titanic.head()

In [None]:
del titanic["Ticket"]
titanic.head()

In [None]:
#To see how many entries are there



titanic.shape

In [None]:
#To see how many null entries are there in each column
#Here isnull() returns 1 if there is a null entry adding upon 
#which(using sum() function) we get number of null entries in each column 



titanic.isnull().sum()

In [None]:
del titanic["Cabin"]
titanic.head()

In [None]:
#Applying statistical approach on the above dataframe to analyse 
#which feature or column is affecting the survival rate and which is useless colum
titanic.describe()

In [None]:
del titanic["Fare"]
titanic.head()

#### We want to check if "Embarked" column is is important for analysis or not, that is whether survival of the person depends on the Embarked column value or not. 

In [None]:
# Finding the number of people who have survived 
# given that they have embarked or boarded from a particular port

survivedS = titanic[titanic.Embarked == 'Q'][titanic.Survived == 1].shape[0]
survivedS

#similarly checked for C and S too and the results were
# Survived for C=93
# Survived for S=217

In [None]:
# Finding the number of people who have not survived 
# given that they have embarked or boarded from a particular port


survivedNS= titanic[titanic.Embarked=="Q"][titanic.Survived==0].shape[0]
survivedNS

#similarly checked for C and S too and the results were
# Not Survived for C = 75
# Not Survived for S = 427

As there are significant changes in the survival rate based on which port the passengers aboard the ship.
<br>
We cannot delete the whole embarked column(It is useful).
<br>
Now the Embarked column has two null values in it with total entries 891 rows and hence we can safely say that deleting two rows from 891 rows will not affect the result. So rather than trying to fill those null values with some vales. We can simply remove them.
<br>
For the Null values in Embarked we can remove the rows with null values
<br>
perform **dropna()** on the Datafrem when only these values are present as null
<br>
titanic.dropna(inplace = True)
<br>
Note - inplace = True , so that the changes takes place in the same dataframe and not in the copy
<br>
titanic.shape

#### Changing Value for "Male, Female" string values to numeric values , male=1 and female=2

In [None]:
def getNumber(str):
    if str=="male":
        return 1
    else:
        return 2
titanic["gender"]=titanic["Sex"].apply(getNumber)
#We have created a new column called "gender" and 
#filling it with values 1 ,2 based on the values of sex column
titanic.head()

In [None]:
#Deleting Sex column, since no use of it now
del titanic["Sex"]
titanic.head()

#### Drawing a pie chart for number of males and females aboard

In [None]:
import matplotlib.pyplot as plt
from matplotlib import style

males = (titanic['gender'] == 1).sum() 
#Summing up all the values of column gender with a 
#condition for male and similary for females
females = (titanic['gender'] == 2).sum()
print(males)
print(females)
p = [males, females]
plt.pie(p,    #giving array
       labels = ['Male', 'Female'], #Correspndingly giving labels
       colors = ['blue', 'red'],   # Corresponding colors
       explode = (0.15, 0),    #How much the gap should me there between the pies
       startangle = 0)  #what start angle should be given
plt.axis('equal') 
plt.show()

#### Fill the null values of the Age column. Fill mean Survived age(mean age of the survived people) in the column where the person has survived and mean not Survived age (mean age of the people who have not survived) in the column where person has not survived

In [None]:
#finding mean survived age
meanS= titanic[titanic.Survived==1].Age.mean()
meanS

#### Creating a new "age" column , filling values in it with a condition (given inside) if goes True then given values (here meanS) is put in place of last values else nothing happens, simply the values are copied from the "Age" column of the daatset  

In [None]:
titanic["age"]=np.where(pd.isnull(titanic.Age) & titanic["Survived"]==1  ,meanS, titanic["Age"])
titanic

In [None]:
# Finding the mean age of "Not Survived" people
meanNS=titanic[titanic.Survived==0].Age.mean()
meanNS

Now the "age" column contains null entries in the places where survived is equal to zero.
<br>
Filling those null values in one go with meanNS 

In [None]:
titanic.age.fillna(meanNS,inplace=True)
titanic

In [None]:
#We can safely delete the Age column
del titanic["Age"]
titanic

In [None]:
#Renaming "age" and "gender" columns
titanic.rename(columns={'age':'Age'}, inplace=True)
titanic.rename(columns={'gender':'Sex'}, inplace=True)
titanic.head()

#### Now removing two rows where embarked value was null (Discussed in the beginning)

In [None]:
titanic.dropna(inplace=True)
titanic.head()