In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('titanic-data.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


In [4]:
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


Data Dictionary
Variable	Definition	Key
survival	Survival	0 = No, 1 = Yes
pclass	Ticket class	1 = 1st, 2 = 2nd, 3 = 3rd
sex	Sex	
Age	Age in years	
sibsp	# of siblings / spouses aboard the Titanic	
parch	# of parents / children aboard the Titanic	
ticket	Ticket number	
fare	Passenger fare	
cabin	Cabin number	
embarked	Port of Embarkation	C = Cherbourg, Q = Queenstown, S = Southampton

> The dataset has the following missing values issues that need to be explored:
1. age has a lot of missing points
2. cabin has a lot of missing points - since the number of points missing are way more than those available am choosing to ignore this in my analysis and concentrate on other datacolumns I have
3. embarked has 2 missng points - Since the number of missing points are just 2, lets address this first. This isthe port of embarkation of the passenger so am deciding to fill with the most common value in the datset of the 3 options.

In [5]:
# let is get the counts of each of the 3 values
df.Embarked.value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [6]:
# Since S = Southampton is the most frequently embarked station, am filling the missing values with that
df['Embarked'].fillna('S', inplace=True);
# check for null values in the embarked column now
df.Embarked.isnull().sum()

0

> Now lets look at age column. 

> There are a large number of null values in the age column and since we intend to use it in our analysis for building the tableau story, we need to clean this up. We have the following options to deal with this:
1. Fill this value with the average age in the dataset
2. Delete the rows where age is null as that would be incorrect representation
3. Use sme sort of regression to predict the age.

> I am choosing to assume that the missing age can be filled by randomly generating numbers that are normally distributed, given the current gap around the age mean and standard deviation and then filling all the missing ages with those.

In [7]:
# find out the number of missing values of age
df.Age.isnull().sum()

177

In [8]:
# create a dataframe from all the values of age that are missing
new_df= df[df['Age'].isnull()]

In [9]:
# let us find a sample of age groups between 
AgeMean = int(df.Age.mean())
Agestd = int(df.Age.std())
# Three things need to be done - age is given as a float column whereas it should be an integer and we need to generate random
# numbers between this group 177 in number and then fill them in the na 
ages = np.random.randint(AgeMean-Agestd,AgeMean+Agestd,size=177)
# assign the new age matrix to the null Age column
new_df['Age']=ages
# ensure all null values are filles
new_df.Age.isnull().sum()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


0

In [10]:
df = df[np.isfinite(df['Age'])]
df=pd.concat([df,new_df], ignore_index=True)
df.Age.isnull().sum()
df.Age=df.Age.astype(int)

sibsp # of siblings / spouses aboard the Titanic
parch # of parents / children aboard the Titanic
let us add these 2 columns into just one as the number of relatives per person aboard the titanic

In [13]:
df['Relatives']=df['SibSp']+df['Parch']

In [15]:
# lets export this data to a csv file
df.to_csv('Titanic_clean_data.csv')