# Titanic Dataset Analysis

The purpose of this analysis is to gain insights from the Titanic dataset, which includes data on passengers aboard the ill-fated voyage of the Titanic. This analysis aims to understand better the factors that may have influenced the survival rates of passengers and how these insights can inform future cruise ship models for enhanced safety and passenger satisfaction. The dataset provides a range of demographic and travel-related information for each passenger.


## Dataset Features

The Titanic dataset consists of the following features:

- **PassengerId** (integer): A unique identifier for each passenger.
- **Survived** (integer): Survival status (0 = No, 1 = Yes).
- **Pclass** (integer): Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd).
- **Name** (string): Name of the passenger.
- **Sex** (string): Sex of the passenger (male or female).
- **Age** (float): Age in years.
- **SibSp** (integer): Number of siblings/spouses aboard.
- **Parch** (integer): Number of parents/children aboard.
- **Ticket** (string): Ticket number.
- **Fare** (float): Passenger fare.
- **Cabin** (string): Cabin number.
- **Embarked** (string): Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton).


In [2]:
import pandas as pd

# Load the dataset
titanic_df = pd.read_csv('titanic.csv')

# Display the first 10 rows of the dataset
titanic_df.head(10)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


## Checking for Missing Values

Next, we will examine the dataset to identify any missing values across different features. This is crucial for understanding the quality of our data and deciding on the preprocessing steps needed to ensure our analysis is based on accurate and complete information.


In [3]:
# Check for missing values
titanic_df.isnull().sum()


PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

## Handling Missing Values in Age

To handle missing values in the 'Age' feature, we will fill them with the mean age of passengers. This approach provides a simple way to retain all rows for analysis without introducing significant bias.


In [4]:
# Fill missing values in 'Age' with the mean age
titanic_df['Age'].fillna(titanic_df['Age'].mean(), inplace=True)

# Display the first 10 rows to see the result
titanic_df.head(10)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,29.699118,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


## Combining SibSp and Parch into Family_Count

Both 'SibSp' (siblings/spouses aboard) and 'Parch' (parents/children aboard) represent family members aboard the Titanic. Combining these into a single feature, 'Family_Count', simplifies our analysis and may reveal insights about the impact of family size on survival.


In [5]:
# Combine 'SibSp' and 'Parch' into 'Family_Count'
titanic_df['Family_Count'] = titanic_df['SibSp'] + titanic_df['Parch']

# Display the first 10 rows to see the result
titanic_df.head(10)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Family_Count
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,1
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,0
5,6,0,3,"Moran, Mr. James",male,29.699118,0,0,330877,8.4583,,Q,0
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,0
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S,4
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S,2
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C,1


## Dropping Unnecessary Columns

We will remove 'PassengerId', 'SibSp', and 'Parch' from the dataset. 'PassengerId' is merely an identifier and provides no useful information for analysis. 'SibSp' and 'Parch' have been combined into 'Family_Count', making them redundant.


In [6]:
# Drop 'PassengerId', 'SibSp', and 'Parch'
titanic_df.drop(['PassengerId', 'SibSp', 'Parch'], axis=1, inplace=True)


## Creating a Cabin Indicator

We will add a 'Cabin_Indicator' column to indicate whether a passenger had a cabin registered (1) or not (0). This feature may help us understand if having a cabin influences survival chances.


In [7]:
# Create 'Cabin_Indicator' column
titanic_df['Cabin_Indicator'] = titanic_df['Cabin'].apply(lambda x: 0 if pd.isnull(x) else 1)

# Display the first 10 rows to see the result
titanic_df.head(10)


Unnamed: 0,Survived,Pclass,Name,Sex,Age,Ticket,Fare,Cabin,Embarked,Family_Count,Cabin_Indicator
0,0,3,"Braund, Mr. Owen Harris",male,22.0,A/5 21171,7.25,,S,1,0
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,PC 17599,71.2833,C85,C,1,1
2,1,3,"Heikkinen, Miss. Laina",female,26.0,STON/O2. 3101282,7.925,,S,0,0
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,113803,53.1,C123,S,1,1
4,0,3,"Allen, Mr. William Henry",male,35.0,373450,8.05,,S,0,0
5,0,3,"Moran, Mr. James",male,29.699118,330877,8.4583,,Q,0,0
6,0,1,"McCarthy, Mr. Timothy J",male,54.0,17463,51.8625,E46,S,0,1
7,0,3,"Palsson, Master. Gosta Leonard",male,2.0,349909,21.075,,S,4,0
8,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,347742,11.1333,,S,2,0
9,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,237736,30.0708,,C,1,0


## Converting Sex to Numerical

To facilitate our analysis, we will convert the 'Sex' column into a numerical format, where male = 0 and female = 1.


In [8]:
# Convert 'Sex' into numerical format
titanic_df['Sex'] = titanic_df['Sex'].map({'male': 0, 'female': 1})

# Display the first 10 rows to see the result
titanic_df.head(10)


Unnamed: 0,Survived,Pclass,Name,Sex,Age,Ticket,Fare,Cabin,Embarked,Family_Count,Cabin_Indicator
0,0,3,"Braund, Mr. Owen Harris",0,22.0,A/5 21171,7.25,,S,1,0
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,PC 17599,71.2833,C85,C,1,1
2,1,3,"Heikkinen, Miss. Laina",1,26.0,STON/O2. 3101282,7.925,,S,0,0
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,113803,53.1,C123,S,1,1
4,0,3,"Allen, Mr. William Henry",0,35.0,373450,8.05,,S,0,0
5,0,3,"Moran, Mr. James",0,29.699118,330877,8.4583,,Q,0,0
6,0,1,"McCarthy, Mr. Timothy J",0,54.0,17463,51.8625,E46,S,0,1
7,0,3,"Palsson, Master. Gosta Leonard",0,2.0,349909,21.075,,S,4,0
8,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",1,27.0,347742,11.1333,,S,2,0
9,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",1,14.0,237736,30.0708,,C,1,0


## Final Dataset Adjustments

Before concluding our data preprocessing, we will drop the 'Cabin', 'Name', and 'Ticket' columns. These features are unlikely to be useful for our analysis. 'Cabin' has many missing values, 'Name' is unique to each passenger and does not provide any insight, and 'Ticket' numbers are too varied to be of analytical value.


In [9]:
# Drop 'Cabin', 'Name', and 'Ticket'
titanic_df.drop(['Cabin', 'Name', 'Ticket'], axis=1, inplace=True)

# Save the cleaned dataset to a new CSV file
titanic_df.to_csv('titanic_clean.csv', index=False)
