# Case Study: Exploring Titanic Dataset using NumPy

In this notebook, we'll use NumPy to analyze the Titanic dataset, which contains information about passengers aboard the Titanic. We'll perform various operations and calculations to gain insights from the data.

## 1. Introduction

The dataset consists of multiple columns, including 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', and 'Embarked'. Our goal is to analyze this data and extract useful information such as survival rates, average age, fare distribution, and more.

Let's start by loading the dataset and exploring its contents.

In [1]:
# Importing necessary libraries
import numpy as np
import pandas as pd

# Loading the dataset
titanic_data = pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')

# Displaying the first few rows of the dataset
print(titanic_data.shape)
titanic_data.head()


(891, 12)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## 2. Data Preprocessing

Before performing any analysis, let's preprocess the data by handling missing values and converting categorical variables into numerical representations.

In [2]:
# Handling missing values
titanic_data.fillna({'Age': titanic_data['Age'].median(), 'Embarked': 'S'}, inplace=True)

# Converting categorical variables into numerical representations
titanic_data['Sex'] = titanic_data['Sex'].map({'male': 0, 'female': 1})
titanic_data['Embarked'] = titanic_data['Embarked'].map({'S': 0, 'C': 1, 'Q': 2})

# Displaying the preprocessed dataset
titanic_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,1
2,3,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,C123,0
4,5,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,,0


## 3. Basic Operations

Let's perform some basic operations on the dataset using NumPy.

In [5]:
# Extracting survival status
survival_status = titanic_data['Survived'].values

# Calculating the average age of passengers
average_age = np.mean(titanic_data['Age'].values)
print('Average Age:', average_age)

# Calculating the fare statistics
fare_stats = np.min(titanic_data['Fare'].values), np.mean(titanic_data['Fare'].values),np.median(titanic_data['Fare'].values), np.max(titanic_data['Fare'].values)
print('Fare Statistics (Min, Mean,Median, Max):', fare_stats)

pd.DataFrame(fare_stats).describe()

Average Age: 29.36158249158249
Fare Statistics (Min, Mean,Median, Max): (0.0, 32.204207968574636, 14.4542, 512.3292)


Unnamed: 0,0
count,4.0
mean,139.746902
std,248.737115
min,0.0
25%,10.84065
50%,23.329204
75%,152.235456
max,512.3292


## 4. Data Manipulation

Let's perform some data manipulation tasks using NumPy.

In [6]:
# Creating a 2D array from selected columns
selected_data = titanic_data[['Pclass', 'Sex', 'Age', 'Fare', 'Embarked']].values

# Displaying the selected data
print('Selected Data Shape:', selected_data.shape)
type(selected_data)

Selected Data Shape: (891, 5)


numpy.ndarray

## 5. Statistical Analysis

Let's perform some statistical analysis on the dataset using NumPy.

In [7]:
# Calculating survival rate
survival_rate = np.mean(survival_status)
print('Survival Rate:', survival_rate)

Survival Rate: 0.3838383838383838


In [8]:
yes = 0
no =0
for i in survival_status:
    if i ==1:
        yes +=1
    else:
        no+=1

print(yes/survival_status.size)

0.3838383838383838


## 6. Conclusion

In this notebook, we used NumPy to analyze the Titanic dataset containing information about passengers aboard the Titanic. We performed various operations and calculations to gain insights from the data, including handling missing values, converting categorical variables, calculating average age, fare statistics, survival rate, and more. NumPy's array operations and mathematical functions make it a powerful tool for data analysis and manipulation.