 _____  _   _   ___   _      _      _____ _   _ _____  _____ 
/  __ \| | | | / _ \ | |    | |    |  ___| \ | |  __ \|  ___|
| /  \/| |_| |/ /_\ \| |    | |    | |__ |  \| | |  \/| |__  
| |    |  _  ||  _  || |    | |    |  __|| . ` | | __ |  __| 
| \__/\| | | || | | || |____| |____| |___| |\  | |_\ \| |___ 
 \____/\_| |_/\_| |_/\_____/\_____/\____/\_| \_/\____/\____/ 

Tasks (use Visual Studio Code and Jupyter Notebook in GitHub Codespaces):

1.) Replace the kaggle.json File in this folder with your own one (see README.md).

2.) Use the Kaggle Web API to download the Titanic data set.
    Link: https://www.kaggle.com/datasets/yasserh/titanic-dataset?select=Titanic-Dataset.csv
    To give you a hint:
    dataset = yasserh/titanic-dataset
    file_name = Titanic-Dataset.csv

3.) Identify the data types in the Titanic data set.

4.) Transform variable 'Sex' (Gender) to a two-column matrix with (0/1) values.

5.) Create a subset of the Titanic data which includes:
    - passengers which have survived AND
    - female passengers which were older than 45 years OR 
    - male passengers which were younger than 20 years

6.) Answer the question: How many passengers were selected?

In [11]:
# Import necessary libraries
import os
import json
import pandas as pd
from kaggle.api.kaggle_api_extended import KaggleApi

# Path to your Kaggle API JSON file
kaggle_json_path = './data/kaggle_personal_1.json'

# Load Kaggle API credentials
with open(kaggle_json_path, 'r') as f:
    kaggle_credentials = json.load(f)

# Set Kaggle API credentials as environment variables
os.environ['KAGGLE_USERNAME'] = kaggle_credentials['username']
os.environ['KAGGLE_KEY'] = kaggle_credentials['key']

# Initialize Kaggle API
api = KaggleApi()
api.authenticate()

# Define the dataset and file to download
dataset = 'yasserh/titanic-dataset'
file_name = 'Titanic-Dataset.csv'

# Download the Titanic dataset
output_dir = './data'
api.dataset_download_file(dataset, file_name, path=output_dir)

# Unzip the downloaded file
import zipfile
zip_path = os.path.join(output_dir, file_name + '.zip')
if os.path.exists(zip_path):
    with zipfile.ZipFile(zip_path, 'r') as zip_ref:
        zip_ref.extractall(output_dir)
    os.remove(zip_path)

print(f"Dataset '{file_name}' downloaded and extracted to '{output_dir}'.")

Dataset URL: https://www.kaggle.com/datasets/yasserh/titanic-dataset
Dataset 'Titanic-Dataset.csv' downloaded and extracted to './data'.


In [29]:
# Path to the Titanic dataset
titanic_data_path = './data/Titanic-Dataset.csv'

# Load the Titanic dataset
df = pd.read_csv(titanic_data_path)

# Display the data types of each column
print(df.dtypes)

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object


In [30]:
print(df.columns)

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')


In [32]:
# Transform 'Sex' variable to a two-column matrix with (0/1) values using get_dummies
sex_dummies = pd.get_dummies(df['Sex'], prefix='Sex')

# Concatenate the dummy variables with the original DataFrame
titanic_df = pd.concat([df, sex_dummies], axis=1)

# Drop the original 'Sex' column
titanic_df.drop('Sex', axis=1, inplace=True)

# Display the first few rows of the transformed DataFrame
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Sex_female,Sex_male
0,1,0,3,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.25,,S,False,True
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,71.2833,C85,C,True,False
2,3,1,3,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.925,,S,True,False
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1,C123,S,True,False
4,5,0,3,"Allen, Mr. William Henry",35.0,0,0,373450,8.05,,S,False,True


In [None]:
# Create a subset of the Titanic data
subset_df = titanic_df[
    (titanic_df['Survived'] == 1) & 
    (
        (titanic_df['Sex_female'] == 1) & (titanic_df['Age'] > 45) |
        (titanic_df['Sex_male'] == 1) & (titanic_df['Age'] < 20)
    )
]

# Display the number of passengers selected
num_passengers_selected = subset_df.shape[0]
print(f"Number of passengers selected: {num_passengers_selected}")

# Display the first few rows of the subset DataFrame
subset_df.head()

Number of passengers selected: 55


Unnamed: 0,PassengerId,Survived,Pclass,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Sex_female,Sex_male
11,12,1,1,"Bonnell, Miss. Elizabeth",58.0,0,0,113783,26.55,C103,S,True,False
15,16,1,2,"Hewlett, Mrs. (Mary D Kingcome)",55.0,0,0,248706,16.0,,S,True,False
52,53,1,1,"Harper, Mrs. Henry Sleeper (Myna Haxtun)",49.0,1,0,PC 17572,76.7292,D33,C,True,False
78,79,1,2,"Caldwell, Master. Alden Gates",0.83,0,2,248738,29.0,,S,False,True
125,126,1,3,"Nicola-Yarred, Master. Elias",12.0,1,0,2651,11.2417,,C,False,True
