# Import Pandas

In [None]:
# This line imports the pandas library and aliases it as 'pd'.
# Aliasing pandas as 'pd' is a widely adopted convention that simplifies the syntax for accessing its functionalities.
# After this statement, you can use 'pd' to access all the functionalities provided by the pandas library.

import pandas as pd

# Creating a `DataFrame` from a csv file

In [None]:
# Load the Titanic dataset from a CSV file into a DataFrame named 'titanic'.
# The 'pd.read_csv()' function is used to read the data from the file 'data/titanic.csv'.
# The file is located in the 'data' directory, relative to the current working directory.
# The resulting DataFrame 'titanic' contains the dataset, ready for analysis and manipulation.

titanic = pd.read_csv('data/titanic.csv')

In [None]:
# Display the DataFrame 'titanic'.
# Note, even though we only see the first and last five rows, we actually read the whole DataFrame into the kernel's memory.
# The pressure on memory usage can be alleviated by using the 'head()' method described below.
# However, this will only be an issue with very large datasets, so don't worry too much about it for now.
# You can find out how much memory a DataFrame uses by using the 'memory_usage()' method:
# titanic.memory_usage(deep=True).sum()

titanic

# Selecting specific columns from a `DataFrame`

![Selecting specific columns from a DataFrame](images/03_subset_columns.svg)

In [None]:
# Access the 'Age' column from the DataFrame 'titanic'.
# This returns a Series object containing all the data in the 'Age' column.

titanic['Age']

In [None]:
# Check the type of the 'Age' column in 'titanic' using the 'type()' function.

type(titanic['Age'])

In [None]:
# Use the 'shape' attribute to determine the dimensions of the Series.
# It returns a tuple representing the number of rows and columns (rows, columns).

titanic['Age'].shape

# Calling multiple `Series`

In [None]:
# Select the columns 'Age' and 'Sex' from the 'titanic' DataFrame.
# This command selects specific columns 'Age' and 'Sex' from the 'titanic' DataFrame using double square brackets.

titanic[['Age', 'Sex']]

### The inner square brackets define a Python `list` with column names, whereas the outer square brackets are used to select the data from a pandas `DataFrame` as seen in the previous example.

In [None]:
# Use the 'type()' function to determine the data type of the DataFrame subset.
# This command selects specific columns 'Age' and 'Sex' from the 'titanic' DataFrame using double square brackets,
# and then applies the 'type()' function to the resulting DataFrame subset.

type(titanic[['Age', 'Sex']])

In [None]:
# Use the 'shape' attribute to determine the dimensions of the DataFrame subset.
# This command selects specific columns 'Age' and 'Sex' from the 'titanic' DataFrame using double square brackets,
# and then applies the 'shape' attribute to the resulting DataFrame subset.

titanic[['Age', 'Sex']].shape

# Filtering specific rows from a `DataFrame`

![Filtering specific rows from a DataFrame](images/03_subset_rows.svg)

In [None]:
# Filter rows in the 'titanic' DataFrame where the 'Age' column is greater than 35.
# This command returns a subset of the DataFrame containing only the rows where the 'Age' column has a value greater than 35.

titanic[titanic['Age'] > 35]

### To select rows based on a conditional expression, use a condition inside the selection brackets `[]`.

The condition inside the selection brackets `titanic['Age'] > 35` checks for which rows the `Age` column has a value larger than 35:

In [None]:
# Create a boolean mask to filter rows where the age of passengers is greater than 35.
# This command evaluates the condition 'titanic['Age'] > 35' element-wise,
# resulting in a boolean Series where True indicates that the corresponding passenger's age is greater than 35.

titanic['Age'] > 35

In [None]:
# Filter rows in the 'titanic' DataFrame where the 'Pclass' column values are either 2 or 3.
# This command uses the 'isin()' method to create a boolean mask, where True indicates that
# the corresponding 'Pclass' value is present in the specified list [2, 3].

titanic[titanic['Pclass'].isin([2, 3])]

### The above is equivalent to filtering by rows for which the class is either 2 or 3 and combining the two statements with an `|` (or) operator.
When combining multiple conditional statements, each condition must be surrounded by parentheses `()`. \
Moreover, you can not use `or`/`and` but need to use the `or` operator `|` and the `and` operator `&`.

In [None]:
# Filter rows in the 'titanic' DataFrame where the 'Pclass' column values are either 2 or 3.
# This command uses boolean indexing with logical OR (|) to create a mask,
# where True indicates that the corresponding 'Pclass' value is either 2 or 3.

titanic[(titanic['Pclass'] == 2) | (titanic['Pclass'] == 3)]

### Remember that the `notna()` conditional function returns a `True` for each row where the values are not a `Null` value.
As such, this can be combined with the selection brackets `[]` to filter the data table.

In [None]:
# Filter rows in the 'titanic' DataFrame where 'Embarked' values are not null (not NaN).
# 'notna()' returns True for non-null values, allowing us to select rows with valid embarkation data.

titanic[titanic['Embarked'].notna()]

In [None]:
# We can compare the number of extracted rows with the number from the original DataFrame.

titanic.info()

# Selecting specific rows and columns from a `DataFrame`

![Selecting specific rows and columns from a DataFrame](images/03_subset_columns_rows.svg)

In [None]:
# Filter rows in the 'titanic' DataFrame where the age is greater than 35,
# then select only the 'Name' and 'Pclass' columns for these filtered rows.
# This command uses boolean indexing to first filter rows where the age is greater than 35,
# and then selects specific columns 'Name' and 'Pclass' using double square brackets.

titanic[titanic['Age'] > 35][['Name', 'Pclass']]

## A note on square bracket indexing (`[]`)

Square bracket indexing (`[]`) is a versatile method for accessing data in Pandas DataFrames, but there are certain tasks that it cannot perform as efficiently or directly compared to `loc` and `iloc`. Here are a few limitations of square bracket indexing in comparison to `loc` and `iloc`:

1. **Positional Selection**: When the selection is purely positional (selecting the first five rows, or columns 2 to 4), `iloc` is the most straightforward tool for the job.

2. **Inclusive Slicing:** Square bracket slicing excludes the end point, requiring adjustment of slice endpoints. Conversely, `loc` allows for inclusive label-based slicing, simplifying the specification of row and column ranges.

3. **Dealing with Non-Integer Labels:** Square brackets may falter with non-integer labels or custom indices due to potential ambiguities, particularly when index labels might be confused with column names. `loc` ensures robust label-based selection, irrespective of label data type.

4. **Efficiency in Complex Selections:** For large datasets or intricate selection tasks, `loc` and `iloc` may provide enhanced performance due to their optimizations for label and integer indexing, respectively. They offer a more specialized approach for accessing and manipulating DataFrame data, particularly in complex scenarios.

We have attached a separate notebook that introduces `loc` and `iloc`, two properties of pandas `DataFrame` and `Series` objects that provide methods for indexing. 

# REMEMBER

* When selecting subsets of data, square brackets `[]` are used.

* Inside these square brackets, you can use
    
    * a single column/row label
    
    * a list of column/row labels
    
    * a slice of labels
    
    * a conditional expression