# Pandas Tutorial

Author: Kellen Sullivan

This is an introductory tutorial for pandas, a powerful Python library for data handling and analysis. Pandas is widely used to load, explore, and transform datasets, making it an essential tool for anyone working in machine learning or AI. By the end of this tutorial, you will have the data skills necessary to start building your first machine learning project.

To get started using pandas, you must first import the pandas package. To do this we will use the following line of code:

In [19]:
import pandas as pd

We use `as pd` to import pandas with the alias `pd`. This saves us time from typing out `pandas` everytime we want to refer to the package and instead we can type `pd`. 

## DataFrames and Series


DataFrames are the essential underlying datastructure for storing data with pandas. They are tables, with labeled columns and indexed rows. 

In [33]:
df = pd.DataFrame(
    {
        "Color": ["red", "blue", "green", "purple", "white", "orange"],
        "Price" : [5, 8, 3, 4, 9, 5]
    }
)

# display the type of a Pandas DataFrame
print(type(df))

# display the DataFrame
df

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Color,Price
0,red,5
1,blue,8
2,green,3
3,purple,4
4,white,9
5,orange,5


The DataFrame `df` has two labeled columns, "Color" and "Price", and an index column with the values 0-5 for each row in the DataFrame. 

Series are a similar datatype in pandas, but contain only one dimension. Unlike DataFrames, Series do not have labeled columns, and instead only have one label for the name of the Series itself. However, Both DataFrames and Series have indexed rows.

In [30]:
s = pd.Series(
    ["red", "blue", "green", "purple", "white"]
)

# display the type of a Pandas Series
print(type(s))

# display the Series
s

<class 'pandas.core.series.Series'>


0       red
1      blue
2     green
3    purple
4     white
dtype: object

You can think of Series as being like one column in a DataFrame. Or if you are familar with standard Python datastructures, if a Series is a list, then a DataFrame is a 2D array.

## Loading Data

As displayed above, DataFrames can be created from python dictionaries. However, when working with real-world data you will often want to work with data that is stored in another location. Fortunately, pandas provides many built in functions to easily load data from a variety of file types. 

For this tutorial, we will explore a popular real-world dataset within the machine learning community about the passengers on the Titanic. The data is stored in a CSV  (Comma-Separated Values) file, which pandas is well suited to handle.

Use `read_csv` to read in data from a csv file and store it into a Pandas DataFrame. The `read_csv` function takes in the filepath to the csv.

In [35]:
df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")

## Exploring Data

### Previewing Data

Now that you have loaded in a new data set into a DataFrame, it is time to explore it! Pandas provides two useful functions `head` and `tail` to quickly view a few rows of data.

Use `head` to display the first 5 rows of a DataFrame. You can also provide a value n to display the first n rows. For example `head(20)` displays the first 20 rows of a DataFrame.

In [17]:
# default displays the first 5 rows of a DataFrame
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


The tail function works similarly to the head function, but displays the last 5 rows of a DataFrame

In [18]:
df.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


You may have noticed that the DataFrame has an unlabled index column that was automatically created when we read in the CSV file. However, in pandas you can set the index column to whatever values you like, including numbers, dates, and even strings! In this case, the column PassengerId appears to be an index column already included in the data. We can use PassengerId as the index by passing in the argument `index_col="PassengerId"` to `read_csv`.

In [36]:
df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv", index_col="PassengerId")
df.head(3)

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


### DataFrame Attributes 

DataFrames have many attributes that allow datascientist to get a quick overview of a dataset. To invoke an attribute use the syntax `dfname.attribute`
A few attributes that can quickly provide general information about a data set include
- columns
- dtypes
- shape
- size 


`columns` returns a list of all columns in a DataFrame

In [154]:
df.columns

Index(['Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket',
       'Fare', 'Cabin', 'Embarked'],
      dtype='object')

`dtypes` displays all columns in a DataFrame and their corresponding data type

In [151]:
df.dtypes

Survived      int64
Pclass        int64
Name         object
Sex          object
Age         float64
SibSp         int64
Parch         int64
Ticket       object
Fare        float64
Cabin        object
Embarked     object
dtype: object

`shape` returns a tuple with the number of rows followed by the number of columns

In [152]:
df.shape

(891, 11)

`size` returns the total number of elements in a DataFrame. In other words, it returns the product of the number of rows and columns in a DataFrame. 

In [153]:
df.size

9801

If you are curious about what other attributes DataFrames have, check out the official pandas documentation including all DataFrame attributes here: https://pandas.pydata.org/docs/reference/frame.html

## Selecting Data

Pandas provides multiple ways to select subsets of data within a DataFrame. In this tutorial we will go over the following methods:

- bracket and dot notation
- loc and iloc

### bracket and dot notation

The simplest way to select a subset of data is similar to selecting an element in a standard python list or dictionary.

- To select a column, use `[]` with the column name inside.
- You can also use a `.` followed by the column name. 

Each syntax will produce the same result

In [3]:
df["Age"]

PassengerId
1      22.0
2      38.0
3      26.0
4      35.0
5      35.0
       ... 
887    27.0
888    19.0
889     NaN
890    26.0
891    32.0
Name: Age, Length: 891, dtype: float64

In [4]:
df.Age

PassengerId
1      22.0
2      38.0
3      26.0
4      35.0
5      35.0
       ... 
887    27.0
888    19.0
889     NaN
890    26.0
891    32.0
Name: Age, Length: 891, dtype: float64

When using bracket notation, you can pass in a list to select multiple columns. This is one advantage of using bracket notation. However, both are valid and the best one to use depends on the context.

In [5]:
columns_to_keep = ["Sex", "Age", "Survived"]

df[columns_to_keep]

Unnamed: 0_level_0,Sex,Age,Survived
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,male,22.0,0
2,female,38.0,1
3,female,26.0,1
4,female,35.0,1
5,male,35.0,0
...,...,...,...
887,male,27.0,0
888,female,19.0,1
889,female,,0
890,male,26.0,1


### loc and iloc

Pandas provides `loc` and `iloc` as indexing options for more advanced operations. `loc` and `iloc` are invoked similarly, with the row selector first followed by the column selector

`loc`  syntax: `df.loc[row_labels, column_labels]`  
`iloc` syntax: `df.iloc[row_positions, column_positions]`

You can also provide only one selector to `loc` or `iloc`. Pandas will assume you are only selecting based on row and all columns will be included. For example:

`df.loc[row_labels]` is also valid syntax

The main distinction between `loc` and `iloc` is that `loc` uses label-based selection and `iloc` uses position-based selection.

To select an element from a DataFrame using `loc`, provide the row label and column labels to select from. 

In [38]:
# Selects the row with PassengerId of 3
df.loc[3]

Survived                                   1
Pclass                                     3
Name                  Heikkinen, Miss. Laina
Sex                                   female
Age                                     26.0
SibSp                                      0
Parch                                      0
Ticket                      STON/O2. 3101282
Fare                                   7.925
Cabin                                    NaN
Embarked                                   S
Is_Female_Survivor                      True
Name: 3, dtype: object

In [39]:
# Selects the row with PassengerId of 3, and the columns Name and Sex
df.loc[3, ["Name", "Sex"]]

Name    Heikkinen, Miss. Laina
Sex                     female
Name: 3, dtype: object

In [None]:
# Selects the rows with PassengerId's 3, 4, and 5 and the Age column
df.loc[[3,4,5], "Age"]

PassengerId
3    26.0
4    35.0
5    35.0
Name: Age, dtype: float64

To select the same elements using `iloc` we use the index instead of labels. 

Note that iloc uses 0-based indexing, so to select the third row in the DataFrame, which is the row with PassengerId of 3, we use 2. 

In [40]:
# Selects the third row in the DataFrame
df.iloc[2]

Survived                                   1
Pclass                                     3
Name                  Heikkinen, Miss. Laina
Sex                                   female
Age                                     26.0
SibSp                                      0
Parch                                      0
Ticket                      STON/O2. 3101282
Fare                                   7.925
Cabin                                    NaN
Embarked                                   S
Is_Female_Survivor                      True
Name: 3, dtype: object

In [41]:
# Selects the third row, and the third and fourth columns in the DataFrame
df.iloc[2, [2,3]]

Name    Heikkinen, Miss. Laina
Sex                     female
Name: 3, dtype: object

In [None]:
# Selects the third, fourth, and fifth rows and the fifth column in the DataFrame
df.iloc[[2,3,4], 4]

PassengerId
3    26.0
4    35.0
5    35.0
Name: Age, dtype: float64

You can use the slicing operator `:` to select many rows or columns without listing all of them individually. The following statement selects rows with PassengerId's from 3 to 10, and all columns up until Age.

In [32]:
df.loc[3:10, :"Age"]

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
3,1,3,"Heikkinen, Miss. Laina",female,26.0
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0
5,0,3,"Allen, Mr. William Henry",male,35.0
6,0,3,"Moran, Mr. James",male,
7,0,1,"McCarthy, Mr. Timothy J",male,54.0
8,0,3,"Palsson, Master. Gosta Leonard",male,2.0
9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0
10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0


Using iloc, a similar subset of the DataFrame can be selected. However, due to iloc's 0-based indexing, the row with PassengerId of 3 is not included!

In [33]:
df.iloc[3:10, :5]

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0
5,0,3,"Allen, Mr. William Henry",male,35.0
6,0,3,"Moran, Mr. James",male,
7,0,1,"McCarthy, Mr. Timothy J",male,54.0
8,0,3,"Palsson, Master. Gosta Leonard",male,2.0
9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0
10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0


One advantage of using `loc` is that it can handle conditional selection. To select based on a conditional statement use the syntax:

`df.loc[conditional statement]`

In [None]:
# Select all rows with Age < 18
df.loc[df["Age"] < 18]

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.0750,,S
10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C
11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7000,G6,S
15,0,3,"Vestrom, Miss. Hulda Amanda Adolfina",female,14.0,0,0,350406,7.8542,,S
17,0,3,"Rice, Master. Eugene",male,2.0,4,1,382652,29.1250,,Q
...,...,...,...,...,...,...,...,...,...,...,...
851,0,3,"Andersson, Master. Sigvard Harald Elias",male,4.0,4,2,347082,31.2750,,S
853,0,3,"Boulos, Miss. Nourelain",female,9.0,1,1,2678,15.2458,,C
854,1,1,"Lines, Miss. Mary Conover",female,16.0,0,1,PC 17592,39.4000,D28,S
870,1,3,"Johnson, Master. Harold Theodor",male,4.0,1,1,347742,11.1333,,S


Often `loc` and `iloc` are both viable solutions, so deciding when to use which one can be tricky. In general, if the DataFrame has clear row and column lables, or you are using conditional selection, then `loc` is the better option. If instead label names are not relevant, or you are iterating over a DataFrame, then `iloc` may provide a better solution.

If you want more practice with `loc` and `iloc`, this article provides many examples and details what situations favor each selector: https://www.datacamp.com/tutorial/loc-vs-iloc

### View vs Copy

One important consideration when selecting data in a DataFrame is whether you are working with a view or a copy.
- A view references the original data, so any edits you make to the view may also affect the original DataFrame.
- A copy, on the other hand, creates a completely new object stored separately in memory. Changes made to a copy will not affect the original data.

Depending on how you select data a view or copy will be created. For example, selecting data with a slice creates a view

In [None]:
# select data using a slice creates a view
df.loc[:, :"Age"]

Selecting data using a conditional value or an iterable list creates a copy instead. 

In [None]:
# select data using conditional value
df.loc[df["Age"] < 18]

# select data with an iterable list
df.loc[:, ["Age", "Sex"]]

In general, views are usually preferred because they provide better performance and use less memory than copies. Check out this resource if you would like to learn more: http://itnext.io/a-guide-to-efficient-data-selection-in-pandas-ea6dab640604

### Exercise: Perform Exploratory Data Analysis

**Time to put your knowledge to the test!**

An important first step in any machine learning project is to explore the data for intial insights. Pandas provides various functions for machine learning engineers to complete this task.

For this exercise you will explore a very popular dataset about passengers on the Titanic. In order to complete this exercise you must complete the following:
1. Display the first 10 rows of the DataFrame
2. Determine the amount of rows and columns in the DataFrame
3. Display all columns and their datatypes
4. Create a subset of the original DataFrame that contains all passengers that are female OR under 18 years old

In [137]:
# 1. Display the first 10 rows of the DataFrame


In [None]:
# 2. Determine the amount of rows and columns in the DataFrame


In [None]:
# 3. Display all columns and their datatypes


In [None]:
# 4. Create a DataFrame containing only passengers that are either female OR 18 years or younger
sub_df = ____

sub_df.head(10)

#### **Solution**

(Try to solve it yourself first before looking!)

Click the cells below to reveal a possible solution approach.

In [87]:
# 1. Display the first 10 rows of the DataFrame
df.head(10)

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [6]:
# 2. Determine the amount of rows and columns in the DataFrame
df.shape

(891, 11)

In [7]:
# 3. Display all columns and their datatypes
df.dtypes

Survived      int64
Pclass        int64
Name         object
Sex          object
Age         float64
SibSp         int64
Parch         int64
Ticket       object
Fare        float64
Cabin        object
Embarked     object
dtype: object

In [None]:
# 4. Create a DataFrame containing only female passengers 18 or younger
sub_df = df.loc[(df["Sex"] == "female") | (df["Age"] < 18)]

sub_df.head(10)

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C
11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7,G6,S
12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.55,C103,S
15,0,3,"Vestrom, Miss. Hulda Amanda Adolfina",female,14.0,0,0,350406,7.8542,,S
16,1,2,"Hewlett, Mrs. (Mary D Kingcome)",female,55.0,0,0,248706,16.0,,S


## Aggregation 

Pandas provides many functions to quickly summarize or aggregate data within DataFrames. In this section we will discuss the following:
- Aggregate Operations
- groupby
- value counts

### Aggregate Operations
Pandas provides many aggregate operations to quickly summarize data into a single value. Aggregate operations iterate over a column and return a single value. The following syntax is used to use one:
`df['column_name'].aggregate_operation()`

If an aggregate operation is used on a DataFrame with more than one column, Pandas will apply the operation to each column in the DataFrame and return a Series containing the result of applying the operation to each column in the DataFrame.

Some of the most useful operations include:
- mean
- mode
- sum
- count
- unique
- max
- min

In [134]:
# Returns the average age of a passenger
df["Age"].mean()

29.69911764705882

In [None]:
# Returns a Series containing the mode(s) for a column
# The Series will contain multiple modes if there is a tie for the most frequent value
df["Sex"].mode()

0    male
Name: Sex, dtype: object

In [None]:
# To access the literal mode value, select the first value in the Series using [0]
df["Sex"].mode()[0]

'male'

In [142]:
# Returns the sum of adding each row in a column
df["Survived"].sum()

342

In [143]:
# Returns the number of non-null Name's in the DataFrame
df["Name"].count()

891

In [144]:
# Returns all unqiue values the Embarked column contains
df["Embarked"].unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [145]:
# Returns the greatest value in the Fare column
df["Fare"].max()

512.3292

In [146]:
# Returns the least value in the Fare column
df["Fare"].min()

0.0

### groupby

Pandas provides the groupby function to split a DataFrame into distinct groups based on a key (usually an existing column in the DataFrame). You can then apply a function to each group individually. The resulting groups are then combined back into one DataFrame. This process is known as the split -> apply -> combine pattern that is commonly used in data analysis.

For example, the following code splits the DataFrame based on the column 'Sex'. It then applys the `sum` function to the male group and female group individually. The results are then combined and displayed in a single DataFrame.

In [None]:
df.groupby("Sex")["Survived"].sum()

Sex
female    233
male      109
Name: Survived, dtype: int64

### Value Counts

`value_counts` is another useful function for displaying the frequency of unqiue values in a DataFrame. For example, to see how many of the passengers are male and how many are female we can use the following code.

In [65]:
df["Sex"].value_counts()

Sex
male      577
female    314
Name: count, dtype: int64

The `value_counts` function is often used in combination with the `groupby` function to display more complex relationships. In this situation, `value_counts` acts as the function in the apply step in the split -> apply -> combine patter, and is applied to each group individually. 

In [70]:
df.groupby("Sex")["Survived"].value_counts()

Sex     Survived
female  1           233
        0            81
male    0           468
        1           109
Name: count, dtype: int64

### Exercise: Utilize Aggregations

**Put your aggregation knowledge to the test!**

Now that you have already intially explored the dataset, it is time to use aggregations to uncover deeper insights and better understand the data. In order to complete this exercise you must complete the following:

1. Determine the age of the youngest and oldest passengers to survive
2. Determine the average fare price for each passenger class 
3. Determine the number of passengers in each passenger class 
4. Determine the number of survivors and non-survivors for each passenger class

hint: passenger class is denoted by the column Pclass



In [None]:
# 1. Determine the age of the youngest and oldest passengers to survive
youngest_survivor = ____
print(f"Age of the youngest passenger to survive: {youngest_survivor}")

oldest_survivor = ____
print(f"Age of the oldest passenger to survive: {oldest_survivor}")

In [None]:
# 2. Determine the average fare price for each passenger class
avg_fare_price = ____
print(f"Average fare price for each passenger class: \n {avg_fare_price}")

In [None]:
# 3. Determine the number of passengers in each passenger class
passenger_count_by_pclass = ____
print(f"Number of passengers by each passenger class: \n {passenger_count_by_pclass}")

In [None]:
# 4. Determine the number of survivors and non-survivors for each passenger class
survivors_by_pclass = ____
print(f"Survivors by passenger class: \n {survivors_by_pclass}")

#### **Solution**

(Try to solve it yourself first before looking!)

Click the cells below to reveal a possible solution approach.

In [None]:
# 1. Determine the age of the youngest and oldest passengers to survive
youngest_survivor = df.loc[df["Survived"] == 1]["Age"].min() 
print(f"Age of the youngest passenger to survive: {youngest_survivor}")

oldest_survivor = df.loc[df['Survived'] == 1]["Age"].max() 
print(f"Age of the oldest passenger to survive: {oldest_survivor}\n")

Age of the youngest passenger to survive: 0.42
Age of the oldest passenger to survive: 80.0



In [119]:
# 2. Determine the average fare price for each passenger class
avg_fare_price = df.groupby('Pclass')['Fare'].mean()
print(f"Average fare price for each passenger class: \n {avg_fare_price}\n")

Average fare price for each passenger class: 
 Pclass
1    84.154687
2    20.662183
3    13.675550
Name: Fare, dtype: float64



In [120]:
# 3. Determine the number of passengers in each passenger class
passenger_count_by_pclass = df['Pclass'].value_counts()
print(f"Number of passengers by each passenger class: \n {passenger_count_by_pclass}\n")

Number of passengers by each passenger class: 
 Pclass
3    491
1    216
2    184
Name: count, dtype: int64



In [121]:
# 4. Determine the number of survivors and non-survivors for each passenger class
survivors_by_pclass = df.groupby('Pclass')['Survived'].value_counts()
print(f"Survivors by passenger class: \n {survivors_by_pclass}")

Survivors by passenger class: 
 Pclass  Survived
1       1           136
        0            80
2       0            97
        1            87
3       0           372
        1           119
Name: count, dtype: int64


## Transforming Data

In machine learning, the columns of a dataset may be referred to as features. Feature engineering is the process of cleaning, transforming, and creating features from the raw data to improve model performance. In practice, machine learning engineers spend 50% to 80% of their time on feature engineering, making it one of the most important skills to master.

Thankfully, Pandas provides many methods that make feature engineering simple. In this section, we will review how to do the following: 

- cast columns to different types
- handle null values
- create new columns

### Type Casting
To change the data type of a column use `astype`.
Common dtypes include 
- Int64
- Float64
- object
- string
- datetime64
- boolean  

Although there are many more!

In [122]:
# Convert the name column from object to string
df["Name"].astype("string")

PassengerId
1                                Braund, Mr. Owen Harris
2      Cumings, Mrs. John Bradley (Florence Briggs Th...
3                                 Heikkinen, Miss. Laina
4           Futrelle, Mrs. Jacques Heath (Lily May Peel)
5                               Allen, Mr. William Henry
                             ...                        
887                                Montvila, Rev. Juozas
888                         Graham, Miss. Margaret Edith
889             Johnston, Miss. Catherine Helen "Carrie"
890                                Behr, Mr. Karl Howell
891                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: string

Note that this does not change the original DataFrame, but instead creates a copy of the DataFrame with the updated column type. In order to change the original DataFrame set the original DataFrame equal to the copy.

In [None]:
# Updates the type of the column Name in the original DataFrame
df["Name"] = df["Name"].astype("String")

### Address Null Values

Most machine learning algorithms cannot directly work with null values, so it is important to have a strategy to handle them. 

If you are familiar with Python, you may know that null values are represented as `none`. In Pandas, `none` can be used to represent missing values, but there are also other representations. `NaN` (Not a number) indicates a missing numerical value. `NaT` (Not a time) is used to represent missing datime values. And finally `pd.NA` is used as a missing value indicator that applies to all data types. 

In order to identify null values in a DataFrame, use `isna`. This pandas function returns a Series or DataFrame with True values for every entry that is null

In [5]:
df.isna()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,False,False,False,False,False,False,False,False,False,True,False
2,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,True,False
4,False,False,False,False,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...
887,False,False,False,False,False,False,False,False,False,True,False
888,False,False,False,False,False,False,False,False,False,False,False
889,False,False,False,False,True,False,False,False,False,True,False
890,False,False,False,False,False,False,False,False,False,False,False


To see how many null values each column has, you can apply the aggreagation operation `sum`. Since True is treated as a 1, and False as a 0, `df.isna().sum()` will return the number of null values in each column

In [101]:
df.isna().sum()

Survived      0
Pclass        0
Name          0
Sex           0
Age         177
SibSp         0
Parch         0
Ticket        0
Fare          0
Cabin       687
Embarked      2
dtype: int64

The simplest strategy to handle null values is to remove or "drop" them. To drop all rows containing a null value use `dropna`. 

In [95]:
dropped_df = df.dropna()

dropped_df

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7000,G6,S
12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,C103,S
...,...,...,...,...,...,...,...,...,...,...,...
872,1,1,"Beckwith, Mrs. Richard Leonard (Sallie Monypeny)",female,47.0,1,1,11751,52.5542,D35,S
873,0,1,"Carlsson, Mr. Frans Olof",male,33.0,0,0,695,5.0000,B51 B53 B55,S
880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C50,C
888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S


Note that `df.dropna()` does NOT change the original DataFrame but instead returns a new object with all rows containing null values removed.
To update the original DataFrame to not include any null values use `df = df.dropna()`

Dropping all rows with a null value can potentially exclude large amounts of valuable data. In this titanic example, `dropna` reduced the number of rows in the DataFrame from 891 to 183!

Because of this, machine learning engineers often prefer to "impute" or fill in missing values with some other value. One easy way to do this in pandas is to use `fillna` and provide the value to replace null values with.

In [None]:
# replace all null values in the age column with 0
df["Age"] = df["Age"].fillna(0)

df.isna().sum()

Survived      0
Pclass        0
Name          0
Sex           0
Age           0
SibSp         0
Parch         0
Ticket        0
Fare          0
Cabin       687
Embarked      2
dtype: int64

One popular imputation strategy when dealing with null values in a machine learning context is mean or mode imputation. 

If a column contains numerical data, we calculate the mean of the column and then replace all missing values in that column with the mean. A column is numerical, if it has a numeric data type such as integer or float. 

In machine learning, however, you will often deal with data that is not numeric. For example, the Embarked column takes on the values 'S', 'C', and 'Q'. Since we can't take the average value of this column, we use mode impuation. As a review, the mode of the column is the value that appears most frequently.

In [None]:
# fill null age values with the average age
mean_age = df["Age"].mean()
df["Age"] = df["Age"].fillna(mean_age)

# fill null embarked values with the mode
embarked_mode = df["Embarked"].mode()[0]
df["Embarked"] = df["Embarked"].fillna(embarked_mode)

array(['S', 'C', 'Q', nan], dtype=object)

### Create New Columns

An important step in the feature engineering process is creating new columns or features. Well-designed features can expose patterns and relationships not immediately present in the raw data, often leading to significant improvements in model performance. 

Pandas provides several convenient ways to create new columns in a DataFrame.

To create a new column with a constant value use `df['new_column_name'] = val`. For example, the following code creates a new column 'Ship' in the DataFrame with the constant value 'Titanic'

In [None]:
df["Ship"] = "Titanic"
df.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_Centered,Ship
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,-7.699118,Titanic
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,8.300882,Titanic
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,-3.699118,Titanic
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,5.300882,Titanic
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,5.300882,Titanic


It is also possible to create new columns based on existing columns in the DataFrame. The following code creates a new column where each row has a new value indicating the age distance from the mean age of everyone on board.

In [None]:
df["Age_Centered"] = df["Age"] - df["Age"].mean()
df.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_Centered
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,-7.699118
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,8.300882
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,-3.699118
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,5.300882
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,5.300882


Pandas also allows you to set new columns equal to a condition evaluated based on other columns. In this case, the value in the column 'Is_Female_Survivor' is true when the passenger survived AND they are a female. In this pandas expression we represent and using a single `&` symbol. Pandas uses the following operators for conditional statements.

- `&`  - and
- `|`  - or
- `==` - equal
- `!=` - notequal
- `>`  - greater than
- `<`  - less than
- `>=` - greater than or equal to
- `<=` - less than or equal to

In [37]:
df["Is_Female_Survivor"] = (df["Survived"] == 1) & (df["Sex"] == "female")
df.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Is_Female_Survivor
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,False
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,True
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,True
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,True
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,False


The `apply` function in pandas lets you apply a custom function to given rows or columns in a DataFrame. It’s commonly used to create new columns based on the values of existing ones by applying the custom function to each row or column. To use `apply` use the following syntax:

`df.apply(function)`, where function is the function you wish to apply.

For example, the following code creates a new column 'Is_Child' that gets the value true when age is less than 18, and False otherwise. 

In [None]:
# Create a user defined function to check if the passenger is a child
def is_child(Age):
    return Age < 18

# Pass the user defined function name as the argument to apply
df["Is_Child"] = df["Age"].apply(is_child)
df.iloc[6:9]

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Is_Child
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,False
8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S,True
9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S,False


In [None]:
# Use a lambda function as the argument to apply
df["Is_Child"] = df["Age"].apply(lambda x: True if x < 18 else False)
df.iloc[6:9]

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Is_Child
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,False
8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S,True
9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S,False


### Exercise: Feature Engineering

**Put your feature engineering skills to the test!**

Now that you have already explored the titanic dataset, it is time to perform feature engineering!

For this exercise you will alter and add to the existing Titanic DataFrame. In order to complete this exercise you must complete the following:
1. Cast the Name column type to 'string'
2. Replace all null values in the 'Age' column with the mean age
3. Create a new column called 'Fare_centered' that is the difference of the Fare price for a passenger and the mean Fare price.
4. Create a new column called 'Adult_male' that is true when a passenger is a male and 18 years or older, and false otherwise


In [None]:
# 1. cast column type here
df["Name"] = ____

In [None]:
# 2. Replace all null values in the 'Age' column with the mean age
mean_age = ____
df["Age"].____

In [None]:
# 3. Create a new column called 'Fare_centered' that is the difference of the Fare price for a passenger and the mean Fare price.
df["Fare_centered"] = ____

In [None]:
# 4. Create a new column called 'Adult_male' that is true when a passenger is a male and 18 years or older, and false otherwise
df["Is_Adult_Male"] = ____
df.head()

#### **Solution**

(Try to solve it yourself first before looking!)

Click the cell below to reveal a possible solution approach.

In [None]:
# 1. cast column type here
df["Name"] = df["Name"].astype("string")

In [None]:
# 2. Replace all null values in the 'Age' column with the mean age
mean_age = df["Age"].mean()
df["Age"].fillna(mean_age)

In [None]:
# 3. Create a new column called 'Fare_centered' that is the difference of the Fare price for a passenger and the mean Fare price.
df["Fare_centered"] = df["Fare"] - df["Fare"].mean()

In [None]:
# 4. Create a new column called 'Adult_male' that is true when a passenger is a male and 18 years or older, and false otherwise
df["Is_Adult_Male"] = (df["Age"] >= 18) & (df["Sex"] == "male")
df.head(10)

## Next Steps

Congratulations for completing the Pandas Introductory Tutorial! 🎓🎉 You now have the skills to load, explore, and transform a dataset into a form that’s ready for machine learning!

If you want to further your Pandas education, cehck out some of these resources:
- Official Pandas Documentation: https://pandas.pydata.org/docs/user_guide/10min.html
- Kaggle Pandas Tutorial: https://www.kaggle.com/learn/pandas
- Youtube Channel with various tutorials on Pandas and other Data Science topics: https://www.youtube.com/@robmulla

If you want to learn about other popular data processing Python packages, check out Polars! Polars is a modern DataFrame library that improves memory and efficieny from Pandas. Find their official documentation here: https://docs.pola.rs/
