# Pandas Tutorial

Author: Kellen Sullivan

This is an introductory tutorial for the popular data handling python package Pandas. If you are unfamiliar with Numpy, you should complete the Numpy tutorial before completing this tutorial

## DataFrames

DataFrames are the essential underlying datastructure for storing data in pandas. They are tables, with labeled columns and indexed rows. Series are a similar datatype in pandas, but contain only one labeled column.

First we import the pandas package. Note that "as pd" is a common phrase used to shorten the package name when invoking the package. 

In [None]:
import pandas as pd

### 1. Create a Dataframe

In [None]:
df = pd.DataFrame(
    {
        'Color': ["red", "blue", "green", "purple", "white", "orange"],
        'Price' : [5, 8, 3, 4, 9, 5]
    }
)

# display the type of a Pandas DataFrame
type(df)

pandas.core.frame.DataFrame

### 2. Create a Series 

In [None]:
s = pd.Series(
    {
        'Color': ["red", "blue", "green", "purple", "white"],
    }
)

# display the type of a Pandas Series
type(s)

pandas.core.series.Series

## Load and Explore DataFrames

As displayed above, DataFrames can be created from dictionaries. However, when working with Pandas, most of the time you will want to work with Data that is stored elsewhere. One of the most common formats to store data is a csv (Comma-Separated Values) file. Pandas has built in functions to easily load data from many file types into a DataFrame. 

For this tutorial, we will explore a popular dataset used in teaching classification problems about the Titanic.

Use `read_csv()` to read in data from a csv file and store it into a Pandas DataFrame. Include the filepath to the csv to read. 

We also include `index_col="PassengerId"` to set the index of the DataFrame to be the PassengerId column.

In [79]:
df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv", index_col="PassengerId")

Use `head()` to display the first 5 rows of a DataFrame. You can also provide a value n to display the first n rows. For example `head(20)` displays the first 20 rows of a DataFrame.

In [80]:
# default displays the first 5 rows of a DataFrame
df.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


The tail function works similarly to the head function, but displays the last 5 rows of a DataFrame

In [81]:
df.tail()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


DataFrames have a few attributes that allow for data scientist to quickly general information about a data set. To invoke an attribute use the syntax `dfname.attribute`

`.dtypes` displays all columns in a DataFrame and their corresponding data type

In [82]:
print(df.dtypes)

Survived      int64
Pclass        int64
Name         object
Sex          object
Age         float64
SibSp         int64
Parch         int64
Ticket       object
Fare        float64
Cabin        object
Embarked     object
dtype: object


`.shape` returns a tuple with the number of rows followed by the number of columns

In [83]:
print(df.shape)

(891, 11)


`.size` returns the total number of elements in a DataFrame. In other words, it returns the product of the number of rows and columns in a DataFrame. 

In [84]:
print(df.size)

9801


## Selecting Data

Pandas provides multiple ways to select subsets of data within a DataFrame. In this tutorial we will go over ?? such methods:

- [] or . syntax
- loc
- iloc
- query

(MAY WANT TO NOTE THAT SELECTING DATA DOESN'T ALTER THE EXISTING DATAFRAME, OR CREATE A NEW ONE, BUT INSTEAD MORE LIKE CREATES A VIEW I THINK)

The simplest way to select a subset of data is similar to selecting an element in a standard python list or dictionary.

- To select a column, use `[]` with the column name inside.
- You can also use a `.` followed by the column name. 

Each syntax will produce the same result

In [92]:
print(df["Age"])

print(df.Age)

PassengerId
1      22.0
2      38.0
3      26.0
4      35.0
5      35.0
       ... 
887    27.0
888    19.0
889     NaN
890    26.0
891    32.0
Name: Age, Length: 891, dtype: float64
PassengerId
1      22.0
2      38.0
3      26.0
4      35.0
5      35.0
       ... 
887    27.0
888    19.0
889     NaN
890    26.0
891    32.0
Name: Age, Length: 891, dtype: float64


When using the brackets syntax, you can also pass in a list to select multiple columns. This is one advantage of using the [] syntax. However, both are valid and the best one to use depends on the context.

In [108]:
columns_to_keep = ["Sex", "Age", "Survived"]

print(df[columns_to_keep])

                Sex   Age  Survived
PassengerId                        
1              male  22.0         0
2            female  38.0         1
3            female  26.0         1
4            female  35.0         1
5              male  35.0         0
...             ...   ...       ...
887            male  27.0         0
888          female  19.0         1
889          female   NaN         0
890            male  26.0         1
891            male  32.0         0

[891 rows x 3 columns]


### loc

Pandas provides `loc` and `iloc` as indexing options for more advanced operations. `loc` selects values based on a label as opposed to `iloc` which selects values based on an index.

To select an element from a DataFrame using `loc`, provide the row(s) and column(s) to select from.

In [117]:
df.loc[:, ["Sex", "Ticket"]]

Unnamed: 0_level_0,Sex,Ticket
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,male,A/5 21171
2,female,PC 17599
3,female,STON/O2. 3101282
4,female,113803
5,male,373450
...,...,...
887,male,211536
888,female,112053
889,female,W./C. 6607
890,male,111369


### iloc

"index-based selection"

In [93]:
df.iloc[0]

Survived                          0
Pclass                            3
Name        Braund, Mr. Owen Harris
Sex                            male
Age                            22.0
SibSp                             1
Parch                             0
Ticket                    A/5 21171
Fare                           7.25
Cabin                           NaN
Embarked                          S
Name: 1, dtype: object

### query

`query()` is a useful function to select elements in a DataFrame based on a condition. For example, you can select all elements in the DataFrame where the passenger is a male and survived using the following statement.

In [104]:
df.query("Sex == 'male' and Survived == 1")

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
18,1,2,"Williams, Mr. Charles Eugene",male,,0,0,244373,13.0000,,S
22,1,2,"Beesley, Mr. Lawrence",male,34.0,0,0,248698,13.0000,D56,S
24,1,1,"Sloper, Mr. William Thompson",male,28.0,0,0,113788,35.5000,A6,S
37,1,3,"Mamee, Mr. Hanna",male,,0,0,2677,7.2292,,C
56,1,1,"Woolner, Mr. Hugh",male,,0,0,19947,35.5000,C52,S
...,...,...,...,...,...,...,...,...,...,...,...
839,1,3,"Chip, Mr. Chang",male,32.0,0,0,1601,56.4958,,S
840,1,1,"Marechal, Mr. Pierre",male,,0,0,11774,29.7000,C47,C
858,1,1,"Daly, Mr. Peter Denis",male,51.0,0,0,113055,26.5500,E17,S
870,1,3,"Johnson, Master. Harold Theodor",male,4.0,1,1,347742,11.1333,,S


To refer to variables within a query statement, use `@` followed by the variable name

In [None]:
cutoff_age = 30

df.query("Age < @cutoff_age")

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.0750,,S
9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C
...,...,...,...,...,...,...,...,...,...,...,...
884,0,2,"Banfield, Mr. Frederick James",male,28.0,0,0,C.A./SOTON 34068,10.5000,,S
885,0,3,"Sutehall, Mr. Henry Jr",male,25.0,0,0,SOTON/OQ 392076,7.0500,,S
887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S


### Exercise: Perform Exploratory Data Analysis

**Time to put your knowledge to the test!**

An important skill for machine learning is fully exploring a dataset to find key insights. Pandas provides various functions to support data scientist and machine learning engineers in this task.

For this exercise you will explore a very popular dataset about passengers on the Titanic. In order to complete this exercise you must complete the following:
1. Display the first 10 rows of the DataFrame
2. Determine the amount of rows and columns in the DataFrame
3. Display all columns and their datatypes
4. Create a subset of the original DataFrame that contains all passengers that are female OR under 18 years old

In [137]:
# 1. Display the first 10 rows of the DataFrame


In [None]:
# 2. Determine the amount of rows and columns in the DataFrame


# 3. Display all columns and their datatypes



In [None]:
# 4. Create a DataFrame containing only passengers that are either female OR 18 years or younger
sub_df = 

sub_df.head(10)

#### **Solution**

(Try to solve it yourself first before looking!)

Click the cell below to reveal a possible solution approach.

In [133]:
# 1. Display the first 10 rows of the DataFrame
df.head(10)

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [134]:
# 2. Determine the amount of rows and columns in the DataFrame
print(df.shape)

# 3. Display all columns and their datatypes
print(df.dtypes)

(891, 11)
Survived      int64
Pclass        int64
Name         object
Sex          object
Age         float64
SibSp         int64
Parch         int64
Ticket       object
Fare        float64
Cabin        object
Embarked     object
dtype: object


In [136]:
# 4. Create a DataFrame containing only female passengers 18 or younger
sub_df = df.query("Sex == 'female' or Age < 18")
sub_df.head(10)

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C
11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7,G6,S
12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.55,C103,S
15,0,3,"Vestrom, Miss. Hulda Amanda Adolfina",female,14.0,0,0,350406,7.8542,,S
16,1,2,"Hewlett, Mrs. (Mary D Kingcome)",female,55.0,0,0,248706,16.0,,S


## Aggregations ??
mean, min, max 
groupby
map, apply functions

## Transforming Data

Data scientist will often recieve data in a state that is not suitable for machine learning models. Thankfully, Pandas provides many useful ways to transform and manipulate data into a form that is ideal for A.I. applications. In this section, we will review how to do the following: 

- cast columns to different types
- manipulate null values
- create new columns
- joining dataframes

To change the data type of a column use `astype()`.
Common dtypes include 
- Int64
- Float64
- object
- string
- datetime64[ns]
- boolean  

Although there are many more!

In [None]:
# Convert the name column from object to string
df['Name'].astype('string')

PassengerId
1                                Braund, Mr. Owen Harris
2      Cumings, Mrs. John Bradley (Florence Briggs Th...
3                                 Heikkinen, Miss. Laina
4           Futrelle, Mrs. Jacques Heath (Lily May Peel)
5                               Allen, Mr. William Henry
                             ...                        
887                                Montvila, Rev. Juozas
888                         Graham, Miss. Margaret Edith
889             Johnston, Miss. Catherine Helen "Carrie"
890                                Behr, Mr. Karl Howell
891                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: string

Note that this does not change the original DataFrame, but instead creates a copy of the DataFrame with the updated column type. In order to change the original DataFrame set the original DataFrame equal to the copy. (OR USE `copy=False` in the function)

In [None]:
df['Name'] = df['Name'].astype('String')

## Next Steps

Congratulations for completing the Pandas Introductory Tutorial! 🎓🎉

Here are some resources to futher your Pandas education:
- Official Pandas Documentation: https://pandas.pydata.org/docs/user_guide/10min.html
- Kaggle Pandas Tutorial: https://www.kaggle.com/learn/pandas
- Youtube Channel with various tutorials on Pandas and other Data Science topics: https://www.youtube.com/@robmulla

Polars is a modern DataFrame library that improves memory and efficieny from Pandas. Find their official documentation here:
https://docs.pola.rs/
