# Pandas Tutorial
A comprehensive yet beginner-friendly tutorial on **pandas**, a popular Python library for data manipulation and analysis.

In this tutorial, we will cover:

    - Installation and import of the pandas library.
    - An introduction to Pandas Series, highlighting its similarity to NumPy arrays.
    - Creating DataFrames from various data sources.
    - Basic data inspection, selection, indexing, and filtering.
    - Modifying DataFrames and performing calculations.
    - Grouping, merging, and finally saving/loading data in different formats.

  **Note:** Remember that a pandas DataFrame can be thought of as a collection of Series objects, where each column is a Series.

In [1]:
# No need to install pandas, it is already included in our environment.
# However if you are not using our environment, you can install pandas using the command:

#!pip install pandas



 ## 1. Installation and Import



 First, install pandas if it is not already installed, then import it into your Python environment.

In [1]:
import pandas as pd


 ## 1.1. Pandas Series: An Introduction



 A **Pandas Series** is a one-dimensional labeled array capable of holding any data type. If you're already familiar with NumPy arrays, you'll notice that a Series behaves similarly but with added flexibility through indexing (labels for each element).



 In fact, a DataFrame is essentially a collection of Series objects (each column is a Series), which means many operations applicable to arrays can also be performed on Series.

In [2]:
# Creating a Pandas Series from a list
data_series = pd.Series([10, 20, 30, 40])
display("Pandas Series:")
display(data_series)

# # # Demonstrate that DataFrame columns are Series
df_series_example = pd.DataFrame({
    "Numbers": data_series, 
    "Squared": data_series ** 2
})
display("\nDataFrame constructed from Series:")
display(df_series_example)


'Pandas Series:'

0    10
1    20
2    30
3    40
dtype: int64

'\nDataFrame constructed from Series:'

Unnamed: 0,Numbers,Squared
0,10,100
1,20,400
2,30,900
3,40,1600


 ## 2. Creating DataFrames



 A **DataFrame** is the core data structure in pandas — think of it as a table with rows and columns. You can create a DataFrame from various sources. Below are a few common methods:

 ### 2.1. From a Dictionary of Lists



 Here, each key in the dictionary represents a column name, and the corresponding value is a list of data for that column.

In [3]:
data = {
    "Name": ["Alice", "Bob", "Charlie"],
    "Age": [25, 30, 35],
    "City": ["New York", "Los Angeles", "Chicago"]
}

df = pd.DataFrame(data)
display(df)


Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,Los Angeles
2,Charlie,35,Chicago


 ### 2.2. From a List of Dictionaries



 In this approach, each dictionary in the list represents a row of data.

In [4]:
data_list = [
    {"Name": "Alice",   "Age": 25, "City": "New York"},
    {"Name": "Bob",     "Age": 30, "City": "Los Angeles"},
    {"Name": "Charlie", "Age": 35, "City": "Chicago"},
    {"Name": "Vijay", "Age": 28, "City": "St. Louis"},
    {"Name": "Jin", "Age": 35, "City": "Orlando"},
    {"Name": "Lucas", "Age": 31, "City": "Bloomington"}
]
df2 = pd.DataFrame(data_list)
display(df2)



Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,Los Angeles
2,Charlie,35,Chicago
3,Vijay,28,St. Louis
4,Jin,35,Orlando
5,Lucas,31,Bloomington


 ## 3. Basic Data Inspection



 After creating or loading a DataFrame, it's important to inspect your data. Common methods include:



 - **`df.head()`**: View the first few rows.

 - **`df.tail()`**: View the last few rows.

 - **`df.shape`**: Get the number of rows and columns.

 - **`df.columns`**: List all column names.

 - **`df.info()`**: Get a summary including data types and non-null counts.

 - **`df.describe()`**: Compute basic statistics for numerical columns.

In [5]:
print("First 2 rows:")
display(df.head(2))       # First 5 rows (use df.head(10) for the first 10)
print("\nLast 2 rows:")
display(df.tail(2))       # Last 5 rows
print("\nShape of DataFrame:")
display(df.shape)        # (rows, columns)
print("\nColumn Names:")
display(df.columns)      # List of column names
print("\nDataFrame Info:")
display(df.info())       # Summary of the DataFrame (types, non-null counts)
print("\nStatistical Summary:")
display(df.describe())   # Basic statistics for numeric columns


First 2 rows:


Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,Los Angeles



Last 2 rows:


Unnamed: 0,Name,Age,City
1,Bob,30,Los Angeles
2,Charlie,35,Chicago



Shape of DataFrame:


(3, 3)


Column Names:


Index(['Name', 'Age', 'City'], dtype='object')


DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    3 non-null      object
 1   Age     3 non-null      int64 
 2   City    3 non-null      object
dtypes: int64(1), object(2)
memory usage: 204.0+ bytes


None


Statistical Summary:


Unnamed: 0,Age
count,3.0
mean,30.0
std,5.0
min,25.0
25%,27.5
50%,30.0
75%,32.5
max,35.0


 ### Knowledge Check: DataFrame Inspection



 Consider the DataFrame you just inspected. Write code to:

 1. Print the first 3 rows using an alternative method.

 2. Retrieve the list of column names.

 3. Summarize the DataFrame using `.info()`.



 *Hint: Use the appropriate DataFrame methods to achieve these tasks.*

In [8]:
# Your solution here:
# 1. Print the first 3 rows.
print("First 3 rows:")
display(df.head(3))

# 2. Print the list of column names.
print("Column Names:")
display(df.columns)

# 3. Display the DataFrame information.
print("DataFrame Info:")
display(df.info())


First 3 rows:


Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,Los Angeles
2,Charlie,35,Chicago


Column Names:


Index(['Name', 'Age', 'City'], dtype='object')

DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    3 non-null      object
 1   Age     3 non-null      int64 
 2   City    3 non-null      object
dtypes: int64(1), object(2)
memory usage: 204.0+ bytes


None

 ## 4. Selecting and Indexing Data



 Pandas offers multiple ways to select or filter data within a DataFrame.



 ### 4.1. Dot Notation / Bracket Notation



 - **Dot Notation**: Simplifies access for columns with simple names.

 - **Bracket Notation**: More flexible; it supports column names with spaces or special characters.

In [9]:
# Dot notation (for simple column names without spaces/special chars)
display("Using dot notation to access 'Age':")
display(df.Age)

# # Bracket notation
display("\nUsing bracket notation to access 'Age':")
display(df["Age"])


"Using dot notation to access 'Age':"

0    25
1    30
2    35
Name: Age, dtype: int64

"\nUsing bracket notation to access 'Age':"

0    25
1    30
2    35
Name: Age, dtype: int64

 ### 4.2. Row Selection with `.loc` and `.iloc`



 - **`.loc`** selects rows and columns by **label**.

 - **`.iloc`** selects rows and columns by **integer position**.

In [10]:
df = pd.DataFrame({
    "Name": ["Alice", "Bob", "Charlie", "Dave"],
    "Age": [25, 30, 35, 28],
    "City": ["NY", "LA", "Chicago", "NY"]
}, index=["row1", "row2", "row3", "row4"])  # custom index labels

display(df)
display("Using .loc (label-based):")
display(df.loc["row2"])               # Entire row labeled 'row2'
display(df.loc["row2", "Age"])        # Specific cell (row2, Age)
display(df.loc["row1":"row3"])        # Slice multiple rows by label
display(df.loc[:, ["Name", "City"]])  # All rows, only these columns

display("\nUsing .iloc (integer-based):")
display(df.iloc[1])                   # 2nd row (since indexing starts at 0)
display(df.iloc[1, 1])                # Cell at row index 1, col index 1
display(df.iloc[0:2])                 # Rows 0 to 1
display(df.iloc[:, [0, 2]])           # All rows, columns 0 and 2


Unnamed: 0,Name,Age,City
row1,Alice,25,NY
row2,Bob,30,LA
row3,Charlie,35,Chicago
row4,Dave,28,NY


'Using .loc (label-based):'

Name    Bob
Age      30
City     LA
Name: row2, dtype: object

30

Unnamed: 0,Name,Age,City
row1,Alice,25,NY
row2,Bob,30,LA
row3,Charlie,35,Chicago


Unnamed: 0,Name,City
row1,Alice,NY
row2,Bob,LA
row3,Charlie,Chicago
row4,Dave,NY


'\nUsing .iloc (integer-based):'

Name    Bob
Age      30
City     LA
Name: row2, dtype: object

30

Unnamed: 0,Name,Age,City
row1,Alice,25,NY
row2,Bob,30,LA


Unnamed: 0,Name,City
row1,Alice,NY
row2,Bob,LA
row3,Charlie,Chicago
row4,Dave,NY


 ## 5. Filtering Rows



 Filtering rows lets you extract data based on specific conditions.



 ### 5.1. Boolean Masking



 Create a boolean condition that returns `True/False` for each row, then use that mask to filter the DataFrame.

In [11]:
# Show only rows where Age > 28
mask = df["Age"] > 28
display(mask)
older_than_28 = df[mask]
display("Rows where Age > 28:")
display(older_than_28)


row1    False
row2     True
row3     True
row4    False
Name: Age, dtype: bool

'Rows where Age > 28:'

Unnamed: 0,Name,Age,City
row2,Bob,30,LA
row3,Charlie,35,Chicago


 ### 5.2. Multiple Conditions



 Combine conditions using bitwise operators:

 - `&` for AND

 - `|` for OR

 - `~` for NOT

In [12]:
# People older than 25 AND living in NY
df_filtered = df[(df["Age"] > 25) & (df["City"] == "NY")]
display("Rows where Age > 25 and City is NY:")
display(df_filtered)


'Rows where Age > 25 and City is NY:'

Unnamed: 0,Name,Age,City
row4,Dave,28,NY


 Alternatively, you can use the `query()` method for more complex filtering.

 Check the official documentation for more details.

 https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#query

In [13]:
df_filtered_query = df.query("Age > 25 and City == 'NY'")
display("Rows where Age > 25 and City is NY (using query):")
display(df_filtered_query)


'Rows where Age > 25 and City is NY (using query):'

Unnamed: 0,Name,Age,City
row4,Dave,28,NY


 ### Knowledge Check: Filtering Rows



 Using the DataFrame `df`:

 1. Create a boolean mask to filter rows where the 'Age' is between 26 and 32 (inclusive).

 2. Additionally, filter rows where the 'City' starts with either 'C' or 'N'.

 3. Print the resulting DataFrame.



 *Hint: Use string methods like `.str.startswith()` on the 'City' column along with logical operators.*

In [17]:
# Your solution here:
# For example, create your mask and apply it to df.
mask = (df["Age"] >=26) & (df["Age"]<=32)
display(mask)

print("Rows where Age is between 26 and 32:")
display(df[mask])

mask = mask & (df["City"].str.startswith(("C", "N")))
print("Rows where Age is between 26 and 32 and City starts with 'C' or 'N':")
display(df[mask])


row1    False
row2     True
row3    False
row4     True
Name: Age, dtype: bool

Rows where Age is between 26 and 32:


Unnamed: 0,Name,Age,City
row2,Bob,30,LA
row4,Dave,28,NY


Rows where Age is between 26 and 32 and City starts with 'C' or 'N':


Unnamed: 0,Name,Age,City
row4,Dave,28,NY


 ## 6. Changing Values



 You can modify DataFrame values using various methods:



 ### 6.1. Assigning with `.loc`



 Modify values by referencing labels.

In [18]:
display(df)
df.loc["row1", "Age"] = 26
display("After modifying using .loc:")
display(df)


Unnamed: 0,Name,Age,City
row1,Alice,25,NY
row2,Bob,30,LA
row3,Charlie,35,Chicago
row4,Dave,28,NY


'After modifying using .loc:'

Unnamed: 0,Name,Age,City
row1,Alice,26,NY
row2,Bob,30,LA
row3,Charlie,35,Chicago
row4,Dave,28,NY


 ### 6.2. Assigning with `.iloc`



 Modify values by referencing integer positions.

In [19]:
df.iloc[0, 1] = 27
display("After modifying using .iloc:")
display(df)


'After modifying using .iloc:'

Unnamed: 0,Name,Age,City
row1,Alice,27,NY
row2,Bob,30,LA
row3,Charlie,35,Chicago
row4,Dave,28,NY


 ### 6.3. Vectorized Assignments



 Apply operations across entire columns efficiently.

In [20]:
# Increase everyone's Age by 1
df["Age"] = df["Age"] + 1
display("After increasing Age by 1:")
display(df)


'After increasing Age by 1:'

Unnamed: 0,Name,Age,City
row1,Alice,28,NY
row2,Bob,31,LA
row3,Charlie,36,Chicago
row4,Dave,29,NY


 ### 6.4. Using `apply()`



 Apply a function to each element in a Series or DataFrame and return a new Series or DataFrame.



In [22]:
df["Age_squared"] = df["Age"].apply(lambda x: x*x)
display("After applying lambda function to square Age:")
display(df)

# ### 6.5. Creating new columns

# Create a new column based on existing columns.

'After applying lambda function to square Age:'

Unnamed: 0,Name,Age,City,Age_squared
row1,Alice,28,NY,784
row2,Bob,31,LA,961
row3,Charlie,36,Chicago,1296
row4,Dave,29,NY,841


In [23]:
df["Age in 5 years"] = df["Age"] + 5
display("After creating new column 'Age in 5 years':")
display(df)


"After creating new column 'Age in 5 years':"

Unnamed: 0,Name,Age,City,Age_squared,Age in 5 years
row1,Alice,28,NY,784,33
row2,Bob,31,LA,961,36
row3,Charlie,36,Chicago,1296,41
row4,Dave,29,NY,841,34


 Sometimes this direct assignment may lead to problems. In particular if you are modifying a view of a DataFrame, it may not behave as expected.

 In this case you should use `df.loc[]` to ensure you are modifying the original DataFrame.

 An alternative is to copy the DataFrame first using `df.copy()`.

In [24]:
# For example if you modify a slice of a DataFrame, it may not behave as expected.
df_slice = df.query("Age > 30")
df_slice["Age plus 10"] = df_slice["Age"] + 10
display("After modifying a slice of the DataFrame:")
display(df_slice)
display(df)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_slice["Age plus 10"] = df_slice["Age"] + 10


'After modifying a slice of the DataFrame:'

Unnamed: 0,Name,Age,City,Age_squared,Age in 5 years,Age plus 10
row2,Bob,31,LA,961,36,41
row3,Charlie,36,Chicago,1296,41,46


Unnamed: 0,Name,Age,City,Age_squared,Age in 5 years
row1,Alice,28,NY,784,33
row2,Bob,31,LA,961,36
row3,Charlie,36,Chicago,1296,41
row4,Dave,29,NY,841,34


 The correct way to do this is to use `df.loc[]` or copy the DataFrame first.

In [25]:
# Modify the original DataFrame using .loc
mask = df.Age>30
df.loc[mask, "Age plus 10"] = df_slice["Age"] + 10
display("After modifying the original DataFrame using .loc:")
display(df)

# Or copy the DataFrame first
df_slice = df.query("Age > 30").copy()
df_slice["Age plus 10"] = df_slice["Age"] + 10
display("After modifying a copy of the slice of the DataFrame:")
display(df_slice)



'After modifying the original DataFrame using .loc:'

Unnamed: 0,Name,Age,City,Age_squared,Age in 5 years,Age plus 10
row1,Alice,28,NY,784,33,
row2,Bob,31,LA,961,36,41.0
row3,Charlie,36,Chicago,1296,41,46.0
row4,Dave,29,NY,841,34,


'After modifying a copy of the slice of the DataFrame:'

Unnamed: 0,Name,Age,City,Age_squared,Age in 5 years,Age plus 10
row2,Bob,31,LA,961,36,41
row3,Charlie,36,Chicago,1296,41,46


 ## 7. Calculating Simple Statistics and Value Counts



 Pandas provides simple methods to compute statistics and count occurrences:



 ### 7.1. Simple Statistics



 Calculate basic statistics such as mean, maximum, and minimum.

In [26]:
display("Average Age:", df["Age"].mean())  # Average age
display("Max Age:", df["Age"].max())         # Maximum age
display("Min Age:", df["Age"].min())         # Minimum age


'Average Age:'

31.0

'Max Age:'

36

'Min Age:'

28

 ### 7.2. `value_counts()`



 Count the occurrence of unique values in a Series.

In [27]:
city_counts = df["City"].value_counts()
display("City counts:")
display(city_counts)


'City counts:'

City
NY         2
LA         1
Chicago    1
Name: count, dtype: int64

 ## 8. Grouping and Aggregation



 Use `.groupby()` to split data into groups based on certain criteria, apply functions to each group, and combine the results.



 In the example below, we group the DataFrame by 'City' and calculate the mean Salary for each group.

In [28]:
data = {
    "Name": ["Alice", "Bob", "Charlie", "Dave"],
    "Age": [25, 30, 35, 28],
    "City": ["NY", "LA", "NY", "LA"],
    "Salary": [70000, 80000, 120000, 95000]
}
df = pd.DataFrame(data)

# Group by 'City' and calculate mean Salary
grouped = df.groupby("City")["Age"].std()
display("Mean Age by City:")
display(grouped)


'Mean Age by City:'

City
LA    1.414214
NY    7.071068
Name: Age, dtype: float64

 ## 9. Merging / Joining DataFrames



 Merge or join multiple DataFrames using pandas methods.



 ### 9.1. The `merge()` Method



 Merge two DataFrames on a common key.

In [29]:
df_left = pd.DataFrame({
    "PersonID": [1, 2, 3],
    "Name": ["Alice", "Bob", "Charlie"]
})

df_right = pd.DataFrame({
    "PersonID": [1, 2, 4],
    "City": ["NY", "LA", "Houston"]
})

merged_df = pd.merge(df_left, df_right, on="PersonID", how="outer")
display("Merged DataFrame (inner join):")
display(merged_df)


'Merged DataFrame (inner join):'

Unnamed: 0,PersonID,Name,City
0,1,Alice,NY
1,2,Bob,LA
2,3,Charlie,
3,4,,Houston


 ### 9.2. Joins on Different Column Names



 If the key column has different names in each DataFrame, use the `left_on` and `right_on` parameters.

In [30]:
# Example: Uncomment and modify the following line if your DataFrames have different key names.
# pd.merge(df_left, df_right, left_on="PersonID", right_on="ID")



 ## 10. Saving and Loading Data



 Pandas allows you to easily save DataFrames to various file formats and load them back into your program. Below are examples for saving to CSV and Feather formats:



 - **CSV Format**: A widely used text-based format.

 - **Feather Format**: A fast, lightweight, language-independent binary format (requires `pyarrow`).



 Pandas supports many other formats as well, including Excel, JSON, and SQL!



In [31]:
# Save DataFrame to CSV
df.to_csv("saved_data.csv", index=False)
display("DataFrame saved to CSV file: saved_data.csv")


'DataFrame saved to CSV file: saved_data.csv'

In [32]:
# Save DataFrame to Feather format (ensure you have pyarrow installed: pip install pyarrow)
df.to_feather("saved_data.feather")
display("DataFrame saved to Feather file: saved_data.feather")


'DataFrame saved to Feather file: saved_data.feather'

In [33]:
# Loading the saved CSV file
df_loaded_csv = pd.read_csv("saved_data.csv")
display("CSV file loaded:")
display(df_loaded_csv)


'CSV file loaded:'

Unnamed: 0,Name,Age,City,Salary
0,Alice,25,NY,70000
1,Bob,30,LA,80000
2,Charlie,35,NY,120000
3,Dave,28,LA,95000


In [34]:
# Loading the saved Feather file
df_loaded_feather = pd.read_feather("saved_data.feather")
display("Feather file loaded:")
display(df_loaded_feather)


'Feather file loaded:'

Unnamed: 0,Name,Age,City,Salary
0,Alice,25,NY,70000
1,Bob,30,LA,80000
2,Charlie,35,NY,120000
3,Dave,28,LA,95000


 ## 11. Exercises



 Practice what you have learned with the following exercises:



 1. Create a DataFrame from a dictionary of lists with at least three columns.

 2. Load a CSV file into a DataFrame and inspect its first few rows.

 3. Filter rows where a numeric column exceeds a certain threshold.

 4. Perform a group-by operation and calculate the mean of another column.

 5. Merge two DataFrames on a common key.



 **For each exercise, write your code in the provided cells.**

In [36]:
# 1. Create a DataFrame from a dictionary of lists.
# define dictionary of lists
data_dic_list = {
    "Person": ["Alice", "Ricardo", "Wendy", "Oreo"],
    "Age": [55, 50, 48, 9],
    "Food": ["pasta", "steak", "pizza", "dog food"]
}

print("DataFrame from dictionary of lists:")
df_dic_list = pd.DataFrame(data_dic_list)
display(df_dic_list)

DataFrame from dictionary of lists:


Unnamed: 0,Person,Age,Food
0,Alice,55,pasta
1,Ricardo,50,steak
2,Wendy,48,pizza
3,Oreo,9,dog food


In [37]:
# 2. Load a CSV file and inspect its first few rows.
df_loaded_csv = pd.read_csv("saved_data.csv")
print("Loaded CSV file:")
display(df_loaded_csv.head())

Loaded CSV file:


Unnamed: 0,Name,Age,City,Salary
0,Alice,25,NY,70000
1,Bob,30,LA,80000
2,Charlie,35,NY,120000
3,Dave,28,LA,95000


In [38]:
# 3. Filter rows where a numeric column exceeds a threshold.
display("Rows where Salary > 80000:")
display(df_loaded_csv[df_loaded_csv["Salary"] > 80000])

'Rows where Salary > 80000:'

Unnamed: 0,Name,Age,City,Salary
2,Charlie,35,NY,120000
3,Dave,28,LA,95000


In [41]:
# 4. Perform a group-by operation and calculate the mean of another column.
df_grouped = df_loaded_csv.groupby("City")
display("Mean Salary by City:")
display(df_grouped["Salary"].mean())

'Mean Salary by City:'

City
LA    87500.0
NY    95000.0
Name: Salary, dtype: float64

In [50]:
# 5. Merge two DataFrames on a common key.

df_right = pd.DataFrame({
    "Person": ["Julia", "Ana", "John"],
    "Age": [10, 20, 8]
})

df_merged = pd.merge(df_dic_list, df_right, on=("Person","Age"), how="outer")

print("Merged DataFrame:")
display(df_merged)

Merged DataFrame:


Unnamed: 0,Person,Age,Food
0,Alice,55,pasta
1,Ana,20,
2,John,8,
3,Julia,10,
4,Oreo,9,dog food
5,Ricardo,50,steak
6,Wendy,48,pizza
