# Pandas: Working with Structured Data in Python

After NumPy, the next natural step in scientific Python workflows is **Pandas**. While NumPy focuses on numerical arrays and mathematical operations, many real-world datasets are **tabular**, **labeled**, and **heterogeneous**. Pandas is designed specifically to work with such data.

Pandas is one of the most widely used third-party Python libraries in scientific computing, data science, and engineering, and it is often the first library that makes environment management unavoidable.


## Why Pandas?

Basic Python data structures such as lists and dictionaries are sufficient for small tasks. However, they become impractical when working with:

- Large datasets

- Tabular data (rows and columns)

- Missing values

- Mixed data types

- Labeled data that must be filtered or aggregated

Pandas provides high-level abstractions that make these tasks simpler, safer, and more expressive.

## Installing Pandas

Pandas is a third-party package and must be installed in the active environment.


In [1]:
# conda install pandas

As discussed in earlier chapters, this installs Pandas only in the currently active environment.

## 1. Import Pandas

By convention, Pandas is imported as `pd`:

In [2]:
import pandas as pd

# Core Pandas Data Structures

Pandas introduces two fundamental data structures: **Series** and **DataFrame**.

## 2. Series

A **Series** is a one-dimensional array with an associated index.

In [3]:
s = pd.Series([10, 20, 30, 40])
s

0    10
1    20
2    30
3    40
dtype: int64

You can explicitly define the index:

In [4]:
s2 = pd.Series([10, 20, 30], index=["a", "b", "c"])
s2

a    10
b    20
c    30
dtype: int64

A Series behaves similarly to a labeled NumPy array.

## 3. DataFrame

A **DataFrame** is a two-dimensional table consisting of rows and columns. Each column is a Series.

**Creating a DataFrame from a dictionary**

In [5]:
data = {
    "name": ["Alice", "Bob", "Charlie"],
    "age": [25, 30, 35],
    "score": [85, 90, 95],
}

df = pd.DataFrame(data)
df

Unnamed: 0,name,age,score
0,Alice,25,85
1,Bob,30,90
2,Charlie,35,95


## 4. Inspecting a DataFrame

Before doing analysis, inspect the structure and data types. These operations help you understand the structure of the dataset.

In [6]:
df.head()

Unnamed: 0,name,age,score
0,Alice,25,85
1,Bob,30,90
2,Charlie,35,95


In [7]:
df.tail()

Unnamed: 0,name,age,score
0,Alice,25,85
1,Bob,30,90
2,Charlie,35,95


In [8]:
df.columns

Index(['name', 'age', 'score'], dtype='object')

In [9]:
df.shape

(3, 3)

In [10]:
df.dtypes

name     object
age       int64
score     int64
dtype: object

## 5. Reading Data from Files (CSV)

If you have a CSV file on disk, you can load it with `pd.read_csv`.

Uncomment and update the path to use this in your project.

In [11]:
df_from_csv = pd.read_csv("data.csv")
df_from_csv.head()

Unnamed: 0,name,age,score
0,Alice,25,85.0
1,Bob,30,90.0
2,Charlie,35,95.0
3,Diana,28,


# How Pandas Thinks About a DataFrame (Objects, Attributes, and Methods)

When data is loaded using Pandas, the result is not just a table of values.
It is a Python object with its own attributes and methods.

For example:

In [12]:
df = pd.read_csv("data.csv")

The variable `df` now refers to a DataFrame object.

## DataFrames Are Python Objects
In Python, objects have:

- **Attributes** → information stored in the object

- **Methods** → functions that act on the object

Pandas follows this object-oriented design strictly.

You interact with a DataFrame using dot notation `(.)`.

### DataFrame Attributes (No Parentheses)

**Attributes** describe properties of the DataFrame.
They are accessed **without parentheses.**

Examples:

In [13]:
df.columns
df.shape
df.dtypes
df.index


RangeIndex(start=0, stop=4, step=1)

These return information such as:

- Column names

- Number of rows and columns

- Data types of each column

- Row labels

**Rule of thumb:**

If you are asking for information, it is usually an attribute.

### DataFrame Methods (With Parentheses)

**Methods** are functions attached to the DataFrame object.
They are called **with parentheses.**

Examples:

In [14]:
df.head()
df.describe()
df[["age", "score"]].mean()
df.dropna()


Unnamed: 0,name,age,score
0,Alice,25,85.0
1,Bob,30,90.0
2,Charlie,35,95.0


These perform actions such as:

- Displaying part of the data

- Computing statistics

- Transforming or cleaning data

**Rule of thumb:**

If something does work or changes data, it is usually a method.

### Why Parentheses Matter

This is a very common beginner mistake:

In [15]:
df.shape()   # ❌ incorrect
df.shape     # ✅ correct


TypeError: 'tuple' object is not callable

Explanation:

- `df.shape` is an attribute (stored information)

- `df.head()` is a method (a function call)

Forgetting or adding parentheses incorrectly will cause errors.

## How read_csv Fits This Model

`pd.read_csv()` is not a DataFrame method.

In [None]:
df = pd.read_csv("data.csv")


Here:

- `pd.read_csv()` is a **function** provided by the Pandas module

- It **creates** a new DataFrame object

- After this, all interaction happens through `df`

### What Pandas Does Automatically When Reading Files

When you load data from a CSV file, Pandas automatically:

- Parses column names from the header

- Infers data types for each column

- Assigns a default index

- Represents missing values using `NaN`

All of this information becomes part of the DataFrame object and is accessible through its attributes and methods.

#### Seeing This in Practice

You can inspect the DataFrame immediately after loading:

In [None]:
df.columns
df.dtypes
df.head()
df.shape


This should always be your **first step** after reading a file.

#### Mental Model to Remember

A Pandas DataFrame is not just data. it is a Python object that knows about its structure and provides methods to work with it.

Once students understand this model:

- `.loc`, `.iloc`, `.mean()`, `.dropna()` make sense

- Errors become easier to debug

- Pandas feels consistent rather than magical

## 6. Selecting Columns

Select one column (returns a Series) or multiple columns (returns a DataFrame).

In [None]:
df["age"]

In [None]:
df[["name", "score"]]

## 7. Selecting Rows

Use `.iloc` for position-based selection and `.loc` for label-based selection.

In [None]:
df.iloc[0]      # Select By position

In [None]:
df.iloc[0:2     # Select By position

In [None]:
df.loc[0]       # Select By label

## 8. Filtering Rows by Conditions

Filtering is one of the most useful Pandas features.

In [None]:
df[df["age"] > 30]     # Filter by one condition

In [None]:
df[(df["age"] > 25) & (df["score"] >= 90)]      #Filter by multiple condition

## 9. Basic Statistics and Descriptive Analysis

Pandas provides built-in methods for quick exploration.

In [None]:
df.mean(numeric_only=True)

In [None]:
df.describe()

In [None]:
df["score"].mean(), df["score"].max()    # Individual columns

## 10. Handling Missing Data

Missing values are common in real datasets. Pandas represents missing values as `NaN` in many cases.

Here we create a small example with missing values and demonstrate typical operations. Understanding how missing data is handled is critical for correct analysis.

In [None]:
df_missing = df.copy()
df_missing.loc[1, "score"] = None
df_missing

In [None]:
df_missing.isna()

In [None]:
df_missing.dropna()

In [None]:
df_missing.fillna(0)

## 11. Modifying Data

Create new columns and update values using `.loc`.

In [None]:
df2 = df.copy()
df2["passed"] = df2["score"] >= 60
df2

You can update existing values:

In [None]:
df2.loc[df2["score"] < 60, "passed"] = False
df2

## 12. Jupyter Tip: Verify Which Python Executable Runs This Notebook

This is helpful for confirming that your notebook is using the intended Conda environment.

In [None]:
import sys
sys.executable

## Pandas and Environment Management
Pandas depends on NumPy and system-level libraries. This means:

- Version mismatches can cause installation issues

- Different projects may require different Pandas versions

- Isolated environments are strongly recommended

Pandas is often the first library that exposes students to real dependency management challenges.

## Common Beginner Mistakes

- Installing Pandas in one environment and running code in another

- Treating DataFrames like Python lists

- Forgetting that column selection returns a Series

- Ignoring missing values

- Modifying data unintentionally due to chained indexing

  Most Pandas errors are conceptual rather than syntactic.

## Summary

- Use **Series** for 1D labeled data
- Use **DataFrame** for 2D tabular data
- Inspect with `head`, `shape`, `dtypes`
- Select with `[]`, `.iloc`, `.loc`
- Filter with boolean conditions
- Handle missing values with `isna`, `dropna`, `fillna`
