
# Python Pandas DataFrame Examples

This notebook demonstrates equivalent **Python Pandas DataFrame operations** mirroring the same examples shown in the PySpark version.

---


In [None]:

import pandas as pd
import numpy as np


### 1️⃣ Create a DataFrame from a Python Collection

In [None]:

data = [("Ravi", 25, "Bengaluru"),
        ("Priya", 30, "Hyderabad"),
        ("Ankit", 28, "Pune"),
        ("Lakshmi", 32, "Chennai")]
columns = ["Name", "Age", "City"]
df = pd.DataFrame(data, columns=columns)
df


### 2️⃣ Selecting Columns

In [None]:

df[["Name", "Age"]]


### 3️⃣ Filtering Rows

In [None]:

df[df["Age"] > 28]


### 4️⃣ Add or Transform Columns

In [None]:

df["Age_After_5_Years"] = df["Age"] + 5
df


### 5️⃣ Rename Columns

In [None]:

df_renamed = df.rename(columns={"City": "Location"})
df_renamed


### 6️⃣ Drop Columns

In [None]:

df_dropped = df.drop(columns=["City"])
df_dropped


### 7️⃣ Group By and Aggregations

In [None]:

group_data = [("IT", 40000), ("HR", 25000), ("IT", 45000), ("Finance", 30000)]
df_group = pd.DataFrame(group_data, columns=["Department", "Salary"])

df_group.groupby("Department").agg(
    Avg_Salary=("Salary", "mean"),
    Max_Salary=("Salary", "max"),
    Count=("Salary", "count")
).reset_index()


### 8️⃣ Sorting Data

In [None]:

df.sort_values(by="Age", ascending=False)


### 9️⃣ Removing Duplicates

In [None]:

dup_data = [("Ravi", "IT"), ("Ravi", "IT"), ("Priya", "HR")]
df_dup = pd.DataFrame(dup_data, columns=["Name", "Dept"])
df_dup.drop_duplicates()


### 🔟 Joins Between DataFrames

In [None]:

dept_data = [("Ravi", "IT"), ("Priya", "HR"), ("Ankit", "Finance")]
df_dept = pd.DataFrame(dept_data, columns=["Name", "Department"])

joined = pd.merge(df, df_dept, on="Name", how="inner")
joined


### 1️⃣1️⃣ Handling Null Values

In [None]:

null_data = [("Ravi", np.nan), ("Priya", 30), ("Ankit", np.nan)]
df_null = pd.DataFrame(null_data, columns=["Name", "Age"])

df_null.fillna({"Age": 25}), df_null.dropna()


### 1️⃣2️⃣ Conditional Column with `np.where()`

In [None]:

df["Category"] = np.where(df["Age"] > 30, "Senior", "Junior")
df


### 1️⃣3️⃣ DataFrame Summary and Statistics

In [None]:

df.describe(), df.info()


### 1️⃣4️⃣ Reindexing and Reset Index

In [None]:

df_reset = df.reset_index(drop=True)
df_reset



---

### ✅ Summary of APIs Covered

| Operation | Pandas API | Description |
|------------|-------------|-------------|
| Create DataFrame | `pd.DataFrame()` | From list/collection |
| Select Columns | `df[col]` | Choose specific columns |
| Filter Rows | `df[df.col > val]` | Apply condition filters |
| Add Column | `df['new'] = ...` | Add or modify a column |
| Rename Column | `rename()` | Rename existing column |
| Drop Column | `drop()` | Remove a column |
| Group & Aggregate | `groupby().agg()` | Aggregate functions |
| Sort | `sort_values()` | Ascending/Descending |
| Remove Duplicates | `drop_duplicates()` | Keep unique rows |
| Join | `merge()` | Combine multiple DataFrames |
| Handle Nulls | `fillna()`, `dropna()` | Manage missing values |
| Conditional Logic | `np.where()` | Add derived columns |
| Statistics | `describe()`, `info()` | Summary metrics |
| Indexing | `reset_index()` | Reorder rows |

---

