# Pandas — DataFrame and Series

**Pandas** is a powerful Python library for **data manipulation** and **data analysis**, widely used in data science and machine learning workflows.  

It provides two primary data structures:

---

## 1. Series
- A **Series** is a **one-dimensional** array-like object.  
- It can hold data of any type (integers, floats, strings, Python objects).  
- It has **labels (index)** associated with each element.  

Mathematically, you can think of it as a mapping:  
$$
\text{Series: } \{ \text{index} \; \rightarrow \; \text{value} \}
$$

---

## 2. DataFrame
- A **DataFrame** is a **two-dimensional**, **size-mutable**, and **heterogeneous** tabular data structure.  
- It has **labeled axes**: rows and columns.  
- Each column in a DataFrame is essentially a **Series**.  
- You can imagine it like an **Excel spreadsheet** or a **SQL table** in Python.  

Formally, a DataFrame is like a dictionary of Series objects sharing the same index:  
$$
\text{DataFrame: } \{\; \text{column label} \; \rightarrow \; \text{Series} \;\}
$$

---

### ✅ Summary
- **Series** → 1D, labeled array (like a single column).  
- **DataFrame** → 2D, tabular structure with rows and columns.  
- Pandas is built on top of **NumPy**, so it’s highly efficient and supports vectorized operations.


# Pandas — DataFrame and Series

**Pandas** is a powerful Python library for **data manipulation** and **data analysis**, widely used in data science and machine learning workflows.  

It provides two primary data structures:

---

## 1. Series
- A **Series** is a **one-dimensional** array-like object.  
- It can hold data of any type (integers, floats, strings, Python objects).  
- It has **labels (index)** associated with each element.  

Mathematically, you can think of it as a mapping:  
$$
\text{Series: } \{ \text{index} \; \rightarrow \; \text{value} \}
$$

**Example of a Series:**

$$
\begin{array}{|c|c|}
\hline

0 & Alice    \\
1 & Bob      \\
2 & Charlie  \\
\hline
\end{array}
$$


---

## 2. DataFrame
- A **DataFrame** is a **two-dimensional**, **size-mutable**, and **heterogeneous** tabular data structure.  
- It has **labeled axes**: rows and columns.  
- Each column in a DataFrame is essentially a **Series**.  
- You can imagine it like an **Excel spreadsheet** or a **SQL table** in Python.  

Formally, a DataFrame is like a dictionary of Series objects sharing the same index:  
$$
\text{DataFrame: } \{\; \text{column label} \; \rightarrow \; \text{Series} \;\}
$$

---

### 🔹 Visual Representation of a DataFrame

$$
\begin{array}{|c|c|c|}
\hline
\textbf{Index} & \textbf{Name} & \textbf{Age} \\
\hline
0 & Alice   & 24 \\
1 & Bob     & 27 \\
2 & Charlie & 22 \\
\hline
\end{array}
$$

---

### ✅ Summary
- **Series** → 1D, labeled array (like a single column).  
- **DataFrame** → 2D, tabular structure with rows and columns.  
- Pandas is built on top of **NumPy**, so it’s highly efficient and supports vectorized operations.



In [None]:
# Installing Pandas from Jupyter notebook
!pip install pandas

Collecting pandas
  Downloading pandas-2.3.3-cp312-cp312-macosx_11_0_arm64.whl.metadata (91 kB)
Collecting pytz>=2020.1 (from pandas)
  Downloading pytz-2025.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Downloading tzdata-2025.2-py2.py3-none-any.whl.metadata (1.4 kB)
Downloading pandas-2.3.3-cp312-cp312-macosx_11_0_arm64.whl (10.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.7/10.7 MB[0m [31m32.0 MB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
[?25hDownloading pytz-2025.2-py2.py3-none-any.whl (509 kB)
Downloading tzdata-2025.2-py2.py3-none-any.whl (347 kB)
Installing collected packages: pytz, tzdata, pandas
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3/3[0m [pandas]2m2/3[0m [pandas]
[1A[2KSuccessfully installed pandas-2.3.3 pytz-2025.2 tzdata-2025.2


In [2]:
import numpy as np
import pandas as pd

## Pandas Series

A **Pandas Series** is a **one-dimensional** array-like object that can hold any data type  
(integers, floats, strings, objects, etc.).  

- It is similar to a **column in a table or spreadsheet**.  
- Each element in the Series has an associated **index** (row label).  
- Default index is **0, 1, 2, …**, but you can define custom labels.

Mathematically, a Series behaves like a mapping:
$$
\text{Series: } \{ \text{index} \; \rightarrow \; \text{value} \}
$$


In [11]:
## Series 
# A Pandas series is a one-dimentional array-like object that can hold any data type.It is similar to a column in a table

# Create a Pandas Series from a Python list
data = [1, 2, 3, 4, 5]

# By default, Pandas assigns an index [0,1,2,...]
series = pd.Series(data)

# Print the Series
print("Series:")
print(f"{series}")

# Accessing attributes
print("\nValues:", series.values)   # underlying NumPy array
print("Index:", series.index)       # index labels

print(type(series))

Series:
0    1
1    2
2    3
3    4
4    5
dtype: int64

Values: [1 2 3 4 5]
Index: RangeIndex(start=0, stop=5, step=1)
<class 'pandas.core.series.Series'>


## Creating a Series from a Dictionary

- A **dictionary** in Python naturally maps **keys → values**.  
- When creating a Series from a dictionary:
  - The **keys** become the **index labels**.
  - The **values** become the **data elements**.

Mathematically:
$$
\text{dict: } \{k_1 : v_1, \; k_2 : v_2, \; k_3 : v_3 \} 
\;\;\longrightarrow\;\;
\text{Series: } \{ \text{index}=k_i \;\; \rightarrow \;\; \text{value}=v_i \}
$$


In [14]:
## Create a Series from dictionary elements 
# Create a dictionary
data = {'a': 1, 'b': 2, 'c': 3}

# Convert dictionary into a Pandas Series
# Keys → become the index
# Values → become the data
series_dict = pd.Series(data)

print("Series from Dictionary:")
print(series_dict)

# Accessing index and values separately
print("\nIndex labels:", series_dict.index)
print("Values:", series_dict.values)

Series from Dictionary:
a    1
b    2
c    3
dtype: int64

Index labels: Index(['a', 'b', 'c'], dtype='object')
Values: [1 2 3]


## Creating a Series with Custom Index

By default, a Pandas Series assigns integer indices (0, 1, 2, …).  
But you can **explicitly provide custom labels** for the index.

This makes the Series behave more like a **dictionary**:  
- The index labels act as **keys**.  
- The values are associated with those keys.

Mathematically:
$$
\{ \text{index}_i \; \rightarrow \; \text{value}_i \}
$$


In [16]:
# Data values
data = [10, 20, 30, 40, 50]

# Custom index labels
index = ['a', 'b', 'c', 'd', 'e']

# Create a Pandas Series with custom index
series = pd.Series(data, index=index)

print("Series with Custom Index:")
print(series)

# Accessing values using custom labels
print("\nValue at index 'a':", series['a'])
print("Value at index 'd':", series['d'])

# Access multiple values using a list of labels
print("\nValues at indices ['b','e']:")
print(series[['b', 'e']])


Series with Custom Index:
a    10
b    20
c    30
d    40
e    50
dtype: int64

Value at index 'a': 10
Value at index 'd': 40

Values at indices ['b','e']:
b    20
e    50
dtype: int64


# Pandas DataFrame

A **DataFrame** is a **two-dimensional**, **tabular data structure** with labeled rows and columns.  

- It can be thought of as an **Excel sheet** or a **SQL table** in Python.  
- Each **column** in a DataFrame is a **Series**.  
- DataFrames can be created from:
  - Dictionary of lists
  - Dictionary of Series
  - NumPy arrays
  - External files (CSV, Excel, SQL, etc.)

---

## Creating a DataFrame from a Dictionary of Lists

- **Keys** of the dictionary become **column names**.  
- **Values (lists)** become **column data**.  
- All lists must have the **same length** (one value per row).  

Mathematically:  
If we define a dictionary

$$
data = \{ 
\; "Name": [n_1, n_2, n_3], \; 
"Age": [a_1, a_2, a_3], \; 
"City": [c_1, c_2, c_3] \; \}
$$

Then, the resulting DataFrame looks like:

$$
\begin{array}{|c|c|c|c|}
\hline
\textbf{Index} & \textbf{Name} & \textbf{Age} & \textbf{City} \\
\hline
0 & n_1 & a_1 & c_1 \\
1 & n_2 & a_2 & c_2 \\
2 & n_3 & a_3 & c_3 \\
\hline
\end{array}
$$


In [None]:
# Dictionary of lists (columns)
data = {
    'Name': ['Prasanna', 'Sundaram', 'Indra'],      # Column "Name"
    'Age': [25, 30, 45],                            # Column "Age"
    'City': ['Chennai', 'Palani', 'Madurai']        # Column "City"
}

# Create DataFrame from dictionary
df = pd.DataFrame(data)

# Display the DataFrame
print("DataFrame created from dictionary of lists:")
print(df)

# Check the type
print("\nType of object:", type(df))


DataFrame created from dictionary of lists:
       Name  Age     City
0  Prasanna   25  Chennai
1  Sundaram   30   Palani
2     Indra   45  Madurai

Type of object: <class 'pandas.core.frame.DataFrame'>


# Creating a DataFrame from a List of Dictionaries

- A **list of dictionaries** is another flexible way to create a DataFrame.  
- Each **dictionary** in the list represents a **row** in the DataFrame.  
- The **keys** of the dictionary become the **column names**.  
- Missing keys will result in **NaN** (Not a Number) values for those columns.

Mathematically:

If we have a list of dictionaries:

$$
data = 
\left[
\{ "Name": "n_1", \; "Age": a_1, \; "City": "c_1" \}, \;
\{ "Name": "n_2", \; "Age": a_2, \; "City": "c_2" \}, \;
\{ "Name": "n_3", \; "Age": a_3, \; "City": "c_3" \}
\right]
$$

Then, the resulting DataFrame looks like:

$$
\begin{array}{|c|c|c|c|}
\hline
\textbf{Index} & \textbf{Name} & \textbf{Age} & \textbf{City} \\
\hline
0 & n_1 & a_1 & c_1 \\
1 & n_2 & a_2 & c_2 \\
2 & n_3 & a_3 & c_3 \\
\hline
\end{array}
$$


In [20]:
# List of dictionaries, each dictionary = one row
data = [
    {'Name': 'Prasanna', 'Age': 25, 'City': 'Chennai'},
    {'Name': 'Sundaram', 'Age': 30, 'City': 'Palani'},
    {'Name': 'Indra', 'Age': 45, 'City': 'Madurai'}
]

# Create DataFrame from list of dictionaries
df = pd.DataFrame(data)

# Display the DataFrame
print("DataFrame created from list of dictionaries:")
print(df)

# Check the type
print("\nType of object:", type(df))


DataFrame created from list of dictionaries:
       Name  Age     City
0  Prasanna   25  Chennai
1  Sundaram   30   Palani
2     Indra   45  Madurai

Type of object: <class 'pandas.core.frame.DataFrame'>


# Converting a DataFrame to a NumPy Array

- A **Pandas DataFrame** is built on top of **NumPy arrays**.  
- You can convert a DataFrame into a NumPy array in multiple ways:  
  1. `np.array(df)` → converts the DataFrame values into a NumPy `ndarray`.  
  2. `df.values` → returns the underlying NumPy representation (legacy).  
  3. `df.to_numpy()` → preferred modern method (since Pandas v0.24).

---

## Example

Given a DataFrame:

$$
\begin{array}{|c|c|c|c|}
\hline
\textbf{Index} & \textbf{Name} & \textbf{Age} & \textbf{City} \\
\hline
0 & \text{Prasanna} & 25 & \text{Chennai} \\
1 & \text{Sundaram} & 30 & \text{Palani} \\
2 & \text{Indra} & 45 & \text{Madurai} \\
\hline
\end{array}
$$


In [24]:
# DataFrame
data = [
    {'Name': 'Prasanna', 'Age': 25, 'City': 'Chennai'},
    {'Name': 'Sundaram', 'Age': 30, 'City': 'Palani'},
    {'Name': 'Indra', 'Age': 45, 'City': 'Madurai'}
]
df = pd.DataFrame(data)

# Convert DataFrame into a NumPy array
arr1 = np.array(df)         # same as df.values
arr2 = df.values
arr3 = df.to_numpy()        # recommended way

print("Original DataFrame:\n", df, "\n")
print("NumPy Array using np.array(df):\n", arr1, "\n")
print("NumPy Array using df.values:\n", arr2, "\n")
print("NumPy Array using df.to_numpy():\n", arr3)

Original DataFrame:
        Name  Age     City
0  Prasanna   25  Chennai
1  Sundaram   30   Palani
2     Indra   45  Madurai 

NumPy Array using np.array(df):
 [['Prasanna' 25 'Chennai']
 ['Sundaram' 30 'Palani']
 ['Indra' 45 'Madurai']] 

NumPy Array using df.values:
 [['Prasanna' 25 'Chennai']
 ['Sundaram' 30 'Palani']
 ['Indra' 45 'Madurai']] 

NumPy Array using df.to_numpy():
 [['Prasanna' 25 'Chennai']
 ['Sundaram' 30 'Palani']
 ['Indra' 45 'Madurai']]


# Reading CSV Files into Pandas DataFrame

- Pandas provides the function `pd.read_csv()` to read data from a **CSV (Comma-Separated Values)** file.  
- It automatically:
  - Infers column names from the first row (header).
  - Assigns row numbers as the default index (0,1,2,…).  
- After loading, you can use **DataFrame methods** like:
  - `.head(n)` → view the first `n` rows.
  - `.tail(n)` → view the last `n` rows.
  - `.shape` → check number of rows and columns.
  - `.info()` → check column data types and memory usage.

Mathematically:
$$
\text{CSV File} \;\; \longrightarrow \;\; \text{DataFrame (rows × columns)}
$$


In [30]:
# Read data from a CSV file into a DataFrame
# 'sales_data.csv' must be in the current working directory
sales_data = pd.read_csv('sales_data.csv')

# Display first 10 rows of the DataFrame
print("First 10 rows of sales_data:\n")
print(sales_data.head(10).to_string())
print()
# Display last 10 rows of the DataFrame
print("Last 10 rows of sales_data:\n")
print(sales_data.tail(10).to_string())

# Check basic information
print("\nShape of dataset (rows, columns):", sales_data.shape)
print("\nDataset info:")
print(sales_data.info())

First 10 rows of sales_data:

   Transaction ID        Date Product Category                 Product Name  Units Sold  Unit Price  Total Revenue         Region Payment Method
0           10001  2024-01-01      Electronics                iPhone 14 Pro           2      999.99        1999.98  North America    Credit Card
1           10002  2024-01-02  Home Appliances             Dyson V11 Vacuum           1      499.99         499.99         Europe         PayPal
2           10003  2024-01-03         Clothing             Levi's 501 Jeans           3       69.99         209.97           Asia     Debit Card
3           10004  2024-01-04            Books            The Da Vinci Code           4       15.99          63.96  North America    Credit Card
4           10005  2024-01-05  Beauty Products      Neutrogena Skincare Set           1       89.99          89.99         Europe         PayPal
5           10006  2024-01-06           Sports  Wilson Evolution Basketball           5       29.99 

# Real-World Example: Sales Transactions Data

In real projects, datasets often contain **thousands or millions of rows**.  
For example, an **e-commerce sales dataset** might look like this:

- **OrderID** → unique transaction ID  
- **Customer** → customer name  
- **Product** → product bought  
- **Quantity** → how many items purchased  
- **Price** → unit price of the product  
- **Date** → transaction date  

Using `head()` helps us preview the **first few transactions**,  
while `tail()` shows the **most recent transactions**.


In [38]:
# ---------------------------------------------------
# 📌 Real-world style dataset: E-commerce transactions
# ---------------------------------------------------
# Columns:
# - OrderID   → Unique transaction ID
# - Customer  → Name of the customer
# - Product   → Product purchased
# - Quantity  → Number of items purchased
# - Price     → Price per unit of the product
# - Date      → Date of the transaction
# - Total     → Computed column (Quantity × Price)

data = {
    'OrderID': range(1001, 1021),   # 20 unique order IDs (1001 → 1020)
    'Customer': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'] * 4,  # 20 names repeating
    'Product': ['Laptop', 'Mobile', 'Tablet', 'Headphones'] * 5,  # 20 products repeating
    'Quantity': [1, 2, 3, 1] * 5,   # Quantity pattern repeating
    'Price': [70000, 15000, 12000, 3000] * 5,  # Product prices repeating
    'Date': pd.date_range(start="2023-01-01", periods=20, freq='D')  # 20 consecutive days
}

# Create a DataFrame from the dictionary
sales_data = pd.DataFrame(data)

# Add a derived column "Total" (transaction amount = Quantity × Price)
sales_data['Total'] = sales_data['Quantity'] * sales_data['Price']

# ---------------------------------------------------
# ✅ Preview the dataset using head()
# - head(n) → displays the first n rows (default = 5)
# - This helps analysts check column names, structure, and sample data
# ---------------------------------------------------
print("🔹 First 5 transactions (preview dataset with head()):")
display(sales_data.head())   # In Jupyter, just writing sales_data.head() looks prettier

# ---------------------------------------------------
# ✅ Inspect the most recent transactions using tail()
# - tail(n) → displays the last n rows (default = 5)
# - Useful for time-series data → check the latest orders, logs, or transactions
# ---------------------------------------------------
print("\n🔹 Last 5 transactions (latest records with tail()):")
display(sales_data.tail())

# ---------------------------------------------------
# ✅ 3. Inspect dataset info
# - info() shows:
#   - number of rows
#   - column names & count of non-null values
#   - data types (int, float, object, datetime)
#   - memory usage
# ---------------------------------------------------
print("\n🔹 Dataset Info:")
print(sales_data.info())

# ---------------------------------------------------
# ✅ 4. Get summary statistics
# - describe() gives descriptive stats for numeric columns:
#   - count, mean, std, min, 25%, 50%, 75%, max
# - Great for quickly spotting data ranges & anomalies
# ---------------------------------------------------
print("\n🔹 Summary Statistics:")
display(sales_data.describe())


🔹 First 5 transactions (preview dataset with head()):


Unnamed: 0,OrderID,Customer,Product,Quantity,Price,Date,Total
0,1001,Alice,Laptop,1,70000,2023-01-01,70000
1,1002,Bob,Mobile,2,15000,2023-01-02,30000
2,1003,Charlie,Tablet,3,12000,2023-01-03,36000
3,1004,David,Headphones,1,3000,2023-01-04,3000
4,1005,Eva,Laptop,1,70000,2023-01-05,70000



🔹 Last 5 transactions (latest records with tail()):


Unnamed: 0,OrderID,Customer,Product,Quantity,Price,Date,Total
15,1016,Alice,Headphones,1,3000,2023-01-16,3000
16,1017,Bob,Laptop,1,70000,2023-01-17,70000
17,1018,Charlie,Mobile,2,15000,2023-01-18,30000
18,1019,David,Tablet,3,12000,2023-01-19,36000
19,1020,Eva,Headphones,1,3000,2023-01-20,3000



🔹 Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   OrderID   20 non-null     int64         
 1   Customer  20 non-null     object        
 2   Product   20 non-null     object        
 3   Quantity  20 non-null     int64         
 4   Price     20 non-null     int64         
 5   Date      20 non-null     datetime64[ns]
 6   Total     20 non-null     int64         
dtypes: datetime64[ns](1), int64(4), object(2)
memory usage: 1.2+ KB
None

🔹 Summary Statistics:


Unnamed: 0,OrderID,Quantity,Price,Date,Total
count,20.0,20.0,20.0,20,20.0
mean,1010.5,1.75,25000.0,2023-01-10 12:00:00,34750.0
min,1001.0,1.0,3000.0,2023-01-01 00:00:00,3000.0
25%,1005.75,1.0,9750.0,2023-01-05 18:00:00,23250.0
50%,1010.5,1.5,13500.0,2023-01-10 12:00:00,33000.0
75%,1015.25,2.25,28750.0,2023-01-15 06:00:00,44500.0
max,1020.0,3.0,70000.0,2023-01-20 00:00:00,70000.0
std,5.91608,0.850696,27037.984976,,24466.679813


In [39]:
# ---------------------------------------------------
# 📌 Step 1: Load your dataset from CSV
# ---------------------------------------------------
sales_data = pd.read_csv("sales_data.csv")

# ---------------------------------------------------
# 📌 Step 2: Add a derived column (Total = Units Sold × Unit Price)
# ---------------------------------------------------
# Assumes your CSV has columns: "Units Sold" and "Unit Price"
# If names differ, adjust accordingly.
sales_data["Total"] = sales_data["Units Sold"] * sales_data["Unit Price"]

# ---------------------------------------------------
# 📌 Step 3: Preview data using head() 
# - First 10 rows → to confirm structure
# ---------------------------------------------------
print("🔹 First 10 rows (head):")
display(sales_data.head(10))   # in Jupyter, display() gives pretty output

# ---------------------------------------------------
# 📌 Step 4: Inspect last 5 rows with tail()
# - Useful for checking recent records in transactional datasets
# ---------------------------------------------------
print("\n🔹 Last 5 rows (tail):")
display(sales_data.tail())

# ---------------------------------------------------
# 📌 Step 5: Dataset Info
# - See column names, data types, missing values, memory usage
# ---------------------------------------------------
print("\n🔹 Dataset Info:")
print(sales_data.info())

# ---------------------------------------------------
# 📌 Step 6: Summary Statistics
# - Basic stats (count, mean, std, min, quartiles, max) for numeric columns
# ---------------------------------------------------
print("\n🔹 Summary Statistics:")
display(sales_data.describe())


🔹 First 10 rows (head):


Unnamed: 0,Transaction ID,Date,Product Category,Product Name,Units Sold,Unit Price,Total Revenue,Region,Payment Method,Total
0,10001,2024-01-01,Electronics,iPhone 14 Pro,2,999.99,1999.98,North America,Credit Card,1999.98
1,10002,2024-01-02,Home Appliances,Dyson V11 Vacuum,1,499.99,499.99,Europe,PayPal,499.99
2,10003,2024-01-03,Clothing,Levi's 501 Jeans,3,69.99,209.97,Asia,Debit Card,209.97
3,10004,2024-01-04,Books,The Da Vinci Code,4,15.99,63.96,North America,Credit Card,63.96
4,10005,2024-01-05,Beauty Products,Neutrogena Skincare Set,1,89.99,89.99,Europe,PayPal,89.99
5,10006,2024-01-06,Sports,Wilson Evolution Basketball,5,29.99,149.95,Asia,Credit Card,149.95
6,10007,2024-01-07,Electronics,MacBook Pro 16-inch,1,2499.99,2499.99,North America,Credit Card,2499.99
7,10008,2024-01-08,Home Appliances,Blueair Classic 480i,2,599.99,1199.98,Europe,PayPal,1199.98
8,10009,2024-01-09,Clothing,Nike Air Force 1,6,89.99,539.94,Asia,Debit Card,539.94
9,10010,2024-01-10,Books,Dune by Frank Herbert,2,25.99,51.98,North America,Credit Card,51.98



🔹 Last 5 rows (tail):


Unnamed: 0,Transaction ID,Date,Product Category,Product Name,Units Sold,Unit Price,Total Revenue,Region,Payment Method,Total
235,10236,2024-08-23,Home Appliances,Nespresso Vertuo Next Coffee and Espresso Maker,1,159.99,159.99,Europe,PayPal,159.99
236,10237,2024-08-24,Clothing,Nike Air Force 1 Sneakers,3,90.0,270.0,Asia,Debit Card,270.0
237,10238,2024-08-25,Books,The Handmaid's Tale by Margaret Atwood,3,10.99,32.97,North America,Credit Card,32.97
238,10239,2024-08-26,Beauty Products,Sunday Riley Luna Sleeping Night Oil,1,55.0,55.0,Europe,PayPal,55.0
239,10240,2024-08-27,Sports,Yeti Rambler 20 oz Tumbler,2,29.99,59.98,Asia,Credit Card,59.98



🔹 Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 240 entries, 0 to 239
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Transaction ID    240 non-null    int64  
 1   Date              240 non-null    object 
 2   Product Category  240 non-null    object 
 3   Product Name      240 non-null    object 
 4   Units Sold        240 non-null    int64  
 5   Unit Price        240 non-null    float64
 6   Total Revenue     240 non-null    float64
 7   Region            240 non-null    object 
 8   Payment Method    240 non-null    object 
 9   Total             240 non-null    float64
dtypes: float64(3), int64(2), object(5)
memory usage: 18.9+ KB
None

🔹 Summary Statistics:


Unnamed: 0,Transaction ID,Units Sold,Unit Price,Total Revenue,Total
count,240.0,240.0,240.0,240.0,240.0
mean,10120.5,2.158333,236.395583,335.699375,335.699375
std,69.42622,1.322454,429.446695,485.804469,485.804469
min,10001.0,1.0,6.5,6.5,6.5
25%,10060.75,1.0,29.5,62.965,62.965
50%,10120.5,2.0,89.99,179.97,179.97
75%,10180.25,3.0,249.99,399.225,399.225
max,10240.0,10.0,3899.99,3899.99,3899.99


# Accessing Data from a Pandas DataFrame

Once a DataFrame is created, we often need to **access columns, rows, or specific cells**.  
Pandas provides multiple convenient methods for this.

---

## 1. Accessing Columns

- `df['ColumnName']` → returns a **Series** (1D).  
- `df[['Col1', 'Col2']]` → returns a **DataFrame** (2D).  
- `df.ColumnName` → shorthand attribute access (not recommended if column names contain spaces).

---

## 2. Accessing Rows

- `df.loc[index_label]` → label-based access (row by index label).  
- `df.iloc[index_position]` → position-based access (row by integer index).  
- `df[0:3]` → slicing by row positions (like Python lists).  

---

## 3. Accessing Specific Values

- `df.loc[row_label, col_label]` → by row and column name.  
- `df.iloc[row_position, col_position]` → by integer positions.  
- `df.at[row_label, col_label]` → **fast scalar access** (label-based).  
- `df.iat[row_position, col_position]` → **fast scalar access** (positional).  

---

## 4. Slicing Rows

- `df[0:2]` → first 2 rows (rows with position 0 and 1).  
- `df.loc[0:2]` → rows with labels 0, 1, and 2 (inclusive).  

---

## 5. Slicing Columns

- `df.loc[:, 'Col1':'Col3']` → all rows, columns between `Col1` and `Col3` (inclusive).  
- `df.iloc[:, 0:2]` → all rows, first 2 columns.  

---

## 6. Boolean Filtering

- `df[df['Age'] > 30]` → all rows where Age > 30.  
- `df[(df['Age'] > 30) & (df['City'] == 'Madurai')]` → multiple conditions with **AND**.  
- `df.query("Age > 30 and City == 'Madurai'")` → SQL-like string queries.  

---

## 7. Conditional Selection + Assignment

- `df.loc[df['Age'] > 30, 'City']` → select City values where Age > 30.  
- `df.loc[df['Age'] > 30, 'City'] = 'Senior City'` → update values conditionally.  

---

## 8. Index and Column Metadata

- `df.index` → row labels.  
- `df.columns` → column labels.  
- `df.set_index('Column')` → set a column as index for label-based access.  
- `df.reset_index()` → reset index back to default integers.  

---

## 9. Quick Exploration

- `df.head(n)` → first `n` rows (default 5).  
- `df.tail(n)` → last `n` rows.  
- `df.sample(n)` → random sample of rows.  

---

## 10. Converting to NumPy

- `df[['Col1','Col2']].to_numpy()` → get underlying 2D NumPy array.  
- Useful for ML libraries that require NumPy arrays.  

---

## 11. Safe Column Access

- `df.get('ColName', default_value)` → safe access with default if column does not exist.  

---

✅ With these techniques, you can access **any part of a DataFrame**:  
- Columns (Series / DataFrame)  
- Rows (by labels or positions)  
- Cells (single values)  
- Subsets (slicing)  
- Conditional selections  
- Random samples  


In [55]:
# -----------------------------------------------
# 📌 Sample DataFrame
# -----------------------------------------------

# COMPLETE DEMO: All common ways to access data in Pandas

data = [
    {'Name': 'Prasanna', 'Age': 25, 'City': 'Chennai', 'Salary': 50000},
    {'Name': 'Sundaram', 'Age': 30, 'City': 'Palani', 'Salary': 60000},
    {'Name': 'Indra', 'Age': 45, 'City': 'Madurai', 'Salary': 80000},
    {'Name': 'Ravi', 'Age': 35, 'City': 'Chennai', 'Salary': 70000},
    {'Name': 'Meena', 'Age': 28, 'City': 'Palani', 'Salary': 52000},
]
df = pd.DataFrame(data)

# show the DataFrame (in Jupyter this renders nicely)
print("Full DataFrame:\n", df, "\n")

# ---------------------------
# 1) Column access
# ---------------------------
# Single column -> returns a Series
names = df['Name']
print("1) df['Name'] -> Series:\n", names, "\n", type(names), "\n")

# Multiple columns -> returns a DataFrame
name_salary = df[['Name', 'Salary']]
print("2) df[['Name','Salary']] -> DataFrame:\n", name_salary, "\n", type(name_salary), "\n")

# Attribute-style access (only when column name is a valid attribute)
# Note: safer to use df['col'] to avoid ambiguity
print("3) df.Name -> Series (attribute access):\n", df.Name, "\n")

# ---------------------------
# 2) Row access: loc (label-based) and iloc (position-based)
# ---------------------------
# By default index labels are 0..n-1 so df.loc[2] = df.iloc[2] here
print("4) df.loc[2] -> row with index label 2 (Series):\n", df.loc[2], "\n")
print("5) df.iloc[2] -> row at integer position 2 (Series):\n", df.iloc[2], "\n")

# ---------------------------
# 3) Access single values (fast accessors)
# ---------------------------
# Use .loc/.iloc for flexible indexing; .at/.iat are faster for single values
val_loc = df.loc[0, 'City']   # label based: row label 0, column 'City'
val_iloc = df.iloc[0, 2]      # positional: first row, third column
val_at = df.at[0, 'City']     # fast single-value access (label-based)
val_iat = df.iat[0, 2]        # fast single-value access (positional)
print("6) Single value examples:")
print("   df.loc[0,'City'] ->", val_loc)
print("   df.iloc[0,2] ->", val_iloc)
print("   df.at[0,'City'] ->", val_at)
print("   df.iat[0,2] ->", val_iat, "\n")

# ---------------------------
# 4) Row slicing
# ---------------------------
print("7) Row slicing df[0:3] (first 3 rows):\n", df[0:3], "\n")
# .loc with slice includes the stop label if labels are integers and present
print("8) df.loc[0:2] -> includes rows with labels 0,1,2:\n", df.loc[0:2], "\n")

# ---------------------------
# 5) Column slicing
# ---------------------------
# Use .loc[:, 'StartCol':'EndCol'] to slice columns by label
print("9) df.loc[:, 'Name':'City'] -> select columns Name through City:\n", df.loc[:, 'Name':'City'], "\n")
# Use .iloc for positional column slices
print("10) df.iloc[:, 0:2] -> first two columns (positional):\n", df.iloc[:, 0:2], "\n")

# ---------------------------
# 6) Boolean indexing / filtering
# ---------------------------
# Create a boolean mask for rows where Age > 30
mask_age_gt_30 = df['Age'] > 30
print("11) Boolean mask (Age > 30):\n", mask_age_gt_30, "\n")

# Apply mask to get filtered DataFrame
filtered = df[mask_age_gt_30]
print("12) df[df['Age'] > 30] -> rows where Age > 30:\n", filtered, "\n")

# Combine multiple conditions with & (and) and | (or) — use parentheses
mask = (df['City'] == 'Chennai') & (df['Salary'] > 60000)
print("13) Combined condition (City == 'Chennai' AND Salary > 60000):\n", df[mask], "\n")

# Using query() for readable filter expressions (string form)
print("14) Using df.query() -> City == 'Palani':\n", df.query("City == 'Palani'"), "\n")

# ---------------------------
# 7) Conditional selection + assignment
# ---------------------------
# Example: give a bonus of 5000 to employees in Chennai
df.loc[df['City'] == 'Chennai', 'Salary'] = df.loc[df['City'] == 'Chennai', 'Salary'] + 5000
print("15) After conditional update (Salary +5000 for Chennai):\n", df, "\n")

# ---------------------------
# 8) Accessing index and columns metadata
# ---------------------------
print("16) df.index ->", df.index)
print("17) df.columns ->", df.columns, "\n")

# You can set a column as index for label based access
df_indexed = df.set_index('Name')   # returns a new DataFrame; original unchanged
print("18) df.set_index('Name') -> indexed by Name:\n", df_indexed, "\n")
# Now loc can use the name label directly
print("19) df_indexed.loc['Indra'] ->\n", df_indexed.loc['Indra'], "\n")

# Reset index back to default if needed
df_reset = df_indexed.reset_index()
print("20) reset_index() -> back to default integer index:\n", df_reset, "\n")

# ---------------------------
# 9) Random sampling and head/tail
# ---------------------------
print("21) df.head(3) -> first 3 rows:\n", df.head(3), "\n")
print("22) df.tail(2) -> last 2 rows:\n", df.tail(2), "\n")
print("23) df.sample(2) -> random 2 rows:\n", df.sample(2), "\n")

# ---------------------------
# 10) Convert selection to numpy / values
# ---------------------------
# to get underlying numpy array (useful for ML libraries)
arr_from_df = df[['Age', 'Salary']].to_numpy()  # 2D ndarray of numeric columns
print("24) df[['Age','Salary']].to_numpy():\n", arr_from_df, "\n", "dtype:", arr_from_df.dtype, "\n")

# ---------------------------
# 11) Access with .get (safe column access with default)
# ---------------------------
print("25) df.get('NonExistingColumn', 'default') ->", df.get('NonExistingColumn', 'default'), "\n")

# ---------------------------
# 12) at / iat performance note
# ---------------------------
print("26) Performance note: use .at/.iat for single scalar access in loops (faster than .loc/.iloc).")


Full DataFrame:
        Name  Age     City  Salary
0  Prasanna   25  Chennai   50000
1  Sundaram   30   Palani   60000
2     Indra   45  Madurai   80000
3      Ravi   35  Chennai   70000
4     Meena   28   Palani   52000 

1) df['Name'] -> Series:
 0    Prasanna
1    Sundaram
2       Indra
3        Ravi
4       Meena
Name: Name, dtype: object 
 <class 'pandas.core.series.Series'> 

2) df[['Name','Salary']] -> DataFrame:
        Name  Salary
0  Prasanna   50000
1  Sundaram   60000
2     Indra   80000
3      Ravi   70000
4     Meena   52000 
 <class 'pandas.core.frame.DataFrame'> 

3) df.Name -> Series (attribute access):
 0    Prasanna
1    Sundaram
2       Indra
3        Ravi
4       Meena
Name: Name, dtype: object 

4) df.loc[2] -> row with index label 2 (Series):
 Name        Indra
Age            45
City      Madurai
Salary      80000
Name: 2, dtype: object 

5) df.iloc[2] -> row at integer position 2 (Series):
 Name        Indra
Age            45
City      Madurai
Salary      80000


# Adding and Removing Columns in Pandas

Once a DataFrame is created, we often need to **add new columns** or **remove existing ones**.  
Pandas makes this easy with simple syntax.

---

## 1. Adding Columns

- **Assign a new column directly**:  
  `df['NewCol'] = values`  

- **Derived column (from existing data)**:  
  `df['Bonus'] = df['Salary'] * 0.10`  

- **Insert column at specific position**:  
  `df.insert(position, 'NewCol', values)`  

---

## 2. Removing Columns

- **Delete a column using `del`:**  
  `del df['ColName']`  

- **Drop a column (returns new DataFrame by default):**  
  `df.drop('ColName', axis=1)`  
  - Use `inplace=True` to modify original DataFrame. `inplace=True` tells Pandas to modify the DataFrame directly, instead of creating a new one. 
  - Drop multiple columns: `df.drop(['Col1','Col2'], axis=1)`  

⚠️ Important:
- If `inplace=False` (default), drop() returns a new DataFrame and does not touch the original.
- If `inplace=True`, the operation changes the original DataFrame and returns None.

---


In [61]:
# -----------------------------------------------
# 📌 Sample DataFrame
# -----------------------------------------------
data = [
    {'Name': 'Prasanna', 'Age': 25, 'City': 'Chennai', 'Salary': 50000},
    {'Name': 'Sundaram', 'Age': 30, 'City': 'Palani', 'Salary': 60000},
    {'Name': 'Indra', 'Age': 45, 'City': 'Madurai', 'Salary': 80000},
    {'Name': 'Ravi', 'Age': 35, 'City': 'Chennai', 'Salary': 70000},
    {'Name': 'Meena', 'Age': 28, 'City': 'Palani', 'Salary': 52000},
]

df = pd.DataFrame(data)
print("Original DataFrame:\n", df, "\n")

# -----------------------------------------------
# 1. Add new column directly
# -----------------------------------------------
df['Bonus'] = df['Salary'] * 0.10   # 10% bonus
print("After adding Bonus column:\n", df, "\n")

# -----------------------------------------------
# 2. Insert column at specific position
# -----------------------------------------------
df.insert(2, 'Department', ['IT', 'HR', 'Finance', 'IT', 'HR'])  
print("After inserting Department column at position 2:\n", df, "\n")

# -----------------------------------------------
# 3. Remove a column using del
# -----------------------------------------------
del df['Bonus']
print("After deleting Bonus column with del:\n", df, "\n")

# -----------------------------------------------
# 4. Remove columns using drop
# -----------------------------------------------
df_dropped = df.drop(['Department', 'City'], axis=1)  # returns new DataFrame
print("New DataFrame after dropping Department & City:\n", df_dropped, "\n")

# Drop inplace (modifies original DataFrame)
# axis = 1 Operates on column
df.drop('Department', axis=1, inplace=True)
print("After dropping Department column inplace:\n", df, "\n")

# Drop inplace (modifies original DataFrame) dropping the row by index location 
# axis = 0 Operates on Row
# by default axis will 0 
df.drop(2, axis=0, inplace=True)
print("After dropping row at index = 2 inplace:\n", df, "\n")


Original DataFrame:
        Name  Age     City  Salary
0  Prasanna   25  Chennai   50000
1  Sundaram   30   Palani   60000
2     Indra   45  Madurai   80000
3      Ravi   35  Chennai   70000
4     Meena   28   Palani   52000 

After adding Bonus column:
        Name  Age     City  Salary   Bonus
0  Prasanna   25  Chennai   50000  5000.0
1  Sundaram   30   Palani   60000  6000.0
2     Indra   45  Madurai   80000  8000.0
3      Ravi   35  Chennai   70000  7000.0
4     Meena   28   Palani   52000  5200.0 

After inserting Department column at position 2:
        Name  Age Department     City  Salary   Bonus
0  Prasanna   25         IT  Chennai   50000  5000.0
1  Sundaram   30         HR   Palani   60000  6000.0
2     Indra   45    Finance  Madurai   80000  8000.0
3      Ravi   35         IT  Chennai   70000  7000.0
4     Meena   28         HR   Palani   52000  5200.0 

After deleting Bonus column with del:
        Name  Age Department     City  Salary
0  Prasanna   25         IT  Chennai 

# Dropping Columns with and without `inplace`

The `drop()` method in Pandas behaves differently depending on the `inplace` parameter:

- **Default (`inplace=False`)**  
  - Returns a **new DataFrame** without the dropped column(s).  
  - The original DataFrame is **unchanged**.  

- **With `inplace=True`**  
  - Modifies the **original DataFrame directly**.  
  - Returns `None`.  

---


In [57]:
# Sample DataFrame
data = {
    'Name': ['Prasanna', 'Sundaram', 'Indra'],
    'Age': [25, 30, 45],
    'City': ['Chennai', 'Palani', 'Madurai']
}
df = pd.DataFrame(data)

print("Original DataFrame:\n", df, "\n")

# ---------------------------------------
# Drop column WITHOUT inplace (default)
# ---------------------------------------
df_new = df.drop('City', axis=1)  # returns new DataFrame
print("New DataFrame after drop (inplace=False):\n", df_new, "\n")

# Check original DataFrame still has "City"
print("Original DataFrame remains unchanged:\n", df, "\n")

# ---------------------------------------
# Drop column WITH inplace
# ---------------------------------------
df.drop('City', axis=1, inplace=True)  # modifies df directly
print("Original DataFrame after inplace=True:\n", df, "\n")


Original DataFrame:
        Name  Age     City
0  Prasanna   25  Chennai
1  Sundaram   30   Palani
2     Indra   45  Madurai 

New DataFrame after drop (inplace=False):
        Name  Age
0  Prasanna   25
1  Sundaram   30
2     Indra   45 

Original DataFrame remains unchanged:
        Name  Age     City
0  Prasanna   25  Chennai
1  Sundaram   30   Palani
2     Indra   45  Madurai 

Original DataFrame after inplace=True:
        Name  Age
0  Prasanna   25
1  Sundaram   30
2     Indra   45 

