## Introduction to Pandas & Core Data Structures

### What is Pandas?

- Open-source Python library for handling structured data (rows & columns).

- Used in data analysis, preprocessing, business reports, machine learning.

- Document : https://pandas.pydata.org/docs/getting_started/install.html

### How to Install Pandas

In [83]:
# Run this command in terminal or Jupyter Notebook
! pip install pandas



### Importing Pandas

In [84]:
import pandas as pd

### Pandas Series (1D Data)

- One-dimensional labeled data (like a single Excel column).

- Series is often used for single-variable analysis or as part of a larger DataFrame

- Created from -
    - List:
        - `pd.Series([100, 200, 300])`
        - Quick creation for simple datasets (sales, counts).

    - Dictionary:
        - `pd.Series({'Math': 95, 'Physics': 88})`
        - Keys = labels ➔ Values = data.
        - Useful for labeled data like survey responses or category scores.

    - NumPy Array:
        - `pd.Series(np.array([1, 2, 3, 4]))` 
        - Ideal when data comes from scientific computations or numeric simulations.

✅ **Note:**
In ML projects, you often receive data as NumPy ➔ Convert to Pandas for easier manipulation and readability.

In [85]:
# From a List
series_list = pd.Series([100, 200, 300])
print('Series from list:')
print(series_list)

Series from list:
0    100
1    200
2    300
dtype: int64


In [86]:
# From a Dictionary
series_dict = pd.Series({'Math': 95, 'Physics': 88})
print('\nSeries from dictionary:')
print(series_dict)


Series from dictionary:
Math       95
Physics    88
dtype: int64


In [87]:
import numpy as np

series_np = pd.Series(np.array([1, 2, 3, 4]))
print('\nSeries from NumPy array:')
print(series_np)


Series from NumPy array:
0    1
1    2
2    3
3    4
dtype: int64


### Pandas DataFrame (2D Data)
- Two-dimensional, labeled rows & columns (like an Excel sheet).

- Created from:
    - Dictionary of Lists (Most Common):
        - `pd.DataFrame({'Name': ['John', 'Sara'], 'Age': [28, 22]})`
        - Best when data is manually created or imported from Excel.

    - List of Dictionaries (API-Friendly):
        - `pd.DataFrame([{'Name': 'John', 'Age': 28}, {'Name': 'Sara', 'Age': 22}])`
        - Common when working with JSON APIs or MongoDB style documents.

    - NumPy Arrays:
        - `pd.DataFrame(np.array([[1, 2], [3, 4]]), columns=['A', 'B'])`
        - Best for numeric matrices (images, financial models).

    **pd.DataFrame(data) ➔ Tabular format.**

✅ **Note:** Always set clear column names for readability and future automation.

In [88]:
# From Dictionary of Lists
df_dict = pd.DataFrame({'Name': ['John', 'Sara'], 'Age': [28, 22]})
print('DataFrame from dictionary of lists:')
print(df_dict)

DataFrame from dictionary of lists:
   Name  Age
0  John   28
1  Sara   22


In [89]:
# From List of Dictionaries
df_list_dict = pd.DataFrame([{'Name': 'John', 'Age': 28}, {'Name': 'Sara', 'Age': 22}])
print('\nDataFrame from list of dictionaries:')
print(df_list_dict)


DataFrame from list of dictionaries:
   Name  Age
0  John   28
1  Sara   22


In [90]:
# From NumPy Arrays
df_np = pd.DataFrame(np.array([[1, 2], [3, 4]]), columns=['A', 'B'])
print('\nDataFrame from NumPy array:')
print(df_np)


DataFrame from NumPy array:
   A  B
0  1  2
1  3  4


### Exploring DataFrames
- View rows:  
  `df.head()` ➔ First 5 rows (use numbers `df.head(10)` for no. of rows)  
  `df.tail()` ➔ Last 5 rows (use numbers `df.tail(10)` for no. of rows) 

- Dimensions:  
  `df.shape` ➔ Rows, Columns count

- Columns: 
  `df.columns` ➔ List column names

- Data Types:  
  `df.dtypes` ➔ Understand types (important for ML models)

- Quick Summary: 
  `df.info()` ➔ Nulls, types, size  
  `df.describe()` ➔ Basic stats (for numeric data)

### Loading **Tips** DataFrames with Seaborn

In [91]:
import seaborn as sns


df = sns.load_dataset('tips')

In [92]:
print("First 5 rows:")
print(df.head())

First 5 rows:
   total_bill   tip     sex smoker  day    time  size
0       16.99  1.01  Female     No  Sun  Dinner     2
1       10.34  1.66    Male     No  Sun  Dinner     3
2       21.01  3.50    Male     No  Sun  Dinner     3
3       23.68  3.31    Male     No  Sun  Dinner     2
4       24.59  3.61  Female     No  Sun  Dinner     4


In [93]:
print("Shape (rows, columns):", df.shape)

Shape (rows, columns): (244, 7)


In [94]:
print("Columns:", df.columns)

Columns: Index(['total_bill', 'tip', 'sex', 'smoker', 'day', 'time', 'size'], dtype='object')


In [95]:
print("Data types:\n", df.dtypes)

Data types:
 total_bill     float64
tip            float64
sex           category
smoker        category
day           category
time          category
size             int64
dtype: object


In [96]:
print("Info:\n")
df.info()

Info:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   total_bill  244 non-null    float64 
 1   tip         244 non-null    float64 
 2   sex         244 non-null    category
 3   smoker      244 non-null    category
 4   day         244 non-null    category
 5   time        244 non-null    category
 6   size        244 non-null    int64   
dtypes: category(4), float64(2), int64(1)
memory usage: 7.4 KB


In [97]:
print("Describe:\n")
print(df.describe())

Describe:

       total_bill         tip        size
count  244.000000  244.000000  244.000000
mean    19.785943    2.998279    2.569672
std      8.902412    1.383638    0.951100
min      3.070000    1.000000    1.000000
25%     13.347500    2.000000    2.000000
50%     17.795000    2.900000    2.000000
75%     24.127500    3.562500    3.000000
max     50.810000   10.000000    6.000000


### Real-World DataFrame Example — Customer Data
Imagine receiving online store transaction data:

 ```python
customer_list = [
 {'CustomerID': 101, 'Name': 'Alice', 'Purchase': 250.75},
 {'CustomerID': 102, 'Name': 'Bob', 'Purchase': 150.50},
 {'CustomerID': 103, 'Name': 'Charlie', 'Purchase': 300.00}
]
```
- Use:
`pd.DataFrame(customer_list)` ➔ Clean, ready-to-analyze tabular data.

👉 This is exactly how API data or NoSQL database outputs look in real business projects.

✅ **Note:**
For messy nested JSON, use `json_normalize`.

In [98]:
customer_list = [
    {'CustomerID': 101, 'Name': 'Alice', 'Purchase': 250.75},
    {'CustomerID': 102, 'Name': 'Bob', 'Purchase': 150.50},
    {'CustomerID': 103, 'Name': 'Charlie', 'Purchase': 300.00}
]
df_customers = pd.DataFrame(customer_list)
print('Customer DataFrame:')
print(df_customers)

Customer DataFrame:
   CustomerID     Name  Purchase
0         101    Alice    250.75
1         102      Bob    150.50
2         103  Charlie    300.00


### Additional Essential Tricks
- Create Empty DataFrame:
    - `pd.DataFrame()` ➔ Useful when building a table programmatically.

- Reset Index:
    - `df.reset_index(drop=True)` ➔ Clean up index after filtering or grouping.

- Add New Rows:
    - Appending rows ➔ use `pd.concat()` (not `.append()`, which is deprecated).

✅ **Note:**
Instead of appending rows one by one (which is slow), collect rows in a list and convert once to DataFrame.

**Example:**

Create: `rows = []`

Add: `rows.append({'col1': val1, 'col2': val2})`

Build: `df = pd.DataFrame(rows)`

In [99]:
# Create Empty DataFrame
empty_df = pd.DataFrame()
print('Empty DataFrame:')
print(empty_df)

Empty DataFrame:
Empty DataFrame
Columns: []
Index: []


In [100]:
# Reset Index Example
df_reset = df_dict.reset_index(drop=True)
print('\nDataFrame after reset_index:')
print(df_reset)


DataFrame after reset_index:
   Name  Age
0  John   28
1  Sara   22


In [101]:
# Add New Rows Efficiently
rows = []
rows.append({'col1': 1, 'col2': 'A'})
rows.append({'col1': 2, 'col2': 'B'})
df_rows = pd.DataFrame(rows)
print('\nDataFrame from collected rows:')
print(df_rows)


DataFrame from collected rows:
   col1 col2
0     1    A
1     2    B


### Accessing Columns in DataFrame
- Single Column:
    - Use square brackets: `df['Name']`
    - Returns a Series.

- Multiple Columns:
    - Pass a list: `df[['Name', 'Score']]`
    - Returns a DataFrame.
    - Always use double square brackets for multiple columns.

In [102]:
# Accessing single column (returns Series)
print(df['total_bill'])

0      16.99
1      10.34
2      21.01
3      23.68
4      24.59
       ...  
239    29.03
240    27.18
241    22.67
242    17.82
243    18.78
Name: total_bill, Length: 244, dtype: float64


In [103]:
# Accessing multiple columns (returns DataFrame)
print(df[['total_bill', 'tip']])

     total_bill   tip
0         16.99  1.01
1         10.34  1.66
2         21.01  3.50
3         23.68  3.31
4         24.59  3.61
..          ...   ...
239       29.03  5.92
240       27.18  2.00
241       22.67  2.00
242       17.82  1.75
243       18.78  3.00

[244 rows x 2 columns]


### Why Pandas Matters in Real Projects
Data is rarely clean—Pandas helps you fix it.

**Essential for:**
- Feature engineering
- Data visualization
- Model-ready datasets
- Data cleaning


### Best Practices (Quick Reference)
- Start every project by inspecting data (`head()`, `info()`, `describe()`).
- Keep original data in a `data/raw/` folder.
- Avoid modifying raw data directly—create processed copies.
- Document your data cleaning process.

✅ **Note:**  
For scalable projects, save clean data for reuse (`data/processed/`).