### **Introduction to Pandas & Core Data Structures**

**What is Pandas?**

- Open-source Python library for handling structured data (rows & columns).

- Used in data analysis, preprocessing, business reports, machine learning.

- Document : https://pandas.pydata.org/docs/getting_started/install.html

**How to Install Pandas**

In [45]:
# Run this command in terminal or Jupyter Notebook
! pip install pandas



### Importing Pandas

In [46]:
import pandas as pd

### Pandas Series (1D Data)

- One-dimensional labeled data (like a single Excel column).

- Created from list, dict, or NumPy array.

- Series is often used for single-variable analysis or as part of a larger DataFrame

In [47]:
# Example: Creating a Series from a list
import pandas as pd
s1 = pd.Series([100, 200, 300])
print(s1)

0    100
1    200
2    300
dtype: int64


In [48]:
# Example: Creating a Series from a dictionary
s2 = pd.Series({'Math': 90, 'Science': 85})
print(s2)

Math       90
Science    85
dtype: int64


### Pandas DataFrame (2D Data)
- Two-dimensional, labeled rows & columns (like an Excel sheet).

- Created from:
    - Dict of lists: { 'Name': ['A', 'B'], 'Age': [25, 30] }
    - List of dicts (API responses)
    - NumPy arrays

    pd.DataFrame(data) ➔ Tabular format.

✅ **Note:** Always set clear column names for readability and future automation.

In [49]:
# Example: Creating a DataFrame from a dictionary of lists
data = { 'Name': ['A', 'B'], 'Age': [25, 30] }
df1 = pd.DataFrame(data)
print(df1)

  Name  Age
0    A   25
1    B   30


In [50]:
# Example: Creating a DataFrame from a list of dictionaries
data2 = [ {'Name': 'A', 'Age': 25}, {'Name': 'B', 'Age': 30} ]
df2 = pd.DataFrame(data2)
print(df2)

  Name  Age
0    A   25
1    B   30


### Exploring DataFrames
- **View rows:**  
  `df.head()` ➔ First 5 rows (use numbers `df.head(10)` for no. of rows)  
  `df.tail()` ➔ Last 5 rows (use numbers `df.tail(10)` for no. of rows) 

- **Dimensions:**  
  `df.shape` ➔ Rows, Columns count

- **Columns:**  
  `df.columns` ➔ List column names

- **Data Types:**  
  `df.dtypes` ➔ Understand types (important for ML models)

- **Quick Summary:**  
  `df.info()` ➔ Nulls, types, size  
  `df.describe()` ➔ Basic stats (for numeric data)

In [51]:
print("First 5 rows:")
print(df1.head())

First 5 rows:
  Name  Age
0    A   25
1    B   30


In [52]:
print("Shape (rows, columns):", df1.shape)

Shape (rows, columns): (2, 2)


In [53]:
print("Columns:", df1.columns)

Columns: Index(['Name', 'Age'], dtype='object')


In [54]:
print("Data types:\n", df1.dtypes)

Data types:
 Name    object
Age      int64
dtype: object


In [55]:
print("Info:\n")
df1.info()

Info:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    2 non-null      object
 1   Age     2 non-null      int64 
dtypes: int64(1), object(1)
memory usage: 160.0+ bytes


In [56]:
print("Describe:\n")
print(df1.describe())

Describe:

             Age
count   2.000000
mean   27.500000
std     3.535534
min    25.000000
25%    26.250000
50%    27.500000
75%    28.750000
max    30.000000


### Creating Pandas Series
**➔ From a List:**
- `pd.Series([100, 200, 300])`
- Quick creation for simple datasets (sales, counts).

**➔ From a Dictionary:**
- `pd.Series({'Math': 95, 'Physics': 88})`
- Keys = labels ➔ Values = data.
- Useful for labeled data like survey responses or category scores.

**➔ From NumPy Array:**
- `pd.Series(np.array([1, 2, 3, 4]))` 
- Ideal when data comes from scientific computations or numeric simulations.

✅ **Note:**
In ML projects, you often receive data as NumPy ➔ Convert to Pandas for easier manipulation and readability.

In [57]:
import numpy as np

In [58]:
# From a List
series_list = pd.Series([100, 200, 300])
print('Series from list:')
print(series_list)

Series from list:
0    100
1    200
2    300
dtype: int64


In [59]:
# From a Dictionary
series_dict = pd.Series({'Math': 95, 'Physics': 88})
print('\nSeries from dictionary:')
print(series_dict)


Series from dictionary:
Math       95
Physics    88
dtype: int64


In [60]:
# From a NumPy Array
series_np = pd.Series(np.array([1, 2, 3, 4]))
print('\nSeries from NumPy array:')
print(series_np)


Series from NumPy array:
0    1
1    2
2    3
3    4
dtype: int64


### Creating Pandas DataFrames — Multiple Methods
**➔ From Dictionary of Lists (Most Common):**
- `pd.DataFrame({'Name': ['John', 'Sara'], 'Age': [28, 22]})`
- Best when data is manually created or imported from Excel.

**➔ From List of Dictionaries (API-Friendly):**
- `pd.DataFrame([{'Name': 'John', 'Age': 28}, {'Name': 'Sara', 'Age': 22}])`
- Common when working with JSON APIs or MongoDB style documents.

**➔ From NumPy Arrays:**
- `pd.DataFrame(np.array([[1, 2], [3, 4]]), columns=['A', 'B'])`
- Best for numeric matrices (images, financial models).

In [61]:
# From Dictionary of Lists
df_dict = pd.DataFrame({'Name': ['John', 'Sara'], 'Age': [28, 22]})
print('DataFrame from dictionary of lists:')
print(df_dict)

DataFrame from dictionary of lists:
   Name  Age
0  John   28
1  Sara   22


In [62]:
# From List of Dictionaries
df_list_dict = pd.DataFrame([{'Name': 'John', 'Age': 28}, {'Name': 'Sara', 'Age': 22}])
print('\nDataFrame from list of dictionaries:')
print(df_list_dict)



DataFrame from list of dictionaries:
   Name  Age
0  John   28
1  Sara   22


In [63]:
# From NumPy Arrays
df_np = pd.DataFrame(np.array([[1, 2], [3, 4]]), columns=['A', 'B'])
print('\nDataFrame from NumPy array:')
print(df_np)


DataFrame from NumPy array:
   A  B
0  1  2
1  3  4


### Real-World DataFrame Example — Customer Data
Imagine receiving online store transaction data:

 ```python
customer_list = [
 {'CustomerID': 101, 'Name': 'Alice', 'Purchase': 250.75},
 {'CustomerID': 102, 'Name': 'Bob', 'Purchase': 150.50},
 {'CustomerID': 103, 'Name': 'Charlie', 'Purchase': 300.00}
]
```
- Use:
`pd.DataFrame(customer_list)` ➔ Clean, ready-to-analyze tabular data.

👉 This is exactly how API data or NoSQL database outputs look in real business projects.

✅ **Note:**
For messy nested JSON, use `json_normalize`.

In [64]:
customer_list = [
    {'CustomerID': 101, 'Name': 'Alice', 'Purchase': 250.75},
    {'CustomerID': 102, 'Name': 'Bob', 'Purchase': 150.50},
    {'CustomerID': 103, 'Name': 'Charlie', 'Purchase': 300.00}
]
df_customers = pd.DataFrame(customer_list)
print('Customer DataFrame:')
print(df_customers)

Customer DataFrame:
   CustomerID     Name  Purchase
0         101    Alice    250.75
1         102      Bob    150.50
2         103  Charlie    300.00


### Additional Essential Tricks
**➔ Create Empty DataFrame:**
- `pd.DataFrame()` ➔ Useful when building a table programmatically.

**➔ Reset Index:**
- `df.reset_index(drop=True)` ➔ Clean up index after filtering or grouping.

**➔ Add New Rows:**
- Appending rows ➔ use `pd.concat()` (not `.append()`, which is deprecated).

✅ **Note:**
Instead of appending rows one by one (which is slow), collect rows in a list and convert once to DataFrame.

**Example:**

Create: `rows = []`

Add: `rows.append({'col1': val1, 'col2': val2})`

Build: `df = pd.DataFrame(rows)`

In [65]:
# Create Empty DataFrame
empty_df = pd.DataFrame()
print('Empty DataFrame:')
print(empty_df)

Empty DataFrame:
Empty DataFrame
Columns: []
Index: []


In [66]:
# Reset Index Example
df_reset = df_dict.reset_index(drop=True)
print('\nDataFrame after reset_index:')
print(df_reset)


DataFrame after reset_index:
   Name  Age
0  John   28
1  Sara   22


In [67]:
# Add New Rows Efficiently
rows = []
rows.append({'col1': 1, 'col2': 'A'})
rows.append({'col1': 2, 'col2': 'B'})
df_rows = pd.DataFrame(rows)
print('\nDataFrame from collected rows:')
print(df_rows)


DataFrame from collected rows:
   col1 col2
0     1    A
1     2    B


### Accessing Columns in DataFrame
**➔ Single Column:**
- Use square brackets: `df['Name']`
- Returns a Series.

**➔ Multiple Columns:**
- Pass a list: `df[['Name', 'Score']]`
- Returns a DataFrame.
- Always use double square brackets for multiple columns.

In [68]:
# Accessing single column (returns Series)
print(df1['Name'])

0    A
1    B
Name: Name, dtype: object


In [69]:
# Accessing multiple columns (returns DataFrame)
print(df1[['Name', 'Age']])

  Name  Age
0    A   25
1    B   30


### Why Pandas Matters in Real Projects
Data is rarely clean—Pandas helps you fix it.

**Essential for:**
- Feature engineering
- Data visualization
- Model-ready datasets
- Data cleaning


### Best Practices (Quick Reference)
- Start every project by inspecting data (`head()`, `info()`, `describe()`).
- Keep original data in a `data/raw/` folder.
- Avoid modifying raw data directly—create processed copies.
- Document your data cleaning process.

✅ **Note:**  
For scalable projects, save clean data for reuse (`data/processed/`).