# Introduction to Pandas

## What is Pandas?
- Python library for **data manipulation and analysis**
- Built on top of NumPy (uses NumPy arrays internally)
- Core data structures:
  - **Series**: 1D labeled array (like a column)
  - **DataFrame**: 2D labeled table (like Excel spreadsheet)
- Key advantages:
  - Labeled axes (rows and columns)
  - Handles missing data elegantly
  - SQL-like operations (groupby, merge, join)
  - Time series functionality
  - File I/O (CSV, Excel, JSON, SQL, etc.)

## Why Pandas Over NumPy?
| Feature | Pandas | NumPy |
|---------|--------|-------|
| Data Structure | DataFrame (2D table) | ndarray (N-D array) |
| Labels | Row/column names | Index-based only |
| Mixed Types | ✅ Different types per column | ❌ Homogeneous |
| Missing Data | Built-in NaN handling | Manual handling |
| Use Case | Tabular/structured data | Numerical computations |

**Rule of Thumb**: Use Pandas for data analysis, NumPy for math operations.


In [1]:
import pandas as pd
import numpy as np
import sys

# Verify installation
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")
print(f"Python version: {sys.version}")

# Conventional alias check
assert pd.Series([1, 2, 3]).sum() == 6, "Import failed!"


Pandas version: 2.2.3
NumPy version: 2.1.3
Python version: 3.13.5 | packaged by Anaconda, Inc. | (main, Jun 12 2025, 11:23:37) [Clang 14.0.6 ]


## Pandas Series
- **1D labeled array** (like a single column in Excel)
- Components:
  - **Values**: The actual data (NumPy array)
  - **Index**: Row labels (default: 0, 1, 2...)
  
**Visualization:**


| Index | Values |
| ----- | ------ |
| 0     | 10     |
| 1     | 20     |
| 2     | 30     |

In [3]:
# Create Series from list
series_list = pd.Series([10, 20, 30, 40])
print("Series from list:")
print(series_list)
print()

# Create Series with custom index
series_custom = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
print("Series with custom index:")
print(series_custom)
print()

# Create Series from dictionary (keys become index)
data_dict = {'India': 1400, 'China': 1425, 'USA': 331}
series_dict = pd.Series(data_dict)
print("Series from dictionary:")
print(series_dict)
print()

# Access elements
print("Access by label:", series_dict['India'])  # 1400
print("Access by position:", series_dict.iloc[0])  # 1400


Series from list:
0    10
1    20
2    30
3    40
dtype: int64

Series with custom index:
a    10
b    20
c    30
dtype: int64

Series from dictionary:
India    1400
China    1425
USA       331
dtype: int64

Access by label: 1400
Access by position: 1400


## Pandas DataFrame
- **2D labeled table** (rows + columns)
- Think: Excel spreadsheet or SQL table
- Each column is a Series
- Components:
  - **Columns**: Column names
  - **Index**: Row labels
  - **Values**: 2D NumPy array

**Structure:**


   Column1  Column2  Column3


Index0 10 A True
Index1 20 B False
Index2 30 C True


**Key Difference from NumPy:**
- NumPy: `array[row, col]` → position-based
- Pandas: `df['ColName']` → label-based


In [7]:
# Method 1: From dictionary (keys = columns)
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['Delhi', 'Mumbai', 'Bangalore']
}
df = pd.DataFrame(data)
print("DataFrame from dictionary:")
print(df)
print()

# Method 2: From list of lists
data_list = [
    ['Alice', 25, 'Delhi'],
    ['Bob', 30, 'Mumbai'],
    ['Charlie', 35, 'Bangalore']
]
df_list = pd.DataFrame(data_list, columns=['Name', 'Age', 'City'])
print("DataFrame from list:")
print(df_list)
print()

# Method 3: From NumPy array
np_data = np.array([[1, 2, 3], [4, 5, 6]])
df_np = pd.DataFrame(np_data, columns=['A', 'B', 'C'])
print("DataFrame from NumPy:")
print(df_np)


DataFrame from dictionary:
      Name  Age       City
0    Alice   25      Delhi
1      Bob   30     Mumbai
2  Charlie   35  Bangalore

DataFrame from list:
      Name  Age       City
0    Alice   25      Delhi
1      Bob   30     Mumbai
2  Charlie   35  Bangalore

DataFrame from NumPy:
   A  B  C
0  1  2  3
1  4  5  6


## Essential DataFrame Attributes

| Attribute | Description | Returns |
|-----------|-------------|---------|
| `.shape` | Dimensions | (rows, columns) |
| `.columns` | Column names | Index object |
| `.index` | Row labels | Index object |
| `.dtypes` | Data types per column | Series |
| `.info()` | Summary (types, nulls, memory) | None (prints) |
| `.head(n)` | First n rows (default 5) | DataFrame |
| `.tail(n)` | Last n rows | DataFrame |


In [9]:
# Create sample DataFrame
df = pd.DataFrame({
    'Product': ['Laptop', 'Phone', 'Tablet', 'Monitor'],
    'Price': [50000, 20000, 15000, 12000],
    'Stock': [10, 25, 15, 8],
    'Available': [True, True, False, True]
})

# Attributes
print("Shape:", df.shape)  # (4, 4) → 4 rows, 4 columns
print("\nColumns:", df.columns)  # Index(['Product', 'Price', 'Stock', 'Available'])
print("\nIndex:", df.index)  # RangeIndex(start=0, stop=4, step=1)
print("\nData types:\n", df.dtypes)
print("\nFirst 2 rows:\n", df.head(2))
print("\nInfo:")
df.info()  # Shows types, non-null counts, memory usage


Shape: (4, 4)

Columns: Index(['Product', 'Price', 'Stock', 'Available'], dtype='object')

Index: RangeIndex(start=0, stop=4, step=1)

Data types:
 Product      object
Price         int64
Stock         int64
Available      bool
dtype: object

First 2 rows:
   Product  Price  Stock  Available
0  Laptop  50000     10       True
1   Phone  20000     25       True

Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Product    4 non-null      object
 1   Price      4 non-null      int64 
 2   Stock      4 non-null      int64 
 3   Available  4 non-null      bool  
dtypes: bool(1), int64(2), object(1)
memory usage: 232.0+ bytes


## File I/O in Pandas
Pandas can read from multiple formats:

| Format | Read Function | Write Function |
|--------|--------------|----------------|
| CSV | `pd.read_csv()` | `df.to_csv()` |
| Excel | `pd.read_excel()` | `df.to_excel()` |
| JSON | `pd.read_json()` | `df.to_json()` |
| SQL | `pd.read_sql()` | `df.to_sql()` |
| HTML | `pd.read_html()` | `df.to_html()` |

**Most Common**: `read_csv()` for CSV files


In [11]:
# Create sample CSV file first
sample_data = pd.DataFrame({
    'Employee': ['John', 'Sarah', 'Mike', 'Emma'],
    'Department': ['IT', 'HR', 'IT', 'Finance'],
    'Salary': [60000, 55000, 62000, 58000],
    'Experience': [3, 5, 2, 4]
})

# Save to CSV
sample_data.to_csv('employees.csv', index=False)
print("CSV file created!")

# Read CSV file
df_csv = pd.read_csv('employees.csv')
print("\nDataFrame from CSV:")
print(df_csv)
print()

# Common read_csv parameters
df_custom = pd.read_csv(
    'employees.csv',
    usecols=['Employee', 'Salary'],  # Select specific columns
    nrows=2  # Read only first 2 rows
)
print("Custom read (selected columns):")
print(df_custom)


CSV file created!

DataFrame from CSV:
  Employee Department  Salary  Experience
0     John         IT   60000           3
1    Sarah         HR   55000           5
2     Mike         IT   62000           2
3     Emma    Finance   58000           4

Custom read (selected columns):
  Employee  Salary
0     John   60000
1    Sarah   55000


In [12]:
# Read Excel file (requires openpyxl or xlrd)
# sample_data.to_excel('employees.xlsx', index=False)
# df_excel = pd.read_excel('employees.xlsx')
# print("DataFrame from Excel:")
# print(df_excel)

# Read JSON file
sample_data.to_json('employees.json', orient='records')
df_json = pd.read_json('employees.json')
print("DataFrame from JSON:")
print(df_json)
print()

# Read JSON from string
json_string = '''
[
    {"Name": "Alice", "Score": 85},
    {"Name": "Bob", "Score": 92}
]
'''
df_json_str = pd.read_json(json_string)
print("DataFrame from JSON string:")
print(df_json_str)


DataFrame from JSON:
  Employee Department  Salary  Experience
0     John         IT   60000           3
1    Sarah         HR   55000           5
2     Mike         IT   62000           2
3     Emma    Finance   58000           4

DataFrame from JSON string:
    Name  Score
0  Alice     85
1    Bob     92


  df_json_str = pd.read_json(json_string)


## Series vs DataFrame: Quick Reference

| Aspect | Series | DataFrame |
|--------|--------|-----------|
| Dimensions | 1D | 2D |
| Think of it as | Single column | Full table |
| Creation | `pd.Series([1,2,3])` | `pd.DataFrame({'A':[1,2]})` |
| Indexing | `series[0]` or `series['label']` | `df['column']` or `df.loc[0]` |
| Use Case | Single variable analysis | Multi-variable datasets |

**Key Insight**: A DataFrame is a collection of Series (one per column).



In [14]:
import time

# Create large dataset
size = 1_000_000
data = {
    'A': np.random.randint(0, 100, size),
    'B': np.random.randint(0, 100, size),
    'C': np.random.randint(0, 100, size)
}

# Python list approach
py_list = list(zip(data['A'], data['B'], data['C']))
start = time.time()
result_list = [a + b + c for a, b, c in py_list]
py_time = time.time() - start

# Pandas DataFrame approach
df = pd.DataFrame(data)
start = time.time()
result_df = df['A'] + df['B'] + df['C']
pd_time = time.time() - start

print(f"Python list time: {py_time:.4f} seconds")
print(f"Pandas time: {pd_time:.4f} seconds")
print(f"Speed ratio: {py_time/pd_time:.1f}x faster with Pandas")


Python list time: 0.0512 seconds
Pandas time: 0.0027 seconds
Speed ratio: 19.1x faster with Pandas


## Key Takeaways

1. **Import Convention**: Always use `import pandas as pd`
2. **Data Structures**:
   - Series = 1D labeled array
   - DataFrame = 2D table (collection of Series)
3. **File Reading**: `pd.read_csv()` is most common
4. **Attributes to Remember**:
   - `.shape` → dimensions
   - `.info()` → overview
   - `.head()` → quick preview
5. **Label-Based**: Pandas uses row/column names (unlike NumPy's position-only)

> **Next Steps**: Deep dive into Series operations, indexing, and selection.


In [16]:
# Try This: Create a DataFrame of your favorite movies
movies = pd.DataFrame({
    'Title': ['Inception', 'The Matrix', 'Interstellar'],
    'Year': [2010, 1999, 2014],
    'Rating': [8.8, 8.7, 8.6],
    'Genre': ['Sci-Fi', 'Sci-Fi', 'Sci-Fi']
})

print("My Movies DataFrame:")
print(movies)
print()

# Challenge: Add a new column 'Watched' with boolean values
movies['Watched'] = [True, True, False]
print("After adding 'Watched' column:")
print(movies)

# Save to CSV
movies.to_csv('my_movies.csv', index=False)
print("\nSaved to 'my_movies.csv'")


My Movies DataFrame:
          Title  Year  Rating   Genre
0     Inception  2010     8.8  Sci-Fi
1    The Matrix  1999     8.7  Sci-Fi
2  Interstellar  2014     8.6  Sci-Fi

After adding 'Watched' column:
          Title  Year  Rating   Genre  Watched
0     Inception  2010     8.8  Sci-Fi     True
1    The Matrix  1999     8.7  Sci-Fi     True
2  Interstellar  2014     8.6  Sci-Fi    False

Saved to 'my_movies.csv'
