# Pandas and SQL Integration

## Learning Objectives

By the end of this notebook, you will be able to:

1. Load SQL query results directly into Pandas DataFrames using `pd.read_sql()`
2. Understand the difference between `pd.read_sql_query()` and `pd.read_sql_table()`
3. Write DataFrames to SQL database tables using `df.to_sql()`
4. Know when to use SQL vs Pandas for data manipulation
5. Build practical workflows combining SQL and Pandas strengths

---

## Setup: Create the Company Database

First, let's create our familiar company database with departments, employees, and projects.

In [None]:
import sqlite3
import pandas as pd
import os

# Check pandas version
print(f"Pandas version: {pd.__version__}")

# Remove existing database for a fresh start
db_path = 'company.db'
if os.path.exists(db_path):
    os.remove(db_path)

# Create database and connection
conn = sqlite3.connect(db_path)
cursor = conn.cursor()

# Create tables
cursor.executescript('''
    CREATE TABLE departments (
        id INTEGER PRIMARY KEY,
        name TEXT NOT NULL UNIQUE,
        budget REAL DEFAULT 0
    );
    
    CREATE TABLE employees (
        id INTEGER PRIMARY KEY,
        name TEXT NOT NULL,
        department_id INTEGER,
        salary REAL,
        hire_date TEXT,
        FOREIGN KEY (department_id) REFERENCES departments(id)
    );
    
    CREATE TABLE projects (
        id INTEGER PRIMARY KEY,
        name TEXT NOT NULL,
        department_id INTEGER,
        start_date TEXT,
        end_date TEXT,
        FOREIGN KEY (department_id) REFERENCES departments(id)
    );
''')

print("Tables created successfully!")

In [None]:
# Insert sample data
departments = [
    (1, 'Engineering', 500000),
    (2, 'Marketing', 300000),
    (3, 'Sales', 400000),
    (4, 'HR', 200000),
    (5, 'Finance', 350000)
]

employees = [
    (1, 'Alice Johnson', 1, 95000, '2020-03-15'),
    (2, 'Bob Smith', 1, 85000, '2021-06-01'),
    (3, 'Carol Williams', 1, 92000, '2019-08-20'),
    (4, 'David Brown', 2, 78000, '2022-01-10'),
    (5, 'Eva Martinez', 2, 82000, '2021-03-25'),
    (6, 'Frank Wilson', 3, 88000, '2020-11-05'),
    (7, 'Grace Lee', 3, 91000, '2019-05-12'),
    (8, 'Henry Taylor', 4, 65000, '2022-07-18'),
    (9, 'Ivy Chen', 1, 105000, '2018-02-28'),
    (10, 'Jack Anderson', 3, 95000, '2020-09-14'),
    (11, 'Karen White', 2, 75000, '2023-02-01'),
    (12, 'Leo Garcia', 5, 89000, '2021-11-20'),
    (13, 'Mia Davis', 4, 62000, '2023-04-15'),
    (14, 'Nathan Moore', 5, 94000, '2019-07-01'),
    (15, 'Olivia Clark', 1, 110000, '2017-09-10')
]

projects = [
    (1, 'Cloud Migration', 1, '2024-01-15', '2024-06-30'),
    (2, 'Brand Refresh', 2, '2024-02-01', '2024-04-30'),
    (3, 'Q2 Sales Campaign', 3, '2024-04-01', '2024-06-30'),
    (4, 'Mobile App v2.0', 1, '2024-03-01', '2024-09-30'),
    (5, 'Employee Portal', 4, '2024-02-15', '2024-05-31'),
    (6, 'Data Analytics Platform', 1, '2024-05-01', '2024-12-31'),
    (7, 'Customer Retention', 3, '2024-03-15', '2024-08-15'),
    (8, 'Budget Planning System', 5, '2024-01-01', '2024-03-31')
]

cursor.executemany('INSERT INTO departments VALUES (?, ?, ?)', departments)
cursor.executemany('INSERT INTO employees VALUES (?, ?, ?, ?, ?)', employees)
cursor.executemany('INSERT INTO projects VALUES (?, ?, ?, ?, ?)', projects)
conn.commit()

print(f"Inserted: {len(departments)} departments, {len(employees)} employees, {len(projects)} projects")
print("Database setup complete!")

---

## 1. Loading SQL Data into Pandas with pd.read_sql()

Pandas provides several functions to read data from SQL databases:

| Function | Description |
|----------|-------------|
| `pd.read_sql()` | General purpose - works with queries or table names |
| `pd.read_sql_query()` | Execute a SQL query and return results |
| `pd.read_sql_table()` | Read an entire table (requires SQLAlchemy) |

For SQLite with the `sqlite3` module, `pd.read_sql()` and `pd.read_sql_query()` are the most useful.

### Basic Usage: pd.read_sql()

In [None]:
# Load an entire table into a DataFrame
df_employees = pd.read_sql('SELECT * FROM employees', conn)

print("Employees DataFrame:")
print(f"Shape: {df_employees.shape}")
print(f"Columns: {list(df_employees.columns)}")
print()
df_employees.head()

In [None]:
# Load all departments
df_departments = pd.read_sql('SELECT * FROM departments', conn)
df_departments

In [None]:
# Load with a specific query (filtering in SQL)
df_high_earners = pd.read_sql('''
    SELECT name, salary, hire_date 
    FROM employees 
    WHERE salary > 90000
    ORDER BY salary DESC
''', conn)

print("High Earners (>$90k):")
df_high_earners

### Using Parameters in Queries

Use the `params` argument to safely pass values to your query.

In [None]:
# Using parameters to filter safely
min_salary = 80000
department_id = 1

df_filtered = pd.read_sql('''
    SELECT name, salary 
    FROM employees 
    WHERE salary >= ? AND department_id = ?
''', conn, params=(min_salary, department_id))

print(f"Engineering employees earning >= ${min_salary:,}:")
df_filtered

### Setting the Index Column

In [None]:
# Use the 'id' column as the DataFrame index
df_emp_indexed = pd.read_sql('SELECT * FROM employees', conn, index_col='id')

print("Employees with 'id' as index:")
df_emp_indexed.head()

In [None]:
# Now you can access rows by employee ID
print("Employee with ID 5:")
print(df_emp_indexed.loc[5])

### Parsing Dates Automatically

In [None]:
# Without date parsing - hire_date is a string
df_no_parse = pd.read_sql('SELECT * FROM employees', conn)
print("Without date parsing:")
print(f"hire_date dtype: {df_no_parse['hire_date'].dtype}")
print()

In [None]:
# With date parsing - hire_date becomes datetime
df_with_dates = pd.read_sql(
    'SELECT * FROM employees', 
    conn, 
    parse_dates=['hire_date']
)

print("With date parsing:")
print(f"hire_date dtype: {df_with_dates['hire_date'].dtype}")
print()

# Now we can use datetime operations
print("Hire year distribution:")
print(df_with_dates['hire_date'].dt.year.value_counts().sort_index())

### Loading JOINed Data

In [None]:
# Load employees with department names (JOIN in SQL)
df_emp_dept = pd.read_sql('''
    SELECT 
        e.id,
        e.name,
        d.name as department,
        e.salary,
        e.hire_date
    FROM employees e
    JOIN departments d ON e.department_id = d.id
    ORDER BY d.name, e.salary DESC
''', conn, parse_dates=['hire_date'])

print("Employees with Department Names:")
df_emp_dept

---

## 2. pd.read_sql_query() vs pd.read_sql_table()

### pd.read_sql_query()

Execute a SQL query and return the results as a DataFrame. This is functionally identical to `pd.read_sql()` when passing a query string.

In [None]:
# pd.read_sql_query() - explicitly for queries
df_query = pd.read_sql_query(
    'SELECT name, salary FROM employees WHERE salary > 85000',
    conn
)

print("Using read_sql_query():")
df_query

### pd.read_sql_table()

Reads an entire table. **Note**: This requires SQLAlchemy, which we'll demonstrate but won't use extensively since we're focusing on the built-in sqlite3 module.

In [None]:
# pd.read_sql_table() requires SQLAlchemy
# Here's how you would use it:

# from sqlalchemy import create_engine
# engine = create_engine('sqlite:///company.db')
# df_table = pd.read_sql_table('employees', engine)

# For sqlite3, use read_sql() with 'SELECT * FROM table_name' instead:
df_all_employees = pd.read_sql('SELECT * FROM employees', conn)
print("Full employees table:")
print(f"Shape: {df_all_employees.shape}")

---

## 3. Writing DataFrames to SQL with df.to_sql()

The `to_sql()` method writes a DataFrame to a SQL database table.

### Basic Usage

In [None]:
# Create a new DataFrame to write
new_employees = pd.DataFrame({
    'name': ['Peter Parker', 'Diana Prince', 'Bruce Wayne'],
    'department_id': [1, 3, 5],
    'salary': [72000, 98000, 150000],
    'hire_date': ['2024-01-15', '2024-02-01', '2024-01-02']
})

print("New employees to add:")
new_employees

In [None]:
# Write to a NEW table (not the existing employees table)
new_employees.to_sql(
    'new_hires',           # Table name
    conn,                   # Connection
    if_exists='replace',    # What to do if table exists
    index=False             # Don't write the DataFrame index
)

print("Data written to 'new_hires' table!")

# Verify
df_verify = pd.read_sql('SELECT * FROM new_hires', conn)
df_verify

### if_exists Parameter

Controls behavior when the table already exists:

| Value | Behavior |
|-------|----------|
| `'fail'` | Raise an error (default) |
| `'replace'` | Drop and recreate the table |
| `'append'` | Add rows to existing table |

In [None]:
# Append more data to the new_hires table
more_employees = pd.DataFrame({
    'name': ['Clark Kent', 'Barry Allen'],
    'department_id': [2, 1],
    'salary': [85000, 78000],
    'hire_date': ['2024-03-01', '2024-03-15']
})

more_employees.to_sql(
    'new_hires',
    conn,
    if_exists='append',  # Add to existing table
    index=False
)

print("Appended more employees!")

# Verify - should now have 5 rows
df_all_new = pd.read_sql('SELECT * FROM new_hires', conn)
print(f"Total rows: {len(df_all_new)}")
df_all_new

### Writing with Index

In [None]:
# Create DataFrame with meaningful index
dept_summary = pd.DataFrame({
    'total_budget': [500000, 300000, 400000],
    'employee_count': [5, 3, 3]
}, index=['Engineering', 'Marketing', 'Sales'])
dept_summary.index.name = 'department'

print("Department summary (with index):")
print(dept_summary)
print()

In [None]:
# Write WITH the index
dept_summary.to_sql(
    'department_summary',
    conn,
    if_exists='replace',
    index=True  # Include the index as a column
)

# Verify - index becomes a column
df_summary = pd.read_sql('SELECT * FROM department_summary', conn)
print("Written to SQL (index becomes 'department' column):")
df_summary

### Specifying Data Types

In [None]:
# Create DataFrame with specific types
products = pd.DataFrame({
    'product_name': ['Widget', 'Gadget', 'Gizmo'],
    'price': [29.99, 49.99, 19.99],
    'quantity': [100, 50, 200],
    'in_stock': [True, True, False]
})

# Write with explicit SQL types
products.to_sql(
    'products',
    conn,
    if_exists='replace',
    index=False,
    dtype={
        'product_name': 'TEXT',
        'price': 'REAL',
        'quantity': 'INTEGER',
        'in_stock': 'INTEGER'  # SQLite stores booleans as 0/1
    }
)

# Check the table schema
cursor.execute('PRAGMA table_info(products)')
print("Products table schema:")
for col in cursor.fetchall():
    print(f"  {col[1]}: {col[2]}")

---

## 4. When to Use SQL vs Pandas

Both SQL and Pandas can perform many of the same operations. Here's guidance on when to use each:

### Use SQL When:

| Scenario | Why SQL |
|----------|--------|
| **Filtering large datasets** | Database engine filters before transferring data |
| **Joining multiple tables** | SQL engines optimize joins efficiently |
| **Aggregating at scale** | Aggregate in database, transfer only results |
| **Data lives in database** | Avoid loading unnecessary data into memory |

### Use Pandas When:

| Scenario | Why Pandas |
|----------|------------|
| **Complex transformations** | More flexible data manipulation |
| **Time series operations** | Superior datetime handling |
| **Statistical analysis** | Rich statistical functions |
| **Data visualization prep** | Easy integration with matplotlib, seaborn |
| **Iterative exploration** | Quick feedback loop |

In [None]:
# Example: Let SQL do the heavy lifting, Pandas for analysis

# BAD: Load all data, then filter in Pandas
# df_all = pd.read_sql('SELECT * FROM employees', conn)
# df_filtered = df_all[df_all['salary'] > 90000]

# GOOD: Filter in SQL, get only what you need
df_filtered = pd.read_sql('''
    SELECT e.*, d.name as department_name
    FROM employees e
    JOIN departments d ON e.department_id = d.id
    WHERE e.salary > 90000
''', conn)

print("Filtered data from SQL:")
df_filtered

In [None]:
# Example: SQL for aggregation, Pandas for presentation

# Get aggregated data from SQL
df_stats = pd.read_sql('''
    SELECT 
        d.name as department,
        COUNT(*) as employees,
        AVG(e.salary) as avg_salary,
        MIN(e.salary) as min_salary,
        MAX(e.salary) as max_salary
    FROM employees e
    JOIN departments d ON e.department_id = d.id
    GROUP BY d.id, d.name
''', conn)

# Use Pandas for formatting and additional calculations
df_stats['salary_range'] = df_stats['max_salary'] - df_stats['min_salary']
df_stats['avg_salary'] = df_stats['avg_salary'].round(2)

print("Department Salary Statistics:")
df_stats

---

## 5. Practical Workflow: SQL for Filtering/Joining, Pandas for Analysis

Let's walk through a complete analysis workflow combining both tools.

### Step 1: Use SQL to Extract and Join Data

In [None]:
# Extract a comprehensive dataset with SQL
df_analysis = pd.read_sql('''
    SELECT 
        e.id as employee_id,
        e.name as employee_name,
        e.salary,
        e.hire_date,
        d.id as department_id,
        d.name as department,
        d.budget as department_budget
    FROM employees e
    JOIN departments d ON e.department_id = d.id
''', conn, parse_dates=['hire_date'])

print(f"Loaded {len(df_analysis)} employee records")
df_analysis.head()

### Step 2: Use Pandas for Transformations

In [None]:
# Add calculated columns with Pandas
from datetime import date

# Years of service
df_analysis['years_of_service'] = (
    (pd.Timestamp.today() - df_analysis['hire_date']).dt.days / 365
).round(1)

# Salary as percentage of department budget
df_analysis['salary_pct_of_budget'] = (
    df_analysis['salary'] / df_analysis['department_budget'] * 100
).round(2)

# Salary category
df_analysis['salary_level'] = pd.cut(
    df_analysis['salary'],
    bins=[0, 70000, 90000, float('inf')],
    labels=['Entry', 'Mid', 'Senior']
)

df_analysis[['employee_name', 'salary', 'years_of_service', 'salary_pct_of_budget', 'salary_level']].head(10)

### Step 3: Use Pandas for Aggregation and Analysis

In [None]:
# Analyze by department
dept_analysis = df_analysis.groupby('department').agg({
    'employee_id': 'count',
    'salary': ['mean', 'std', 'min', 'max'],
    'years_of_service': 'mean',
    'department_budget': 'first'
}).round(2)

# Flatten column names
dept_analysis.columns = ['_'.join(col).strip() for col in dept_analysis.columns.values]
dept_analysis = dept_analysis.rename(columns={
    'employee_id_count': 'num_employees',
    'salary_mean': 'avg_salary',
    'salary_std': 'salary_std_dev',
    'salary_min': 'min_salary',
    'salary_max': 'max_salary',
    'years_of_service_mean': 'avg_tenure',
    'department_budget_first': 'budget'
})

print("Department Analysis:")
dept_analysis

In [None]:
# Salary level distribution by department
salary_dist = pd.crosstab(
    df_analysis['department'], 
    df_analysis['salary_level'],
    margins=True
)

print("Salary Level Distribution by Department:")
salary_dist

### Step 4: Write Results Back to Database

In [None]:
# Save the department analysis back to the database
dept_analysis.to_sql(
    'department_analysis',
    conn,
    if_exists='replace',
    index=True  # Keep department as a column
)

print("Department analysis saved to database!")

# Verify
pd.read_sql('SELECT * FROM department_analysis', conn)

### Step 5: Create a Summary Report

In [None]:
# Combine SQL and Pandas for a final report
print("=" * 60)
print("COMPANY WORKFORCE ANALYSIS REPORT")
print("=" * 60)

# Company-wide stats
print("\n### Company Overview ###")
print(f"Total Employees: {len(df_analysis)}")
print(f"Total Salary Expense: ${df_analysis['salary'].sum():,.0f}")
print(f"Average Salary: ${df_analysis['salary'].mean():,.0f}")
print(f"Average Tenure: {df_analysis['years_of_service'].mean():.1f} years")

# Top earners
print("\n### Top 5 Earners ###")
top_earners = df_analysis.nlargest(5, 'salary')[['employee_name', 'department', 'salary']]
for _, row in top_earners.iterrows():
    print(f"  {row['employee_name']} ({row['department']}): ${row['salary']:,.0f}")

# Longest tenured
print("\n### Longest Tenured Employees ###")
veterans = df_analysis.nlargest(3, 'years_of_service')[['employee_name', 'years_of_service', 'department']]
for _, row in veterans.iterrows():
    print(f"  {row['employee_name']} ({row['department']}): {row['years_of_service']:.1f} years")

# Department with highest avg salary
print("\n### Highest Paying Department ###")
highest_paying = dept_analysis['avg_salary'].idxmax()
print(f"  {highest_paying}: ${dept_analysis.loc[highest_paying, 'avg_salary']:,.0f} average")

print("\n" + "=" * 60)

---

## Exercises

### Exercise 1: Basic SQL to DataFrame

Load all projects from the database into a DataFrame, parsing the `start_date` and `end_date` columns as dates. Display the DataFrame.

In [None]:
# Your code here


<details>
<summary>Click to see solution</summary>

```python
df_projects = pd.read_sql(
    'SELECT * FROM projects',
    conn,
    parse_dates=['start_date', 'end_date']
)

print("Projects DataFrame:")
print(f"Date columns dtypes:")
print(f"  start_date: {df_projects['start_date'].dtype}")
print(f"  end_date: {df_projects['end_date'].dtype}")
print()
df_projects
```

</details>

### Exercise 2: Parameterized Query

Write a function that takes a department name and minimum salary as parameters, and returns a DataFrame of employees matching those criteria. Use parameterized queries for safety.

In [None]:
# Your code here


<details>
<summary>Click to see solution</summary>

```python
def get_employees_by_dept_salary(dept_name: str, min_salary: float) -> pd.DataFrame:
    """Get employees from a department earning at least min_salary."""
    query = '''
        SELECT e.name, e.salary, d.name as department
        FROM employees e
        JOIN departments d ON e.department_id = d.id
        WHERE d.name = ? AND e.salary >= ?
        ORDER BY e.salary DESC
    '''
    return pd.read_sql(query, conn, params=(dept_name, min_salary))

# Test the function
result = get_employees_by_dept_salary('Engineering', 90000)
print("Engineering employees earning >= $90k:")
result
```

</details>

### Exercise 3: DataFrame to SQL

Create a DataFrame containing quarterly sales data (Q1-Q4) for 3 products. Write it to a new table called 'quarterly_sales'. Then read it back to verify.

In [None]:
# Your code here


<details>
<summary>Click to see solution</summary>

```python
# Create the DataFrame
quarterly_sales = pd.DataFrame({
    'product': ['Widget', 'Widget', 'Widget', 'Widget',
                'Gadget', 'Gadget', 'Gadget', 'Gadget',
                'Gizmo', 'Gizmo', 'Gizmo', 'Gizmo'],
    'quarter': ['Q1', 'Q2', 'Q3', 'Q4'] * 3,
    'sales': [10000, 12000, 15000, 18000,
              8000, 9500, 11000, 12500,
              5000, 6000, 7500, 9000]
})

# Write to database
quarterly_sales.to_sql(
    'quarterly_sales',
    conn,
    if_exists='replace',
    index=False
)
print("Written to 'quarterly_sales' table!")

# Read back to verify
df_verify = pd.read_sql('SELECT * FROM quarterly_sales', conn)
print(f"\nVerification ({len(df_verify)} rows):")
df_verify
```

</details>

### Exercise 4: Combined Workflow

1. Use SQL to get all employees with their department names and project counts
2. Use Pandas to calculate the salary percentile of each employee within their department
3. Display the top 3 employees by percentile in each department

In [None]:
# Your code here


<details>
<summary>Click to see solution</summary>

```python
# Step 1: SQL query to get employees with departments
df = pd.read_sql('''
    SELECT 
        e.name,
        e.salary,
        d.name as department
    FROM employees e
    JOIN departments d ON e.department_id = d.id
''', conn)

# Step 2: Calculate salary percentile within department
df['salary_percentile'] = df.groupby('department')['salary'].transform(
    lambda x: (x.rank(pct=True) * 100).round(1)
)

# Step 3: Get top 3 by percentile in each department
top_by_dept = df.sort_values(['department', 'salary_percentile'], ascending=[True, False])
top_by_dept = top_by_dept.groupby('department').head(3)

print("Top 3 Earners by Percentile in Each Department:")
for dept in top_by_dept['department'].unique():
    print(f"\n{dept}:")
    dept_data = top_by_dept[top_by_dept['department'] == dept]
    for _, row in dept_data.iterrows():
        print(f"  {row['name']}: ${row['salary']:,.0f} ({row['salary_percentile']}th percentile)")
```

</details>

### Exercise 5: Analysis Pipeline

Create a complete analysis pipeline that:
1. Loads project data with department names from SQL
2. Calculates project duration in days using Pandas
3. Groups by department to find average project duration
4. Saves the results back to a new 'project_stats' table

In [None]:
# Your code here


<details>
<summary>Click to see solution</summary>

```python
# Step 1: Load projects with department names
df_projects = pd.read_sql('''
    SELECT 
        p.name as project,
        d.name as department,
        p.start_date,
        p.end_date
    FROM projects p
    JOIN departments d ON p.department_id = d.id
''', conn, parse_dates=['start_date', 'end_date'])

# Step 2: Calculate project duration
df_projects['duration_days'] = (df_projects['end_date'] - df_projects['start_date']).dt.days

print("Projects with Duration:")
print(df_projects[['project', 'department', 'duration_days']])

# Step 3: Group by department
project_stats = df_projects.groupby('department').agg({
    'project': 'count',
    'duration_days': ['mean', 'min', 'max']
}).round(1)

project_stats.columns = ['num_projects', 'avg_duration', 'min_duration', 'max_duration']

print("\nProject Statistics by Department:")
print(project_stats)

# Step 4: Save to database
project_stats.to_sql('project_stats', conn, if_exists='replace', index=True)
print("\nSaved to 'project_stats' table!")

# Verify
pd.read_sql('SELECT * FROM project_stats', conn)
```

</details>

---

## Summary

In this notebook, you learned:

### Loading Data from SQL
- `pd.read_sql(query, conn)` - Execute query and return DataFrame
- `pd.read_sql_query()` - Same as read_sql for queries
- Use `params` for safe parameterized queries
- Use `parse_dates` to automatically convert date columns
- Use `index_col` to set the DataFrame index

### Writing Data to SQL
- `df.to_sql(table_name, conn)` - Write DataFrame to table
- `if_exists`: 'fail' (default), 'replace', or 'append'
- `index=False` to exclude DataFrame index
- `dtype` to specify SQL column types

### Best Practices
1. **Use SQL for**: Filtering, joining, aggregating large datasets
2. **Use Pandas for**: Complex transformations, time series, statistics
3. **Workflow**: SQL extracts and joins -> Pandas transforms and analyzes -> Write results back

### Key Functions

| Function | Purpose |
|----------|--------|
| `pd.read_sql()` | Load SQL query results to DataFrame |
| `pd.read_sql_query()` | Execute SQL query (explicit) |
| `df.to_sql()` | Write DataFrame to SQL table |

## Congratulations!

You've completed the SQL in Python module! You now know how to:
- Create and manage SQLite databases
- Perform CRUD operations (Create, Read, Update, Delete)
- Write advanced queries with JOINs and subqueries
- Integrate SQL with Pandas for powerful data analysis workflows

---

## Cleanup

In [None]:
# Close connection and remove database file
conn.close()

if os.path.exists('company.db'):
    os.remove('company.db')
    print("Database file removed.")