# Reading and Writing Data

## Learning Objectives

By the end of this notebook, you will be able to:

1. Read and write CSV files with various options
2. Work with Excel files (single and multiple sheets)
3. Read and write JSON data
4. Understand basics of SQL database connections
5. Handle common file reading issues

---

## Setup

First, let's import pandas and create some sample data that we'll save to files.

In [None]:
import pandas as pd
import numpy as np
from io import StringIO  # For simulating file content
import json

print(f"Pandas version: {pd.__version__}")

In [None]:
# Create sample data for demonstrations
sample_data = pd.DataFrame({
    'id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'department': ['Engineering', 'Marketing', 'Engineering', 'HR', 'Marketing'],
    'salary': [75000, 65000, 80000, 60000, 70000],
    'hire_date': ['2020-01-15', '2019-06-01', '2021-03-20', '2018-11-10', '2022-02-28']
})
print("Sample Data:")
print(sample_data)

---

## 1. CSV Files

CSV (Comma-Separated Values) is the most common format for tabular data. Pandas provides `read_csv()` and `to_csv()` for reading and writing.

### 1.1 Writing to CSV

In [None]:
# Write to CSV
sample_data.to_csv('employees.csv', index=False)
print("File saved: employees.csv")

# Let's see what the file looks like
with open('employees.csv', 'r') as f:
    print(f.read())

In [None]:
# Write with index included
sample_data.to_csv('employees_with_index.csv', index=True)
print("File saved with index:")
with open('employees_with_index.csv', 'r') as f:
    print(f.read())

### 1.2 Reading from CSV

In [None]:
# Basic reading
df = pd.read_csv('employees.csv')
print("Read from CSV:")
print(df)
print(f"\nData types:\n{df.dtypes}")

### 1.3 CSV Options

In [None]:
# Simulate CSV with different separators and options
csv_content = """name;age;city;score
Alice;25;New York;85.5
Bob;30;Los Angeles;92.0
Charlie;35;Chicago;78.5
"""

# Read with semicolon separator
df = pd.read_csv(StringIO(csv_content), sep=';')
print("CSV with semicolon separator:")
print(df)

In [None]:
# Read specific columns
df = pd.read_csv('employees.csv', usecols=['name', 'salary'])
print("Only name and salary columns:")
print(df)

In [None]:
# Read with custom column names
csv_no_header = """1,Alice,Engineering,75000
2,Bob,Marketing,65000
3,Charlie,Engineering,80000
"""

df = pd.read_csv(StringIO(csv_no_header), 
                 names=['id', 'name', 'dept', 'salary'],
                 header=None)
print("CSV without header (custom column names):")
print(df)

In [None]:
# Parse dates automatically
df = pd.read_csv('employees.csv', parse_dates=['hire_date'])
print("With parsed dates:")
print(df)
print(f"\nhire_date type: {df['hire_date'].dtype}")

In [None]:
# Set a column as index
df = pd.read_csv('employees.csv', index_col='id')
print("With id as index:")
print(df)

In [None]:
# Read only first N rows (useful for large files)
df = pd.read_csv('employees.csv', nrows=3)
print("First 3 rows only:")
print(df)

### 1.4 Handling Missing Values

In [None]:
# CSV with missing values
csv_with_missing = """name,age,city,score
Alice,25,New York,85.5
Bob,,Los Angeles,92.0
Charlie,35,,78.5
Diana,28,Houston,
Eve,NA,Phoenix,88.0
"""

# Read with custom NA values
df = pd.read_csv(StringIO(csv_with_missing), na_values=['NA', ''])
print("CSV with missing values:")
print(df)
print(f"\nNull counts:\n{df.isnull().sum()}")

---

## 2. Excel Files

Pandas can read and write Excel files using `read_excel()` and `to_excel()`. Note: You may need to install `openpyxl` for `.xlsx` files.

In [None]:
# Check if openpyxl is available
try:
    import openpyxl
    print(f"openpyxl version: {openpyxl.__version__}")
    EXCEL_AVAILABLE = True
except ImportError:
    print("openpyxl not installed. Install with: pip install openpyxl")
    EXCEL_AVAILABLE = False

### 2.1 Writing to Excel

In [None]:
if EXCEL_AVAILABLE:
    # Write to Excel
    sample_data.to_excel('employees.xlsx', index=False, sheet_name='Employees')
    print("File saved: employees.xlsx")
else:
    print("Skipping Excel write (openpyxl not installed)")

In [None]:
if EXCEL_AVAILABLE:
    # Write multiple sheets to one Excel file
    with pd.ExcelWriter('multi_sheet.xlsx') as writer:
        sample_data.to_excel(writer, sheet_name='Employees', index=False)
        
        # Create another DataFrame for a second sheet
        departments = pd.DataFrame({
            'department': ['Engineering', 'Marketing', 'HR'],
            'budget': [500000, 300000, 200000],
            'headcount': [50, 30, 20]
        })
        departments.to_excel(writer, sheet_name='Departments', index=False)
    
    print("File saved: multi_sheet.xlsx with 2 sheets")

### 2.2 Reading from Excel

In [None]:
if EXCEL_AVAILABLE:
    # Basic reading
    df = pd.read_excel('employees.xlsx')
    print("Read from Excel:")
    print(df)
else:
    print("Skipping Excel read (openpyxl not installed)")

In [None]:
if EXCEL_AVAILABLE:
    # Read specific sheet
    employees = pd.read_excel('multi_sheet.xlsx', sheet_name='Employees')
    departments = pd.read_excel('multi_sheet.xlsx', sheet_name='Departments')
    
    print("Employees sheet:")
    print(employees)
    print("\nDepartments sheet:")
    print(departments)

In [None]:
if EXCEL_AVAILABLE:
    # Read all sheets into a dictionary
    all_sheets = pd.read_excel('multi_sheet.xlsx', sheet_name=None)
    print(f"Sheet names: {list(all_sheets.keys())}")
    print(f"Type: {type(all_sheets)}")

---

## 3. JSON Files

JSON (JavaScript Object Notation) is common for web APIs and configuration files.

### 3.1 Writing to JSON

In [None]:
# Write to JSON (default orientation: columns)
sample_data.to_json('employees.json')
print("File saved: employees.json")

with open('employees.json', 'r') as f:
    print(f.read())

In [None]:
# Write with different orientations
# 'records' orientation - list of dictionaries (common for APIs)
sample_data.to_json('employees_records.json', orient='records', indent=2)
print("Records orientation:")
with open('employees_records.json', 'r') as f:
    print(f.read())

In [None]:
# 'index' orientation - dictionary keyed by index
sample_data.head(2).to_json('employees_index.json', orient='index', indent=2)
print("Index orientation:")
with open('employees_index.json', 'r') as f:
    print(f.read())

### 3.2 Reading from JSON

In [None]:
# Read JSON with default orientation
df = pd.read_json('employees.json')
print("Read from JSON (columns orientation):")
print(df)

In [None]:
# Read JSON with records orientation
df = pd.read_json('employees_records.json', orient='records')
print("Read from JSON (records orientation):")
print(df)

### 3.3 Nested JSON

In [None]:
# Handling nested JSON
nested_json = """[
    {"name": "Alice", "info": {"age": 25, "city": "NYC"}, "scores": [85, 90, 88]},
    {"name": "Bob", "info": {"age": 30, "city": "LA"}, "scores": [92, 88, 95]},
    {"name": "Charlie", "info": {"age": 35, "city": "Chicago"}, "scores": [78, 82, 80]}
]"""

# Basic read keeps nested structure
df = pd.read_json(StringIO(nested_json))
print("Nested JSON (basic read):")
print(df)
print(f"\ninfo column type: {type(df['info'].iloc[0])}")

In [None]:
# Use json_normalize to flatten nested JSON
data = json.loads(nested_json)
df_flat = pd.json_normalize(data)
print("Flattened JSON:")
print(df_flat)

---

## 4. SQL Databases

Pandas can interact with SQL databases using `read_sql()` and `to_sql()`. We'll demonstrate with SQLite, which is built into Python.

In [None]:
import sqlite3

# Create an in-memory SQLite database
conn = sqlite3.connect(':memory:')

# Write DataFrame to SQL table
sample_data.to_sql('employees', conn, index=False, if_exists='replace')
print("Data written to SQL table 'employees'")

In [None]:
# Read entire table
df = pd.read_sql('SELECT * FROM employees', conn)
print("Read from SQL:")
print(df)

In [None]:
# Read with SQL query
query = """
SELECT name, department, salary 
FROM employees 
WHERE salary > 65000
ORDER BY salary DESC
"""
df = pd.read_sql(query, conn)
print("Filtered SQL query result:")
print(df)

In [None]:
# Aggregation with SQL
query = """
SELECT department, 
       COUNT(*) as employee_count,
       AVG(salary) as avg_salary
FROM employees 
GROUP BY department
"""
df = pd.read_sql(query, conn)
print("SQL aggregation:")
print(df)

In [None]:
# Close the connection
conn.close()

---

## 5. Other Formats

Pandas supports many other formats. Here are a few quick examples.

### 5.1 Parquet (Efficient Binary Format)

In [None]:
# Check if pyarrow is available
try:
    import pyarrow
    
    # Write to Parquet
    sample_data.to_parquet('employees.parquet')
    print("Written to Parquet")
    
    # Read from Parquet
    df = pd.read_parquet('employees.parquet')
    print("Read from Parquet:")
    print(df)
except ImportError:
    print("pyarrow not installed. Install with: pip install pyarrow")

### 5.2 Clipboard

In [None]:
# Copy DataFrame to clipboard (useful for pasting into Excel)
# sample_data.to_clipboard(index=False)

# Read from clipboard
# df = pd.read_clipboard()

print("Clipboard operations available:")
print("  df.to_clipboard() - copy to clipboard")
print("  pd.read_clipboard() - read from clipboard")

### 5.3 HTML Tables

In [None]:
# Write to HTML
html = sample_data.to_html(index=False)
print("HTML output (first 500 chars):")
print(html[:500])

---

## Exercises

Practice reading and writing data with these exercises.

### Exercise 1: Create and Save CSV

1. Create a DataFrame with the following data about products:
   - Product: Laptop, Phone, Tablet, Watch, Headphones
   - Price: 999.99, 699.99, 449.99, 299.99, 149.99
   - Stock: 50, 150, 80, 200, 300
   - Category: Electronics, Electronics, Electronics, Wearables, Audio

2. Save it to 'products.csv' without the index
3. Read it back and verify the data types

In [None]:
# Your code here


<details>
<summary>Click to reveal solution</summary>

```python
# 1. Create DataFrame
products = pd.DataFrame({
    'Product': ['Laptop', 'Phone', 'Tablet', 'Watch', 'Headphones'],
    'Price': [999.99, 699.99, 449.99, 299.99, 149.99],
    'Stock': [50, 150, 80, 200, 300],
    'Category': ['Electronics', 'Electronics', 'Electronics', 'Wearables', 'Audio']
})
print("Products DataFrame:")
print(products)

# 2. Save to CSV
products.to_csv('products.csv', index=False)
print("\nSaved to products.csv")

# 3. Read back and verify
products_loaded = pd.read_csv('products.csv')
print("\nLoaded from CSV:")
print(products_loaded)
print(f"\nData types:\n{products_loaded.dtypes}")
```
</details>

### Exercise 2: Parse CSV with Options

Read the following CSV content (stored as a string) with appropriate options:
- Use semicolon as separator
- Parse the 'date' column as datetime
- Set 'order_id' as the index

```python
csv_content = """order_id;customer;date;amount
1001;John;2024-01-15;150.00
1002;Jane;2024-01-16;225.50
1003;Bob;2024-01-17;89.99
1004;Alice;2024-01-18;312.00
"""
```

In [None]:
# Your code here
csv_content = """order_id;customer;date;amount
1001;John;2024-01-15;150.00
1002;Jane;2024-01-16;225.50
1003;Bob;2024-01-17;89.99
1004;Alice;2024-01-18;312.00
"""


<details>
<summary>Click to reveal solution</summary>

```python
df = pd.read_csv(
    StringIO(csv_content),
    sep=';',
    parse_dates=['date'],
    index_col='order_id'
)

print("Parsed DataFrame:")
print(df)
print(f"\nData types:\n{df.dtypes}")
print(f"\nIndex: {df.index}")
```
</details>

### Exercise 3: JSON Operations

1. Create a DataFrame with 3 students and their grades in Math, Science, and English
2. Save it to JSON using 'records' orientation
3. Read it back and verify

In [None]:
# Your code here


<details>
<summary>Click to reveal solution</summary>

```python
# 1. Create DataFrame
students = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'Math': [92, 85, 78],
    'Science': [88, 90, 82],
    'English': [95, 87, 91]
})
print("Students DataFrame:")
print(students)

# 2. Save to JSON with records orientation
students.to_json('students.json', orient='records', indent=2)
print("\nSaved to students.json")

# Show the file content
with open('students.json', 'r') as f:
    print(f.read())

# 3. Read back
students_loaded = pd.read_json('students.json', orient='records')
print("\nLoaded from JSON:")
print(students_loaded)
```
</details>

### Exercise 4: SQL Query

1. Create an in-memory SQLite database
2. Save the products DataFrame (from Exercise 1) to a table called 'products'
3. Write a SQL query to find all products with price > 200 and stock > 100
4. Read the results into a DataFrame

In [None]:
# Your code here


<details>
<summary>Click to reveal solution</summary>

```python
# Recreate products DataFrame
products = pd.DataFrame({
    'Product': ['Laptop', 'Phone', 'Tablet', 'Watch', 'Headphones'],
    'Price': [999.99, 699.99, 449.99, 299.99, 149.99],
    'Stock': [50, 150, 80, 200, 300],
    'Category': ['Electronics', 'Electronics', 'Electronics', 'Wearables', 'Audio']
})

# 1. Create SQLite connection
conn = sqlite3.connect(':memory:')

# 2. Save to SQL
products.to_sql('products', conn, index=False, if_exists='replace')
print("Data saved to SQL table 'products'")

# 3 & 4. Query and read results
query = """
SELECT * FROM products
WHERE Price > 200 AND Stock > 100
"""
result = pd.read_sql(query, conn)
print("\nProducts with price > 200 and stock > 100:")
print(result)

conn.close()
```
</details>

### Exercise 5: Handle Nested JSON

Parse the following nested JSON and flatten it into a DataFrame:

```python
nested_data = [
    {"id": 1, "user": {"name": "Alice", "email": "alice@example.com"}, "active": True},
    {"id": 2, "user": {"name": "Bob", "email": "bob@example.com"}, "active": False},
    {"id": 3, "user": {"name": "Charlie", "email": "charlie@example.com"}, "active": True}
]
```

In [None]:
# Your code here
nested_data = [
    {"id": 1, "user": {"name": "Alice", "email": "alice@example.com"}, "active": True},
    {"id": 2, "user": {"name": "Bob", "email": "bob@example.com"}, "active": False},
    {"id": 3, "user": {"name": "Charlie", "email": "charlie@example.com"}, "active": True}
]


<details>
<summary>Click to reveal solution</summary>

```python
# Use json_normalize to flatten
df_flat = pd.json_normalize(nested_data)
print("Flattened DataFrame:")
print(df_flat)
print(f"\nColumns: {df_flat.columns.tolist()}")
```
</details>

---

## Cleanup

In [None]:
# Clean up created files
import os

files_to_remove = [
    'employees.csv', 'employees_with_index.csv',
    'employees.xlsx', 'multi_sheet.xlsx',
    'employees.json', 'employees_records.json', 'employees_index.json',
    'products.csv', 'students.json', 'employees.parquet'
]

for f in files_to_remove:
    if os.path.exists(f):
        os.remove(f)
        print(f"Removed: {f}")

print("\nCleanup complete!")

---

## Summary

In this notebook, you learned:

1. **CSV Files**:
   - `read_csv()` with options: sep, usecols, parse_dates, index_col, nrows, na_values
   - `to_csv()` with index option

2. **Excel Files**:
   - `read_excel()` for single/multiple sheets
   - `to_excel()` with ExcelWriter for multiple sheets
   - Requires `openpyxl` package

3. **JSON Files**:
   - `read_json()` and `to_json()` with different orientations
   - `json_normalize()` for nested JSON

4. **SQL Databases**:
   - `read_sql()` for queries
   - `to_sql()` for writing tables
   - Works with any database via SQLAlchemy

5. **Other Formats**: Parquet, clipboard, HTML

---

## Next Steps

Continue to the next notebook: **[03_indexing_and_selection.ipynb](03_indexing_and_selection.ipynb)** to learn how to select and filter data using loc, iloc, and boolean indexing.