# Reading and Writing Files in Python

A practical guide for Data Analysts

**You'll learn:**
- How to work with different file formats (TXT, CSV, Excel, JSON)
- Basic Python file operations
- Why and how to use Pandas for data files
- Best practices for real-world scenarios

**Why is this important?**
As a Data Analyst, 90% of your work starts with loading data from files. You need to master this skill to be effective.

In [None]:
pip install pandas

In [None]:

# Import libraries we'll use throughout this notebook
import os
import pandas as pd
import numpy as np
import json
import zipfile

print("‚úì Libraries loaded successfully")

## 1. Understanding File Paths

Before reading any file, you need to know WHERE it is.

**Key concepts:**
- **Current directory**: Where your Python program is running
- **Relative path**: Path from current directory (`data/file.csv`)
- **Absolute path**: Complete path (`C:/Users/you/data/file.csv`)

**Best practice:** Always use `os.path.join()` for cross-platform compatibility.

In [None]:
# Where are we now?
current_dir = os.getcwd()
print(f"üìÅ Current directory: {current_dir}")

# Create a 'data' folder for our examples
data_dir = "data"
if not os.path.exists(data_dir):
    os.makedirs(data_dir)
    print(f"‚úì Created '{data_dir}' folder")
else:
    print(f"‚úì '{data_dir}' folder exists")

# Build paths safely (works on Windows, Mac, Linux)
csv_path = os.path.join(data_dir, "sales.csv")
excel_path = os.path.join(data_dir, "report.xlsx")

print(f"\nüìÑ Example paths:")
print(f"  CSV: {csv_path}")
print(f"  Excel: {excel_path}")

## 2. Text Files (.txt)

The simplest file format - just plain text.

**When to use:**
- Log files
- Simple notes
- Configuration files
- When you need basic text storage

In [None]:
# WRITE a text file
file_path = os.path.join("data", "notes.txt")

with open(file_path, 'w') as file:
    file.write("Data Analysis Notes\n")
    file.write("===================\n")
    file.write("Always check your data for null values!\n")

print(f"‚úì Created: {file_path}")

In [None]:
# READ a text file - Line by line
with open(file_path, 'r') as file:
    print("=== File Contents ===")
    for line_num, line in enumerate(file, 1):
        print(f"{line_num}: {line.strip()}")

## 3. CSV Files - The Data Analyst's Best Friend

**CSV = Comma Separated Values**

**Why CSV is important:**
- ‚úì Universal format (works everywhere)
- ‚úì Lightweight (small file size)
- ‚úì Easy to share and version control
- ‚úì Readable by Excel, Python, R, SQL...

**Structure:**
```
Name,Age,Department,Salary
Alice,25,IT,50000
Bob,30,Sales,60000
```

### Method 1: CSV with Python's csv module (basic)

Good to know, but you'll rarely use this as a Data Analyst.

In [None]:
import csv

# CREATE a CSV with Python's csv module
csv_file = os.path.join("data", "employees.csv")

employees = [
    ["Name", "Age", "Department", "Salary"],
    ["Alice", 25, "IT", 50000],
    ["Bob", 30, "Sales", 60000],
    ["Charlie", 35, "IT", 55000],
    ["Diana", 28, "HR", 65000],
]

with open(csv_file, 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerows(employees)

print(f"‚úì CSV created: {csv_file}")

### Method 2: CSV with Pandas (recommended for Data Analysts)

**This is what you'll use 99% of the time!**

One line to read, instant analysis capabilities.

In [None]:
# READ CSV with Pandas - Simple and powerful
df = pd.read_csv(csv_file)

print("=== CSV Data as DataFrame ===")
print(df)
print(f"\nüìä Shape: {df.shape[0]} rows √ó {df.shape[1]} columns")
print(f"üìã Columns: {df.columns.tolist()}")

In [None]:
# Instant analysis!
print("=== Quick Analysis ===")
print(f"Average salary: ${df['Salary'].mean():,.2f}")
print(f"Age range: {df['Age'].min()} - {df['Age'].max()}")
print(f"\nEmployees by department:")
print(df['Department'].value_counts())
print(f"\nTop 2 salaries:")
print(df.nlargest(2, 'Salary')[['Name', 'Salary']])

### Common CSV Reading Options

As a Data Analyst, you'll need these options frequently:

In [None]:
# Option 1: Only read specific columns
df_partial = pd.read_csv(csv_file, usecols=['Name', 'Salary'])
print("Only Name and Salary:")
print(df_partial)

In [None]:
# Option 2: CSV with semicolon separator (common in Europe)
csv_euro = os.path.join("data", "employees_euro.csv")
df.to_csv(csv_euro, sep=';', index=False)

df_euro = pd.read_csv(csv_euro, sep=';')
print("CSV with ';' separator:")
print(df_euro.head())

In [None]:
# Option 3: Read only first N rows (useful for large files)
df_sample = pd.read_csv(csv_file, nrows=2)
print("Only first 2 rows:")
print(df_sample)

### Writing CSV Files

Save your analysis results back to CSV:

In [None]:
# Add a calculated column
df['Salary_Bonus'] = df['Salary'] * 1.10  # 10% bonus

# Save to new CSV
output_file = os.path.join("data", "employees_with_bonus.csv")
df.to_csv(
    output_file,
    index=False,        # Don't save row numbers
    sep=';',            # Use semicolon
    float_format='%.2f' # 2 decimal places
)

print(f"‚úì Saved to: {output_file}")
print(pd.read_csv(output_file, sep=';'))

## 4. Excel Files - For Business Reports

**When to use Excel:**
- Creating reports for non-technical people
- Need multiple sheets
- Want formatting and colors
- Company standard is Excel

**Note:** Install openpyxl first: `pip install openpyxl`

In [None]:
# Check if openpyxl is installed
try:
    import openpyxl
    print("‚úì openpyxl is ready")
except ImportError:
    print("‚ùå Install with: pip install openpyxl")

### Reading Excel Files

In [None]:
# CREATE an Excel file first
excel_file = os.path.join("data", "sales_report.xlsx")

df_sales = pd.DataFrame({
    'Product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Headphones'],
    'Price': [999.99, 29.99, 79.99, 349.99, 149.99],
    'Units_Sold': [45, 200, 150, 30, 85],
    'Category': ['Electronics', 'Accessories', 'Accessories', 'Electronics', 'Accessories']
})

df_sales.to_excel(excel_file, index=False, sheet_name='Sales_Q1')
print(f"‚úì Excel created: {excel_file}")

In [None]:
# READ Excel file
df_from_excel = pd.read_excel(excel_file)

print("=== Excel Data ===")
print(df_from_excel)
print(f"\nüí∞ Total Revenue: ${(df_from_excel['Price'] * df_from_excel['Units_Sold']).sum():,.2f}")

### Working with Multiple Sheets

Real-world Excel files often have multiple sheets:

In [None]:
# CREATE Excel with multiple sheets
excel_multi = os.path.join("data", "quarterly_report.xlsx")

with pd.ExcelWriter(excel_multi, engine='openpyxl') as writer:
    # Sheet 1: All data
    df_sales.to_excel(writer, sheet_name='All_Products', index=False)
    
    # Sheet 2: Electronics only
    electronics = df_sales[df_sales['Category'] == 'Electronics']
    electronics.to_excel(writer, sheet_name='Electronics', index=False)
    
    # Sheet 3: Summary
    summary = df_sales.groupby('Category').agg({
        'Price': 'mean',
        'Units_Sold': 'sum'
    }).round(2)
    summary.to_excel(writer, sheet_name='Summary')

print(f"‚úì Multi-sheet Excel created: {excel_multi}")

In [None]:
# READ specific sheet
df_electronics = pd.read_excel(excel_multi, sheet_name='Electronics')
print("=== Electronics Sheet ===")
print(df_electronics)

# READ all sheets at once
all_sheets = pd.read_excel(excel_multi, sheet_name=None)
print(f"\nüìÑ Available sheets: {list(all_sheets.keys())}")

## 5. JSON Files - For Structured Data

**JSON = JavaScript Object Notation**

**When to use:**
- Data from APIs
- Hierarchical/nested data
- Configuration files
- Modern data interchange

In [None]:
# CREATE and SAVE JSON
data = {
    "analyst": "John Doe",
    "project": "Sales Analysis Q1 2024",
    "metrics": {
        "total_revenue": 125000,
        "avg_order_value": 450.50,
        "conversion_rate": 0.045
    },
    "top_products": ["Laptop", "Monitor", "Keyboard"]
}

json_file = os.path.join("data", "analysis_results.json")
with open(json_file, "w") as f:
    json.dump(data, f, indent=4)

print(f"‚úì JSON saved: {json_file}")

In [None]:
# READ JSON
with open(json_file, "r") as f:
    loaded_data = json.load(f)

print("=== JSON Data ===")
print(f"Analyst: {loaded_data['analyst']}")
print(f"Revenue: ${loaded_data['metrics']['total_revenue']:,}")
print(f"Top Products: {', '.join(loaded_data['top_products'])}")

In [None]:
# JSON to DataFrame (when data is tabular)
employees_json = [
    {"name": "Alice", "age": 25, "dept": "IT", "salary": 50000},
    {"name": "Bob", "age": 30, "dept": "Sales", "salary": 60000},
    {"name": "Charlie", "age": 35, "dept": "IT", "salary": 55000}
]

df_from_json = pd.DataFrame(employees_json)
print("=== JSON ‚Üí DataFrame ===")
print(df_from_json)

## 6. Working with ZIP Files

Often you'll receive compressed data files:

In [None]:
# CREATE a ZIP with multiple files
zip_path = os.path.join("data", "data_archive.zip")

with zipfile.ZipFile(zip_path, 'w') as zipf:
    zipf.write(csv_file, arcname='employees.csv')
    zipf.write(json_file, arcname='results.json')

print(f"‚úì ZIP created: {zip_path}")

# List contents
with zipfile.ZipFile(zip_path, 'r') as zipf:
    print("\nüì¶ Files in ZIP:")
    for name in zipf.namelist():
        print(f"  - {name}")

In [None]:
# READ CSV directly from ZIP (no need to extract!)
with zipfile.ZipFile(zip_path) as z:
    with z.open('employees.csv') as f:
        df_from_zip = pd.read_csv(f)

print("=== CSV from ZIP ===")
print(df_from_zip)

## 7. Encoding Issues (Important!)

**Common problem:** Opening a file and seeing weird characters like `Espa√É¬±a` instead of `Espa√±a`.

**Solution:** Always specify encoding='utf-8'

In [None]:
# CREATE CSV with special characters
df_cities = pd.DataFrame({
    'City': ['Madrid', 'Barcelona', 'M√°laga', 'A Coru√±a'],
    'Country': ['Spain', 'Spain', 'Spain', 'Spain'],
    'Population': [3223000, 1620000, 574000, 246000]
})

csv_encoded = os.path.join("data", "cities.csv")
df_cities.to_csv(csv_encoded, index=False, encoding='utf-8')

# READ with correct encoding
df_read = pd.read_csv(csv_encoded, encoding='utf-8')
print("=== Cities (UTF-8) ===")
print(df_read)

## 8. Best Practices for Data Analysts

### File Organization
```
project/
‚îú‚îÄ‚îÄ data/
‚îÇ   ‚îú‚îÄ‚îÄ raw/          # Original files (DON'T MODIFY!)
‚îÇ   ‚îú‚îÄ‚îÄ processed/    # Cleaned data
‚îÇ   ‚îî‚îÄ‚îÄ results/      # Analysis outputs
‚îú‚îÄ‚îÄ notebooks/        # Your analysis notebooks
‚îî‚îÄ‚îÄ reports/          # Final reports
```

In [None]:
# CREATE recommended folder structure
folders = ['data/raw', 'data/processed', 'data/results', 'reports']

for folder in folders:
    os.makedirs(folder, exist_ok=True)

print("‚úì Project structure created:")
for folder in folders:
    print(f"  üìÅ {folder}/")

### Always Verify Your Data After Loading

In [None]:
# ALWAYS do this after reading a file
df = pd.read_csv(csv_file)

print("=== Data Verification ===")
print(f"Shape: {df.shape}")
print(f"\nFirst few rows:")
print(df.head())
print(f"\nData types:")
print(df.dtypes)
print(f"\nMissing values:")
print(df.isnull().sum())
print(f"\nBasic stats:")
print(df.describe())

### Error Handling (Production Code)

In [None]:
def load_data_safely(file_path):
    """
    Load CSV with proper error handling
    """
    try:
        if not os.path.exists(file_path):
            print(f"‚ùå File not found: {file_path}")
            return None
        
        df = pd.read_csv(file_path)
        print(f"‚úì Loaded {len(df)} rows from {file_path}")
        return df
        
    except pd.errors.EmptyDataError:
        print(f"‚ùå File is empty: {file_path}")
        return None
    except Exception as e:
        print(f"‚ùå Error: {e}")
        return None

# Test it
df_safe = load_data_safely(csv_file)
if df_safe is not None:
    print(df_safe.head())

## 9. Quick Reference Cheat Sheet

```python
# READ FILES
df = pd.read_csv('file.csv')                    # CSV
df = pd.read_csv('file.csv', sep=';')          # CSV with semicolon
df = pd.read_excel('file.xlsx')                # Excel
df = pd.read_excel('file.xlsx', sheet_name='Sheet1')  # Specific sheet
df = pd.read_json('file.json')                 # JSON

# WRITE FILES
df.to_csv('output.csv', index=False)           # CSV
df.to_excel('output.xlsx', index=False)        # Excel
df.to_json('output.json')                      # JSON

# COMMON OPTIONS
encoding='utf-8'          # Character encoding
sep=';'                   # Separator
usecols=['A', 'B']       # Only certain columns
nrows=100                 # Only first N rows
sheet_name='Sales'        # Excel sheet name

# FILE OPERATIONS
os.getcwd()               # Current directory
os.listdir('folder')      # List files
os.path.exists('file')    # Check if exists
os.path.join('a', 'b')    # Join paths safely
```

## 10. Summary

**What you learned:**
- ‚úì File paths and organization
- ‚úì Reading/writing TXT, CSV, Excel, JSON
- ‚úì Why Pandas is essential for data analysts
- ‚úì Encoding issues and solutions
- ‚úì Best practices for real projects

**Key takeaways:**
1. **Always use Pandas** for data files (CSV, Excel)
2. **Always specify encoding='utf-8'**
3. **Verify data** after loading (shape, head, dtypes, nulls)
4. **Organize your files** (raw/processed/results)
5. **Use relative paths** with `os.path.join()`

**Next steps:**
- Practice with real datasets (Kaggle, data.gov)
- Learn Pandas DataFrame operations in depth
- Master data cleaning and transformation
- Start doing exploratory data analysis (EDA)

**Resources:**
- [Pandas Documentation](https://pandas.pydata.org/docs/)
- [Pandas I/O Tools](https://pandas.pydata.org/docs/user_guide/io.html)
- [Real Python - File I/O](https://realpython.com/read-write-files-python/)