# Loading Data from Various Sources (CSV, Excel, JSON)

This notebook demonstrates how to load data from various sources including CSV, Excel, and JSON files in data science applications. These are fundamental skills for any data scientist as data acquisition is the first step in any data analysis project.

## Import Required Libraries

In [None]:
# Import libraries for data manipulation and analysis
import pandas as pd
import numpy as np

# Libraries for JSON handling
import json

# Libraries for web requests
import requests

# Library for database connections
import sqlite3
from sqlalchemy import create_engine

# Other useful imports
import os
import io
import time
from pathlib import Path

## Loading CSV Data

CSV (Comma-Separated Values) files are one of the most common data formats. Pandas provides powerful functions for importing these files.

In [None]:
# Basic CSV loading
# For this demonstration, we'll create a simple CSV in memory
csv_data = """id,name,age,city
1,John Smith,34,New York
2,Jane Doe,28,San Francisco
3,Bob Johnson,45,Chicago
4,Alice Brown,32,Boston"""

# Write to a file
with open("sample_data.csv", "w") as f:
    f.write(csv_data)

# Basic read_csv usage
df_csv = pd.read_csv("sample_data.csv")
print("Basic CSV loading:")
print(df_csv)

In [None]:
# Advanced CSV loading options

# Custom delimiters
csv_tab_data = """id\tname\tage\tcity
1\tJohn Smith\t34\tNew York
2\tJane Doe\t28\tSan Francisco
3\tBob Johnson\t45\tChicago
4\tAlice Brown\t32\tBoston"""

with open("sample_tab_data.csv", "w") as f:
    f.write(csv_tab_data)

df_tab = pd.read_csv("sample_tab_data.csv", delimiter="\t")
print("CSV with tab delimiter:")
print(df_tab)

# Handling missing values
csv_missing_data = """id,name,age,city
1,John Smith,,New York
2,Jane Doe,28,
3,,45,Chicago
4,Alice Brown,32,Boston"""

with open("missing_data.csv", "w") as f:
    f.write(csv_missing_data)

df_missing = pd.read_csv("missing_data.csv", na_values=["", "NA", "N/A"])
print("\nCSV with missing values:")
print(df_missing)

# Skip rows and specify column types
csv_messy_data = """This is a header line we want to skip
This is another line we want to skip
id,name,age,city
1,John Smith,34,New York
2,Jane Doe,28,San Francisco
3,Bob Johnson,45,Chicago
4,Alice Brown,32,Boston"""

with open("messy_data.csv", "w") as f:
    f.write(csv_messy_data)

df_skipped = pd.read_csv("messy_data.csv", 
                        skiprows=2,
                        dtype={"id": int, "age": float})
print("\nCSV with skipped rows and custom types:")
print(df_skipped)

## Loading Excel Data

Excel files are ubiquitous in business environments. Pandas can read Excel files (both .xls and .xlsx formats) directly.

In [None]:
# Create a simple Excel file for demonstration
# First, let's create multiple DataFrames to represent different sheets

# Sheet 1 - Employee data
employee_data = pd.DataFrame({
    'ID': [1, 2, 3, 4],
    'Name': ['John Smith', 'Jane Doe', 'Bob Johnson', 'Alice Brown'],
    'Department': ['HR', 'Engineering', 'Marketing', 'Engineering'],
    'Salary': [65000, 85000, 72000, 88000]
})

# Sheet 2 - Department data
department_data = pd.DataFrame({
    'Department': ['HR', 'Engineering', 'Marketing', 'Sales'],
    'Manager': ['Michael Scott', 'Jim Halpert', 'Dwight Schrute', 'Pam Beesly'],
    'Budget': [250000, 750000, 400000, 650000]
})

# Create an Excel file with multiple sheets
with pd.ExcelWriter('company_data.xlsx') as writer:
    employee_data.to_excel(writer, sheet_name='Employees', index=False)
    department_data.to_excel(writer, sheet_name='Departments', index=False)

print("Excel file created with multiple sheets")

In [None]:
# Basic reading from Excel
df_excel = pd.read_excel('company_data.xlsx')
print("Default Excel reading (first sheet):")
print(df_excel)

# Reading a specific sheet
df_excel_dept = pd.read_excel('company_data.xlsx', sheet_name='Departments')
print("\nReading specific sheet 'Departments':")
print(df_excel_dept)

# Reading multiple sheets
all_sheets = pd.read_excel('company_data.xlsx', sheet_name=None)
print("\nAll sheets names in the Excel file:")
for sheet_name in all_sheets.keys():
    print(f" - {sheet_name}")
    
# Working with specific rows and columns
df_excel_subset = pd.read_excel('company_data.xlsx',
                               skiprows=1,
                               usecols="A,C,D",
                               nrows=3)
print("\nCustomized Excel import (skipping 1 row, only cols A,C,D, first 3 rows):")
print(df_excel_subset)

## Loading JSON Data

JSON (JavaScript Object Notation) is a common format for web APIs and configuration files. Pandas has built-in support for JSON data.

In [None]:
# Create a sample JSON file
json_data = {
    "employees": [
        {"id": 1, "name": "John Smith", "department": "HR", "projects": ["Recruitment", "Training"]},
        {"id": 2, "name": "Jane Doe", "department": "Engineering", "projects": ["Database", "API"]},
        {"id": 3, "name": "Bob Johnson", "department": "Marketing", "projects": ["Campaign", "Social Media"]},
        {"id": 4, "name": "Alice Brown", "department": "Engineering", "projects": ["Frontend", "Mobile App"]}
    ],
    "company_info": {
        "name": "Tech Corp",
        "founded": 2005,
        "locations": ["New York", "San Francisco", "London"]
    }
}

# Write to a JSON file
with open("company_data.json", "w") as f:
    json.dump(json_data, f)

print("JSON file created")

In [None]:
# Reading JSON with pandas
df_json = pd.read_json("company_data.json")
print("Basic JSON reading:")
print(df_json)

# The above doesn't work well with nested structures, so let's use the json module
with open("company_data.json", "r") as f:
    json_data = json.load(f)

# Convert the employees list to a DataFrame
df_employees = pd.DataFrame(json_data['employees'])
print("\nEmployees data from nested JSON:")
print(df_employees)

# Handling the projects list (which is an array in each record)
# Let's normalize this nested data
df_with_projects = pd.json_normalize(json_data['employees'])
print("\nNormalized JSON data with project arrays:")
print(df_with_projects)

# Exploding arrays into separate rows
df_exploded = df_with_projects.explode('projects')
print("\nExploded projects into separate rows:")
print(df_exploded)

## Working with Web APIs

Many data sources are available via APIs. We'll use the requests library to fetch data from public APIs.

In [None]:
# Example API request - using a public API for demonstration
# Let's get book data from the Open Library API

try:
    # Get information about a book
    response = requests.get('https://openlibrary.org/api/books?bibkeys=ISBN:9780140328721&format=json')
    
    if response.status_code == 200:
        book_data = response.json()
        print("API Response:")
        print(book_data)
        
        # Convert to DataFrame (this is simple data, for more complex nested data 
        # you might need to flatten it first)
        df_book = pd.DataFrame(list(book_data.values()))
        print("\nBook data as DataFrame:")
        print(df_book)
    else:
        print(f"Error fetching data: {response.status_code}")
except Exception as e:
    print(f"Error: {e}")

In [None]:
# Getting a list of items from a JSON API
try:
    # Request a list of users from JSONPlaceholder API
    response = requests.get('https://jsonplaceholder.typicode.com/users')
    
    if response.status_code == 200:
        users_data = response.json()
        
        # Convert the list of users to a DataFrame
        df_users = pd.json_normalize(users_data)
        print("Users data from API:")
        print(df_users.head())
        
        # We can select specific columns
        df_users_subset = df_users[['id', 'name', 'email', 'company.name']]
        print("\nSelected user information:")
        print(df_users_subset.head())
    else:
        print(f"Error fetching data: {response.status_code}")
except Exception as e:
    print(f"Error: {e}")

## Loading Data from Databases

Connecting to databases is a core skill for data scientists. Pandas can directly interface with SQL databases.

In [None]:
# Create an in-memory SQLite database for demonstration
conn = sqlite3.connect(':memory:')
cursor = conn.cursor()

# Create tables
cursor.execute('''
CREATE TABLE employees (
    id INTEGER PRIMARY KEY,
    name TEXT,
    department TEXT,
    salary REAL
)
''')

cursor.execute('''
CREATE TABLE departments (
    id INTEGER PRIMARY KEY,
    name TEXT,
    manager TEXT,
    budget REAL
)
''')

# Insert data
employees_data = [
    (1, 'John Smith', 'HR', 65000),
    (2, 'Jane Doe', 'Engineering', 85000),
    (3, 'Bob Johnson', 'Marketing', 72000),
    (4, 'Alice Brown', 'Engineering', 88000)
]

departments_data = [
    (1, 'HR', 'Michael Scott', 250000),
    (2, 'Engineering', 'Jim Halpert', 750000),
    (3, 'Marketing', 'Dwight Schrute', 400000),
    (4, 'Sales', 'Pam Beesly', 650000)
]

cursor.executemany('INSERT INTO employees VALUES (?, ?, ?, ?)', employees_data)
cursor.executemany('INSERT INTO departments VALUES (?, ?, ?, ?)', departments_data)
conn.commit()

print("SQLite database created with sample data")

In [None]:
# Reading data from SQLite with pandas
df_employees_sql = pd.read_sql_query("SELECT * FROM employees", conn)
print("Employees from SQL:")
print(df_employees_sql)

# Join tables with SQL
df_joined = pd.read_sql_query('''
    SELECT e.name as employee_name, 
           e.salary, 
           e.department,
           d.manager
    FROM employees e
    JOIN departments d ON e.department = d.name
''', conn)

print("\nJoined data from SQL:")
print(df_joined)

# Using SQLAlchemy for database connections
engine = create_engine('sqlite:///:memory:')

# We can also write DataFrames back to the database
df_new = pd.DataFrame({
    'id': [5, 6],
    'name': ['Charlie Davis', 'Diana Evans'],
    'department': ['Sales', 'IT'],
    'salary': [67000, 92000]
})

df_new.to_sql('new_employees', engine, index=False, if_exists='replace')

# And read it back
df_read_back = pd.read_sql('new_employees', engine)
print("\nData written to and read from database using SQLAlchemy:")
print(df_read_back)

## Handling Different File Encodings

Working with international data often means dealing with different character encodings. Pandas can handle various encodings.

In [None]:
# Create a CSV with non-ASCII characters
csv_international = """id,name,country,city
1,José García,Spain,Madrid
2,François Dupont,France,Paris
3,Jürgen Müller,Germany,Berlin
4,黄小明,China,Beijing
5,Екатерина Иванова,Russia,Moscow"""

# Write with UTF-8 encoding
with open("international_data.csv", "w", encoding="utf-8") as f:
    f.write(csv_international)

# Read with UTF-8 encoding
df_utf8 = pd.read_csv("international_data.csv", encoding="utf-8")
print("CSV with international characters read with UTF-8 encoding:")
print(df_utf8)

# Let's simulate a file with different encoding
with open("international_data_latin1.csv", "w", encoding="latin1") as f:
    # Note: This will lose some characters that aren't in latin1
    try:
        f.write(csv_international)
    except UnicodeEncodeError:
        # This is expected, we'll modify the data to make it fit latin1
        simplified_csv = """id,name,country,city
1,José García,Spain,Madrid
2,François Dupont,France,Paris
3,Jurgen Muller,Germany,Berlin
4,Unknown,China,Beijing
5,Ekaterina,Russia,Moscow"""
        f.write(simplified_csv)

# Reading with correct encoding 
df_latin1 = pd.read_csv("international_data_latin1.csv", encoding="latin1")
print("\nCSV with Latin-1 encoding:")
print(df_latin1)

# Trying to read a file with the wrong encoding can lead to errors or garbled text
try:
    # This might fail or produce garbled text
    df_wrong = pd.read_csv("international_data.csv", encoding="ascii")
    print("\nReading UTF-8 file with ASCII encoding (may show garbled text):")
    print(df_wrong)
except UnicodeDecodeError as e:
    print(f"\nError reading with wrong encoding: {e}")

## Comparing Data Loading Methods

Different data formats have different advantages and performance characteristics. Let's compare them.

In [None]:
# Create a larger dataset for performance comparison
rows = 100000
ids = list(range(1, rows + 1))
names = [f"Person_{i}" for i in range(1, rows + 1)]
values = np.random.randn(rows)
categories = np.random.choice(['A', 'B', 'C', 'D'], rows)

big_df = pd.DataFrame({
    'id': ids,
    'name': names,
    'value': values,
    'category': categories
})

print(f"Created dataset with {rows} rows for performance testing")

# Save in different formats for comparison
formats = {
    'csv': {'func': big_df.to_csv, 'args': {'path_or_buf': 'big_data.csv', 'index': False}},
    'excel': {'func': big_df.to_excel, 'args': {'excel_writer': 'big_data.xlsx', 'index': False}},
    'json': {'func': big_df.to_json, 'args': {'path_or_buf': 'big_data.json'}},
    'pickle': {'func': big_df.to_pickle, 'args': {'path': 'big_data.pkl'}},
    'parquet': {'func': big_df.to_parquet, 'args': {'path': 'big_data.parquet', 'index': False}}
}

# Save in different formats
for fmt, config in formats.items():
    try:
        config['func'](**config['args'])
        print(f"Saved in {fmt} format")
    except Exception as e:
        print(f"Error saving in {fmt} format: {e}")

In [None]:
# Compare loading speeds
load_functions = {
    'csv': {'func': pd.read_csv, 'args': {'filepath_or_buffer': 'big_data.csv'}},
    'excel': {'func': pd.read_excel, 'args': {'io': 'big_data.xlsx'}},
    'json': {'func': pd.read_json, 'args': {'path_or_buf': 'big_data.json'}},
    'pickle': {'func': pd.read_pickle, 'args': {'filepath_or_buffer': 'big_data.pkl'}},
    'parquet': {'func': pd.read_parquet, 'args': {'path': 'big_data.parquet'}}
}

results = {}

for fmt, config in load_functions.items():
    try:
        start_time = time.time()
        df = config['func'](**config['args'])
        duration = time.time() - start_time
        results[fmt] = {'duration': duration, 'rows': len(df), 'columns': len(df.columns)}
        print(f"Loaded {fmt} in {duration:.4f} seconds - {len(df)} rows")
    except Exception as e:
        print(f"Error loading {fmt}: {e}")

# Create a comparison DataFrame
comparison_df = pd.DataFrame.from_dict(results, orient='index')
comparison_df = comparison_df.sort_values(by='duration')

print("\nPerformance comparison:")
print(comparison_df)

# Show file sizes
file_sizes = {}
for fmt in formats.keys():
    try:
        file_path = f"big_data.{fmt}"
        if fmt == 'parquet':
            file_path = 'big_data.parquet'
        
        if os.path.exists(file_path):
            size_mb = os.path.getsize(file_path) / (1024 * 1024)
            file_sizes[fmt] = size_mb
    except Exception as e:
        print(f"Error getting file size for {fmt}: {e}")

file_sizes_df = pd.DataFrame.from_dict(file_sizes, orient='index', columns=['Size (MB)'])
file_sizes_df = file_sizes_df.sort_values(by='Size (MB)')

print("\nFile size comparison:")
print(file_sizes_df)

## Summary

In this notebook, we've explored various methods to load data from different sources:

1. **CSV files** - The most common format, easy to read and write, but lacks type information
2. **Excel files** - Great for business data, supports multiple sheets, but slower to process
3. **JSON data** - Standard for web APIs, handles nested structures, but can be complex to normalize
4. **Databases** - Scalable, support complex queries, but require connection setup
5. **Web APIs** - Access to external data, but may require authentication and handling rate limits

Key takeaways:
- Choose the right format based on your data size, structure, and processing needs
- Parquet and Pickle formats are much faster for large datasets
- Consider file size if storage is a concern
- Pay attention to encodings when working with international data
- Always inspect your data after loading to verify it was loaded correctly