# Loading Data from Various Sources (CSV, Excel, JSON)

This notebook demonstrates how to load data from different file formats including CSV, Excel, and JSON using pandas and other Python libraries. We'll explore various parameters and options for each format to handle different scenarios.

## Import Required Libraries

In [1]:
# Import core libraries for data loading and processing
import pandas as pd
import numpy as np
import json
import requests
from io import StringIO
import os
import time
import matplotlib.pyplot as plt

# Check pandas version
print(f"Pandas version: {pd.__version__}")

Pandas version: 2.2.3


## Loading Data from CSV Files

CSV (Comma-Separated Values) is one of the most common file formats for data storage. Pandas provides the powerful `read_csv()` function to load CSV files with many customization options.

Let's explore various ways to load and handle CSV files.

In [2]:
# Creating a sample CSV data for demonstration
sample_csv = """
id,name,age,salary,department
1,John Smith,34,50000,IT
2,Jane Doe,28,65000,Marketing
3,Bob Johnson,45,75000,Finance
4,Alice Brown,31,55000,HR
5,Charlie Wilson,29,60000,IT
"""

# Create a file-like object from the string
csv_data = StringIO(sample_csv.strip())

# Basic CSV reading
df_csv = pd.read_csv(csv_data)
print("Basic CSV loading:")
print(df_csv.head())

Basic CSV loading:
   id            name  age  salary department
0   1      John Smith   34   50000         IT
1   2        Jane Doe   28   65000  Marketing
2   3     Bob Johnson   45   75000    Finance
3   4     Alice Brown   31   55000         HR
4   5  Charlie Wilson   29   60000         IT


### CSV Loading Options

Let's explore various parameters available in `pd.read_csv()`:
- Custom delimiters
- Skipping rows
- Specifying data types
- Handling missing values

In [3]:
# Create a sample CSV with different delimiter
sample_csv_tab = """
id\tname\tage\tsalary\tdepartment
1\tJohn Smith\t34\t50000\tIT
2\tJane Doe\t28\t65000\tMarketing
3\tBob Johnson\t45\t75000\tFinance
4\tAlice Brown\t31\t55000\tHR
5\tCharlie Wilson\t29\t60000\tIT
"""

# Create a file-like object
csv_tab_data = StringIO(sample_csv_tab.strip())

# Reading with custom delimiter
df_tab = pd.read_csv(csv_tab_data, delimiter='\t')
print("CSV with tab delimiter:")
print(df_tab.head())

CSV with tab delimiter:
   id            name  age  salary department
0   1      John Smith   34   50000         IT
1   2        Jane Doe   28   65000  Marketing
2   3     Bob Johnson   45   75000    Finance
3   4     Alice Brown   31   55000         HR
4   5  Charlie Wilson   29   60000         IT


In [4]:
# Sample CSV with header at second line and some missing values
sample_csv_complex = """
This is a comment line
id,name,age,salary,department
1,John Smith,34,,IT
2,Jane Doe,,65000,Marketing
3,Bob Johnson,45,75000,
4,Alice Brown,31,55000,HR
5,,29,60000,IT
"""

# Create a file-like object
csv_complex_data = StringIO(sample_csv_complex.strip())

# Reading with skiprows and handling missing values
df_complex = pd.read_csv(
    csv_complex_data,
    skiprows=1,  # Skip the first row
    na_values=["", "NA", "N/A"],  # Define NA values
    dtype={"id": int, "name": str, "age": float, "salary": float, "department": str}  # Define data types
)

print("CSV with skiprows and handling missing values:")
print(df_complex.head())
print("\nMissing values count:")
print(df_complex.isna().sum())

CSV with skiprows and handling missing values:
   id         name   age   salary department
0   1   John Smith  34.0      NaN         IT
1   2     Jane Doe   NaN  65000.0  Marketing
2   3  Bob Johnson  45.0  75000.0        NaN
3   4  Alice Brown  31.0  55000.0         HR
4   5          NaN  29.0  60000.0         IT

Missing values count:
id            0
name          1
age           1
salary        1
department    1
dtype: int64


In [5]:
# Reading CSV from a URL (using the Iris dataset)
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv"

try:
    df_iris = pd.read_csv(url)
    print("Iris dataset loaded from URL:")
    print(df_iris.head())
    print(f"\nDataset shape: {df_iris.shape}")
    print(f"Dataset columns: {df_iris.columns.tolist()}")
except Exception as e:
    print(f"Error loading from URL: {e}")

Iris dataset loaded from URL:
   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa

Dataset shape: (150, 5)
Dataset columns: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']


## Loading Data from Excel Files

Excel files are widely used in business settings. Pandas provides the `read_excel()` function to load data from Excel files, with options to specify sheets, ranges, and more.

In [6]:
# Since we can't create a real Excel file in this notebook directly,
# let's simulate Excel file loading by creating a DataFrame and then
# showing how it would be loaded from an Excel file

# Create sample data for our simulated Excel file
data = {
    'Product': ['Laptop', 'Phone', 'Tablet', 'Monitor', 'Keyboard'],
    'Category': ['Electronics', 'Electronics', 'Electronics', 'Computer Accessories', 'Computer Accessories'],
    'Price': [1200, 800, 300, 250, 100],
    'Stock': [15, 25, 40, 30, 45],
    'Last Updated': pd.date_range(start='2023-01-01', periods=5, freq='D')
}

df_excel_data = pd.DataFrame(data)
print("Sample Excel data:")
print(df_excel_data)

# In real scenario, this would be saved as:
# df_excel_data.to_excel('products.xlsx', index=False)

print("\nTo load this data from Excel, you would use:")
print("df = pd.read_excel('products.xlsx')\n")

# Save and load Excel file
df_excel_data.to_excel('products.xlsx', index=False)
df = pd.read_excel('products.xlsx')
print(df)

Sample Excel data:
    Product              Category  Price  Stock Last Updated
0    Laptop           Electronics   1200     15   2023-01-01
1     Phone           Electronics    800     25   2023-01-02
2    Tablet           Electronics    300     40   2023-01-03
3   Monitor  Computer Accessories    250     30   2023-01-04
4  Keyboard  Computer Accessories    100     45   2023-01-05

To load this data from Excel, you would use:
df = pd.read_excel('products.xlsx')

    Product              Category  Price  Stock Last Updated
0    Laptop           Electronics   1200     15   2023-01-01
1     Phone           Electronics    800     25   2023-01-02
2    Tablet           Electronics    300     40   2023-01-03
3   Monitor  Computer Accessories    250     30   2023-01-04
4  Keyboard  Computer Accessories    100     45   2023-01-05


### Excel Loading Options

When loading Excel files, you can:
- Specify sheet names or indices
- Read specific cell ranges
- Handle dates and time formats
- Deal with merged cells and formulas

In [7]:
# Here's how you would load from a multi-sheet Excel file
print("Loading from a specific sheet:")
print("df = pd.read_excel('products.xlsx', sheet_name='Sheet1')")

print("\nLoading from multiple sheets:")
print("all_sheets = pd.read_excel('products.xlsx', sheet_name=None)  # Returns a dict of DataFrames")

print("\nLoading a specific range:")
print("df = pd.read_excel('products.xlsx', usecols='A:C', skiprows=2, nrows=10)")

print("\nHandling dates:")
print("df = pd.read_excel('products.xlsx', parse_dates=['Last Updated'])")

Loading from a specific sheet:
df = pd.read_excel('products.xlsx', sheet_name='Sheet1')

Loading from multiple sheets:
all_sheets = pd.read_excel('products.xlsx', sheet_name=None)  # Returns a dict of DataFrames

Loading a specific range:
df = pd.read_excel('products.xlsx', usecols='A:C', skiprows=2, nrows=10)

Handling dates:
df = pd.read_excel('products.xlsx', parse_dates=['Last Updated'])


## Loading Data from JSON Files

JSON (JavaScript Object Notation) is a common data format for web APIs and configuration files. Pandas provides the `read_json()` function to load JSON data directly into DataFrames.

In [8]:
# Create a sample JSON string
sample_json = """
{
  "employees": [
    {"id": 1, "name": "John Smith", "department": "IT", "skills": ["Python", "SQL", "JavaScript"]},
    {"id": 2, "name": "Jane Doe", "department": "Marketing", "skills": ["SEO", "Content Writing", "Analytics"]},
    {"id": 3, "name": "Bob Johnson", "department": "Finance", "skills": ["Excel", "Financial Modeling", "SQL"]},
    {"id": 4, "name": "Alice Brown", "department": "HR", "skills": ["Recruiting", "Training", "Benefits"]}
  ],
  "company": "Tech Solutions Inc.",
  "location": "New York"
}
"""

# Parse the JSON
parsed_json = json.loads(sample_json)
print("Parsed JSON structure:")
print(f"Keys: {list(parsed_json.keys())}")
print(f"Number of employees: {len(parsed_json['employees'])}")

Parsed JSON structure:
Keys: ['employees', 'company', 'location']
Number of employees: 4


In [9]:
# Convert the employees list to a DataFrame
df_json = pd.DataFrame(parsed_json['employees'])
print("JSON data as DataFrame:")
print(df_json)

# Notice that the skills column contains lists
print("\nSkills column type:", type(df_json['skills'][0]))

JSON data as DataFrame:
   id         name department                             skills
0   1   John Smith         IT          [Python, SQL, JavaScript]
1   2     Jane Doe  Marketing  [SEO, Content Writing, Analytics]
2   3  Bob Johnson    Finance   [Excel, Financial Modeling, SQL]
3   4  Alice Brown         HR   [Recruiting, Training, Benefits]

Skills column type: <class 'list'>


In [10]:
# Loading JSON directly with pandas
# For a simple JSON array
simple_json = """
[
  {"id": 1, "name": "John", "age": 30},
  {"id": 2, "name": "Jane", "age": 25},
  {"id": 3, "name": "Bob", "age": 35}
]
"""

# Load JSON with pandas
df_simple_json = pd.read_json(StringIO(simple_json))
print("Simple JSON loaded with pandas:")
print(df_simple_json)

# For more complex nested JSON, you might need to normalize
print("\nFor nested JSON, you can use json_normalize:")
df_normalized = pd.json_normalize(parsed_json['employees'])
print(df_normalized)

Simple JSON loaded with pandas:
   id  name  age
0   1  John   30
1   2  Jane   25
2   3   Bob   35

For nested JSON, you can use json_normalize:
   id         name department                             skills
0   1   John Smith         IT          [Python, SQL, JavaScript]
1   2     Jane Doe  Marketing  [SEO, Content Writing, Analytics]
2   3  Bob Johnson    Finance   [Excel, Financial Modeling, SQL]
3   4  Alice Brown         HR   [Recruiting, Training, Benefits]


### Loading JSON from an API

Many data sources are available through APIs that return JSON. Let's see how to load data from a public API.

In [11]:
# Example: Loading data from a public API
# Using a public API that doesn't require authentication
try:
    # Open Notify API - Current location of the International Space Station
    response = requests.get("http://api.open-notify.org/iss-now.json")
    
    if response.status_code == 200:
        data = response.json()
        print("API Response:")
        print(json.dumps(data, indent=2))
        
        # Convert position data to DataFrame
        position = data['iss_position']
        df_position = pd.DataFrame([position])
        print("\nPosition as DataFrame:")
        print(df_position)
    else:
        print(f"Failed to fetch data: Status code {response.status_code}")
except Exception as e:
    print(f"Error fetching API data: {e}")

API Response:
{
  "iss_position": {
    "longitude": "120.3184",
    "latitude": "-35.9747"
  },
  "message": "success",
  "timestamp": 1744795937
}

Position as DataFrame:
  longitude  latitude
0  120.3184  -35.9747


## Handling Different File Encodings

When loading files, especially from international sources, you may encounter encoding issues. Let's see how to handle different file encodings.

In [12]:
# Create sample data with non-ASCII characters
sample_data_utf8 = """
id,name,country
1,José García,Spain
2,Björn Müller,Germany
3,Séverine Dupont,France
4,Николай Иванов,Russia
5,中村 健,Japan
"""

# Create a file-like object
data_utf8 = StringIO(sample_data_utf8.strip())

# Reading with UTF-8 encoding
df_utf8 = pd.read_csv(data_utf8, encoding='utf-8')
print("Data with UTF-8 encoding:")
print(df_utf8)

print("\nCommon encodings to try when you have issues:")
print("- utf-8: Universal encoding that works for most modern files")
print("- latin-1 (iso-8859-1): Works for Western European languages")
print("- cp1252: Windows default for Western languages")
print("- utf-16: For some special applications")

Data with UTF-8 encoding:
   id             name  country
0   1      José García    Spain
1   2     Björn Müller  Germany
2   3  Séverine Dupont   France
3   4   Николай Иванов   Russia
4   5             中村 健    Japan

Common encodings to try when you have issues:
- utf-8: Universal encoding that works for most modern files
- latin-1 (iso-8859-1): Works for Western European languages
- cp1252: Windows default for Western languages
- utf-16: For some special applications


### Detecting Encoding Issues

When you encounter encoding errors, here's how to diagnose and fix them:

In [13]:
def try_encodings(file_path, encodings=['utf-8', 'latin-1', 'cp1252', 'iso-8859-1']):
    """Try different encodings for a file"""
    for encoding in encodings:
        try:
            # Try to read the first few lines with this encoding
            print(f"Trying {encoding} encoding...")
            result = pd.read_csv(file_path, encoding=encoding, nrows=5)
            print(f"Success with {encoding}!")
            return encoding, result
        except UnicodeDecodeError:
            print(f"Failed with {encoding} encoding.")
        except Exception as e:
            print(f"Other error with {encoding}: {str(e)}")
    
    return None, None

# This is how you'd use the function with a real file
# best_encoding, sample_data = try_encodings('data.csv')
# if best_encoding:
#     full_data = pd.read_csv('data.csv', encoding=best_encoding)

print("When loading files with encoding issues, you can use the try_encodings function above")
print("to automatically detect the correct encoding for your file.")

When loading files with encoding issues, you can use the try_encodings function above
to automatically detect the correct encoding for your file.


## Working with URLs and Remote Data

You can load data directly from URLs without downloading files first using pandas or the requests library.

In [14]:
# Loading CSV directly from a URL
url_csv = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"

try:
    # Direct loading with pandas
    df_titanic = pd.read_csv(url_csv)
    print("Titanic dataset loaded from URL:")
    print(df_titanic.head())
    print(f"\nDataset shape: {df_titanic.shape}")
except Exception as e:
    print(f"Error loading CSV from URL: {e}")

Titanic dataset loaded from URL:
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            37

In [15]:
# Alternative approach: Using requests to download the data first
try:
    response = requests.get(url_csv)
    if response.status_code == 200:
        content = StringIO(response.text)
        df_titanic2 = pd.read_csv(content)
        print("Titanic dataset loaded using requests:")
        print(f"Number of rows: {len(df_titanic2)}")
    else:
        print(f"Failed to fetch data: Status code {response.status_code}")
except Exception as e:
    print(f"Error with requests approach: {e}")

Titanic dataset loaded using requests:
Number of rows: 891


## Comparing File Formats

Let's compare loading speed, file sizes, and use cases for different file formats. We'll also show how to convert between formats.

In [16]:
# Create a sample DataFrame for comparison
import numpy as np

# Generate a larger dataset for comparison
np.random.seed(42)
num_rows = 100000
data = {
    'id': range(1, num_rows + 1),
    'value1': np.random.rand(num_rows),
    'value2': np.random.rand(num_rows),
    'category': np.random.choice(['A', 'B', 'C', 'D'], num_rows),
    'timestamp': pd.date_range(start='2020-01-01', periods=num_rows, freq='10min')
}

comparison_df = pd.DataFrame(data)
print(f"Generated DataFrame with {len(comparison_df)} rows and {len(comparison_df.columns)} columns")
print(comparison_df.head())

Generated DataFrame with 100000 rows and 5 columns
   id    value1    value2 category           timestamp
0   1  0.374540  0.580779        D 2020-01-01 00:00:00
1   2  0.950714  0.526972        A 2020-01-01 00:10:00
2   3  0.731994  0.351037        B 2020-01-01 00:20:00
3   4  0.598658  0.493213        B 2020-01-01 00:30:00
4   5  0.156019  0.365097        C 2020-01-01 00:40:00


In [17]:
# Function to measure file size and read/write time
def format_comparison(df, formats=None):
    if formats is None:
        formats = ['csv', 'excel', 'json', 'pickle', 'parquet', 'feather']
    
    results = []
    
    for fmt in formats:
        file_path = f"temp_data.{fmt}"
        
        # Write to file
        write_start = time.time()
        
        try:
            if fmt == 'csv':
                df.to_csv(file_path, index=False)
            elif fmt == 'excel':
                df.to_excel(file_path, index=False)
            elif fmt == 'json':
                df.to_json(file_path, orient='records')
            elif fmt == 'pickle':
                df.to_pickle(file_path)
            elif fmt == 'parquet':
                df.to_parquet(file_path, index=False)
            elif fmt == 'feather':
                df.to_feather(file_path)
            
            write_time = time.time() - write_start
            
            # Get file size
            file_size = os.path.getsize(file_path) / (1024 * 1024)  # Size in MB
            
            # Read from file
            read_start = time.time()
            
            if fmt == 'csv':
                _ = pd.read_csv(file_path)
            elif fmt == 'excel':
                _ = pd.read_excel(file_path)
            elif fmt == 'json':
                _ = pd.read_json(file_path, orient='records')
            elif fmt == 'pickle':
                _ = pd.read_pickle(file_path)
            elif fmt == 'parquet':
                _ = pd.read_parquet(file_path)
            elif fmt == 'feather':
                _ = pd.read_feather(file_path)
                
            read_time = time.time() - read_start
            
            # Clean up
            # os.remove(file_path)
            
            results.append({
                'Format': fmt,
                'File Size (MB)': round(file_size, 2),
                'Write Time (s)': round(write_time, 3),
                'Read Time (s)': round(read_time, 3)
            })
            
        except Exception as e:
            print(f"Error with {fmt} format: {e}")
            # Clean up if file exists
            # if os.path.exists(file_path):
            #     os.remove(file_path)
    
    return pd.DataFrame(results)

# Compare formats (limiting to common formats that don't require extra libraries)
try:
    # For demonstration, use a smaller subset of the data
    sample_df = comparison_df.head(10000)
    comparison = format_comparison(sample_df, formats=['csv', 'json', 'pickle'])
    print(comparison)
except Exception as e:
    print(f"Error during comparison: {e}")
    print("Note: Some formats like parquet and feather require additional libraries")

   Format  File Size (MB)  Write Time (s)  Read Time (s)
0     csv            0.63           0.064          0.023
1    json            0.92           0.014          0.039
2  pickle            0.33           0.003          0.011


### Format Comparison Summary

Let's discuss the pros and cons of each format:

1. **CSV**
   - Pros: Universal compatibility, human-readable, works with many tools
   - Cons: Larger file size, slow for large datasets, no schema/type preservation

2. **JSON**
   - Pros: Great for web APIs, preserves nested structures, human-readable
   - Cons: Larger file size than binary formats, slower parsing

3. **Excel**
   - Pros: User-friendly for non-technical users, supports multiple sheets
   - Cons: Very slow for large datasets, large file size, version compatibility issues

4. **Pickle**
   - Pros: Fast, preserves pandas objects and data types, small file size
   - Cons: Python-specific, security concerns, version compatibility issues

5. **Parquet**
   - Pros: Very efficient storage, columnar storage for analytics, preserves schema
   - Cons: Requires additional libraries, not human-readable

6. **Feather**
   - Pros: Fast read/write, cross-compatible between R and Python
   - Cons: Requires additional libraries, not as universally supported

In [18]:
# Converting between formats 
print("Code to convert between formats:")

print("\n# CSV to Excel")
print("df = pd.read_csv('data.csv')")
print("df.to_excel('data.xlsx', index=False)")

print("\n# Excel to JSON")
print("df = pd.read_excel('data.xlsx')")
print("df.to_json('data.json', orient='records')")

print("\n# JSON to Parquet")
print("df = pd.read_json('data.json')")
print("df.to_parquet('data.parquet', index=False)")

print("\n# Parquet to CSV")
print("df = pd.read_parquet('data.parquet')")
print("df.to_csv('data_new.csv', index=False)")

Code to convert between formats:

# CSV to Excel
df = pd.read_csv('data.csv')
df.to_excel('data.xlsx', index=False)

# Excel to JSON
df = pd.read_excel('data.xlsx')
df.to_json('data.json', orient='records')

# JSON to Parquet
df = pd.read_json('data.json')
df.to_parquet('data.parquet', index=False)

# Parquet to CSV
df = pd.read_parquet('data.parquet')
df.to_csv('data_new.csv', index=False)


## Summary

In this notebook, we've learned how to:

1. **Load data from CSV files** with various options for separators, headers, and data types
2. **Work with Excel files** and handle different sheets and ranges
3. **Process JSON data** from files and APIs
4. **Handle encoding issues** with different character sets
5. **Load data directly from URLs** without downloading files
6. **Compare different file formats** for storage efficiency and speed
7. **Convert data between formats** for different use cases

These skills are fundamental for any data science workflow, as data loading is typically the first step in any analysis or machine learning project.