# Loading Data from Various Sources (CSV, Excel, JSON)

This notebook provides a comprehensive guide on loading data from different file formats commonly used in data science workflows. We'll explore:
1. Creating sample data files
2. Loading data from CSV files
3. Loading data from Excel files 
4. Loading data from JSON files
5. Handling different data sources
6. Performing data cleaning
7. Saving data back to different formats

## Required Libraries Setup

First, let's install and import all necessary libraries. We'll need:
- pandas: for data manipulation
- numpy: for numerical operations
- matplotlib: for visualization
- openpyxl: for Excel file handling
- json: for JSON handling

In [None]:
# Install necessary packages if not already installed
# Uncomment these lines if you need to install packages
# !pip install pandas numpy matplotlib openpyxl

# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import json
import os

# For displaying plots inline
%matplotlib inline

# Set display options for better readability
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 20)
pd.set_option('display.width', 1000)

print("Libraries imported successfully!")

## Creating Sample Data Files

Let's create three sample data files containing similar data but in different formats:
1. CSV (Comma Separated Values)
2. Excel (.xlsx)
3. JSON (JavaScript Object Notation)

We'll create a dataset about electronics products and their sales information.

In [None]:
# Define the data structure
data = {
    'product_id': [1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010],
    'product_name': ['Laptop', 'Smartphone', 'Tablet', 'Smartwatch', 'Headphones', 
                     'Gaming Console', 'TV', 'Camera', 'Bluetooth Speaker', 'E-reader'],
    'category': ['Computers', 'Phones', 'Computers', 'Wearables', 'Audio', 
                 'Gaming', 'Home Entertainment', 'Photography', 'Audio', 'Reading'],
    'price': [1200.50, 899.99, 399.95, 249.50, 159.99, 
              499.95, 899.99, 649.95, 79.99, 129.95],
    'stock': [45, 120, 35, 78, 90, 25, 15, 28, 60, 55],
    'rating': [4.6, 4.5, 4.2, 3.9, 4.7, 4.8, 4.3, 4.1, 4.0, 4.4],
    'release_date': ['2023-01-15', '2022-09-20', '2023-03-10', '2022-11-05', '2023-02-28',
                     '2022-07-12', '2023-04-18', '2022-12-30', '2023-05-22', '2022-08-08']
}

# Create a pandas DataFrame
df = pd.DataFrame(data)

# Display the data
print("Sample data created:")
df.head()

Now that we have our data ready, let's save it in three different formats:

In [None]:
# Define file paths
csv_file = "c:\\Users\\pavel\\projects\\ai-ml\\electronics_products.csv"
excel_file = "c:\\Users\\pavel\\projects\\ai-ml\\electronics_products.xlsx"
json_file = "c:\\Users\\pavel\\projects\\ai-ml\\electronics_products.json"

# Save as CSV
df.to_csv(csv_file, index=False)

# Save as Excel
df.to_excel(excel_file, sheet_name='Products', index=False)

# Save as JSON
df.to_json(json_file, orient='records', indent=4)

print(f"Files saved successfully at:\n- {csv_file}\n- {excel_file}\n- {json_file}")

## Loading CSV Data

CSV (Comma Separated Values) is one of the most common formats for storing structured data. Pandas provides the powerful `read_csv()` function with many parameters to handle various CSV file configurations.

In [None]:
# Basic reading of a CSV file
df_csv = pd.read_csv(csv_file)

# Display the first few rows
print("Data loaded from CSV file:")
df_csv.head()

### Advanced CSV Loading Options

Let's explore some of the parameters that `read_csv()` offers:
- delimiter/sep: specify different delimiters
- header: specify header row
- skiprows: skip specific rows
- usecols: select specific columns
- dtype: specify column data types
- parse_dates: convert columns to datetime
- na_values: specify values to be treated as NaN

In [None]:
# Let's create a slightly modified version of our CSV to demonstrate these options
# First, add some rows with missing values
df_modified = df.copy()
df_modified.loc[2, 'price'] = np.nan
df_modified.loc[5, 'rating'] = np.nan
df_modified.to_csv('c:\\Users\\pavel\\projects\\ai-ml\\modified_products.csv', index=False)

# Now let's read it with various options
df_csv_advanced = pd.read_csv('c:\\Users\\pavel\\projects\\ai-ml\\modified_products.csv',
                             dtype={'product_id': int, 'stock': int},
                             parse_dates=['release_date'],
                             na_values=['N/A', 'unknown'])

# Display info about the loaded dataframe
print("Data types after specifying dtypes and parse_dates:")
print(df_csv_advanced.dtypes)

# Show rows with missing values
print("\nRows with missing values:")
print(df_csv_advanced[df_csv_advanced.isna().any(axis=1)])

In [None]:
# Reading only specific columns
df_csv_subset = pd.read_csv(csv_file, usecols=['product_id', 'product_name', 'price'])

print("Reading only specific columns:")
df_csv_subset.head()

## Loading Excel Data

Excel files (.xlsx, .xls) are widely used in business and data analysis. Pandas provides the `read_excel()` function to work with Excel files.

In [None]:
# Basic reading of an Excel file
df_excel = pd.read_excel(excel_file)

print("Data loaded from Excel file:")
df_excel.head()

### Advanced Excel Loading Options

Excel files can contain multiple sheets and complex structures. Let's explore some advanced options:
- sheet_name: select specific sheet(s)
- usecols: select specific columns
- skiprows: skip specific rows
- header: specify header row
- names: specify column names

In [None]:
# Let's create a more complex Excel file with multiple sheets
with pd.ExcelWriter('c:\\Users\\pavel\\projects\\ai-ml\\complex_products.xlsx') as writer:
    # First sheet with all data
    df.to_excel(writer, sheet_name='All Products', index=False)
    
    # Second sheet with computers only
    computers = df[df['category'] == 'Computers']
    computers.to_excel(writer, sheet_name='Computers', index=False)
    
    # Third sheet with audio products
    audio = df[df['category'] == 'Audio']
    audio.to_excel(writer, sheet_name='Audio Products', index=False)

print("Complex Excel file created with multiple sheets")

In [None]:
# Reading specific sheets
df_computers = pd.read_excel('c:\\Users\\pavel\\projects\\ai-ml\\complex_products.xlsx', 
                            sheet_name='Computers')

print("Data from 'Computers' sheet:")
df_computers

In [None]:
# Reading multiple sheets at once
all_sheets = pd.read_excel('c:\\Users\\pavel\\projects\\ai-ml\\complex_products.xlsx', 
                          sheet_name=None)  # None returns all sheets as dict

# Show available sheets
print("Available sheets:", list(all_sheets.keys()))

# Display data from each sheet
for sheet_name, sheet_data in all_sheets.items():
    print(f"\nData from '{sheet_name}' sheet:")
    print(sheet_data.head(2))

In [None]:
# Reading specific columns and rows
df_excel_subset = pd.read_excel('c:\\Users\\pavel\\projects\\ai-ml\\complex_products.xlsx',
                              sheet_name='All Products',
                              usecols="B,D,F",  # Columns B, D, F (product_name, price, rating)
                              skiprows=1,       # Skip first row (header)
                              header=0)         # Use first row as header

print("Subset of Excel data with specific columns:")
df_excel_subset.head()

## Loading JSON Data

JSON (JavaScript Object Notation) is a popular data format used for configuration files and web services. Pandas provides the `read_json()` function to work with JSON data.

In [None]:
# Basic reading of a JSON file
df_json = pd.read_json(json_file)

print("Data loaded from JSON file:")
df_json.head()

### Advanced JSON Loading Options

JSON can have various structures, and pandas offers options to handle them:
- orient: specify JSON structure ('records', 'columns', 'index', 'split', 'table')
- lines: read JSON objects line by line (for JSON Lines format)
- convert_dates: convert date strings to datetime objects

In [None]:
# Let's create different JSON structures for demonstration

# 1. JSON with 'records' orientation (list of objects)
with open('c:\\Users\\pavel\\projects\\ai-ml\\products_records.json', 'w') as f:
    json.dump(df.to_dict(orient='records'), f, indent=2)

# 2. JSON with 'columns' orientation (columns as keys)
with open('c:\\Users\\pavel\\projects\\ai-ml\\products_columns.json', 'w') as f:
    json.dump(df.to_dict(orient='columns'), f, indent=2)

# 3. JSON Lines format (each line is a valid JSON object)
with open('c:\\Users\\pavel\\projects\\ai-ml\\products_lines.jsonl', 'w') as f:
    for record in df.to_dict(orient='records'):
        f.write(json.dumps(record) + '\n')

print("Created three different JSON structure files")

In [None]:
# Reading JSON with different orientations
df_json_records = pd.read_json('c:\\Users\\pavel\\projects\\ai-ml\\products_records.json')
print("JSON 'records' format loaded:")
print(df_json_records.head(3))

df_json_columns = pd.read_json('c:\\Users\\pavel\\projects\\ai-ml\\products_columns.json')
print("\nJSON 'columns' format loaded:")
print(df_json_columns.head(3))

In [None]:
# Reading JSON lines format
df_json_lines = pd.read_json('c:\\Users\\pavel\\projects\\ai-ml\\products_lines.jsonl', lines=True)
print("JSON Lines format loaded:")
print(df_json_lines.head(3))

### Working with Nested JSON

JSON can have nested structures that may require preprocessing:

In [None]:
# Create a nested JSON structure
nested_data = []
for i, row in df.iterrows():
    nested_data.append({
        'product_id': row['product_id'],
        'product_name': row['product_name'],
        'details': {
            'category': row['category'],
            'price': row['price'],
            'stock': row['stock']
        },
        'meta': {
            'rating': row['rating'],
            'release_date': row['release_date']
        }
    })

# Save nested JSON
with open('c:\\Users\\pavel\\projects\\ai-ml\\nested_products.json', 'w') as f:
    json.dump(nested_data, f, indent=2)

print("Created nested JSON structure")

In [None]:
# Read the nested JSON file
with open('c:\\Users\\pavel\\projects\\ai-ml\\nested_products.json', 'r') as f:
    nested_json = json.load(f)

# Convert to DataFrame
df_nested = pd.json_normalize(
    nested_json,
    sep='_'  # Separator for nested column names
)

print("Nested JSON normalized to DataFrame:")
df_nested.head()

## Handling Different Data Sources

Now that we've loaded data from different sources, let's compare them to ensure consistency:

In [None]:
# Check if dataframes from different sources are identical
csv_equals_excel = df_csv.equals(df_excel)
csv_equals_json = df_csv.equals(df_json)

print(f"CSV and Excel data are identical: {csv_equals_excel}")
print(f"CSV and JSON data are identical: {csv_equals_json}")

if not (csv_equals_excel and csv_equals_json):
    print("\nDifferences may exist in data types. Let's check data types from each source:")
    
    print("\nCSV data types:")
    print(df_csv.dtypes)
    
    print("\nExcel data types:")
    print(df_excel.dtypes)
    
    print("\nJSON data types:")
    print(df_json.dtypes)

### Handling Data Type Issues

Different file formats may lead to different data type interpretations. Let's standardize the data types:

In [None]:
# Function to standardize data types across dataframes
def standardize_dtypes(df):
    df_copy = df.copy()
    # Convert specific columns to appropriate types
    df_copy['product_id'] = df_copy['product_id'].astype(int)
    df_copy['price'] = df_copy['price'].astype(float)
    df_copy['stock'] = df_copy['stock'].astype(int)
    df_copy['rating'] = df_copy['rating'].astype(float)
    df_copy['release_date'] = pd.to_datetime(df_copy['release_date'])
    return df_copy

# Standardize data types for all dataframes
df_csv_std = standardize_dtypes(df_csv)
df_excel_std = standardize_dtypes(df_excel)
df_json_std = standardize_dtypes(df_json)

# Check if standardized dataframes are now identical (ignoring index)
csv_equals_excel = df_csv_std.reset_index(drop=True).equals(df_excel_std.reset_index(drop=True))
csv_equals_json = df_csv_std.reset_index(drop=True).equals(df_json_std.reset_index(drop=True))

print(f"After standardization, CSV and Excel data are identical: {csv_equals_excel}")
print(f"After standardization, CSV and JSON data are identical: {csv_equals_json}")

## Data Cleaning Techniques

Let's demonstrate some common data cleaning procedures for each data source type:
1. Handling missing values
2. Removing duplicates
3. Converting data types
4. Standardizing string values

In [None]:
# Let's create a messy dataset to clean
messy_data = {
    'product_id': [1001, 1002, 1002, 1003, None, 1005],
    'product_name': ['Laptop', 'smartphone', 'Smartphone', 'Tablet ', None, 'headphones'],
    'category': ['Computers', 'Phones', 'phones', 'Computers', 'Unknown', 'audio'],
    'price': ['1200.50', '899.99', '899.99', 'N/A', '159.99', '?'],
    'stock': ['45', '120', '120', '35', '', '90'],
    'rating': ['4.6', '4.5', '4.5', '4,2', 'pending', '4.7'],
    'release_date': ['2023-01-15', '22-09-2022', '22/09/2022', '10-03-2023', '2023', '02/28/2023']
}

messy_df = pd.DataFrame(messy_data)
print("Messy data to clean:")
messy_df

In [None]:
# Step 1: Check for missing values
print("Missing values per column:")
print(messy_df.isnull().sum())

# Step 2: Check for duplicates
print("\nDuplicate rows:")
print(messy_df[messy_df.duplicated(subset=['product_id', 'product_name'], keep=False)])

# Step 3: Data types overview
print("\nData types before cleaning:")
print(messy_df.dtypes)

In [None]:
# Let's clean the data step by step

# Step 1: Handle missing values
cleaned_df = messy_df.copy()
cleaned_df['product_id'] = cleaned_df['product_id'].fillna(9999)  # Fill missing IDs with a placeholder
cleaned_df['product_name'] = cleaned_df['product_name'].fillna('Unknown Product')

# Step 2: Remove duplicates
cleaned_df = cleaned_df.drop_duplicates(subset=['product_id', 'product_name'], keep='first')

# Step 3: Fix price column - replace non-numeric values and convert to float
cleaned_df['price'] = cleaned_df['price'].replace(['N/A', '?'], np.nan)
cleaned_df['price'] = pd.to_numeric(cleaned_df['price'], errors='coerce')

# Step 4: Fix stock column - convert to integer
cleaned_df['stock'] = pd.to_numeric(cleaned_df['stock'], errors='coerce').fillna(0).astype(int)

# Step 5: Fix rating column - replace comma with dot, convert to float
cleaned_df['rating'] = cleaned_df['rating'].str.replace(',', '.', regex=False)
cleaned_df['rating'] = pd.to_numeric(cleaned_df['rating'], errors='coerce')

# Step 6: Standardize text columns (lowercase, strip spaces)
cleaned_df['product_name'] = cleaned_df['product_name'].str.strip().str.title()
cleaned_df['category'] = cleaned_df['category'].str.strip().str.title()

# Step 7: Parse dates using a custom function
def parse_date(date_str):
    if pd.isna(date_str):
        return pd.NaT
    
    # Try different date formats
    formats = ['%Y-%m-%d', '%d-%m-%Y', '%d/%m/%Y', '%m/%d/%Y', '%Y']
    for fmt in formats:
        try:
            return pd.to_datetime(date_str, format=fmt)
        except:
            continue
    
    return pd.NaT

cleaned_df['release_date'] = cleaned_df['release_date'].apply(parse_date)

# Display cleaned data
print("Data after cleaning:")
cleaned_df

In [None]:
# Check data types after cleaning
print("Data types after cleaning:")
print(cleaned_df.dtypes)

# Summary statistics
print("\nSummary statistics after cleaning:")
print(cleaned_df.describe())

## Saving Data to Different Formats

After processing and cleaning our data, we often need to save it back to disk in various formats:

In [None]:
# Save cleaned data to CSV
cleaned_df.to_csv('c:\\Users\\pavel\\projects\\ai-ml\\cleaned_products.csv', index=False)

# Save cleaned data to Excel with some formatting
with pd.ExcelWriter('c:\\Users\\pavel\\projects\\ai-ml\\cleaned_products.xlsx', engine='openpyxl') as writer:
    cleaned_df.to_excel(writer, sheet_name='Cleaned Products', index=False)
    # You can add more sheets or formatting here if needed

# Save cleaned data to JSON in different formats
cleaned_df.to_json('c:\\Users\\pavel\\projects\\ai-ml\\cleaned_products_records.json', 
                  orient='records', indent=2)

cleaned_df.to_json('c:\\Users\\pavel\\projects\\ai-ml\\cleaned_products_columns.json', 
                  orient='columns', indent=2)

print("Cleaned data saved to various formats")

### Advanced Export Options

Let's explore some advanced options for saving data:

In [None]:
# 1. Save CSV with specific options
cleaned_df.to_csv('c:\\Users\\pavel\\projects\\ai-ml\\cleaned_products_advanced.csv', 
                 index=False,
                 sep=';',  # Use semicolon as delimiter
                 date_format='%Y-%m-%d',  # Specify date format
                 float_format='%.2f')  # Format floats to 2 decimal places

# 2. Save Excel with formatting
from openpyxl.styles import Font, PatternFill, Alignment, Border, Side
from openpyxl.utils.dataframe import dataframe_to_rows

# Create a new Excel file with openpyxl for more formatting control
import openpyxl
wb = openpyxl.Workbook()
ws = wb.active
ws.title = "Formatted Products"

# Add the dataframe
for r_idx, row in enumerate(dataframe_to_rows(cleaned_df, index=False, header=True), 1):
    for c_idx, value in enumerate(row, 1):
        ws.cell(row=r_idx, column=c_idx, value=value)

# Apply formatting to header row
header_fill = PatternFill(start_color="4472C4", end_color="4472C4", fill_type="solid")
header_font = Font(color="FFFFFF", bold=True)

for cell in ws[1]:
    cell.fill = header_fill
    cell.font = header_font
    cell.alignment = Alignment(horizontal="center")

# Auto adjust column width
for col in ws.columns:
    max_length = 0
    column = col[0].column_letter
    for cell in col:
        if cell.value:
            max_length = max(max_length, len(str(cell.value)))
    adjusted_width = max_length + 2
    ws.column_dimensions[column].width = adjusted_width

# Save the workbook
wb.save('c:\\Users\\pavel\\projects\\ai-ml\\formatted_products.xlsx')

print("Advanced formatted Excel file saved")

In [None]:
# 3. Save filtered data
# For example, save only high-priced items to a separate file
high_priced = cleaned_df[cleaned_df['price'] > 500]
high_priced.to_json('c:\\Users\\pavel\\projects\\ai-ml\\high_priced_products.json', orient='records', indent=2)

# 4. Save data as HTML table
html_table = cleaned_df.to_html(index=False, classes='table table-striped')
with open('c:\\Users\\pavel\\projects\\ai-ml\\products_table.html', 'w') as f:
    f.write('''
    <html>
    <head>
        <title>Products Table</title>
        <link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.5.2/css/bootstrap.min.css">
    </head>
    <body class="container mt-5">
        <h2>Electronics Products Data</h2>
    ''' + html_table + '''
    </body>
    </html>
    ''')

print("Additional export formats created")

## Summary

In this notebook, we've covered:

1. **Creating Sample Data Files** - We created CSV, Excel, and JSON files containing similar data.

2. **Loading CSV Data** - We explored various parameters of `pd.read_csv()` to handle different CSV formats.

3. **Loading Excel Data** - We demonstrated how to work with Excel files, including handling multiple sheets using `pd.read_excel()`.

4. **Loading JSON Data** - We examined different JSON structures and how to load them with `pd.read_json()` and `json_normalize()`.

5. **Handling Different Data Sources** - We compared data loaded from different sources and addressed inconsistencies.

6. **Data Cleaning Techniques** - We applied common data cleaning operations to fix messy data.

7. **Saving Data to Different Formats** - We explored various options to export data with specific formatting.

These skills are essential for any data scientist, as most data analysis projects begin with loading and preprocessing data from various sources.