# <font color="#418FDE" size="6.5" uppercase>**Reading Tabular Data**</font>

>Last update: 20251225.
    
By the end of this Lecture, you will be able to:
- Use read_csv and read_excel to load tabular data into DataFrames with appropriate options. 
- Control column types, missing value markers, and date parsing during file ingestion. 
- Export cleaned DataFrames back to disk using to_csv and to_excel with reproducible settings. 


## **1. read csv basics**

### **1.1. Separators Headers Index**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas 2.3.1 A-Z/Module_02/Lecture_A/image_01_01.jpg?v=1766641533" width="250">



>* Choose the correct separator for each file
>* Check columns and names to catch separator mistakes

>* Identify which row actually contains column headers
>* Skip extra top lines and rename columns consistently

>* Choose between default index or natural key
>* Ensure index column is unique, complete, reliable



In [None]:
#@title Python Code - Separators Headers Index

# Demonstrate separators headers and index when reading CSV text with pandas.
# Show how wrong separator breaks columns and how correct separator fixes them.
# Show how header and index options change resulting DataFrame structure.

import pandas as pd
from io import StringIO

# Create sample CSV text with semicolon separator and header row.
csv_text = "date;customer_id;energy_kwh\n2024-01-01;C001;15.5\n2024-01-02;C002;18.0"

# Read using default comma separator, everything becomes one confusing column.
wrong_sep_df = pd.read_csv(StringIO(csv_text))
print("Wrong separator, columns look incorrect:")
print(wrong_sep_df.head())

# Read using correct semicolon separator, columns now parse correctly.
correct_sep_df = pd.read_csv(StringIO(csv_text), sep=";")
print("\nCorrect separator, columns now parsed:")
print(correct_sep_df.head())

# Read again but treat first row as data, then add custom header names.
no_header_df = pd.read_csv(StringIO(csv_text), sep=";", header=None)
no_header_df.columns = ["date", "customer_id", "energy_kwh"]
print("\nNo header used, custom names assigned:")
print(no_header_df.head())

# Read with correct separator and set date column as index for convenience.
indexed_df = pd.read_csv(StringIO(csv_text), sep=";", index_col="date")
print("\nDate column used as index:")
print(indexed_df.head())



### **1.2. Encoding and compression**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas 2.3.1 A-Z/Module_02/Lecture_A/image_01_02.jpg?v=1766641557" width="250">



>* Encoding maps text characters to file bytes
>* Wrong encoding causes garbled text and data issues

>* Explicitly set file encoding when loading data
>* Wrong or default encodings silently corrupt text fields

>* Use on-the-fly decompression for compressed data files
>* Specify compression type to ensure speed and accuracy



In [None]:
#@title Python Code - Encoding and compression

# Demonstrate reading CSV files with different encodings and compression options.
# Show how incorrect encoding breaks special characters when loading text data.
# Show how pandas reads compressed CSV files directly without manual decompression.

import pandas as pd
from io import StringIO

# Create a small CSV text with special characters using UTF-8 encoding.
text_utf8 = "name,city\nJosé,Montréal\nZoë,München"

# Simulate a file object using StringIO for the UTF-8 encoded text.
file_like_utf8 = StringIO(text_utf8)

# Read the CSV correctly using the matching UTF-8 encoding parameter.
df_correct = pd.read_csv(file_like_utf8, encoding="utf-8")

# Show the correctly decoded DataFrame with readable accented characters.
print("Correct UTF-8 decoding example:")
print(df_correct)

# Now read the same bytes using a mismatched Latin-1 encoding to show corruption.
file_like_latin = StringIO(text_utf8)

df_wrong = pd.read_csv(file_like_latin, encoding="latin1", on_bad_lines="skip")

# Show how the text may appear corrupted or altered with wrong encoding.
print("\nWrong Latin-1 decoding example:")
print(df_wrong)

# Next, create a small DataFrame and save it as a compressed CSV using gzip.
small_df = pd.DataFrame({"city": ["New York", "Los Angeles"], "temp_f": [72, 85]})

compressed_path = "cities_temps.csv.gz"

small_df.to_csv(compressed_path, index=False, compression="gzip")

# Read the compressed CSV directly by specifying the gzip compression parameter.
loaded_compressed = pd.read_csv(compressed_path, compression="gzip")

# Display the DataFrame loaded from the compressed gzip CSV file.
print("\nLoaded from gzip compressed CSV:")
print(loaded_compressed)



### **1.3. Handling large files with chunksize**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas 2.3.1 A-Z/Module_02/Lecture_A/image_01_03.jpg?v=1766641583" width="250">



>* Very large data files can overwhelm memory
>* Read data in row chunks using streaming

>* Process data in independent, chunk-sized steps
>* Compute partial stats, discard chunks, scale analysis

>* Tune chunk size to balance memory, speed
>* Plan how chunk outputs are stored or combined



In [None]:
#@title Python Code - Handling large files with chunksize

# Demonstrate reading large CSV files using chunksize parameter effectively.
# Simulate a big file by writing many rows to a temporary CSV file.
# Process each chunk to compute running totals without loading everything.

import pandas as pd
import os
import tempfile

rows_per_chunk = 10000
number_of_chunks = 5
columns = ["trip_miles", "fare_dollars"]

fd, temp_path = tempfile.mkstemp(suffix="_rides.csv")
os.close(fd)

with open(temp_path, "w") as file_handle:
    file_handle.write(",".join(columns) + "\n")
    for chunk_index in range(number_of_chunks):
        for row_index in range(rows_per_chunk):
            file_handle.write(f"{1.5},{12.0}\n")

print("Temporary CSV file path:", temp_path)
print("Simulated rows count:", rows_per_chunk * number_of_chunks)

chunk_size = 8000
running_miles_total = 0.0
running_fare_total = 0.0

for chunk_frame in pd.read_csv(temp_path, chunksize=chunk_size):
    chunk_miles_sum = chunk_frame["trip_miles"].sum()
    chunk_fare_sum = chunk_frame["fare_dollars"].sum()
    running_miles_total += float(chunk_miles_sum)
    running_fare_total += float(chunk_fare_sum)
    print("Processed chunk rows count:", len(chunk_frame))

print("Total miles across all chunks:", running_miles_total)
print("Total fare dollars across all chunks:", running_fare_total)

os.remove(temp_path)



## **2. Excel data ingestion**

### **2.1. Reading Excel Sheets**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas 2.3.1 A-Z/Module_02/Lecture_A/image_02_01.jpg?v=1766641620" width="250">



>* Load chosen Excel sheets into DataFrames
>* Control column types, missing values, and dates

>* Set correct types for numbers, text, dates
>* Keep types consistent across sheets to avoid errors

>* Identify messy dates, codes, and placeholders
>* Define missing values, types, and date parsing



In [None]:
#@title Python Code - Reading Excel Sheets

# Demonstrate reading Excel sheets into DataFrames with controlled column types.
# Show how to choose sheets and inspect interpreted data types clearly.
# Run in Colab to see how Excel data becomes structured DataFrames.

import pandas as pd
from io import BytesIO

# Create a small DataFrame representing monthly sales data.
sales_data = {
    "Month": ["Jan", "Feb", "Mar", "Apr"],
    "Store_ID": ["001", "002", "001", "003"],
    "Sales_USD": [1200.5, 950.0, 1430.0, 800.0],
}


sales_df = pd.DataFrame(sales_data)

# Create another DataFrame representing simple employee hire information.
hire_data = {
    "Employee_ID": ["A10", "B20", "C30", "D40"],
    "Hire_Date": ["2024-01-05", "2024-02-10", "2024-03-15", "2024-04-20"],
    "Department": ["Sales", "Support", "Sales", "Finance"],
}


hire_df = pd.DataFrame(hire_data)

# Save both DataFrames into a single in memory Excel workbook.
excel_buffer = BytesIO()
with pd.ExcelWriter(excel_buffer, engine="xlsxwriter") as writer:
    sales_df.to_excel(writer, sheet_name="Monthly_Sales", index=False)
    hire_df.to_excel(writer, sheet_name="Hires", index=False)


excel_buffer.seek(0)

# Read the sales sheet, forcing Store_ID to stay as string identifiers.
read_sales = pd.read_excel(
    excel_buffer,
    sheet_name="Monthly_Sales",
    dtype={"Store_ID": "string"},
)


print("Sales sheet dtypes after reading:")
print(read_sales.dtypes)

# Reset buffer position before reading another sheet from the same workbook.
excel_buffer.seek(0)

# Read the hires sheet, parsing Hire_Date as real datetime values.
read_hires = pd.read_excel(
    excel_buffer,
    sheet_name="Hires",
    parse_dates=["Hire_Date"],
)


print("\nHires sheet dtypes after reading:")
print(read_hires.dtypes)



### **2.2. Handling header rows**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas 2.3.1 A-Z/Module_02/Lecture_A/image_02_02.jpg?v=1766641643" width="250">



>* Decide which Excel row is real header
>* Header choice impacts column names and types

>* Wrong headers can corrupt types and missing values
>* Pick the true label row to guide parsing

>* Use flexible header options for complex Excel layouts
>* Rename columns to clarify dates and missing values



In [None]:
#@title Python Code - Handling header rows

# Demonstrate Excel header handling with pandas read_excel options.
# Show skipping decorative rows and choosing correct header row.
# Compare inferred column names and data types after different header choices.

import pandas as pandas_lib
from pandas import DataFrame as DataFrame_class

# Create a small DataFrame that mimics messy Excel layout.
# First rows contain title and notes, then real column labels.
# Data rows include dates and numeric values for simple inspection.
raw_data = {
    "col1": ["Sales Report 2024", "All amounts in dollars", "Date", "2024-01-01", "2024-01-02"],
    "col2": ["Northeast region only", "Preliminary numbers only", "Units Sold", 120, 150],
    "col3": ["Draft version only", "Do not distribute", "Revenue", 2500.0, 3100.0],
}

messy_df = DataFrame_class(raw_data)

# Save the messy DataFrame to an Excel file for ingestion demonstration.
# In real life this file would come from an external business system.
excel_filename = "messy_sales_report.xlsx"
messy_df.to_excel(excel_filename, index=False)

# Read the Excel file using the default header behavior for comparison.
# Pandas will treat the first row as header, which is actually a title.
print("Default header=0 column names and dtypes:")
read_default = pandas_lib.read_excel(excel_filename)
print(read_default.dtypes)

# Read the same file while skipping decorative rows and choosing correct header.
# Here we skip first two rows and use the third row as header labels.
print("\nUsing skiprows=2 for correct header row:")
read_fixed = pandas_lib.read_excel(excel_filename, skiprows=2)
print(read_fixed.dtypes)

# Show the cleaned DataFrame head to confirm readable column names.
# This helps beginners see the practical effect of header handling.
print("\nCleaned DataFrame preview with proper headers:")
print(read_fixed.head())



### **2.3. Excel export options**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas 2.3.1 A-Z/Module_02/Lecture_A/image_02_03.jpg?v=1766641664" width="250">



>* Treat Excel export as part of ingestion
>* Careful export preserves types, dates, and reliability

>* Excel may change column types and nulls
>* Choose missing markers based on downstream Excel usage

>* Standardize Excel date formats to avoid misinterpretation
>* Use clear text dates, document formats, preserve timezones



In [None]:
#@title Python Code - Excel export options

# Demonstrate Excel export options for types and missing values.
# Show how to control dates and missing markers during export.
# Verify that exported Excel data roundtrips back correctly.

import pandas as pd
from datetime import datetime

# Create a small DataFrame with product codes and dates.
# Include leading zeros, missing values, and mixed types.
# Use simple inches based product lengths for clarity.
data = {
    "product_code": ["0012", "0013", "A014", "0015"],
    "length_inches": [10.0, None, 12.5, 9.0],
    "sale_date": [datetime(2024, 1, 5), None, datetime(2024, 2, 1), datetime(2024, 3, 15)],
}

# Build the DataFrame and inspect dtypes before export.
# Product codes should remain strings, not integers.
# Missing lengths and dates appear as NaN values.
df = pd.DataFrame(data)
print("Original DataFrame dtypes before export:")
print(df.dtypes)

# Export to Excel with controlled options for missing values and dates.
# Use a clear missing marker string for numeric columns.
# Convert dates to ISO formatted strings for safety.
export_df = df.copy()
export_df["sale_date"] = export_df["sale_date"].dt.strftime("%Y-%m-%d")
export_df.to_excel("products_clean.xlsx", index=False)

# Read the Excel file back to verify roundtrip behavior.
# Let pandas infer types from the exported workbook.
# Then compare dtypes and values with the original DataFrame.
roundtrip_df = pd.read_excel("products_clean.xlsx")
print("\nRoundtrip DataFrame dtypes after Excel export:")
print(roundtrip_df.dtypes)

# Show the roundtripped data to highlight preserved product codes.
# Leading zeros should still appear in product_code column.
# Missing values should match expectations for downstream ingestion.
print("\nRoundtrip DataFrame preview:")
print(roundtrip_df)



## **3. Exporting DataFrames Safely**

### **3.1. Using to_csv Options**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas 2.3.1 A-Z/Module_02/Lecture_A/image_03_01.jpg?v=1766641690" width="250">



>* Choose path, separator, and header settings explicitly
>* Match options to downstream tools for reliable exports

>* Choose strings for missing values carefully
>* Control quoting to avoid parsing and spreadsheet issues

>* Standardize encoding, line endings, and file size handling
>* Decide append vs overwrite and document export settings



In [None]:
#@title Python Code - Using to_csv Options

# Demonstrate basic DataFrame export using to_csv options.
# Show separator, header, index, and missing value handling clearly.
# Help create predictable CSV files for sharing across different tools.

import pandas as pd

# Create a small DataFrame representing weekly sales data.
data = {
    "week": ["2024-01-01", "2024-01-08", "2024-01-15"],
    "region": ["North", "South", "West"],
    "units_sold": [120, None, 95],
}

sales_df = pd.DataFrame(data)

# Export with comma separator, header included, index excluded, custom missing marker.
comma_path = "sales_comma.csv"
sales_df.to_csv(
    comma_path,
    sep=",",
    header=True,
    index=False,
    na_rep="MISSING",
)

# Export with semicolon separator, header included, index excluded, different missing marker.
semicolon_path = "sales_semicolon.csv"
sales_df.to_csv(
    semicolon_path,
    sep=";",
    header=True,
    index=False,
    na_rep="NA_CODE",
)

# Read both files back to quickly show their different text formats.
comma_text = open(comma_path, "r", encoding="utf-8").read()
semicolon_text = open(semicolon_path, "r", encoding="utf-8").read()

# Print short previews so students see how options changed the files.
print("Comma separated CSV preview:\n", comma_text.strip())
print("\nSemicolon separated CSV preview:\n", semicolon_text.strip())



### **3.2. Formatting Index and Floats**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas 2.3.1 A-Z/Module_02/Lecture_A/image_03_02.jpg?v=1766641709" width="250">



>* Decide whether to save the index column
>* Preserve meaningful indexes and name them clearly

>* Choose sensible decimal places for float exports
>* Consistent float formatting improves sharing and reproducibility

>* Match index and float formatting to workflow
>* Standardize settings to protect accuracy and reproducibility



In [None]:
#@title Python Code - Formatting Index and Floats

# Demonstrate exporting DataFrame index formatting choices with to_csv options.
# Show how including or excluding index column changes saved file structure.
# Show how float_format controls decimal places for exported numeric values.

import pandas as pd

# Create simple DataFrame with default integer index and float values.
data = {"city": ["Boston", "Denver", "Dallas"], "temperature_f": [72.4567, 65.3333, 88.9999]}

df = pd.DataFrame(data)

# Export including index, with four decimal places for float values.
file_with_index = "temperatures_with_index.csv"

df.to_csv(file_with_index, index=True, float_format="%.4f")

# Export excluding index, with two decimal places for float values.
file_without_index = "temperatures_without_index.csv"

df.to_csv(file_without_index, index=False, float_format="%.2f")

# Read both files back to compare how index and floats were stored.
read_with_index = pd.read_csv(file_with_index)

read_without_index = pd.read_csv(file_without_index)

# Print small samples to observe index column presence and float precision.
print("With index column and four decimals:")

print(read_with_index.head())

print("\nWithout index column and two decimals:")

print(read_without_index.head())



### **3.3. Validating File Round Trips**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas 2.3.1 A-Z/Module_02/Lecture_A/image_03_03.jpg?v=1766641728" width="250">



>* Round-trip files to confirm structure and meaning
>* Catch silent type, truncation, and missing-value errors

>* Check columns, order, and data types match
>* Verify identifiers, numbers, text, and encodings stay intact

>* Watch for locale, formatting, and parsing changes
>* Spot check key columns to ensure data integrity



In [None]:
#@title Python Code - Validating File Round Trips

# Demonstrate saving a DataFrame and reading it back safely.
# Show how to compare original and loaded DataFrames.
# Help validate that a file round trip preserved important information.

import pandas as pd
from pathlib import Path

# Create a tiny DataFrame representing simple store sales data.
data = {
    "store_id": ["A001", "A002", "A003"],
    "sale_dollars": [19.99, 5.50, 120.00],
    "sale_date": ["2024-01-01", "2024-01-02", "2024-01-03"],
}

# Build the DataFrame and parse dates for correct dtypes.
df_original = pd.DataFrame(data)
df_original["sale_date"] = pd.to_datetime(df_original["sale_date"], format="%Y-%m-%d")

# Choose a temporary CSV path inside the current working directory.
file_path = Path("round_trip_example.csv")

# Save the DataFrame with explicit options for reproducibility.
df_original.to_csv(file_path, index=False, float_format="%.2f")

# Read the file back, parsing dates to match original dtypes.
df_loaded = pd.read_csv(file_path, parse_dates=["sale_date"])

# Compare shapes, column names, and dtypes for quick structural checks.
print("Original shape and columns:", df_original.shape, list(df_original.columns))
print("Loaded shape and columns:", df_loaded.shape, list(df_loaded.columns))

# Check whether dtypes match between original and loaded DataFrames.
print("Original dtypes:\n", df_original.dtypes)
print("Loaded dtypes:\n", df_loaded.dtypes)

# Use equals to confirm that all values match exactly after the round trip.
print("DataFrames equal after round trip:", df_original.equals(df_loaded))




# <font color="#418FDE" size="6.5" uppercase>**Reading Tabular Data**</font>


In this lecture, you learned to:
- Use read_csv and read_excel to load tabular data into DataFrames with appropriate options. 
- Control column types, missing value markers, and date parsing during file ingestion. 
- Export cleaned DataFrames back to disk using to_csv and to_excel with reproducible settings. 

In the next Lecture (Lecture B), we will go over 'APIs and Databases'