# <font color="#418FDE" size="6.5" uppercase>**Importing Data**</font>

>Last update: 20251225.
    
By the end of this Lecture, you will be able to:
- Load tabular data from CSV, Excel, and Parquet files into Pandas 2.3.1 DataFrames using appropriate read_* functions. 
- Configure import options such as separators, headers, dtypes, and date parsing to correctly interpret raw files. 
- Diagnose and fix common import issues including bad encodings, unexpected missing values, and mixed-type columns. 


## **1. Reading CSV Files**

### **1.1. Core read csv options**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas (2.3.1) A-Z/Module_02/Lecture_A/image_01_01.jpg?v=1766706987" width="250">



>* CSV is plain text forming table-like rows
>* Defaults usually create a usable, well-typed DataFrame

>* Choose and position headers to match data
>* Limit or skip rows so DataFrame structure fits

>* Control dtypes, missing values, and parsing behavior
>* Careful options create accurate, analysis-ready DataFrames



In [None]:
#@title Python Code - Core read csv options

# Demonstrate core pandas read_csv options with simple in memory CSV text.
# Show header handling, row skipping, and column data type control clearly.
# Designed for beginners using Google Colab with minimal printed output.

import pandas as pd
from io import StringIO

csv_text = """Note line before header
Another note describing file
id,name,zip_code,age
001,Anna,02115,28
002,Bob,30301,35
003,Cara,10001,41
"""

csv_buffer = StringIO(csv_text)

print("Original CSV text preview:")
print("\n".join(csv_text.splitlines()[:4]))

csv_buffer.seek(0)

basic_df = pd.read_csv(csv_buffer, header=2)

print("\nDataFrame with inferred types:")
print(basic_df.head())

csv_buffer.seek(0)

dtyped_df = pd.read_csv(csv_buffer, header=2, dtype={"id": "string", "zip_code": "string"})

print("\nDataFrame with controlled types:")
print(dtyped_df.dtypes)



### **1.2. Handling delimiters and headers**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas (2.3.1) A-Z/Module_02/Lecture_A/image_01_02.jpg?v=1766707004" width="250">



>* Check actual delimiter and header row structure
>* These choices control column names and field splitting

>* Clean, clear headers make analysis easier, safer
>* Tidy messy headers or replace them entirely

>* Embedded delimiters need correct quoting to parse
>* Check rows, headers, and delimiters to avoid misalignment



In [None]:
#@title Python Code - Handling delimiters and headers

# Demonstrate reading CSV files with different delimiters and header configurations.
# Show how wrong delimiter assumptions collapse columns into single wide column.
# Show how header options change column names and skipped descriptive rows.

import pandas as pd
from io import StringIO

csv_semicolon_text = "Name;Age;City\nAlice;30;Boston\nBob;25;Chicago"
print("Raw semicolon separated text preview:")
print(csv_semicolon_text.split("\n")[0])

wrong_delimiter_buffer = StringIO(csv_semicolon_text)
df_wrong = pd.read_csv(wrong_delimiter_buffer, delimiter=",")
print("\nUsing comma delimiter, columns look collapsed:")
print(df_wrong.head())

correct_delimiter_buffer = StringIO(csv_semicolon_text)
df_correct = pd.read_csv(correct_delimiter_buffer, delimiter=";")
print("\nUsing semicolon delimiter, columns separate correctly:")
print(df_correct.head())

messy_header_text = "Sales report for 2024\nValues in US dollars\nProduct,Units,Revenue\nWidget,10,250.0\nGadget,5,150.0"
print("\nRaw messy header first three lines:")
for line in messy_header_text.split("\n")[:3]:
    print(line)

messy_buffer = StringIO(messy_header_text)
df_skip_header = pd.read_csv(messy_buffer, skiprows=2)
print("\nAfter skipping two lines, header row becomes column names:")
print(df_skip_header.head())

messy_buffer_custom = StringIO(messy_header_text)
custom_names = ["product_name", "units_sold", "revenue_usd"]
df_custom_header = pd.read_csv(messy_buffer_custom, skiprows=3, header=None, names=custom_names)
print("\nUsing custom header names after skipping three lines:")
print(df_custom_header.head())



### **1.3. Efficient Large CSV Loading**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas (2.3.1) A-Z/Module_02/Lecture_A/image_01_03.jpg?v=1766707034" width="250">



>* Treat huge CSVs as row streams, not wholes
>* Process chunks, keep summaries or cleaned subsets only

>* Read only needed columns to save resources
>* Set dtypes upfront to speed and stabilize imports

>* Simplify parsing options and delay complex cleaning
>* Process chunks efficiently and convert to columnar format



In [None]:
#@title Python Code - Efficient Large CSV Loading

# Demonstrate reading large CSV files using chunks efficiently.
# Show selecting useful columns and specifying data types explicitly.
# Summarize chunked data while keeping memory usage comfortably small.

import pandas as pd
import numpy as np

# Create a sample CSV file that imitates a larger dataset.
# We use many rows but keep values simple for quick processing.
num_rows = 50000

np.random.seed(0)
customer_ids = np.random.randint(10000, 20000, size=num_rows)

signup_days = np.random.randint(1, 28, size=num_rows)
signup_months = np.random.randint(1, 12, size=num_rows)

signup_dates = [f"2024-{m:02d}-{d:02d}" for m, d in zip(signup_months, signup_days)]

countries = np.random.choice(["US", "CA", "UK", "MX"], size=num_rows)

spend_dollars = np.random.gamma(shape=2.0, scale=50.0, size=num_rows)

full_df = pd.DataFrame({
    "customer_id": customer_ids,
    "signup_date": signup_dates,
    "country": countries,
    "spend_usd": spend_dollars,
})

csv_path = "large_customers.csv"
full_df.to_csv(csv_path, index=False)

# Now read the CSV efficiently using chunks and selected columns.
# We also specify data types to avoid expensive automatic inference.
use_columns = ["customer_id", "country", "spend_usd"]

dtype_map = {"customer_id": "int32", "country": "category", "spend_usd": "float32"}

chunk_size = 10000

reader = pd.read_csv(csv_path, usecols=use_columns, dtype=dtype_map, chunksize=chunk_size)

total_spend_by_country = {}

for chunk in reader:
    grouped = chunk.groupby("country")["spend_usd"].sum()
    for country, value in grouped.items():
        total_spend_by_country[country] = total_spend_by_country.get(country, 0.0) + float(value)

print("Total spend by country using chunked loading:")

for country, value in total_spend_by_country.items():
    print(f"{country}: ${value:,.2f}")




## **2. Excel and Parquet Imports**

### **2.1. Selecting Excel Sheets**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas (2.3.1) A-Z/Module_02/Lecture_A/image_02_01.jpg?v=1766707067" width="250">



>* Excel files can contain many different sheets
>* Careful sheet choice ensures you load intended data

>* Identify sheets by index or clear names
>* Inspect workbook, then explicitly import the correct sheet

>* Import and combine related sheets for analysis
>* Use or ignore metadata sheets to guide interpretation



In [None]:
#@title Python Code - Selecting Excel Sheets

# Demonstrate selecting Excel sheets when importing data with pandas.
# Create a sample Excel file containing multiple related sheets.
# Load specific sheets by name and index, then display combined results.

import pandas as pd
from io import BytesIO

# Create example DataFrames representing different Excel workbook sheets.
usa_data = pd.DataFrame({"City": ["Boston", "Dallas"], "Sales_dollars": [1200, 1500]})
canada_data = pd.DataFrame({"City": ["Toronto", "Calgary"], "Sales_dollars": [900, 1100]})
notes_data = pd.DataFrame({"Note": ["Internal summary only"]})

# Save DataFrames into one in_memory Excel file with multiple named sheets.
excel_buffer = BytesIO()
with pd.ExcelWriter(excel_buffer, engine="xlsxwriter") as writer:
    usa_data.to_excel(writer, sheet_name="USA_Sales", index=False)
    canada_data.to_excel(writer, sheet_name="Canada_Sales", index=False)
    notes_data.to_excel(writer, sheet_name="Notes", index=False)

# Move buffer position to start, then read workbook using pandas read_excel.
excel_buffer.seek(0)
all_sheets = pd.read_excel(excel_buffer, sheet_name=None)
print("Available sheet names in workbook:")
print(list(all_sheets.keys()))

# Load a single sheet by visible sheet name, then display small preview.
excel_buffer.seek(0)
usa_only = pd.read_excel(excel_buffer, sheet_name="USA_Sales")
print("\nUSA sheet preview after selecting by name:")
print(usa_only)

# Load multiple sheets at once, then concatenate them into one DataFrame.
excel_buffer.seek(0)
selected = pd.read_excel(excel_buffer, sheet_name=["USA_Sales", "Canada_Sales"])
combined = pd.concat(selected.values(), ignore_index=True)
print("\nCombined sales from selected sheets:")
print(combined)



### **2.2. Parsing Dates and Numbers**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas (2.3.1) A-Z/Module_02/Lecture_A/image_02_02.jpg?v=1766707086" width="250">



>* Decide how Excel dates and numbers import
>* Consistent parsing prevents confusing, incorrect analysis results

>* Same-looking Excel values can mean different things
>* Control parsing to avoid date and number mistakes

>* Plan for mixed, invalid, or placeholder values
>* Sample data, set types, and handle inconsistencies



In [None]:
#@title Python Code - Parsing Dates and Numbers

# Demonstrate parsing Excel dates and numbers with pandas read_excel options.
# Show problems with default parsing and then fix them using parse_dates options.
# Highlight numeric parsing with thousands separators and custom decimal characters.

import pandas as pd
from io import BytesIO

# Create a small DataFrame with mixed date and numeric representations.
data = {
    "order_date": ["2024-01-02", "01/03/2024", "2024-01-04"],
    "delivery_date": ["02-01-2024", "03-01-2024", "TBD"],
    "budget_text": ["1,234.50", "2,500.00", "750.25"],
}

# Convert the DataFrame into an in memory Excel file using BytesIO buffer.
df_original = pd.DataFrame(data)
excel_buffer = BytesIO()
df_original.to_excel(excel_buffer, index=False)

# Move buffer position back to start before reading with pandas read_excel function.
excel_buffer.seek(0)

# Read Excel file with default options to inspect inferred column data types.
df_default = pd.read_excel(excel_buffer)
print("Default dtypes and values:")
print(df_default.dtypes)

# Reset buffer position again because previous read moved internal pointer forward.
excel_buffer.seek(0)

# Read Excel file again, now explicitly parsing order_date and delivery_date columns.
df_parsed = pd.read_excel(excel_buffer, parse_dates=["order_date", "delivery_date"], dayfirst=False)
print("\nParsed dates dtypes and values:")
print(df_parsed.dtypes)

# Convert budget_text column into numeric values handling thousands separators and decimals.
df_parsed["budget_numeric"] = pd.to_numeric(df_parsed["budget_text"].str.replace(",", ""), errors="coerce")
print("\nParsed numeric budget column:")
print(df_parsed[["budget_text", "budget_numeric"]])



### **2.3. Parquet Data Imports**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas (2.3.1) A-Z/Module_02/Lecture_A/image_02_03.jpg?v=1766707105" width="250">



>* Parquet stores columnar data with built-in metadata
>* Import focuses on using or overriding this schema

>* Check Parquet datetime encodings and timezones on import
>* Validate numeric dtypes to avoid precision or rounding

>* Check and standardize categorical and string columns
>* Treat import as schema negotiation for consistency



In [None]:
#@title Python Code - Parquet Data Imports

# Demonstrate basic Parquet imports with pandas DataFrames.
# Show how Parquet preserves schema and data types.
# Adjust dtypes after import for correct downstream analysis.

import pandas as pd
from datetime import datetime, timezone

# Create a small DataFrame with mixed types.
data = {
    "order_id": ["0001", "0002", "0003"],
    "amount_usd": [19.99, 5.50, 120.00],
    "purchased_at_utc": [
        datetime(2024, 1, 1, 15, 0, tzinfo=timezone.utc),
        datetime(2024, 1, 2, 18, 30, tzinfo=timezone.utc),
        datetime(2024, 1, 3, 20, 45, tzinfo=timezone.utc),
    ],
}

# Build the DataFrame and inspect original dtypes.
df_original = pd.DataFrame(data)
print("Original DataFrame dtypes:")
print(df_original.dtypes)

# Save the DataFrame to a Parquet file in the current directory.
parquet_path = "orders_example.parquet"
df_original.to_parquet(parquet_path, index=False)

# Read the Parquet file back into a new DataFrame.
df_loaded = pd.read_parquet(parquet_path)
print("\nLoaded DataFrame dtypes:")
print(df_loaded.dtypes)

# Convert UTC timestamps to America New_York timezone for analysis.
df_loaded["purchased_at_local"] = df_loaded["purchased_at_utc"].dt.tz_convert("America/New_York")

# Ensure order_id stays string, even if numeric looking.
df_loaded["order_id"] = df_loaded["order_id"].astype("string")

# Show final dtypes and a small preview.
print("\nAdjusted DataFrame dtypes:")
print(df_loaded.dtypes)
print("\nAdjusted DataFrame preview:")
print(df_loaded.head())



## **3. Troubleshooting Data Imports**

### **3.1. Encoding and Missing Values**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas (2.3.1) A-Z/Module_02/Lecture_A/image_03_01.jpg?v=1766707126" width="250">



>* Encoding mismatches cause garbled text or errors
>* Common with international characters; recognize mismatch symptoms

>* Infer likely encoding from data source
>* Test encodings and inspect characters to confirm

>* Placeholder codes can hide or fake missing data
>* Use domain knowledge to map placeholders to NA



In [None]:
#@title Python Code - Encoding and Missing Values

# Demonstrate encoding issues and missing value placeholders during data import.
# Create small CSV files with different encodings and placeholder missing values.
# Show how encoding and na_values options fix garbled text and fake numbers.

import pandas as pd
import io

# Create a small CSV string with accented names and placeholder values.
csv_text = "name,age,height_inches\nJosé,29,70\nMüller,NA,0\nAna,35,-1\n"

# Save the CSV bytes using Latin-1 encoding to simulate legacy system output.
latin1_bytes = csv_text.encode("latin-1")

# Read the bytes incorrectly as UTF-8 using BytesIO, causing decoding problems.
try:
    bad_df = pd.read_csv(io.BytesIO(latin1_bytes), encoding="utf-8")
    print("Incorrect encoding import, names look wrong:")
    print(bad_df["name"])
except UnicodeDecodeError as error:
    print("Import failed due to UnicodeDecodeError:")
    print(error)

# Read the same bytes using correct Latin-1 encoding to restore proper characters.
correct_df = pd.read_csv(io.BytesIO(latin1_bytes), encoding="latin-1")
print("\nCorrect encoding import, names look correct:")
print(correct_df["name"])

# Read again while telling pandas which placeholders represent missing values.
na_df = pd.read_csv(io.BytesIO(latin1_bytes), encoding="latin-1", na_values=["NA", 0, -1])
print("\nMissing values correctly recognized using na_values option:")
print(na_df)



### **3.2. Fixing Mixed Types**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas (2.3.1) A-Z/Module_02/Lecture_A/image_03_02.jpg?v=1766707143" width="250">



>* Mixed-type columns mix numbers, text, and dates
>* They break calculations, so inspect types and values

>* Check inferred column dtypes against expected meanings
>* Scan values and docs to find inconsistent entries

>* Choose a target type and clean values
>* Standardize formats, handle exceptions, or split columns



In [None]:
#@title Python Code - Fixing Mixed Types

# Demonstrate detecting mixed types in imported columns using pandas DataFrames.
# Show how non numeric placeholders break numeric operations during analysis.
# Clean placeholders, convert column types, and compare behavior before and after.

import pandas as pandas_library
import numpy as numeric_library

# Create example data with mixed types in numeric and date columns.
data_dictionary = {
    "order_id": ["A001", "A002", "A003", "A004"],
    "weight_pounds": ["10.5", "unknown", "8.0", "-"],
    "ship_date": ["2024-01-05", "01/06/2024", "pending", "2024-01-08"],
}

mixed_frame = pandas_library.DataFrame(data_dictionary)
print("Original DataFrame with mixed types:")
print(mixed_frame)

# Show inferred dtypes, weight and ship_date appear as generic object types.
print("\nInferred dtypes before cleaning:")
print(mixed_frame.dtypes)

# Attempt numeric mean on weight column, this fails due to mixed string values.
try:
    print("\nAttempting numeric mean on weight_pounds column:")
    print(mixed_frame["weight_pounds"].mean())
except TypeError as error_object:
    print("Operation failed due to mixed types:", error_object)

# Replace placeholders with proper missing values, then convert to numeric type.
clean_frame = mixed_frame.copy()
clean_frame["weight_pounds"] = clean_frame["weight_pounds"].replace([
    "unknown",
    "-",
], numeric_library.nan)

clean_frame["weight_pounds"] = pandas_library.to_numeric(
    clean_frame["weight_pounds"], errors="coerce"
)

# Parse ship_date column, coercing invalid or pending entries to missing values.
clean_frame["ship_date"] = pandas_library.to_datetime(
    clean_frame["ship_date"], errors="coerce", infer_datetime_format=True
)

print("\nCleaned DataFrame with consistent types:")
print(clean_frame)

print("\nDtypes after cleaning and conversion:")
print(clean_frame.dtypes)

# Now numeric operations and date filters behave correctly on cleaned columns.
print("\nMean shipped weight in pounds, ignoring missing values:")
print(clean_frame["weight_pounds"].mean())



### **3.3. Sampling Imports with nrows**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas (2.3.1) A-Z/Module_02/Lecture_A/image_03_03.jpg?v=1766707162" width="250">



>* Start by importing only a small sample
>* Use samples to tune settings before full load

>* Sample a few rows to spot irregularities
>* Adjust import settings before loading full dataset

>* Small samples can miss rare data problems
>* Sample multiple file sections before full import



In [None]:
#@title Python Code - Sampling Imports with nrows

# Demonstrate sampling imports using nrows with pandas DataFrame.
# Create a fake large CSV file and read only a few rows.
# Compare sampled import with full import for understanding potential issues.

import pandas as pd
import io

csv_text = """id,date,temperature_f,notes
1,2024-01-01,72.5,Normal reading
2,2024-01-02,73.0,Missing value later
3,2024-01-03,,Sensor glitch
4,2024-01-04,75.2,Strange character � here
5,2024-01-05,not_a_number,Corrupted temperature
6,2024-01-06,71.8,Normal reading
"""

csv_buffer = io.StringIO(csv_text)

sample_df = pd.read_csv(csv_buffer, nrows=3)

print("Sampled first three rows only:")
print(sample_df)

csv_buffer.seek(0)

full_df = pd.read_csv(csv_buffer)

print("\nFull import row count:", len(full_df))
print("Full import temperature column dtype:", full_df["temperature_f"].dtype)



# <font color="#418FDE" size="6.5" uppercase>**Importing Data**</font>


In this lecture, you learned to:
- Load tabular data from CSV, Excel, and Parquet files into Pandas 2.3.1 DataFrames using appropriate read_* functions. 
- Configure import options such as separators, headers, dtypes, and date parsing to correctly interpret raw files. 
- Diagnose and fix common import issues including bad encodings, unexpected missing values, and mixed-type columns. 

In the next Lecture (Lecture B), we will go over 'Cleaning Columns'