# <font color="#418FDE" size="6.5" uppercase>**DataFrame Essentials**</font>

>Last update: 20251225.
    
By the end of this Lecture, you will be able to:
- Construct DataFrames from common sources such as dictionaries, lists of dicts, and NumPy arrays using Pandas 2.3.1. 
- Inspect DataFrame structure and metadata to understand shape, columns, indexes, and basic statistics. 
- Perform basic column and row operations including selection, renaming, and simple transformations. 


## **1. Create DataFrames**

### **1.1. From dicts and lists of dicts**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas (2.3.1) A-Z/Module_01/Lecture_C/image_01_01.jpg?v=1766705067" width="250">



>* Dictionaries map column names to value lists
>* Pandas turns these mappings into typed DataFrames

>* List of dicts becomes DataFrame rows
>* Pandas unifies keys, filling missing fields

>* Pandas aligns messy dict data into tables
>* It infers types and preserves labels for analysis



In [None]:
#@title Python Code - From dicts and lists of dicts

# Demonstrate creating DataFrames from dictionaries and lists of dictionaries.
# Show how keys become column labels in resulting DataFrames.
# Print small DataFrames to observe rows, columns, and missing values.

import pandas as pd

survey_dict = {
    "customer_id": [101, 102, 103],
    "drink": ["Latte", "Espresso", "Tea"],
    "rating": [5, 4, 3],
}

survey_df = pd.DataFrame(survey_dict)

print("DataFrame from dictionary of lists:")
print(survey_df)

orders_list = [
    {"order_id": 1, "item": "Burger", "price_usd": 9.5},
    {"order_id": 2, "item": "Fries", "price_usd": 3.0},
    {"order_id": 3, "item": "Soda", "price_usd": 1.5, "size_oz": 12},
]

orders_df = pd.DataFrame(orders_list)

print("\nDataFrame from list of dictionaries:")
print(orders_df)



### **1.2. From NumPy arrays**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas (2.3.1) A-Z/Module_01/Lecture_C/image_01_02.jpg?v=1766705082" width="250">



>* DataFrames turn NumPy arrays into labeled tables
>* Labels make numeric simulation outputs easier to analyze

>* DataFrame adds patient and measurement labels to arrays
>* Labeled rows and columns make analysis easier, safer

>* Combine related NumPy arrays into one DataFrame
>* Enable aligned analysis, filtering, statistics, and visualization



In [None]:
#@title Python Code - From NumPy arrays

# Demonstrate creating DataFrames from simple NumPy arrays.
# Show how column labels describe each measurement clearly.
# Show how row labels can represent meaningful observation identifiers.

import numpy as np  # Import NumPy for numerical arrays.
import pandas as pd  # Import pandas for DataFrame creation.

patient_data = np.array([[45, 120, 190], [60, 135, 210], [52, 110, 175]])
patient_ids = np.array(["P001", "P002", "P003"])

column_labels = ["age_years", "systolic_mmHg", "cholesterol_mg_dl"]
row_labels = patient_ids.tolist()

patients_df = pd.DataFrame(data=patient_data, index=row_labels, columns=column_labels)

print("NumPy array shape and contents summary:")
print(patient_data.shape, "values representing three patients.")

print("\nDataFrame with labels for rows and columns:")
print(patients_df)



### **1.3. Index and Column Labels**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas (2.3.1) A-Z/Module_01/Lecture_C/image_01_03.jpg?v=1766705098" width="250">



>* DataFrames use row indexes and column names
>* Default labels are convenient but often unrealistic

>* Source type determines how labels are created
>* Explicit names and indexes make data understandable

>* Good indexes act as reliable unique identifiers
>* Clear labels make DataFrames readable and dependable



In [None]:
#@title Python Code - Index and Column Labels

# Show default index and column labels for simple DataFrames.
# Demonstrate custom index labels using meaningful row identifiers.
# Demonstrate custom column labels for clearer, self documenting data.

import pandas as pd
import numpy as np

# Create a DataFrame from a dictionary with default index labels.
data_default = {"product": ["Widget", "Gadget"], "revenue_usd": [1200, 950]}
df_default = pd.DataFrame(data_default)

# Display the default index and column labels for this DataFrame.
print("Default index and columns:")
print(df_default)

# Create a DataFrame with a custom index using order identifiers.
order_ids = ["ORD_1001", "ORD_1002"]
df_custom_index = pd.DataFrame(data_default, index=order_ids)

# Display the custom index labels that replace simple integer positions.
print("\nCustom index labels:")
print(df_custom_index)

# Create a DataFrame from a NumPy array without inherent labels.
array_data = np.array([[72.5, 45.0], [68.0, 55.5]])

# Add explicit column labels and a custom index describing measurement context.
columns = ["temperature_fahrenheit", "humidity_percent"]
index = ["Basement_sensor", "Attic_sensor"]
df_sensors = pd.DataFrame(array_data, columns=columns, index=index)

# Display the labeled DataFrame showing meaningful index and column names.
print("\nLabeled sensor readings:")
print(df_sensors)



## **2. Inspecting DataFrame Structure**

### **2.1. Previewing DataFrames**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas (2.3.1) A-Z/Module_01/Lecture_C/image_02_01.jpg?v=1766705132" width="250">



>* Quickly preview new DataFrames to understand contents
>* Spot loading errors, misalignments, and strange values

>* Preview connects abstract DataFrame concepts to reality
>* Spot types, missing values, and data issues early

>* Preview checks if layout fits analysis goals
>* Aligns your mental model with actual structure



In [None]:
#@title Python Code - Previewing DataFrames

# Demonstrate quick DataFrame previews using simple customer order data.
# Show how head and tail reveal a small sample of rows.
# Help you visually check columns, values, and index formatting.

import pandas as pd

# Create a small DataFrame representing online store customer orders.
orders_data = {
    "order_id": [101, 102, 103, 104, 105, 106],
    "customer": ["Alice", "Bob", "Cara", "Dan", "Eli", "Fran"],
    "order_date": ["2024-01-01", "2024-01-02", "2024-01-03", "2024-01-04", "2024-01-05", "2024-01-06"],
    "total_dollars": [59.99, 23.50, 120.00, 5.99, 250.75, 42.10],
}

orders_df = pd.DataFrame(orders_data)

# Preview the first few rows to quickly inspect structure and example values.
print("First three rows preview:")
print(orders_df.head(3))

# Preview the last few rows to confirm later records and index behavior.
print("\nLast two rows preview:")
print(orders_df.tail(2))

# Show shape information to connect preview with overall DataFrame size.
print("\nDataFrame shape (rows, columns):", orders_df.shape)



### **2.2. .info() and .describe() in 2.3.1**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas (2.3.1) A-Z/Module_01/Lecture_C/image_02_02.jpg?v=1766705168" width="250">



>* Structural summaries show shape, columns, and types
>* They quickly reveal missing data and type issues

>* Summarizes numeric columns with key statistics
>* Helps spot outliers, errors, and cleaning needs

>* Combine structural and statistical summaries for insight
>* Use both to spot issues and guide analysis



In [None]:
#@title Python Code - .info() and .describe() in 2.3.1

# Demonstrate DataFrame structural summary using info method in pandas 2.3.1.
# Demonstrate DataFrame statistical summary using describe method for numeric columns.
# Compare both summaries to understand data structure and numeric distributions.

import pandas as pd

# Create a small DataFrame representing simple car trip records.
trips_data = {
    "miles_driven": [3.5, 12.0, 7.2, 0.5, 250.0],
    "trip_minutes": [8, 25, 15, 3, 400],
    "fare_usd": [7.5, 22.0, 13.0, 3.0, 600.0],
    "driver_id": ["A1", "B2", "A1", "C3", "B2"],
}

trips_df = pd.DataFrame(trips_data)

# Show structural summary including column names, types, and non missing counts.
print("Structural summary using info():")
trips_df.info()

# Add a blank line to separate structural and statistical summaries visually.
print("\nStatistical summary using describe():")

# Show descriptive statistics for numeric columns only, revealing ranges and typical values.
print(trips_df.describe())



### **2.3. Index and Columns**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas (2.3.1) A-Z/Module_01/Lecture_C/image_02_03.jpg?v=1766705186" width="250">



>* Columns label data fields; index labels rows
>* Index and columns guide selection and merging

>* Columns list available variables and their meaning
>* Index type shows how rows are identified, aligned

>* Good indexes speed up slicing and alignment
>* Clear column names improve collaboration and reliability



In [None]:
#@title Python Code - Index and Columns

# Show how DataFrame indexes organize rows clearly.
# Show how column labels describe each field clearly.
# Demonstrate inspecting and customizing index and columns.

import pandas as pd

# Create a simple shipments DataFrame with default integer index.
shipments_data = {
    "shipment_id": [101, 102, 103],
    "origin_city": ["Dallas", "Denver", "Boston"],
    "weight_pounds": [25, 40, 15],
}

shipments_df = pd.DataFrame(shipments_data)

# Inspect default index and columns attributes for orientation.
print("Default index:", shipments_df.index)
print("Default columns:", shipments_df.columns)

# Set shipment_id as index to label rows meaningfully.
shipments_indexed = shipments_df.set_index("shipment_id")

print("\nIndexed DataFrame preview:")
print(shipments_indexed)

# Rename columns to be clearer and more standardized.
shipments_renamed = shipments_indexed.rename(
    columns={"origin_city": "origin_city_name", "weight_pounds": "weight_lb"}
)

print("\nRenamed columns list:")
print(list(shipments_renamed.columns))




## **3. Core DataFrame Operations**

### **3.1. Selecting DataFrame Subsets**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas (2.3.1) A-Z/Module_01/Lecture_C/image_03_01.jpg?v=1766705209" width="250">



>* Select only rows and columns you need
>* Create a focused view to analyze patterns

>* Select subsets by labels like names, indexes
>* Or select by integer positions, switching as needed

>* Filter rows using logical, value-based conditions
>* Combine conditions to target precise, real-world subsets



In [None]:
#@title Python Code - Selecting DataFrame Subsets

# Demonstrate selecting DataFrame subsets by labels and positions.
# Show simple column selections and row filters using conditions.
# Keep output short and readable for beginners.

import pandas as pd

# Create a small hospital patients DataFrame example.
data = {
    "patient_id": [101, 102, 103, 104],
    "age_years": [25, 67, 40, 55],
    "diagnosis": ["flu", "pneumonia", "flu", "asthma"],
    "stay_nights": [2, 5, 1, 3],
}

patients = pd.DataFrame(data)

# Show the full DataFrame to understand available columns.
print("Full patients DataFrame:")
print(patients)

# Select a subset of columns by label using bracket notation.
subset_columns = patients[["age_years", "diagnosis"]]
print("\nSelected columns by label:")
print(subset_columns)

# Select specific rows and columns by label using loc accessor.
label_subset = patients.loc[101:103, ["patient_id", "stay_nights"]]
print("\nSubset using loc with labels:")
print(label_subset)

# Select rows and columns by integer position using iloc accessor.
position_subset = patients.iloc[0:2, 1:3]
print("\nSubset using iloc with positions:")
print(position_subset)

# Filter rows using a condition on age and stay length together.
condition_subset = patients[(patients["age_years"] > 40) & (patients["stay_nights"] >= 3)]
print("\nFiltered rows with conditions:")
print(condition_subset)



### **3.2. Safe Column Renaming**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas (2.3.1) A-Z/Module_01/Lecture_C/image_03_02.jpg?v=1766705238" width="250">



>* Clear, consistent column names improve understanding and work
>* Rename columns safely, explicitly, and reversibly to avoid errors

>* Rename using explicit old-to-new column mappings
>* Avoid position-based assumptions; document renaming for collaborators

>* Use consistent naming conventions across all datasets
>* Document renames to keep pipelines clear and maintainable



In [None]:
#@title Python Code - Safe Column Renaming

# Demonstrate safe column renaming using explicit name mapping.
# Show how positional assumptions can silently break renaming logic.
# Encourage consistent, reversible column naming for reliable analysis.

import pandas as pd

# Create a small marketing dataset with slightly cryptic column names.
data_initial = {"Pt_ID": [101, 102], "AdmDt": ["2024-01-01", "2024-01-02"], "Dx1": ["A10", "B20"]}

df_initial = pd.DataFrame(data_initial)

# Show the original DataFrame columns before any renaming changes.
print("Original columns:", list(df_initial.columns))

# Safely rename columns using a clear mapping dictionary by old names.
rename_map = {"Pt_ID": "patient_id", "AdmDt": "admission_date", "Dx1": "primary_diagnosis"}

df_safe = df_initial.rename(columns=rename_map)

# Display the safely renamed columns for confirmation and documentation.
print("Safely renamed columns:", list(df_safe.columns))

# Simulate a new dataset version where a new column appears at the front.
data_new = {"Campaign_ID": ["X1", "X2"], "Pt_ID": [201, 202], "AdmDt": ["2024-02-01", "2024-02-02"], "Dx1": ["C30", "D40"]}

df_new = pd.DataFrame(data_new)

# Incorrect approach: rename by position, assuming second column is always patient identifier.
wrong_new_columns = list(df_new.columns)

wrong_new_columns[1] = "patient_id"  # This silently breaks if order changes.

df_wrong = df_new.copy()

df_wrong.columns = wrong_new_columns

# Correct approach: rename by explicit original names, independent of column order.
correct_rename_map = {"Pt_ID": "patient_id", "AdmDt": "admission_date", "Dx1": "primary_diagnosis"}

df_correct = df_new.rename(columns=correct_rename_map)

# Print both incorrect and correct column sets to highlight the difference.
print("Wrong positional rename columns:", list(df_wrong.columns))

print("Correct name based rename columns:", list(df_correct.columns))




### **3.3. Column Transformations**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas (2.3.1) A-Z/Module_01/Lecture_C/image_03_03.jpg?v=1766705258" width="250">



>* Transform columns to create cleaner, richer features
>* Convert types and compute new values for analysis

>* Apply rules to values for consistent columns
>* Combine or derive columns to answer analysis questions

>* Fix missing, inconsistent, and noisy column values
>* Produce clean, standardized, analysis-ready DataFrame columns



In [None]:
#@title Python Code - Column Transformations

# Demonstrate simple column transformations using a small Pandas DataFrame.
# Show cleaning text, converting types, and creating new calculated columns.
# Keep everything beginner friendly and easy to run in Colab.

import pandas as pd

# Create a tiny orders DataFrame with messy text and string numbers.
data = {
    "item": ["Laptop", "laptop", "LAPTOP"],
    "category": ["Electronics", "electronics", "ELECTRONICS"],
    "quantity": [1, 2, 3],
    "unit_price_usd": ["1000.00", "950.50", "1100.25"],
}

orders = pd.DataFrame(data)

# Standardize category capitalization to make categories consistent.
orders["category_clean"] = orders["category"].str.strip().str.lower()

# Convert unit price strings into numeric values for calculations.
orders["unit_price_usd"] = pd.to_numeric(orders["unit_price_usd"], errors="coerce")

# Create a new revenue column by multiplying quantity and unit price.
orders["line_revenue_usd"] = orders["quantity"] * orders["unit_price_usd"]

# Convert revenue from dollars to cents for an alternative numeric scale.
orders["line_revenue_cents"] = (orders["line_revenue_usd"] * 100).round(0).astype("int64")

# Display the transformed DataFrame to inspect new and cleaned columns.
print(orders)



# <font color="#418FDE" size="6.5" uppercase>**DataFrame Essentials**</font>


In this lecture, you learned to:
- Construct DataFrames from common sources such as dictionaries, lists of dicts, and NumPy arrays using Pandas 2.3.1. 
- Inspect DataFrame structure and metadata to understand shape, columns, indexes, and basic statistics. 
- Perform basic column and row operations including selection, renaming, and simple transformations. 

In the next Module (Module 2), we will go over 'Data Cleaning'