# Part 4: Pandas DataFrames - Data Loading and Manipulation

In this notebook, we'll dive deep into Pandas DataFrames, learning how to load data and perform essential data manipulation operations.

## Topics Covered:
- Loading data from CSV, TSV, and TXT files
- Selecting columns and rows
- Filtering data
- Dropping missing values and columns
- Grouping data
- Joining/Merging DataFrames

In [None]:
import pandas as pd
import numpy as np

# For this workshop, we'll create sample data
# In real scenarios, you'd load actual files
print("Pandas version:", pd.__version__)

## 1. Loading Data from Files

Pandas can read data from various file formats:

In [None]:
# Loading CSV files (Comma-Separated Values)
# df = pd.read_csv('data/file.csv')

# Loading TSV files (Tab-Separated Values)
# df = pd.read_csv('data/file.tsv', sep='\t')

# Loading TXT files with custom delimiter
# df = pd.read_csv('data/file.txt', delimiter='|')

# Common parameters:
# - header: row number to use as column names (default: 0)
# - index_col: column to use as row labels
# - usecols: list of columns to read
# - na_values: values to recognize as NA/NaN

# For this workshop, let's create a sample dataset
np.random.seed(42)
sales_data = {
    'date': pd.date_range('2024-01-01', periods=100),
    'product': np.random.choice(['Laptop', 'Phone', 'Tablet', 'Monitor'], 100),
    'region': np.random.choice(['North', 'South', 'East', 'West'], 100),
    'sales': np.random.randint(1000, 10000, 100),
    'quantity': np.random.randint(1, 20, 100),
    'customer_rating': np.random.choice([4.0, 4.5, 5.0, np.nan], 100)
}

df = pd.DataFrame(sales_data)
print("Sample sales data:")
print(df.head())

## 2. Selecting Columns and Rows

There are multiple ways to select data from a DataFrame:

In [None]:
# Select a single column (returns a Series)
products = df['product']
print("Product column (Series):")
print(products.head())

# Select multiple columns (returns a DataFrame)
subset = df[['product', 'sales', 'quantity']]
print("\nMultiple columns:")
print(subset.head())

In [None]:
# Select rows by index position with iloc
print("First row:")
print(df.iloc[0])

print("\nFirst 5 rows:")
print(df.iloc[0:5])

print("\nSpecific rows and columns:")
print(df.iloc[0:3, 1:4])  # Rows 0-2, Columns 1-3

In [None]:
# Select rows by label with loc
print("Using loc with row indices and column names:")
print(df.loc[0:4, ['product', 'sales']])

# Boolean indexing
print("\nRows where sales > 8000:")
high_sales = df.loc[df['sales'] > 8000]
print(high_sales.head())

## 3. Filtering Data

Filtering allows you to extract rows based on conditions:

In [None]:
# Single condition
laptops = df[df['product'] == 'Laptop']
print(f"Laptop sales: {len(laptops)} records")
print(laptops.head())

# Multiple conditions with & (and) and | (or)
# Note: Use & instead of 'and', | instead of 'or'
# Wrap each condition in parentheses

high_laptop_sales = df[(df['product'] == 'Laptop') & (df['sales'] > 5000)]
print(f"\nHigh-value laptop sales: {len(high_laptop_sales)} records")
print(high_laptop_sales.head())

In [None]:
# Using isin() for multiple values
electronics = df[df['product'].isin(['Laptop', 'Phone'])]
print(f"Laptops and Phones: {len(electronics)} records")

# Using str methods for string filtering
# (Example: if we had product names with patterns)
# df[df['product'].str.contains('Lap')]
# df[df['product'].str.startswith('L')]

# Filter by date range
january_sales = df[df['date'] < '2024-02-01']
print(f"\nJanuary sales: {len(january_sales)} records")

## 4. Handling Missing Values

Real-world data often has missing values that need to be handled:

In [None]:
# Check for missing values
print("Missing values per column:")
print(df.isnull().sum())

print("\nPercentage of missing values:")
print((df.isnull().sum() / len(df)) * 100)

# Check which rows have any missing values
rows_with_na = df[df.isnull().any(axis=1)]
print(f"\nRows with missing values: {len(rows_with_na)}")

In [None]:
# Drop rows with any missing values
df_no_na = df.dropna()
print(f"Original shape: {df.shape}")
print(f"After dropping NA: {df_no_na.shape}")

# Drop rows where specific columns have missing values
df_rating_clean = df.dropna(subset=['customer_rating'])
print(f"After dropping rows with missing ratings: {df_rating_clean.shape}")

# Fill missing values
df_filled = df.copy()
df_filled['customer_rating'].fillna(df_filled['customer_rating'].mean(), inplace=True)
print(f"\nMissing values after filling: {df_filled['customer_rating'].isnull().sum()}")

## 5. Dropping Columns

Sometimes you need to remove columns from your DataFrame:

In [None]:
# Drop a single column
df_no_rating = df.drop('customer_rating', axis=1)
# axis=1 means drop column, axis=0 would drop rows
print("Columns after dropping customer_rating:")
print(df_no_rating.columns.tolist())

# Drop multiple columns
df_minimal = df.drop(['customer_rating', 'quantity'], axis=1)
print("\nColumns after dropping multiple:")
print(df_minimal.columns.tolist())

# Note: Original df is unchanged unless you use inplace=True
# df.drop('column', axis=1, inplace=True)

## 6. Grouping Data

Group by allows you to aggregate data based on categories:

In [None]:
# Group by a single column
product_summary = df.groupby('product')['sales'].sum()
print("Total sales by product:")
print(product_summary)

# Multiple aggregations
product_stats = df.groupby('product')['sales'].agg(['sum', 'mean', 'count'])
print("\nProduct statistics:")
print(product_stats)

In [None]:
# Group by multiple columns
region_product = df.groupby(['region', 'product'])['sales'].sum()
print("Sales by region and product:")
print(region_product)

# Unstack for better readability
print("\nUnstacked view:")
print(region_product.unstack())

In [None]:
# Custom aggregation functions
custom_agg = df.groupby('region').agg({
    'sales': ['sum', 'mean'],
    'quantity': 'sum',
    'customer_rating': 'mean'
})

print("Custom aggregation:")
print(custom_agg)

## 7. Joining DataFrames

Combining data from multiple DataFrames is a common operation:

In [None]:
# Create sample DataFrames for demonstration
products_df = pd.DataFrame({
    'product_id': [1, 2, 3, 4],
    'product_name': ['Laptop', 'Phone', 'Tablet', 'Monitor'],
    'category': ['Computer', 'Mobile', 'Mobile', 'Computer']
})

prices_df = pd.DataFrame({
    'product_id': [1, 2, 3, 5],
    'price': [1200, 800, 500, 300],
    'currency': ['USD', 'USD', 'USD', 'USD']
})

print("Products:")
print(products_df)
print("\nPrices:")
print(prices_df)

In [None]:
# INNER JOIN - only matching rows from both DataFrames
inner_join = pd.merge(products_df, prices_df, on='product_id', how='inner')
print("Inner Join (only matching):")
print(inner_join)
print(f"Shape: {inner_join.shape}")

In [None]:
# LEFT JOIN - all rows from left DataFrame, matching from right
left_join = pd.merge(products_df, prices_df, on='product_id', how='left')
print("Left Join (all products):")
print(left_join)
print(f"Shape: {left_join.shape}")

In [None]:
# RIGHT JOIN - all rows from right DataFrame, matching from left
right_join = pd.merge(products_df, prices_df, on='product_id', how='right')
print("Right Join (all prices):")
print(right_join)
print(f"Shape: {right_join.shape}")

In [None]:
# OUTER JOIN - all rows from both DataFrames
outer_join = pd.merge(products_df, prices_df, on='product_id', how='outer')
print("Outer Join (all rows):")
print(outer_join)
print(f"Shape: {outer_join.shape}")

## 8. Practical Example: Complete Data Pipeline

In [None]:
# Hypothetical scenario: Analyzing sales data

# 1. Load data (in real scenario)
# df = pd.read_csv('sales_data.csv')

# 2. Inspect data
print("Data shape:", df.shape)
print("\nFirst few rows:")
print(df.head())

# 3. Handle missing values
df_clean = df.dropna(subset=['sales', 'quantity'])
df_clean['customer_rating'].fillna(df_clean['customer_rating'].mean(), inplace=True)

# 4. Filter for specific criteria
high_value = df_clean[df_clean['sales'] > 5000]

# 5. Group and aggregate
summary = high_value.groupby(['region', 'product']).agg({
    'sales': 'sum',
    'quantity': 'sum',
    'customer_rating': 'mean'
}).round(2)

print("\nHigh-value sales summary:")
print(summary)

## Practice Exercises

In [None]:
# Exercise 1: From the original df, filter for:
# - Product is 'Phone' or 'Tablet'
# - Sales greater than 3000
# - Region is 'North' or 'South'
# Print the shape and first 5 rows

# Your code here:

In [None]:
# Exercise 2: Create a summary showing:
# - Average sales per product
# - Total quantity sold per product
# - Count of transactions per product

# Your code here:

In [None]:
# Exercise 3: Create two DataFrames and join them
# df1: product_id, product_name
# df2: product_id, stock_level
# Perform an inner join

# Your code here: