# Pandas Fundamentals I - Part 2: Basic DataFrame Operations

## Week 2, Day 2 (Thursday) - April 17th, 2025

### Overview
This is the second part of our introduction to Pandas, focusing on basic operations for exploring and manipulating DataFrames. We'll learn the equivalent Pandas operations for common SQL tasks.

### Learning Objectives
- Inspect and understand DataFrame structure
- Access and manipulate DataFrame elements
- Handle missing data
- Perform basic column operations

### Prerequisites
- Pandas Fundamentals I - Part 1

In [None]:
# Import libraries
import pandas as pd
import numpy as np

# Create a sample DataFrame to work with
data = {
    'product_id': ['P001', 'P002', 'P003', 'P004', 'P005'],
    'product_name': ['Laptop', 'Smartphone', 'Tablet', 'Headphones', 'Monitor'],
    'category': ['Electronics', 'Electronics', 'Electronics', 'Accessories', 'Electronics'],
    'price': [1200, 800, 450, 150, 300],
    'stock_quantity': [15, 25, 0, 30, 10],
    'rating': [4.5, 4.8, 4.2, 4.6, np.nan]  # Note the missing value (np.nan)
}

products_df = pd.DataFrame(data)
print(products_df)

## 1. DataFrame Inspection

Before diving into data analysis, it's important to inspect and understand your data. Pandas provides several methods for this purpose.

In [None]:
# View the first n rows (default is 5)
print("First 3 rows:")
print(products_df.head(3))

# View the last n rows (default is 5)
print("\nLast 2 rows:")
print(products_df.tail(2))

In [None]:
# Get basic information about the DataFrame
print("DataFrame info:")
products_df.info()

The `info()` method provides key information about the DataFrame:
- The number of rows and columns
- The column names and data types
- Non-null values for each column
- Memory usage

This is similar to getting table schema information in SQL.

In [None]:
# Get summary statistics for numeric columns
print("Summary statistics:")
print(products_df.describe())

The `describe()` method provides descriptive statistics for numeric columns:
- count: number of non-missing values
- mean, std (standard deviation): measures of central tendency and dispersion
- min, 25%, 50% (median), 75%, max: percentiles

This is similar to aggregate functions in SQL like COUNT(), AVG(), MIN(), MAX(), etc.

In [None]:
# Get descriptive statistics for categorical columns
print("Category counts:")
print(products_df['category'].value_counts())

# You can also use describe() with include/exclude parameters
print("\nDescribe categorical columns:")
print(products_df.describe(include=['object']))  # 'object' is pandas' string type

### Additional inspection methods

In [None]:
# Column names
print("Column names:", products_df.columns.tolist())

# Row index
print("\nRow index:", products_df.index.tolist())

# DataFrame dimensions (rows, columns)
print("\nDataFrame shape:", products_df.shape)

# Unique values in a column
print("\nUnique categories:", products_df['category'].unique())

## 2. Accessing Columns and Rows

Now let's look at different ways to access data within a DataFrame.

In [None]:
# Accessing a single column (returns a Series)
prices = products_df['price']
print("Prices:\n", prices)
print("Type:", type(prices))

In [None]:
# Alternative column access using dot notation
# Note: This only works for column names that could be valid Python variable names
# and don't conflict with DataFrame method names
prices_alt = products_df.price
print("Prices (dot notation):\n", prices_alt)

In [None]:
# Accessing multiple columns (returns a DataFrame)
product_info = products_df[['product_name', 'price', 'rating']]
print("Product info:\n", product_info)
print("Type:", type(product_info))

### Accessing rows

In [None]:
# Accessing a row by position using iloc
# iloc uses integer-based indexing [row, column]
first_row = products_df.iloc[0]
print("First row:\n", first_row)
print("Type:", type(first_row))

In [None]:
# Accessing multiple rows with iloc
first_three_rows = products_df.iloc[0:3]
print("First three rows:\n", first_three_rows)

In [None]:
# Accessing specific rows and columns with iloc
# iloc[row_selector, column_selector]
subset = products_df.iloc[1:4, [0, 1, 3]]  # Rows 1-3, columns 0, 1, and 3
print("Subset of rows and columns:\n", subset)

In [None]:
# Accessing rows and columns by label using loc
# loc uses label-based indexing [row_label, column_label]
# Since our index is numeric (0-4), it looks similar to iloc in this case
second_row = products_df.loc[1, ['product_id', 'product_name', 'price']]
print("Second row selected fields:\n", second_row)

### Label-based indexing with custom index

Let's set the `product_id` as the index to see how label-based indexing works:

In [None]:
# Set product_id as index
products_indexed = products_df.set_index('product_id')
print(products_indexed)

In [None]:
# Now we can access rows by product_id
laptop_row = products_indexed.loc['P001']
print("Laptop details:\n", laptop_row)

# Get specific fields for specific products
headphones_info = products_indexed.loc['P004', ['price', 'stock_quantity']]
print("\nHeadphones price and stock:\n", headphones_info)

### SQL equivalent for row and column selection

In SQL, selecting specific columns and rows would look like:

```sql
SELECT product_name, price, rating
FROM products
WHERE product_id = 'P001';
```

In Pandas, this is equivalent to:

In [None]:
# SQL to Pandas translation
result = products_df.loc[products_df['product_id'] == 'P001', ['product_name', 'price', 'rating']]
print(result)

## 3. Basic Data Types and Conversions

Let's look at data types in Pandas and how to convert between them.

In [None]:
# Check data types
print(products_df.dtypes)

Common pandas data types include:
- `object`: String or mixed types (similar to VARCHAR in SQL)
- `int64`: Integer (similar to INT in SQL)
- `float64`: Floating-point (similar to FLOAT or DOUBLE in SQL)
- `bool`: Boolean (similar to BOOLEAN in SQL)
- `datetime64`: Date and time (similar to DATE, DATETIME in SQL)
- `category`: Categorical data (similar to ENUM in SQL, but more powerful)

Let's convert some columns to different types:

In [None]:
# Convert category to categorical type (more memory efficient for repeated values)
products_df['category'] = products_df['category'].astype('category')

# Create a date column and convert to datetime
products_df['last_updated'] = ['2025-01-15', '2025-01-20', '2025-01-10', '2025-01-25', '2025-01-18']
products_df['last_updated'] = pd.to_datetime(products_df['last_updated'])

# Check data types again
print(products_df.dtypes)

# Display the DataFrame with the new column
print("\nUpdated DataFrame:")
print(products_df)

## 4. Handling Missing Data

Missing data is a common issue in real-world datasets. In Pandas, missing values are typically represented by `NaN` (Not a Number). Let's see how to detect and handle missing values.

In [None]:
# Check for missing values
print("Missing values per column:")
print(products_df.isna().sum())

# Check if any value in a row is missing
print("\nRows with any missing value:")
print(products_df[products_df.isna().any(axis=1)])

### Handling missing values

There are several ways to handle missing values:

In [None]:
# 1. Remove rows with missing values
print("DataFrame after dropping rows with NaN:")
print(products_df.dropna())

# Note: The above operation doesn't modify the original DataFrame unless inplace=True
# Check that our original DataFrame still has the missing value
print("\nOriginal DataFrame (unchanged):")
print(products_df)

In [None]:
# 2. Fill missing values
# With a constant value
print("DataFrame after filling NaN with 0:")
print(products_df.fillna(0))

# With column-specific values
print("\nDataFrame after filling NaN with column-specific values:")
print(products_df.fillna({'rating': 3.0}))

# With the mean of the column
mean_rating = products_df['rating'].mean()
print(f"\nMean rating: {mean_rating:.2f}")
print("DataFrame after filling NaN with column mean:")
print(products_df.fillna({'rating': mean_rating}))

In [None]:
# 3. Update the original DataFrame
# Let's fill the missing rating with the mean and update our DataFrame
products_df['rating'] = products_df['rating'].fillna(mean_rating)
print("Updated DataFrame:")
print(products_df)

# Verify there are no more missing values
print("\nMissing values per column:")
print(products_df.isna().sum())

## 5. Basic Column Operations

Now let's look at some basic operations on DataFrame columns.

In [None]:
# Adding a new column
# Calculate inventory value (price * stock_quantity)
products_df['inventory_value'] = products_df['price'] * products_df['stock_quantity']
print("DataFrame with inventory value:")
print(products_df)

In [None]:
# Using apply() to create a column with a function
def stock_status(quantity):
    if quantity == 0:
        return 'Out of Stock'
    elif quantity < 15:
        return 'Low Stock'
    else:
        return 'In Stock'

products_df['stock_status'] = products_df['stock_quantity'].apply(stock_status)
print("DataFrame with stock status:")
print(products_df)

In [None]:
# Using lambda functions for simple operations
# Calculate a 10% discount price
products_df['discount_price'] = products_df['price'].apply(lambda x: x * 0.9)
print("DataFrame with discount price:")
print(products_df)

In [None]:
# Renaming columns
products_df = products_df.rename(columns={
    'stock_quantity': 'quantity_in_stock',
    'discount_price': 'sale_price'
})
print("DataFrame with renamed columns:")
print(products_df)

In [None]:
# Dropping columns
products_df_simplified = products_df.drop(columns=['inventory_value', 'stock_status'])
print("Simplified DataFrame:")
print(products_df_simplified)

## 6. Practice Exercises

Now let's practice with some exercises using what we've learned.

### Exercise 1: DataFrame Inspection

Create a new DataFrame with sales data and answer the following questions:
1. How many rows and columns are in the DataFrame?
2. What is the data type of each column?
3. Are there any missing values?
4. What is the average sales amount?

In [None]:
# Create a sales DataFrame
sales_data = {
    'sale_id': ['S001', 'S002', 'S003', 'S004', 'S005', 'S006'],
    'date': ['2025-01-05', '2025-01-10', '2025-01-15', '2025-01-20', '2025-01-25', '2025-01-30'],
    'product_id': ['P001', 'P002', 'P001', 'P003', 'P002', 'P004'],
    'quantity': [1, 2, 1, 1, 3, 2],
    'amount': [1200, 1600, 1200, 450, 2400, 300],
    'customer_id': ['C001', 'C002', 'C003', 'C001', None, 'C002']
}

sales_df = pd.DataFrame(sales_data)
sales_df['date'] = pd.to_datetime(sales_df['date'])
print(sales_df)

# Your code here to answer the questions


### Exercise 2: Column Operations

Using the sales DataFrame from Exercise 1:
1. Add a column 'unit_price' that calculates the price per unit (amount / quantity)
2. Add a column 'month' that extracts the month from the date
3. Add a column 'high_value' that is True if the amount is greater than 1000, False otherwise
4. Calculate the total sales amount

In [None]:
# Your code here


### Exercise 3: Handling Missing Values

Using the sales DataFrame from Exercise 1:
1. Identify which rows have missing values
2. Fill missing customer_id values with 'Unknown'
3. Create a new DataFrame that drops rows with any missing values

In [None]:
# Your code here


## Next Steps

In the next part, we'll focus on data selection and filtering operations, including how to translate SQL WHERE clauses to Pandas.

Continue to [Part 3: Selection and Filtering](02_Pandas_Fundamentals_I_part3.ipynb)