## Lesson 3: Data Transformation with dplyr - Part 1 (Select, Filter, Arrange)

Welcome to Lesson 3! Now that you've learned data cleaning, let's explore **data transformation** - the art of reshaping and manipulating your data to extract meaningful insights.

**What is Data Transformation?**
- The process of converting data from one format or structure into another
- Selecting specific columns, filtering rows, and arranging data
- Creating new variables and summarizing information

**Why is it important?**
- Most analysis requires data in a specific format
- Helps you focus on relevant subsets of your data
- Essential for creating meaningful visualizations and reports

**In this lesson, we'll cover:**
- **select()**: Choosing specific columns
- **filter()**: Subsetting rows based on conditions
- **arrange()**: Reordering rows
- **The pipe operator (%>%)**: Chaining operations together

## Loading Required Packages

For data transformation, we'll use the **dplyr** package, which is part of the **tidyverse** collection:
- **dplyr**: The grammar of data manipulation
- **Intuitive function names**: select, filter, arrange, mutate, summarize
- **Pipe operator (%>%)**: Chain operations in a readable way

Let's load the tidyverse package that includes dplyr:

In [4]:
# Load necessary packages
library(tidyverse)   # Load the tidyverse collection of packages
                     # This includes: dplyr, ggplot2, tidyr, readr, purrr, tibble
                     # The comment after # explains what this package includes

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.2
[32m✔[39m [34mggplot2  [39m 4.0.0     [32m✔[39m [34mtibble   [39m 3.3.0
[32m✔[39m [34mlubridate[39m 1.9.4     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.1.0     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


## Creating Sample Sales Dataset

For this lesson, we'll work with a realistic sales dataset that contains:
- **OrderID**: Unique identifier for each order
- **CustomerName**: Names of customers
- **Product**: Items purchased (Laptop, Mouse, Keyboard, Monitor)
- **Category**: Product category (all Electronics in this example)
- **Price**: Price of each item
- **Quantity**: Number of items ordered
- **OrderDate**: Date when the order was placed
- **Region**: Geographic region (East, West, North, South)

**Why this dataset is good for learning:**
- Contains different data types (numeric, character, date)
- Realistic business scenario
- Multiple dimensions for filtering and grouping

In [5]:
# Create a sample sales dataset
sales_data <- data.frame(              # Create a data frame (R's table structure)
  OrderID = 101:110,                   # Create sequence from 101 to 110 using :
  CustomerName = c("Alice", "Bob", "Charlie", "David", "Eve", "Frank", "Grace", "Heidi", "Ivan", "Judy"),
                                       # c() combines values into a vector
  Product = c("Laptop", "Mouse", "Keyboard", "Monitor", "Laptop", "Mouse", "Keyboard", "Monitor", "Laptop", "Mouse"),
                                       # Character vector with product names
  Category = c("Electronics", "Electronics", "Electronics", "Electronics", "Electronics", "Electronics", "Electronics", "Electronics", "Electronics", "Electronics"),
                                       # All same category - shows repetitive data
  Price = c(1200, 25, 75, 300, 1150, 20, 80, 320, 1250, 30),
                                       # Numeric vector with price values
  Quantity = c(1, 2, 1, 1, 1, 3, 1, 1, 1, 2),
                                       # Integer vector for quantities ordered
  OrderDate = as.Date(c("2024-01-15", "2024-01-15", "2024-01-16", "2024-01-17", "2024-01-18",
                        "2024-01-19", "2024-01-20", "2024-01-21", "2024-01-22", "2024-01-23")),
                                       # as.Date() converts text to proper date format
  Region = c("East", "West", "North", "South", "East", "West", "North", "South", "East", "West")
                                       # Factor-like data for geographic regions
)

print("Original Sales Data:")          # print() displays output to console
print(sales_data)                      # Show the entire data frame

[1] "Original Sales Data:"
   OrderID CustomerName  Product    Category Price Quantity  OrderDate Region
1      101        Alice   Laptop Electronics  1200        1 2024-01-15   East
2      102          Bob    Mouse Electronics    25        2 2024-01-15   West
3      103      Charlie Keyboard Electronics    75        1 2024-01-16  North
4      104        David  Monitor Electronics   300        1 2024-01-17  South
5      105          Eve   Laptop Electronics  1150        1 2024-01-18   East
6      106        Frank    Mouse Electronics    20        3 2024-01-19   West
7      107        Grace Keyboard Electronics    80        1 2024-01-20  North
8      108        Heidi  Monitor Electronics   320        1 2024-01-21  South
9      109         Ivan   Laptop Electronics  1250        1 2024-01-22   East
10     110         Judy    Mouse Electronics    30        2 2024-01-23   West


## Introduction to dplyr and the Pipe Operator (%>%)

The **pipe operator (%>%)** is one of the most powerful features in R for data manipulation:

**Without pipe (traditional R):**
```r
result <- filter(select(sales_data, Product, Price), Price > 100)
```

**With pipe (modern tidyverse style):**
```r
result <- sales_data %>%
  select(Product, Price) %>%
  filter(Price > 100)
```

**Why use the pipe operator?**
- **Readability**: Code reads left-to-right, top-to-bottom
- **Less nesting**: Avoid complex nested function calls
- **Step-by-step**: Easy to add/remove operations
- **Debugging**: Can run partial pipelines to check intermediate results

In [6]:
# Example: without pipe (harder to read)
# result <- filter(select(sales_data, Product, Price), Price > 100)
# This nests functions: select first, then filter the result

# Example: with pipe (easier to read)
result <- sales_data %>%               # Start with sales_data, then pipe to next function
  select(Product, Price) %>%           # Select only Product and Price columns, then pipe
  filter(Price > 100)                  # Filter rows where Price is greater than 100
                                       # %>% passes result from left side to right side

print("Example of pipe operator - Products with Price > 100:")
print(result)                          # Display the final filtered result

[1] "Example of pipe operator - Products with Price > 100:"
  Product Price
1  Laptop  1200
2 Monitor   300
3  Laptop  1150
4 Monitor   320
5  Laptop  1250


## 1. select(): Choosing Columns

The **select()** function allows you to choose which columns to keep in your dataset:

**Common select() patterns:**
- `select(col1, col2, col3)`: Select specific columns
- `select(col1:col5)`: Select range of columns
- `select(-col1, -col2)`: Select all EXCEPT specified columns
- `select(starts_with("prefix"))`: Select columns starting with text
- `select(ends_with("suffix"))`: Select columns ending with text
- `select(contains("text"))`: Select columns containing text

**Why is select() useful?**
- Focus on relevant variables for analysis
- Reduce dataset size for better performance
- Reorder columns as needed

In [7]:
# Select specific columns
selected_columns <- sales_data     # Take sales_data and pipe it to select()
  select(OrderID, CustomerName, Product, Price)
                                       # select() keeps only the named columns
                                       # All other columns (Category, Quantity, etc.) are dropped
print("Selected Columns (OrderID, CustomerName, Product, Price):")
print(selected_columns)               # Result has only 4 columns instead of original 8

ERROR: Error: object 'OrderID' not found


In [10]:
# Select columns by range
selected_range <- sales_data %>%      # Take sales_data and pipe to select()
  select(OrderID:Product)              # : means "from OrderID through Product"
                                       # This selects OrderID, CustomerName, and Product
                                       # Range selection based on column position
print("Selected Columns by Range (OrderID to Product):")
print(selected_range)                 # Shows first 3 columns only

[1] "Selected Columns by Range (OrderID to Product):"
   OrderID CustomerName  Product
1      101        Alice   Laptop
2      102          Bob    Mouse
3      103      Charlie Keyboard
4      104        David  Monitor
5      105          Eve   Laptop
6      106        Frank    Mouse
7      107        Grace Keyboard
8      108        Heidi  Monitor
9      109         Ivan   Laptop
10     110         Judy    Mouse


In [9]:
# Select all columns EXCEPT some
except_columns <- sales_data %>%      # Take sales_data and pipe to select()
  select(-Category, -OrderDate)       # Minus sign (-) means "exclude these columns"
                                       # Keep everything except Category and OrderDate
                                       # Useful when you want most columns but not all
print("Selected All Columns Except Category and OrderDate:")
print(except_columns)                 # Result has 6 columns instead of 8

[1] "Selected All Columns Except Category and OrderDate:"
   OrderID CustomerName  Product Price Quantity Region
1      101        Alice   Laptop  1200        1   East
2      102          Bob    Mouse    25        2   West
3      103      Charlie Keyboard    75        1  North
4      104        David  Monitor   300        1  South
5      105          Eve   Laptop  1150        1   East
6      106        Frank    Mouse    20        3   West
7      107        Grace Keyboard    80        1  North
8      108        Heidi  Monitor   320        1  South
9      109         Ivan   Laptop  1250        1   East
10     110         Judy    Mouse    30        2   West


In [8]:
# Select columns that start with a specific string
starts_with_o <- sales_data %>%       # Take sales_data and pipe to select()
  select(starts_with("O"))            # starts_with() is a helper function
                                       # Finds columns beginning with "O"
                                       # Case-sensitive: looks for "OrderID" and "OrderDate"
print("Selected Columns Starting with 'O':")
print(starts_with_o)                  # Result shows OrderID and OrderDate columns only

[1] "Selected Columns Starting with 'O':"
   OrderID  OrderDate
1      101 2024-01-15
2      102 2024-01-15
3      103 2024-01-16
4      104 2024-01-17
5      105 2024-01-18
6      106 2024-01-19
7      107 2024-01-20
8      108 2024-01-21
9      109 2024-01-22
10     110 2024-01-23


In [11]:
# Select columns that contain a specific string
contains_name <- sales_data %>%       # Take sales_data and pipe to select()
  select(contains("Name"))            # contains() is a helper function
                                       # Finds columns with "Name" anywhere in column name
                                       # Matches "CustomerName" in our dataset
print("Selected Columns Containing 'Name':")
print(contains_name)                  # Result shows only CustomerName column

[1] "Selected Columns Containing 'Name':"
   CustomerName
1         Alice
2           Bob
3       Charlie
4         David
5           Eve
6         Frank
7         Grace
8         Heidi
9          Ivan
10         Judy


## 2. filter(): Subsetting Rows Based on Conditions

The **filter()** function allows you to subset rows based on logical conditions:

**Common filter() operators:**
- `==`: Equal to
- `!=`: Not equal to
- `>`, `>=`: Greater than (or equal)
- `<`, `<=`: Less than (or equal)
- `&` or `,`: AND condition
- `|`: OR condition
- `%in%`: Value is in a list
- `is.na()`: Is missing value
- `!is.na()`: Is not missing value

**Why is filter() essential?**
- Focus on specific subsets of data
- Remove unwanted observations
- Create targeted analyses for different segments

In [12]:
# Filter rows where Price is greater than 100
high_price_items <- sales_data %>%    # Take sales_data and pipe to filter()
  filter(Price > 100)                 # filter() keeps rows where condition is TRUE
                                       # > is the "greater than" comparison operator
                                       # Only keeps rows where Price column > 100
print("Items with Price > 100:")
print(high_price_items)               # Shows only expensive items (laptops, monitors)

[1] "Items with Price > 100:"
  OrderID CustomerName Product    Category Price Quantity  OrderDate Region
1     101        Alice  Laptop Electronics  1200        1 2024-01-15   East
2     104        David Monitor Electronics   300        1 2024-01-17  South
3     105          Eve  Laptop Electronics  1150        1 2024-01-18   East
4     108        Heidi Monitor Electronics   320        1 2024-01-21  South
5     109         Ivan  Laptop Electronics  1250        1 2024-01-22   East


In [13]:
# Filter rows where Product is 'Laptop' and Quantity is 1
laptop_single_quantity <- sales_data %>%  # Take sales_data and pipe to filter()
  filter(Product == "Laptop", Quantity == 1)
                                       # Multiple conditions separated by comma = AND logic
                                       # == is "exactly equal to" (use == not = for comparison)
                                       # Both conditions must be TRUE for row to be kept
print("Laptops with Quantity = 1:")
print(laptop_single_quantity)         # Shows only laptop orders with quantity of 1

[1] "Laptops with Quantity = 1:"
  OrderID CustomerName Product    Category Price Quantity  OrderDate Region
1     101        Alice  Laptop Electronics  1200        1 2024-01-15   East
2     105          Eve  Laptop Electronics  1150        1 2024-01-18   East
3     109         Ivan  Laptop Electronics  1250        1 2024-01-22   East


In [14]:
# Filter rows where Region is 'East' or 'West'
east_west_region <- sales_data %>%    # Take sales_data and pipe to filter()
  filter(Region == "East" | Region == "West")
                                       # | is the OR operator (either condition can be true)
                                       # Keep rows where Region equals "East" OR "West"
                                       # Excludes "North" and "South" regions
print("Orders from East or West Region:")
print(east_west_region)               # Shows only orders from East or West regions

[1] "Orders from East or West Region:"
  OrderID CustomerName Product    Category Price Quantity  OrderDate Region
1     101        Alice  Laptop Electronics  1200        1 2024-01-15   East
2     102          Bob   Mouse Electronics    25        2 2024-01-15   West
3     105          Eve  Laptop Electronics  1150        1 2024-01-18   East
4     106        Frank   Mouse Electronics    20        3 2024-01-19   West
5     109         Ivan  Laptop Electronics  1250        1 2024-01-22   East
6     110         Judy   Mouse Electronics    30        2 2024-01-23   West


In [15]:
# Filter using %in% operator
selected_products <- sales_data %>%   # Take sales_data and pipe to filter()
  filter(Product %in% c("Laptop", "Monitor"))
                                       # %in% checks if value exists in a list
                                       # c("Laptop", "Monitor") creates a vector of allowed values
                                       # More efficient than Product == "Laptop" | Product == "Monitor"
print("Orders for Laptop or Monitor:")
print(selected_products)              # Shows only laptop and monitor orders

[1] "Orders for Laptop or Monitor:"
  OrderID CustomerName Product    Category Price Quantity  OrderDate Region
1     101        Alice  Laptop Electronics  1200        1 2024-01-15   East
2     104        David Monitor Electronics   300        1 2024-01-17  South
3     105          Eve  Laptop Electronics  1150        1 2024-01-18   East
4     108        Heidi Monitor Electronics   320        1 2024-01-21  South
5     109         Ivan  Laptop Electronics  1250        1 2024-01-22   East


In [16]:
# Filter rows based on date
orders_jan_17_onwards <- sales_data %>%  # Take sales_data and pipe to filter()
  filter(OrderDate >= as.Date("2024-01-17"))
                                       # >= means "greater than or equal to"
                                       # as.Date() converts text string to date format
                                       # Keeps orders from Jan 17, 2024 and later
print("Orders from Jan 17, 2024 onwards:")
print(orders_jan_17_onwards)          # Shows orders from Jan 17 through Jan 23

[1] "Orders from Jan 17, 2024 onwards:"
  OrderID CustomerName  Product    Category Price Quantity  OrderDate Region
1     104        David  Monitor Electronics   300        1 2024-01-17  South
2     105          Eve   Laptop Electronics  1150        1 2024-01-18   East
3     106        Frank    Mouse Electronics    20        3 2024-01-19   West
4     107        Grace Keyboard Electronics    80        1 2024-01-20  North
5     108        Heidi  Monitor Electronics   320        1 2024-01-21  South
6     109         Ivan   Laptop Electronics  1250        1 2024-01-22   East
7     110         Judy    Mouse Electronics    30        2 2024-01-23   West


## 3. arrange(): Reordering Rows

The **arrange()** function allows you to sort your data by one or more columns:

**Common arrange() patterns:**
- `arrange(column)`: Sort by column in ascending order
- `arrange(desc(column))`: Sort by column in descending order
- `arrange(col1, col2)`: Sort by multiple columns
- `arrange(desc(col1), col2)`: Mixed ascending/descending sort

**Why is arrange() useful?**
- Identify top/bottom performers
- Prepare data for reports
- Make patterns more visible
- Ensure consistent ordering for analysis

In [17]:
# Arrange by Price in ascending order (default)
arranged_by_price_asc <- sales_data %>%  # Take sales_data and pipe to arrange()
  arrange(Price)                      # arrange() sorts rows by specified column
                                       # Default is ascending order (lowest to highest)
                                       # Cheapest items appear first
print("Arranged by Price (Ascending):")
print(arranged_by_price_asc)          # Shows data sorted from $20 to $1250

[1] "Arranged by Price (Ascending):"
   OrderID CustomerName  Product    Category Price Quantity  OrderDate Region
1      106        Frank    Mouse Electronics    20        3 2024-01-19   West
2      102          Bob    Mouse Electronics    25        2 2024-01-15   West
3      110         Judy    Mouse Electronics    30        2 2024-01-23   West
4      103      Charlie Keyboard Electronics    75        1 2024-01-16  North
5      107        Grace Keyboard Electronics    80        1 2024-01-20  North
6      104        David  Monitor Electronics   300        1 2024-01-17  South
7      108        Heidi  Monitor Electronics   320        1 2024-01-21  South
8      105          Eve   Laptop Electronics  1150        1 2024-01-18   East
9      101        Alice   Laptop Electronics  1200        1 2024-01-15   East
10     109         Ivan   Laptop Electronics  1250        1 2024-01-22   East


In [18]:
# Arrange by Price in descending order
arranged_by_price_desc <- sales_data %>%  # Take sales_data and pipe to arrange()
  arrange(desc(Price))                # desc() function reverses sort order
                                       # desc = descending (highest to lowest)
                                       # Most expensive items appear first
print("Arranged by Price (Descending):")
print(arranged_by_price_desc)          # Shows data sorted from $1250 to $20

[1] "Arranged by Price (Descending):"
   OrderID CustomerName  Product    Category Price Quantity  OrderDate Region
1      109         Ivan   Laptop Electronics  1250        1 2024-01-22   East
2      101        Alice   Laptop Electronics  1200        1 2024-01-15   East
3      105          Eve   Laptop Electronics  1150        1 2024-01-18   East
4      108        Heidi  Monitor Electronics   320        1 2024-01-21  South
5      104        David  Monitor Electronics   300        1 2024-01-17  South
6      107        Grace Keyboard Electronics    80        1 2024-01-20  North
7      103      Charlie Keyboard Electronics    75        1 2024-01-16  North
8      110         Judy    Mouse Electronics    30        2 2024-01-23   West
9      102          Bob    Mouse Electronics    25        2 2024-01-15   West
10     106        Frank    Mouse Electronics    20        3 2024-01-19   West


In [19]:
# Arrange by multiple columns (e.g., Region then Price)
arranged_by_region_price <- sales_data %>%  # Take sales_data and pipe to arrange()
  arrange(Region, Price)              # Multiple columns: first sort by Region
                                       # Then within each region, sort by Price
                                       # Primary sort = Region (alphabetical)
                                       # Secondary sort = Price (within each region)
print("Arranged by Region then Price:")
print(arranged_by_region_price)        # Groups by East, North, South, West, then price within each

[1] "Arranged by Region then Price:"
   OrderID CustomerName  Product    Category Price Quantity  OrderDate Region
1      105          Eve   Laptop Electronics  1150        1 2024-01-18   East
2      101        Alice   Laptop Electronics  1200        1 2024-01-15   East
3      109         Ivan   Laptop Electronics  1250        1 2024-01-22   East
4      103      Charlie Keyboard Electronics    75        1 2024-01-16  North
5      107        Grace Keyboard Electronics    80        1 2024-01-20  North
6      104        David  Monitor Electronics   300        1 2024-01-17  South
7      108        Heidi  Monitor Electronics   320        1 2024-01-21  South
8      106        Frank    Mouse Electronics    20        3 2024-01-19   West
9      102          Bob    Mouse Electronics    25        2 2024-01-15   West
10     110         Judy    Mouse Electronics    30        2 2024-01-23   West


## Combining Operations: The Power of the Pipe

The real power of dplyr comes from **chaining multiple operations** together using the pipe operator:

**Benefits of chaining:**
- **Readable code**: Operations flow logically from one to the next
- **Efficient**: No need to create intermediate variables
- **Flexible**: Easy to modify by adding/removing steps
- **Powerful**: Complex transformations in just a few lines

**Example workflow:**
1. Start with original data
2. Filter to relevant rows
3. Select important columns
4. Arrange in meaningful order

In [20]:
# Combine operations: filter and then arrange
high_value_orders_arranged <- sales_data %>%  # Start with sales_data
  filter(Price * Quantity > 1000) %>% # Step 1: Calculate total value (Price × Quantity)
                                       # Keep only rows where total > 1000
  arrange(desc(Price * Quantity))     # Step 2: Sort by total value (highest first)
                                       # desc() sorts descending order
                                       # Pipeline: data → filter → arrange → result
print("High Value Orders (Price * Quantity > 1000) arranged by total value (Descending):")
print(high_value_orders_arranged)      # Shows high-value orders sorted by total value

[1] "High Value Orders (Price * Quantity > 1000) arranged by total value (Descending):"
  OrderID CustomerName Product    Category Price Quantity  OrderDate Region
1     109         Ivan  Laptop Electronics  1250        1 2024-01-22   East
2     101        Alice  Laptop Electronics  1200        1 2024-01-15   East
3     105          Eve  Laptop Electronics  1150        1 2024-01-18   East


## Summary and Next Steps

**What we learned in this lesson:**
- **select()**: Choose specific columns from your dataset
- **filter()**: Subset rows based on logical conditions
- **arrange()**: Sort data by one or more columns
- **Pipe operator (%>%)**: Chain operations together for readable code

**Key takeaways:**
1. **Start simple**: Begin with basic operations before chaining
2. **Think step-by-step**: Break complex tasks into smaller operations
3. **Use meaningful names**: Create descriptive variable names for results
4. **Practice**: The more you use these functions, the more intuitive they become

**Coming up in Part 2:**
- **mutate()**: Creating new variables
- **summarize()**: Calculating summary statistics
- **group_by()**: Performing operations by groups
- **Advanced combinations**: Complex data transformations

**Practice exercises:**
1. Try filtering the data for different conditions
2. Experiment with selecting different column combinations
3. Practice arranging by different variables
4. Create your own chain of operations