# Getting Started with dplyr in R: A Hands-on Tutorial

------------------------

Welcome to the dplyr module in our Data Analysis and Visualization series. In this module, we'll explore the powerful `dplyr` package for data manipulation in R.

## Learning Objectives

After completing this module, you will be able to:

1. Understand the core functions of `dplyr`.
2. Use the pipe operator `%>%` to chain together multiple operations.
3. Perform data manipulation tasks such as filtering, selecting, arranging, mutating, and summarizing data.
4. Group data and perform group-wise operations.
5. Read and write data files using `readr`.
6. Apply these skills to real-world datasets.

---

## Introduction

The `dplyr` package is an essential tool for data analysis in R, providing a set of intuitive functions for data manipulation. Whether you're analyzing scientific experiments, business metrics, or social data, `dplyr` offers a consistent and efficient way to handle your data.

### Why dplyr?

#### 1. Intuitive Grammar
- Uses functions that mirror how we think about data.
  - `filter()` to keep rows that match conditions.
  - `select()` to choose specific columns.
  - `arrange()` to sort data.
  - `mutate()` to create new columns.
  - `summarize()` to calculate statistics.
- Consistent syntax makes code readable and maintainable.
- Clear correspondence between code and analytical steps.

#### 2. Performance
- Optimized for handling large datasets.
- Efficient memory usage.
- Fast execution of complex operations.

#### 3. Versatility
- Works with various data formats (CSV, Excel, SQL databases).
- Handles different types of data (numerical, categorical, time series).
- Scales from small to big data applications.

#### 4. Integration
- Part of the tidyverse ecosystem.
- Seamless connection with visualization tools like ggplot2.
- Works well with specialized packages for specific domains.

---

## 1. Install Packages and Load Libraries

Before we start, we need to install and load the necessary packages.

In [None]:
# Suppress warnings and messages
options(warn=-1)  # Suppress warnings
options(message=FALSE)  # Suppress messages

# Install required packages if not already installed
if (!require(tidyverse)) install.packages("tidyverse")
if (!require(palmerpenguins)) install.packages("palmerpenguins")

# Load libraries
library(tidyverse)
library(palmerpenguins)

# Load the dataset
data(penguins)

## 2. Understanding Pipes: A Key Concept

One of the most powerful features in `dplyr` is the pipe operator `%>%`. Think of it as saying "and then" between operations. The pipe takes the output of one expression and passes it as the first argument to the next function.

### Without Pipes
Let's consider filtering the penguins dataset for Gentoo species and selecting specific columns without using pipes:

In [None]:
# Without pipes
selected_data <- select(
  filter(penguins, species == "Gentoo"),
  species, island, body_mass_g
)
head(selected_data)

### With Pipes
Using pipes, the same operation becomes more readable:

In [None]:
# With pipes
selected_data <- penguins %>%
  filter(species == "Gentoo") %>%
  select(species, island, body_mass_g)
head(selected_data)

### Benefits of Using Pipes
- **Readability**: Code flows from top to bottom, mirroring the sequence of operations.
- **Maintainability**: Easier to modify and add steps.
- **Debugging**: Simplifies tracing through data transformations.

---

## 3. Core dplyr Functions

Let's explore the main functions that form the backbone of `dplyr` data manipulation. For each function, we'll:
- Understand its purpose.
- See its basic syntax.
- Try practical examples with our penguin data.

---

### 3.1 `filter()`: Subsetting Rows

The `filter()` function lets you keep rows that match certain conditions. Think of it like a search function that helps you find specific observations in your dataset.

#### Common Operators Used in `filter()`
- `==` : equal to.
- `!=` : not equal to.
- `>` , `>=` , `<` , `<=` : greater than, greater than or equal to, less than, less than or equal to.
- `%in%` : matches any of the values in a vector.
- `&` : logical AND.
- `|` : logical OR.
- `is.na()` : checks for missing values.

#### Examples

In [None]:
# Find all Gentoo penguins
gentoo_penguins <- penguins %>%
  filter(species == "Gentoo")
head(gentoo_penguins)

# Find penguins heavier than 5000g from Biscoe island
heavy_biscoe_penguins <- penguins %>%
  filter(body_mass_g > 5000, island == "Biscoe")
head(heavy_biscoe_penguins)

# Find penguins from either Dream or Torgersen islands
dream_torgersen_penguins <- penguins %>%
  filter(island %in% c("Dream", "Torgersen"))
head(dream_torgersen_penguins)

#### Exercise 1: Filtering Data

**Task**: Find all female Adelie penguins with a flipper length greater than 190 mm.

##### Your Code Here

In [None]:
# Your code here

<details>
<summary><strong>Hint:</strong> Click to expand</summary>

- Use `filter()` with conditions `species == "Adelie"`, `sex == "female"`, and `flipper_length_mm > 190`.

</details>

<details>
<summary><strong>Solution:</strong> Click to expand</summary>

```R
# Solution
adelie_females <- penguins %>%
  filter(species == "Adelie", sex == "female", flipper_length_mm > 190)
head(adelie_females)
```

</details>

### 3.2 `select()`: Choosing Columns

The `select()` function helps you pick which columns (variables) you want to keep or remove. You can:
- Select columns by name.
- Remove columns using `-`.
- Use helper functions like `starts_with()`, `ends_with()`, `contains()`, `everything()`.

#### Examples

In [None]:
# Select specific columns
selected_columns <- penguins %>%
  select(species, island, body_mass_g)
head(selected_columns)

# Remove the 'year' column
without_year <- penguins %>%
  select(-year)
head(without_year)

# Select columns that start with 'bill'
bill_columns <- penguins %>%
  select(starts_with("bill"))
head(bill_columns)

#### Exercise 2: Selecting Columns

**Task**: Create a new dataframe that includes only the `species`, `island`, `sex`, and all columns that contain the word `length`.

##### Your Code Here

In [None]:
# Your code here

<details>
<summary><strong>Hint:</strong> Click to expand</summary>

- Use `select()` with column names and `contains("length")`.

</details>

<details>
<summary><strong>Solution:</strong> Click to expand</summary>

```R
# Solution
length_data <- penguins %>%
  select(species, island, sex, contains("length"))
head(length_data)
```

</details>

### 3.3 `arrange()`: Sorting Data

`arrange()` helps you sort your data based on one or more columns:
- Use `desc()` for descending order.
- Can sort by multiple columns.
- Missing values (`NA`) are placed at the end.

#### Examples

In [None]:
# Sort by body mass in ascending order
sorted_mass <- penguins %>%
  arrange(body_mass_g)
head(sorted_mass)

# Sort by body mass in descending order
sorted_mass_desc <- penguins %>%
  arrange(desc(body_mass_g))
head(sorted_mass_desc)

# Sort by species and then by bill length
sorted_species_bill <- penguins %>%
  arrange(species, bill_length_mm)
head(sorted_species_bill)

#### Exercise 3: Arranging Data

**Task**: Arrange the penguins dataset to find the top 5 penguins with the longest flipper length. Display their `species`, `island`, `flipper_length_mm`, and `body_mass_g`.

##### Your Code Here

In [None]:
# Your code here

<details>
<summary><strong>Hint:</strong> Click to expand</summary>

- Use `arrange(desc(flipper_length_mm))` to sort in descending order.
- Use `head()` to get the top 5 rows.
- Use `select()` to choose the desired columns.

</details>

<details>
<summary><strong>Solution:</strong> Click to expand</summary>

```R
# Solution
top_flipper <- penguins %>%
  arrange(desc(flipper_length_mm)) %>%
  select(species, island, flipper_length_mm, body_mass_g) %>%
  head(5)
top_flipper
```

</details>

### 3.4 `mutate()`: Creating New Variables

`mutate()` allows you to:
- Create new columns based on calculations.
- Modify existing columns.
- Create multiple columns at once.

#### Examples

In [None]:
# Add new columns with calculations
penguins_mutated <- penguins %>%
  mutate(
    body_mass_kg = body_mass_g / 1000,
    bill_ratio = bill_length_mm / bill_depth_mm
  )
head(penguins_mutated)

#### Exercise 4: Mutating Data

**Task**: Create a new column called `flipper_length_cm` by converting `flipper_length_mm` to centimeters. Then, select `species`, `flipper_length_mm`, and `flipper_length_cm`.

##### Your Code Here

In [None]:
# Your code here

<details>
<summary><strong>Hint:</strong> Click to expand</summary>

- Use `mutate(flipper_length_cm = flipper_length_mm / 10)`.
- Use `select()` to choose the desired columns.

</details>

<details>
<summary><strong>Solution:</strong> Click to expand</summary>

```R
# Solution
penguins_cm <- penguins %>%
  mutate(flipper_length_cm = flipper_length_mm / 10) %>%
  select(species, flipper_length_mm, flipper_length_cm)
head(penguins_cm)
```

</details>

### 3.5 `summarize()`: Calculating Summary Statistics

`summarize()` (or `summarise()`) helps you:
- Calculate summary statistics.
- Reduce multiple values down to a single value.
- Often used with `group_by()` for group-wise calculations.

#### Examples

In [None]:
# Overall summary statistics
overall_summary <- penguins %>%
  summarize(
    count = n(),
    avg_mass = mean(body_mass_g, na.rm = TRUE),
    sd_mass = sd(body_mass_g, na.rm = TRUE)
  )
overall_summary

# Summary statistics by species
species_summary <- penguins %>%
  group_by(species) %>%
  summarize(
    count = n(),
    avg_mass = mean(body_mass_g, na.rm = TRUE),
    sd_mass = sd(body_mass_g, na.rm = TRUE)
  )
species_summary

### 3.6 `group_by()`: Grouping Data

`group_by()` is often used with `summarize()` or `mutate()` to:
- Perform operations by group.
- Calculate group-wise statistics.
- Create group-specific calculations.

Think of it as saying "do this calculation separately for each group."

#### Example

In [None]:
# Calculate average body mass by species and sex
mass_by_species_sex <- penguins %>%
  group_by(species, sex) %>%
  summarize(
    avg_body_mass = mean(body_mass_g, na.rm = TRUE),
    count = n()
  )
mass_by_species_sex

#### Exercise 5: Grouping and Summarizing

**Task**: For each island, calculate the average `bill_length_mm` and the number of penguins observed.

##### Your Code Here

In [None]:
# Your code here

<details>
<summary><strong>Hint:</strong> Click to expand</summary>

- Use `group_by(island)`.
- Use `summarize(avg_bill_length = mean(bill_length_mm, na.rm = TRUE), count = n())`.

</details>

<details>
<summary><strong>Solution:</strong> Click to expand</summary>

```R
# Solution
island_summary <- penguins %>%
  group_by(island) %>%
  summarize(
    avg_bill_length = mean(bill_length_mm, na.rm = TRUE),
    count = n()
  )
island_summary
```

</details>

## 4. Combining Multiple dplyr Functions

One of the most powerful aspects of `dplyr` is the ability to combine multiple functions using pipes (`%>%`). This allows you to:
- Perform complex data manipulation step by step.
- Keep your code readable and logical.
- Build up analysis in a clear sequence.

### Example: Finding the Largest Penguins by Species

**Task**: Find the top 3 heaviest penguins for each species and display their `species`, `island`, `sex`, and `body_mass_g`.

#### Solution

In [None]:
# Find top 3 heaviest penguins by species
top_penguins <- penguins %>%
  drop_na(body_mass_g) %>%
  group_by(species) %>%
  arrange(desc(body_mass_g)) %>%
  slice_head(n = 3) %>%
  select(species, island, sex, body_mass_g)
top_penguins

## 5. Reading and Writing Data with `readr`

The `readr` package (part of the tidyverse) provides functions to read and write data efficiently.

### Reading Data
- `read_csv()`: Reads comma-separated files.
- `read_tsv()`: Reads tab-separated files.

### Writing Data
- `write_csv()`: Writes data to a CSV file.
- `write_tsv()`: Writes data to a TSV file.

#### Example: Writing and Reading Data

In [None]:
# Create a small dataset
small_penguins <- penguins %>%
  select(species, island, body_mass_g) %>%
  head(10)

# Write to CSV
write_csv(small_penguins, "small_penguins.csv")

# Read from CSV
penguins_csv <- read_csv("small_penguins.csv")
penguins_csv

## 6. Working with Genomic Data: The S. cerevisiae Genome

In this section, we'll explore how to work with real-world genomic data using `dplyr`. We'll use the yeast genome file (GFF format), which is a tab-separated file containing information about genes and other features.

### Steps to Work with GFF Files
1. **Read the data**: Use `read_tsv()` to read the file, skipping comment lines.
2. **Rename columns**: GFF files have nine standard columns.
3. **Clean and manipulate data**: Use `dplyr` functions to explore and analyze the data.

### Reading the GFF File

In [None]:
# Read the GFF file
yeast_genome <- read_tsv(
  "assets/data/saccharomyces_cerevisiae.gff",
  comment = "#",
  col_names = FALSE
)

# Preview the data
head(yeast_genome)

### Renaming Columns

GFF files have the following columns:
1. `seqname`: Name of the chromosome or scaffold.
2. `source`: Annotation source.
3. `feature`: Feature type (e.g., gene, exon).
4. `start`: Start position.
5. `end`: End position.
6. `score`: Score value.
7. `strand`: Strand (`+` or `-`).
8. `frame`: Frame or phase.
9. `attribute`: Additional information.

Let's rename these columns.

In [None]:
# Rename columns
yeast_genome <- yeast_genome %>%
  rename(
    seqname = X1,
    source = X2,
    feature = X3,
    start = X4,
    end = X5,
    score = X6,
    strand = X7,
    frame = X8,
    attribute = X9
  )

# Preview the renamed data
head(yeast_genome)

### Exploring the Data

#### Counting Feature Types

In [None]:
# Count feature types
yeast_genome %>%
  count(feature, sort = TRUE)

#### Exercise 6: Analyzing Genomic Data

**Task**: Find the top 5 longest genes in the yeast genome. Display their `seqname`, `start`, `end`, `strand`, and `attribute`.

##### Your Code Here

In [None]:
# Your code here

<details>
<summary><strong>Hint:</strong> Click to expand</summary>

- Filter for `feature == "gene"`.
- Calculate gene length using `mutate(length = end - start + 1)`.
- Arrange in descending order of `length`.
- Use `head(5)` to get the top 5.
- Select the desired columns.

</details>

<details>
<summary><strong>Solution:</strong> Click to expand</summary>

```R
# Solution
longest_genes <- yeast_genome %>%
  filter(feature == "gene") %>%
  mutate(length = end - start + 1) %>%
  arrange(desc(length)) %>%
  select(seqname, start, end, strand, attribute) %>%
  head(5)
longest_genes
```

</details>

## Conclusion

In this module, we've explored the core functions of `dplyr` for data manipulation, learned about the pipe operator `%>%`, and applied these skills to both the Palmer Penguins dataset and real genomic data. Mastery of `dplyr` is essential for efficient and effective data analysis in R.

---

**Next Steps**:

- Practice using `dplyr` with your own datasets.
- Explore more advanced functions like `join` operations.
- Learn about other tidyverse packages like `tidyr` for data reshaping.

---