# Getting Started with dplyr in R: A Hands-on Tutorial

------------------------

Welcome to the dplyr module in our Data Analysis and Visualization series. In this module, we'll explore the powerful `dplyr` package for data manipulation in R.

## Learning Objectives

After completing this module, you will be able to:

1. Understand the core functions of `dplyr`.
2. Use the pipe operator `%>%` to chain together multiple operations.
3. Perform data manipulation tasks such as filtering, selecting, arranging, mutating, and summarizing data.
4. Group data and perform group-wise operations.
5. Read and write data files using `readr`.
6. Apply these skills to real-world datasets.

---

## Introduction

The `dplyr` package is an essential tool for data analysis in R, providing a set of intuitive functions for data manipulation. Whether you're analyzing scientific experiments, business metrics, or social data, `dplyr` offers a consistent and efficient way to handle your data.

### Why dplyr?

#### 1. Intuitive Grammar
- Uses functions that mirror how we think about data.
  - `filter()` to keep rows that match conditions.
  - `select()` to choose specific columns.
  - `arrange()` to sort data.
  - `mutate()` to create new columns.
  - `summarize()` to calculate statistics.
- Consistent syntax makes code readable and maintainable.
- Clear correspondence between code and analytical steps.

#### 2. Performance
- Optimized for handling large datasets.
- Efficient memory usage.
- Fast execution of complex operations.

#### 3. Versatility
- Works with various data formats (CSV, Excel, SQL databases).
- Handles different types of data (numerical, categorical, time series).
- Scales from small to big data applications.

#### 4. Integration
- Part of the tidyverse ecosystem.
- Seamless connection with visualization tools like ggplot2.
- Works well with specialized packages for specific domains.

---

## 1. Install Packages and Load Libraries

Before we start, we need to install and load the necessary packages.

In [1]:
# Suppress warnings and messages
options(warn=-1)  # Suppress warnings
options(message=FALSE)  # Suppress messages

# Install required packages if not already installed
if (!require(tidyverse)) install.packages("tidyverse")
if (!require(palmerpenguins)) install.packages("palmerpenguins")

# Load libraries
library(tidyverse)
library(palmerpenguins)

# Load the dataset
data(penguins)

Loading required package: tidyverse

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.4     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
Loading required package: palmerpenguins



## 2. Understanding Pipes: A Key Concept

One of the most powerful features in `dplyr` is the pipe operator `%>%`. Think of it as saying "and then" between operations. The pipe takes the output of one expression and passes it as the first argument to the next function.

### Without Pipes
Let's consider filtering the penguins dataset for Gentoo species and selecting specific columns without using pipes:

In [2]:
# Without pipes
selected_data <- select(
  filter(penguins, species == "Gentoo"),
  species, island, body_mass_g
)
head(selected_data)

species,island,body_mass_g
<fct>,<fct>,<int>
Gentoo,Biscoe,4500
Gentoo,Biscoe,5700
Gentoo,Biscoe,4450
Gentoo,Biscoe,5700
Gentoo,Biscoe,5400
Gentoo,Biscoe,4550


### With Pipes
Using pipes, the same operation becomes more readable:

In [None]:
# With pipes
selected_data <- penguins %>%
  filter(species == "Gentoo") %>%
  select(species, island, body_mass_g)
head(selected_data)

### Benefits of Using Pipes
- **Readability**: Code flows from top to bottom, mirroring the sequence of operations.
- **Maintainability**: Easier to modify and add steps.
- **Debugging**: Simplifies tracing through data transformations.

---

## 3. Core dplyr Functions

Let's explore the main functions that form the backbone of `dplyr` data manipulation. For each function, we'll:
- Understand its purpose.
- See its basic syntax.
- Try practical examples with our penguin data.

---

### 3.1 `filter()`: Subsetting Rows

The `filter()` function lets you keep rows that match certain conditions. Think of it like a search function that helps you find specific observations in your dataset.

#### Common Operators Used in `filter()`
- `==` : equal to.
- `!=` : not equal to.
- `>` , `>=` , `<` , `<=` : greater than, greater than or equal to, less than, less than or equal to.
- `%in%` : matches any of the values in a vector.
- `&` : logical AND.
- `|` : logical OR.
- `is.na()` : checks for missing values.

#### Examples

In [3]:
# Find all Gentoo penguins
gentoo_penguins <- penguins %>%
  filter(species == "Gentoo")
head(gentoo_penguins)

# Find penguins heavier than 5000g from Biscoe island
heavy_biscoe_penguins <- penguins %>%
  filter(body_mass_g > 5000, island == "Biscoe")
head(heavy_biscoe_penguins)

# Find penguins from either Dream or Torgersen islands
dream_torgersen_penguins <- penguins %>%
  filter(island %in% c("Dream", "Torgersen"))
head(dream_torgersen_penguins)

species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
<fct>,<fct>,<dbl>,<dbl>,<int>,<int>,<fct>,<int>
Gentoo,Biscoe,46.1,13.2,211,4500,female,2007
Gentoo,Biscoe,50.0,16.3,230,5700,male,2007
Gentoo,Biscoe,48.7,14.1,210,4450,female,2007
Gentoo,Biscoe,50.0,15.2,218,5700,male,2007
Gentoo,Biscoe,47.6,14.5,215,5400,male,2007
Gentoo,Biscoe,46.5,13.5,210,4550,female,2007


species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
<fct>,<fct>,<dbl>,<dbl>,<int>,<int>,<fct>,<int>
Gentoo,Biscoe,50.0,16.3,230,5700,male,2007
Gentoo,Biscoe,50.0,15.2,218,5700,male,2007
Gentoo,Biscoe,47.6,14.5,215,5400,male,2007
Gentoo,Biscoe,46.7,15.3,219,5200,male,2007
Gentoo,Biscoe,46.8,15.4,215,5150,male,2007
Gentoo,Biscoe,49.0,16.1,216,5550,male,2007


species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
<fct>,<fct>,<dbl>,<dbl>,<int>,<int>,<fct>,<int>
Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
Adelie,Torgersen,,,,,,2007
Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007
Adelie,Torgersen,39.3,20.6,190.0,3650.0,male,2007


#### Exercise 1: Filtering Data

**Task**: Find all female Adelie penguins with a flipper length greater than 190 mm.

##### Your Code Here

In [None]:
# Your code here

<details>
<summary><strong>Hint:</strong> Click to expand</summary>

- Use `filter()` with conditions `species == "Adelie"`, `sex == "female"`, and `flipper_length_mm > 190`.

</details>

<details>
<summary><strong>Solution:</strong> Click to expand</summary>

```R
# Solution
adelie_females <- penguins %>%
  filter(species == "Adelie", sex == "female", flipper_length_mm > 190)
head(adelie_females)
```

</details>

### 3.2 `select()`: Choosing Columns

The `select()` function helps you pick which columns (variables) you want to keep or remove. You can:
- Select columns by name.
- Remove columns using `-`.
- Use helper functions like `starts_with()`, `ends_with()`, `contains()`, `everything()`.

#### Examples

In [4]:
# Select specific columns
selected_columns <- penguins %>%
  select(species, island, body_mass_g)
head(selected_columns)

# Remove the 'year' column
without_year <- penguins %>%
  select(-year)
head(without_year)

# Select columns that start with 'bill'
bill_columns <- penguins %>%
  select(starts_with("bill"))
head(bill_columns)

species,island,body_mass_g
<fct>,<fct>,<int>
Adelie,Torgersen,3750.0
Adelie,Torgersen,3800.0
Adelie,Torgersen,3250.0
Adelie,Torgersen,
Adelie,Torgersen,3450.0
Adelie,Torgersen,3650.0


species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
<fct>,<fct>,<dbl>,<dbl>,<int>,<int>,<fct>
Adelie,Torgersen,39.1,18.7,181.0,3750.0,male
Adelie,Torgersen,39.5,17.4,186.0,3800.0,female
Adelie,Torgersen,40.3,18.0,195.0,3250.0,female
Adelie,Torgersen,,,,,
Adelie,Torgersen,36.7,19.3,193.0,3450.0,female
Adelie,Torgersen,39.3,20.6,190.0,3650.0,male


bill_length_mm,bill_depth_mm
<dbl>,<dbl>
39.1,18.7
39.5,17.4
40.3,18.0
,
36.7,19.3
39.3,20.6


#### Exercise 2: Selecting Columns

**Task**: Create a new dataframe that includes only the `species`, `island`, `sex`, and all columns that contain the word `length`.

##### Your Code Here

In [None]:
# Your code here

<details>
<summary><strong>Hint:</strong> Click to expand</summary>

- Use `select()` with column names and `contains("length")`.

</details>

<details>
<summary><strong>Solution:</strong> Click to expand</summary>

```R
# Solution
length_data <- penguins %>%
  select(species, island, sex, contains("length"))
head(length_data)
```

</details>

### 3.3 `arrange()`: Sorting Data

`arrange()` helps you sort your data based on one or more columns:
- Use `desc()` for descending order.
- Can sort by multiple columns.
- Missing values (`NA`) are placed at the end.

#### Examples

In [5]:
# Sort by body mass in ascending order
sorted_mass <- penguins %>%
  arrange(body_mass_g)
head(sorted_mass)

# Sort by body mass in descending order
sorted_mass_desc <- penguins %>%
  arrange(desc(body_mass_g))
head(sorted_mass_desc)

# Sort by species and then by bill length
sorted_species_bill <- penguins %>%
  arrange(species, bill_length_mm)
head(sorted_species_bill)

species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
<fct>,<fct>,<dbl>,<dbl>,<int>,<int>,<fct>,<int>
Chinstrap,Dream,46.9,16.6,192,2700,female,2008
Adelie,Biscoe,36.5,16.6,181,2850,female,2008
Adelie,Biscoe,36.4,17.1,184,2850,female,2008
Adelie,Biscoe,34.5,18.1,187,2900,female,2008
Adelie,Dream,33.1,16.1,178,2900,female,2008
Adelie,Torgersen,38.6,17.0,188,2900,female,2009


species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
<fct>,<fct>,<dbl>,<dbl>,<int>,<int>,<fct>,<int>
Gentoo,Biscoe,49.2,15.2,221,6300,male,2007
Gentoo,Biscoe,59.6,17.0,230,6050,male,2007
Gentoo,Biscoe,51.1,16.3,220,6000,male,2008
Gentoo,Biscoe,48.8,16.2,222,6000,male,2009
Gentoo,Biscoe,45.2,16.4,223,5950,male,2008
Gentoo,Biscoe,49.8,15.9,229,5950,male,2009


species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
<fct>,<fct>,<dbl>,<dbl>,<int>,<int>,<fct>,<int>
Adelie,Dream,32.1,15.5,188,3050,female,2009
Adelie,Dream,33.1,16.1,178,2900,female,2008
Adelie,Torgersen,33.5,19.0,190,3600,female,2008
Adelie,Dream,34.0,17.1,185,3400,female,2008
Adelie,Torgersen,34.1,18.1,193,3475,,2007
Adelie,Torgersen,34.4,18.4,184,3325,female,2007


#### Exercise 3: Arranging Data

**Task**: Arrange the penguins dataset to find the top 5 penguins with the longest flipper length. Display their `species`, `island`, `flipper_length_mm`, and `body_mass_g`.

##### Your Code Here

In [None]:
# Your code here

<details>
<summary><strong>Hint:</strong> Click to expand</summary>

- Use `arrange(desc(flipper_length_mm))` to sort in descending order.
- Use `head()` to get the top 5 rows.
- Use `select()` to choose the desired columns.

</details>

<details>
<summary><strong>Solution:</strong> Click to expand</summary>

```R
# Solution
top_flipper <- penguins %>%
  arrange(desc(flipper_length_mm)) %>%
  select(species, island, flipper_length_mm, body_mass_g) %>%
  head(5)
top_flipper
```

</details>

### 3.4 `mutate()`: Creating New Variables

`mutate()` allows you to:
- Create new columns based on calculations.
- Modify existing columns.
- Create multiple columns at once.

#### Examples

In [6]:
# Add new columns with calculations
penguins_mutated <- penguins %>%
  mutate(
    body_mass_kg = body_mass_g / 1000,
    bill_ratio = bill_length_mm / bill_depth_mm
  )
head(penguins_mutated)

species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year,body_mass_kg,bill_ratio
<fct>,<fct>,<dbl>,<dbl>,<int>,<int>,<fct>,<int>,<dbl>,<dbl>
Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007,3.75,2.090909
Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007,3.8,2.270115
Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007,3.25,2.238889
Adelie,Torgersen,,,,,,2007,,
Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007,3.45,1.901554
Adelie,Torgersen,39.3,20.6,190.0,3650.0,male,2007,3.65,1.907767


#### Exercise 4: Mutating Data

**Task**: Create a new column called `flipper_length_cm` by converting `flipper_length_mm` to centimeters. Then, select `species`, `flipper_length_mm`, and `flipper_length_cm`.

##### Your Code Here

In [None]:
# Your code here

<details>
<summary><strong>Hint:</strong> Click to expand</summary>

- Use `mutate(flipper_length_cm = flipper_length_mm / 10)`.
- Use `select()` to choose the desired columns.

</details>

<details>
<summary><strong>Solution:</strong> Click to expand</summary>

```R
# Solution
penguins_cm <- penguins %>%
  mutate(flipper_length_cm = flipper_length_mm / 10) %>%
  select(species, flipper_length_mm, flipper_length_cm)
head(penguins_cm)
```

</details>

### 3.5 `summarize()`: Calculating Summary Statistics

`summarize()` (or `summarise()`) helps you:
- Calculate summary statistics.
- Reduce multiple values down to a single value.
- Often used with `group_by()` for group-wise calculations.

#### Examples

In [7]:
# Overall summary statistics
overall_summary <- penguins %>%
  summarize(
    count = n(),
    avg_mass = mean(body_mass_g, na.rm = TRUE),
    sd_mass = sd(body_mass_g, na.rm = TRUE)
  )
overall_summary

# Summary statistics by species
species_summary <- penguins %>%
  group_by(species) %>%
  summarize(
    count = n(),
    avg_mass = mean(body_mass_g, na.rm = TRUE),
    sd_mass = sd(body_mass_g, na.rm = TRUE)
  )
species_summary

count,avg_mass,sd_mass
<int>,<dbl>,<dbl>
344,4201.754,801.9545


species,count,avg_mass,sd_mass
<fct>,<int>,<dbl>,<dbl>
Adelie,152,3700.662,458.5661
Chinstrap,68,3733.088,384.3351
Gentoo,124,5076.016,504.1162


### 3.6 `group_by()`: Grouping Data

`group_by()` is often used with `summarize()` or `mutate()` to:
- Perform operations by group.
- Calculate group-wise statistics.
- Create group-specific calculations.

Think of it as saying "do this calculation separately for each group."

#### Example

In [8]:
# Calculate average body mass by species and sex
mass_by_species_sex <- penguins %>%
  group_by(species, sex) %>%
  summarize(
    avg_body_mass = mean(body_mass_g, na.rm = TRUE),
    count = n()
  )
mass_by_species_sex

[1m[22m`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.


species,sex,avg_body_mass,count
<fct>,<fct>,<dbl>,<int>
Adelie,female,3368.836,73
Adelie,male,4043.493,73
Adelie,,3540.0,6
Chinstrap,female,3527.206,34
Chinstrap,male,3938.971,34
Gentoo,female,4679.741,58
Gentoo,male,5484.836,61
Gentoo,,4587.5,5


#### Exercise 5: Grouping and Summarizing

**Task**: For each island, calculate the average `bill_length_mm` and the number of penguins observed.

##### Your Code Here

In [None]:
# Your code here

<details>
<summary><strong>Hint:</strong> Click to expand</summary>

- Use `group_by(island)`.
- Use `summarize(avg_bill_length = mean(bill_length_mm, na.rm = TRUE), count = n())`.

</details>

<details>
<summary><strong>Solution:</strong> Click to expand</summary>

```R
# Solution
island_summary <- penguins %>%
  group_by(island) %>%
  summarize(
    avg_bill_length = mean(bill_length_mm, na.rm = TRUE),
    count = n()
  )
island_summary
```

</details>

## 4. Combining Multiple dplyr Functions

One of the most powerful aspects of `dplyr` is the ability to combine multiple functions using pipes (`%>%`). This allows you to:
- Perform complex data manipulation step by step.
- Keep your code readable and logical.
- Build up analysis in a clear sequence.

### Example: Finding the Largest Penguins by Species

**Task**: Find the top 3 heaviest penguins for each species and display their `species`, `island`, `sex`, and `body_mass_g`.

#### Solution

In [9]:
# Find top 3 heaviest penguins by species
top_penguins <- penguins %>%
  drop_na(body_mass_g) %>%
  group_by(species) %>%
  arrange(desc(body_mass_g)) %>%
  slice_head(n = 3) %>%
  select(species, island, sex, body_mass_g)
top_penguins

species,island,sex,body_mass_g
<fct>,<fct>,<fct>,<int>
Adelie,Biscoe,male,4775
Adelie,Biscoe,male,4725
Adelie,Torgersen,male,4700
Chinstrap,Dream,male,4800
Chinstrap,Dream,male,4550
Chinstrap,Dream,male,4500
Gentoo,Biscoe,male,6300
Gentoo,Biscoe,male,6050
Gentoo,Biscoe,male,6000


## 5. Reading and Writing Data with `readr`

The `readr` package (part of the tidyverse) provides functions to read and write data efficiently.

### Reading Data
- `read_csv()`: Reads comma-separated files.
- `read_tsv()`: Reads tab-separated files.

### Writing Data
- `write_csv()`: Writes data to a CSV file.
- `write_tsv()`: Writes data to a TSV file.

#### Example: Writing and Reading Data

In [10]:
# Create a small dataset
small_penguins <- penguins %>%
  select(species, island, body_mass_g) %>%
  head(10)

# Write to CSV
write_csv(small_penguins, "small_penguins.csv")

# Read from CSV
penguins_csv <- read_csv("small_penguins.csv")
penguins_csv

[1mRows: [22m[34m10[39m [1mColumns: [22m[34m3[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (2): species, island
[32mdbl[39m (1): body_mass_g

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


species,island,body_mass_g
<chr>,<chr>,<dbl>
Adelie,Torgersen,3750.0
Adelie,Torgersen,3800.0
Adelie,Torgersen,3250.0
Adelie,Torgersen,
Adelie,Torgersen,3450.0
Adelie,Torgersen,3650.0
Adelie,Torgersen,3625.0
Adelie,Torgersen,4675.0
Adelie,Torgersen,3475.0
Adelie,Torgersen,4250.0


## 6. Working with Genomic Data: The S. cerevisiae Genome

In this section, we'll explore how to work with real-world genomic data using `dplyr`. We'll use the yeast genome file (GFF format), which is a tab-separated file containing information about genes and other features.

### Steps to Work with GFF Files
1. **Read the data**: Use `read_tsv()` to read the file, skipping comment lines.
2. **Rename columns**: GFF files have nine standard columns.
3. **Clean and manipulate data**: Use `dplyr` functions to explore and analyze the data.

### Reading the GFF File

In [11]:
# Read the GFF file
yeast_genome <- read_tsv(
  "assets/saccharomyces_cerevisiae.gff",
  comment = "#",
  col_names = FALSE
)

# Preview the data
head(yeast_genome)

[1mRows: [22m[34m28347[39m [1mColumns: [22m[34m9[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m "\t"
[31mchr[39m (7): X1, X2, X3, X6, X7, X8, X9
[32mdbl[39m (2): X4, X5

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


X1,X2,X3,X4,X5,X6,X7,X8,X9
<chr>,<chr>,<chr>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>
chrI,SGD,chromosome,1,230218,.,.,.,ID=chrI;dbxref=NCBI:BK006935.2;Name=chrI
chrI,SGD,telomere,1,801,.,-,.,ID=TEL01L;Name=TEL01L;Note=Telomeric%20region%20on%20the%20left%20arm%20of%20Chromosome%20I%3B%20composed%20of%20an%20X%20element%20core%20sequence%2C%20X%20element%20combinatorial%20repeats%2C%20and%20a%20short%20terminal%20stretch%20of%20telomeric%20repeats;display=Telomeric%20region%20on%20the%20left%20arm%20of%20Chromosome%20I;dbxref=SGD:S000028862;curie=SGD:S000028862
chrI,SGD,X_element,337,801,.,-,.,Parent=TEL01L;Name=TEL01L_X_element
chrI,SGD,X_element_combinatorial_repeat,63,336,.,-,.,Parent=TEL01L;Name=TEL01L_X_element_combinatorial_repeat
chrI,SGD,telomeric_repeat,1,62,.,-,.,Parent=TEL01L;Name=TEL01L_telomeric_repeat_1
chrI,SGD,telomere,1,801,.,-,.,ID=TEL01L_telomere;Name=TEL01L_telomere;Parent=TEL01L


### Renaming Columns

GFF files have the following columns:
1. `seqname`: Name of the chromosome or scaffold.
2. `source`: Annotation source.
3. `feature`: Feature type (e.g., gene, exon).
4. `start`: Start position.
5. `end`: End position.
6. `score`: Score value.
7. `strand`: Strand (`+` or `-`).
8. `frame`: Frame or phase.
9. `attribute`: Additional information.

Let's rename these columns.

In [12]:
# Rename columns
yeast_genome <- yeast_genome %>%
  rename(
    seqname = X1,
    source = X2,
    feature = X3,
    start = X4,
    end = X5,
    score = X6,
    strand = X7,
    frame = X8,
    attribute = X9
  )

# Preview the renamed data
head(yeast_genome)

seqname,source,feature,start,end,score,strand,frame,attribute
<chr>,<chr>,<chr>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>
chrI,SGD,chromosome,1,230218,.,.,.,ID=chrI;dbxref=NCBI:BK006935.2;Name=chrI
chrI,SGD,telomere,1,801,.,-,.,ID=TEL01L;Name=TEL01L;Note=Telomeric%20region%20on%20the%20left%20arm%20of%20Chromosome%20I%3B%20composed%20of%20an%20X%20element%20core%20sequence%2C%20X%20element%20combinatorial%20repeats%2C%20and%20a%20short%20terminal%20stretch%20of%20telomeric%20repeats;display=Telomeric%20region%20on%20the%20left%20arm%20of%20Chromosome%20I;dbxref=SGD:S000028862;curie=SGD:S000028862
chrI,SGD,X_element,337,801,.,-,.,Parent=TEL01L;Name=TEL01L_X_element
chrI,SGD,X_element_combinatorial_repeat,63,336,.,-,.,Parent=TEL01L;Name=TEL01L_X_element_combinatorial_repeat
chrI,SGD,telomeric_repeat,1,62,.,-,.,Parent=TEL01L;Name=TEL01L_telomeric_repeat_1
chrI,SGD,telomere,1,801,.,-,.,ID=TEL01L_telomere;Name=TEL01L_telomere;Parent=TEL01L


### Exploring the Data

#### Counting Feature Types

In [13]:
# Count feature types
yeast_genome %>%
  count(feature, sort = TRUE)

feature,n
<chr>,<int>
mRNA,11125
CDS,7072
gene,6613
ARS,543
noncoding_exon,497
long_terminal_repeat,384
intron,378
tRNA,299
tRNA_gene,299
ARS_consensus_sequence,196


#### Exercise 6: Analyzing Genomic Data

**Task**: Find the top 5 longest genes in the yeast genome. Display their `seqname`, `start`, `end`, `strand`, and `attribute`.

##### Your Code Here

In [None]:
# Your code here

<details>
<summary><strong>Hint:</strong> Click to expand</summary>

- Filter for `feature == "gene"`.
- Calculate gene length using `mutate(length = end - start + 1)`.
- Arrange in descending order of `length`.
- Use `head(5)` to get the top 5.
- Select the desired columns.

</details>

<details>
<summary><strong>Solution:</strong> Click to expand</summary>

```R
# Solution
longest_genes <- yeast_genome %>%
  filter(feature == "gene") %>%
  mutate(length = end - start + 1) %>%
  arrange(desc(length)) %>%
  select(seqname, start, end, strand, attribute) %>%
  head(5)
longest_genes
```

</details>

## Conclusion

In this module, we've explored the core functions of `dplyr` for data manipulation, learned about the pipe operator `%>%`, and applied these skills to both the Palmer Penguins dataset and real genomic data. Mastery of `dplyr` is essential for efficient and effective data analysis in R.

---

**Next Steps**:

- Practice using `dplyr` with your own datasets.
- Explore more advanced functions like `join` operations.
- Learn about other tidyverse packages like `tidyr` for data reshaping.

---