# The National Parks Dataset for Tidy Tuesday!

Here is some great information about today's dataset!  

> The information in NPSpecies is available to the public. The exceptions to this are records for some sensitive, threatened, or endangered species, where widespread distribution of information could potentially put a species at risk.  
>
> An essential component of NPSpecies is evidence; that is, observations, vouchers, or reports that document the presence of a species in a park. Ideally, every species in a park that is designated as “present in park” will have at least one form of credible evidence substantiating the designation

Thanks to [f.hull](https://github.com/frankiethull) for putting the dataset together.  

Access the data and more, here! --> [link](https://github.com/rfordatascience/tidytuesday/blob/504d69514fc162bb6fb76a9ffd356941330f0df9/data/2024/2024-10-08/readme.md)



## Setup

In [1]:
# installation
using Pkg
Pkg.add(["DataFrames", "CSV", "HTTP", "StatsPlots",
    "StatsBase", "Plots", "SummaryTables", "DataFramesMeta",
    "Chain", "CategoricalArrays", "Measures", "PlotThemes",
    "SplitApplyCombine"
])

# load
using DataFrames, CSV, HTTP, StatsPlots, StatsBase, Plots, SummaryTables, DataFramesMeta, Chain, CategoricalArrays, Measures, PlotThemes, SplitApplyCombine

[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `C:\Users\ndfos\.julia\environments\v1.10\Project.toml`
[32m[1m  No Changes[22m[39m to `C:\Users\ndfos\.julia\environments\v1.10\Manifest.toml`


## Data

In [None]:
# Option 1: Fetch data directly from GitHub
url = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-10-08/most_visited_nps_species_data.csv"
 
response = HTTP.get(url)

# Parse the CSV into a DataFrame
species_data = CSV.read(response.body, DataFrame)

# Verify the first few rows
first(species_data, 5)


In [3]:
unique(species_data.CategoryName)

UndefVarError: UndefVarError: `species_data` not defined

In [4]:
# check the data

describe(species_data)

UndefVarError: UndefVarError: `species_data` not defined

## Steps for the Analysis:
* Filter the dataset: Focus on the 15 most visited parks and ensure we capture relevant columns like `ParkName`, `Abundance`, and `Nativeness`.
* Group and summarize: We'll summarize the abundance of species for each park, grouped by nativeness (native vs non-native).
* Create a stacked bar chart: We'll visualize species abundance by park, stacking native and non-native species for easy comparison.

## Code to Filter and Summarize:

In [5]:
# Filter and summarize data using @chain macro
species_abundance_summary = @chain begin
    species_data
    @transform(:CommonNames = ifelse.(:CommonNames .== "NA", :SciName, :CommonNames))
    @transform(:CommonNames = ifelse.(occursin.(",", :CommonNames), first.(split.(:CommonNames, ",")), :CommonNames))  # Handle split only if comma exists
    groupby([:ParkName, :Nativeness])
    combine(:Nativeness => length => :Count; ungroup=true)
    sort([:ParkName, order(:Sightings, rev=true)])
    filter(:Sightings => x -> x .> 0, _)
    groupby(:ParkName)
    subset(:Sightings => x -> x .== maximum(x))
    groupby(:ParkName)
    combine(:ParkName => first => :ParkName, :CommonNames => first => :CommonNames, :Sightings => first => :Sightings)  # Combine with max sightings per ParkName
end



UndefVarError: UndefVarError: `species_data` not defined

## Visualize

### Stacked Bar Chart:

Once we have the summarized data, we can create the stacked bar chart.

This will produce a stacked bar chart where each bar represents a park, and the native/non-native species are stacked to show their total contributions.

In [6]:
# Stacked bar chart for species abundance by park and nativeness
@chain species_abundance_summary begin
    @df groupedbar(:ParkName, :Count, group=:Nativeness, bar_position=:stack,
        title="Species Abundance by Park and Nativeness", xlabel="Park Name", ylabel="Species Count",
        legend=:topright, lw=0.8, fillalpha=0.7)
end

UndefVarError: UndefVarError: `species_abundance_summary` not defined

### Explanation:

* `@chain`: Ensures a more readable pipeline for processing the data.
* Data Processing: We:
    * Select only relevant columns (`ParkName`, `Abundance`, `Nativeness`).
    * Drop missing data.
    * Group the data by park name and nativeness.
    * Count the number of species in each group.
* Plotting: We use the `groupedbar` function to create a stacked bar chart with native and non-native species counts per park.

## Steps for the Second Analysis:

* Filter the dataset: Focus on the 15 most visited parks and select relevant columns such as ParkName, TEStatus, and Sensitive.
* Group and summarize: We'll count the number of species per park that are either threatened, endangered, or sensitive.
* Create a heatmap: Visualize the concentration of these species across the parks.

## Code to Filter and Summarize Using `@chain`:

In [7]:
# Filter and summarize the data for sensitive, threatened, and endangered species
threatened_species_summary = @chain species_data begin
    select([:ParkName, :TEStatus, :Sensitive])           # Select relevant columns
    filter(row -> !ismissing(row.TEStatus) || row.Sensitive == true, _)  # Keep rows where species are threatened/endangered or sensitive
    groupby(:ParkName)                                   # Group by park
    combine(nrow => :Threatened_Species_Count)           # Count the number of species per park
end

# Preview the summarized data
first(threatened_species_summary, 10)

UndefVarError: UndefVarError: `species_data` not defined

## Heatmap Visualization with `@chain`:

In [8]:
# Heatmap showing the concentration of threatened and endangered species by park
@chain threatened_species_summary begin
    @df heatmap(:ParkName, :Threatened_Species_Count, color=:blues,
        title="Concentration of Threatened and Sensitive Species by Park", xlabel="Park Name",
        ylabel="Threatened/Sensitive Species Count", clims=(0, maximum(threatened_species_summary[:Threatened_Species_Count])))
end

UndefVarError: UndefVarError: `threatened_species_summary` not defined

### Explanation:

* Data Filtering: We filter species that are either:
    * Marked as Sensitive, or
    * Have a TEStatus indicating a threatened or endangered status.
* Data Grouping: We group the filtered data by ParkName and count the number of threatened and sensitive species per park.
* Heatmap: We create a heatmap to visualize the number of such species per park, using a blue color gradient to represent the density of threatened species.