# Biodiversity in National Parks

## Structure
- 1. Introduction
- 2. Data Characteristics and Cleaning
- 3. Analysis
    - 3.1 Conservation Status
    - 3.2 Endangered Species
    - 3.3 Differences Between Different Species and Conservation Status
    - 3.4 Observations by Park
- 4. Conclusions

## 1. Introduction
The aim of this project is to analyse data from the National Parks Service about endangered species in different parks, and identify any patterns or themes in relation to the conservation statuses of these species.

Below are the questions that this analysis will seek to answer:
- What is the distribution of conservation status for species?
- Which species are more likely to be endangered?
- Are the differences between species and their conservation status significant?
- Which species were spotted the most at each park, and what was their distribution?

## 2. Data Characteristics and Cleaning

In [186]:
import pandas as pd
import numpy as np

from matplotlib import pyplot as plt
import seaborn as sns

%matplotlib inline

The `species_info.csv` dataset has the following columns of data:
- category - class of animal
- scientific_name - the scientific name of each species
- common_name - the common names of each species
- conservation_status - each species’ current conservation status

In [188]:
species = pd.read_csv('species_info.csv')
species.head()

Unnamed: 0,category,scientific_name,common_names,conservation_status
0,Mammal,Clethrionomys gapperi gapperi,Gapper's Red-Backed Vole,
1,Mammal,Bos bison,"American Bison, Bison",
2,Mammal,Bos taurus,"Aurochs, Aurochs, Domestic Cattle (Feral), Dom...",
3,Mammal,Ovis aries,"Domestic Sheep, Mouflon, Red Sheep, Sheep (Feral)",
4,Mammal,Cervus elaphus,Wapiti Or Elk,


The `observations.csv` dataset had the following columns of data:
- scientific_name - the scientific name of each species
- park_name - Park where species were found
- observations - the number of times each species was observed at park

In [190]:
observations = pd.read_csv('observations.csv')
observations.head()

Unnamed: 0,scientific_name,park_name,observations
0,Vicia benghalensis,Great Smoky Mountains National Park,68
1,Neovison vison,Great Smoky Mountains National Park,77
2,Prunus subcordata,Yosemite National Park,138
3,Abutilon theophrasti,Bryce National Park,84
4,Githopsis specularioides,Great Smoky Mountains National Park,85


Next, I am using .shape and .info() to gain a better understanding of the characteristics of each dataset. 

Some characteristics to note here are:
- In both `species` and `observations`, there are variables that are objects that should be strings. I amend this in the next section.
- In `species`, there are 5,824 entries, but `conservation_status` has 191 non-null counts. This is something to look into.

I have checked for and removed any duplicated values in each DataFrame. 

In [192]:
print(f"species shape: {species.shape}")
print(species.info())

print(f"observations shape: {observations.shape}")
print(observations.info())

species shape: (5824, 4)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5824 entries, 0 to 5823
Data columns (total 4 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   category             5824 non-null   object
 1   scientific_name      5824 non-null   object
 2   common_names         5824 non-null   object
 3   conservation_status  191 non-null    object
dtypes: object(4)
memory usage: 182.1+ KB
None
observations shape: (23296, 3)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23296 entries, 0 to 23295
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   scientific_name  23296 non-null  object
 1   park_name        23296 non-null  object
 2   observations     23296 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 546.1+ KB
None


In [194]:
# Amending data types from 'object' to 'string'
species = species.astype('string')
observations['scientific_name'] = observations['scientific_name'].astype('string')
observations['park_name'] = observations['park_name'].astype('string')

# Checing for duplicated values in each DataFrame
print(f"Duplicated valaues in Species:\n {species.duplicated().value_counts()}")
print(f"Duplicated valaues in Observations:\n {observations.duplicated().value_counts()}")

#Dropping duplicated values from 'observations' and verifying by checking again
observations = observations.drop_duplicates()
print(f"Duplicated valaues in Observations:\n {observations.duplicated().value_counts()}")

Duplicated valaues in Species:
 False    5824
Name: count, dtype: int64
Duplicated valaues in Observations:
 False    23281
True        15
Name: count, dtype: int64
Duplicated valaues in Observations:
 False    23281
Name: count, dtype: int64


Starting with `species` I wanted to find out how many unique species are in the dataset. By examining the `scientific_name` column, I found that there are 5,541 unique species in the `species` DataFrame, even though there are 5,824 total entries. This indicates that some scientific names appear more than once.

I also observed that some species have multiple common names. These are stored in the `common_names` column, but not all common names are recorded for every scientific name. As a result, some species appear multiple times, each with different common names, while others may be missing some or all of their common names.

To simplify `species`, I will remove `common_names` and `scientific_name` as the unique identifier for each species. This ensures consistency, as `scientific_name` is more standardised and avoids the inconsistencies and duplicates introduced by varying common names.

In [196]:
print(f"number of species: {species.scientific_name.nunique()}")

# Checking for any multiple entries of each 'scientific_name'
counts = species.groupby('scientific_name').size().loc[lambda x: x > 1]
print(counts)

#Filtering species to look more closely at the values for a specific 'scientific_name'
check_1 = species[species['scientific_name'] == 'Agrostis mertensii']
print(check_1)

check_2 = species[species['scientific_name'] == 'Vulpia myuros']
print(check_2)

#Removing 'common_names' to simplify `species`, and removing duplicates from the resulting DataFrame
species = species.drop(['common_names'], axis=1)
print(f"Duplicated valaues in Species:\n {species.duplicated().value_counts()}")
species = species.drop_duplicates()
print(f"Duplicated valaues in Species:\n {species.duplicated().value_counts()}")

number of species: 5541
scientific_name
Agrostis capillaris     2
Agrostis gigantea       2
Agrostis mertensii      2
Agrostis scabra         2
Agrostis stolonifera    2
                       ..
Vireo solitarius        2
Vulpia bromoides        2
Vulpia myuros           2
Vulpia octoflora        2
Zizia aptera            2
Length: 274, dtype: int64
            category     scientific_name  \
2136  Vascular Plant  Agrostis mertensii   
4178  Vascular Plant  Agrostis mertensii   

                              common_names conservation_status  
2136                     Northern Agrostis                <NA>  
4178  Arctic Bentgrass, Northern Bentgrass                <NA>  
            category scientific_name  \
2330  Vascular Plant   Vulpia myuros   
5643  Vascular Plant   Vulpia myuros   

                                           common_names conservation_status  
2330                                     Rattail Fescue                <NA>  
5643  Foxtail Fescue, Rattail Fescue, Rat-T

In the previous stage, I identified that there are null values in `conservation_status`. To take a closer look, I have grouped the data. Given that the majority of the dataset has been assigned `nan` as a `conservation_status` I have assumed that there are no concerns about these species. To make this clear, I have replaced the `nan` values with `No Intervention`.

In [198]:
# Looking at null values in the 'conservation_status' column
print(f"Null values: {species.conservation_status.isna().sum()}")
print(f"Not null values: {species.conservation_status.notnull().sum()}")
species.groupby("conservation_status").size()

Null values: 5363
Not null values: 180


conservation_status
Endangered             15
In Recovery             4
Species of Concern    151
Threatened             10
dtype: int64

In [204]:
# Replacing null values in 'conservation_status' with 'No Intervention'
species.fillna('No Intervention', inplace=True)
print(f"Null values: {species.conservation_status.isna().sum()}")
print(f"Not null values: {species.conservation_status.notnull().sum()}")
species.groupby("conservation_status").size()

Null values: 0
Not null values: 5543


conservation_status
Endangered              15
In Recovery              4
No Intervention       5363
Species of Concern     151
Threatened              10
dtype: int64

There is still a discrepanyc between the number of unique scientific names (5,541) and the total entries (5,543) in `species` suggesting that there are two duplicates that I have missed. I carried out the checks again and identified two pairs of entries that shared the same scientific name but had different conservation statuses. For this analysis, I have assumed the worst and opted for the entry with the most concerning conservation status.

In [214]:
# Checking for duplicate entries
print(f"number of species: {species.scientific_name.nunique()}")
print(species.info())
counts = species.groupby('scientific_name').size().loc[lambda x: x > 1]
print(counts)

# Looking at the duplicate entries in further detail
check_3 = species[species['scientific_name'] == 'Canis lupus']
print(check_3)

check_4 = species[species['scientific_name'] == 'Oncorhynchus mykiss']
print(check_4)

number of species: 5541
<class 'pandas.core.frame.DataFrame'>
Index: 5543 entries, 0 to 5823
Data columns (total 3 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   category             5543 non-null   string
 1   scientific_name      5543 non-null   string
 2   conservation_status  5543 non-null   string
dtypes: string(3)
memory usage: 173.2 KB
None
scientific_name
Canis lupus            2
Oncorhynchus mykiss    2
dtype: int64
     category scientific_name conservation_status
8      Mammal     Canis lupus          Endangered
3020   Mammal     Canis lupus         In Recovery
     category      scientific_name conservation_status
560      Fish  Oncorhynchus mykiss     No Intervention
3283     Fish  Oncorhynchus mykiss          Threatened


In [219]:
# Removing the entries with the least concerning 'conservation_status'
species.drop(species[(species['category'] == 'Mammal') & (species['scientific_name'] == 'Canis lupus') & (species['conservation_status'] == 'In Recovery')].index, inplace=True)
species.drop(species[(species['category'] == 'Fish') & (species['scientific_name'] == 'Oncorhynchus mykiss') & (species['conservation_status'] == 'No Intervention')].index, inplace=True)

# Conducting a final check on the number of unique species using 'scientific_name' and the number of entries in 'species' 
print(f"number of species: {species.scientific_name.nunique()}")
print(species.info())

number of species: 5541
<class 'pandas.core.frame.DataFrame'>
Index: 5541 entries, 0 to 5823
Data columns (total 3 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   category             5541 non-null   string
 1   scientific_name      5541 non-null   string
 2   conservation_status  5541 non-null   string
dtypes: string(3)
memory usage: 173.2 KB
None


Next, I want to understand the different types of species in the dataset. I can do this using the `category` column. There are 7 categories in `species`, including different categories of animals and plants. Taking a closer look, I have grouped the `category` column by size in order to see how many species there are for each type. In this dataset, the Vascular Plant category has the most amount of species (4,262), whilst the Reptile category has the fewest (79).

In [231]:
# Looking at the different species categories and the total number of entries per category
print(f"Categories: {species.category.unique()}")
species.groupby("category").size()

Categories: <StringArray>
[           'Mammal',              'Bird',           'Reptile',
         'Amphibian',              'Fish',    'Vascular Plant',
 'Nonvascular Plant']
Length: 7, dtype: string


category
Amphibian              79
Bird                  488
Fish                  125
Mammal                176
Nonvascular Plant     333
Reptile                78
Vascular Plant       4262
dtype: int64

I will now look at `observations`. I have identified 4 National Parks in the `observations` dataset, and the total amount of observations is 3,312,429.

In [237]:
print(f"National Parks: {observations.park_name.unique()}")
print(f"Total number of observations: {observations.observations.sum()}")

National Parks: <StringArray>
['Great Smoky Mountains National Park',              'Yosemite National Park',
                 'Bryce National Park',           'Yellowstone National Park']
Length: 4, dtype: string
Total number of observations: 3312429


In [240]:
print(observations.head())
print(species.head())

            scientific_name                            park_name  observations
0        Vicia benghalensis  Great Smoky Mountains National Park            68
1            Neovison vison  Great Smoky Mountains National Park            77
2         Prunus subcordata               Yosemite National Park           138
3      Abutilon theophrasti                  Bryce National Park            84
4  Githopsis specularioides  Great Smoky Mountains National Park            85
  category                scientific_name conservation_status
0   Mammal  Clethrionomys gapperi gapperi     No Intervention
1   Mammal                      Bos bison     No Intervention
2   Mammal                     Bos taurus     No Intervention
3   Mammal                     Ovis aries     No Intervention
4   Mammal                 Cervus elaphus     No Intervention


Now that I have cleaned `species` and `observations` I have merged the DataFrames to create `all_data`, which I will use for my analysis.

In [252]:
# Merging 'observations' and 'species' using a left join
all_data = observations.merge(species, how='left', on='scientific_name')

# Checking that info and first 5 rows of 'all_data' are as expected
print(all_data.info())
print(all_data.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23281 entries, 0 to 23280
Data columns (total 5 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   scientific_name      23281 non-null  string
 1   park_name            23281 non-null  string
 2   observations         23281 non-null  int64 
 3   category             23281 non-null  string
 4   conservation_status  23281 non-null  string
dtypes: int64(1), string(4)
memory usage: 909.5 KB
None
            scientific_name                            park_name  \
0        Vicia benghalensis  Great Smoky Mountains National Park   
1            Neovison vison  Great Smoky Mountains National Park   
2         Prunus subcordata               Yosemite National Park   
3      Abutilon theophrasti                  Bryce National Park   
4  Githopsis specularioides  Great Smoky Mountains National Park   

   observations        category conservation_status  
0            68  Vascular Plan

## 3. Analysis