# Biodiversity in National Parks

In this project, we will be examine data gathered from US National Parks. The goal is to find which parks contain the most diversity within.

In [9]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import chi2_contingency, f_oneway
from statsmodels.stats.multicomp import pairwise_tukeyhsd

## Import Data

I have imported the required CSVs for the project then we will start to examine them in the next section.

In [2]:
species_data = pd.read_csv("species_info.csv")
observation_data = pd.read_csv("observations.csv")

## Examine Species Data

In this section, I have added lines to explore the data and to know what is contained in the Species Data.

In [3]:
print("Species")
print(species_data.info())
print(species_data.describe())
print(species_data.head())

Species
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5824 entries, 0 to 5823
Data columns (total 4 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   category             5824 non-null   object
 1   scientific_name      5824 non-null   object
 2   common_names         5824 non-null   object
 3   conservation_status  191 non-null    object
dtypes: object(4)
memory usage: 182.1+ KB
None
              category    scientific_name        common_names  \
count             5824               5824                5824   
unique               7               5541                5504   
top     Vascular Plant  Castor canadensis  Brachythecium Moss   
freq              4470                  3                   7   

       conservation_status  
count                  191  
unique                   4  
top     Species of Concern  
freq                   161  
  category                scientific_name  \
0   Mammal  Clethrionomys gap

## Examine Species Data

In this section, I have added lines to explore the data and to know what is contained in the Observation Data.

In [4]:
print("Observation")
print(observation_data.info())
print(observation_data.describe())
print(observation_data.head(15))


Observation
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23296 entries, 0 to 23295
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   scientific_name  23296 non-null  object
 1   park_name        23296 non-null  object
 2   observations     23296 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 546.1+ KB
None
       observations
count  23296.000000
mean     142.287904
std       69.890532
min        9.000000
25%       86.000000
50%      124.000000
75%      195.000000
max      321.000000
                        scientific_name                            park_name  \
0                    Vicia benghalensis  Great Smoky Mountains National Park   
1                        Neovison vison  Great Smoky Mountains National Park   
2                     Prunus subcordata               Yosemite National Park   
3                  Abutilon theophrasti                  Bryce National Park   
4              Git

## Connection the Tables

Observation table does not include information like Category, Common Name, and Conservation Status. Merging the tables together allows us to examine categories within each park which is not included in the species table. Will also include some checks to see if all the data came accross.

In [5]:
all_data = observation_data.merge(species_data, left_on='scientific_name', right_on='scientific_name')
print(len(all_data))
print(all_data.info())
print(all_data.head(15))


25632
<class 'pandas.core.frame.DataFrame'>
Int64Index: 25632 entries, 0 to 25631
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   scientific_name      25632 non-null  object
 1   park_name            25632 non-null  object
 2   observations         25632 non-null  int64 
 3   category             25632 non-null  object
 4   common_names         25632 non-null  object
 5   conservation_status  880 non-null    object
dtypes: int64(1), object(5)
memory usage: 1.4+ MB
None
         scientific_name                            park_name  observations  \
0     Vicia benghalensis  Great Smoky Mountains National Park            68   
1     Vicia benghalensis               Yosemite National Park           148   
2     Vicia benghalensis            Yellowstone National Park           247   
3     Vicia benghalensis                  Bryce National Park           104   
4         Neovison vison  Great Smoky Mounta

## Change NULL values

Out of 25632 entries, only 880 of the `conservation_status` entires do not contain a NULL value. The next step will change these NULL values into a `Not Threatened` value.

In [6]:
all_data = all_data.fillna(value={'conservation_status': 'Not Threatened'})
#print(all_data.head())

## EDA

Start testing Category and Conservation_status

In [31]:
category_status_crosstab = pd.crosstab(all_data["park_name"], all_data["conservation_status"])
print(category_status_crosstab)

conservation_status                  Endangered  In Recovery  Not Threatened  \
park_name                                                                      
Bryce National Park                          20            6            6188   
Great Smoky Mountains National Park          20            6            6188   
Yellowstone National Park                    20            6            6188   
Yosemite National Park                       20            6            6188   

conservation_status                  Species of Concern  Threatened  
park_name                                                            
Bryce National Park                                 183          11  
Great Smoky Mountains National Park                 183          11  
Yellowstone National Park                           183          11  
Yosemite National Park                              183          11  


In [32]:
print(all_data[all_data["conservation_status"] == "In Recovery"].sort_index())
print(observation_data[all_data["conservation_status"] == "In Recovery"].sort_index())

                scientific_name                            park_name  \
6009                Canis lupus               Yosemite National Park   
6012                Canis lupus                  Bryce National Park   
6015                Canis lupus                  Bryce National Park   
6018                Canis lupus                  Bryce National Park   
6021                Canis lupus  Great Smoky Mountains National Park   
6024                Canis lupus            Yellowstone National Park   
6027                Canis lupus            Yellowstone National Park   
6030                Canis lupus            Yellowstone National Park   
6033                Canis lupus  Great Smoky Mountains National Park   
6036                Canis lupus               Yosemite National Park   
6039                Canis lupus               Yosemite National Park   
6042                Canis lupus  Great Smoky Mountains National Park   
8676    Falco peregrinus anatum                  Bryce National 

In [15]:
chi2, pval, dof, expected = chi2_contingency(category_status_crosstab)
print(pval)

1.0


In [13]:
obser_gsmnp = all_data.observations[all_data["park_name"] == "Great Smoky Mountains National Park"]
obser_yosemite = all_data.observations[all_data["park_name"] == "Yosemite National Park"]
obser_yellowstone = all_data.observations[all_data["park_name"] == "Yellowstone National Park"]
obser_bryce = all_data.observations[all_data["park_name"] == "Bryce National Park"]

fstat, pval = f_oneway(obser_gsmnp, obser_yosemite, obser_yellowstone, obser_bryce)
print(fstat,pval)

81547.46406291836 0.0
