# Biodiversity in National Parks

This project investigates biodiversity data from the National Parks Service about endangered species in various parks, especially on the conservation statuses of these species, to see if there are any patterns to the types of species that become endangered. 

The aim of this project will be to scope, clean up, analyze, plot data, and seek to explain findings from the analysis in a meaningful way.

**Sources:**
Both `Observations.csv` and `Species_info.csv` was provided by [Codecademy.com](https://www.codecademy.com).

Note: The data for this project is mostly fictional, inspired by real data.

## Scoping

### Project Goals

The project will analyze data from the National Parks Service, with the goal of understanding characteristics about the species and their conservations status, and those species and their relationship to the national parks.

Some of the questions to be tackled include:
- What is the distribution of conservation status for animals?
- Are certain types of species more likely to be endangered?
- Are the differences between species and their conservation status significant?
- Which species were spotted the most at each park?

### Data

The project has two datasets that came with the .zip file used. The first `.csv` file contains data about different species and their conservation status. The second `.csv` file holds recorded sightings of different species at several national parks for the past 7 days.

### Analysis 

The analysis consists of the use of descriptive statistics and data visualization techniques to understand the data. Some of the key metrics that will be computed include: 

1. Distribution and counts
2. ...

### Evaluation

Lastly, I'll revisit the project's goals and summarize the output of the analysis using the initally stated questions. Any open questions will also be suggested which can include limitations in the analysis or suggestions on additional questions that can be answered with the data.

## Importing Modules and Data from Files

First, I'll import the preliminary modules for this project, along with the data from the two separate files provided for this analysis.

In [311]:
import pandas as pd
import numpy as np
import statsmodels as st
from matplotlib import pyplot as pyplot
import seaborn as sns
from glob import glob


observations = pd.read_csv('observations.csv')
species = pd.read_csv('species_info.csv')

## Data Characteristics

To prepare the data for analysis, I'll first scope both datasets using the `.head()` and `.info()` functions.

### `species`

The `species` dataset shows 5824 entries with are four variables (i.e. columns):

- **category** - taxonomy for each species.
- **scientific_name** - scientific name of each species.
- **common_names** - common names of each species.
- **conservation_status** - species' conservation status.

The last column shows discrepancies in the number of entries, indicating there is missing data.

In [312]:
print('Species')
print(species.info())
# print(species.head())

Species
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5824 entries, 0 to 5823
Data columns (total 4 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   category             5824 non-null   object
 1   scientific_name      5824 non-null   object
 2   common_names         5824 non-null   object
 3   conservation_status  191 non-null    object
dtypes: object(4)
memory usage: 182.1+ KB
None


The dtypes for the columns might be better if I changed them, both for performance as well as adequate data type. To see which dtype would be most appropriate, I'll sample the unique values of each column, in particular the categorical variables such as `category` and `conservation_status`, and change them to categorical types if possible.

In [313]:
print(species.category.unique(), '\n')
print(species.conservation_status.unique(), '\n')

species = species.astype({'category': 'category', 
                        'conservation_status': 'category',
                        # I'll also change the other columns to string variables 
                        'scientific_name': 'string', 
                        'common_names': 'string'})
print(species.info())

['Mammal' 'Bird' 'Reptile' 'Amphibian' 'Fish' 'Vascular Plant'
 'Nonvascular Plant'] 

[nan 'Species of Concern' 'Endangered' 'Threatened' 'In Recovery'] 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5824 entries, 0 to 5823
Data columns (total 4 columns):
 #   Column               Non-Null Count  Dtype   
---  ------               --------------  -----   
 0   category             5824 non-null   category
 1   scientific_name      5824 non-null   string  
 2   common_names         5824 non-null   string  
 3   conservation_status  191 non-null    category
dtypes: category(2), string(2)
memory usage: 103.0 KB
None


### `observations`

Exploration of the `observations` dataset shows there are three columns:

- **scientific_name** - scientific name of each species.
- **park_name** - name of the national park species are located in.
- **observations** - number of observations in the past 7 days.

The columns don't show any missing data, however the dtypes for the first two columns should be changed to a string type.

In [314]:
print('Observations')

print(observations.info(), '\n')
# print(observations.head())

observations = observations.astype({'scientific_name': 'string', 'park_name': 'string'})
print(observations.info())

Observations
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23296 entries, 0 to 23295
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   scientific_name  23296 non-null  object
 1   park_name        23296 non-null  object
 2   observations     23296 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 546.1+ KB
None 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23296 entries, 0 to 23295
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   scientific_name  23296 non-null  string
 1   park_name        23296 non-null  string
 2   observations     23296 non-null  int64 
dtypes: int64(1), string(2)
memory usage: 546.1 KB
None


## Exploratory Data Analysis

I'll start by exploring the `species` dataset more in detail. To obtain various aspects of each column, I've created a function that loops over the columns list and printed out the unique values, the length of nun-null values, and a preview of the top most frequent values. The exploration shows there are 7 types of species, 5541 species, 5504 common names and 4 conservation statuses.

In [315]:
def column_eda(dataset):
    cols = list(dataset.columns)
    for col in cols:
        print(f'---------------{col}---------------')
        print(f'Unique values:', dataset[col].nunique(), '\t', 
                f'Length: {dataset[col].notnull().sum()}', '\t', 
                f'Missing values: {dataset[col].isnull().sum()}')
        print(dataset[col].value_counts().reset_index().head(4))

column_eda(species)

---------------category---------------
Unique values: 7 	 Length: 5824 	 Missing values: 0
               index  category
0     Vascular Plant      4470
1               Bird       521
2  Nonvascular Plant       333
3             Mammal       214
---------------scientific_name---------------
Unique values: 5541 	 Length: 5824 	 Missing values: 0
                  index  scientific_name
0     Castor canadensis                3
1  Hypochaeris radicata                3
2         Columba livia                3
3         Puma concolor                3
---------------common_names---------------
Unique values: 5504 	 Length: 5824 	 Missing values: 0
                index  common_names
0  Brachythecium Moss             7
1       Dicranum Moss             7
2         Panic Grass             6
3          Bryum Moss             6
---------------conservation_status---------------
Unique values: 4 	 Length: 191 	 Missing values: 5633
                index  conservation_status
0  Species of Concern  

Some preliminary remarks can be made from a first glance:
- `conservation_status` displays a high number of `nan` values (5633), which, given the available categories, may well be treated as 'species of no concern' or 'not endangered/threatened'.
- The discrepancy between unique values of `scientific_name` and `common_names` along with the equal amount of non-null values may show that some entries use the same common name as other species in the dataset.
- The lesser amount of `scientific_name` compared to the total number of entries may point to multiple observations with the same scientific name.

To attent the first observation, I'll change the missing values in the `conservation_status` column to "No concern".

In [316]:
print(species.conservation_status.unique(), '\n')

species.conservation_status = species.conservation_status.cat.add_categories('No concern').fillna('No concern')

print(species.conservation_status.unique())

[NaN, 'Species of Concern', 'Endangered', 'Threatened', 'In Recovery']
Categories (4, object): ['Endangered', 'In Recovery', 'Species of Concern', 'Threatened'] 

['No concern', 'Species of Concern', 'Endangered', 'Threatened', 'In Recovery']
Categories (5, object): ['Endangered', 'In Recovery', 'Species of Concern', 'Threatened', 'No concern']


To confirm the second observation, I'll scope the dataset for duplicates both overall and in each of those columns.

In [347]:
duplicates = species[species.duplicated()]
print(f'Overall duplicates (rows): {len(duplicates)}', '\n')

repeated_common_names = species.common_names[species.common_names.duplicated()]
print(f'Number of duplicated common names {len(repeated_common_names)}', '\n')

common_name_dupl_count = species.pivot_table(columns=['common_names'], aggfunc='size')\
                                .sort_values(ascending=False).reset_index()
print('----Common_name duplicate count----\n', common_name_dupl_count.head(), '\n'\
    f'Total duplicated: {len(common_name_dupl_count[common_name_dupl_count[0] > 1])}', '\n')

Brachythecium_Moss_rep = species.loc[species['common_names'] == 'Brachythecium Moss', 
                                    ['common_names', 'scientific_name']
                                    ]
print('----Example of repeated common name: Brachythecium Moss----\n', Brachythecium_Moss_rep)

Overall duplicates (rows): 0 

Number of duplicated common names 320 

----Common_name duplicate count----
          common_names  0
0  Brachythecium Moss  7
1       Dicranum Moss  7
2         Panic Grass  6
3            Sphagnum  6
4          Bryum Moss  6 
Total duplicated: 248 

----Example of repeated common name: Brachythecium Moss----
             common_names           scientific_name
2812  Brachythecium Moss   Brachythecium digastrum
2813  Brachythecium Moss  Brachythecium oedipodium
2814  Brachythecium Moss   Brachythecium oxycladon
2815  Brachythecium Moss    Brachythecium plumosum
2816  Brachythecium Moss    Brachythecium rivulare
2817  Brachythecium Moss   Brachythecium rutabulum
2818  Brachythecium Moss  Brachythecium salebrosum


Regarding the third observation, if scientific names are repeated, this begs the question of if these reflect duplicates or unique observations with particular differences. The first possibility has been ruled out by the previous inspection for duplicate rows. I'll explore the second possibility by printing out the rows with repeated scientific names.

In [344]:
repeated_sci_names = species.common_names[species.scientific_name.duplicated()]
print(f'Number of duplicated common names {len(repeated_sci_names)}', '\n')

scientific_name_dupl_count = species.pivot_table(columns=['scientific_name'], aggfunc='size')\
                                .sort_values(ascending=False).reset_index()
print('----Scientific name duplicate count----\n', scientific_name_dupl_count.head(), '\n'\
    f'Total duplicated: {len(scientific_name_dupl_count[scientific_name_dupl_count[0] > 1])}', '\n')

Canis_lupus_rep = species.loc[species['scientific_name'] == 'Canis lupus']
print('----Example of repeated common name: Canis lupus----\n', Canis_lupus_rep, '\n')

Myotis_lucifugus_rep = species.loc[species['scientific_name'] == 'Myotis lucifugus']
print('----Example of repeated common name: Myotis lucifugus----\n', Myotis_lucifugus_rep, '\n')

Number of duplicated common names 283 

----Scientific name duplicate count----
      scientific_name  0
0        Canis lupus  3
1     Holcus lanatus  3
2      Puma concolor  3
3   Myotis lucifugus  3
4  Castor canadensis  3 
Total duplicated: 274 

----Example of repeated common name: Canis lupus----
      category scientific_name     common_names conservation_status
8      Mammal     Canis lupus        Gray Wolf          Endangered
3020   Mammal     Canis lupus  Gray Wolf, Wolf         In Recovery
4448   Mammal     Canis lupus  Gray Wolf, Wolf          Endangered 

----Example of repeated common name: Myotis lucifugus----
      category   scientific_name  \
37     Mammal  Myotis lucifugus   
3042   Mammal  Myotis lucifugus   
4467   Mammal  Myotis lucifugus   

                                           common_names conservation_status  
37                Little Brown Bat, Little Brown Myotis  Species of Concern  
3042  Little Brown Bat, Little Brown Myotis, Little ...  Species of Co

Based on the above analysis, there are various species recorded multiple times with different or multiple common names and some with different conservations statuses. This poses a problem for further analysis. 

A possible solution might be to enter each repeated species once, with (1) the most informatic `common_name` and (2) the worst `conservation_status`. The first step assumes longer `common_name` contain lists of all *possible* common names for a given species (as seen in the two examples printed above). The second step assumes a fatalistic view of a specie's conservation status, which will consequently favor more severe statuses and obscure positive progress of a specie's conservation.

In [351]:
print(species[species.scientific_name.duplicated()].head())

species = 

     category         scientific_name                          common_names  \
3017   Mammal          Cervus elaphus                    Rocky Mountain Elk   
3019   Mammal  Odocoileus virginianus  White-Tailed Deer, White-Tailed Deer   
3020   Mammal             Canis lupus                       Gray Wolf, Wolf   
3022   Mammal           Puma concolor           Cougar, Mountain Lion, Puma   
3025   Mammal        Lutra canadensis                           River Otter   

     conservation_status  
3017          No concern  
3019          No concern  
3020         In Recovery  
3022          No concern  
3025          No concern  
