## Introduction

This is an analysis of data from the National Parks Service about endangered species in different parks.

This goal of this project is to analyze biodiversity data from the National Parks Service, particularly around various species observed in different national park locations. In our hypothetical scenario, we're a biodiversity analyst working for the National Park Service, and the National Park Service wants to ensure the survival of species at risk, to maintain the parks' biodiversity.

This project will scope, analyze, prepare, plot data, and seek to explain the findings from the analysis.

Here's a few questions that we want to answer with this project:

    What is the distribution of conservation status for species?
    Are certain types of species more likely to be endangered?
    Are the differences between species and their conservation status significant?
    Which animal is most prevalent and what is their distribution amongst parks?

Data sources:

Both Observations.csv and Species_info.csv was sourced from https://content.codecademy.com/PRO/paths/data-science/biodiversity-solution.zip

Note: While inspired by real data, the data here is mostly fictional.

## Scoping

Here we decide the scope of this project. We do this through 4 different categories - Project Goals, Data, Analysis and Evaluation

### Project Goals

In our hypothetical scenario, we're a biodiversity analyst working for the National Park Service, and the National Park Service wants to ensure the survival of species at risk, to maintain the parks' biodiversity.

As described in the introduction, we have a few questions that we want to answer:

    What is the distribution of conservation status for species?
    Are certain types of species more likely to be endangered?
    Are the differences between species and their conservation status significant?
    Which animal is most prevalent and what is their distribution amongst parks?

### Data

We have access to two csv files. 'observations.csv' relates to the observations of species in park locations, and 'species_info.csv' relates to information about different species. We'll get into more detail about these once the project is started.

### Analysis

### Evaluation

## Import modules

Here we import any important modules for the project.

numpy is for numerical manipulation,
pandas is for data manipulation,
pyplot and seaborn are for plotting data.

In [3]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

## Loading the Data

Let's read our data and see what we have:

In [4]:
df_obs = pd.read_csv('observations.csv')
df_species = pd.read_csv('species_info.csv')

print(df_obs.head())
print(df_species.head())

            scientific_name                            park_name  observations
0        Vicia benghalensis  Great Smoky Mountains National Park            68
1            Neovison vison  Great Smoky Mountains National Park            77
2         Prunus subcordata               Yosemite National Park           138
3      Abutilon theophrasti                  Bryce National Park            84
4  Githopsis specularioides  Great Smoky Mountains National Park            85
  category                scientific_name  \
0   Mammal  Clethrionomys gapperi gapperi   
1   Mammal                      Bos bison   
2   Mammal                     Bos taurus   
3   Mammal                     Ovis aries   
4   Mammal                 Cervus elaphus   

                                        common_names conservation_status  
0                           Gapper's Red-Backed Vole                 NaN  
1                              American Bison, Bison                 NaN  
2  Aurochs, Aurochs, Domestic 

So we have two csvs. 

One, called 'observations.csv', has a list of different species paired with the national park they were spotted at. This is then paired with the number of observations of this species at this park. According to the description, this is observations over the past 7 days.

The other is 'species_info.csv' and has info about different species - scientific and other names, category of species (eg. mammal, bird, reptile, vascular plant), and their conservation status (whether endangered or not, see https://en.wikipedia.org/wiki/Endangered_species).

#TODO: Check NaN for conservation_status / other variables. Plot most popular animals of each park

In [6]:
print(df_species.conservation_status.unique())

[nan 'Species of Concern' 'Endangered' 'Threatened' 'In Recovery']
