# Introduction

The goal for this project is to analyze biodiversity data from the National Parks Service, specifically around various species observed in different national park locations.

This project will scope, analyze, prepare, plot data, and seek to explain the findings from the analysis.

Here are a few questions that this project has sought to answer:

- What is the distribution of conservation status for species?
- Are certain types of species more likely to be endangered?
- Are the differences between species and their conservation status significant?
- Which animal is most prevalent and what is their distribution amongst parks?

**Data sources:**

Both `Observations.csv` and `Species_info.csv` was provided by [Codecademy.com](https://www.codecademy.com).

Note: The data for this project is *inspired* by real data, but is mostly fictional.

## Scoping

To ensure the success of the project, it's important to establish a clear scope that outlines the goals, data, analysis, evaluation, and expected output of the project. The following sections will define each of these components in detail to guide the project process and ensure it aligns with the project objectives.

### Project Goals

The goal of this project is to analyze data from the National Parks Service about endangered species in various parks and investigate any patterns or themes related to the types of species that become endangered. As a biodiversity analyst for the National Parks Service, the objective is to understand the conservation statuses of the species and their relationship to the parks. To achieve this, the following questions will be posed and answered through data analysis:

- What is the distribution of conservation status for species?
- Are certain types of species more likely to become endangered?
- Are there any significant differences in the conservation status of species?
- Which species are most prevalent, and what is their distribution among parks?

### Data 

The data required for this project is provided by the National Parks Service and includes two datasets. The first dataset is a CSV file containing information about each species, while the second dataset consists of observations of species with park locations. This data will be cleaned, analyzed, and visualized to answer the research questions and achieve the project goals.

### Analysis 

The data analysis will involve cleaning and processing the datasets to extract relevant information for the project goals. Descriptive statistics and data visualization techniques will be employed to gain a better understanding of the data. Additionally, statistical inference will be used to test the significance of the observed values. Key metrics that will be computed include:

1. Distributions of conservation status for species.
2. Counts of each species and their prevalence in different parks.
3. Relationships between species and their conservation status.
4. Comparison of the conservation status of different species.
5. Observations of species in parks and their distribution.

### Evaluation

The evaluation section will ensure that the project goals have been achieved by checking if the output of the analysis corresponds to the questions posed in the project goals section. It will also reflect on what has been learned through the project, any limitations, and if any of the questions were unable to be answered. Additionally, this section will assess if any of the analysis could have been done using different methodologies.

### Output

The output of this project will include a detailed report of the findings from the data analysis, along with visualizations that clearly demonstrate the trends and patterns observed in the data. This report will provide insights into the conservation status of species in different parks and can be used by the National Parks Service to make informed decisions about endangered species conservation.

In [1]:
import pandas as pd
import numpy as np

from matplotlib import pyplot as plt
import seaborn as sns

%matplotlib inline

## Data

#### species

`species_info.csv` contains information on the different species in the National Parks. The columns included are:

- **category** - The category of taxonomy for each species
- **scientific_name** - The scientific name of each species
- **common_names** - The common names of each species
- **conservation_status** - The species conservation status

#### observations

`Observations.csv` contains information from recorded sightings of different species throughout the national parks in the past 7 days. The columns included are:

- **scientific_name** - The scientific name of each species
- **park_name** - The name of the national park
- **observations** - The number of observations in the past 7 days

In [5]:
species = pd.read_csv("species_info.csv")
species.head()

Unnamed: 0,category,scientific_name,common_names,conservation_status
0,Mammal,Clethrionomys gapperi gapperi,Gapper's Red-Backed Vole,
1,Mammal,Bos bison,"American Bison, Bison",
2,Mammal,Bos taurus,"Aurochs, Aurochs, Domestic Cattle (Feral), Dom...",
3,Mammal,Ovis aries,"Domestic Sheep, Mouflon, Red Sheep, Sheep (Feral)",
4,Mammal,Cervus elaphus,Wapiti Or Elk,


In [6]:
observations = pd.read_csv("Observations.csv")
observations.head()

Unnamed: 0,scientific_name,park_name,observations
0,Vicia benghalensis,Great Smoky Mountains National Park,68
1,Neovison vison,Great Smoky Mountains National Park,77
2,Prunus subcordata,Yosemite National Park,138
3,Abutilon theophrasti,Bryce National Park,84
4,Githopsis specularioides,Great Smoky Mountains National Park,85


#### Data Characteristics

`species` has 5,824 rows and 4 columns while `observations` has 23,296 rows and 3 columns.

In [7]:
print("Species shape: {}".format(species.shape))
print("Observations shape: {}".format(observations.shape))

Species shape: (5824, 4)
Observations shape: (23296, 3)


## Explore the Data