# Introduction

## Aims and Goals

The goal of this project is to analyse biodiversity data from the National Parks Service. Various species will investigated in different national park locations.

The project will aim to provide answers to the following questions:
1. What is the distribution of conservation status for species?
2. Are certain types of species more likely to be endangered?
3. Is there a correlation between species and their conversation status?
4. Which animal is the most common and what is their distribution amongst parks?


## Data Source
Both Species_info.csv and Observations.csv provided by [Codecademy.com](https://www.codecademy.com/).

Note: The data is mostly fictional and does the represent real life data.

## Project Scoping

### Project Goals
The project goal is to examine biodiversity data  for the National Parks Service. It should provide insight what is expected from a data analyst on a day to day basis and what tools and methodologies to deploy approrpriately. The project aims to understand the relationship between species and their conservation status. The following questions will provide a guide for this project:
1. What is the distribution of conservation status for species?
2. Are certain types of species more likely to be endangered?
3. Is there a correlation between species and their conversation status?
4. Which animal is the most common and what is their distribution amongst parks?

### Data
The Species_info.csv file contains information about each species. The observations.csv file conatains information on observations of species within the park locations. This data will be examined and aim to provide answers to the project goals.

### Analysis
Appropriate descriptive statistics and data visualisation techniques will be used to analyse and understand the data. Inferential statistics will be used as descriptive statistics will not be enough to answer the project goal questions alone.

### Evaluation
The final section will outline a conclusion. The conclusion will aim to answer the project goal questions including any limitations of the methodologies deployed and if any any improvement could be made. Any further research suggested will be mentioned.

## Import Python Modules

In [14]:
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
%matplotlib inline

## Loading the Data

### Species

In [18]:
species = pd.read_csv("Species_info.csv")
species.head()

Unnamed: 0,category,scientific_name,common_names,conservation_status
0,Mammal,Clethrionomys gapperi gapperi,Gapper's Red-Backed Vole,
1,Mammal,Bos bison,"American Bison, Bison",
2,Mammal,Bos taurus,"Aurochs, Aurochs, Domestic Cattle (Feral), Dom...",
3,Mammal,Ovis aries,"Domestic Sheep, Mouflon, Red Sheep, Sheep (Feral)",
4,Mammal,Cervus elaphus,Wapiti Or Elk,


In [19]:
species.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5824 entries, 0 to 5823
Data columns (total 4 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   category             5824 non-null   object
 1   scientific_name      5824 non-null   object
 2   common_names         5824 non-null   object
 3   conservation_status  191 non-null    object
dtypes: object(4)
memory usage: 182.1+ KB


#### Species Columns
- **category** - Taxonomy for each species
- **scientific_name** - The scientific name for each species
- **common_names** - The common names of each species
- **conversation_status** - The conservation status for each species

### Observations

In [20]:
observations = pd.read_csv("observations.csv")
observations.head()

Unnamed: 0,scientific_name,park_name,observations
0,Vicia benghalensis,Great Smoky Mountains National Park,68
1,Neovison vison,Great Smoky Mountains National Park,77
2,Prunus subcordata,Yosemite National Park,138
3,Abutilon theophrasti,Bryce National Park,84
4,Githopsis specularioides,Great Smoky Mountains National Park,85


In [21]:
observations.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23296 entries, 0 to 23295
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   scientific_name  23296 non-null  object
 1   park_name        23296 non-null  object
 2   observations     23296 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 546.1+ KB


#### Observations columns
- **scientific_name** - The scientific name of each species
- **park_name** - The name of the national park
- **observations** - The number of observations in the past 7 days

## Explore the Data

This section we will explore the surface of the data. The following statement below prints the number of unique species which is 5541.

In [22]:
print(f"Number of Species:{species.scientific_name.nunique()}")

Number of Species:5541


The following print statements below prints the number of unique number of categories, including the available categories within the dataset. There are a total of 7 unique categories being: Mammal, Bird, Reptile, Amphibian, Fish, Vascular Plant and Nonvascular Plant.

In [28]:
print(f"Number of Categories:{species.category.nunique()}")
print(f"Unique Categories:{species.category.unique()}")

Number of Categories:7
Unique Categories:['Mammal' 'Bird' 'Reptile' 'Amphibian' 'Fish' 'Vascular Plant'
 'Nonvascular Plant']


The following statement examines the number of species grouped by category. Vascular plants have the highest share of species making up 80% whilst reptiles being the fewest at 0.01%.

In [26]:
species.groupby("category").size()

category
Amphibian              80
Bird                  521
Fish                  127
Mammal                214
Nonvascular Plant     333
Reptile                79
Vascular Plant       4470
dtype: int64

The following statements examines the conservation status. There are a total of 4 unique conservation statuses being: Species of Concern, Endangered, Threatened, In Recovery and nan values.

In [30]:
print(f"Number of conservation statuses:{species.conservation_status.nunique()}")
print(f"Unique conservation statuses:{species.conservation_status.unique()}")

Number of conservation statuses:4
Unique conservation statuses:[nan 'Species of Concern' 'Endangered' 'Threatened' 'In Recovery']


The following statement examines the number of observations in each category of conservation status. The number of na values is 5633. That means that many species are not a cause for concern. However, there are 161 species of concern, 16 endangered, 4 in recovery and 10 threatened.

In [31]:
print(f"na values:{species.conservation_status.isna().sum()}")

na values:5633


In [32]:
species.groupby("conservation_status").size()

conservation_status
Endangered             16
In Recovery             4
Species of Concern    161
Threatened             10
dtype: int64

### Observations Data

The following statements examines the park name. There are a total of 4 unique parks being: Great Smoky Mountains National Park, Yosemite National Park, Bryce National Park and Yellowstone National Park.

In [34]:
print(f"Number of Parks:{observations.park_name.nunique()}")
print(f"Unique Parks:{observations.park_name.unique()}")

Number of Parks:4
Unique Parks:['Great Smoky Mountains National Park' 'Yosemite National Park'
 'Bryce National Park' 'Yellowstone National Park']


The following statement examines the number of observations made in each park over the last 7 days. There was a total of 3,314,739 observations made.

In [36]:
print(f"Number of Observations:{observations.observations.sum()}")

Number of Observations:3314739


## Analysis

## Conclusion

## Further Research