# BIODIVERSITY IN NATION PARKS

## Introduction

This goal of this project is to analyze biodiversity data from these four National Parks during a one week period.
* Bryce National Park
* Great Smoky Mountains National Park* 
Yellowstone National Par
* 
Yosemite National Pa

This project will analyze data, and seek to find the answers to:
* How many species were observed this week, at these four National Parks?
* What are the convervation status of these species?
* What category of species is most protected?rk

## Data sources
Both observation.csv and species_info.csv was provided by Codecademy.com.
The data for this project is inspired by real data, but is mostly fictional.

In [2]:
from IPython.display import Image
Image(url="yosemite-park.jpg", width=300, height=300) 

In [3]:
# Import Python modules
import pandas as pd
import numpy as np

from matplotlib import pyplot as plt
import seaborn as sns

%matplotlib inline

from scipy.stats import chi2_contingency

from itertools import chain
import string

## Load in data
There are 2 data files to load in.

In [4]:
species = pd.read_csv('species_info.csv',encoding='utf-8')
species.head(20)

Unnamed: 0,category,scientific_name,common_names,conservation_status
0,Mammal,Clethrionomys gapperi gapperi,Gapper's Red-Backed Vole,
1,Mammal,Bos bison,"American Bison, Bison",
2,Mammal,Bos taurus,"Aurochs, Aurochs, Domestic Cattle (Feral), Dom...",
3,Mammal,Ovis aries,"Domestic Sheep, Mouflon, Red Sheep, Sheep (Feral)",
4,Mammal,Cervus elaphus,Wapiti Or Elk,
5,Mammal,Odocoileus virginianus,White-Tailed Deer,
6,Mammal,Sus scrofa,"Feral Hog, Wild Pig",
7,Mammal,Canis latrans,Coyote,Species of Concern
8,Mammal,Canis lupus,Gray Wolf,Endangered
9,Mammal,Canis rufus,Red Wolf,Endangered


In [5]:
observations = pd.read_csv('observations.csv', encoding='utf-8')
observations.head(10)

Unnamed: 0,scientific_name,park_name,observations
0,Vicia benghalensis,Great Smoky Mountains National Park,68
1,Neovison vison,Great Smoky Mountains National Park,77
2,Prunus subcordata,Yosemite National Park,138
3,Abutilon theophrasti,Bryce National Park,84
4,Githopsis specularioides,Great Smoky Mountains National Park,85
5,Elymus virginicus var. virginicus,Yosemite National Park,112
6,Spizella pusilla,Yellowstone National Park,228
7,Elymus multisetus,Great Smoky Mountains National Park,39
8,Lysimachia quadrifolia,Yosemite National Park,168
9,Diphyscium cumberlandianum,Yellowstone National Park,250


## Scope the data sets

In [6]:
print(f"Species shape: {species.shape}")
print(f"Observations shape: {observations.shape}")

Species shape: (5824, 4)
Observations shape: (23296, 3)


There are 5,824 rows of data in the "Species" data set, and 4 columns.

There are 23,296 rows of data in the "Observations" data set, and 3 columns.

In [7]:
print(species.columns)
print(species.dtypes)

Index(['category', 'scientific_name', 'common_names', 'conservation_status'], dtype='object')
category               object
scientific_name        object
common_names           object
conservation_status    object
dtype: object


In [8]:
print(observations.columns)
print(observations.dtypes)

Index(['scientific_name', 'park_name', 'observations'], dtype='object')
scientific_name    object
park_name          object
observations        int64
dtype: object


### How many species have been observed in these four National Parks this week?

In [9]:
print("There have been " + str(species.scientific_name.nunique()) + " number of species observed in these four National Parks this week." + '\n')

There have been 5541 number of species observed in these four National Parks this week.



### How many species are there in each of these categories?

In [10]:
print("There are " + str(species.category.nunique()) + " categories of species in this dataset.")

There are 7 categories of species in this dataset.


In [11]:
species.groupby("category").size()

category
Amphibian              80
Bird                  521
Fish                  127
Mammal                214
Nonvascular Plant     333
Reptile                79
Vascular Plant       4470
dtype: int64

### How many National Parks are there in the dataset, and what are their names?


In [12]:
print("There are " + str(observations.park_name.nunique()) + " National Parks in this dataset.")

There are 4 National Parks in this dataset.


In [13]:
print(observations.park_name.unique())

['Great Smoky Mountains National Park' 'Yosemite National Park'
 'Bryce National Park' 'Yellowstone National Park']


#### There are 4 parks in the data set:
1. Bryce National Park
2. Great Smoky Mountains National Park
3. Yellowstone National Park
4. Yosemite National Park

### What are the conservation statuses of the species?


In [14]:
print("There are " + str(species.conservation_status.nunique()) + " conservation statuses.")
print("These conservation statuses are " + str(species.conservation_status.unique()))

There are 4 conservation statuses.
These conservation statuses are [nan 'Species of Concern' 'Endangered' 'Threatened' 'In Recovery']


#### There are five conservation statuses. Of the five, one of them is empty data ("nan"):

We will assume that the empty data is "Not Endangered." "nan" will have to be converted to "Not Endangered".
When we clean the data later on
* Endangered* 
In Recovery* 
Species of Concer* n
Threaten* ed
nan (empty data)ta)


#### Number of species in each conservation status:

In [15]:
species.groupby("conservation_status").size()

conservation_status
Endangered             16
In Recovery             4
Species of Concern    161
Threatened             10
dtype: int64

#### The species have some levels of protection, but most of species fall into the "Species of Concern" conservation category. This is the lowest level of conservation.

In [16]:
print(species.conservation_status)

0       NaN
1       NaN
2       NaN
3       NaN
4       NaN
       ... 
5819    NaN
5820    NaN
5821    NaN
5822    NaN
5823    NaN
Name: conservation_status, Length: 5824, dtype: object


#### The "NaN" shows a lot of missing data in the conservation column. It implies that species are not in danger.

## Clean Data

The first task will be to clean and explore the data in the conservation_status column in species. 

convervation_status has 5 possible values.