# National Parks Service Biodiversity
Codecademy Portfolio Project by Leah Fulmer ([Github](https://github.com/leahmfulmer))<br>
With acknowledgements and gratitude to...

#### Project Objectives from Codecademy:

* Complete a project to add to your portfolio
* Use Jupyter Notebook to communicate findings
* Run an analysis on a set of data
* Become familiar with data analysis workflow

#### Table of Contents : COMPLETE IN POST
[Section 1: Loading and Examining the Data](#data)<br>
[Section 2: Wrangling and Tidying the Data](#tidy)<br>
[Section 3: Questions for Analysis](#questions)<br>
[Section 4: Analysis of Observations](#observations)<br>
[Section 5: Analysis of Conservation Status](#conservation)<br>
[Section 6: Analysis of National Parks](#parks)<br>
[Section 7: Resting Place: Where Code Goes to Rest!](#rest)<br>

### Section 1: Loading and Examining the Data <a id="data"></a>

In [1]:
# Import modules

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
# Load data

observations = pd.read_csv("observations.csv")
species_info = pd.read_csv("species_info.csv")

In [3]:
# Examine observations

print("The dataset 'observations' contains {} rows and \
{} columns.".format(observations.shape[0], observations.shape[1]))

observations.head()
# observations.count()

The dataset 'observations' contains 23296 rows and 3 columns.


Unnamed: 0,scientific_name,park_name,observations
0,Vicia benghalensis,Great Smoky Mountains National Park,68
1,Neovison vison,Great Smoky Mountains National Park,77
2,Prunus subcordata,Yosemite National Park,138
3,Abutilon theophrasti,Bryce National Park,84
4,Githopsis specularioides,Great Smoky Mountains National Park,85


In [4]:
# Examine species_info

print("The dataset 'species_info' contains {} rows and \
{} columns.".format(species_info.shape[0], species_info.shape[1]))

species_info.head()
# species_info.count()

The dataset 'species_info' contains 5824 rows and 4 columns.


Unnamed: 0,category,scientific_name,common_names,conservation_status
0,Mammal,Clethrionomys gapperi gapperi,Gapper's Red-Backed Vole,
1,Mammal,Bos bison,"American Bison, Bison",
2,Mammal,Bos taurus,"Aurochs, Aurochs, Domestic Cattle (Feral), Dom...",
3,Mammal,Ovis aries,"Domestic Sheep, Mouflon, Red Sheep, Sheep (Feral)",
4,Mammal,Cervus elaphus,Wapiti Or Elk,


### Section 2: Wrangling and Tidying the Data<a id="tidy"></a>

#### All Datasets

In [5]:
# Make all rows with strings lower case

def make_lower_case(df):
    for column in df.columns:
        if type(df[column][0]) != np.int64:
            df[column] = df[column].apply(lambda x: x.lower() if type(x) != np.float else x)

make_lower_case(observations)
make_lower_case(species_info)

#### Observations

In [6]:
# Drop completely-duplicated rows

observations.drop_duplicates(subset=['scientific_name', 'park_name', 'observations'], inplace=True)
print("The dataset 'observations' now contains {} rows and \
{} columns.".format(observations.shape[0], observations.shape[1]))

The dataset 'observations' now contains 23281 rows and 3 columns.


In [7]:
# Sum observational instances for rows with duplicated scientific_name and park_name 

observations = observations.groupby(['scientific_name', 'park_name'])['observations'].sum().reset_index()
print("The dataset 'observations' now contains {} rows and \
{} columns.".format(observations.shape[0], observations.shape[1]))

The dataset 'observations' now contains 22164 rows and 3 columns.


#### Species_info

In [8]:
# Count how many rows we expect in our tidy species_info dataset

duplicates = species_info[species_info.duplicated(subset=['scientific_name'])]
print("There are {} rows with the same scientific_name in 'species_info'."\
      .format(duplicates.shape[0]))
print("Therefore, we expect our tidy 'species_info' dataset to contain {} rows."\
      .format(species_info.shape[0] - duplicates.shape[0]))

There are 283 rows with the same scientific_name in 'species_info'.
Therefore, we expect our tidy 'species_info' dataset to contain 5541 rows.


In [9]:
# Replace NaN values with "no data"

species_info.conservation_status = species_info.conservation_status.fillna("no data")

In [10]:
# Join common_names and conservation_status for duplicated scientific_name

species_info = species_info.groupby('scientific_name', as_index=False)\
.agg({'common_names': lambda x: ', '.join(x), 'category': 'first', \
      'conservation_status': 'last'})
#       'conservation_status': lambda x: ', '.join(x)})

print("The dataset 'species_info' now contains {} rows and \
{} columns.".format(species_info.shape[0], species_info.shape[1]))
species_info.head()

The dataset 'species_info' now contains 5541 rows and 4 columns.


Unnamed: 0,scientific_name,common_names,category,conservation_status
0,abies bifolia,rocky mountain alpine fir,vascular plant,no data
1,abies concolor,"balsam fir, colorado fir, concolor fir, silver...",vascular plant,no data
2,abies fraseri,fraser fir,vascular plant,species of concern
3,abietinella abietina,abietinella moss,nonvascular plant,no data
4,abronia ammophila,"wyoming sand verbena, yellowstone sand verbena",vascular plant,species of concern


#### Merge

In [12]:
# Combine 'observations' and 'species_info'

combined = pd.merge(observations, species_info, on = ['scientific_name'], how = 'left')

print("The dataset 'combined' contains {} rows and \
{} columns.".format(combined.shape[0], combined.shape[1]))
combined.head()

The dataset 'combined' contains 22164 rows and 6 columns.


Unnamed: 0,scientific_name,park_name,observations,common_names,category,conservation_status
0,abies bifolia,bryce national park,109,rocky mountain alpine fir,vascular plant,no data
1,abies bifolia,great smoky mountains national park,72,rocky mountain alpine fir,vascular plant,no data
2,abies bifolia,yellowstone national park,215,rocky mountain alpine fir,vascular plant,no data
3,abies bifolia,yosemite national park,136,rocky mountain alpine fir,vascular plant,no data
4,abies concolor,bryce national park,83,"balsam fir, colorado fir, concolor fir, silver...",vascular plant,no data


#### Separate by National Park

In [13]:
# Create unique dataframes for each National Park

def isolate(park_name):
    df = combined[combined.park_name == park_name]
    return df

bryce = isolate("bryce national park")
great_smoky_mountains = isolate("great smoky mountains national park")
yellowstone = isolate("yellowstone national park")
yosemite = isolate("yosemite national park")

### Section 3: Questions for Analysis<a id="questions"></a><br>


**Observations:**
* Which categories or species are observed most?
* Do number of observations correlate with any other property (e.g., category, conservation_status)?

**Conservation Status:**
* What types of conservation status exist?
* Which categories and species are most flagged for conservation?

**National Parks:**
* Does the number of observations vary across national park?
* Which national park contains the most species flagged for conservation?

### Section 4: Analysis of Observations<a id="observations"></a>

### Section 5: Analysis of Conservation Status<a id="conservation"></a>

In [14]:
# What types of conservation status exist?

combined.conservation_status.unique()

array(['no data', 'species of concern', 'threatened', 'endangered',
       'in recovery'], dtype=object)

In [None]:
# Which categories and species are most flagged for conservation?



### Section 6: Analysis of National Parks<a id="parks"></a>