# Problem
For this project, you will interpret data from the National Parks Service about endangered species in different parks.

You will perform some data analysis on the conservation statuses of these species and investigate if there are any patterns or themes to the types of species that become endangered. During this project, you will analyze, clean up, and plot data as well as pose questions and seek to answer them in a meaningful way.

After you perform your analysis, you will share your findings about the National Park Service.

Here are a few questions that this project has sought to answer:

- What is the distribution of conservation status for species?
- Are certain types of species more likely to be endangered?
- Are the differences between species and their conservation status significant?
- Which animal is most prevalent and what is their distribution amongst parks?

### Analysis

In this section, descriptive statistics and data visualization techniques will be employed to understand the data better. Statistical inference will also be used to test if the observed values are statistically significant. Some of the key metrics that will be computed include: 

1. Distributions
1. counts
1. relationship between species
1. conservation status of species
1. observations of species in parks. 

### Module imports
Importing pandas, numpy, matplotlib and seaborn

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

## Loading & inspecting data
Loading the species and observation data into pandas dfs

In [2]:
species = pd.read_csv('species_info.csv')
print(f'The shape of the data frame is: {species.shape}')
species.head()

The shape of the data frame is: (5824, 4)


Unnamed: 0,category,scientific_name,common_names,conservation_status
0,Mammal,Clethrionomys gapperi gapperi,Gapper's Red-Backed Vole,
1,Mammal,Bos bison,"American Bison, Bison",
2,Mammal,Bos taurus,"Aurochs, Aurochs, Domestic Cattle (Feral), Dom...",
3,Mammal,Ovis aries,"Domestic Sheep, Mouflon, Red Sheep, Sheep (Feral)",
4,Mammal,Cervus elaphus,Wapiti Or Elk,


In [3]:
observations = pd.read_csv('observations.csv')
print(f'The shape of the data frame is: {observations.shape}')
observations.head()

The shape of the data frame is: (23296, 3)


Unnamed: 0,scientific_name,park_name,observations
0,Vicia benghalensis,Great Smoky Mountains National Park,68
1,Neovison vison,Great Smoky Mountains National Park,77
2,Prunus subcordata,Yosemite National Park,138
3,Abutilon theophrasti,Bryce National Park,84
4,Githopsis specularioides,Great Smoky Mountains National Park,85


### Inspecting data
I like the thought of having park names as a categorical variable to save space in `observations`, and will also do the same with category and conservation status in `species`. Before doing this I need to make sure these are relatively small values and not already cetgorical.

In [4]:
print('types of data in species:')
print(species.dtypes)
print('types of data in observations:')
print(observations.dtypes)

types of data in species:
category               object
scientific_name        object
common_names           object
conservation_status    object
dtype: object
types of data in observations:
scientific_name    object
park_name          object
observations        int64
dtype: object


Now to check the numbers of these objects.

In [5]:
print(f'The different numbers of species categories are: {species.category.nunique()}')
print(f'These species are:  {species.category.unique()}')
print(f'The different numbers of species conservation_] status are: {species.conservation_status.nunique()}')
print(f'These conservation statuses are:  {species.conservation_status.unique()}')
print(f'The different numbers of observation park names are: {observations.park_name.nunique()}')
print(f'These park names are: {observations.park_name.unique()}')

The different numbers of species categories are: 7
These species are:  ['Mammal' 'Bird' 'Reptile' 'Amphibian' 'Fish' 'Vascular Plant'
 'Nonvascular Plant']
The different numbers of species conservation_] status are: 4
These conservation statuses are:  [nan 'Species of Concern' 'Endangered' 'Threatened' 'In Recovery']
The different numbers of observation park names are: 4
These park names are: ['Great Smoky Mountains National Park' 'Yosemite National Park'
 'Bryce National Park' 'Yellowstone National Park']


There are few numbers for each of these and so they can be changed to categorical variables.

In [9]:
species['category'] = species['category'].astype('category')
species['conservation_status'] = species['conservation_status'].astype('category')
observations['park_name'] = observations['park_name'].astype('category')

In [10]:
print('The new data types for the species DF are:')
print(species.info())
print('The new data types for the observations DF are:')
print(observations.info())

The new data types for the species DF are:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5824 entries, 0 to 5823
Data columns (total 4 columns):
 #   Column               Non-Null Count  Dtype   
---  ------               --------------  -----   
 0   category             5824 non-null   category
 1   scientific_name      5824 non-null   object  
 2   common_names         5824 non-null   object  
 3   conservation_status  191 non-null    category
dtypes: category(2), object(2)
memory usage: 103.0+ KB
None
The new data types for the observations DF are:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23296 entries, 0 to 23295
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype   
---  ------           --------------  -----   
 0   scientific_name  23296 non-null  object  
 1   park_name        23296 non-null  category
 2   observations     23296 non-null  int64   
dtypes: category(1), int64(1), object(1)
memory usage: 387.1+ KB
None


The DF should still look the same, but when I run `head` on the column the categories are listed.

In [11]:
print(species['category'].head())
print(species['conservation_status'].head())
print(observations['park_name'].head())

0    Mammal
1    Mammal
2    Mammal
3    Mammal
4    Mammal
Name: category, dtype: category
Categories (7, object): ['Amphibian', 'Bird', 'Fish', 'Mammal', 'Nonvascular Plant', 'Reptile', 'Vascular Plant']
0    NaN
1    NaN
2    NaN
3    NaN
4    NaN
Name: conservation_status, dtype: category
Categories (4, object): ['Endangered', 'In Recovery', 'Species of Concern', 'Threatened']
0    Great Smoky Mountains National Park
1    Great Smoky Mountains National Park
2                 Yosemite National Park
3                    Bryce National Park
4    Great Smoky Mountains National Park
Name: park_name, dtype: category
Categories (4, object): ['Bryce National Park', 'Great Smoky Mountains National Park', 'Yellowstone National Park', 'Yosemite National Park']


## Species EDA
Now time to look more spedifically at the species DF.

In [18]:
# How many unique animals by scientific name
print(f"There are {species['scientific_name'].nunique()} species with scientific names.")
print(f"But there are {species['common_names'].nunique()} species with common names.")

There are 5541 species with scientific names.
But there are 5504 species with common names.


In [19]:
# Identify the duplicates
species.describe()

Unnamed: 0,category,scientific_name,common_names,conservation_status
count,5824,5824,5824,191
unique,7,5541,5504,4
top,Vascular Plant,Castor canadensis,Brachythecium Moss,Species of Concern
freq,4470,3,7,161


In [29]:
# How many duplicates
species[species['scientific_name'].duplicated() == True]['conservation_status'].count()

13

In [33]:
duplicated_species = species[species['scientific_name'].duplicated() == True]
duplicated_species.sort_values(['scientific_name'])

Unnamed: 0,category,scientific_name,common_names,conservation_status
5553,Vascular Plant,Agrostis capillaris,"Colonial Bent, Colonial Bentgrass",
5554,Vascular Plant,Agrostis gigantea,"Black Bent, Redtop, Water Bentgrass",
4178,Vascular Plant,Agrostis mertensii,"Arctic Bentgrass, Northern Bentgrass",
5556,Vascular Plant,Agrostis scabra,"Rough Bent, Rough Bentgrass, Ticklegrass",
4182,Vascular Plant,Agrostis stolonifera,"Carpet Bentgrass, Creeping Bent, Creeping Bent...",
...,...,...,...,...
3231,Bird,Vireo solitarius,Blue-Headed Vireo,
5640,Vascular Plant,Vulpia bromoides,"Brome Fescue, Brome Six-Weeks Grass, Desert Fe...",
5643,Vascular Plant,Vulpia myuros,"Foxtail Fescue, Rattail Fescue, Rat-Tail Fescu...",
4290,Vascular Plant,Vulpia octoflora,"Eight-Flower Six-Weeks Grass, Pullout Grass, S...",


As there seems to be many names for the different animals, I will leave this here and focus on the grouping of species by conservation status.

In [62]:
species.groupby(['conservation_status']).size().sum()

191

from the initial massive amount of data, only 191 

## Testing merging data
As both data sets have scientific names, I want to see how I could merge these together.

In [12]:
df_merge = pd.merge(species, observations, how='inner')
print(f'The shape of the data frame is: {df_merge.shape}')
df_merge.head()

The shape of the data frame is: (25632, 6)


Unnamed: 0,category,scientific_name,common_names,conservation_status,park_name,observations
0,Mammal,Clethrionomys gapperi gapperi,Gapper's Red-Backed Vole,,Bryce National Park,130
1,Mammal,Clethrionomys gapperi gapperi,Gapper's Red-Backed Vole,,Yellowstone National Park,270
2,Mammal,Clethrionomys gapperi gapperi,Gapper's Red-Backed Vole,,Great Smoky Mountains National Park,98
3,Mammal,Clethrionomys gapperi gapperi,Gapper's Red-Backed Vole,,Yosemite National Park,117
4,Mammal,Bos bison,"American Bison, Bison",,Yosemite National Park,128


This appears to be successful, and can lead to more efficient data analysis.