### Table of Contents

* [Goals](#goals)
* [Data](#Data)
    * [Loading the Data](#section1_1)
    * [Data Information](#section1_2)
* [Data Cleaning](#cleaning)
* [Exploratory Data Analysis](#EDA)
* [Conclusion](#conclusion)

### Goals <a class="anchor" id="Goals"></a>

This notebook contains an analysis on National Park data. The goal for this project was to do the following:
* Get acquainted with the data
* Clean the data so it is ready for analysis
* Develop some questions for analysis
* Analyze variables within the data to gain patterns and insights on these questions

### Data <a class="anchor" id="Data"></a>

The data for this project was downloaded from Kaggle:

https://www.kaggle.com/datasets/nationalparkservice/park-biodiversity?select=parks.csv

Some code inspiration for this analysis was sourced from [this noteboook](https://www.kaggle.com/dimitriirfan/market-eda).

#### Loading the Data <a class="anchor" id="section1_1"></a>
First, the necessary libraries are loaded into the notebook. The pandas library is used to import data from parks.csv and from species.csv and to preview the first five rows of each DataFrame.

In [1]:
# sets up matplotlib with interactive features
%matplotlib notebook
import csv
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import re

In [2]:
# see all columns
pd.set_option('display.max_columns', None)
parks_data = pd.read_csv("C:/Users/betsy/OneDrive/Desktop/data-analyst-project/parks.csv")
parks_data.head()

Unnamed: 0,Park Code,Park Name,State,Acres,Latitude,Longitude
0,ACAD,Acadia National Park,ME,47390,44.35,-68.21
1,ARCH,Arches National Park,UT,76519,38.68,-109.57
2,BADL,Badlands National Park,SD,242756,43.75,-102.5
3,BIBE,Big Bend National Park,TX,801163,29.25,-103.25
4,BISC,Biscayne National Park,FL,172924,25.65,-80.08


In [3]:
# see all columns
pd.set_option('display.max_columns', None)
species_data = pd.read_csv("C:/Users/betsy/OneDrive/Desktop/data-analyst-project/species.csv", low_memory = False)
species_data.head()

Unnamed: 0,Species ID,Park Name,Category,Order,Family,Scientific Name,Common Names,Record Status,Occurrence,Nativeness,Abundance,Seasonality,Conservation Status,Unnamed: 13
0,ACAD-1000,Acadia National Park,Mammal,Artiodactyla,Cervidae,Alces alces,Moose,Approved,Present,Native,Rare,Resident,,
1,ACAD-1001,Acadia National Park,Mammal,Artiodactyla,Cervidae,Odocoileus virginianus,"Northern White-Tailed Deer, Virginia Deer, Whi...",Approved,Present,Native,Abundant,,,
2,ACAD-1002,Acadia National Park,Mammal,Carnivora,Canidae,Canis latrans,"Coyote, Eastern Coyote",Approved,Present,Not Native,Common,,Species of Concern,
3,ACAD-1003,Acadia National Park,Mammal,Carnivora,Canidae,Canis lupus,"Eastern Timber Wolf, Gray Wolf, Timber Wolf",Approved,Not Confirmed,Native,,,Endangered,
4,ACAD-1004,Acadia National Park,Mammal,Carnivora,Canidae,Vulpes vulpes,"Black Fox, Cross Fox, Eastern Red Fox, Fox, Re...",Approved,Present,Unknown,Common,Breeder,,


#### Data Information <a class="anchor" id="section1_2"></a>

Some immediate insights for parks_data DataFrame are:
* There are 6 columns and 56 rows.
* The data types include object, int64, and float64. 
* There are no null values in any of the rows.
* Minimal to no cleaning needs to be done to this dataframe to make it usable.

Some immediate insights for species_data DataFrame are:
* There are 14 columns and 119248 rows.
* All columns have the data type object. 
* The `Order`, `Family`, `Occurrences`, `Nativeness`, `Abundance`, `Seasonality`, and `Conservation Status` columns all have missing data and will need to be cleaned. 

In [9]:
parks_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56 entries, 0 to 55
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Park Code  56 non-null     object 
 1   Park Name  56 non-null     object 
 2   State      56 non-null     object 
 3   Acres      56 non-null     int64  
 4   Latitude   56 non-null     float64
 5   Longitude  56 non-null     float64
dtypes: float64(2), int64(1), object(3)
memory usage: 2.8+ KB


In [14]:
#see what columns have missing data in the parks_data DataFrame
parks_data.isnull().sum()

Park Code    0
Park Name    0
State        0
Acres        0
Latitude     0
Longitude    0
dtype: int64

In [15]:
parks_data.describe()

Unnamed: 0,Acres,Latitude,Longitude
count,56.0,56.0,56.0
mean,927929.1,41.233929,-113.234821
std,1709258.0,10.908831,22.440287
min,5550.0,19.38,-159.28
25%,69010.5,35.5275,-121.57
50%,238764.5,38.55,-110.985
75%,817360.2,46.88,-103.4
max,8323148.0,67.78,-68.21


In [16]:
species_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119248 entries, 0 to 119247
Data columns (total 14 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   Species ID           119248 non-null  object
 1   Park Name            119248 non-null  object
 2   Category             119248 non-null  object
 3   Order                117776 non-null  object
 4   Family               117736 non-null  object
 5   Scientific Name      119248 non-null  object
 6   Common Names         119248 non-null  object
 7   Record Status        119248 non-null  object
 8   Occurrence           99106 non-null   object
 9   Nativeness           94203 non-null   object
 10  Abundance            76306 non-null   object
 11  Seasonality          20157 non-null   object
 12  Conservation Status  4718 non-null    object
 13  Unnamed: 13          5 non-null       object
dtypes: object(14)
memory usage: 12.7+ MB


In [17]:
#see what columns have missing data in the species_data DataFrame
species_data.isnull().sum()

Species ID                  0
Park Name                   0
Category                    0
Order                    1472
Family                   1512
Scientific Name             0
Common Names                0
Record Status               0
Occurrence              20142
Nativeness              25045
Abundance               42942
Seasonality             99091
Conservation Status    114530
Unnamed: 13            119243
dtype: int64

In [18]:
species_data.describe()

Unnamed: 0,Species ID,Park Name,Category,Order,Family,Scientific Name,Common Names,Record Status,Occurrence,Nativeness,Abundance,Seasonality,Conservation Status,Unnamed: 13
count,119248,119248,119248,117776,117736,119248,119248.0,119248,99106,94203,76306,20157,4718,5
unique,119248,56,14,554,2332,46022,35826.0,54,7,5,8,24,11,3
top,ACAD-1000,Great Smoky Mountains National Park,Vascular Plant,Poales,Asteraceae,Falco peregrinus,,Approved,Present,Native,Unknown,Breeder,Species of Concern,Threatened
freq,1,6623,65221,11453,8843,56,27147.0,86254,83278,75950,28119,12214,3843,2


### Data Cleaning <a class="anchor" id="cleaning"></a>