# Checkpoint Three: Cleaning Data

Now you are ready to clean your data. Before starting coding, provide the link to your dataset below.

My dataset:

Import the necessary libraries and create your dataframe(s).

In [1]:
# import libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

sp_df = pd.read_csv("species.csv")
pa_df = pd.read_csv("parks.csv")

# checking to see if nulls are present in both dataframes
sp_df.info()
print("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~")
pa_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119248 entries, 0 to 119247
Data columns (total 14 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   Species ID           119248 non-null  object
 1   Park Name            119248 non-null  object
 2   Category             119248 non-null  object
 3   Order                117776 non-null  object
 4   Family               117736 non-null  object
 5   Scientific Name      119248 non-null  object
 6   Common Names         119248 non-null  object
 7   Record Status        119248 non-null  object
 8   Occurrence           99106 non-null   object
 9   Nativeness           94203 non-null   object
 10  Abundance            76306 non-null   object
 11  Seasonality          20157 non-null   object
 12  Conservation Status  4718 non-null    object
 13  Unnamed: 13          5 non-null       object
dtypes: object(14)
memory usage: 12.7+ MB
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  sp_df = pd.read_csv("species.csv")


## Missing Data

Test your dataset for missing data and handle it as needed. Make notes in the form of code comments as to your thought process.

In [2]:
# check the total number of missing values in each column for sp_df
num_missing = sp_df.isnull().sum()
num_missing
# This tells me that I need to use the 'Scientific Name' and 'Common Names' columns and not the 'Order' and 'Family' columns
# I need to remove the following columns from the sp_df due to the large amount of missing data: Occurrence, Nativeness, Abundance, Seasonality, Conservation Status, Unnamed: 13

Species ID                  0
Park Name                   0
Category                    0
Order                    1472
Family                   1512
Scientific Name             0
Common Names                0
Record Status               0
Occurrence              20142
Nativeness              25045
Abundance               42942
Seasonality             99091
Conservation Status    114530
Unnamed: 13            119243
dtype: int64

In [3]:
# check the total number of missing values in each column for pa_df
num_missing = pa_df.isnull().sum()
num_missing
# tells me there are no missing values in the Parks Dataframe


Park Code    0
Park Name    0
State        0
Acres        0
Latitude     0
Longitude    0
dtype: int64

## Irregular Data

Detect outliers in your dataset and handle them as needed. Use code comments to make notes about your thought process.

In [4]:
# according to the below outputs, the only int dtype is in the acres column in the pa_df so I will check only that column for outliers
sp_df.info()
print("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~")
pa_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119248 entries, 0 to 119247
Data columns (total 14 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   Species ID           119248 non-null  object
 1   Park Name            119248 non-null  object
 2   Category             119248 non-null  object
 3   Order                117776 non-null  object
 4   Family               117736 non-null  object
 5   Scientific Name      119248 non-null  object
 6   Common Names         119248 non-null  object
 7   Record Status        119248 non-null  object
 8   Occurrence           99106 non-null   object
 9   Nativeness           94203 non-null   object
 10  Abundance            76306 non-null   object
 11  Seasonality          20157 non-null   object
 12  Conservation Status  4718 non-null    object
 13  Unnamed: 13          5 non-null       object
dtypes: object(14)
memory usage: 12.7+ MB
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In [5]:
pa_df['Acres'].nlargest(56)

# This shows the acreage of all National Parks. 
# After double-checking and googling the largest and the smallest parks, there acreage checks out.
# So no outliers.

52    8323148
18    7523898
15    4740912
32    3674530
14    3372402
20    3224840
35    2619733
53    2219791
34    1750717
17    1508538
22    1217403
19    1013572
41     922651
47     865952
3      801163
31     789745
54     761266
33     669983
30     571790
24     521490
40     504781
7      337598
28     323431
25     309995
45     265828
10     249561
2      242756
8      241904
39     235625
50     218200
48     199045
12     183224
4      172924
55     146598
44     112512
36     106372
42      93533
46      91440
26      86416
21      77180
1       76519
49      70447
16      64701
37      52830
38      52122
0       47390
9       46766
23      42984
6       35835
13      32950
5       32950
27      29094
51      28295
43      26606
11      26546
29       5550
Name: Acres, dtype: int64

## Unnecessary Data

Look for the different types of unnecessary data in your dataset and address it as needed. Make sure to use code comments to illustrate your thought process.

In [6]:
# I am dropping columns from the sp_df due to the large amount of missing data and the information is irrelevant to what I am researching: Occurrence, Nativeness, Abundance, Seasonality, Conservation Status, Unnamed: 13
sp_df.drop(['Occurrence', 'Nativeness', 'Abundance', 'Seasonality', 'Conservation Status', 'Unnamed: 13'], axis=1, inplace=True)
# double checking to make sure these columns have been dropped 
sp_df

Unnamed: 0,Species ID,Park Name,Category,Order,Family,Scientific Name,Common Names,Record Status
0,ACAD-1000,Acadia National Park,Mammal,Artiodactyla,Cervidae,Alces alces,Moose,Approved
1,ACAD-1001,Acadia National Park,Mammal,Artiodactyla,Cervidae,Odocoileus virginianus,"Northern White-Tailed Deer, Virginia Deer, Whi...",Approved
2,ACAD-1002,Acadia National Park,Mammal,Carnivora,Canidae,Canis latrans,"Coyote, Eastern Coyote",Approved
3,ACAD-1003,Acadia National Park,Mammal,Carnivora,Canidae,Canis lupus,"Eastern Timber Wolf, Gray Wolf, Timber Wolf",Approved
4,ACAD-1004,Acadia National Park,Mammal,Carnivora,Canidae,Vulpes vulpes,"Black Fox, Cross Fox, Eastern Red Fox, Fox, Re...",Approved
...,...,...,...,...,...,...,...,...
119243,ZION-2791,Zion National Park,Vascular Plant,Solanales,Solanaceae,Solanum triflorum,Cut-Leaf Nightshade,Approved
119244,ZION-2792,Zion National Park,Vascular Plant,Vitales,Vitaceae,Vitis arizonica,Canyon Grape,Approved
119245,ZION-2793,Zion National Park,Vascular Plant,Vitales,Vitaceae,Vitis vinifera,Wine Grape,Approved
119246,ZION-2794,Zion National Park,Vascular Plant,Zygophyllales,Zygophyllaceae,Larrea tridentata,Creosote Bush,Approved


In [7]:
# Checking for duplicates in the sp_df

sp_df[sp_df.duplicated()]

# This show there are no duplicates

Unnamed: 0,Species ID,Park Name,Category,Order,Family,Scientific Name,Common Names,Record Status


In [8]:
# Checking for duplicates in the pa_df

pa_df[pa_df.duplicated()]

# This show there are no duplicates

Unnamed: 0,Park Code,Park Name,State,Acres,Latitude,Longitude


## Inconsistent Data

Check for inconsistent data and address any that arises. As always, use code comments to illustrate your thought process.

In [12]:
sp_df.sample(50)
# view a sample of the values, nothing looks inconsistent in the sp_df

Unnamed: 0,Species ID,Park Name,Category,Order,Family,Scientific Name,Common Names,Record Status
103020,SHEN-3086,Shenandoah National Park,Vascular Plant,Malpighiales,Salicaceae,Salix sericea,Silky Willow,Approved
91544,REDW-4904,Redwood National Park,Insect,Coleoptera,Hydroscaphidae,Hydroscapha,,In Review
41441,GRBA-3119,Great Basin National Park,Slug/Snail,Stylommatophora,Oreohelicidae,Oreohelix nevadensis,Schell Creek Mountainsnail,In Review
103864,SHEN-3930,Shenandoah National Park,Vascular Plant,Rosales,Rosaceae,Aruncus dioicus,Goat's-Beard,Approved
97608,SAGU-1504,Saguaro National Park,Vascular Plant,Asterales,Asteraceae,Baccharis sarothroides,Desertbroom,Approved
50045,GRSM-5496,Great Smoky Mountains National Park,Insect,Hemiptera,Membracidae,Telamona decorata,,In Review
25080,DENA-1315,Denali National Park and Preserve,Vascular Plant,Asterales,Asteraceae,Artemisia laciniata,Siberian Wormwood,Approved
79171,MORA-1851,Mount Rainier National Park,Vascular Plant,Liliales,Liliaceae,Lilium columbianum,Tiger Lily,Approved
9565,BRCA-1318,Bryce Canyon National Park,Vascular Plant,,,Chrysothamnus parryi var. affinis,Parry's Rabbitbrush,Approved
70639,KOVA-1890,Kobuk Valley National Park,Fungi,Lecanorales,Stereocaulaceae,Stereocaulon species 2,,In Review


In [13]:
pa_df.sample(50)
# view a sample of the values, nothing looks inconsistent in the pa_df

Unnamed: 0,Park Code,Park Name,State,Acres,Latitude,Longitude
24,GRSM,Great Smoky Mountains National Park,"TN, NC",521490,35.68,-83.53
41,OLYM,Olympic National Park,WA,922651,47.97,-123.5
48,SHEN,Shenandoah National Park,VA,199045,38.53,-78.35
8,CARE,Capitol Reef National Park,UT,241904,38.2,-111.17
32,KATM,Katmai National Park and Preserve,AK,3674530,58.5,-155.0
12,CRLA,Crater Lake National Park,OR,183224,42.94,-122.1
44,REDW,Redwood National Park,CA,112512,41.3,-124.0
9,CAVE,Carlsbad Caverns National Park,NM,46766,32.17,-104.44
36,LAVO,Lassen Volcanic National Park,CA,106372,40.49,-121.51
54,YOSE,Yosemite National Park,CA,761266,37.83,-119.5


## Summarize Your Results

Make note of your answers to the following questions.

1. Did you find all four types of dirty data in your dataset?

Species Dataframe: Missing Data - YES, Irregular Data - NO, Unnecessary Data - YES, Inconsistent Data - NO
  
Parks Dataframe: Missing Data - YES, Irregular Data - NO, Unnecessary Data - NO, Inconsistent Data - NO
  
2. Did the process of cleaning your data give you new insights into your dataset?

Yes, I realized that I am missing a 7 National Parks. After googling the total number, I found that there are 63 and I only have data on 56. 

3. Is there anything you would like to make note of when it comes to manipulating the data and making visualizations? 

After looking closer at the data. I would like to check to see if there is a correlation between the pasks location to the equator and the amount of species that call it home. 