# Checkpoint Three: Cleaning Data

Now you are ready to clean your data. Before starting coding, provide the link to your dataset below.

My dataset:

Import the necessary libraries and create your dataframe(s).

In [1]:
#import libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sam

#data sets into data frames
colony_df = pd.read_csv('colony.csv')
honey_df = pd.read_csv('honey.csv')
stressors_df= pd.read_csv('stressors.csv')

## Missing Data

Test your dataset for missing data and handle it as needed. Make notes in the form of code comments as to your thought process.

In [2]:
#From my EDA I know that any null data in any of the dataframes is listed as (NA) so looking for missing data at this point is moot. 
#However, if I were going to look for missing data, I would use this code for each of my dataframes. 
colony_df.isnull().sum()

State                 0
Starting Colonies     0
Maximum Colonies      0
Lost Colonies         0
Percent Loss          0
Added Colonies        0
Renovated Colonies    0
Percent Renovated     0
Year                  0
dtype: int64

In [3]:
#If I want to know where the (NA) values are, I will search this
na_colony = colony_df[colony_df.eq('(NA)').any(1)]
na_colony.head()

Unnamed: 0,State,Starting Colonies,Maximum Colonies,Lost Colonies,Percent Loss,Added Colonies,Renovated Colonies,Percent Renovated,Year
799,Alabama,(NA),(NA),(NA),(NA),(NA),(NA),(NA),2019-Q2
800,Arizona,(NA),(NA),(NA),(NA),(NA),(NA),(NA),2019-Q2
801,Arkansas,(NA),(NA),(NA),(NA),(NA),(NA),(NA),2019-Q2
802,California,(NA),(NA),(NA),(NA),(NA),(NA),(NA),2019-Q2
803,Colorado,(NA),(NA),(NA),(NA),(NA),(NA),(NA),2019-Q2


In [4]:
#If I want to know how many (NA) values there are in each column, I will search this.
na_colony.count()

State                 47
Starting Colonies     47
Maximum Colonies      47
Lost Colonies         47
Percent Loss          47
Added Colonies        47
Renovated Colonies    47
Percent Renovated     47
Year                  47
dtype: int64

In [5]:
#Z values are another value that is in the data explanation but should be replaced rather than removed. But first, I should find them.
z_colony = colony_df[colony_df.eq('(Z)').any(1)]
z_colony.head(50)

Unnamed: 0,State,Starting Colonies,Maximum Colonies,Lost Colonies,Percent Loss,Added Colonies,Renovated Colonies,Percent Renovated,Year
29,North Dakota,57000,120000,620,1,1800,530,(Z),2015-Q1
39,Vermont,5500,5500,700,13,1200,20,(Z),2015-Q1
115,Mississippi,23000,34000,3500,10,260,110,(Z),2015-Q3
133,Vermont,6000,6500,40,1,30,20,(Z),2015-Q3
151,Illinois,14000,14000,690,5,0,30,(Z),2015-Q4
153,Iowa,35000,35000,4300,12,40,30,(Z),2015-Q4
154,Kansas,8500,8500,3400,40,50,20,(Z),2015-Q4
155,Kentucky,8500,8500,1100,13,20,10,(Z),2015-Q4
157,Maine,4700,4700,60,1,530,20,(Z),2015-Q4
161,Minnesota,104000,105000,10000,10,600,40,(Z),2015-Q4


In [6]:
#count them because they seem to be more scattered
z_colony.count()

State                 134
Starting Colonies     134
Maximum Colonies      134
Lost Colonies         134
Percent Loss          134
Added Colonies        134
Renovated Colonies    134
Percent Renovated     134
Year                  134
dtype: int64

In [7]:
#check honey_df for (NA)
na_honey = honey_df[honey_df.eq('(NA)').any(1)]
na_honey

Unnamed: 0,State,Honey producing colonies (thousand),Yield per colony (pounds),"Production (1,000 pounds)","Stocks December 15 (1,000 pounds)",Average price per pound (dollars),"Value of production (1,000 dollars)",Year


In [8]:
#check honey_df for (Z)
z_honey = honey_df[honey_df.eq('(Z)').any(1)]
z_honey

Unnamed: 0,State,Honey producing colonies (thousand),Yield per colony (pounds),"Production (1,000 pounds)","Stocks December 15 (1,000 pounds)",Average price per pound (dollars),"Value of production (1,000 dollars)",Year


In [9]:
#check stressors_df for (NA)
na_stressors = stressors_df[stressors_df.eq('(NA)').any(1)]
na_stressors

Unnamed: 0,State,Varroa Mites (Percent),Other pests and parasites (Percent),Diseases (percent),Pesticides (percent),Other (percent),Unknown (percent),Year
799,Alabama,(NA),(NA),(NA),(NA),(NA),(NA),2019-Q2
800,Arizona,(NA),(NA),(NA),(NA),(NA),(NA),2019-Q2
801,Arkansas,(NA),(NA),(NA),(NA),(NA),(NA),2019-Q2
802,California,(NA),(NA),(NA),(NA),(NA),(NA),2019-Q2
803,Colorado,(NA),(NA),(NA),(NA),(NA),(NA),2019-Q2
804,Connecticut,(NA),(NA),(NA),(NA),(NA),(NA),2019-Q2
805,Florida,(NA),(NA),(NA),(NA),(NA),(NA),2019-Q2
806,Georgia,(NA),(NA),(NA),(NA),(NA),(NA),2019-Q2
807,Hawaii,(NA),(NA),(NA),(NA),(NA),(NA),2019-Q2
808,Idaho,(NA),(NA),(NA),(NA),(NA),(NA),2019-Q2


In [10]:
#check stressors_df for (Z)
z_stressors = stressors_df[stressors_df.eq('(Z)').any(1)]
z_stressors

Unnamed: 0,State,Varroa Mites (Percent),Other pests and parasites (Percent),Diseases (percent),Pesticides (percent),Other (percent),Unknown (percent),Year
0,Alabama,10,5.4,(Z),2.2,9.1,9.4,2015-Q1
1,Arizona,26.9,20.5,0.1,(Z),1.8,3.1,2015-Q1
5,Connecticut,2.5,1.4,(Z),(Z),21.2,2.4,2015-Q1
8,Hawaii,38.8,37.7,1.6,(Z),2,(Z),2015-Q1
16,Maine,4.4,0.1,(Z),(Z),7.5,1.9,2015-Q1
...,...,...,...,...,...,...,...,...
1167,Vermont,21.1,(Z),3.7,0,10.1,(Z),2021-Q1
1169,Washington,15.5,2.4,1,(Z),5.3,(Z),2021-Q1
1171,Wisconsin,16.1,(Z),(Z),0,3.6,3.2,2021-Q1
1172,Wyoming,(Z),(Z),(Z),0,3.3,7.9,2021-Q1


In [11]:
#Looks like it is prudent to remove (NA)  data from the colony and stressors dfs. Honey seems to be okay for missing data.
#drop rows for (NA) by using the drop function and slicing
no_null_colony = colony_df.drop(colony_df.index[799:846])
#check that all were removed
no_null_colony[no_null_colony.eq('(NA)').any(1)]

Unnamed: 0,State,Starting Colonies,Maximum Colonies,Lost Colonies,Percent Loss,Added Colonies,Renovated Colonies,Percent Renovated,Year


In [12]:
no_null_stressors = stressors_df.drop(stressors_df.index[799:846])
#check that all were removed
no_null_stressors[no_null_stressors.eq('(NA)').any(1)]

Unnamed: 0,State,Varroa Mites (Percent),Other pests and parasites (Percent),Diseases (percent),Pesticides (percent),Other (percent),Unknown (percent),Year


In [13]:
#from the info file included with the dataset, I am concluding that the (Z) is a very small number but greater than 0. 
#for the sake of this analysis,I am going to replace all (Z) values with 0.
#again, honey_df will not need this action
no_z_colony = no_null_colony.replace('(Z)', 0)
#check that there at least one of the original rows with a (Z) was updated with 0. Rows 29 and 39 had (Z)
no_z_colony.head(50)

Unnamed: 0,State,Starting Colonies,Maximum Colonies,Lost Colonies,Percent Loss,Added Colonies,Renovated Colonies,Percent Renovated,Year
0,Alabama,7000,7000,1800,26,2800,250,4,2015-Q1
1,Arizona,35000,35000,4600,13,3400,2100,6,2015-Q1
2,Arkansas,13000,14000,1500,11,1200,90,1,2015-Q1
3,California,1440000,1690000,255000,15,250000,124000,7,2015-Q1
4,Colorado,3500,12500,1500,12,200,140,1,2015-Q1
5,Connecticut,3900,3900,870,22,290,0,0,2015-Q1
6,Florida,305000,315000,42000,13,54000,25000,8,2015-Q1
7,Georgia,104000,105000,14500,14,47000,9500,9,2015-Q1
8,Hawaii,10500,10500,380,4,3400,760,7,2015-Q1
9,Idaho,81000,88000,3700,4,2600,8000,9,2015-Q1


In [14]:
no_z_stressors = no_null_stressors.replace('(Z)', 0)
#check that there at least one of the original rows with a (Z) was updated with 0. Top 5 rows had (Z)
no_z_stressors.head()

Unnamed: 0,State,Varroa Mites (Percent),Other pests and parasites (Percent),Diseases (percent),Pesticides (percent),Other (percent),Unknown (percent),Year
0,Alabama,10.0,5.4,0.0,2.2,9.1,9.4,2015-Q1
1,Arizona,26.9,20.5,0.1,0.0,1.8,3.1,2015-Q1
2,Arkansas,17.6,11.4,1.5,3.4,1.0,1.0,2015-Q1
3,California,24.7,7.2,3.0,7.5,6.5,2.8,2015-Q1
4,Colorado,14.6,0.9,1.8,0.6,2.6,5.9,2015-Q1


In [15]:
no_x_colony = no_z_colony.replace('(X)', 0)

In [16]:
no_x_colony

Unnamed: 0,State,Starting Colonies,Maximum Colonies,Lost Colonies,Percent Loss,Added Colonies,Renovated Colonies,Percent Renovated,Year
0,Alabama,7000,7000,1800,26,2800,250,4,2015-Q1
1,Arizona,35000,35000,4600,13,3400,2100,6,2015-Q1
2,Arkansas,13000,14000,1500,11,1200,90,1,2015-Q1
3,California,1440000,1690000,255000,15,250000,124000,7,2015-Q1
4,Colorado,3500,12500,1500,12,200,140,1,2015-Q1
...,...,...,...,...,...,...,...,...,...
1217,West Virginia,8000,9000,170,2,1900,390,4,2021-Q2
1218,Wisconsin,42000,57000,2200,4,9000,7500,13,2021-Q2
1219,Wyoming,13500,30000,3400,11,7500,4900,16,2021-Q2
1220,Other,5970,8410,140,2,2890,3100,37,2021-Q2


## Irregular Data

Detect outliers in your dataset and handle them as needed. Use code comments to make notes about your thought process.

In [17]:
#trying to get all of my columns in a boxplot format, will continue to work on this.
honey_df.describe()

Unnamed: 0,Honey producing colonies (thousand),Yield per colony (pounds),"Production (1,000 pounds)","Stocks December 15 (1,000 pounds)",Average price per pound (dollars),"Value of production (1,000 dollars)",Year
count,252.0,252.0,252.0,252.0,252.0,252.0,252.0
mean,130.357143,52.266667,7324.825397,1776.674603,2.986099,15277.896825,2017.5
std,419.470838,16.998795,23813.612951,5880.589664,1.346677,49107.824876,1.711224
min,4.0,27.0,160.0,13.0,1.28,597.0,2015.0
25%,11.0,40.0,466.5,90.5,1.96,1829.0,2016.0
50%,29.5,48.0,1504.0,263.5,2.425,3516.0,2017.5
75%,92.0,60.0,3603.5,1006.0,3.85,9036.0,2019.0
max,2812.0,131.0,161882.0,42203.0,7.99,335905.0,2020.0


In [18]:
no_x_colony.dtypes

State                 object
Starting Colonies     object
Maximum Colonies      object
Lost Colonies         object
Percent Loss          object
Added Colonies        object
Renovated Colonies    object
Percent Renovated     object
Year                  object
dtype: object

In [19]:
no_z_stressors.dtypes

State                                  object
Varroa Mites (Percent)                 object
Other pests and parasites (Percent)    object
Diseases (percent)                     object
Pesticides (percent)                   object
Other (percent)                        object
Unknown (percent)                      object
Year                                   object
dtype: object

In [20]:
#Looks like my DFs for stressors and colonies are not the correct data format for descriptive stats. Must convert to int or float.
no_z_stressors[['Varroa Mites (Percent)', 'Other pests and parasites (Percent)', 'Diseases (percent)','Pesticides (percent)', 'Other (percent)', 'Unknown (percent)']]=no_z_stressors[['Varroa Mites (Percent)', 'Other pests and parasites (Percent)', 'Diseases (percent)','Pesticides (percent)', 'Other (percent)', 'Unknown (percent)']].astype(float)
no_z_stressors

Unnamed: 0,State,Varroa Mites (Percent),Other pests and parasites (Percent),Diseases (percent),Pesticides (percent),Other (percent),Unknown (percent),Year
0,Alabama,10.0,5.4,0.0,2.2,9.1,9.4,2015-Q1
1,Arizona,26.9,20.5,0.1,0.0,1.8,3.1,2015-Q1
2,Arkansas,17.6,11.4,1.5,3.4,1.0,1.0,2015-Q1
3,California,24.7,7.2,3.0,7.5,6.5,2.8,2015-Q1
4,Colorado,14.6,0.9,1.8,0.6,2.6,5.9,2015-Q1
...,...,...,...,...,...,...,...,...
1170,West Virginia,13.7,3.4,0.7,2.6,1.2,3.3,2021-Q1
1171,Wisconsin,16.1,0.0,0.0,0.0,3.6,3.2,2021-Q1
1172,Wyoming,0.0,0.0,0.0,0.0,3.3,7.9,2021-Q1
1173,Other,19.4,1.2,4.1,0.0,6.8,1.1,2021-Q1


In [21]:
no_z_stressors.dtypes

State                                   object
Varroa Mites (Percent)                 float64
Other pests and parasites (Percent)    float64
Diseases (percent)                     float64
Pesticides (percent)                   float64
Other (percent)                        float64
Unknown (percent)                      float64
Year                                    object
dtype: object

In [22]:
no_x_colony[['Starting Colonies', 'Maximum Colonies', 'Lost Colonies', 'Percent Loss', 'Added Colonies', 'Renovated Colonies', 'Percent Renovated']] = no_x_colony[['Starting Colonies', 'Maximum Colonies', 'Lost Colonies', 'Percent Loss', 'Added Colonies', 'Renovated Colonies', 'Percent Renovated']].astype(float)

In [23]:
no_x_colony.dtypes

State                  object
Starting Colonies     float64
Maximum Colonies      float64
Lost Colonies         float64
Percent Loss          float64
Added Colonies        float64
Renovated Colonies    float64
Percent Renovated     float64
Year                   object
dtype: object

In [24]:
no_z_stressors.describe()

Unnamed: 0,Varroa Mites (Percent),Other pests and parasites (Percent),Diseases (percent),Pesticides (percent),Other (percent),Unknown (percent)
count,1128.0,1128.0,1128.0,1128.0,1128.0,1128.0
mean,30.48617,11.109663,3.655319,6.500798,6.256649,4.182801
std,19.434581,13.453746,6.912448,9.371546,6.58784,5.238304
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,15.675,1.9,0.2,0.4,1.9,0.8
50%,27.45,7.0,1.2,2.8,4.3,2.5
75%,42.9,15.2,4.5,9.0,8.325,5.425
max,98.8,91.9,87.4,73.5,61.4,46.2


In [25]:
no_x_colony.describe()

Unnamed: 0,Starting Colonies,Maximum Colonies,Lost Colonies,Percent Loss,Added Colonies,Renovated Colonies,Percent Renovated
count,1175.0,1175.0,1175.0,1175.0,1175.0,1175.0,1175.0
mean,123280.7,77472.99,16970.025532,11.526809,16660.374468,14297.055319,7.396596
std,436620.4,188736.2,61636.840022,7.339359,66416.131426,60466.064409,9.42708
min,1300.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,8000.0,8670.0,965.0,7.0,400.0,140.0,1.0
50%,17500.0,21000.0,2200.0,10.0,1800.0,800.0,4.0
75%,55000.0,65000.0,7000.0,15.0,6000.0,3900.0,10.0
max,3181180.0,1710000.0,502350.0,65.0,736920.0,762550.0,77.0


## Unnecessary Data

Look for the different types of unnecessary data in your dataset and address it as needed. Make sure to use code comments to illustrate your thought process.

In [31]:
# I just need to remove the rows that are labeled "United States" as those appear to contain the sums of all the states which will skew my data.
honey_df[honey_df.eq('United States').any(1)]

Unnamed: 0,State,Honey producing colonies (thousand),Yield per colony (pounds),"Production (1,000 pounds)","Stocks December 15 (1,000 pounds)",Average price per pound (dollars),"Value of production (1,000 dollars)",Year
41,United States,2660,58.9,156544,42203,2.09,327177,2015
83,United States,2775,58.3,161882,41253,2.075,335905,2016
125,United States,2669,55.3,147638,30577,2.156,318308,2017
167,United States,2803,54.4,152348,29091,2.166,333482,2018
209,United States,2812,55.8,156922,41022,1.97,309136,2019
251,United States,2706,54.5,147594,39715,2.03,299616,2020


In [45]:
drop_US_honey = honey_df.drop([41, 83, 125, 167, 209, 251])
drop_US_honey[drop_US_honey.eq('United States').any(1)]

Unnamed: 0,State,Honey producing colonies (thousand),Yield per colony (pounds),"Production (1,000 pounds)","Stocks December 15 (1,000 pounds)",Average price per pound (dollars),"Value of production (1,000 dollars)",Year


In [34]:
no_x_colony[no_x_colony.eq('United States').any(1)]

Unnamed: 0,State,Starting Colonies,Maximum Colonies,Lost Colonies,Percent Loss,Added Colonies,Renovated Colonies,Percent Renovated,Year
46,United States,2824610.0,0.0,500020.0,18.0,546980.0,270530.0,10.0,2015-Q1
93,United States,2849500.0,0.0,352860.0,12.0,661860.0,692850.0,24.0,2015-Q2
140,United States,3132880.0,0.0,457100.0,15.0,172990.0,303070.0,10.0,2015-Q3
187,United States,2874760.0,0.0,412380.0,14.0,117150.0,158790.0,6.0,2015-Q4
234,United States,2619940.0,0.0,416100.0,16.0,571880.0,245060.0,9.0,2016-Q1
281,United States,2801470.0,0.0,329820.0,12.0,736920.0,561160.0,20.0,2016-Q2
328,United States,3181180.0,0.0,397290.0,12.0,217320.0,282130.0,9.0,2016-Q3
375,United States,3032060.0,0.0,502350.0,17.0,124660.0,60390.0,2.0,2016-Q4
422,United States,2641090.0,0.0,398650.0,15.0,478240.0,241210.0,9.0,2017-Q1
469,United States,2694150.0,0.0,285590.0,11.0,613360.0,762550.0,28.0,2017-Q2


In [35]:
drop_US_colony = no_x_colony.drop([46, 93, 140, 187, 234, 281, 328, 375, 422, 469, 516, 563, 610, 657, 704, 751, 798, 892, 939, 986, 1033, 1080, 1127, 1174, 1221])
drop_US_colony[drop_US_colony.eq('United States').any(1)]

Unnamed: 0,State,Starting Colonies,Maximum Colonies,Lost Colonies,Percent Loss,Added Colonies,Renovated Colonies,Percent Renovated,Year


In [36]:
no_z_stressors[no_z_stressors.eq('United States').any(1)]

Unnamed: 0,State,Varroa Mites (Percent),Other pests and parasites (Percent),Diseases (percent),Pesticides (percent),Other (percent),Unknown (percent),Year
46,United States,25.2,8.6,3.1,7.4,6.9,4.3,2015-Q1
93,United States,43.4,19.5,4.9,16.6,11.6,3.5,2015-Q2
140,United States,41.2,17.6,8.0,15.2,8.8,4.8,2015-Q3
187,United States,37.0,11.4,5.2,9.5,7.3,6.9,2015-Q4
234,United States,34.6,12.6,6.2,10.9,6.9,5.4,2016-Q1
281,United States,53.4,16.3,9.5,12.4,12.3,4.1,2016-Q2
328,United States,46.1,15.6,6.7,15.1,9.3,4.5,2016-Q3
375,United States,46.6,16.9,8.3,9.4,10.1,6.4,2016-Q4
422,United States,42.2,15.5,7.0,8.9,7.2,7.4,2017-Q1
469,United States,40.9,10.9,4.6,12.3,7.0,4.9,2017-Q2


In [38]:
drop_US_stressors = no_z_stressors.drop([46, 93, 140, 187, 234, 281, 328, 375, 422, 469, 516, 563, 610, 657, 704, 751, 798, 892, 939, 986, 1033, 1080, 1127, 1174])
drop_US_stressors[drop_US_stressors.eq('United States').any(1)]

Unnamed: 0,State,Varroa Mites (Percent),Other pests and parasites (Percent),Diseases (percent),Pesticides (percent),Other (percent),Unknown (percent),Year


## Inconsistent Data

Check for inconsistent data and address any that arises. As always, use code comments to illustrate your thought process.

In [50]:
# since most of the data are integers, I need to just check the state and year columns in each df for unique values. 
# If there are any duplicates for formatted differently, I will make those changes.

drop_US_honey['State'].unique()

array(['Alabama', 'Arizona', 'Arkansas', 'California', 'Colorado',
       'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana',
       'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Michigan',
       'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska',
       'New Jersey', 'New York', 'North Carolina', 'North Dakota', 'Ohio',
       'Oregon', 'Pennsylvania', 'South Carolina', 'South Dakota',
       'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington',
       'West Virginia', 'Wisconsin', 'Wyoming', 'Other'], dtype=object)

In [52]:
drop_US_colony['State'].unique()

array(['Alabama', 'Arizona', 'Arkansas', 'California', 'Colorado',
       'Connecticut', 'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois',
       'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine',
       'Maryland', 'Massachusetts', 'Michigan', 'Minnesota',
       'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'New Jersey',
       'New Mexico', 'New York', 'North Carolina', 'North Dakota', 'Ohio',
       'Oklahoma', 'Oregon', 'Pennsylvania', 'South Carolina',
       'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont',
       'Virginia', 'Washington', 'West Virginia', 'Wisconsin', 'Wyoming',
       'Other'], dtype=object)

In [44]:
drop_US_stressors['State'].unique()

array(['Alabama', 'Arizona', 'Arkansas', 'California', 'Colorado',
       'Connecticut', 'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois',
       'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine',
       'Maryland', 'Massachusetts', 'Michigan', 'Minnesota',
       'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'New Jersey',
       'New Mexico', 'New York', 'North Carolina', 'North Dakota', 'Ohio',
       'Oklahoma', 'Oregon', 'Pennsylvania', 'South Carolina',
       'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont',
       'Virginia', 'Washington', 'West Virginia', 'Wisconsin', 'Wyoming',
       'Other'], dtype=object)

In [49]:
drop_US_honey['Year'].unique()

array([2015, 2016, 2017, 2018, 2019, 2020], dtype=int64)

In [53]:
drop_US_colony['Year'].unique()

array(['2015-Q1', '2015-Q2', '2015-Q3', '2015-Q4', '2016-Q1', '2016-Q2',
       '2016-Q3', '2016-Q4', '2017-Q1', '2017-Q2', '2017-Q3', '2017-Q4',
       '2018-Q1', '2018-Q2', '2018-Q3', '2018-Q4', '2019-Q1', '2019-Q3',
       '2019-Q4', '2020-Q1', '2020-Q2', '2020-Q3', '2020-Q4', '2021-Q1',
       '2021-Q2'], dtype=object)

In [54]:
drop_US_stressors['Year'].unique()

array(['2015-Q1', '2015-Q2', '2015-Q3', '2015-Q4', '2016-Q1', '2016-Q2',
       '2016-Q3', '2016-Q4', '2017-Q1', '2017-Q2', '2017-Q3', '2017-Q4',
       '2018-Q1', '2018-Q2', '2018-Q3', '2018-Q4', '2019-Q1', '2019-Q3',
       '2019-Q4', '2020-Q1', '2020-Q2', '2020-Q3', '2020-Q4', '2021-Q1'],
      dtype=object)

## Summarize Your Results

Make note of your answers to the following questions.

1. Did you find all four types of dirty data in your dataset?<br>
No <br>

2. Did the process of cleaning your data give you new insights into your dataset?<br>
Yes, EDA helped me figure out ways to clean but actually cleaning it made me look at the data at different angles and gave me different perspectives on how to tackle the next step. <br>
3. Is there anything you would like to make note of when it comes to manipulating the data and making visualizations? <br>
I tried making a box plot when looking for outliers and I realized I might want to create another df or another column in my df for regions.  This might make visualizations better.