In [1]:
## IT'S DANGEROUS TO GO ALONE! TAKE THIS:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


# Soil, Data Wrangling - Drought
---
**Let the Soil Play it's Simple Part**
Greg Sakowski
* Book 3 of 7
* Cleaning and reformatting data that was pulled from the US Drought Monitor API. 
* Reading from CSV: roughDroughtFIPS.csv, roughDroughtFIPS2.csv
* Writing to CSV: drought.csv
* **Warning: the initial data and final dataframe are over 500 mb total.
---
## Table of Contents:

[Data Overview](#Data-Overview)

[Duplicates, Making Year, Month, and Day](#Duplicates,-Making-Year,-Month,-and-Day)

[Combination Notes](#Combination-Notes)


## Data Sources

We have three data sources, each with slightly different types of data. The goal is to have weekly data at the county level.

- NOAA Weather data

- **US Drought monitor** - We'll work on the drought data in this notebook
    * This is the cleanest/closest to being ready to use. The source data contains weekly, county level data for 20 years. I have it split into two csv's and I *know* that the 2011-12-27 to 2012-01-02 row is duplicated for each county, so that will need to be fixed when I combine them. There are columns for the land area in sq miles for the county that is in drought and what level of drought.

- USDA/NASS Census and Survey of Ag data

# Data Overview
---

The data we scraped from the US Drought Monitor has the below columns:

* MapDate

* FIPS

* County

* State

* None

* D0

* D1

* D2

* D3

* D4

* ValidStart

* ValidEnd

* StatisticFormatID

---

'None' and the D's are square miles in drought:
* None - No dryness/drought
* D0 - Abnormally Dry
* D1 - Moderate Drought
* D2 - Severe Drought
* D3 - Extreme Drought
* D4 - Exceptional Drought

These could be renamed to:
* Sq_Mi_No_Drought
* Sq_Mi_Abnormally_Dry
* Sq_Mi_Moderate_Drought
* Sq_Mi_Severe_Drought
* Sq_Mi_Extreme_Drought
* Sq_Mi_Exceptional_Drought

We will eventually add our Target variable using two columns calculated with the six levels of drought data. We have some options as far as how we define the 'in_drought' target. The most likely option will be to use total_sq_mi and compare that to the square miles in the 'Sq_Mi_No_Drought' column, calculating a positive for the 'in_drought' variable when less than 50% of the county's area is *not* in drought.

* total_Sq_Mi
* in_drought

The MapDate is redundant with the ValidStart date, ValidEnd, StatisticFormatID is 2 all the way down and just refers to the type of info requested from the API. We can keep the ValidStart column and leave behind ValidEnd, MapDate, and StatisticFormatID when we import the csv into a dataframe.

After importing the data we can check the data types, check for nulls, and make any immediate adjustments that we need before diving into reformatting our data.

In [2]:
#referenced the below stackoverflow thread to fix issues with commas in the drought areas, 
#and I just used 'thousands' on the import instead of messing around with a post-import replacement
# https://stackoverflow.com/questions/22137723/convert-number-strings-with-commas-in-pandas-dataframe-to-float

#importing the first half of the data
drought1 = pd.read_csv('data/roughDroughtFIPS.csv',
                       thousands=',',
                       header=0,
                       names=['Mapdate','FIPS', 'County', 'State', 'Sq_Mi_No_Drought', 'Sq_Mi_Abnormally_Dry', 'Sq_Mi_Moderate_Drought', 'Sq_Mi_Severe_Drought', 'Sq_Mi_Extreme_Drought', 'Sq_Mi_Exceptional_Drought', 'ValidStart', 'ValidEnd', 'StatisticFormatID'], 
                       usecols=['FIPS', 'County', 'State', 'Sq_Mi_No_Drought', 'Sq_Mi_Abnormally_Dry', 'Sq_Mi_Moderate_Drought', 'Sq_Mi_Severe_Drought', 'Sq_Mi_Extreme_Drought', 'Sq_Mi_Exceptional_Drought', 'ValidStart'])
drought1


Unnamed: 0,FIPS,County,State,Sq_Mi_No_Drought,Sq_Mi_Abnormally_Dry,Sq_Mi_Moderate_Drought,Sq_Mi_Severe_Drought,Sq_Mi_Extreme_Drought,Sq_Mi_Exceptional_Drought,ValidStart
0,1033,Colbert County,AL,371.40,253.63,0.0,0.0,0.0,0.0,2021-12-28
1,1033,Colbert County,AL,592.73,32.30,0.0,0.0,0.0,0.0,2021-12-21
2,1033,Colbert County,AL,625.03,0.00,0.0,0.0,0.0,0.0,2021-12-14
3,1033,Colbert County,AL,625.03,0.00,0.0,0.0,0.0,0.0,2021-12-07
4,1033,Colbert County,AL,625.03,0.00,0.0,0.0,0.0,0.0,2021-11-30
...,...,...,...,...,...,...,...,...,...,...
1642215,53029,Island County,WA,213.97,0.00,0.0,0.0,0.0,0.0,2012-01-24
1642216,53029,Island County,WA,213.97,0.00,0.0,0.0,0.0,0.0,2012-01-17
1642217,53029,Island County,WA,213.97,0.00,0.0,0.0,0.0,0.0,2012-01-10
1642218,53029,Island County,WA,213.97,0.00,0.0,0.0,0.0,0.0,2012-01-03


In [3]:
drought1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1642220 entries, 0 to 1642219
Data columns (total 10 columns):
 #   Column                     Non-Null Count    Dtype  
---  ------                     --------------    -----  
 0   FIPS                       1642220 non-null  int64  
 1   County                     1642220 non-null  object 
 2   State                      1642220 non-null  object 
 3   Sq_Mi_No_Drought           1642220 non-null  float64
 4   Sq_Mi_Abnormally_Dry       1642220 non-null  float64
 5   Sq_Mi_Moderate_Drought     1642220 non-null  float64
 6   Sq_Mi_Severe_Drought       1642220 non-null  float64
 7   Sq_Mi_Extreme_Drought      1642220 non-null  float64
 8   Sq_Mi_Exceptional_Drought  1642220 non-null  float64
 9   ValidStart                 1642220 non-null  object 
dtypes: float64(6), int64(1), object(3)
memory usage: 125.3+ MB


In [4]:
#importing the second half of the data
drought2 = pd.read_csv('data/roughDroughtFIPS2.csv',
                       thousands=',',
                       header=0,
                       names=['Mapdate','FIPS', 'County', 'State', 'Sq_Mi_No_Drought', 'Sq_Mi_Abnormally_Dry', 'Sq_Mi_Moderate_Drought', 'Sq_Mi_Severe_Drought', 'Sq_Mi_Extreme_Drought', 'Sq_Mi_Exceptional_Drought', 'ValidStart', 'ValidEnd', 'StatisticFormatID'], 
                       usecols=['FIPS', 'County', 'State', 'Sq_Mi_No_Drought', 'Sq_Mi_Abnormally_Dry', 'Sq_Mi_Moderate_Drought', 'Sq_Mi_Severe_Drought', 'Sq_Mi_Extreme_Drought', 'Sq_Mi_Exceptional_Drought', 'ValidStart'])
drought2

Unnamed: 0,FIPS,County,State,Sq_Mi_No_Drought,Sq_Mi_Abnormally_Dry,Sq_Mi_Moderate_Drought,Sq_Mi_Severe_Drought,Sq_Mi_Extreme_Drought,Sq_Mi_Exceptional_Drought,ValidStart
0,1033,Colbert County,AL,625.03,0.00,0.0,0.0,0.0,0.0,2011-12-27
1,1033,Colbert County,AL,625.03,0.00,0.0,0.0,0.0,0.0,2011-12-20
2,1033,Colbert County,AL,625.03,0.00,0.0,0.0,0.0,0.0,2011-12-13
3,1033,Colbert County,AL,625.03,0.00,0.0,0.0,0.0,0.0,2011-12-06
4,1033,Colbert County,AL,159.05,465.98,0.0,0.0,0.0,0.0,2011-11-29
...,...,...,...,...,...,...,...,...,...,...
1639075,53029,Island County,WA,213.97,0.00,0.0,0.0,0.0,0.0,2002-01-29
1639076,53029,Island County,WA,213.97,0.00,0.0,0.0,0.0,0.0,2002-01-22
1639077,53029,Island County,WA,213.97,0.00,0.0,0.0,0.0,0.0,2002-01-15
1639078,53029,Island County,WA,213.97,0.00,0.0,0.0,0.0,0.0,2002-01-08


In [5]:
drought2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1639080 entries, 0 to 1639079
Data columns (total 10 columns):
 #   Column                     Non-Null Count    Dtype  
---  ------                     --------------    -----  
 0   FIPS                       1639080 non-null  int64  
 1   County                     1639080 non-null  object 
 2   State                      1639080 non-null  object 
 3   Sq_Mi_No_Drought           1639080 non-null  float64
 4   Sq_Mi_Abnormally_Dry       1639080 non-null  float64
 5   Sq_Mi_Moderate_Drought     1639080 non-null  float64
 6   Sq_Mi_Severe_Drought       1639080 non-null  float64
 7   Sq_Mi_Extreme_Drought      1639080 non-null  float64
 8   Sq_Mi_Exceptional_Drought  1639080 non-null  float64
 9   ValidStart                 1639080 non-null  object 
dtypes: float64(6), int64(1), object(3)
memory usage: 125.1+ MB


Now we can update the data types for some of these columns. We will convert FIPS to a string so it can have leading zeros applied to the four digit values and combined later with year and month columns se we can use it as a key for joining dataframes.

Eventually we will convert the ValidStart to 'date' data type. We will leave it as an object for now since we need to expand the column into year, month, and day columns.

In [6]:
#converting the FIPS columns in each dataframe to string data
drought1['FIPS'] = drought1['FIPS'].astype(str)
drought2['FIPS'] = drought2['FIPS'].astype(str)

#checking our conversions for drought1
drought1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1642220 entries, 0 to 1642219
Data columns (total 10 columns):
 #   Column                     Non-Null Count    Dtype  
---  ------                     --------------    -----  
 0   FIPS                       1642220 non-null  object 
 1   County                     1642220 non-null  object 
 2   State                      1642220 non-null  object 
 3   Sq_Mi_No_Drought           1642220 non-null  float64
 4   Sq_Mi_Abnormally_Dry       1642220 non-null  float64
 5   Sq_Mi_Moderate_Drought     1642220 non-null  float64
 6   Sq_Mi_Severe_Drought       1642220 non-null  float64
 7   Sq_Mi_Extreme_Drought      1642220 non-null  float64
 8   Sq_Mi_Exceptional_Drought  1642220 non-null  float64
 9   ValidStart                 1642220 non-null  object 
dtypes: float64(6), object(4)
memory usage: 125.3+ MB


In [7]:
#and for drought2
drought2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1639080 entries, 0 to 1639079
Data columns (total 10 columns):
 #   Column                     Non-Null Count    Dtype  
---  ------                     --------------    -----  
 0   FIPS                       1639080 non-null  object 
 1   County                     1639080 non-null  object 
 2   State                      1639080 non-null  object 
 3   Sq_Mi_No_Drought           1639080 non-null  float64
 4   Sq_Mi_Abnormally_Dry       1639080 non-null  float64
 5   Sq_Mi_Moderate_Drought     1639080 non-null  float64
 6   Sq_Mi_Severe_Drought       1639080 non-null  float64
 7   Sq_Mi_Extreme_Drought      1639080 non-null  float64
 8   Sq_Mi_Exceptional_Drought  1639080 non-null  float64
 9   ValidStart                 1639080 non-null  object 
dtypes: float64(6), object(4)
memory usage: 125.1+ MB


We can double check for missing data with isnull() to make sure our conversion didn't create null/NaN values.

In [8]:
drought1.isnull().sum()

FIPS                         0
County                       0
State                        0
Sq_Mi_No_Drought             0
Sq_Mi_Abnormally_Dry         0
Sq_Mi_Moderate_Drought       0
Sq_Mi_Severe_Drought         0
Sq_Mi_Extreme_Drought        0
Sq_Mi_Exceptional_Drought    0
ValidStart                   0
dtype: int64

In [9]:
drought2.isnull().sum()

FIPS                         0
County                       0
State                        0
Sq_Mi_No_Drought             0
Sq_Mi_Abnormally_Dry         0
Sq_Mi_Moderate_Drought       0
Sq_Mi_Severe_Drought         0
Sq_Mi_Extreme_Drought        0
Sq_Mi_Exceptional_Drought    0
ValidStart                   0
dtype: int64

Great! No null values came up.

# Duplicates, Making Year, Month, and Day

We have some rows that we *know* are duplicates. We can find those to get a head count of the known duplicates and when we combine the dataframes we can check that head count against the total duplicates we find. Before combining the dataframes we will need to set up a 'Year', 'Month', and 'Day' column.

To do this, we can make three columns from the ValidStart column so we have a Year, Month, and Day column. Then we can filter by the year column-- the drought1 dataframe should span 2021 to 2012, but because 01/01/2012 landed mid-way through the drought monitor's week, we have a week that starts in 2011. The drought2 dataframe spans 2011 to 2002, and includes this final week of 2011.

While we're at it, we will check for duplicates in both dataframes.

In [11]:
#splitting the ValidStart in drought1 and adding the split columns as Year Month and Day
drought1[['Year', 'Month', 'Day']] = drought1['ValidStart'].str.split(pat='-', expand=True)
drought1

Unnamed: 0,FIPS,County,State,Sq_Mi_No_Drought,Sq_Mi_Abnormally_Dry,Sq_Mi_Moderate_Drought,Sq_Mi_Severe_Drought,Sq_Mi_Extreme_Drought,Sq_Mi_Exceptional_Drought,ValidStart,Year,Month,Day
0,1033,Colbert County,AL,371.40,253.63,0.0,0.0,0.0,0.0,2021-12-28,2021,12,28
1,1033,Colbert County,AL,592.73,32.30,0.0,0.0,0.0,0.0,2021-12-21,2021,12,21
2,1033,Colbert County,AL,625.03,0.00,0.0,0.0,0.0,0.0,2021-12-14,2021,12,14
3,1033,Colbert County,AL,625.03,0.00,0.0,0.0,0.0,0.0,2021-12-07,2021,12,07
4,1033,Colbert County,AL,625.03,0.00,0.0,0.0,0.0,0.0,2021-11-30,2021,11,30
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1642215,53029,Island County,WA,213.97,0.00,0.0,0.0,0.0,0.0,2012-01-24,2012,01,24
1642216,53029,Island County,WA,213.97,0.00,0.0,0.0,0.0,0.0,2012-01-17,2012,01,17
1642217,53029,Island County,WA,213.97,0.00,0.0,0.0,0.0,0.0,2012-01-10,2012,01,10
1642218,53029,Island County,WA,213.97,0.00,0.0,0.0,0.0,0.0,2012-01-03,2012,01,03


In [12]:
#making sure we don't have any unknown duplicates in drought1
drought1.duplicated().sum()

0

In [13]:
#splitting the ValidStart in drought2 and adding the split columns as Year Month and Day
drought2[['Year', 'Month', 'Day']] = drought2['ValidStart'].str.split(pat='-', expand=True)
drought2

Unnamed: 0,FIPS,County,State,Sq_Mi_No_Drought,Sq_Mi_Abnormally_Dry,Sq_Mi_Moderate_Drought,Sq_Mi_Severe_Drought,Sq_Mi_Extreme_Drought,Sq_Mi_Exceptional_Drought,ValidStart,Year,Month,Day
0,1033,Colbert County,AL,625.03,0.00,0.0,0.0,0.0,0.0,2011-12-27,2011,12,27
1,1033,Colbert County,AL,625.03,0.00,0.0,0.0,0.0,0.0,2011-12-20,2011,12,20
2,1033,Colbert County,AL,625.03,0.00,0.0,0.0,0.0,0.0,2011-12-13,2011,12,13
3,1033,Colbert County,AL,625.03,0.00,0.0,0.0,0.0,0.0,2011-12-06,2011,12,06
4,1033,Colbert County,AL,159.05,465.98,0.0,0.0,0.0,0.0,2011-11-29,2011,11,29
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1639075,53029,Island County,WA,213.97,0.00,0.0,0.0,0.0,0.0,2002-01-29,2002,01,29
1639076,53029,Island County,WA,213.97,0.00,0.0,0.0,0.0,0.0,2002-01-22,2002,01,22
1639077,53029,Island County,WA,213.97,0.00,0.0,0.0,0.0,0.0,2002-01-15,2002,01,15
1639078,53029,Island County,WA,213.97,0.00,0.0,0.0,0.0,0.0,2002-01-08,2002,01,08


In [14]:
#making sure we don't have any unknown duplicates in drought1
drought2.duplicated().sum()

0

In [15]:
#checking the number of rows with 2011 in drought1
drought1[drought1['Year']=='2011'].shape[0]

3140

We have zero duplicates in either dataframe and 3,140 of our known duplicates. Let's combine the dataframes and rerun the duplicated function.

In [16]:
drought = pd.concat([drought1, drought2], ignore_index=True)
drought

Unnamed: 0,FIPS,County,State,Sq_Mi_No_Drought,Sq_Mi_Abnormally_Dry,Sq_Mi_Moderate_Drought,Sq_Mi_Severe_Drought,Sq_Mi_Extreme_Drought,Sq_Mi_Exceptional_Drought,ValidStart,Year,Month,Day
0,1033,Colbert County,AL,371.40,253.63,0.0,0.0,0.0,0.0,2021-12-28,2021,12,28
1,1033,Colbert County,AL,592.73,32.30,0.0,0.0,0.0,0.0,2021-12-21,2021,12,21
2,1033,Colbert County,AL,625.03,0.00,0.0,0.0,0.0,0.0,2021-12-14,2021,12,14
3,1033,Colbert County,AL,625.03,0.00,0.0,0.0,0.0,0.0,2021-12-07,2021,12,07
4,1033,Colbert County,AL,625.03,0.00,0.0,0.0,0.0,0.0,2021-11-30,2021,11,30
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3281295,53029,Island County,WA,213.97,0.00,0.0,0.0,0.0,0.0,2002-01-29,2002,01,29
3281296,53029,Island County,WA,213.97,0.00,0.0,0.0,0.0,0.0,2002-01-22,2002,01,22
3281297,53029,Island County,WA,213.97,0.00,0.0,0.0,0.0,0.0,2002-01-15,2002,01,15
3281298,53029,Island County,WA,213.97,0.00,0.0,0.0,0.0,0.0,2002-01-08,2002,01,08


In [17]:
#checking columns
drought.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3281300 entries, 0 to 3281299
Data columns (total 13 columns):
 #   Column                     Dtype  
---  ------                     -----  
 0   FIPS                       object 
 1   County                     object 
 2   State                      object 
 3   Sq_Mi_No_Drought           float64
 4   Sq_Mi_Abnormally_Dry       float64
 5   Sq_Mi_Moderate_Drought     float64
 6   Sq_Mi_Severe_Drought       float64
 7   Sq_Mi_Extreme_Drought      float64
 8   Sq_Mi_Exceptional_Drought  float64
 9   ValidStart                 object 
 10  Year                       object 
 11  Month                      object 
 12  Day                        object 
dtypes: float64(6), object(7)
memory usage: 325.4+ MB


In [18]:
#printing the duplicate count
print(f'There are {drought.duplicated().sum()} duplicated rows in our merged dataframe, "drought".\n\
This is equal to our expected count of {drought1[drought1["Year"]=="2011"].shape[0]}.\n\
Let\'s drop them!')

There are 3140 duplicated rows in our merged dataframe, "drought".
This is equal to our expected count of 3140.
Let's drop them!


In [19]:
#keeping the first instance of duplicated rows and dropping the second
drought = drought.drop_duplicates(keep='first', ignore_index=True)

# Combination Notes
---
When we combine the **drought** dataframe with the weather data, we will need to be cognizant of how we will fill in data. We can combine using the same FIPSYearMonth column used in the weather dataframe to make sure our data lands in the right spot. This should automatically duplicate the average, minimum, and maximum temperatures for each week of the respective month those temperatures

For example: for Colbert County, Alabama, the December 2021 temperatures and precipitation values from the **Weather** dataframe will be applied to the 4 weeks/rows of drought data we have for that month.

Although the temperature generally fluctuates over the course of the month, this is effectively a forward fill of the temperatures. This is not terribly concerning as we generally would not aggregate the temperature data by summing it, so duplicate temperatures are reasonable placeholders. 

The same effect will happen with Precipitation, which *is concerning* because it will erroneously quadruple or quintuple the precipitation if we decide to sum the data for a given year. We will need to keep this in mind as we perform analysis--changing the column name to **monthly_precip** should help, too.

We can keep that in mind for the final dataframe merge--next let's change the FIPS column to a string type, add in the leading zeros we'll need for the final merger and set up our FIPSYearMonth column.

In [20]:
drought['FIPS'] = drought['FIPS'].astype(str)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  drought['FIPS'] = drought['FIPS'].astype(str)


In [21]:
drought['FIPS'] = drought['FIPS'].str.zfill(5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  drought['FIPS'] = drought['FIPS'].str.zfill(5)


In [22]:
drought['FIPSYearMonth'] = drought['FIPS'] + drought['Year'] + drought['Month']
drought

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  drought['FIPSYearMonth'] = drought['FIPS'] + drought['Year'] + drought['Month']


Unnamed: 0,FIPS,County,State,Sq_Mi_No_Drought,Sq_Mi_Abnormally_Dry,Sq_Mi_Moderate_Drought,Sq_Mi_Severe_Drought,Sq_Mi_Extreme_Drought,Sq_Mi_Exceptional_Drought,ValidStart,Year,Month,Day,FIPSYearMonth
0,01033,Colbert County,AL,371.40,253.63,0.0,0.0,0.0,0.0,2021-12-28,2021,12,28,01033202112
1,01033,Colbert County,AL,592.73,32.30,0.0,0.0,0.0,0.0,2021-12-21,2021,12,21,01033202112
2,01033,Colbert County,AL,625.03,0.00,0.0,0.0,0.0,0.0,2021-12-14,2021,12,14,01033202112
3,01033,Colbert County,AL,625.03,0.00,0.0,0.0,0.0,0.0,2021-12-07,2021,12,07,01033202112
4,01033,Colbert County,AL,625.03,0.00,0.0,0.0,0.0,0.0,2021-11-30,2021,11,30,01033202111
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3278155,53029,Island County,WA,213.97,0.00,0.0,0.0,0.0,0.0,2002-01-29,2002,01,29,53029200201
3278156,53029,Island County,WA,213.97,0.00,0.0,0.0,0.0,0.0,2002-01-22,2002,01,22,53029200201
3278157,53029,Island County,WA,213.97,0.00,0.0,0.0,0.0,0.0,2002-01-15,2002,01,15,53029200201
3278158,53029,Island County,WA,213.97,0.00,0.0,0.0,0.0,0.0,2002-01-08,2002,01,08,53029200201


In [23]:
drought.isnull().sum()

FIPS                         0
County                       0
State                        0
Sq_Mi_No_Drought             0
Sq_Mi_Abnormally_Dry         0
Sq_Mi_Moderate_Drought       0
Sq_Mi_Severe_Drought         0
Sq_Mi_Extreme_Drought        0
Sq_Mi_Exceptional_Drought    0
ValidStart                   0
Year                         0
Month                        0
Day                          0
FIPSYearMonth                0
dtype: int64

With the FIPSYearMonth column in place, we can drop the Year, Month, and Day Columns and convert our ValidStart to 'date' data (and rename it to 'Date' to make it clearer).

In [24]:
drought = drought.drop(columns=['Year', 'Month', 'Day'])
drought

Unnamed: 0,FIPS,County,State,Sq_Mi_No_Drought,Sq_Mi_Abnormally_Dry,Sq_Mi_Moderate_Drought,Sq_Mi_Severe_Drought,Sq_Mi_Extreme_Drought,Sq_Mi_Exceptional_Drought,ValidStart,FIPSYearMonth
0,01033,Colbert County,AL,371.40,253.63,0.0,0.0,0.0,0.0,2021-12-28,01033202112
1,01033,Colbert County,AL,592.73,32.30,0.0,0.0,0.0,0.0,2021-12-21,01033202112
2,01033,Colbert County,AL,625.03,0.00,0.0,0.0,0.0,0.0,2021-12-14,01033202112
3,01033,Colbert County,AL,625.03,0.00,0.0,0.0,0.0,0.0,2021-12-07,01033202112
4,01033,Colbert County,AL,625.03,0.00,0.0,0.0,0.0,0.0,2021-11-30,01033202111
...,...,...,...,...,...,...,...,...,...,...,...
3278155,53029,Island County,WA,213.97,0.00,0.0,0.0,0.0,0.0,2002-01-29,53029200201
3278156,53029,Island County,WA,213.97,0.00,0.0,0.0,0.0,0.0,2002-01-22,53029200201
3278157,53029,Island County,WA,213.97,0.00,0.0,0.0,0.0,0.0,2002-01-15,53029200201
3278158,53029,Island County,WA,213.97,0.00,0.0,0.0,0.0,0.0,2002-01-08,53029200201


In [25]:
drought['Date'] = pd.to_datetime(drought['ValidStart'])
drought

Unnamed: 0,FIPS,County,State,Sq_Mi_No_Drought,Sq_Mi_Abnormally_Dry,Sq_Mi_Moderate_Drought,Sq_Mi_Severe_Drought,Sq_Mi_Extreme_Drought,Sq_Mi_Exceptional_Drought,ValidStart,FIPSYearMonth,Date
0,01033,Colbert County,AL,371.40,253.63,0.0,0.0,0.0,0.0,2021-12-28,01033202112,2021-12-28
1,01033,Colbert County,AL,592.73,32.30,0.0,0.0,0.0,0.0,2021-12-21,01033202112,2021-12-21
2,01033,Colbert County,AL,625.03,0.00,0.0,0.0,0.0,0.0,2021-12-14,01033202112,2021-12-14
3,01033,Colbert County,AL,625.03,0.00,0.0,0.0,0.0,0.0,2021-12-07,01033202112,2021-12-07
4,01033,Colbert County,AL,625.03,0.00,0.0,0.0,0.0,0.0,2021-11-30,01033202111,2021-11-30
...,...,...,...,...,...,...,...,...,...,...,...,...
3278155,53029,Island County,WA,213.97,0.00,0.0,0.0,0.0,0.0,2002-01-29,53029200201,2002-01-29
3278156,53029,Island County,WA,213.97,0.00,0.0,0.0,0.0,0.0,2002-01-22,53029200201,2002-01-22
3278157,53029,Island County,WA,213.97,0.00,0.0,0.0,0.0,0.0,2002-01-15,53029200201,2002-01-15
3278158,53029,Island County,WA,213.97,0.00,0.0,0.0,0.0,0.0,2002-01-08,53029200201,2002-01-08


In [26]:
drought = drought.drop(columns=['ValidStart'])
drought

Unnamed: 0,FIPS,County,State,Sq_Mi_No_Drought,Sq_Mi_Abnormally_Dry,Sq_Mi_Moderate_Drought,Sq_Mi_Severe_Drought,Sq_Mi_Extreme_Drought,Sq_Mi_Exceptional_Drought,FIPSYearMonth,Date
0,01033,Colbert County,AL,371.40,253.63,0.0,0.0,0.0,0.0,01033202112,2021-12-28
1,01033,Colbert County,AL,592.73,32.30,0.0,0.0,0.0,0.0,01033202112,2021-12-21
2,01033,Colbert County,AL,625.03,0.00,0.0,0.0,0.0,0.0,01033202112,2021-12-14
3,01033,Colbert County,AL,625.03,0.00,0.0,0.0,0.0,0.0,01033202112,2021-12-07
4,01033,Colbert County,AL,625.03,0.00,0.0,0.0,0.0,0.0,01033202111,2021-11-30
...,...,...,...,...,...,...,...,...,...,...,...
3278155,53029,Island County,WA,213.97,0.00,0.0,0.0,0.0,0.0,53029200201,2002-01-29
3278156,53029,Island County,WA,213.97,0.00,0.0,0.0,0.0,0.0,53029200201,2002-01-22
3278157,53029,Island County,WA,213.97,0.00,0.0,0.0,0.0,0.0,53029200201,2002-01-15
3278158,53029,Island County,WA,213.97,0.00,0.0,0.0,0.0,0.0,53029200201,2002-01-08


In [27]:
#rearranging the columns
drought = drought[['Date',
                   'FIPS',
                   'County',
                   'State',
                   'Sq_Mi_No_Drought',
                   'Sq_Mi_Abnormally_Dry',
                   'Sq_Mi_Moderate_Drought',
                   'Sq_Mi_Severe_Drought',
                   'Sq_Mi_Extreme_Drought',
                   'Sq_Mi_Exceptional_Drought',
                   'FIPSYearMonth']]
drought

Unnamed: 0,Date,FIPS,County,State,Sq_Mi_No_Drought,Sq_Mi_Abnormally_Dry,Sq_Mi_Moderate_Drought,Sq_Mi_Severe_Drought,Sq_Mi_Extreme_Drought,Sq_Mi_Exceptional_Drought,FIPSYearMonth
0,2021-12-28,01033,Colbert County,AL,371.40,253.63,0.0,0.0,0.0,0.0,01033202112
1,2021-12-21,01033,Colbert County,AL,592.73,32.30,0.0,0.0,0.0,0.0,01033202112
2,2021-12-14,01033,Colbert County,AL,625.03,0.00,0.0,0.0,0.0,0.0,01033202112
3,2021-12-07,01033,Colbert County,AL,625.03,0.00,0.0,0.0,0.0,0.0,01033202112
4,2021-11-30,01033,Colbert County,AL,625.03,0.00,0.0,0.0,0.0,0.0,01033202111
...,...,...,...,...,...,...,...,...,...,...,...
3278155,2002-01-29,53029,Island County,WA,213.97,0.00,0.0,0.0,0.0,0.0,53029200201
3278156,2002-01-22,53029,Island County,WA,213.97,0.00,0.0,0.0,0.0,0.0,53029200201
3278157,2002-01-15,53029,Island County,WA,213.97,0.00,0.0,0.0,0.0,0.0,53029200201
3278158,2002-01-08,53029,Island County,WA,213.97,0.00,0.0,0.0,0.0,0.0,53029200201


In [28]:
drought.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3278160 entries, 0 to 3278159
Data columns (total 11 columns):
 #   Column                     Dtype         
---  ------                     -----         
 0   Date                       datetime64[ns]
 1   FIPS                       object        
 2   County                     object        
 3   State                      object        
 4   Sq_Mi_No_Drought           float64       
 5   Sq_Mi_Abnormally_Dry       float64       
 6   Sq_Mi_Moderate_Drought     float64       
 7   Sq_Mi_Severe_Drought       float64       
 8   Sq_Mi_Extreme_Drought      float64       
 9   Sq_Mi_Exceptional_Drought  float64       
 10  FIPSYearMonth              object        
dtypes: datetime64[ns](1), float64(6), object(4)
memory usage: 275.1+ MB


Our drought data is cleaned, merged, and without missing or duplicated values. The dataframe is *mostly* numeric, with 3 columns that will likely stay categorical (FIPS, State, and County) and once we have merged this dataframe with the **weather** and **ag** dataframes we can drop the FIPSYearMonth column.

We can export this to a new csv file and then compress and move the old data from the drought1 and drought2 csv's to our storage drive.

In [29]:
#saving our squeaky clean dataframe to csv
drought.to_csv('drought.csv', index=False)