In [1]:
## IT'S DANGEROUS TO GO ALONE! TAKE THIS:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


# Soil, Data Wrangling - Weather
---
**Let the Soil Play its Simple Part**

Greg Sakowski
* Book 2 of 7
* Reformatting 4 sets of weather data and combining them to one dataframe. 
* Reading from CSV: roughMinTemp.csv, roughMaxTemp.csv, roughAvgTemp.csv, roughPrecip.csv
* Writing to CSV: avg_temp.csv, min_temp.csv, max_temp.csv, precip.csv, weather.csv

---
## Table of Contents:

[Tidy Data, Tidy Models](#Tidy-Data,-Tidy-Models)

[Reformatting Workflow](#Reformatting-Workflow)

[Combining Dataframes](#Combining-Dataframes)


## Data Sources

We have three data sources, each with slightly different types of data. The goal is to have either weekly (with imputed/interpolated weather data) or monthly (with binned and averaged drought data)

- **NOAA Weather data** - We'll focus on this
    * This is 4 separate text files. We need to slice out the number that signifies what the data *is* from the fips/year column for each dataframe. Then pivot the rest of the columns into two vertical columns for month and precip/avg temp/min temp/max temp instead of the 12 columns (one per month) format we have now. Then we should be able to combine the 4 dataframes using the fips/year column as an index, and drop the extra month columns

- US Drought monitor

- USDA/NASS Census and Survey of Ag data

---

# Tidy Data, Tidy Models
---
### Getting the Weather data into tidy columns.

The weather data started out as a text file that was converted to .csv in the acquisition notebook. We are starting with 13 columns:
* The first column is for the state/county FIPS # combined with a two digit number indicating the type of data and the year in YYYY format.

* The next 12 columns are for each month of the year.

What we want is a 'tidy' dataframe. The notion of tidy data came from an introductory book on **R***. The goal of tidying up one's data is to have the data following these three rules:

1. Each variable must have its own column
2. Each observation must have its own row
3. Each value must have its own cell

This is accomplished with a handful of functions from the tidyverse library. For our purposes the **gather** and **spread** functions are most salient. 

**Gather** will take a number of rows that are holding the same variable and pivot them vertically. In our case, gathering all of the months and creating a single 'month' column and a single 'min_temp' column (or max_temp/precip, etc). Knowing that this function is commonplace in R led to a search for examples of code that accomplishes this same task in pandas.

*R for Data Science by Hadley Wickham & Garrett Grolemund

In [2]:
#starting with min_temp
min_temp = pd.read_csv('data/roughMinTemp.csv')
min_temp

Unnamed: 0,FIPS28Year,01,02,03,04,05,06,07,08,09,10,11,12
0,1001281895,34.2,27.7,43.4,51.8,59.3,67.4,69.7,70.3,67.1,46.9,42.1,32.5
1,1001281896,34.4,37.2,42.6,57.0,65.0,67.9,71.4,71.7,65.0,52.2,46.1,35.9
2,1001281897,33.2,41.5,51.2,50.9,56.8,69.2,71.4,69.3,64.4,53.4,41.7,37.7
3,1001281898,39.6,34.4,49.1,47.1,60.4,69.1,70.2,69.6,65.7,50.6,38.6,32.7
4,1001281899,33.6,29.6,44.3,51.3,64.1,68.4,69.9,70.4,61.1,54.8,43.2,34.1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
400631,50290282018,-9.1,-1.1,6.1,16.1,34.4,44.4,49.9,44.0,35.6,23.6,6.8,-3.7
400632,50290282019,-9.0,1.2,16.8,19.8,36.8,48.0,51.9,42.8,36.0,21.6,5.1,-11.4
400633,50290282020,-22.8,-16.7,-1.6,18.4,36.3,45.9,47.1,44.8,34.9,18.3,-0.9,-3.2
400634,50290282021,-3.2,-17.2,-6.3,11.0,33.4,46.3,49.2,43.9,31.0,20.4,-7.4,-7.8


In [3]:
#test driving a gather function from the below repository:
# https://gist.github.com/derekpowell/5f97dabdd0730e68380fa1a00cd34ac4
#docstring is my own

def gather( df, key, value, cols ):
    
    '''Accepts four arguments:
        df = the dataframe
        key = the new column you are creating to store the dispersed labels for the observations
        value = the dispersed observations that relate to the label'''
    
    id_vars = [ col for col in df.columns if col not in cols ]
    id_values = cols
    var_name = key
    value_name = value
    return pd.melt( df, id_vars, id_values, var_name, value_name )

In [4]:
#making a 5 row test set for translations and column splitting
min_temp_small = min_temp[:5].copy()
min_temp_small

Unnamed: 0,FIPS28Year,01,02,03,04,05,06,07,08,09,10,11,12
0,1001281895,34.2,27.7,43.4,51.8,59.3,67.4,69.7,70.3,67.1,46.9,42.1,32.5
1,1001281896,34.4,37.2,42.6,57.0,65.0,67.9,71.4,71.7,65.0,52.2,46.1,35.9
2,1001281897,33.2,41.5,51.2,50.9,56.8,69.2,71.4,69.3,64.4,53.4,41.7,37.7
3,1001281898,39.6,34.4,49.1,47.1,60.4,69.1,70.2,69.6,65.7,50.6,38.6,32.7
4,1001281899,33.6,29.6,44.3,51.3,64.1,68.4,69.9,70.4,61.1,54.8,43.2,34.1


In [5]:
#
gather(df=min_temp_small,
       key='Month',
       value='mintemp',
       cols=['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12'])
#yep, that works, now lets fix the first column

Unnamed: 0,FIPS28Year,Month,mintemp
0,1001281895,1,34.2
1,1001281896,1,34.4
2,1001281897,1,33.2
3,1001281898,1,39.6
4,1001281899,1,33.6
5,1001281895,2,27.7
6,1001281896,2,37.2
7,1001281897,2,41.5
8,1001281898,2,34.4
9,1001281899,2,29.6


Et voila, the borrowed 'gather' function performed as desired. Now to move on to fixing the first column.

The 'FIPSXXYear column for all four weather csv's need to be split up into three columns:

* FIPS - this is currently either 4 or 5 digits, but *should* be 5 digits for all values. The first two digits are for the state and the last 3 digits for the county in that state. We will need to add a leading zero whenever the FIPS number is only 4 digits long.
* 28/27/02/01 - the 2 digit code that signifies what data is held in dataframe. This is a holdover from downloading the data from the NOAA database-- we'll be dropping this column.
* Year - the year in YYYY format, this column will be kept as is and once we have a tidy dataframe with all four of the weather variables we can slice out a copy with data from 2002 to 2021.

In [6]:
#setting the fips/year column to be a string
min_temp_small['FIPS28Year'] = min_temp_small['FIPS28Year'].astype(str)

In [7]:
#making sure it worked
min_temp_small.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   FIPS28Year  5 non-null      object 
 1   01          5 non-null      float64
 2   02          5 non-null      float64
 3   03          5 non-null      float64
 4   04          5 non-null      float64
 5   05          5 non-null      float64
 6   06          5 non-null      float64
 7   07          5 non-null      float64
 8   08          5 non-null      float64
 9   09          5 non-null      float64
 10  10          5 non-null      float64
 11  11          5 non-null      float64
 12  12          5 non-null      float64
dtypes: float64(12), object(1)
memory usage: 648.0+ bytes


In [8]:
#adding in the leading zero
min_temp_small['FIPS28Year'] = min_temp_small['FIPS28Year'].str.zfill(11)
min_temp_small

Unnamed: 0,FIPS28Year,01,02,03,04,05,06,07,08,09,10,11,12
0,1001281895,34.2,27.7,43.4,51.8,59.3,67.4,69.7,70.3,67.1,46.9,42.1,32.5
1,1001281896,34.4,37.2,42.6,57.0,65.0,67.9,71.4,71.7,65.0,52.2,46.1,35.9
2,1001281897,33.2,41.5,51.2,50.9,56.8,69.2,71.4,69.3,64.4,53.4,41.7,37.7
3,1001281898,39.6,34.4,49.1,47.1,60.4,69.1,70.2,69.6,65.7,50.6,38.6,32.7
4,1001281899,33.6,29.6,44.3,51.3,64.1,68.4,69.9,70.4,61.1,54.8,43.2,34.1


In [9]:
#after some googling for ways to break up a string with regex (because using '28' as a delimiter would've been bad/wrong)
#I found an answer on stackexchange: 
#https://stackoverflow.com/questions/25252200/how-to-split-a-column-based-on-several-string-indices-using-pandas
#and tested it out in regex101 until I got the below code to behave

min_temp_small['FIPS28Year'].str.extract('(.{5})(.{2})(.{4})')

Unnamed: 0,0,1,2
0,1001,28,1895
1,1001,28,1896
2,1001,28,1897
3,1001,28,1898
4,1001,28,1899


In [10]:
#creating new columns with the extracted data
min_temp_small[['FIPS', '28', 'Year']] = min_temp_small['FIPS28Year'].str.extract('(.{5})(.{2})(.{4})')

In [11]:
#checking my work
min_temp_small

Unnamed: 0,FIPS28Year,01,02,03,04,05,06,07,08,09,10,11,12,FIPS,28,Year
0,1001281895,34.2,27.7,43.4,51.8,59.3,67.4,69.7,70.3,67.1,46.9,42.1,32.5,1001,28,1895
1,1001281896,34.4,37.2,42.6,57.0,65.0,67.9,71.4,71.7,65.0,52.2,46.1,35.9,1001,28,1896
2,1001281897,33.2,41.5,51.2,50.9,56.8,69.2,71.4,69.3,64.4,53.4,41.7,37.7,1001,28,1897
3,1001281898,39.6,34.4,49.1,47.1,60.4,69.1,70.2,69.6,65.7,50.6,38.6,32.7,1001,28,1898
4,1001281899,33.6,29.6,44.3,51.3,64.1,68.4,69.9,70.4,61.1,54.8,43.2,34.1,1001,28,1899


In [12]:
#adding in the gather function
min_temp_small = gather(df=min_temp_small,
       key='Month',
       value='mintemp',
       cols=['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12'])
min_temp_small

Unnamed: 0,FIPS28Year,FIPS,28,Year,Month,mintemp
0,1001281895,1001,28,1895,1,34.2
1,1001281896,1001,28,1896,1,34.4
2,1001281897,1001,28,1897,1,33.2
3,1001281898,1001,28,1898,1,39.6
4,1001281899,1001,28,1899,1,33.6
5,1001281895,1001,28,1895,2,27.7
6,1001281896,1001,28,1896,2,37.2
7,1001281897,1001,28,1897,2,41.5
8,1001281898,1001,28,1898,2,34.4
9,1001281899,1001,28,1899,2,29.6


In [13]:
#and dropping the FIPS28Year and 28 columns
min_temp_small = min_temp_small.drop(columns=['FIPS28Year', '28'])
min_temp_small

Unnamed: 0,FIPS,Year,Month,mintemp
0,1001,1895,1,34.2
1,1001,1896,1,34.4
2,1001,1897,1,33.2
3,1001,1898,1,39.6
4,1001,1899,1,33.6
5,1001,1895,2,27.7
6,1001,1896,2,37.2
7,1001,1897,2,41.5
8,1001,1898,2,34.4
9,1001,1899,2,29.6


The last thing we will need is to create an index column that merges the FIPS, year, and month so it is easy to join the dataframes together into one weather dataframe

In [14]:
#recombining the FIPS and year columns so I have something to join the columns on
min_temp_small['FIPSYearMonth'] = min_temp_small['FIPS'] + min_temp_small['Year'] + min_temp_small['Month']
min_temp_small

Unnamed: 0,FIPS,Year,Month,mintemp,FIPSYearMonth
0,1001,1895,1,34.2,1001189501
1,1001,1896,1,34.4,1001189601
2,1001,1897,1,33.2,1001189701
3,1001,1898,1,39.6,1001189801
4,1001,1899,1,33.6,1001189901
5,1001,1895,2,27.7,1001189502
6,1001,1896,2,37.2,1001189602
7,1001,1897,2,41.5,1001189702
8,1001,1898,2,34.4,1001189802
9,1001,1899,2,29.6,1001189902


Now, to load in the other three sets of weather data.

In [15]:
max_temp = pd.read_csv('data/roughMaxTemp.csv')
max_temp

Unnamed: 0,FIPS27Year,01,02,03,04,05,06,07,08,09,10,11,12
0,1001271895,53.7,48.7,67.6,76.4,81.9,89.2,91.1,90.4,90.9,76.0,66.6,58.0
1,1001271896,54.2,60.8,65.3,81.6,88.5,88.2,92.0,94.5,90.8,77.2,69.9,58.7
2,1001271897,54.2,63.1,71.4,75.1,83.2,95.6,93.3,89.9,88.9,81.3,68.1,58.8
3,1001271898,60.6,59.1,71.0,72.0,89.5,93.9,91.5,88.8,86.7,73.6,61.7,55.7
4,1001271899,55.6,53.4,68.8,73.4,89.3,93.7,92.2,92.6,87.5,78.4,68.1,56.6
...,...,...,...,...,...,...,...,...,...,...,...,...,...
400631,50290272018,4.6,13.4,24.2,35.2,52.9,64.8,70.2,57.9,51.7,37.2,16.6,9.1
400632,50290272019,5.4,17.6,33.2,38.7,57.3,70.7,71.9,60.6,51.8,33.4,16.4,1.6
400633,50290272020,-9.6,2.0,17.3,36.2,59.1,66.0,64.8,65.3,49.4,31.4,12.7,9.4
400634,50290272021,10.4,-0.7,15.1,34.9,55.2,66.4,66.8,58.2,47.2,31.0,4.2,7.6


In [16]:
avg_temp = pd.read_csv('data/roughAvgTemp.csv')
avg_temp

Unnamed: 0,FIPS02Year,01,02,03,04,05,06,07,08,09,10,11,12
0,1001021895,44.0,38.2,55.5,64.1,70.6,78.3,80.4,80.4,79.0,61.4,54.4,45.3
1,1001021896,44.3,49.0,54.0,69.3,76.8,78.0,81.7,83.1,77.9,64.7,58.0,47.3
2,1001021897,43.7,52.3,61.3,63.0,70.0,82.4,82.4,79.6,76.6,67.4,54.9,48.2
3,1001021898,50.1,46.8,60.1,59.6,75.0,81.5,80.8,79.2,76.2,62.1,50.2,44.2
4,1001021899,44.6,41.5,56.6,62.3,76.7,81.0,81.0,81.5,74.3,66.6,55.7,45.3
...,...,...,...,...,...,...,...,...,...,...,...,...,...
400631,50290022018,-2.2,6.1,15.1,25.6,43.6,54.6,60.0,51.0,43.6,30.4,11.7,2.7
400632,50290022019,-1.8,9.4,25.0,29.2,47.0,59.3,62.0,51.7,43.9,27.5,10.7,-4.9
400633,50290022020,-16.2,-7.4,7.9,27.3,47.7,55.9,55.9,55.1,42.1,24.8,5.9,3.1
400634,50290022021,3.6,-9.0,4.4,22.9,44.3,56.4,58.0,51.1,39.2,25.7,-1.6,-0.1


In [17]:
precip = pd.read_csv('data/roughPrecip.csv')
precip

Unnamed: 0,FIPS01Year,01,02,03,04,05,06,07,08,09,10,11,12
0,1001011895,7.03,2.96,8.36,3.53,3.96,5.40,3.92,3.36,0.73,2.03,1.44,3.66
1,1001011896,5.86,5.42,5.54,3.98,3.77,6.24,4.38,2.57,0.82,1.66,2.89,1.94
2,1001011897,3.27,6.63,10.94,4.35,0.81,1.57,3.96,5.02,0.87,0.75,1.84,4.38
3,1001011898,2.33,2.07,2.60,4.56,0.54,3.13,5.80,6.02,1.51,3.21,6.66,3.91
4,1001011899,5.80,6.94,3.35,2.22,2.93,2.31,6.80,2.90,0.63,3.02,1.98,5.25
...,...,...,...,...,...,...,...,...,...,...,...,...,...
400631,50290012018,0.47,1.04,0.80,0.80,1.36,1.97,1.71,4.58,1.95,1.43,0.95,0.88
400632,50290012019,0.52,1.31,0.93,0.52,1.18,1.08,1.67,3.56,2.54,2.27,2.25,0.58
400633,50290012020,0.34,0.71,1.04,1.58,0.39,2.45,2.46,2.06,2.57,0.94,1.12,0.52
400634,50290012021,0.43,0.63,0.86,0.95,0.53,1.82,2.82,4.48,1.96,1.40,0.90,3.19


### Cleaning note:
The Temperature data has missing values represented with **-99.99** (an impossibly low temperature in fahrenheit for the contiguous USA), while the precipitation data has missing values represented with **-9.99** (a distinctly possible temperature, one that is unfortunately common in the author's hometown in January and February).

Because of this difference in how missing values are presented, we will pre-clean the precipitation data. Because we are mainly interested in the years after 2000, we can wait to do this cleaning until the end of the reformatting of the precip dataframe.

Let's start with precip! We have 6 steps to clean up each of the weather dataframes: 

# Reformatting Workflow
---

**1. Convert to string**
    - min_temp_small['FIPS28Year'] = min_temp_small['FIPS28Year'].astype(str)

**2. Add leading zeros**
    - min_temp_small['FIPS28Year'] = min_temp_small['FIPS28Year'].str.zfill(11)

**3. Gather the month columns**
    - min_temp_small = gather(df=min_temp_small,
       key='Month',
       value='mintemp',
       cols=['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12'])

**4. Extract and create new columns**
    - min_temp_small[['FIPS', '28', 'Year']] = min_temp_small['FIPS28Year'].str.extract('(.{5})(.{2})(.{4})')


**5. Drop the old combined column and the unnecessary label column**
    - min_temp_small = min_temp_small.drop(columns=['FIPS28Year', '28'])

**6. Make a column from the FIPS, Year, and Month so when we join everything we have a key to join on**
    - min_temp_small['FIPSYearMonth'] = min_temp_small['FIPS'] + min_temp_small['Year'] + min_temp_small['Month']



## Reformatting precip
---

In [18]:
#Convert to string and add leading zeros
precip['FIPS01Year'] = precip['FIPS01Year'].astype(str)
precip['FIPS01Year'] = precip['FIPS01Year'].str.zfill(11)
precip

Unnamed: 0,FIPS01Year,01,02,03,04,05,06,07,08,09,10,11,12
0,01001011895,7.03,2.96,8.36,3.53,3.96,5.40,3.92,3.36,0.73,2.03,1.44,3.66
1,01001011896,5.86,5.42,5.54,3.98,3.77,6.24,4.38,2.57,0.82,1.66,2.89,1.94
2,01001011897,3.27,6.63,10.94,4.35,0.81,1.57,3.96,5.02,0.87,0.75,1.84,4.38
3,01001011898,2.33,2.07,2.60,4.56,0.54,3.13,5.80,6.02,1.51,3.21,6.66,3.91
4,01001011899,5.80,6.94,3.35,2.22,2.93,2.31,6.80,2.90,0.63,3.02,1.98,5.25
...,...,...,...,...,...,...,...,...,...,...,...,...,...
400631,50290012018,0.47,1.04,0.80,0.80,1.36,1.97,1.71,4.58,1.95,1.43,0.95,0.88
400632,50290012019,0.52,1.31,0.93,0.52,1.18,1.08,1.67,3.56,2.54,2.27,2.25,0.58
400633,50290012020,0.34,0.71,1.04,1.58,0.39,2.45,2.46,2.06,2.57,0.94,1.12,0.52
400634,50290012021,0.43,0.63,0.86,0.95,0.53,1.82,2.82,4.48,1.96,1.40,0.90,3.19


In [19]:
#gather the month data into one column
precip = gather(df=precip,
       key='Month',
       value='precip',
       cols=['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12'])

In [20]:
#extracting and adding newly split-up columns
precip[['FIPS', '01', 'Year']] = precip['FIPS01Year'].str.extract('(.{5})(.{2})(.{4})')
precip

Unnamed: 0,FIPS01Year,Month,precip,FIPS,01,Year
0,01001011895,01,7.03,01001,01,1895
1,01001011896,01,5.86,01001,01,1896
2,01001011897,01,3.27,01001,01,1897
3,01001011898,01,2.33,01001,01,1898
4,01001011899,01,5.80,01001,01,1899
...,...,...,...,...,...,...
4807627,50290012018,12,0.88,50290,01,2018
4807628,50290012019,12,0.58,50290,01,2019
4807629,50290012020,12,0.52,50290,01,2020
4807630,50290012021,12,3.19,50290,01,2021


In [21]:
#dropping the extra columns, adding in the joining column
precip = precip.drop(columns=['FIPS01Year', '01'])
precip['FIPSYearMonth'] = precip['FIPS'] + precip['Year'] + precip['Month']
precip

Unnamed: 0,Month,precip,FIPS,Year,FIPSYearMonth
0,01,7.03,01001,1895,01001189501
1,01,5.86,01001,1896,01001189601
2,01,3.27,01001,1897,01001189701
3,01,2.33,01001,1898,01001189801
4,01,5.80,01001,1899,01001189901
...,...,...,...,...,...
4807627,12,0.88,50290,2018,50290201812
4807628,12,0.58,50290,2019,50290201912
4807629,12,0.52,50290,2020,50290202012
4807630,12,3.19,50290,2021,50290202112


Our precip dataframe is in the right format, now lets check for missing data. If there is quite a bit of it outside the 20 years we are planning on using, we can cut out the chunk of data we are going to use and perform the changes there.

In [22]:
#check for missing values
precip[precip['precip']==-9.99]

Unnamed: 0,Month,precip,FIPS,Year,FIPSYearMonth
2804579,08,-9.99,01001,2022,01001202208
2804707,08,-9.99,01003,2022,01003202208
2804835,08,-9.99,01005,2022,01005202208
2804963,08,-9.99,01007,2022,01007202208
2805091,08,-9.99,01009,2022,01009202208
...,...,...,...,...,...
4807239,12,-9.99,50230,2022,50230202212
4807337,12,-9.99,50240,2022,50240202212
4807435,12,-9.99,50275,2022,50275202212
4807533,12,-9.99,50282,2022,50282202212


In [23]:
#check for missing values that aren't in 2022
precip[
    (precip['precip']==-9.99) & (precip['Year']!='2022')
]

Unnamed: 0,Month,precip,FIPS,Year,FIPSYearMonth


In [24]:
#checking for nulls
precip.isnull().sum()

Month            0
precip           0
FIPS             0
Year             0
FIPSYearMonth    0
dtype: int64

It seems that the precipitation data is whole, apart from the last five months in 2022-- which makes sense, considering the data was pulled in August of 2022. We can check the unique number of FIPS, it should be ~3,000, and multiply that by 5. If it's close to the total we are seeing for the first search, then we should have complete data.

In [25]:
print(f"There are {(precip['FIPS'].nunique())} unique FIPS numbers.\n\
There theoretically are 5 months missing for each, so we should have {(precip['FIPS'].nunique())*5} rows with -9.99.\n\
That matches the row count above of {precip[precip['precip']==-9.99].shape[0]}.")

There are 3137 unique FIPS numbers.
There theoretically are 5 months missing for each, so we should have 15685 rows with -9.99.
That matches the row count above of 15685.


In [26]:
precip.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4807632 entries, 0 to 4807631
Data columns (total 5 columns):
 #   Column         Dtype  
---  ------         -----  
 0   Month          object 
 1   precip         float64
 2   FIPS           object 
 3   Year           object 
 4   FIPSYearMonth  object 
dtypes: float64(1), object(4)
memory usage: 183.4+ MB


Wonderful! The precip dataframe is clean and complete. We can update the data types for Month and Year to integers and create a copy with all of the years from 2002 through 2021. We'll call this precip20. 

Then we can repeat the reformatting process for the other three dataframes and combine them.

In [27]:
precip['Month'] = precip['Month'].astype(int)
precip['Year'] = precip['Year'].astype(int)
precip.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4807632 entries, 0 to 4807631
Data columns (total 5 columns):
 #   Column         Dtype  
---  ------         -----  
 0   Month          int32  
 1   precip         float64
 2   FIPS           object 
 3   Year           int32  
 4   FIPSYearMonth  object 
dtypes: float64(1), int32(2), object(2)
memory usage: 146.7+ MB


In [28]:
precip20 = precip[(precip['Year']>2001) & (precip['Year']!=2022)].copy()
precip20

Unnamed: 0,Month,precip,FIPS,Year,FIPSYearMonth
107,1,4.70,01001,2002,01001200201
108,1,2.62,01001,2003,01001200301
109,1,2.60,01001,2004,01001200401
110,1,3.08,01001,2005,01001200501
111,1,5.69,01001,2006,01001200601
...,...,...,...,...,...
4807626,12,0.82,50290,2017,50290201712
4807627,12,0.88,50290,2018,50290201812
4807628,12,0.58,50290,2019,50290201912
4807629,12,0.52,50290,2020,50290202012


## Reformatting max_temp
---

In [29]:
#Checking the two digit number that splits the FIPS and year values for max_temp
max_temp

Unnamed: 0,FIPS27Year,01,02,03,04,05,06,07,08,09,10,11,12
0,1001271895,53.7,48.7,67.6,76.4,81.9,89.2,91.1,90.4,90.9,76.0,66.6,58.0
1,1001271896,54.2,60.8,65.3,81.6,88.5,88.2,92.0,94.5,90.8,77.2,69.9,58.7
2,1001271897,54.2,63.1,71.4,75.1,83.2,95.6,93.3,89.9,88.9,81.3,68.1,58.8
3,1001271898,60.6,59.1,71.0,72.0,89.5,93.9,91.5,88.8,86.7,73.6,61.7,55.7
4,1001271899,55.6,53.4,68.8,73.4,89.3,93.7,92.2,92.6,87.5,78.4,68.1,56.6
...,...,...,...,...,...,...,...,...,...,...,...,...,...
400631,50290272018,4.6,13.4,24.2,35.2,52.9,64.8,70.2,57.9,51.7,37.2,16.6,9.1
400632,50290272019,5.4,17.6,33.2,38.7,57.3,70.7,71.9,60.6,51.8,33.4,16.4,1.6
400633,50290272020,-9.6,2.0,17.3,36.2,59.1,66.0,64.8,65.3,49.4,31.4,12.7,9.4
400634,50290272021,10.4,-0.7,15.1,34.9,55.2,66.4,66.8,58.2,47.2,31.0,4.2,7.6


In [30]:
#Convert to string and add leading zeros
max_temp['FIPS27Year'] = max_temp['FIPS27Year'].astype(str)
max_temp['FIPS27Year'] = max_temp['FIPS27Year'].str.zfill(11)
max_temp

Unnamed: 0,FIPS27Year,01,02,03,04,05,06,07,08,09,10,11,12
0,01001271895,53.7,48.7,67.6,76.4,81.9,89.2,91.1,90.4,90.9,76.0,66.6,58.0
1,01001271896,54.2,60.8,65.3,81.6,88.5,88.2,92.0,94.5,90.8,77.2,69.9,58.7
2,01001271897,54.2,63.1,71.4,75.1,83.2,95.6,93.3,89.9,88.9,81.3,68.1,58.8
3,01001271898,60.6,59.1,71.0,72.0,89.5,93.9,91.5,88.8,86.7,73.6,61.7,55.7
4,01001271899,55.6,53.4,68.8,73.4,89.3,93.7,92.2,92.6,87.5,78.4,68.1,56.6
...,...,...,...,...,...,...,...,...,...,...,...,...,...
400631,50290272018,4.6,13.4,24.2,35.2,52.9,64.8,70.2,57.9,51.7,37.2,16.6,9.1
400632,50290272019,5.4,17.6,33.2,38.7,57.3,70.7,71.9,60.6,51.8,33.4,16.4,1.6
400633,50290272020,-9.6,2.0,17.3,36.2,59.1,66.0,64.8,65.3,49.4,31.4,12.7,9.4
400634,50290272021,10.4,-0.7,15.1,34.9,55.2,66.4,66.8,58.2,47.2,31.0,4.2,7.6


In [31]:
#gather the month data into one column
max_temp = gather(df=max_temp,
       key='Month',
       value='maxtemp',
       cols=['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12'])
max_temp

Unnamed: 0,FIPS27Year,Month,maxtemp
0,01001271895,01,53.7
1,01001271896,01,54.2
2,01001271897,01,54.2
3,01001271898,01,60.6
4,01001271899,01,55.6
...,...,...,...
4807627,50290272018,12,9.1
4807628,50290272019,12,1.6
4807629,50290272020,12,9.4
4807630,50290272021,12,7.6


In [32]:
#extracting and adding newly split-up columns
max_temp[['FIPS', '27', 'Year']] = max_temp['FIPS27Year'].str.extract('(.{5})(.{2})(.{4})')
max_temp

Unnamed: 0,FIPS27Year,Month,maxtemp,FIPS,27,Year
0,01001271895,01,53.7,01001,27,1895
1,01001271896,01,54.2,01001,27,1896
2,01001271897,01,54.2,01001,27,1897
3,01001271898,01,60.6,01001,27,1898
4,01001271899,01,55.6,01001,27,1899
...,...,...,...,...,...,...
4807627,50290272018,12,9.1,50290,27,2018
4807628,50290272019,12,1.6,50290,27,2019
4807629,50290272020,12,9.4,50290,27,2020
4807630,50290272021,12,7.6,50290,27,2021


In [33]:
#dropping the extra columns, adding in the joining column
max_temp = max_temp.drop(columns=['FIPS27Year', '27'])
max_temp['FIPSYearMonth'] = max_temp['FIPS'] + max_temp['Year'] + max_temp['Month']
max_temp

Unnamed: 0,Month,maxtemp,FIPS,Year,FIPSYearMonth
0,01,53.7,01001,1895,01001189501
1,01,54.2,01001,1896,01001189601
2,01,54.2,01001,1897,01001189701
3,01,60.6,01001,1898,01001189801
4,01,55.6,01001,1899,01001189901
...,...,...,...,...,...
4807627,12,9.1,50290,2018,50290201812
4807628,12,1.6,50290,2019,50290201912
4807629,12,9.4,50290,2020,50290202012
4807630,12,7.6,50290,2021,50290202112


In [34]:
max_temp['Month'] = max_temp['Month'].astype(int)
max_temp['Year'] = max_temp['Year'].astype(int)
max_temp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4807632 entries, 0 to 4807631
Data columns (total 5 columns):
 #   Column         Dtype  
---  ------         -----  
 0   Month          int32  
 1   maxtemp        float64
 2   FIPS           object 
 3   Year           int32  
 4   FIPSYearMonth  object 
dtypes: float64(1), int32(2), object(2)
memory usage: 146.7+ MB


In [35]:
#setting up a subsegment for 2002 through 2021
max_temp20 = max_temp[(max_temp['Year']>2001) & (max_temp['Year']!=2022)].copy()
max_temp20

Unnamed: 0,Month,maxtemp,FIPS,Year,FIPSYearMonth
107,1,59.2,01001,2002,01001200201
108,1,52.6,01001,2003,01001200301
109,1,56.2,01001,2004,01001200401
110,1,61.6,01001,2005,01001200501
111,1,63.8,01001,2006,01001200601
...,...,...,...,...,...
4807626,12,20.4,50290,2017,50290201712
4807627,12,9.1,50290,2018,50290201812
4807628,12,1.6,50290,2019,50290201912
4807629,12,9.4,50290,2020,50290202012


In [36]:
#checking for missing values in max_temp20
max_temp20[max_temp20['maxtemp']==-99.9]

Unnamed: 0,Month,maxtemp,FIPS,Year,FIPSYearMonth


In [37]:
#checking for nulls
max_temp20.isnull().sum()

Month            0
maxtemp          0
FIPS             0
Year             0
FIPSYearMonth    0
dtype: int64

## Reformatting min_temp
---

In [38]:
#Checking the two digit number that splits the FIPS and year values for max_temp
min_temp

Unnamed: 0,FIPS28Year,01,02,03,04,05,06,07,08,09,10,11,12
0,1001281895,34.2,27.7,43.4,51.8,59.3,67.4,69.7,70.3,67.1,46.9,42.1,32.5
1,1001281896,34.4,37.2,42.6,57.0,65.0,67.9,71.4,71.7,65.0,52.2,46.1,35.9
2,1001281897,33.2,41.5,51.2,50.9,56.8,69.2,71.4,69.3,64.4,53.4,41.7,37.7
3,1001281898,39.6,34.4,49.1,47.1,60.4,69.1,70.2,69.6,65.7,50.6,38.6,32.7
4,1001281899,33.6,29.6,44.3,51.3,64.1,68.4,69.9,70.4,61.1,54.8,43.2,34.1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
400631,50290282018,-9.1,-1.1,6.1,16.1,34.4,44.4,49.9,44.0,35.6,23.6,6.8,-3.7
400632,50290282019,-9.0,1.2,16.8,19.8,36.8,48.0,51.9,42.8,36.0,21.6,5.1,-11.4
400633,50290282020,-22.8,-16.7,-1.6,18.4,36.3,45.9,47.1,44.8,34.9,18.3,-0.9,-3.2
400634,50290282021,-3.2,-17.2,-6.3,11.0,33.4,46.3,49.2,43.9,31.0,20.4,-7.4,-7.8


In [39]:
#Convert to string and add leading zeros
min_temp['FIPS28Year'] = min_temp['FIPS28Year'].astype(str)
min_temp['FIPS28Year'] = min_temp['FIPS28Year'].str.zfill(11)
min_temp

Unnamed: 0,FIPS28Year,01,02,03,04,05,06,07,08,09,10,11,12
0,01001281895,34.2,27.7,43.4,51.8,59.3,67.4,69.7,70.3,67.1,46.9,42.1,32.5
1,01001281896,34.4,37.2,42.6,57.0,65.0,67.9,71.4,71.7,65.0,52.2,46.1,35.9
2,01001281897,33.2,41.5,51.2,50.9,56.8,69.2,71.4,69.3,64.4,53.4,41.7,37.7
3,01001281898,39.6,34.4,49.1,47.1,60.4,69.1,70.2,69.6,65.7,50.6,38.6,32.7
4,01001281899,33.6,29.6,44.3,51.3,64.1,68.4,69.9,70.4,61.1,54.8,43.2,34.1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
400631,50290282018,-9.1,-1.1,6.1,16.1,34.4,44.4,49.9,44.0,35.6,23.6,6.8,-3.7
400632,50290282019,-9.0,1.2,16.8,19.8,36.8,48.0,51.9,42.8,36.0,21.6,5.1,-11.4
400633,50290282020,-22.8,-16.7,-1.6,18.4,36.3,45.9,47.1,44.8,34.9,18.3,-0.9,-3.2
400634,50290282021,-3.2,-17.2,-6.3,11.0,33.4,46.3,49.2,43.9,31.0,20.4,-7.4,-7.8


In [40]:
#gather the month data into one column
min_temp = gather(df=min_temp,
       key='Month',
       value='mintemp',
       cols=['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12'])
min_temp

Unnamed: 0,FIPS28Year,Month,mintemp
0,01001281895,01,34.2
1,01001281896,01,34.4
2,01001281897,01,33.2
3,01001281898,01,39.6
4,01001281899,01,33.6
...,...,...,...
4807627,50290282018,12,-3.7
4807628,50290282019,12,-11.4
4807629,50290282020,12,-3.2
4807630,50290282021,12,-7.8


In [41]:
#extracting and adding newly split-up columns
min_temp[['FIPS', '28', 'Year']] = min_temp['FIPS28Year'].str.extract('(.{5})(.{2})(.{4})')
min_temp

Unnamed: 0,FIPS28Year,Month,mintemp,FIPS,28,Year
0,01001281895,01,34.2,01001,28,1895
1,01001281896,01,34.4,01001,28,1896
2,01001281897,01,33.2,01001,28,1897
3,01001281898,01,39.6,01001,28,1898
4,01001281899,01,33.6,01001,28,1899
...,...,...,...,...,...,...
4807627,50290282018,12,-3.7,50290,28,2018
4807628,50290282019,12,-11.4,50290,28,2019
4807629,50290282020,12,-3.2,50290,28,2020
4807630,50290282021,12,-7.8,50290,28,2021


In [42]:
#dropping the extra columns, adding in the joining column
min_temp = min_temp.drop(columns=['FIPS28Year', '28'])
min_temp['FIPSYearMonth'] = min_temp['FIPS'] + min_temp['Year'] + min_temp['Month']
min_temp

Unnamed: 0,Month,mintemp,FIPS,Year,FIPSYearMonth
0,01,34.2,01001,1895,01001189501
1,01,34.4,01001,1896,01001189601
2,01,33.2,01001,1897,01001189701
3,01,39.6,01001,1898,01001189801
4,01,33.6,01001,1899,01001189901
...,...,...,...,...,...
4807627,12,-3.7,50290,2018,50290201812
4807628,12,-11.4,50290,2019,50290201912
4807629,12,-3.2,50290,2020,50290202012
4807630,12,-7.8,50290,2021,50290202112


In [43]:
min_temp['Month'] = min_temp['Month'].astype(int)
min_temp['Year'] = min_temp['Year'].astype(int)
min_temp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4807632 entries, 0 to 4807631
Data columns (total 5 columns):
 #   Column         Dtype  
---  ------         -----  
 0   Month          int32  
 1   mintemp        float64
 2   FIPS           object 
 3   Year           int32  
 4   FIPSYearMonth  object 
dtypes: float64(1), int32(2), object(2)
memory usage: 146.7+ MB


In [44]:
#setting up a subsegment for 2002 through 2021
min_temp20 = min_temp[(min_temp['Year']>2001) & (min_temp['Year']!=2022)].copy()
min_temp20

Unnamed: 0,Month,mintemp,FIPS,Year,FIPSYearMonth
107,1,36.1,01001,2002,01001200201
108,1,29.2,01001,2003,01001200301
109,1,33.8,01001,2004,01001200401
110,1,38.1,01001,2005,01001200501
111,1,40.1,01001,2006,01001200601
...,...,...,...,...,...
4807626,12,8.0,50290,2017,50290201712
4807627,12,-3.7,50290,2018,50290201812
4807628,12,-11.4,50290,2019,50290201912
4807629,12,-3.2,50290,2020,50290202012


In [45]:
#checking for missing values in min_temp20
min_temp20[min_temp20['mintemp']==-99.9]

Unnamed: 0,Month,mintemp,FIPS,Year,FIPSYearMonth


In [46]:
#checking for nulls
min_temp20.isnull().sum()

Month            0
mintemp          0
FIPS             0
Year             0
FIPSYearMonth    0
dtype: int64

## Reformatting avg_temp
---

In [47]:
#Checking the two digit number that splits the FIPS and year values for max_temp
avg_temp

Unnamed: 0,FIPS02Year,01,02,03,04,05,06,07,08,09,10,11,12
0,1001021895,44.0,38.2,55.5,64.1,70.6,78.3,80.4,80.4,79.0,61.4,54.4,45.3
1,1001021896,44.3,49.0,54.0,69.3,76.8,78.0,81.7,83.1,77.9,64.7,58.0,47.3
2,1001021897,43.7,52.3,61.3,63.0,70.0,82.4,82.4,79.6,76.6,67.4,54.9,48.2
3,1001021898,50.1,46.8,60.1,59.6,75.0,81.5,80.8,79.2,76.2,62.1,50.2,44.2
4,1001021899,44.6,41.5,56.6,62.3,76.7,81.0,81.0,81.5,74.3,66.6,55.7,45.3
...,...,...,...,...,...,...,...,...,...,...,...,...,...
400631,50290022018,-2.2,6.1,15.1,25.6,43.6,54.6,60.0,51.0,43.6,30.4,11.7,2.7
400632,50290022019,-1.8,9.4,25.0,29.2,47.0,59.3,62.0,51.7,43.9,27.5,10.7,-4.9
400633,50290022020,-16.2,-7.4,7.9,27.3,47.7,55.9,55.9,55.1,42.1,24.8,5.9,3.1
400634,50290022021,3.6,-9.0,4.4,22.9,44.3,56.4,58.0,51.1,39.2,25.7,-1.6,-0.1


In [48]:
#Convert to string and add leading zeros
avg_temp['FIPS02Year'] = avg_temp['FIPS02Year'].astype(str)
avg_temp['FIPS02Year'] = avg_temp['FIPS02Year'].str.zfill(11)
avg_temp

Unnamed: 0,FIPS02Year,01,02,03,04,05,06,07,08,09,10,11,12
0,01001021895,44.0,38.2,55.5,64.1,70.6,78.3,80.4,80.4,79.0,61.4,54.4,45.3
1,01001021896,44.3,49.0,54.0,69.3,76.8,78.0,81.7,83.1,77.9,64.7,58.0,47.3
2,01001021897,43.7,52.3,61.3,63.0,70.0,82.4,82.4,79.6,76.6,67.4,54.9,48.2
3,01001021898,50.1,46.8,60.1,59.6,75.0,81.5,80.8,79.2,76.2,62.1,50.2,44.2
4,01001021899,44.6,41.5,56.6,62.3,76.7,81.0,81.0,81.5,74.3,66.6,55.7,45.3
...,...,...,...,...,...,...,...,...,...,...,...,...,...
400631,50290022018,-2.2,6.1,15.1,25.6,43.6,54.6,60.0,51.0,43.6,30.4,11.7,2.7
400632,50290022019,-1.8,9.4,25.0,29.2,47.0,59.3,62.0,51.7,43.9,27.5,10.7,-4.9
400633,50290022020,-16.2,-7.4,7.9,27.3,47.7,55.9,55.9,55.1,42.1,24.8,5.9,3.1
400634,50290022021,3.6,-9.0,4.4,22.9,44.3,56.4,58.0,51.1,39.2,25.7,-1.6,-0.1


In [49]:
#gather the month data into one column
avg_temp = gather(df=avg_temp,
       key='Month',
       value='avgtemp',
       cols=['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12'])
avg_temp

Unnamed: 0,FIPS02Year,Month,avgtemp
0,01001021895,01,44.0
1,01001021896,01,44.3
2,01001021897,01,43.7
3,01001021898,01,50.1
4,01001021899,01,44.6
...,...,...,...
4807627,50290022018,12,2.7
4807628,50290022019,12,-4.9
4807629,50290022020,12,3.1
4807630,50290022021,12,-0.1


In [50]:
#extracting and adding newly split-up columns
avg_temp[['FIPS', '02', 'Year']] = avg_temp['FIPS02Year'].str.extract('(.{5})(.{2})(.{4})')
avg_temp

Unnamed: 0,FIPS02Year,Month,avgtemp,FIPS,02,Year
0,01001021895,01,44.0,01001,02,1895
1,01001021896,01,44.3,01001,02,1896
2,01001021897,01,43.7,01001,02,1897
3,01001021898,01,50.1,01001,02,1898
4,01001021899,01,44.6,01001,02,1899
...,...,...,...,...,...,...
4807627,50290022018,12,2.7,50290,02,2018
4807628,50290022019,12,-4.9,50290,02,2019
4807629,50290022020,12,3.1,50290,02,2020
4807630,50290022021,12,-0.1,50290,02,2021


In [51]:
#dropping the extra columns, adding in the joining column
avg_temp = avg_temp.drop(columns=['FIPS02Year', '02'])
avg_temp['FIPSYearMonth'] = avg_temp['FIPS'] + avg_temp['Year'] + avg_temp['Month']
avg_temp

Unnamed: 0,Month,avgtemp,FIPS,Year,FIPSYearMonth
0,01,44.0,01001,1895,01001189501
1,01,44.3,01001,1896,01001189601
2,01,43.7,01001,1897,01001189701
3,01,50.1,01001,1898,01001189801
4,01,44.6,01001,1899,01001189901
...,...,...,...,...,...
4807627,12,2.7,50290,2018,50290201812
4807628,12,-4.9,50290,2019,50290201912
4807629,12,3.1,50290,2020,50290202012
4807630,12,-0.1,50290,2021,50290202112


In [52]:
avg_temp['Month'] = avg_temp['Month'].astype(int)
avg_temp['Year'] = avg_temp['Year'].astype(int)
avg_temp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4807632 entries, 0 to 4807631
Data columns (total 5 columns):
 #   Column         Dtype  
---  ------         -----  
 0   Month          int32  
 1   avgtemp        float64
 2   FIPS           object 
 3   Year           int32  
 4   FIPSYearMonth  object 
dtypes: float64(1), int32(2), object(2)
memory usage: 146.7+ MB


In [53]:
#setting up a subsegment for 2002 through 2021
avg_temp20 = avg_temp[(avg_temp['Year']>2001) & (avg_temp['Year']!=2022)].copy()
avg_temp20

Unnamed: 0,Month,avgtemp,FIPS,Year,FIPSYearMonth
107,1,47.6,01001,2002,01001200201
108,1,41.0,01001,2003,01001200301
109,1,45.0,01001,2004,01001200401
110,1,49.9,01001,2005,01001200501
111,1,52.0,01001,2006,01001200601
...,...,...,...,...,...
4807626,12,14.2,50290,2017,50290201712
4807627,12,2.7,50290,2018,50290201812
4807628,12,-4.9,50290,2019,50290201912
4807629,12,3.1,50290,2020,50290202012


In [54]:
#checking for missing values in avg_temp20
avg_temp20[avg_temp20['avgtemp']==-99.9]

Unnamed: 0,Month,avgtemp,FIPS,Year,FIPSYearMonth


In [55]:
#checking for nulls
avg_temp20.isnull().sum()

Month            0
avgtemp          0
FIPS             0
Year             0
FIPSYearMonth    0
dtype: int64

# Combining Dataframes
---
We now have four, tidied and abridged dataframes that can be combined into one dataframe, **weather**.

We have a column set up in each dataframe that can be used as a key for merging them, so we can actually drop the Month, FIPS, and Year columns from most of the dataframes. Luckily, we don't have missing data for any of the dataframes thus far.

We'll keep precip20 whole and drop the columns from the temperature dataframes.

In [56]:
#dropping the unnecessary columns
avg_temp20 = avg_temp20.drop(columns=['Month', 'Year', 'FIPS'])
min_temp20 = min_temp20.drop(columns=['Month', 'Year', 'FIPS'])
max_temp20 = max_temp20.drop(columns=['Month', 'Year', 'FIPS'])

In [57]:
#combining precip and avg_temp dataframes on the 'FIPSYearMonth' columns
weather = pd.merge(precip20, avg_temp20, on='FIPSYearMonth')
weather

Unnamed: 0,Month,precip,FIPS,Year,FIPSYearMonth,avgtemp
0,1,4.70,01001,2002,01001200201,47.6
1,1,2.62,01001,2003,01001200301,41.0
2,1,2.60,01001,2004,01001200401,45.0
3,1,3.08,01001,2005,01001200501,49.9
4,1,5.69,01001,2006,01001200601,52.0
...,...,...,...,...,...,...
752875,12,0.82,50290,2017,50290201712,14.2
752876,12,0.88,50290,2018,50290201812,2.7
752877,12,0.58,50290,2019,50290201912,-4.9
752878,12,0.52,50290,2020,50290202012,3.1


In [58]:
#Adding the min temp dataframe
weather = pd.merge(weather, min_temp20, on='FIPSYearMonth')
weather

Unnamed: 0,Month,precip,FIPS,Year,FIPSYearMonth,avgtemp,mintemp
0,1,4.70,01001,2002,01001200201,47.6,36.1
1,1,2.62,01001,2003,01001200301,41.0,29.2
2,1,2.60,01001,2004,01001200401,45.0,33.8
3,1,3.08,01001,2005,01001200501,49.9,38.1
4,1,5.69,01001,2006,01001200601,52.0,40.1
...,...,...,...,...,...,...,...
752875,12,0.82,50290,2017,50290201712,14.2,8.0
752876,12,0.88,50290,2018,50290201812,2.7,-3.7
752877,12,0.58,50290,2019,50290201912,-4.9,-11.4
752878,12,0.52,50290,2020,50290202012,3.1,-3.2


In [59]:
#adding the max temp dataframe
weather = pd.merge(weather, max_temp20, on='FIPSYearMonth')
weather

Unnamed: 0,Month,precip,FIPS,Year,FIPSYearMonth,avgtemp,mintemp,maxtemp
0,1,4.70,01001,2002,01001200201,47.6,36.1,59.2
1,1,2.62,01001,2003,01001200301,41.0,29.2,52.6
2,1,2.60,01001,2004,01001200401,45.0,33.8,56.2
3,1,3.08,01001,2005,01001200501,49.9,38.1,61.6
4,1,5.69,01001,2006,01001200601,52.0,40.1,63.8
...,...,...,...,...,...,...,...,...
752875,12,0.82,50290,2017,50290201712,14.2,8.0,20.4
752876,12,0.88,50290,2018,50290201812,2.7,-3.7,9.1
752877,12,0.58,50290,2019,50290201912,-4.9,-11.4,1.6
752878,12,0.52,50290,2020,50290202012,3.1,-3.2,9.4


In [60]:
#rearranging the columns
weather = weather[['FIPSYearMonth', 'FIPS', 'Year', 'Month', 'precip', 'mintemp', 'maxtemp', 'avgtemp']]
weather

Unnamed: 0,FIPSYearMonth,FIPS,Year,Month,precip,mintemp,maxtemp,avgtemp
0,01001200201,01001,2002,1,4.70,36.1,59.2,47.6
1,01001200301,01001,2003,1,2.62,29.2,52.6,41.0
2,01001200401,01001,2004,1,2.60,33.8,56.2,45.0
3,01001200501,01001,2005,1,3.08,38.1,61.6,49.9
4,01001200601,01001,2006,1,5.69,40.1,63.8,52.0
...,...,...,...,...,...,...,...,...
752875,50290201712,50290,2017,12,0.82,8.0,20.4,14.2
752876,50290201812,50290,2018,12,0.88,-3.7,9.1,2.7
752877,50290201912,50290,2019,12,0.58,-11.4,1.6,-4.9
752878,50290202012,50290,2020,12,0.52,-3.2,9.4,3.1


In [61]:
weather.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 752880 entries, 0 to 752879
Data columns (total 8 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   FIPSYearMonth  752880 non-null  object 
 1   FIPS           752880 non-null  object 
 2   Year           752880 non-null  int32  
 3   Month          752880 non-null  int32  
 4   precip         752880 non-null  float64
 5   mintemp        752880 non-null  float64
 6   maxtemp        752880 non-null  float64
 7   avgtemp        752880 non-null  float64
dtypes: float64(4), int32(2), object(2)
memory usage: 46.0+ MB


In [62]:
#double checking in case we creating null values
weather.isnull().sum()
#bless the NOAA and their impeccably complete data

FIPSYearMonth    0
FIPS             0
Year             0
Month            0
precip           0
mintemp          0
maxtemp          0
avgtemp          0
dtype: int64

All of the weather data is clean and the 20 year period we will be focusing on is assembled into the dataframe, **weather**. Now to export the four whole dataframes and the **weather** dataframe to fresh csv files. 

We can remake the FIPSYearMonth column for each of these if need be, so prior to exporting we can save a bit of space and drop that column.

In [63]:
#dropping the FIPSYearMonth column from
precip = precip.drop(columns=['FIPSYearMonth'])
avg_temp = avg_temp.drop(columns=['FIPSYearMonth'])
min_temp = min_temp.drop(columns=['FIPSYearMonth'])
max_temp = max_temp.drop(columns=['FIPSYearMonth'])

In [64]:
#spit out the full cleaned dataframes and the combined weather dataframe
avg_temp.to_csv('avg_temp.csv', index=False)
min_temp.to_csv('min_temp.csv', index=False)
max_temp.to_csv('max_temp.csv', index=False)
precip.to_csv('precip.csv', index=False)
weather.to_csv('weather.csv', index=False)