# Examine Air Temperature and Count Freezing Days

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mrahnis/nb-streamgage/blob/main/Weather-02--Air-Temperature.ipynb)

In [1]:
# if the notebook is running in colab we'll get the data from github
HOST_IS_COLAB = 'google.colab' in str(get_ipython())

if HOST_IS_COLAB:
    path = 'https://github.com/mrahnis/nb-streamgage/blob/main'
    params = '?raw=true'
else:
    path = '.'
    params = ''

## Preliminaries

In [2]:
from datetime import datetime
import pytz
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Load weather data

Historical weather data is available from NOAA Climate Data Online. This data is from the station at the Lancaster Filtration Plant (GHCND:USC00364763):
https://www.ncdc.noaa.gov/cdo-web/datasets/GHCND/stations/GHCND:USC00364763/detail

We load the CSV file downloaded from NOAA, convert the strings representing dates into localized datetime objects and set that as the index.

In [3]:
wx = pd.read_csv('./data/noaa-daily-3080214.csv')

wx['DATETIME'] = pd.to_datetime(wx['DATE']).map(lambda x: x.tz_localize('EST'))
wx.set_index('DATETIME', inplace=True)

## Count the subzero days for each year

We can group the subzero days (where `mask==True`) by calendar year, and return the count of days in each year.

In [4]:
# make a boolean (True/False) mask
mask = (wx['TMAX'] <= 0).rename('SUBZERO_DAYS')

by_calendaryear = mask[mask==True].groupby(mask[mask==True].index.year).count()
by_calendaryear.index.rename('YR', inplace=True)
by_calendaryear

YR
2008    12
2009    25
2010    22
2011    17
2012     5
2013    17
2014    31
2015    26
2016    17
2017    20
2018    16
2019    11
2020     5
2021     9
2022    12
Name: SUBZERO_DAYS, dtype: int64

Grouping by calendar year however splits the winter season across years. We are more interested in how many subzero days occurred during each winter. The USGS has a notion of a water year that encompases the Northern Hemisphere winter and spring months during which snowfall and snowmelt occur.

A Northern Hemisphere water year runs from October 1 to September 30 of the following calendar year. The water year is designated by the calender year in which it ends. In the Southern Hemisphere the water year lasts from July 1 to June 30. 

We can count by water year by shifting our index by 3 months into the future.

In [5]:
by_wateryear = mask[mask==True].groupby(mask[mask==True].index.shift(3,freq='m').year).count()
by_wateryear.index.rename('WY', inplace=True)
by_wateryear

WY
2008     8
2009    23
2010    23
2011    22
2012     5
2013    12
2014    34
2015    28
2016    14
2017    14
2018    22
2019    13
2020     4
2021    11
2022    12
Name: SUBZERO_DAYS, dtype: int64

Let's print the mask for January 2010. Upon looking at our mask we see it has runs of multiple consecutive days below freezing. We can investigate the length of these freezing periods.

In [6]:
mask[datetime(2010, 1, 1, tzinfo=pytz.timezone('US/Eastern')):datetime(2010, 1, 31,  tzinfo=pytz.timezone('US/Eastern'))]

DATETIME
2010-01-01 00:00:00-05:00    False
2010-01-02 00:00:00-05:00     True
2010-01-03 00:00:00-05:00     True
2010-01-04 00:00:00-05:00     True
2010-01-05 00:00:00-05:00     True
2010-01-06 00:00:00-05:00    False
2010-01-07 00:00:00-05:00    False
2010-01-08 00:00:00-05:00     True
2010-01-09 00:00:00-05:00     True
2010-01-10 00:00:00-05:00     True
2010-01-11 00:00:00-05:00     True
2010-01-12 00:00:00-05:00     True
2010-01-13 00:00:00-05:00     True
2010-01-14 00:00:00-05:00    False
2010-01-15 00:00:00-05:00    False
2010-01-16 00:00:00-05:00    False
2010-01-17 00:00:00-05:00    False
2010-01-18 00:00:00-05:00    False
2010-01-19 00:00:00-05:00    False
2010-01-20 00:00:00-05:00    False
2010-01-21 00:00:00-05:00    False
2010-01-22 00:00:00-05:00    False
2010-01-23 00:00:00-05:00    False
2010-01-24 00:00:00-05:00    False
2010-01-25 00:00:00-05:00    False
2010-01-26 00:00:00-05:00    False
2010-01-27 00:00:00-05:00    False
2010-01-28 00:00:00-05:00    False
2010-01-29 

## Label consecutive days of sub-zero/above-zero temperature

In [7]:
tab = pd.DataFrame()
label = 0
runlength = 0
prev_state = False
for ix, state in mask.items():
    if state == prev_state:
        runlength += 1
    else:
        label += 1
        runlength = 1
    tab.loc[ix, 'mask'] = state
    tab.loc[ix, 'label'] = label
    tab.loc[ix, 'runlength'] = runlength
    prev_state = state

In [8]:
tab

Unnamed: 0,mask,label,runlength
2008-01-01 00:00:00-05:00,False,0.0,1.0
2008-01-02 00:00:00-05:00,False,0.0,2.0
2008-01-03 00:00:00-05:00,True,1.0,1.0
2008-01-04 00:00:00-05:00,False,2.0,1.0
2008-01-05 00:00:00-05:00,False,2.0,2.0
...,...,...,...
2022-09-06 00:00:00-05:00,False,248.0,132.0
2022-09-07 00:00:00-05:00,False,248.0,133.0
2022-09-08 00:00:00-05:00,False,248.0,134.0
2022-09-09 00:00:00-05:00,False,248.0,135.0


## Count consecutive subzero days and sort by descending length

In [9]:
periods = tab[tab['mask']==True].groupby('label').size().reset_index(name='size')
periods.sort_values(by='size', ascending=False)

Unnamed: 0,label,size
89,179.0,13
73,147.0,9
30,61.0,8
78,157.0,7
23,47.0,6
...,...,...
54,109.0,1
49,99.0,1
48,97.0,1
47,95.0,1


In [10]:
# check to see if this longest period looks right
tab[tab['label']==179]

Unnamed: 0,mask,label,runlength
2017-12-27 00:00:00-05:00,True,179.0,1.0
2017-12-28 00:00:00-05:00,True,179.0,2.0
2017-12-29 00:00:00-05:00,True,179.0,3.0
2017-12-30 00:00:00-05:00,True,179.0,4.0
2017-12-31 00:00:00-05:00,True,179.0,5.0
2018-01-01 00:00:00-05:00,True,179.0,6.0
2018-01-02 00:00:00-05:00,True,179.0,7.0
2018-01-03 00:00:00-05:00,True,179.0,8.0
2018-01-04 00:00:00-05:00,True,179.0,9.0
2018-01-05 00:00:00-05:00,True,179.0,10.0


## A nicer summary: Grouping and aggregating labeled periods

In [11]:
periods = tab.reset_index().groupby('label').agg(
    firstday=('index', 'first'),
    lastday=('index', 'last'),
    mask=('mask','first'),
    runlength=('runlength','max')
)
# the duration is inclusive of the last day we we add a timedelta of 1 day
periods['duration'] = periods['lastday'] - periods['firstday'] + pd.Timedelta(days=1)

In [12]:
periods[periods['mask']==True].sort_values(by='duration', ascending=False)

Unnamed: 0_level_0,firstday,lastday,mask,runlength,duration
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
179.0,2017-12-27 00:00:00-05:00,2018-01-08 00:00:00-05:00,True,13.0,13 days
147.0,2015-02-13 00:00:00-05:00,2015-02-21 00:00:00-05:00,True,9.0,9 days
61.0,2011-01-07 00:00:00-05:00,2011-01-14 00:00:00-05:00,True,8.0,8 days
157.0,2016-01-18 00:00:00-05:00,2016-01-24 00:00:00-05:00,True,7.0,7 days
47.0,2010-01-08 00:00:00-05:00,2010-01-13 00:00:00-05:00,True,6.0,6 days
...,...,...,...,...,...
109.0,2014-02-04 00:00:00-05:00,2014-02-04 00:00:00-05:00,True,1.0,1 days
99.0,2013-12-25 00:00:00-05:00,2013-12-25 00:00:00-05:00,True,1.0,1 days
97.0,2013-12-17 00:00:00-05:00,2013-12-17 00:00:00-05:00,True,1.0,1 days
95.0,2013-12-14 00:00:00-05:00,2013-12-14 00:00:00-05:00,True,1.0,1 days
