# Fetch 'daylight' data

_Primary Author: Peter King_

This notebook generates a pandas DataFrame showing daylight hours on the Winter Solstice in calendar year 2022 (21 Dec 2022) for the 50 U.S. states, plus Puerto Rico and Washington, D.C.  We will use this dataset in combination with data from the Behavioral Risk Factor Surveillance System (BRFSS) 2022, in order to investigate correlation between daylight hours in winter (as occurs in the northern latitudes in the northern hemisphere) with self-reported feelings of depression, as measure by a novel "depression index" (DI).

## Methodology

We start by using the 'requests' library to fetch two plain text files that contain a list of states and territories, along with their capitals and associated latitudes and longitudes.  We then feed the latitudes and longitudes for the capitals into the U.S. Naval Observatory's (USNO) [Application Programming Interface](https://aa.usno.navy.mil/data/api) (API), to get sunrise and sunset times on the Winter Solstice for each geographic location.  Finally, we take the difference between sunset and sunrise to be the number of daylight hours available.

In [1]:
import requests
import json
import datetime
import time
import re
import numpy as np
import pandas as pd

PATH = 'data/'

## Data wrangling -- Dealing with plain text data

We first get a list of all 50 states, their capitals, and the lat/long of each capital.  We start by requesting raw data in the form of two text files.

In order to process the data in the text files, we first split each file on the newline character ('\\n'), and we then use a regular expression to parse each line to catpure four fields of interest: the state code, capital name, latitude, and longitude.  We capture information for each state in a set of pandas Series objects and then compile these into a DataFrame.  Pandas' indexing capability ensures consistency in data organization and facilitates joining this dataset with our primary dataset, the BRFSS 2022.

In [2]:
## Request data
URL_capital = 'https://people.sc.fsu.edu/~jburkardt/datasets/states/state_capitals_name.txt'
URL_lat_long = 'https://people.sc.fsu.edu/~jburkardt/datasets/states/state_capitals_ll.txt'
capital_data = requests.get(URL_capital)
lat_long_data = requests.get(URL_lat_long)

## Convert raw text data from URL request into pandas DataFrame

# Split on newline to create python lists
st_cap_list = capital_data.text.split('\n')
st_ll_list = lat_long_data.text.split('\n')
st_ll_list.pop()  # for some reason there is a dangling empty string

# For each list, create a pandas Series
state, capital = [], []
for row in st_cap_list:
    element = row.split('  ')
    state.append(element[0])
    capital.append(element[1].strip('"'))
capital_Ser = pd.Series(data=capital, index=state, name='Capital')

# Use a regular expression to convert each row of free text to a list of values
value = re.compile('[A-Z.0-9\-]+')
state, lat, long = [], [], []
for row in st_ll_list:
    element = value.findall(row)
    state.append(element[0])
    lat.append(element[1])
    long.append(element[2])
lat_Ser = pd.Series(data=lat, index=state, name='Lat')
long_Ser = pd.Series(data=long, index=state, name='Long')

# Assemble Series into DataFrame
geo_df = pd.DataFrame([capital_Ser, lat_Ser, long_Ser]).T
# Drop duplicated row for Washington, DC
geo_df.drop('US', inplace=True)

geo_df.head()

Unnamed: 0,Capital,Lat,Long
AL,Montgomery,32.361538,-86.279118
AK,Juneau,58.301935,-134.41974
AZ,Phoenix,33.448457,-112.073844
AR,Little Rock,34.736009,-92.331122
CA,Sacramento,38.555605,-121.468926


In [3]:
len(geo_df.index) # Includes 50 states + Puerto Rico (PR) and Distric of Columbia (DC)

52

## Data Wrangling -- Working with the USNO API

The [USNO API](https://aa.usno.navy.mil/data/api) returns sunrise/sunset data in JSON format, so here we use the 'json' standard module to facilitate information capture.  Sunrise and sunset times are provided as strings, so we use the 'datetime' standard module to convert these to Python's 'datetime' data type.  Using this data type makes it easy to calculate daylight hours as a simple difference: daylight = sunset - sunrise.

We again use a pandas Series to capture the information for each state, and then we compile them into a DataFrame.

Note: Since the terms of use for the USNO API limits query frequency to 1 query per second, we use 'time' standard module to enforce a 1 second delay between queries in our 'for' loop.  **This code block will take a ~2 minutes to run due to the enforced time delay.**

In [4]:
### Use data from geo_df to make a separate DataFrame for Daylight information

## For each state, query the USNO API for Sunrise/Sunset Data
date = '2022-12-21'
daylight = []
for state in geo_df.index:
    URL = ('https://aa.usno.navy.mil/api/rstt/oneday?date=2022-12-21&coords='
           + geo_df['Lat'].loc[state] + ',' + geo_df['Long'].loc[state]
           + '&tz=-8&dst=false'
          )
    response = requests.get(URL)
    # Convert response data to json and then to pandas Series
    data_json = json.loads(response.text)
    index = [data_json['properties']['data']['sundata'][i]['phen'] for i in range(5)]
    data = [pd.to_datetime(date + ' ' + data_json['properties']['data']['sundata'][i]['time']) for i in range(5)]
    row = pd.Series(data=data, index=index, name=state)
    # Add a column for Daylight Hours as the difference between Sunset and Sunrise
    row['Daylight Hours'] = row['Set'] - row['Rise']
    daylight.append(row)
    # Free API access is limited to one query per second
    time.sleep(1)

daylight_df = pd.DataFrame(daylight)

In [5]:
daylight_df.head()

Unnamed: 0,Begin Civil Twilight,Rise,Upper Transit,Set,End Civil Twilight,Daylight Hours
AL,2022-12-21 04:15:00,2022-12-21 04:42:00,2022-12-21 09:43:00,2022-12-21 14:44:00,2022-12-21 15:11:00,0 days 10:02:00
AK,2022-12-21 08:52:00,2022-12-21 09:45:00,2022-12-21 12:56:00,2022-12-21 16:07:00,2022-12-21 17:00:00,0 days 06:22:00
AZ,2022-12-21 06:01:00,2022-12-21 06:28:00,2022-12-21 11:26:00,2022-12-21 16:25:00,2022-12-21 16:52:00,0 days 09:57:00
AR,2022-12-21 04:45:00,2022-12-21 05:13:00,2022-12-21 10:07:00,2022-12-21 15:02:00,2022-12-21 15:30:00,0 days 09:49:00
CA,2022-12-21 06:50:00,2022-12-21 07:20:00,2022-12-21 12:04:00,2022-12-21 16:48:00,2022-12-21 17:18:00,0 days 09:28:00


Finally, we save our final 'daylight' DataFrame to disk as a CSV file in order to access it more easily in the analysis and visualization notebooks associated with this project.  Note that we developed our code in Google's Colab environment.  Since our project needs to function in the Jupyter environment provided for SIADS 593 via Coursera, we provided a second option for the PATH string.

In [6]:
daylight_df.to_csv(PATH + 'daylight.csv', index_label='State')

## Record Dependencies

In [7]:
%load_ext watermark
%watermark
%watermark --iversions

Last updated: 2025-02-16T18:10:57.617775+00:00

Python implementation: CPython
Python version       : 3.10.11
IPython version      : 8.17.2

Compiler    : GCC 11.3.0
OS          : Linux
Release     : 6.5.0-1020-aws
Machine     : x86_64
Processor   : x86_64
CPU cores   : 64
Architecture: 64bit

re      : 2.2.1
json    : 2.0.9
requests: 2.31.0
pandas  : 2.0.2
numpy   : 1.24.3

