This is the fourth iteration of the Urban Heat Island (UHI) Effect project. This notebook will include data exploration, analysis, and predictive modeling using location and temperature data from various sources. The main key in this project is to use the SNURD (**S**ummer **N**ighttime **U**rban **R**ural **D**ifferential) which gives a score of whether an urbanized city is at a higher or lower risk of being affected by the UHI effect. [Here](https://www.sciencedirect.com/science/article/pii/S0303243418304653) is a paper where researchers created the metric using urban and rural temperature data spanning 40 days. The way I will be sorting this information is through percentiles. The SNURD metric will be sorted and will tell the user if their city's SNURD metric is in a certain percentile based on their score.

In [1]:
import seaborn as sns
import numpy as np 
import pandas as pd 
import csv

with open('hottest_cities.txt', 'r') as in_file:
    stripped = (line.strip() for line in in_file)
    lines = (line.split(',') for line in stripped if line)
    with open('hottest_cities.csv', 'w') as out_file:
        writer = csv.writer(out_file)
        writer.writerow(('City', 'State'))
        writer.writerows(lines)

I created a dataset based on this article from [Climate Central's hottest cities article from 2014](https://www.climatecentral.org/news/urban-heat-islands-threaten-us-health-17919#more). For the rural data, I am using the dataset provided by [Consumer Finance's rural area list](https://www.consumerfinance.gov/policy-compliance/guidance/rural-and-underserved-counties-list/). Said list also has a key feature, FIPS (**F**ederal **I**nformation **P**rocessing **S**tandards) code, which will be key in getting the temperature data from NOAA (**N**ational **O**ceanic and **A**tmospheric **A**ssociation). 

In [106]:
hottest_cities_df = pd.read_csv('hottest_cities.csv')
hottest_cities_df

Unnamed: 0,City,State
0,Albuquerque,NM
1,Columbus,OH
2,Denver,CO
3,Kansas City,MO
4,Las Vegas,NV
5,Louisville,KY
6,Minneapolis,IN
7,Portland,OR
8,Seattle,WA
9,Washington D.C.,MD


In [3]:
rural_places = pd.read_csv('cfpb_rural-list_2019.csv')
rural_places

Unnamed: 0,FIPS Code,County Name,State
0,1005,Barbour County,AL
1,1011,Bullock County,AL
2,1013,Butler County,AL
3,1019,Cherokee County,AL
4,1023,Choctaw County,AL
...,...,...,...
1603,56045,Weston County,WY
1604,72049,Culebra Municipio,PR
1605,72083,Las Marias Municipio,PR
1606,72093,Maricao Municipio,PR


In [4]:
hottest_cities_states = np.unique(hottest_cities_df['State'])
rural_places_filtered = rural_places[rural_places['State'].isin(hottest_cities_states)]
rural_places_filtered

Unnamed: 0,FIPS Code,County Name,State
0,1005,Barbour County,AL
1,1011,Bullock County,AL
2,1013,Butler County,AL
3,1019,Cherokee County,AL
4,1023,Choctaw County,AL
...,...,...,...
1579,55123,Vernon County,WI
1580,55125,Vilas County,WI
1581,55129,Washburn County,WI
1582,55135,Waupaca County,WI


I'll be attempting to do some further data processing at this stage. We want to ideally get the counties from the rural dataset that are the closest to the hottest cities. I'm using the geopy library and referencing the haversine distance from this [link](https://kanoki.org/2019/02/14/how-to-find-distance-between-two-points-based-on-latitude-and-longitude-using-python-and-sql/)
 in order to achieve this. 

In [22]:
from geopy.geocoders import Nominatim
from geopy.exc import GeocoderTimedOut

geo_locator = Nominatim()
lat_long_hottest_cities = []
lat_long_rural = []
country = "USA"

def find_lat_long(city, country): 
    try:
        return geo_locator.geocode(city+','+country)
    except GeocoderTimedOut:
        return find_lat_long(city, country)
    
for city in np.array(hottest_cities_df['City']):
    location = find_lat_long(city[0], country)
    lat_long_hottest_cities.append((location.longitude, location.latitude))   

for county in np.array(rural_places_filtered['County Name']):
    location = find_lat_long(county[0], country)
    lat_long_rural.append((location.longitude, location.latitude)) 

  after removing the cwd from sys.path.


In [212]:
from math import radians, cos, sin, asin, sqrt

def haversine_distance(long_1, lat_1, long_2, lat_2):
    long_1, lat_1, long_2, lat_2 = map(radians, [long_1, lat_1, long_2, lat_2])
    
    difference_long = long_2 - long_1
    difference_lat = lat_2 - lat_1
    
    area = sin(difference_lat/2)**2 + cos(lat_1) * cos(lat_2) * sin(difference_long/2)**2
    circumference = 2 * asin(sqrt(area))
    radius = 6371 #radius of the earth in kilometers
    return circumference * radius
    
rural_distances = []
hottest_cities_df[['Longitude', 'Latitude']] = lat_long_hottest_cities
hottest_cities_unique = hottest_cities_df.drop_duplicates(['State'])
rural_places_filtered[['Longitude', 'Latitude']] = lat_long_rural
rural_places_filtered_1 = rural_places_filtered.dropna(subset=['Longitude', 'Latitude'])

for state_city, long_city, lat_city in zip(hottest_cities_unique['State'], hottest_cities_unique['Longitude'], 
                                           hottest_cities_unique['Latitude']):
    city_lat_long = (long_city, lat_city)
    for state_county, long_rural, lat_rural in zip(rural_places_filtered_1['State'], rural_places_filtered_1['Longitude'], 
                            rural_places_filtered_1['Latitude']): 
        if state_city == state_county:
            distance = haversine_distance(long_city, lat_city, long_rural, lat_rural)
            rural_distances.append((state_county, distance))


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


In [213]:
rural_distances_sorted = sorted(rural_distances, key=lambda tup: tup[0])
rural_distances = [x[1] for x in rural_distances_sorted]

rural_places_filtered_1['Distance_KM'] = rural_distances
rural_places_filtered_1

Unnamed: 0,FIPS Code,County Name,State,Longitude,Latitude,Distance_KM
0,1005,Barbour County,AL,-93.610677,32.497136,0.000000
1,1011,Bullock County,AL,-93.610677,32.497136,0.000000
2,1013,Butler County,AL,-93.610677,32.497136,0.000000
3,1019,Cherokee County,AL,-96.867569,37.241315,605.343232
4,1023,Choctaw County,AL,-96.867569,37.241315,605.343232
...,...,...,...,...,...,...
1579,55123,Vernon County,WI,-78.492772,37.123224,98.524152
1580,55125,Vilas County,WI,-78.492772,37.123224,98.524152
1581,55129,Washburn County,WI,-97.093347,31.803481,1897.983986
1582,55135,Waupaca County,WI,-97.093347,31.803481,1897.983986


In [214]:
idx = rural_places_filtered_1.groupby(['State'], sort=False)['Distance_KM'].transform(min) == rural_places_filtered_1['Distance_KM']
rural_places_filtered_min_d = rural_places_filtered_1[idx]
rural_places_filtered_min_d

Unnamed: 0,FIPS Code,County Name,State,Longitude,Latitude,Distance_KM
0,1005,Barbour County,AL,-93.610677,32.497136,0.00000
1,1011,Bullock County,AL,-93.610677,32.497136,0.00000
2,1013,Butler County,AL,-93.610677,32.497136,0.00000
54,4001,Apache County,AZ,-84.394811,33.858778,2511.91238
114,6035,Lassen County,CA,-80.804549,41.234330,0.00000
...,...,...,...,...,...,...
1488,51135,Nottoway County,VA,-75.844995,43.156168,0.00000
1510,51720,Norton city,VA,-75.844995,43.156168,0.00000
1520,53055,San Juan County,WA,-83.155544,41.295156,0.00000
1568,55077,Marquette County,WI,-77.480540,37.493203,0.00000


Now, I am going to be making API calls to NOAA's climate database to get the data from rural areas first. Afterwards, I'll be doing the same for the hottest cities dataset. I'll be getting the data from the Daily Summaries portion of the DB, and I'll be using the TMIN, TMAX, and TAVG from this data to use the min, max, and average temperatures respectively for further analysis. 

**NOTE**: Based on previous experience using NOAA's temperature data, there will be datasets that come up empty on one, some, or all metrics. I will only be choosing one rural area per state, so based on the results, I'll be choosing the rural areas with the most information. 

In [217]:
import requests 
from requests.utils import quote
import time


url = 'https://www.ncdc.noaa.gov/cdo-web/api/v2/data'
headers = {'token':'YIPVZHyparqBDvvqVDyfzLxMVXpwpFjf'}
json_data = []
r = None
for county, state, fips_code in zip(np.array(rural_places_filtered_min_d['County Name']), 
                                    np.array(rural_places_filtered_min_d['State']), 
                                    np.array(rural_places_filtered_min_d['FIPS Code'])):
    query_string = {"datasetid":"GHCND", "datatypeid":"TAVG", "datatypeid":"TMIN", "datatypeid":"TMAX", 
                    "locationid":quote("FIPS:"+str(fips_code), safe=":"), "startdate":"2019-06-23", 
                    "enddate":"2019-09-23", "units":"standard", "limit":100}

    try: 
        r = requests.get(url, headers=headers, params=query_string)
    except requests.exceptions.ConnectionError:
        print('ConnectionError: Too many requests! Try again soon.')
        time.sleep(10)
        print('Okay, continue requesting data!')
        continue
    if 'json' in r.headers.get('Content-Type'):
        temperature_data = r.json()
        if temperature_data != {}:
            temperature_data_with_county_and_state = {county+"_"+state:temperature_data['results']}
            json_data.append(temperature_data_with_county_and_state)
    else:
        print("Response is not in JSON format.")
        temperature_data = 'spam'

Response is not in JSON format.
Response is not in JSON format.
Response is not in JSON format.
Response is not in JSON format.


In [245]:
len(json_data)

44

In [263]:
finalized_rural_dataset = pd.DataFrame(columns=["County_State", "Date", "TMIN", "TMAX", "TAVG"])
index = []
date, tmin, tmax, tavg = [], [], [], []
for data_1 in json_data:
    for county_state, data_2 in data_1.items(): 
        for data_3 in data_2: 
            datatype = data_3['datatype']
            datetime = data_3['date']
            temp = data_3['value']
            index.append(county_state)
            date.append(datetime)
            if datatype == "TMAX":
                tmax.append(temp)
            elif datatype == "TMIN":
                tmin.append(temp)
            elif datatype == "TAVG":
                tavg.append(temp)
            #print(data_3['datatype'])
finalized_rural_dataset["County_State"] = index
finalized_rural_dataset["Date"] = date
#finalized_rural_dataset["TMIN"] = tmin
finalized_rural_dataset["TMAX"] = tmax
#finalized_rural_dataset["TAVG"] = tavg
finalized_rural_dataset

Unnamed: 0,County_State,Date,TMIN,TMAX,TAVG
0,Benewah County_ID,2019-06-23T00:00:00,,68.0,
1,Benewah County_ID,2019-06-23T00:00:00,,72.0,
2,Benewah County_ID,2019-06-24T00:00:00,,65.0,
3,Benewah County_ID,2019-06-24T00:00:00,,70.0,
4,Benewah County_ID,2019-06-25T00:00:00,,66.0,
...,...,...,...,...,...
4196,Monroe County_WI,2019-07-25T00:00:00,,82.0,
4197,Monroe County_WI,2019-07-25T00:00:00,,84.0,
4198,Monroe County_WI,2019-07-25T00:00:00,,85.0,
4199,Monroe County_WI,2019-07-26T00:00:00,,81.0,
