
# Motivation and problem statement
I want to analyze the temperatures and demographics of big cities to see if there is evidence of the urban heat island (UHI) effect. I'm doing a research project on UHIs for another class so I thought this would be a good opportunity to conduct my own research. The UHI effect occurs when urbanized areas experience higher temperatures than outlying areas. Structures such as buildings, roads, and other infrastructure absorb and re-emit the sun’s heat more than natural landscapes such as forests and water bodies (EPA). This [NPR article](https://www.npr.org/2019/09/03/754044732/as-rising-heat-bakes-u-s-cities-the-poor-often-feel-it-most) shows that poorer cities are more vulnerable to heat. 

From a scientific and practical perspective, UHI is a relevant topic especially with climate change. There's no doubt that our planet is warming so it'll be interesting to see the magnitude of temperature changes for different areas. From a human-centered perspective, many people living cities are already facing the effects of extreme heat. Heat equity is an issue that needs to be addressed because certain communities who are poorer or are composed of a certain demographic (ie age, ethnicity) do not have the same access to healthcare or cooling centers. These people are at higher risk for health issues. This issue is related to human-centered design because it is affecting humans. Understanding who is affected will help policymakers better design cities and allocate resources to maximize the benefits. 

I hope to learn to apply the technical data analysis skills I've learned so far. Mainly, I want to explore the correlation between the demographics of cities and temperature to see if I can come up with my own conclusion of who is most vulnerable to the UHI effect. 

# Data selected for analysis
I will use [this dataset](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/1F72FB) from a [research paper](https://agupubs.onlinelibrary.wiley.com/doi/10.1029/2021EF002016) published on the American Geophysical Union about UHI. It is under the CC0 1.0 Universal  (CC0 1.0) Public Domain license so I can use it freely without needing permission. The data was extracted from the US Census Bureau 2014 5-year American Community Survey (ACS) for all census tracks in the contiguous US. It includes 1,056 counties that have at least 10 built-up census tracts. This sample includes both large cities and small towns, and covers more than 300 million people.

Some of the fields in the data set are:
- Census tract number 
- Day and night time temperature changes
- Share of White, Black, Asian, and Hispanics
- nonUS: share of individuals that are NOT US-citizens
- Education: share of individuals with a high school diploma (includes High school equivalency) or less. 
- Age: share of individuals 75 years of age or older
- Income: median income in the past 12 month in 2014 inflation-adjusted US$

This dataset will help my research because I can choose specific cities to focus on and I already have the demographic data so I can see if there's correlation between temperature and other factors. For more detail about the fields in the data set, click [here](https://agupubs.onlinelibrary.wiley.com/action/downloadSupplement?doi=10.1029/2021EF002016&file=2021EF002016-sup-0001-Supporting%20Information%20SI-S01.pdf). 

One ethical consideration is that American Indians are not represented in the dataset. This might not have been collected in the Census ACS so this research is not inclusive to all communities. Other than that, the data is public and there's no personally identifiable information so privacy shouldn't be a huge concern.

# Unknowns and dependencies
I'm not sure which cities to focus on yet or how many I should analyze. I'm not sure if I need to clean up any of the data or if it's ready to be analyzed. I'm not too confident in my data analysis skills or statistics so I might have to spend more time trying to figure out how to code things.



# Research Questions
In this project, I want to determine what populations and regions are exposed to the highest temperatures. By identifying who is being impacted the most by UHI effect, we can design better heat mitigation strategies. Below are questions to guide my research that extend beyond the current research that I have read about.  

1. How strong is the correlation in cities that have strong correlation between income and temperatures?
2. Which counties have the highest correlation between age/race/education and temperature? Is one factor stronger than the others?
3. Which counties have the largest daytime temperature difference? Which counties have the largest nighttime temperature difference?


# Background
The urban heat island effect is a well documented phenomenon. There are quite a few studies on the correlation between income and heat. For example, [this NPR article](https://www.npr.org/2019/09/03/754044732/as-rising-heat-bakes-u-s-cities-the-poor-often-feel-it-most) identified 9 cities with the strongest correlations between heat and income. NPR's methodology analyzed 97 of the most populous U.S. cities using the median household income from U.S. Census Bureau data and thermal satellite images from NASA and the U.S. Geological Survey. The article didn't list the exact strength of the correlation for the cities so I want to calculate that in this project (relates to Research Question 1).

In the study I got the dataset from, the author found that neighborhoods with lower-income and higher shares of non-white residents experience significantly more extreme surface urban heat than their wealthier, whiter counterparts. In Research Question 2, I want to explore how each demographic: age, race, and education level, correlate with temperature. 

In Research Question 3, I wanted to explore which counties experienced the greatest urban heat island effect. This question would allow me to understand which parts of our country experience the highest temperatures. 

# Methodology
Before starting my analysis, I will clean up my data table because I noticed there are cells with missing data. 

For RQ1, I will calculate the correlation coefficient for the 9 cities listed in the NPR article between income and temperature. The coefficient determines the degree to which the movement of the two different variables is associated. To present this data, I plan on showing a table with the cities sorted from highest to lowest correlation. My dataset contains the Census tract number so I need to find which tracts belong the cities I'm analyzing.

For RQ2,  I will create 3 visualizations for each factor. Each visualization will have Income on the x-axis and the dependent variable on the y-axis. I will plot each county and calculate the correlation coefficient for each of the variables to see if one has a stronger influence than the others. These graphs are the most appropriate way to convey this information because the viewer will be able to see if there are any outliers and how strongly correlated the data points are. 

For RQ3, I will create a 2 maps. One showing the census tracts and their daytime temperature differences and one showing the nighttime differences. I want to compare if it is the same counties in both conditions. The data is all in the table, the hardest part will be connecting the data to an area on the map.  A map will be helpful because the viewer will be able to easily see which regions experience the most extreme UHI effect. 