<a href="https://colab.research.google.com/github/kimlouisev/data-visualization/blob/main/Exploring_HIV_Incidence_and_Service_Access_in_New_York_City_(2016%E2%80%932021).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exploring HIV Incidence and Service Access in New York City (2016–2021)




## I. Introduction

🔴 **What is HIV and AIDS?**

 The World Health Organization (WHO) defines the [Human immunodeficiency virus (HIV)](https://www.who.int/news-room/fact-sheets/detail/hiv-aids) as a virus that targets the body's white blood cells and attacks the body’s immune system. This makes it easier to get sick with diseases like tuberculoses, infections and some cancers. HIV can be prevented and treated with antiretroviral therapy (ART), but left untreated, it can progress to the most advanced stage of infection, which is Acquired immunodeficiency syndrome (AIDS). With effective treatment, people with HIV can live a near-normal life expectancy, but without treatment, life expectancy can be significantly shortened, often leading to death within a few years of developing AIDS.

🔴 **HIV in NYC, Then**

In the 1980s, New York City was at the epicenter of the AIDS crisis in the United States. By 1983, the city had already reported 1,000 AIDS cases, and by 1987, an estimated 70,000 New Yorkers were living with the disease—accounting for nearly 30% of all U.S. cases at the time. The epidemic disproportionately affected gay men, leading to widespread stigma and discrimination. Dubbed “gay pneumonia” due to misconceptions about its origins, AIDS fueled violence, job loss, and housing instability for many in the LGBTQ+ community. It also significantly affected minority groups, namely black, hispanic and poor individuals.[More on the history of HIV and AIDS in NYC here.](https://blogs.shu.edu/nyc-history/aids-crisis/) Fear and marginalization defined the epidemic’s early years, but activism drove progress in research, treatment, and policy.

In [None]:
from IPython.display import display, HTML

html_code = """
<div style="text-align: center;">
    <img src="https://blogs.shu.edu/nyc-history/files/2017/12/index-8-682x500.jpg" width="400" height="300"/>
    <p><strong>Figure 1: An advertisement in the 80s that calls attention to minority groups that were suffering from AIDS in America. </strong></p>
</div>
"""

display(HTML(html_code))

🔴 **HIV in NYC, Now**

HIV remains a public health concern, but advancements in treatment and prevention have reduced overall cases. This blog examines NYC’s latest HIV data, highlighting the most affected populations and areas for improved support.

[*Note: This blog focuses on HIV cases rather than AIDS for brevity. However, since HIV leads to AIDS, the insights remain relevant for AIDS prevention efforts as well.*]

## II. Research Questions


In this blog, I will try to answer the following questions using recent data available in New York City:

- Which boroughs and neighborhoods in NYC had the highest number of HIV cases from the years 2016 to 2021?
- Which sex and race/ethnicity in NYC had the highest number of HIV cases from the years 2016 to 2021?
- Is poverty correlated with HIV cases?
- Are high burden areas within close proximity to organizations that offer HIV services (diagnostic, treatment, education and counselling)?

## III. Hypothesis

Given the history of HIV in New York City and the broader United States, my hypothesis is that people living in Bronx or Brooklyn, men and black individuals will have the highest number of HIV cases. I suspect that incidence of HIV increases as poverty increases, yet I am not entirely certain how strong this association is. Further, given the historical trend of HIV being more common among gay men, I predict that there will be a higher incidence of HIV cases among men compared to women in New York City. Finally, given the activism of the past four decades around detecting and treating HIV in the city, I posit that organizations providing HIV services are generally present at or in close proximity to high burden areas.

## IV. About the Data

To answer the above questions, I will use three datasets from New York City Open Data and New York City government Data Portal.

- [Diagnoses of HIV/AIDS by Year, Neighborhood, Sex, and Race/Ethnicity](https://https://data.cityofnewyork.us/Health/HIV-AIDS-Diagnoses-by-Neighborhood-Sex-and-Race-Et/ykvb-493p/about_data) - This dataset from NYC Open Data includes data on new diagnoses of HIV and AIDS in NYC for the calendar years 2016 through 2021. Reported cases and case rates (per 100,000 population) are stratified by neighborhoods, sex, and race/ethnicity.

- [Neighborhood poverty by UHF 42](https://https://a816-dohbesp.nyc.gov/IndicatorPublic/data-explorer/economic-conditions/?id=103#display=summary) - Data from the NYC.gov Environment and Health Data Portal. This dataset shows the percent of households with incomes below the federal poverty level from 2017-2021 at the neighborhood level.

- [DOHMH HIV Service Directory](https://data.cityofnewyork.us/dataset/DOHMH-HIV-Service-Directory/pwts-g83w/about_data) - This dataset from NYC Open Data was compiled as a guide for persons living with HIV in New York City seeking HIV medical and supportive services. It compiles different organizations that receive CDC and Ryan White Part-A funding and their location throughout New York City. These organizations provide a range of services, including: Targeted-Testing among Priority Populations, Food and Nutrition Services, Health Education and Risk Reduction Services, Harm Reduction Services, Legal Services, Mental Health Services, Case Management and Care Coordination Services, and Supportive Counseling Services.


🔴 **Notes**

In this blog, I use the **number of HIV cases per 100,000 people** rather than the total number of HIV cases. This helps provide a clearer comparison between different populations or areas, as it accounts for the size of each population, making it easier to understand the rate of HIV in each group. In epidemiology, using cases per 100,000 is a common practice to standardize data across different population sizes. It allows for more meaningful comparisons between regions or groups, regardless of their population size, and helps highlight areas with higher or lower rates of disease.

Throughout the blog, I will use the **United Hospital Fund (UHF) neighborhoods** in New York City to spatially disaggregate data. These are a set of geographic areas used for health data collection and analysis. These neighborhoods were defined by the United Hospital Fund, a New York-based nonprofit organization focused on health care and public health, to help examine health disparities across different areas of the city.

## V. Methodology

My methodology consists of three main steps:
1.   Loading the datasets
2.   Cleaning the datasets
3.   Performing analysis and visualization


### A. Loading the three datasets



First, I loaded the HIV/AIDS case count data from 2016-2021 from NYC Open Data.

In [None]:
# setup
import pandas as pd
import plotly.express as px

# mount google drive
from google.colab import drive
drive.mount('/content/drive')

# load hiv data
df_hiv = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/HIV_AIDS_Diagnoses_by_Neighborhood__Sex__and_Race_Ethnicity_20250223.csv")

# display
df_hiv.head()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Unnamed: 0,YEAR,Borough,Neighborhood (U.H.F),SEX,RACE/ETHNICITY,TOTAL NUMBER OF HIV DIAGNOSES,"HIV DIAGNOSES PER 100,000 POPULATION",TOTAL NUMBER OF CONCURRENT HIV/AIDS DIAGNOSES,PROPORTION OF CONCURRENT HIV/AIDS DIAGNOSES AMONG ALL HIV DIAGNOSES,TOTAL NUMBER OF AIDS DIAGNOSES,"AIDS DIAGNOSES PER 100,000 POPULATION"
0,2010,,Greenpoint,Male,Black,6,330.4,0,0.0,5,275.3
1,2011,,Stapleton - St. George,Female,Native American,0,0.0,0,0.0,0,0.0
2,2010,,Southeast Queens,Male,All,23,25.4,5,21.7,14,15.4
3,2012,,Upper Westside,Female,Unknown,0,0.0,0,0.0,0,0.0
4,2013,,Willowbrook,Male,Unknown,0,0.0,0,0.0,0,0.0


Next, I loaded the NYC.gov's neighborhood poverty rates.







In [None]:
# load neighborhood poverty data
df_pov_neigh = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/NYC EH Data Portal - Neighborhood poverty (uhf42).csv")

# display
df_pov_neigh.head()

Unnamed: 0,TimePeriod,GeoTypeDesc,GeoID,GeoRank,Geography,Number,Percent
0,2017-21,UHF 42,101,4,Kingsbridge - Riverdale,14713,16.2
1,2017-21,UHF 42,102,4,Northeast Bronx,30694,14.8
2,2017-21,UHF 42,103,4,Fordham - Bronx Pk,73579,27.7
3,2017-21,UHF 42,104,4,Pelham - Throgs Neck,61654,20.3
4,2017-21,UHF 42,105,4,Crotona -Tremont,73064,33.6


Finally, I loaded the DOHMH HIV Service Directory dataset from NYC Open Data.

In [None]:
# load hiv_service
df_hiv_serv = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/DOHMH_HIV_Service_Directory_20250223.csv")

# display
df_hiv_serv.head(5)

Unnamed: 0,FacilityName,Service Type,Address,Address 2,Borough,ZipCode,Website,Email,Contacts' Email,Secondary Email,Organization Phone,Contact Phone,Latitude,Longitude,Community Board,City Council District,BIN,BBL,Census Tract (2020),Neighborhood Tabulation Area (NTA) (2020)
0,African Services Committee,"Supportive Counseling Services, Legal Services...",429 West 127th Street,,Manhattan,10027,http://africanservices.org/,info@africanservices.org,,,2122223882,"mulusewb@africanservices.org, paolar@africanse...",40.813002,-73.953849,109.0,7.0,1084104.0,1019670000.0,21303.0,MN0902
1,After Hours Project,"Short-Term Housing, Short-Term Rental Assistan...",1204 Broadway,,Brooklyn,11221,www.afterhoursproject.org/,,,,7182490755,"emosquera@afterhoursproject.org, pbriones@afte...",40.692107,-73.926657,303.0,41.0,3399422.0,3016130000.0,387.0,BK0302
2,AIDS Center of Queens County,"Harm Reduction, Mental Health, Supportive Coun...",161-21 Jamaica Avenue,,Queens,11432,http://acqc.org/,,(718)-896-2500 x5736,,7188962500,"lanhalt@acqc.org, rlopez@acqc.org, evasquez@ac...",40.704249,-73.797921,412.0,27.0,4208875.0,4097608000.0,44601.0,QN1201
3,APICHA Community Health Center,Care Coordination,400 Broadway,,Manhattan,10013,https://apicha.org/,,646-744-2591,,2123347940,"tau@apicha.org, vvacharakitja@apicha.org",40.718702,-74.002464,101.0,1.0,1002334.0,1001960000.0,31.0,MN0102
4,Argus Community,Care Coordination,760 East 160th Street,,Bronx,10456,http://www.arguscommunity.org/,info@arguscommunity.org,(718) 401-5715,,7184015715,marodriguez@arguscommunity.org,40.820438,-73.904737,201.0,17.0,2004691.0,2026560000.0,77.0,BX0102


### B. Cleaning the datasets

After loading the data, I cleaned the dataset on HIV cases and neighborhood poverty rate, so that I can merge them later on.


First, I cleaned the HIV/AIDS case count data from 2016-2021 by removing aggregate rows and cleaning the values in the rows and columns.

In [None]:
#-------------------------#
# Removing aggregate rows #
#-------------------------#

# clean df_hiv by removing rows where "borough" is blank, "All" and NaN
df_hiv_clean = df_hiv[(df_hiv["Borough"] != "") & (df_hiv["Borough"] != "All") & (df_hiv["Borough"].notna())]

# clean df_hiv by removing rows where Neighborhood (U.H.F.) is "All"
df_hiv_clean = df_hiv_clean[df_hiv_clean["Neighborhood (U.H.F)"] != "All"]

# clean df_hiv by removing rows where Sex is "All"
df_hiv_clean = df_hiv_clean[df_hiv_clean["SEX"] != "All"]

# clean df_hiv by removing rows where Race/Ethnicity is "All"
df_hiv_clean = df_hiv_clean[df_hiv_clean["RACE/ETHNICITY"] != "All"]

#--------------------------------#
# Cleaning values in cols & rows #
#--------------------------------#

# rename values "Staten\nIsland" under borough
df_hiv_clean["Borough"] = df_hiv_clean["Borough"].replace("Staten\nIsland", "Staten Island")

# change all column titles to be lowercase
df_hiv_clean.columns = df_hiv_clean.columns.str.lower()

# make values in certain columns integers
df_hiv_clean["total number of hiv diagnoses"] = pd.to_numeric(df_hiv_clean["total number of hiv diagnoses"])
df_hiv_clean["hiv diagnoses per 100,000 population"] = pd.to_numeric(df_hiv_clean["hiv diagnoses per 100,000 population"])
df_hiv_clean["total number of aids diagnoses"] = pd.to_numeric(df_hiv_clean["total number of aids diagnoses"])
df_hiv_clean["aids diagnoses per 100,000 population"] = pd.to_numeric(df_hiv_clean["aids diagnoses per 100,000 population"])

# remove instances of "\n" in neighborhood (u.h.f) column
df_hiv_clean["neighborhood (u.h.f)"] = df_hiv_clean["neighborhood (u.h.f)"].str.replace("\n", " ", regex=False)

# display
df_hiv_clean.head()

Unnamed: 0,year,borough,neighborhood (u.h.f),sex,race/ethnicity,total number of hiv diagnoses,"hiv diagnoses per 100,000 population",total number of concurrent hiv/aids diagnoses,proportion of concurrent hiv/aids diagnoses among all hiv diagnoses,total number of aids diagnoses,"aids diagnoses per 100,000 population"
2971,2016,Bronx,Crotona - Tremont,Female,Asian/Pacific\nIslander,0.0,0.0,0,,1.0,63.3
2972,2016,Bronx,Crotona - Tremont,Female,Black,10.0,38.7,0,0.0,10.0,38.7
2973,2016,Bronx,Crotona - Tremont,Female,Latino/Hispanic,14.0,22.0,2,14.3,12.0,18.9
2974,2016,Bronx,Crotona - Tremont,Female,Other/Unknown,0.0,0.0,0,,0.0,0.0
2975,2016,Bronx,Crotona - Tremont,Female,White,2.0,130.6,0,0.0,0.0,0.0


Next, I cleaned the neighborhood poverty rate dataset by retaining only the relevant columns for my analysis and renaming the headers & some values to make merging easier.

In [None]:
# remove columns 'TimePeriod', 'GeoTypeDesc', 'GeoID', 'GeoRank'
df_pov_neigh = df_pov_neigh.drop(columns=['TimePeriod', 'GeoTypeDesc', 'GeoID', 'GeoRank'])

# renaming columns
df_pov_neigh = df_pov_neigh.rename(columns={'Geography': 'neighborhood (u.h.f)'})
df_pov_neigh = df_pov_neigh.rename(columns={'Number': 'no. of people in poverty (2017-2021)'})
df_pov_neigh = df_pov_neigh.rename(columns={'Percent': 'poverty rate (2017-2021)'})

# renaming values in columns to merge with df_hiv_clean
df_pov_neigh["neighborhood (u.h.f)"] = df_pov_neigh["neighborhood (u.h.f)"].replace("Fordham - Bronx Pk", "Fordham - Bronx Park", regex=False)
df_pov_neigh["neighborhood (u.h.f)"] = df_pov_neigh["neighborhood (u.h.f)"].replace("Crotona -Tremont", "Crotona - Tremont", regex=False)


# display
df_pov_neigh.head()

Unnamed: 0,neighborhood (u.h.f),no. of people in poverty (2017-2021),poverty rate (2017-2021)
0,Kingsbridge - Riverdale,14713,16.2
1,Northeast Bronx,30694,14.8
2,Fordham - Bronx Park,73579,27.7
3,Pelham - Throgs Neck,61654,20.3
4,Crotona - Tremont,73064,33.6


### C. Merging the datasets

In this section, I combine the cleaned HIV/AIDs dataset with the cleaned neighborhood poverty dataset. This will be my main dataset for analysis and visualization.

Adding poverty data to the HIV dataset

In [None]:
# do an inner merge df_pov_neigh to df_hiv using the column "neighborhood (u.h.f)"
df_merged = pd.merge(df_hiv_clean, df_pov_neigh, on="neighborhood (u.h.f)", how="left")

# print
df_merged.head()

Unnamed: 0,year,borough,neighborhood (u.h.f),sex,race/ethnicity,total number of hiv diagnoses,"hiv diagnoses per 100,000 population",total number of concurrent hiv/aids diagnoses,proportion of concurrent hiv/aids diagnoses among all hiv diagnoses,total number of aids diagnoses,"aids diagnoses per 100,000 population",no. of people in poverty (2017-2021),poverty rate (2017-2021)
0,2016,Bronx,Crotona - Tremont,Female,Asian/Pacific\nIslander,0.0,0.0,0,,1.0,63.3,73064,33.6
1,2016,Bronx,Crotona - Tremont,Female,Black,10.0,38.7,0,0.0,10.0,38.7,73064,33.6
2,2016,Bronx,Crotona - Tremont,Female,Latino/Hispanic,14.0,22.0,2,14.3,12.0,18.9,73064,33.6
3,2016,Bronx,Crotona - Tremont,Female,Other/Unknown,0.0,0.0,0,,0.0,0.0,73064,33.6
4,2016,Bronx,Crotona - Tremont,Female,White,2.0,130.6,0,0.0,0.0,0.0,73064,33.6


### D. Analysis

In this section, I investigate whether my hypothesis regarding the research questions are correct, given 2016-2021 data.

#### a. How many HIV cases per 100,000 by borough (total and per year)?

🔴 From 2016 to 2021, **Manhattan** had the highest rate of HIV diagnoses per 100,000 people, followed by Brooklyn, the Bronx, Queens, and Staten Island in that order.

In [None]:
# aggregate HIV cases per 100,000 by borough
hiv_bor = df_merged.groupby('borough')['hiv diagnoses per 100,000 population'].sum()

# display
hiv_bor

# creating a bar chart summarizing hiv cases per 100,000 by borough
fig_hiv_bor = px.bar(
    hiv_bor,
    x=hiv_bor.index,
    y='hiv diagnoses per 100,000 population',
    title = 'Total HIV cases per 100,000 (2016-2021) in NYC')

fig_hiv_bor.show()

🔴 When we disaggregate the number of HIV diagnoses per 100,000 from 2016-2021, we can see two striking observations:
- Manhattan is the borough that consistently had the highest share of HIV diagnoses per 100,000.
- There was an uptick in new HIV diagnoses in 2020.

In [None]:
# aggregate HIV cases per 100,000 by borough
hiv_bor_yr = df_merged.groupby(['year', 'borough'])['hiv diagnoses per 100,000 population'].sum().reset_index()

# display
hiv_bor_yr

# creating a bar chart summarizing hiv cases per 100,000 by borough
fig_hiv_bor_yr = px.bar(
    hiv_bor_yr,
    x='year',
    y='hiv diagnoses per 100,000 population',
    color= 'borough',
    title = 'Total HIV cases per 100,000 (2016-2021) by borough by year in NYC')

fig_hiv_bor_yr.show()

#### b. How many HIV cases per 100,000 by race/ethnicity?

🔴 From 2016 to 2021, **Black** individuals had the highest rate of HIV diagnoses per 100,000 people, making up 46% of all cases in that period. They are followed by Latino/Hispanic, Other, White, then Asian/Pacific Islander.

In [None]:
# aggregate HIV cases per 100,000 by race/ethnicity
hiv_race = df_merged.groupby('race/ethnicity')['hiv diagnoses per 100,000 population'].sum().reset_index()

# sort in descending order
hiv_race = hiv_race.sort_values(by='hiv diagnoses per 100,000 population', ascending=False)

# display
hiv_race

# creating a pie chart summarizing hiv cases per 100,000 by race/ethnicity
fig_hiv_race = px.pie(
    hiv_race,
    values='hiv diagnoses per 100,000 population',
    names='race/ethnicity',
    title='Total HIV cases per 100,000 (2016-2021) by race/ethnicity in NYC')

# display
fig_hiv_race.show()

🔴 When we disaggregate these figures across 2016-2021, we see that this trend is consistent across all years. Black individuals make up the largest share of new HIV diagnoses per 100,000. It's worth noting that the number of white individuals was larger than Other/Unknown in 2020 and 2021.

In [None]:
# aggregate HIV cases per 100,000 by race/ethnicity by year
hiv_race_yr = df_merged.groupby(['year', 'race/ethnicity'])['hiv diagnoses per 100,000 population'].sum().reset_index()

# display
hiv_race_yr

# creating a pie chart summarizing hiv cases per 100,000 by race/ethnicity
fig_hiv_race_yr = px.bar(
    hiv_race_yr,
    x='year',
    y='hiv diagnoses per 100,000 population',
    color='race/ethnicity',
    title='Total HIV cases per 100,000 (2016-2021) by race/ethnicity by year in NYC')


fig_hiv_race_yr.show()

#### c. How many HIV cases per 100,000 by sex (total and per year)?

🔴 From 2016 to 2021, **men** had the highest rate of HIV diagnoses per 100,000 people at 85% compared to women at 15%.

In [None]:
# aggregate HIV cases per 100,000 by sex
hiv_sex = df_merged.groupby('sex')['hiv diagnoses per 100,000 population'].sum().reset_index()

# display
hiv_sex

# creating a pie chart summarizing hiv cases per 100,000 by sex
fig_hiv_sex = px.pie(
    hiv_sex,
    values='hiv diagnoses per 100,000 population',
    names='sex',
    title='Total HIV cases per 100,000 (2016-2020) by gender in NYC')

fig_hiv_sex.show()

🔴 When we disaggregate across 2016-2021, we see that men consistently have higher incidence of HIV than women.

In [None]:
# aggregate HIV cases per 100,000 by sex by year
hiv_sex_yr = df_merged.groupby(['year', 'sex'])['hiv diagnoses per 100,000 population'].sum().reset_index()

# display
hiv_sex_yr

# creating a pie chart summarizing hiv cases per 100,000 by sex
fig_hiv_sex_yr = px.bar(
    hiv_sex_yr,
    x='year',
    y='hiv diagnoses per 100,000 population',
    color= 'sex',
    title='Total HIV cases per 100,000 (2016-2021) by gender by year in NYC')


fig_hiv_sex_yr.show()

#### d. Are HIV cases per 100,000 correlated with poverty rate?

🔴 Next, I examine the correlation between HIV diagnoses per 100,000 and the poverty rate by neighborhood. The results indicate a small positive correlation, suggesting that as the poverty rate increases, the number of HIV diagnoses also tends to rise. However, the strength of this relationship is relatively weak. This is evident from the R² value in the regression analysis, which measures how well the trendline fits the data. With an R² of 0.091, the trendline does not align closely with the data points (an R² of 1.00 would indicate a perfect fit and a strong association). Additionally, the coefficient for the poverty rate (x1) has a p-value of 0.0664, which exceeds the conventional 5% significance threshold. This suggests that while there is a small positive association, it may not hold universally across all cases. Visually, the weak association is also apparent, as the trendline in the scatterplot does not closely follow the distribution of the data points.



In [None]:
# load statsmodel
#!pip install statsmodels
import statsmodels.api as sm
import statsmodels.formula.api as smf
import plotly.express as px

#-------

# filter only data in 2021
hiv_2021 = df_merged[df_merged["year"] == 2021]

# display
hiv_2021.head()

# aggregate hiv diagnoses and poverty rate by borough
hiv_pov_2021 = hiv_2021.groupby(['neighborhood (u.h.f)', 'poverty rate (2017-2021)'])['hiv diagnoses per 100,000 population'].sum().reset_index()

# display
hiv_pov_2021.head()

Unnamed: 0,neighborhood (u.h.f),poverty rate (2017-2021),"hiv diagnoses per 100,000 population"
0,Bayside - Little Neck,7.8,0.0
1,Bedford Stuyvesant - Crown Heights,21.6,351.9
2,Bensonhurst - Bay Ridge,14.4,73.9
3,Borough Park,21.8,138.5
4,Canarsie - Flatlands,11.6,122.8


In [None]:
# scatterplot with y=hiv cases by boro, x=poverty rate by boro in 2019
fig_hiv_pov_2021 = px.scatter(
    hiv_pov_2021,
    x="poverty rate (2017-2021)",
    y="hiv diagnoses per 100,000 population",
    color="neighborhood (u.h.f)",
    title="Poverty rate vs. HIV cases per 100,000 (2021) by neighborhood in NYC",
    trendline="ols",
    trendline_scope="overall"
)

# display
fig_hiv_pov_2021.show()

In [None]:
# get regression results
trend_results = px.get_trendline_results(fig_hiv_pov_2021).iloc[0, 0]
trend_results.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.091
Model:,OLS,Adj. R-squared:,0.065
Method:,Least Squares,F-statistic:,3.585
Date:,"Tue, 04 Mar 2025",Prob (F-statistic):,0.0664
Time:,01:54:35,Log-Likelihood:,-241.32
No. Observations:,38,AIC:,486.6
Df Residuals:,36,BIC:,489.9
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,147.9607,52.436,2.822,0.008,41.616,254.305
x1,5.2828,2.790,1.893,0.066,-0.376,10.942

0,1,2,3
Omnibus:,16.49,Durbin-Watson:,2.343
Prob(Omnibus):,0.0,Jarque-Bera (JB):,19.539
Skew:,1.378,Prob(JB):,5.72e-05
Kurtosis:,5.179,Cond. No.,42.8


#### e. How many HIV cases per 100,000 by neighborhood?

🔴 Although we know that **Manhattan** has the highest number of HIV diagnoses per 100,000, it is useful to know which neighborhoods have the highest incidence. When we disaggregate the numbers from 2016-2021, we can see that **Chelsea - Clinton** and **Greenwich Village and Soho** have the highest incidence. The next big areas with high incidence are parts of Bronx (**Hunts Point - Mott Haven** and **Crotona - Tremont**) and Brooklyn (**Bedford Stuyvesant - Crown Heights** and **East New York**). We can see from the map on poverty rates by neighborhood (2017 - 2021) that these are neighborhoods that have the highest poverty rates, therefore corroborating our earlier finding that HIV cases are somewhat correlated with poverty rate. We can see this in both the bar graphs by neighborhood and in the maps for HIV diagnoses by neighborhood and poverty rate by neighborhood.

In [None]:
# aggregate HIV cases per 100,000 by UHF neighborhood
hiv_uhf = df_merged.groupby(['borough', 'neighborhood (u.h.f)'])['hiv diagnoses per 100,000 population'].sum().reset_index()

# sort in descending order based on hiv diagnoses column
hiv_uhf = hiv_uhf.sort_values(by='hiv diagnoses per 100,000 population', ascending=False)

# display
hiv_uhf.head()

Unnamed: 0,borough,neighborhood (u.h.f),"hiv diagnoses per 100,000 population"
19,Manhattan,Chelsea - Clinton,4408.8
22,Manhattan,Greenwich Village - Soho,3771.3
0,Bronx,Crotona - Tremont,2911.8
3,Bronx,Hunts Point - Mott Haven,2790.3
7,Brooklyn,Bedford Stuyvesant - Crown Heights,2683.6


In [None]:
# rename values under neighborhood (u.h.f) column for mapping
hiv_uhf["neighborhood (u.h.f)"] = hiv_uhf["neighborhood (u.h.f)"].replace("Downtown - Heights - Park Slope", "Downtown  - Heights - Slope", regex=False)
hiv_uhf["neighborhood (u.h.f)"] = hiv_uhf["neighborhood (u.h.f)"].replace("Gramercy Park - Murray Hill", "Gramercy Park -  Murray Hill", regex=False)
hiv_uhf["neighborhood (u.h.f)"] = hiv_uhf["neighborhood (u.h.f)"].replace("Port Richmond", "Port  Richmond", regex=False)

# creating a bar chart summarizing hiv cases per 100,000 by borough without legend
fig_hiv_uhf = px.bar(
    hiv_uhf,
    x='neighborhood (u.h.f)',
    y='hiv diagnoses per 100,000 population',
    color="borough",
    title='Total HIV cases per 100,000 (2016-2021) in NYC')

fig_hiv_uhf.show()

In [None]:
import requests

response = requests.get(
    "https://raw.githubusercontent.com/nycehs/NYC_geography/refs/heads/master/UHF42.geo.json"
)
geojson = response.json()

geojson['features'][1]['properties']

{'GEOCODE': 102, 'GEONAME': 'Northeast Bronx', 'BOROUGH': 'Bronx'}

In [None]:
fig_hiv_uhf_map = px.choropleth_mapbox(hiv_uhf,
                           geojson=geojson,
                           locations='neighborhood (u.h.f)',
                           featureidkey='properties.GEONAME',
                           color='hiv diagnoses per 100,000 population',
                           hover_data=['neighborhood (u.h.f)'],
                           title='Total HIV cases per 100,000 by neighborhood (u.h.f)',
                           center = {'lat': 40.73, 'lon': -73.98},
                           zoom=10,
                           color_continuous_scale="YlOrRd",
                           mapbox_style='carto-positron')

fig_hiv_uhf_map.update_layout(height=1000)
fig_hiv_uhf_map.show()

Poverty rate by neighborhoods

In [None]:
# rename values under neighborhood (u.h.f) column for mapping
df_pov_neigh["neighborhood (u.h.f)"] = df_pov_neigh["neighborhood (u.h.f)"].replace("Downtown - Heights - Slope", "Downtown  - Heights - Slope", regex=False)
df_pov_neigh["neighborhood (u.h.f)"] = df_pov_neigh["neighborhood (u.h.f)"].replace("Gramercy Park - Murray Hill", "Gramercy Park -  Murray Hill", regex=False)
df_pov_neigh["neighborhood (u.h.f)"] = df_pov_neigh["neighborhood (u.h.f)"].replace("Port Richmond", "Port  Richmond", regex=False)
df_pov_neigh["neighborhood (u.h.f)"] = df_pov_neigh["neighborhood (u.h.f)"].replace("Washington Heights", "Washington Heights - Inwood", regex=False)
df_pov_neigh["neighborhood (u.h.f)"] = df_pov_neigh["neighborhood (u.h.f)"].replace("Greenwich Village - SoHo", "Greenwich Village - Soho", regex=False)


# graphing
fig_pov_neigh_map = px.choropleth_mapbox(df_pov_neigh,
                           geojson=geojson,
                           locations='neighborhood (u.h.f)',
                           featureidkey='properties.GEONAME',
                           color='poverty rate (2017-2021)',
                           hover_data=['neighborhood (u.h.f)'],
                           title='Poverty rate (2017-2021) by neighborhood',
                           center = {'lat': 40.73, 'lon': -73.98},
                           zoom=10,
                           color_continuous_scale="YlOrRd",
                           mapbox_style='carto-positron')

fig_pov_neigh_map.update_layout(height=1010)

#display
fig_pov_neigh_map.show()

#### f. Where are HIV service centers located? Are they near neighborhoods that have high HIV case counts?


🔴 Finally, since it would help the city to know how best to serve HIV-positive populations, I looked into the location of different organizations offering HIV services. I first mapped them out in New York City, and then laid these points on top of the earlier map with HIV diagnoses per 100,000 by neighborhood.

On this two-layer map, we can see that most organizations providing HIV services are largely situated in neighborhoods with a high burden of HIV cases. However, there are a few neighborhoods that are rather underserved (meaning there are no organizations offering HIV services in the area) despite having HIV diagnosis per 100,000 higher than 2000, namely:
- Washington Heights
- Ridgewood - Forest Hills
- East New York

These are areas which these organizations can expand into to offer HIV services.

In [None]:
# Create the map
fig_hiv_serv = px.scatter_mapbox(df_hiv_serv,
                        lat="Latitude",
                        lon="Longitude",
                        hover_name="FacilityName",
                        zoom=10)

# Use open-street-map as base map
fig_hiv_serv.update_layout(mapbox_style="open-street-map")
fig_hiv_serv.update_layout(margin={"r":0,"t":0,"l":0,"b":0})

# Show the figure in browser
fig_hiv_serv.show()

Mapping HIV service centers and HIV cases per 100,000 (2016-2021) on NYC map

In [None]:
# import other plotly setup
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Create a new figure that will contain both maps
fig = go.Figure()

# Add the choropleth layer
fig.add_trace(go.Choroplethmapbox(
    geojson=geojson,
    locations=hiv_uhf['neighborhood (u.h.f)'],
    z=hiv_uhf['hiv diagnoses per 100,000 population'],
    featureidkey='properties.GEONAME',
    colorscale="YlOrRd",
    marker_opacity=0.5,
    marker_line_width=0.5,
    colorbar=dict(
        title="HIV diagnoses<br>per 100,000",
        x=0.01,
        xpad=0
    ),
    hovertemplate='<b>%{location}</b><br>HIV diagnoses per 100,000: %{z}<extra></extra>'
))

# Add the scatter mapbox layer for HIV service facilities
fig.add_trace(go.Scattermapbox(
    lat=df_hiv_serv["Latitude"],
    lon=df_hiv_serv["Longitude"],
    mode='markers',
    marker=dict(
        size=8,
        color='blue',
        opacity=0.8
    ),
    text=df_hiv_serv["FacilityName"],
    hoverinfo='text',
    hovertemplate='<b>%{text}</b><extra></extra>',
    name='HIV Services'
))

# Update layout with proper styling
fig.update_layout(
    title="HIV Diagnoses per 100,000 by Neighborhood and Service Locations in NYC (2016-2021)",
    mapbox_style="carto-positron",
    mapbox=dict(
        center=dict(lat=40.73, lon=-73.98),
        zoom=10
    ),
    height=1100,
    margin={"r":0, "t":40, "l":0, "b":0},
    legend=dict(
        orientation="h",
        yanchor="bottom",
        y=1.02,
        xanchor="right",
        x=1
    )
)

# Show the figure
fig.show()

## VI. Results and Revisiting Hypotheses


🔴  **1. HIV Rates are highest in parts of Manhattan, the Bronx and Brooklyn**

My hypothesis that the Bronx and Brooklyn, as well as Black individuals, would have the highest number of HIV cases was largely confirmed, with one slight deviation. **Manhattan** turned out to have the highest rate of HIV diagnoses per 100,000 people, which wasn't what I expected. This is largely driven by high cases in Chelsea - Clinton and Greenwich Village - Soho, areas which have a large population of gay men. However, both **Brooklyn** and the **Bronx** showed significant rates of HIV diagnoses, especially in neighborhoods like **Bedford Stuyvesant - Crown Heights** and **Hunts Point - Mott Haven**, which aligns with my prediction for these boroughs.

🔴  **2. HIV Rates are highest among men and Black and Latino individuals**

Regarding demographics, I was correct in predicting that **Black individuals** would have the highest rate of HIV diagnoses per 100,000, which was consistent with my hypothesis. This trend supports the historical and social factors that have disproportionately affected Black communities. As for gender, I was also right in expecting that **men** would have a higher incidence of HIV than women. The finding that men accounted for 85% of new HIV diagnoses confirms the historical trend of HIV being more common among men, particularly gay men.

🔴 **3. There is a weak positive association between HIV and Poverty**

I suspected there would be a correlation between HIV and poverty, but the relationship turned out to be weaker than I anticipated. While there was indeed a **small positive correlation** between the two, the **weak strength** of the association was surprising. The R² value of 0.091 suggested that while there may be some connection between higher poverty rates and increased HIV diagnoses, poverty does not fully explain the trends. This indicates that other factors likely play a role, so my hypothesis about a strong connection between poverty and HIV was only partially correct.

🔴 **4. HIV Services are generally available in high-burden areas, with a few exceptions**

My hypothesis that HIV services are generally available in high-burden areas was mostly confirmed. The analysis showed that most HIV service organizations are located in neighborhoods with high rates of HIV diagnoses, which supports my assumption. However, I was surprised to find that some areas, like **Washington Heights** and **Ridgewood - Forest Hills**, had relatively high HIV rates but were **underserved** in terms of HIV services. This indicates that while services are typically located in high-burden areas, there are still some gaps, especially in neighborhoods that might be emerging or harder to access for healthcare providers.

### Summary

In the end, most of my hypotheses were confirmed, particularly regarding the higher rates of HIV among Black individuals and men, as well as the general presence of HIV services in high-burden areas. However, the weak correlation between HIV and poverty suggests that other factors beyond poverty contribute to the epidemic, such as gender, race and neighborhood. Further, the final graph is a call to improve access to HIV services to underserved neighborhoods in the Bronx and Brooklyn with a relatively high burden of HIV.  