## 

# Unveiling Health Disparities: A Data-Driven Exploration of Socioeconomic Factors, Healthcare Accessibility, Poverty Rate, and Mortality Rates in different Communities

Written by Grayce (Mingxuan) Yang, Ruotong Xu, Nov 12, 2023

Visualization can be accessed here: https://deepnote.com/workspace/datathon-f9fa-d0d2d7af-6292-4865-8f7d-4d1e6386bb0f/project/Datathon-2853eace-95aa-42bd-bb8c-3e961b96ca5d/notebook/Notebook%201-2-299ba54f9938480d898d87f599a749aa?duplicate=true 

## Background 

In recent years, public health has been deeply intertwined with socioeconomic factors, access to healthcare, and mortality rates. In the realm of data science for public health, our project seeks to illuminate the intricate connections between socioeconomic factors, access to healthcare, and mortality rates. Leveraging datasets from various regions, we aim to provide a quantitative perspective on disparities in health outcomes, particularly among marginalized communities.

Communities grappling with higher poverty rates often experience limited access to healthcare providers, leading to delayed diagnoses and inadequate medical interventions. This healthcare inequity can contribute to higher mortality rates within these vulnerable populations. The importance of understanding these connections is underscored by various incidents across different regions.

Disparities in healthcare access have been brought to the forefront in various urban and rural areas, where socioeconomic challenges intersect with the availability of medical resources. In some instances, residents face barriers to accessing essential healthcare services due to financial constraints or geographic isolation. This lack of access not only exacerbates existing health conditions but can also contribute to a higher overall mortality rate.

The impact of socioeconomic factors on health outcomes is a multifaceted issue that requires comprehensive attention. By investigating the intricate relationship between poverty rates, healthcare provider accessibility, and mortality rates, public health initiatives can be tailored to address the specific needs of underserved communities. This approach is vital for fostering a more equitable healthcare system that ensures everyone, regardless of socioeconomic status, has access to the resources necessary for a healthier and more resilient society.

## The Data

Our data science project draws data from CalData's authoritative sources in California. We utilize three key datasets: 

(1) Infectious diseases categorized by type, county, year, and sex: https://data.ca.gov/dataset/infectious-diseases-by-disease-county-year-and-sex

(2) Primary care shortage areas in the state: https://data.ca.gov/dataset/primary-care-shortage-areas-in-california

(3) Hospital inpatient mortality rates: https://data.ca.gov/dataset/california-hospital-inpatient-mortality-rates-and-quality-ratings

These datasets serve as the foundation for our analysis, exploring the intricate relationships between socioeconomic factors, healthcare accessibility, and mortality rates. Through advanced data science methodologies, we aim to unveil meaningful patterns and correlations, providing valuable perspectives for public health research and policy considerations.

In [None]:
#importing necessary packages
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
from sklearn.preprocessing import OneHotEncoder

## Insights in Infectious Disease Rate Data

In [None]:
infec = pd.read_csv('/work/Infectious_Diseases.csv')

In [None]:
infec_no_sex = infec[infec['Sex'] == 'Total']
infec_no_sex.reset_index().drop('index', axis = 1)
pd.set_option('mode.chained_assignment', None)
infec_no_sex['Rate'] = infec_no_sex['Rate'].replace({'-': '0'}).str.replace(r'\*$', '').astype(float)
pd.set_option('mode.chained_assignment', 'warn')

  infec_no_sex['Rate'] = infec_no_sex['Rate'].replace({'-': '0'}).str.replace(r'\*$', '').astype(float)


In [None]:
total = infec_no_sex.groupby('Disease')[['Cases']].agg(np.sum)
total['Cases']= total['Cases'].astype(int)
total.sort_values(by = 'Cases', ascending = False).head()

Unnamed: 0_level_0,Cases
Disease,Unnamed: 1_level_1
Campylobacteriosis,306830
Salmonellosis,213144
Coccidioidomycosis,193448
Giardiasis,95938
Shigellosis,89608


Create a selected dataframe consisting of Malaria, Dengue Virus Infection, Campylobacteriosis, Salmonellosis, Coccidioidomycosis, Giardiasis, Shigellosis.

### Average of Infectious Disease Rate Progression

In [None]:
DeepnoteChart(infec_no_sex, """{"layer":[{"layer":[{"mark":{"clip":true,"type":"circle","tooltip":true},"encoding":{"x":{"sort":null,"type":"quantitative","field":"Year","scale":{"type":"linear","zero":false},"format":{"type":"default","decimals":null},"formatType":"numberFormatFromNumberType"},"y":{"axis":{"title":"Average of Infectious Rate (%)"},"sort":null,"type":"quantitative","field":"Rate","scale":{"type":"linear","zero":false},"format":{"type":"default","decimals":null},"aggregate":"average","formatType":"numberFormatFromNumberType"},"color":{"sort":null,"type":"quantitative","scale":{"scheme":"redpurple"},"aggregate":"count"}}}]}],"title":"Relationship between Year and Average of Infectious DiseaseRate","config":{"legend":{}},"$schema":"https://vega.github.io/schema/vega-lite/v5.json","encoding":{}}""")

<__main__.DeepnoteChart at 0x7fd5bb0d4250>

We observe an increasing trend as the years progress, noting a rise in the average infectious disease rate. This emphasizes the significance of an ongoing issue in public health that requires attention and resolution.

In [None]:
import plotly.express as px

# Assuming your data is stored in a DataFrame called df
# Replace 'df' with the actual name of your DataFrame if different

# Create a scatter plot with trendline using Plotly Express
fig = px.scatter(selected_df, x='Year', y='Rate', color='Disease',
                 labels={'Rate': 'Infectious Disease Rate'},
                 title='Infectious Disease Rate vs. Year by Disease',
                 width=800, height=500,
                 opacity=0.7)  # Set the opacity to control transparency

# Show the plot
fig.show()


By comparing the highest counts of different infectious diseases in the dataset, we observe that the coccidioidomycosis disease rate is higher than that of other infectious diseases. This suggests that coccidioidomycosis should be one of the most important aspects to focus on in order to address public health problems.

## Insights in Primary Care Shortage Data

In [None]:
care = pd.read_csv('/work/primary-care-shortage-areas.csv')
care

Unnamed: 0,MSSA_COUNTY,MSSA_ID,MSSA_NAME,Total_Population,EST_Physicians,EST_FNPPA,EST_Providers,Provider_Ratio,Score_Provider_Ratio,Pop_100FPL,PCT_100FPL,Score_Poverty,Score_Total,PCSA,Effective Date
0,Alameda,1.1,Livermore Central and West/Spring Town,58273,46.0,5.0,49.8,1172.5,1,3149,0.054,1,2,No,1/30/2020
1,Alameda,1.2,Altamont/Livermore East/Midway/Mountain House/...,39930,13.5,2.7,15.5,2576.1,4,1774,0.044,0,4,No,1/30/2020
2,Alameda,2a,Berkeley South and West/Emeryville/Oakland Nor...,86595,116.0,78.7,175.0,494.8,0,20908,0.241,4,4,No,1/30/2020
3,Alameda,2b,Albany/Berkeley East and North/Claremont/Cragm...,110451,273.0,39.4,302.6,365.0,0,8701,0.079,1,1,No,1/30/2020
4,Alameda,2c,Oakland West Central,88757,361.0,48.1,397.0,223.6,0,22684,0.256,5,5,Yes,1/30/2020
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
537,Yolo,246.1,Woodland,60630,61.0,35.3,87.5,692.9,0,7451,0.123,2,2,No,1/30/2020
538,Yolo,246.2,Knights Landing,4250,0.0,0.0,0.0,,5,558,0.131,2,7,Yes,1/30/2020
539,Yuba,247,Browns Valley/Brownsville/Dobbins/Oregon House,4365,0.0,2.3,1.7,2567.6,4,979,0.224,4,8,Yes,1/30/2020
540,Yuba,248,Wheatland,4492,0.0,0.0,0.0,,5,439,0.098,1,6,Yes,1/30/2020


In [None]:
def one_hot_encode(data):
    """
    Return the one-hot encoded DataFrame of our input data.

    Parameters
    -----------
    data: A DataFrame that may include non-numerical features.

    Returns
    -----------
    A one-hot encoded DataFrame that only contains numeric features.

    """
    enc = ['PCSA']
    oh_enc = OneHotEncoder()
    oh_enc.fit(data[enc])

    edata = oh_enc.transform(data[enc]).toarray()
    cat_df = pd.DataFrame(data= edata, columns = oh_enc.get_feature_names_out(),index = data.index)
    return data.join(cat_df).drop(columns=enc)
carenew = one_hot_encode(care)

carenew_no = carenew.groupby('MSSA_COUNTY')['PCSA_No'].agg(sum) / carenew.groupby('MSSA_COUNTY')['PCSA_No'].agg(len)
carenew_yes = carenew.groupby('MSSA_COUNTY')['PCSA_Yes'].agg(sum) / carenew.groupby('MSSA_COUNTY')['PCSA_Yes'].agg(len)

# Create a DataFrame 'result_df' based on the condition
result_df = pd.DataFrame({'MSSA_COUNTY': carenew_yes.index, 'Yes': carenew_yes.values, 'No': carenew_no.values})

# Melt the DataFrame to have a 'Condition' column
result_df = pd.melt(result_df, id_vars=['MSSA_COUNTY'], value_vars=['Yes', 'No'], var_name='Condition', value_name='Percentage')

# Create a bar plot using Plotly Express
fig = px.bar(result_df, x='MSSA_COUNTY', y='Percentage', color='Condition', barmode='group',
             labels={'Percentage': 'Percentage', 'Condition': 'Condition'},
             title="Percentage of 'Yes' and 'No' by County")

# Show the plot
fig.show()

<img src="image-20231112-134710.png" width="50%" align="" />

We examined the data and utilized a geographical graph to display primary care shortage areas, offering an overview of specific counties currently facing healthcare shortages.

In [None]:
caredip1 = care.groupby('MSSA_COUNTY')['Pop_100FPL'].agg(np.sum)
caredip2 = care.groupby('MSSA_COUNTY')['Total_Population'].agg(np.sum)
povertyratio = caredip1/caredip2

In [None]:
Caredip1 = care.groupby('MSSA_COUNTY')['EST_Providers'].agg(np.sum)
Caredip2 = care.groupby('MSSA_COUNTY')['Total_Population'].agg(np.sum)
Provideratio = Caredip1/Caredip2
Combo=pd.concat([Provideratio,povertyratio],axis=1).reset_index()
Combo = Combo.rename(columns={0: 'Poverty_Ratio',1:"Provider_Ratio"})
Combo

Unnamed: 0,MSSA_COUNTY,Poverty_Ratio,Provider_Ratio
0,Alameda,0.001562,0.11308
1,Alpine,0.0,0.197811
2,Amador,0.001079,0.106195
3,Butte,0.000867,0.20488
4,Calaveras,0.000476,0.127505
5,Colusa,0.000352,0.139964
6,Contra Costa,0.001235,0.097502
7,Del Norte,0.00078,0.232416
8,El Dorado,0.000548,0.098168
9,Fresno,0.000705,0.254357


## Comparison of Poverty Rate and Infectious Disease Rate

We set up the following data frame for the year 2019 to compare the infectious disease rate and poverty rate, aiming to examine the association between the two variables in different data frame during that year.

In [None]:
selected_df = infec_no_sex[infec_no_sex['Disease'].isin(['Malaria', 'Dengue Virus Infection', 'Campylobacteriosis', 'Salmonellosis', 'Coccidioidomycosis', 'Giardiasis', 'Shigellosis'])]
selected_df_2019 = selected_df[selected_df['Year'] == 2019]
selected_df_2019

Unnamed: 0,Disease,County,Year,Sex,Cases,Population,Rate,Lower_95__CI,Upper_95__CI
27314,Campylobacteriosis,Alameda,2019,Total,487.0,1678926,29.007,26.488,31.701
27380,Campylobacteriosis,Alpine,2019,Total,0.0,1205,0.000,0.000,305.663
27446,Campylobacteriosis,Amador,2019,Total,3.0,40279,7.448,1.536,21.765
27512,Campylobacteriosis,Butte,2019,Total,103.0,226714,45.432,37.084,55.096
27578,Campylobacteriosis,Calaveras,2019,Total,9.0,45285,19.874,9.088,37.724
...,...,...,...,...,...,...,...,...,...
153362,Shigellosis,Tulare,2019,Total,12.0,470619,2.550,1.318,4.454
153428,Shigellosis,Tuolumne,2019,Total,0.0,55316,0.000,0.000,6.669
153494,Shigellosis,Ventura,2019,Total,58.0,845474,6.860,5.209,8.868
153560,Shigellosis,Yolo,2019,Total,17.0,216073,7.868,4.583,12.597


In [None]:
rate_selected_df = selected_df_2019.groupby('County')['Rate'].mean().sort_values(ascending=False).reset_index().sort_values(by='Rate', ascending=False)
povertydf = povertyratio.sort_values().to_frame().rename(columns={0: 'Poverty Ratio'}).reset_index()

In [None]:
finaldf = pd.merge(rate_selected_df, povertydf, right_on = 'MSSA_COUNTY', left_on = 'County').drop('MSSA_COUNTY', axis = 1)
finaldf = finaldf.iloc[:-1,:]

In [None]:
finaldf['Rate'] = finaldf['Rate'].replace(0, 0.000001)
finaldf['Log Infectious Rate'] = np.log(finaldf['Rate'])
finaldf

Unnamed: 0,County,Rate,Poverty Ratio,Log Infectious Rate
0,Kern,60.882143,0.225636,4.10894
1,Kings,26.557429,0.208739,3.27931
2,Tulare,19.639714,0.270889,2.977554
3,San Luis Obispo,18.854429,0.137888,2.936748
4,Imperial,18.567429,0.237504,2.921409
5,San Francisco,18.295143,0.116509,2.906636
6,Fresno,17.100571,0.254357,2.839112
7,San Benito,15.125,0.097226,2.716349
8,Madera,14.903,0.220586,2.701563
9,Ventura,14.615714,0.10259,2.682097


In [None]:
import plotly.express as px

# Create a scatter plot with x and y axes using Plotly Express
fig = px.scatter(finaldf, x='Poverty Ratio', y='Log Infectious Rate', trendline='ols',
                 labels={'Poverty Ratio': 'Poverty Ratio', 'Log Infectious Rate': 'Log Infectious Diseae Rate'},
                 title='Relationship Between Poverty Rate and Log Infectious Disease Rate by County',
                 opacity=0.7, size_max=10)

fig.update_layout(
    title_text="Association between Poverty Ratio and Log Infectious Disease Rate",
    title_x=0.5  # Set the title's x position to the center of the figure
)
                 
# Show the plot
fig.show()

In [None]:
ols_results = px.get_trendline_results(fig).iloc[0]
slope = ols_results['px_fit_results'].params[1]
slope

0.8136265881432208

We observed a relatively strong linear relationship between the Poverty Ratio and the Log Infectious Disease Rate.

This correlation implies that areas with higher poverty rates may experience a correspondingly elevated incidence of infectious diseases. Understanding this connection is crucial for public health planning and intervention strategies, as it highlights the interconnectedness between socio-economic factors and health outcomes. Addressing disparities in resource allocation, healthcare access, and socio-economic conditions may be essential to effectively mitigate infectious disease burdens in vulnerable communities.

## Comparison of Poverty Rate and Healthcare Provider Access Rate

In [None]:
# Create a scatter plot using Plotly Express
fig = px.scatter(
    Combo,
    x=Combo['Poverty_Ratio'],
    y=Combo["Provider_Ratio"],
    color_discrete_sequence=["orange"],
    trendline="ols",
)
fig.update_layout(
    width=800,  # Adjust the width as needed
    height=500,  # Adjust the height as needed
)
fig.update_layout(
    title_text="Association between Poverty Ratio and Provider Ratio",
    title_x=0.5  # Set the title's x position to the center of the figure
)

fig.update_layout(
    xaxis_title="Poverty Ratio",
    yaxis_title="Provider Ratio",
)

# Show the plot
fig.show()

From this scatterplot, we observe a discernible negative correlation between the Provider Ratio and Poverty Ratio. The trend depicted suggests that as the poverty ratio increases, the Provider Ratio tends to decrease. This inverse relationship implies that areas experiencing higher levels of poverty tend to have a reduced accessibility to healthcare providers.

The negative correlation underscores a concerning pattern where communities grappling with elevated poverty rates face challenges in accessing healthcare resources. As the poverty ratio climbs, the availability of healthcare providers diminishes, potentially leading to increased difficulties in obtaining timely medical attention and interventions. This trend highlights a critical issue in healthcare accessibility, particularly in areas burdened by socioeconomic disparities.

## Insights in Mortality Data

In [None]:
death = pd.read_csv('/work/2016-2021-imi-results-long-view.csv')
death

Unnamed: 0,YEAR,COUNTY,HOSPITAL,OSHPDID,Procedure/Condition,Risk Adjuested Mortality Rate,# of Deaths,# of Cases,Hospital Ratings,LONGITUDE,LATITUDE
0,2016,AAAA,STATEWIDE,,AAA Repair Unruptured,1.3,30,2358,,,
1,2016,AAAA,STATEWIDE,,AMI,6.1,3178,52167,,,
2,2016,AAAA,STATEWIDE,,Acute Stroke,9.1,5482,60184,,,
3,2016,AAAA,STATEWIDE,,Acute Stroke Hemorrhagic,21.1,2580,12210,,,
4,2016,AAAA,STATEWIDE,,Acute Stroke Ischemic,5,2258,45141,,,
...,...,...,...,...,...,...,...,...,...,...,...
53211,2021,Yuba,Adventist Health and Rideout,106580996,Heart Failure,2.2,15,799,As Expected,-121.593602,39.138805
53212,2021,Yuba,Adventist Health and Rideout,106580996,Hip Fracture,3.6,4,138,As Expected,-121.593602,39.138805
53213,2021,Yuba,Adventist Health and Rideout,106580996,PCI,5.2,9,194,As Expected,-121.593602,39.138805
53214,2021,Yuba,Adventist Health and Rideout,106580996,Pancreatic Resection,,,,,-121.593602,39.138805


In [None]:
death['# of Deaths'] = death['# of Deaths'].fillna(0).astype(str)
death['YEAR'] = death['YEAR'].astype(str)
# Group by 'YEAR' and calculate the sum of deaths
sum_deaths_by_year = death.groupby('YEAR')['# of Deaths'].sum().reset_index()

# Create a bar chart using Plotly Express
fig = px.bar(sum_deaths_by_year, x='YEAR', y='# of Deaths',
             labels={'Number_of_Deaths': 'Sum of Death Cases', 'YEAR': 'Year'},
             title='Sum of Death Cases by Year',
             color='YEAR',  # Add color to differentiate bars by year
             height=500)
fig.update_yaxes(showticklabels=False)
# Show the plot
fig.show()

In [None]:
deatht = death[(death['YEAR']==2019) & (death['COUNTY'] != 'AAAA')].dropna()
deathhos = deatht.groupby('COUNTY')[['HOSPITAL']].agg(len).iloc[:,0:8]
#deathmor = deatht.groupby('COUNTY')[['Risk Adjuested Mortality Rate']].agg(le).iloc[:,0:8]
deatht['death ratio'] = deatht['# of Deaths'].astype(int) / deatht['# of Cases'].astype(int)
deatht2 = deatht.sort_values(by = 'death ratio', ascending = True)
death2 = deatht2.iloc[:,[0,1,2, 6,7,8,11]]
death3 = death2.groupby('COUNTY')[['death ratio']].agg(np.mean).sort_values(by = 'death ratio').reset_index().iloc[1:-1,:]
death3

Unnamed: 0,COUNTY,death ratio
1,Mariposa,0.008065
2,Glenn,0.017857
3,Kings,0.020049
4,Modoc,0.021053
5,Calaveras,0.021718
6,Siskiyou,0.023733
7,Madera,0.023911
8,San Joaquin,0.030356
9,San Benito,0.033178
10,Merced,0.033751


In [None]:
Provideratiodf = Provideratiodf.reset_index().rename(columns = {0:'Provider_Ratio'})
Provideratiodf

Unnamed: 0,index,MSSA_COUNTY,Provider_Ratio
0,0,Alameda,0.001562
1,1,Alpine,0.0
2,2,Amador,0.001079
3,3,Butte,0.000867
4,4,Calaveras,0.000476
5,5,Colusa,0.000352
6,6,Contra Costa,0.001235
7,7,Del Norte,0.00078
8,8,El Dorado,0.000548
9,9,Fresno,0.000705


## Comparison of Death Rate and Provider Rate

In [None]:
merged_df2 = pd.merge(deatht, care, left_on='COUNTY', right_on='MSSA_COUNTY', how='inner').drop('MSSA_COUNTY', axis=1)
Provideratiodf = Provideratio.to_frame().reset_index()
# Calculate the death ratio
merged_df2['dratio'] = merged_df2['# of Deaths'].astype(int) / merged_df2['Total_Population'].astype(int)

# Merge with 'provideratofd' and group by 'COUNTY' while aggregating mean
merged_df3 = merged_df2.merge(Provideratiodf , left_on='COUNTY', right_on='MSSA_COUNTY', how='inner').groupby('COUNTY').agg('mean')

merged_df2 = pd.merge(deatht, care, left_on='COUNTY', right_on='MSSA_COUNTY', how='inner').drop('MSSA_COUNTY',axis = 1)
merged_df2['dratio'] = merged_df2['# of Deaths'].astype(int) / merged_df2['Total_Population'].astype(int)
merged_df3 = merged_df2.merge(Provideratiodf, left_on = 'COUNTY', right_on = 'MSSA_COUNTY', how ='inner').groupby('COUNTY').agg(np.mean)
merged_df3 = merged_df3.rename(columns = {0:'pratio'})

In [None]:
merged_df3 = merged_df3.rename(columns = {0:'pratio'})
# Create a scatter plot with a regression line using Plotly Express
fig = px.scatter(merged_df3, x='pratio', y='dratio', trendline='ols', labels={'pratio': 'Provider Ratio', 'dratio': 'Death Ratio'})

# Update layout to add titles
fig.update_layout(
    title_text="Association between death rate and Provider Rate",
    xaxis_title="Provider Ratio",
    yaxis_title="Death Ratio",
)
fig.update_layout(
    title_text="Association betweewn Death Ratio and Provider Ratio",
    title_x=0.5,  # Set the title's x position to the center of the figure
    xaxis_title="Provider Ratio",
    yaxis_title="Death Ratio",
)

fig.update_layout(
    width=800,  # Adjust the width as needed
    height=500,  # Adjust the height as needed
)
# Show the plot
fig.show()

The Scatterplot reveals a noteworthy negative correlation between Provider Ratio and Death Ratio. This finding suggests that as the Provider Ratio decreases, indicating limited access to healthcare resources, there is a corresponding increase in the Death Ratio. In other words, the lower the availability of healthcare providers, the higher the mortality rate within a given population.

This negative correlation underscores a relationship between healthcare accessibility and mortality outcomes. Areas with a reduced presence of healthcare providers may experience challenges in delivering timely and effective medical interventions, leading to a potential escalation in mortality rates. The implications of this pattern emphasize the importance of addressing healthcare disparities, especially in regions where the Provider Ratio is insufficient to meet the healthcare needs of the population. Efforts to enhance healthcare accessibility can contribute to mitigating adverse outcomes and fostering healthier communities.

Building on insights from the preceding graphs, it becomes evident that marginalized communities, characterized by lower income levels, tend to exhibit a diminished number of healthcare providers. The confluence of limited financial resources and reduced healthcare accessibility contributes to an increased mortality rate within these communities. The absence of access to healthcare services becomes a potentially important factor influencing the higher mortality rates observed in such circumstances.

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=2853eace-95aa-42bd-bb8c-3e961b96ca5d' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>