# Analyzing the Role of Socioeconomic Factors in Transportation & Health Outcomes:<br>
### _An NLP Approach_

## Introduction

### Background

* The social determinants of health (SDOH) are non-medical factors that can influence public health (WHO). <br>
* Factors based on <a href="https://www.annualreviews.org/content/journals/10.1146/annurev-publhealth-031210-101218">The Social Determinants of Health: Coming of Age</a> are: <br>
    * <u>Neighbourhood Conditions:</u> social + physical environments; individual health
    * <u>Working Conditions:</u> employment-related factors (the physical aspects of working as well as the work environment; psychosocial aspects; opportunities & resources)
    * <u>Education:</u> educational opportunities and the social & psychological factors associated
    * <u>Income & Wealth:</u> socioeconomic status, the environment one grows up with, and the resources they have access to
    * <u>Race & Racism:</u> racial disparities in health & healthcare
    * <u>Stress:</u> psychological stressors (their causes & effects)
* I have also identified common themes between these classifications, and made note of them, as follows: 

    * <u>Environment</u>: the physical landscape of the environment, services available, individual health, the physical aspects of work (eg muscle strain, back pain), the work environment (eg collaborative, desk space, etc), area demographics, economic hardship, racial discrimination, relative social status
    * <u>Social/Psychological</u>: social relationships, individual health, stress, subjective social status, self-perception
    * <u>Opportunities & Resources</u>: literacy, employment opportunities, literacy, income/wealth, education, social safety nets, Medicare, etc.
    * <u>Biological Effects</u>: release of cortisol & cytokines, ageing, disease
* Socioeconomic factors (SEFs) such as income, education, and employment play a crucial role in shaping transportation accessibility and health outcomes. 
* From the list above, we are interested in investigating **accessibility and transportation** which falls under the category of **Neighbourhood Conditions** because they are related to the social and physical environment people live in, which affects people's mobility and daily life activities
* Accessibility to transportation is also deeply tied with the other themes since socioeconomic status can affect vehicle ownership and access. Lack of transportation can cause psychological stress and increase daily burdens. Having limited transportation also limits access to healthcare, a solid education, and a stable job. All these factors affect people's overall well-being.

### Research Question

**Research Question**: Are research studies on public health more likely to mention transportation accessibility compared to other topics, and how is this relationship framed?

**Purpose**: to examine how accessibility to transportation is discussed in relation to public health by using NLP techniques to identify key themes, sentiment, and the frequency of associations in textual data from research studies

**Objectives:** <br>
1) To use NLP techniques to extract and categorize recurring themes in research studies discussing transportation accessibility and public health
2) To compare how often transportation is mentioned relative to other public health concerns (ie employment, education, income & wealth, race, and stress)
3) To analyze whether accessibility is discussed as a positive,  negative, or neutral factor in health outcomes
4) To identify patterns in how transportation is framed in relation to health (eg beneficial and improving healthcare access, or as a barrier and worsening disparities in health)

**Hypothesis**: Public health research studies are more likely to mention transportation accessibility compared to other social determinants of health, and they predominantly frame it as a positive factor for health outcomes.

In [1]:
# import packages
import spacy, re, nltk
import pandas as pd
import xml.etree.ElementTree as ET
import scipy

from collections import Counter

# NLP Packages
import gensim
import string
import re
import gensim.corpora as corpora
from scipy.linalg import triu
from gensim.models.ldamodel import LdaModel
from gensim.utils import simple_preprocess
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer



In [2]:
nlp = spacy.load("en_core_web_sm")

## Data Collection

### Data Acquisition Methods

* Collected 50 research articles about public health from the Internet and stored them in Mendeley Reference Manager
* Exported the references as a .xml file
* Stored the contents of the research articles each in a .txt file in an articles folder

## Data Processing and Preprocessing

### Text Data Preprocessing

In [3]:
# divide text into sections by label
def split_by_label(text, labels):
    pattern = '|'.join(re.escape(label) for label in labels)
    sections = re.split(pattern, text)
    return sections

In [4]:
# get the section labels for Article 1
labels = []
with open ('./articles/article1.txt', 'r') as file:
    for line in file:
        if line.strip() and not line.startswith((" ", "\t")):  # line is not indented
            labels.append(line.strip())
labels

['Abstract',
 'Multifaceted Nature of Transportation Insecurity Among Patients With Cancer',
 'Prevalence of Transportation Insecurity Among Patients With Cancer',
 'Consequences of Transportation Insecurity Among Patients With Cancer',
 'Screening for Transportation Insecurity Among Patients With Cancer',
 'Efforts to Address Transportation Insecurity for Patients With Cancer',
 'Policy Agenda for Addressing Transportation Insecurity for Patients With Cancer',
 'Data Infrastructure Research Agenda to Address Transportation Insecurity for Patients With Cancer']

In [5]:
# placing the words from the article in a list / string
no_blanks = []
with open ('./articles/article1.txt', 'r') as f:
    lines = f.readlines()
    stripped_words = [[word.strip() for word in line.split() if len(word) > 0] for line in lines]
    for lst in stripped_words:
        if len(lst) > 0:
            no_blanks.append(lst)

In [6]:
# Stop Words
stop_words = set(stopwords.words("English"))

In [7]:
# Clean text - remove stopwords, punctuation, and special characters
def preprocess(word):
    word = re.sub(r"[^\w\s]", " ", word) # remove punctuation and special characters (anything that is not word or whitespace)
    word = re.sub(r"\d+", "", word) # remove numbers
    word = word.lower()
    if word not in stop_words:
        return word
    else:
        return ''

In [8]:
# cleaned text as a string
article_text = ' '.join([' '.join([preprocess(word) for word in lst if type(preprocess(word)) != list]) for lst in no_blanks])
article_text = ' '.join([word.strip() for word in article_text.split()])
article_text

'abstract health care related transportation insecurity common united states patients cancer especially vulnerable cancer care episodic nature occurs prolonged period marked frequent clinical encounters requires intense treatments results substantial financial hardship result transportation insecurity patients cancer may forego miss delay alter and or prematurely terminate necessary care limited data suggest alterations care potential increase rates cancer recurrence mortality exacerbate disparities cancer incidence severity outcomes transportation insecurity also negatively impacts informal caregiver provider health system societal levels recognizing transportation critical determinant outcomes patients cancer ongoing efforts develop evidence based protocols identify at risk patients address transportation insecurity federal policy health system not for profit industry levels national cancer policy forum national academies science engineering medicine sponsored series webinars address

### Named Entity Recognition

Let's investigate how common words fall in the categories of the social determinants of health and its related topics. The topics we will investigate in our text are:<br>
* **Social Determinants of Health:** the non-medical factors that influence a person's health (WHO)<br>
* **Health:** a person's physical, mental, and emotional well-being in a general sense, not necessarily related to the social determinants of health<br>
* **Privilege:** this may be mentioned a lot if SDOH is found to favor some people over others<br>
* **Socioeconomic status:** a measurement of where people stand socially or economically; this may be crucial to the social determinants of health<br>
* **Transportation:** we are interested in whether an association exists between transportation and the social determinants of health. By looking at positive/negative sentiment of the word along with topics related to privilege and socioeconomic status, we can get a better understanding of how inequalities in accessibility to transportation affect public health.<br>

In [9]:
sdoh_words = [
    "access", "poverty", "education", "employment", "housing", "nutrition", "environment", 
    "equity", "income", "stress", "safety", "transport", "literacy", "insurance", 
    "community", "resources", "disparity", "inequality", "lifestyle", "socioeconomic", 
    "healthcare", "discrimination", "segregation", "neighborhood", "accessibility", 
    "prevention", "opportunity", "advocacy", "wellness", "infrastructure", "support", 
    "inclusion", "participation", "mobility", "stability", "vulnerability", "diversity", 
    "integration", "outreach", "screening", "services", "policy", "networks", "empowerment", 
    "collaboration", "demographics", "risk", "culture", "barriers", "awareness"
]
sdoh_lemma = [word.lemma_ for word in nlp(' '.join(sdoh_words))]

In [10]:
transportation_words = [
    'transportation', 'mobility', 'infrastructure', 'commute', 'traffic', 'roadway',
    'vehicle', 'automobile', 'rail', 'bus', 'subway', 'cycle', 'carpool', 'pedestrian',
    'congestion', 'accessibility', 'logistics', 'freight', 'aviation', 'ports', 'shipping',
    'train', 'route', 'sustainability', 'emissions', 'efficiency', 'network', 'policy',
    'green', 'fuels', 'electric', 'self-driving', 'urban', 'system', 'equity', 'safety',
    'cyclist', 'carsharing', 'ridesharing', 'demand', 'cost', 'vehicle-sharing', 'transport',
    'motion', 'access', 'distance', 'trip', 'congestion', 'platform', 'traffic-jam', 'driver',
    'passenger', 'hub'
]


In [11]:
def find_mentions(text, keywords):
    doc =  nlp(text)
    mentions = [token.text for token in doc if token.lower_ in keywords]
    entities = [ent.text for ent in doc.ents if ent.label_ in ['GPE', 'ORG', 'PERSON']]
    return mentions, entities
    

In [12]:
Counter(find_mentions(article_text, transportation_words)[0])

Counter({'transportation': 234,
         'cost': 39,
         'system': 30,
         'policy': 30,
         'safety': 10,
         'access': 9,
         'infrastructure': 8,
         'distance': 7,
         'demand': 3,
         'bus': 2,
         'urban': 2,
         'equity': 1,
         'vehicle': 1,
         'transport': 1,
         'trip': 1,
         'logistics': 1,
         'passenger': 1,
         'mobility': 1,
         'ridesharing': 1,
         'efficiency': 1})

In [13]:
Counter(find_mentions(article_text, sdoh_words)[1])

Counter({'medicaid': 16,
         'medicare': 12,
         'nemt': 5,
         'mtm': 5,
         'united states': 2,
         'emphysema lung': 1,
         'social determinants health disparities': 1,
         'continuum examine multilevel': 1,
         'continuum screening': 1,
         'american academy': 1,
         'illinois health s program': 1,
         'partnership university illinois health kaizen': 1})

### LDA: Latent Dirichlet Allocation

In [14]:
no_blanks2 = []
with open ('./articles/article2.txt', 'r') as f:
    lines = f.readlines()
    stripped_words = [[word.strip() for word in line.split() if len(word) > 0] for line in lines]
    for lst in stripped_words:
        if len(lst) > 0:
            no_blanks2.append(lst)

In [15]:
# with just 2 articles
article_text2 = ' '.join([' '.join([preprocess(word) for word in lst if type(preprocess(word)) != list]) for lst in no_blanks2])
article_text2 = ' '.join([word.strip() for word in article_text2.split()])

In [16]:
Counter(find_mentions(article_text2, transportation_words)[0])

Counter({'transportation': 88,
         'urban': 31,
         'access': 28,
         'bus': 18,
         'system': 8,
         'policy': 8,
         'mobility': 6,
         'trip': 5,
         'transport': 4,
         'cost': 4,
         'vehicle': 3,
         'infrastructure': 3,
         'driver': 2,
         'distance': 2,
         'route': 1})

In [17]:
Counter(find_mentions(article_text2, sdoh_words)[0])

Counter({'services': 38,
         'access': 28,
         'community': 27,
         'healthcare': 11,
         'barriers': 10,
         'policy': 8,
         'support': 7,
         'mobility': 6,
         'housing': 4,
         'transport': 4,
         'resources': 3,
         'income': 3,
         'infrastructure': 3,
         'education': 1,
         'employment': 1,
         'neighborhood': 1,
         'advocacy': 1,
         'stress': 1,
         'risk': 1,
         'opportunity': 1})

In [18]:
docs = [article_text, article_text2]


In [19]:
def preprocess_text(text):
    text = text.lower() 
    text = re.sub(r'\d+', '', text)  
    text = text.translate(str.maketrans('', '', string.punctuation)) 
    tokens = simple_preprocess(text)  # this function includes tokenize text!
    tokens = [word for word in tokens if word not in stop_words] 
    return tokens

In [20]:
little_processed_docs = [preprocess_text(doc) for doc in docs]

In [21]:
little_dictionary = corpora.Dictionary(little_processed_docs)

In [22]:
corpus = [little_dictionary.doc2bow(doc) for doc in little_processed_docs]

In [23]:
little_lda_model = LdaModel(
    corpus=corpus,
    id2word=little_dictionary,
    num_topics=5,
    random_state=42,
    passes=10,
    per_word_topics=True
)


In [24]:
topics = little_lda_model.print_topics(num_words=5)
for topic in topics:
    print(topic)

(0, '0.028*"health" + 0.021*"transportation" + 0.021*"rural" + 0.019*"older" + 0.018*"care"')
(1, '0.001*"health" + 0.001*"transportation" + 0.001*"care" + 0.001*"rural" + 0.001*"adults"')
(2, '0.002*"transportation" + 0.002*"health" + 0.001*"cancer" + 0.001*"care" + 0.001*"insecurity"')
(3, '0.042*"transportation" + 0.035*"cancer" + 0.025*"insecurity" + 0.022*"patients" + 0.020*"health"')
(4, '0.002*"transportation" + 0.002*"cancer" + 0.001*"health" + 0.001*"care" + 0.001*"patients"')


In [25]:
import pyLDAvis.gensim_models as gensimvis
import pyLDAvis

little_lda_display = gensimvis.prepare(little_lda_model, corpus, little_dictionary)
pyLDAvis.display(little_lda_display)

In [26]:
import os
all_docs = []

# extract the text for all the articles
for article in os.listdir('./articles/'):
    text = []
    with open (f'./articles/{article}', 'r') as f:
        lines = f.readlines()
        stripped_words = [[word.strip() for word in line.split() if len(word) > 0] for line in lines]
        for lst in stripped_words:
            if len(lst) > 0:
                text.append(lst)
        
    article_text_str = ' '.join([' '.join([preprocess(word) for word in lst if type(preprocess(word)) != list]) for lst in text])
    article_text_str = ' '.join([word.strip() for word in article_text_str.split()])
    all_docs.append(article_text_str)

In [27]:
all_docs

['abstract background australia aboriginal people underserved transport system less able easily get places need go others part larger pattern exclusion inequity aboriginal people affects health wellbeing social participation guided decolonising framework research explored older aboriginal people whose pivotal roles families communities require mobility experience transportation system providing indigenous centred view accessibility transportation options society methods interviews drawing yarning technique conducted ten older aboriginal people living greater western sydney analysed qualitatively results addition cognitive labour required decipher rules transport system organise commitments match scheduling transport services older aboriginal people study experienced stigmatising attitudes condescending treatment service professionals public traveling conclusions study suggests three potential ways current trajectory underserves older aboriginal people could disrupted relating service d

In [28]:
processed_docs = [preprocess_text(doc) for doc in all_docs]

In [29]:
dictionary = corpora.Dictionary(processed_docs)

In [30]:
corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

In [31]:
lda_model = LdaModel(
    corpus=corpus,
    id2word=dictionary,
    num_topics=5,
    random_state=42,
    passes=10,
    per_word_topics=True
)

In [32]:
topics = lda_model.print_topics(num_words=5)
for topic in topics:
    print(topic)

(0, '0.033*"transportation" + 0.024*"health" + 0.020*"cancer" + 0.017*"care" + 0.014*"insecurity"')
(1, '0.001*"transportation" + 0.001*"cancer" + 0.001*"health" + 0.001*"traffic" + 0.001*"social"')
(2, '0.026*"traffic" + 0.017*"exposure" + 0.011*"et" + 0.011*"al" + 0.009*"health"')
(3, '0.017*"transport" + 0.017*"people" + 0.016*"aboriginal" + 0.013*"participants" + 0.007*"community"')
(4, '0.019*"cancer" + 0.017*"sdoh" + 0.008*"health" + 0.008*"interventions" + 0.007*"care"')


In [33]:
lda_display = gensimvis.prepare(lda_model, corpus, dictionary)
pyLDAvis.display(lda_display)

In [34]:
from gensim.models import CoherenceModel

coherence_model_score = CoherenceModel(model = lda_model, texts = processed_docs, dictionary = dictionary, coherence = 'u_mass')
coherence_model_score.get_coherence()

-0.6838758638638988

In [35]:
print(f'Perplexity Score: {lda_model.log_perplexity(corpus)}')
# coherence: -0.684
# perplexity: -7.149

Perplexity Score: -7.191121614117823


### Sentiment Analysis

In [None]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# EXAMPLE OF SENTIMENT ANALYSIS

analyzer = SentimentIntensityAnalyzer()
text = "Accessible public transportation greatly improves patient access to healthcare services."
score = analyzer.polarity_scores(text)


{'neg': 0.0, 'neu': 0.744, 'pos': 0.256, 'compound': 0.4754}


In [50]:
def preprocess_sentence(word):
    word = re.sub(r"\d+", "", word) # remove numbers
    word = word.lower()
    if word not in stop_words:
        return word
    else:
        return ''

In [51]:
article_text2_sentences = ' '.join([' '.join([preprocess_sentence(word) for word in lst if type(preprocess(word)) != list]) for lst in no_blanks2])
article_text2_sentences = ' '.join([word.strip() for word in article_text2_sentences.split()])
article_text2_sentences

'abstract well recognized transportation barrier healthcare access rural-dwelling residents, particularly older adults. healthcare restructuring initiatives seldom take consideration complexity transportation, acts barrier appropriate timely access healthcare services older adults rural com\xad munities. article presents findings qualitative research study explored complex nature transportation challenges rural-dwelling older adults experience western canada trying access primary community care services. data derived larger study service user views healthcare restructuring initiative intended facilitate aging-in-place. conducted focus groups interviews diverse sample older adults living one urban centre nine rural small rural towns british columbia (bc)’s interior. used content analysis determine codes derive themes. study findings showed transportation top priority improving primary community care. older adult participants identified range transportation challenges trying get healthca

In [52]:
score1 = analyzer.polarity_scores(article_text2_sentences)
score1


{'neg': 0.031, 'neu': 0.831, 'pos': 0.138, 'compound': 0.9999}

In [53]:
article_text_sentences = ' '.join([' '.join([preprocess_sentence(word) for word in lst if type(preprocess(word)) != list]) for lst in no_blanks])
article_text_sentences = ' '.join([word.strip() for word in article_text_sentences.split()])
article_text_sentences

'abstract health-care–related transportation insecurity common united states. patients cancer especially vulnerable cancer care episodic nature, occurs prolonged period, marked frequent clinical encounters, requires intense treatments, results substantial financial hardship. result transportation insecurity, patients cancer may forego, miss, delay, alter, and/or prematurely terminate necessary care. limited data suggest alterations care potential increase rates cancer recurrence mortality exacerbate disparities cancer incidence, severity, outcomes. transportation insecurity also negatively impacts informal caregiver, provider, health system, societal levels. recognizing transportation critical determinant outcomes patients cancer, ongoing efforts develop evidence-based protocols identify at-risk patients address transportation insecurity federal policy, health system, not-for-profit, industry levels. , national cancer policy forum national academies science, engineering, medicine spons

In [54]:
analyzer.polarity_scores(article_text_sentences)

{'neg': 0.241, 'neu': 0.639, 'pos': 0.119, 'compound': -1.0}

In [68]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB

texts = ["Transportation barriers limit healthcare access.", "Public transit improves patient mobility."]
labels = [0, 1]  # 0 = Negative, 1 = Positive

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(texts)

model = MultinomialNB()
model.fit(X, labels)

print(model.predict(vectorizer.transform(["Lack of transport worsens health outcomes."])))  # Output: [0]


[0]
