# Analyzing the Role of Socioeconomic Factors in Transportation & Health Outcomes:<br>
### _An NLP Approach_

## Introduction

### Background

* The social determinants of health (SDOH) are non-medical factors that can influence public health (WHO). <br>
* Factors based on <a href="https://www.annualreviews.org/content/journals/10.1146/annurev-publhealth-031210-101218">The Social Determinants of Health: Coming of Age</a> are: <br>
    * <u>Neighbourhood Conditions:</u> social + physical environments; individual health
    * <u>Working Conditions:</u> employment-related factors (the physical aspects of working as well as the work environment; psychosocial aspects; opportunities & resources)
    * <u>Education:</u> educational opportunities and the social & psychological factors associated
    * <u>Income & Wealth:</u> socioeconomic status, the environment one grows up with, and the resources they have access to
    * <u>Race & Racism:</u> racial disparities in health & healthcare
    * <u>Stress:</u> psychological stressors (their causes & effects)
* I have also identified common themes between these classifications, and made note of them, as follows: 

    * <u>Environment</u>: the physical landscape of the environment, services available, individual health, the physical aspects of work (eg muscle strain, back pain), the work environment (eg collaborative, desk space, etc), area demographics, economic hardship, racial discrimination, relative social status
    * <u>Social/Psychological</u>: social relationships, individual health, stress, subjective social status, self-perception
    * <u>Opportunities & Resources</u>: literacy, employment opportunities, literacy, income/wealth, education, social safety nets, Medicare, etc.
    * <u>Biological Effects</u>: release of cortisol & cytokines, ageing, disease
* Socioeconomic factors (SEFs) such as income, education, and employment play a crucial role in shaping transportation accessibility and health outcomes. 
* From the list above, we are interested in investigating **accessibility and transportation** which falls under the category of **Neighbourhood Conditions** because they are related to the social and physical environment people live in, which affects people's mobility and daily life activities
* Accessibility to transportation is also deeply tied with the other themes since socioeconomic status can affect vehicle ownership and access. Lack of transportation can cause psychological stress and increase daily burdens. Having limited transportation also limits access to healthcare, a solid education, and a stable job. All these factors affect people's overall well-being.

### Research Question

**Research Question**: Are research studies on public health more likely to mention transportation accessibility compared to other topics, and how is this relationship framed?

**Purpose**: to examine how accessibility to transportation is discussed in relation to public health by using NLP techniques to identify key themes, sentiment, and the frequency of associations in textual data from research studies

**Objectives:** <br>
1) To use NLP techniques to extract and categorize recurring themes in research studies discussing transportation accessibility and public health
2) To compare how often transportation is mentioned relative to other public health concerns (ie employment, education, income & wealth, race, and stress)
3) To analyze whether accessibility is discussed as a positive,  negative, or neutral facator in health outcomes
4) To identify patterns in how transportation is framed in relation to health (eg beneficial and improving healthcare access, or as a barrier and worsening disparities in health)

**Hypothesis**: Public health research studies are more likely to mention transportation accessibility compared to other social determinants of health, and they predominantly frame it as a positive factor for health outcomes.

In [90]:
# import packages
import spacy, re, nltk
import pandas as pd
import xml.etree.ElementTree as ET
import scipy

from collections import Counter

# NLP Packages
import gensim
import string
import re
import gensim.corpora as corpora
from scipy.linalg import triu
from gensim.models.ldamodel import LdaModel
from gensim.utils import simple_preprocess
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer



In [91]:
nlp = spacy.load("en_core_web_sm")

## Data Collection

### Data Acquisition Methods

* Collected 50 research articles about public health from the Internet and stored them in Mendeley Reference Manager
* Exported the references as a .xml file
* Stored the contents of the research articles each in a .txt file in an articles folder

## Data Processing and Preprocessing

### Text Data Preprocessing

In [92]:
# divide text into sections by label
def split_by_label(text, labels):
    pattern = '|'.join(re.escape(label) for label in labels)
    sections = re.split(pattern, text)
    return sections

In [93]:
# get the section labels for Article 1
labels = []
with open ('./articles/article1.txt', 'r') as file:
    for line in file:
        if line.strip() and not line.startswith((" ", "\t")):  # line is not indented
            labels.append(line.strip())
labels

['Abstract',
 'Multifaceted Nature of Transportation Insecurity Among Patients With Cancer',
 'Prevalence of Transportation Insecurity Among Patients With Cancer',
 'Consequences of Transportation Insecurity Among Patients With Cancer',
 'Screening for Transportation Insecurity Among Patients With Cancer',
 'Efforts to Address Transportation Insecurity for Patients With Cancer',
 'Policy Agenda for Addressing Transportation Insecurity for Patients With Cancer',
 'Data Infrastructure Research Agenda to Address Transportation Insecurity for Patients With Cancer']

In [94]:
# placing the words from the article in a list / string
no_blanks = []
with open ('./articles/article1.txt', 'r') as f:
    lines = f.readlines()
    stripped_words = [[word.strip() for word in line.split() if len(word) > 0] for line in lines]
    for lst in stripped_words:
        if len(lst) > 0:
            no_blanks.append(lst)

In [95]:
# Stop Words
stop_words = set(stopwords.words("English"))

In [None]:
# Clean text - remove stopwords, punctuation, and special characters
def preprocess(word):
    word = re.sub(r"[^\w\s]", " ", word) # remove punctuation and special characters (anything that is not word or whitespace)
    word = re.sub(r"\d+", "", word) # remove numbers
    word = word.lower()
    if word not in stop_words:
        return word
    else:
        return ''

In [97]:
# cleaned text as a string
article_text = ' '.join([' '.join([preprocess(word) for word in lst if type(preprocess(word)) != list]) for lst in no_blanks])
article_text = ' '.join([word.strip() for word in article_text.split()])
article_text

'abstract health care related transportation insecurity common united states patient cancer especially vulnerable cancer care episodic nature occurs prolonged period marked frequent clinical encounters requires intense treatments result substantial financial hardship result transportation insecurity patient cancer may forego miss delay alter and or prematurely terminate necessary care limited data suggest alteration care potential increase rate cancer recurrence mortality exacerbate disparity cancer incidence severity outcomes transportation insecurity also negatively impact informal caregiver provider health system societal levels recognizing transportation critical determinant outcome patient cancer ongoing effort develop evidence based protocol identify at risk patient address transportation insecurity federal policy health system not for profit industry levels national cancer policy forum national academy science engineering medicine sponsored series webinars addressing key social 

### Named Entity Recognition

Let's investigate how common words fall in the categories of the social determinants of health and its related topics. The topics we will investigate in our text are:<br>
* **Social Determinants of Health:** the non-medical factors that influence a person's health (WHO)<br>
* **Health:** a person's physical, mental, and emotional well-being in a general sense, not necessarily related to the social determinants of health<br>
* **Privilege:** this may be mentioned a lot if SDOH is found to favor some people over others<br>
* **Socioeconomic status:** a measurement of where people stand socially or economically; this may be crucial to the social determinants of health<br>
* **Transportation:** we are interested in whether an association exists between transportation and the social determinants of health. By looking at positive/negative sentiment of the word along with topics related to privilege and socioeconomic status, we can get a better understanding of how inequalities in accessibility to transportation affect public health.<br>

In [98]:
sdoh_words = [
    "access", "poverty", "education", "employment", "housing", "nutrition", "environment", 
    "equity", "income", "stress", "safety", "transport", "literacy", "insurance", 
    "community", "resources", "disparity", "inequality", "lifestyle", "socioeconomic", 
    "healthcare", "discrimination", "segregation", "neighborhood", "accessibility", 
    "prevention", "opportunity", "advocacy", "wellness", "infrastructure", "support", 
    "inclusion", "participation", "mobility", "stability", "vulnerability", "diversity", 
    "integration", "outreach", "screening", "services", "policy", "networks", "empowerment", 
    "collaboration", "demographics", "risk", "culture", "barriers", "awareness"
]
sdoh_lemma = [word.lemma_ for word in nlp(' '.join(sdoh_words))]

In [99]:
transportation_words = [
    'transportation', 'mobility', 'infrastructure', 'commute', 'traffic', 'roadway',
    'vehicle', 'automobile', 'rail', 'bus', 'subway', 'cycle', 'carpool', 'pedestrian',
    'congestion', 'accessibility', 'logistics', 'freight', 'aviation', 'ports', 'shipping',
    'train', 'route', 'sustainability', 'emissions', 'efficiency', 'network', 'policy',
    'green', 'fuels', 'electric', 'self-driving', 'urban', 'system', 'equity', 'safety',
    'cyclist', 'carsharing', 'ridesharing', 'demand', 'cost', 'vehicle-sharing', 'transport',
    'motion', 'access', 'distance', 'trip', 'congestion', 'platform', 'traffic-jam', 'driver',
    'passenger', 'hub'
]


In [100]:
def find_mentions(text, keywords):
    doc =  nlp(text)
    mentions = [token.text for token in doc if token.lower_ in keywords]
    entities = [ent.text for ent in doc.ents if ent.label_ in ['GPE', 'ORG', 'PERSON']]
    return mentions, entities
    

In [101]:
Counter(find_mentions(article_text, transportation_words)[0])

Counter({'transportation': 234,
         'cost': 51,
         'system': 32,
         'policy': 30,
         'safety': 10,
         'access': 9,
         'distance': 8,
         'infrastructure': 8,
         'platform': 8,
         'trip': 3,
         'demand': 3,
         'bus': 2,
         'vehicle': 2,
         'urban': 2,
         'equity': 1,
         'transport': 1,
         'logistics': 1,
         'hub': 1,
         'driver': 1,
         'automobile': 1,
         'passenger': 1,
         'mobility': 1,
         'ridesharing': 1,
         'efficiency': 1})

In [102]:
Counter(find_mentions(article_text, sdoh_words)[1])

Counter({'medicaid': 17,
         'medicare': 9,
         'nemt': 8,
         'mtm': 5,
         'united states': 2,
         'national academy science engineering': 2,
         'emphysema lung': 1,
         'continuum examine multilevel': 1,
         'continuum screening': 1,
         'american academy': 1,
         'family physician': 1,
         'illinois health s program': 1,
         'partnership university illinois health kaizen': 1})

### LDA: Latent Dirichlet Allocation

In [103]:
no_blanks2 = []
with open ('./articles/article2.txt', 'r') as f:
    lines = f.readlines()
    stripped_words = [[word.strip() for word in line.split() if len(word) > 0] for line in lines]
    for lst in stripped_words:
        if len(lst) > 0:
            no_blanks2.append(lst)

In [104]:
# with just 2 articles
article_text2 = ' '.join([' '.join([preprocess(word) for word in lst if type(preprocess(word)) != list]) for lst in no_blanks2])
article_text2 = ' '.join([word.strip() for word in article_text2.split()])
article_text2

'abstract well recognized transportation barrier healthcare access rural dwelling residents particularly older adults healthcare restructuring initiative seldom take consideration complexity transportation act barrier appropriate timely access healthcare service older adult rural com munities article present finding qualitative research study explored complex nature transportation challenge rural dwelling older adult experience western canada trying access primary community care services data derived larger study service user view healthcare restructuring initiative intended facilitate aging in place conducted focus group interview diverse sample older adult living one urban centre nine rural small rural town british columbia bc s interior used content analysis determine code derive themes study finding showed transportation wa top priority improving primary community care older adult participant identified range transportation challenge trying get healthcare service care provider gett

In [105]:
Counter(find_mentions(article_text2, transportation_words)[0])

Counter({'transportation': 88,
         'urban': 31,
         'access': 28,
         'bus': 18,
         'cost': 12,
         'trip': 10,
         'policy': 9,
         'system': 8,
         'mobility': 6,
         'driver': 4,
         'transport': 4,
         'distance': 4,
         'vehicle': 4,
         'infrastructure': 3,
         'hub': 1,
         'route': 1})

In [106]:
Counter(find_mentions(article_text2, sdoh_words)[0])

Counter({'community': 44,
         'access': 28,
         'services': 12,
         'healthcare': 11,
         'policy': 9,
         'support': 8,
         'mobility': 6,
         'housing': 4,
         'transport': 4,
         'income': 3,
         'infrastructure': 3,
         'resources': 1,
         'education': 1,
         'employment': 1,
         'neighborhood': 1,
         'advocacy': 1,
         'disparity': 1,
         'barriers': 1,
         'stress': 1,
         'risk': 1,
         'opportunity': 1})

In [107]:
docs = [article_text, article_text2]


In [None]:
def preprocess_text(text):
    text = text.lower() 
    text = re.sub(r'\d+', '', text)  
    text = text.translate(str.maketrans('', '', string.punctuation)) 
    tokens = simple_preprocess(text)  # this function includes tokenize text!
    tokens = [word for word in tokens if word not in stop_words] 
    return tokens

In [110]:
little_processed_docs = [preprocess_text(doc) for doc in docs]

In [111]:
little_dictionary = corpora.Dictionary(little_processed_docs)

In [112]:
corpus = [little_dictionary.doc2bow(doc) for doc in little_processed_docs]

In [113]:
little_lda_model = LdaModel(
    corpus=corpus,
    id2word=little_dictionary,
    num_topics=5,
    random_state=42,
    passes=10,
    per_word_topics=True
)


In [114]:
topics = little_lda_model.print_topics(num_words=5)
for topic in topics:
    print(topic)

(0, '0.028*"health" + 0.021*"transportation" + 0.021*"rural" + 0.019*"older" + 0.018*"care"')
(1, '0.001*"transportation" + 0.001*"health" + 0.001*"care" + 0.001*"cancer" + 0.001*"rural"')
(2, '0.002*"transportation" + 0.002*"patient" + 0.001*"care" + 0.001*"cancer" + 0.001*"insecurity"')
(3, '0.001*"transportation" + 0.001*"cancer" + 0.001*"health" + 0.001*"patient" + 0.001*"need"')
(4, '0.042*"transportation" + 0.035*"cancer" + 0.028*"patient" + 0.024*"insecurity" + 0.020*"health"')


In [115]:
import pyLDAvis.gensim_models as gensimvis
import pyLDAvis

little_lda_display = gensimvis.prepare(little_lda_model, corpus, little_dictionary)
pyLDAvis.display(little_lda_display)

In [116]:
import os
all_docs = []

# extract the text for all the articles
for article in os.listdir('./articles/'):
    text = []
    with open (f'./articles/{article}', 'r') as f:
        lines = f.readlines()
        stripped_words = [[word.strip() for word in line.split() if len(word) > 0] for line in lines]
        for lst in stripped_words:
            if len(lst) > 0:
                text.append(lst)
        
    article_text_str = ' '.join([' '.join([preprocess(word) for word in lst if type(preprocess(word)) != list]) for lst in text])
    article_text_str = ' '.join([word.strip() for word in article_text_str.split()])
    all_docs.append(article_text_str)

In [117]:
all_docs

['abstract background australia aboriginal people underserved transport system le able easily get place need go others part larger pattern exclusion inequity aboriginal people affect health wellbeing social participation guided decolonising framework research explored older aboriginal people whose pivotal role family community require mobility experience transportation system providing indigenous centred view accessibility transportation option society methods interview drawing yarning technique conducted ten older aboriginal people living greater western sydney analysed qualitatively results addition cognitive labour required decipher rule transport system organise commitment match scheduling transport services older aboriginal people study experienced stigmatising attitude condescending treatment service professional public traveling conclusions study suggests three potential way current trajectory underserves older aboriginal people could disrupted relating service design diversity 

In [118]:
processed_docs = [preprocess_text(doc) for doc in all_docs]

In [119]:
dictionary = corpora.Dictionary(processed_docs)

In [120]:
corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

In [145]:
lda_model = LdaModel(
    corpus=corpus,
    id2word=dictionary,
    num_topics=15,
    random_state=42,
    passes=10,
    per_word_topics=True
)

In [142]:
topics = lda_model.print_topics(num_words=5)
for topic in topics:
    print(topic)

(0, '0.016*"transport" + 0.015*"people" + 0.015*"aboriginal" + 0.011*"participant" + 0.009*"wa"')
(1, '0.025*"traffic" + 0.017*"exposure" + 0.011*"et" + 0.011*"al" + 0.009*"health"')
(2, '0.005*"health" + 0.005*"transportation" + 0.004*"traffic" + 0.003*"wa" + 0.003*"people"')
(3, '0.030*"transportation" + 0.022*"health" + 0.021*"cancer" + 0.016*"care" + 0.015*"patient"')
(4, '0.002*"transportation" + 0.002*"health" + 0.002*"care" + 0.001*"cancer" + 0.001*"social"')


In [143]:
lda_display = gensimvis.prepare(lda_model, corpus, dictionary)
pyLDAvis.display(lda_display)

In [144]:
from gensim.models import CoherenceModel

coherence_model_score = CoherenceModel(model = lda_model, texts = processed_docs, dictionary = dictionary, coherence = 'u_mass')
coherence_model_score.get_coherence()

-0.6482832138920955