# Analyzing the Role of Socioeconomic Factors in Transportation & Health Outcomes:<br>
### _An NLP Approach_

**Purpose**: to determine if accessibility to transportation has a positive impact on public health

**Objectives:** <br>
1) to observe how often socioeconomic factors, such as income, education, and employment, are mentioned in relationship to transportation and health outcomes
2) to identify the type of association that exists between socioeconomic factors – particularly transportation and accessibility –  and health


In [1]:
# import packages
import spacy, re, nltk
import pandas as pd
import xml.etree.ElementTree as ET
import scipy

from collections import Counter
from nltk.corpus import stopwords
from string import punctuation
from scipy import stats
from scipy.stats import pearsonr
from textblob import TextBlob

# NLP Packages
from scipy.linalg import triu
import gensim
import gensim.corpora as corpora
from gensim.models.ldamodel import LdaModel
from gensim.utils import simple_preprocess
import string
import re

In [None]:
nlp = spacy.load("en_core_web_sm")

## Data Collection

### Data Acquisition Methods

_describe how i got the data here_

## Data Processing and Preprocessing

### Text Data Preprocessing

In [3]:
# divide text into sections by label
def split_by_label(text, labels):
    pattern = '|'.join(re.escape(label) for label in labels)
    sections = re.split(pattern, text)
    return sections

In [4]:
# get the section labels for Article 1
labels = []
with open ('./articles/article1.txt', 'r') as file:
    for line in file:
        if line.strip() and not line.startswith((" ", "\t")):  # line is not indented
            labels.append(line.strip())
labels

['Abstract',
 'Multifaceted Nature of Transportation Insecurity Among Patients With Cancer',
 'Prevalence of Transportation Insecurity Among Patients With Cancer',
 'Consequences of Transportation Insecurity Among Patients With Cancer',
 'Screening for Transportation Insecurity Among Patients With Cancer',
 'Efforts to Address Transportation Insecurity for Patients With Cancer',
 'Policy Agenda for Addressing Transportation Insecurity for Patients With Cancer',
 'Data Infrastructure Research Agenda to Address Transportation Insecurity for Patients With Cancer']

In [5]:
# placing the words from the article in a list / string
no_blanks = []
with open ('./articles/article1.txt', 'r') as f:
    lines = f.readlines()
    stripped_words = [[word.strip() for word in line.split() if len(word) > 0] for line in lines]
    for lst in stripped_words:
        if len(lst) > 0:
            no_blanks.append(lst)

In [6]:
# Stop Words
stop_words = set(stopwords.words("English"))

In [7]:
# Clean text - remove stopwords, punctuation, and special characters
def preprocess(word):
    word = re.sub(r"[^\w\s]", " ", word) # remove punctuation and special characters (anything that is not word or whitespace)
    word = re.sub(r"\d+", "", word) # remove numbers
    word = word.lower()
    if word not in stop_words:
        return word
    else:
        return ''

In [18]:
# cleaned text as a string
article_text = ' '.join([' '.join([preprocess(word) for word in lst if type(preprocess(word)) != list]) for lst in no_blanks])
article_text = ' '.join([word.strip() for word in article_text.split()])
article_text

'abstract health care related transportation insecurity common united states patients cancer especially vulnerable cancer care episodic nature occurs prolonged period marked frequent clinical encounters requires intense treatments results substantial financial hardship result transportation insecurity patients cancer may forego miss delay alter and or prematurely terminate necessary care limited data suggest alterations care potential increase rates cancer recurrence mortality exacerbate disparities cancer incidence severity outcomes transportation insecurity also negatively impacts informal caregiver provider health system societal levels recognizing transportation critical determinant outcomes patients cancer ongoing efforts develop evidence based protocols identify at risk patients address transportation insecurity federal policy health system not for profit industry levels national cancer policy forum national academies science engineering medicine sponsored series webinars address

### Named Entity Recognition

Let's investigate how common words fall in the categories of the social determinants of health and its related topics. The topics we will investigate in our text are:<br>
* **Social Determinants of Health:** the non-medical factors that influence a person's health (WHO)<br>
* **Health:** a person's physical, mental, and emotional well-being in a general sense, not necessarily related to the social determinants of health<br>
* **Privilege:** this may be mentioned a lot if SDOH is found to favor some people over others<br>
* **Socioeconomic status:** a measurement of where people stand socially or economically; this may be crucial to the social determinants of health<br>
* **Transportation:** we are interested in whether an association exists between transportation and the social determinants of health. By looking at positive/negative sentiment of the word along with topics related to privilege and socioeconomic status, we can get a better understanding of how inequalities in accessibility to transportation affect public health.<br>

In [19]:
sdoh_words = [
    "access", "poverty", "education", "employment", "housing", "nutrition", "environment", 
    "equity", "income", "stress", "safety", "transport", "literacy", "insurance", 
    "community", "resources", "disparity", "inequality", "lifestyle", "socioeconomic", 
    "healthcare", "discrimination", "segregation", "neighborhood", "accessibility", 
    "prevention", "opportunity", "advocacy", "wellness", "infrastructure", "support", 
    "inclusion", "participation", "mobility", "stability", "vulnerability", "diversity", 
    "integration", "outreach", "screening", "services", "policy", "networks", "empowerment", 
    "collaboration", "demographics", "risk", "culture", "barriers", "awareness"
]
sdoh_lemma = [word.lemma_ for word in nlp(' '.join(sdoh_words))]

In [20]:
transportation_words = [
    'transportation', 'mobility', 'infrastructure', 'commute', 'traffic', 'roadway',
    'vehicle', 'automobile', 'rail', 'bus', 'subway', 'cycle', 'carpool', 'pedestrian',
    'congestion', 'accessibility', 'logistics', 'freight', 'aviation', 'ports', 'shipping',
    'train', 'route', 'sustainability', 'emissions', 'efficiency', 'network', 'policy',
    'green', 'fuels', 'electric', 'self-driving', 'urban', 'system', 'equity', 'safety',
    'cyclist', 'carsharing', 'ridesharing', 'demand', 'cost', 'vehicle-sharing', 'transport',
    'motion', 'access', 'distance', 'trip', 'congestion', 'platform', 'traffic-jam', 'driver',
    'passenger', 'hub'
]


In [21]:
def find_mentions(text, keywords):
    doc =  nlp(text)
    mentions = [token.text for token in doc if token.lower_ in keywords]
    entities = [ent.text for ent in doc.ents if ent.label_ in ['GPE', 'ORG', 'PERSON']]
    return mentions, entities
    

In [22]:
Counter(find_mentions(article_text, transportation_words)[0])

Counter({'transportation': 234,
         'cost': 39,
         'system': 30,
         'policy': 30,
         'safety': 10,
         'access': 9,
         'infrastructure': 8,
         'distance': 7,
         'demand': 3,
         'bus': 2,
         'urban': 2,
         'equity': 1,
         'vehicle': 1,
         'transport': 1,
         'trip': 1,
         'logistics': 1,
         'passenger': 1,
         'mobility': 1,
         'ridesharing': 1,
         'efficiency': 1})

In [None]:
Counter(find_mentions(article_text, sdoh_words)[1])

Counter({'medicaid': 16,
         'medicare': 12,
         'nemt': 5,
         'mtm': 5,
         'united states': 2,
         'emphysema lung': 1,
         'social determinants health disparities': 1,
         'continuum screening': 1,
         'american academy': 1,
         'illinois health s program': 1,
         'partnership university illinois health kaizen': 1})

### LDA: Latent Dirichlet Allocation

In [24]:
no_blanks2 = []
with open ('./articles/article2.txt', 'r') as f:
    lines = f.readlines()
    stripped_words = [[word.strip() for word in line.split() if len(word) > 0] for line in lines]
    for lst in stripped_words:
        if len(lst) > 0:
            no_blanks2.append(lst)

In [65]:
# with just 2 articles
article_text2 = ' '.join([' '.join([preprocess(word) for word in lst if type(preprocess(word)) != list]) for lst in no_blanks2])
article_text2 = ' '.join([word.strip() for word in article_text2.split()])
article_text2

'abstract well recognized transportation barrier healthcare access rural dwelling residents particularly older adults healthcare restructuring initiatives seldom take consideration complexity transportation acts barrier appropriate timely access healthcare services older adults rural com munities article presents findings qualitative research study explored complex nature transportation challenges rural dwelling older adults experience western canada trying access primary community care services data derived larger study service user views healthcare restructuring initiative intended facilitate aging in place conducted focus groups interviews diverse sample older adults living one urban centre nine rural small rural towns british columbia bc s interior used content analysis determine codes derive themes study findings showed transportation top priority improving primary community care older adult participants identified range transportation challenges trying get healthcare services car

In [66]:
Counter(find_mentions(article_text2, transportation_words)[0])

Counter({'transportation': 88,
         'urban': 31,
         'access': 28,
         'bus': 18,
         'system': 8,
         'policy': 8,
         'mobility': 6,
         'trip': 5,
         'transport': 4,
         'cost': 4,
         'vehicle': 3,
         'infrastructure': 3,
         'driver': 2,
         'distance': 2,
         'route': 1})

In [67]:
Counter(find_mentions(article_text2, sdoh_words)[0])

Counter({'services': 38,
         'access': 28,
         'community': 27,
         'healthcare': 11,
         'barriers': 10,
         'policy': 8,
         'support': 7,
         'mobility': 6,
         'housing': 4,
         'transport': 4,
         'resources': 3,
         'income': 3,
         'infrastructure': 3,
         'education': 1,
         'employment': 1,
         'neighborhood': 1,
         'advocacy': 1,
         'stress': 1,
         'risk': 1,
         'opportunity': 1})

In [32]:
docs = [article_text, article_text2]


In [34]:
def preprocess_text(text):
    text = text.lower() 
    text = re.sub(r'\d+', '', text)  
    text = text.translate(str.maketrans('', '', string.punctuation)) 
    tokens = simple_preprocess(text)  # this function includes tokenize text!
    tokens = [word for word in tokens if word not in stop_words] 
    return tokens

In [None]:
little_processed_docs = [preprocess_text(doc) for doc in docs]

In [None]:
little_dictionary = corpora.Dictionary(little_processed_docs)

In [None]:
corpus = [little_dictionary.doc2bow(doc) for doc in little_processed_docs]

In [None]:
little_lda_model = LdaModel(
    corpus=corpus,
    id2word=little_dictionary,
    num_topics=15,
    random_state=42,
    passes=10,
    per_word_topics=True
)


In [None]:
topics = little_lda_model.print_topics(num_words=5)
for topic in topics:
    print(topic)

(0, '0.001*"health" + 0.001*"transportation" + 0.001*"care" + 0.001*"older" + 0.001*"insecurity"')
(1, '0.001*"transportation" + 0.001*"health" + 0.001*"care" + 0.001*"adults" + 0.001*"older"')
(2, '0.002*"transportation" + 0.002*"cancer" + 0.002*"health" + 0.001*"insecurity" + 0.001*"care"')
(3, '0.044*"transportation" + 0.037*"cancer" + 0.026*"insecurity" + 0.023*"patients" + 0.021*"health"')
(4, '0.003*"transportation" + 0.002*"cancer" + 0.002*"care" + 0.001*"patients" + 0.001*"health"')
(5, '0.001*"transportation" + 0.001*"health" + 0.001*"rural" + 0.001*"older" + 0.001*"care"')
(6, '0.001*"health" + 0.001*"transportation" + 0.001*"care" + 0.001*"patients" + 0.001*"cancer"')
(7, '0.002*"transportation" + 0.002*"health" + 0.001*"cancer" + 0.001*"insecurity" + 0.001*"care"')
(8, '0.002*"transportation" + 0.002*"cancer" + 0.002*"patients" + 0.002*"health" + 0.001*"care"')
(9, '0.001*"transportation" + 0.001*"care" + 0.001*"health" + 0.001*"cancer" + 0.001*"insecurity"')
(10, '0.002*"t

In [None]:
import pyLDAvis.gensim_models as gensimvis
import pyLDAvis

little_lda_display = gensimvis.prepare(little_lda_model, corpus, little_dictionary)
pyLDAvis.display(little_lda_display)

In [79]:
import os
all_docs = []

# extract the text for all the articles
for article in os.listdir('./articles/'):
    text = []
    with open (f'./articles/{article}', 'r') as f:
        lines = f.readlines()
        stripped_words = [[word.strip() for word in line.split() if len(word) > 0] for line in lines]
        for lst in stripped_words:
            if len(lst) > 0:
                text.append(lst)
        
    article_text_str = ' '.join([' '.join([preprocess(word) for word in lst if type(preprocess(word)) != list]) for lst in text])
    article_text_str = ' '.join([word.strip() for word in article_text_str.split()])
    all_docs.append(article_text_str)

In [80]:
all_docs

['abstract health care related transportation insecurity common united states patients cancer especially vulnerable cancer care episodic nature occurs prolonged period marked frequent clinical encounters requires intense treatments results substantial financial hardship result transportation insecurity patients cancer may forego miss delay alter and or prematurely terminate necessary care limited data suggest alterations care potential increase rates cancer recurrence mortality exacerbate disparities cancer incidence severity outcomes transportation insecurity also negatively impacts informal caregiver provider health system societal levels recognizing transportation critical determinant outcomes patients cancer ongoing efforts develop evidence based protocols identify at risk patients address transportation insecurity federal policy health system not for profit industry levels national cancer policy forum national academies science engineering medicine sponsored series webinars addres

In [81]:
processed_docs = [preprocess_text(doc) for doc in all_docs]

In [84]:
dictionary = corpora.Dictionary(processed_docs)

In [85]:
corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

In [86]:
lda_model = LdaModel(
    corpus=corpus,
    id2word=dictionary,
    num_topics=15,
    random_state=42,
    passes=10,
    per_word_topics=True
)

In [87]:
topics = lda_model.print_topics(num_words=5)
for topic in topics:
    print(topic)

(0, '0.030*"health" + 0.022*"transportation" + 0.022*"rural" + 0.020*"older" + 0.019*"care"')
(1, '0.001*"transportation" + 0.001*"health" + 0.001*"care" + 0.001*"cancer" + 0.001*"older"')
(2, '0.004*"transportation" + 0.003*"cancer" + 0.003*"patients" + 0.003*"insecurity" + 0.002*"health"')
(3, '0.043*"transportation" + 0.036*"cancer" + 0.025*"insecurity" + 0.023*"patients" + 0.021*"health"')
(4, '0.001*"traffic" + 0.001*"health" + 0.001*"exposure" + 0.001*"transportation" + 0.001*"cancer"')
(5, '0.002*"transportation" + 0.001*"health" + 0.001*"care" + 0.001*"cancer" + 0.001*"patients"')
(6, '0.002*"transportation" + 0.001*"cancer" + 0.001*"health" + 0.001*"insecurity" + 0.001*"patients"')
(7, '0.029*"traffic" + 0.018*"exposure" + 0.012*"al" + 0.012*"et" + 0.010*"health"')
(8, '0.001*"transportation" + 0.001*"health" + 0.001*"cancer" + 0.001*"insecurity" + 0.001*"care"')
(9, '0.001*"traffic" + 0.001*"transportation" + 0.001*"health" + 0.001*"cancer" + 0.001*"exposure"')
(10, '0.001*"t

In [88]:
lda_display = gensimvis.prepare(lda_model, corpus, dictionary)
pyLDAvis.display(lda_display)