# Exploring text analysis for BRAC

Exploring how text analysis might be used on the news articles for studying the uncertainty effect of base closures. 

In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import spacy
from spacytextblob.spacytextblob import SpacyTextBlob

## Sample cities with BRAC DoD proposal amounts

Data copied from 2005_BRAC_proposals.csv, collected from the 2005 BRAC commission report, Appendix O.

In [2]:
data = [['Atlanta', -6820, 124, -6764],
        ['Baltimore', -4414, 11619, 7727],
        ['Lubbock', -7, 0, -7],
        ['Norfolk', -10399, 10400, 1],
        ['Norwich', -8460, 0, -8460],
        ['Richmond', -812, 7821, 7009],
        ['San Antonio', -8129, 10946, 2817],
        ['Indianapolis', -283, 3592, 3309]]
df_brac = pd.DataFrame(data, columns=['city', 'jobs_lost', 'jobs_gained', 'net'])
df_brac

Unnamed: 0,city,jobs_lost,jobs_gained,net
0,Atlanta,-6820,124,-6764
1,Baltimore,-4414,11619,7727
2,Lubbock,-7,0,-7
3,Norfolk,-10399,10400,1
4,Norwich,-8460,0,-8460
5,Richmond,-812,7821,7009
6,San Antonio,-8129,10946,2817
7,Indianapolis,-283,3592,3309


# Loading the sample news articles from AWN

There are ten articles from places representing a variety of BRAC proposed outcomes, with Baltimore and Norwich reprsented twice.

In [3]:
def load_article(fname):
    #Note: We don't actually know the structure of the text files we would get from purchased
    #AWN access, so this function both handles the trivial example data, and serves as a
    #placeholder for whatever the future structure looks like.
    with open(os.path.join('articles', fname), 'r') as ifile:
        text = ifile.read()

    #Based on the body of every article beginning after the "Words" count, return only the body
    text = text[text.find('Words')+len('Words')+2:]
    return text

articles = {fname[:-4]:load_article(fname) for fname in os.listdir(r'articles')}
{city:text[:100] for city, text in articles.items()} #display first 100 characters

{'Atlanta': 'Local officials are preparing to lobby members of the federal Base Realignment and Closure Commissio',
 'Baltimore 1': 'While officials in Maryland have publicly welcomed the prospect of gaining 6,600 federal jobs throug',
 'Baltimore 2': 'Like many federal workers driving from the Baltimore suburbs into Washington every day, Marshall Hud',
 'Indianapolis': 'Gov. Mitch Daniels expressed relief this morning that the Pentagon plans to keep open the Naval Surf',
 'Lubbock': 'HAWTHORNE, Nev. (AP) - For more than 50 years, this struggling desert town that proudly calls itself',
 'Norfolk': 'FORT MONROE - There was good news in the bad news.\n\nAfter centuries of standing guard, Fort Monroe m',
 'Norwich 1': 'NORWICH - The task is simple: Do the math and find the flaw.\n\nWhen the Groton submarine base was tar',
 'Norwich 2': 'WASHINGTON - While the initial focus of the state has been on the Groton Submarine Base, it is not t',
 'Richmond': "The Prince George Comfort Inn, a half

# Basic word presence

This mirrors the behavior we can get through the AWN interface directly, without any text analysis. This does not require purchasing access to the text of articles.

In [4]:
def brac_keywords(text):
    acronym = 'brac' in text.lower()
    full = 'base realignment and closure' in text.lower()
    return acronym or full
brac_articles = {city: brac_keywords(text) for city, text in articles.items()}
print('Is BRAC discussed? (should be all True):')
brac_articles

Is BRAC discussed? (should be all True):


{'Atlanta': True,
 'Baltimore 1': True,
 'Baltimore 2': True,
 'Indianapolis': True,
 'Lubbock': True,
 'Norfolk': True,
 'Norwich 1': True,
 'Norwich 2': True,
 'Richmond': True,
 'San Antonio': True}

In [5]:
def unc_keywords(text):
    unc = 'uncertain' in text.lower()
    econ = 'econom' in text.lower()
    return unc and econ
brac_articles = {city: unc_keywords(text) for city, text in articles.items()}
print('Is economic uncertainty discussed?:')
brac_articles

Is economic uncertainty discussed?:


{'Atlanta': False,
 'Baltimore 1': False,
 'Baltimore 2': False,
 'Indianapolis': False,
 'Lubbock': False,
 'Norfolk': True,
 'Norwich 1': False,
 'Norwich 2': False,
 'Richmond': False,
 'San Antonio': False}

# Text Analysis

First we insert the sentiment scoring into the pipline, then injest the data into Spacy to tokenize the contents. Sentiment scores range from 1 (positive) to -1 (negative).

In [6]:
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe('spacytextblob');

docs = {city:nlp(text) for city, text in articles.items()} #turning plain text into Spacy tokens

In [7]:
doc_polarity = {city:round(doc._.blob.polarity, 4)  for city, doc in docs.items()}
print('Overall sentiment (higher = more positive)')
sorted(doc_polarity.items(), key=lambda docs: docs[1], reverse=True)

Overall sentiment (higher = more positive)


[('Lubbock', 0.0835),
 ('Baltimore 2', 0.0673),
 ('San Antonio', 0.0527),
 ('Norfolk', 0.0394),
 ('Richmond', 0.0348),
 ('Indianapolis', -0.0015),
 ('Baltimore 1', -0.0108),
 ('Norwich 1', -0.0903),
 ('Atlanta', -0.1481),
 ('Norwich 2', -0.1697)]

To compare the sentiment score to the actual proposed job losses, I will take the averages and merge them all together. They align fairly well. Atlanta and Norwich both have massive negative results, and are at the bottom of the sentiment rankings. This is fairly rudimentary, but might be enough text analysis for this paper.

In [8]:
#combine the two articles for Baltimore and Norwich into an average for each
doc_polarity['Baltimore'] = round(float(np.mean([doc_polarity['Baltimore 1'], doc_polarity['Baltimore 2']])), 4)
doc_polarity['Norwich'] = round(float(np.mean([doc_polarity['Norwich 1'], doc_polarity['Norwich 2']])), 4)
del doc_polarity['Baltimore 1'], doc_polarity['Baltimore 2'], doc_polarity['Norwich 1'], doc_polarity['Norwich 2']

#convert to a dataframe in order to merge and display neatly
polarity = pd.DataFrame({'city':doc_polarity.keys(), 'doc_sentiment':doc_polarity.values()})
df_brac = df_brac.merge(polarity, on='city')

In [9]:
df_brac.sort_values('doc_sentiment', ascending=False)

Unnamed: 0,city,jobs_lost,jobs_gained,net,doc_sentiment
2,Lubbock,-7,0,-7,0.0835
6,San Antonio,-8129,10946,2817,0.0527
3,Norfolk,-10399,10400,1,0.0394
5,Richmond,-812,7821,7009,0.0348
1,Baltimore,-4414,11619,7727,0.0282
7,Indianapolis,-283,3592,3309,-0.0015
4,Norwich,-8460,0,-8460,-0.13
0,Atlanta,-6820,124,-6764,-0.1481


# Exploring sentiment about specific topics

Text analysis can do named entity recognition based on pre-trained libraries. ORG=organization, GEP=global-political entities, LOC=non-GPE locations. This is an area where we may want to train our own model, at least in part, since we can do things like link different entities together (e.g. base realignment and closure linked to BRAC, or all military base names linked together).

In [10]:
#Focusing on just ORGs identified in the article for Indianapolis
orgs = [(entity, entity.label_) for entity in docs['Indianapolis'].ents if entity.label_ == 'ORG']
orgs

[(Pentagon, 'ORG'),
 (the Naval Surface Warfare Center at Crane, 'ORG'),
 (Defense Finance and Accounting Service, 'ORG'),
 (Base Realignment and Closure, 'ORG'),
 (Crane, 'ORG'),
 (Defense Finance and Accounting Service, 'ORG'),
 (DFAS, 'ORG'),
 (the Department of Defense, 'ORG'),
 (Soldier Support Center, 'ORG'),
 (Army, 'ORG'),
 (Crane, 'ORG'),
 (Daniels, 'ORG'),
 (Crane, 'ORG'),
 (Crane, 'ORG'),
 (Crane, 'ORG'),
 (Pentagon, 'ORG'),
 (Crane, 'ORG'),
 (Sodrel, 'ORG'),
 (CB Richard Ellis, 'ORG'),
 (the West Gate Technology Park at, 'ORG'),
 (Busch, 'ORG'),
 (the Hulman Regional Airport Air Guard Station, 'ORG'),
 (Navy Marine Corps, 'ORG'),
 (the Newport Chemical Depot, 'ORG'),
 (the Navy Reserve Center, 'ORG'),
 (the U.S. Army Reserve Center, 'ORG'),
 (the U.S. Army Reserve Center, 'ORG'),
 (Congress, 'ORG'),
 (Daniels, 'ORG')]

Nearly all of these appear to be military related, but we would need to do more refinement to make sure. That is not technically difficult to do at all, but would take some time.

We can get the average sentiment relating to only organizations using filtering on entities, though the results are rough without more work on entitity identification.

In [11]:
#Sentiment around all organizations in the articles
docs_org_sentiment = {city:float(round(np.mean([
    entity._.blob.polarity for entity in doc.ents if entity.label_ == 'ORG']), 4))
      for city, doc in docs.items()}

In [12]:
#combine the two articles for Baltimore and Norwich into an average for each
docs_org_sentiment['Baltimore'] = round(float(np.mean([docs_org_sentiment['Baltimore 1'], docs_org_sentiment['Baltimore 2']])), 4)
docs_org_sentiment['Norwich'] = round(float(np.mean([docs_org_sentiment['Norwich 1'], docs_org_sentiment['Norwich 2']])), 4)
del docs_org_sentiment['Baltimore 1'], docs_org_sentiment['Baltimore 2'], docs_org_sentiment['Norwich 1'], docs_org_sentiment['Norwich 2']

#convert to a dataframe in order to merge and display neatly
polarity = pd.DataFrame({'city':docs_org_sentiment.keys(), 'org_sentiment':docs_org_sentiment.values()})
df_brac = df_brac.merge(polarity, on='city')

In [13]:
df_brac.sort_values('org_sentiment', ascending=False)

Unnamed: 0,city,jobs_lost,jobs_gained,net,doc_sentiment,org_sentiment
3,Norfolk,-10399,10400,1,0.0394,0.0
1,Baltimore,-4414,11619,7727,0.0282,-0.014
7,Indianapolis,-283,3592,3309,-0.0015,-0.0448
5,Richmond,-812,7821,7009,0.0348,-0.0598
4,Norwich,-8460,0,-8460,-0.13,-0.0645
2,Lubbock,-7,0,-7,0.0835,-0.0654
0,Atlanta,-6820,124,-6764,-0.1481,-0.0667
6,San Antonio,-8129,10946,2817,0.0527,-0.0731


I think sentiment specifically about military bases is going to be our best bet, but one advantage to also looking at GPEs is that some articles are about places other than where they are published. For example, the Lubbock, TX article is ranked poorly now, but looking at the text makes it clear it is about the BRAC process affecting a town in Nevada, even though it was published in Lubbock.

In [14]:
docs_gpes = {city:[entity for entity in doc.ents if entity.label_ == 'GPE'] for city, doc in docs.items()}
docs_gpes['Lubbock']

[HAWTHORNE,
 Nev.,
 America,
 Nevada,
 Reno,
 Nevada,
 Iraq,
 Hawthorne,
 U.S.,
 Utah,
 Hawthorne,
 El Capitan,
 Hawthorne]

Here we see the GPEs in the Lubbock article. If we eliminate the country-level names American, USA, and Iraq, we're left with:

In [15]:
lubbock_gpes = [gpe.text.lower() for gpe in docs_gpes['Lubbock'] if gpe.text not in ['U.S.', 'America', 'Iraq']]
lubbock_gpes

['hawthorne',
 'nev.',
 'nevada',
 'reno',
 'nevada',
 'hawthorne',
 'utah',
 'hawthorne',
 'el capitan',
 'hawthorne']

In [16]:
from collections import Counter
Counter(lubbock_gpes)

Counter({'hawthorne': 4,
         'nevada': 2,
         'nev.': 1,
         'reno': 1,
         'utah': 1,
         'el capitan': 1})

The article is about Hawthorne, Nevada, meaning 7 of the 10 GPEs mentioned direct us to the proper location. Of the remaining GPEs, "reno" is used to describe the location of Hawthorn in relation to a major city, "utah" is mentioned as the destination for the jobs Hawthorn is losing, and "El Capitan" is the name of a casino (the second largest employer in Hawthorn) that the entity parser is incorrectly identifying as the location of the mountain by the same name.  

When we look at the top three results for all of the articles:

In [17]:
{city:Counter([
    gpe.text.lower() for gpe in doc if gpe.text not in ['U.S.', 'America', 'Iraq']
        ]).most_common(3) for city, doc in docs_gpes.items()}

{'Atlanta': [('clayton county', 2), ('georgia', 2), ('atlanta', 1)],
 'Baltimore 1': [('washington', 6), ('maryland', 5), ('baltimore', 4)],
 'Baltimore 2': [('maryland', 4), ('washington', 3), ('va.', 3)],
 'Indianapolis': [('indiana', 8), ('lawrence', 2), ('indianapolis', 2)],
 'Lubbock': [('hawthorne', 4), ('nevada', 2), ('nev.', 1)],
 'Norfolk': [('virginia', 4), ('fort', 2), ('hampton', 2)],
 'Norwich 1': [('norfolk', 3), ('ga.', 1), ('virginia', 1)],
 'Norwich 2': [('connecticut', 4), ('washington', 1), ('middletown', 1)],
 'Richmond': [('petersburg', 2), ('richmond', 1), ('fort', 1)],
 'San Antonio': [("san antonio's", 2), ('pennsylvania', 1), ('lackland', 1)]}

It seems that a simple heuristic, like finding the most referred-to geographies, will generally discover the correct geographic location the article is focused on. We would have to refine our geographic linking so that we identify, for example, that "clayton county", "georiga", and "atlanta" are all referring to the same place for the purposes of the first article.

Looking at GPEs in articles also opens the door to studying spillover effects and economic linkages.

# Conclusions and Issues

There is always a tradeoff in text analysis between accuracy and work time/computing power. Effort spent parsing the meaning out of text has major diminishing returns, but I think it is accurate to say that it only asymptotically approaches zero for any article of sufficient size. In my experience there is always going to be one more corner case that we could correct for, but at some point it is better to leave them as noise. For example, it may be rare for articles on BRAC to refer only to other places (like the Lubbock, TX article on Hancock, NV), so it may not be worth our time to put a lot of effort into correcting a very small number of articles. 

A second related issue that frequently arises in the text processing space is that there are many tools and many judgment calls involved, often with very limited ability to defend to a referee that one is the right approach over another. For example, here I used a pre-trained sentiment algorithm from textblob, which is a popular choice. But we could also use another library, like huggingface, which would give us access to a whole new subset of sentiment algorithms. We could also put in the effort to train our own custom algorithm, though in this case I do not think that would be worth our time, even if we had the resources for it. 

The many possible approaches makes our results harder to defend, and we may have to offer multiple versions of the results using different algorithms that hopefully align with their conclusions. Ideally we would assess our machine-driven results by comparing them to manual classification done by RAs, but that would be a substantial task.