# Exploring text analysis for BRAC

Here we see how text analysis might be used on the news articles for studying the uncertainty affect of base closures. 

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import spacy
from spacytextblob.spacytextblob import SpacyTextBlob

## Sample cities with BRAC DoD proposal amounts

Data copied from 2005_BRAC_proposals.csv, collected from the 2005 BRAC commission report, Appendix O.

In [2]:
data = [['Atlanta', -6820, 124, -6764],
        ['Baltimore', -4414, 11619, 7727],
        ['Lubbock', -7, 0, -7],
        ['Norfolk', -10399, 10400, 1],
        ['Norwich', -8460, 0, -8460],
        ['Richmond', -812, 7821, 7009],
        ['San Antonio', -8129, 10946, 2817],
        ['Indianapolis', -283, 3592, 3309]]
df_brac = pd.DataFrame(data, columns=['city', 'loss', 'gain', 'net'])
df_brac

Unnamed: 0,city,loss,gain,net
0,Atlanta,-6820,124,-6764
1,Baltimore,-4414,11619,7727
2,Lubbock,-7,0,-7
3,Norfolk,-10399,10400,1
4,Norwich,-8460,0,-8460
5,Richmond,-812,7821,7009
6,San Antonio,-8129,10946,2817
7,Indianapolis,-283,3592,3309


# Loading the sample news articles from AWN

There are ten articles from places representing a variety of BRAC proposed outcomes, with Baltimore and Norwich reprsented twice.

In [3]:
def load_article(fname):
    #Note: We don't actually know the structure of the text files we would get from purchased
    #AWN access, so this function both handles the trivial example data, and serves as a
    #placeholder for whatever the future structure looks like.
    with open(os.path.join('articles', fname), 'r') as ifile:
        text = ifile.read()

    #Based on the body of every article beginning after the "Words" count, return only the body
    text = text[text.find('Words')+len('Words')+2:]
    return text

articles = {fname[:-4]:load_article(fname) for fname in os.listdir(r'articles')}
articles

{'Atlanta': 'Local officials are preparing to lobby members of the federal Base Realignment and Closure Commission, who are expected in Atlanta next month.\n\nCommission members will be in town June 30 to hear the pitch to keep Fort Gillem open.\n\n"Everyone\'s pretty much got one shot at it," said Forest Park Mayor Chuck Hall. "After that, it\'s up to the commission."\n\nDefense Secretary Donald Rumsfeld recommends that Fort Gillem in northeastern Clayton County, Fort McPherson in East Point, the Naval Air Station in Cobb County and the Naval Supply Corps School in Athens be closed. They are among the 33 major military facilities nationwide that may be shut down as part of the 2005 BRAC process.\n\nDefense officials are hoping to streamline the number of military bases around the country as they reorganize the nation\'s military. Four other BRAC rounds -- 1998, 1995, 1993 and 1991 -- closed 97 bases.\n\nThis year\'s BRAC round would close Fort Gillem, with six Army groups now at the b

# Basic word presence

This mirrors the behavior we can get through the AWN interface directly, without any text analysis. This does not require purchasing access to the text of articles.

In [4]:
def brac_keywords(text):
    acronym = 'brac' in text.lower()
    full = 'base realignment and closure' in text.lower()
    return acronym or full
brac_articles = {city: brac_keywords(text) for city, text in articles.items()}
print('Is BRAC discussed? (should be all True):')
brac_articles

Is BRAC discussed? (should be all True):


{'Atlanta': True,
 'Baltimore 1': True,
 'Baltimore 2': True,
 'Indianapolis': True,
 'Lubbock': True,
 'Norfolk': True,
 'Norwich 1': True,
 'Norwich 2': True,
 'Richmond': True,
 'San Antonio': True}

In [5]:
def unc_keywords(text):
    unc = 'uncertain' in text.lower()
    econ = 'econom' in text.lower()
    return unc and econ
brac_articles = {city: unc_keywords(text) for city, text in articles.items()}
print('Is economic uncertainty discussed?:')
brac_articles

Is economic uncertainty discussed?:


{'Atlanta': False,
 'Baltimore 1': False,
 'Baltimore 2': False,
 'Indianapolis': False,
 'Lubbock': False,
 'Norfolk': True,
 'Norwich 1': False,
 'Norwich 2': False,
 'Richmond': False,
 'San Antonio': False}

# Text Analysis

First we insert the sentiment scoring into the pipline, then injest the data into Spacy to tokenize the contents. Sentiment scores range from 1 (positive) to -1 (negative).

In [None]:
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe('spacytextblob');

docs = {city:nlp(text) for city, text in articles.items()} #turning plain text into Spacy tokens

In [13]:
doc_polarity = {city:round(doc._.blob.polarity, 4)  for city, doc in docs.items()}
print('Overall sentiment (higher = more positive)')
sorted(doc_polarity.items(), key=lambda docs: docs[1], reverse=True)

Overall sentiment (higher = more positive)


[('Lubbock', 0.0835),
 ('Baltimore 2', 0.0673),
 ('San Antonio', 0.0527),
 ('Norfolk', 0.0394),
 ('Richmond', 0.0348),
 ('Indianapolis', -0.0015),
 ('Baltimore 1', -0.0108),
 ('Norwich 1', -0.0903),
 ('Atlanta', -0.1481),
 ('Norwich 2', -0.1697)]

Compare the sentiment score to the actual proposed job losses, and they align fairly well. Atlanta and Norwich both have massive negative results, and are at the bottom of the sentiment rankings.

In [16]:
df_brac.sort_values('net', ascending=False)

Unnamed: 0,city,loss,gain,net
1,Baltimore,-4414,11619,7727
5,Richmond,-812,7821,7009
7,Indianapolis,-283,3592,3309
6,San Antonio,-8129,10946,2817
3,Norfolk,-10399,10400,1
2,Lubbock,-7,0,-7
0,Atlanta,-6820,124,-6764
4,Norwich,-8460,0,-8460


# Exploring sentiment about specific topics

Beginning with named entity recognition. ORG=organization, GEP=global-political entities, LOC=non-GPE locations. This is an area where we may want to train our own model, since we can do things like link different entities together (e.g. base realignment and closure linked to BRAC, or all military base names linked together).

In [None]:
#Focusing on just ORGs identified in the article for Indianapolis
orgs = [(entity, entity.label_) for entity in docs['Indianapolis'].ents if entity.label_ == 'ORG']
orgs

[(Pentagon, 'ORG'),
 (the Naval Surface Warfare Center at Crane, 'ORG'),
 (Defense Finance and Accounting Service, 'ORG'),
 (Base Realignment and Closure, 'ORG'),
 (Crane, 'ORG'),
 (Defense Finance and Accounting Service, 'ORG'),
 (DFAS, 'ORG'),
 (the Department of Defense, 'ORG'),
 (Soldier Support Center, 'ORG'),
 (Army, 'ORG'),
 (Crane, 'ORG'),
 (Daniels, 'ORG'),
 (Crane, 'ORG'),
 (Crane, 'ORG'),
 (Crane, 'ORG'),
 (Pentagon, 'ORG'),
 (Crane, 'ORG'),
 (Sodrel, 'ORG'),
 (CB Richard Ellis, 'ORG'),
 (the West Gate Technology Park at, 'ORG'),
 (Busch, 'ORG'),
 (the Hulman Regional Airport Air Guard Station, 'ORG'),
 (Navy Marine Corps, 'ORG'),
 (the Newport Chemical Depot, 'ORG'),
 (the Navy Reserve Center, 'ORG'),
 (the U.S. Army Reserve Center, 'ORG'),
 (the U.S. Army Reserve Center, 'ORG'),
 (Congress, 'ORG'),
 (Daniels, 'ORG')]

Nearly all of these appear to be military related, but we would need to do more refinement to make sure. That is not technically difficult to do at all, but would take some time.

We can get the average sentiment relating to only organizations using filtering on entities. It yields similar results to the overall document sentiment, since we haven't refined the organization matching yet.

In [34]:
#Sentiment around all organizations in the articles
docs_org_sentiment = {city:float(round(np.mean([entity._.blob.polarity for entity in doc.ents if entity.label_ == 'ORG']), 4)) for city, doc in docs.items()}
sorted(docs_org_sentiment.items(), key=lambda docs: docs[1], reverse=True)

[('Baltimore 1', 0.0123),
 ('Norfolk', 0.0),
 ('Norwich 1', -0.032),
 ('Baltimore 2', -0.0403),
 ('Indianapolis', -0.0448),
 ('Richmond', -0.0598),
 ('Lubbock', -0.0654),
 ('Atlanta', -0.0667),
 ('San Antonio', -0.0731),
 ('Norwich 2', -0.0969)]

I think sentiment specifically about military bases is going to be our best bet, but one advantage to also looking at GPEs is that some articles are about places other than where they are published. For example, the Lubbock, TX article is about the BRAC process affecting a town in Nevada that is no where near Texas.

In [41]:
docs_gpes = {city:[entity for entity in doc.ents if entity.label_ == 'GPE'] for city, doc in docs.items()}
docs_gpes['Lubbock']

[HAWTHORNE,
 Nev.,
 America,
 Nevada,
 Reno,
 Nevada,
 Iraq,
 Hawthorne,
 U.S.,
 Utah,
 Hawthorne,
 El Capitan,
 Hawthorne]

Here we see the GPEs in the Lubbock article. If we eliminate the country-level names American, USA, and Iraq, we're left with:

In [44]:
lubbock_gpes = [gpe.text.lower() for gpe in docs_gpes['Lubbock'] if gpe.text not in ['U.S.', 'America', 'Iraq']]
lubbock_gpes

['hawthorne',
 'nev.',
 'nevada',
 'reno',
 'nevada',
 'hawthorne',
 'utah',
 'hawthorne',
 'el capitan',
 'hawthorne']

In [45]:
from collections import Counter
Counter(lubbock_gpes)

Counter({'hawthorne': 4,
         'nevada': 2,
         'nev.': 1,
         'reno': 1,
         'utah': 1,
         'el capitan': 1})

The article is about Hawthorne, Nevada, meaning 7 of the 10 GPEs mentioned direct us to the proper location. Of the remaining GPEs, "reno" is used to describe the location of hawthorn in relation to a major city, "utah" is mentioned as the destination for the jobs Hawthorn is losing, and "El Capitan" is the name of a casino that the entity parser is incorrectly identifying as the location of the mountain by the same name.  

When we look at the top three results for all of the articles:

In [52]:
{city:Counter([gpe.text.lower() for gpe in doc if gpe.text not in ['U.S.', 'America', 'Iraq']]).most_common(3) for city, doc in docs_gpes.items()}

{'Atlanta': [('clayton county', 2), ('georgia', 2), ('atlanta', 1)],
 'Baltimore 1': [('washington', 6), ('maryland', 5), ('baltimore', 4)],
 'Baltimore 2': [('maryland', 4), ('washington', 3), ('va.', 3)],
 'Indianapolis': [('indiana', 8), ('lawrence', 2), ('indianapolis', 2)],
 'Lubbock': [('hawthorne', 4), ('nevada', 2), ('nev.', 1)],
 'Norfolk': [('virginia', 4), ('fort', 2), ('hampton', 2)],
 'Norwich 1': [('norfolk', 3), ('ga.', 1), ('virginia', 1)],
 'Norwich 2': [('connecticut', 4), ('washington', 1), ('middletown', 1)],
 'Richmond': [('petersburg', 2), ('richmond', 1), ('fort', 1)],
 'San Antonio': [("san antonio's", 2), ('pennsylvania', 1), ('lackland', 1)]}

It seems like finding the most referred-to geographies will generally discover the correct geographic location the article is focused on, as long as we use geographic linking so that we identify, for example, "clayton county", "georiga", and "atlanta" are all referring to the same place for the purposes of the first article.

# Conclusions and Issues

There is always a tradeoff in text analysis between accuracy and time/computing power. Effort spent parsing the meaning out of text has major diminishing returns, but I think it is accurate to say that it only asymptotically approaches zero for any article of sufficient size. For example, there may not be very many articles that refer only to other places (like the Lubbock, TX article on Hancock, NV), so it may not be worth our time to put a lot of effort into correcting a very small number of articles. 

A second related issue that frequently arises in the text processing space is that there are many tools and many judgment calls involved, often with very limited ability to "prove" one is the right approach over another. For example, here I used a pre-trained sentiment algorithm from textblob, which is a popular choice. But we could also use another library, like huggingface, which would give us access to a whole new subset of sentiment algorithms. We could also put in the effort to train our own custom algorithm, though in this case I do not think that would be worth our time, even if we had the resources for it. This makes our results harder to defend to referees, and we may have to offer multiple versions of the results using different algorithms.

This leads to another issue, that the further we go beyond the most simple approaches to text analysis, the less likely the average economist will be able to easily understand what we're doing and critique our approach. Whether buying access to the text of AWN articles will be worth our time or not is difficult to judge.