# Extracting Keywords

The goal of this notebook is to extract annual keywords from the Council on Foreign Relations' Preventive Priorities Survey, between 2011 and 2019. This will be the starting point of our per-year analysis of Google Trends and news media outlets.

In [1]:
## Imports

import pandas as pd

import nltk
from nltk.corpus import stopwords

from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer

import itertools 
import string
import re

In [2]:
# Creating pandas dataframe from CSV
df = pd.read_csv("PPS_all.csv")

## Data Exploration

In [3]:
df.head()

Unnamed: 0,Year,Tier,Issue/Country,Impact,Likelihood,Description
0,2019,1,cyberattack,high,moderate,A highly disruptive cyberattack on US critical...
1,2019,1,North Korea,high,moderate,Renewed tensions on the Korean Peninsula follo...
2,2019,1,Iran,high,moderate,An armed confrontation between Iran and the Un...
3,2019,1,South China Sea,high,moderate,An armed confrontation over disputed maritime ...
4,2019,1,terrorist attack,high,moderate,A mass casualty terrorist attack on the US hom...


In [4]:
df.shape

(267, 6)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 267 entries, 0 to 266
Data columns (total 6 columns):
Year             267 non-null int64
Tier             267 non-null int64
Issue/Country    267 non-null object
Impact           210 non-null object
Likelihood       210 non-null object
Description      267 non-null object
dtypes: int64(2), object(4)
memory usage: 12.6+ KB


In [6]:
# Setting the weight of the combined Impact and Likelihood
# List of category columns to adjust, for iteration
category_cols = ["Impact", "Likelihood"]
# Defining the numbers to assign for each importance
# Including 0 since we'll need to fill in some null values
importance = {"high": 3, "moderate": 2, "low": 1, 0:0}

# Iterating over the two columns, filling nulls and replacing
# the importance with a number, 0-3
for col in category_cols:
    df[col].fillna(0, inplace=True)
    df[col] = [importance[item] for item in df[col]]

# Creating a new weight column, combining Impact and Likelihood
df["Weight"] = df["Impact"] + df["Likelihood"]

# Now, we don't actually want a weight of zero, since we might
# be multiplying by it, so let's reset it to one
df["Weight"] = df["Weight"].replace(to_replace=0, value=1)

In [7]:
df.tail()

Unnamed: 0,Year,Tier,Issue/Country,Impact,Likelihood,Description,Weight
262,2011,3,Myanmar,0,0,Violent instability in Burma/Myanmar,1
263,2011,3,Kyrgyzstan,0,0,Political instability/resurgent ethnic violenc...,1
264,2011,3,Uganda,0,0,Electoral violence in Uganda,1
265,2011,3,Cote d'Ivoire,0,0,Political crisis devolves to armed violence in...,1
266,2011,3,Thailand,0,0,Violent instability in Thailand,1


## Data Cleaning
A problem we have is that our algorithms don't have the same understanding of conflicts as we do. For example, we need to explain that 'Central African Republic' is one thing, and that when we break down the words in our descriptions into tokens, it shouldn't break apart 'Central' 'African' and 'Republic' and look at each word separately. Additionally, we need to make sure that "China", "PRC" and "Sino" are all considered the same thing. So we need to do some preprocessing for the words in each description, so we can do some reasonable analysis when we search for keywords later.

In [11]:
# Behold, a function to replace tricky concepts, like acronyms
# Needs to be done before tokenization since some are multi-word
# Because our data was small, just went through and did this
# by hand as I decided what was important to bring together
def replace_tricky_concepts(string):
    # Defining a dictionary of trikcy concepts, where the keys
    # are what we'll replace with, the values are what to replace
    # Input: expects a string (so, not a whole column of text)
    
    tricky_concepts = {
    "SouthChinaSea" : "South China Sea",
    "SEAsia" : "Southeast Asian",
    "CentralAm" : "Central America",
    "DRCongo" : "Democratic Republic of Congo",
    "SouthSudan" : "South Sudan",
    "CARep" : "Central African Republic",
    "EastChinaSea" : "East China Sea",
    "ISIS" : ["Islamic State", "Islamic State of Iraq and Syria"],
    "ICBM" : "intercontinental ballistic missile",
    "MiddleEast" : "Middle East",
    "Qaeda" : "AQAP",
    "China" : ["Sino", "PRC"],
    "nuclear" : "denuclearization",
    "Europe" : "EU",
    "India" : "Indo"
    }
    
    # Creating a copy of the string input
    output = string
    # Iterating over the dictionary
    for replacement, concept in tricky_concepts.items():
        # Checking for lists, since some values are lists
        if isinstance(concept,list):
            for item in concept:
                output=output.replace(item, replacement)
        else:
            output = output.replace(concept, replacement)
    return output

In [12]:
# Iterating over every row of our text column to pass in
# the strings of text and replace our multi-word concepts
df["Description"] = df["Description"].map(lambda x: replace_tricky_concepts(x))

In [13]:
# Now we can work on some other pre-processing
# Defining our basic English stopwords to remove, including punctuation

stopwords_list = stopwords.words("english")
stopwords_list += list(string.punctuation)

In [14]:
# Defining a function to tokenize our words, using RegEx, as well as
# make all words lowercase
def process_text(text):
    # Input: expects a string (again, not a column of text)
    # Defining our regex pattern to grab individual words
    pattern = "([a-zA-Z]+(?:'[a-z]+)?)"
    # Using NLTK's regex tokenize function
    tokens_raw = nltk.regexp_tokenize(text, pattern)
    # Setting all words to lowercase now
    tokens_lower = [w.lower() for w in tokens_raw]
    # Only returning tokens if they are not stopwords or digits
    return [t for t in tokens_lower if t not in stopwords_list and not t.isdigit()]

In [94]:
# Testing our tokenization function
process_text(df["Description"][1])

['renewed',
 'tensions',
 'korean',
 'peninsula',
 'following',
 'collapse',
 'nuclear',
 'negotiations']

In [17]:
df["Keywords"] = df["Description"].map(lambda x: process_text(x))

In [18]:
df.head()

Unnamed: 0,Year,Tier,Issue/Country,Impact,Likelihood,Description,Weight,Keywords
0,2019,1,cyberattack,3,2,A highly disruptive cyberattack on US critical...,5,"[highly, disruptive, cyberattack, us, critical..."
1,2019,1,North Korea,3,2,Renewed tensions on the Korean Peninsula follo...,5,"[renewed, tensions, korean, peninsula, followi..."
2,2019,1,Iran,3,2,An armed confrontation between Iran and the Un...,5,"[armed, confrontation, iran, united, states, o..."
3,2019,1,South China Sea,3,2,An armed confrontation over disputed maritime ...,5,"[armed, confrontation, disputed, maritime, are..."
4,2019,1,terrorist attack,3,2,A mass casualty terrorist attack on the US hom...,5,"[mass, casualty, terrorist, attack, us, homela..."


In [96]:
vocabulary = set()
for obs in df.index:
    words = process_text(df["Description"][obs])
    vocabulary.update(words)
vocabulary = list(vocabulary)

## Arriving at Keywords

In [65]:
# Let's go through and see what keywords it picks out of each

tfidf = TfidfVectorizer(tokenizer=process_text)

In [70]:
tfidf.fit_transform(df["Description"])

<267x530 sparse matrix of type '<class 'numpy.float64'>'
	with 2720 stored elements in Compressed Sparse Row format>

In [None]:
tfidf.get_feature_names()

In [137]:
from nltk.stem.snowball import SnowballStemmer  # Assuming we're working with English

stemmer = SnowballStemmer("english")

In [132]:
words = []
for obs in df.index:
    tokens = process_text(df["Description"][obs])
    for t in tokens:
        words.append(t)

In [133]:
stems = []
for word in words:
    word_stem = stemmer.stem(word)
    stems.append(word_stem)

In [134]:
from nltk import FreqDist

fdist = FreqDist(stems)

In [135]:
fdist

FreqDist({'violenc': 99, 'polit': 91, 'instabl': 91, 'militari': 57, 'civil': 42, 'attack': 40, 'increas': 36, 'confront': 34, 'crisi': 32, 'result': 32, ...})

In [17]:
# # Our CFR PPS data ranges from 2011 to 2019
# years = list(range(2011, 2020))

# # Creating a dictionary to hold the text per year
# annual_dict = {}

# for year in years:
#     # Creating a dataframe for each year
#     year_df = df.loc[df["Year"] == year]
#     # Chaining together the column of descriptions
#     year_list = list(itertools.chain(year_df["Description"]))
#     # Joining that chain into one long string
#     year_str_all = ' '.join(year_list)
#     # Setting the key as the year, value as the long descriptions
#     annual_dict[year] = year_str_all

In [18]:
# for year in annual_dict.keys():
#     annual_dict[year] = process_text(annual_dict[year])

In [19]:
# annual_features = {}

# for year in annual_dict.keys():
#     tfidf = TfidfVectorizer(max_features=10, lowercase=False)
#     year_input = list(annual_dict[year])
#     tfidf.fit_transform(year_input)
#     year_features = tfidf.get_feature_names()
#     annual_features[year] = year_features

In [85]:
desc_list = list(itertools.chain(df["Description"]))
desc_list

['A highly disruptive cyberattack on US critical infrastructure and networks',
 'Renewed tensions on the Korean Peninsula following a collapse of the nuclear negotiations',
 'An armed confrontation between Iran and the United States or one of its allies over Iran’s involvement in regional conflicts and support of militant proxy groups',
 'An armed confrontation over disputed maritime areas in the SouthChinaSea between China and one or more SEAsia claimants (Brunei, Malaysia, Philippines, Taiwan, and Vietnam)',
 'A mass casualty terrorist attack on the US homeland or a treaty ally by either (a) foreign or homegrown terrorist(s)',
 'Continued violent reimposition of government control in Syria leading to further civilian casualties and heightened tensions among external parties to the conflict',
 'Deepening economic crisis and political instability in Venezuela leading to violent civil unrest and increased refugee outflows',
 'Worsening of the humanitarian crisis in Yemen, exacerbated by

In [21]:
# all_tokens = process_text(str_all)

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(lowercase=True, stop_words=stopwords_list)
cv_all = cv.fit_transform(desc_list)
# all_features = cv.get_feature_names()

# all_features

In [22]:
cv_all.shape

(267, 540)