# Extracting Keywords

The goal of this notebook is to extract annual keywords from the Council on Foreign Relations' Preventive Priorities Survey, between 2011 and 2019. This will be the starting point of our per-year analysis of Google Trends and news media outlets.

In [46]:
## Imports

import pandas as pd

import nltk
from nltk.corpus import stopwords

from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer

import itertools 
import string
import re

In [47]:
# Creating pandas dataframe from CSV
df = pd.read_csv("PPS_all.csv")

## Data Exploration

In [48]:
df.head()

Unnamed: 0,Year,Tier,Issue/Country,Impact,Likelihood,Description
0,2019,1,cyberattack,high,moderate,A highly disruptive cyberattack on US critical...
1,2019,1,North Korea,high,moderate,Renewed tensions on the Korean Peninsula follo...
2,2019,1,Iran,high,moderate,An armed confrontation between Iran and the Un...
3,2019,1,South China Sea,high,moderate,An armed confrontation over disputed maritime ...
4,2019,1,terrorist attack,high,moderate,A mass casualty terrorist attack on the US hom...


In [49]:
df.shape

(267, 6)

In [50]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 267 entries, 0 to 266
Data columns (total 6 columns):
Year             267 non-null int64
Tier             267 non-null int64
Issue/Country    267 non-null object
Impact           210 non-null object
Likelihood       210 non-null object
Description      267 non-null object
dtypes: int64(2), object(4)
memory usage: 12.6+ KB


In [51]:
# Setting the weight of the combined Impact and Likelihood
# List of category columns to adjust, for iteration
category_cols = ["Impact", "Likelihood"]
# Defining the numbers to assign for each importance
# Including 0 since we'll need to fill in some null values
importance = {"high": 3, "moderate": 2, "low": 1, 0:0}

# Iterating over the two columns, filling nulls and replacing
# the importance with a number, 0-3
for col in category_cols:
    df[col].fillna(0, inplace=True)
    df[col] = [importance[item] for item in df[col]]

# Creating a new weight column, combining Impact and Likelihood
df["Weight"] = df["Impact"] + df["Likelihood"]

## Data Cleaning
A problem we have is that our algorithms don't have the same understanding of conflicts as we do. For example, we need to explain that 'Central African Republic' is one thing, and that when we break down the words in our descriptions into tokens, it shouldn't break apart 'Central' 'African' and 'Republic' and look at each word separately. Additionally, we need to make sure that "China", "PRC" and "Sino" are all considered the same thing. So we need to do some preprocessing for the words in each description, so we can do some reasonable analysis when we search for keywords later.

In [55]:
# First, let's define our dictionaries of concepts that might trip up
# our algorithm as we search for keywords
multi_word_concepts = {
    "SouthChinaSea" : "South China Sea",
    "SEAsia" : "Southeast Asian",
    "CentralAm" : "Central America",
    "BokoHaram" : "Boko Haram",
    "DRCongo" : "Democratic Republic of Congo",
    "SouthSudan" : "South Sudan",
    "CARep" : "Central African Republic",
    "BosniaHerzegovina" : "Bosnia and Herzegovina",
    "NKorea" : ["North Korean", "North Korea"],
    "EastChinaSea" : "East China Sea",
    "ISIS" : ["Islamic State", "Islamic State of Iraq and Syria"],
    "Saudi" : ["Saudi Arabia"],
    "NagornoKarabakh" : "Nagorno Karabakh",
    "ICBM" : "intercontinental ballistic missile",
    "MiddleEast" : "Middle East"
}

single_word_concepts = {
    "Qaeda" : "AQAP",
    "China" : ["Sino", "PRC"],
    "Syria" : "Syrian",
    "Turkey" : "Turkish",
    "Pakistan" : "Pakistani",
    "nuclear" : "denuclearization",
    "Iran" : "Iranian",
    "Russia" : "Russian",
    "Ukraine" : "Ukrainian",
    "Israel" : ["Israelis", "Israeli"],
    "Palestine" : ["Palestinians", "Palestinian "],
    "Europe" : ["EU", "European"],
    "Ukraine" : "Ukrainian",
    "Iraq" : "Iraqi",
    "India" : ["Indian", "Indo"]
}

In [66]:
# Behold, a function to replace tricky multi-word concepts
# Needs to be done before tokenization since they are multi-word
def replace_multi_word_concepts(string):
    output = string
    for k,v in multi_word_concepts.items():
        if isinstance(v,list):
            for item in v:
                output=output.replace(item, k)
        else:
            output = output.replace(v, k)
    return output

In [57]:
# Defining our basic English stopwords to remove, including punctuation and numbers

stopwords_list = stopwords.words("english")
stopwords_list += list(string.punctuation)
str_numbs = [str(dig) for dig in list(range(10))]
stopwords_list += str_numbs

In [67]:
df["Description"] = df["Description"].map(lambda x: replace_multi_word_concepts(x))
df.head()

Unnamed: 0,Year,Tier,Issue/Country,Impact,Likelihood,Description,Weight
0,2019,1,cyberattack,3,2,A highly disruptive cyberattack on US critical...,5
1,2019,1,North Korea,3,2,Renewed tensions on the Korean Peninsula follo...,5
2,2019,1,Iran,3,2,An armed confrontation between Iran and the Un...,5
3,2019,1,South China Sea,3,2,An armed confrontation over disputed maritime ...,5
4,2019,1,terrorist attack,3,2,A mass casualty terrorist attack on the US hom...,5


In [68]:
# Defining a function to tokenize our words, using RegEx, as well as
# make all words lowercase
def process_text(text):
    pattern = "([a-zA-Z]+(?:'[a-z]+)?)"
    tokens_raw = nltk.regexp_tokenize(text, pattern)
#     tokens_replaced = replace_tricky_concepts(single_word_concepts, tokens_raw)
    tokens = [w.lower() for w in tokens_raw]
    return tokens

In [69]:
process_text(df["Description"][0])

['a',
 'highly',
 'disruptive',
 'cyberattack',
 'on',
 'us',
 'critical',
 'infrastructure',
 'and',
 'networks']

Unnamed: 0,Year,Tier,Issue/Country,Impact,Likelihood,Description,Weight
0,2019,1,cyberattack,3,2,A highly disruptive cyberattack on US critical...,5
1,2019,1,North Korea,3,2,Renewed tensions on the NKorea Peninsula follo...,5
2,2019,1,Iran,3,2,An armed confrontation between Iran and the Un...,5
3,2019,1,South China Sea,3,2,An armed confrontation over disputed maritime ...,5
4,2019,1,terrorist attack,3,2,A mass casualty terrorist attack on the US hom...,5


In [14]:
tricky_concepts_dict.values()

dict_values(['South China Sea', 'denuclearization', 'Southeast Asian', 'Iranian', 'Russian', 'Ukrainian', ['Israelis', 'Israeli'], ['Palestinians', 'Palestinian'], 'Central America', 'Boko Haram', 'Democratic Republic of Congo', 'South Sudan', 'Central African Republic', 'Bosnia and Herzegovina', ['North Korea', 'Korean', 'North Korean'], ['EU', 'European'], 'East China Sea', 'Iraqi', ['Indian', 'Indo'], ['Islamic State', 'Islamic State of Iraq and Syria'], ['Saudi, Saudi Arabia'], 'Nagorno Karabakh', 'Syrian', 'Turkish', ['Pakistani', 'Pak'], 'intercontinental ballistic missile', ['Sino', 'PRC'], 'Middle East', 'AQAP'])

In [36]:
df.head()

Unnamed: 0,Year,Tier,Issue/Country,Impact,Likelihood,Description,Weight
0,2019,1,cyberattack,3,2,A highly disruptive cyberattack on U.S. critic...,5
1,2019,1,North Korea,3,2,Renewed tensions on the Korean Peninsula follo...,5
2,2019,1,Iran,3,2,An armed confrontation between Iran and the Un...,5
3,2019,1,South China Sea,3,2,An armed confrontation over disputed maritime ...,5
4,2019,1,terrorist attack,3,2,A mass casualty terrorist attack on the U.S. h...,5


In [10]:
df["Description"][0]

'A highly disruptive cyberattack on U.S. critical infrastructure and networks'

In [11]:
process_text(df["Description"][0])

['a',
 'highly',
 'disruptive',
 'cyberattack',
 'on',
 'u',
 's',
 'critical',
 'infrastructure',
 'and',
 'networks']

## Arriving at Keywords

In [52]:
# # Our CFR PPS data ranges from 2011 to 2019
# years = list(range(2011, 2020))

# # Creating a dictionary to hold the text per year
# annual_dict = {}

# for year in years:
#     # Creating a dataframe for each year
#     year_df = df.loc[df["Year"] == year]
#     # Chaining together the column of descriptions
#     year_list = list(itertools.chain(year_df["Description"]))
#     # Joining that chain into one long string
#     year_str_all = ' '.join(year_list)
#     # Setting the key as the year, value as the long descriptions
#     annual_dict[year] = year_str_all

In [56]:
# for year in annual_dict.keys():
#     annual_dict[year] = process_text(annual_dict[year])

In [60]:
# annual_features = {}

# for year in annual_dict.keys():
#     tfidf = TfidfVectorizer(max_features=10, lowercase=False)
#     year_input = list(annual_dict[year])
#     tfidf.fit_transform(year_input)
#     year_features = tfidf.get_feature_names()
#     annual_features[year] = year_features

In [59]:
desc_list = list(itertools.chain(df["Description"]))
desc_list

['A highly disruptive cyberattack on US critical infrastructure and networks',
 'Renewed tensions on the Korean Peninsula following a collapse of the denuclearization negotiations',
 'An armed confrontation between Iran and the United States or one of its allies over Iran’s involvement in regional conflicts and support of militant proxy groups',
 'An armed confrontation over disputed maritime areas in the SouthChinaSea between China and one or more SEAsia claimants (Brunei, Malaysia, Philippines, Taiwan, and Vietnam)',
 'A mass casualty terrorist attack on the US homeland or a treaty ally by either (a) foreign or homegrown terrorist(s)',
 'Continued violent reimposition of government control in Syria leading to further civilian casualties and heightened tensions among external parties to the conflict',
 'Deepening economic crisis and political instability in Venezuela leading to violent civil unrest and increased refugee outflows',
 'Worsening of the humanitarian crisis in Yemen, exace

In [91]:
# all_tokens = process_text(str_all)

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(lowercase=True, stop_words=stopwords_list)
cv_all = cv.fit_transform(desc_list)
# all_features = cv.get_feature_names()

# all_features

In [93]:
cv_all.shape

(267, 541)