# Keyword expansion

In this exercise we are going to use the keyword expansion technique propsoed in `Computer-Assisted Keyword and Document Set Discovery from Unstructured Text` by King, Lam and Roberts (2017), in order to label a dataset of tweets according to whether or not they are related to covid-19.

The idea is to use an initial list of keywords to label the date, and then use supervised learning to expand the list of keywords to get a better sense of how people talk about a topic. It is an iterative approach, meaning that you start with a list of keywords, and expand it, run it again etc. until you saturate the list. The approach also emphasises that you should read some of the text that you label, in order to ensure correct labelling.


This exercise is a python translation of Gregory Eady's R exercise, heavily inspired by the replication material found here: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/FMJDCD.

If interested, you can also see Greg's walk-through of the R version of this code in his video here:
https://gregoryeady.com/SocialMediaDataCourse/readings/Keywords/

### Read in required packages

In [40]:
import pandas as pd
#import pyreadr #package to allow us to read in .rds data files (native R datafile)
from nltk.stem.snowball import SnowballStemmer
import re
from tqdm import tqdm
from collections import OrderedDict
from collections import defaultdict
from collections import namedtuple
import numpy as np
from nltk.stem import PorterStemmer
from nltk.tokenize import WhitespaceTokenizer
import nltk
from sklearn.feature_extraction.text import CountVectorizer
import random
from math import lgamma
from sklearn import linear_model
import matplotlib.pyplot as plt
import datetime

# 1. Load the data

Download the "MOC-tweets" data from the course module on Absalon, and load the data.

In [41]:
df = pd.read_csv("data/MOC_tweets.csv")

print(df.shape)
df.head()

(1615238, 15)


Unnamed: 0.1,Unnamed: 0,user_id,num_tweets,raw_url,url,url_tweet_part,tweet_type,date,tweet_id,retries,text_user_id,text,geography,affiliation,nominate_name
0,1,5558312.0,430143,none,none,none,authored,20190319,1.108e+18,999999,5558312.0,Federal government employees are dedicated pub...,AR,Republican,"BOOZMAN, John"
1,2,5558312.0,430143,none,none,none,authored,20170803,8.929066e+17,999999,5558312.0,Congrats to @SenTomCotton's Sand Lizards on th...,AR,Republican,"BOOZMAN, John"
2,3,5558312.0,430143,https://twitter.com/60Minutes/status/656077372...,https://twitter.com/60Minutes/status/656077372...,author,quote,20151019,6.560929e+17,0,5558312.0,WATCH: I applaud Northeast #Arkansas residents...,AR,Republican,"BOOZMAN, John"
3,4,5558312.0,430143,none,none,none,authored,20181004,1.04795e+18,999999,5558312.0,After reviewing the FBI supplemental backgroun...,AR,Republican,"BOOZMAN, John"
4,5,5558312.0,430143,none,none,none,authored,20180607,1.004722e+18,999999,5558312.0,Mack McLarty is a dedicated public servant who...,AR,Republican,"BOOZMAN, John"


# 1.1. Preprocessing

Due to time restraints, the preprocessing code is given below, ready to be run. Take a look at the code to understand what is being done.


Subset the data by removing tweets before 2019 (we are only interested in tweets that may reference COVID-19).

In [42]:
df = df[df.date  >= 20190101] # Subset to 2019 and later because we'll look at COVID-19 over time
df = df.loc[df.tweet_id.drop_duplicates().index] # removing duplicate observations (tweets)

df.head()

Unnamed: 0.1,Unnamed: 0,user_id,num_tweets,raw_url,url,url_tweet_part,tweet_type,date,tweet_id,retries,text_user_id,text,geography,affiliation,nominate_name
0,1,5558312.0,430143,none,none,none,authored,20190319,1.108e+18,999999,5558312.0,Federal government employees are dedicated pub...,AR,Republican,"BOOZMAN, John"
7,8,5558312.0,12531,https://www.vlm.cem.va.gov/?utm_source=Veteran...,https://www.vlm.cem.va.gov/?utm_source=Veteran...,author,authored,20190913,1.172596e+18,0,5558312.0,.@DeptVetAffairs recently rolled out a new dig...,AR,Republican,"BOOZMAN, John"
10,11,5558312.0,12531,none,none,none,authored,20190912,1.172256e+18,999999,5558312.0,I know the importance of empowering women in t...,AR,Republican,"BOOZMAN, John"
17,18,5558312.0,430143,https://www.kffb.com/us-senator-john-boozman-m...,https://www.kffb.com/us-senator-john-boozman-m...,author,authored,20190322,1.109101e+18,0,5558312.0,It was great to spend some time with leaders i...,AR,Republican,"BOOZMAN, John"
18,19,5558312.0,12531,none,none,none,authored,20190918,1.174356e+18,999999,5558312.0,"For 72 years, @usairforce has been blazing the...",AR,Republican,"BOOZMAN, John"


In [43]:
df.reset_index(inplace = True, drop = True)

df.head()

Unnamed: 0.1,Unnamed: 0,user_id,num_tweets,raw_url,url,url_tweet_part,tweet_type,date,tweet_id,retries,text_user_id,text,geography,affiliation,nominate_name
0,1,5558312.0,430143,none,none,none,authored,20190319,1.108e+18,999999,5558312.0,Federal government employees are dedicated pub...,AR,Republican,"BOOZMAN, John"
1,8,5558312.0,12531,https://www.vlm.cem.va.gov/?utm_source=Veteran...,https://www.vlm.cem.va.gov/?utm_source=Veteran...,author,authored,20190913,1.172596e+18,0,5558312.0,.@DeptVetAffairs recently rolled out a new dig...,AR,Republican,"BOOZMAN, John"
2,11,5558312.0,12531,none,none,none,authored,20190912,1.172256e+18,999999,5558312.0,I know the importance of empowering women in t...,AR,Republican,"BOOZMAN, John"
3,18,5558312.0,430143,https://www.kffb.com/us-senator-john-boozman-m...,https://www.kffb.com/us-senator-john-boozman-m...,author,authored,20190322,1.109101e+18,0,5558312.0,It was great to spend some time with leaders i...,AR,Republican,"BOOZMAN, John"
4,19,5558312.0,12531,none,none,none,authored,20190918,1.174356e+18,999999,5558312.0,"For 72 years, @usairforce has been blazing the...",AR,Republican,"BOOZMAN, John"


Save the original text and lowercase the text column.

In [44]:
df['text_original'] = df['text']
df['text'] = df['text'].str.lower()

Do some (but not all) preprocessing by removing tweet elements that we do not care about.


In [45]:
# Remove mentions (posts that start with a "@some_user_name ")
df['text'] = df['text'].str.replace("\\B@\\w+|^@\\w+", "", regex = True)
# Change ampersands to "and"
df['text'] = df['text'].str.replace("&amp;", "and")
# Remove the "RT" and "via" (old retweet style)
df['text'] = df['text'].str.replace("(^RT|^via)((?:\\b\\W*@\\w+)+)","", regex=True, case=False)
# Remove URLs
df['text'] = df['text'].str.replace("(https|http)?:\\/\\/(\\w|\\.|\\/|\\?|\\=|\\&|\\%)*\\b", "", regex = True)
# Keep ASCII only (removes Cyrillic, Japanese characters, etc.)
df['text'] = df['text'].str.replace("[^ -~]", "", regex = True)
# Remove double+ spaces (e.g. "build   the wall" to "build the wall")
df['text'] = df['text'].str.replace("\\s+", " ", regex = True)

With our mostly preprocessed tweets, let us begin building our classifier from chosen keywords.

# 2. Define inclusion and exclusion keywords

You should now define the initial keywords that you want to include and exclude. Keywords to include should reference COVID-19, e.g. "covid19" and/or "coronavirus". We will use these initial keywords to find more keywords relevant to the topic.

1. Define 4 lists: the **first** should contain a seed reference word to be included, the **second** should contain the expanded list of reference words to include (empty to begin with), the **third** should contain a seed reference word to be excluded (can be left empty), and the **fourth** should contain the expanded list of reference words to exclude (empty to begin with).

2. Using `.join`, collapse the two inclusion and exclusion lists, respectively, into strings that can be used as regex OR-operations. The result should be in the form \['dog', 'cat'\] --> 'dog|cat'

3. Use this regex string to create a bool column indicating whether the tweet contains one of your keywords.

4. If you have any exlusions, also find the tweets that contain the excluded keywords (the exclusion list can be left empty).

5. Define a variable that is either 0 or 1, where 1 shows that the tweet contains one or more of your inclusion keywords _and_ does not contain any exclusion keywords. Create a bool column with this.

6. See how many tweets you have labelled as related to COVID-19 so far (how many 0s and how many 1s).

7. Sample 10 tweets labelled as COVID-19, and read the text in them (in the text_original column).

In [46]:
# define lists for seed keywords and excluded words
reference_words_seeds = ["covid19", "coronavirus"] # initial
reference_words_expanded = [] # expanded

reference_words_excluded = [] # initial
reference_words_excluded_expanded = [] # expanded

In [47]:
# join lists into regex strings
reference_words = '|'.join(reference_words_seeds + reference_words_expanded)
reference_words_excluded = '|'.join(reference_words_excluded + reference_words_excluded_expanded)

In [48]:
reference_words

'covid19|coronavirus'

In [49]:
excluded_words = bool(reference_words_excluded.strip("|"))

In [50]:
# define variable to show whether inclusion keywords and no exclusion words are included
df['included'] = df['text'].str.contains(reference_words, regex=True, case=False)

df['excluded'] = False
if excluded_words:
    df['excluded']  = df.loc[df['text'].str.contains(reference_words_excluded, regex = True, case = False), :]

In [51]:
df['reference_set'] = (df['included'] & ~df['excluded']).astype(int)

In [52]:
df['reference_set'].value_counts()

reference_set
0    576881
1      2939
Name: count, dtype: int64

In [53]:
# sample 10 tweets with included keywords
sample_10 = df.loc[df['reference_set']==1,'text_original'].sample(10)
for i in sample_10:
    print("Tweet:",i)

Tweet: Though the Trump Administration’s request for emergency coronavirus funding is woefully insufficient to protect Ame… https://t.co/Caez0wk8EP
Tweet: Like many Americans, I have questions about what the #Coronavirus means for my community.   I had the opportunity t… https://t.co/sIQ685augq
Tweet: House Passes Bipartisan Coronavirus Response: "Be Prepared, Not Scared," is still the appropriate message. To that… https://t.co/C1Pl2RG2nG
Tweet: Completely unserious—and yet, emblematic of how most Washington Democrats are approaching the Coronavirus problem.… https://t.co/RFyEZm0NxU
Tweet: .@barronsonline discovered that the official number of Chinese coronavirus deaths could be predicted using a simple… https://t.co/Utqjtuj57V
Tweet: America must always be ahead of the game on viral threats like the #Coronavirus, and I’m proud to say that our gove… https://t.co/xSFDu26oWC
Tweet: Last night we learned of the first two presumptively-positive cases of coronavirus in Florida, including on

# 3. Further preprocessing and vectorizing

Next, we need to tokenize the data and preprocess the tokens (as opposed to the preprossesing on the full string as earlier).

We will also remove all the keywords that demarcate exclusion and inclusion from the covid-19 theme. This is becasue we want the model to learn to predict the topic using other, new keywords.

1. Create a new col named "text_preprocessed" - it should be equal the text col, but with the keywords removed (Hint: use `.str.replace()` with `regex = True`).

-----

To spend less time on lessons you have already been through, code for further preprocessing is provided. This code may take a few minutes to run. The steps are:

2. Tokenizing. A whitespace tokenizer is used, since we want to keep words with '-'.

3. Removing any tokens that are only numbers (you can remove more types of tokens if you want - up to you).

4. Remove any empty strings.

5. Stemming.

6. Re-joining the stemmed tokens using a whitespace.

7. Creating a column with the preprocessed sentences.

-----

8. Now you have a column  of sentences made out of stemmed and preprocessed tokens. Use a CountVectorizer to make a document term matrix based on this column. Set `min_df = 10` and `max_df = 0.999`, as well as `stop_words = 'english'` and set an appropriate `ngram_range`.

NB: Do not try to make this DTM into a dataframe or np array, as you will most likely run out of memory. It is a sparse matrix that you can work with in the same way as an np.array.



In [54]:
# remove keywords (inclusion & exclusion) so the model can predict the topic using other keywords
if excluded_words:
    all_current_keywords = reference_words + "|" + reference_words_excluded
else:
    all_current_keywords = reference_words

df['text_preprossed'] = df['text'].str.replace(all_current_keywords, "", regex=True, case=False)


In [55]:
tokenizer = WhitespaceTokenizer()
ps = PorterStemmer()

In [None]:
pre_prossed_sents =[]
for sent in tqdm(df['text_preprossed']): #tqdm adds progress bar
    words = tokenizer.tokenize(sent)
    words = [re.sub(r'\d+', '', word) for word in words] #removing tokens that are only words
    words = [x for x in words if x] #removing empty strings
    sent_stem = [ps.stem(word) for word in words]

    sent_done = " ".join(sent_stem)
    pre_prossed_sents.append(sent_done)

  0%|          | 0/579820 [00:00<?, ?it/s]

100%|██████████| 579820/579820 [05:52<00:00, 1643.61it/s]


In [57]:
df['text_pre_stem'] = pre_prossed_sents

In [None]:
# Consider saving a csv at this time
df.to_csv('MOC_Tweets_preprocessed.csv')

In [59]:
#if needed
#df = pd.read_csv('data/MOCTweets_preprocessed.csv')

In [60]:
vectorizer = CountVectorizer(min_df = 10,
            max_df = 0.999,
            stop_words='english',
            ngram_range = (1,2)) #set for larger n-grams

corpus = vectorizer.fit_transform(df['text_pre_stem'])

In [None]:
# create document-term matrix of counts
DTM_dict = {"DTM":corpus,
               "labels":list(df['reference_set'])}

# 4. Sample training data and make predictions

Let us sample some tweets we will use to train our classifier.

1) Define two lists of indices: One list containing the indices of the tweets in the reference set (those labelled as belonging to the covid-19 topic), and another list containing N sample of tweets not from the reference set (N should be either 2x the amount of tweets in the reference set or 50000, whichever is smaller).

2) You now have 2 lists of indices – use these to subset the Document Term Matrix (where each row represents a tweet, and each column a token) and the reference set column in the dataframe (the labels). Define a train DTM and  a train labels object.

3) Fit a cross validated lasso logistic regression, using the DTM subset as input (X) and the reference subset as labels (y). This means that we are trying to predict whether a tweet is in the reference set using the term frequencies. (Hint:  use the sample code with sklearn's `linear_model.LogisticRegressionCV()`). This may take some time to run (approx. 5 min, depending on the size of your train data), change some of the hyperparameters if necessary.

4) Use the fitted model to make predictions on the full DTM, and create a column in the dataframe called `predicted_raw` based on this. (Remember that the rows in the DTM correspond to the rows in the dataframe).

5) The prediction outputs propabilities and not classes, so check the standard deviation of the predicion_raw column - this will check if we actually have some variance in the prediction. This is just a sanity check.

6) Set a threshold of 0.25, and assign 1 or 0 to a new column called `predicted`, depending on whether the probability in `predicted_raw` is >= the threshold. (Note: Keep the threshold low if you want more tweets to get into the target set).

7) Create a column called `set_var`. This variable should be == "Reference" if the observation is in the reference set (our original covid-19 labels), "Target" if it is _predicted_ to be a covid-19 related tweet (1) and "Not target" if it is _predicted_ not to be (0).

8) Create a crosstable of the prediciton and set_var, to see how you model does (hint: use use `pd.crosstab()`). Examine the crosstab - what do the different entries mean?

In [68]:
# Determine how many tweets to sample for the training data
n_search = min(sum(df['reference_set']==1)* 2, 50000)
print(n_search)

5878


In [74]:
# list containing indices of tweets in reference set
ref_ids = df[df['reference_set']==1].index.tolist()

# N sample of tweets not from reference set (N should be either 2x the amont of tweets in the reference set of 50000, whichever is smaller)
n_search = min(sum(df['reference_set']==1)* 2, 50000)
print(n_search)

search_ids = df[df['reference_set']==0].index.tolist()

search_ids_sample = random.sample(search_ids, n_search)


5878


In [77]:
# subset dtm and labels to defain train dtm and train labels object
ids = ref_ids + search_ids_sample

print(len(ref_ids),len(search_ids_sample),len(ids))

X_train = corpus[ids]
y_train = df.loc[ids, 'reference_set']


2939 5878 8817


In [80]:
# Defining the classifier

#logistic reg with lasso penalty
clf = linear_model.LogisticRegressionCV(
    penalty="l1", 
    n_jobs = -1, 
    solver = "saga", 
    max_iter=10000, 
    verbose = 1)

In [None]:
# Fitting the model. This takes some time.
clf.fit(X_train, y_train)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.


convergence after 1 epochs took 0 seconds
convergence after 1 epochs took 0 seconds
convergence after 1 epochs took 0 seconds
convergence after 1 epochs took 0 seconds
convergence after 1 epochs took 0 seconds
convergence after 1 epochs took 0 seconds
convergence after 1 epochs took 0 seconds
convergence after 1 epochs took 0 seconds
convergence after 1 epochs took 0 seconds
convergence after 1 epochs took 0 seconds
convergence after 356 epochs took 20 seconds
convergence after 32 epochs took 1 seconds
convergence after 425 epochs took 22 seconds
convergence after 444 epochs took 23 seconds
convergence after 372 epochs took 23 seconds
convergence after 29 epochs took 1 seconds
convergence after 30 epochs took 1 seconds
convergence after 29 epochs took 1 seconds
convergence after 437 epochs took 24 seconds
convergence after 31 epochs took 2 seconds
convergence after 214 epochs took 9 seconds
convergence after 306 epochs took 10 seconds
convergence after 319 epochs took 11 seconds
conver

# 5. Calculate the log likelihood as in the paper

1) Create 3 sets of indices based on the `set_var` colum: one for "Target", one for "Not target" and one for "Reference".

2) Create 3 objects for the target, not_target and reference sets, based on the DTM. These should be: for each token, how often is the given token in the set, how many documents in the set contains the given token, and the proportion of documents in the set containing the given token. (Hint: see sample code for the target set. If you want to convert to a list and not a matrix object, you can use the `.tolist()[0]`)

3) Create a new dataframe, where each row is a token from the DTM (you can use `vectorizer.get_feature_names()`), with 9 cols for each of the 9 objects you just created.

4) Subset the dataset by removing any observations where the terms do not appear in either the target or not_target set, thus keeping only tokens that were in the original search set (step (a) on page 979).

5) Keywords go in the target list if their proportion is higher among those documents estimated to be in the reference set than not; e.g. if for the word "pandemic", 15% of documents predicted as target contain the word "pandemic" versus only 2% among those in the not_target set (step (b) on page 979). Therefore: create a new column that should be True if the token has a higher or equal proportion in the target set than in the not_target set.

6) Examine the `llik` function provided and look in the paper - what does it do?

7) Calculate the amount of documents in the target and the not_target set.

8) Use the provided function to calculate the log likelihood for each token. Assign this to a new column in the dataframe created in step 3.


In [None]:
# Creating 3 lists of indices

target_ids = list(pd.Series(DTM_full['set_var']) == 'Target')
not_target_ids =
ref_ids =

In [None]:
# Creating statistics for the target, not_target and reference sets

target_freq = np.sum(DTM_full['DTM'][target_ids,:],0) #how many times is each token used in the target documents
target_num_docs = np.sum(DTM_full['DTM'][target_ids,:] > 0, axis = 0) #how many target documents does each token appear in
target_num_docs_prop =  target_num_docs / sum(target_ids) #proportion of target docs with each token


not_target_freq =
not_target_num_docs =
not_target_num_docs_prop =


ref_freq =
ref_num_docs =
ref_num_docs_prop =


# Saving the above in a dict

d = {'target_freq' :target_freq.tolist()[0],
    'target_num_docs': target_num_docs.tolist()[0],
    'target_num_docs_prop': target_num_docs_prop.tolist()[0],
    'not_target_freq':
    'not_target_num_docs':
    'not_target_num_docs_prop':
    'ref_freq':
    'ref_num_docs':
    'ref_num_docs_prop': }


In [None]:
# Likelihood function

def llik(target_num_docs, nottarget_num_docs, target_num_docs_total, nottarget_num_docs_total):
    '''No docstring - you neew to see what it does :) '''
    x1 = ((lgamma(target_num_docs + 1) + lgamma(nottarget_num_docs + 1)) -
           lgamma(target_num_docs + nottarget_num_docs + 1 + 1))
    x2 = ((lgamma(target_num_docs_total - target_num_docs + 1) +
           lgamma(nottarget_num_docs_total - nottarget_num_docs + 1)) -
           lgamma(target_num_docs_total - target_num_docs +
          nottarget_num_docs_total - nottarget_num_docs + 1 + 1))
    llik = x1 + x2
    return llik

# 6. Examine new keywords

1) Show the top 25 keywords based on highest log likelihood, where the share of documents in the target set is higher than in the not_target set (see task 5.5). These are the tokens that are most likely to differentiate between the target and not_target sets (meaning that they help the model predict covid-19 related tweets).

2) Do the same with the not_target - what are these terms representative of?

3) Are there any of these tokens that you want to include in the keywords? Choose 1-3 keywords that you want to include or exclude.

4) For the 1-3 keywords you have found, find tweets that contain the given keyword in the original tweet text in the original dataframe. Read some tweets where the keyword is used in context - do you still want to include or exclude the keyword?

5) Optional: add the new keywords to the original list at the beginning of this exercise in 2.1, and rerun the exercises until here, now including the new keywords. This is how the computer-assisted keyword discovery is used iteratively.


In [None]:
# [your code here]

# 7. Optional: Use your new classifier for downstream tasks

1) Assign a `final_classification` boolean column in the original dataframe, which should be 1 if the tweet contains any of the keywords in the new, complete list and if it does not contain any of the exclusion keywords.

2) Examine the value counts of the political affiliation variable. Assign "Democrat" to the tweets labelled with "Independent" (see the people behind the tweets for reason).

3) Plot the share of tweets labelled as covid-19 relevant by your classifier (y), grouped on days (x) for each party - meaning two lines of covid-19 share across time.

**Hints:** <br>
The pandas `groupby` functionality may be of help to you. <br>
You can also also turn the date ints into so-called datetime objects using this:

`dates =[datetime.datetime(year=int(x[i][0:4]), month=int(x[i][4:6]), day=int(x[i][6:8])) for i in range(len(x))]`

where x is a list of the unique dates as int.


In [None]:
# [ your code here]