# Candidate merging and related preprocessing


**Necessary files:**
 - event_df = df\_[event]\_clean.csv file with event dataframes with unique tweets only
 
 _the goal of this notebook is to tag all tweets from event_df and extract all noun phrases. Noun phrases will serve as candidates and using pipeline function they are categorised and finally only unique (and cleaned) candidates will be saved into event_cands dataframe_

In [6]:
#python libraries
import stanza

import numpy as np
import pandas as pd
import os
import re
from tqdm import tqdm
import time
from collections import Counter, defaultdict


# self written modules
import preprocessing
import candidate_processing as cand_prep
import candidate_extraction as cand_ex

import pickle

def pickle_file(file_name, file_to_dump):
    directory_path = os.getcwd() + "/../../../../"
    folder_name = file_name.split('_')[0]
    file_path = directory_path +  fr"Dropbox (CBS)/Master thesis data/Candidate Data/{folder_name}/{file_name}"
    with open(file_path, 'wb') as fp:
        pickle.dump(file_to_dump, fp)

def load_pickle(file_name):
    directory_path = os.getcwd() + "/../../../../"
    folder_name = file_name.split('_')[0]
    #folder_name = re.sub(r'[12]', '', folder_name)
    file_path = directory_path + fr"Dropbox (CBS)/Master thesis data/Candidate Data/{folder_name}/{file_name}"
    with open(file_path, "rb") as input_file:
        return pickle.load(input_file)


## 1. Importing the data

In [2]:
greece_url = r"Dropbox (CBS)/Master thesis data/Event Dataframes/Clean/df_greece_clean.csv" # for Greece
tigray_url = r"Dropbox (CBS)/Master thesis data/Event Dataframes/Clean/df_tigray_clean.csv" # for Tigray
rohingya_url = r"Dropbox (CBS)/Master thesis data/Event Dataframes/Clean/df_rohingya_clean.csv" # for Rohingya
channel_url = r"Dropbox (CBS)/Master thesis data/Event Dataframes/Clean/df_channel_clean.csv" # for channel

def read_event_df(data_url):
    directory_path = os.getcwd() + "/../../../../" + data_url 
    event_df = pd.read_csv(directory_path, index_col=0)
    event_df.reset_index(drop=True, inplace=True)
    print(f'loaded {event_df.shape[0]} tweets!')
    return event_df

# pick the df 
event_df = read_event_df(channel_url)
event_df.head()

loaded 173758 tweets!


Unnamed: 0,source,text,lang,id,created_at,author_id,retweet_count,reply_count,like_count,quote_count,...,refugee,migrant,immigrant,asylum_seeker,other,text_coherent,retweet_count_sum,count,text_alphanum,text_stm
0,WordPress.com,CHANNEL MIGRANT CRISIS – TODAYS VIDEOS FROM DO...,en,1284639846930227200,2020-07-19 00:03:01+00:00,1039171425364520960,0,0,0,0,...,False,True,False,False,False,CHANNEL MIGRANT CRISIS TODAYS VIDEOS FROMDOVER.,0,1,channel migrant crisis todays videos fromdover.,channel crisis today video fromdover
1,Twitter Web App,“Chinese immorality [and] eccentricities … are...,en,1284640070855729163,2020-07-19 00:03:55+00:00,153438157,22,1,37,0,...,False,False,True,False,False,Chinese immorality [and] eccentricities are ab...,22,1,chinese immorality and eccentricities are abho...,chinese immorality eccentricity abhorrent arya...
2,Twitter for iPhone,@chrisgregson123 @VeuveK @CharlieHicks90 @Rudy...,en,1284640230499328000,2020-07-19 00:04:33+00:00,503070765,0,0,0,1,...,False,False,False,False,False,O / c Leavers voted for what they believed was...,0,1,o c leavers voted for what they believed was ...,leaver voted believed best england wale howeve...
3,Twitter for iPhone,@SkyNews It never will if uk keeps bring in hu...,en,1284640911788576770,2020-07-19 00:07:15+00:00,1276420769384402944,1,1,3,1,...,False,True,False,True,False,It never will if uk keeps bring in hundreds of...,1,1,it never will if uk keeps bring in hundreds of...,never keep bring hundred asylum seeker giving ...
4,Twitter for iPhone,How many illegal immigrants this week in #Dove...,en,1284641481576402945,2020-07-19 00:09:31+00:00,755084846783950848,0,0,0,0,...,False,False,True,False,False,How many illegal immigrants this week in dover...,0,1,how many illegal immigrants this week in dover...,many illegal week dover


## First,  extracting noun phrases

In [3]:
# this code runs for around another 13h per 100k tweets
event_df = read_event_df(rohingya_url)
from stanza.server import CoreNLPClient

#use "with" so the client stops properly after finished
with CoreNLPClient(annotators=["tokenize,ssplit,pos,parse"], timeout=6000000, memory='8G') as client:
        print('extracting noun phrases...')
        tqdm.pandas()
        # get noun phrases with tregex using get_noun_phrases function
        event_df['noun_phrases'] = event_df['text_coherent'].progress_apply(cand_ex.get_noun_phrases,args=(client,"tokenize,ssplit,pos,parse"))


#len(np_list)

2021-06-23 07:54:07 INFO: Writing properties to tmp file: corenlp_server-e0ed75ae247c4879.props
2021-06-23 07:54:07 INFO: Starting server with command: java -Xmx8G -cp C:\Users\nikodemicek\stanza_corenlp\* edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 6000000 -threads 5 -maxCharLength 100000 -quiet False -serverProperties corenlp_server-e0ed75ae247c4879.props -annotators tokenize,ssplit,pos,parse -preload -outputFormat serialized
  0%|                                                                                        | 0/29432 [00:00<?, ?it/s]

loaded 29432 tweets!
extracting noun phrases...


100%|██████████████████████████████████████████████████████████████████████████| 29432/29432 [2:39:32<00:00,  3.07it/s]


NameError: name 'np_list' is not defined

In [4]:
np_list = list(event_df['noun_phrases'])
np_list = cand_prep.remove_char(np_list,'@') # remove the @ sign from all mentions
np_list = cand_prep.remove_child_nps(np_list) # remove the sub NPs if they were found in longer NPs in the same tree
pickle_file('rohingya_noun_phrases',np_list)

#np_list = load_pickle("moria_short_noun_phrases")

removing child NP candidates...
Removed 50798 child NP candidates!


In [8]:
# this code runs for around another 13h per 100k tweets
event_df = read_event_df(tigray_url)
from stanza.server import CoreNLPClient

#use "with" so the client stops properly after finished
with CoreNLPClient(annotators=["tokenize,ssplit,pos,parse"], timeout=6000000, memory='8G') as client:
        print('extracting noun phrases...')
        tqdm.pandas()
        # get noun phrases with tregex using get_noun_phrases function
        event_df['noun_phrases'] = event_df['text_coherent'].progress_apply(cand_ex.get_noun_phrases,args=(client,"tokenize,ssplit,pos,parse"))


#len(np_list)

2021-06-23 18:07:39 INFO: Writing properties to tmp file: corenlp_server-ea054eb6faf04dcb.props
2021-06-23 18:07:39 INFO: Starting server with command: java -Xmx8G -cp C:\Users\nikodemicek\stanza_corenlp\* edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 6000000 -threads 5 -maxCharLength 100000 -quiet False -serverProperties corenlp_server-ea054eb6faf04dcb.props -annotators tokenize,ssplit,pos,parse -preload -outputFormat serialized
  0%|                                                                                        | 0/42853 [00:00<?, ?it/s]

loaded 42853 tweets!
extracting noun phrases...


100%|██████████████████████████████████████████████████████████████████████████| 42853/42853 [6:53:56<00:00,  1.73it/s]


In [9]:
np_list = list(event_df['noun_phrases'])
np_list = cand_prep.remove_char(np_list,'@') # remove the @ sign from all mentions
np_list = cand_prep.remove_child_nps(np_list) # remove the sub NPs if they were found in longer NPs in the same tree
pickle_file('tigray_noun_phrases',np_list)

removing child NP candidates...
Removed 78471 child NP candidates!


## Tag tweets using stanza module to get NER and POS tags in tweets. We do it in batches to speed things up.

In [3]:
#
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ needed when running first time ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#
#stanza.download("en")
#stanza.install_corenlp()

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

# loading the pipeline
en_nlp = stanza.Pipeline("en",  
                         tokenize_pretokenized=False,
                         ner_batch_size=4096,
                         processors = "tokenize,pos,lemma,depparse,ner")

2021-06-20 22:20:33 INFO: Loading these models for language: en (English):
| Processor | Package   |
-------------------------
| tokenize  | combined  |
| pos       | combined  |
| lemma     | combined  |
| depparse  | combined  |
| ner       | ontonotes |

2021-06-20 22:20:33 INFO: Use device: cpu
2021-06-20 22:20:33 INFO: Loading: tokenize
2021-06-20 22:20:33 INFO: Loading: pos
2021-06-20 22:20:33 INFO: Loading: lemma
2021-06-20 22:20:33 INFO: Loading: depparse
2021-06-20 22:20:34 INFO: Loading: ner
2021-06-20 22:20:36 INFO: Done loading processors!


In [None]:

event_df = read_event_df(tigray_url)
event_tagged_tweets = [en_nlp(tweet) for tweet in tqdm(list(event_df['text_coherent']))]
pickle_file('tigray_tagged_tweets',event_tagged_tweets)

In [8]:
event_df = read_event_df(rohingya_url)
event_tagged_tweets = [en_nlp(tweet_batch) for tweet_batch in tqdm(list(event_df['text_coherent']))]
pickle_file('rohingya_tagged_tweets',event_tagged_tweets)

100%|██████████████████████████████████████████████████████████████████████████| 22966/22966 [4:58:17<00:00,  1.28it/s]


In [9]:
event_df = read_event_df(greece_url)
event_tagged_tweets = [en_nlp(tweet_batch) for tweet_batch in tqdm(list(event_df['text_coherent']))]
pickle_file('greece_tagged_tweets',event_tagged_tweets)

In [None]:
event_df = read_event_df(channel_url)
event_tagged_tweets = [en_nlp(tweet_batch) for tweet_batch in tqdm(event_tweet_list)]
pickle_file('channel_tagged_tweets',event_tagged_tweets)

In [3]:
channel1 = load_pickle('channel_tagged_tweets2')
channel2 = load_pickle('channel_tagged_tweets1.pickle')

In [5]:
channel_tags = channel2 + channel1
len(channel_tags)

173758

In [9]:
pickle_file('channel_tagged_tweets',channel_tags)

## Pipeline for candidate identification

**Necessary files:**
 - event_np_list = pickled file of list of noun phrases
 - event_tagged_tweets = pickled file with NER and POS tags for all tweets

In [2]:
def load_event_data(event_name):
    assert event_name in ['greece','tigray','rohingya','moria','channel'], f"Oh no! We do not analyze {event_name} event"
    
    print(f'Loading {event_name} data...')
    try:
        #sample = 100
        
        event_np_list = load_pickle(event_name + '_noun_phrases')#[:sample]
        event_tagged_tweets = load_pickle(event_name + '_tagged_tweets')#[:sample]
        return event_np_list,event_tagged_tweets
    except:
        print(f'The {event_name} files not found! First extract noun phrases and tag tweets of the {event_name}_df')
        return None


In [7]:
def pipeline(event_name):

    ####~~~~~~~~~~~~~~~~~~~~~ 1. LOAD THE DATA ~~~~~~~~~~~~~~~~~~~~~
    event_np_list,event_tagged_tweets = load_event_data(event_name)
    ####  ~~~~~~~~~~~~~~~~~~~~~ 2. GET POS AND NER TAGS ~~~~~~~~~~~~~~~~~~~~~
    # get list of tuples (POS-tags of each word, NER-tags of each named entity)
    tweet_tags = cand_prep.get_tweet_tags(event_tagged_tweets) 

    
    ####  ~~~~~~~~~~~~~~~~~~~~~ 3. PREPROCESS CANDIDATES ~~~~~~~~~~~~~~~~~~~~~
    # ~~~~~~~~~~~~ processing of noun phrases ~~~~~~~~~~~~~~~~~~~~~
    print(f'Processing {event_name} noun phrase candidates...')
    
    event_np_list = [['no_candidate'] if len(noun_ps)==0 else noun_ps for noun_ps in event_np_list ]
    
    #print(event_np_list)
    print(f'Tagging {event_name} noun phrase candidates...')
    #tag all tweets and save them in a list    
    
    # loading the pipeline - for candidates use flag tokenize_pretokenized
    en_nlp = stanza.Pipeline("en",  
                             tokenize_pretokenized=True,
                             ner_batch_size=4096,
                             processors = "tokenize,pos,lemma,depparse,ner",
                             verbose=False)
    
    tagged_np_cands =load_pickle('channel_tagged_cands')
    #tagged_np_cands = [en_nlp('\n\n'.join(tweet_batch)) for tweet_batch in tqdm(event_np_list)]
    #tagged_np_cands = [tagged_cand for tagged_cand in tqdm(batch(batched_np_list, en_nlp, batch_size=6000))]
    
    np_cand_heads = [cand_prep.get_cand_heads(tweet_cands) for tweet_cands in tqdm(tagged_np_cands)]

    np_and_cand_list = cand_prep.get_cand_type(event_np_list,np_cand_heads, tweet_tags)
    #print(event_np_list)
          
          
    # ~~~~~~~~~~~~~~~~~~~~ combining candidate lists ~~~~~~~~~~~~~~~~~~~~~~~~~~~
    #concatenate corefs and noun phrase lists
    nps_cands = [cand for cands in np_and_cand_list for cand in cands]
    #candidate_list = coref_and_cand_list + np_and_cand_list

    #unpack list of lists into one list
    candidate_list = nps_cands
          
    nps_tagged = [sent for tagged_cand in tagged_np_cands for sent in tagged_cand.sentences ]

    all_cands_tagged = nps_tagged
    
        
    #print(len(candidate_list),'vs', len(all_cands_tagged))
    cand_df = pd.DataFrame(
        {'candidates': candidate_list,
         'cand_tags': all_cands_tagged
        })
    
    
    cand_df['cand_text'] = cand_df.candidates.apply(lambda x: x[0])
    cand_df['cand_head'] = cand_df.candidates.apply(lambda x: x[1])
    cand_df['cand_type'] = cand_df.candidates.apply(lambda x: x[2])
    cand_df['cand_len'] = cand_df.cand_text.apply(lambda x: len(x.split()))


    count_cands = Counter(cand_df['candidates'])
    cand_df['cand_freq'] = cand_df["candidates"].map(count_cands)
    
    #count_cands[cand_df['cand_text']]
    #count_sorted = sorted(count_cands.items(),key=lambda x: x[1],reverse=True)
    cand_df.columns = cand_df.columns.str.strip()
    
          
    # we sort the candidates by their length
    cand_df.sort_values('cand_freq', ascending=False,inplace=True)
    cand_df.reset_index(drop=True, inplace = True)
    #remove dummy candidates that were used to avoid errors

    
    cand_df = cand_df[cand_df.cand_text != 'candidate_to_be_removed']
    cand_df = cand_df[cand_df.cand_text != 'no_candidate']
    print(len(cand_df))    
    cand_df.reset_index(drop=True,inplace=True)
      
    return cand_df


### Candidates as identified by stanza library still have a lot of noise to be removed. Cleaner candidates merge better and throwing away duplicate candidates or candidates without useful information speeds up merging.

In [8]:
#Finally the candidates are cleaned before storing in a file prior to merging

from nltk.corpus import stopwords

def clean_cands(event_cands):
    """
    Applying cleaning steps on candidates and engineering some features:
     1. creating a column with length of the tweet (in chars)
     2. lowercase the candidate information in the tuple with cand, candidate representative head and set of phrases heads
     3. extract candidate text and keep only alphanumeric chars
     4. remove candidates that are stopwords
     5. remove candidates that are only numeric
     6. remove candidates that are only 1 char long
     """

    #stopwords
    tqdm.pandas()
    event_cands_clean = event_cands.copy()
        
    event_cands_clean['cand_text'] = event_cands_clean['cand_text'].progress_apply(lambda x:re.sub(r'[^A-Za-z0-9 ]+', '', x.lower()).strip())
    event_cands_clean['cand_head'] = event_cands_clean['cand_head'].progress_apply(lambda x:re.sub(r'[^A-Za-z0-9]+', '', x.lower()).strip())

    event_cands_clean = event_cands_clean[~event_cands_clean['cand_text'].isin(stopwords.words('english'))]
    event_cands_clean['pure_chars'] = event_cands_clean['cand_text'].progress_apply(lambda x: x.replace(' ', ''))
    event_cands_clean = event_cands_clean[~event_cands_clean['pure_chars'].str.isnumeric()]
    event_cands_clean.drop('pure_chars',axis=1,inplace=True)
    
    event_cands_clean['string_len'] = event_cands_clean['cand_text'].progress_apply(len)
    event_cands_clean = event_cands_clean[event_cands_clean['string_len']>1]
    event_cands_clean = event_cands_clean.drop_duplicates(subset = ["cand_text","cand_type"])
    event_cands_clean.reset_index(drop=True, inplace=True)
    print(f'The event has  {len(event_cands_clean)} unique candidates after cleaning')
    return event_cands_clean


In [21]:
#run the pipeline for tigray
event_cands = pipeline('tigray')
event_cands_clean = clean_cands(event_cands)
pickle_file('tigray_cands', event_cands_clean)

Loading tigray data...


100%|█████████████████████████████████████████████████████████████████████████| 42853/42853 [00:01<00:00, 26712.11it/s]


Processing tigray noun phrase candidates...
Tagging tigray noun phrase candidates...


100%|██████████████████████████████████████████████████████████████████████████| 42853/42853 [4:42:02<00:00,  2.53it/s]
100%|█████████████████████████████████████████████████████████████████████████| 42853/42853 [00:02<00:00, 20607.46it/s]
100%|████████████████████████████████████████████████████████████████████████████| 42853/42853 [10:07<00:00, 70.51it/s]
  0%|                                                                                       | 0/386607 [00:00<?, ?it/s]

386607


100%|██████████████████████████████████████████████████████████████████████| 386607/386607 [00:01<00:00, 343355.60it/s]
100%|██████████████████████████████████████████████████████████████████████| 386607/386607 [00:01<00:00, 357068.81it/s]
100%|██████████████████████████████████████████████████████████████████████| 340063/340063 [00:00<00:00, 815415.71it/s]
100%|██████████████████████████████████████████████████████████████████████| 336875/336875 [00:00<00:00, 891096.30it/s]


The event has  105841 unique candidates after cleaning


In [22]:
#run the pipeline for rohingya
event_cands = pipeline('rohingya')
event_cands_clean = clean_cands(event_cands)
pickle_file('rohingya_cands', event_cands_clean)

Loading rohingya data...


100%|█████████████████████████████████████████████████████████████████████████| 29432/29432 [00:00<00:00, 32846.50it/s]


Processing rohingya noun phrase candidates...
Tagging rohingya noun phrase candidates...


100%|██████████████████████████████████████████████████████████████████████████| 29432/29432 [3:04:03<00:00,  2.67it/s]
100%|█████████████████████████████████████████████████████████████████████████| 29432/29432 [00:01<00:00, 29140.69it/s]
100%|████████████████████████████████████████████████████████████████████████████| 29432/29432 [05:18<00:00, 92.55it/s]


237318


100%|██████████████████████████████████████████████████████████████████████| 237318/237318 [00:00<00:00, 345833.61it/s]
100%|██████████████████████████████████████████████████████████████████████| 237318/237318 [00:00<00:00, 346957.46it/s]
100%|██████████████████████████████████████████████████████████████████████| 212903/212903 [00:00<00:00, 791350.69it/s]
100%|██████████████████████████████████████████████████████████████████████| 210592/210592 [00:00<00:00, 806872.79it/s]


The event has  67008 unique candidates after cleaning


In [27]:
#run the pipeline for greece
event_cands = pipeline('greece')
event_cands_clean = clean_cands(event_cands)
pickle_file('greece_cands', event_cands_clean)

100%|████████████████████████████████████████████████████████████████████| 1191823/1191823 [00:05<00:00, 201997.66it/s]
100%|████████████████████████████████████████████████████████████████████| 1191823/1191823 [00:04<00:00, 242652.50it/s]
100%|████████████████████████████████████████████████████████████████████| 1004413/1004413 [00:02<00:00, 444368.56it/s]
100%|██████████████████████████████████████████████████████████████████████| 995833/995833 [00:01<00:00, 787400.01it/s]


The event has  291113 unique candidates after cleaning


In [None]:
#run the pipeline for channel
event_cands = pipeline('channel')
event_cands_clean = clean_cands(event_cands)
pickle_file('channel_cands', event_cands_clean)

Loading channel data...
