# Discourse Similarity Clustering Analyses for Climate Tweets

This notebook constructs, tests, and explains the Discouse Similarity Clustering analyses used in the research project on how climte tweets differ between expert and general population users. This method has the following steps:

1. **Prepare Texts**: 
    
    Each user's tweets are summarised together and all numbers and links are dropped. 

2. **Similarity Scores**: 
    
    Between each pair of users, their summarised texts are compared and a similarity score is calculated. This score can be calculated in many ways, but here we first use "cosine similarity", the NLP industry standard for similarity computation. Then we perform a robustness check by first using a sentence transformer to extract phrases and then compute cosine similarity on the lists of phrases of all pairs of users. 

3. **Graph Clustering**: 

    With the similarity scores between each user pair, we can create a Similarity Graph where each dot represent a user and each link between two dots represent that the pair of users have similar tweets. 
    
    The strengths of the links may represent the level of similarity, and a cut-off can be added to focus only on the stronger links (a rigorous approach will be to use Bayesian Information Criteria but usually we may heuristically use $ \frac{N^2}{6} $ number of links in a graph of $N$ nodes). 
    
    When a graph is created, we can use graph partitioning algorithms to see if there are structually separated communities, i.e., if there are clusters of users that are more similar to each other than to others in terms of their tweets. By checking if each community have different characteristics such as opinions, we can see if the population is segregated into "echo chambers". 

In [1]:
# initialisation
import pandas as pd

# read in file and reformat
# if there's an error when run on a windows machine, try changing the encoding parameter
Raw_Tweets = pd.read_csv("data/anonhayhoetw.csv", encoding='mac_roman',usecols=['tweet_id','author_id','text2'])
Raw_Tweets.columns = ['tweet_id','author_id','text']

# run this if text already cleaned:
Raw_Tweets = pd.read_csv("data/anonhayhoetw.csv", usecols=['tweet_id', 'author_id', 'clean_text'])

  Raw_Tweets = pd.read_csv(f"{file_path}anonhayhoetw.csv", usecols=['tweet_id', 'author_id', 'clean_text'])


In [2]:
# since there are lots of texts to process, we execute some code through multiprocessing using the ray package
import ray
# change this to the number of cores your computer has
ray.init(num_cpus=8) 

2022-12-19 18:51:30,182	INFO worker.py:1518 -- Started a local Ray instance.


0,1
Python version:,3.10.7
Ray version:,2.0.0


# Text Cleaning

Skip to the next chapter if already done this!

In [None]:
# function for cleaning texts
def clean_text(text):
    words = str(text).replace('\n',' ').replace('\r',' ').replace(r"\\",'').split(' ')
    words = [
        word for word in words
        if not any([ch.isdigit() for ch in word])
        and not '@' in word
        # we should keep the # in because they are meaningful tokens
        and not 'http' in word
    ]
    return ' '.join(words)

# one example
clean_text(Raw_Tweets.text.values[3])

In [None]:
# a remote wrapper function for bundled execution
@ray.remote
def ray_clean_texts(bdl_range, bdl_texts):
    output = []
    start, end = bdl_range
    for i in range(start,end):
        text = bdl_texts[i-start]
        text = clean_text(str(text))
        output.append([i, text])
    return output

# ray bundles
# change the size so that you have num_cpus * int number of bundles
# e.g.: I initiated 8 CPUs, therefore I want 8, 16, 24, etc bundles
BDL_SIZE = 48_000
bundles = [(x*BDL_SIZE, (x+1)*BDL_SIZE) for x in range(len(Raw_Tweets)//BDL_SIZE)]
bundles.append((bundles[-1][1], len(Raw_Tweets)))
print(f"We have {len(bundles)} bundles for ray")

In [None]:
# execute on ray
all_texts = Raw_Tweets.text.values

print('Sending bundled tasks to ray')
ray_handles = []
for bdl_range in bundles:
    start, end = bdl_range
    bdl_texts = all_texts[start:end]
    ray_handles.append(ray_clean_texts.remote(bdl_range, bdl_texts))

done_handles = []
while len(ray_handles):
    done, ray_handles = ray.wait(ray_handles)
    done_handles.append(done)
    print(len(ray_handles)+1, end = " ")

print('')
print('Getting from ray to df')

del all_texts
results = pd.DataFrame(columns=[0,1])
for batch in range(int(len(done_handles)/8)):
    got_bundles = []
    for i in range(batch*8,(batch+1)*8):
        got_bundles.append(ray.get(done_handles[i]))
        print(len(done_handles)-i, end=' ')
    got_bundles = [row for bdl in got_bundles for row in bdl[0]]
    results = pd.concat([results, pd.DataFrame(got_bundles)])

print('')
print('Sorting result df and re-merge')
# this step is because ray do things in a suffled order so the results must be re-aligned
results = results.sort_values(0)
Raw_Tweets['clean_text'] = results[1].values

print('Done!')

In [None]:
# join up the tweets of each user
User_Tweets = pd.DataFrame(columns=['author_id', 'text'])
User_Tweets.author_id = list(set(Raw_Tweets.author_id.values.astype(int)))
User_Tweets = User_Tweets[~User_Tweets.author_id.astype(str).isin(['nan','0'])]
User_Tweets.set_index('author_id', inplace=True)

for user in User_Tweets.index.values:
    User_Tweets.loc[user,'text'] = ' '.join([
        str(text) for text in
        Raw_Tweets.clean_text.values[Raw_Tweets.author_id == user]
    ])
    if user % 100 == 0: print(user, end = ' ')

User_Tweets.to_csv("results/user_tweets.csv")

# Method 1: Cosine Similarity

This method is where a cosine similarity is computed directly between each user's tweet summaries, and a graph be constructed on this similarity matrix. Clustering analysis on the graph will then be done in R since it is easier.

In [None]:
# read in file if starting midway after cleaning is already done
User_Tweets = pd.read_csv("results/user_tweets.csv")

In [None]:
# cosine similarity between each user
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# create story-word matrix
all_tweets = User_Tweets.text.values
tweet_word_matrix = TfidfVectorizer(stop_words='english').fit_transform(all_tweets)
# should be len(tweet) x len(tokens)
print(tweet_word_matrix.shape)

# get cosine similarity
sim_matrix = cosine_similarity(tweet_word_matrix, tweet_word_matrix)
# should be the same shape
print(sim_matrix.shape)

pd.DataFrame(sim_matrix).to_csv('private/cosine_sim_matrix.csv')

# Method 2: Key Phrase Similarity

In this alternative method, key phrases are first extracted from each user's tweet summaries using a pre-trained sentence transformer. A user-level pairwise cosine similarity is then computed on the phrases as "the ratio between the intersection and union of the two lists of items". A graph is then constructed on these similarity scores and clustering analyses be performed in R. 

Compare to Method 1, this method adds a layer of "phrases" instead of using the entire corpus. It takes much longer to run, taking time in terms of hours instead of seconds on a sample of 1000 users with my MacBook Pro M1. In my personal experience, these two methods should produce similar results, so I am including it here as a robustness check. 

In [3]:
# read in file if starting midway after cleaning is already done
User_Tweets = pd.read_csv("results/user_tweets.csv")

In [4]:
# a sentence transformer model for keyword extractions
from keybert import KeyBERT
keyword_model = KeyBERT(model='all-MiniLM-L6-v2')

# function for top keywords
def user_keywords(their_tweets, cutoff):
    their_keywords = keyword_model.extract_keywords(their_tweets, keyphrase_ngram_range = (1, 3), top_n = 20)
    their_keywords = pd.DataFrame(their_keywords)
    their_keywords.columns = ['kw','score']
    their_keywords = their_keywords.kw.values[their_keywords.score > cutoff]
    return ' '.join(their_keywords)

In [5]:
# ray function for speed
all_tweets = User_Tweets.text.values

@ray.remote
def ray_user_kw(bdl_range, bdl_stories):
    output = []
    start, end = bdl_range
    for i in range(start,end):
        tweets = bdl_stories[i-start]
        output.append([i, user_keywords(tweets, cutoff = 0.3)])
    return output

# ray bundles
BDL_SIZE = 14
bundles = [(x*BDL_SIZE, (x+1)*BDL_SIZE) for x in range(len(all_tweets)//BDL_SIZE)]
bundles.append((bundles[-1][1], len(all_tweets)))
print(f"We have {len(bundles)} bundles for ray")

We have 72 bundles for ray


In [6]:
# execute on ray
print('Sending bundled tasks to ray')
bdl_handles = []
for i, bdl_range in enumerate(bundles):
    start, end = bdl_range
    bdl_stories = all_tweets[start:end]
    bdl_handles.append(ray_user_kw.remote(bdl_range, bdl_stories))
    print(len(bdl_handles), end = " ")

print('')
print('Getting from ray to df')
done_bundles = []
while len(bdl_handles):
    done, bdl_handles = ray.wait(bdl_handles)
    done_bundles.append(ray.get(done))
    print(len(bdl_handles)+1, end = " ")

print('')
print('Sorting result df and remerging...', end = ' ')
all_keywords = [row for bdl in done_bundles for row in bdl[0]]
all_keywords = pd.DataFrame(all_keywords)
all_keywords.columns = ['i', 'keyword']
all_keywords = all_keywords.sort_values('i').reset_index(drop=True)
User_Tweets['keywords'] = all_keywords.keyword.values
print('Done')

Sending bundled tasks to ray




1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 
Getting from ray to df
72 71 70 69 68 67 66 65 64 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 
Sorting result df and remerging... Done


In [9]:
User_Tweets.to_csv("results/user_tweets.csv")

In [11]:
# cosine similarity between each user
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# create story-word matrix
all_tweets = User_Tweets.keywords.values
tweet_word_matrix = TfidfVectorizer(stop_words='english').fit_transform(all_tweets)
# should be len(tweet) x len(tokens)
print(tweet_word_matrix.shape)

# get cosine similarity
sim_matrix = cosine_similarity(tweet_word_matrix, tweet_word_matrix)
# should be the same shape
print(sim_matrix.shape)

pd.DataFrame(sim_matrix).to_csv('private/phrasic_sim_matrix.csv')

(1000, 7348)
(1000, 1000)
