## Month-on-Month Topic Modelling using LDA

### Data Preparation

In [15]:
import pandas as pd

In [16]:
# Checking the current working directory to download the files
import os
print(os.getcwd())

C:\Users\Utilizador\AppData\Local\Programs\Microsoft VS Code


In [17]:
# Checking date columns of the preprocessed .csv

  # 1) Reading both .csv files
df_r = pd.read_csv('df_r.csv')
df_w = pd.read_csv('df_w.csv')

  # 2) Getting the unique date values from each dataset
months_df_r = df_r['month'].value_counts().sort_index()
months_df_w = df_w['month'].value_counts().sort_index()

  # 3) Printing results
print("Unique months and respective counts in Russian tweets dataset:")
print(months_df_r)

print("\nUnique months and respective counts in Western tweets dataset:")
print(months_df_w)

Unique months and respective counts in Russian tweets dataset:
2    1612
3    8960
4    9722
5    2308
Name: month, dtype: int64

Unique months and respective counts in Western tweets dataset:
2    1780
3    7501
4    6179
5    1267
Name: month, dtype: int64


There are four common months in both datasets: 2 (February), 3 (March), 4 (April), and 5 (May). We will consider these four months for further monthly LDA comparisons.

In [18]:
# Splitting each of the datasets into three new files, based on the month of the tweet

df_r_2 = df_r[df_r['month'] == 2]
df_r_3 = df_r[df_r['month'] == 3]
df_r_4 = df_r[df_r['month'] == 4]
df_r_5 = df_r[df_r['month'] == 5]

df_w_2 = df_w[df_w['month'] == 2]
df_w_3 = df_w[df_w['month'] == 3]
df_w_4 = df_w[df_w['month'] == 4]
df_w_5 = df_w[df_w['month'] == 5]

# Checking the month values of the new dataframes to confirm if the split was effective
print("Unique months in the Feb Russian tweets dataset:", df_r_2['month'].unique())
print("Unique months in the March Russian tweets dataset:", df_r_3['month'].unique())
print("Unique months in the April Russian tweets dataset:", df_r_4['month'].unique())
print("Unique months in the May Russian tweets dataset:", df_r_5['month'].unique())
print("")
print("Unique months in the Feb Western tweets dataset:", df_w_2['month'].unique())
print("Unique months in the March Western tweets dataset:", df_w_3['month'].unique())
print("Unique months in the April Western tweets dataset:", df_w_4['month'].unique())
print("Unique months in the May Western tweets dataset:", df_w_5['month'].unique())

Unique months in the Feb Russian tweets dataset: [2]
Unique months in the March Russian tweets dataset: [3]
Unique months in the April Russian tweets dataset: [4]
Unique months in the May Russian tweets dataset: [5]

Unique months in the Feb Western tweets dataset: [2]
Unique months in the March Western tweets dataset: [3]
Unique months in the April Western tweets dataset: [4]
Unique months in the May Western tweets dataset: [5]


Now that we have confirmed that splitting the datasets based on date was done successfully, we can save them as separate .csv files so they can be further used for new LDA monthly comparisons.

In [19]:
# Saving the split datasets as new .csv files to be easily accessible for further LDA analysis

df_r_2.to_csv('df_r_2.csv', index = False)
df_r_3.to_csv('df_r_3.csv', index = False)
df_r_4.to_csv('df_r_4.csv', index = False)
df_r_5.to_csv('df_r_5.csv', index = False)

df_w_2.to_csv('df_w_2.csv', index = False)
df_w_3.to_csv('df_w_3.csv', index = False)
df_w_4.to_csv('df_w_4.csv', index = False)
df_w_5.to_csv('df_w_5.csv', index = False)

In [20]:
# Checking if the new .csv files were successfuly saved

df_r_2 = pd.read_csv('df_r_2.csv')
df_r_3 = pd.read_csv('df_r_3.csv')
df_r_4 = pd.read_csv('df_r_4.csv')
df_r_5 = pd.read_csv('df_r_5.csv')

df_w_2 = pd.read_csv('df_w_2.csv')
df_w_3 = pd.read_csv('df_w_3.csv')
df_w_4 = pd.read_csv('df_w_4.csv')
df_w_5 = pd.read_csv('df_w_5.csv')

# Checking if the number of rows in the new dataframes match with the value counts of each month
print("Number of rows in the Feb Russian tweets dataset:", df_r_2.shape[0])
print("Number of rows in the March Russian tweets dataset:", df_r_3.shape[0])
print("Number of rows in the April Russian tweets dataset:", df_r_4.shape[0])
print("Number of rows in the May Russian tweets dataset:", df_r_5.shape[0])
print("")
print("Number of rows in the Feb Western tweets dataset:", df_w_2.shape[0])
print("Number of rows in the March Western tweets dataset:", df_w_3.shape[0])
print("Number of rows in the April Western tweets dataset:", df_w_4.shape[0])
print("Number of rows in the May Western tweets dataset:", df_w_5.shape[0])

Number of rows in the Feb Russian tweets dataset: 1612
Number of rows in the March Russian tweets dataset: 8960
Number of rows in the April Russian tweets dataset: 9722
Number of rows in the May Russian tweets dataset: 2308

Number of rows in the Feb Western tweets dataset: 1780
Number of rows in the March Western tweets dataset: 7501
Number of rows in the April Western tweets dataset: 6179
Number of rows in the May Western tweets dataset: 1267


The number of rows in each of the new .csv files matches the value counts of each month in the preprocessed dataframes, indicating that the new files are now ready to be reused in the LDA analysis.

### TF-IDF corpus

To understand if there was a MoM evolution of the topics discussed, we will perform topic modelling using LDA on each of the newly created dataframes (representing the tweets posted in each month). To do so, we need to create separate TF-IDF corpus for each month's data (we opted for TF-IDF because it holds more information on the more/less important words, and thus should ensure a higher accuracy of the derived insights).
Since the new .csv files were created from the already preprocessed dataframes, we will use the cleaned_tokens column to create the separate TF-IDF corpora.

In [24]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [32]:
# Creating new monthly TF-IDF corpus to pass onto the LDA model

  ## 1) Adding the cleaned_tokens entries for each month into a list
cleaned_tokens_r_2 = df_r_2['cleaned_tokens'].tolist()
cleaned_tokens_r_3 = df_r_3['cleaned_tokens'].tolist()
cleaned_tokens_r_4 = df_r_4['cleaned_tokens'].tolist()
cleaned_tokens_r_5 = df_r_5['cleaned_tokens'].tolist()

cleaned_tokens_w_2 = df_w_2['cleaned_tokens'].tolist()
cleaned_tokens_w_3 = df_w_3['cleaned_tokens'].tolist()
cleaned_tokens_w_4 = df_w_4['cleaned_tokens'].tolist()
cleaned_tokens_w_5 = df_w_5['cleaned_tokens'].tolist()

# The result is a list of lists (the cleaned_tokens of each tweet [row] are added to the cleaned_tokens_x_y list)

  ## 2) Initializing a separate TF-IDF Vectorizer for each side (Russian vs. Western)
tfidf_vectorizer_r = TfidfVectorizer()
tfidf_vectorizer_w = TfidfVectorizer()

# Aggregating tokens by side, so that we can fit the vectorizers on the aggregated data
cleaned_tokens_r = (cleaned_tokens_r_2 + cleaned_tokens_r_3 + cleaned_tokens_r_4 + cleaned_tokens_r_5)
cleaned_tokens_w = (cleaned_tokens_w_2 + cleaned_tokens_w_3 + cleaned_tokens_w_4 + cleaned_tokens_w_5)


  ## 3) Fitting the TF-IDF Vectorizers on side's data
tfidf_corpus_r = tfidf_vectorizer_r.fit_transform(cleaned_tokens_r)
tfidf_corpus_w = tfidf_vectorizer_w.fit_transform(cleaned_tokens_w)
  
  ## 4) Building the corpora with each month's data to be used in LDA
tfidf_corpus_r_2 = tfidf_vectorizer_r.transform(cleaned_tokens_r_2)
tfidf_corpus_r_3 = tfidf_vectorizer_r.transform(cleaned_tokens_r_3)
tfidf_corpus_r_4 = tfidf_vectorizer_r.transform(cleaned_tokens_r_4)
tfidf_corpus_r_5 = tfidf_vectorizer_r.transform(cleaned_tokens_r_5)

tfidf_corpus_w_2 = tfidf_vectorizer_w.transform(cleaned_tokens_w_2)
tfidf_corpus_w_3 = tfidf_vectorizer_w.transform(cleaned_tokens_w_3)
tfidf_corpus_w_4 = tfidf_vectorizer_w.transform(cleaned_tokens_w_4)
tfidf_corpus_w_5 = tfidf_vectorizer_w.transform(cleaned_tokens_w_5)


### Training the LDA models

In [31]:
# Instaling the necessary packages and libraries
%pip install pandas==1.5.3 # THIS VERSION OF PANDAS MUST BE INSTALLED TO BE COMPATIBLE WITH pyLDAvis
%pip install pyLDAvis
%pip install pyLDAvis.gensim

Note: you may need to restart the kernel to use updated packages.


ERROR: Invalid requirement: '#'


Collecting pyLDAvis
  Obtaining dependency information for pyLDAvis from https://files.pythonhosted.org/packages/6b/5a/66364c6799f2362bfb9b7100bc1ce6ffcdfe7f17e8d2e85a591bfe427643/pyLDAvis-3.4.1-py3-none-any.whl.metadata
  Downloading pyLDAvis-3.4.1-py3-none-any.whl.metadata (4.2 kB)
Collecting pandas>=2.0.0 (from pyLDAvis)
  Obtaining dependency information for pandas>=2.0.0 from https://files.pythonhosted.org/packages/ab/63/966db1321a0ad55df1d1fe51505d2cdae191b84c907974873817b0a6e849/pandas-2.2.2-cp311-cp311-win_amd64.whl.metadata
  Downloading pandas-2.2.2-cp311-cp311-win_amd64.whl.metadata (19 kB)
Collecting funcy (from pyLDAvis)
  Obtaining dependency information for funcy from https://files.pythonhosted.org/packages/d5/08/c2409cb01d5368dcfedcbaffa7d044cc8957d57a9d0855244a5eb4709d30/funcy-2.0-py2.py3-none-any.whl.metadata
  Downloading funcy-2.0-py2.py3-none-any.whl.metadata (5.9 kB)
Collecting tzdata>=2022.7 (from pandas>=2.0.0->pyLDAvis)
  Obtaining dependency information for tz

ERROR: Could not install packages due to an OSError: [WinError 5] Acesso negado: 'C:\\Users\\Utilizador\\anaconda3\\Lib\\site-packages\\~andas\\_libs\\algos.cp311-win_amd64.pyd'
Consider using the `--user` option or check the permissions.



Note: you may need to restart the kernel to use updated packages.


ERROR: Could not find a version that satisfies the requirement pyLDAvis.gensim (from versions: none)
ERROR: No matching distribution found for pyLDAvis.gensim
