### Social Media Analysis: Exploring Political Discourse on Twitter through NLP
#### Meier Chen

### Introduction

This project delves into social media analysis focusing on political discourse, particularly on Twitter. Utilizing Natural Language Processing (NLP) techniques, we aim to conduct topic modeling and sentiment analysis on tweets related to political content.

The dataset used in this project is from [Kaggle: Joe Biden Tweets (2007 - 2020)](https://www.kaggle.com/datasets/rohanrao/joe-biden-tweets?select=JoeBidenTweets.csv). It contains 6062 Joe Biden's tweets posted from 24th October 2007 to 31st October 2020.

There are six parts in this project, which are introduction, preparation, data preprocessing, descriptive statistics, topic modeling and sentiment analysis.

__Preparaction__

In this section, the packages and data are imported. And I would sample 100 tweets posted during the campaign for this project.

__Data Preprocessing__

The data preprocess includes (1) removing URLs and HTML tags, (2) removing punctuations, (3) removing stopwords, (4) lowering case, (5) lemmatization and (5) tokenization.

__Descriptive Statistics__

To explore the fundamental characteristics of the data, a descriptive analysis will be done in this section.

__Topic Modeling__

The aim of this project is to analyze the patterns of Biden's tweets for the 2020 presidential campaign. Therefore, I would apply LDA algorithm as it has shown excellent results in practice and visualize the result by pyLDAvis package.

__Sentiment Analysis__

To explore further, sentiment analysis will be done by using Hugging face transformers. I will display the top 10 common words by different sentiments.


__Reference__

* Kedia, A. (2020). Hands-on python natural language processing: Explore tools and techniques to analyze and process text with a view to building real-world nlp applications. Packt Publishing Limited.
* Pérez, J. M., Giudici, J. C., & Luque, F. (2021). pysentimiento: A python toolkit for sentiment analysis and socialnlp tasks. arXiv preprint arXiv:2106.09462.

### Preparaction

In [None]:
# import packages
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

import pandas as pd

import re
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import TweetTokenizer

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.decomposition import LatentDirichletAllocation as LDA
from sklearn.decomposition import TruncatedSVD

import numpy as np
import collections
import seaborn as sns
import matplotlib.pyplot as plt

import gensim
import gensim.corpora as corpora
from gensim.corpora import Dictionary
from gensim.models import LdaMulticore
from gensim.models.coherencemodel import CoherenceModel

import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
from gensim.models.ldamodel import LdaModel

from transformers import pipeline
from wordcloud import WordCloud, STOPWORDS

In [None]:
# import data
df = pd.read_csv("JoeBidenTweets.csv")

# remove missing and duplicate values
df.dropna(axis = "columns", inplace = True)
df.drop_duplicates(inplace = True, subset = "tweet")

df.head()

My aim is to explore the patterns of Biden's tweets for the 2020 presidential campaign. Therefore, I would extract the tweets posted after 25th of April, 2019, which is the date that Biden officially posted a video and claimed to join the campaign. Firstly, I convert the timestamp column to DateTime format. Then, I check the time of the last tweet in this dataset and the amount of tweets posted during the campaign to ensure I can get enough data for analysis. 

In [None]:
# convert column timestamp to DateTime format
df['timestamp'] = pd.to_datetime(df['timestamp'])

# the last tweet
df.sort_values('timestamp', ascending = True, inplace = True)
last_tweet = df.timestamp.iloc[-1]
print(last_tweet)

# the amount of tweet posted during the campaign
df_cam = df.loc[df['timestamp'] >= '2019-04-25']
len(df_cam.tweet)

The last tweet was posted on 2020.11.01, and 4722 tweets were posted during the campaign. For this project, I would sample 100 tweets randomly for analysis.

In [None]:
# sample 100 tweets
sample_df = df_cam.sample(n = 100, ignore_index=True, random_state = 100)
tweets = sample_df[['tweet']]
tweets

### Data Preprocessing

In [None]:
# take a look of the data
with pd.option_context('display.max_rows',5, 'display.max_colwidth', None): 
    display(tweets)

As it is shown above, the tweets contain URL links and HTML parsers. Therefore, I clean them first and save the result for sentiment analysis. After that, I preprocess the data by removing punctuations, lowering case, removing stopwords and lemmatization. I keep the @ and # as I also want to look at the users and hashtags in the tweets. Finally, I tokenize the tweets into bigrams for topic modeling.

In [None]:
# remove URLs and ampersand by HTML parsers (&amp)
def remove_links(tweet):
        tweet = re.sub(r'http[^\s]+','',str(tweet))
        tweet = re.sub('&amp','',str(tweet))
        return tweet
    
tweets['cleaned_tweets'] = tweets['tweet'].apply(remove_links)
tweets['cleaned_tweets_sa'] = tweets['cleaned_tweets']

In [None]:
corpus = pd.Series(tweets.cleaned_tweets.tolist()).astype(str)

# remove punctuations (expect for @ and #) and lowering case 
def text_clean(corpus, keep_list):
    cleaned_corpus = pd.Series()
    for row in corpus:
        qs = []
        for word in row.split():
            if word not in keep_list:
                p1 = re.sub(pattern='[^a-zA-Z0-9@#]',repl=' ',string=word)
                p1 = p1.lower()
                qs.append(p1)
            else : qs.append(word)
        cleaned_corpus = cleaned_corpus.append(pd.Series(' '.join(qs)))
    return cleaned_corpus

In [None]:
# remove stopwords
def stopwords_removal(corpus):
    wh_words = ['who', 'what', 'when', 'why', 'how', 'which', 'where', 'whom']
    stop = set(stopwords.words('english'))
    for word in wh_words:
        stop.remove(word)
    corpus = [[x for x in x.split() if x not in stop] for x in corpus]
    return corpus

In [None]:
# lemmatization
def lemmatize(corpus):
    lem = WordNetLemmatizer()
    corpus = [[lem.lemmatize(x, pos = 'v') for x in x] for x in corpus]
    return corpus

In [None]:
def preprocess(corpus, keep_list, cleaning = True, lemmatization = True, remove_stopwords = True):
 
    if cleaning == True:
        corpus = text_clean(corpus, keep_list)
    
    if remove_stopwords == True:
        corpus = stopwords_removal(corpus)
    else :
        corpus = [[x for x in x.split()] for x in corpus]
    
    if lemmatization == True:
        corpus = lemmatize(corpus)
        
        
    corpus = [' '.join(x) for x in corpus]        

    return corpus

keep_list = ['U.S.A', 'U.S.']

In [None]:
cleaned_corpus = preprocess(corpus, keep_list, cleaning = True, lemmatization = True, remove_stopwords = True)
tweets['cleaned_tweets'] = pd.Series(cleaned_corpus)

In [None]:
# tokenization: bigrams
def generate_bigrams(text,ngram=2):
    
    words=[word for word in text.split(" ") if word not in set(stopwords.words('english'))]  
    temp=zip(*[words[i:] for i in range(0,ngram)])
    ans=[' '.join(ngram) for ngram in temp]

    return ans

tweets['bigrams'] = tweets['cleaned_tweets'].apply(generate_bigrams)

In [None]:
tweets

### Descriptive Statistics

In [None]:
# word count
def word_count(text):
   
    return len(text.split())

tweets['original_tweet_word_count'] = tweets['tweet'].apply(word_count)
tweets['cleaned_tweet_word_count'] = tweets['cleaned_tweets'].apply(word_count)
tweets['bigram_count'] = tweets['bigrams'].str.len()

tweets.head(3).T

The table above shows the word count for original tweets, tweets after preprocessing and bigrams. To save space, only the first three tweets are displayed.

In [None]:
# top 10 bigrams - all sample data
def get_top_ngram(corpus, n=None):
    vec = CountVectorizer(ngram_range=(n, n)).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0)
    words_freq = [(word, sum_words[0, idx])
                  for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:10]

top_10_bigrams = get_top_ngram(tweets['cleaned_tweets'],2)[:10] 
x,y = map(list,zip(*top_10_bigrams)) 
sns.barplot(x = y,y = x)

The bar chart shows the top 10 bigrams in the sample data. From this chart, we can tell Biden mentioned president Donald Trump a lot which quite makes sense as they are competitors. Following, I will explore further on Biden's tweets.

### Topic Modeling

In [None]:
# LDA model by bigrams
id2word = Dictionary(tweets['bigrams'])

# Term Document Frequency
corpus_bigrams = [id2word.doc2bow(text) for text in tweets['bigrams']]

[[(id2word[i], freq) for i, freq in doc] for doc in corpus_bigrams[:1]]

# Build LDA model
lda_model_bigrams = LdaModel(corpus=corpus_bigrams,
                   id2word=id2word,
                   num_topics=2, 
                   random_state=100,
                   update_every=1,
                   chunksize=100,
                   alpha='auto',
                   per_word_topics=True)

coherence_model_lda_bigrams = CoherenceModel(model=lda_model_bigrams, texts=tweets['bigrams'], dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda_bigrams.get_coherence()
print('\nCoherence Score: ', coherence_lda)

In [None]:
# display LDA model
pyLDAvis.enable_notebook()
vis = gensimvis.prepare(lda_model_bigrams, corpus_bigrams, id2word)
vis

Both topics are relevant to Donald Trump, but we can still find some differences. Topic 1 seems to mention America from the perspective of a nation ('american people', 'american history' are mentioned a lot), which brings connections from culture or history. While topic 2 is more close to the perspective of the state (includes 'covid 19', 'government work'), which means characterising the country by formal institutions or policy.

### Sentiment Analysis

In [None]:
# get transformers
classifier = pipeline(model="finiteautomata/bertweet-base-sentiment-analysis")
sentiment_analysis = tweets.cleaned_tweets_sa.apply(classifier)

In [None]:
# save sentiment analysis result
labels=[]
scores=[]
for sentiment in sentiment_analysis:
    labels.append(sentiment[0]['label'])
    scores.append(round(sentiment[0]['score'],4))
    
tweets['labels'] = labels
tweets['scores'] = scores

In [None]:
# sentiment analysis result 
print(tweets.groupby(['labels'])['labels'].count())

neg_tweets = tweets[tweets['labels'] == 'NEG']
neu_tweets = tweets[tweets['labels'] == 'NEU']
pos_tweets = tweets[tweets['labels'] == 'POS']

In [None]:
# top 10 bigrams - Negative
top_10_neg = get_top_ngram(neg_tweets['cleaned_tweets'],2)[:10] 
x1,y1 = map(list,zip(*top_10_neg)) 
sns.barplot(x = y1,y = x1)

In [None]:
# top 10 bigrams - Neutral
top_10_neu = get_top_ngram(neu_tweets['cleaned_tweets'],2)[:10] 
x2,y2 = map(list,zip(*top_10_neu)) 
sns.barplot(x = y2,y = x2)

In [None]:
# top 10 bigrams - Positive
top_10_pos = get_top_ngram(pos_tweets['cleaned_tweets'],2)[:10] 
x3,y3 = map(list,zip(*top_10_pos)) 
sns.barplot(x = y3,y = x3)

We can see 'donald trump' remains a high frequency in all the sentiments, but it is not the most frequent bigrams in positive sentiment. The tweets labelled as negative seem to describe Trump's term of office (as it shows 'supreme court' 'repeal aca'), while the positive tweets are more close to Biden himself (as it shows 'vice president'). And for the neutral tweets, public topics such as covid-19 are mentioned.

To sum up, this project presents a descriptive and exploratory analysis for detecting Biden's campaign strategies on social media. More theoretical evidence on social media and political campaigns is needed to explain the outcome.