# Project 1

## Overview

For this project, we used two python packages to analyze twitter sentiment on Biden, Trump, and Kamala. We chose to analyze political topics because of the upcoming election and therefore increasing relevance on twitter. We are interested to see how each of these candidates are viewd by twitter users to get some insight into possible election outcomes.

## Importing Data

We could not get our twitter API to work, so we pulled our data from https://www.kaggle.com/datasets/fastcurious/twitter-dataset-february-2024?resource=download. We load our data in a json format to be processed for sentiment analysis.

In [19]:
# Import data from json file
import json
with open(r"C:\Users\JakeBeckman\Downloads\dataset_tweet-scraper_2024-03-04_15-49-10-416.json",'r',encoding='utf-8') as f:
    content = json.load(f)

In [21]:
# Function that cleans data to include only text from tweets
def get_text(tweets:list)->list:
    text = []
    for tweet in tweets:
        if tweet['type']=='tweet':
            text.append(tweet['text'])
    return text

In [23]:
# Function that pulls tweets with the specified topic
def pull_topic(tweets:list,topic:str)->list:
    tweets_ontopic=[]
    for tweet in tweets:
        if topic.lower() in str(tweet.lower()):
            tweets_ontopic.append(tweet)
    return tweets_ontopic

In [25]:
# View first 4 tweets in the dataset
text = get_text(content)
print(text[:4])

['#Demographics is destiny. The future ONLY belongs to those able and willing to conduct #massdeportations, STOP the #BorderInvasion.🇺🇸', '“@Gemini failed to conduct due diligence on an unregulated third party, later accused of massive fraud, harming Earn customers who were suddenly unable to access their assets after Genesis Global Capital experienced a financial meltdown.” https://t.co/fb9qSOHSqT', '@BidensWins 🤣 did Hunter conduct the poll ?', '@visegrad24 So he can give them more orders to conduct October 7th type attacks.']


## Processing the data

In [27]:
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer

In [28]:

stopwords_list = list(stopwords.words('english'))
tokenizer = RegexpTokenizer(r'\w+')

lemmatizer = WordNetLemmatizer()

tokenized_tweets = [tokenizer.tokenize(i) for i in text]

processed_tokenized_tweets=[]
for tweet in tokenized_tweets:
    processed_tokenized_tweets.append([lemmatizer.lemmatize(word.lower()) for word in tweet if word.lower() not in stopwords_list])

final_tweets = [' '.join(words) for words in processed_tokenized_tweets]
final_tweets

['demographic destiny future belongs able willing conduct massdeportations stop borderinvasion',
 'gemini failed conduct due diligence unregulated third party later accused massive fraud harming earn customer suddenly unable access asset genesis global capital experienced financial meltdown http co fb9qsohsqt',
 'bidenswins hunter conduct poll',
 'visegrad24 give order conduct october 7th type attack',
 'anniebecky haha conduct mongoose status lub ummmm lady tiiiiimeeee',
 'robertkennedyjr western intelligence asset got caught trying conduct color revolution behest mi6 cia nothing noble courageous far right nationalist racist yet regurgitate champion democracy',
 'schiefertomtom think want excuse appalling conduct good people narrative',
 'unapologt haaretzcom fighting terrorist organisation appeased would conduct terrorist attack take hostage',
 'paulkrugman uh airport asia europe lightyears beyond term cleanliness conduct quality food restaurant etc',
 'opinion establishment need imp

## Sentiment Analysis with `Vader`

In [31]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
def analyze_sentiment(tweets, topic):
    print(f'Analyzing sentiment for topic "{topic}"')
    tweet_list = pull_topic(tweets, topic)
    sid = SentimentIntensityAnalyzer()
    # Create empty lists we can take the average of for each
    neg_list = []
    pos_list = []
    neu_list = []
    # Loop through each tweet to find the sentiments and apend to each list
    for tweet in tweet_list:
        sentiment_scores = sid.polarity_scores(tweet)
        neg_list.append(sentiment_scores['pos'])
        neu_list.append(sentiment_scores['neu'])
        pos_list.append(sentiment_scores['neg'])
    print(f'Average positive sentiment: {sum(pos_list) / len(pos_list)}')
    print(f'Average neutral sentiment: {sum(neu_list) / len(neu_list)}')
    print(f'Average negative sentiment: {sum(neg_list) / len(neg_list)}')

In [33]:
analyze_sentiment(text, 'Trump')

Analyzing sentiment for topic "Trump"
Average positive sentiment: 0.08354761904761905
Average neutral sentiment: 0.8376785714285714
Average negative sentiment: 0.07876190476190476


In [35]:
analyze_sentiment(text, 'Biden')

Analyzing sentiment for topic "Biden"
Average positive sentiment: 0.10387931034482759
Average neutral sentiment: 0.8245
Average negative sentiment: 0.07158620689655172


In [37]:
analyze_sentiment(text, 'Kamala')

Analyzing sentiment for topic "Kamala"
Average positive sentiment: 0.038
Average neutral sentiment: 0.8407777777777778
Average negative sentiment: 0.12122222222222222


After analyzing Trump, Biden, and Kamala using the VADER sentiment analysis package. We can conclude that most of the tweets were considered neutral for all thre candidates. We know this cannot be entirely accurate because tweets are especially polarized during elections. We can assume that the VADER sentiment package is not picking up on the full sentiment from each tweet. By using something more advanced we can gain more insight into how each candidate is viewed.

## Advanced Sentiment Analysis With Transformers

We used `transformers` sentiment-analysis pipeline to conduct an advanced sentiment analysis. Using this pipeline, we did not need to clean our data, because the bert model under the pipeline comes with a custom tokenizer to account for special characters and white space. The Sentiment Analysis pipeline also takes into account stop words.

In [39]:
from transformers import pipeline
classifier = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


RuntimeError: Failed to import transformers.models.distilbert.modeling_tf_distilbert because of the following error (look up to see its traceback):
Your currently installed version of Keras is Keras 3, but this is not yet supported in Transformers. Please install the backwards-compatible tf-keras package with `pip install tf-keras`.

The following functions filter out the topic we are interested in then run a sentiment analysis using the pipeline. `conduct_sentiment_analysis()` calls the transformers sentiment analysis pipeline, while `pull_sentiment()` brings multiple functions together to run using all tweets and a topic as parameters, returning the sentiment analysis.

In [None]:
def conduct_sentiment_analysis(tweets:list):
    sentiments = []
    for tweet in tweets:
        if len(tweet)>512:
            tweet = tweet[:512]
        sentiments.append(classifier(tweet)[0]['label'])
    sentiments = ' '.join(sentiments)
    positive_prop = sentiments.count('POSITIVE')/len(tweets)
    neutral_prop = sentiments.count('NEUTRAL')/len(tweets)
    negative_prop = sentiments.count('NEGATIVE')/len(tweets)
    print(f"Positive Proportion: {positive_prop}, Negative Proportion: {negative_prop}, Neutral Proportion: {neutral_prop}")
    return positive_prop,negative_prop,neutral_prop

In [None]:
def pull_sentiments(tweets:list,topic:str):
    tweets_ontopic = pull_topic(tweets=tweets,topic=topic)
    print(f"{len(tweets_ontopic)} tweets mention {topic}.")
    pos,neg,neu = conduct_sentiment_analysis(tweets_ontopic)
    return pos,neg,neu

In our project, we were interested in how certain political icons were depicted by twitter users. We focused on Joe Biden, Donald Trump, Kamala Harris, and the white house. In our analysis, we wanted to see if certain politicians were depicted better than others, and whether or not the overall sentiment of politics (using key word "white house" as our baseline) reflected the average sentiment for our candidates of interest.

In [None]:
pull_sentiments(text,"biden")

In [None]:
pos_biden = pull_sentiments(tweets=text,topic="biden")[0]
neg_biden = pull_sentiments(tweets=text,topic="biden")[1]

In [None]:
pos_trump=pull_sentiments(tweets=text,topic="trump")[0]
neg_trump=pull_sentiments(tweets=text,topic="trump")[1]

In [None]:
pos_kam=pull_sentiments(tweets=text,topic="kamala")[0]
neg_kam=pull_sentiments(tweets=text,topic="kamala")[1]

In [None]:
pos_wh=pull_sentiments(tweets=text,topic="white house")[0]
neg_wh=pull_sentiments(tweets=text,topic="white house")[1]

Our Sentiment Analysis revealed that all politicians, regardless of party, are depicted in a negative light. This makes sense as the phrase "white house" had a 100% negative sentiment label.

In [None]:
import matplotlib.pyplot as plt
plt.bar(height=[pos_biden,pos_trump,pos_kam,pos_wh],x=["Biden",'Trump','Kamala','White House'])
plt.bar(height=[neg_biden*-1,neg_trump*-1,neg_kam*-1,neg_wh*-1],x=["Biden",'Trump','Kamala','White House'])
plt.ylabel("Proportion Positive Sentiment")
plt.xlabel("Topic or Person")

In [45]:
type(text)

list

In [47]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

def cluster_and_filter_relevance(texts, n_clusters=5, n_key_words=10):
    """
    Perform K-means clustering on a list of texts and filter the most relevant clusters with lemmatization.
    
    :param texts: List of texts to cluster.
    :param n_clusters: Number of clusters to create.
    :param n_key_words: Number of top keywords to use for filtering relevant clusters.
    
    :return: List of clusters with the assigned labels and a dictionary containing cluster keywords.
    """
    # Vectorize the text data
    vectorizer = TfidfVectorizer(stop_words='english')
    X = vectorizer.fit_transform(texts)
    
    # Perform K-means clustering
    kmeans = KMeans(n_clusters=n_clusters, random_state=42).fit(X)
    labels = kmeans.labels_
    
    # Initialize a dictionary to store the keywords for each cluster
    cluster_keywords = {}
    order_centroids = kmeans.cluster_centers_.argsort()[:, ::-1]
    terms = vectorizer.get_feature_names_out()
    
    # Iterate through each cluster and store the top n keywords
    for i in range(n_clusters):
        cluster_keywords[i] = [terms[ind] for ind in order_centroids[i, :n_key_words]]
    
    # Create a list of tuples containing text and its corresponding cluster label
    clustered_texts = [(text, label) for text, label in zip(texts, labels)]
    
    # Return both the list of clustered texts and the cluster_keywords dictionary
    return clustered_texts, cluster_keywords


In [61]:
# Example: Assuming `text` is a single string, wrap it in a list
texts = [text]  # Wrap the single string in a list

# Set n_clusters to 1 since we only have one text
clusters, keywords = cluster_and_filter_relevance(texts, n_clusters=10)

# Display the top keywords for the single cluster
print("\nCluster Keywords:")
for group, key_words in keywords.items():
    print(f"Cluster {group} keywords: {', '.join(key_words)}")


ValueError: n_samples=1 should be >= n_clusters=10.

## Works Cited

stackoverflow.com. Assistance given to the author, written explanation. Our group used a post from stack overflow that explained how to craete a word tokenizer that only tokenizes words, specifically, and removes special characters for us. West Point, NY 1SEP2024.