# Analyzing Political Tweets on a Depression Prediction ML Model
### Sam Spell, James Tipton

Political rhetoric and discussions have seemingly become more polarized recently. In history and while reaching adulthood, being able to vote and be a part of politics is a very important role in a stable and healthy society. This project aims to use machine learning to develop a model to predict depression based on a string of text from twitter. Once this model is developed, it can be used to conduct an analysis on political messages sent online. We will be able to draw out patterns in twitter texts that the machine learning model classifies as showing signs of Depression. Another goal of this machine learning model is to extract patterns of text that can be connected to patterns of political messaging if they exist, and to compare this to a temporal aspect. With the changing view on polarized politics, it will be interesting to test if there is a change in the prevalence of messages classified with “depression” throughout different political times.


#### Step 1: Clean the datasets to prepare for the model
Import libraries

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

import numpy as np
from numpy import savetxt
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import svm
import nltk
from nltk.corpus import stopwords, wordnet
from nltk.stem import PorterStemmer, WordNetLemmatizer

run these downloads once

In [2]:
# nltk.download('punkt')
# nltk.download('stopwords')
# nltk.download('wordnet')
# nltk.download('omw-1.4')

filter for stopwords

In [3]:
# isolate text column of dataset
d = pd.read_csv("depression.csv")
text = d["clean_text"]

# determine stopwords
stop_words = set(stopwords.words('english'))

In [4]:
# define function to remove stopwords
def remove_stopwords(text):
    tokens = nltk.word_tokenize(text)
    filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
    filtered_text = " ".join(filtered_tokens)
    return filtered_text

text = text.apply(remove_stopwords)

lemmatize and stem each reddit post in the dataset

In [5]:
lemmatizer = WordNetLemmatizer()

def lemmatize_text(text):
    tokens = nltk.word_tokenize(text)
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
    lemmatized_text = " ".join(lemmatized_tokens)
    return lemmatized_text

text = text.apply(lemmatize_text)

In [6]:
stemmer = PorterStemmer()

def stem_text(text):
    tokens = nltk.word_tokenize(text)
    stemmed_tokens = [stemmer.stem(token) for token in tokens]
    stemmed_text = " ".join(stemmed_tokens)
    return stemmed_text

text = text.apply(stem_text)

cleaned text

In [7]:
x_train, x_test, y_train, y_test = train_test_split(text, d['is_depression'], test_size=0.2, random_state=42)

# convert phrases into numerical vectors using TF-IDF
vectorizer = TfidfVectorizer()
x_train = vectorizer.fit_transform(x_train)
x_test = vectorizer.transform(x_test)

In [8]:
# train SVM model
clf = SVC(kernel='linear', C=1.0)
clf.fit(x_train, y_train)

In [9]:
# evaluate SVM model
y_pred = clf.predict(x_test)

print('Accuracy:', accuracy_score(y_test, y_pred))
print('Precision:', precision_score(y_test, y_pred))
print('Recall:', recall_score(y_test, y_pred))
print('F1 score:', f1_score(y_test, y_pred))

Accuracy: 0.9521654815772462
Precision: 0.9563492063492064
Recall: 0.9463350785340314
F1 score: 0.9513157894736843


In [29]:
p = pd.read_csv("political.csv")
new_text = pd.Series.dropna(p["Text"])

x_new = vectorizer.transform(new_text)

y_new_pred = clf.predict(x_new)

print(y_new_pred)

[0 0 0 ... 0 0 0]
0       0
1       0
2       0
3       0
4       0
       ..
2423    0
2424    0
2425    0
2426    0
2427    0
Length: 2428, dtype: int64
2        Who watched the state of the union last night ...
11       I have fallen for this trap several times and ...
20       One of the things I have noticed in todays wor...
42       [https://kites-journal.org/2022/03/01/between-...
54       ***"Axe tax"*** aka ***"Hammer tax"*** aka ***...
                               ...                        
12829    "Well now, which is the longest river in Afric...
12837    Let's say there are private courts/resolution ...
12838    Basically, the above. It can be a very effecti...
12842    Last week I would've considered myself a liber...
12853    I go to the mises.org and listen to the writin...
Name: Text, Length: 2428, dtype: object


In [45]:
print(f"{(sum(y_new_pred) / 2428) * 100:.2f}%")

2.14168039538715 %
