# Analyzing Political Tweets on a Depression Prediction ML Model
### Sam Spell, James Tipton

Political rhetoric and discussions have seemingly become more polarized recently. In history and while reaching adulthood, being able to vote and be a part of politics is a very important role in a stable and healthy society. This project aims to use machine learning to develop a model to predict depression based on a string of text from twitter. Once this model is developed, it can be used to conduct an analysis on political messages sent online. We will be able to draw out patterns in twitter texts that the machine learning model classifies as showing signs of Depression. Another goal of this machine learning model is to extract patterns of text that can be connected to patterns of political messaging if they exist, and to compare this to a temporal aspect. With the changing view on polarized politics, it will be interesting to test if there is a change in the prevalence of messages classified with “depression” throughout different political times.


#### Step 1: Clean the datasets to prepare for the model
Import libraries

In [77]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

import numpy as np
from numpy import savetxt
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import svm
import nltk
from nltk.corpus import stopwords, wordnet
from nltk.stem import PorterStemmer, WordNetLemmatizer

run these downloads once

In [78]:
# nltk.download('punkt')
# nltk.download('stopwords')
# nltk.download('wordnet')
# nltk.download('omw-1.4')

filter for stopwords

In [79]:
# isolate text column of dataset
d = pd.read_csv("depression.csv")
p = pd.read_csv("political.csv")
comb = p["Title"].fillna('') +  ' ' + p['Text'].fillna('')
comb2 = comb
text = d["clean_text"]

# determine stopwords
stop_words = set(stopwords.words('english'))

In [80]:
# define function to remove stopwords
def remove_stopwords(text):
    tokens = nltk.word_tokenize(text)
    filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
    filtered_text = " ".join(filtered_tokens)
    return filtered_text

text = text.apply(remove_stopwords)
comb = comb.apply(remove_stopwords)

lemmatize and stem each reddit post in the dataset

In [81]:
lemmatizer = WordNetLemmatizer()

def lemmatize_text(text):
    tokens = nltk.word_tokenize(text)
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
    lemmatized_text = " ".join(lemmatized_tokens)
    return lemmatized_text

text = text.apply(lemmatize_text)
comb = comb.apply(lemmatize_text)

In [82]:
stemmer = PorterStemmer()

def stem_text(text):
    tokens = nltk.word_tokenize(text)
    stemmed_tokens = [stemmer.stem(token) for token in tokens]
    stemmed_text = " ".join(stemmed_tokens)
    return stemmed_text

text = text.apply(stem_text)
comb = comb.apply(stem_text)
print(comb.head())
print(comb2.head())

0    matter someon , look like , languag speak , we...
1        biden speech draw 38.2 million u.s. tv viewer
2    state union watch state union last night opini...
3                                give poor peopl money
4                                                  dew
dtype: object
0    No matter who someone is, how they look like, ...
1     Biden speech draws 38.2 million U.S. TV viewers 
2    State of the union Who watched the state of th...
3               We Should Just Give Poor People Money 
4                                   Do it for the Dew 
dtype: object


cleaned text

In [83]:
x_train, x_test, y_train, y_test = train_test_split(text, d['is_depression'], test_size=0.2, random_state=42)

# convert phrases into numerical vectors using TF-IDF
vectorizer = TfidfVectorizer()
x_train = vectorizer.fit_transform(x_train)
x_test = vectorizer.transform(x_test)

In [84]:
# train SVM model
clf = SVC(kernel='linear', C=1.0)
clf.fit(x_train, y_train)

SVC(kernel='linear')

In [85]:
# evaluate SVM model
y_pred = clf.predict(x_test)

print('Accuracy:', accuracy_score(y_test, y_pred))
print('Precision:', precision_score(y_test, y_pred))
print('Recall:', recall_score(y_test, y_pred))
print('F1 score:', f1_score(y_test, y_pred))

Accuracy: 0.9521654815772462
Precision: 0.9563492063492064
Recall: 0.9463350785340314
F1 score: 0.9513157894736843


In [86]:


#x_new = vectorizer.transform(new_text)
x_new = vectorizer.transform(comb)
y_new_pred = clf.predict(x_new)

print(y_new_pred)

[1 0 0 ... 0 0 0]


In [87]:
print(f"{(sum(y_new_pred) / len(comb)) * 100:.2f}%")

10.56%


In [93]:
print(len(y_new_pred))
print(sum(y_new_pred))

12854
1358


In [95]:
conserv_count = 0
liberal_count = 0


for i in range(len(y_new_pred)):
    if y_new_pred[i] == 1:
        if(p.loc[i,'Political Lean'] == "Conservative"):
            conserv_count = conserv_count + 1
        elif(p.loc[i,'Political Lean'] == "Liberal"):
            liberal_count = liberal_count + 1
            

            

In [104]:
print("Conservative Count")            
print(str(conserv_count) + " out of " + str(len(p[p['Political Lean'] == 'Liberal'])))
print(f"{conserv_count / len(p[p['Political Lean'] == 'Liberal']) * 100:.2f}%")
print("Liberal Count")
print(str(liberal_count) + " out of " + str(len(p[p['Political Lean'] == 'Conservative'])))
print(f"{liberal_count / len(p[p['Political Lean'] == 'Conservative']) * 100:.2f}%")



Conservative Count
463 out of 8319
5.57%
Liberal Count
895 out of 4535
19.74%
