# Political Alignment Analysis in 140 Characters 
#### CX 4240, Spring 2019 
#### Jessica Buzzelli, Jarad Hosking, Aakanksha Patil 

# I: Problem

Twitter is a social media platform where users' status updates (tweets) can have the ability to impact how their followers percieve events, [especially in the realm of politics](https://www.nytimes.com/2019/04/11/us/politics/on-politics-twitter-democrats.html). In this project, we wanted to see how accuractely we could identify users' political affiliations, with the goal of making a model that could be applied to identify like-minded public figures to a given user.

Especially with [Twitter's ongoing decline in monthly users](https://www.bloomberg.com/news/articles/2018-07-27/twitter-projects-users-to-decline-profit-short-of-estimates), we hypothesize that such a tool could be used to further establish the site as a more specialized hub for political news and debate and, in turn, drive platform-unique content and reinvigorate overall site traffic. Another high potential use case could involve matching individuals with similarly-aligned local politicians in order to inspire more people to participate in non-federal elections.

# II: Approach

In order to meaningfully visualize our model projections, we chose to use a Nolan Chart as our frame of reference. We extracted ground-truth information on training users (politicians) from [OnTheIssues.org](), and solved for unknown scores of our test users. 

An example of a Nolan Chart from our source mapping Donald Trump to a point at (.8,.2):

<img src="report_imgs/donald_trump.gif" width="400"/>

Our problem lies somewhere between a clustering and a classification problem:
1. We want to know which users are most similar to a test user, but
1. Our classification "labels" are non-discrete, and
1. We wanted to project onto a space where we can bring prior knowledge on political parties, schools of thought into our interpretation of the data

# III: Dataset

Since Twitter frowns upon (but allows) its data being used to identify users based on federally protected classes such as political alignment, we have limited our test set of data to users whose public presences are based around their political commentary -- we refer to them collectively in this project as pundits.

Using a [Twitter API Python client](http://www.tweepy.org/) and an [SQLite](https://www.sqlite.org/index.html) database, we were able to pull tweets from a collection of politicians and pundits with the following characteristics.

__NOTE:__ We were not able to obtain as many training users (politicians) as ideal due to the lack of politicians with Twitter accounts active enough to have more than 2,000 tweets since the 2016 presidential election (a limit we saw necessary in case of changes in political affiliation over time). We had no problem finding active pundits on the platform, but restrained our test numbers to scale well with our set of politicians. 

Nolan Chart breakdown of users in our dataset:

<img src="report_imgs/newflows.png" width="700"/>

__NOTE:__ We were unable to find any good examples of populists for our dataset, and therefore would ideally use a different visualization convention if unbiased ground-truth data were similiarly available.

Before composing our feature matrix from the tweets in our database, we cleaned each tweet as follows:

In [7]:
from preproccess_tweets import Preprocessor

test_tweet = """RT @user: I really like using Twitter to follow #trends;
                I can't stop reading about my favorite politicians! """

print(Preprocessor().preprocess(test_tweet))

['twitter', 'follow', 'trend', 'stop', 'read', 'favorit', 'politician']


After, we used [Term Frequency x Inverse Document Frequency (TF-IDF) scores](https://nlp.stanford.edu/IR-book/html/htmledition/tf-idf-weighting-1.html) trained on the politicians' combined corpus to build a feature matrix. 

This worked by 1) excluding words not used by the polticians' accounts and to 2) minimizing the weight of words that were rarely used by multiple politicans. 

Our feature matricies had __378,013 training records__, __67,088 testing records__, __65,376 unique features__ (same number of features in training and testing matricies), and looked something like this:

<img src="report_imgs/vecs2.png" width="550"/>

# IV: Results

### Attempt 1: Linear Regression

A [similiar problem](https://medium.com/linalgo/predict-political-bias-using-python-b8575eedef13) used logistic regression and clustering to predict the bias of newspaper headlines, but we used linear regression to get hard "classifications" of our training users' scores instead of clustering the data and reducing the dimensions down to a graphable output.

Knowing that linear regressions vary much more as the number of features exceeds the number of data points, we applied Principal Component Analysis to bring our xtrain and xtest matricies down to __75 features__ (determined trialing with varying numbers of components). The improvement in results was significant:

<img src="report_imgs/PCA.png" width="1300"/>

The PCA model (the winner) had an average error of 24.0324 units of distance on the Nolan Chart (the difference of roughly one quadrant) with a standard deviation of 5 units. 

Our model's pundit estimations:

In [13]:
import linear_regression_model
%matplotlib inline

linear_regression_model.main()

Number of features (post-processing): 24972

Regression results:
                economic_score_estimate  social_score_estimate  \
author_handle                                                    
PostOpinions                   0.603092               0.453282   
ObsoleteDogma                  0.569767               0.565621   
demsocialists                  0.271233               0.696731   
JonathanLKrohn                 0.531447               0.430418   
JoeNBC                         0.602881               0.440418   
IngrahamAngle                  0.666145               0.275623   
GlennBeck                      0.637169               0.390862   
RedState                       0.731189               0.247807   
Heritage                       0.795789               0.321089   
MichelleMalkin                 0.708965               0.250840   
KellyannePolls                 0.719816               0.392904   
lizmair                        0.495508               0.479168   
reason     


### Attempt 2: Feature Engineering via Latent Dirichlet Allocation and Sentiment Data

Unlike [similar projects](https://medium.com/analytics-vidhya/twitter-sentiment-analysis-for-the-2019-election-8f7d52af1887) [using Twitter data for sentiment analysis](https://ieeexplore.ieee.org/document/6897213), we decided that a person's political affiliation is a combination of:

1. The topics they care about (or in this case, tweet about), and
1. Their sentiments (in this case, positive, negative, or neutral) towards those topics

Using [Vader sentiment analyser](https://www.nltk.org/_modules/nltk/sentiment/vader.html), we proceded as follows:

<img src="report_imgs/ldaflow.png" width="1000"/>

Of course, we also ran a "one vs. rest" cross validation on an LDA model that did not use the sentiment approach detailed above:

<img src="report_imgs/lda_comparisons2.png" width="1500"/>

Cross validation results of the sentiment-enabled model had a higher mean error and estimation variance than the linear regression model, so we consider this approach a dead-end in its current state.

# V: Conclusion

Considerations:
* Sentiment data may have not been the most accurate given sarcasm, passive agressiveness aimed at other users, etc. 

* Very skeptical about the scalability of the regression model in Attempt 1 due to potential overfitting from our choice of number of components from PCA; could also have high variation due to tweets' small character limit making individual tweets poor reflections of a user's entire Twitter timeline

* Topic-based models (Attempt 2) did not perform better when using feature reduction via Latent Semantic Indexing (LSI, or LDA with truncated SVD); had extremely high variances when set to fewer than 6 or more than 15 "topics"/components

* Concerned about the possibility of users not tweeting as we would expect given a certain alignment; can only really counteract this by expanding our set of training users

* Regression to project users onto the chart was tricky since we lacked enough datapoints at the extremes of the chart to be able to place users towards the corners -- especially in the populists' quadrant

An example of using the linear regression models to suggest users to follow:

In [None]:
from linear_regression_model import returnRecommendations

returnRecommendations('realDonaldTrump')