### Capstone Project : Political Ideology persona 
#### Author : Kumud Chauhan
#### Professor : Randi Griffin

## INTRODUCTION

In this assignment, we trained a political ideology persona. This persona reads the message/text and gives us a score that shows how much it is interested in this message. If the interest score is close to 0, it means the text is leaning towards Democrats and if it is close to 1, then the text is leaning towards Republicans.
 To train our political ideology persona, we used twitter dataset from Kaggle which contains 86460 tweets which are labeled as Republican or Democrats. (Data source: https://www.kaggle.com/kapastor/democratvsrepublicantweets#ExtractedTweets.csv) 



We perform several tasks in this assignment, which are as follows:


1.   Train a persona that gives an interest score from text that may be of interest of a democrat persona.
2.   Lift Analysis that shows change in interest scores when a word is deleted.
3.   Populate a table that shows top 5 words that improves the interest score and worst 5 words that affects the scores.



In [0]:
# import necessary libraries and packages
import numpy as np
import pandas as pd
import re
from nltk.tokenize import TweetTokenizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import stop_words
import tensorflow as tf
from tensorflow import keras
tf.random.set_seed(2)

In [2]:
from google.colab import files
uploaded = files.upload()

#from sklearn.model_selection import train_test_split
# data source: https://www.kaggle.com/kapastor/democratvsrepublicantweets#ExtractedTweets.csv
import io
df = pd.read_csv(io.BytesIO(uploaded['ExtractedTweets.csv']))

Saving ExtractedTweets.csv to ExtractedTweets.csv


In [4]:
# Check how our data looks
df.head()

Unnamed: 0,Party,Handle,Tweet
0,Democrat,RepDarrenSoto,"Today, Senate Dems vote to #SaveTheInternet. P..."
1,Democrat,RepDarrenSoto,RT @WinterHavenSun: Winter Haven resident / Al...
2,Democrat,RepDarrenSoto,RT @NBCLatino: .@RepDarrenSoto noted that Hurr...
3,Democrat,RepDarrenSoto,RT @NALCABPolicy: Meeting with @RepDarrenSoto ...
4,Democrat,RepDarrenSoto,RT @Vegalteno: Hurricane season starts on June...


In [5]:
# check for class bias
df.groupby('Party').count()

Unnamed: 0_level_0,Handle,Tweet
Party,Unnamed: 1_level_1,Unnamed: 2_level_1
Democrat,42068,42068
Republican,44392,44392


#### Data preprocessing

In [0]:
# process the tweets for training the persona
tweet_tokenizer = TweetTokenizer(strip_handles=True, reduce_len=True)

def clean_tweet(tweet):
   tweet = re.sub(r'http\S+', '', tweet) #remove links
   tweet = re.sub("[^a-zA-Z]", " ", tweet) #remove all characters except letters
   tweet = " ".join(tweet_tokenizer.tokenize(tweet)) #join the clean text as string
   return tweet 

# apply the function on raw data to get processed data
df['clean'] = df['Tweet'].apply(clean_tweet)
processed_df = df[df['clean'].apply(len) > 0]
le = LabelEncoder()
labels = le.fit_transform(processed_df['Party'])
## Data selection for training and testing
X_train, X_test, y_train, y_test = train_test_split( processed_df['clean'], labels , test_size=0.10, random_state=1)

## Training Political Ideology persona model 

This part is further divided into two parts in which we explore different methods/algorithms to perform this task and compare among them to choose the best model for further analysis of the project.
### Part a: Training a Logistic regression based persona model 
In this part, we use traditional ML methods to train persona such as `tfidf` and logistic regression.


In [9]:
# tokenizing using TfidfVectorizer, we use upto trigrams.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score


tf_idf_vectorizer = TfidfVectorizer(ngram_range=(1,3),min_df = 3)
X_train_tfidf = tf_idf_vectorizer.fit_transform(X_train)
X_train_tfidf.shape

(77549, 137068)

In [13]:
### Interest scores predictions using Traditional ML method: Logistic Regression
clf = LogisticRegression(C=2, max_iter=200).fit(X_train_tfidf, y_train)
X_test_tfidf = tf_idf_vectorizer.transform(X_test)
predicted = clf.predict(X_test_tfidf)
print("Test accuracy",(accuracy_score(y_test, predicted)))

Test accuracy 0.8045723569687826


We achieved 80.45 % accuracy on the test dataset.

### Part b: 
In this part, we use deep learning methods to train persona such as `word embeddings` and LSTM neural network.
First, we traina bideirectional LSTM model with random initilaization of embeddings and further we train LSTM model with twitter pre-trained `Glove embeddings`.
### Training LSTM model with random initialized embeddings



In [0]:
MAXLEN= 50
tokenizer = keras.preprocessing.text.Tokenizer(25000)
tokenizer.fit_on_texts(X_train)
x_train_seq = tokenizer.texts_to_sequences(X_train)
x_test_seq = tokenizer.texts_to_sequences(X_test)
x_train_seq_padded = keras.preprocessing.sequence.pad_sequences(x_train_seq, maxlen=MAXLEN)
x_test_seq_padded = keras.preprocessing.sequence.pad_sequences(x_test_seq,maxlen=MAXLEN)

In [16]:
##### LSTM model
lstm_model = keras.Sequential([ 
    keras.layers.Embedding(25000,100, input_length = MAXLEN),
    keras.layers.Bidirectional(keras.layers.LSTM(512)),
    keras.layers.Dense(1, activation='sigmoid')
])

# Compile model
opt = keras.optimizers.Adam(learning_rate=0.005, beta_1=0.9, beta_2=0.999, amsgrad=False)
lstm_model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])

lstm_model.fit(x_train_seq_padded, y_train, epochs=8, batch_size=512)

Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


<tensorflow.python.keras.callbacks.History at 0x7f2172f38278>

In [17]:
print("Test accuracy of lstm_model",lstm_model.evaluate(x_test_seq_padded, y_test, verbose=0)[1])

Test accuracy of lstm_model 0.7773006558418274


### LSTM model with pre-trained twitter embedding

First, download and unzip the pre-trained embeddings  

In [19]:
! wget http://nlp.stanford.edu/data/glove.twitter.27B.zip
! unzip glove.twitter.27B.zip

--2020-04-01 16:01:35--  http://nlp.stanford.edu/data/glove.twitter.27B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.twitter.27B.zip [following]
--2020-04-01 16:01:35--  https://nlp.stanford.edu/data/glove.twitter.27B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.twitter.27B.zip [following]
--2020-04-01 16:01:35--  http://downloads.cs.stanford.edu/nlp/data/glove.twitter.27B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1520408563 (1.4G) [appli

In [20]:
## using pre-trained glove embedding
embeddings_index = {}
f = open( 'glove.twitter.27B.100d.txt')
for line in f:
    values = line.split()
    word = values[0]
    word_embeddings = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = word_embeddings 
f.close()

print('Found %s word vectors.' % len(embeddings_index))
# creating embedding matrix for the words that exists in our dataset 
# from the pre-trained glove embedding
embedding_matrix = np.zeros((len(tokenizer.word_index) + 1, 100))
for word, i in tokenizer.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

Found 1193514 word vectors.


In [21]:
##### LSTM model with pre-trained Glove embeddings 
lstm_glove_model = keras.Sequential([ 
    keras.layers.Embedding(len(tokenizer.word_index) + 1,
                           100, 
                           weights=[embedding_matrix],
                           input_length=MAXLEN
                           ),
    keras.layers.Bidirectional(keras.layers.LSTM(512)),
    keras.layers.Dense(1, activation='sigmoid')
])

# Compile model
#opt = keras.optimizers.Adam(learning_rate=0.005, beta_1=0.9, beta_2=0.999, amsgrad=False)
lstm_glove_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
lstm_glove_model.fit(x_train_seq_padded, y_train, epochs=8, batch_size=512)

Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


<tensorflow.python.keras.callbacks.History at 0x7f20a2b66780>

In [22]:
print("Test accuracy of lstm_model with glove embeddings",lstm_glove_model.evaluate(x_test_seq_padded, y_test, verbose=0)[1])

Test accuracy of lstm_model with glove embeddings 0.8006266951560974


### Model comparision 

On comparing different methods, we conclude that both traditional and deep learning methods give similar performace on this dataset. Since, this dataset consists of tweets, both political as well as apolitical, we think 80% accuracy is not that bad. We expect to have a better model accuracy if we use a larger dataset. We emphasize that the same codebase can be adopted to any text based binary classification model.  

### PART 1: Persona Score
 Here, we define persona score prediction functions for both logistic regression as well as LSTM based persona models. 

In [0]:
def persona_score_one_hot_models(model,text,tf_idf_vectorizer):
    text_tfidf = tf_idf_vectorizer.transform([text])
    interest_score = model.predict_proba(text_tfidf)
    return interest_score[0][0]

In [0]:
def persona_score_LSTM(model, text, tokenizer):
    x_seq = tokenizer.texts_to_sequences([text])
    x_seq_padded = keras.preprocessing.sequence.pad_sequences(x_seq, MAXLEN)
    interest_score = model.predict(x_seq_padded)
    return interest_score[0][0]

In [0]:
### Here, we use following two example tweets from the twitter (link provided)
## We use these examples for lift analysis and explaining results

# https://twitter.com/SpeakerPelosi/status/1244295869152854018
dem_tweet = "The #CARESAct was just a down payment in the fight against the coronavirus. We can and will do more to help state & local governments as they fight this public health crisis."

# https://twitter.com/senatemajldr/status/1217613653106679808
rep_tweet = "Democrats’ impeachment has been nakedly partisan from the beginning. Pelosi admits it was in the making years before events with Ukraine. Schumer says that whatever happens, if it helps him politically, it’s a “win-win.” They are playing political games with the Constitution."


In [0]:
def predict_score(text):
  return persona_score_LSTM(lstm_glove_model, text, tokenizer)

In [185]:
predict_score(rep_tweet)

0.8473714

In [186]:
predict_score(dem_tweet)

0.4072026

The interest score predicted for the rep_tweet by the model  is 0.84 and 0.40  for the dem_tweet. This reflects that the model is performing well.

As we already mentioned earlier that if a score of greater than 0.50  means that the tweet shows republican ideology and if the model score is less than 0.50, it shows democratic (left) ideology.

We generated the scores using both models, using same test message and found that this tweet got very less democrat interest score which is expected since we have taken a republican tweet. 
In next part, we will see which words can improve the score to make it democratic.

## Part 2: LIFT ANALYSIS
In this part, we compute an alternative score after deleting the tokens iteratively. We compute the change with respect to the baseline score and record that token's contribution. We will use the same examples (used above) for the analysis.

In [0]:
## message optimization pipeline
def lift_analysis(predict_fn, text,print_baseline = True, print_alternative = True):
    baseline = predict_score(text)
    if print_baseline:
      print("The baseline score for the given text is", baseline)
    tokens = text.lower().split()
    lift_score = {}
    for i in range(len(tokens)):
      if tokens[i] in stop_words.ENGLISH_STOP_WORDS:
        continue
      subtokens = tokens[0:i]+tokens[i+1:len(tokens)]
      updated_text = ' '.join(subtokens)
      alternative = predict_score(updated_text)
      if print_alternative:
        print("Dropping '{}', updated score: {}".format(tokens[i], alternative))
      lift_score[tokens[i]] = round(alternative-baseline,2)
    return lift_score
    

In [213]:
lift_scores = lift_analysis(predict_score, dem_tweet, print_baseline = True, 
                            print_alternative = True )

The baseline score for the given text is 0.4072026
Dropping '#caresact', updated score: 0.4072026014328003
Dropping 'just', updated score: 0.37688830494880676
Dropping 'payment', updated score: 0.5570924878120422
Dropping 'fight', updated score: 0.49595290422439575
Dropping 'coronavirus.', updated score: 0.4072026014328003
Dropping 'help', updated score: 0.38538673520088196
Dropping 'state', updated score: 0.34755393862724304
Dropping '&', updated score: 0.4072026014328003
Dropping 'local', updated score: 0.4402707517147064
Dropping 'governments', updated score: 0.2475307732820511
Dropping 'fight', updated score: 0.5116938948631287
Dropping 'public', updated score: 0.5888792872428894
Dropping 'health', updated score: 0.5408839583396912
Dropping 'crisis.', updated score: 0.223998561501503


The above democrat_tweet example shows how the updated score is shifting towards the republican if we delete words "payment", "public", "health" from the tweet. This shows that these tokens are important in the sentence to predict the score and if we drop such terms from the sentence then we can convert this tweet ro right wing.

In [214]:
lift_scores = lift_analysis(predict_score, rep_tweet, print_baseline = True, 
                            print_alternative = True)

The baseline score for the given text is 0.8473714
Dropping 'democrats’', updated score: 0.8473713994026184
Dropping 'impeachment', updated score: 0.9993457198143005
Dropping 'nakedly', updated score: 0.8473713994026184
Dropping 'partisan', updated score: 0.9303543567657471
Dropping 'beginning.', updated score: 0.5395645499229431
Dropping 'pelosi', updated score: 0.456940621137619
Dropping 'admits', updated score: 0.978834867477417
Dropping 'making', updated score: 0.7624854445457458
Dropping 'years', updated score: 0.7941723465919495
Dropping 'events', updated score: 0.9983073472976685
Dropping 'ukraine.', updated score: 0.11966124922037125
Dropping 'schumer', updated score: 0.002879971405491233
Dropping 'says', updated score: 0.9889532327651978
Dropping 'happens,', updated score: 0.33020299673080444
Dropping 'helps', updated score: 0.7244032025337219
Dropping 'politically,', updated score: 0.9885335564613342
Dropping 'it’s', updated score: 0.8473713994026184
Dropping '“win-win.”', up

Above repulican tweet shows that if we delete the word 'impeachment', 'partisan', says' from the tweet then the model is more confident to predict that it is a republican tweet. However, if we delete the terms 'schumer', 'ukraine.'then the updated score is shifting a democratic party.

## PART 3: Representing top words that lift score positively and worst words that affects the persona score negatively.


In [0]:
# Print top 5 words that lifts the score
def top_5_words_contributing_persona_score(text):
    lift_score = lift_analysis(predict_score, text, print_baseline = False, 
                               print_alternative = False)
    lift_df = pd.DataFrame([lift_score], columns=lift_score.keys())
    lift_df = lift_df.T
    lift_df.columns = ['lift_score']
    return lift_df.sort_values(by=['lift_score'], ascending=False)[:5]

In [0]:
# Print worst 5 words that affects the score negatively
def worst_5_words_affecting_persona_score(text):
    lift_score = lift_analysis(predict_score, text, print_baseline = False, 
                               print_alternative = False)
    lift_df = pd.DataFrame([lift_score], columns=lift_score.keys())
    lift_df = lift_df.T
    lift_df.columns = ['lift_score']
    return lift_df.sort_values(by=['lift_score'], ascending=True)[:5]

From the lift analysis, we observed which words are important in the tweet, that helps the model to predict the political party of the tweet. Now, we analyze what are the top 5 words that contributes positively as well as worst 5 words that contributes negatively towards the overall tweet.

In [196]:
 top_5_words_contributing_persona_score(rep_tweet)

Unnamed: 0,lift_score
events,0.15
impeachment,0.15
"politically,",0.14
says,0.14
admits,0.13


In [215]:
worst_5_words_affecting_persona_score(rep_tweet)

Unnamed: 0,lift_score
schumer,-0.84
ukraine.,-0.73
"happens,",-0.52
games,-0.43
pelosi,-0.39


From the above two tables generated on the republican tweet, we observe that the word `events', "impeachment" contributes maximum to improve the score or to make the tweet more republican leaning. As expected, the words "schumer", "pelosi" are affecting the score negatively as the tweet is shifting towards democrats(neutral) if we delete these words.

In [199]:
 top_5_words_contributing_persona_score(dem_tweet)

Unnamed: 0,lift_score
public,0.18
payment,0.15
health,0.13
fight,0.1
local,0.03


In [198]:
worst_5_words_affecting_persona_score(dem_tweet)

Unnamed: 0,lift_score
crisis.,-0.18
governments,-0.16
state,-0.06
just,-0.03
help,-0.02


Similarly, from the above two tables, we observed that words "public", "payment", "health" are important words to contribute positively towards the interest score. If we delete these words, then the tweet shifted towards the republican. While if we delete the words "crisis", "governments" from the democratic tweet then the score drops.

## Conclusion


*   In this project, we explored both traditional and deep learning methods to train a poltical ideology persona.
*  We observed that pre-trained twitter glove embeddings performs slightly better than randomly initialized word embeddings.
*  Since the dataset also contained some neutral tweets such as congratulatory messages, festival or new year wishes etc, we believe that training on a more political biased dataset will lead to a better persona.
* Since we used deep learning (specifically Bi-directional LSTM models) which capture the semantic meaning and the meaning of a word depends on the context in which the word appears, that's why we did not remove stopwords. If we remove stopwords we may lose the important information on which our sequence model works. That's why we have some common words in our top and worst words list.
*  In lift analysis part, we dropped tokens iteratively and compute the change of score due to which we lose some contextual information (semantic, syntactic etc.)  as the sentence feeded to the LSTM model might be grammaticaly incorrect.


### Future work


*   In future, we would like to train persona on a much larger corpus including 
political speeches, reddit discussions, news articles as well as twitter data.
Also, to improve interest score, we will try to use more complex classfier such as BERT. 
*   Further, there are many recent advancements in NLP such as text style transfer, through which we can make the changes in the message and convert it into a democratic tweet and vice-versa.



