# Sentiment Analysis of 2020 Democratic Field

** Work-in-progress!!! **

This notebook takes an early look into how the 2020 Democratic field is faring. Majority of primary candidates have announced their candidature and are jostling for making early in-roads for broader name recognition and gaining momentum initialy to last deep into debate time and eventually to become the candidate. This notebook will rely on articles published on Politico.com and run sentiment analysis against the names of these candidates. 

As far as scope is concerned, the basic framework is there for sentiment analysis for any given day. Work is pending to extend that analysis to be in a time-series where we can analyze ebb and flow across weeks and months. Check the full listing of further enhancements and limitations at the bottom of the page.

## Setup

In [1]:
import nltk
import pickle
import gensim
import pandas as pd
import requests
import string
import datetime

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.tag import StanfordNERTagger
from nltk.tokenize import word_tokenize
from nltk import word_tokenize, sent_tokenize
from nltk.stem.porter import *
from gensim.models.word2vec import Word2Vec
from gensim.models.phrases import Phraser, Phrases
from gensim import corpora
from sklearn.manifold import TSNE
from bs4 import BeautifulSoup

%matplotlib inline

paramiko missing, opening SSH/SCP/SFTP paths will be disabled.  `pip install paramiko` to suppress


In [2]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## Retrieve URLs for News Articles

In [3]:
baseUrl1 = "https://www.politico.com/story" # Supposedly contains all Politico stories. Need to test and confirm
baseUrl2 = "https://www.politico.com/news/2020-elections" # Focused on 2020 elections and therefore our base target URL but doesn't have ALL 2020 published stories 
#urlWapo = "https://www.washingtonpost.com/politics/"

baseUrls = []
baseUrls.append(baseUrl1)
baseUrls.append(baseUrl2)
#baseUrls.append(urlFoxN)
#baseUrls.append(urlWapo)

urls = []
divs = []
for baseUrl in baseUrls:
  basePage = requests.get(baseUrl).text
  soup = BeautifulSoup(basePage, "lxml")  
  divs = soup.findAll("div", {"class": "summary"})
  for div in divs:
    p = div.find('a')['href']
    if (p not in urls):
      urls.append(p)
      
      
len(urls)

57

In [4]:
page = ''
tagsList = []
tagTypes = ['p']
for url in urls: 
  page = requests.get(url).text 
  newssoup = BeautifulSoup(page, "lxml")  
  tagsList.append(newssoup.find_all(tagTypes))
  
  
len(tagsList)

57

In [0]:
origTxt = ''
for tagsSinglePage in tagsList:
  for tag in tagsSinglePage:
    origTxt += tag.text
#origTxt

# Input Cleanup

## Steps

1. Tokenize at sentence level.
2. Clean the sentences removing stop words and punctuations
3. Manually create list of candidates.
4. Purge sentences that don't have reference to any candidate name.


The term 'candidate' is somewhat loose here. It includes genuine candidates who have offocially launched their 2020 candidates, but also those who have only launched an exploratory committee but not offically launched their candidacy just yet.

In [6]:
from nltk.tokenize import PunktSentenceTokenizer
sents_tokenized = sent_tokenize(origTxt)
sents_tokenized[:5]

['\n                House Financial Services Committee Chairwoman Maxine Waters said in a statement that the potential use of the U.S. financial system for illicit purposes was a “very serious concern."',
 '| Zach Gibson/Getty Images\nFinancial Services\n                    By ZACHARY WARMBRODT and JOHN BRESNAHAN04/15/2019 08:20 PM EDTHouse Democrats on Monday issued a subpoena to the German lender Deutsche Bank seeking information on President Donald Trump’s finances, a major escalation of their investigation into his business dealings.In addition to the Deutsche Bank subpoena, House Democrats subpoenaed other banks — including JPMorgan Chase, Bank of America and Citigroup — seeking information on Russian money laundering.',
 'The issuance of the subpoenas was first reported by The New York Times.Story Continued BelowDemocrats were expected to issue additional subpoenas targeting Trump’s finances soon, Democratic aides said.',
 'The House Intelligence and Financial Services Committees

In [7]:
import itertools
stopWords = set(stopwords.words('english') + list(string.punctuation))
for i, val in enumerate(itertools.islice(stopWords, 10)): # print sample from stopWords to ensure the set is populated
  print(val)

you're
myself
were
no
been
#
each
too
she's
aren't


Now, generate sentences after stopwords have been removed and list them at the end.

In [8]:
sents_ClnTknzd = []
punctuations = list(string.punctuation) # only remove punctuations. Keep stop words for phrases and un-abbreviated forms

for sent in sents_tokenized:
  tempStr = ''
  tempTokens = nltk.word_tokenize(sent)
  for token in tempTokens:
    if token not in punctuations:
        tempStr += (token)
        tempStr += ' '
        #cleanTokens.append(' ') # Need to append a single space for cases where words are losing space in between
  
  sents_ClnTknzd.append(tempStr.strip())

sents_ClnTknzd[:5]

["House Financial Services Committee Chairwoman Maxine Waters said in a statement that the potential use of the U.S. financial system for illicit purposes was a “ very serious concern ''",
 'Zach Gibson/Getty Images Financial Services By ZACHARY WARMBRODT and JOHN BRESNAHAN04/15/2019 08:20 PM EDTHouse Democrats on Monday issued a subpoena to the German lender Deutsche Bank seeking information on President Donald Trump ’ s finances a major escalation of their investigation into his business dealings.In addition to the Deutsche Bank subpoena House Democrats subpoenaed other banks — including JPMorgan Chase Bank of America and Citigroup — seeking information on Russian money laundering',
 'The issuance of the subpoenas was first reported by The New York Times.Story Continued BelowDemocrats were expected to issue additional subpoenas targeting Trump ’ s finances soon Democratic aides said',
 'The House Intelligence and Financial Services Committees are ratcheting up their coordinated inves

In [9]:
candidates = ['Joe Biden', 'Cory Booker', 'Pete Buttigieg',
              'Julian Castro', 'John Delaney', 'Tulsi Gabbard',
              'Kirsten Gillibrand', 'Kamala Harris', 'John Hickenlooper',
              'Jay Inslee', 'Amy Klobuchar', 'Beto O’Rourke',
              'Bernie Sanders', 'Howard Schultz', 'Elizabeth Warren', 
              'Marianne Williamson', 'Andrew Yang']

# The news articles will rarely mention full name. Use last names to capture references to candidates. 
lastNames = []
for candidate in candidates:
    lastNames.append(candidate.split()[1])

lastNames

['Biden',
 'Booker',
 'Buttigieg',
 'Castro',
 'Delaney',
 'Gabbard',
 'Gillibrand',
 'Harris',
 'Hickenlooper',
 'Inslee',
 'Klobuchar',
 'O’Rourke',
 'Sanders',
 'Schultz',
 'Warren',
 'Williamson',
 'Yang']

Now, create a filter to keep only those sentences that have reference to at-least one member in this list. Remove sentences that don't directly refer to candidates names. There is a downside here though where we are losing information where pronouns are being used to refer to a candidate. May look to incorporate that as well later on with NLP analysis that can capture that level of contextual information.

In [10]:

sentsNameReferences = []
for sent in sents_ClnTknzd:
  for lName in lastNames:
    if lName in sent:
      #print(sent)
      sentsNameReferences.append(sent)

sentsUnqRefs = set(sentsNameReferences)
for i, val in enumerate(itertools.islice(sentsUnqRefs, 5)):# print sample to ensure set is populated
  print(val)

Buttigieg who is openly gay has gained notice for his criticism of Vice President Mike Pence a former Indiana governor and an opponent of same-sex marriage
Read More » The early fundraising reveals that Sen. Elizabeth Warren is spending like a front-runner but isn ’ t yet raising money like one
Warren took in over 6 million and Sens
In addition to exceeding basement-level expectations for his candidacy Buttigieg has fashioned a headline-grabbing foil out of his fellow mild-mannered Hoosier Vice President Mike Pence the former Indiana governor with whom Buttigieg has had a long and not always contentious relationship
Better than Klobuchar


In [11]:
len(sentsUnqRefs)

212

# Analysis

At this point, we do have a list of sentences with clean text. Now we will capture all available information in a DataFrame. This DataFrame will capture candidate's last names, the number of times each candidate's name appears across target corpus, sentiment value for those sentences and so on...

In [12]:
counts = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

mapCandCounts = dict(zip(lastNames, counts))
dictCandidatesAndSents = {}

for lName in mapCandCounts:
  lstTemp = [] 
  for sent in sentsUnqRefs:
    if lName in sent:
      mapCandCounts[lName] += 1
      lstTemp.append(sent)
      
  dictCandidatesAndSents[lName] = lstTemp
      

mapCandCounts
dfrm = pd.DataFrame(list(mapCandCounts.items()), columns=['Candidate', 'RefCounts'])
dfrm

Unnamed: 0,Candidate,RefCounts
0,Biden,13
1,Booker,12
2,Buttigieg,73
3,Castro,9
4,Delaney,0
5,Gabbard,0
6,Gillibrand,13
7,Harris,28
8,Hickenlooper,0
9,Inslee,13


## Sentiment Analysis

We will use textblob package to generate sentiment analysis by each sentence and group the output by each candidate. Next, we import textblob and write simple Lambda functions to generate polarity and subjectivity value output from textblob.

In [0]:
import textblob
from textblob import TextBlob
pol = lambda x: TextBlob(x).sentiment.polarity
sub = lambda x: TextBlob(x).sentiment.subjectivity

In [14]:
# Generate placeholder matching the total count of candidates. Conver list into Pandad series, merge that to our DFRM and print the DFRM. 
polarity = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
subjectivity = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

polSeries = pd.Series(data=polarity, name="Polarity")
subSeries = pd.Series(data=subjectivity, name="Subjectivity")
dfrm = pd.concat([dfrm, polSeries, subSeries], axis = 1)
dfrm

Unnamed: 0,Candidate,RefCounts,Polarity,Subjectivity
0,Biden,13,0.0,0.0
1,Booker,12,0.0,0.0
2,Buttigieg,73,0.0,0.0
3,Castro,9,0.0,0.0
4,Delaney,0,0.0,0.0
5,Gabbard,0,0.0,0.0
6,Gillibrand,13,0.0,0.0
7,Harris,28,0.0,0.0
8,Hickenlooper,0,0.0,0.0
9,Inslee,13,0.0,0.0


Now, we will go through list of sentences tagged for each candidate, calculate average sentiment scors, and put those scores in our DFRM.

In [15]:
for lName, sents in dictCandidatesAndSents.items():
  index = dfrm.index[dfrm['Candidate'] == lName]
  polarity = 0.0; subjectivity = 0.0; count = 0
  for sent in sents:
      polarity += pol(sent)
      subjectivity= sub(sent)
      count += 1
  if (count > 0):
    dfrm.loc[index, 'Polarity'] = polarity / count
    dfrm.loc[index, 'Subjectivity'] = subjectivity / count

dfrm

Unnamed: 0,Candidate,RefCounts,Polarity,Subjectivity
0,Biden,13,-0.040237,0.017308
1,Booker,12,0.138828,0.038889
2,Buttigieg,73,0.1051,0.0
3,Castro,9,0.063194,0.077778
4,Delaney,0,0.0,0.0
5,Gabbard,0,0.0,0.0
6,Gillibrand,13,0.065571,0.028846
7,Harris,28,0.123319,0.01824
8,Hickenlooper,0,0.0,0.0
9,Inslee,13,0.094188,0.005128


## Pending / Possible Enhancements:
#### Priority Items
- P0: Change the analysis to be time-series based to see ebb and flow across time. 

- P0: Beto O'Rourke's name is coming up as "Beto O ’ Rourke" because of the way data is assembled back again after cleanup. Bring that in line with how the articles publish his name.

- P0: How can we confirm that the sentinment is for the candidate herself versus for a topic she is addressing or has been referred in the context of?

- P0: There is a reporter John F. Harris who shares last name with Kamala Harris which results in a list of false positives. Need to exclude such instances except for there is a genuine reference for any other candidate in that sentence.



#### Low Priority Items
- P1: Rather than calculating sentiment of the entire corpus in one go, let us check the sentiment by sentence. 
- P1: Topic Modeling to assess which topics are generating highest sentiment scores for a candidate. This can help depict the topics a candidate is more at home for messaging. 
- P1: As part of data cleanup, all sentences that did not have direct references to candidates' last names were removed. This included sentences where candidates were referred to by their respctive gender based pronouns. Need to find ways to capture that level of contextual information. Check whether BERT can account for something like this.
- P1: Extend this to Twitter and other major political messaging platforms. 
- P1: Compare Politico's sentiment scores for a candidate to those from left and right leaning news sources to check whether Politico's reporting is in the middle
- P1: Retrieve data from secondary sources like Fox News, CNN: 

      """
      urlFoxN = "https://www.foxnews.com/politics"

      For News will require "m" class divs and then appending the base HTTP address since links lack that. 
      if (baseUrl == 'https://www.foxnews.com/politics'): 
          divs = soup.findAll("div", {"class": "m"})

      CNN will require first HTML and then JSON parsing for list of urls

      """


#### Other Data Sources
- https://www.kaggle.com/benhamner/2016-us-election
- https://www.kaggle.com/c/can-we-predict-voting-outcomes/data
