# Sentiment Analysis of 2020 Democratic Field

This notebook will take an early look into how the 2020 Democratic field is faring. Majority of primary candidates have announced their candidature and are jostling for making early in-roads for broader name recognition and gaining momentum initialy to last deep into debate time and eventually to become the candidate. This notebook will rely on articles published on Politico.com and run sentiment analysis against the names of these candidates. 

## Setup

In [0]:
import nltk
import pickle
import gensim
import pandas as pd
import requests
import string
import datetime

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.tag import StanfordNERTagger
from nltk.tokenize import word_tokenize
from nltk import word_tokenize, sent_tokenize
from nltk.stem.porter import *
from gensim.models.word2vec import Word2Vec
from gensim.models.phrases import Phraser, Phrases
from gensim import corpora
from sklearn.manifold import TSNE
from bs4 import BeautifulSoup

%matplotlib inline

In [0]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## Retrieve URLs for News Articles

In [0]:
baseUrl1 = "https://www.politico.com/story" # Supposedly contains all Politico stories. Need to test and confirm
baseUrl2 = "https://www.politico.com/news/2020-elections" # Focused on 2020 elections and therefore our base target URL but doesn't have ALL 2020 published stories 
#urlWapo = "https://www.washingtonpost.com/politics/"

baseUrls = []
baseUrls.append(baseUrl1)
baseUrls.append(baseUrl2)
baseUrls.append(urlFoxN)
#baseUrls.append(urlWapo)

urls = []
divs = []
for baseUrl in baseUrls:
  basePage = requests.get(baseUrl).text
  soup = BeautifulSoup(basePage, "lxml")  
  divs = soup.findAll("div", {"class": "summary"})
  for div in divs:
    p = div.find('a')['href']
    if (p not in urls):
      urls.append(p)
      
      
len(urls)
urls

['https://www.politico.com/story/2019/04/01/trump-budget-cut-special-olympics-1308298',
 'https://www.politico.com/story/2019/04/01/trump-disaster-aid-bill-1246740',
 'https://www.politico.com/story/2019/04/01/kirstjen-nielsen-asylum-seekers-mexico-1309229',
 'https://www.politico.com/story/2019/04/01/senate-democrats-electoral-college-1246521',
 'https://www.politico.com/story/2019/04/01/lujan-running-senate-new-mexico-1246500',
 'https://www.politico.com/story/2019/04/01/white-house-security-clearance-problems-1246432',
 'https://www.politico.com/story/2019/04/01/trump-census-citizenship-1246499',
 'https://www.politico.com/story/2019/04/01/trump-democrats-mueller-1246401',
 'https://www.politico.com/story/2019/04/01/house-judiciary-subpoenas-mueller-report-1246206',
 'https://www.politico.com/story/2019/04/01/china-fentanyl-after-opioid-epidemic-1308684',
 'https://www.politico.com/story/2019/04/01/pete-buttigieg-2020-fundraising-1246204',
 'https://www.politico.com/story/2019/04/01

In [0]:
page = ''
tagsList = []
tagTypes = ['p']
for url in urls: 
  page = requests.get(url).text 
  newssoup = BeautifulSoup(page, "lxml")  
  tagsList.append(newssoup.find_all(tagTypes))
  
  
len(tagsList)

58

In [0]:
# Find all paragraph tags as these are the ones that contain data in our documents. 


In [0]:
origTxt = ''
for tagsSinglePage in tagsList:
  for tag in tagsSinglePage:
    origTxt += tag.text
origTxt



# Input Cleanup

## Steps

1. Tokenize at sentence level.
2. Clean the sentences removing stop words and punctuations
3. Manually create list of candidates.
4. Purge sentences that don't have reference to any candidate name.


The term 'candidate' is somewhat loose here. It includes genuine candidates who have offocially launched their 2020 candidates, but also those who have only launched an exploratory committee but not offically launched their candidacy just yet.

In [0]:
from nltk.tokenize import PunktSentenceTokenizer
sents_tokenized = sent_tokenize(origTxt)
sents_tokenized

['\n                President Donald Trump on Thursday backpedaled on his budget proposal to eliminate the program after two days of intense criticism on social media.',
 '| AP Photo/Damian Dovarganes, File\neducation\n                    By NICOLE GAUDIANO04/01/2019 02:31 PM EDTThe Trump administration’s now-abandoned plan to eliminate $17.6 million in federal funding for the Special Olympics is looking like a boon for the nonprofit.Rather than cutting off the Special Olympics from all federal money, lawmakers are weighing a possible increase for the fifth year in a row as high-profile advocates rally behind the organization and its programs in thousands of schools.',
 "The social media outcry over the threat to its funding stream is boosting the Special Olympics' message — and private donations are expected to surge.",
 'It also shows the political risk for the Trump administration when it targets an educational program beloved by Democrats and Republicans alike.Story Continued Below

In [0]:
stopWords = set(stopwords.words('english') + list(string.punctuation))
stopWords

{'!',
 '"',
 '#',
 '$',
 '%',
 '&',
 "'",
 '(',
 ')',
 '*',
 '+',
 ',',
 '-',
 '.',
 '/',
 ':',
 ';',
 '<',
 '=',
 '>',
 '?',
 '@',
 '[',
 '\\',
 ']',
 '^',
 '_',
 '`',
 'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'need

Now, generate sentences after stopwords have been removed and list them at the end.

In [0]:
sents_ClnTknzd = []
punctuations = list(string.punctuation) # only remove punctuations. Keep stop words for phrases and un-abbreviated forms

for sent in sents_tokenized:
  tempStr = ''
  tempTokens = nltk.word_tokenize(sent)
  for token in tempTokens:
    if token not in punctuations:
        tempStr += (token)
        tempStr += ' '
        #cleanTokens.append(' ') # Need to append a single space for cases where words are losing space in between
  
  sents_ClnTknzd.append(tempStr.strip())

sents_ClnTknzd

['President Donald Trump on Thursday backpedaled on his budget proposal to eliminate the program after two days of intense criticism on social media',
 'AP Photo/Damian Dovarganes File education By NICOLE GAUDIANO04/01/2019 02:31 PM EDTThe Trump administration ’ s now-abandoned plan to eliminate 17.6 million in federal funding for the Special Olympics is looking like a boon for the nonprofit.Rather than cutting off the Special Olympics from all federal money lawmakers are weighing a possible increase for the fifth year in a row as high-profile advocates rally behind the organization and its programs in thousands of schools',
 'The social media outcry over the threat to its funding stream is boosting the Special Olympics message — and private donations are expected to surge',
 'It also shows the political risk for the Trump administration when it targets an educational program beloved by Democrats and Republicans alike.Story Continued Below “ It ’ s a little bit overwhelming in the best

In [0]:
candidates = ['Joe Biden', 'Cory Booker', 'Pete Buttigieg',
              'Julian Castro', 'John Delaney', 'Tulsi Gabbard',
              'Kirsten Gillibrand', 'Kamala Harris', 'John Hickenlooper',
              'Jay Inslee', 'Amy Klobuchar', 'Beto O’Rourke',
              'Bernie Sanders', 'Howard Schultz', 'Elizabeth Warren', 
              'Marianne Williamson', 'Andrew Yang']

# The news articles will rarely mention full name. Use last names to capture references to candidates. 
lastNames = []
for candidate in candidates:
    lastNames.append(candidate.split()[1])

lastNames

['Biden',
 'Booker',
 'Buttigieg',
 'Castro',
 'Delaney',
 'Gabbard',
 'Gillibrand',
 'Harris',
 'Hickenlooper',
 'Inslee',
 'Klobuchar',
 'O’Rourke',
 'Sanders',
 'Schultz',
 'Warren',
 'Williamson',
 'Yang']

Now, create a filter to keep only those sentences that have reference to at-least one member in this list. Remove sentences that don't directly refer to candidates names. There is a downside here though where we are losing information where pronouns are being used to refer to a candidate. May look to incorporate that as well later on with NLP analysis that can capture that level of contextual information.

In [0]:
sentsNameReferences = []
for sent in sents_ClnTknzd:
  for lName in lastNames:
    if lName in sent:
      print(sent)
      sentsNameReferences.append(sent)

sentsUnqRefs = set(sentsNameReferences)
sentsUnqRefs

Dick Durbin of Illinois Dianne Feinstein of California and Kirsten Gillibrand of New York who is running for president
By Nelson W. Cunningham By Michael Grunwald By Susan Page By Sen. Mitch McConnell © 2019 POLITICO LLC `` This is just a preliminary analysis but our team ’ s initial report shows we raised over 7 million dollars in Q1 of this year '' South Bend Indiana Mayor Pete Buttigieg wrote on Twitter
Justin Sullivan/Getty Images 2020 Elections By DANIEL STRAUSS04/01/2019 07:42 AM EDT Updated 04/01/2019 09:22 AM EDTDemocrat Pete Buttigieg said Monday that his presidential exploratory committee raised over 7 million in the first quarter of 2019.That fundraising haul covers slightly more than two months from January 23 to the end of March since the 37-year-old mayor of South Bend Indiana announced his longshot campaign for president.Story Continued Below '' This is just a preliminary analysis but our team ’ s initial report shows we raised over 7 million dollars in Q1 of this year '

{'... We hope he gets in the race and we hope he ’ s the Democrats ’ nominee. ” Republicans and Democrats alike are laying the groundwork to face Biden by examining the business deals of his son Hunter Biden when Joe Biden was vice president',
 '2020 elections By CARLA MARINUCCIButtigieg was invited to appear on Ellen DeGeneres ’ daytime talk show the same stage that launched his 2020 rival then-Texas Senate candidate Beto O ’ Rourke into the fundraising stratosphere in 2018',
 '2020 elections By DAVID SIDERSDuring her first term she ’ s focused on issues affecting veterans and women pairing with Sen. Kirsten Gillibrand D-N.Y. on a bill to require sexual assault prevention training for new members of the military and teaming with Sen. Dianne Feinstein D-Calif. on reauthorization of the Violence Against Women Act',
 '2020 elections By KATIE GALIOTOStill the rise of the MeToo movement and a trove of videos and photos have placed Biden under fresh scrutiny',
 "AP Photo/Domenico Stinellis 

In [0]:
len(sentsUnqRefs)

130

# Analysis

At this point, we do have a list of sentences with clean text. Now we will capture all available information in a DataFrame. This DataFrame will capture candidate's last names, the number of times each candidate's name appears across target corpus, sentiment value for those sentences and so on...

In [0]:
counts = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

mapCandCounts = dict(zip(lastNames, counts))
dictCandidatesAndSents = {}

for lName in mapCandCounts:
  lstTemp = [] 
  for sent in sentsUnqRefs:
    if lName in sent:
      mapCandCounts[lName] += 1
      lstTemp.append(sent)
      
  dictCandidatesAndSents[lName] = lstTemp
      

mapCandCounts
dfrm = pd.DataFrame(list(mapCandCounts.items()), columns=['Candidate', 'RefCounts'])
dfrm

Unnamed: 0,Candidate,RefCounts
0,Biden,74
1,Booker,4
2,Buttigieg,34
3,Castro,3
4,Delaney,0
5,Gabbard,0
6,Gillibrand,3
7,Harris,9
8,Hickenlooper,3
9,Inslee,0


## Sentiment Analysis

We will use textblob package to generate sentiment analysis by each sentence and group the output by each candidate. Next, we import textblob and write simple Lambda functions to generate polarity and subjectivity value output from textblob.

In [0]:
import textblob
from textblob import TextBlob
pol = lambda x: TextBlob(x).sentiment.polarity
sub = lambda x: TextBlob(x).sentiment.subjectivity

In [0]:
# Generate placeholder matching the total count of candidates. Conver list into Pandad series, merge that to our DFRM and print the DFRM. 
polarity = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
subjectivity = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

polSeries = pd.Series(data=polarity, name="Polarity")
subSeries = pd.Series(data=subjectivity, name="Subjectivity")
dfrm = pd.concat([dfrm, polSeries, subSeries], axis = 1)
dfrm

Unnamed: 0,Candidate,RefCounts,Polarity,Subjectivity
0,Biden,74,0.0,0.0
1,Booker,4,0.0,0.0
2,Buttigieg,34,0.0,0.0
3,Castro,3,0.0,0.0
4,Delaney,0,0.0,0.0
5,Gabbard,0,0.0,0.0
6,Gillibrand,3,0.0,0.0
7,Harris,9,0.0,0.0
8,Hickenlooper,3,0.0,0.0
9,Inslee,0,0.0,0.0


Now, we will go through list of sentences tagged for each candidate, calculate average sentiment scors, and put those scores in our DFRM.

In [0]:
for lName, sents in dictCandidatesAndSents.items():
  index = dfrm.index[dfrm['Candidate'] == lName]
  polarity = 0.0; subjectivity = 0.0; count = 0
  for sent in sents:
      polarity += pol(sent)
      subjectivity= sub(sent)
      count += 1
  if (count > 0):
    dfrm.loc[index, 'Polarity'] = polarity / count
    dfrm.loc[index, 'Subjectivity'] = subjectivity / count

dfrm

Unnamed: 0,Candidate,RefCounts,Polarity,Subjectivity
0,Biden,74,0.041381,0.006498
1,Booker,4,0.002794,0.0
2,Buttigieg,34,0.155078,0.012255
3,Castro,3,-0.094841,0.140079
4,Delaney,0,0.0,0.0
5,Gabbard,0,0.0,0.0
6,Gillibrand,3,0.123295,0.143434
7,Harris,9,0.116008,0.046693
8,Hickenlooper,3,0.0,0.0
9,Inslee,0,0.0,0.0


Rather than calculating sentiment of the entire corpus in one go, let us check the sentiment by sentence. 

## Pending / Possible Enhancements:
#### Priority Items
- P0: Change the analysis to be time-series based to see ebb and flow across time. 

- P0: Beto O'Rourke's name is coming up as "Beto O ’ Rourke" because of the way data is assembled back again after cleanup. Bring that in line with how the articles publish his name.

- P0: How can we confirm that the sentinment is for the candidate herself versus for a topic she is addressing or has been referred in the context of?

- P0: There is a reporter John F. Harris who shares last name with Kamala Harris which results in a list of false positives. Need to exclude such instances except for there is a genuine reference for any other candidate in that sentence.



#### Low Priority Items
- P1: Topic Modeling to assess which topics are generating highest sentiment scores for a candidate. This can help depict the topics a candidate is more at home for messaging. 
- P1: As part of data cleanup, all sentences that did not have direct references to candidates' last names were removed. This included sentences where candidates were referred to by their respctive gender based pronouns. Need to find ways to capture that level of contextual information. Check whether BERT can account for something like this.
- P1: Extend this to Twitter and other major political messaging platforms. 
- P1: Compare Politico's sentiment scores for a candidate to those from left and right leaning news sources to check whether Politico's reporting is in the middle
- P1: Retrieve data from secondary sources like Fox News, CNN: 

      """
      urlFoxN = "https://www.foxnews.com/politics"

      For News will require "m" class divs and then appending the base HTTP address since links lack that. 
      if (baseUrl == 'https://www.foxnews.com/politics'): 
          divs = soup.findAll("div", {"class": "m"})

      CNN will require first HTML and then JSON parsing for list of urls

      """


#### Other Data Sources
- https://www.kaggle.com/benhamner/2016-us-election
- https://www.kaggle.com/c/can-we-predict-voting-outcomes/data
