# Data Cleaning for Text Project: Sentiment Analysis on Today's NYTimes

### Isaac Newell

This notebook pulls all the articles from today's New York Times (the webpage actually updates constantly). It retrieves the text from all of them and then processes that text. Here are the steps:
1. Find all the "a" tags of articles from the today's paper homepage
2. Use requests to get the content of these articles
3. Split each article into sentences
4. Perform sentiment analysis on each sentence
5. Split these sentences into words and get the count for each word
6. Output this processed data to a .json file
------

First, we'll import import these two libraries:
* BeautifulSoup, for parsing the DOM of webpages we load in
* requests, for making HTTP requests to the NYTimes website

In [1]:
from bs4 import BeautifulSoup
import requests

Before we address the real task, let's just test BeautifulSoup and requests on one article to retrieve its text:

In [2]:
r = requests.get("https://www.nytimes.com/2017/10/27/world/africa/burundi-international-criminal-court.html?rref=collection%2Fsectioncollection%2Fworld&action=click&contentCollection=world&region=stream&module=stream_unit&version=latest&contentPlacement=1&pgtype=sectionfront")

html_doc = r.text

soup = BeautifulSoup(html_doc, 'html.parser')

ps = soup.find_all("p")

for p in ps:
    tex = p.get_text()
    if tex != "Advertisement":
        print(tex)


By JINA MOOREOCT. 27, 2017

NAIROBI, Kenya — One month after a scathing United Nations report that called for a criminal investigation likely to lead back to its leaders, Burundi has withdrawn from the International Criminal Court, becoming the first country in the world to do so.
A United Nations Commission of Inquiry on Burundi reported in September that it had found evidence of extrajudicial killings, disappearances, arbitrary arrests and detentions, torture and sexual violence in the two-and-a-half years since Burundi’s president, Pierre Nkurunziza, muscled his way to a third term in office.
Burundi announced its intention to withdraw last year, at a time when the court was deeply unpopular with African leaders. Gambia and South Africa were also threatening to pull out, and the continent’s top intelligence officials signed a statement accusing the court of being “hijacked by powerful western countries” and “acting as a proxy” for foreign-led government change. But Mr. Nkurunziza ca

Since loading that article worked properly, now we move on to the nytimes todayspaper page. That page has every article in the paper arranged by section and hyperlinked in. By inspecting the webpage structure, I figured  out where in the DOM all of the articles are. The front page is in a seperate div, so it had to be retrieved seperately from the other sections.

In [3]:
# Makes a dictionary with the urls of all articles in the current issue of the NYTimes
# Output will contain a key for each section, whose value is a list of urls for all articles
def get_url_dict():
    r = requests.get("http://www.nytimes.com/pages/todayspaper/index.html")

    html_doc = r.text

    soup = BeautifulSoup(html_doc, "html.parser")
    
    # Parse the DOM
    main = soup.find("div",attrs={"id":"main"})

    front_page_div = main.find("div",class_="aColumn")
    
    # The frontpage is seperated into 2 seperate columns
    fp_col1 = front_page_div.find("div",class_="columnGroup first")

    fp_col2 = front_page_div.find("div",class_="columnGroup singleRule last")

    col1_stories = fp_col1.findAll("div",class_="story")

    # Create dictionary to store article urls in
    urls = {}
    
    # Add the urls for the frontpage from both of the columns
    urls["frontpage"] = []
    for story in col1_stories:
        h3 = story.find("h3")
        a = h3.find("a")
        urls["frontpage"].append(a.get("href"))

    col2_stories = fp_col2.findAll("a")
    for story in col2_stories:
        urls["frontpage"].append(story.get("href"))
    
    # Now find the other sections, all stored in a seperate div
    other_section_container = main.find("div",attrs={"id":"SpanABMiddleRegion"})
    secs = other_section_container.find_all("div",class_="columnGroup")
    
    # Iterate through the divs with class "columnGroup".
    # Usually, every other one is an article, and
    # every other other one is a "jump to" menu, not containing anything we want
    for i,sec in enumerate(secs):
        if i == len(secs)-1:
            break
        if len(sec.find_all("div",class_="jumpToModule")) == 0:
            sec_name = sec.find("h3",class_="sectionHeader").find("a").get("name")
            #print(sec_name)
            urls[sec_name] = []
            artic_list = sec.find("ul").find_all("a")
            for artic in artic_list:
                urls[sec_name].append(artic.get("href"))
    return urls

# Now test this function
urls = get_url_dict()
# Print the first three urls in each section
for k in urls.keys():
    print(urls[k][:3])

['https://www.nytimes.com/2017/11/18/nyregion/new-york-subway-system-failure-delays.html?ref=todayspaper', 'https://www.nytimes.com/2017/11/18/us/politics/ron-johnson-senate-tax-cut.html?ref=todayspaper', 'https://www.nytimes.com/2017/11/18/us/roy-moore-alabama.html?ref=todayspaper']
['https://www.nytimes.com/2017/11/18/world/americas/rio-de-janeiro-brazil-violent-crime-security.html?ref=todayspaper', 'https://www.nytimes.com/2017/11/18/world/middleeast/hariri-france-saudi-lebanon.html?ref=todayspaper', 'https://www.nytimes.com/2017/11/18/world/americas/mexico-city-airport-enrique-pena-nieto.html?ref=todayspaper']
['https://www.nytimes.com/2017/11/18/business/trump-wants-more-big-infrastructure-projects-the-obstacles-can-be-big-too.html?ref=todayspaper', 'https://www.nytimes.com/2017/11/18/us/politics/republican-governors-trump-backlash-2018.html?ref=todayspaper', 'https://www.nytimes.com/2017/11/18/nyregion/he-fled-myanmar-on-a-deathtrap-now-hes-the-luckiest-man-alive.html?ref=todaysp

Now that we have the urls for everything, we can make seperate HTTP requests for each one. From there we can get the title and text of each article. That is what this function does.

In [4]:
def get_article_title_and_text(url):
    r = requests.get(url)
    html_doc = r.text
    soup = BeautifulSoup(html_doc, "html.parser")
    
    # Extract all the <p> tags
    ps = soup.find_all("p",class_="story-body-text story-content")
    
    # Attach the content of all of them to one big string
    # Use BeautifulSoup's get_text method
    # Replace curly quotes with straight quotes
    # Curly quotes were messing up the sentence boundary finding (done later)
    text = ""
    for p in ps:
         text = text+" "+p.get_text().replace('“','"').replace('”','"')
    
    # Get the article title, inside an <h1> with id="headline"
    title = soup.find_all("h1",attrs={"id":"headline"})
    if len(title) > 0:
        title = title[0].get_text()
    else:
        title = ""
    return {"title": title,
            "text": text}

Now we can use get_article_title_and_text for all the urls. That is what this function does, and outputs it into a dictionary with a key for each section. The corresponding value is a list of articles, each represented by a dictionary with containing the title, url, and text content of each article.

In [5]:
def get_all_articles_dict():
    urls = get_url_dict()
    articles = {}
    for k in urls.keys():
        articles[k] = []
        sec_urls = urls[k]
        for url in sec_urls:
            tt = get_article_title_and_text(url)
            tt["url"] = url
            articles[k].append(tt)
    return articles

In [7]:
articles = get_all_articles_dict()
for k in articles.keys():
    first_article = articles[k][0]
    title = first_article["title"]
    text = first_article["text"]
    print(k,title,text[:200])
print(articles)

frontpage How Politics and Bad Decisions Starved New York’s Subways  After a drumbeat of transit disasters this year, it became impossible to ignore the failures of the New York City subway system. A rush-hour Q train careened off the rails in southern Brooklyn. A tra
world In Rio de Janeiro, ‘Complete Vulnerability’ as Violence Surges  RIO DE JANEIRO — For teachers in this seaside megacity, Rio de Janeiro’s surge in violence has meant making a life-or-death judgment call with unnerving frequency: deciding whether to cancel classes 
us Trump Wants More Big Infrastructure Projects. The Obstacles Can Be Big, Too.  President Trump says he is frustrated with the slow pace of major construction projects like highways, ports and pipelines. Last summer, he pledged to use the power of the presidency to jump start bu
obituaries Jeremy Hutchinson, a Top Lawyer in High-Profile Cases, Dies at 102  LONDON — Jeremy Hutchinson, a British barrister whose sometimes theatrical courtroom tactics and rhet

# Now we need to split the articles into sentences

Splitting into sentences will allow us to perform sentiment analysis seperately on each sentence. Actually, sentence boundary disambiguation (SBD) is a more complicated task than it might sound (i.e. just finding periods), since abbreviations, question and exclamation marks, and quotes make the task more difficult. See the Wikipedia article on SBD for more.

For this we will use the following the punkt sentence finder from nltk.

In [8]:
import nltk

In [None]:
# nltk.download_shell()
# Uncomment this if you need to download the packages necessary for this.
# I just downloaded everything, and then didn't need this line any more.
# For some reason the regular nltk.download() wasn't working for me, but this did.

Test nltk.sent_tokenize on one sentence:

In [9]:
s = "My name is Isaac. I live in a dorm at Andover called Stu. Herbie also lives there."

nltk.sent_tokenize(s)

['My name is Isaac.',
 'I live in a dorm at Andover called Stu.',
 'Herbie also lives there.']

That seems to have worked well. Now we'll work with splitting the sentences into words and doing further analysis there. 

Initially I tried using a stemmer (which reduces a word to a root form, i.e. "being" goes to "be"). However the stemmer gets a lot of things wrong and makes a lot of fake words, i.e. "flying" goes to "fli". Lemmatizing is an alternative, which guarantees that the output is a real word. However, it really only works when you know the part of speech of the input, which is a difficult problem in itself. Thus I decided to abandon that idea.

In [10]:
from nltk.stem import SnowballStemmer

stemmer = SnowballStemmer("english")

In [11]:
def get_wordcounts(wordlist):
    counts = {}
    for word in wordlist:
        stem = word #stemmer.stem(word)
        if stem in counts:
            counts[stem] += 1
        else:
            counts[stem] = 1
    return counts

In [12]:
wl1 = ["to", "be", "or", "not", "to", "being"]
get_wordcounts(wl1)

{'be': 1, 'being': 1, 'not': 1, 'or': 1, 'to': 2}

We will use nltk's SentimentIntensityAnalyzer to perform sentiment analysis on all of our sentences. This is a pretrained model (from the VADER package, which stands for "Valence Aware Dictionary and sEntiment Reasoner", not Darth Vader). It outputs a vector of scores, each component between 0 and 1. The scores are positive, neutral, and negative. It also outputs a compound score, between -1 and 1. That is what we will use. This model is trained on social media, which could potentially have an inherent bias. Social media can be a nasty place, so when I actually finished this data visualization I noticed that many sentences that are clearly negative got classified as positive or neutral. It seems like anything that is not some nasty little message like a Trump tweet filled with words like "sad" and "failing" is biased towards positive. In any case, the algorithm isn't perfect and NLP is hard.

In [13]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer



Test the sentiment analyzer on a few sentences to make sure it's working:

In [14]:
test_sents = ["Phillips Academy is a wonderful place.",
             "Although Andover can work you half to death, overall its fun",
             "I HATE Andover with a fiery passion!",
             "Trump can’t get his bad ideas through Congress, but he can use the power of the presidency to sabotage or even sink Obama’s signature deeds.",
             "Our country is being ruined by Trump",
             "Donald Trump is Making America Great Again!,",
             "We faked the moon landing"]
sia = SentimentIntensityAnalyzer()
for sent in test_sents:
    print(sent)
    ss = sia.polarity_scores(sent)
    for k in ss:
        print("{0}: {1}, ".format(k, ss[k]), end="")
    print()

Phillips Academy is a wonderful place.
neg: 0.0, neu: 0.519, pos: 0.481, compound: 0.5719, 
Although Andover can work you half to death, overall its fun
neg: 0.241, neu: 0.556, pos: 0.204, compound: -0.1531, 
I HATE Andover with a fiery passion!
neg: 0.588, neu: 0.165, pos: 0.247, compound: -0.628, 
Trump can’t get his bad ideas through Congress, but he can use the power of the presidency to sabotage or even sink Obama’s signature deeds.
neg: 0.229, neu: 0.771, pos: 0.0, compound: -0.7814, 
Our country is being ruined by Trump
neg: 0.341, neu: 0.659, pos: 0.0, compound: -0.4767, 
Donald Trump is Making America Great Again!,
neg: 0.0, neu: 0.577, pos: 0.423, compound: 0.6588, 
We faked the moon landing
neg: 0.0, neu: 1.0, pos: 0.0, compound: 0.0, 


Nice! It seems to work decently well, and classifies the varying sentences about Andover correctly.

We also want to keep track of the counts of words and their presence or absence in a given sentence for our later analysis. We will use the Counter to do this.

In [15]:
from collections import Counter

Also we want to filter the words a bit. We'll use nltk's stopwords list, which contains common, mostly insignificant words like "i","your", and other pronouns, etc.

In [16]:
from nltk.corpus import stopwords
print(stopwords.words('english')[:15], len(stopwords.words('english')[:15]))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him'] 15


Also filter out punctuation.

In [17]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [18]:
# Checks if a word is valid with the three criteria:
# 1. If it is not in the stopwords
# 2. If it is at least 4 characters long
# 3. If it doesn't start with punctuation
def is_valid(word):
    if word in stopwords.words('english'):
        return False;
    if len(word) < 4:
        return False;
    if word[0] in string.punctuation:
        return False;
    return True;

Now we are ready to finish processing and store our text data. We will store it in a dictionary. That dictionary will have a key for every section, which will correspond to a list of articles, each represented by a dictionary. Each article has a title, a url, and its content. Its content is a list of dictionaries, one corresponding to each sentence within the article. Those dictionaries contain the list of split significant words and their counts, the raw text of the sentence, and the calculated sentiment. We will also save a seperate data structure that counts words globally throughout a selection of sections (frontpage, world, us, opinion, nyregion, business, sundayreview). These data structures will both be output into seperate JSON files.

In [19]:
data = {}
dictionary = Counter()

for k in articles.keys():
    sec = articles[k]
    sec_content = []
    for artic in sec:
        artic_content = []
        artic_sentences = nltk.sent_tokenize(artic["text"])
         
        for sent in artic_sentences:
            sent_obj = {}
            sent_obj["sentence"] = sent
            sent_words = nltk.word_tokenize(sent)
            sent_obj["words"] = get_wordcounts(sent_words)
            if k in ["frontpage","world","us","opinion","nyregion","business","sundayreview"]:
                useful_words = [w for w in sent_words if is_valid(w)]
                dictionary.update(useful_words)
            sent_obj["sentiment"] = sia.polarity_scores(sent)["compound"]
            artic_content.append(sent_obj)
        sec_content.append({"title": artic["title"],
                            "url": artic["url"],
                           "content": artic_content})
    data[k] = sec_content       
    

Now we need to write the data to a JSON file:

In [20]:
import json

In [21]:
with open("nytimes_sentiment.json", "w") as f:
    json.dump(data, f, ensure_ascii=False)

Also we will output the dictionary of wordcounts to a JSON file. Some selection of the top words will be displayed on the final visualization.

Let's just print out the 20 most common words, to make sure it's working:

In [22]:
print(dictionary.most_common(20))

[('said', 654), ('would', 253), ('people', 165), ('years', 159), ('Baker', 157), ('could', 141), ('year', 135), ('like', 130), ('time', 125), ('state', 123), ('They', 113), ('government', 108), ('also', 107), ('first', 106), ('work', 100), ('family', 99), ('home', 89), ('many', 87), ('back', 87), ('even', 82)]


Side note: interestingly, "said" is always the number 1 word, by far, every day that I've run this script. It makes sense, since the NYTimes is a newspaper, which means that it frequently quotes people. If you look at just the opinion section, for example, however, there are almost no quotes and the top word is almost always "Trump". It's also interesting how this list of words is often so variable over time based off of current stories, i.e. "Weinstein" has recently been consistently in the top words because that story is big news currently. Some words, like "Trump" are more stably at the top.

Now output that into a JSON file:

In [23]:
with open("top_words.json", "w") as f:
    json.dump({"counts": dictionary.most_common(1000)}, f, ensure_ascii=False)