The goal of this notebook is to establish the utility of the NLTK's sentence tokenizer for use in our current work. A particular concern is the possible effect that "Dr." and other abbreviations might have in returning false positives: sentences that are not, in fact sentences. 

As always, the first thing we need to do is load our data. Here I'm using version **6d**. (I'm reverting to 6d because I don't need the overall valence for each talk, and it looks like the first talk by Al Gore got dropped in **6e**.)

In the code below, `header=0` tells pandas to use the first row in the CSV as the header row. If you run `df.head()`, you will see that the index saved to the CSV gets re-imported as an unnamed column. (I haven't figured out how to keep this from happening, or, once it has, how to make pandas ignore the first column on import.)

In [1]:
import pandas

df = pandas.read_csv('../data/talks_6d.csv', header = 0)
talks = df.text.tolist()

All the talks are now in a list, as usual. I am also going to create a text to test the NLTK tokenizer.

In [1]:
test = """
Mrs. Brown loves chocolate. 
When she heard the news that Donald Trump has been 
elected president. She ate an entire plate of brownies. 
She doesn't feel the same way about chocolate."""

In [2]:
test = test.replace('\n', ' ').replace('\r', '')
print(test)

# From https://stackoverflow.com/questions/16566268/remove-all-line-breaks-from-a-long-string-of-text

 Mrs. Brown loves chocolate.  When she heard the news that Donald Trump has been  elected president. She ate an entire plate of brownies.  She doesn't feel the same way about chocolate.


In [3]:
from nltk import tokenize

test_sent = tokenize.sent_tokenize(test)
print(len(test_sent), test_sent)

LookupError: 
**********************************************************************
  Resource 'tokenizers/punkt/PY3/english.pickle' not found.
  Please use the NLTK Downloader to obtain the resource:  >>>
  nltk.download()
  Searched in:
    - '/Users/katiek/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - ''
**********************************************************************

Success! The abbreviation does not generate its own sentence. 

Now you want five talks converted into a list of strings, each string a sentence. **>>>** I had some difficulties with this: see the two code blocks at the end of this notebook for what I started and you can see where I ran into difficulties. Until then, I can offer some code blocks that work on a given set of texts...

For reference, when I've done sentiment analysis -- I've used Afinn, TextBlob, and the Indico libraries, I have used the following code. It returns the sentiment values as a list for each text -- a list as long as the number of sentences.

In [4]:
# Using Afinn here. Just replace Afinn code with hedonometer code
def sentiment(text):
    from afinn import Afinn
    afinn = Afinn()
    sentences = tokenize.sent_tokenize(text)
    sentiments = []
    for sentence in sentences:
        sentsent = afinn.score(sentence)
        sentiments.append(sentsent)
    return sentiments

In [5]:
# Here I rewrote the "for" loop as a list comprehension:
def sentiment_2(text):
    from afinn import Afinn
    afinn = Afinn()
    sentences = tokenize.sent_tokenize(text)
    sentiments = [ afinn.score(sentence) for sentence in sentences]
    return sentiments

In [6]:
sentiment(talks[0]) == sentiment_2(talks[0])

True

In [7]:
# Proof of concept: using print here to make the output easier to read
print(sentiment(talks[0]))

[2.0, 10.0, 6.0, 0.0, 0.0, 0.0, 2.0, 4.0, -2.0, 0.0, 2.0, -1.0, -2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 2.0, 0.0, 3.0, 0.0, 0.0, 1.0, -2.0, -1.0, 0.0, 0.0, 12.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 2.0, 0.0, 1.0, 0.0, 0.0, -3.0, 0.0, 0.0, 0.0, 0.0, 2.0, -2.0, 0.0, -1.0, 2.0, 1.0, -1.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 1.0, 0.0, 0.0, 2.0, 2.0, 0.0, 0.0, 0.0, 2.0, 2.0, 0.0, -7.0, 0.0, 8.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 5.0, 0.0, 0.0, 0.0, 0.0, -1.0, 1.0, 2.0, 0.0, -2.0, -1.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 3.0, 0.0, 0.0, 0.0, 0.0, -1.0, -1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 2.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 2.0, 0.0, 2.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 4.0]


To get the values for the first 5 talks from the dataframe, but which are now sitting in the `talks` list, I would just use a for loop and the version of the function you have. 

In [8]:
for talk in talks[0:5]:
    print(sentiment(talk))

[2.0, 10.0, 6.0, 0.0, 0.0, 0.0, 2.0, 4.0, -2.0, 0.0, 2.0, -1.0, -2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 2.0, 0.0, 3.0, 0.0, 0.0, 1.0, -2.0, -1.0, 0.0, 0.0, 12.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 2.0, 0.0, 1.0, 0.0, 0.0, -3.0, 0.0, 0.0, 0.0, 0.0, 2.0, -2.0, 0.0, -1.0, 2.0, 1.0, -1.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 1.0, 0.0, 0.0, 2.0, 2.0, 0.0, 0.0, 0.0, 2.0, 2.0, 0.0, -7.0, 0.0, 8.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 5.0, 0.0, 0.0, 0.0, 0.0, -1.0, 1.0, 2.0, 0.0, -2.0, -1.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 3.0, 0.0, 0.0, 0.0, 0.0, -1.0, -1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 2.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 2.0, 0.0, 2.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 4.0]
[1.0, 2.0, -5.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 3.0, 0.0, 0.0, 2.0, 5.0, 4.0, 0.0, 0.0, 1.0, 3.0, -2.0, 0.0, -1.0, -2.0, 1.0, 3.0, 1.0, 3.0, 7.0, 2.0, 3.0, 3.0, -2.0, -2.0, -4.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 3.0, 0.0, 0.0, 2.0, -3.0, 0.0, 0.0, 2.0, -2.0, 0.0, -2.0, 0.0, 0.0, 2.0, 4.0, -2.0, -3.0,

In [15]:
import csv

with open('../data/sentenized_talks.csv', 'wb') as f:
    wr = csv.writer(f)
    wr.writerows([tokenize.sent_tokenize(talk) for talk in talks[0:5]])

TypeError: 'str' does not support the buffer interface

In [19]:
sentenced = [ tokenize.sent_tokenize(talk) for talk in talks[0:5]]
print(len(sentenced), type(sentenced))

5 <class 'list'>
