The goal of this notebook is to establish the utility of the NLTK's sentence tokenizer for use in our current work. A particular concern is the possible effect that "Dr." and other abbreviations might have in returning false positives: sentences that are not, in fact sentences. 

As always, the first thing we need to do is load our data. Here I'm using the most recent version, **6d**. (I'm reverting to 6d because I don't need the overall valence for each talk, and it looks like the first talk by Al Gore got dropped in **6e**.)

In the code below, `header=0` tells **`pandas`** to use the first row in the CSV as the header row. If you run df.head(), you will see, however, that the index saved to the CSV gets re-imported as an unnamed column. 

In [2]:
import pandas

df = pandas.read_csv('../data/talks_6d.csv', header = 0)
talks = df.text.tolist()

All the talks are now in a list, as usual. I am also going to create a text to test the NLTK tokenizer.

In [3]:
test = """
Mrs. Brown loves chocolate. 
When she heard the news that Donald Trump has been 
elected president. She ate an entire plate of brownies. 
She doesn't feel the same way about chocolate."""


Mrs. Brown loves chocolate. When she heard the news that Donald Trump has been 
elected president. She ate an entire plate of brownies. She doesn't feel the same way 
about chocolate.


In [10]:
from nltk import tokenize

test_sent = tokenize.sent_tokenize(test)
print(len(test_sent), test_sent)

4 ['\nMrs. Brown loves chocolate.', 'When she heard the news that Donald Trump has been \nelected president.', 'She ate an entire plate of brownies.', "She doesn't feel the same way \nabout chocolate."]


Success! The abbreviation does not generate its own sentence. 

For reference, when I've done sentiment analysis -- I've used Afinn, TextBlob, and the Indico libraries, I have used the following code. It returns the sentiment values as a list for each text -- a list as long as the number of sentences.

In [None]:
def afinn_sentiment(filename):
    from afinn import Afinn
    afinn = Afinn()
    with open (my_file, "r") as myfile:
        text = myfile.read().replace('\n', ' ')   
        sentences = tokenize.sent_tokenize(text)
        sentiments = []
        for sentence in sentences:
            sentsent = afinn.score(sentence)
            sentiments.append(sentsent)
        return sentiments

To get the values for the first 5 talks from the dataframe, but which are now sitting in the `talks` list, I would just use a for loop and the version of the function you have. This example just prints the number of sentences for the first five talks.

In [12]:
for talk in talks[0:5]:
    print(len(tokenize.sent_tokenize(talk)))

132
247
250
192
46
