# Text Analysis using TextBlob

In this lesson, we'll expore text analysis using [TextBlob](http://textblob.readthedocs.io/en/dev/). TextBlob is a Python library for processing textual data, which provides a simple API (*application programming interface*) for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.

*Note: a "library" (or "module") is just programming-speak for an extension. It's something we can add to our Python installation in order to get special abilities.* 

The process to `import` a library into your Jupyter Notebook is very simple, but in order to do so, we must first install the library on the computer.

##### Installing TextBlob on your computer
To [install TextBlob](http://textblob.readthedocs.io/en/dev/install.html) on your computer, open the Anaconda3 folder in the Start menu (Start -> All Programs -> Anaconda3 (64-bit)), and select the Anaconda Prompt.

In the window that opens, type `pip install textblob` and hit enter. You should see the installation process and the message "Successfully installed textblob-0.15.1" when the installation is complete.

##### Installing NLTK on your computer

Now, we also need to install the NLTK corpora that we will use with TextBlob. To do this, in the Anaconda Prompt window, type: `python -m textblob.download_corpora` and hit enter.

##### Importing TextBlob into your Notebook

Now that we've installed TextBlob on our computer, we can start using it in our Jupyter Notebook. To do this, we first need to `import` the library:

In [None]:
# run the code block below to import TextBlob 
from textblob import TextBlob

Note that it might take a few moments for TextBlob to be imported. 

Now, let's test TextBlob to make sure it's working properly:

In [None]:
blob = TextBlob("It was the best of times, it was the worst of times...")
print(blob)

This small piece of code takes a string ("It was the..."), and uses the TextBlob library we imported to create a special TextBlob object we have named "blob". 

Now that we know TextBlob is working, let's try using one of the built-in functions:

In [None]:
# what do you think this will do?

print(blob.translate(to="es"))

That last line of code takes our `blob` object and translates it, using the [Google Translate API](https://cloud.google.com/translate/). The free version of Google Translate is included in TextBlob. 

### Quiz 1: In the space below write a function that takes a string as an argument and translates it into another language.

Hint: you may want to refer to the documentation [here](http://textblob.readthedocs.io/en/dev/quickstart.html#translation-and-language-detection).

In [None]:
# add your code here



------

Before we dive in deeper, let's write a quick function that will make it easy to retrieve TextBlob objects based on a selected file. This will make it easier to use TextBlob to analyze the novels we have saved in our `data/` directory.

Remember how to read files? Each time we had to open a file, read the contents and then close the file. Since this is a series of steps we will often need to do, we can write a single function that does all that for us. We write a small utility function read_file(filename) that reads the specified file and simply returns all contents as a TextBlob object that we can work with.

In [None]:
def read_file(filename):
    "Read the contents of FILENAME and return as a TextBlob object."
    infile = open(filename, encoding="utf8")
    contents = infile.read()
    blob = TextBlob(contents) 
    infile.close()
    return blob

Now, instead of having to open a file, read the contents and close the file, we can just call the function read_file to do all that:

In [None]:
text = read_file("data/OliverTwist.txt")
#print(text)

At this point, you might be a little confused about what we mean by "TextBlob objects." Without going into too much detail, this is a feature of object-oriented-programming, where we are interacting with special data structures called *objects*. 

All you really need to know now is that when we talk about TextBlob objects, what we really mean are pieces of text that have been converted into a TextBlob, which means we can call the build-in TextBlob functions (or methods) on this object, as we do with the line: `blob.translate(to="es")` where we call the method `translate` on the TextBlob object named `blob`.

> Now that we have an easy-to-use function for creating our TextBlobs, we can start learning some new functionality.

# Word & Noun Phrase Frequencies

TextBlob makes it easy to calculate the [frequencies of words and noun phrases](http://textblob.readthedocs.io/en/dev/quickstart.html#get-word-and-noun-phrase-frequencies) without needing to build our own functions, as we did in the previous lessons. 

For example, to calcuate how many times the word "Oliver" (case insensitive) appears in our `text` TextBlob, we can use the following line of code:

In [None]:
# find single word count, using lowercase for our search
text.word_counts['oliver']

>**Wait, didn't we recieve a different word count in our previous exercise?**
Why do you think this could be?

We can also use the same basic functionality to retrieve frequencies of **all** words in the form of a dictionary:

In [None]:
text.word_counts

We can do something similar with noun phrases:

In [None]:
# to find the frequency of a single noun phrase:
text.np_counts['mr. bumble']

The below will retrieve frequencies of **all** noun phrases in the form of a dictionary...note that this may take a minute or two to run!

In [None]:
# to retrieve the frequencies of all noun phrases as a dictionary

text.np_counts

### Quiz 2: write a function that takes a list of terms and a TextBlob to search and returns a dictionary with each term as a key and the frequency of each term as the value.

In [None]:
def word_counter(blob, mylist):
    "return dictionary of word frequencies"
    # Enter your code here:


# Sentiment Analysis

From [Wikipedia](https://en.wikipedia.org/wiki/Sentiment_analysis): 

Sentiment analysis "refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information...

A basic task in sentiment analysis is classifying the polarity of a given text at the document, sentence, or feature/aspect level—whether the expressed opinion in a document, a sentence or an entity feature/aspect is positive, negative, or neutral." 

In [None]:
# What do you think this will return?
text.sentiment


The [sentiment](http://textblob.readthedocs.io/en/dev/api_reference.html#textblob.blob.TextBlob.sentiment) method returns two values: polarity and subjectivity. The polarity score is a float within the range [-1.0, 1.0], where -1.0 is very negative and 1.0 is very positive.

The subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective.

Note that these values are returned in the form of a tuple, which is very similar to a list. 

We can also pull just the polarity or the subjectivity scores:

In [None]:
# try running the code below

print("The polarity is " + str(text.sentiment[0]))
print("The subjectivity is " + str(text.sentiment[1]))

In this example, we print the individual polarity and subjectivity values by referring to their respective indicies and changing them to strings with the `str()` function.

-----

### Quiz #3: In the cell below, determine the polarity and subjectivity of at least two Dickens novels. See if you can find the most positive and most negative. 

In [None]:
# enter your code below


---

## What if we want to look at individual sentences? 

[TextBlob also supports tokenization](http://textblob.readthedocs.io/en/dev/quickstart.html#tokenization), or breaking a text into words or sentences. 

For sentence tokenization, TextBlob gives us a method called `sentences` that can be used to break a larger corpus into individual sentences. This can be used on its own, but we can also combine it with the sentiment analysis skills we've been building. 

In [None]:
# try running the code below

for sentence in text.sentences:
    print(sentence.sentiment)

Okay...that's a lot of information. What we've done here is use TextBlob's built-in `sentences` method to break the text up into individual sentences. These sentences are then iterated over through our `for` loop, printing the `sentiment` for each.

But what can we do with this information? Now that we have computed a range of general statistics for our text, it would be nice to have a better way to visualize them. We could for example, plot for each sentence, what its polarity is and even graph the change in sentiment over the duration of the novel.

Python is quite good at graphing or plotting. This isn't something that we'll tackle here due to time constraints, but if you're interested in experimenting after class, the plotting library *matplotlib* (see [here](http://matplotlib.org)) is very well supported and allows us to produce all kinds of graphs. 

What we'll do instead here, is learn how to export data from Python that we can use in a 3rd party program that we may be more comfortable with, such as MS Excel or Goolge Sheets. 

> ### Exporting Data

To do this, let's first create a dictionary. This will allow us to add the polarity values of each sentence in the text to this dictionary, with the sentence number as the key and the polarity as the value. 

In [None]:
# showing change in sentiment over the duration of the novel

sentiment_polarity_by_sentence = {}
counter = 0
for sentence in text.sentences:
    sentiment_polarity_by_sentence[counter] = sentence.sentiment[0]
    counter +=1
    
print(sentiment_polarity_by_sentence)

Now, let's try exporting our dictionary to a format that we can use with other programs. A `csv` file is a good option here, as it can handle tabular data of this sort well and can be used with many different types of applications.

In [None]:
# exporting to CSV
import csv

with open('sentiment_polarity_by_sentence.csv', 'w',  newline='') as csv_file:
    writer = csv.writer(csv_file)
    for key, value in sentiment_polarity_by_sentence.items():
       writer.writerow([key, value])

Don't worry too much about the syntax here. Basically what we are doing is importing a new Python library (`csv`), and using it to write our dictionary, `sentiment_polarity_by_sentence`, to a new CSV file that will be saved in the same directory as our Jupyter Notebook files. 

It's worth noting that with minimal alterations, you could re-use the same block of code to export word or noun phrase counts as csv files.

> ### Graphing with Google Sheets

Now, we can open this file up in our preferred application. In this example, we'll use Google Sheets. 

To do this, first login to your Bates Gmail account, and use the small grid menu on the upper right to open your Google Drive. 

Next, simply drag and drop your new CSV file into your Drive account. Once it's finished uploading, right click the file and select Open With --> Google Sheets


Once you've created your new Google Sheet, highlight the two columns with our data, and select Insert--> Chart.

This will give us a nice line chart that we can customize and even publish online

-------------

## Polarity scores by novel

Let's try another example. What if we wanted to plot the polarity score for each of Dicken's novels?

In [None]:
# First, let's create a list that contains our novels
novels = ['ATaleOfTwoCities','BarnabyRudge','BleakHouse','DavidCopperfield','DombeyandSon','GreatExpectations','HardTimes','LittleDorrit','MartinChuzzlewit','NicholasNickleby','OliverTwist', 'OurMutualFriend', 'TheMysteryofEdwinDrood','TheOldCuriosityShop','ThePickwickPapers']

# Now, let's create an empty dictionary to store our data:
sentiment_polarity_by_novel = {}

# Finally, let's iterate through our list of novels, create a TextBlob of each using our read_file function
# and add the polarity of each to our dictionary as values
for novel in novels:
    newBlob = read_file("data/"+novel+".txt")
    #newBlob = TextBlob(novel)
    s_score = newBlob.sentiment[1]
    sentiment_polarity_by_novel[novel] = s_score

# Just to confirm we are getting what we want:
print(sentiment_polarity_by_novel)

Now we can use the same process as above to export this as a CSV file, and then to use that CSV file in Google Sheets to generate a data visualization:

In [None]:
# exporting to CSV
with open('sentiment_polarity_by_novel.csv', 'w',  newline='') as csv_file:
    writer = csv.writer(csv_file)
    for key, value in sentiment_polarity_by_novel.items():
       writer.writerow([key, value])

---

### Final Quiz

User choice - use the tools and techniques we've reviewed during these lessons to generate a data visualization of your choice using at least one novel from our Dickens corpus. 