# Calculating TTR using TextBlob

### Task
Annotate each of the following lines with a comment explaining what the line is doing. 

In [None]:
from textblob import TextBlob

whitman_text = open('song_of_myself.txt', encoding = 'utf-8').read() 

whitman_blob = TextBlob(whitman_text) 

### Question

a) What does the `.word_counts` TextBlob method return?

b) What do `whitman_counts.keys()` and `whitman_counts.values()` return?

c) What do the values `sum(whitman_counts.values())` and `len(whitman_counts.keys())` represent?

(Note: `.word_counts` is a function from TextBlob, which is why it doesn't require the ending `()`. `.keys()` and `.values()` are default Python methods that work with dictionaries.)

In [None]:
whitman_counts = whitman_blob.word_counts
whitman_counts

In [None]:
whitman_counts.keys()

In [None]:
whitman_counts.values()

In [None]:
print(sum(whitman_counts.values()))

print(len(whitman_counts.keys()))

song_of_myself_ttr = len(whitman_counts.keys()) / sum(whitman_counts.values()) * 100
print(song_of_myself_ttr)

# Creating a dataframe with rolling average data

### Task

Let's go through and annotate this code block.

In [None]:
from textblob import TextBlob

whitman_text = open('song_of_myself.txt', encoding = 'utf-8').read()

whitman_blob = TextBlob(whitman_text)

whitman_sentences_blob = whitman_blob.sentences

whitman_polarities = [] 

for sentence in whitman_sentences_blob:
    whitman_polarities.append(sentence.polarity) 
    
whitman_subjectivities = []
for sentence in whitman_sentences_blob:
    whitman_subjectivities.append(sentence.subjectivity)
    
whitman_sentences = []
for sentence in whitman_sentences_blob:
    whitman_sentences.append(" ".join(sentence.words))
    
whitman_ttrs = []
for sentence in whitman_sentences_blob:
    sentence_counts = sentence.word_counts
    whitman_ttrs.append((len(sentence_counts)/sum(sentence_counts.values())))
    
import pandas as pd
whitman_sentences_df = pd.DataFrame({
    'sentence': whitman_sentences,
    'polarity': whitman_polarities,
    'subjectivity': whitman_subjectivities,
    'TTR': whitman_ttrs
})    

whitman_sentences_df

### Task

Plot a line graph showing how the TTRs change over the course of the poem. 

### Question

What does the below line of code do? 

Note: "center=True" means we are looking at the 10 items around each item. For example, the 5th value in `whitman_sentences_df['polarity'].rolling(window=10, center=True).mean()` is the mean polarity among the first 10 sentences. 

In [None]:
whitman_sentences_df['rolling_10_polarity'] = \
    whitman_sentences_df['polarity'].rolling(window=10, center=True).mean()


In [None]:
whitman_sentences_df['polarity'].rolling(window=10, center=True).mean()[:25]

In [None]:
whitman_sentences_df['rolling_10_polarity'].plot(kind='line', figsize=(20,8))

### Task

a) Alter the below lines of code to create a new column in the `whitman_sentences_df` DataFrame called `rolling_20_polarity`, which contains the average polarity of every group of 20 consecutive items. Then, create a line plot representing the data in the `rolling_10_polarity` and `rolling_20_polarity` columns. 


In [None]:
whitman_sentences_df['rolling_10_polarity'] = \
    whitman_sentences_df['polarity'].rolling(window=10, center=True).mean()

.plot()

b) Alter the below line of code to create a new column in the `whitman_sentences_df` DataFrame called `rolling_10_ttr`, which contains the average TTR of every group of 20 consecutive items. Then plot the result. 


In [None]:
whitman_sentences_df['rolling_10_polarity'] = \
    whitman_sentences_df['polarity'].rolling(window=10, center=True).mean()

# Creating scatter plots

Scatter plots can be used if you want to see if two variables are correlated. 

In [None]:
whitman_sentences_df

In [None]:
whitman_sentences_df.plot(kind='scatter', x = 'polarity', y='subjectivity')

In [None]:
whitman_sentences_df.plot(kind='scatter', x = 'polarity', y='subjectivity', marker = '*', color='fuchsia')

# DataFrame review

In [None]:
whitman_sentences_df

## Isolating columns

### Question

How can I isolate just the "sentence" and "rolling 10 polarity" columns from `whitman_sentences_df`?  

In [None]:
#Isolating the sentence and rolling 10 polarity columns:
whitman_sentences_df

## Filtering rows

You can isolate rows that correspond to certain criteria.

### Question

The below line of code, copy-pasted from a lecture, was used to get a dataframe containing only the works written by Toni Morrison. 

Alter the line to get a dataframe containing only the sentences in `whitman_sentences_df` with zero polarity. 

In [None]:
nyt_df[nyt_df['author'] == 'Toni Morrison']

Now, write a line of code to isolate rows with sentences whose polarity value is positive:

In [None]:
whitman_sentences_df

We can put the "filter" in a separate variable, then apply it to the DataFrame.

In [None]:
polarity_filter = 
whitman_sentences_df[polarity_filter]

## Grouping by a single column

The code below adds a new column called "polarity_category" that reads "Positive" if the sentence has positive polarity, "Negative" if the sentence has negative polarity, and "Neutral" if the sentence has zero polarity. 

I want to see if, on average, positive or negative sentences are *more subjective* than neutral sentences. 

In [None]:
whitman_text = open('song_of_myself.txt', encoding = 'utf-8').read() 

whitman_sentences_blob = TextBlob(whitman_text).sentences

polarity_categories = []

for sentence in whitman_sentences_blob:
    if sentence.polarity < 0:
        polarity_categories.append("Negative")
    elif sentence.polarity > 0:
        polarity_categories.append("Positive")
    else:
        polarity_categories.append("Neutral")

whitman_sentences_df["polarity_category"] = polarity_categories
whitman_sentences_df

### Question

The below line of code found the mean VADER score across each category of test sentence. 

Alter the line to group `whitman_sentences_df` by the polarity category, then calculate the mean subjectivity in each category. 

In [None]:
results_df.groupby('Category')['VADER Score'].mean()

## Grouping by multiple columns

Let's recall how to group by multiple columns.

I can use `.groupby()` to group according to Column1, then Column2; use `.size()` to find the total number of instances of Column2 items in each Column1 category; then use `.unstack()` (with `fill_value=0`) to make a nice dataframe from this data. 

I created a new column in `whitman_sentences_df` called `subjectivity_category`, which reads "Objective" if subjectivity is zero, "Somewhat subjective" if subjectivity is between 0 and 0.5, "Very subjective" is subjectivity is between 0.5 and 1, and "Subjective" if subjectivity is 1. 

I want to find the number of instances of each subjectivity category among positive, negative, and neutral sentences. 

In [None]:
whitman_text = open('song_of_myself.txt', encoding = 'utf-8').read() 

whitman_sentences_blob = TextBlob(whitman_text).sentences

subjectivity_categories = []

for sentence in whitman_sentences_blob:
    
    if sentence.subjectivity == 0:
        subjectivity_categories.append("Objective")
        
    elif sentence.subjectivity > 0 and sentence.subjectivity <= 0.5:
        subjectivity_categories.append("Somewhat subjective")
        
    elif sentence.subjectivity > 0.5 and sentence.subjectivity < 1:
        subjectivity_categories.append("Very subjective")
        
    else:
        subjectivity_categories.append("Subjective")

whitman_sentences_df["subjectivity_category"] = subjectivity_categories
whitman_sentences_df

### Task

We used the line 

`nytg_df.groupby(['year', 'gender_signal']).size().unstack(fill_value=0)`

in lecture. Alter the line to do the following task:

Group by polarity_category, then by subjectivity_category, then use the `.size()` and `.unstack()` methods to create a dataframe containing the instances of each subjectivity category among each polarity category. Put this dataframe in a variable called `whitman_grouped_df`. Then plot the data. 

In [None]:
whitman_grouped_df = nytg_df.groupby(['year', 'gender_signal']).size().unstack(fill_value=0)
whitman_grouped_df

## Sorting columns

### Question

I copy-pasted the following lines from lecture notebooks. Alter them accordingly to find the to answer these questions about `whitman_sentences_df`.

In [None]:
# Find the top 10 sentences with lowest subjectivity. 
# (You need to slice the first ten items of the dataframe. 
# The syntax is the same as slicing lists.)

csal_ttr_df.sort_values(by='Standardized TTR', ascending=False)

In [None]:
#Find the sentence with highest combined polarity and subjectivity.

csal_ttr_df.sort_values(by=['Standardized TTR', "Overall TTR"], ascending=[True, True])

In [None]:
#Find the 5 sentences with highest polarity. 

checkouts_df.nlargest(10, 'Checkouts')

# Final project and exam tips!!

project:
- Start early! If you run into problems with any coding, you don't want to be debugging at the last minute. 
- Try not to be too ambitious; you can create an interesting project within a neighbourhood of the class material. If you want to do something outside of the scope of the class, make sure you start early and ask for help from the teaching team along the way. 


exam:
- Go through the labs and homework, clear your previous work, and redo them. 
- Create a new notebook where you can copy-paste important bits of code so you can refer to them easily. (I believe you should have access to anything uploaded to jupyterhub during the exam, but please confirm with the instructors.) Knowing how to copy-paste and alter code to your needs is a useful skill!

I hope you all can see how far you've come from the first weeks of this class! I'm really proud of all of you and I wish you all best of luck with the rest of your classes!