![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

<a href="https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fmisterhay%2FInteresting-Problems&branch=master&subPath=analysing-text-statistics.ipynb&depth=1" target="_parent"><img src="https://raw.githubusercontent.com/callysto/curriculum-notebooks/master/open-in-callysto-button.svg?sanitize=true" width="123" height="24" alt="Open in Callysto"/></a>

# Analysing Text Statistics

Let's try out some statistical analysis of text, including [natural language processing](https://en.wikipedia.org/wiki/Natural_language_processing), using a [public domain](https://en.wikipedia.org/wiki/Public_domain) book from [Project Gutenberg](http://www.gutenberg.org).

The example we'll use is the [most downloaded](http://www.gutenberg.org/browse/scores/top) book, [Pride and Prejudice by Jane Austen](http://www.gutenberg.org/ebooks/1342). Running this first code cell will import and display the contents of the book.

Feel free to change the link in the following code cell if you'd like to explore another book, but make sure you are using the `Plain Text UTF-8` link.

In [None]:
gutenberg_text_link = 'http://gutenberg.org/files/1342/1342-0.txt'

import requests
r = requests.get(gutenberg_text_link)
r.encoding = 'utf-8'
text = r.text.split('***')[2]
text = text.replace("’","'").replace("“",'"').replace("”",'"')
print(text)

## Making a DataFrame

Now that we have the text of the book, let's split it into chapters. We'll use the Python [library](https://en.wikipedia.org/wiki/Library_(computing)) called [pandas](https://pandas.pydata.org) to create a [dataframe](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) that includes the text and length of each chapter (number of characters, including spaces and punctuation).

In [None]:
import pandas as pd
df = pd.DataFrame()
for chapter in text.split('Chapter'):
    if len(chapter)>500:
        chapter = chapter.replace('\r','').replace('\n','')
        df = df.append({'Chapter Text':chapter, 'Length':len(chapter)}, ignore_index=True)
df

## Visualizing Chapter Lengths

From that dataframe we can create a bar graph of chapter lengths using the [cufflinks](https://github.com/santosjorge/cufflinks) library.

In [None]:
import cufflinks as cf
cf.go_offline()
df.iplot(kind='bar', y='Length', title='Chapter Lengths', yTitle='Length (characters)', xTitle='Chapter')

## Counting Words by Type

We'll use the [spaCy](https://spacy.io) natural language processing library to identify the [parts of speech](https://spacy.io/api/annotation#pos-tagging) in the text. For this example we'll just look at adjectives, verbs, nouns, and proper nouns, but you can add to the list on the first line in the code cell.

This will take a while to run, and will result in a dataframe containing the number of each of those parts of speech in each chapter.

In [None]:
word_types = ['ADJ', 'VERB', 'NOUN', 'PROPN'] # https://spacy.io/api/annotation#pos-tagging

#!pip install spacy --user
#!python -m spacy download en_core_web_sm
import en_core_web_sm
nlp = en_core_web_sm.load()
parts_of_speech_df = pd.DataFrame(columns=word_types)
for i in range(len(df)):
    parts_of_speech_list = []
    for token in nlp(df['Chapter Text'][i]):
        part_of_speech = token.pos_
        if part_of_speech in word_types:
            parts_of_speech_list.append(part_of_speech)
    word_type_count = {}
    for word_type in word_types:
        word_type_count.update({word_type:parts_of_speech_list.count(word_type)})
    parts_of_speech_df = parts_of_speech_df.append(word_type_count, ignore_index=True)
parts_of_speech_df['Chapter'] = parts_of_speech_df.index+1
parts_of_speech_df.set_index('Chapter')
parts_of_speech_df

## Most Common Verbs

To get an idea of the most common words in the text we can look at a part of speech, verbs for example, and count which are the most frequent.

This will also take some time to run.

In [None]:
word_type = 'VERB'

from collections import Counter
words_df = pd.DataFrame()
for i in range(len(df)):
    word_list = []
    for token in nlp(df['Chapter Text'][i]):
        if token.pos_ == word_type:
            word_list.append(token.lemma_.strip().lower())
    words_df = words_df.append(Counter(word_list), ignore_index=True)
words_df.sum().sort_values(ascending=False)

In our example text there are 1233 unique verbs. Let's look at the `10` most common verbs.

In [None]:
words_df.sum().sort_values(ascending=False).head(10)

We can also choose a verb and plot its frequency by chapter.

In [None]:
word = 'say'

words_df.iplot(y=verb, title='Frequency of the Word "'+word+'"', yTitle='Frequency', xTitle='Chapter')

## Most Common Names

We can also look at character names and how often they occur in each chapter. The spaCy library does a fairly good job of identifying names, but you may see some false positives (words that aren't actually character names).

In [None]:
names_df = pd.DataFrame()
for i in range(len(df)):
    names_list = []
    for token in nlp(df['Chapter'][i]):
        #if token.pos_ == 'PROPN':
        if token.ent_type_ == 'PERSON':
            names_list.append(token.text)
    names_df = names_df.append(Counter(names_list), ignore_index=True)
names_df

### List of Character Names

We can check out the list of words identified as names.

In [None]:
for name in names_df.columns:
    print(name)

### Cleaning Data

If you'd like to remove columns that are likely categorized incorrectly, we can drop columns with only a few occurrences (fewer than five).

In [None]:
for column in names_df.columns:
    if names_df[column].sum() < 5:
        names_df.drop(columns=column, inplace=True)
names_df

### Visualization of Name Frequencies

Let's make a bar graph of the top `20` most frequently mentioned characters.

In [None]:
proper_noun_df.sum().sort_values(ascending=False).head(20).iplot(kind='bar', yTitle='Frequency', title='Character Names')

### Name Frequencies over Time

Since we have the text divided into chapters, let's visualize how often the top `3` character names are mentioned per chapter.

In [None]:
main_character_names = proper_noun_df.sum().sort_values(ascending=False).head(3).index
main_characters = proper_noun_df[main_character_names]
main_characters.iplot(yTitle='Frequency', xTitle='Chapter', title='Frequency of Character Mentions by Chapter')

## Sentiment Analysis

[Sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis) is categorizing text based on its tone (negative, neutral, or positive).

For this we will use the [vaderSentiment](https://github.com/cjhutto/vaderSentiment) library, then visualize the positive and negative sentiment by chapter.

In [None]:
!pip install vaderSentiment --user
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()

sentiment_df = pd.DataFrame()
for i in range(len(df)):
    senti = analyzer.polarity_scores(df['Chapter'][i])
    sentiment_df = sentiment_df.append(senti, ignore_index=True)
sentiment_df.iplot(y=['neg', 'pos'], title='Sentiment Analysis by Chapter', xTitle='Chapter')

## Readability

One last library to introduce, [textstat](https://github.com/shivam5992/textstat) for checking the readability, complexity, and grade level of text. It includes a [number of functions](https://github.com/shivam5992/textstat#list-of-functions), but we'll only use a few of them.

In [None]:
#!pip install --user textstat
import textstat
readability = pd.DataFrame()
for i in range(len(df)):
    text = df['Chapter Text'][i]
    readability_data = {'Flesch-Kincaid Grade':textstat.flesch_kincaid_grade(text),
                        'Gunning Fog Index':textstat.gunning_fog(text),
                        'Linsear Write Formula':textstat.linsear_write_formula(text),
                        'Readability':textstat.text_standard(text, float_output=True)}
    #print(readability_data)
    readability = readability.append(readability_data, ignore_index=True)
readability

Now that we have a dataframe of readability information, we can describe the statistics.

In [None]:
readability.describe()

# Conclusion

Hopefully this was an interesting introduction to text statistics. You can also analyse the text from any other online document using similar code.

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)