# Introduction to Text Analysis, with a side of Dictionaries

Now that we've learned some Python basics, we'll move to applying our tools to do analyses beyond graphing numbers. Our primary example: text analysis.

To do so, we often don't want lists or dataframes, but we want certain elements to be associated with values (we covered this a bit last week). A person will have an income, for example, or a novel will use the word `humanistic` a certain number of times. Today we'll think through data types that can help us we these associations. This form of data is called linked data.


# Tuples, Dictionaries, and List and Dictionary Comprehension

A *tuple* is a collection of objects which is ordered and unchangeable. In Python tuples are written with round brackets.

A *dictionary* in Python is an unordered collection of data values, used to store data values like a map, which unlike other Data Types that hold only single value as an element, a *dictionary* holds key:value pairs.

Like lists, dictionaries can easily be changed, can be shrunk and grown ad libitum at run time. They shrink and grow without the necessity of making copies. Dictionaries can be contained in lists and vice versa. 

But what's the difference between lists and dictionaries? Lists are ordered sets of objects, whereas dictionaries are unordered sets. But the main difference is that items in dictionaries are accessed via keys and not via their position. A dictionary is an associative array (also known as hashes). Any key of the dictionary is associated (or mapped) to a value. The values of a dictionary can be any Python data type. So dictionaries are unordered key-value-pairs. 

*List comprehension* is a syntactic construct available in some programming languages for creating a list based on existing lists. It condenses what we did before, looping through lists, down to one line. (It's way more powerful, but we'll just get a taste here.)


# Self-Defined Functions

So far, we have only been using the functions that come with Python, but it is also possible to add new functions. A function definition specifies the name of a new function and the sequence of statements that execute when the function is called. Once we define a function, we can reuse the function over and over throughout our program.

## Defining your own function


Here is an example:

In [23]:
def print_lyrics():
    print("I'm a lumberjack, and I'm okay.")
    print('I sleep all night and I work all day.')

`def` is a keyword that indicates that this is a function definition. The name of the function is print_lyrics. The rules for function names are the same as for variable names: letters, numbers and some punctuation marks are legal, but the first character can't be a number. You can't use a keyword as the name of a function, and you should avoid having a variable and a function with the same name.

The empty parentheses after the name indicate that this function doesn't take any arguments. Later we will build functions that take arguments as their inputs.

The first line of the function definition is called the *header*; the rest is called the *body*. The header has to end with a colon and the body has to be indented. By convention, the indentation is always four spaces. The body can contain any number of statements.

The strings in the print statements are enclosed in quotes. Single quotes and double quotes do the same thing; most people use single quotes except in cases like this where a single quote (which is also an apostrophe) appears in the string.

The syntax for calling the new function is the same as for built-in functions:

In [24]:
print_lyrics()

I'm a lumberjack, and I'm okay.
I sleep all night and I work all day.


Once you have defined a function, you can use it inside another function. For example, to repeat the previous refrain, we could write a function called `repeat_lyrics`:

In [25]:
def repeat_lyrics():
    print_lyrics()
    print_lyrics()

In [26]:
repeat_lyrics()

I'm a lumberjack, and I'm okay.
I sleep all night and I work all day.
I'm a lumberjack, and I'm okay.
I sleep all night and I work all day.


## Parameters and Arugments

Some of the built-in functions we have seen require arguments.

Inside the function, the arguments are assigned to variables called parameters. Here is an example of a user-defined function that takes an argument:

In [27]:
def phrase_length(p):
    print(len(p))
    print(p)

This function assigns the argument to a parameter named phrase. When the function is called, it prints length of the value of the parameter (whatever it is).

This function works with any value that can be an argument for the length function.

In [28]:
phrase_length("In the beginning")
phrase_length("Call me Ishmael")
phrase_length("I heard a Fly buzz – when I died –  \
The Stillness in the Room \
Was like the Stillness in the Air –   \
Between the Heaves of Storm – \
\
The Eyes around – had wrung them dry –  \
And Breaths were gathering firm \
For that last Onset – when the King \
Be witnessed – in the Room –  \
\
I willed my Keepsakes – Signed away \
What portions of me be \
Assignable – and then it was \
There interposed a Fly –  \
\
With Blue – uncertain stumbling Buzz –  \
Between the light – and me –  \
And then the Windows failed – and then \
I could not see to see – ")

16
In the beginning
15
Call me Ishmael
516
I heard a Fly buzz – when I died –  The Stillness in the Room Was like the Stillness in the Air –   Between the Heaves of Storm – The Eyes around – had wrung them dry –  And Breaths were gathering firm For that last Onset – when the King Be witnessed – in the Room –  I willed my Keepsakes – Signed away What portions of me be Assignable – and then it was There interposed a Fly –  With Blue – uncertain stumbling Buzz –  Between the light – and me –  And then the Windows failed – and then I could not see to see – 


## Why Functions?

It may not be clear why it is worth the trouble to divide a program into functions. There are several reasons:

* Creating a new function gives you an opportunity to name a group of statements, which makes your program easier to read, understand, and debug.

* Functions can make a program smaller by eliminating repetitive code. Later, if you make a change, you only have to make it in one place.

* Dividing a long program into functions allows you to debug the parts one at a time and then assemble them into a working whole.

* Well-designed functions are often useful for many programs. Once you write and debug one, you can reuse it.

Throughout the rest of the course, often we will use a function definition to explain a concept. Part of the skill of creating and using functions is to have a function properly capture an idea such as "find the smallest value in a list of values". Later we will show you code that finds the smallest in a list of values and we will present it to you as a function named min which takes a list of values as its argument and returns the smallest value in the list.

## Tuples and Dictionaries

In [29]:
my_tuple = [('education', 'high school'), ('income', 100)]

print(type(my_tuple))
print(type(my_tuple[0]))

<class 'list'>
<class 'tuple'>


In [30]:
#You can loop through tuples, but you need to assign multiple variables when you loop through them:
for key, value in my_tuple:
    print("The key is: ")
    print(key)
    print("The value is: ")
    print(value)
    print('\n')

The key is: 
education
The value is: 
high school


The key is: 
income
The value is: 
100




In [31]:
my_dict = dict(my_tuple)
type(my_dict)

dict

In [32]:
my_dict

{'education': 'high school', 'income': 100}

The key is before the colon, the value is after the colon. 

Find all the keys from the dictionary, and then all the values.

In [5]:
my_dict.keys()

dict_keys(['education', 'income'])

In [6]:
my_dict.values()

dict_values(['high school', 100])

We can access keys using the bracket syntax. We've seen this before (remember columns in Pandas?). The input is a dictionary key, the output is the key's value.

In [7]:
my_dict['education']

'high school'

In [8]:
my_dict['income']

100

We can add key/value pairs using the bracket syntax and the assignment operator. Notice the order of the key/value pairs does not matter, like they do in lists and strings.

In [9]:
my_dict['age'] = 24
my_dict

{'education': 'high school', 'income': 100, 'age': 24}

In [10]:
#list comprehension!

mylist = ['In', 'the', 'beginning', 'there', 'was', 'chaos']

# Use a loop to filter out words that begin with 't':

t_words = []

for word in mylist:
    if word.lower().startswith('t'):
        t_words.append(word)

print(t_words)

['the', 'there']


In [11]:
#now do it with list comprehension

t_words = [word for word in mylist if word.lower().startswith('t')]
t_words

['the', 'there']

# Example: Counting Words

We have been looking at different features of "words" (or, as Python knows them, elements in a string separated by white space). What if we want to find the number of times each word occurs in a text? We can use the `counter` class in Python, which utilizes dictionaries and another datatype, tuples. Let's walk through an example

One of the most frequent tasks in computational text analysis is quickly summarizing the content of text. In this lesson we will learn how to summarze text by counting frequent words in the text. In the process we'll learn to think about features, which words are important, and we'll cover some common pre-processing steps. 

This techniques fits under the umbrella of Natural Language Processing, a term that incorporates many techiques and methods to process, analyze, and understand natural languages (as opposed to artificial languages like logics, or Python).


## Outline:
- Tokenizing Text and Type-Token Ratio
    * Number of words
    * Type-Token Ratio
- Most frequent words
- Pre-processing


## Key Terms:

* *stop words*: 
    * The most common words in a language.
* *token*:
    *  A token is an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing.
* *type*:
    * A type is the class of all tokens containing the same character sequence.

## Let's begin!

First, we assign a sample sentence, our "text", to a variable called "sentence".

Note: This sentence is a quote about what digital humanities means, from digital humanist Kathleen Fitzpatrick. Source: "On Scholarly Communication and the Digital Humanities: An Interview with Kathleen Fitzpatrick", *In the Library with the Lead Pipe*

In [12]:
#assign the desired sentence to the variable called 'sentence.'
sentence = "For me it has to do with the work that gets done at the crossroads of \
digital media and traditional humanistic study. And that happens in two different ways. \
On the one hand, it's bringing the tools and techniques of digital media to bear \
on traditional humanistic questions; on the other, it's also bringing humanistic modes \
of inquiry to bear on digital media."

#print the content
print(sentence)

For me it has to do with the work that gets done at the crossroads of digital media and traditional humanistic study. And that happens in two different ways. On the one hand, it's bringing the tools and techniques of digital media to bear on traditional humanistic questions; on the other, it's also bringing humanistic modes of inquiry to bear on digital media.


## Type-Token Ratio

One quick calculation we can do on the text is determine it's type-token ratio.

We know what a token is. But many tokens are repeated in a text. For example, in this sentence, the token "the" appears 5 times. "The" is a type. The 5 "the"s in the sentence are tokens. The TTR is simply the number of types divided by the number of tokens. A high TTR indicates a large amount of lexical variation or lexical diversity and a low TTR indicates relatively little lexical variation. The type-token ratio of speech, for example, is less than that of written language. 

To get a subset of our list that only contains one element of each type, we can use the `set` function:

In [13]:
sentence_list = sentence.split()
sentence_list

['For',
 'me',
 'it',
 'has',
 'to',
 'do',
 'with',
 'the',
 'work',
 'that',
 'gets',
 'done',
 'at',
 'the',
 'crossroads',
 'of',
 'digital',
 'media',
 'and',
 'traditional',
 'humanistic',
 'study.',
 'And',
 'that',
 'happens',
 'in',
 'two',
 'different',
 'ways.',
 'On',
 'the',
 'one',
 'hand,',
 "it's",
 'bringing',
 'the',
 'tools',
 'and',
 'techniques',
 'of',
 'digital',
 'media',
 'to',
 'bear',
 'on',
 'traditional',
 'humanistic',
 'questions;',
 'on',
 'the',
 'other,',
 "it's",
 'also',
 'bringing',
 'humanistic',
 'modes',
 'of',
 'inquiry',
 'to',
 'bear',
 'on',
 'digital',
 'media.']

In [14]:
set(sentence_list)

{'And',
 'For',
 'On',
 'also',
 'and',
 'at',
 'bear',
 'bringing',
 'crossroads',
 'different',
 'digital',
 'do',
 'done',
 'gets',
 'hand,',
 'happens',
 'has',
 'humanistic',
 'in',
 'inquiry',
 'it',
 "it's",
 'me',
 'media',
 'media.',
 'modes',
 'of',
 'on',
 'one',
 'other,',
 'questions;',
 'study.',
 'techniques',
 'that',
 'the',
 'to',
 'tools',
 'traditional',
 'two',
 'ways.',
 'with',
 'work'}

In [15]:
#type-token ratio

len(set(sentence_list)) 
len(set(sentence_list)) / len(sentence_list)

0.6666666666666666

## Counting Words

We are often also interested in the most frequent words, which can help us quickly summarize a text. We can do this by looping through our sentence tokens variable and creating a counts dictionary.

Let's walk through this code slowly.

In [15]:
counts = dict()
for word in sentence_list:
    if word not in counts:
        counts[word] = 1
    else:
        counts[word] += 1
counts

{'For': 1,
 'me': 1,
 'it': 1,
 'has': 1,
 'to': 3,
 'do': 1,
 'with': 1,
 'the': 5,
 'work': 1,
 'that': 2,
 'gets': 1,
 'done': 1,
 'at': 1,
 'crossroads': 1,
 'of': 3,
 'digital': 3,
 'media': 2,
 'and': 2,
 'traditional': 2,
 'humanistic': 3,
 'study.': 1,
 'And': 1,
 'happens': 1,
 'in': 1,
 'two': 1,
 'different': 1,
 'ways.': 1,
 'On': 1,
 'one': 1,
 'hand,': 1,
 "it's": 2,
 'bringing': 2,
 'tools': 1,
 'techniques': 1,
 'bear': 2,
 'on': 3,
 'questions;': 1,
 'other,': 1,
 'also': 1,
 'modes': 1,
 'inquiry': 1,
 'media.': 1}

In [16]:
# Now we can get the count (value) associated with any word (key)
counts['humanistic']

3

## Most Frequent Words

We'll have to creatively combine dictionaries and tuples to find the most frequent words in our sentence.

The dictionary method .items() returns a list of tuples. This will eventually allow us to sort through the tuples.

A `tuple` is a sequence of values much like a list. The values stored in a tuple can be any type, and they are indexed by integers. The important difference is that tuples are immutable. Tuples are also comparable and hashable so we can sort lists of them and use tuples as key values in Python dictionaries.

Syntactically, a tuple is a comma-separated list of values:

In [17]:
counts.items()

dict_items([('For', 1), ('me', 1), ('it', 1), ('has', 1), ('to', 3), ('do', 1), ('with', 1), ('the', 5), ('work', 1), ('that', 2), ('gets', 1), ('done', 1), ('at', 1), ('crossroads', 1), ('of', 3), ('digital', 3), ('media', 2), ('and', 2), ('traditional', 2), ('humanistic', 3), ('study.', 1), ('And', 1), ('happens', 1), ('in', 1), ('two', 1), ('different', 1), ('ways.', 1), ('On', 1), ('one', 1), ('hand,', 1), ("it's", 2), ('bringing', 2), ('tools', 1), ('techniques', 1), ('bear', 2), ('on', 3), ('questions;', 1), ('other,', 1), ('also', 1), ('modes', 1), ('inquiry', 1), ('media.', 1)])

In [18]:
#we can loop through these values like we might in a list, but notice the syntax here!
for key, value in counts.items():
    print(key, value)

For 1
me 1
it 1
has 1
to 3
do 1
with 1
the 5
work 1
that 2
gets 1
done 1
at 1
crossroads 1
of 3
digital 3
media 2
and 2
traditional 2
humanistic 3
study. 1
And 1
happens 1
in 1
two 1
different 1
ways. 1
On 1
one 1
hand, 1
it's 2
bringing 2
tools 1
techniques 1
bear 2
on 3
questions; 1
other, 1
also 1
modes 1
inquiry 1
media. 1


In [19]:
freq_words = []

for key, val in counts.items():
    freq_words.append((val, key))

freq_words

[(1, 'For'),
 (1, 'me'),
 (1, 'it'),
 (1, 'has'),
 (3, 'to'),
 (1, 'do'),
 (1, 'with'),
 (5, 'the'),
 (1, 'work'),
 (2, 'that'),
 (1, 'gets'),
 (1, 'done'),
 (1, 'at'),
 (1, 'crossroads'),
 (3, 'of'),
 (3, 'digital'),
 (2, 'media'),
 (2, 'and'),
 (2, 'traditional'),
 (3, 'humanistic'),
 (1, 'study.'),
 (1, 'And'),
 (1, 'happens'),
 (1, 'in'),
 (1, 'two'),
 (1, 'different'),
 (1, 'ways.'),
 (1, 'On'),
 (1, 'one'),
 (1, 'hand,'),
 (2, "it's"),
 (2, 'bringing'),
 (1, 'tools'),
 (1, 'techniques'),
 (2, 'bear'),
 (3, 'on'),
 (1, 'questions;'),
 (1, 'other,'),
 (1, 'also'),
 (1, 'modes'),
 (1, 'inquiry'),
 (1, 'media.')]

In [20]:
freq_words.sort(reverse=True)
freq_words

[(5, 'the'),
 (3, 'to'),
 (3, 'on'),
 (3, 'of'),
 (3, 'humanistic'),
 (3, 'digital'),
 (2, 'traditional'),
 (2, 'that'),
 (2, 'media'),
 (2, "it's"),
 (2, 'bringing'),
 (2, 'bear'),
 (2, 'and'),
 (1, 'work'),
 (1, 'with'),
 (1, 'ways.'),
 (1, 'two'),
 (1, 'tools'),
 (1, 'techniques'),
 (1, 'study.'),
 (1, 'questions;'),
 (1, 'other,'),
 (1, 'one'),
 (1, 'modes'),
 (1, 'media.'),
 (1, 'me'),
 (1, 'it'),
 (1, 'inquiry'),
 (1, 'in'),
 (1, 'has'),
 (1, 'happens'),
 (1, 'hand,'),
 (1, 'gets'),
 (1, 'done'),
 (1, 'do'),
 (1, 'different'),
 (1, 'crossroads'),
 (1, 'at'),
 (1, 'also'),
 (1, 'On'),
 (1, 'For'),
 (1, 'And')]

In [21]:
for key, val in freq_words[:10]:
    print(key, val)

5 the
3 to
3 on
3 of
3 humanistic
3 digital
2 traditional
2 that
2 media
2 it's


In [33]:
#let's save as a function!

def word_counts(text_list):
    counts = dict()
    for word in text_list:
        if word not in counts:
            counts[word] = 1
        else:
            counts[word] += 1
    
    freq_words = []

    for key, val in counts.items():
        freq_words.append((val, key))
    
    freq_words.sort(reverse=True)
    
    return(freq_words)

## Tokenizing Text and Preprocessing

But what's the issue here? First, capitalization and punctuation are messing with our word counts. Second, the most frequent words, *the*, *to*, *on*, *of*, don't actually tell us much about the text. This is always the case. Stop words, or words that don't convey content, make up the vast majority of all text.

Before doing any text analysis, we thus must do lots of preprocessing. The exact preprocessing steps you take will depend on what you're planning on doing. I'll go through common steps here, but think carefully about what steps you want to take when you do

In [34]:
#lowercase

sentence_lc = sentence.lower()

sentence_lc

"for me it has to do with the work that gets done at the crossroads of digital media and traditional humanistic study. and that happens in two different ways. on the one hand, it's bringing the tools and techniques of digital media to bear on traditional humanistic questions; on the other, it's also bringing humanistic modes of inquiry to bear on digital media."

In [35]:
#remove punctuation
#For punctuation use the list from the string library
import string
punct_list = string.punctuation

punct_list

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [36]:
sentence_nopunct = ''.join([e for e in sentence_lc if e not in punct_list])
sentence_nopunct

'for me it has to do with the work that gets done at the crossroads of digital media and traditional humanistic study and that happens in two different ways on the one hand its bringing the tools and techniques of digital media to bear on traditional humanistic questions on the other its also bringing humanistic modes of inquiry to bear on digital media'

In [37]:
#often, we want to remove stopwords

stop_words = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 
                     'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 
                     'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 
                     'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 
                     'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 
                     'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 
                     'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 
                     'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 
                     'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 
                     'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 
                     'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 'can', 'will',
                     'just', 'dont', 'should', 'aint', 'arent', 'couldn', 'could', 'would', 'much', 'must',
                     'didnt', 'doesnt', 'hadnt', 'hasnt', 'havent', 'isnt', 'mightnt', 'mustnt', 'neednt', 'shan',
                     'shouldnt', 'wasnt', 'werent', 'wont', 'wouldnt']

In [38]:
sentence_tokens = sentence_nopunct.split()
sentence_clean = [word for word in sentence_tokens if word not in stop_words]
sentence_clean

['work',
 'gets',
 'done',
 'crossroads',
 'digital',
 'media',
 'traditional',
 'humanistic',
 'study',
 'happens',
 'two',
 'different',
 'ways',
 'one',
 'hand',
 'bringing',
 'tools',
 'techniques',
 'digital',
 'media',
 'bear',
 'traditional',
 'humanistic',
 'questions',
 'also',
 'bringing',
 'humanistic',
 'modes',
 'inquiry',
 'bear',
 'digital',
 'media']

In [39]:
#let's save this as a function
def word_tokenize(text):
    text = text.lower()
    text_clean = ''.join([e for e in text if e not in punct_list])
    text_token =  text_clean.split()
    text_token_clean = [word for word in text_token if word not in stop_words]
    return text_token_clean

In [40]:
#complete the line below
sentence_tokens = word_tokenize(sentence)
sentence_tokens

['work',
 'gets',
 'done',
 'crossroads',
 'digital',
 'media',
 'traditional',
 'humanistic',
 'study',
 'happens',
 'two',
 'different',
 'ways',
 'one',
 'hand',
 'bringing',
 'tools',
 'techniques',
 'digital',
 'media',
 'bear',
 'traditional',
 'humanistic',
 'questions',
 'also',
 'bringing',
 'humanistic',
 'modes',
 'inquiry',
 'bear',
 'digital',
 'media']

In [41]:
#total number of words
len(sentence_tokens)

32

In [42]:
# let's count again! (remember our function)

word_count = word_counts(sentence_tokens)
word_count

[(3, 'media'),
 (3, 'humanistic'),
 (3, 'digital'),
 (2, 'traditional'),
 (2, 'bringing'),
 (2, 'bear'),
 (1, 'work'),
 (1, 'ways'),
 (1, 'two'),
 (1, 'tools'),
 (1, 'techniques'),
 (1, 'study'),
 (1, 'questions'),
 (1, 'one'),
 (1, 'modes'),
 (1, 'inquiry'),
 (1, 'happens'),
 (1, 'hand'),
 (1, 'gets'),
 (1, 'done'),
 (1, 'different'),
 (1, 'crossroads'),
 (1, 'also')]

In [43]:
# reminder: word counts before preprocessing:

old_count = word_counts(sentence.split())
old_count

[(5, 'the'),
 (3, 'to'),
 (3, 'on'),
 (3, 'of'),
 (3, 'humanistic'),
 (3, 'digital'),
 (2, 'traditional'),
 (2, 'that'),
 (2, 'media'),
 (2, "it's"),
 (2, 'bringing'),
 (2, 'bear'),
 (2, 'and'),
 (1, 'work'),
 (1, 'with'),
 (1, 'ways.'),
 (1, 'two'),
 (1, 'tools'),
 (1, 'techniques'),
 (1, 'study.'),
 (1, 'questions;'),
 (1, 'other,'),
 (1, 'one'),
 (1, 'modes'),
 (1, 'media.'),
 (1, 'me'),
 (1, 'it'),
 (1, 'inquiry'),
 (1, 'in'),
 (1, 'has'),
 (1, 'happens'),
 (1, 'hand,'),
 (1, 'gets'),
 (1, 'done'),
 (1, 'do'),
 (1, 'different'),
 (1, 'crossroads'),
 (1, 'at'),
 (1, 'also'),
 (1, 'On'),
 (1, 'For'),
 (1, 'And')]

## Reading in Text Files

In [44]:
# Read in a text file saved in your data folder

with open("../data/Austen_PrideAndPrejudice.txt", encoding='utf-8') as myfile:
    #print(myfile)
    mytext = myfile.read()

mytext[:200]
len(mytext)

685139

# Exercises!

On the text we just read in, do the following. You are in control of what pre-processing steps you might want to take, if any. Use any of the functions we defined above if you want.

1. How long is the text?
2. Calculate the type-token ratio.
3. Print the 20 most frequent words.
4. How many short words are in the text? Short words == three characters or less.
5. How many long words are in the text? Long words == seven characters or more.
6. What is the (approximate) average word length in the text?
7. What is the (approximate) average sentence length of the text?
8. How long is the longest sentence?
9. How long is the shortest sentence?

In [37]:
#1. How long is the text?

len(mytext.split())

121793

In [38]:
# 2. Calculate the type-token ratio.

len(set(mytext.split()))/len(mytext.split())

0.10627047531467326

In [41]:
# 3. Print the 20 most frequent words.

mytext_tokens = word_tokenize(mytext)
austen_wc = word_counts(mytext_tokens)
austen_wc[:20]

[(772, 'mr'),
 (585, 'elizabeth'),
 (396, 'said'),
 (365, 'darcy'),
 (342, 'mrs'),
 (292, 'bennet'),
 (288, 'one'),
 (283, 'every'),
 (281, 'miss'),
 (258, 'jane'),
 (253, 'bingley'),
 (235, 'know'),
 (222, 'though'),
 (221, 'never'),
 (218, 'soon'),
 (212, 'well'),
 (211, 'think'),
 (210, 'now'),
 (201, 'time'),
 (200, 'might')]

In [54]:
# 4. How many short words are in the text? Short words == three characters or less.

len([x for x in mytext.split() if len(x)<=3])

52447

In [47]:
# 5. How many long words are in the text? Long words == seven characters or more.

long_words = 0

for e in mytext_tokens:
    if len(e)>=7:
        long_words = long_words + 1
print(long_words)

23066


In [55]:
#6. What is the (approximate) average word length in the text?

print(len(mytext) / mytext.count(' '))

len(mytext) / len(mytext.split())

5.725953783795077


5.625438243577217

In [50]:
#7. What is the (approximate) average sentence length of the text?

import numpy as np

sentences = mytext.split('.')
np.mean([len(x.split()) for x in sentences])

19.41136543014996

In [57]:
#or, without numpy:

sentences = mytext.split('.')
sum([len(x.split()) for x in sentences]) / len(sentences)

19.41136543014996

In [58]:
# 8. How long is the longest sentence?

print([len(x.split()) for x in sentences][:10])
max([len(x.split()) for x in sentences])

[5, 3, 7, 2, 1, 3, 2, 23, 48, 3]


121

In [59]:
# 9. How long is the shortest sentence?
min([len(x.split()) for x in sentences])

0