
# Week 3: Power, Carrying Out a TTR Experiment, and Further Adventures in Python


## Part One: Calculating TTRs with web tools

DC will next provide of an overview of how to carry out a TTR experiment using web resources and web tools.

* finding texts
* saving them as text files
* cleaning them
* calculating TTRs for total texts
* calculating standardized TTRs
* gathering data in a table


## Part Two: "The Power Chapter"

KM will deliver a brief lecture on this week's reading, D'Ignazio and Klein's "The Power Chapter" from *Data Feminism*.

The above will also introduce our first "dataset": **Project Gutenberg**, which we will discuss in the terms introduced by D'Iganazio and Klein.

## Part Three: Indexing and Slicing Strings, String Methods, Tokenizing, Lists, and Loading Files into Python

In this section of the lecture, included below, we explore some new methods (pun) and a new data type in Python.

This week's lecture draws on Melanie Walsh's chapters on [string methods](https://melaniewalsh.github.io/Intro-Cultural-Analytics/02-Python/06-String-Methods.html), [lists and loops](https://melaniewalsh.github.io/Intro-Cultural-Analytics/02-Python/09-Lists-Loops-Part1.html), and [files and character encoding](https://melaniewalsh.github.io/Intro-Cultural-Analytics/02-Python/07-Files-Character-Encoding.html)

## Lab and Homework

As always, your weekly lab will be released shortly after lecture on Tuesday, and is due Wednesday at 10pm. Your weekly homework will be released on Thursday after tutorial and is due the following Monday at 10pm.

---

# 3: A Bit of Review

Let's start with a few hopefully fun exercises to remind us of some what we learned last time... (And teach us one new thing!)

In [None]:
# How do you explain these potentially surprising reults?

print(3 + 3)
print("3" + "3")

In [None]:
name = "Dasg"
print(name * 3)
print(name * 3.0)

In [None]:
print("True" + "False")
print(True + False)

In [None]:
True * 3.14159

In [None]:
True / False

# Our Coding Task Today

Today we're going to learn how to do a hugely important part of our Type/Token Ratio experiment: 
* Load a text
* Break that text into words (**"tokenize"** it)
* Count the number of words (or "tokens") in the text

Let's start by doing this task manually. 

Our sample text will be the opening sentence of Jane Austen's *Pride and Prejudice*:

> It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.

How many words are in this text? 

# 3a: Indexing and Slicing strings

Let's all get our knives out and slice some strings!

![Knife slicing string](knife-string.jpg)

A string, as we know, is some text strung together. 

Let's do the English Lit equivalent of `"Hello world!"` and load that first line of *Prejudice* into a string variable called `text`.

In [None]:
text = "It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife."

In [None]:
type(text)

In [None]:
text

In [None]:
print(text)

Now let's say we didn't want to read that *whole* opening line (I mean, it's sort of long for modern tastes!), and wanted to chop it up a little bit, say just those famous first six words. Sure, we could *retype them* and put them into a new variable, but that would be a lot of work. I mean, they're already **there** — so how to we access only them?

Thankfully, Python has some handy tools for "indexing" and "slicing" strings. These are both achieved by putting **square brackets** `[ ]` directly after the variable name (or the directly inputted text) that you want to "slice up." 

## Indexing

Let's start with the below. It will display "unit number one" from the `text` variable. What do you expect to see?

In [None]:
text[1]

I don't know about you, but that's totally not what I was expecting to get back. 
* What did *you* expect to see?
* How do you explain what we see here? 

Can we figure it out if we enter a few more?


In [None]:
text[2]

In [None]:
text[3]

In [None]:
text[4]

In [None]:
text[5]

In [None]:
text[6]

In [None]:
text[7]

In [None]:
text[8]

In [None]:
text[9]

In [None]:
text[10]

In [None]:
text[11]

In [None]:
text[12]

What have we learned about how indexing works on strings?
* Strings are broken up or "indexed" by *character*, not by word. So we have learned that if you just want to pull a single character out of a string — which is called **indexing** a string — the syntax is `string[character_number]`.
* **AND PYTHON DOESN'T START COUNTING AT 1, IT STARTS COUNTING AT ZERO!!!!!!!!!!!!!!**

![hands start counting at zero](start_at_zero.jpg)

The above is another one of those unintuitive things for which there is a perfectly good explanation.

![misleading list](zero_smartest.jpg)

Just so we don't forget this, I'm going to put it in big type:
# >>>Python doesn't start counting at 1; it starts counting at 0<<<

In [None]:
text[0]

## Slicing

If you want to pull multiple characters out of a string, that's called **slicing**, and the syntax is as follows: 

> `string[start:stop:step]`, where `start`, `stop`, and `step` are "index positions" in the string you want to slice. 

Again be careful not to use different types of brackets, different kinds of colons, or add any spaces. What kind of reader is the Python Interpreter, again? Right: "**extraordinarily dutiful but uncompromisingly literal.**"

Now, we don't have to include all three instructions — `start`, `stop`, and `step` — so let's start with just two, `start` and `stop`.

Okay, let's try to slice that string so that we only get the first six words, "It is a truth universally acknowledged."

But let's start with with just the first FOUR words, which we know go up to `text[12]`. So maybe...

In [None]:
text[0:12]

Nope! That doesn't work! We lost our "h"! **We have not *truth* but *trut*, and this is entirely unsatisfying!**

Let me try to explain, and David and/or Mary can offer their thoughts. 

Yes, `text[12]` produces the terminal "t" of "truth" — but that "t" is not the twelfth character in the string; it is the **thirteenth**, since Python starts counting at 0. So at least we can sensibly interpret the command `text[0:13]` as meaning "Show me every character from the start up to the thirteenth."

In [None]:
text[0:13]

Note that we can also write this as follows:

In [None]:
text[:13]

And while we're at it, we may as well try:

In [None]:
text[13:]

What further insights does this provide us into the way that Python indexes (or indices, I suppose) work? 

`text[:13]` tells Python, "show me everything from the start of the string all the way up to, **but not including**, `text[13]`

`text[13:]` tells Python, "show me everything from `text[13]` -- **including `text[13]` itself** -- all the way to the end."

Okay, let's collectively make some random guesses until we get those famous first six words...

In [None]:
text[:]

Now let's play around a little bit more with the other parts of the `string[start:stop:step]` syntax.

In [None]:
text[0:13:2]

In [None]:
text[:13:2]

In [None]:
text[::2]

In [None]:
text[-1]

In [None]:
text[-5:]

In [None]:
text[::-1]

# 3b: String methods

Let's now meet **methods**, which are a *special kind of function* that only belong to certain **data types** and which are written out differently than functions.

As you'll recall, the syntax of a function is `function(argument)`, which we said was analogous to `verb(noun)`: functions are verb-y in that they **perform actions** on noun-like arguments.

But not all kinds of actions are applicable or appropriate to all kind of nouns, as you will perhaps have discovered in your madlibs exercises last week (Can you really be "eaten alive" by an "uproarious smoke"?).

**And not all functions are appropriate to all kinds of Python data types**.

It makes sense to `print()` anything or check the `type()` of anything — but whereas it certainly makes sense to **convert a `str` to all lowercase letters**, it makes absolutely none to convert an `int` or a `float` or a `bool` to lowercase letters, since they are not composed of letters.

This is not a hypothetical example: Python has a built-in way to turn strings into lowercase letters. But because it is something that only makes sense for one data type (`str`s), it is a **method** rather than a **function**.

The syntax for methods is as follows: 

> `data.method(argument)`

As you can see, they look quite a bit like functions, but they come **after** the data that you want to perform some action on, and they are "attached" to that data, as it were, by a period (`.`). 

Think of methods like **suffixes** — or, if you speak an inflected language like French or Polish, like grammatical words endings like the bolded bits in "Tu aim**es**" vs. "Nous aim**ons**."

Python has lots of cool string-specific methods. Let's explore a few:
* `.lower()`: make lowercase
* `.upper()`: make uppercase
* `.title()`: make title case
* `.replace()`: replace some text with other text (like "Find and Replace" in Word)
* `.split()`: break a string into separate units, such as words

Let's try out the make-it-lower case method, `.lower()`

In [None]:
"AAAAAAAAAAAAAAAAAAAAAH".lower()

A powerful method! A scream becomes a sigh!!

Note that `.lower()` **does not require an argument**, which is also true of `.upper()` and `.title()`. 

Again, verbs provide a nice analogy. 
* Some verbs in certain contexts **require *objects***, like "I am going to take... ***all of your oatmeal***". 
* Some verbs in certain contexts **do not require objects**, like "Oatmeal sucks."

Some functions and methods require arguments, and some do not.
* "Lowercase this string, Python Interpreter!" / "Aaaaaaaaaaaaaaaah. Sure thing, boss!"
* "Replace this string, Python Interpreter!" / "AAAAAAAAAAAAAAAAAAAAAAAAH! What do you want me to replace?? And with what?????"

In [None]:
text.lower()

In [None]:
text.upper()

In [None]:
text.title()

Note that running the above three commands ^^ doesn't actually alter the contents of the `text` variable. 

In [None]:
text

If we **wanted** to replace the value of `text` with a fully lowercased version of itself, how would we do it?

In [None]:
#

Below is a look at the `.replace()` method, which as you can see takes **two arguments**, the text to replace with something else, and the "something else" to replace it with.

`string.replace("text to remove", "text to insert in its place")`

Note that the **two arguments** are separated by commas.

In [None]:
"Now approaching Ossington... Ossington Station".replace("Ossington", "Christie")

You can use `names_of_string_variables` rather than `"directly inputted text"` at any position here. For example,

In [None]:
pre_foucault = "truth universally acknowledged"
post_foucault = "spurious bias held in certain highly localized socio-political configurations"

text.replace(pre_foucault, post_foucault)

As exciting as the above undoubtedly is, perhaps the most exciting string method for the purposes of our TTR experiments is 

> `.split()`

— which takes a string and breaks it up into chunks.

`.split()` doesn't **require** an argument. What does it do below?

In [None]:
"April is the cruelest month, breeding / Lilacs out of the dead land".split()

The default, argument-less version of `split()` **"splits on whitespace."** That is to say, any time it encounters any number of consecutive characters that Python (or the people who made Python) interpret as "empty," it sharpens its fangs cuts them out. 

Whitespace characters include:
* "` `": spaces (yes, spaces are characters!)
* "`\t`": tabs (yes, tabs are characters, and are represented as `\t`)
* "`\n`": newlines or "Return"s (yes, "Return" or "Enter" is a character, and is represented as `\n` — among other ways!)

But `.split()` will **accept** an argument, if we want to split a string up by something other than whitespace.

For instance, we could split Eliot up by "`/`" to divide this poem into lines...

In [None]:
waste_land = "April is the cruelest month, breeding / Lilacs out of the dead land"

waste_land.split("/")

What happens to our splitting character, `/`, in the above command? Where does it go?

And where does the output of the command go? Is it stored anywhere? How could we store it?

Let's now split up our beloved `text` variable, from *Pride and Prejudice*:

In [None]:
text.split()

But I'm interested in that output — I want to catch it in my butterfly net! Let's grab it and stick it in a variable called `text_words`.

In [None]:
text_words = text.split()

In [None]:
print(text_words)

What a fascinating output! All those words, surrounded by quotation marks `'`, separated by commas `,`, and wrapped up in square brackets `[ ]`. **What on earth *is* this output?** Is it a **data type** we already know — *or is it perhaps something entirely new??*

In [None]:
type(text_words)

# 3c: Lists!

Is it weird to have a favourite data type? Maybe it is, maybe it isn't, but I/AH have one, and it's lists.

Lists respond to the dreadful situation we described last class: *What if you want to put more than one thing in an envelope*? Lists are the data type that allows you put as many things as you want in an envelope. Those individual "things" can be any of the data types we've met this far, `str`s, `int`s, `float`s, `bool`s — whatever! In that way, they are sort of like a "meta data-type," in that they store or contain lots of other sub-data-types.


The best real-world equivalent I can think of for a list is... a list. 

Lists have names, and they contain multiple items.

> **Grocery List**:
> * Chocolate bar
> * Chips
> * Chocolate milk
> * Another bag of chips

The Python syntax for creating a new list variable is the following: `name = [item1, item2, item3, item4]`

If I wanted to create a Python equivalent of the above list, I might do something like...

In [None]:
grocery_list = ["Chocolate bar", "Chips", "Chocolate milk", "Another bag of chips"]

In [None]:
grocery_list

In [None]:
print(grocery_list)

In [None]:
type(grocery_list)

Let's try to take all the glorious values we created last week for our `LITERATURE`, `Woolf`, `wolf`, and `pi` viarables — and stick them into a new `list` variable called `bunch_o_stuff`.

## Indexing and Slicing Lists

Good news! Both work exactly the same as for strings. Except that strings break down into characters, whereas lists break down into items or **elements**.

If you want to pull out an individual item or **element** from a list, you can **index** it just like a string.

In [None]:
grocery_list[0]

What do you think will happen when we do the below?

In [None]:
type(grocery_list[0])

In [None]:
grocery_list[0] + grocery_list[3]

In [None]:
print(f"Boy I sure am hungry for some {grocery_list[2]} and some {grocery_list[1]}")

You can also **slice** `list`s in the same ways as you can `str`s.

In [None]:
grocery_list[:2]

In [None]:
type(grocery_list[:2])

In [None]:
grocery_list[-2:]

In [None]:
grocery_list[::-1]

The `grocery_list` is cool and all, but I'm more interested in that list we creater earlier, `text_words`. Let's have another look at it.

In [None]:
text_words

## `len()`

Believe it or not, we're one mere tiny function away from being able to do something really exciting and absolutely essential to our TTR experiment: **count the number of tokens in a text**. 

**`.split()`**, you see, already got us to our first goal: it **"tokenized"** the text, i.e., **split it into individual words**. It didn't do a perfect job, of course, but it did pretty well. 
* What problems do you see with its tokenization?
* Did it count the same number of words as you did?

All we need now is a way to actually count the number of words in our list... which, blessedly, Python provides us in the form of a function called **`len()`**.

**`len()`** — don't called him `leonard()` or `lenny()`, just `len()` — calculates the length of a variety of data types. For instance, let's try him out on a string, then a list.

In [None]:
print(text)
len(text)

In [None]:
print(text_words)
len(text_words)

Sadly, `len()` doesn't know what it means to count the length an `int`, a `float`, or a `bool`.

In [None]:
len(232)

In [None]:
len(3.14159)

In [None]:
len(True)

When `len()` meets a `list`, he doesn't count anything within the actual items in the list; he only counts the number of elements or items in the list. We could, of course, ask him to count how long an individual element is, too, if it's the sort of thing that can be counted...

In [None]:
print(text_words[4])
len(text_words[4])

## The `.join()` method

Before we move on to counting the actual number of words in an actual novel (!!!!), there's one more string method to introduce, now that we know about `list`s: the `.join()` method.

It is used, perhaps unsurprisingly, to **join things together** — more specifically, to join together the items of a list into single string. The syntax is as follows:

`string.join(list)` — where the `string` is whatever to want to mash **between** the items being joined and `list` is the list that you want to collapse into a single string.

I've always found the syntax of this one pretty odd, but it is what it is, and it's definitely a useful method. 

In [None]:
" Mississippi, ".join(["One", "two", "three", "four"])

Notice that the `string` is **only** stuck in **between** items, so that in my application I'm left with a weirdly hanging four.

Totally optional, but maybe on your own time you can try to figure out how the below works...

In [None]:
divider = " Mississauga"
(divider + ", ").join(["One", "two", "three", "four"]) + divider + "."

Here's a more practical application of `.join()`:

In [None]:
" ".join(text_words)

... and here we bring *Pride and Prejudice* into the 21st century...

In [None]:
", like, ".join(text_words)

# 3e: Loading a real-life text file into Python... and tokenizing it!

We now have pretty much all the tools we need to perform an important part of our TTR exercise:
* We can **tokenize** a sting by `.split()`ting it into words (more or less) and producing lists of words (more or less)
* And we can count how many words there are in those lists with `len()`

What we haven't learned to do, however, is actually load a novel into Python.

Well, believe it or not, for that we only need a single line of code. 

It starts with the `open()` function, which we will provide with two arguments:
* The **path** to the file we want to open, entered as `str` (so with `""`s around it). This tells Python *where* that file is, relative to the notebook you currently have open. For today's example, we've already placed some **plain text** files on your JupyterHub, and they're in the same folder as this notebook. So the path is pretty simple: the name of the file, in quotation marks.
* The type of **character encoding** that that file uses, a topic which requires its own subject heading.

## Character encoding

It's important to specify what kind of "character encoding" you're using because... computers don't actually understand what text *is*: they are just sign-manipulators who do what they're told.

In [None]:
A = "P"
d = "o"
a = "o"
m = "p"
print(A+d+a+m)
A+d+a+m == "Poop"

Although it would be handy if there were only one way of encoding text — one master system of signifiers and signifieds — there are in fact many systems. Among these are
* The OG system, "[ASCII](https://en.wikipedia.org/wiki/ASCII)" — first devised in the 1960s! — which has pretty limited support for anything beyond English characters A-Z, a-z, 0-9, with some punctuation and special characters allowed. 
* The main encoding system we'll be using, "[UTF-8](https://en.wikipedia.org/wiki/UTF-8)" (derived from [Unicode](https://en.wikipedia.org/wiki/Unicode)), which is a lot better, but still — as Aditya Mukerjee points out in his essay “I Can Text You A Pile of Poo, But I Can’t Write My Name,” assigned this week — lacks characters that are essential to the Bengali alphabet as well as to many other non-English languages, when it has plenty of space to include these characters.

Let's load a file that we've put in all of your JupyterHubs. It's called `"sample_character_encoding.txt"` and it is encoded in the UTF-8 standard.

In the below command, we'll use the `open()` function to open the file, and then immediately apply the `.read()` method to that file. This is because the `open()` function produces a "file object" which only becomes what we want it to be — a `str` — once we apply this method to it.


In [None]:
sample = open("sample_character_encoding.txt", encoding="utf-8").read()
print(sample)

If we try to open this same exact file with a different encoding system called ISO-8859-1, we get a bit of a mess

In [None]:
sample = open("sample_character_encoding.txt", encoding="iso-8859-1").read()
print(sample)

If we try to open this same exact file using Ye Olde Fashionede ASCII encoding, we just get an error.

In [None]:
sample = open("sample_character_encoding.txt", encoding="ascii").read()
print(sample)

Now let's try something loading something a bit longer, something we've actually all read... ***The Sign of the Four***! 

We've already conveniently put a copy of it — `sign-of-four.txt`, sourced from Project Gutenberg and encoded in UTF-8 — in your JupyterHubs, in the same folder as where this notebook lives. Let's load it into a variable called `sot4`.

In [None]:
sot4 = open("sign-of-four.txt", encoding="utf-8").read()

In [None]:
type(sot4)

In [None]:
len(sot4)

Let's have a look inside!

In [None]:
sot4

Talk about **Literature and Data**! Here is our precious **literature** being **treated as data**.
* How does it make you feel?
* How does looking at the text in this way compare to looking at it in the pages of a book?



Now, you might be wondering what all those `\n` things are. It was briefly mentioned above — but in case you've forgotten, have a look at the [original text file on Project Gutenberg](https://www.gutenberg.org/cache/epub/2097/pg2097.txt). Does that provide any clues?

In [None]:
sot4[:200]

Notice that if we `print()` this out, we get quite different output!

In [None]:
print(sot4[:200])

Anyway, let's try **tokenizing** this as is and look at the output.

In [None]:
sot4_words = sot4.split()

In [None]:
sot4_words

And then... just one more step to get our number of tokens!

In [None]:
len(sot4_words)

Now, this notably **does not** agree with the number I had on the slides last week, which was **43,520**. What do you think could explain the difference?



In terms of the TTR project, we now know how to:
* load a file
* split it into words
* count the number of words

We don't have the tools yet to:
* remove punctuation 
* automatically go into a folder full of lots and lots of files, load them all up, and count their lengths
* count unique types
* automatically standardize our sample size

To do all the above, we'll need to learn a bit about iteration and loops. Which we'll do next class...