# Library Carpentry: Tools for Humanists

## Python Lesson

### Part 1: Learning to Read Code

This series of lessons introduces the Python programming language as a tool for working with text as data. Pedagogically, it is modeled on the [Library Carpentry](https://librarycarpentry.org/) curriculum, with additional inspiration from the following sources:
 - Melanie Walsh's [Introduction to Cultural Analytics & Python](https://librarycarpentry.org/)
 - Mark Guzdial's work on [learner-centered design](http://www.morganclaypool.com/doi/10.2200/S00684ED1V01Y201511HCI033) in computing education
 
In this first part, you'll get acquainted with the basic elements of programming and start to get a feel for the syntax of Python. Together we will write some code to perform a few small tasks -- nothing that you probably couldn't do with other tools at your disposal. But we'll intentionally be going slow, asking you to think about what's happening at every step. This work -- for those humanists out there, think of it as a kind of close reading -- will help you develop your conceptual model of Python (and of programming languages more generally).

#### Objectives
 - Explore the Python interpreter via a Jupyter notebook
 - Create variables to hold different types of data
 - Read data in from a file
 - Transform data from one type to another
 - Work with conditionals and iterative structures to process data efficiently
 - Store output in a persistent and portable format

#### 1. What is runnable in Python?
 - Code cells vs. markdown cells
 - Parts of a notebook:
   - Browser interface
   - Python kernel

##### Exercise 1.1

Execute the following code cells and note how Python responds. Discuss with your neighbor: what kinds of input are valid for Python? What do you notice about the output you receive?

In [None]:
"Hello world"

In [None]:
9 + 3

In [None]:
"9 + 3"

In [None]:
Hello world

In [None]:
Hello

In [None]:
file_name = 'newton_opticks.txt'

In [None]:
print(file_name)

##### Notes
 - Any characters enclosed in single or double quotes constitute a string.
 - Numerical values outside of quotes are evaluated as numerical.
 - Sequences of other characters outside of quotes are _names_.
 - Names must a) be defined and b) follow certain syntatical rules.
 - Whitespace cannot be part of a name. (By convention, spaces in names are represented by underscores, e.g., `hello_world` instead of `hello world`.)
 - _Functions_ are names that operate on other names, numbers, strings, and other Python data types.
 - Some operations (like `print`) produce output. Others (like assignment with the `=`) do not.
 - Invalid operations produce error messages.

#### 2. Working with data in Python

 - Reading a text file into Python
 - Using Python objects & methods
 - Exploring Python types
 - Working with strings

In order to work with a file in Python, it needs to be available to the Python interpreter. Since we're using Python in the cloud, our first step is to put a file into local environment. 

It's possible to upload files into Google Colab, but in this case, it's easier just to fetch the file from the web.

The following command uses the Unix `wget` command to retrieve a text file from our workshop website. The exclamation point at the beginning of the command is important: this is NOT a Python command, but a command we're issuing to the underlying operating system.

You can copy the URL from the Etherpad, under the `Links` section. Make sure you're using the URL for `Newton's Opticks`.


In [None]:
!wget https://raw.githubusercontent.com/gwu-libraries/2022-07-14-gwu/gh-pages/files/newton_opticks.txt

We downloaded the file to our local environment; now we have to make it available to the Python interpreter.

Here we use Python's `open` function to work with an external text file. The file is stored in the same directory as our notebook. (If the file was located elsewhere, we would need to specify a path to it.)


In [None]:
f = open(file_name)

The assignment creates a variable named `f` and associates it with the output of the `open` function. We supply that function with another variable, which we created above, that is associated with the name of a text file. What does the name `f` refer to now? 

In [None]:
f

Every name in Python is associated with what's called an _object_. Some objects are simple values, like strings and numbers. Other objects are functions, like `open` and `print`. Still other objects are more complex. But you don't need to have a firm grasp of what _objects are_; what's important is knowing _how to use them_.

The object associated with `f` has a _method_ called `read`. A method is a special kind of function that "belongs" to another object. We can access it by writing _name of the object DOT name of the function_.  

In [None]:
f.read

But to use a method or function, as opposed to just exposing its underlying object, we need to use _parentheses_. We used the `print` and `open` functions with _arguments_: names or values between the parens that represent objects the function should operate on, or that determine how the function behaves.

Some functions and methods don't accept or don't require arguments. In these cases, we invoke them with empty parens.

In [None]:
f.read()

##### Exercise 1.2

Take a few minutes with a neighor to look at the output of `f.read()`. What does this data represent? And what do you notice about how it appears as Python output?

##### Notes
 - text file of Newton's _Opticks_
 - special characters: `\n`, `\ufeff`
 - treated by Python as a single string

Usually, we're working with _sequences_ of Python statements to accomplish a specific task. A complete sequence for reading a text file looks like this. Here we open the file, assign the contents to a new variable, and close the file.

In [None]:
f = open('newton_opticks.txt')
text = f.read()
f.close()

The name `text` allows us to refer to our data at a persistent (persisting for the duration of this Python session) location in memory.  

We can examine the _type_ that Python has assigned to our data.

In [None]:
type(text)

Every object in Python has a type. A type delimits the kinds of operations the object can participate in. (Similar to how, in English, a noun can be the subject of a sentence, and an adjective can modify a noun.)

One operation available with strings is to evaluate the character in a given position.

In [None]:
text[0]

##### Exercise 1.3

Evaluate the characters at various position in our `text` string. What do you notice?

##### Notes
 - Some "characters" represented by more than one literal character on screen (Unicode)
 - Escaped characters (`\n`) count as single characters, too
 - First character position at 0

##### Exercise 1.4

Run the following lines of code and discuss with your neighbor. What do these operations do?

In [None]:
text[:100]

In [None]:
len(text)

In [None]:
text.split()

##### Notes
 - Slice vs. single index
 - `len()` function gives the length in _characters_
 - `split()` separates a string on _white space_

We can invoke `split` on `text` because Python strings are objects with special methods of their own. But `split` is an example of a method that transforms one Python type into another. What's the type of the output of `split`?

In [None]:
words = text.split()
type(words)

##### Exercise 1.5

Repeat the above operations from `text` with our `words` variable. How do lists behave differently from strings?

In [None]:
words[0]

In [None]:
words[:100]

In [None]:
len(words)

In [None]:
words.split()

##### Notes
 - Lists are ordered collections and can contain _any_ Python types, not just strings
 - Indexing/slicing refers to elements in the list
 - Elements here are multi-character strings
 - Lists lack the `split` method

#### 3. Test, repeat, test, repeat
How many times does the word "light" appear in Newton's _Opticks_? 

Let's make a logical plan:
1. Assume our `words` list is a good-enough representation of the words in the text
2. Evaluate each word in our list: is it "light"?
   - If so, increase the count by one
3. Print the total count of instances of "light"

In [None]:
n = 0
for word in words:
    if word == 'light':
        n += 1
print(f"The word 'light' appears {n} times in Newton's Opticks.")

Line-by-line analysis:
1. Create a new variable, initially set to 0, to keep count
2. Start a `for` loop: the variable `word`, our "loop variable," will get the value of each element in our `words` list, in order
3. The colon at the end of the line creates a _code block_
4. Indented lines are treated as part of the block -- they will happen _each time_ `word` receives a new value
5. The `if` statement introduces another code block
6. The _double equals_ (`==`) is not a variable assignment. It evaluates one object in terms of another, and returns `True` if both objects are "the same" 
7. The code within the `if` block (double indented) is executed _only if_ the expression in the `if` statement evaluates to `True`
8. The `+=` operator increments a numeric variable by what's on the right side
9. The `f'` introduces a special kind of string. What appears between the curly braces is interpreted not as part of the string but as a Python name -- in this case, our counter variable -- and the value with which it's associated is inserted into the string.

##### Exercise: 1.6

"Truth" in a program depends on a) the structure of the language and b) decisions made by the programmer.

How does `word == 'light'` in our little program define truth? Are there situations where our test would _not_ catch instances that we might want to count as true? Are there situations where it might flag as true instances that we would want to count as false?

##### Notes
 - Because we've defined a _word_ as whatever is separated by white space, our test will fail on strings like the following: `light,`, `light.`, `light).`, etc.
 - We've also defined our word as lowercase, excluding cases where the word begins a sentence.

#### 4. Improving our program
 - Using regexes in Python
 - Importing modules
 - Dealing with `None`

A **module** is an external set of Python objects that you can load when you need additional functionality.

Modules that form part of the basic Python installation belong to the _standard library_.

Other modules can be installed as needed; we'll see how to do that later in our workshop.

For now, we'll important the `re` module, which lets us use regular expressions in Python.

In [None]:
import re

The first step is to define a regular expression. We put it between quotes in order to create a string.

To keep it simple, we'll create a pattern that matches the letters `light` followed by zero or one additional, _non-alphabetic_ characters. We'll also include the `$` and `^` metacharacters at the beginning and end of the regex, to signal that this pattern should occupy the whole word. In other words, we don't want false matches on words like `lightning` or `delight`.

In [None]:
pattern = '^[lL]ight[^A-Za-z]?$'

Now let's test it. We can use the `re.match` method to apply a regular expression to another Python string. 

Unlike methods we've used up to now, `match` takes _two_ arguments: the first should be a regular expression, the second a string to be evaluated.

In [None]:
re.match(pattern, 'light')

In [None]:
re.match(pattern, 'Light')

In [None]:
re.match(pattern, 'lightning')

In [None]:
re.match(pattern, 'delight')

Notice how the behavior of the function differs when it finds a match and when it doesn't.

Every function in Python has a _return_ value. A function can return any type of value; if it doesn't explicitly return a value, by default it returns a special _null_ value called `None`. 

In [None]:
type(re.match(pattern, 'light?'))

In [None]:
type(re.match(pattern, 'lightly'))

In an `if` statement, we can evalute the return value of a function; if the function returns `None`, this will be treated as `False`. 

##### Exercise 1.7

Can you modify the previous `for` loop to use the regular expression to check for the presence of the word `light` in Newton's text?

Arrange the lines from the left in the window on the right in the correct order and indentation to solve the exercise. Note that not all lines of code are necessary to the solution.

http://parsons.problemsolving.io/puzzle/53cbfc9d50364664897a3cffc914c0a6

##### Answer
```
n = 0
for word in words:
    if re.match(pattern, word):
        n += 1
print(f'The word "light" appears in Newton\'s Opicks {n} times.')
```

In [None]:
n = 0
for word in words:
    if re.match(pattern, word):
        n += 1
print(f'The word "light" appears in Newton\'s Opicks {n} times.')

#### 5. More word counts
To find the frequency of a single term in a document, we could use `Ctrl-F` in a word processor or text editor.

But what about a table showing the frequency of occurence of _every word_ in the document? Word frequencies are the basis of many methods of quantitative text analysis, including topic modeling. While Python libraries exist that can automatically compute these for us, we'll walk through a basic version of code to accomplish this.

##### Exercise 1.8

Working with a neighbor, develop a logical plan for this task, using the `words` list we created above. 

##### Notes
 - A collection mapping each word (string) to its frequency (number)
 - Loop through the list of words
 - Whenever we encounter a new word, add it to the collection and set the count to 1
 - If we've seen the word before, increment the count 

##### Collections: Lists v. Dictionaries
 - A collection consists of a container and the elements contained in it
 - Lists: 
   - container = square brackets
   - elements = any Python objects, separated by commas
   - elements accessed by position
 - Dictionaries: 
   - container = curly braces
   - elements = pairs of Python objects
     - First element (key): a string, number, and certain kinds of other objects
     - Second element (value): any Python object
     - Multiple elements are separated by commas
   - elements accessed by key


Here are two ways to create a dictionary with two key-value pairs.

In [None]:
word_counts = {'telescope': 4, 'light': 3}

In [None]:
word_counts = {}
word_counts['telescope'] = 4
word_counts['light'] = 3

These ways of creating the dictionary are equivalent. To access the value associated with the key `light`, we write this:

In [None]:
word_counts['light']

In [None]:
word_counts

Like lists, dictionaries can be looped over using `for`. To loop over the key-value pairs, we can use the special method `items`, which returns both key and value.

In [None]:
for word, count in word_counts.items():
    print(f'Key is {word}')
    print(f'Value is {count}')

We can also check for the presence of a key (but not a value) in a dictionary by using the `in` keyword in a conditional statement.

In [None]:
if 'light' in word_counts:
    print('Key is present')

##### Exercise 1.9

Let's try to put together loops, conditionals, dictionaries, and lists in order to create a dictionary of word frequencies in Newton's _Opticks_. 

See if you can put the lines of code in the correct order: http://parsons.problemsolving.io/puzzle/e6751f6b5b8941e9b72c3a7128a33124

In [None]:
word_counts = {}
for word in words:
    if word in word_counts:
        word_counts[word] += 1
    else:
        word_counts[word] = 1
for word, count in word_counts.items():
    print(f'{word} -- {count}')

##### Notes
 - 2 `for` loops: 
   - to build the dictionary of word counts
   - to print the keys/values in the completed dictionary
 - Need to create a key and assign it a value before you can increment the value
 - `if`/`else` for branching conditionals

#### 6. Saving our work
Our final step in this lesson is to save our dictionary of word counts to disk. We'll use _comma-separated value_ (CSV) format, which is very portable and will create a human-readable table of results.

 - module handles details of CSV format, including separators (commas by default), quoting text that contains the separator, etc.
 - import the CSV module
 - define our header row: each row will be a Python list
 - the `with open()` syntax allows us to open a file without having to remember to close it (best practice)
 - the `'w'` argument to `open` signals that we want to write to this file
 - create a `writer` object from the `csv` module
 - use the `writerow` method first to create the header row (we do this once, outside the loop)
 - loop over the _items_ in our dictionary
 - create a list, using the loop variables
 - call the `writerow` method within the loop to create one row per word count

In [None]:
import csv

In [None]:
header_row = ['word', 'count']

In [None]:
with open('newton-opticks-wordcount.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerow(header_row)
    for word, count in word_counts.items():
        row = [word, count]
        writer.writerow(row)

To download this file, open the `Files` panel on the left-hand side of your notebook window by clicking the folder icon. Locate your CSV file, click the three dots beside the file name, and select `Download`.