# DH Downunder: Distant Reading

## Notebook 1: Welcome!

Welcome to the Distant Reading stream of *Dh Downunder 2018*. In this first session, we are going to cover some of the basics of programming in Python, particularly using the Jupyter Notebooks interface.

At the end of this notebook, you will know how to use Jupyter's features to code interactively in Python, you will know how to import and export data, and you will have conducted your first distant reading of a corpus of texts.

## Section 1: Programming in Python using Jupyter notebooks

Jupyter Notebooks provide an interactive programming environment, where you can write code, run it to test its output, and annotate it so it is easier for a human to interpret.

A notebook consists of cell, which can be of two main types:
* **Markdown cells:** Markdown cells (like this one) contain text and images. They are where you can keep your 'notes' about the program you are writing. 'Markdown' is an extremely simple programming language for laying out documents. If you double click on this cell, you can see and edit the underlying markdown. Simply click <kbd>Ctrl</kbd> + <kbd>Enter</kbd> to return to the rendered view.
* **Code cells:** Code cells contain Python code you can execute. If you want to run some Python code, create or click on a code cell, write the code if it's not already there, then click <kbd>Ctrl</kbd> + <kbd>Enter</kbd> and you will see the output just below.

### Exercise 1.1: Edit this markdown cell

Visit [this Github guide](https://guides.github.com/features/mastering-markdown/) to find the basic syntax of markdown. Then double-click on this cell to edit it, and see if you can do the following:

Turn me into a header 2

Put the word 'bongo' in italics and the phrase 'topknot pigeon' in bold.

Make
us
into
a
bullet
point
list

Turn these fruits into a numbered list:
Durian
Pawpaw
Breadfruit

Have fun! If you want a more advanced challenge, you might try to include a code block, a blockquote, a hyperlink or a table. If you want to see the results of your completed markdown cell at any time, just click <kbd>Ctrl</kbd> + <kbd>Enter</kbd>.

### Exercise 1.2: Your first Python program.

Well, this may or may not be your first Python program, but it is definitely your first one for this class. The cell below is a Python cell. There are two parts to the exercise:
1. **Make your computer say something cool:** use the `print()` function to make your computer say something cool. Simply type `print()`, and then inside the parentheses, type a phrase in inverted commas or quotation marks, e.g. `'I am Inigo Montoya'`. Then hit <kbd>Ctrl</kbd> + <kbd>Enter</kbd> to see the output of your program.
2. **Describe your program with a comment:** use a hashtag to indicate comments, e.g. `#this is a comment`. It is good practice to annotate your programs with comments, so you can understand what you've done when you return to them.

In [None]:
# YOUR CODE HERE:


# END OF YOUR CODE

### Exercise 1.3: Adding a new cell

For this exercise, you will add a new cell and write a slightly more complex program.

* **Add a cell:** Click on the <kbd>+</kbd> button at the top of the screen. Make sure it is a 'code' cell. If the cell appears in the wrong place, just use the <kbd>↑</kbd> and <kbd>↓</kbd> buttons at the top of the screen to move it to where you would like it to be. 

For this new program, you are going to define a variable and print it as part of a sentence.
* **Define a variable:** Simply type a name for your variable, e.g. `my_var`, then type `=`, and then type the value you wish to assign to your variable. This can be an integer, e.g. `1`, a floating point number, e.g `3.14`, or a string, e.g. `'Songs of a Sentimental Bloke'`.
* **Print it as part of a sentence:** Use a [formatted string](https://docs.python.org/3/reference/lexical_analysis.html#f-strings). To use a formatted string, you need to add an `f` just before the string begins. This lets you include your variable in the string by using curly braces: `{}`, e.g. `print(f'I wish I had {my_var} dollars!')`.

## Section 2: First steps in distant reading

Now you are a master of Python and of Jupyter notebooks, it is time to start working with some text data.

From a computer's perspective, a text is simply a string of characters: `We slept in what had once been the gymnasium.`. In this and the workshops to follow, we will learn how to take a string of charactrs, and transform them so a computer can find interesting patterns in them.

### Exercise 2.1: Tokenisation

The most basic operation in text analysis is 'tokenisation', or splitting a string into individual words or 'tokens'. Before you execute the cell below, try to predict what the ouput will be:

In [None]:
my_string = 'We slept in what had once been the gymnasium.'
my_tokens = my_string.split()
print(f'String before tokenisation: {my_string}')
print(f'String after tokenisation: {my_tokens}')

You can use the `.split()` method in this way to tokenise any string in Python. The square brackets `[...]` mean that your tokenised sentence is now a list of strings. You can get a particular word from a list using indexing. Python, like many programming languages, counts from 0. If you had a list of five words called `my_five_words`, you could get the first word by typing `my_five_words[0]`, and the fifth word by typing `my_five_words[4]`. You can also count from the item in the list using negative numbers. So you could also get the fifth word by typing `my_five_words[-1]`.

In the cell below, type in your own sentence (maybe your favourite line of poetry, a line from your favourite film, or a quote from your favourite poststructuralist theorist), tokenise it, then get the first word, the third word and the last word.

When you're done, try putting different sentences into `my_string`. In particular, try a sentence that has extra spaces between the words or unorthodox punctuation. Can you see any limitations of using `.split()` to tokenise a string?

In [None]:
# YOUR CODE HERE:
my_string = 
my_tokens = 

first_word = 
third_word = 
last_word = 
# END OF YOUR CODE

print(f'My sentence is: {my_string}')
print(f'The first word is "{first_word}".')
print(f'The third word is "{third_word}".')
print(f'The last word is "{last_word}".')

### Exercise 2.2: Capitalisation

In distant reading, it is also very common to strip out capital letters. This is because as far as a computer is concerned, `q` and `Q` are completely different letters. You can check this by running the cell below. The `==` operator means 'is the same as'.

If you like, you can try typing in some different words and see what Python thinks. What does it think of the two French words `ou` (or) and `où` (where)?

In [None]:
'Enterprise' == 'enterprise'

Luckily there is an easy solution to this problem, and in the cell below you are going to find it. You have two tasks:
* **Put the supplied string into lower case:** Visit the [string methods](https://docs.python.org/3/library/stdtypes.html#string-methods) page of the Python documentation and find a method like `.split()` that will put all the characters in lower case.
* **Count the number of times 'the' appears in the string:** Visit the [list methods](https://docs.python.org/3/tutorial/datastructures.html) page of the Python documentation and find a method that will let you count the number of times the word 'the' appears in your list of tokens.

When you have completed these two tasks, you can try:
* Finding methods to put your string into UPPERCASE or Title Case.
* Counting how many times 'the' appears if you don't put the string in lower case

In [None]:
sentence = 'He—for there could be no doubt of his sex, though the fashion of the time did something to disguise it—was in the act of slicing at the head of a Moor which swung from the rafters.'

# YOUR CODE HERE

# Put the sentence in lower case:
lower_case = 

# Tokenise the new lower-case sentence:
token_list = 

# Count the number of times 'the' appears in the list of tokens:
num_the = 

# END OF YOUR CODE
print(f'The original sentence was:\n"{sentence}"\n')
print(f'After lowering the case, it became:\n"{lower_case}"\n')
print(f'After tokenisation, it became:\n{token_list}\n')
print(f'The word "the" appears {num_the} times in the sentence.')

## Section 3: Importing data, analysing text

In the folder for this session, I have provided the .txt files of two novels, Charlotte Brontë's *Jane Eyre* and the Australian classic, *Such is Life* by [Joseph Furphy](http://adb.anu.edu.au/biography/furphy-joseph-6261).

It is quite easy to import a whole novel as a single long string into Python. In this final part of the notebook, you are going to import these two novels and compare some patterns in their language.

### Exercise 3.1: Import the two novels.

To import the novels, you will need to use the `open()` function. This command tells Python to connect to a file so it can read or write to it. Usually this is done using a `with` construction. This is somewhat complicated. For now, just trust me that it works.

The `open()` function requires you to give it two pieces of information. It needs to know which file you want it to open, and what you would like Python to do with the file. For example, if you had done some analysis, and wanted to save it to a file called 'analysis_results.txt', you would use the command:
```python
open('analysis_results.txt', mode = 'w')
```
The `mode = 'w'` argument tells Python that you want to *write* to this file. You can [check the documentation](), or maybe just guess, what you would put if you want Python to *read* the file instead.

Your first exercise is to use open to find the novels and read in the text:
* **Find the novels:** The novels are stored in the files `jane_eyre.txt` and `such_is_life.txt`.
* **Read in the text:** Set `open()` to the correct mode so it *reads* rather than *writes* the files. *If you put `mode = 'w'` you will delete the text files. If you do this accidentally, I can provide a backup.*

In [None]:
# YOUR CODE HERE

# Import Jane Eyre
with open( , mode = , encoding = 'utf-8') as file1:
    jane_eyre = file1.read()

# Import Such is Life
with open( , mode = , encoding = 'utf-8') as file2:
    such_is_life = file2.read()

# END OF YOUR CODE

print(f'jane_eyre is a string {len(jane_eyre):,} characters long.')
print(f'such_is_life is a string {len(such_is_life):,} characters long.')

### Exercise 3.2: Decapitalise and tokenise the novels

Looking at your code above, put the two novels in lower case and then tokenise them.

When you have done this, use the `len()` function on the the tokenised version of each novel to calculate their word counts. You can find out about the function [here](https://docs.python.org/3/library/functions.html#len). *NB: `len()` is a 'function' like `open()` or `print()`, not a 'method' like `.split()`.*

In [None]:
# YOUR CODE HERE

# Pre-process Jane Eyre
jane_eyre_lower = 
jane_eyre_tokens = 
jane_eyre_word_count = 

# Pre-process Such is Life
such_is_life_lower = 
such_is_life_tokens = 
such_is_life_word_count = 

# END OF YOUR CODE

print(f'Jane Eyre is {jane_eyre_word_count:,} words long.')
print(f'Such is Life is {such_is_life_word_count:,} words long.')

### Exercise 3.3: Gendered language

Now that you have converted each novel into a list of words, you can begin to do some basic analysis. Charlotte Brontë is one of the great feminist novelists of the nineteenth century, while Joseph Furphy is more famous for his democratic nationalism. Both suffered from typical nineteenth-century attitudes towards race. Can we find any evidence of this using basic distant reading?

In this part of the notebook, we are going to use the `pyplot` module from the package `matplotlib`. You should have installed `matplotlib` as part of your preparation for this workshop. This module allows you to create many common kinds of graphs with ease.

The basic way to load a module from a package is like this:
```python
from package import module as nickname
```
As we will see in later sessions, you can also just import an entire package if you like, which requires a slightly different command. It is usually a bit more convenient just to import the modules you need, for reasons we will also see.

Use the next cell to import the `pyplot` module of `matplotlib`, and give it the nickname `plt` for convenience.

In [None]:
# YOUR CODE HERE

# END OF YOUR CODE

# This line is sometimes necessary to allow plots to appear in Jupyter Notebooks
%matplotlib inline

plt.xlim((0,10))
plt.ylim((0,10))
plt.scatter([5,5,4,3,6,6,7,9,1,1,2,3,5,6,5], [9,8,7,6,5,4,3,4,5,6,5,4,2,3,4])
plt.title("A totally random scatterplot.")
plt.xlabel("Meaningless variable 1")
plt.ylabel("Meaningless variable 2")
plt.show()

Now that you have imported `pyplot` and both the novels, we can start to investigate gendered language in the two novels, and display the results graphically.

In the next cell, find the number of times that Brontë and Furphy use the words 'he' and 'she'. Fill in the curly braces `{...}` in the f-strings to display the results in a human-readable format.

In [None]:
# YOUR CODE HERE

# Count 'he' and 'she' in Jane Eyre:
bronte_he = 
bronte_she = 

# Count 'he' and 'she' in Such is Life:
furphy_he = 
furphy_she = 

print(f'Charlotte Brontë uses the word "he" { } times, and "she" { } times.')
print(f'Joseph Furphy uses the word "he" { } times, and "she" { } times.')

# END CODE HERE

Now you can display the results in a bar graph using `plt.bar()`. This function requires two arguments in the parentheses:
1. A list of numbers giving the height for each bar. You have four variables that you wish to plot. In the next cell, put the four of them into a list by enclosing them in square brackets `[number_1, number_2, ...]`. Then feed this list to `plt.bar()` using the keyword `height = `.
2. A list of labels for the bars, which you can feed to `plt.bar()` using the keyword `tick_label = `. You should make a list of labels by using square brackets and typing the string for each label into the list, e.g. `["label 1", "label 2", ...]` Make sure that your list of variables and your list of labels are in the same order, so that the plot is correct!

Finally, give your plot a title, and label the y axis using `plt.title()` and `plt.ylabel()`.

In [None]:
# YOUR CODE HERE:

variable_list = []
label_list = []

plt.bar(x = [1,2,3,4], height = , tick_label = )
plt.title()
plt.ylabel()

# END OF YOUR CODE

plt.show()

Of course, this comparison could be unfair, because *Such is Life* has fewer words that *Jane Eyre*. It would be more accurate to compare the relative frequency of the words 'he' and 'she' in each novel. An easy way to do this is to divide the raw word counts by the total words in each novel. You have already calculated the total words for each novel above. You can use `/` to do the division.

In [None]:
# YOUR CODE HERE:

# Divide 'he' and 'she' in Jane Eyre by the total words in Jane Eyre:
bronte_he_relative = 
bronte_she_relative = 

# Divide 'he' and 'she' in Such is Life by the total words in Such is Life:
furphy_he_relative = 
furphy_she_relative = 

# Create lists for bar graph:
relative_variable_list = []
relative_label_list = [] # HINT: if you have put the variables in the same order, you can use your old label_list

# Create your new graph:
plt.bar(x = [1,2,3,4], height = , tick_label = )
plt.title()
plt.ylabel()

# END OF YOUR CODE

plt.show()