# JupyterHub

### Accessing JupyterHub

- Got to https://jhub.dartmouth.edu/  
- Log in using your university credentials  
- Click "start my server"  
- Select "COLT 70 -- Spring 2022"  
- It may take some time for the server to start up  
- The notebooks are in the "notebooks" directory on the left  
- Double click the notebook you want to open

### Editing a Notebook

At this point you can only view a notebook. In order to be able to edit the notebook:  
 - right click on the notebook file you want to edit (listed on the left)  
 - select "copy"  
 
 ![right-click](right-click.png)  
 
 
 - double click on your home directory (by clicking the directory icon above the list of notebooks)  
 
 ![home-directory](home-directory.png)  
 
 - paste the notebook  

You should now be able to edit and run the cells of the notebook.
For the Introduction to Python below, also copy over the "gilman.txt. file". 

N.B. you can check if a notebook is in read-only or in editable (writable) mode by hovering your cursor over the name of the notebook file — this will list some information about the notebook (such as its name, size, the path to the notebook (i.e. where the notebook is located), its date of creation and the last time modified, and the type of kernel it is running on). If it lists “true” next to “Writable” then the notebook is editable (in _writable_ mode). 

### Uploading files

To upload files to JupyterHub:  
 - make sure you are in your home directory (where you paste the writable/editable notebooks)  
 - click the upload icon  
 
 ![uploadscreenshot](uploadscreenshot.png)  
 
 - select your files to upload

-------------------
# Contents of this notebook  

[Introduction to Notebooks](#section-1)
- [Editing and running code in notebooks](#section-2)  
- [Clearing outputs: interrupting and restarting the kernel](#section-3)  

[Introduction to Python](#section-4)
- [Imports](#section-5)
- [Functions](#section-6)
- [Variables](#section-7)
- [Data types](#section-8)
- [Methods](#section-9)
- - [List methods](#section-10)
- - [String methods](#section-11)
- [For loops](#section-12)
- [Conditionals and Boolean values](#section-13)
- [Flow of execution and Comments](#section-14)
- [Working with text files](#section-15)
- - [Opening and reading text files](#section-16)
- - [File paths](#section-17)
- - [Working with multiple files](#section-18)
- [Wrap up](#section-19)
- [Debugging](#section-20)
-------------------

<a id='section-1'></a>
# Introduction to Notebooks  

This is a notebook! A notebook is composed of different cells which can be used to write markdown text or computer code.

If you double click on this text, you will see that this is a "Markdown" cell for writing text, inserting images, videos, links etc. 

In [None]:
## This is a code cell

N.B. You can select the cell type at the top of this document with the drop down menu. 

Jupyter notebooks therefore allow you to combine code, text, images, visualizations etc. all in one place. You can edit and run code in a notebook which makes it an ideal place to play around with and test out code. The possibilities of combining code and text afforded by Notebooks are not only useful as a pedagogical and learning environment, but can also offer a way to make our analytical process explicit and to reflect on our analytical process:

> "Notebooks are theory — not merely code as theory but theory as thoughtful engagement with the theoretical work and implications of the code itself. Disciplinary norms— including contextual framing, theory, and self or auto-critique— need to accompany, supplement, and inform any computational criticism. Revealing as much of the code, data, and methods as possible is essential to enable the ongoing disciplinary conversation. Compiling these together in a single object, one that can be exported, shared, examined, and executed by others, produces a dynamic type of theorization that is modular yet tightly bound up with its object." (Dobson, James E. _Critical Digital Humanities: The Search for a Methodology_. Urbana-Champaign: University of Illinois Press (2019) p. 40)

<a id='section-2'></a>
### Editing and running code in notebooks

To edit a cell double click the cell. 

To run a code cell (or render a Markdown cell):  
> - select the cell (a blue line will appear on the left when the cell is selected)  
> - click the "play" icon at the top of the notebook OR press shift+enter.

Note that the code cells have a pair of square brackets with a colon next to them [ ]:  
Once you run a code cell a number will appear on the left. This tells you how many times the cell was run and in what order — this can help you keep track of which cells were run and in what order.  

Try running the code cell above a couple of times and see the number in square brackets change.  

Now run the cell below:

In [None]:
print('Waiting 5 seconds...')
import time
time.sleep(5)
print('Done')

Did you notice the asterisk in the brackets when running the cell? An asterisk displays whilst code is still busy executing, and the number appears when it has finished executing.  

Beneath the code cell, any output from running the code will appear. In the case of the cell above "Waiting 5 second..." and "Done" were printed beneath the cell.

<a id='section-3'></a>
### Clearing outputs: interrupting and restarting the kernel

If your code seems to be getting stuck, or if the flow of execution of the cells has become mixed up, it’s a good idea to make a fresh start and clear all the outputs. 

To clear the outputs of a single cell:  
> select the cell > click Edit tab > select clear output  

To clear the output from all cells:  
> Edit tab > Clear All Outputs

Interrupt and restart the kernel:  
> You can click the “stop” and “restart” icons at the top of the notebook (next to the “play” icon)
or  
> Select Kernel tab and select the options you want (Interrupt Kernel, Restart Kernel).  

Clearing outputs clears all outputs from executed code. Restarting the kernel clears the outputs and fires up the backend component that actually runs the code written in the notebook. Notebooks are browser-based documents that run through your browser. Every time you fire up a notebook it runs on a local server on your personal computer (that’s why it sometimes takes some time to load and why you can edit a notebook without altering the “original” notebook). If you restart the kernel it refreshes that environment which runs your notebook.

<a id='section-4'></a>
# Introduction to Python

Writing code involves describing a series of steps (detailed instructions) to go through in order to perform a task. Writing code involves _expressing_ instructions that you want to realize in programmatic language, the instructions you have provided are _evaluated_ which results in an _output_. Code allows you to specify instructions that the computer will follow. Whilst programming languages are in some ways very rigid — the *exact* syntax and instructions need to be defined in order for it to work — this doesn’t mean there isn’t some degree of flexibility. There are multiple ways of achieving something through code, and people have different coding styles that might differ according to their goals (e.g. is the code efficient? is the code readable? etc.).

### Anatomy of a Python Script

Here is an example of a programme. The end-goal (output) of this programme is to count and display the most frequent words in a given text. We're going to go through this example script and break it down to look at some basics of the Python programming language.

In [None]:
#Importing the libraries and modules we'll need
import re
from collections import Counter

#Defining a function to split our text into words
def split_into_words(any_chunk_of_text):
    lowercase_text = any_chunk_of_text.lower()
    split_words = re.split("\W+", lowercase_text)
    return split_words

#Defining variables we're going to need
#Giving the path to the text we're going to analyze
filepath_of_text = "gilman.txt"

number_of_desired_words = 40

#Defining a list of stopwords
stopwords = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours',
 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers',
 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves',
 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are',
 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does',
 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until',
 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into',
 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down',
 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here',
 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',
 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so',
 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 've', 'll', 'amp']

#Opening and Reading the text to analyze it
full_text = open(filepath_of_text, encoding="utf-8").read()

#Splitting the text into words by using our previously defined function
all_the_words = split_into_words(full_text)

#Making a list of words with our stopwords removed
meaningful_words = []
for word in all_the_words:
        if word not in stopwords:
            meaningful_words.append(word)

""" Another way of writing this for loop
meaningful_words = [word for word in all_the_words if word not in stopwords] """

#Counting how many times each word in our "meaningful_words" list appears
meaningful_words_tally = Counter(meaningful_words)

#Pulling out the top 40 most frequently occurring words from our complete tally using hte "most_common" method
most_frequent_meaningful_words = meaningful_words_tally.most_common(number_of_desired_words)

#Display these top 40 most frequent words
most_frequent_meaningful_words

If you're encountering a `FileNotFound` error when you run the program, check that you have copy and pasted the "gilman.txt" file from the notebooks repository into your home directory (just as you did with the notebook). The "gileman.txt" file and the notebook need to be in the same place (more on this in the Filepaths section below in Working with text files).

<a id='section-5'></a>
#### Imports

There is a lot of code that has already been written, and writing your own code can mostly involve adapting code already written by others to your own needs.  

At the top of a program, it is common to import any libraries or modules you might need later on in the program.  

Python incudes many libraries (importable packages of already written code) that can do different things. In this case we are importing the library `re` (for regular expressions, which will allow us to use regular expressions to clean up our text) and from the `collections` library we’re importing the module `Counter` (which will allow us to count things).

In [None]:
import re
from collections import Counter

Remember the “Zen of Python” that we read when we were reading manifestos? The “Easter Egg” it referred to in that manifesto is that there is a module in Python (called `this`) that prints the “Zen of Python” when you import it.

In [None]:
import this

<a id='section-6'></a>
#### Functions

Functions are bundles of code that performs a particular task. Often a function involves giving some kind of input to the function (called “arguments” that you pass to the function in brackets), the function then performs its task on the input and returns an output (the result of its task on this particular input).

Python has a number of prewritten (or built-in) functions. For example, the function `print()` will print to the screen the argument you pass to it.  

Run the cell below. And then try adding your name to “Hello!” and run the cell again. 

In [None]:
print("Hello!")

Functions performs different tasks. Notice how the same input with a different function returns a different results (can you guess what this `len()` function does from the returned output?):

In [None]:
len("Hello!")

You can also write your own functions. This is called defining a function. 

It is useful to write your own functions if you want to bundle together a series of steps to achieve a particular goal. This make the code neater and easier to manage. It also means you can re-use that function anywhere in the code, you just have to “call” it by its name. 

In this case we have defined a function called `split_into_words` that will split our text into individuals words and that takes `any_chunk_of_text` as its argument.  
- To define a function we write a function definition statement: starting with the keyword `def` (short for define) then the name we want to give the function and in brackets any parameters we might need to pass to the function (or nothing in brackets if no parameters are necessary). All this is followed by a colon `:`.  
- Then there is an indented block of code that specifies what the function does. Note that this block of code is indented. Indentation is important in python as it defines where the block of code starts and ends. Anything indented will be part of the function definition. If it is not indented it will be considered not part of the function definition.  
- The`return` statement at the end of the function sends the value that results from the function execution out to where the function was called from.  

We will go into the details of what the function `split_into_words` is doing further below in the Methods section.

In [None]:
def split_into_words(any_chunk_of_text):
    lowercase_text = any_chunk_of_text.lower()
    split_words = re.split("\W+", lowercase_text)
    return split_words

<a id='section-7'></a>
#### Variables

Next in our program we see we have defined some “variables”. 

In [None]:
filepath_of_text = "gilman.txt"

number_of_desired_words = 40

Variables are like containers that store information. You assign to a variable a value that you want to use later on. Then you only need to use the name of the variable later on. The value of a variable can be overwritten if you assign a new value to the same name. 

You can give a variable almost any name you want, but there are a few limitations. The name can’t start with a number (it needs to start with a letter), there can be no spaces in the name and no punctuation apart from underscore (\_\), and it cannot be a python reserve word (i.e. a word that already has a specific meaning in the python language). 

For example here we assign the value `“Hello!”`to a variable called greeting. And then we can pass that value to a function such as print instead of writing out “Hello!” again.

In [None]:
greeting = "Hello!"
print(greeting)

<a id='section-8'></a>
#### Data Types

Notice how the variables we have assigned in our program look a bit different. One is a number that appears in green. The other are some words that appear in red in between quotation marks. 

This is because numbers and words are different data types in Python: depending on what type of data something is there are different rules about how to write them into code, and different possibilities about what you can do with them and how you can use them.

In [None]:
filepath_to_text = "gilman.txt"

number_of_desired_words = 40

The function `type()` returns what type of data something is:

In [None]:
type("Hello!")

Let’s start by looking at four basic data types: 

*String*: strings store characters such as letters or numbers. String are essentially text. To specify something is a string you need to write it in “quotation marks” (both ‘single’ or “double” quotation marks work).

Try finding out what data type these examples are below using the `type()` function:

In [None]:
example = 'forty'
another_example = "40"
a_final_example = 40
type(example)

As you can see, the final example is a not a string (or text), it was not written in quotation marks. It is an *integer* (i.e. a whole number). Decimal numbers in python are called *floats* (or floating-point number). 

In [None]:
integer_example = type(40)
float_example = type(4.5)
print(integer_example)
print(float_example)

How would you find out what type of data our variables are below using `type()`?

In [None]:
filepath_to_text = "gilman.txt"

number_of_desired_words = 40

We’ve also assigned another kind of data type to the variable “stopwords”.

In [None]:
stopwords = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours',
 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers',
 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves',
 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are',
 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does',
 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until',
 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into',
 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down',
 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here',
 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',
 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so',
 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 've', 'll', 'amp']

type(stopwords)

The `type()` function tells us this is a _list_. A list is one of Python’s collection types of data types. Collection types are data types that allow us to store many values together in a single variable. Differences between collection types involve how items are stored and how we can retrieve or access them. 

With a list you can list a collection of items (between square brackets and separated by commas) into a single list. Here we are listing a collection of strings and integers:

In [None]:
list_items = [4, "apples", 1, "pear", 5, "oranges", 6, "melons", 23, "kiwis"]

Another kind of collections type in Python is a _dictionary_. In a dictionary you can list a collection of items (between curly brackets and separated by commas) with particular values assigned to each of those items. If we were to rewrite our previous list as a dictionary we could write:

In [None]:
dictionary_items = {"apples": 4, "pear": 1, "oranges": 5, "melons": 6, "kiwis": 23}

If we wanted to retrieve an item from a list, we can do this using its index (or location in the sequence of items in the list).

In [None]:
list_items[1]

Note how when we retrieve the item at index “1” in the list it gives us the second item. That is because Python counts from 0, so the first item in the list is at index \[0]\. Try retrieving different items from the list using different index numbers. 

You can also use reverse index to get the last item on the list without knowing how many items are in the list.

In [None]:
list_items[-1]

You can also take a slice of items in your list by using index numbers and colons: 

- You can slice from the start of the list (i.e. index [0]) up to a given index (excluding index given):

In [None]:
list_items[:2]

- You can slice from the end of the list until a given index: 

In [None]:
list_items[2:]

In [None]:
list_items[-2:]

- You can also slice by stepping over a given number of items. For example, starting at the second item in the list (index 1) and going until the end of the list, stepping by 2 (missing every other item).

In [None]:
list_items[1::2]

Here is a notation to formalize these different slicing options:

In [None]:
your_list[start:stop:step]

In a dictionary, items are not ordered according to their sequence, like in a list. They are ordered as _key-value_ pairs. Each item (key) has a value associated to it. We can therefore retrieve values by using their associated key, rather than using the index number in the sequence (as we did in a list).You can also use reverse index to get the last item on the list without knowing how many items are in the list.

In [None]:
dictionary_items["apples"]

<a id='section-9'></a>
#### Methods 

Different data types therefore allow us to structure different kinds of information in different ways, and allow us to do different things with the information. There are things that you can only do with strings or with lists. The way you interact and do things with data types is through _methods_.

<a id='section-10'></a>
#### List methods

The syntax for writing methods is an input (i.e. what you’re applying the method to) followed by a dot and the name of the method and brackets (in which we can pass the method any arguments, any particular specifications, just like we could with a function). 

For example, there are list methods for adding items to a list, removing items from a list or sorting items in a list.

- Add an item to a list with the `.append()` method

In [None]:
list_items = ["apples", "pear", "oranges", "melons", "kiwis"]
list_items.append("blueberries")
list_items

- Remove the first instance of an items with the `remove()` method

In [None]:
list_items.remove("blueberries")
list_items

- Sort the order in which the items in a list appear with the `sort()` method.

In [None]:
list_items.sort()
list_items

The default is for items to be sorted in ascending order (from small to large number or in alphabetical order).  
`list.sort(reverse=False)`  
You can change the default by setting the default parameter to `True` to sort in descending or reverse order.

In [None]:
list_items.sort(reverse=True)
list_items

- If you want to add multiple items you could extend the list by adding another list to the list. 

In [None]:
new_items = ["blueberries", "grapes"]
list_items.extend(new_items)
list_items

<a id='section-11'></a>
#### String Methods
String methods provide different ways of interacting with and doing things with strings (text). 

For example, you could turn all characters in a string sequence to upper case or lower case using the `upper()` and `lower()`string methods. (Why this is important will become clearer later when we talk about pre-processing).

In [None]:
example_string = "Here is an example."

example_string = example_string.upper()
print(example_string)

example_string = example_string.lower()
print(example_string)

Note how the variable `example_string` gets rewritten with new content. 

Another useful string method to know about is `replace()`. This method replaces a defined string with a new string. For example, sometimes there are encoding issues when scrapping text and we might have our text look like this. 

In [None]:
replace_example = "Here\xa0is\xa0an\xa0example."

“\xa0” is the Unicode code-point for a space. We would probably want the replace the unicode code point with a space and re-encode our text. 

What's a Unicode code-point?  
Encodings are systems that define relations between two systems of representation for example, between numbers and letters or characters. Unicode is a standard that aspires to define a unique ID number (or “code-point) for every character in all kinds of different languages around the world (and even emoji). This unique code-point can then be used to encode words into bits in order to represent text digitally (through the unicode encodings such as utf-8).

In [None]:
replace_example = replace_example.replace("\xa0", " ").encode("utf-8")
replace_example

In our example program for counting words, we used string methods to lower case our text and to split the text.

In [None]:
def split_into_words(any_chunk_of_text):
    lowercase_text = any_chunk_of_text.lower()
    split_words = re.split("\W+", lowercase_text)
    return split_words

The `split()` method is a string methods that splits any string it is given according to a specified pattern or “delimiter”. It returns a list of substrings that have been split according to the specified delimiter. The default delimiter (i.e. if we were to leave the brackets blank) is a space.

In [None]:
split_example = "I’m wondering how different split methods going to deal with-and split-text in different ways."
split_example = split_example.split()
split_example 

In our example program, we use a _regular expression_ to specify a particular pattern or delimiter according to which our text must be split. Regular expressions (or regex) are sequences of patterns or symbols that specify patterns in text. They’re like a code for referring to particular patterns in language in order to make searching for those patterns more concise and precise. In the case of our program for example, we can use a regular expression to specify that we want to split our text every time we encounter a character that is NOT a word character (letter). This is specified with the regular expression `\W+`.  

Compare the output of splitting with the regular expression compared to our previous method of using the default parameter of `split()` and splitting the text at every space. Which would you prefer to use in your program?

In [None]:
split_example = "I’m wondering how different split methods going to deal with-and split-text in different ways."
split_example = re.split("\W+", split_example)
split_example

<a id='section-12'></a>
#### For Loops

Computers aren’t good at manipulating unstructured data such as text. In order to be able to manipulate text programmatically, we often need to restructure it in a way that makes it manipulable, such as in a list, as above, when we split our text into a list of words. 

Now that the text is structured as a list, we can prgorammatically manipulate it in different ways. Often, we will want to do something with every item in the list. To do this, we will need to iterate through every item in our list by using a `for` loop.

In [None]:
sample_text = "It was the best of times, it was the worst of times."
tokenized = sample_text.split()

for word in tokenized:
    print(word)

To construct a `for` loop: you use the keyword `for` then a variable name that stands in for each item (such as here we used to word `word` but we could’ve used `item` or even `potato`), then you use the keyword `in` and the name of the list you are looping through followed by a colon `:`. Just like with a function, the indented body of code after the colon specifies what task is to be performed on every item in the list. 

<a id='section-13'></a>
#### Conditionals and Boolean values

In our program example, we use a `for` loop to remove stopwords from our text.

What are stopwords?  
We will talk about this more later in the course, but when we’re analyzing text computationally we are often relying on counting words and on word frequencies. However, some very frequent words might not be very significant or relevant for our analyses. In this case, we need to remove these words from our analyses. These are called stopwords, words we remove from our analyses.

In order to remove the stopwords from our text (which we have restructured as a list through our `split_into_words(any_chunk_of_text)` function using the `split` method), we use a `for` loop to iterate over every item in our list and consider whether it is included in the stopwords list or not. If it is not included in the stopword list, we append it to a new list which will contain all words from our text that are not in the stopword list. 

In [None]:
meaningful_words = []
for word in all_the_words:
        if word not in stopwords:
            meaningful_words.append(word)

We do this process of assessing whether a word in the stopword list or not by using a _conditional statement_. 

Conditional statements (or `if` statements) follow a similar syntax to that of the `for` loop (i.e. with keyword, colon and indented block of code that performs a particular task).  

A conditional statement defines a condition and specifies an action to be taken depending on the state of the condition. In the case of our program example above, if a word is not in the stopword list, then it is appended to the new list `meaningful_words`. Conditional statements therefore allow us to change the program's behavior based on whether some condition is met or not. 

To define different kinds of conditions we can use comparison operators such as: 
- equal to: `==`
- not equal to: `!=`
- greater than: `x > y`
- lesser than: `x < y`
- greater than or equal to: `x >= y`
- lesser than or equal to: `x <= y`
- object identity: `x is y`
- negated object identity: `x is not y`

If the condition is met (or `True`) then one course of action is followed, and if the condition is not met (or `False`) then another course of action is followed. These evaluations of `True` and `False` are called _Boolean values_ and are another kind of data type in Python.

For example, we could write a `for` loop that iterates through a list of words and evaluates whether each word is longer than two characters. If this is `True`, then the word is printed, if it is not true (`False`) then it is not.

In [None]:
sample_text = "It was the best of times, it was the worst of times."
tokenized = sample_text.split()

for word in tokenized:
    if len(word) > 2:
        print(word)

<a id='section-14'></a>
#### Flow of execution and Comments

Python code runs from left to right and from top to bottom, one line at a time. For code to work, things need to happen in a certain order. Statements such as conditional statements can control this flow of execution, i.e. the order in which code runs. They can define whether a program follows one path of action or another.

This idea of a flow of execution might be a little counter-intuitive in Notebooks because Notebooks allow you to run cells in any order, but if you start by running a cell from the bottom of a notebook that relies on a previous cell then you will probably get an error. The output of any cell that you run in a notebook will be carried over to the next, but you need to have run the cell and in the right order. If you get errors, check that you have run the previous cells by checking the numbers in \[brackets\] next to the code cells.

You might have noticed that there are some lines in our program that start with a hash mark (`#`).

In [None]:
#Defining variables we're going to need
#Giving the path to the text we're going to analyze
filepath_of_text = "gilman.txt"

These are _comments_. Comments allow us to insert comments directly in the code without disrupting the running of the code (lines starting with the hash mark will be ignored by the code). Comments  can be used to explain what different aspects of the code are doing or as a way to write in or blank out different aspects of the code.

In [None]:
marx = "All that is solid melts into air, all that is holy is profaned, and man is at last compelled to face with sober senses his real conditions of life and his relations with his kind."

#tokenized_marx = re.split("\W+", marx)
tokenized_marx = marx.split()
print(tokenized_marx)

marx_without_punctuation = []
for word in tokenized_marx:
    if word.isalpha(): #the .isalpha() method checks whether a string is composed of alphabetic characters only
        marx_without_punctuation.append(word)
print(marx_without_punctuation)


You can see the flow of execution at work: code and comment can both be on a single line but the code has to come first to be executed. You could also toggle between the different splitting methods by commenting and uncommenting the different lines of code.

Longer comments can be written using three quotation marks: 

In [None]:
marx = "All that is solid melts into air, all that is holy is profaned, and man is at last compelled to face with sober senses his real conditions of life and his relations with his kind."

#tokenized_marx = re.split("\W+", marx)
tokenized_marx = marx.split()
print(tokenized_marx)

marx_without_punctuation = []
for word in tokenized_marx:
    if word.isalpha(): #the .isalpha() method checks whether a string is composed of alphabetic characters only
        marx_without_punctuation.append(word)

""" Another way of writing this for loop is by writing a list comprehension. This is just a more compact and precise way of writing
marx_without_punctuation = [word for word in tokenized_marx if word.isalpha()] """

print(marx_without_punctuation)

<a id='section-15'></a>
#### Working with Text Files in Python

<a id='section-16'></a>
**Opening and reading files**  

In order to analyze text with Python you need to `open` the text in your program. This is done by using the file method `open()` and a number of different _modes_ depending on what you want to do with the file (read it to analyze it, write to it, append to it, etc.).

- Reading a file with `.read()`
When you `open()` a file using the `open()` function this creates a _file object_. Python is an _object-oriented_ programming language: it is built around entities called _objects_ that contain both data and methods for manipulating the data in the object. Once you create an object you can then manipulate it and interact with it with other objects. However, `open()` simply creates an object from your file, you need to specify in which mode you’re opening it in order to manipulate it accordingly. In our case, we need to `read()` the file object as text.

Here are two ways of opening and reading a file. Note how we assign a variable to our open and read file so that we can conveniently refer to the open and read file by the variable name later.

In [None]:
raw_text = open("gilman.txt", mode="r", encoding="utf-8").read()
raw_text

In [None]:
with open("gilman.txt") as f:
    raw_text = f.read()
raw_text

Opening a file with `with` means you don’t have to worry about closing the file (it will automatically close when the block is exited). 

- Writing to a file with `.write()`

We could also open our file objects in write mode in order to write to a file. Note that writing to a file will overwrite the file if it already exists. If the file does not already exist, it creates a new file for writing. 

Here are two ways of opening a file in write mode and writing “Hello file” in it.

In [None]:
f = open("example_text.txt", mode="w")
f.write("Hello file")
f.close()

Can you see the newly created file in the list on the left? Double click to open it and see its contents.

In [None]:
with open("example_text.txt", mode="w") as f: 
    write_file = f.write("Hello file")

For example, we could add a couple lines of code at the end of our program to write out the most frequent words to a text file.

In [None]:
with open("most-frequent-words-gilman.txt", mode="w") as file_object:
    file_object.write(str(most_frequent_meaningful_words))

- Append to a file with `.append()`

If you want to write to a file, but you don’t want to overwrite its existing context you can use the `.append()` file method. This will add the text you want to add at the end of the existing file. If you append to a file that does not already exist, it simply create a new file for writing.

Let’s add some text to our previous "example_text.txt".

In [None]:
f = open("example_text.txt", mode="a")
f.write("\n" + "how are you?")
f.close()

In [None]:
with open("example_text.txt", mode="a") as f:
    append_file = f.write("\n" + "how are you?")
append_file

Double click to file "example_text.txt" to open it and see its contents. What happens if you run the append code multiple times?

N.B. you can also simply write `"a"` or `"w"` or `"r"` instead of `mode="a"`, `mode="w"`, `mode="r"`.

<a id='section-17'></a>
**File paths**

In order to work with files, you need to be able to tell your program where the files are located. The location to the file is called the _path_. Inside the parantheses of the `open()`, you insert the filepath of the file to be opened in quotation marks, as well as specifying which mode and encoding you want to open the file in. 

In [None]:
with open("gilman.txt", mode="r", encoding="utf-8") as f:
    raw_text = f.read()

If the file is in the same folder as the program (as is the case in our example program) then the path to the file is simply the file name and extension (`gilman.txt`). 

However, if your file is located somewhere else you will need to pass the path of the file to the `open()` function. 

For example, create a new directory (i.e. folder) in your home directory (where this notebook and the gilman text are location). Call it `texts` for example and copy the gilman text into it. If you double click the “texts” directory, you can see under the search box that there is the icon of your home directory, then a slash, and then "texts" (the name of the new directory where you are currently located). This is the path to the contexts of the texts directory. If you click back to the home directory, you move back one layer in the file structure to your home directory. Now let’s try specifying the path to this copy of the gilman text in the "texts" directory. 

In [None]:
filepath_of_text = "texts/gilman.txt"
full_text = open(filepath_of_text, encoding="utf-8").read()
full_text

Computers follow hierarchical systems of organization. Files contain other subfiles, which contain other subfiles etc. The filepath specifies the pathway through these files to a particular location. 

Paths can be _relative_ or _absolute_. Absolute filepaths provide the entire path to the current location from the home directory or root directory. Relative filepaths provide the location to the file based on where you are already located, i.e. based on current location or directory from which you are specifying the path. 

If you see notations such as `../` these are symbols to specify that you move up one folder from your current folder location (e.g. when we moved from "texts" back to the home directory).

Tip: You can find the path to a file by right clicking on it and selecting “copy path”. These may not always be right for your code (it might give you the absolute path when you only need to relative path).

<a id='section-18'></a>
**Working with multiple files**

So far we have worked with only one text file, but we’ll want to work with many different text files in the future. You can work with multiple files with the help of the `Path` library or the `os` package.

- Working with multiple files within a directory using the `Path` library:

In [None]:
from pathlib import Path
directory_path = "gilman_corpus"

#Loop through any file with glob and the asterisk wildcard
for filepath in Path(directory_path).glob("*"):
    print(filepath)

In [None]:
#Loop through and open any file ending in .txt
for filepath in Path(directory_path).glob("*.txt"): 
    print(filepath)

In [None]:
#Loop through all files ending in .txt and open and read them
for filepath in Path(directory_path).glob("*.txt"): 
    corpus = open(filepath, encoding="utf-8").read()
    print(corpus[:20]) #This prints a slice of the first 20 characters of each file in the gilman_corpus directory

- Working with multiple files using the `os` library:

In [None]:
import os
# Loop through and print the name of files in a given directory
gilman_corpus = os.listdir("gilman_corpus")
for file in gilman_corpus:
    print(file)

In [None]:
import os
# Loop through and print the name of files ending in.txt. (i.e. only text files) in a given directory
gilman_corpus = os.listdir("gilman_corpus")
for file in gilman_corpus:
    if file.endswith(".txt"):
        corpus = open("gilman_corpus/"+file, encoding="utf-8").read()
        print(corpus[:20])#This prints a slice of the first 20 characters of each file in the gilman_corpus directory

<a id='section-19'></a>
#### Wrap up

The aim of this course is that you become comfortable reading code and adapting it to your needs. It somewhat follows the ideas of “literate programming” or “literate computing” put forward by Donald Knuth in the 1970s that inspired to creation of Jupyter Notebooks. The idea of literate programming is that code can be written and presented in ways that highlight it as a human written cultural artifact — instead of prioritizing code as it is read by computers, literature programming wants to find ways of writing and reading code that make code more readable and understanding to human beings and that highlight the writer’s own thought processes. These ideas motivated the creation of Jupyter Notebooks to provide environments that can combine human prose with executable code. (cf. Knuth, Donald. _Literate Programming_. (1992) and this Programming Historian´s [Introduction to Jupyter Notebooks](https://programminghistorian.org/en/lessons/jupyter-notebooks).)

Read through our word counting program. Are you able to read through it and more or less follow what is going on now? How does it compare to when you first saw the program, are you able to understand more of it now?  

Run the program and inspect the results. Do you find these most frequent words insightful or meaningful? Are some words strange or unexpected? What might account for some of these most frequent words? Open and skim through the source text (the `gilman.txt file`). Do some frequent words make more sense now?

Try modifying and playing around with the program. Are there any words you might want filter out further from this list of most frequent words? How would you go about that?

In [None]:
#Importing the libraries and modules we'll need
import re
from collections import Counter

#Defining a function to split our text into words
def split_into_words(any_chunk_of_text):
    lowercase_text = any_chunk_of_text.lower()
    split_words = re.split("\W+", lowercase_text)
    return split_words

#Defining variables we're going to need
#Giving the path to the text we're going to analyze
filepath_of_text = "gilman.txt"

number_of_desired_words = 40

#Defining a list of stopwords
stopwords = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours',
 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers',
 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves',
 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are',
 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does',
 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until',
 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into',
 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down',
 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here',
 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',
 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so',
 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 've', 'll', 'amp']

#Opening and Reading the text to analyze it
full_text = open(filepath_of_text, encoding="utf-8").read()

#Splitting the text into words by using our previously defined function
all_the_words = split_into_words(full_text)

#Making a list of words with our stopwords removed
meaningful_words = []
for word in all_the_words:
        if word not in stopwords:
            meaningful_words.append(word)

""" Another way of writing this for loop
meaningful_words = [word for word in all_the_words if word not in stopwords] """

#Counting how many times each word in our "meaningful_words" list appears
meaningful_words_tally = Counter(meaningful_words)

#Pulling out the top 40 most frequently occurring words from our complete tally using hte "most_common" method
most_frequent_meaningful_words = meaningful_words_tally.most_common(number_of_desired_words)

#Display these top 40 most frequent words
most_frequent_meaningful_words

#Writing the most frequent words to a file called "most-frequent-words-gilman.txt"
#with open("most-frequent-words-gilman.txt", mode="w") as file_object:
    #file_object.write(str(most_frequent_meaningful_words))

<a id='section-20'></a>
### Debugging

Just as you will become more and more familiar with reading code, you can also become more and more familiar with reading the errors that Python flag up to us when code doesn’t work. Python errors contain information to help us identify and fix the problem, they are a helpful resource that we can use to identify what is going wrong. 

It often take a series of attempts and some trial-and-error to get a program working. Errors can help us debug our programs along the way. 

There are different types of errors that give us an indication as to the type of problem we are dealing with:

- _SyntaxError_: something is wrong with the Python syntax, the rules of writing Python code (i.e. the arrangements of words and punctuation in your code). It could be for example that you forgot to close a quotation mark, or that you forgot a colon after a ´for´ loop. If you are copy and pasting code, check that the quotation marks are the right kind of quotation mark that Python recognizes. In the error information, it will point out to you using a caret (^) where the error has occurred. 

- _NameError_: this error points to a problem with the name of a function or variable. For example, if a variable name cannot be found (maybe we forgot to run the cell that defines the variable, or maybe we misspelled the variable).

- _TypeError_: this flags up issues of data type, when we are trying to perform an operation on the wrong kind of data type. Python might tell which data type is required in the error statement.

- _AttributeError_: this error means that we trying to access something from an object that that object doesn’t possess or do something with an object that that object cannot do. For example, trying to use a method on a file that is a method for a string. 

- _FileNotFound Error_: this error means that the file name you have given cannot be located. This may be a problem of spelling the name wrong, or giving the wrong path to the file. 


When errors flag up, read through the code carefully, experiment with different things, talk it through with someone, and try googling and looking up forums (e.g. stackoverflow). There is a vibrant community around Python and you are probably not the first person to encounter the error you are encountering.

_Acknowledgements_: This notebook is heavily inspired by Melanie Walsh’s [_Introduction to Cultural Analytics & Python_](https://melaniewalsh.github.io/Intro-Cultural-Analytics/02-Python/00-Python.html) and also drew on the Constellate team's [tutorials](https://constellate.org/tutorials/), the Programming Historian's [tutorials](https://programminghistorian.org/en/lessons/working-with-text-files), and Ryan Heuser's [Literary text mining course](https://github.com/quadrismegistus/literarytextmining).