<a href="https://colab.research.google.com/github/hlab-repo/learning/blob/main/Getting_Started_with_Coding_for_Humanities_Scholars_by_Micah_Saxton_and_Michael_Hemenway.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Getting Started with Coding for Humanities Scholars

## Introduction

### Description

This workshop is an introduction to coding in Python for humanities research. We will start by showcasing a few kinds of Python projects and brainstorming about how principles or insights from these projects could be implemented in your own research. Following that, you will learn a few fundamentals to make you comfortable getting started with Python. Finally, we will provide a number of resources you can use to learn Python on your own.

### Learning Goals
1. To become aware of how coding can be useful to humanities scholars
2. To learn a few fundamental Python data types and commands.

## Getting Started

### Using Google's Colab Notebooks
The webpage you are looking at is an example of a Colab notebook. Notebooks are a convenient way to write and execute Python code. Google's Colab notebooks provide the extra benefit of installing packages and running code on the cloud, rather than on your own CPU.

### Before You Begin:
- Navigate to "File" and select "Save a copy in Drive."
- Make sure you are working in your own copy of the notebook.

### Creating code and text cells
Colab notebooks are divided into cells which can contain either text or Python code. Although I have created all the cells we will be using for this workshop, it may be helpful to learn how to add cells of your own.

If you hover your mouse at the top or bottom of an already existing cell, you will have an option of adding a new code or text cell. Additionally, you can select the three dots on the right side of a cell for more options.

### Running code cells
There are two ways to run code cells:
- Click the "play" button on the left side of the code cell
- Press <kbd>SHIFT</kbd>+<kbd>RETURN</kbd> (or <kbd>SHIFT</kbd>+<kbd>ENTER</kbd>)

## Showcase One: Using a Scatterplot to Compare Textual Corpora

In [None]:
# Setup
%%capture
!pip install scattertext

In [None]:
import scattertext as st
import spacy
import requests
from IPython.core.display import HTML
import pandas as pd

In [None]:
nlp = spacy.load('en')

In [None]:
# Melville's Moby Dick
response = requests.get('https://raw.githubusercontent.com/msaxton/sc_workshop/master/melville.txt')
text = response.text
melville = text[11994:1209637]  # <first character of text>:<last character of text>
melville_paras = melville.split('\n\n')

In [None]:
# Austin's Sense and Sensibility
response = requests.get('https://raw.githubusercontent.com/msaxton/sc_workshop/master/austin.txt')
text = response.text
austin = text[709:674322]  # <first character of text>:<last character of text>
austin_paras = austin.split('\n\n')

In [None]:
melville_df = pd.DataFrame(data={'author': 'Melville', 'text': melville_paras})
austin_df = pd.DataFrame(data={'author': 'Austin', 'text': austin_paras})

df = melville_df.append(austin_df)

In [None]:
# this will take a few minutes
corpus = st.CorpusFromPandas(df, category_col='author', text_col='text', nlp=nlp).build()

In [None]:
html = st.produce_scattertext_explorer(corpus, category='Melville',
                                       category_name='Melville',
                                       not_category_name='Austen',
                                       width_in_pixels=900)
HTML(html)

## Showcase Two: Exploring a Corpus with Natural Language Processing

This showcase comes from the incredibly rich resource [Programming Historian](https://programminghistorian.org). We will explore some useful ways to read a corpus with machines, following along with Matthew J. Lavin's [Analyzing Documents with TF-IDF](https://programminghistorian.org/en/lessons/analyzing-documents-with-tfidf).

Specifically, we will look at the most __important__ terms in a corpus of obituaries from the New York Times. You might be wondering, "How do we decide whether a term is important or not?"

### Text Frequency - Inverse Document Frequency (TF-IDF)

TF-IDF is a very common approach used in analyzing texts and in preprocessing texts for use in natural language processing tasks. It is also used in many search algorithms to help locate relevant pages related to your search terms.

![](https://static.wixstatic.com/media/1cd646_3e5b7f0d10e34c04ba293144c637e1eb~mv2.jpg/v1/fill/w_779,h_412,al_c,q_90/1cd646_3e5b7f0d10e34c04ba293144c637e1eb~mv2.jpg)

Image comes from a great simple [explanation of TF-IDF](https://keetmalin.wixsite.com/keetmalin/post/2017/06/05/tf-idf-in-the-field-of-information-retrieval) by Keet Malin Sugathadasa.

In [None]:
from io import BytesIO
from urllib.request import urlopen
from zipfile import ZipFile
import re
import pandas as pd

#import the TfidfVectorizer from Scikit-Learn.
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
zipurl = 'https://programminghistorian.org/assets/tf-idf/lesson-files.zip'
doc_texts = []
doc_names = []
with urlopen(zipurl) as zipresp:
    with ZipFile(BytesIO(zipresp.read())) as zfile:
       # load file metadata into dataframe
        with zfile.open('lesson-files/metadata.csv') as m:
            doc_metadata = pd.read_csv(m)

        # load document contents
        filenames = zfile.namelist()
        filenames.sort()
        for file in filenames:
            if re.search(r'\/\d{4}\.txt', file):
                with zfile.open(file) as f:
                    doc_names.append(file)
                    doc_texts.append(f.read().decode('utf-8', 'ignore'))
print(doc_metadata.shape)
len(doc_texts)

In [None]:
print(doc_metadata.loc[125])
doc_texts[125]

In [None]:
vectorizer = TfidfVectorizer(max_df=.65, min_df=1, stop_words=None, use_idf=True, norm=None)
# There is a great description of the use of max_df and min_df on [stack overflow](https://stackoverflow.com/a/35615151)
transformed_documents = vectorizer.fit_transform(doc_texts)

In [None]:
transformed_documents_as_array = transformed_documents.toarray()
# use this line of code to verify that the numpy array represents the same number of documents that we have in the file list
len(transformed_documents_as_array)

In [None]:
vectorizer.get_feature_names()[1000:1030]

In [None]:
list(transformed_documents_as_array[0][1000:1030])

In [None]:
# loop each item in transformed_documents_as_array, using enumerate to keep track of the current position
doc_dfs = []

for counter, doc in enumerate(transformed_documents_as_array):
    # construct a dataframe
    tf_idf_tuples = list(zip(vectorizer.get_feature_names(), doc))
    one_doc_as_df = pd.DataFrame.from_records(tf_idf_tuples, columns=['term', 'score']).sort_values(by='score', ascending=False).reset_index(drop=True)
    one_doc_as_df['filename'] = doc_names[counter]
    one_doc_as_df['year'] = doc_metadata.loc[counter]['year']
    one_doc_as_df['metadata_id'] = counter
    # put all of the document dataframes together
    doc_dfs.append(one_doc_as_df)

all_docs_df = pd.DataFrame()
all_docs = all_docs_df.append(doc_dfs)

In [None]:
all_docs.head()

### Top 20 words in a certain obituary
Here is where a better naming convention for files would have helped, or having the identity of the person as a part of the metadata file. Some example obituaries would be:

* Isaac Asimov: 1
* Virginia Woolf: 24
* Albert Einstein: 73
* Billie Holliday: 97

In [None]:
# Top words in a certain obituary
def get_obit_top_words(obit_id, num_words=20):

    return all_docs.loc[all_docs['metadata_id'] == obit_id][:num_words]

In [None]:
get_obit_top_words(1, 20)

In [None]:
# Top 20 words in a date range
all_docs.loc[(all_docs['year'] >= 1960) & (all_docs['year'] < 1970)].nlargest(20, 'score')

### Comparative Example from Lavin's Programming Historian Tutorial
Lavin reminds us that this raw use of TF-IDF is best suited to generatae more informed research questions, rather than to provide definitive answers. As an example, her compares some socially minded writers to look for interesting questions. See this in context of the tutorial at [Analyzing Dcouments with TF-IDF](https://programminghistorian.org/en/lessons/analyzing-documents-with-tfidf#interpreting-word-lists-best-practices-and-cautionary-notes).

* Nellie Bly: 125
* Willa Cather: 341
* W.E.B. Du Bois: 53
* Upton Sinclair: 263
* Ida Tarbell: 309

In [None]:
# Top 20 words for PH comparative
compare_obits = [('Bly', 125), ('Cather', 341), ('Du Bois', 53), ('Sinclair', 263), ('Tarbell', 309)]
n_words = 20
series_list = []
for name, id in compare_obits:
    top_df = get_obit_top_words(id, n_words)
    terms = top_df['term'].rename(name)
    series_list.append(terms)

compare_df = pd.concat(series_list, axis=1)
compare_df

## A Brief Introduction to Python

### Your First Python Program
There is a tradition when someone learns to code a new language of writing a program called "Hello, World!" Therefore, this is the first program we will write together. 

**Code along**
1. Click into the code cell below these directions
2. Type `message = 'Hello, World!'`
3. On the next line type `print(message)`
4. Run the code cell

This very simple program introduces us to some fundamental Python concepts:
- `message` is an example of a **variable** . Think of variables as buckets into which we can put information for later use.
- `'Hello, World!'` is an example of a **string**. Think of strings as text information. Python knows this is a string because of the quotation marks.
- `=` assigns the string `'Hello, World!'` to the variable `message`
- `print` is a **function**. Think of functions as a set of instructions. Functions often need more information to carry out their tasks which are placed in parentheses. This additional information is called an **argument**. Here, we are giving the variable `message` (which itself contains the string `Hello, World!`) as an argument to `print` so that it knows what to print.

### Strings
Strings can be enclosed in either a single quote (') or a double quote ("). Consider the following:

In [None]:
example_1 = 'Micah has a pet cat.'  # single quote
print(example_1)
example_2 = "Micah's cat's name is Cat."  # double quote
print(example_2)
example_3 = "Micah's cat says, \"These are terrible examples.\""  # bonus: escape character
print(example_3)

Micah has a pet cat.
Micah's cat's name is Cat.
Micah's cat says, "These are terrible examples."


Strings can be manipulated using a number of built in functions:

In [None]:
example_4 = 'this is a string'
print(example_4.upper())
print(example_4.title())


THIS IS A STRING
This Is A String


**Pro Tip:** If you want to know what functions stings (or other Python objects) have, type the name of the variable followed by a period (.) and Colab will show you your options. In the code cell bellow type `example_4.` and see what functions are available.

Strings can also be joined together (this is called **concatenation**):

In [None]:
example_5 = 'The quick brown fox'
example_6 = ' jumps over the lazy dog.'  # notice the leading blank space
print(example_5 + example_6)

The quick brown fox jumps over the lazy dog.


Finally, variables may be inserted in a string with something called an **f-string**:

In [None]:
name = 'micah'
occupation = 'librarian'
print(f'Hello, my name is {name.title()} and I work as a {occupation}.')

Hello, my name is Micah and I work as a librarian.


In [None]:
# Exercise: Create a variable called favorite_movie, assign it a value, and print an f-string using the variable.

### Integers
**Integers** are one of the ways that Python handles numbers (**Floats** are another one not covered ehre). Integers are common in Python even if you are working primarily with texts.

In [None]:
print(10 + 5)  # addition
print(10- 5)  # subtraction
print(10 * 5)  # multiplication
print(10 / 5)  # division (Technically this returns a float, not an integer)

15
5
50
2.0


Python is really fast with numbers:

In [None]:
73950383 * 39282

2904918945006

Integers and strings are different data types in Python and it is important to remember if you are dealing with a string or an integer. In the following we can use the `==` operator to show this.

In [None]:
print('2' == '2')
print(2 == 2)
print('2' == 2)

True
True
False


### Lists
**Lists** are an important way to store information in Python (**Dictionaries** are another, not covered here). Lists are indicated by square brackets([]) and items in the list are separated by a comma (,). Lists are "zero-indexed" so the first item is item 0.

In [None]:
pets = ['dog', 'cat', 'fish', 'turtle']
nums = [2, 4, 6, 8]
print(pets)
print(nums)

['dog', 'cat', 'fish', 'turtle']
[2, 4, 6, 8]


In [None]:
# Exercise: Create your own list containing 3 items and give it an appropriate name. Then, print the list

Sometimes you may need to find specific items in a list.

In [None]:
dog = pets[0]
print(dog)

dog


In [None]:
# exercise: Create a variable called item_2 and assign the appropriate item from your list. Print the variable.

Sometimes you will need to edit lists. Here are some simple examples:

In [None]:
pets.append('emu')
print(pets)
nums.remove(2)
print(nums)

['dog', 'cat', 'fish', 'turtle', 'emu']
[4, 6, 8]


In [None]:
# exercise: add an item to your list. Then, print the list.

### For Loops

Sometimes you may need to do something to each item in your list. Python has something called a **For loop** which allows us to do just that. In the example bellow pay attention to the indentation of the second line.

In [None]:
for pet in pets:
  print(pet)

dog
cat
fish
turtle
emu


In [None]:
for num in nums:
  half_num = num / 2
  print(half_num)

2.0
3.0
4.0


In [None]:
# exercise: use a for loop it iterate through your list and print each item

We can even get a little more complex and add conditional logic. Consider the following complex example which contains some new content. Can you guess what it will do?

In [None]:
my_pets = ['dog', 'cat']
for pet in pets:
  if pet in my_pets:
    print(f'I have a pet {pet}.')
  else:
    print(f'I do not have a pet {pet}')


I have a pet dog.
I have a pet cat.
I do not have a print fish
I do not have a print turtle
I do not have a print emu


**Pro Tip:** You can turn a string into a list and a list into a string.

In [None]:
# Turn a string into a list
s = 'This string is about to become a list.'
l = s.split(' ')
print(l)

['This', 'string', 'is', 'about', 'to', 'become', 'a', 'list.']


In [None]:
# Turn a list into a string
l = ['This', 'list', 'is', 'about', 'to', 'become', 'a', 'string', '.']
s = ' '.join(l)
print(s)

This list is about to become a string .


### Functions


A **function** in Python is merely a named set of instructions which can be used over and over again. Here are a few simple examples:

In [None]:
# function
def greeting(name):
  print(f'Hello, {name}! How are you today?')

There is no output after running this code, but the computer has stored this function under the name `greeting` which can be called later.

In [None]:
greeting(name="Michael")

Hello, Michael! How are you today?


Sometimes it is more useful to store the output of a function into its own variable for later use. Here's an example:

In [None]:
def times_two(num):
  new_num = num * 2
  return new_num  #  notice the keyword "return."

In [None]:
result = times_two(num=5)

In [None]:
print(result)

10


## Final Code-Along (Time Permitting)

Let's build a function that removes stop words from a string. There are a lot of ways to do this, but here is an example of just one. Each comment below represents something our function needs to do.

In [None]:
#  name the function and provide required arguments
#  remove puncutation
#  lowercase all the letters
#  convert string to list
#  create an empty list for our results
#  Loop through the list
#  Exame each word
#  if the word is NOT in the list of stop words add it to results list
#  change results list back to a string
#  return the string

## Resources for Learning Python

#### [Free Code Camp: Python Tutorial on YouTube](https://youtu.be/rfscVS0vtbw)
Free Code Camp provides high quality training for beginning coders. This is a four hour YouTube tutorial for beginners.

#### [The Programming Historian Python Lessons](https://programminghistorian.org/en/lessons/introduction-and-installation)
The Programming Historian contains many coding lessons geared specifically toward digital humanities projects. The set of lessons linked to here provides an introduction to Python centered on gathering and using web pages.

#### [Python Practice Book](https://anandology.com/python-practice-book/index.html)
Python Practice Book by Anand Chitipothu is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

#### [Codecademy: Learn Python 3](https://www.codecademy.com/learn/learn-python-3)
Codecademy provides both free and paid options for learning to code. 