Skip to content

Code used in the course Race, Gender, and Literary Digital Humanities

Notifications You must be signed in to change notification settings

jdportercode/184D

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

184D

Code used in the course Race, Gender, and Literary Digital Humanities

The purpose of these files was primarily to demonstrate some basic Python concepts and text-mining functions, so they aren't necessarily optimized for day-to-day use. Still, they work, and the comments have pedagogical value.

word_counter.py has functions for reading a .txt file, performing some basic cleaning functions on the words (removing non-alphabetic characters from the ends of strings, making everything lowercase, etc.), creating a dictionary with counts for each word, and printing the results to a file (.tsv). As is, it is set up to run this on an entire directory; note that the results can get pretty unwieldy if your corpus is more than a few dozen files.

KWIC.py is designed to take a list of target words (e.g., ['sky', 'sea']), look through a directory of texts, and return the target words in the context, with x number of words on either side. (KWIC stands for Key Words In Context). For instance, a window of 6 might produce output like "the brain is wider than the sky for put them side by side". The program takes the results and writes them to a .tsv, with columns for the target word and the file in which each KWIC was found.

get_mdw.py is more complicated. In short, this one takes a corpus and a metadata table that divides the corpus into two or more subcorpora. For instance, you could take few dozen 19th-century novels and divide them into "American" and "British" corpora. The program counts the words in each corpus and uses a Fisher's exact test to determine the significance of the counts relative to an "expected" even distribution across all corpora. It then writes out the significant results (p=.05 by default, but this can be changed easily) to a .tsv, with columns for the word, the corpus, the observations (how many times the word showed up), the p-value, and the observed over expected value.

About

Code used in the course Race, Gender, and Literary Digital Humanities

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages