Text Analysis of the Indian Claims Commission Decisions
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
out
table
text
.gitignore
LICENSE
README.md
icc-cluster.html
icc-data.Rproj
icc-topics.pdf
icc.txt
load.r
makefile
packages.js
r2d3.r
style.css
table.r
tesseract-config
topics.json
topics.r

README.md

##Text Analysis of the Indian Claims Commission Decisions

The Indian Claims Commission was a legal body that adjudicated hundreds of claims that Indian Tribes had against the United States for past wrongs. It produced 43 volumes of decisions over more than 30 years of work. Though the ICC tried cases to legal standards, it was of its time and reflected changing attitudes towards Native Americans. This work attempts to examine its place in Federal-Indian policy and analyze how the Commission used historical knowledge to arrive at legal decisions. It is also a case study in using text mining to explore a large corpus (n=100%) of legal documents computationally.

This analysis collected the the Decisions from Oklahoma State University: Performed OCR of the PDFs using tesseract and Lincoln Mullen's make recipe from Civil-Procedure-Codes

#Update: OKstate has now changed their icc file structure, not allowing wget downloading. The "data" for the analysis is the raw txt files of the ICC decisions in "text." The "code" is run with topic.r only. Skip running load.r (cleans previously OCR'd text) and table.r (experiment in parsing html tables for providing decisions with metadata). Other files are for configuration (e.g. topic modeling stoplist is icc.txt).

I used the Makefile to perform each of the tasks- download, collect PDFs, OCR, collect tables/plaintiff tribes. I'd highly recommend running the OCR in parallel using make ocr -j2

The rest of the work is various R scripts that process and analyze the textural data. Use load.r and topic.r to perform the work. Table.r is a script to collect the plaintiff tribe names for the stoplist. Best practice is to use the curated stoplist on github as manual changes have been made to it.

It was originally created as a final class project for CLIO3: Hist 698 at George Mason University.

Visualizations at Petercarrjones.com