In [1]:
import sys
print(sys.version)

3.6.3 (default, Nov 13 2017, 08:48:07) 
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)]


In [2]:
import gensim

I'm going to work with the Courtlistener Supreme Court opinions dataset, found at [https://www.courtlistener.com/api/bulk-data/opinions/scotus.tar.gz](https://www.courtlistener.com/api/bulk-data/opinions/scotus.tar.gz) --- they're a nonprofit without a lot of bandwidth, so leave a good half an hour to download it even on a fast connection --- which has been previously downloaded into supcourt and then unzipped into a bunch of individual case jsons (there are a lot of them)

If you look at the data, you'll see that each JSON has a "html" field with a html-ized version of the opinion, as well as a "plain_text" field... but that the latter is sometimes empty.  So in order to extract texts, we'll try and get it out of plain_text first; if there's no plain text, we'll use beautiful soup to extract from html (or from several other alternate html fields)

In [13]:
from bs4 import BeautifulSoup
def extract_text(html):
    soup = BeautifulSoup(html, "lxml")
    for crap in soup(["script", "style", "meta"]):
        crap.extract()
    return soup.get_text()

In [14]:
# let's do an experiment and make sure this works right
import requests
response = requests.get("http://rulelaw.net")
print(extract_text(response.text))

The Rule of Law in the Real WorldThe Rule of Law in the Real WorldPaul Gowder"Paul Gowder's masterpiece articulates a new vision of the rule of the law that protects the disempowered and marginalized, and that demands that the nation-state rationalize its coercive power. He relentlessly attacks irrational social, economic and political hierarchies, particularly those that give continued vitality to racial inequality in the US today. As such his rule of law is firmly rooted in notions of human rights, and looks askance at soaring inequality in the US. This rule of law protects real people in need of real protection, rather than serving as just another rhetorical instrument in the arsenal of the wealthy and powerful. It also serves, incidentally, as a bedrock foundation for economic development and a tool that can help avoid financial disruptions like that seen in 2008."Steven A. Ramirez, author of Lawless Capitalism"The Rule of Law in the Real World explores and connects legal philosoph

Actually, this is ugly.  After the data finishes downloading, if cases come out this ugly too, I might break out jsoup or some other parser that does a better job getting text out of these things.

In [15]:
import glob
jsons = list(glob.glob("supcourt/*.json"))

In [16]:
len(jsons)

63967

In [17]:
import json

In [19]:
with open(jsons[0]) as j:
    example = json.load(j)

In [20]:
print(example)

{'resource_uri': 'http://www.courtlistener.com/api/rest/v3/opinions/100000/', 'absolute_url': '/opinion/100000/morrisdale-coal-co-v-united-states/', 'cluster': 'http://www.courtlistener.com/api/rest/v3/clusters/100000/', 'author': 'http://www.courtlistener.com/api/rest/v3/people/1501/', 'joined_by': [], 'author_str': '', 'per_curiam': False, 'date_created': '2010-04-28T16:47:22Z', 'date_modified': '2017-03-24T04:07:08.420443Z', 'type': '010combined', 'sha1': 'f966678c479af550803b000aecea9d1f16897a6a', 'page_count': None, 'download_url': None, 'local_path': None, 'plain_text': '', 'html': '<p class="case_cite">259 U.S. 188</p>\n    <p class="case_cite">42 S.Ct. 481</p>\n    <p class="case_cite">66 L.Ed. 892</p>\n    <p class="parties">MORRISDALE COAL CO.<br>v.<br>UNITED STATES.</p>\n    <p class="docket">No. 65.</p>\n    <p class="date">Argued Jan. 6-9, 1922.</p>\n    <p class="date">Decided May 29, 1922.</p>\n    <div class="prelims">\n      <p class="indent">Messrs. Gibbs L. Baker and

In [21]:
print(example.keys())

dict_keys(['resource_uri', 'absolute_url', 'cluster', 'author', 'joined_by', 'author_str', 'per_curiam', 'date_created', 'date_modified', 'type', 'sha1', 'page_count', 'download_url', 'local_path', 'plain_text', 'html', 'html_lawbox', 'html_columbia', 'html_with_citations', 'extracted_by_ocr', 'opinions_cited'])


In [22]:
print(example["plain_text"])




In [23]:
print(extract_text(example["html"]))

259 U.S. 188
42 S.Ct. 481
66 L.Ed. 892
MORRISDALE COAL CO.v.UNITED STATES.
No. 65.
Argued Jan. 6-9, 1922.
Decided May 29, 1922.

Messrs. Gibbs L. Baker and Karl Knox Gartner, both of Washington, D. C., for appellant.
Mr. Assistant Attorney General Riter, for the United States.
Mr. Justice HOLMES delivered the opinion of the Court.


1
This is an appeal from a judgment of the Court of Claims dismissing the appellant's petition upon demurrer. The petition alleges that the claimant had outstanding contracts calling for more than the actual production of its mines for the months of June and following through November, 1918, at a price of $4.50 per gross ton; that the Fuel Administration appointed by the President during the war 'requisitioned and compelled petitioner to divert 12,823.29 tons of coal' during the period mentioned; that the price received for this coal was $3.304 per gross ton, and that the claimant thereby suffered a loss of $15,337.37, for which loss it asks judgment agains

In [24]:
import re
example_text = extract_text(example["html"])

In [26]:
test_despace = re.sub('\s+', " ", example_text)

In [27]:
print(test_despace)

259 U.S. 188 42 S.Ct. 481 66 L.Ed. 892 MORRISDALE COAL CO.v.UNITED STATES. No. 65. Argued Jan. 6-9, 1922. Decided May 29, 1922. Messrs. Gibbs L. Baker and Karl Knox Gartner, both of Washington, D. C., for appellant. Mr. Assistant Attorney General Riter, for the United States. Mr. Justice HOLMES delivered the opinion of the Court. 1 This is an appeal from a judgment of the Court of Claims dismissing the appellant's petition upon demurrer. The petition alleges that the claimant had outstanding contracts calling for more than the actual production of its mines for the months of June and following through November, 1918, at a price of $4.50 per gross ton; that the Fuel Administration appointed by the President during the war 'requisitioned and compelled petitioner to divert 12,823.29 tons of coal' during the period mentioned; that the price received for this coal was $3.304 per gross ton, and that the claimant thereby suffered a loss of $15,337.37, for which loss it asks judgment against t

In [30]:
test_despace.encode("ascii", errors="ignore").decode("ascii")

"259 U.S. 188 42 S.Ct. 481 66 L.Ed. 892 MORRISDALE COAL CO.v.UNITED STATES. No. 65. Argued Jan. 6-9, 1922. Decided May 29, 1922. Messrs. Gibbs L. Baker and Karl Knox Gartner, both of Washington, D. C., for appellant. Mr. Assistant Attorney General Riter, for the United States. Mr. Justice HOLMES delivered the opinion of the Court. 1 This is an appeal from a judgment of the Court of Claims dismissing the appellant's petition upon demurrer. The petition alleges that the claimant had outstanding contracts calling for more than the actual production of its mines for the months of June and following through November, 1918, at a price of $4.50 per gross ton; that the Fuel Administration appointed by the President during the war 'requisitioned and compelled petitioner to divert 12,823.29 tons of coal' during the period mentioned; that the price received for this coal was $3.304 per gross ton, and that the claimant thereby suffered a loss of $15,337.37, for which loss it asks judgment against 

In [41]:
import string
transdict = {ord(x): " " for x in string.punctuation + string.digits}
def squish_spaces(text):
    return re.sub('\s+', " ", text)

def asciify(text):
    return text.encode("ascii", errors="ignore").decode("ascii")

def remove_nonletter(text):
    return text.translate(transdict)

def cleanup(text):
    return squish_spaces(remove_nonletter(asciify(text))).lower()

In [42]:
print(cleanup(example_text))

 u s s ct l ed morrisdale coal co v united states no argued jan decided may messrs gibbs l baker and karl knox gartner both of washington d c for appellant mr assistant attorney general riter for the united states mr justice holmes delivered the opinion of the court this is an appeal from a judgment of the court of claims dismissing the appellant s petition upon demurrer the petition alleges that the claimant had outstanding contracts calling for more than the actual production of its mines for the months of june and following through november at a price of per gross ton that the fuel administration appointed by the president during the war requisitioned and compelled petitioner to divert tons of coal during the period mentioned that the price received for this coal was per gross ton and that the claimant thereby suffered a loss of for which loss it asks judgment against the united states the petition does not allege or mean that the united states took the coal to its own use the meani

In [47]:
def extract_clean_text(casedict):
    if casedict["plain_text"]:
        text = casedict["plain_text"]
    else:
        text = extract_text(casedict["html"])
    return cleanup(text)

In [48]:
with open(jsons[1]) as j:
    print(extract_clean_text(json.load(j)))

 u s s ct l ed pine hill coal co inc v united states no argued jan decided may mr henry s drinker jr of philadephia pa for appellant argument of counsel from pages intentionally omitted mr assistant attorney general riter for the united states mr justice holmes delivered the opinion of the court this case like morrisdale coal co v united states u s sup ct l ed is a claim based upon the action of the fuel administration under the act of august c stat comp st comp st ann supp q fixing prices for coal the allegations and arguments however are different the transactions of the claimant from and including september through january are set forth in detail they embrace large sales at government prices and smaller sales at other than those prices it is alleged that the prices fixed for the claimant s coal were unjust and unreasonable and did not afford just compensation and that as a result of keeping to them as the claimant did the receipts were actually less than the cost of production on th

In [50]:
from os import path
# let's make some texts now.
for jfile in jsons:
    with open(jfile) as j:
        text = extract_clean_text(json.load(j))
    filename = path.split(jfile)[-1]
    outfile = "texts/" + filename.partition(".")[0] + ".txt"
    with open(outfile, "w") as o:
        o.write(text)

In [51]:
texts = list(glob.glob("texts/*.txt"))

In [52]:
len(texts)

63967

In [54]:
with open(texts[0]) as t:
    print(t.read())

 u s s ct l ed morrisdale coal co v united states no argued jan decided may messrs gibbs l baker and karl knox gartner both of washington d c for appellant mr assistant attorney general riter for the united states mr justice holmes delivered the opinion of the court this is an appeal from a judgment of the court of claims dismissing the appellant s petition upon demurrer the petition alleges that the claimant had outstanding contracts calling for more than the actual production of its mines for the months of june and following through november at a price of per gross ton that the fuel administration appointed by the president during the war requisitioned and compelled petitioner to divert tons of coal during the period mentioned that the price received for this coal was per gross ton and that the claimant thereby suffered a loss of for which loss it asks judgment against the united states the petition does not allege or mean that the united states took the coal to its own use the meani