## Read in file (text format)

In [None]:
with open("litigation.txt", "r") as f:
    raw_text_lines = f.readlines()

In [None]:
raw_text_lines[:200]

Because this data is not yet disambiguated, the first step is to always cut down what you're working with into separate documents. To figure out how to best do this, try to find distinct patterns in the transitions between cases. When you're reading through it as a human, what tells you that this is the beginning of a new case? In data wrangling there is no one right answer!

## Parse out separate cases

Two features stuck me immediately: 1) The header is always centered, 2) it ends with a date, maybe more than one, but at least one.

I'm going to first write a regex to match these dates:

In [None]:
import re
date_re = re.compile(r"(January|February|March|April|May|June|July|August|September|October|November|December)\s[1-3]?[0-9]+,\s[0-9]+")

Now I iterate through the lines of the file searching for centered text. To check that this is a header, I look to make sure it begins in capital letters. The end of the header is marked with the date. I use the `binary` variable to turn on and off my collection of text.

In [None]:
# to collect strings
all_headers = []
body_indices = []

# to form sub strings
header = ""
binary = 0
start = 0
for i, l in enumerate(raw_text_lines):
    if l.startswith("        "):
        if binary == 0:
            if l.strip()[:2].isupper() and l.strip()[:2].isalpha():
                binary = 1
                body_indices.append((start, i))

        if i < len(raw_text_lines):
            if re.search(date_re, l) and re.search(date_re, raw_text_lines[i+1]) == None:
                header += l
                all_headers.append(header)
                header = ""
                binary = 0
                start = i + 1
            
        if binary == 1:
            header += l

Now we can pair back the headers with the body.

In [None]:
head_body = []

for i, pair in enumerate(body_indices[1:]):
    head_body.append((all_headers[i], ''.join(raw_text_lines[pair[0]:pair[1]])))

In [None]:
print(len(head_body))

In [None]:
print(head_body[2][0])

In [None]:
print(head_body[0][1])

## Extracting Information

### Entities

The simplest method to extract entities is to use [Stanford's NER](http://nlp.stanford.edu/software/CRF-NER.shtml#Download) tagger, with the NLTK wrapper. As we are working with phrases and not necessarily sentences, the accuracy will not be as high.

In [None]:
from nltk.tag.stanford import StanfordNERTagger
from nltk import word_tokenize

ner_tag = StanfordNERTagger(
        '/Users/chench/Documents/stanford-ner-2015-12-09/classifiers/english.all.3class.distsim.crf.ser.gz',
        '/Users/chench/Documents/stanford-ner-2015-12-09/stanford-ner.jar')

In [None]:
from itertools import groupby

NER = {"LOCATION": [],
       "ORGANIZATION": [],
       "PERSON": []}

In [None]:
print(head_body[1][0].split('\n'))

In [None]:
for l in head_body[1][0].split("\n"):
    NER_line = ner_tag.tag(word_tokenize(l))
    print(NER_line)
    for tag, chunk in groupby(NER_line, lambda x: x[1]):
        if tag != "O":
            NER_word = " ".join(w for w, t in chunk)  # join consecutive chunks
            NER[tag].append(NER_word)

In [None]:
print(NER["ORGANIZATION"])

### Time and Dates

The `datetime` library is incredibly useful to get more quantitative data:

In [None]:
import datetime

datetime.datetime.now()

We can parse all of the headers and find the dates buried in them:

In [None]:
print(head_body[0][0])

In [None]:
from dateutil import parser
my_birthday = parser.parse("My birthday is 18. Jul 1989.", fuzzy=True)

In [None]:
my_birthday.strftime("%Y-%m-%d")

In [None]:
from dateutil import parser
for l in head_body[0][0].split("\n"):
    try:
        date = parser.parse(l, fuzzy=True)
        date_string = date.strftime("%Y-%m-%d")
        if date_string != datetime.datetime.now().strftime("%Y-%m-%d") and 1900 < date.year < 2020:
            print(l)
            print(date)
    except ValueError:
        pass

Let's write a function to get the difference in days between cases "argued" and "decided":

In [None]:
def get_diff_arg_dec(header):
    argued = None
    decided = None
    for l in header.split('\n'):
        if "Argued" in l:
            try:
                argued = parser.parse(l, fuzzy=True)
            except ValueError:
                pass
        elif "Decided" in l:
            try:
                decided = parser.parse(l, fuzzy=True)
            except ValueError:
                pass

    if argued and decided:
        days_diff = (decided-argued).days
        if days_diff > 0:
            return days_diff 

In [None]:
get_diff_arg_dec(head_body[0][0])

In [None]:
arg_dec_diffs = []
for h in head_body:
    diff = get_diff_arg_dec(h[0])
    if diff:
        arg_dec_diffs.append(diff)

In [None]:
arg_dec_diffs

In [None]:
import numpy as np
print(len(arg_dec_diffs), np.mean(arg_dec_diffs), np.median(arg_dec_diffs), np.std(arg_dec_diffs))

What about "filed" and "decided"?

In [None]:
def get_diff_fil_dec(header):
    decided = None
    filed = None
    for l in header.split('\n'):
        if "Decided" in l:
            try:
                decided = parser.parse(l, fuzzy=True)
            except ValueError:
                pass
        elif "Filed" in l:
            try:
                filed = parser.parse(l, fuzzy=True)
            except ValueError:
                pass

    if decided and filed:
        days_diff = (filed-decided).days
        if days_diff > 0:
            return days_diff 

In [None]:
fil_dec_diffs = []
for h in head_body:
    diff = get_diff_fil_dec(h[0])
    if diff:
        fil_dec_diffs.append(diff)

In [None]:
print(len(fil_dec_diffs), np.mean(fil_dec_diffs), np.median(fil_dec_diffs), np.std(fil_dec_diffs))

In [None]:
fil_dec_diffs

Since all cases have a "decided" line, let's see how productive the courts are being:

In [None]:
decided = []
for h in head_body:
    for l in h[0].split('\n'):
        if "Decided" in l:
            try:
                d_date = parser.parse(l, fuzzy=True)
            except ValueError:
                pass
    decided.append(d_date)

In [None]:
decided[:10]

In [None]:
print(len(decided), len(set(decided)), len(head_body))

When were the most cases decided?

In [None]:
from collections import Counter
pop_dates = Counter(decided).most_common()
pop_dates

In [None]:
% matplotlib inline
import matplotlib.pyplot as plt

x = [x[0] for x in pop_dates]
y = [x[1] for x in pop_dates]

ax = plt.subplot(111)
ax.bar(x, y, width=10)
ax.xaxis_date()

plt.show()

### Patent co-referencing

Some text has something like the following: U.S. Patent No. 5,301,105 ('105 patent).

How can we find all of these pairs?

In [None]:
pairs = []
pattern = re.compile(r"Patent No\. (?P<number>[0-9\,]+)\s\((?:the\s)?'(?P<alt>[0-9]+) patent\)")
for x in head_body:
    pairs.extend(re.findall(pattern, x[1]))

In [None]:
print(len(pairs))
print(pairs[0])

In [None]:
id_dict = {int(k.replace(",", "")): int(v.replace(",", "")) for (k, v) in pairs}
print(id_dict[5337753])

### Outcomes

We can extract just the outcomes:

In [None]:
from nltk.tag import pos_tag
from nltk import word_tokenize

pos_outcomes = []
for i in head_body[:10]:
    body = i[1]
    outcome = body[body.find("OUTCOME:"):].split('\n\n')[0]
    pos_outcome = pos_tag(word_tokenize(outcome)[2:])
    pos_outcomes.append(pos_outcome)
    print(pos_outcome)
    print()
    
for o in pos_outcomes:
    for w in o:
        if w[1].startswith("VB"):
            print(w[0])
    print()

## Classification

First create the `X` and `y` arrays:

In [None]:
X = []
y = []


for i in head_body:
    body = i[1]
    outcome = body[body.find("OUTCOME:"):].split('\n\n')[0]
    if "revers" in outcome:
        X.append(body)
        y.append(1)
        
    elif "affirm" in outcome:
        X.append(body)
        y.append(0)

print(len(X), len(y))

In [None]:
y[:500]

Now import the necessary `scikit-learn` code:

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer
from sklearn.svm import LinearSVC
from sklearn import cross_validation

Training and testing data:

In [None]:
X_train, X_test, y_train, y_test = cross_validation.train_test_split(
    X, y, test_size=0.2, random_state=40)

In [None]:
y_train[0]

Create the TFIDF for train and test:

In [None]:
tfidf = TfidfVectorizer()
tfidf.fit(X)
X_train = tfidf.transform(X_train)
X_test = tfidf.transform(X_test)

In [None]:
X_train

Create the model and get score:

In [None]:
svc_class = LinearSVC()
model = svc_class.fit(X_train, y_train)
model.score(X_test, y_test)

Same thing only in a `scikit-learn` pipeline, and k-fold cross-validation:

In [None]:
from sklearn.pipeline import Pipeline
text_clf = Pipeline([('vect', CountVectorizer(ngram_range=(1, 2))),
                    ('tfidf', TfidfTransformer()),
                    ('clf', LinearSVC(random_state=0))
                     ])
scores = cross_validation.cross_val_score(text_clf, X, y, cv=2)
print(scores, np.mean(scores))

Extract useful features:

In [None]:
feature_names = tfidf.get_feature_names()
top10 = np.argsort(model.coef_[0])[-10:]
print(list(feature_names[j] for j in top10))

For more on using text features in models, see the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html) for feature extraction.