# Predict who wrote the Supreme Court opinions that don't have authors.

These are called [per curiam](https://en.wikipedia.org/wiki/Per_curiam_decision) decisions. Using the language in opinions that *do* have authors, we can probably predict who wrote them.

You should **probably use the Classification Template we used in class**, it isn't like this has anything useful in it. You might also want to take a look at the template from class, especially the part about **custom features** - I hear sentence length is an important one!

Also, the `ny-doctors` assignment is a little more of a walkthrough on the same topic. Might be helpful!

In [134]:
import pandas as pd
import math
pd.set_option('max_colwidth', 40)
df = pd.read_csv("supreme-court-opinions.csv")
df.head()

Unnamed: 0,content,case,author
0,NOTICE: This opinion is subject to f...,09-958,ROBERTS
1,NOTICE: This opinion is subject to f...,09-958,BREYER
2,"No.\n10–1001 LUIS MARIANO MARTINEZ, ...",10-1001,SCALIA
3,NOTICE: This opinion is subject to f...,10-1001,KENNEDY
4,"No.\n10–1016 DANIEL COLEMAN, PETITIO...",10-1016,GINSBURG


### Creating a df with the know authors and another one with the unknown

In [135]:
known_df = df[~df['author'].isnull()].copy()
unknown_df = df[df['author'].isnull()].copy()

### Step 1. Labeling the known opinions / categorizing

In [136]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
known_df['label'] = le.fit_transform(known_df['author'])
known_df.head()

Unnamed: 0,content,case,author,label
0,NOTICE: This opinion is subject to f...,09-958,ROBERTS,6
1,NOTICE: This opinion is subject to f...,09-958,BREYER,1
2,"No.\n10–1001 LUIS MARIANO MARTINEZ, ...",10-1001,SCALIA,7
3,NOTICE: This opinion is subject to f...,10-1001,KENNEDY,5
4,"No.\n10–1016 DANIEL COLEMAN, PETITIO...",10-1016,GINSBURG,2


### Step 2.1 Vectorize the contents

In [137]:
from sklearn.feature_extraction.text import CountVectorizer

# Keep the stop words as we analyse writing styles
vec = CountVectorizer(
    ngram_range=(3,4),
    max_features=8000
)

matrix = vec.fit_transform(known_df.content.str.replace('[\d_ܩݐ]', ''))
features_df = pd.DataFrame(matrix.toarray(), columns=vec.get_feature_names())
features_df.head()

Unnamed: 0,ability or inability,ability or inability to,ability to elect,about how to,about the meaning,about the meaning of,about the scope,about the scope of,about whether the,absence of any,...,year mandatory minimum,year statute of,year statute of limitations,years after the,years ago in,years of the,yet the court,yet the majority,yulee florida bar,zone of interests
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Step 3. Train a model using the contents

In [138]:
from sklearn.tree import DecisionTreeClassifier

# Seems to be the best option here
clf = DecisionTreeClassifier()

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    features_df.values,
    known_df.label,
    test_size=0.2)

### Step 4: Test the classifier

In [139]:
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

0.94999999999999996

In [140]:
clf.score(X_train, y_train)

1.0

### Step 5: Vectorize the unclassified contents and make a prediction

In [143]:
matrix = vec.transform(unknown_df.content)
clf.predict(matrix)

array([2, 8, 5, 5, 5, 1, 5, 5, 5, 5, 2, 1, 2, 8, 5, 8, 8, 2, 8, 2, 5, 5, 8,
       2, 5, 1, 5, 5, 5, 5, 5, 0, 8, 5, 5, 5, 5, 0, 0, 7, 2, 5, 5, 2, 5, 5,
       5, 0, 5, 5, 5, 5, 5, 5, 5, 5, 5, 2, 5, 8, 2, 8, 9, 9, 8, 5, 2, 9, 5,
       8, 2, 5])

### Step 6: Update the dataframe

In [144]:
unknown_df['author'] = le.inverse_transform(clf.predict(matrix))
unknown_df.head()

Unnamed: 0,content,case,author
20,"JAVIER CAVAZOS, ACTING WARDEN v. SHI...",10-1115,GINSBURG
43,KPMG LLP v. ROBERT COCCHI ET AL.\nAg...,10-1521,SOTOMAYOR
44,"DAVID BOBBY, WARDEN v. ARCHIE DIXON ...",10-1540,KENNEDY
79,Slip Opinion NOTICE: This opinion is...,10-7081b2d,KENNEDY
122,v. LORENZO JOHNSON Respondent Lorenz...,11-1053,KENNEDY
