# Statistical Learning Final Project: Predicting Supreme Court Outcomes
_by Miranda Seitz-McLeese_

Author's note: I have chosen to do this project in an INotebook. My project makes heavy use of [pandas](http://pandas.pydata.org) and [scikit-learn](http://scikit-learn.org/stable/index.html).

## Customer
For this project I was inspired by my friends who are in law school, who showed me some data sets that were available in that area. Therefore my imagined customer is a law firm, who wants to be able to accurately tell their clients how the argument went, and what the likely outcome was. Additionally, a law firm might want to know what effective advocates might do to increase the likelihood of securing a victory for their clients. Finally a law firm, or any lawyer engaged in legal research might be interested in finding cases that deal with similar facts.


## Objective
I had three objectives for this analysis, which I breifly mentioned above:
1. Cluster cases based on facts to allow legal researchers to find similar cases.
2. Predict the outcome of a case (for this analysis, restricted to Supreme Court cases, because of the data available).
3. Analyze feature importance based on my model for the first objective to see what makes for an effective argument.

## Data
I mined my data from two locations. First I got the transcripts, some justice voting data, and facts, as well as some other data, that I ended up not using for this analysis from [The Oyez Project](https://www.oyez.org). I got some other decision data from [The Supreme Court Database](http://supremecourtdatabase.org/), as well as meta data about proceedural history and parties that I did not end up using for my analysis.

[The Supreme Court Database](http://supremecourtdatabase.org/) provides downloads in comma separated value file formats, and I used [scrapy](http://scrapy.org) to scrape the data from [The Oyez Project](https://www.oyez.org). I combined these sources in an SQL database. 

In order to perform my analysis, I wrote a function that would pull the data from my SQL database into a [pandas](http://pandas.pydata.org) DataFrame and then perform some basic cleaning and transformations to consolodate the data so I have only one row for each case.

Below I use this function to read in the data. The full text for the function can be found in the learn submodule of the scotus module source code. 

In [1]:
from scotus.learn.vote_predict import lines_data
data = lines_data()

I have 254 different features in this dataframe, as well as some columns containing metadata, and my vote and outcome data. For ease of reading throughout the rest of this document I have split them into categories below, so that I may access the features I need for my analysis. In the code below I have comments with a brief description of the feature.

In [3]:
column_names = data.columns.values

facts = ['facts']  # a one to two paragraph summary of the facts of the case.

speaker = ['speaker'] # this is a paragraph that lists the names of the speakers in the order they
                      # spoke during the argument.

turn_text = [x for x in column_names if x.split('_')[0]=='text'] # this is multiple features, one text document
                                                                 # per justice/advocate per party to the case
                                                                 # with all statements from that justice or advocate
                                                                 # during that party's speaking time.

count_features = ['turns', 'question', 'interrupted', 'interruption', 'humor']

count = [x for x in column_names if x.split('_')[0] in count_features] # this is multiple features, one for each
                                                                       # statistic per justice/advocate per party
                                                                       # counting the number of occurances during 
                                                                       # that party's speaking time.

length = [x for x in column_names if x.split('_')[0] == 'length']      # this is multiple features, one for each
                                                                       # one per justice/advocate per party
                                                                       # containing the number of seconds that 
                                                                       # justice/advocate spoke during that party's 
                                                                       # speaking time.


## Clustering
In this section I will discuss the analysis and results for the first goal I had, namely to cluster cases based on facts to allow legal researchers to find similar cases.

### Techniques
This is an unsupervized problem, because I do not have a training set, but rather I am interested in finding patterns and clusters in the data. Therefore I used two techniques to achieve different facets of my goal. I used k-means clustering to find the clusters, and k-nearest-neighbors for the search. I chose these techniques because they were unsupervized. Additionally, because I am looking for "similar" cases it makes sense to choose features, and a distance metric such that cases that are "similar" are "near" to each other in my feature space. I chose k-means clustering largely for computational reasons and my results were good.

### Considerations 
The first thing I had to condend with in this analysis was the curse of dimensionality. My features here were word counts from the two paragraph description of the facts of the case. This made my feature space sparse and gave it a very high dimensionality. To deal with this I used Latant Semantic Analysis to project my feature space down to a smaller feature space of only 100 dimensions. This will also help with overfitting. Additionally to minimize the risks of bias/overfitting, I required words to show up in at least three documents before I included them in my vocabulary and I limited the number of clusters.

### Results


In [None]:
import numpy as np
import pandas as pd