# Statistical Learning Final Project: Predicting Supreme Court Outcomes
_by Miranda Seitz-McLeese_

## Customer
For this project I was inspired by my friends who are in law school, who showed me some data sets that were available in that area. Therefore my imagined customer is a law firm, who wants to be able to accurately tell their clients how the argument went, and what the likely outcome was. Additionally, a law firm might want to know what effective advocates might do to increase the likelihood of securing a victory for their clients. Finally a law firm, or any lawyer engaged in legal research might be interested in finding cases that deal with similar facts.


## Objective
I had three objectives for this analysis, which I breifly mentioned above:
1. Cluster cases based on facts to allow legal researchers to find similar cases.
2. Predict the outcome of a case (for this analysis, restricted to Supreme Court cases, because of the data available).
3. Analyze feature importance based on my model for the first objective to see what makes for an effective argument.

## Data
I mined my data from two locations. First I got the transcripts, some justice voting data, and facts, as well as some other data, that I ended up not using for this analysis from [The Oyez Project](https://www.oyez.org). I got some other decision data from [The Supreme Court Database](http://supremecourtdatabase.org/), as well as meta data about proceedural history and parties that I did not end up using for my analysis.

[The Supreme Court Database](http://supremecourtdatabase.org/) provides downloads in comma separated value file formats, and I used [scrapy](http://scrapy.org) to scrape the data from [The Oyez Project](https://www.oyez.org). I combined these sources in an SQL database. 

In order to perform my analysis, I wrote a function that would pull the data from my SQL database into a [pandas](http://pandas.pydata.org) DataFrame and then perform some basic cleaning and transformations to consolodate the data so I have only one row for each case.

Below I use this function to read in the data. The full text for the function can be found in the learn submodule of the scotus module source code. This function returns a dataframe that has 

In [None]:
from scotus.learn.vote_predict import lines_data
data = lines_data()
print data.shape

In [1]:
from scotus.learn.vote_predict import *
from scotus.db import DB
from scotus.settings import DEFAULT_DB
from sqlalchemy.sql import select, and_
from pandas.io import sql
from scotus.db.models import *
import numpy as np
import pandas
db = DB(DEFAULT_DB)
#data = fetch_data(db) 
transcript_data = get_transcript_data(db)
case_data = get_case_data(db)
vote_data = get_vote_data(db)
session = db.Session()
try:
    names = Justice.by_name(session).iteritems()
    justice_name = [{'id': id,
                 'speaker':name.rstrip('I').lower().split(',')[0].split()[-1]} for name, id in names]
except:
    raise
finally:
    session.close()
name_data = pandas.DataFrame.from_records(justice_name, index='id')
data = pandas.merge(vote_data, transcript_data, on=['case_id', 'justice_id'], how='outer').join(name_data, on='justice_id', how='left')
data = data.join(case_data, on=['case_id'], how='left')
#data.loc[data['justice_id'].apply(j_gender, axis=1) , 'gender'] = '0'
data.loc[data['kind']=='advocate', 'speaker'] = 'advocate'
data.rename(columns={'dec_date': 'date_decided',
                   'date':'date_argued',
                   'dec_type':'decision_type',
                   'kind':'turns'}, inplace=True)
data.loc[(data['vote'] != 'majority')& (data['vote'] != 'minority') & (data['decision_type'] == 'per curiam'), ('vote')] = u'majority'
data['vote_side'] = data.apply(get_vote_side_numeric, axis=1)
data.dropna(subset=['facts', 'speaker'], inplace=True)
majority = data.groupby(['case_id', 'speaker'], as_index=False)['vote'].agg('first').groupby('case_id').agg(lambda x: len(x[x=='majority']))['vote']
minority = data.groupby(['case_id', 'speaker'], as_index=False)['vote'].agg('first').groupby('case_id').agg(lambda x: len(x[x=='minority']))['vote']
# data = data[(data['speaker'] == 'advocate') | ((data['justice_id'] < 11) & (data['justice_id'] != 8))] 
cases = reshape(data,
              groups=['case_id'],
              aggregators={'facts': 'first',
                           'name':'first',
                           'winning_side': 'first',
                           'date_decided': 'first',
                           'decision_type': 'first',
                           'date_argued': 'first',
                           'speaker': lambda x: ' '.join(x)
                           })
votes = reshape(data,
              groups=['case_id', 'speaker'],
              aggregators={'vote_side':'first'},
              stack=['speaker'],
              drop=['vote_side_advocate'])
speakers = reshape(data,
                 groups=['case_id', 'side', 'speaker'],
                 aggregators={'turns':'count',
                              'interrupted': 'sum',
                              'interruption': 'sum',
                              'length': 'sum',
                              'humor': 'sum',
                              'choppiness': 'sum',
                              'question': 'sum',
                              'text': lambda x: ' '.join(x)},
                stack=['speaker', 'side'])
data = cases.join([votes, speakers])
data['majority'] = majority
data['minority'] = minority
data.dropna(subset=['date_argued'], inplace=True)

KeyError: 'date_decided'

In [4]:
import datetime

2016-10-02


In [None]:
import numpy as np
import pandas as pd