## **Supreme Court Transcripts Database Design**

### Contents:
 1. Finding Justices Present
 2. Date, Year
 3. Appearances
 4. Sentiment Analysis 
 5. Building the DataFrame
 
## Note to Everyone: Make sure you've uploaded the textfiles from this folder: https://drive.google.com/drive/folders/1aepNIVRUS0rwu-m_fqK7KnWEqhyJFBvW my directory might be different from yours so make sure to check !

In [4]:
import pandas as pd
import regex as re
import numpy as np
import os
from os import listdir
from os.path import isfile,join

# Read in a plain text file
files = []
for i in os.listdir('/home/jovyan/Liberating Archives Project'):
    if i.endswith('.txt'):
        text = open(i).read()
        files.append(text)

In [99]:
files[0]

<_io.TextIOWrapper name='SEC v. Zandford.pdf.txt' mode='r' encoding='UTF-8'>

In [None]:
cleaned = []
for txt in files:
    clean = re.sub('\xad','',txt)
    clean = re.sub('\n','',clean)
    clean = re.sub('\\\\','',clean)
    cleaned += [clean]

### **Finding Justices Present**

In [18]:
def unique(lst):
    uni = []
    for i in lst:
        if i not in uni:
            uni += [i]
    return uni

def case_no(txt):
    return unique(re.findall('No.\s\d+[-]*\d+',txt))

for i in cleaned:
    print(case_no(i))

In [21]:
def justices(texts):
    d = {}
    for txt in texts:
        num = case_no(txt)[0]
        j = re.findall('JUSTICE[A-Z\s]+:',txt)
        justice = sorted(unique(j))
        cleaned_list = [justice[i][:-1] for i in range(len(justice))]
        d[num] = cleaned_list
    return d

## **Date, Year**

In [22]:
def date(text):
    return re.findall('\w+\s+\d+,\s+\d{4}',text)[0]

In [23]:
def year(text):
    return re.findall('\d\d\d\d',date(text))

In [24]:
#This doesn't work
def title(text):
    s = re.findall('[-\s]*[\w\n\s\d.,\/#!$%\^&\*;:{}=\-_`~()]*[-\s]*',text)[0]
    s1 = re.findall('\w+\s\w+\s\w+',s)
    first = re.findall('\s\w+',s1[0])
    first = ''.join(first)
    last = re.findall('\s\w+',s1[1])
    last = ''.join(last)
    title = first + ' v.'+last
    return title

## **Appearances**

In [25]:
def appearances(text):
    app = re.findall('APPEARANCES:[\s\S]*?Reporting',text)[0]
    app = re.findall('[\s\S]*?;[\s\S]*?\.',app)
    app = [re.sub('\d','',app[i]) for i in range(len(app))]
    remove_appearance = re.sub('APPEARANCES:\s','',app[0])
    app[0] = remove_appearance
    app = [re.sub('[\s\s]+',' ',app[i]) for i in range(len(app))]
    return app

In [26]:
for i in cleaned:
    app = appearances(i)
    for j in app:
        print(j)
    print('-----------------------')

JAMES W. DABNEY, ESQ., New York, N.Y.; on behalf of Petitioner.
 THOMAS G. HUNGAR, ESQ., Deputy Solicitor General, Department of Justice, Washington, D.C.; on behalf of the United States, as amicus curiae, supporting Petitioner.
-----------------------
THEODORE B. OLSON, ESQ., Washington, D.C.; on behalf of Petitioner.
 DARYL JOSEFFER, ESQ., Assistant to the Solicitor General, Department of Justice, Washington, D.C.; On behalf of the United States, as amicus curiae, supporting Petitioner.
 SETH P. WAXMAN, ESQ., Washington, D.C.; on behalf of Respondent.
-----------------------
KEVIN K. RUSSELL, ESQ., Washington, D.C.; on behalf of the Petitioner.
 GLEN D. NAGER, ESQ., Washington, D.C.; on behalf of the Respondent.
 IRVING L. GORNSTEIN, ESQ., Assistant to the Solicitor General, Department of Justice, Washington, D.C.; as amicus curiae on behalf of the Respondent.
-----------------------
JAMES R. MILKEY, ESQ., Assistant Attorney General, Boston, Mass; on behalf of Petitioners.
 GREGORY C

## **Sentiment Analysis**

**Checkpoint 10/22**

* need to work on finding the regex pattern between speakers
* General pattern: "SPEAKER: anything they say until the next SPEAKER:"

### STEPS:
1. Extract the sentences from each speaker.
2. Develop a function (actually there's one written in the github link below that you could model yours from; it's very good)
3. Test it on various transcripts to ensure it's generalized

### Note to Amal: Please look at these slides for performing sentiment analysis. They are from my IEOR class, and these techniques are very useful. Please let me know if you would like to go over it together.

https://github.com/ikhlaqsidhu/data-x/blob/master/07a-tools-nlp-sentiment_add_missing_si/notebook-nlp-sentiment-analysis-imdb-afo_v2.ipynb

In [30]:
## I was trying to extract the speakers from each text. 

def dialogue(text):
    sents = re.findall('[A-Z\s]+:[\s\S]+?[A-Z]+?:',text) #regex pattern to find all instances
    sents = [re.sub('\d','',i) for i in sents] # cleaning transcript
    sents = [re.sub('[\s\s]+',' ',i) for i in sents] #cleaning transcript
    return sents

other_regex_patters = re.findall('[A-Z\s.]+:[\s\S]+?[A-Z\s]+?:',text)

In [48]:
for i in re.findall('(MR. DABNEY:)([\s\S\]+?)([A-Z][A-Z\s]*:)',cleaned[0]):
    print(i)
    print('')

In [49]:
dialogue(cleaned[0])[1]

" CHIEF JUSTICE ROBERTS: We'll hear argument next in No. -, KSR International versus Teleflex, Incorporated. Mr. Dabney. ORAL ARGUMENT OF JAMES W. DABNEY ON BEHALF OF THE PETITIONER MR. DABNEY:"

# **Building the DataFrame**

In [95]:
##Running functions over files

cases = [case_no(i)[0] for i in cleaned]
justice = [justices(cleaned).get(num) for num in cases]
dates = [date(i) for i in cleaned]
years = [year(i)[0] for i in cleaned]
people = [appearances(i) for i in cleaned]
title = [f for f in listdir('/home/jovyan/Liberating Archives Project') if isfile(join('/home/jovyan/Liberating Archives Project', f))]

In [96]:
data = pd.DataFrame({'Title':title,'Case No': cases,'Justices':justice,'Date': dates,'Year':years,'Appearances':people})

ValueError: arrays must all be same length

In [42]:
pd.to_datetime(data['Date'])

0    2006-11-28
1    2007-02-21
2    2006-11-27
3    2006-11-29
4    2008-12-02
5    2009-02-24
6    2009-04-29
7    2012-01-10
8    2012-03-21
9    2014-01-14
10   2016-01-12
11   2018-03-27
Name: Date, dtype: datetime64[ns]