## **Supreme Court Transcripts Database Design**

### Contents:
 1. Finding Justices Present
 2. Date, Year
 3. Appearances
 4. Sentiment Analysis 
 5. Building the DataFrame
 
## Note to Everyone: Make sure you've uploaded the textfiles from this folder: https://drive.google.com/drive/folders/1aepNIVRUS0rwu-m_fqK7KnWEqhyJFBvW my directory might be different from yours so make sure to check !

In [1]:
import pandas as pd
import regex as re
import numpy as np
import os
from os import listdir
from os.path import isfile,join

# Read in a plain text file
files = []
path = r"C:\Users\Avena Cheng\Desktop\Liberating Archives\supremecourt\textfiles\\"
for i in os.listdir(r"C:\Users\Avena Cheng\Desktop\Liberating Archives\supremecourt\textfiles"):
    if i.endswith('.txt'):
        text = open(path+i,encoding="utf8").read()
        files.append(text)

In [2]:
len(files)

1302

In [3]:
cleaned = []
for txt in files:
    clean = re.sub('\xad','',txt)
    clean = re.sub('\n','',clean)
    clean = re.sub('\\\\','',clean)
    cleaned += [clean]

### **Finding Justices Present**

In [4]:
def unique(lst):
    uni = []
    for i in lst:
        if i not in uni:
            uni += [i]
    return uni

def case_no(txt):
    return unique(re.findall('No.\s*\d+[-]*\d+',txt))

In [5]:
def justices(txt):
    j = re.findall('JUSTICE[A-Z\s]+:',txt)
    justice = sorted(unique(j))
    cleaned_list = [justice[i][:-1] for i in range(len(justice))]
    return cleaned_list

In [6]:
#def justices(texts):
 #   d = {}
  #  for txt in texts:
   #     num = case_no(txt)[0]
    #    j = re.findall('JUSTICE[A-Z\s]+:',txt)
     #   justice = sorted(unique(j))
      #  cleaned_list = [justice[i][:-1] for i in range(len(justice))]
       # d[num] = cleaned_list
    #return d

## **Date, Year**

In [7]:
def date(text):
    return re.findall('\W\w+,*\s+\d+\w*\w*,\s+\d{4}',text)

In [8]:
def year(text):
    return re.findall('\d\d\d\d',date(text))

## **Appearances**

In [9]:
def appearances(text):
    app = re.findall('APPEARANCES:[\s\S]*?Reporting',text)[0]
    app = re.findall('[\s\S]*?;[\s\S]*?\.',app)
    app = [re.sub('\d','',app[i]) for i in range(len(app))]
    remove_appearance = re.sub('APPEARANCES:\s','',app[0])
    app[0] = remove_appearance
    app = [re.sub('[\s\s]+',' ',app[i]) for i in range(len(app))]
    return app

In [10]:
#for i in cleaned:
 #   app = appearances(i)
  #  for j in app:
   #     print(j)
    #print('-----------------------')

## **Sentiment Analysis**

**Checkpoint 10/22**

* need to work on finding the regex pattern between speakers
* General pattern: "SPEAKER: anything they say until the next SPEAKER:"

### STEPS:
1. Extract the sentences from each speaker.
2. Develop a function (actually there's one written in the github link below that you could model yours from; it's very good)
3. Test it on various transcripts to ensure it's generalized

### Note to Amal: Please look at these slides for performing sentiment analysis. They are from my IEOR class, and these techniques are very useful. Please let me know if you would like to go over it together.

https://github.com/ikhlaqsidhu/data-x/blob/master/07a-tools-nlp-sentiment_add_missing_si/notebook-nlp-sentiment-analysis-imdb-afo_v2.ipynb

In [11]:
## I was trying to extract the speakers from each text. 

def dialogue(text):
    sents = re.findall('[A-Z\s]+:[\s\S]+?[A-Z]+?:',text) #regex pattern to find all instances
    sents = [re.sub('\d','',i) for i in sents] # cleaning transcript
    sents = [re.sub('[\s\s]+',' ',i) for i in sents] #cleaning transcript
    return sents

other_regex_patters = re.findall('[A-Z\s.]+:[\s\S]+?[A-Z\s]+?:',text)

In [12]:
for i in re.findall('(MR. DABNEY:)([\s\S\]+?)([A-Z][A-Z\s]*:)',cleaned[0]):
    print(i)
    print('')

In [13]:
dialogue(cleaned[0])[1]

" CHIEF JUSTICE ROBERTS: We'll hear argument next in Case -, Penn Plaza LLC v. Pyett. Mr. Salvatore. ORAL ARGUMENT OF PAUL SALVATORE ON BEHALF OF THE PETITIONERS MR. SALVATORE:"

# **Building the DataFrame**

#### Justices!
`#print(cleaned[4])`

If we run the above ...

Notice that instead of having the justices' name, it just says "QUESTION". This is usually the case for when it appears there is only one justice present. It seems to be only for Justice Rehnquist.

In [14]:
justice = [justices(i) for i in cleaned]
for i in range(len(justice)):
    justice[i] = ", ".join(justice[i])

In [15]:
for i in range(len(justice)):
    justice[i] = re.sub('JUSTICE','',justice[i])

#### Dates!

In [16]:
date(cleaned[0])[0]

' December 1, 2008'

In [17]:
dates = []
for i in cleaned:
    dates += [date(i)[0]]

In [18]:
len(dates)

1302

#### Year!

In [19]:
years = []
for i in dates:
    years += re.findall('\d\d\d\d',i)

In [20]:
len(years)

1302

#### Case Numbers!

In [21]:
case_no(cleaned[0])

['No. 07-581']

In [22]:
cases = []
for i in cleaned:
    cases += [case_no(i)[0]]

In [23]:
len(cleaned)

1302

#### Appearances!

In [24]:
def appearances(text):
    app = re.findall('APPEARANCES:[\s\S]*?Reporting.*?',text)
    if len(app) == 0:
        app = re.findall('APPEARANCES:[\s\S]*?REPORTING.*?',text)
        if len(app) == 0:
            app = re.findall('APPEARANCES:[\s\S]*?C O N T E N T S.*?',text)
            if len(app) == 0:
                app = re.findall('APPEARANCES:[\s\S]*?CONTENTS.*?',text)
    app = app[0]
    app = re.findall('[\s\S]*?;[\s\S]*?\.',app)
    app = [re.sub('\d','',app[i]) for i in range(len(app))]
    remove_appearance = re.sub('APPEARANCES:\s','',app[0])
    app[0] = remove_appearance
    app = [re.sub('[\s\s]+',' ',app[i]) for i in range(len(app))]
    return app

In [25]:
def diff(text):
    app = re.findall('APPEARANCES:*[\s\S]*?Reporting.*?',text)
    if len(app) == 0 or len(app)>400:
        app = re.findall('APPEARANCES:[\s\S]*?REPORTING.*?',text)
        if len(app) == 0:
            app = re.findall('APPEARANCES:[\s\S]*?C O N T E N T S.*?',text)
            if len(app) == 0:
                app = re.findall('APPEARANCES:[\s\S]*?CONTENTS.*?',text)
    app = app[0]
    app = re.sub('\d','',app)
    remove_appearance = re.sub('APPEARANCES:*\s','',app)
    app = re.sub('[\s\s]+',' ',remove_appearance)
    return app

In [26]:
ugh = []
c = 1
for i in cleaned:
    #print(c)
    try:
        ugh+=[diff(i)]
        #print('')
    except:
        ugh+= ["ERROR"]
        continue
    c+=1

In [27]:
for i in ugh:
    if i == "ERROR":
        print(i)

ERROR


In [28]:
#diff(cleaned[1023])

In [29]:
appearances(cleaned[1])

[' SCOTT A. KELLER, Solicitor General of Texas, Austin, Texas; on behalf of the Appellants.',
 ' EDWIN S. KNEEDLER, Deputy Solicitor General, Department of Justice, Washington, D.C.; on behalf of Appellee United States, in support of the Appellants.',
 ' MAX RENEA HICKS, ESQ., Austin, Texas; on behalf of the Appellees in No.',
 ' -. ALLISON J. RIGGS, ESQ., Durham, North Carolina; on behalf of the Appellees in No.']

In [30]:
ugh[1023]

'ERROR'

In [31]:
for i in range(len(ugh)):
    if i == 1023:
        ugh[i] = 'ignore'
    else:
        ugh[i] = re.sub('in No. -','',ugh[i])

In [32]:
reporting_pattern = '[A-Z][a-z]+\sReporting'
REPORTING_pattern = '[A-Z]+\sREPORTING'
contents_pattern = 'CONTENTS'
CONTENTS_pattren = 'C O N T E N T S'

In [33]:
re.sub(re.findall(reporting_pattern,ugh[0])[0],'',ugh[0])

'PAUL SALVATORE, ESQ., New York, N.Y.; on behalf of the Petitioners. DAVID C. FREDERICK, ESQ., Washington, D.C.; on behalf of the Respondents. CURTIS E. GANNON, ESQ., Assistant to the Solicitor General, Department of Justice, Washington, D.C.; on behalf of the United States, as amicus curiae, supporting the Respondents. '

In [34]:
for i in range(len(ugh)):
    if i == 1023:
        ugh[i] == 'JASON D. HAWKINS, ESQ., Assistant Federal Public Defender, Dallas, Texas; for Petitioner. WILLIAM M. JAY, ESQ., Assistant to the Solicitor General, Department of Justice, Washington, D.C.; for  Respondent, in support of Petitioner. EVAN A. YOUNG, ESQ., Austin, Texas; for amicus curiae, in support of the judgment below; appointed by this Court.'
    else:
        report1 = re.findall(reporting_pattern,ugh[i])
        report2 = re.findall(REPORTING_pattern,ugh[i])
        content1 = re.findall(contents_pattern,ugh[i])
        content2 = re.findall(CONTENTS_pattren,ugh[i])
        if len(report1) > 0:
            ugh[i]=re.sub(report1[0],'',ugh[i])
        elif len(report2) > 0:
            ugh[i]=re.sub(report2[0],'',ugh[i])
        elif len(content1) > 0:
            ugh[i]=re.sub(content1[0],'',ugh[i])
        else:
            ugh[i]=re.sub(content2[0],'',ugh[i])

In [35]:
bad = 'JASON D. HAWKINS, ESQ., Assistant Federal Public Defender, Dallas, Texas; for Petitioner. WILLIAM M. JAY, ESQ., Assistant to the Solicitor General, Department of Justice, Washington, D.C.; for  Respondent, in support of Petitioner. EVAN A. YOUNG, ESQ., Austin, Texas; for amicus curiae, in support of the judgment below; appointed by this Court.'

In [36]:
#ugh

#### Title!

In [37]:
title = [f for f in listdir(r"C:\Users\Avena Cheng\Desktop\Liberating Archives\supremecourt\textfiles") if isfile(join(r"C:\Users\Avena Cheng\Desktop\Liberating Archives\supremecourt\textfiles", f))]
for i in range(len(title)):
    title[i] = re.sub('.pdf.txt','',title[i])
title.remove('DatabaseDesign1.ipynb')

### ACTUAL DATAFRAME!!!

** Note: Some cases only have one justice, Rehnquist, but this is actually because the rest of the questions appear as "QUESTION:" rather than the justices' name.

In [38]:
#make sure they're all the same length
print(len(justice),
len(title),
len(ugh),
len(years),
len(cases))

1302 1302 1302 1302 1302


In [39]:
data = pd.DataFrame({'Title':title,'Case No': cases,'Date':dates,'Year':years,'Justices':justice,'Appearances':ugh})#'Justices':justice,'Appearances':people})

In [40]:
#data.to_csv(r"C:\Users\Avena Cheng\Desktop\Liberating Archives\supremecourt\textfiles\data.csv")

In [41]:
data['Appearances'] = data['Appearances'].replace('ignore',bad)

#### People!

In [42]:
people = []
for i in range(len(ugh)):
    p = re.findall('[A-Z]+\s[A-Z\.\s]*[A-Z]+',ugh[i])
    people += [p]
    
for i in range(len(people)):
    people[i] = ", ".join(people[i])
    
data['People'] = people

In [74]:
for i in range(1302):
    p = re.findall('[A-Z]+people[0]

'PAUL SALVATORE, DAVID C. FREDERICK, CURTIS E. GANNON'

In [44]:
check_unique = [justices(i) for i in cleaned]


In [46]:
#unique(justice)

In [54]:
import regex as re
import numpy
import os

def unique(texts):
    return list(set(texts))

#return a set of nonunique bill mentioned in the speech.
def bill(texts):
    no_num = re.sub('\d+','',texts)
    no_n = re.sub('\n','',no_num)
    evenly_spaced = re.sub('\s+',' ',no_n)
    b= re.findall("[A-Z][a-z]+\s[A-Z][a-z]+\sAct", evenly_spaced)
    return b
#return a set of unique bill mentioned in the speech.
def unibill(texts):
    return unique(bill(texts))
#return the name and times it appears.
def count(texts):
    bills = bill(texts)
    ret = []
    for i in range(0,len(bills)):
        count = bills.count(bills[i])
        ret.append (str(bills[i]) + " " + ":" + ' '+str(count))
    return unique(ret)


In [56]:
unibill(cleaned[0])
count(cleaned[0])

['Labor Relations Act : 2', 'Labor Standards Act : 1']

In [64]:
bills_mentioned = []
for i in range(len(cleaned)):
    bills_mentioned += [count(cleaned[i])]

In [68]:
len(bills_mentioned)

1302

In [69]:
data['Bills Mentioned'] = bills_mentioned

Unnamed: 0,Title,Case No,Date,Year,Justices,Appearances,People,Bills Mentioned
0,14 Penn Plaza LLC v. Pyett,No. 07-581,"December 1, 2008",2008,"ALITO, BREYER, GINSBURG, KENNEDY, ROBERTS...","PAUL SALVATORE, ESQ., New York, N.Y.; on behal...","PAUL SALVATORE, DAVID C. FREDERICK, CURTIS E. ...","[Labor Relations Act : 2, Labor Standards Act ..."
1,Abbott v. Perez,No. 17-586,"April 24, 2018",2018,"ALITO, BREYER, GINSBURG, GORSUCH, KAGAN, ...","SCOTT A. KELLER, Solicitor General of Texas, ...","SCOTT A. KELLER, EDWIN S. KNEEDLER, MAX RENEA ...",[Voting Rights Act : 3]
2,Abbott v. United States,No. 09-479,"October 4, 2010",2010,"ALITO, BREYER, GINSBURG, ROBERTS, SCALIA,...","DAVID L. HORAN, ESQ., Dallas, Texas; on behalf...","DAVID L. HORAN, JAMES E. RYAN",[Career Criminal Act : 3]
3,Abdul-Kabir v. Quarterman,No. 05-11284,"January 17, 2006",2006,"ALITO, BREYER, GINSBURG, KENNEDY, ROBERTS...","ROBERT C. OWEN, ESQ., Austin, Tex.; on behalf...","ROBERT C. OWEN, EDWARD L. MARSHALL",[]
4,Abdur_Rahman v. Bell,No. 01-9094,"November 6, 2002",2002,REHNQUIST,"JAMES S. LIEBMAN, ESQ., New York, New York; on...","JAMES S. LIEBMAN, PAUL G. SUMMERS, PAUL J. ZID...",[]
5,Abramski v. United States,No. 121493,"January 22, 2014",2014,"ALITO, BREYER, GINSBURG, KAGAN, KENNEDY, ...","RICHARD D. DIETZ, ESQ., WinstonSalem, North Ca...","RICHARD D. DIETZ, JOSEPH R. PALMORE","[Gun Control Act : 12, Owners Protection Act : 2]"
6,Abuelhawa v. United States,No. 08-192,"March 4, 2009",2009,"ALITO, BREYER, GINSBURG, KENNEDY, ROBERTS...","SRI SRINIVASAN, ESQ., Washington, D.C.; on beh...","SRI SRINIVASAN, ERIC D. MILLER","[The Travel Act : 1, Controlled Substances Act..."
7,Adams v. Florida Power Corp.,No. 01-584,"March 20, 2002",2002,REHNQUIST,"JOHN J. CRABTREE, ESQ., Key Biscayne, Florida;...","JOHN J. CRABTREE, GLEN D. NAGER","[Age Discrimination Act : 3, Civil Rights Act ..."
8,"Adarand Constructors, Inc. v. Mineta",No. 00-730,"October 31, 2001",2001,REHNQUIST,"WILLIAM P. PENDLEY, ESQ., President and Chief ...","WILLIAM P. PENDLEY, THEODORE B. OLSON",[Small Business Act : 3]
9,Adoptive Couple v. Baby Girl,No. 12-399,"April 16, 2013",2013,"ALITO, BREYER, GINSBURG, KAGAN, KENNEDY, ...","LISA S. BLATT, ESQ., Washington, D.C.; on beha...","LISA S. BLATT, PAUL D. CLEMENT, CHARLES A. ROT...",[Child Welfare Act : 1]


In [77]:
bills_db = data[['Title','Case No','Date','Bills Mentioned']]

In [80]:
bills_db.to_csv('bills_db.csv')