## Scraping downloading and converting many PDFs
Here we are extending the last part of our homework so that we can actually download PDFs of the transcripts.

Here I demo how to download PDFs using python and then I use Tika which would need to be installed if you wanted to use it to convert the PDFs to text. Again you do not need to run this part, I am keeping it in here so you can use this in the future. But if you need to use Tika you need to install it, and also in order for Tika to run you need to have Java installed on your computer (because it is Java-based).

I have provided the downloaded and converted court documents for the homework. You can **skip to the homework part** if you want to start trying to process those transcripts using regular expressions.


In [22]:
import re
import requests
from bs4 import BeautifulSoup
#the homework but here I am just getting the pdf link
my_url = "https://www.supremecourt.gov/oral_arguments/argument_transcript/2023"
raw_html = requests.get(my_url).content
court_doc = BeautifulSoup(raw_html, "html.parser")


In [23]:
# getting all of the court info
all_rows = court_doc.find_all('tr')
each_case = []
for row in all_rows[1:]:
    if row.td:
        this_row = {}
        each_cell=row.find_all('td')
        this_row['docket_num'] = each_cell[0].span.text
        #I am using split below to get just the pdf name out of the file structure
        #so instead of ../argument_transcripts/2023/22-6389_8n59.pdf I get 22-6389_8n59.pdf
        this_row['pdf_link'] = each_cell[0].a['href'].split('/')[-1]
        this_row['case_name'] = each_cell[0].find_all('span')[1].text
        this_row['date'] = each_cell[1].text
        each_case.append(this_row)
each_case


[{'docket_num': '23-108',
  'pdf_link': '23-108_o7jp.pdf',
  'case_name': 'Snyder v. United States',
  'date': '04/15/24'},
 {'docket_num': '23-50',
  'pdf_link': '23-50_g3bh.pdf',
  'case_name': 'Chiaverini v. City of Napoleon',
  'date': '04/15/24'},
 {'docket_num': '23-5572',
  'pdf_link': '23-5572_l537.pdf',
  'case_name': 'Fischer v. United States',
  'date': '04/16/24'},
 {'docket_num': '22-982',
  'pdf_link': '22-982_m64n.pdf',
  'case_name': 'Thornell v. Jones',
  'date': '04/17/24'},
 {'docket_num': '23-175',
  'pdf_link': '23-175_20f4.pdf',
  'case_name': 'City of Grants Pass v. Johnson',
  'date': '04/22/24'},
 {'docket_num': '22-1218',
  'pdf_link': '22-1218_h3ci.pdf',
  'case_name': 'Smith v. Spizzirri',
  'date': '04/22/24'},
 {'docket_num': '23-334',
  'pdf_link': '23-334_ifjm.pdf',
  'case_name': 'Dept. of State v. Munoz',
  'date': '04/23/24'},
 {'docket_num': '23-367',
  'pdf_link': '23-367_5he6.pdf',
  'case_name': 'Starbucks Corp. v. McKinney',
  'date': '04/23/24'}

In [24]:
len(each_case)

61

Next I used the **requests** library to download all of the PDFs to a folder on my computer.

In [None]:
https://www.supremecourt.gov/oral_arguments/argument_transcripts/2023/

In [None]:
import time
import requests
for case in each_case:
    time.sleep(2)
    link = 'https://www.supremecourt.gov/oral_arguments/argument_transcripts/2023/' + case['pdf_link']
    file_name = "court_docs/" + case['pdf_link']
    r = requests.get(link, stream=True)
    with open(file_name,'wb') as Pypdf:
        for chunk in r.iter_content():
            if chunk:
                Pypdf.write(chunk)


Here I use **tika** to extract the text from the pdfs and write txt files to my computer.

In [None]:
import tika
from tika import parser
import time
for case in each_case:
    time.sleep(2)
    print(case['pdf_link'])
    file_name = "court_docs/" + case['pdf_link']
    parsed_pdf = parser.from_file(file_name) 
    txt_data = parsed_pdf['content']
    txt_name = case['pdf_link'].split('.')[0] + ".txt"
    print(txt_name)
    file_out ="court_docs/" + txt_name
    with open(file_out, 'w') as outfile:
        outfile.write(txt_data)


    

In [16]:
#Open a text file from your computer
#We are going to use this one.
f = open('court_docs/22-1079_4357.txt', 'r')
sample_transcript = f.read()


### NOW THE HOMEWORK!!
**How in the world are you going to clean this up?**
Take a close look and think about first what needs to be removed, and then needs to be isolated. You'll probably need the combination of regular expression. You will need to using re.subs() -- which is a regex replace -- and re.split() -- where you split the text that point, and just keep the part of the text that you want.



**GET RID OF THE HEADER FIRST**

Try to find the pattern that appears at the beginning of every case, when the arguments begin.

Make a regular expression that will find that, and use an re.split() to split the transcript into a list with two parts, the header and the rest.

Save the rest! (That is, hint hint, the [1] element in that list.)


In [1]:
#revome the header
import re
regex1 = r"P R O C E E D I N G S\s+[(]\d[^)]+[)]" #not grouped
header_list=re.split(regex1,sample_transcript)

NameError: name 'sample_transcript' is not defined

**NEXT GET RID OF THE FOOTER** 

Do the same thing you did above: find a pattern that appears at the end of every transcript.

Do an re.split() and save the part with the arguments.


In [19]:
#remove the footer
cleaned_header = header_list[1]
#(Whereupon, at 12:45 p.m., the case \n\nwas submitted.)
regex1 = r"[(]Whereupon,\s+at\s+\d" #not grouped
#regex2 = r"(and)" #grouped
footer_list=re.split(regex1,cleaned_header)
footer_list[0]#but keep footer_list[0]

'\n\n CHIEF JUSTICE ROBERTS:  We\'ll hear\n\n argument next in Case 22-1079, Truck Insurance\n\n Exchange versus Kaiser Gypsum Company.\n\n Ms. Ho.\n\n ORAL ARGUMENT OF ALLYSON N. HO\n\n ON BEHALF OF THE PETITIONER\n\n MS. HO: Thank you, Mr. Chief Justice, \n\nand may it please the Court: \n\nIf anyone is a party in interest \n\nentitled to be heard in this Chapter 11 case, \n\nit\'s the insurer, Truck, who will pay virtually \n\nevery dollar the debtors owe the asbestos \n\nclaimants. \n\nYet, the Fourth Circuit\'s rule denies \n\nthat insurer a voice.  That rule, which my \n\nfriends barely defend, violates the text, \n\ncontext, and history of 1109(b). \n\nIt also defies the practical reality \n\nthat Chapter 11 cases are, as this Court has \n\nrecognized, collaborative, working best when all \n\nstakeholders come together at the outset to hash \n\nthings out. \n\nCongress recognized that reality and \n\nHeritage Reporting Corporation \n\n\n\n  \n \n\n \n\n  \n\n \n                 

**REMOVE THE PAGE BREAKS**

OK, the page breaks are very messy, they have tons of numbers and other text. Try to write an expression that captures all of that!

When you do that use re.sub() to replace all of those page break messes with " ".

In [20]:
#remove the page break
clean_middle = footer_list[0]
clean_middle
#Heritage Reporting Corporation Official - Subject to Final Review and all those 1 2 3...
regex1 = r"Heritage Reporting Corporation[\s\d-]+Official" #not grouped
cleaner_middle = re.sub(regex1,"",clean_middle)
cleaner_middle

'\n\n CHIEF JUSTICE ROBERTS:  We\'ll hear\n\n argument next in Case 22-1079, Truck Insurance\n\n Exchange versus Kaiser Gypsum Company.\n\n Ms. Ho.\n\n ORAL ARGUMENT OF ALLYSON N. HO\n\n ON BEHALF OF THE PETITIONER\n\n MS. HO: Thank you, Mr. Chief Justice, \n\nand may it please the Court: \n\nIf anyone is a party in interest \n\nentitled to be heard in this Chapter 11 case, \n\nit\'s the insurer, Truck, who will pay virtually \n\nevery dollar the debtors owe the asbestos \n\nclaimants. \n\nYet, the Fourth Circuit\'s rule denies \n\nthat insurer a voice.  That rule, which my \n\nfriends barely defend, violates the text, \n\ncontext, and history of 1109(b). \n\nIt also defies the practical reality \n\nthat Chapter 11 cases are, as this Court has \n\nrecognized, collaborative, working best when all \n\nstakeholders come together at the outset to hash \n\nthings out. \n\nCongress recognized that reality and \n\n \n\n spoke expansively in 1109(b) to extend the right\n\n to be heard to any i

**SPLIT THE DIALOGUE BY SPEAKER AND SPEECH (SPEAKER/WORDS)**

This is the toughest part. You need to write a regular expression that accurately captures who is speaking like:
```
MR. GORE:
```
OR
```
JUSTICE KAGAN:
```
If you do a split using groups (as a demoed in the advanced regular expressions notebook), it will split by the pattern but keep the pattern instead of discarding it (like a default split does). And that way you will get a list where every other element is the speaker. Like this (this is taken from inside the transcript):

```
'JUSTICE KAGAN',
 '  Did your expert \n\npresent an alternative study which did control \n\nfor geography and reached a different result? ',
 'MR. GORE',
 " He did not try to mirror \n\nDr. Ragusa's study --",
 'JUSTICE KAGAN',
 '  Because that would \n\nhave been the easiest way to undermine the \n\ntheory.  I mean, as I understand it, this was \n\nhardly touched upon by -- by -- by -- by the \n\nstate below.  And, certainly, the state did not \n\ndo what would seem to be the -- the normal thing \n\nif you were really concerned about this, which \n\nis to say: Look at our study.  We controlled \n\n \n\nfor geography. The results are entirely\n\n different.',
 'MR. GORE',
 " We did raise objections to\n\n Dr. Ragusa's methodology, and as I was \n\nexplaining, it is a flawed methodology and not\n\n reliable.\n\n Moreover, the state presented direct\n\n testimony from the map drawer to explain which\n\n VTDs were chosen and why.  That direct evidence \n\nshowed, like all the other direct evidence, that \n\ndecisions were made based on politics and \n\ntraditional principles and not using race at \n\nall. ",
 'JUSTICE SOTOMAYOR',
 " I think you end up \n\nin a very poor starting point under clear error \n\narguing the substance of believability of one \n\nexpert over another, because credibility \n\nfindings under clear error standard must be \n\ndeferred to to the district court. \n\nI understand your points about -- your \n\npoint about Dr. Ragusa, but I just point out \n\nthat other experts before the court and he \n\nhimself said that geography was very much \n\nembedded as part of the structure of his \n\nanalysis. \n\n \n\nYou may disagree with that.  It's\n\n going to be very hard for you to show that no \n\nfact finder could credit that understanding of\n\n his testimony.\n\n But I think what I'm really troubled \n\nby is, going back to Justice Thomas's question, \n\nwhat's the legal error and what's the clear\n\n error? Just tick them off for me.",
 'MR. GORE',
 ' There are several legal \n\nerrors, Justice Sotomayor. ',
 'JUSTICE SOTOMAYOR',
 '  Not facts.  I want \n\nlegal errors or clear errors beyond -- under our \n\nstandard. ',
 'MR. GORE',
 ' The first legal error is a \n\nfailure to enforce the alternative map \n\nrequirement. ',
 'JUSTICE KAGAN',
 "  Okay.  I'm going to \n\nbutt in.  And I'm sorry, Justice Sotomayor. ",
 'JUSTICE SOTOMAYOR',
 '  Yes, you can --\n\nyou can start there. ',
```

If you get this, you are done!!

In [21]:
regex1 = r"\n\s*([A-Z]{2}[A-Z .-]{3,}):" #not grouped
#regex2 = r"(and)" #grouped
speaker_speech=re.split(regex1,cleaner_middle)
speaker_speech

['',
 'CHIEF JUSTICE ROBERTS',
 "  We'll hear\n\n argument next in Case 22-1079, Truck Insurance\n\n Exchange versus Kaiser Gypsum Company.\n\n Ms. Ho.\n\n ORAL ARGUMENT OF ALLYSON N. HO\n\n ON BEHALF OF THE PETITIONER",
 'MS. HO',
 " Thank you, Mr. Chief Justice, \n\nand may it please the Court: \n\nIf anyone is a party in interest \n\nentitled to be heard in this Chapter 11 case, \n\nit's the insurer, Truck, who will pay virtually \n\nevery dollar the debtors owe the asbestos \n\nclaimants. \n\nYet, the Fourth Circuit's rule denies \n\nthat insurer a voice.  That rule, which my \n\nfriends barely defend, violates the text, \n\ncontext, and history of 1109(b). \n\nIt also defies the practical reality \n\nthat Chapter 11 cases are, as this Court has \n\nrecognized, collaborative, working best when all \n\nstakeholders come together at the outset to hash \n\nthings out. \n\nCongress recognized that reality and \n\n \n\n spoke expansively in 1109(b) to extend the right\n\n to be heard to

In [15]:
speaker_speech[-5]

' Thank you, \n\ncounsel. \n\nRebuttal, Mr. Weir?\n\n REBUTTAL ARGUMENT OF BRYAN K. WEIR\n\n ON BEHALF OF THE PETITIONER '

In [31]:
all_transcripts = []
for case in each_case:
    try:
        filename = 'court_docs/' + case['pdf_link'].split('.')[0] + '.txt'
        print(case['docket_num'])
        f = open(filename, 'r')
        transcript = f.read()
        regex1 = r"P R O C E E D I N G S\s+[(]\d[^)]+[)]" #not grouped 
        #23-719 breaks here
        header_list=re.split(regex1,transcript)
        cleaned_header = header_list[1]
        #(Whereupon, at 12:45 p.m., the case \n\nwas submitted.)
        regex1 = r"[(]Whereupon,\s+at\s+\d" #not grouped
        footer_list=re.split(regex1,cleaned_header)
        clean_middle = footer_list[0]
        #Heritage Reporting Corporation Official - Subject to Final Review and all those 1 2 3...
        regex1 = r"Heritage Reporting Corporation[\s\d-]+Official" #not grouped
        cleaner_middle = re.sub(regex1,"",clean_middle)
        cleaner_middle
        regex1 = r"\n\s*([A-Z]{2}[A-Z .]{3,}):" #not grouped
        regex1 = r"[\n]\s*([A-Z]{2}[A-Z\u00C0-\u00DCca .]{3,}):" #not grouped version with ACCENTS
        speaker_speech=re.split(regex1,cleaner_middle)
        for x in range(1,len(speaker_speech),2):
            dialogue = {}
            dialogue['case'] = case['docket_num']
            dialogue['speaker'] = speaker_speech[x]
            dialogue['words'] = speaker_speech[x+1]
            all_transcripts.append(dialogue)
    except:
        print("------ERROR--------")
        print(case['docket_num'])
        print("------ERROR--------")


23-108
23-50
23-5572
22-982
23-175
22-1218
23-334
23-367
23-726
23-939
23-411
22-842
23-14
22-1079
22-1025
141-Orig
23-250
23-21
23-235
23-370
23-146
23-719
------ERROR--------
23-719
------ERROR--------
22-1008
23-51
23A349
22-1078
22-277
22-555
22-7386
22-529
22-976
23-3
22-674
22-1178
22-1074
22-1238
22-899
22-1165
22-913
22-451
22-1219
22-6389
22-721
22-666
22-859
23-124
22-800
22-193
22-585
22-324
22-611
22-704
22-846
22-915
22-888
22-340
22-448
22-429
22-660
22-500
22-807


In [36]:
for case in each_case:
    print(case)

{'docket_num': '23-108', 'pdf_link': '23-108_o7jp.pdf', 'case_name': 'Snyder v. United States', 'date': '04/15/24'}
{'docket_num': '23-50', 'pdf_link': '23-50_g3bh.pdf', 'case_name': 'Chiaverini v. City of Napoleon', 'date': '04/15/24'}
{'docket_num': '23-5572', 'pdf_link': '23-5572_l537.pdf', 'case_name': 'Fischer v. United States', 'date': '04/16/24'}
{'docket_num': '22-982', 'pdf_link': '22-982_m64n.pdf', 'case_name': 'Thornell v. Jones', 'date': '04/17/24'}
{'docket_num': '23-175', 'pdf_link': '23-175_20f4.pdf', 'case_name': 'City of Grants Pass v. Johnson', 'date': '04/22/24'}
{'docket_num': '22-1218', 'pdf_link': '22-1218_h3ci.pdf', 'case_name': 'Smith v. Spizzirri', 'date': '04/22/24'}
{'docket_num': '23-334', 'pdf_link': '23-334_ifjm.pdf', 'case_name': 'Dept. of State v. Munoz', 'date': '04/23/24'}
{'docket_num': '23-367', 'pdf_link': '23-367_5he6.pdf', 'case_name': 'Starbucks Corp. v. McKinney', 'date': '04/23/24'}
{'docket_num': '23-726', 'pdf_link': '23-726_ggco.pdf', 'case_

In [32]:
len(all_transcripts)

22508

In [33]:
all_transcripts[21000:21010]

[{'case': '22-429',
  'speaker': 'MR. UNIKOWSKY',
  'words': '  I don\'t think "I may \n\nsomeday" is enough. That kind of sounds like\n\n the allegations --'},
 {'case': '22-429',
  'speaker': 'JUSTICE GORSUCH',
  'words': '  "I will someday." '},
 {'case': '22-429',
  'speaker': 'MR. UNIKOWSKY',
  'words': '  "I will" -- I think "I \n\nwill" -- "someday" probably is not enough \n\neither. '},
 {'case': '22-429',
  'speaker': 'JUSTICE GORSUCH',
  'words': '  "Someday" not good \n\nenough? '},
 {'case': '22-429',
  'speaker': 'MR. UNIKOWSKY',
  'words': '  I don\'t think -- under \n\nLujan case, the Court held that "someday" plans \n\naren\'t good enough for standing. '},
 {'case': '22-429',
  'speaker': 'JUSTICE GORSUCH',
  'words': '  In the next decade? \n\n(Laughter.) '},
 {'case': '22-429',
  'speaker': 'MR. UNIKOWSKY',
  'words': "  I think it's got to be \n\nconcrete plans.  If you're -- if you're going \n\nto Wells next summer and you're trying to make \n\na reservation at Coas

In [34]:
#MR. LaCOUR
#MR. McCOLLOCH:
#MR. AGUIÑAGA:
for line in all_transcripts:
    if re.search(r"[A-Z]:",line['words']):
        print(line['words'])
        

 And -- and that would be 

the time at which you could say, at least for

 now, here is the class of states that are out, 

and so you, EPA, rather than comment on, as 

Justice Kagan was pointing out, the -- what

 would happen in the possibly millions of 

permutations of some states being in or out, at 

that point, they could have said to EPA: These

 are the specific states that are out.  We don't 

think the plan makes sense as to the remaining 

states. 
  Thank you, Mr. 

Chief Justice, and may it please the Court: 

Throughout this litigation and at 

 

times this morning, Petitioners have sought to 

characterize this case as presenting a 

fundamental question of the separation of powers

 and a test of Article III:  Will courts continue 

to say what the law is?

 But I think, stepping back, I want to 

make sure that what doesn't get lost in the

 shuffle is that Petitioners have made an

 important concession that I think illustrates 

that the issue here is actually fa

In [35]:
for line in all_transcripts:
    if (re.search("heritage",line['words'],re.IGNORECASE)):
        print(line)

{'case': '22-277', 'speaker': 'JUSTICE KAGAN', 'words': "  But why is it \n\ndifferent?  You -- you know, when we talked --\n\nwhen we had the parade case, we said they don't \n\nhave a lot of rules, but they have some rules, \n\nand we're going to respect the rules that they \n\ndo have.  Even though they let a lot of people \n\nHeritage Reporting Corporation \n\nhttps://Democrats.com\n\n\n  \n \n\n \n\n  \n\n \n                                                                  \n \n \n                 \n \n                 \n \n              \n \n                 \n \n                 \n \n                  \n \n                \n \n               \n \n              \n \n               \n \n              \n \n                \n \n              \n \n               \n \n               \n \n             \n \n                 \n \n                \n \n               \n \n               \n \n                \n \n             \n \n                \n \n                \n \n               \n 

In [36]:
for line in all_transcripts:
    if (re.search("heritage",line['words'],re.IGNORECASE)):
        regex1 = r"Heritage Reporting Corporation.+?Official" #not grouped
        line["words"] = re.sub(regex1,"",line["words"],flags=re.DOTALL)
        print(line)

{'case': '22-277', 'speaker': 'JUSTICE KAGAN', 'words': "  But why is it \n\ndifferent?  You -- you know, when we talked --\n\nwhen we had the parade case, we said they don't \n\nhave a lot of rules, but they have some rules, \n\nand we're going to respect the rules that they \n\ndo have.  Even though they let a lot of people \n\n \n\ncome in, they don't let a few people come in, \n\nand that seems to be quite important to them.\n\n And similarly here, I mean, Facebook, \n\nYouTube, these are the paradigmatic social media \n\ncompanies that this law applies to, and they \n\nhave rules about content. They say, you know,\n\n you can't have hate speech on this site.  They\n\n say you can't have misinformation with respect\n\n to particular subject matter areas. \n\nAnd they seem to take those rules and \n\n-- I mean, you know, somebody can say maybe they \n\nshould enforce them even more than they do, but \n\nthey do seem to take them seriously.  They have \n\nthousands and thousands of e

In [37]:
all_transcripts[1000:1015]

[{'case': '23-5572',
  'speaker': 'MR. GREEN',
  'words': ' Availability it says too, \n\nbut, as I mentioned earlier, simply delaying the \n\narrival of evidence at the courthouse --'},
 {'case': '23-5572',
  'speaker': 'JUSTICE JACKSON',
  'words': "  No, not delay. \n\nLet's say the person steals the envelope and \n\ntakes it away. "},
 {'case': '23-5572',
  'speaker': 'MR. GREEN',
  'words': " Then it gets harder, I\n\n agree. If they steal the envelope, they take it \n\naway, they rip up, all of those things, which is\n\n certainly not what happened here, and it's not \n\nin the indictment, the -- the ballots or the --\n\nthe vote count is not even in the indictment."},
 {'case': '23-5572',
  'speaker': 'JUSTICE JACKSON',
  'words': "  Well, we -- we\n\n wouldn't have to decide that."},
 {'case': '23-5572', 'speaker': 'MR. GREEN', 'words': ' Right. '},
 {'case': '23-5572',
  'speaker': 'JUSTICE JACKSON',
  'words': "  We could send it \n\nback if we clarified that that is what the

In [38]:
#clean line breaks...

for line in all_transcripts:
    line["words"] = re.sub(r"\n"," ",line["words"])
    line["words"] = re.sub(r"\s+"," ",line["words"])
        

In [39]:
for line in all_transcripts:
    if (re.search("kafka",line['words'],re.IGNORECASE)):
        print(line)
        

{'case': '23-108', 'speaker': 'MS. BLATT', 'words': ' Mr. Chief Justice, and may it please the Court: Section 666 applies to 19 million state, local, and tribal officials and anyone else whose employer receives federal benefits, including 14 million Medicare-funded healthcare workers. Congress did not plausibly subject all of these people to 10 years in prison just for accepting gifts, especially when federal officials face only two years for accepting gifts under 201(c). 666 punishes corruptly receiving anything of value intending to be influenced or rewarded. "Corruptly [...] intending to be influenced" covers classic bribes, where officials get upfront payments in exchange for official conduct, while "corruptly [...] intending to be rewarded" covers bribes paid after the fact and to officials who aren\'t actually influenced. The government argues "corruptly" under 666 means wrongful, immoral, depraved, or evil. But the government tried this case and countless others on the theory th

In [40]:
#bad transcript
f = open('court_docs/23-719_2jf3.txt', 'r')
sample_transcript = f.read()
sample_transcript

'\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n \n\n  \n \n\n  \n \n\n \n \n\n \n  \n\n \n \n\n        \n \n \n                 \n \n\n             \n \n                      \n \n                       \n \n\n \n \n                      \n \n                 \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n   \n \n\n \n \n\n \n\n- - - - - - - - - - - - - - - - -\n\n- - - - - - - - - - - - - - - - -\n\nSUPREME COURT \nOF THE UNITED STATES \n\nIN THE SUPREME COURT OF THE UNITED STATES \n\nDONALD J. TRUMP, ) \n\nPetitioner, ) \n\nv. ) No. 23-719 \n\nNORMA ANDERSON, ET AL., ) \n\nRespondents. ) \n\nPages: 1 through 141 \n\nPlace: Washington, D.C. \n\nDate: February 8, 2024 \n\nHERITAGE REPORTING CORPORATION \nOfficial Reporters \n\n1220 L Street, N.W., Suite 206 \nWashington, D.C.  20005 \n\n(202) 628-4888 \nwww.hrccourtreporters.com \n\nwww.hrccourtreporters.com\n\n\n  \n \n\n \n\n  \n\n \n                                                          

In [41]:
#23-719
bad_case = []
for case in each_case:
    if case['docket_num'] == "23-719":
        print(case)
        filename = 'court_docs/' + case['pdf_link'].split('.')[0] + '.txt'
        print(case['docket_num'])
        f = open(filename, 'r')
        transcript = f.read()
        regex1 = r"\n\d\d?" #not grouped
        fix_nums = re.sub(regex1,"",transcript)
        regex1 = r"P R O C E E D I N G S\s+[(]\d[^)]+[)]" #not grouped 
        #23-719 breaks here
        header_list=re.split(regex1,fix_nums)
        cleaned_header = header_list[1]
        #(Whereupon, at 12:45 p.m., the case \n\nwas submitted.)
        regex1 = r"[(]Whereupon,\s+at\s+\d" #not grouped
        footer_list=re.split(regex1,cleaned_header)
        clean_middle = footer_list[0]
        #Heritage Reporting Corporation Official - Subject to Final Review and all those 1 2 3...
        regex1 = r"Heritage Reporting Corporation[\s\d-]+Official" #not grouped
        cleaner_middle = re.sub(regex1,"",clean_middle)
        cleaner_middle
        regex1 = r"[\n]\s*([A-Z][A-Zca .]{3,}):" #not grouped
        # regex1 = r"[\n]\s*([A-Z][A-Z\u00C0-\u00DCca .]{3,}):" #not grouped version with ACCENTS
        speaker_speech=re.split(regex1,cleaner_middle)
        for x in range(1,len(speaker_speech),2):
            dialogue = {}
            dialogue['case'] = case['docket_num']
            dialogue['speaker'] = speaker_speech[x]
            dialogue['words'] = speaker_speech[x+1]
            bad_case.append(dialogue)

{'docket_num': '23-719', 'pdf_link': '23-719_2jf3.pdf', 'case_name': 'Trump v. Anderson', 'date': '02/08/24'}
23-719


In [42]:
bad_case

[{'case': '23-719',
  'speaker': 'CHIEF JUSTICE ROBERTS',
  'words': " We'll hear argument \n this morning in Case 23-719, Trump versus Anderson. \n Mr. Mitchell. \n ORAL ARGUMENT OF JONATHAN F. MITCHELL \n ON BEHALF OF THE PETITIONER "},
 {'case': '23-719',
  'speaker': 'MR. MITCHELL',
  'words': '  Mr. Chief Justice, and may \n it please the Court: \n The Colorado Supreme Court held that \n   President Donald J. Trump is constitutionally \n   disqualified from serving as president under \n Section 3 of the Fourteenth Amendment.  The Colorado \n Supreme Court\'s decision is wrong and should be \n   reversed for numerous independent reasons. \n The first reason is that President Trump is \n not covered by Section 3 because the president is not \n "an officer of the United States" as that term is \n used throughout the Constitution.  "Officer of the \n   United States" refers only to appointed officials, \n and it does not encompass elected individuals, such \n as the President or membe

In [43]:
for line in bad_case:
    line["words"] = re.sub(r"\n"," ",line["words"])
    line["words"] = re.sub(r"\s+"," ",line["words"])
bad_case

[{'case': '23-719',
  'speaker': 'CHIEF JUSTICE ROBERTS',
  'words': " We'll hear argument this morning in Case 23-719, Trump versus Anderson. Mr. Mitchell. ORAL ARGUMENT OF JONATHAN F. MITCHELL ON BEHALF OF THE PETITIONER "},
 {'case': '23-719',
  'speaker': 'MR. MITCHELL',
  'words': ' Mr. Chief Justice, and may it please the Court: The Colorado Supreme Court held that President Donald J. Trump is constitutionally disqualified from serving as president under Section 3 of the Fourteenth Amendment. The Colorado Supreme Court\'s decision is wrong and should be reversed for numerous independent reasons. The first reason is that President Trump is not covered by Section 3 because the president is not "an officer of the United States" as that term is used throughout the Constitution. "Officer of the United States" refers only to appointed officials, and it does not encompass elected individuals, such as the President or members of Congress. This is clear from the Commissions Clause, the Im

In [44]:
all_transcripts.extend(bad_case)

In [45]:
len(all_transcripts)

23218

In [46]:
import numpy as np
import pandas as pd

In [47]:
df = pd.DataFrame(all_transcripts)
df.head()

Unnamed: 0,case,speaker,words
0,23-108,CHIEF JUSTICE ROBERTS,We will hear argument first this morning in C...
1,23-108,MS. BLATT,"Mr. Chief Justice, and may it please the Cour..."
2,23-108,CHIEF JUSTICE ROBERTS,"Ms. Blatt, if I find a lost pet and return it..."
3,23-108,MS. BLATT,"So, yes, divorced from, you know, a crime tha..."
4,23-108,JUSTICE KAGAN,But if -- if -- I -- I -- I would think that ...


In [48]:
df2 = df.groupby('speaker')['words'].nunique().reset_index()

In [112]:
df2.sort_values(by='words', ascending=False).head(15)

Unnamed: 0,speaker,words
8,JUSTICE GORSUCH,1864
9,JUSTICE JACKSON,1648
13,JUSTICE SOTOMAYOR,1362
11,JUSTICE KAVANAUGH,1276
10,JUSTICE KAGAN,1151
6,JUSTICE ALITO,1124
1,CHIEF JUSTICE ROBERTS,1062
7,JUSTICE BARRETT,961
3,GENERAL PRELOGAR,776
14,JUSTICE THOMAS,479


In [49]:
df2.sort_values(by='speaker').head(60)


Unnamed: 0,speaker,words
0,CHIEF JUSTICE ROBERT,1
1,CHIEF JUSTICE ROBERTS,1062
2,GENERAL GORSUCH,1
3,GENERAL PRELOGAR,776
4,JUDGE KAVANAUGH,1
5,JUST SOTOMAYOR,1
6,JUSTICE ALITO,1124
7,JUSTICE BARRETT,961
8,JUSTICE GORSUCH,1864
9,JUSTICE JACKSON,1648


In [145]:
df2.sort_values(by='speaker').tail(60)


Unnamed: 0,speaker,words
58,MR. McCLOUD,99
59,MR. McCOLLOCH,111
60,MR. McDOWELL,61
61,MR. McNAMARA,57
62,MR. NIELSON,135
63,MR. PETRANY,68
64,MR. RAYNOR,141
65,MR. ROSENKRANZ,24
66,MR. SAMUELS,51
67,MR. SANTHANAM,48


In [50]:
#so many misspellings!!

df['speaker'] = df['speaker'].str.replace(r'CHIEF JUSTICE ROBERT$','CHIEF JUSTICE ROBERTS', regex=True)
df['speaker'] = df['speaker'].str.replace('JUDGE KAVANAUGH','JUSTICE KAVANAUGH')
df['speaker'] = df['speaker'].str.replace('GENERAL GORSUCH','JUSTICE GORSUCH')
df['speaker'] = df['speaker'].str.replace('JUSTICE PRELOGAR','GENERAL PRELOGAR')
df['speaker'] = df['speaker'].str.replace('JUST SOTOMAYOR','JUSTICE SOTOMAYOR')
df['speaker'] = df['speaker'].str.replace('MS ADEN','MS. ADEN')
df['speaker'] = df['speaker'].str.replace('MR UNIKOWSKY','MR. UNIKOWSKY')



In [51]:
df2 = df.groupby('speaker')['words'].nunique().reset_index()
df2.sort_values(by='speaker').head(60)

Unnamed: 0,speaker,words
0,CHIEF JUSTICE ROBERTS,1063
1,GENERAL PRELOGAR,777
2,JUSTICE ALITO,1124
3,JUSTICE BARRETT,961
4,JUSTICE GORSUCH,1865
5,JUSTICE JACKSON,1648
6,JUSTICE KAGAN,1151
7,JUSTICE KAVANAUGH,1277
8,JUSTICE SOTOMAYOR,1363
9,JUSTICE THOMAS,479


In [52]:
df2.sort_values(by='words').head(60)

Unnamed: 0,speaker,words
16,MR. BROWN,1
38,MR. KAGAN,1
76,MR. WECHSLER,12
110,MS. VALE,19
60,MR. ROSENKRANZ,24
111,MS. WOLD,28
108,MS. STETSON,31
36,MR. HARRIS,32
10,MR. ABBAS,32
109,MS. STEVENSON,32


In [53]:
df[df['speaker'] == "MR. KAGAN"]
###????

Unnamed: 0,case,speaker,words
14049,22-1219,MR. KAGAN,-- apply the best interpretation --


In [152]:
pd.set_option('display.max_colwidth', None)
df[13965:13975]

Unnamed: 0,case,speaker,words
13965,22-1219,CHIEF JUSTICE ROBERTS,Justice -- Justice Kagan?
13966,22-1219,JUSTICE KAGAN,"Mr. Martinez, I want you to think of this from Congress's perspective. So I was thinking what is the next big piece of legislation on the horizon and who knows, don't have a crystal ball, but I'm going to say -- I'm going to guess that it's artificial intelligence. So let's imagine Congress enacts an artificial intelligence bill and it has all kinds of delegations, maybe it creates an agency for the purpose or maybe it uses existing agencies and it has all kinds of delegations to that agency or agencies about how to regulate artificial intelligence so that this nation can capture the -- the -- the opportunities but also meet the challenges of that. And then, just by the nature of things and especially the nature of the subject, there are going to be all kinds of places where, although there's not an explicit delegation, Congress has, in effect, left a gap. It has created an ambiguity. And what Congress is thinking is, do we want courts to fill that gap, or do we want an agency to fill that gap? When the normal techniques of legal interpretation have run out, on the matter of artificial intelligence, what does Congress want, Mr. Martinez?"
13967,22-1219,MR. MARTINEZ,I think Congress wants courts to interpret the best interpretation of their --
13968,22-1219,JUSTICE KAGAN,Congress doesn't know
13969,22-1219,MR. KAGAN,-- apply the best interpretation --
13970,22-1219,JUSTICE KAGAN,"-- what that answer means. Congress knows that there are going to be gaps because Congress can hardly see a week in the future with respect to this subject, let alone a year or a decade in the future. And Congress knows that there are going to be things that it writes that it's just not going to be clear how this will apply or what it will mean with respect to countless factual situations that this country will have to address. Does the Congress want this Court to decide those questions, policy-laden questions, of artificial intelligence?"
13971,22-1219,MR. MARTINEZ,"I -- I don't think Congress wants the Court to do policy. I think Congress wants the Court to do its ordinary function, which is interpret the law and figure -- and apply the best understanding of the law. And I think that the implication of your question is that this is some sort of intentional delegation by Congress, that Chevron deference is -- is this implicit delegation. But I -- I don't think that's right. I think many people, including a very insightful article that -- that you wrote 20 years ago, make clear that this is fictional. This delegation is fictional."
13972,22-1219,JUSTICE KAGAN,"Fictional just means -- is like academic speak for presumed. We are indeed presuming congressional intent. The congressional intent, it -- you know, the -- the delegation is not explicit on the face of this statute, but what we're thinking is Congress knows things about different institutions, about what they know, about what they're competent with respect to, and Congress knows that this Court and lower courts are not competent with respect to deciding all the questions about AI that are going to come up in the future. And what Congress wants, we presume, is for people who actually know about AI to decide those questions. And also, those same people who know about AI are people who, to some degree in some way, are accountable to the political process. They have constituencies. They have fact-finding abilities. They are obligated to go consult with people. They report to a president, who needs to be elected. In all kinds of ways, both with -- with respect to expertise and with respect to their connections to the public and to other policymaking entities, those are the people Congress wants to decide questions about AI. We don't even know what the questions are about AI, let alone the answers to them, ""we"" being the Court."
13973,22-1219,MR. MARTINEZ,"Justice Kagan, I think, if we're trying to figure out what the -- what the reasonable thing to infer that Congress has presumed, I think the far more reasonable presumption and the one that's most consistent with our constitutional structure is that Congress is going to presume that courts are going to do law, not policy, they're going to pick the best interpretation and enforce the best interpretation as to this statute in the exact same way that they would do it with respect to any other -- any other statute. And I think this case actually -- you know, AI is a trickier example --"
13974,22-1219,JUSTICE KAGAN,"I mean, but it's --"


In [54]:
df['speaker'] = df['speaker'].str.replace('MR. KAGAN','MR. MARTINEZ')

In [55]:
pd.set_option('display.max_colwidth', 30)
df[13965:13975]

Unnamed: 0,case,speaker,words
13965,22-1219,JUSTICE GORSUCH,"-- and then, if I might j..."
13966,22-1219,MR. MARTINEZ,No.
13967,22-1219,JUSTICE GORSUCH,What happens?
13968,22-1219,MR. MARTINEZ,I think the agency can ov...
13969,22-1219,JUSTICE GORSUCH,And I'm struck on that sc...
13970,22-1219,MR. MARTINEZ,Right.
13971,22-1219,JUSTICE GORSUCH,And then the next adminis...
13972,22-1219,MR. MARTINEZ,That's -- that -- that's ...
13973,22-1219,JUSTICE GORSUCH,-- where we started.
13974,22-1219,MR. MARTINEZ,"That's exactly right, Jus..."


In [56]:
df2 = df.groupby('speaker')['words'].nunique().reset_index()
df2.sort_values(by='speaker').head(60)


Unnamed: 0,speaker,words
0,CHIEF JUSTICE ROBERTS,1063
1,GENERAL PRELOGAR,777
2,JUSTICE ALITO,1124
3,JUSTICE BARRETT,961
4,JUSTICE GORSUCH,1865
5,JUSTICE JACKSON,1648
6,JUSTICE KAGAN,1151
7,JUSTICE KAVANAUGH,1277
8,JUSTICE SOTOMAYOR,1363
9,JUSTICE THOMAS,479


In [57]:
from collections import Counter


#https://docs.python.org/2/library/collections.html
wordcount = df.groupby("speaker")["words"].sum().map(lambda words: Counter(re.findall(r"\b\w{7,}\b",words.lower())).most_common(12)).reset_index()


In [60]:
wordcount.head(10)

Unnamed: 0,speaker,words
0,CHIEF JUSTICE ROBERTS,"[(justice, 1030), (counsel..."
1,GENERAL PRELOGAR,"[(congress, 397), (because..."
2,JUSTICE ALITO,"[(question, 211), (argumen..."
3,JUSTICE BARRETT,"[(because, 217), (justice,..."
4,JUSTICE GORSUCH,"[(government, 159), (under..."
5,JUSTICE JACKSON,"[(because, 342), (understa..."
6,JUSTICE KAGAN,"[(because, 172), (question..."
7,JUSTICE KAVANAUGH,"[(question, 154), (because..."
8,JUSTICE SOTOMAYOR,"[(because, 201), (question..."
9,JUSTICE THOMAS,"[(government, 50), (argume..."


In [61]:
wordcount.iloc[5]['words']

[('because', 342),
 ('understand', 265),
 ('government', 262),
 ('question', 237),
 ('whether', 208),
 ('statute', 190),
 ('justice', 187),
 ('congress', 186),
 ('different', 143),
 ('situation', 133),
 ('argument', 132),
 ('thought', 126)]

In [62]:
wordcount["string"] = wordcount.words.apply(lambda x: ', '.join([i[0] for i in x[0:10]]))

In [63]:
wordcount

Unnamed: 0,speaker,words,string
0,CHIEF JUSTICE ROBERTS,"[(justice, 1030), (counsel...","justice, counsel, argument..."
1,GENERAL PRELOGAR,"[(congress, 397), (because...","congress, because, statute..."
2,JUSTICE ALITO,"[(question, 211), (argumen...","question, argument, whethe..."
3,JUSTICE BARRETT,"[(because, 217), (justice,...","because, justice, question..."
4,JUSTICE GORSUCH,"[(government, 159), (under...","government, understand, qu..."
...,...,...,...
106,MS. SRIDHARAN,"[(federal, 15), (methodolo...","federal, methodology, cons..."
107,MS. STETSON,"[(question, 28), (justice,...","question, justice, emissio..."
108,MS. STEVENSON,"[(candidate, 12), (colorad...","candidate, colorado, amend..."
109,MS. VALE,"[(sources, 18), (downwind,...","sources, downwind, control..."


In [64]:

wordcount2 = df.groupby("speaker")["words"].sum().map(lambda words: Counter(re.findall(r"[\w ]{4}\b([A-Z]\w{2,})\b",words)).most_common(12)).reset_index()
wordcount2.tail(10)

Unnamed: 0,speaker,words
101,MS. PETTIT,"[(Compact, 23), (New, 21),..."
102,MS. REAVES,"[(Court, 38), (Nieves, 20)..."
103,MS. ROSS,"[(Court, 37), (Justice, 8)..."
104,MS. SANTOS,"[(Honor, 28), (Court, 21),..."
105,MS. SINZDAK,"[(Court, 45), (Section, 26..."
106,MS. SRIDHARAN,"[(EPA, 30), (Honor, 19), (..."
107,MS. STETSON,"[(EPA, 19), (Court, 12), (..."
108,MS. STEVENSON,"[(Court, 14), (Colorado, 9..."
109,MS. VALE,"[(EPA, 6), (Court, 3), (Go..."
110,MS. WOLD,"[(Fourth, 37), (Court, 13)..."


In [65]:
wordcount2.tail(10)

Unnamed: 0,speaker,words
101,MS. PETTIT,"[(Compact, 23), (New, 21),..."
102,MS. REAVES,"[(Court, 38), (Nieves, 20)..."
103,MS. ROSS,"[(Court, 37), (Justice, 8)..."
104,MS. SANTOS,"[(Honor, 28), (Court, 21),..."
105,MS. SINZDAK,"[(Court, 45), (Section, 26..."
106,MS. SRIDHARAN,"[(EPA, 30), (Honor, 19), (..."
107,MS. STETSON,"[(EPA, 19), (Court, 12), (..."
108,MS. STEVENSON,"[(Court, 14), (Colorado, 9..."
109,MS. VALE,"[(EPA, 6), (Court, 3), (Go..."
110,MS. WOLD,"[(Fourth, 37), (Court, 13)..."


In [184]:
wordcount2.iloc[5]['words']

[('Congress', 154),
 ('Justice', 153),
 ('Court', 63),
 ('First', 42),
 ('Section', 30),
 ('Board', 30),
 ('Seventh', 30),
 ('Constitution', 29),
 ('Idaho', 22),
 ('United', 21),
 ('Sacklers', 16),
 ('Barker', 16)]

In [185]:
wordcount2.iloc[3]['words']

[('Justice', 151),
 ('Congress', 34),
 ('Texas', 25),
 ('Court', 21),
 ('Article', 17),
 ('Idaho', 15),
 ('Facebook', 15),
 ('NTA', 14),
 ('Robinson', 13),
 ('Honor', 11),
 ('Sixth', 10),
 ('Fifth', 10)]

### So...what to do from here?

That is the challenge!

There's certainly more investigations you can do into the words here. There are fancier ways of analyzing natural language, but, for the purposes of the next week I don't think that is necessary unless it's something that you understand and are interested in doing. I believe that you doing your own analyses through word searches can be just as powerful if not more powerful than engaging ML/AI (though there are certainly useful approaches via those tools).

The number one priority is to merge this with another data set. We have some that we have scraped already, and I highly recommend using those.

You could get information from the opinions (but I don't recommend engaging with the opinion docs):

https://www.supremecourt.gov/opinions/slipopinion/23

As well as the individual docket information. 

The docket information does include the lower court were the case originated,. So if you did want to map you could go in that direction. But the main challenge is to think about what the map with then show either by case, or by lower court.

Again, I believe that the potentially strongest approach would be to come up with categories for all of these cases. That would entail doing your own research. As I mentioned, do not find a source for what are the standard categories. Actually look at the cases and come up with your own categories, and add them to your main data frame, so you attach them to the dockets. I would say somewhere between six and 10 categories is a fine guideline. The point here would be to group/aggregate these transcripts based on the categories you choose. And then do analyses in pandas/plots (or map by lower courts).

But, of course, there are many other interpretive approaches, and you might want to just come up with your own. That is the main goal of these projects, is to create your own point of view/interpretive lens by building a data set. 