## Homework 7.1: Scraping downloading and converting and transforming Supreme Court Transcripts
Here we are extending the our recent homework so that we can actually download PDFs of the transcripts.

Here I demo how to download PDFs using python and then I use Tika which would need to be installed if you wanted to use it to convert the PDFs to text. You **should not** run this part, I am keeping it in here so you can use this in the future.

I have provided the downloaded and converted court documents for the homework. Please **skip to the homework part** to start transforming those transcripts using regular expressions.


In [None]:
import requests
from bs4 import BeautifulSoup
#the homework but here I am just getting the pdf link
my_url = "https://www.supremecourt.gov/oral_arguments/argument_transcript/2023"
raw_html = requests.get(my_url).content
court_doc = BeautifulSoup(raw_html, "html.parser")


In [2]:
# getting all of the court info
all_rows = court_doc.find_all('tr')
each_case = []
for row in all_rows[1:]:
    if row.td:
        this_row = {}
        each_cell=row.find_all('td')
        this_row['docket_num'] = each_cell[0].span.text
        #I am using split below to get just the pdf name out of the file structure
        #so instead of ../argument_transcripts/2023/22-6389_8n59.pdf I get 22-6389_8n59.pdf
        this_row['pdf_link'] = each_cell[0].a['href'].split('/')[-1]
        this_row['case_name'] = each_cell[0].find_all('span')[1].text
        this_row['date'] = each_cell[1].text
        each_case.append(this_row)
each_case


[{'docket_num': '23-108',
  'pdf_link': '23-108_o7jp.pdf',
  'case_name': 'Snyder v. United States',
  'date': '04/15/24'},
 {'docket_num': '23-50',
  'pdf_link': '23-50_g3bh.pdf',
  'case_name': 'Chiaverini v. City of Napoleon',
  'date': '04/15/24'},
 {'docket_num': '23-5572',
  'pdf_link': '23-5572_l537.pdf',
  'case_name': 'Fischer v. United States',
  'date': '04/16/24'},
 {'docket_num': '22-982',
  'pdf_link': '22-982_m64n.pdf',
  'case_name': 'Thornell v. Jones',
  'date': '04/17/24'},
 {'docket_num': '23-175',
  'pdf_link': '23-175_20f4.pdf',
  'case_name': 'City of Grants Pass v. Johnson',
  'date': '04/22/24'},
 {'docket_num': '22-1218',
  'pdf_link': '22-1218_h3ci.pdf',
  'case_name': 'Smith v. Spizzirri',
  'date': '04/22/24'},
 {'docket_num': '23-334',
  'pdf_link': '23-334_ifjm.pdf',
  'case_name': 'Dept. of State v. Munoz',
  'date': '04/23/24'},
 {'docket_num': '23-367',
  'pdf_link': '23-367_5he6.pdf',
  'case_name': 'Starbucks Corp. v. McKinney',
  'date': '04/23/24'}

Next I used the **requests** library to download all of the PDFs to a folder on my computer.

In [None]:
#folder hierarchy
#https://www.supremecourt.gov/oral_arguments/argument_transcripts/2023/

In [3]:
import time
import requests
for case in each_case:
    time.sleep(2)
    link = 'https://www.supremecourt.gov/oral_arguments/argument_transcripts/2023/' + case['pdf_link']
    file_name = "court_docs/" + case['pdf_link']
    r = requests.get(link, stream=True)
    with open(file_name,'wb') as Pypdf:
        for chunk in r.iter_content():
            if chunk:
                Pypdf.write(chunk)

In [4]:
len(each_case)

61


Here I use **tika** to extract the text from the pdfs and write txt files to my computer.

In [5]:
import tika
from tika import parser
import time
for case in each_case:
    time.sleep(0.5)
    print(case['pdf_link'])
    file_name = "court_docs/" + case['pdf_link']
    parsed_pdf = parser.from_file(file_name) 
    txt_data = parsed_pdf['content']
    txt_name = case['pdf_link'].split('.')[0] + ".txt"
    print(txt_name)
    file_out ="court_docs/" + txt_name
    with open(file_out, 'w') as outfile:
        outfile.write(txt_data)


    

23-108_o7jp.pdf
23-108_o7jp.txt
23-50_g3bh.pdf
23-50_g3bh.txt
23-5572_l537.pdf
23-5572_l537.txt
22-982_m64n.pdf
22-982_m64n.txt
23-175_20f4.pdf
23-175_20f4.txt
22-1218_h3ci.pdf
22-1218_h3ci.txt
23-334_ifjm.pdf
23-334_ifjm.txt
23-367_5he6.pdf
23-367_5he6.txt
23-726_ggco.pdf
23-726_ggco.txt
23-939_f2qg.pdf
23-939_f2qg.txt
23-411_5367.pdf
23-411_5367.txt
22-842_1823.pdf
22-842_1823.txt
23-14_885f.pdf
23-14_885f.txt
22-1079_4357.pdf
22-1079_4357.txt
22-1025_k13j.pdf
22-1025_k13j.txt
141-orig_2_5okl.pdf
141-orig_2_5okl.txt
23-250_9o6b.pdf
23-250_9o6b.txt
23-21_7m4e.pdf
23-21_7m4e.txt
23-235_g71f.pdf
23-235_g71f.txt
23-370_5368.pdf
23-370_5368.txt
23-146_5pd4.pdf
23-146_5pd4.txt
23-719_2jf3.pdf
23-719_2jf3.txt
22-1008_f3b7.pdf
22-1008_f3b7.txt
23-51_869d.pdf
23-51_869d.txt
23a349_iie0.pdf
23a349_iie0.txt
22-1078_775f.pdf
22-1078_775f.txt
22-277_5924.pdf
22-277_5924.txt
22-555_6a77.pdf
22-555_6a77.txt
22-7386_ipdh.pdf
22-7386_ipdh.txt
22-529_f64i.pdf
22-529_f64i.txt
22-976_l07n.pdf
22-976_l07

In [7]:
#Open a text file from your computer
#We are going to use this one.

f = open('court_docs/22-1178_97m5.txt', 'r')
sample_transcript = f.read()

In [9]:
sample_transcript

"\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n \n\n  \n \n\n  \n \n\n \n \n\n \n  \n\n \n \n\n        \n \n                  \n \n\n   \n \n\n                  \n \n               \n \n                   \n \n\n       \n \n               \n \n                  \n \n \n \n \n \n \n \n \n \n \n \n \n\n   \n \n\n \n \n\n \n\n- - - - - - - - - - - - - - - - - -\n\n- - - - - - - - - - - - - - - - - -\n\nSUPREME COURT \nOF THE UNITED STATES \n\nIN THE SUPREME COURT OF THE UNITED STATES \n\nFEDERAL BUREAU OF INVESTIGATION,  ) \n\nET AL.,          ) \n\nPetitioners,  ) \n\nv. ) No. 22-1178 \n\nYONAS FIKRE,                ) \n\nRespondent.  ) \n\nPages: 1 through 88 \n\nPlace: Washington, D.C. \n\nDate: January 8, 2024 \n\nHERITAGE REPORTING CORPORATION \nOfficial Reporters \n\n1220 L Street, N.W., Suite 206 \nWashington, D.C.  20005 \n\n(202) 628-4888 \nwww.hrccourtreporters.com \n\nwww.hrccourtreporters.com\n\n\n  \n \n\n \n\n  \n\n \n \n             

### NOW THE HOMEWORK!!
**How in the world are you going to clean this up?**
Take a close look and think about first what needs to be removed, and then needs to be isolated. You'll probably need the combination of regular expression. You will need to using re.subs() -- which is a regex replace -- and re.split() -- where you split the text that point, and just keep the part of the text that you want.



In [11]:
import re
f = open('court_docs/22-1178_97m5.txt', 'r')
sample_transcript = f.read()
sample_transcript

"\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n \n\n  \n \n\n  \n \n\n \n \n\n \n  \n\n \n \n\n        \n \n                  \n \n\n   \n \n\n                  \n \n               \n \n                   \n \n\n       \n \n               \n \n                  \n \n \n \n \n \n \n \n \n \n \n \n \n\n   \n \n\n \n \n\n \n\n- - - - - - - - - - - - - - - - - -\n\n- - - - - - - - - - - - - - - - - -\n\nSUPREME COURT \nOF THE UNITED STATES \n\nIN THE SUPREME COURT OF THE UNITED STATES \n\nFEDERAL BUREAU OF INVESTIGATION,  ) \n\nET AL.,          ) \n\nPetitioners,  ) \n\nv. ) No. 22-1178 \n\nYONAS FIKRE,                ) \n\nRespondent.  ) \n\nPages: 1 through 88 \n\nPlace: Washington, D.C. \n\nDate: January 8, 2024 \n\nHERITAGE REPORTING CORPORATION \nOfficial Reporters \n\n1220 L Street, N.W., Suite 206 \nWashington, D.C.  20005 \n\n(202) 628-4888 \nwww.hrccourtreporters.com \n\nwww.hrccourtreporters.com\n\n\n  \n \n\n \n\n  \n\n \n \n             

**GET RID OF THE HEADER FIRST**

Try to find the pattern that appears at the beginning of every case, when the arguments begin.

Make a regular expression that will find that, and use an re.split() to split the transcript into a list with two parts, the header and the rest.

Save the rest! (That is, hint hint, the [1] element in that list.)


In [None]:
#revome the header


**NEXT GET RID OF THE FOOTER** 

Do the same thing you did above: find a pattern that appears at the end of every transcript.

Do an re.split() and save the part with the arguments.


In [None]:
#remove the footer


**REMOVE THE PAGE BREAKS**

OK, the page breaks are very messy, they have tons of numbers and other text. Try to write an expression that captures all of that!

When you do that use re.sub() to replace all of those page break messes with " ".

In [None]:
#remove the page breaks

**SPLIT THE DIALOGUE BY SPEAKER AND SPEECH (SPEAKER/WORDS)**

This is the toughest part. You need to write a regular expression that accurately captures who is speaking like:
```
MR. GORE:
```
OR
```
JUSTICE KAGAN:
```
If you do a split using groups (as a demoed in the advanced regular expressions notebook), it will split by the pattern but keep the pattern instead of discarding it (like a default split does). And that way you will get a list where every other element is the speaker. Like this (this is taken from inside the transcript):

```
'JUSTICE KAGAN',
 '  Did your expert \n\npresent an alternative study which did control \n\nfor geography and reached a different result? ',
 'MR. GORE',
 " He did not try to mirror \n\nDr. Ragusa's study --",
 'JUSTICE KAGAN',
 '  Because that would \n\nhave been the easiest way to undermine the \n\ntheory.  I mean, as I understand it, this was \n\nhardly touched upon by -- by -- by -- by the \n\nstate below.  And, certainly, the state did not \n\ndo what would seem to be the -- the normal thing \n\nif you were really concerned about this, which \n\nis to say: Look at our study.  We controlled \n\n \n\nfor geography. The results are entirely\n\n different.',
 'MR. GORE',
 " We did raise objections to\n\n Dr. Ragusa's methodology, and as I was \n\nexplaining, it is a flawed methodology and not\n\n reliable.\n\n Moreover, the state presented direct\n\n testimony from the map drawer to explain which\n\n VTDs were chosen and why.  That direct evidence \n\nshowed, like all the other direct evidence, that \n\ndecisions were made based on politics and \n\ntraditional principles and not using race at \n\nall. ",
 'JUSTICE SOTOMAYOR',
 " I think you end up \n\nin a very poor starting point under clear error \n\narguing the substance of believability of one \n\nexpert over another, because credibility \n\nfindings under clear error standard must be \n\ndeferred to to the district court. \n\nI understand your points about -- your \n\npoint about Dr. Ragusa, but I just point out \n\nthat other experts before the court and he \n\nhimself said that geography was very much \n\nembedded as part of the structure of his \n\nanalysis. \n\n \n\nYou may disagree with that.  It's\n\n going to be very hard for you to show that no \n\nfact finder could credit that understanding of\n\n his testimony.\n\n But I think what I'm really troubled \n\nby is, going back to Justice Thomas's question, \n\nwhat's the legal error and what's the clear\n\n error? Just tick them off for me.",
 'MR. GORE',
 ' There are several legal \n\nerrors, Justice Sotomayor. ',
 'JUSTICE SOTOMAYOR',
 '  Not facts.  I want \n\nlegal errors or clear errors beyond -- under our \n\nstandard. ',
 'MR. GORE',
 ' The first legal error is a \n\nfailure to enforce the alternative map \n\nrequirement. ',
 'JUSTICE KAGAN',
 "  Okay.  I'm going to \n\nbutt in.  And I'm sorry, Justice Sotomayor. ",
 'JUSTICE SOTOMAYOR',
 '  Yes, you can --\n\nyou can start there. ',
```

If you get this, you are done!!

In [None]:
#split speakers!!

In [None]:
#go back and try this on a few other transcript by changing the name of the text file, and see if this works on a few more!!



### Now, loop through all of the 2023 text files and get one list of dictionaries

The entries should be
docketnum:
speaker:
text:

### Finally, bring that into pandas!