## Supreme Court Project Guide

The ultimate goal of this project is to build a database of Supreme Court cases for 2020 (or a different range of years) that includes the dialogue from the oral arguments of each case. As we have seen in class the arguments were scraped from this page: https://www.supremecourt.gov/oral_arguments/argument_transcript.aspx 

See if you can follow that guide to downloading and transforming pdfs to texts (don't be shy on slack!)

Once you have a folder of texts transcripts there are three primary programmatic steps that you need to complete:

**Please note:** Step 3 is the most challenging--if you want to spend some time coding, you can skip Steps 1 and 2 and get to work on Step 3

**STEP 1:** scrape all of the case information available on this page: https://www.supremecourt.gov/oral_arguments/argument_transcript/2020

This should include case name, docket number, etc--and most importantly the name of the PDF file. All of the text files share the exact same name as the PDF files they came from. This file name will allow you to connect your transcription data with your case data. 

It is up to you what kind data structure you want to build. But it likely to be a list of lists, or list of dictionaries--for each case you will have a list or dictionary of the information you scrape from the webpage.

**STEP 2:** find secondary source(s) to scrape/integrate with your case data. The information on the Supreme Court page is very limited. You need to find a source or group of sources that ad information. The most important information would likely be: the decision, who voted for and against, and the district court origin of the case (for geocoding). You might think of other great things to put in there too! This information needs to be merged with the data you have from STEP 2.

**STEP 3:** use regular expressions to clean up and parse the text files so that you have a searchable data structure containing the dialog from the transcripts. 

**Data Architecture** 
You will need to think about how you will set up, separate, and join different tables that you create. The initial scraping will give you very simple dataframe: the columns will be dockett, case name, date argued, and PDF name. The regex work on the PDFs should result in a very simple table (or just a list of tuples) of speaker name and dialogue. 

`[('MR. BERGERON'," Yes. That's essentially the same thing"),('JUSTICE SOTOMAYOR',' So how do you deal with Chambers?')]`

But make sure you attach the docket number or pdf filename to each set of arguments you transform using regex. Your secondary sources and information should be linked by docket number, but the question is how to set up those data frames, join them, aggregate them, and narrow them to the fields necessary for presentation.

Go step-by-step through this, and DM me on Slack whenever you get stuck, and I will help. If you complete all the steps before Friday, Slack me if you want to go further.

**Interpretive Architecture**
Also consider what kind of interpretive categories you can add through your reading and research. At the very least, it is recommended that you come up with categories for the kinds of cases that are before the court: human clustering for meaning is always more effective than computational clustering. Try to come up with perhaps 8 to 10 domains that groups of cases might belong to. But also think of other ways of categorizing these cases or these decisions--by politics, by consequences on citizens (you could make a scale from 1 to 10), even an aggregated index of consequences/effects on different types of communities, sectors, regions, etc. 

You are the researcher, these categories or ways of expressing your point-of-view.



### STEP 1
Scrape all of the necessary information from:

https://www.supremecourt.gov/oral_arguments/argument_transcript/2020

This should result in a list of dictionaries for each case.

In [None]:
###Import your scraping libraries



In [None]:
###Write your scraping code here



In [None]:
###Print out your list of lists or dictionaries here

### STEP 2 
Scrape the additional source(s)

For this you need to do research and try to find sources that will give you useful information that you can add to the table/dictionary you created in Step 1.

Here are some recommended sources that you can scrape and add to your data. You do not need to scrape all of these, and you may want to look for other sources that are useful.

Geographical locations:
https://system.uslegal.com/us-courts-of-appeals/

Transcripts by year
https://www.supremecourt.gov/oral_arguments/argument_transcript/2017

Dockets buy circuit court (I recommend at least this one):
https://www.supremecourt.gov/orders/ordersbycircuit/ordercasebycircuit/061118OrderCasesByCircuit

Dockett information by case:
https://www.supremecourt.gov/search.aspx?filename=/docket/docketfiles/html/public/17-7919.html

Opinions (as seen in Homework 3):
https://www.supremecourt.gov/opinions/slipopinion/17

In [None]:
#Code away!




### STEP 3
Here we go: the text files that were extracted from the PDFs are quite messy, you do not need to get them perfect, but you need to clean them up enough so that you can zone in on the arguments themselves. Below I take you step-by-step through what you need to do. In the end you want to have a separate list for each case that contains the speaker and the dialogue attached to that speaker.

**Step 1:** Download the text files from courseworks.

Make sure they are locally on your computer. 

Open up the text files in a text editor like sublime, and carefully look at the problems with the files. How will you clean this up?

**Step 2:** Eventually you will want to loop through all of the text files and run the cleanup on all of them. But first just select one text file to open up and begin cleaning up.

In [1]:
#Import the regular expression library
import re

In [2]:
#Open a text file from your computer

f = open('/Users/thirkield/Documents/Columbia2020/final_projects/2020pdfs/18-217_5hdk.txt', 'r')

#f = open('/Users/YOU/Documents/columbia_syllabus/pdf/15-777_1b82.txt', 'r')
sample_transcript = f.read()

In [3]:
#Take a look at the text file
sample_transcript

'SUPREME COURT OF THE UNITED STATES\nIN THE SUPREME COURT OF THE UNITED STATES - - - - - - - - - - - - - - - - - RANDALL MATHENA, WARDEN, Petitioner, v. LEE BOYD MALVO, Respondent. ) ) ) No. 18-217 ) )\n\n- - - - - - - - - - - - - - - - - -\n\nPages: Place: Date:\n\n1 through 70 Washington, D.C. October 16, 2019\n\nHERITAGE REPORTING CORPORATION\nOfficial Reporters 1220 L Street, N.W., Suite 206 Washington, D.C. 20005 (202) 628-4888 www.hrccourtreporters.com\n\n\x0cOfficial 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 TOBY J. HEYTENS, Solicitor General, Richmond, Virginia; on behalf of the Petitioner. ERIC J. FEIGIN, Assistant to the Solicitor General, Department of Justice, Washington, D.C.; for the United States, as amicus curiae, supporting the Petitioner. DANIELLE SPINELLI, ESQ., Washington, D.C.; on behalf of the Respondent. APPEARANCES: The above-entitled matter came on for oral argument before the Supreme Court of the United States at 1:00 p.m. IN THE SUPR

**How in the world are you going to clean this up?**
Take a close look and think about first what needs to be removed, and then needs to be isolated. You'll probably need the combination of regular expression (especially using subs() -- which is a regex replace), and simple splits -- where you split the text that point, and just keep the part of the text that you want. If you want to figure this on your own don't read any further--if you're starting to get stuck go a few cells down, and follow my hints.

Also take a look at the hint below--it might come in very handy...


In [4]:
#A note on regex splits:
# look at the difference between regex1 regex2
#A split using groups keeps the groups!!!!

string = "Tomorrow and tomorrow and tomorrow"
regex1 = r"and" #not grouped
regex2 = r"(and)" #grouped
re.split(regex2,string)

['Tomorrow ', 'and', ' tomorrow ', 'and', ' tomorrow']

In [None]:
##Try to do everything yourself

















### Cleaning comes first

A step-by-step way of Cleaning up this mess.

Step 1. You might notice that every page has:

`Heritage Reporting Corporation

Official 2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25`

(Note in earlier years it was:
`Alderson Reporting Company

Official - Subject to Final Review`
 If you choose to transform arguments from earlier years, please Slack me and I will send you the instructions for earlier versions of these PDFs.
 )
You want to get rid of that. I would use a regex sub() 

Step 2 and 3. **chop off the beginning/ chop off the end**: now it would be very helpful to get rid of all of the text that comes before the arguments begin, and all the text that comes after the argument (each page has a really annoying index at the end that you don't want to be searching through). Look for words or phrases that uniquely repeat at the beginning and at the end of the arguments. The easiest way to isolate this, to do a simple split() on one of those phrases, and keep the half of The split you want. (Am I being too cryptic here?--a good split should give you list with two elements when you want to keep one of them) Think about it and email me.

Try to get these 3 cleaning actions to work step-by-step in the 4 cells below. As you go, I would assign each cleaner version of the text to a new variable. 

In [None]:
#1. Heritage company stuff, and numbers



In [None]:
#2. Chop off the beginning before the dialogue begins


In [None]:
#3. Chop off the end after the dialogue ends



In [None]:
#Check your new variable to make sure it is clean

### Get your dialogue list
Now this transcription should be clean enough to get a list with every speaker, and what the speaker said. The pattern for the speakers is fairly obvious--my recommendation is to do a split using groups (like the example I show above with "tomorrow and tomorrow").

If you write your regular expression correctly: you should get a single list in which each element is either a speaker, or what was said.

In [None]:
#get a list of speaker and speech




### Make it a list of pairs
If you got your list the way I recommended to, it is just single list with elements after element--you need to figure out how to change it so you pair the speaker with what is said. Give it some thought, there are a few ways to try to do this. If you made it this far, you're doing great!

In [None]:
#make it a list of pairs of speaker and speech



### Loop through all texts
If you made it this far--congratulations! 
The only thing left is to set up a loop that looks through all the texts and runs the cleanup and parsing when each one. You will need to have completed Step 1 in order to be able to do this loop because you will need the names to PDFs to do it. (Also each final list should also contain the PDF name, so you can reference it from your case database.)

In [None]:
# you could try here--Or email me with questions...