## Supreme Court Project Guide

The ultimate goal of this project is to build a database of Supreme Court cases for 2016 that includes the dialogue from the oral arguments of each case. As we have seen in class the arguments were scraped from this page: https://www.supremecourt.gov/oral_arguments/argument_transcript.aspx 

I have already downloaded and transformed the PDFs of the transcripts into text documents which you can download from courseworks: supreme_court_pdfs_txt.zip

There are three steps that you need to complete:

**Please note:** Step 3 is the most challenging--if you want to spend some time coding, you can skip Steps 1 and 2 and get to work on Step 3

**STEP 1:** scrape all of the case information available on this page: https://www.supremecourt.gov/oral_arguments/argument_transcript.aspx 

This should include case name, docket number, etc--and most importantly the name of the PDF file. All of the text files share the exact same name as the PDF files they came from. This file name will allow you to connect your transcription data with your case data. 

It is up to you what kind Data structure you want to build. But it likely to be a list of lists, or list of dictionaries--for each case you will have a list or dictionary of the information you scrape from the webpage.

**STEP 2:** find a secondary source to scrape/integrate with your case data. The information on the Supreme Court page is very limited. You need to find a source or group of sources that ad information. The most important information would likely be: the decision, who voted for and against, and the state of origin of the case (for geocoding). You might think of other great things to put in there too! This information needs to be merged with the data you have from STEP 2.

**STEP 3:** use regular expressions to clean up and parse the text files so that you have a searchable data structure containing the dialog from the transcripts. 

From a data architecture perspective, you probably want to have a separate list for each case and in each list a data structure that pairs the speaker with what she/he says. Like:

`[['MR. BERGERON'," Yes. That's essentially the same thing"],[ 'JUSTICE SOTOMAYOR',' So how do you deal with Chambers?']]`

This is a list of lists --it could also be a list of dictionaries if you want it to be. The real programmatic challenge here is to clean up the text files and parse them successfully. Most of the instructions below are devoted to this, but Steps 1 and 2 are also extremely important.

Go step-by-step through this, and email me whenever you get stuck, and I will help. If you complete all the steps before Tuesday, email me if you want to go further.



### STEP 1
Scrape all of the necessary information from:

https://www.supremecourt.gov/oral_arguments/argument_transcript.aspx 

You should result and a list of dictionaries for each case.

In [None]:
###Import your scraping libraries



In [None]:
###Write your scraping code here



In [None]:
###Print out your list of lists or dictionaries here

### STEP 2 
Scrape the additional source(s)

For this you need to do research and try to find a source that will give you useful information that you can add to the dictionary you created in Step 1.

In [None]:
#Code away!




### STEP 3
Here we go: the text files that were extracted from the PDFs are quite messy, you do not need to get them perfect, but you need to clean them up enough so that you can zone in on the arguments themselves. Below I take you step-by-step through what you need to do. In the end you want to have a separate list for each case that contains the speaker and the dialogue attached to that speaker.

**Step 1:** Download the text files from courseworks
Make sure they are locally on your computer. 
Open up the text files in a text editor like sublime, and carefully look at the problems with the files. How will you clean this up?

**Step 2:** Eventually you will want to loop through all of the text files and run the cleanup on all of them. But first just select one text file to open up and begin cleaning up.

In [None]:
#Import the regular expression library

In [None]:
#Open a text file from your computer
f = open('/Users/YOU/Documents/columbia_syllabus/pdf/15-777_1b82.txt', 'r')
sample_transcript = f.read()

In [None]:
#Take a look at the text file
sample_transcript

**How in the world are you going to clean this up?**
Take a close look and think about first what needs to be removed, and then needs to be isolated. You'll probably need the combination of regular expression (especially using subs() -- which is a regex replace), and simple splits -- where you split the text that point, and just keep the part of the text that you want. If you want to figure this on your own don't read any further--if you're starting to get stuck go a few cells down, and follow my hints.

Also take a look at the hint below--it might come in very handy...


In [None]:
#A note on regex splits:
# look at the difference between regex1 regex2
#A split using groups keeps the groups!!!!

string = "Tomorrow and tomorrow and tomorrow"
regex1 = r"and" #not grouped
regex2 = r"(and)" #grouped
re.split(regex2,string)

In [None]:
##Try to do everything yourself

















### Cleaning comes first

A step-by-step way of Cleaning up this mess.

Step 1. You might notice that every page has:

`Alderson Reporting Company

Official - Subject to Final Review`
 
You want to get rid of that. I would use a regex sub() 

Step 2. **Line Numbers:** you might also notice these annoying line numbers going from 1 - 25 everywhere: I would use the regex sub() to get rid of this too -- but be very careful, you don't want to get rid of all the numbers in there. The cleaning doesn't have to be perfect, but try to get as many of them as you can without deleting other numbers.

Step 3 and 4. **chop off the beginning/ chop off the end**: now it would be very helpful to get rid of all of the text that comes before the arguments begins, and all the text that comes after the argument (each page has a really annoying index at the end that you don't want to be searching through). Look for words or phrases that uniquely repeat at the beginning and at the end of the arguments. The easiest way to isolate this, to do a simple split() on one of those phrases, and keep the half of The split you want. (Am I being too cryptic here?--a good split should give you list with two elements when you want to keep one of them) Think about it and email me.

Try to get these 4 cleaning actions to work step-by-step in the 4 cells below. As you go, I would assign each cleaner version of the text to a new variable. 

In [None]:
#1. Alderson company stuff



In [None]:
#2. Line numbers 1 - 25



In [None]:
#3. Chop off the beginning before the dialogue begins



In [None]:
#4. Chop off the end after the dialogue ends



In [None]:
#Check your new variable to make sure it is clean

### Get your dialogue list
Now this transcription should be clean enough to get a list with every speaker, and what the speaker said. The pattern for the speakers is fairly obvious--my recommendation is to do a split using groups (like the example I show above with "tomorrow and tomorrow").

If you write your regular expression correctly: you should get a single list in which each element is either a speaker, or what was said.

In [None]:
#get a list of speaker and speech




### Make it a list of pairs
If you got your list the way I recommended to, it is just single list with elements after element--you need to figure out how to change it so you pair the speaker with what is said. Give it some thought, there are a few ways to try to do this. If you made it this far, you're doing great!

In [None]:
#make it a list of pairs of speaker and speech



### Loop through all texts
If you made it this far--congratulations! 
The only thing left is to set up a loop that looks through all the texts and runs the cleanup and parsing when each one. You will need to have completed Step 1 in order to be able to do this loop because you will need the names to PDFs to do it. (Also each final list should also contain the PDF name, so you can reference it from your case database.)

In [None]:
# you could try here--Or email me with questions...