# Kate Cough

#### June 29 2017


## Supreme Court Project Guide

The ultimate goal of this project is to build a database of Supreme Court cases for 2016 that includes the dialogue from the oral arguments of each case. As we have seen in class the arguments were scraped from this page: https://www.supremecourt.gov/oral_arguments/argument_transcript.aspx 

I have already downloaded and transformed the PDFs of the transcripts into text documents which you can download from courseworks: supreme_court_pdfs_txt.zip

There are three steps that you need to complete:

**Please note:** Step 3 is the most challenging--if you want to spend some time coding, you can skip Steps 1 and 2 and get to work on Step 3

**STEP 1:** scrape all of the case information available on this page: https://www.supremecourt.gov/oral_arguments/argument_transcript.aspx 

This should include case name, docket number, etc--and most importantly the name of the PDF file. All of the text files share the exact same name as the PDF files they came from. This file name will allow you to connect your transcription data with your case data. 

It is up to you what kind Data structure you want to build. But it likely to be a list of lists, or list of dictionaries--for each case you will have a list or dictionary of the information you scrape from the webpage.

**STEP 2:** find a secondary source to scrape/integrate with your case data. The information on the Supreme Court page is very limited. You need to find a source or group of sources that ad information. The most important information would likely be: the decision, who voted for and against, and the state of origin of the case (for geocoding). You might think of other great things to put in there too! This information needs to be merged with the data you have from STEP 2.

**STEP 3:** use regular expressions to clean up and parse the text files so that you have a searchable data structure containing the dialog from the transcripts. 

From a data architecture perspective, you probably want to have a separate list for each case and in each list a data structure that pairs the speaker with what she/he says. Like:

`[['MR. BERGERON'," Yes. That's essentially the same thing"],[ 'JUSTICE SOTOMAYOR',' So how do you deal with Chambers?']]`

This is a list of lists --it could also be a list of dictionaries if you want it to be. The real programmatic challenge here is to clean up the text files and parse them successfully. Most of the instructions below are devoted to this, but Steps 1 and 2 are also extremely important.

Go step-by-step through this, and email me whenever you get stuck, and I will help. If you complete all the steps before Tuesday, email me if you want to go further.



### STEP 1
Scrape all of the necessary information from:

https://www.supremecourt.gov/oral_arguments/argument_transcript.aspx 

You should result and a list of dictionaries for each case.

In [1]:
###Import your scraping libraries

from bs4 import BeautifulSoup
from urllib.request import urlopen
import requests



In [2]:
###Write your scraping code here
url = 'https://www.supremecourt.gov/oral_arguments/argument_transcript.aspx'
raw_html = urlopen(url).read()
doc = BeautifulSoup(raw_html, 'html.parser')

result = requests.get('https://www.supremecourt.gov/oral_arguments/argument_transcript.aspx')
#store the result of in the variable 'result'
# result.text

In [3]:
### Print out your list of lists or dictionaries here
doc = BeautifulSoup(raw_html, 'html.parser')
#define the variable doc
table = doc.find(class_ = 'table datatables')
#the info is in the table (defined by the class table datatables) and in the table rows.
cases = table.find_all('tr')
#define the variable for the table rows

supreme_court_list_all = []
#make a list to store the info

for each_case in cases:
    
    current = {}
    #make a dictionary. for each entry in the dictionary 
    #there will be four key : value pairs: link, name, date and docket_number, defined below
    #using beautiful soup and the tags, we'll find each one. remember we're already inside the tr tag
    
    link = each_case.find_all('td')[0].find('a')
    name = each_case.find('span')
    date = each_case.find_all('td')[1].string
    docket_number = each_case.find_all('td')[0].find(target = '_blank')
    
    if name:
        
        current['Text'] = link['href'].split('/')[-1]
        current['Case Name'] = name.string    
        current['Date Argued'] = date
        current['Docket Number'] = docket_number.string.strip()
    
    supreme_court_list_all.append(current)
    
supreme_court_list_all


[{},
 {'Case Name': 'Perry v. Merit Systems Protection Bd.',
  'Date Argued': '04/17/17',
  'Docket Number': '16-399.',
  'Text': '16-399_3f14.pdf'},
 {'Case Name': 'Town of Chester v. Laroe Estates, Inc.',
  'Date Argued': '04/17/17',
  'Docket Number': '16-605.',
  'Text': '16-605_2dp3.pdf'},
 {'Case Name': "California Public Employees' Retirement System v. ANZ Securities, Inc.",
  'Date Argued': '04/17/17',
  'Docket Number': '16-373.',
  'Text': '16-373_4e46.pdf'},
 {'Case Name': 'Kokesh v. SEC',
  'Date Argued': '04/18/17',
  'Docket Number': '16-529.',
  'Text': '16-529_21p3.pdf'},
 {'Case Name': 'Henson v. Santander Consumer USA Inc.',
  'Date Argued': '04/18/17',
  'Docket Number': '16-349.',
  'Text': '16-349_e29g.pdf'},
 {'Case Name': 'Trinity Lutheran Church of Columbia, Inc. v. Comer',
  'Date Argued': '04/19/17',
  'Docket Number': '15-577.',
  'Text': '15-577_l64n.pdf'},
 {'Case Name': 'Weaver v. Massachusetts',
  'Date Argued': '04/19/17',
  'Docket Number': '16-240.',
 

In [4]:
import pandas as pd
df = pd.DataFrame(supreme_court_list_all)
df.to_csv("supreme_court_list_all.csv", index=False)
supreme_court_list_all = pd.read_csv('supreme_court_list_all.csv')
supreme_court_list_all.head()

Unnamed: 0,Case Name,Date Argued,Docket Number,Text
0,,,,
1,Perry v. Merit Systems Protection Bd.,04/17/17,16-399.,16-399_3f14.pdf
2,"Town of Chester v. Laroe Estates, Inc.",04/17/17,16-605.,16-605_2dp3.pdf
3,California Public Employees' Retirement System...,04/17/17,16-373.,16-373_4e46.pdf
4,Kokesh v. SEC,04/18/17,16-529.,16-529_21p3.pdf


### STEP 2 
Scrape the additional source(s)

For this you need to do research and try to find a source that will give you useful information that you can add to the dictionary you created in Step 1.

In [5]:
#Code away!




### STEP 3
Here we go: the text files that were extracted from the PDFs are quite messy, you do not need to get them perfect, but you need to clean them up enough so that you can zoom in on the arguments themselves. Below I take you step-by-step through what you need to do. In the end you want to have a separate list for each case that contains the speaker and the dialogue attached to that speaker.

**Step 1:** Download the text files from courseworks
Make sure they are locally on your computer. 
Open up the text files in a text editor like sublime, and carefully look at the problems with the files. How will you clean this up?

**Step 2:** Eventually you will want to loop through all of the text files and run the cleanup on all of them. But first just select one text file to open up and begin cleaning up.

In [6]:
#Import the regular expression library
import re

In [7]:
#Open a text file from your computer
f = open('/Users/kaitlincough/Documents/Lede/thirkield/python_notebooks_thirkield/pdfs/15-777_1b82.txt', 'r')
sample_transcript = f.read()

In [8]:
# !cat /Users/kaitlincough/Documents/Lede/thirkield/python_notebooks_thirkield/pdfs/15-777_1b82.txt

In [9]:
#Take a look at the text file
sample_transcript

'Official - Subject to Final Review\n1 1 IN THE SUPREME COURT OF THE UNITED STATES\n\n2 -----------------x\n\n3 SAMSUNG ELECTRONICS CO.,\n\n:\n\n4 LTD., ET AL.,\n\n:\n\n5\n\nPetitioners\n\n: No. 15-777\n\n6 v.\n\n:\n\n7 APPLE, INC.,\n\n:\n\n8\n\nRespondent.\n\n:\n\n9 -----------------x\n\n10 Washington, D.C.\n\n11 Tuesday, October 11, 2016\n\n12\n\n13 The above-entitled matter came on for oral\n\n14 argument before the Supreme Court of the United States\n\n15 at 10:05 a.m.\n\n16 APPEARANCES:\n\n17 KATHLEEN M. SULLIVAN, ESQ., New York, N.Y.; on behalf of\n\n18 the Petitioners.\n\n19 BRIAN H. FLETCHER, ESQ., Assistant to the Solicitor\n\n20 General, Department of Justice, Washington, D.C.;\n\n21 for United States, as amicus curiae, supporting\n\n22 neither party.\n\n23 SETH P. WAXMAN, ESQ., Washington, D.C.; on behalf of the\n\n24 Respondent.\n\n25\n\nAlderson Reporting Company\n\n\x0cOfficial - Subject to Final Review\n1 CONTENTS 2 ORAL ARGUMENT OF 3 KATHLEEN M. SULLIVAN, ESQ. 4 On beha

### Cleaning comes first

A step-by-step way of Cleaning up this mess.

Step 1. You might notice that every page has:

`Alderson Reporting Company

Official - Subject to Final Review`
 
You want to get rid of that. I would use a regex sub() 

Step 2. **Line Numbers:** you might also notice these annoying line numbers going from 1 - 25 everywhere: I would use the regex sub() to get rid of this too -- but be very careful, you don't want to get rid of all the numbers in there. The cleaning doesn't have to be perfect, but try to get as many of them as you can without deleting other numbers.

Step 3 and 4. **chop off the beginning/ chop off the end**: now it would be very helpful to get rid of all of the text that comes before the arguments begins, and all the text that comes after the argument (each page has a really annoying index at the end that you don't want to be searching through). Look for words or phrases that uniquely repeat at the beginning and at the end of the arguments. The easiest way to isolate this, to do a simple split() on one of those phrases, and keep the half of The split you want. (Am I being too cryptic here?--a good split should give you list with two elements when you want to keep one of them) Think about it and email me.

Try to get these 4 cleaning actions to work step-by-step in the 4 cells below. As you go, I would assign each cleaner version of the text to a new variable. 

In [10]:
#getting rid of the beginning
remove_beginning = re.split(r'\bPROCEEDINGS \d \(\d\d\:\d\d \w\.m\.\)', sample_transcript)
remove_beginning[1]

' 3 CHIEF JUSTICE ROBERTS: We\'ll hear argument 4 first this morning in Case No. 15-777, Samsung 5 Electronics v. Apple, Incorporated. 6 Ms. Sullivan. 7 ORAL ARGUMENT OF KATHLEEN M. SULLIVAN 8 ON BEHALF OF THE PETITIONERS 9 MS. SULLIVAN: Mr. Chief Justice, and may it 10 please the Court: 11 A smartphone is smart because it contains 12 hundreds of thousands of the technologies that make it 13 work. But the Federal Circuit held that Section 289 of 14 the Patent Act entitles the holder of a single design 15 patent on a portion of the appearance of the phone to 16 total profit on the entire phone. 17 That result makes no sense. A single design 18 patent on the portion of the appearance of a phone 19 should not entitle the design-patent holder to all the 20 profit on the entire phone. 21 Section 289 does not require that result, 22 and as this case comes to the Court on the briefing, 23 Apple and the government now agree that Section 289 does 24 not require that result. We respectfully ask 

In [11]:
#getting rid of the Alderson Reporting Company
remove_alderson = re.sub('Alderson Reporting Company|Official - Subject to Final Review', '', remove_beginning[1])
remove_alderson

' 3 CHIEF JUSTICE ROBERTS: We\'ll hear argument 4 first this morning in Case No. 15-777, Samsung 5 Electronics v. Apple, Incorporated. 6 Ms. Sullivan. 7 ORAL ARGUMENT OF KATHLEEN M. SULLIVAN 8 ON BEHALF OF THE PETITIONERS 9 MS. SULLIVAN: Mr. Chief Justice, and may it 10 please the Court: 11 A smartphone is smart because it contains 12 hundreds of thousands of the technologies that make it 13 work. But the Federal Circuit held that Section 289 of 14 the Patent Act entitles the holder of a single design 15 patent on a portion of the appearance of the phone to 16 total profit on the entire phone. 17 That result makes no sense. A single design 18 patent on the portion of the appearance of a phone 19 should not entitle the design-patent holder to all the 20 profit on the entire phone. 21 Section 289 does not require that result, 22 and as this case comes to the Court on the briefing, 23 Apple and the government now agree that Section 289 does 24 not require that result. We respectfully ask 

In [12]:
remove_end = re.split(r'Whereupon', remove_alderson)
remove_end[0]

' 3 CHIEF JUSTICE ROBERTS: We\'ll hear argument 4 first this morning in Case No. 15-777, Samsung 5 Electronics v. Apple, Incorporated. 6 Ms. Sullivan. 7 ORAL ARGUMENT OF KATHLEEN M. SULLIVAN 8 ON BEHALF OF THE PETITIONERS 9 MS. SULLIVAN: Mr. Chief Justice, and may it 10 please the Court: 11 A smartphone is smart because it contains 12 hundreds of thousands of the technologies that make it 13 work. But the Federal Circuit held that Section 289 of 14 the Patent Act entitles the holder of a single design 15 patent on a portion of the appearance of the phone to 16 total profit on the entire phone. 17 That result makes no sense. A single design 18 patent on the portion of the appearance of a phone 19 should not entitle the design-patent holder to all the 20 profit on the entire phone. 21 Section 289 does not require that result, 22 and as this case comes to the Court on the briefing, 23 Apple and the government now agree that Section 289 does 24 not require that result. We respectfully ask 

In [13]:
remove_numbers = re.sub(r'[\n ][12]?\d |\n\n\n|', '', remove_end[0])
remove_numbers

'CHIEF JUSTICE ROBERTS: We\'ll hear argumentfirst this morning in Case No. 15-777, SamsungElectronics v. Apple, Incorporated.Ms. Sullivan.ORAL ARGUMENT OF KATHLEEN M. SULLIVANON BEHALF OF THE PETITIONERSMS. SULLIVAN: Mr. Chief Justice, and may itplease the Court:A smartphone is smart because it containshundreds of thousands of the technologies that make itwork. But the Federal Circuit held that Section 289 ofthe Patent Act entitles the holder of a single designpatent on a portion of the appearance of the phone tototal profit on the entire phone.That result makes no sense. A single designpatent on the portion of the appearance of a phoneshould not entitle the design-patent holder to all theprofit on the entire phone.Section 289 does not require that result,and as this case comes to the Court on the briefing,Apple and the government now agree that Section 289 doesnot require that result. We respectfully ask that theCourt hold that when a design patent claims a design\x0c1 that is applied

In [14]:
remove_x0 = re.sub(r'[\x0c]*', '', remove_numbers)
remove_x0

'CHIEF JUSTICE ROBERTS: We\'ll hear argumentfirst this morning in Case No. 15-777, SamsungElectronics v. Apple, Incorporated.Ms. Sullivan.ORAL ARGUMENT OF KATHLEEN M. SULLIVANON BEHALF OF THE PETITIONERSMS. SULLIVAN: Mr. Chief Justice, and may itplease the Court:A smartphone is smart because it containshundreds of thousands of the technologies that make itwork. But the Federal Circuit held that Section 289 ofthe Patent Act entitles the holder of a single designpatent on a portion of the appearance of the phone tototal profit on the entire phone.That result makes no sense. A single designpatent on the portion of the appearance of a phoneshould not entitle the design-patent holder to all theprofit on the entire phone.Section 289 does not require that result,and as this case comes to the Court on the briefing,Apple and the government now agree that Section 289 doesnot require that result. We respectfully ask that theCourt hold that when a design patent claims a design1 that is applied to 

In [15]:
remove_n = re.sub(r'\n', '', remove_x0)
remove_n

'CHIEF JUSTICE ROBERTS: We\'ll hear argumentfirst this morning in Case No. 15-777, SamsungElectronics v. Apple, Incorporated.Ms. Sullivan.ORAL ARGUMENT OF KATHLEEN M. SULLIVANON BEHALF OF THE PETITIONERSMS. SULLIVAN: Mr. Chief Justice, and may itplease the Court:A smartphone is smart because it containshundreds of thousands of the technologies that make itwork. But the Federal Circuit held that Section 289 ofthe Patent Act entitles the holder of a single designpatent on a portion of the appearance of the phone tototal profit on the entire phone.That result makes no sense. A single designpatent on the portion of the appearance of a phoneshould not entitle the design-patent holder to all theprofit on the entire phone.Section 289 does not require that result,and as this case comes to the Court on the briefing,Apple and the government now agree that Section 289 doesnot require that result. We respectfully ask that theCourt hold that when a design patent claims a design1 that is applied to 

In [16]:
remove_digits = re.sub(r' \d | [.\d]', '', remove_n)
remove_digits

'CHIEF JUSTICE ROBERTS: We\'ll hear argumentfirst this morning in Case No.5-777, SamsungElectronics v. Apple, Incorporated.Ms. Sullivan.ORAL ARGUMENT OF KATHLEEN M. SULLIVANON BEHALF OF THE PETITIONERSMS. SULLIVAN: Mr. Chief Justice, and may itplease the Court:A smartphone is smart because it containshundreds of thousands of the technologies that make itwork. But the Federal Circuit held that Section89 ofthe Patent Act entitles the holder of a single designpatent on a portion of the appearance of the phone tototal profit on the entire phone.That result makes no sense. A single designpatent on the portion of the appearance of a phoneshould not entitle the design-patent holder to all theprofit on the entire phone.Section89 does not require that result,and as this case comes to the Court on the briefing,Apple and the government now agree that Section89 doesnot require that result. We respectfully ask that theCourt hold that when a design patent claims a design1 that is applied to a compon

In [17]:
# re.findall(r'[A-Z]+: [\S\s]*?=', sample_transcript)
#look ahead
lines = re.split(r'([A-Z.\s]{8,}):', remove_digits)
lines

#this is saying, get UPPER CASE letters with a space after them, any character, capture it, have a colon after it

['',
 'CHIEF JUSTICE ROBERTS',
 " We'll hear argumentfirst this morning in Case No.5-777, SamsungElectronics v. Apple, Incorporated.Ms. Sullivan",
 '.ORAL ARGUMENT OF KATHLEEN M. SULLIVANON BEHALF OF THE PETITIONERSMS. SULLIVAN',
 ' Mr. Chief Justice, and may itplease the Court:A smartphone is smart because it containshundreds of thousands of the technologies that make itwork. But the Federal Circuit held that Section89 ofthe Patent Act entitles the holder of a single designpatent on a portion of the appearance of the phone tototal profit on the entire phone.That result makes no sense. A single designpatent on the portion of the appearance of a phoneshould not entitle the design-patent holder to all theprofit on the entire phone.Section89 does not require that result,and as this case comes to the Court on the briefing,Apple and the government now agree that Section89 doesnot require that result. We respectfully ask that theCourt hold that when a design patent claims a design1 that is a

In [21]:
#get rid of these two things::
#do the regular expression on lines[3] and lines[24]

.REBUTTAL ARGUMENT OF KATHLEEN M. SULLIVANON BEHALF OF THE PETITIONERSMS. 

.ORAL ARGUMENT OF KATHLEEN M. SULLIVANON BEHALF OF THE PETITIONERSMS.

SyntaxError: invalid syntax (<ipython-input-21-1976a13c699b>, line 3)

In [22]:
remove_petitioners = re.sub(r'([A-Z.\s]{60,})','',lines[3])
remove_petitioners

''

In [19]:
remove_petitioners = re.sub(r'REBUTTAL ARGUMENT OF KATHLEEN M. SULLIVANON BEHALF OF THE PETITIONERS','',lines[-24])
remove_petitioners

r'([A-Z.\s]{60,})'

'.MS. SULLIVAN'

In [20]:
remove_petitioners

'.MS. SULLIVAN'

In [None]:
line[1::20]
#change the number. what's happening?
#pass the content to a dataframe and say: here's all the content for one column and then for another column 

**How in the world are you going to clean this up?**
Take a close look and think about first what needs to be removed, and then needs to be isolated. You'll probably need the combination of regular expression (especially using subs() -- which is a regex replace), and simple splits -- where you split the text that point, and just keep the part of the text that you want. If you want to figure this on your own don't read any further--if you're starting to get stuck go a few cells down, and follow my hints.

Also take a look at the hint below--it might come in very handy...


In [None]:
#A note on regex splits:
# look at the difference between regex1 regex2
#A split using groups keeps the groups!!!!

string = "Tomorrow and tomorrow and tomorrow"
regex1 = r"and" #not grouped
regex2 = r"(and)" #grouped
re.split(regex2,string)

In [None]:
# transcript_lines_list = sample_transcript.split('\n')
# transcript_lines_list

### Get your dialogue list
Now this transcription should be clean enough to get a list with every speaker, and what the speaker said. The pattern for the speakers is fairly obvious--my recommendation is to do a split using groups (like the example I show above with "tomorrow and tomorrow").

If you write your regular expression correctly: you should get a single list in which each element is either a speaker, or what was said.

In [None]:
#get a list of speaker and speech




### Make it a list of pairs
If you got your list the way I recommended to, it is just single list with elements after element--you need to figure out how to change it so you pair the speaker with what is said. Give it some thought, there are a few ways to try to do this. If you made it this far, you're doing great!

In [None]:
#make it a list of pairs of speaker and speech



### Loop through all texts
If you made it this far--congratulations! 
The only thing left is to set up a loop that looks through all the texts and runs the cleanup and parsing when each one. You will need to have completed Step 1 in order to be able to do this loop because you will need the names to PDFs to do it. (Also each final list should also contain the PDF name, so you can reference it from your case database.)

In [None]:
# you could try here--Or email me with questions...