# Kate Cough

#### June 29 2017


## Supreme Court Project Guide

The ultimate goal of this project is to build a database of Supreme Court cases for 2016 that includes the dialogue from the oral arguments of each case. As we have seen in class the arguments were scraped from this page: https://www.supremecourt.gov/oral_arguments/argument_transcript.aspx 

I have already downloaded and transformed the PDFs of the transcripts into text documents which you can download from courseworks: supreme_court_pdfs_txt.zip

There are three steps that you need to complete:

**Please note:** Step 3 is the most challenging--if you want to spend some time coding, you can skip Steps 1 and 2 and get to work on Step 3

**STEP 1:** scrape all of the case information available on this page: https://www.supremecourt.gov/oral_arguments/argument_transcript.aspx 

This should include case name, docket number, etc--and most importantly the name of the PDF file. All of the text files share the exact same name as the PDF files they came from. This file name will allow you to connect your transcription data with your case data. 

It is up to you what kind Data structure you want to build. But it likely to be a list of lists, or list of dictionaries--for each case you will have a list or dictionary of the information you scrape from the webpage.

**STEP 2:** find a secondary source to scrape/integrate with your case data. The information on the Supreme Court page is very limited. You need to find a source or group of sources that ad information. The most important information would likely be: the decision, who voted for and against, and the state of origin of the case (for geocoding). You might think of other great things to put in there too! This information needs to be merged with the data you have from STEP 2.

**STEP 3:** use regular expressions to clean up and parse the text files so that you have a searchable data structure containing the dialog from the transcripts. 

From a data architecture perspective, you probably want to have a separate list for each case and in each list a data structure that pairs the speaker with what she/he says. Like:

`[['MR. BERGERON'," Yes. That's essentially the same thing"],[ 'JUSTICE SOTOMAYOR',' So how do you deal with Chambers?']]`

This is a list of lists --it could also be a list of dictionaries if you want it to be. The real programmatic challenge here is to clean up the text files and parse them successfully. Most of the instructions below are devoted to this, but Steps 1 and 2 are also extremely important.

Go step-by-step through this, and email me whenever you get stuck, and I will help. If you complete all the steps before Tuesday, email me if you want to go further.



### STEP 1
Scrape all of the necessary information from:

https://www.supremecourt.gov/oral_arguments/argument_transcript.aspx 

You should result and a list of dictionaries for each case.

In [None]:
###Import your libraries and all other things

from bs4 import BeautifulSoup
from urllib.request import urlopen
import requests
import pandas as pd
import re
import numpy as np
import geopandas as gpd

from shapely.geometry import Point

from collections import Counter

%matplotlib inline

In [None]:
###Write your scraping code here
url = 'https://www.supremecourt.gov/oral_arguments/argument_transcript.aspx'
raw_html = urlopen(url).read()
doc = BeautifulSoup(raw_html, 'html.parser')

result = requests.get('https://www.supremecourt.gov/oral_arguments/argument_transcript.aspx')
#store the result of in the variable 'result'
# result.text

In [None]:
### Print out your list of lists or dictionaries here
#define the variable doc
#the info is in the table (defined by the class table datatables) and in the table rows.
#define the variable for the table rows
#make a list to store the info

table = doc.find('table', class_ = 'table datatables')
cases = table.find_all('tr')

supreme_court_list_all = []

for each_case in cases:
    
    current = {}
    #make a dictionary. for each entry in the dictionary 
    #there will be four key : value pairs: link, name, date and docket_number, defined below
    #using beautiful soup and the tags, we'll find each one. remember we're already inside the tr tag
    
    link = each_case.find_all('td')[0].find('a')
    
    name = each_case.find('span')
    
    date = each_case.find_all('td')[1].string
    
    docket_number = each_case.find_all('td')[0].find(target = '_blank')
    
    if link:
        current['Text'] = link['href'].split('/')[-1]
    if name:
        current['Case Name'] = name.string
    if date:    
        current['Date Argued'] = date
    if docket_number:
        current['Docket Number'] = docket_number.string.strip()
    
    supreme_court_list_all.append(current)
    
supreme_court_list_all



In [None]:
scraped_supreme_court = pd.DataFrame(supreme_court_list_all)
scraped_supreme_court.shape


In [None]:
# df.to_csv("supreme_court_list_all.csv", index=False)
# supreme_court_list_all = pd.read_csv('supreme_court_list_all.csv')
# supreme_court_list_all.head()


In [None]:
scraped_supreme_court.head()

In [None]:
scraped_supreme_court.dropna(inplace=True)

In [None]:
array = scraped_supreme_court['Text'].unique()
array.sort()
array

In [None]:
#Open a text file from your computer
f = open('/Users/kaitlincough/Documents/Lede/thirkield/final_project_supreme_court/pdfs/15-777_1b82.txt', 'r')
sample_transcript = f.read()

In [None]:
# !cat /Users/kaitlincough/Documents/Lede/thirkield/python_notebooks_thirkield/pdfs/15-777_1b82.txt

In [None]:
#Take a look at the text file
sample_transcript

### Cleaning comes first

A step-by-step way of Cleaning up this mess.

Step 1. You might notice that every page has:

`Alderson Reporting Company

Official - Subject to Final Review`
 
You want to get rid of that. I would use a regex sub() 

Step 2. **Line Numbers:** you might also notice these annoying line numbers going from 1 - 25 everywhere: I would use the regex sub() to get rid of this too -- but be very careful, you don't want to get rid of all the numbers in there. The cleaning doesn't have to be perfect, but try to get as many of them as you can without deleting other numbers.

Step 3 and 4. **chop off the beginning/ chop off the end**: now it would be very helpful to get rid of all of the text that comes before the arguments begins, and all the text that comes after the argument (each page has a really annoying index at the end that you don't want to be searching through). Look for words or phrases that uniquely repeat at the beginning and at the end of the arguments. The easiest way to isolate this, to do a simple split() on one of those phrases, and keep the half of The split you want. (Am I being too cryptic here?--a good split should give you list with two elements when you want to keep one of them) Think about it and email me.

Try to get these 4 cleaning actions to work step-by-step in the 4 cells below. As you go, I would assign each cleaner version of the text to a new variable. 

In [None]:
#getting rid of the beginning
remove_beginning = re.split(r'\bPROCEEDINGS \d \(\d\d\:\d\d \w\.m\.\)', sample_transcript)
remove_beginning[1]

In [None]:
#getting rid of the Alderson Reporting Company
remove_alderson = re.sub('Alderson Reporting Company|Official - Subject to Final Review', '', remove_beginning[1])
remove_alderson

In [None]:
remove_end = re.split(r'Whereupon', remove_alderson)
remove_end[0]

In [None]:
#remove the number on the left hand side of the page
remove_numbers = re.sub(r'[\n ][12]?\d |\n\n\n|', '', remove_end[0])
remove_numbers

In [None]:
#remove more numbers
remove_nums = re.sub(r" \d ", "", remove_numbers)
remove_nums

In [None]:
#remove the katherine sullivan line
remove_argument = re.sub(r"\w+ ARGUMENT [^a-z]+ (PETITIONER|RESPONDENT)S?", "", remove_nums)
remove_argument

In [None]:
#get rid of the x0c
remove_x0 = re.sub(r'[\x0c]*', '', remove_argument)
remove_x0

In [None]:
#and the ns
remove_n = re.sub(r'\n', '', remove_x0)
remove_n

In [None]:
#remove the timestamp before roberts
remove_time = re.split(r"([A-Z.\s]+:)", remove_n)
remove_time[1:]

In [None]:
remove_time = re.split(r"([A-Z.\s]+:)", remove_argument)
remove_time[1:]

### Get your dialogue list
Now this transcription should be clean enough to get a list with every speaker, and what the speaker said. The pattern for the speakers is fairly obvious--my recommendation is to do a split using groups (like the example I show above with "tomorrow and tomorrow").

If you write your regular expression correctly: you should get a single list in which each element is either a speaker, or what was said.

In [None]:
#get a list of speaker and speech

speech1 = remove_time[1:]
speech1

### Make it a list of pairs
If you got your list the way I recommended to, it is just single list with elements after element--you need to figure out how to change it so you pair the speaker with what is said. Give it some thought, there are a few ways to try to do this. If you made it this far, you're doing great!

In [None]:
#make it a list of pairs of speaker and speech

speech2 = list(zip(speech1[0::2], speech1[1::2]))
speech2

In [None]:
remove_beginning = re.split(r'\bPROCEEDINGS \d \(\d\d\:\d\d \w\.m\.\)', sample_transcript)
remove_alderson = re.sub('Alderson Reporting Company|Official - Subject to Final Review', '', remove_beginning[0])
remove_end = re.split(r'Whereupon', remove_alderson)
remove_numbers = re.sub(r'[\n ][12]?\d |\n\n\n|', '', remove_end[0])
remove_nums = re.sub(r" \d ", "", remove_numbers)
remove_argument = re.sub(r"\w+ ARGUMENT [^a-z]+ (PETITIONER|RESPONDENT)S?", "", remove_nums)
remove_x0 = re.sub(r'[\x0c]*', '', remove_argument)
remove_n = re.sub(r'[\n]', '', remove_x0)
remove_time = re.split(r"([A-Z.\s]+:)", remove_n)
remove_time[1:]
speech1[2] = re.sub(r"\.\s[A-Z.\s]{67}", "", speech1[2])
speech1[-24] = re.sub(r"\.\s[A-Z.\s]{71}", "", speech1[-24])
speech2 = list(zip(speech1[0::2], speech1[1::2]))


In [None]:
speech2

In [None]:
def court_text(lines):
    remove_beginning = re.split(r'\bPROCEEDINGS \d \(\d\d\:\d\d \w\.m\.\)', lines)
    remove_alderson = re.sub('Alderson Reporting Company|Official - Subject to Final Review', '', remove_beginning[0])
    remove_end = re.split(r'Whereupon', remove_alderson)
    remove_numbers = re.sub(r'[\n ][12]?\d |\n\n\n|', '', remove_end[0])
    remove_nums = re.sub(r" \d ", "", remove_numbers)
    remove_argument = re.sub(r"\w+ ARGUMENT [^a-z]+ (PETITIONER|RESPONDENT)S?", "", remove_nums)
    remove_x0 = re.sub(r'[\x0c]*', '', remove_argument)
    remove_n = re.sub(r'[\n]', '', remove_x0)
    remove_time = re.split(r"([A-Z.\s]+:)", remove_n)
    remove_time[1:]
    speech1[2] = re.sub(r"\.\s[A-Z.\s]{67}", "", speech1[2])
    speech1[-24] = re.sub(r"\.\s[A-Z.\s]{71}", "", speech1[-24])
    speech2 = list(zip(speech1[0::2], speech1[1::2]))
    
    return speech2 

In [None]:
court_text(sample_transcript)
#run the function on sample_transcript

### Loop through all texts
If you made it this far--congratulations! 
The only thing left is to set up a loop that looks through all the texts and runs the cleanup and parsing when each one. You will need to have completed Step 1 in order to be able to do this loop because you will need the names to PDFs to do it. (Also each final list should also contain the PDF name, so you can reference it from your case database.)

In [None]:
# all_cases = ['14-1538_j4ek', '14-9496_feah']
supreme_court_list_all = []
#create an empty list
path = '/Users/kaitlincough/Documents/Lede/thirkield/final_project_supreme_court/pdfs'
for file_name in array:
    print(file_name)
    # bad files
    if file_name != '15-1358_7648' and file_name != '15-577_l64n' and file_name !='14-1055_h3dj' and file_name != '15-866_j426' and file_name != '16-32_mlho' and file_name!= '16-466_4g15' and file_name !='16-529_21p3':
#         f = open(path + file_name + 'txt' + 'r')
        sample_transcript = f.read()
        this_list = court_text(sample_transcript)   
#         remember court_text was our function, and it's being applied to sample_transcript
        better_list = []
        #create another empty list 
        for each in this_list:
            entry = list(each)
            entry.append(file_name)
            better_list.append(entry)
        this_list.append(file_name)
        supreme_court_list_all.extend(better_list)

In [None]:
# supreme_court_list_all
# col_names = ['speaker', 'words', 'case_id']
# supreme_df = pd.DataFrame(supreme_court_list_all, columns=col_names)
# supreme_df.head()

## After this we can take the information from our scraped file and this loop through all the texts
and merge them. There's also a third data frame with information gathered from elsewhere. It's called cases_df.

In [None]:
#let's look at the scraped file and remind ourselves what's in there
scraped_supreme_court.head()

In [None]:
#delete Text column, clean up Docket Number so it'll match later
scraped_supreme_court['Docket Number'].replace(r'\.', '', regex=True, inplace=True)
del scraped_supreme_court['Text']
scraped_supreme_court.head()

In [None]:
#import the clean df with the information from looping through the texts
supreme_df_clean = pd.read_csv('/Users/kaitlincough/Documents/Lede/thirkield/cases_data.csv')
supreme_df_clean.head()

In [None]:
#get rid of the last four digits at the end of case id so you can join it with the cases df
supreme_df_clean['case_id'].replace(r'\_\w\w\w\w', '', regex=True, inplace=True)

In [None]:
# cases_clean_supreme = cases_df.merge(supreme_df_clean, left_on='Docket Number', right_on='case_id', how='outer')

df_all_info = scraped_supreme_court.merge(supreme_df_clean, left_on='Docket Number', right_on='case_id', how='outer' )

In [None]:
df_all_info.head()

In [None]:
#clean up the columns a little, delete duplicates

df_all_info.rename(columns={'words': 'Text', 'speaker': 'Speaker', 'Date Argued_x':'Date Argued', 'Area*': 'Area'}, inplace=True)

df_all_info['Speaker'].replace(r'\:', '', regex=True, inplace=True)

#get rid of the case_id column, it's redundant
del df_all_info['case_id']

# del df_all_info['case_id']
# del df_all_info['Date Argued_y']

In [None]:
df_all_info.head()

## Now let's bring in the data frame we all created as a group

In [None]:
cases_df = pd.read_csv('/Users/kaitlincough/Documents/Lede/thirkield/final_project_supreme_court/supreme_court_info_created.csv')
cases_df.rename(columns={'Area*': 'Area'}, inplace=True)

cases_df.head()

In [None]:
df_all_info.head()

In [None]:
#merge the df we created (cases_df) with the df made after looping through all the texts and adding the scraped df--
#so merge df_all_info with cases_df. The common column is Docket Number

final_df = cases_df.merge(df_all_info, left_on='Docket Number', right_on='Docket Number', how='outer')
final_df.head()

In [None]:
final_df['Text'].dropna(inplace=True)

In [None]:
#counting the words in the text column and creating a new column with the count
final_df['Word Count'] = final_df['Text'].apply(lambda x: len(x.split()))
final_df.head()

In [None]:
#get rid of the NA values

final_df.dropna(inplace=True)

In [None]:
#make a DF with ONLY the Justices
justices_only_df = final_df[final_df['Speaker'].str.contains('JUSTICE')]
justices_only_df.head(2)

In [None]:
#let's get rid of the columns we don't need
# justices_only_df.drop(['State', 'Date Argued_x', 'Decision', 'Status','Court Leaning', 'Previous Court', 'Date Argued_y'], axis=1, inplace=True)

In [None]:
justices_only_df.head()

In [None]:
justices_only_df.dropna()

In [None]:
justices_only_df_new = justices_only_df.groupby(['Area','Speaker', 'Case Name', 'Latitude', 'Longitude'])['Word Count'].sum()

justices_only_df_new = justices_only_df_new.reset_index()
justices_only_df_new.head()


In [None]:
# ax = justices_only_df_new.groupby(['Area', 'Speaker'])['Word Count'].sum().sort_values(ascending=True).plot(x='Speaker',y='Word Count', kind='barh', figsize=(50,50))

In [None]:
justices_only_df_new

In [None]:
#turn the lat/lon columns into a new column, geometry, that is geometic points
justices_only_df_new['geometry'] = justices_only_df_new.apply(lambda row: Point(row.Latitude, row.Longitude), axis=1)

In [None]:
justices_only_df_new.head()

In [None]:
justices_only_df_new.rename(columns={'Area': 'area', 'Speaker': 'speaker', 'Case Name':'case_name', 'Latitude':'latitude','Longitude': 'longitude', 'Word Count':'word_count'}, inplace=True)


In [None]:
justices_only_df_new

In [None]:
justices_only_df_new = gpd.GeoDataFrame(justices_only_df_new)
type(justices_only_df_new)


In [None]:
justices_only_df_new.to_file('justices_only_df_new.json', driver='GeoJSON')

In [None]:
#for each area, who is the most frequent speaker?
justices_only_df.plot(x='Speaker', y='Word Count', figsize=(20,10))

## Now we have a data frame (area_grouped_df) where we can look for most frequent words. 

In [None]:
#convert the Text column to a string so you can count 
#the most frequent words (there are some numbers in there)


area_grouped_df['Text'] = area_grouped_df['Text'].astype(str)

In [None]:
justices_only_df.groupby('Speaker')['Text'].nunique()
wordcount = justices_only_df.groupby(["Speaker"]).sum().applymap(lambda words: Counter(re.findall(r"\b\w{12,}\b",words.lower())).most_common())
wordcount

In [None]:
# freqwords = area_grouped_df.head().applymap(lambda x:Counter(" ".join(area_grouped_df["Text"]).split()).most_common())
# freqwords

In [None]:
# freqwords = area_grouped_df.applymap(lambda x: Counter(re.findall(r"\b\w{5,}\b",x.lower())).most_common())

# wordcount = df.groupby(["speaker"]).sum().applymap(lambda words: Counter(re.findall(r"\b\w{12,}\b",words.lower())).most_common())


freqwords = area_grouped_df.groupby(["Area"]).sum().applymap(lambda words: Counter(re.findall(r"\b\w{5,}\b",words.lower())).most_common())
freqwords


In [None]:
def unpack(freq_words):
    string = ""
    for pair in freq_words[0:10]:
        string += pair[0] + ": " + str(pair[1]) + " "
    return string
freqwords

In [None]:
# freqwords = area_grouped_df.applymap(lambda x:Counter(" ".join(area_grouped_df["Text"]).split()).most_common())


In [None]:
# def unpack(freq_words):
#     string = ""
#     for pair in freq_words[0:10]:
#         string += pair[0] + ": " + str(pair[1]) + " "
#     return string
# freqwords

In [None]:
#functions that semi-work:

# Counter(" ".join(df_all_info["Text"]).split()).most_common()

freqwords = area_grouped_df.applymap(lambda x: Counter(re.findall(r"\b\w{5,}\b",x.lower())).most_common())

#find the old df with just


def unpack(freq_words):
    string = ""
    for pair in freq_words[0:10]:
        string += pair[0] + ": " + str(pair[1]) + " "
    return string
freqwords

In [None]:
#THIS ONE WORKS
# freqwords = supreme_words.groupby(['Docket Number']).sum().applymap(lambda x:Counter(" ".join(df_all_info["Text"]).split()).most_common())


In [None]:
#group by case and see what you want to get out
#most frequent words per case. most frequent words per state
#what case had the longest transcript?
#pick a particular justice or two to track
#which justice spoke the most from case to case
#longest word count by any justice in each case and have that be printed as the pull quote
#you could focus on ten cases in one sector
#get a data frame and compress it and export it
#groupby case, get a valuecount for each word in the text
#groupby state, count the number of occurrences of speech for every justice, make that into df that has it grouped by
#case and speaker
#group down to the case and run things on the case and see what you can pull out of the case

## Now let's look at the geojson file. Eventually we'll have to merge it right?

In [None]:
# geo_file = "/Users/kaitlincough/Documents/Lede/thirkield/final_project_supreme_court/map_templatesUPDATED/states_data_all.geojson"
# states = gpd.read_file(geo_file)
# states.head(100)

In [None]:

geo_points = pd.read_csv('/Users/kaitlincough/Documents/Lede/thirkield/final_project_supreme_court/supreme_lat_lon.csv')
geo_points = geo_points[['Docket Number', 'Latitude', 'Longitude']]

# freqwords = freqwords.merge(geo_points, left_on='Docket Number', right_on='Docket Number', how='outer')

# freqwords['geometry'] = freqwords.apply(lambda row: Point(row.Longitude, row.Latitude), axis=1)

# #turn final data frame into geojson file
# freqwords = gpd.GeoDataFrame(freqwords)
# type(freqwords)
# freqwords.to_file('WHATEVER.json', driver='GeoJSON')

# must have lat/lon/docket number 

# #read in the csv file we made
# geo_points = pd.read_csv('/Users/kaitlincough/Documents/Lede/thirkield/final_project_supreme_court/supreme_lat_lon.csv')
# #only import these columns
# geo_points = geo_points[['Docket Number', 'Latitude', 'Longitude']]

#merge the data frame we made with the final data frame
# freqwords = freqwords.merge(geo_points, left_on='Docket Number', right_on='Docket Number', how='outer')

#turn the lat/lon columns into a new column, geometry, that is geometic points
# freqwords['geometry'] = freqwords.apply(lambda row: Point(row.Longitude, row.Latitude), axis=1)

#change the colors based on a definition
# freqwords.loc[freqwords['word_count'] > 400, 'color'] = 'black'

#change the filename in the html file
#paste all of your json file into the points.json or wahtever it is in the maps template 

In [None]:
geo_points.head()

In [None]:
freqwords.head()

In [None]:
freqwords = freqwords.merge(geo_points, left_on='Docket Number', right_on='case_id', how='outer')

# freqwords['geometry'] = freqwords.apply(lambda row: Point(row.Longitude, row.Latitude), axis=1)

# #turn final data frame into geojson file
# freqwords = gpd.GeoDataFrame(freqwords)
# type(freqwords)
# freqwords.to_file('WHATEVER.json', driver='GeoJSON')

In [None]:

#turn final data frame into geojson file
final_data = gpd.GeoDataFrame(final_data)
type(final_data)
final_data.to_file('WHATEVER.json', driver='GeoJSON')

# must have lat/lon/docket number 

#read in the csv file we made
geo_points = pd.read_csv('/Users/kaitlincough/Documents/Lede/thirkield/final_project_supreme_court/supreme_lat_lon.csv')
#only import these columns
geo_points = geo_points[['Docket Number', 'Latitude', 'Longitude']]

#merge the data frame we made with the final data frame
final_data = final_data.merge(geo_points, left_on='Docket Number', right_on='Docket Number', how='outer')

#turn the lat/lon columns into a new column, geometry, that is geometic points
final_data['geometry'] = final_data.apply(lambda row: Point(row.Longitude, row.Latitude), axis=1)

#change the colors based on a definition
final_data.loc[final_data['word_count'] > 400, 'color'] = 'black'

#change the filename in the html file
#paste all of your json file into the points.json or wahtever it is in the maps template 