Part One: The ultimate goal of this project is to build a database of Supreme Court cases for 2016 that includes the dialogue from the oral arguments of each case, and then create a visualization project based on this data + a secondary source data. As we have seen in class the arguments were scraped from this page: https://www.supremecourt.gov/oral_arguments/argument_transcript.aspx 
Make a list of dictionaries for each case scraped from the website 

## Scraping with BeautifulSoup to get list of dictionaries

In [1]:
import requests
from bs4 import BeautifulSoup
response = requests.get('https://www.supremecourt.gov/oral_arguments/argument_transcript.aspx')
doc = BeautifulSoup(response.text, 'html.parser')

In [2]:
table = doc.find(class_="table datatables")
cases = table.find_all('tr')
cases

[<tr><td align="left" scope="col"><b>Argument Session: April 17, 2017 - April 26, 2017</b></td><td scope="col" style="text-align:center;"><b>Date Argued</b></td></tr>,
 <tr>
 <td style="text-align:left">    <a href="argument_transcripts/2016/16-399_3f14.pdf" id="ctl00_ctl00_MainEditable_mainContent_rptTranscript_ctl01_hypFile" target="_blank">16-399. </a> <span id="ctl00_ctl00_MainEditable_mainContent_rptTranscript_ctl01_lblCName">Perry v. Merit Systems Protection Bd.</span></td>
 <td style="text-align:center">04/17/17</td>
 </tr>,
 <tr>
 <td style="text-align:left">    <a href="argument_transcripts/2016/16-605_2dp3.pdf" id="ctl00_ctl00_MainEditable_mainContent_rptTranscript_ctl02_hypFile" target="_blank">16-605. </a> <span id="ctl00_ctl00_MainEditable_mainContent_rptTranscript_ctl02_lblCName">Town of Chester v. Laroe Estates, Inc.</span></td>
 <td style="text-align:center">04/17/17</td>
 </tr>,
 <tr>
 <td style="text-align:left">    <a href="argument_transcripts/2016/16-373_4e46.pdf" 

In [147]:
#List of all of the PDFs, names, docket names, date argued.
#Using a for loop after determining the whole scraped table
all_2016 = []
for things in cases:
    current = {}
    case_id = things.find_all('td')[0].find('a')
    name = things.find('span')
    date = things.find_all('td')[1].string
    docket_number = things.find_all('td')[0].find(target="_blank")
    
    if name:
        current['Case Name'] = name.string
        
    if case_id:
        this_file = case_id['href'].split('/')[-1]
        current['case_id'] = this_file.split('.')[0]
        
        current['Date Argued'] = date
        current['Docket Number'] = docket_number.string.strip()
        all_2016.append(current)
all_2016

[{'Case Name': 'Perry v. Merit Systems Protection Bd.',
  'Date Argued': '04/17/17',
  'Docket Number': '16-399.',
  'case_id': '16-399_3f14'},
 {'Case Name': 'Town of Chester v. Laroe Estates, Inc.',
  'Date Argued': '04/17/17',
  'Docket Number': '16-605.',
  'case_id': '16-605_2dp3'},
 {'Case Name': "California Public Employees' Retirement System v. ANZ Securities, Inc.",
  'Date Argued': '04/17/17',
  'Docket Number': '16-373.',
  'case_id': '16-373_4e46'},
 {'Case Name': 'Kokesh v. SEC',
  'Date Argued': '04/18/17',
  'Docket Number': '16-529.',
  'case_id': '16-529_21p3'},
 {'Case Name': 'Henson v. Santander Consumer USA Inc.',
  'Date Argued': '04/18/17',
  'Docket Number': '16-349.',
  'case_id': '16-349_e29g'},
 {'Case Name': 'Trinity Lutheran Church of Columbia, Inc. v. Comer',
  'Date Argued': '04/19/17',
  'Docket Number': '15-577.',
  'case_id': '15-577_l64n'},
 {'Case Name': 'Weaver v. Massachusetts',
  'Date Argued': '04/19/17',
  'Docket Number': '16-240.',
  'case_id':

In [158]:
#Now that we have a list of dicionaries, create dataframes with pandas 
import pandas as pd
df = pd.DataFrame(all_2016)
df.shape

(64, 4)

In [159]:
df.columns

Index(['Case Name', 'Date Argued', 'Docket Number', 'case_id'], dtype='object')

In [160]:
df.head()

Unnamed: 0,Case Name,Date Argued,Docket Number,case_id
0,Perry v. Merit Systems Protection Bd.,04/17/17,16-399.,16-399_3f14
1,"Town of Chester v. Laroe Estates, Inc.",04/17/17,16-605.,16-605_2dp3
2,California Public Employees' Retirement System...,04/17/17,16-373.,16-373_4e46
3,Kokesh v. SEC,04/18/17,16-529.,16-529_21p3
4,Henson v. Santander Consumer USA Inc.,04/18/17,16-349.,16-349_e29g


In [162]:
array = df['case_id'].unique()
array.sort()
array

array(['14-1055_h3dj', '14-1538_j4ek', '14-9496_feah', '15-1031_6647',
       '15-1039_bqm1', '15-1111_ca7d', '15-1189_6468', '15-118_3e04',
       '15-1191_igdj', '15-1194_0861', '15-1204_k536', '15-1248_2dq3',
       '15-1251_q86b', '15-1256_d1o2', '15-1262_l537', '15-1293_o7jp',
       '15-1358_7648', '15-1391_5315', '15-1406_d1of', '15-1498_m647',
       '15-1500_5g68', '15-1503_3f14', '15-214_l6hn', '15-423_pnk0',
       '15-457_gfbh', '15-497_4g15', '15-513_k5fm', '15-537_ljgm',
       '15-577_l64n', '15-5991_21p3', '15-606_5iel', '15-628_p86a',
       '15-649_l5gm', '15-680_n648', '15-7250_3eah', '15-777_1b82',
       '15-797_f2q3', '15-8049_4f15', '15-827_gfbh', '15-8544_c1o2',
       '15-866_j426', '15-9260_bq7c', '15-927_6j37', '16-142_4gc5',
       '16-149_bodg', '16-240_nkp1', '16-254_7lio', '16-309_b97c',
       '16-327_d18e', '16-32_mlho', '16-341_8njq', '16-348_2cp3',
       '16-349_e29g', '16-369_8nka', '16-373_4e46', '16-399_3f14',
       '16-405_9olb', '16-466_4g15', 

### STEP 2 
Scraping an additional source about Supreme Court (that can be somehow visualized. 
Below is using pandas to go through a CSV I found online with every case since 1947: http://scdb.wustl.edu/data.php 
My final project goal: Organize by Supreme Court Era and determine the location of origin for each case 

In [56]:
import pandas as pd
df = pd.read_csv('history_supreme_court.csv')
pd.set_option('display.max_columns', 80 )
df.head()

Unnamed: 0,3 judge dc?,admin action.id,admin action.state,agency,area,authority 1,authority 2,case,case issues,case.disposition,chief,date argued.day,date argued.full,date argued.month,date argued.year,date reargued.day,date reargued.full,date reargued.month,date reargued.year,date.day,date.full,date.month,date.year,decision.direction,decision.type,disagreement?,disposition,dissent agrees,docket,end.day,end.full,end.month,end.year,id.docket,issue.id,jurisdiction,laws.id,laws.type,led,lexis,lower court.direction,majority,majority assigner.id,majority assigner.long name,majority assigner.name,majority writer.id,majority writer.long name,majority writer.name,minority,name,natural court.id,origin.id,origin.name,origin.state,period,petitioner.entity,petitioner.id,petitioner.state,precedent altered?,reasons,respondent.entity,respondent.id,respondent.state,sct,source.id,source.name,source.state,split on second,start.day,start.full,start.month,start.year,term,text,unclear,unconstitutional,unusual,us,vote,winning party
0,False,-1,,unknown,Economic Activity,statutory construction,,1946-001,1946-001-01-01,reversed,Vinson,9,1/9/1946,1,1946,23,10/23/1946,10,1946,18,11/18/1946,11,1946,liberal,court opinion,False,affirmed,False,24,23,August/23/1949,8,1949,1946-001-01,80180,rehearing or reargument,6,Infrequently litigated statutes,91 L. Ed. 3,1946 U.S. LEXIS 1724,conservative,8,78,"Black, Hugo ( 08/19/1937 - 09/17/1971 )",HLBlack,78,"Black, Hugo ( 08/19/1937 - 09/17/1971 )",HLBlack,1,HALLIBURTON OIL WELL CEMENTING CO. v. WALKER e...,1301,51,California Southern U.S. District Court,,1,"oil company, or natural gas producer",198,,True,to resolve question presented,"inventor, patent assigner, trademark owner or ...",172,,67 S. Ct. 6,29,"U.S. Court of Appeals, Ninth Circuit",,False,24,June/24/1946,6,1946,1946,patents and copyrights: patent,False,no unconstitutionality,False,329 U.S. 1,1946-001-01-01-01,favorable disposition for petitioning party
1,False,-1,,unknown,Criminal Procedure,statutory construction,,1946-002,1946-002-01-01,affirmed (includes modified),Vinson,10,10/10/1945,10,1945,17,10/17/1946,10,1946,18,11/18/1946,11,1946,conservative,court opinion,False,affirmed,False,12,23,August/23/1949,8,1949,1946-002-01,10500,cert,6,Infrequently litigated statutes,91 L. Ed. 12,1946 U.S. LEXIS 1725,conservative,6,87,"Vinson, Fred ( 06/24/1946 - 09/08/1953 )",FMVinson,81,"Douglas, William ( 04/17/1939 - 11/12/1975 )",WODouglas,3,CLEVELAND v. UNITED STATES,1301,123,Utah U.S. District Court,,1,"person accused, indicted, or suspected of crime",100,,False,putative conflict,United States,27,,67 S. Ct. 13,30,"U.S. Court of Appeals, Tenth Circuit",,False,24,June/24/1946,6,1946,1946,statutory construction of criminal laws: Mann ...,False,no unconstitutionality,False,329 U.S. 14,1946-002-01-01-01,no favorable disposition for petitioning party
2,True,66,,Interstate Commerce Commission,Economic Activity,judicial review (national level),,1946-003,1946-003-01-01,affirmed (includes modified),Vinson,8,11/8/1945,11,1945,18,10/18/1946,10,1946,18,11/18/1946,11,1946,liberal,court opinion,False,unknown,False,21,23,August/23/1949,8,1949,1946-003-01,80250,appeal,2,Constitutional Amendment,91 L. Ed. 22,1946 U.S. LEXIS 3037,liberal,5,78,"Black, Hugo ( 08/19/1937 - 09/17/1971 )",HLBlack,84,"Jackson, Robert ( 07/11/1941 - 10/09/1954 )",RHJackson,4,CHAMPLIN REFINING CO. v. UNITED STATES ET AL.,1301,107,Oklahoma Western U.S. District Court,,1,pipe line company,209,,False,case did not arise on cert or cert not granted,United States,27,,67 S. Ct. 1,107,Oklahoma Western U.S. District Court,,False,24,June/24/1946,6,1946,1946,federal and some few state regulation of trans...,False,no unconstitutionality,False,329 U.S. 29,1946-003-01-01-01,no favorable disposition for petitioning party
3,False,67,,Indian Claims Commission,Civil Rights,statutory construction,,1946-004,1946-004-01-01,affirmed (includes modified),Vinson,31,1/31/1946,1,1946,25,10/25/1946,10,1946,25,11/25/1946,11,1946,liberal,court judgement,False,unknown,False,26,23,August/23/1949,8,1949,1946-004-01,20150,cert,6,Infrequently litigated statutes,91 L. Ed. 29,1946 U.S. LEXIS 1696,liberal,5,87,"Vinson, Fred ( 06/24/1946 - 09/08/1953 )",FMVinson,87,"Vinson, Fred ( 06/24/1946 - 09/08/1953 )",FMVinson,3,UNITED STATES v. ALCEA BAND OF TILLAMOOKS ET AL.,1301,3,"U.S. Court of Claims, Court of Federal Claims",,1,United States,27,,False,to resolve important or significant question,"Indian, including Indian tribe or nation",170,,67 S. Ct. 167,3,"U.S. Court of Claims, Court of Federal Claims",,False,24,June/24/1946,6,1946,1946,Indians (other than pertains to state jurisdic...,False,no unconstitutionality,False,329 U.S. 40,1946-004-01-01-01,no favorable disposition for petitioning party
4,False,-1,,unknown,Economic Activity,federal common law,,1946-005,1946-005-01-01,reversed,Vinson,25,10/25/1946,10,1946,25,10/25/1946,10,1946,25,11/25/1946,11,1946,liberal,court opinion,False,unknown,False,50,23,August/23/1949,8,1949,1946-005-01,80060,cert,-1,unknown,91 L. Ed. 44,1946 U.S. LEXIS 2997,liberal,6,87,"Vinson, Fred ( 06/24/1946 - 09/08/1953 )",FMVinson,78,"Black, Hugo ( 08/19/1937 - 09/17/1971 )",HLBlack,3,"UNITED STATES v. HOWARD P. FOLEY CO., INC.",1301,3,"U.S. Court of Claims, Court of Federal Claims",,1,United States,27,,False,federal court conflict,government contractor,176,,67 S. Ct. 154,3,"U.S. Court of Claims, Court of Federal Claims",,False,24,June/24/1946,6,1946,1946,"liability, governmental: tort or contract acti...",False,no unconstitutionality,False,329 U.S. 64,1946-005-01-01-01,favorable disposition for petitioning party


In [57]:
#clearn it up based on only the columns I care about 
df = pd.read_csv('history_supreme_court.csv', usecols=['name','origin.name','area', 'case', 'area', 'date.year', 'case.disposition', 'petitioner.state','docket','source.name'])
df.head()

Unnamed: 0,area,case,case.disposition,date.year,docket,name,origin.name,petitioner.state,source.name
0,Economic Activity,1946-001,reversed,1946,24,HALLIBURTON OIL WELL CEMENTING CO. v. WALKER e...,California Southern U.S. District Court,,"U.S. Court of Appeals, Ninth Circuit"
1,Criminal Procedure,1946-002,affirmed (includes modified),1946,12,CLEVELAND v. UNITED STATES,Utah U.S. District Court,,"U.S. Court of Appeals, Tenth Circuit"
2,Economic Activity,1946-003,affirmed (includes modified),1946,21,CHAMPLIN REFINING CO. v. UNITED STATES ET AL.,Oklahoma Western U.S. District Court,,Oklahoma Western U.S. District Court
3,Civil Rights,1946-004,affirmed (includes modified),1946,26,UNITED STATES v. ALCEA BAND OF TILLAMOOKS ET AL.,"U.S. Court of Claims, Court of Federal Claims",,"U.S. Court of Claims, Court of Federal Claims"
4,Economic Activity,1946-005,reversed,1946,50,"UNITED STATES v. HOWARD P. FOLEY CO., INC.","U.S. Court of Claims, Court of Federal Claims",,"U.S. Court of Claims, Court of Federal Claims"


In [58]:
df.set_index('case', inplace = True)
df.head()

Unnamed: 0_level_0,area,case.disposition,date.year,docket,name,origin.name,petitioner.state,source.name
case,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1946-001,Economic Activity,reversed,1946,24,HALLIBURTON OIL WELL CEMENTING CO. v. WALKER e...,California Southern U.S. District Court,,"U.S. Court of Appeals, Ninth Circuit"
1946-002,Criminal Procedure,affirmed (includes modified),1946,12,CLEVELAND v. UNITED STATES,Utah U.S. District Court,,"U.S. Court of Appeals, Tenth Circuit"
1946-003,Economic Activity,affirmed (includes modified),1946,21,CHAMPLIN REFINING CO. v. UNITED STATES ET AL.,Oklahoma Western U.S. District Court,,Oklahoma Western U.S. District Court
1946-004,Civil Rights,affirmed (includes modified),1946,26,UNITED STATES v. ALCEA BAND OF TILLAMOOKS ET AL.,"U.S. Court of Claims, Court of Federal Claims",,"U.S. Court of Claims, Court of Federal Claims"
1946-005,Economic Activity,reversed,1946,50,"UNITED STATES v. HOWARD P. FOLEY CO., INC.","U.S. Court of Claims, Court of Federal Claims",,"U.S. Court of Claims, Court of Federal Claims"


In [59]:
#find the different categories that cases fall into 
df['area'].value_counts()

Criminal Procedure      1949
Economic Activity       1676
Civil Rights            1402
Judicial Power          1186
First Amendment          667
Federalism               395
Unions                   353
Due Process              337
Federal Taxation         310
Privacy                  112
Attorneys                 98
Interstate Relations      95
unknown                   28
Miscellaneous             21
Private Action             2
Name: area, dtype: int64

In [60]:
df['petitioner.state'].value_counts().head(10)

California       208
New York         123
Illinois          88
Texas             87
United States     84
Ohio              75
Pennsylvania      64
Arizona           58
Michigan          56
Missouri          51
Name: petitioner.state, dtype: int64

In [61]:
df['case.disposition'].value_counts()

affirmed (includes modified)                               2563
reversed and remanded                                      2302
reversed                                                   1958
vacated and remanded                                       1015
petition denied or appeal dismissed                         341
affirmed and reversed (or vacated) in part and remanded     160
unknown                                                     124
affirmed and reversed (or vacated) in part                   76
stay, petition, or motion granted                            44
vacated                                                      34
certification to or from a lower court                       14
Name: case.disposition, dtype: int64

In [62]:
df['source.name'].value_counts()

State Supreme Court                                                                                                                                                                                 1634
U.S. Court of Appeals, Ninth Circuit                                                                                                                                                                1013
U.S. Court of Appeals, Fifth Circuit                                                                                                                                                                 654
U.S. Court of Appeals, Second Circuit                                                                                                                                                                608
U.S. Court of Appeals, District of Columbia Circuit (includes the Court of Appeals for the District of Columbia but not the District of Columbia Court of Appeals, which has local jurisdiction)    

In [63]:
df.groupby('petitioner.state')['area'].value_counts()

petitioner.state  area                
Alabama           Civil Rights            18
                  Criminal Procedure       7
                  Judicial Power           6
                  Federalism               3
                  Economic Activity        2
                  First Amendment          2
                  Due Process              1
Alaska            Economic Activity        6
                  Criminal Procedure       2
                  Federalism               2
                  First Amendment          2
                  Judicial Power           2
                  Civil Rights             1
                  Unions                   1
Arizona           Criminal Procedure      19
                  Civil Rights            13
                  Judicial Power          10
                  Interstate Relations     6
                  Federalism               5
                  Due Process              2
                  Economic Activity        2
                

In [64]:
df_ga = df[df['petitioner.state'] == 'Georgia']
df_ga['area'].value_counts()

Civil Rights            16
Criminal Procedure       8
Federalism               4
First Amendment          2
Judicial Power           2
Privacy                  1
Interstate Relations     1
Due Process              1
Economic Activity        1
Attorneys                1
Name: area, dtype: int64

In [65]:
df_ga = df[df['petitioner.state'] == 'California']
df_ga['area'].value_counts()

Criminal Procedure      67
Civil Rights            45
Judicial Power          32
Economic Activity       18
First Amendment         15
Federalism              12
Due Process             10
Interstate Relations     5
Unions                   2
Attorneys                1
unknown                  1
Name: area, dtype: int64

In [66]:
df_ga = df[df['petitioner.state'] == 'New York']
df_ga['area'].value_counts()

Criminal Procedure      30
Civil Rights            30
First Amendment         21
Judicial Power          15
Due Process              8
Economic Activity        6
Federalism               6
Privacy                  3
Attorneys                2
Federal Taxation         1
Interstate Relations     1
Name: area, dtype: int64

In [67]:
df[df['petitioner.state'] == 'Georgia']

Unnamed: 0_level_0,area,case.disposition,date.year,docket,name,origin.name,petitioner.state,source.name
case,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1946-131,Economic Activity,affirmed (includes modified),1947,385,"ATLANTIC COAST LINE RAILROAD CO. v. PHILLIPS, ...",State Trial Court,Georgia,State Supreme Court
1949-037,Federalism,affirmed (includes modified),1950,83,REGENTS OF THE UNIVERSITY SYSTEM OF GEORGIA v....,State Trial Court,Georgia,State Appellate Court
1951-032,Judicial Power,reversed and remanded,1952,1,"GEORGIA RAILROAD & BANKING CO. v. REDWINE, STA...",Georgia Northern U.S. District Court,Georgia,Georgia Northern U.S. District Court
1961-015,Federalism,affirmed (includes modified),1961,42,"CAMPBELL, COMMISSIONER OF AGRICULTURE OF GEORG...",Georgia Southern U.S. District Court,Georgia,Georgia Southern U.S. District Court
1964-028,Civil Rights,reversed,1965,178,"FORTSON, SECRETARY OF STATE OF GEORGIA v. DORS...",Georgia Northern U.S. District Court,Georgia,Georgia Northern U.S. District Court
1964-037,Civil Rights,vacated and remanded,1965,300,"FORTSON, SECRETARY OF STATE OF GEORGIA, et al....",Georgia Northern U.S. District Court,Georgia,Georgia Northern U.S. District Court
1965-133,Judicial Power,affirmed (includes modified),1966,147,GEORGIA v. RACHEL et al.,Georgia Northern U.S. District Court,Georgia,"U.S. Court of Appeals, Fifth Circuit"
1966-022,Civil Rights,reversed,1966,800,"FORTSON, SECRETARY OF STATE OF GEORGIA v. MORR...",Georgia Northern U.S. District Court,Georgia,Georgia Northern U.S. District Court
1970-013,Criminal Procedure,reversed and remanded,1970,10,"DUTTON, WARDEN v. EVANS",Georgia Northern U.S. District Court,Georgia,"U.S. Court of Appeals, Fifth Circuit"
1970-084,Civil Rights,reversed,1971,420,"MCDANIEL, SUPERINTENDENT OF SCHOOLS, et al. v....",State Trial Court,Georgia,State Supreme Court


In [90]:
(df['date.year'] == '2015').value_counts()

False    8570
True       61
Name: date.year, dtype: int64

In [93]:
df.groupby('date.year')['petitioner.state'].value_counts(ascending = True)

date.year  petitioner.state    
1946       Alaska                  1
           Illinois                1
1947       Georgia                 1
           New Jersey              1
           Ohio                    1
           Oklahoma                1
           Rhode Island            1
           United States           1
           Wisconsin               1
           New York                2
1948       Massachusetts           1
           Oklahoma                1
           United States           4
1949       Hawaii                  1
           Illinois                1
           Louisiana               1
           New York                1
           Ohio                    1
           Oklahoma                1
           South Carolina          1
           United States           1
           West Virginia           1
           Wisconsin               1
           California              2
           Michigan                2
1950       District of Columbia    1
      

In [86]:
df.dtypes

area                object
case.disposition    object
date.year           object
docket               int64
name                object
origin.name         object
petitioner.state    object
source.name         object
Court Era             bool
dtype: object

In [85]:
#Organizing by Era 
#Warren Court - 1953-1969
#Burger Court - 1969-1986
#Rehnquist Court - 1986-2005
#Roberts Court - 2005-present 
#(organize by date.year and make a separate column saying what Era it falls under 
#http://strftime.org/ 
#df[df.country == 'Angola']['continent'] = 'Africa'

#df['Court Era'] = False
#df.loc[df['date.year'].str.contains(['1953-1969'], na=False), 'Court Era'] = True
#df.head()

#warren_court = [1953]
#warren = df[df['date.year'] in warren_court]
 
df['date.year'] = df['date.year'].astype(str)
df['date.year'].value_counts()

#df['Court Era'] = False
#df.loc[df['date.year'].str.contains(['1953']['Court Era'] = "Warren Court" 


1976    189
1973    187
1985    175
1982    174
1972    173
1977    171
1964    171
1968    171
1984    171
1963    170
1967    166
1974    163
1971    163
1983    162
1980    161
1987    160
1986    160
1978    157
1988    155
1975    155
1979    153
1981    152
1958    150
1989    150
1965    145
1960    143
1990    143
1961    137
1966    137
1957    137
       ... 
1953    107
1952    107
1956    103
1951     98
1997     98
1994     96
1998     96
1955     95
1999     94
1995     93
1950     91
1996     91
1954     88
2010     87
2011     87
2001     87
2009     86
2002     86
2013     84
2000     84
2004     83
2005     81
2003     81
2006     80
2012     76
2007     76
2014     75
2008     72
2015     61
1946     25
Name: date.year, Length: 70, dtype: int64

In [69]:
df['source.name'].value_counts()

State Supreme Court                                                                                                                                                                                 1634
U.S. Court of Appeals, Ninth Circuit                                                                                                                                                                1013
U.S. Court of Appeals, Fifth Circuit                                                                                                                                                                 654
U.S. Court of Appeals, Second Circuit                                                                                                                                                                608
U.S. Court of Appeals, District of Columbia Circuit (includes the Court of Appeals for the District of Columbia but not the District of Columbia Court of Appeals, which has local jurisdiction)    

### STEP 3
Use regular expressions to clean up and parse the text files/PDFs so that you have a searchable data structure containing the dialog from the transcripts. Make sure PDFs are downloaded locally on your computer, read through in Sublime first to check how it looks 


In [26]:
#Import the regular expression library
import re
!pwd

/Users/MaijaLiisaEhlinger/Desktop/supreme_court


In [113]:
#Open a text file from your computer
f = open('pdfs/14-1055_h3dj.txt', 'r')
sample_transcript = f.read()

In [114]:
#Take a look at the text file
sample_transcript

'Official - Subject to Final Review\n1 1 IN THE SUPREME COURT OF THE UNITED STATES\n\n2 -----------------x\n\n3 CRYSTAL MONIQUE\n\n:\n\n4 LIGHTFOOT, ET AL.,\n\n:\n\n5\n\nPetitioners\n\n: No. 14-1055\n\n6 v.\n\n:\n\n7 CENDANT MORTGAGE\n\n:\n\n8 CORPORATION, DBA PHH\n\n:\n\n9 MORTGAGE, ET AL.,\n\n:\n\n10\n\nRespondents.\n\n:\n\n11 - - - - - - - - - - - - - - - - - x\n\n12 Washington, D.C.\n\n13 Tuesday, November 8, 2016\n\n14\n\n15 The above-entitled matter came on for oral\n\n16 argument before the Supreme Court of the United States\n\n17 at 11:04 a.m.\n\n18 APPEARANCES:\n\n19 E. JOSHUA ROSENKRANZ, ESQ., New York, N.Y.; on behalf of\n\n20 the Petitioners.\n\n21 ANN O\'CONNELL, ESQ., Assistant to the Solicitor General,\n\n22 Department of Justice, Washington, D.C.; for United\n\n23 States, as amicus curiae, supporting the Petitioners.\n\n24 BRIAN P. BROOKS, ESQ., Washington, D.C.; on behalf of\n\n25 the Respondents.\n\nAlderson Reporting Company\n\n\x0cOfficial - Subject to Final Review\

In [29]:
speaker_list = sample_transcript.splitlines()
speaker_list


['Official - Subject to Final Review',
 '1 1 IN THE SUPREME COURT OF THE UNITED STATES',
 '',
 '2 -----------------x',
 '',
 '3 CRYSTAL MONIQUE',
 '',
 ':',
 '',
 '4 LIGHTFOOT, ET AL.,',
 '',
 ':',
 '',
 '5',
 '',
 'Petitioners',
 '',
 ': No. 14-1055',
 '',
 '6 v.',
 '',
 ':',
 '',
 '7 CENDANT MORTGAGE',
 '',
 ':',
 '',
 '8 CORPORATION, DBA PHH',
 '',
 ':',
 '',
 '9 MORTGAGE, ET AL.,',
 '',
 ':',
 '',
 '10',
 '',
 'Respondents.',
 '',
 ':',
 '',
 '11 - - - - - - - - - - - - - - - - - x',
 '',
 '12 Washington, D.C.',
 '',
 '13 Tuesday, November 8, 2016',
 '',
 '14',
 '',
 '15 The above-entitled matter came on for oral',
 '',
 '16 argument before the Supreme Court of the United States',
 '',
 '17 at 11:04 a.m.',
 '',
 '18 APPEARANCES:',
 '',
 '19 E. JOSHUA ROSENKRANZ, ESQ., New York, N.Y.; on behalf of',
 '',
 '20 the Petitioners.',
 '',
 "21 ANN O'CONNELL, ESQ., Assistant to the Solicitor General,",
 '',
 '22 Department of Justice, Washington, D.C.; for United',
 '',
 '23 States, as ami

In [30]:
full = re.split(r"JUSTICE",sample_transcript)
full


["Official - Subject to Final Review\n1 1 IN THE SUPREME COURT OF THE UNITED STATES\n\n2 -----------------x\n\n3 CRYSTAL MONIQUE\n\n:\n\n4 LIGHTFOOT, ET AL.,\n\n:\n\n5\n\nPetitioners\n\n: No. 14-1055\n\n6 v.\n\n:\n\n7 CENDANT MORTGAGE\n\n:\n\n8 CORPORATION, DBA PHH\n\n:\n\n9 MORTGAGE, ET AL.,\n\n:\n\n10\n\nRespondents.\n\n:\n\n11 - - - - - - - - - - - - - - - - - x\n\n12 Washington, D.C.\n\n13 Tuesday, November 8, 2016\n\n14\n\n15 The above-entitled matter came on for oral\n\n16 argument before the Supreme Court of the United States\n\n17 at 11:04 a.m.\n\n18 APPEARANCES:\n\n19 E. JOSHUA ROSENKRANZ, ESQ., New York, N.Y.; on behalf of\n\n20 the Petitioners.\n\n21 ANN O'CONNELL, ESQ., Assistant to the Solicitor General,\n\n22 Department of Justice, Washington, D.C.; for United\n\n23 States, as amicus curiae, supporting the Petitioners.\n\n24 BRIAN P. BROOKS, ESQ., Washington, D.C.; on behalf of\n\n25 the Respondents.\n\nAlderson Reporting Company\n\n\x0cOfficial - Subject to Final Review\

In [31]:
regex1 = r"^(\d)"
re.split(sample_transcript,sample_transcript)

['Official - Subject to Final Review\n1 1 IN THE SUPREME COURT OF THE UNITED STATES\n\n2 -----------------x\n\n3 CRYSTAL MONIQUE\n\n:\n\n4 LIGHTFOOT, ET AL.,\n\n:\n\n5\n\nPetitioners\n\n: No. 14-1055\n\n6 v.\n\n:\n\n7 CENDANT MORTGAGE\n\n:\n\n8 CORPORATION, DBA PHH\n\n:\n\n9 MORTGAGE, ET AL.,\n\n:\n\n10\n\nRespondents.\n\n:\n\n11 - - - - - - - - - - - - - - - - - x\n\n12 Washington, D.C.\n\n13 Tuesday, November 8, 2016\n\n14\n\n15 The above-entitled matter came on for oral\n\n16 argument before the Supreme Court of the United States\n\n17 at 11:04 a.m.\n\n18 APPEARANCES:\n\n19 E. JOSHUA ROSENKRANZ, ESQ., New York, N.Y.; on behalf of\n\n20 the Petitioners.\n\n21 ANN O\'CONNELL, ESQ., Assistant to the Solicitor General,\n\n22 Department of Justice, Washington, D.C.; for United\n\n23 States, as amicus curiae, supporting the Petitioners.\n\n24 BRIAN P. BROOKS, ESQ., Washington, D.C.; on behalf of\n\n25 the Respondents.\n\nAlderson Reporting Company\n\n\x0cOfficial - Subject to Final Review

In [32]:
#Remove parts of text that we know repeats and we don't want to show up 
remove_alderson = re.sub(r"Alderson Reporting Company", "", sample_transcript, flags = re.IGNORECASE)
remove_alderson

remove_official = re.sub(r'Official - Subject to Final Review', "", remove_alderson, flags = re.IGNORECASE)
remove_official


'\n1 1 IN THE SUPREME COURT OF THE UNITED STATES\n\n2 -----------------x\n\n3 CRYSTAL MONIQUE\n\n:\n\n4 LIGHTFOOT, ET AL.,\n\n:\n\n5\n\nPetitioners\n\n: No. 14-1055\n\n6 v.\n\n:\n\n7 CENDANT MORTGAGE\n\n:\n\n8 CORPORATION, DBA PHH\n\n:\n\n9 MORTGAGE, ET AL.,\n\n:\n\n10\n\nRespondents.\n\n:\n\n11 - - - - - - - - - - - - - - - - - x\n\n12 Washington, D.C.\n\n13 Tuesday, November 8, 2016\n\n14\n\n15 The above-entitled matter came on for oral\n\n16 argument before the Supreme Court of the United States\n\n17 at 11:04 a.m.\n\n18 APPEARANCES:\n\n19 E. JOSHUA ROSENKRANZ, ESQ., New York, N.Y.; on behalf of\n\n20 the Petitioners.\n\n21 ANN O\'CONNELL, ESQ., Assistant to the Solicitor General,\n\n22 Department of Justice, Washington, D.C.; for United\n\n23 States, as amicus curiae, supporting the Petitioners.\n\n24 BRIAN P. BROOKS, ESQ., Washington, D.C.; on behalf of\n\n25 the Respondents.\n\n\n\n\x0c\n1 CONTENTS 2 ORAL ARGUMENT OF 3 E. JOSHUA ROSENKRANZ, ESQ. 4 On behalf of the Petitioners\n\n

In [33]:
#2. Line numbers 1 - 25 
remove_numbers = re.sub(r"\b([1-9]|1[0-9]|2[0-5])\b", " ", remove_official) 
remove_numbers


'\n    IN THE SUPREME COURT OF THE UNITED STATES\n\n  -----------------x\n\n  CRYSTAL MONIQUE\n\n:\n\n  LIGHTFOOT, ET AL.,\n\n:\n\n \n\nPetitioners\n\n: No.  -1055\n\n  v.\n\n:\n\n  CENDANT MORTGAGE\n\n:\n\n  CORPORATION, DBA PHH\n\n:\n\n  MORTGAGE, ET AL.,\n\n:\n\n \n\nRespondents.\n\n:\n\n  - - - - - - - - - - - - - - - - - x\n\n  Washington, D.C.\n\n  Tuesday, November  , 2016\n\n \n\n  The above-entitled matter came on for oral\n\n  argument before the Supreme Court of the United States\n\n  at  :04 a.m.\n\n  APPEARANCES:\n\n  E. JOSHUA ROSENKRANZ, ESQ., New York, N.Y.; on behalf of\n\n  the Petitioners.\n\n  ANN O\'CONNELL, ESQ., Assistant to the Solicitor General,\n\n  Department of Justice, Washington, D.C.; for United\n\n  States, as amicus curiae, supporting the Petitioners.\n\n  BRIAN P. BROOKS, ESQ., Washington, D.C.; on behalf of\n\n  the Respondents.\n\n\n\n\x0c\n  CONTENTS   ORAL ARGUMENT OF   E. JOSHUA ROSENKRANZ, ESQ.   On behalf of the Petitioners\n\n  \n\n ORAL ARGUME

In [34]:
#start splitting based on the actual proceedings of the case. We know PROCEEDINGS is the start of the actual argument 
new_split = re.split(r"PROCEEDINGS", remove_numbers)
new_split

["\n    IN THE SUPREME COURT OF THE UNITED STATES\n\n  -----------------x\n\n  CRYSTAL MONIQUE\n\n:\n\n  LIGHTFOOT, ET AL.,\n\n:\n\n \n\nPetitioners\n\n: No.  -1055\n\n  v.\n\n:\n\n  CENDANT MORTGAGE\n\n:\n\n  CORPORATION, DBA PHH\n\n:\n\n  MORTGAGE, ET AL.,\n\n:\n\n \n\nRespondents.\n\n:\n\n  - - - - - - - - - - - - - - - - - x\n\n  Washington, D.C.\n\n  Tuesday, November  , 2016\n\n \n\n  The above-entitled matter came on for oral\n\n  argument before the Supreme Court of the United States\n\n  at  :04 a.m.\n\n  APPEARANCES:\n\n  E. JOSHUA ROSENKRANZ, ESQ., New York, N.Y.; on behalf of\n\n  the Petitioners.\n\n  ANN O'CONNELL, ESQ., Assistant to the Solicitor General,\n\n  Department of Justice, Washington, D.C.; for United\n\n  States, as amicus curiae, supporting the Petitioners.\n\n  BRIAN P. BROOKS, ESQ., Washington, D.C.; on behalf of\n\n  the Respondents.\n\n\n\n\x0c\n  CONTENTS   ORAL ARGUMENT OF   E. JOSHUA ROSENKRANZ, ESQ.   On behalf of the Petitioners\n\n  \n\n ORAL ARGUME

In [35]:
#remove the top messy part of the text we don't really care about. We've created a list above, so look at each element 
#It's the second element in the list, so [1]
new_clean = new_split[1]
new_clean

'   ( :04 a.m.)   CHIEF JUSTICE ROBERTS: We will hear   argument next in Case No.  -1055, Lightfoot v. Cendant   Mortgage Corporation.   Mr. Rosenkranz.   ORAL ARGUMENT OF E. JOSHUA ROSENKRANZ   ON BEHALF OF THE PETITIONERS   MR. ROSENKRANZ: Thank you, Mr. Chief   Justice, and may it please the Court:   There is only one natural way to read the   language at issue here. A "court of competent   jurisdiction" is a court that has an independent source   of subject-matter jurisdiction. That is what this Court   has held five times those words mean. So let\'s start   with the plain language.   The statute grants Freddie, quote, "The   power in its corporate name to sue and be sued in any   \'court of competent jurisdiction,\' State or Federal."   The only reference to jurisdiction in that passage is to   say that you don\'t get to go to any Federal court or any   State court, but rather, you have to choose a court,   State or Federal, that must be a "court of competent   jurisdiction." And 

In [36]:
#take off bottom part of the dialogue 
edit_bottom = re.split(r"The case is submitted", new_clean)
edit_bottom

['   ( :04 a.m.)   CHIEF JUSTICE ROBERTS: We will hear   argument next in Case No.  -1055, Lightfoot v. Cendant   Mortgage Corporation.   Mr. Rosenkranz.   ORAL ARGUMENT OF E. JOSHUA ROSENKRANZ   ON BEHALF OF THE PETITIONERS   MR. ROSENKRANZ: Thank you, Mr. Chief   Justice, and may it please the Court:   There is only one natural way to read the   language at issue here. A "court of competent   jurisdiction" is a court that has an independent source   of subject-matter jurisdiction. That is what this Court   has held five times those words mean. So let\'s start   with the plain language.   The statute grants Freddie, quote, "The   power in its corporate name to sue and be sued in any   \'court of competent jurisdiction,\' State or Federal."   The only reference to jurisdiction in that passage is to   say that you don\'t get to go to any Federal court or any   State court, but rather, you have to choose a court,   State or Federal, that must be a "court of competent   jurisdiction." And

In [37]:
isolated_middle = edit_bottom[0]
isolated_middle

'   ( :04 a.m.)   CHIEF JUSTICE ROBERTS: We will hear   argument next in Case No.  -1055, Lightfoot v. Cendant   Mortgage Corporation.   Mr. Rosenkranz.   ORAL ARGUMENT OF E. JOSHUA ROSENKRANZ   ON BEHALF OF THE PETITIONERS   MR. ROSENKRANZ: Thank you, Mr. Chief   Justice, and may it please the Court:   There is only one natural way to read the   language at issue here. A "court of competent   jurisdiction" is a court that has an independent source   of subject-matter jurisdiction. That is what this Court   has held five times those words mean. So let\'s start   with the plain language.   The statute grants Freddie, quote, "The   power in its corporate name to sue and be sued in any   \'court of competent jurisdiction,\' State or Federal."   The only reference to jurisdiction in that passage is to   say that you don\'t get to go to any Federal court or any   State court, but rather, you have to choose a court,   State or Federal, that must be a "court of competent   jurisdiction." And 

In [38]:
#Remove the \ns that are lingering from HTML formatting 
edit_n = re.sub(r"\n", " ", isolated_middle)
edit_n

'   ( :04 a.m.)   CHIEF JUSTICE ROBERTS: We will hear   argument next in Case No.  -1055, Lightfoot v. Cendant   Mortgage Corporation.   Mr. Rosenkranz.   ORAL ARGUMENT OF E. JOSHUA ROSENKRANZ   ON BEHALF OF THE PETITIONERS   MR. ROSENKRANZ: Thank you, Mr. Chief   Justice, and may it please the Court:   There is only one natural way to read the   language at issue here. A "court of competent   jurisdiction" is a court that has an independent source   of subject-matter jurisdiction. That is what this Court   has held five times those words mean. So let\'s start   with the plain language.   The statute grants Freddie, quote, "The   power in its corporate name to sue and be sued in any   \'court of competent jurisdiction,\' State or Federal."   The only reference to jurisdiction in that passage is to   say that you don\'t get to go to any Federal court or any   State court, but rather, you have to choose a court,   State or Federal, that must be a "court of competent   jurisdiction." And 

In [39]:
#Remove the \x0c characters that keep showing up 
edit_x = re.sub(r"\x0c", "", edit_n) 
edit_x

'   ( :04 a.m.)   CHIEF JUSTICE ROBERTS: We will hear   argument next in Case No.  -1055, Lightfoot v. Cendant   Mortgage Corporation.   Mr. Rosenkranz.   ORAL ARGUMENT OF E. JOSHUA ROSENKRANZ   ON BEHALF OF THE PETITIONERS   MR. ROSENKRANZ: Thank you, Mr. Chief   Justice, and may it please the Court:   There is only one natural way to read the   language at issue here. A "court of competent   jurisdiction" is a court that has an independent source   of subject-matter jurisdiction. That is what this Court   has held five times those words mean. So let\'s start   with the plain language.   The statute grants Freddie, quote, "The   power in its corporate name to sue and be sued in any   \'court of competent jurisdiction,\' State or Federal."   The only reference to jurisdiction in that passage is to   say that you don\'t get to go to any Federal court or any   State court, but rather, you have to choose a court,   State or Federal, that must be a "court of competent   jurisdiction." And 

In [40]:
#Thought this would get rid of the \' in don't, isn't, etc... 
#edit_slash = re.sub("\\'", "'", edit_x)
#edit_slash
#Turns out that these \ are needed by python not to break everything 

In [41]:
#We know that we are going to create lists based on UPPER CASE NAMES:.
#So remove large strings of upper case letters that we see 
replace1 = re.sub(r"ORAL ARGUMENT OF E. JOSHUA ROSENKRANZ   ON BEHALF OF THE PETITIONERS", "", edit_x)
replace1

'   ( :04 a.m.)   CHIEF JUSTICE ROBERTS: We will hear   argument next in Case No.  -1055, Lightfoot v. Cendant   Mortgage Corporation.   Mr. Rosenkranz.      MR. ROSENKRANZ: Thank you, Mr. Chief   Justice, and may it please the Court:   There is only one natural way to read the   language at issue here. A "court of competent   jurisdiction" is a court that has an independent source   of subject-matter jurisdiction. That is what this Court   has held five times those words mean. So let\'s start   with the plain language.   The statute grants Freddie, quote, "The   power in its corporate name to sue and be sued in any   \'court of competent jurisdiction,\' State or Federal."   The only reference to jurisdiction in that passage is to   say that you don\'t get to go to any Federal court or any   State court, but rather, you have to choose a court,   State or Federal, that must be a "court of competent   jurisdiction." And the only way to find out whether a   court is a "court of competent 

In [42]:
#REBUTTAL ARGUMENT OF E. JOSHUA ROSENKRANZ    ON BEHALF OF THE PETITIONERS 
replace2 = re.sub(r"REBUTTAL.*PETITIONERS", "", replace1)
replace2

'   ( :04 a.m.)   CHIEF JUSTICE ROBERTS: We will hear   argument next in Case No.  -1055, Lightfoot v. Cendant   Mortgage Corporation.   Mr. Rosenkranz.      MR. ROSENKRANZ: Thank you, Mr. Chief   Justice, and may it please the Court:   There is only one natural way to read the   language at issue here. A "court of competent   jurisdiction" is a court that has an independent source   of subject-matter jurisdiction. That is what this Court   has held five times those words mean. So let\'s start   with the plain language.   The statute grants Freddie, quote, "The   power in its corporate name to sue and be sued in any   \'court of competent jurisdiction,\' State or Federal."   The only reference to jurisdiction in that passage is to   say that you don\'t get to go to any Federal court or any   State court, but rather, you have to choose a court,   State or Federal, that must be a "court of competent   jurisdiction." And the only way to find out whether a   court is a "court of competent 

In [43]:
replace3 = re.sub(r"ORAL.*PETITIONERS", "", replace2)
replace3

'   ( :04 a.m.)   CHIEF JUSTICE ROBERTS: We will hear   argument next in Case No.  -1055, Lightfoot v. Cendant   Mortgage Corporation.   Mr. Rosenkranz.      MR. ROSENKRANZ: Thank you, Mr. Chief   Justice, and may it please the Court:   There is only one natural way to read the   language at issue here. A "court of competent   jurisdiction" is a court that has an independent source   of subject-matter jurisdiction. That is what this Court   has held five times those words mean. So let\'s start   with the plain language.   The statute grants Freddie, quote, "The   power in its corporate name to sue and be sued in any   \'court of competent jurisdiction,\' State or Federal."   The only reference to jurisdiction in that passage is to   say that you don\'t get to go to any Federal court or any   State court, but rather, you have to choose a court,   State or Federal, that must be a "court of competent   jurisdiction." And the only way to find out whether a   court is a "court of competent 

In [44]:
replace4 = re.sub(r"ORAL ARGUMENT OF BRIAN P. BROOKS   ON BEHALF OF THE RESPONDENTS   MR. BROOKS", "", replace3)
replace4

'   ( :04 a.m.)   CHIEF JUSTICE ROBERTS: We will hear   argument next in Case No.  -1055, Lightfoot v. Cendant   Mortgage Corporation.   Mr. Rosenkranz.      MR. ROSENKRANZ: Thank you, Mr. Chief   Justice, and may it please the Court:   There is only one natural way to read the   language at issue here. A "court of competent   jurisdiction" is a court that has an independent source   of subject-matter jurisdiction. That is what this Court   has held five times those words mean. So let\'s start   with the plain language.   The statute grants Freddie, quote, "The   power in its corporate name to sue and be sued in any   \'court of competent jurisdiction,\' State or Federal."   The only reference to jurisdiction in that passage is to   say that you don\'t get to go to any Federal court or any   State court, but rather, you have to choose a court,   State or Federal, that must be a "court of competent   jurisdiction." And the only way to find out whether a   court is a "court of competent 

### Get your dialogue list
The cleaned transcript can now be made into a list of speakers and the words that they speak. 

In [100]:
#get a list of speaker and speech
#here we know that even indeces are part of the speech/are words. We know the first few [] are unnecessary, and just
#want to start where Justice Roberts actually starts speaking 
speakers = re.split(r"([A-Z. ]+:)", replace3)
del speakers[:3]
speakers

['   CHIEF JUSTICE ROBERTS:',
 ' We will hear   argument next in Case No.  -1055, Lightfoot v. Cendant   Mortgage Corporation.   Mr. Rosenkranz',
 '.      MR. ROSENKRANZ:',
 ' Thank you, Mr. Chief   Justice, and may it please the Court:   There is only one natural way to read the   language at issue here. A "court of competent   jurisdiction" is a court that has an independent source   of subject-matter jurisdiction. That is what this Court   has held five times those words mean. So let\'s start   with the plain language.   The statute grants Freddie, quote, "The   power in its corporate name to sue and be sued in any   \'court of competent jurisdiction,\' State or Federal."   The only reference to jurisdiction in that passage is to   say that you don\'t get to go to any Federal court or any   State court, but rather, you have to choose a court,   State or Federal, that must be a "court of competent   jurisdiction." And the only way to find out whether a   court is a "court of competen

In [46]:
#Zip is a built-in function looking at the different elements within the list above. We know that the 0 element 
#is the person speaking and the 1 element are the words spoken.
#Create an easy-to-read list of dictionaries 
full_list = list(zip(speakers[0::2], speakers[1::2]))
full_list

[('   CHIEF JUSTICE ROBERTS:',
  ' We will hear   argument next in Case No.  -1055, Lightfoot v. Cendant   Mortgage Corporation.   Mr. Rosenkranz'),
 ('.      MR. ROSENKRANZ:',
  ' Thank you, Mr. Chief   Justice, and may it please the Court:   There is only one natural way to read the   language at issue here. A "court of competent   jurisdiction" is a court that has an independent source   of subject-matter jurisdiction. That is what this Court   has held five times those words mean. So let\'s start   with the plain language.   The statute grants Freddie, quote, "The   power in its corporate name to sue and be sued in any   \'court of competent jurisdiction,\' State or Federal."   The only reference to jurisdiction in that passage is to   say that you don\'t get to go to any Federal court or any   State court, but rather, you have to choose a court,   State or Federal, that must be a "court of competent   jurisdiction." And the only way to find out whether a   court is a "court of com

In [101]:
#Use pandas to create a dataframe of everyline spoken 
#import pandas as pd
#col_names = ['Speaker','Words']
#df = pd.DataFrame.from_records(full_list, columns=col_names)
#df


In [102]:
#df['Speaker'].value_counts()

### Make it a list of pairs
If you got your list the way I recommended to, it is just single list with elements after element--you need to figure out how to change it so you pair the speaker with what is said. Give it some thought, there are a few ways to try to do this. If you made it this far, you're doing great!

In [97]:
#done above


### Loop through all texts
If you made it this far--congratulations! 
The only thing left is to set up a loop that looks through all the texts and runs the cleanup and parsing when each one. You will need to have completed Step 1 in order to be able to do this loop because you will need the names to PDFs to do it. (Also each final list should also contain the PDF name, so you can reference it from your case database.)

In [98]:
# you could try here--Or email me with questions...

In [186]:
#f = open('pdfs/14-1055_h3dj.txt', 'r')
#sample_transcript = f.read()
speaker_list = sample_transcript
remove_alderson = re.sub(r"Alderson Reporting Company", "", speaker_list, flags = re.IGNORECASE)
remove_official = re.sub(r'Official - Subject to Final Review', "", remove_alderson, flags = re.IGNORECASE)
remove_numbers = re.sub(r"\b([1-9]|1[0-9]|2[0-5])\b", " ", remove_official) 
new_split = re.split(r"PROCEEDINGS", remove_numbers)
new_clean = new_split[1]
edit_bottom = re.split(r"above-entitled", new_clean)
isolated_middle = edit_bottom[0]
edit_n = re.sub(r"\n", " ", isolated_middle)
edit_x = re.sub(r"\x0c", "", edit_n) 
replace1 = re.sub(r"ORAL ARGUMENT OF E. JOSHUA ROSENKRANZ   ON BEHALF OF THE PETITIONERS", "", edit_x)
replace2 = re.sub(r"REBUTTAL.*PETITIONERS", "", replace1)
replace3 = re.sub(r"ORAL.*PETITIONERS", "", replace2)
replace4 = re.sub(r"ORAL ARGUMENT OF BRIAN P. BROOKS   ON BEHALF OF THE RESPONDENTS   MR. BROOKS", "", replace3)
speakers = re.split(r"([A-Z. ]+:)", replace4)
del speakers[:3]
full_list = list(zip(speakers[0::2], speakers[1::2]))
full_list

[('   CHIEF JUSTICE ROBERTS:',
  ' We will hear   argument first -- first this morning in Case  -74,   Advocate Health Care Network v. Stapleton and the   consolidated case.   Ms. Blatt'),
 ('.      MR. STEWART:',
  ' Mr. Chief Justice, and may it   please the Court:   I\'d like to first to pick up on a point that   Ms. Blatt alluded to when she was describing the -- the   history of the statute and its amendment. I think the   statute in its current form is probably not the type of   provision that Congress would draft if it were doing the   whole thing in one fell swoop. But it\'s important to   understand that the text of the -- the current provision   is the combination of things that were done in 1974 and   things that were done in 1980.   Congress enacted the original church plan   provision. Presumably, it had in mind particular plans   that were established and maintained by churches and it   covered those; and pretty quickly, problems came to   light. Other types of plans were

In [187]:
def parse_transcript(the_text):
    remove_alderson = re.sub(r"Alderson Reporting Company", "", the_text, flags = re.IGNORECASE)
    remove_official = re.sub(r'Official - Subject to Final Review', "", remove_alderson, flags = re.IGNORECASE)
    remove_numbers = re.sub(r"\b([1-9]|1[0-9]|2[0-5])\b", " ", remove_official) 
    new_split = re.split(r"PROCEEDINGS", remove_numbers)
    new_clean = new_split[1]
    edit_bottom = re.split(r"above-entitled", new_clean)
    isolated_middle = edit_bottom[0]
    edit_n = re.sub(r"\n", " ", isolated_middle)
    edit_x = re.sub(r"\x0c", "", edit_n) 
    replace1 = re.sub(r"ORAL ARGUMENT OF E. JOSHUA ROSENKRANZ   ON BEHALF OF THE PETITIONERS", "", edit_x)
    replace2 = re.sub(r"REBUTTAL.*PETITIONERS", "", replace1)
    replace3 = re.sub(r"ORAL.*PETITIONERS", "", replace2)
    replace4 = re.sub(r"ORAL ARGUMENT OF BRIAN P. BROOKS   ON BEHALF OF THE RESPONDENTS   MR. BROOKS", "", replace3)
    speakers = re.split(r"([A-Z. ]+:)", replace4)
    del speakers[:3]
    full_list = list(zip(speakers[0::2], speakers[1::2]))
    return full_list


In [188]:
parse_transcript(sample_transcript)

[('   CHIEF JUSTICE ROBERTS:',
  ' We will hear   argument first -- first this morning in Case  -74,   Advocate Health Care Network v. Stapleton and the   consolidated case.   Ms. Blatt'),
 ('.      MR. STEWART:',
  ' Mr. Chief Justice, and may it   please the Court:   I\'d like to first to pick up on a point that   Ms. Blatt alluded to when she was describing the -- the   history of the statute and its amendment. I think the   statute in its current form is probably not the type of   provision that Congress would draft if it were doing the   whole thing in one fell swoop. But it\'s important to   understand that the text of the -- the current provision   is the combination of things that were done in 1974 and   things that were done in 1980.   Congress enacted the original church plan   provision. Presumably, it had in mind particular plans   that were established and maintained by churches and it   covered those; and pretty quickly, problems came to   light. Other types of plans were

In [197]:
import pandas as pd
col_names = ['Speaker','Speech', 'PDF']
df = pd.DataFrame.from_records(list_of_cases, columns=col_names)
df


Unnamed: 0,Speaker,Speech,PDF
0,CHIEF JUSTICE ROBERTS:,We will hear argument next in Case No. -10...,14-1055_h3dj
1,. MR. ROSENKRANZ:,"Thank you, Mr. Chief Justice, and may it pl...",14-1055_h3dj
2,. JUSTICE GINSBURG:,Does that include - you -- you said subject-...,14-1055_h3dj
3,MR. ROSENKRANZ:,I -- I am not limiting it to subject-matter...,14-1055_h3dj
4,. JUSTICE GINSBURG:,What did you do -- what does Justice Soute...,14-1055_h3dj
5,. MR. ROSENKRANZ:,"Understood, Justice Ginsburg. And I think t...",14-1055_h3dj
6,. JUSTICE BREYER:,"It's tough. I mean, I find this pretty toug...",14-1055_h3dj
7,. MR. ROSENKRANZ:,"Well, Your Honor --",14-1055_h3dj
8,JUSTICE BREYER:,It comes out the other way. And the Red Cro...,14-1055_h3dj
9,. MR. ROSENKRANZ:,"No, Your Honor",14-1055_h3dj


In [190]:
#looping through several cases
list_of_cases = []
path = 'pdfs/'
for file_name in array:
    print(file_name)
    if file_name != '15-1358_7648' and file_name != '15-577_l64n' and file_name != '15-866_j426' and file_name != '16-32_mlho' and file_name!= '16-466_4g15' and file_name !='16-529_21p3':
        f = open(path + file_name + '.txt', 'r')
        sample_transcript = f.read()
        this_list = parse_transcript(sample_transcript)   
        better_list = []
        for each in this_list:
            entry = list(each)
            entry.append(file_name)
            better_list.append(entry)
        this_list.append(file_name)
        list_of_cases.extend(better_list)

14-1055_h3dj
14-1538_j4ek
14-9496_feah
15-1031_6647
15-1039_bqm1
15-1111_ca7d
15-1189_6468
15-118_3e04
15-1191_igdj
15-1194_0861
15-1204_k536
15-1248_2dq3
15-1251_q86b
15-1256_d1o2
15-1262_l537
15-1293_o7jp
15-1358_7648
15-1391_5315
15-1406_d1of
15-1498_m647
15-1500_5g68
15-1503_3f14
15-214_l6hn
15-423_pnk0
15-457_gfbh
15-497_4g15
15-513_k5fm
15-537_ljgm
15-577_l64n
15-5991_21p3
15-606_5iel
15-628_p86a
15-649_l5gm
15-680_n648
15-7250_3eah
15-777_1b82
15-797_f2q3
15-8049_4f15
15-827_gfbh
15-8544_c1o2
15-866_j426
15-9260_bq7c
15-927_6j37
16-142_4gc5
16-149_bodg
16-240_nkp1
16-254_7lio
16-309_b97c
16-327_d18e
16-32_mlho
16-341_8njq
16-348_2cp3
16-349_e29g
16-369_8nka
16-373_4e46
16-399_3f14
16-405_9olb
16-466_4g15
16-5294_g314
16-529_21p3
16-54_7l48
16-605_2dp3
16-6219_7mio
16-74_p8k0


In [191]:
list_of_cases

[['   CHIEF JUSTICE ROBERTS:',
  ' We will hear   argument next in Case No.  -1055, Lightfoot v. Cendant   Mortgage Corporation.   Mr. Rosenkranz',
  '14-1055_h3dj'],
 ['.      MR. ROSENKRANZ:',
  ' Thank you, Mr. Chief   Justice, and may it please the Court:   There is only one natural way to read the   language at issue here. A "court of competent   jurisdiction" is a court that has an independent source   of subject-matter jurisdiction. That is what this Court   has held five times those words mean. So let\'s start   with the plain language.   The statute grants Freddie, quote, "The   power in its corporate name to sue and be sued in any   \'court of competent jurisdiction,\' State or Federal."   The only reference to jurisdiction in that passage is to   say that you don\'t get to go to any Federal court or any   State court, but rather, you have to choose a court,   State or Federal, that must be a "court of competent   jurisdiction." And the only way to find out whether a   court 

In [196]:
#import pandas as pd
#col_names = ['Speaker','Words',]
#df = pd.DataFrame.from_records(full_list, columns=col_names)
#df

In [194]:
len(list_of_cases)

11626

In [171]:
df['Speaker'].value_counts().head(10)

    MR. EISENHAMMER:          12
.   CHIEF JUSTICE ROBERTS:    11
.   MR. EISENHAMMER:          11
   MR. EISENHAMMER:           10
   MR. SCODRO:                 9
.   MR. SCODRO:                9
    MR. SCODRO:                9
.    MS. EISENSTEIN:           9
    JUSTICE KENNEDY:           8
.   JUSTICE KENNEDY:           8
Name: Speaker, dtype: int64

In [182]:
df.Speaker.str.contains('KENNEDY').replace("KENNEDY", "JUSTICE KENNNEDY", inplace = True)
df['Speaker'].value_counts()

    MR. EISENHAMMER:                                                                  12
.   CHIEF JUSTICE ROBERTS:                                                            11
.   MR. EISENHAMMER:                                                                  11
   MR. EISENHAMMER:                                                                   10
   MR. SCODRO:                                                                         9
.   MR. SCODRO:                                                                        9
    MR. SCODRO:                                                                        9
.    MS. EISENSTEIN:                                                                   9
    JUSTICE KENNEDY:                                                                   8
.   JUSTICE KENNEDY:                                                                   8
  JUSTICE SOTOMAYOR:                                                                   7
.    JUSTICE BREYER: 

In [None]:
#start a new python notebook for a clean 