## Federal Courts Project Guide

The ultimate goal of this project is to build a centralized database of federal judgeships across the 13 district appellate courts and/or the 96 District courts in the United States. Because of the wealth of data involved, and the fact that much of this data is scattered across many pages and sites, the first step involves researching the domain, and developing a focus and range of data you want to obtain and make available.

Here are three possible angles:

1. Current judgeships, vacancies, and nomination proceedings: with this focus you would download tables of the recent vacancies and appointments, and go further into nomination procedures and Q&A's. This would entail a combination of scraping, conversions of PDFs, and using regular expressions to parse the PDFs (this is tough).

2. Historical judgeships: with this focus you examine changes in federal judgeships over a certain period of time (perhaps 10 to 20 years). This would entail mainly the scraping of many pages and the integration of data about specific judges, ordered by district.

3. Recent Nominations and confirmations:  this would focus specifically on judges newly nominated or appointed under the current administration. The focus would be more directly on the nomination hearings (Q&As), as well as the search for other data sources regarding the judges--news articles, opinions, writings by the judges.



Your primary goal by Thursday is to come up with a specific research question: what kind of knowledge do you want to investigate, build and make available through this project. What are the central units of analysis? What do you want to reveal about the federal courts?

Your secondary goal is to view the primary source pages and begin scraping. You do not have to have your central research question right at the beginning of the scraping, but it may help to have a direction.

You're goal by Friday is to have a finalized architecture for your dataframe(s), any finalized list of sources that you will scrape/obtain.

**Data Architecture**
The question of architecture is central to this project. Because of the many possible angles, and the highly decentralized state of the primary source data, there is a wide range of designs for tables, rows, columns. You may want to begin scraping some of the main pages to get more familiar with what kind of rows and columns might be involved.

**Interpretive architecture**
This depends I how focused your data frame will be. If you pick specific districts, judges and/or confirmation hearings you may want to do more human reading to assess different ways the framing the politics/legal perspective of the judge or the district's decisions. If you choose to cast a wider net for your data, then you will want to focus on more quantitative categories for framing this: judges age, district, background, length of appointment, length of vacancy, number of vacancies, etc.



### Coding considerations:
While there is a great amount of data available, much of it is distributed across multiple pages, sometimes and inconsistent format. If you're interested in scraping nominations and downloading PDFs, you may want to consider using **selenium** for part of it. If you want to use beautiful soup, you will have to download links, and the loop through multiple pages to get a complete data set--unless your focus is more specific.

### STEP 1
Scrape the first page of judicial vacancies:

http://www.uscourts.gov/judges-judgeships/judicial-vacancies/current-judicial-vacancies

In [2]:
###Import your scraping libraries
from bs4 import BeautifulSoup
import pandas as pd
import re
import requests
from playwright.async_api import async_playwright


To awoid inventing a crazy loop I just run the same code for all 12 pages of information

In [3]:
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless = False)
page = await browser.new_page()

await page.goto("https://www.congress.gov/search?q=%7B%22source%22%3A%22nominations%22%2C%22senate-committee%22%3A%22Judiciary%22%2C%22congress%22%3A%5B%22107%22%2C%22108%22%2C%22109%22%2C%22110%22%2C%22111%22%2C%22112%22%2C%22113%22%2C%22114%22%2C%22115%22%2C%22116%22%2C%22117%22%5D%7D&pageSize=250&page=1")

await page.content()
doc = BeautifulSoup(await page.content(), 'html.parser')

with open ('output1.html', 'w') as file:
    file.write(str(doc))

In [4]:
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless = False)
page = await browser.new_page()

await page.goto("https://www.congress.gov/search?q=%7B%22source%22%3A%22nominations%22%2C%22senate-committee%22%3A%22Judiciary%22%2C%22congress%22%3A%5B%22107%22%2C%22108%22%2C%22109%22%2C%22110%22%2C%22111%22%2C%22112%22%2C%22113%22%2C%22114%22%2C%22115%22%2C%22116%22%2C%22117%22%5D%7D&pageSize=250&page=2")

await page.content()
doc = BeautifulSoup(await page.content(), 'html.parser')

with open ('output2.html', 'w') as file:
    file.write(str(doc))

In [5]:
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless = False)
page = await browser.new_page()

await page.goto("https://www.congress.gov/search?q=%7B%22source%22%3A%22nominations%22%2C%22senate-committee%22%3A%22Judiciary%22%2C%22congress%22%3A%5B%22107%22%2C%22108%22%2C%22109%22%2C%22110%22%2C%22111%22%2C%22112%22%2C%22113%22%2C%22114%22%2C%22115%22%2C%22116%22%2C%22117%22%5D%7D&pageSize=250&page=3")

await page.content()
doc = BeautifulSoup(await page.content(), 'html.parser')

with open ('output3.html', 'w') as file:
    file.write(str(doc))

In [6]:
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless = False)
page = await browser.new_page()

await page.goto("https://www.congress.gov/search?q=%7B%22source%22%3A%22nominations%22%2C%22senate-committee%22%3A%22Judiciary%22%2C%22congress%22%3A%5B%22107%22%2C%22108%22%2C%22109%22%2C%22110%22%2C%22111%22%2C%22112%22%2C%22113%22%2C%22114%22%2C%22115%22%2C%22116%22%2C%22117%22%5D%7D&pageSize=250&page=4")

await page.content()
doc = BeautifulSoup(await page.content(), 'html.parser')

with open ('output4.html', 'w') as file:
    file.write(str(doc))

In [7]:
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless = False)
page = await browser.new_page()

await page.goto("https://www.congress.gov/search?q=%7B%22source%22%3A%22nominations%22%2C%22senate-committee%22%3A%22Judiciary%22%2C%22congress%22%3A%5B%22107%22%2C%22108%22%2C%22109%22%2C%22110%22%2C%22111%22%2C%22112%22%2C%22113%22%2C%22114%22%2C%22115%22%2C%22116%22%2C%22117%22%5D%7D&pageSize=250&page=5")

await page.content()
doc = BeautifulSoup(await page.content(), 'html.parser')

with open ('output5.html', 'w') as file:
    file.write(str(doc))

In [8]:
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless = False)
page = await browser.new_page()

await page.goto("https://www.congress.gov/search?q=%7B%22source%22%3A%22nominations%22%2C%22senate-committee%22%3A%22Judiciary%22%2C%22congress%22%3A%5B%22107%22%2C%22108%22%2C%22109%22%2C%22110%22%2C%22111%22%2C%22112%22%2C%22113%22%2C%22114%22%2C%22115%22%2C%22116%22%2C%22117%22%5D%7D&pageSize=250&page=6")

await page.content()
doc = BeautifulSoup(await page.content(), 'html.parser')

with open ('output6.html', 'w') as file:
    file.write(str(doc))

In [9]:
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless = False)
page = await browser.new_page()

await page.goto("https://www.congress.gov/search?q=%7B%22source%22%3A%22nominations%22%2C%22senate-committee%22%3A%22Judiciary%22%2C%22congress%22%3A%5B%22107%22%2C%22108%22%2C%22109%22%2C%22110%22%2C%22111%22%2C%22112%22%2C%22113%22%2C%22114%22%2C%22115%22%2C%22116%22%2C%22117%22%5D%7D&pageSize=250&page=7")

await page.content()
doc = BeautifulSoup(await page.content(), 'html.parser')

with open ('output7.html', 'w') as file:
    file.write(str(doc))

In [10]:
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless = False)
page = await browser.new_page()

await page.goto("https://www.congress.gov/search?q=%7B%22source%22%3A%22nominations%22%2C%22senate-committee%22%3A%22Judiciary%22%2C%22congress%22%3A%5B%22107%22%2C%22108%22%2C%22109%22%2C%22110%22%2C%22111%22%2C%22112%22%2C%22113%22%2C%22114%22%2C%22115%22%2C%22116%22%2C%22117%22%5D%7D&pageSize=250&page=8")

await page.content()
doc = BeautifulSoup(await page.content(), 'html.parser')

with open ('output8.html', 'w') as file:
    file.write(str(doc))

In [11]:
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless = False)
page = await browser.new_page()

await page.goto("https://www.congress.gov/search?q=%7B%22source%22%3A%22nominations%22%2C%22senate-committee%22%3A%22Judiciary%22%2C%22congress%22%3A%5B%22107%22%2C%22108%22%2C%22109%22%2C%22110%22%2C%22111%22%2C%22112%22%2C%22113%22%2C%22114%22%2C%22115%22%2C%22116%22%2C%22117%22%5D%7D&pageSize=250&page=9")

await page.content()
doc = BeautifulSoup(await page.content(), 'html.parser')

with open ('output9.html', 'w') as file:
    file.write(str(doc))

In [12]:
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless = False)
page = await browser.new_page()

await page.goto("https://www.congress.gov/search?q=%7B%22source%22%3A%22nominations%22%2C%22senate-committee%22%3A%22Judiciary%22%2C%22congress%22%3A%5B%22107%22%2C%22108%22%2C%22109%22%2C%22110%22%2C%22111%22%2C%22112%22%2C%22113%22%2C%22114%22%2C%22115%22%2C%22116%22%2C%22117%22%5D%7D&pageSize=250&page=10")

await page.content()
doc = BeautifulSoup(await page.content(), 'html.parser')

with open ('output10.html', 'w') as file:
    file.write(str(doc))

In [13]:
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless = False)
page = await browser.new_page()

await page.goto("https://www.congress.gov/search?q=%7B%22source%22%3A%22nominations%22%2C%22senate-committee%22%3A%22Judiciary%22%2C%22congress%22%3A%5B%22107%22%2C%22108%22%2C%22109%22%2C%22110%22%2C%22111%22%2C%22112%22%2C%22113%22%2C%22114%22%2C%22115%22%2C%22116%22%2C%22117%22%5D%7D&pageSize=250&page=11")

await page.content()
doc = BeautifulSoup(await page.content(), 'html.parser')

with open ('output11.html', 'w') as file:
    file.write(str(doc))

In [14]:
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless = False)
page = await browser.new_page()

await page.goto("https://www.congress.gov/search?q=%7B%22source%22%3A%22nominations%22%2C%22senate-committee%22%3A%22Judiciary%22%2C%22congress%22%3A%5B%22107%22%2C%22108%22%2C%22109%22%2C%22110%22%2C%22111%22%2C%22112%22%2C%22113%22%2C%22114%22%2C%22115%22%2C%22116%22%2C%22117%22%5D%7D&pageSize=250&page=12")

await page.content()
doc = BeautifulSoup(await page.content(), 'html.parser')

with open ('output12.html', 'w') as file:
    file.write(str(doc))

### STEP 2
Scrape the first page of judicial confirmations:

http://www.uscourts.gov/judges-judgeships/judicial-vacancies/confirmation-listing


### STEP 3
Investigate the judicial committee's confirmation postings:

https://www.judiciary.senate.gov/nominations/confirmed

This is relatively straightforward, except that the most interesting information is possibly PDFs of the questionnaires for each candidate. To get the PDFs you need to use selenium (see step 4), but first look this data and assess whether you think it will be useful to you. You can then parse them using regular expressions.

In [39]:
#Don't necessarily code here
#Think about where you're going first
#And read below

### STEP 4
Investigate the judicial committee's hearings on nominees: 

https://www.judiciary.senate.gov/hearings

This one is very tricky. It is where you can find PDFs with Q&A's from confirmation hearings. It is a multiple page scrape just to get links to various nomination pages, which then have links to PDFs, which is then have redirects to download the PDFs (you have to use selenium here). 

But before you do the scrape just go through the hearings pages by hand and click on where it says "Nominations". Look at the different Q&A's available and see if you think they will be useful to you. If they will be I can give you most of the code you will need to get the PDFs. Also, I have uploaded a file on slack of one hearings PDFs along with text conversions of them. Take a look at the text conversions, because you'll need to parse them using regular expressions.

If you are interested in more historical data, look into the information on these links:

Archives of vacancies/confirmations (if you want to build more historical data)
http://www.uscourts.gov/judges-judgeships/judicial-vacancies/archive-judicial-vacancies

Present and past judges including resumes:

Appeals courts:
https://www.fjc.gov/history/courts/u.s.-court-appeals-district-columbia-circuit-justices-and-judges

District courts:
https://www.fjc.gov/history/courts/u.s.-district-courts-and-federal-judiciary

In [40]:
#Think about your focus and what your ultimate architecture should be

In [41]:
#More to come...