## Scraping the White House Press Release page for Biden's judicial nominees

Initially used Playwright to scrape the press release page but started to find inconsistencies in the url, format of the page, and even titles of the press releases. After examining the pages, I decided to use BeautifulSoup instead as it was mostly text and the website was easy to parse. There are said to be 42 press releases for every round of announcement of judicial nominees but only 38 were on the website. 

https://www.whitehouse.gov/?s=judicial+nominees

## Examples of inaccuracies in the format of the webpages

![webpage-samples.png](attachment:02b4e916-7228-460f-8205-642dc7b35a02.png)

In [4]:
import pandas as pd
from glob import glob
import re
import os
import requests
from bs4 import BeautifulSoup

### Extract using BeatifulSoup

In [7]:
url = 'https://www.whitehouse.gov/briefing-room/presidential-actions/2023/11/15/president-biden-names-forty-second-round-of-judicial-nominees/'

response = requests.get(url)
soupdoc = BeautifulSoup(response.text, 'html.parser')

In [8]:
data = soupdoc.find_all('u')
for datum in data:
    print(datum.text)

United States Circuit Court Announcements
Nicole G. Berner: Nominee for the United States Court of Appeals for the Fourth Circuit
Adeel A. Mangi: Nominee for the United States Court of Appeals for the Third Circuit
United States District Court Announcements
Judge Amy M. Baggio: Nominee for the United States District Court for the District of Oregon
Judge Cristal C. Brisco: Nominee for the United States District Court for the Northern District of Indiana
Judge Gretchen S. Lund: Nominee for the United States District Court for the Northern District of Indiana
District of Columbia Superior Court Announcements
Judge Sherri Beatty-Arthur: Nominee for the District of Columbia Superior Court
Erin C. Johnston: Nominee for the District of Columbia Superior Court
Ray D. McKenzie: Nominee for the District of Columbia Superior Court


### Get text and parse using regex
In order to remove unnecessary duplicates in the lines, I opted to delete the lines subheads: United States Circuit Court Announcements, United States District Court Announcements, and District of Columbia Superior Court Announcements.

In [9]:
text = '''Nicole G. Berner: Nominee for the United States Court of Appeals for the Fourth Circuit
Adeel A. Mangi: Nominee for the United States Court of Appeals for the Third Circuit
Judge Amy M. Baggio: Nominee for the United States District Court for the District of Oregon
Judge Cristal C. Brisco: Nominee for the United States District Court for the Northern District of Indiana
Judge Gretchen S. Lund: Nominee for the United States District Court for the Northern District of Indiana
Judge Sherri Beatty-Arthur: Nominee for the District of Columbia Superior Court
Erin C. Johnston: Nominee for the District of Columbia Superior Court
Ray D. McKenzie: Nominee for the District of Columbia Superior Court'''

In [10]:
list = text.splitlines()
list


['Nicole G. Berner: Nominee for the United States Court of Appeals for the Fourth Circuit',
 'Adeel A. Mangi: Nominee for the United States Court of Appeals for the Third Circuit',
 'Judge Amy M. Baggio: Nominee for the United States District Court for the District of Oregon',
 'Judge Cristal C. Brisco: Nominee for the United States District Court for the Northern District of Indiana',
 'Judge Gretchen S. Lund: Nominee for the United States District Court for the Northern District of Indiana',
 'Judge Sherri Beatty-Arthur: Nominee for the District of Columbia Superior Court',
 'Erin C. Johnston: Nominee for the District of Columbia Superior Court',
 'Ray D. McKenzie: Nominee for the District of Columbia Superior Court']

In [11]:

results = []

for text in list:

    name_match = re.search(r".*:", text)
    name = name_match.group() if name_match else None

    district_match = re.search(r"((Southern|Northern|Middle|Eastern|Western)\sDistrict\sof+\s\w.*)|(District\sof+\s\w.*)", text)
    district = district_match.group() if district_match else None

    court_match = re.search(r"Court\sof\sAppeals|District\sCourt|Superior\sCourt", text)
    court = court_match.group()
    
    data = {
        "name": name,
        "district": district,
        "court": court
    }
    
    results.append(data)

print(results)


[{'name': 'Nicole G. Berner:', 'district': None, 'court': 'Court of Appeals'}, {'name': 'Adeel A. Mangi:', 'district': None, 'court': 'Court of Appeals'}, {'name': 'Judge Amy M. Baggio:', 'district': 'District of Oregon', 'court': 'District Court'}, {'name': 'Judge Cristal C. Brisco:', 'district': 'Northern District of Indiana', 'court': 'District Court'}, {'name': 'Judge Gretchen S. Lund:', 'district': 'Northern District of Indiana', 'court': 'District Court'}, {'name': 'Judge Sherri Beatty-Arthur:', 'district': 'District of Columbia Superior Court', 'court': 'Superior Court'}, {'name': 'Erin C. Johnston:', 'district': 'District of Columbia Superior Court', 'court': 'Superior Court'}, {'name': 'Ray D. McKenzie:', 'district': 'District of Columbia Superior Court', 'court': 'Superior Court'}]


In [13]:

df = pd.DataFrame(results, columns=["name", "district", "court"])

df['name'] = df['name'].str.replace(':','') 

df


Unnamed: 0,name,district,court
0,Nicole G. Berner,,Court of Appeals
1,Adeel A. Mangi,,Court of Appeals
2,Judge Amy M. Baggio,District of Oregon,District Court
3,Judge Cristal C. Brisco,Northern District of Indiana,District Court
4,Judge Gretchen S. Lund,Northern District of Indiana,District Court
5,Judge Sherri Beatty-Arthur,District of Columbia Superior Court,Superior Court
6,Erin C. Johnston,District of Columbia Superior Court,Superior Court
7,Ray D. McKenzie,District of Columbia Superior Court,Superior Court


In [14]:
#Save .csv file to folder
#folder_path = 'judicial-nominees'
#csv_file_path = os.path.join(folder_path, '#_df.csv')
#df.to_csv(csv_file_path, index=False)

#print(f"CSV file saved at: {csv_file_path}")

### Combining all .csv files from each individual page

In [15]:
folder_path = '/Users/katrinaventura/Documents/Columbia/04-databases/final-project/final/judicial-nominees'

# Used list comprehension to read all CSV files in the folder into a list of DataFrames

dfs = [pd.read_csv(os.path.join(folder_path, file)) for file in os.listdir(folder_path) if file.endswith('.csv')]
dfs

[                        name                         district  \
 0     Bridget Meehan Brennan        Northern District of Ohio   
 1     Victoria Marie Calvert     Northern District of Georgia   
 2               John H. Chun   Western District of Washington   
 3        Samantha D. Elliott        District of New Hampshire   
 4      Charles Esque Fleming        Northern District of Ohio   
 5   Sarah Elisabeth Geraghty     Northern District of Georgia   
 6                 Dale E. Ho    Southern District of New York   
 7                Linda Lopez  Southern District of California   
 8               Jinsook Ohta  Southern District of California   
 9        David Augustin Ruiz        Northern District of Ohio   
 10          Loren L. AliKhan             District of Columbia   
 11    Adrienne Jennings Noti             District of Columbia   
 12            Ebony M. Scott             District of Columbia   
 13              D.W. Tunnage             District of Columbia   
 
        

In [16]:
len(dfs)

38

In [18]:
nominees = pd.concat(dfs, ignore_index=True)
nominees

Unnamed: 0,name,district,court
0,Bridget Meehan Brennan,Northern District of Ohio,District Court
1,Victoria Marie Calvert,Northern District of Georgia,District Court
2,John H. Chun,Western District of Washington,District Court
3,Samantha D. Elliott,District of New Hampshire,District Court
4,Charles Esque Fleming,Northern District of Ohio,District Court
...,...,...,...
226,Judge Ana Isabel de Alba,Eastern District of California,District Court
227,Robert Steven Huie,Southern District of California,District Court
228,Natasha C. Merle,Eastern District of New York,District Court
229,Jennifer H. Rearden,Southern District of New York,District Court


In [None]:
#folder_path = ''

#csv_file_path = os.path.join(folder_path, 'biden_nominees.csv')
#nominees.to_csv(csv_file_path, index=False)