## LIN350 Course Project - The Language of Immigration Politics: Terminology Differences Across Party Lines in Congressional Speeches

The way I usually run jupyter notebooks is opening the anaconda prompt terminal and running the command *jupyter notebook* from there I go to visual studio and click on select kernel -> existing jupyter server -> localhost or you can copy and paste the url of the tab that opened up with the *jupyter notebook* command and then click on python and that should be it

To keep track of the work we're doing together we can use a github repository to update changes and sync up our work. The usual workflow for this should be.
1. Any changes you have in your laptop can be added to the repository with "git add ./" from the terminal the notebook is in
2. After adding the files and changes you can use "git commit -m 'message here'" For the message make sure its in quotations and it can be anything
3. After adding and commiting you can "git push" which pushes ur changes to the repository
4. Let's say there's changes in the repository that are not in your laptop you can fetch them with "git pull"

Some other setup you might need to do is set environement variables in local computer since we don't want to share that in the repository for privacy issues. So to do this you would run commands in your notebook to set it up. I'll show you
1. running "%env" in a code block will show you all the environment variables in the jupyter environment
2. to set up the enviroment variable for our project run the command "%env API_KEY=apikeyfromourgoogledocs"
3. After that running the first cell of code will setup the api key to be used as API_KEY


### Congressional Record Data Collector - Very simple for now, simple text data collection

In [None]:
import requests
import json
import os
import pandas as pd
import time
from datetime import datetime, timedelta
from tqdm.notebook import tqdm

# create directories for data storage
os.makedirs('data/congressional_record', exist_ok=True)

# set your API key here (get one from https://api.data.gov/signup/)
API_KEY = os.environ["API_KEY"]

# Define date ranges for your study (immigration debates 2018-2023)
date_ranges = [
    # 2018 - Family separation policy debates -> for now commented out because I wanted to test collection only 2019 January data
    #("2018-06-01", "2018-06-30"),
    # 2019 - Border wall government shutdown -> for now 
    ("2019-01-01", "2019-01-31"),
    # 2020 - Selected periods
    #("2020-02-01", "2020-02-15"),
    # 2021 - Biden immigration policy
    #("2021-02-01", "2021-02-15"),
    # 2022 - Border discussions
    #("2022-05-01", "2022-05-15")
]



Will download Congressional Record data for 31 dates


  0%|          | 0/31 [00:00<?, ?it/s]

No Congressional Record available for 2019-01-01 (Status: 404)
Successfully downloaded Congressional Record for 2019-01-02 (72 granules)
No Congressional Record available for 2019-01-03 (Status: 404)
Successfully downloaded Congressional Record for 2019-01-04 (100 granules)
No Congressional Record available for 2019-01-05 (Status: 404)
No Congressional Record available for 2019-01-06 (Status: 404)
No Congressional Record available for 2019-01-07 (Status: 404)
Successfully downloaded Congressional Record for 2019-01-08 (100 granules)
Successfully downloaded Congressional Record for 2019-01-09 (100 granules)
Successfully downloaded Congressional Record for 2019-01-10 (100 granules)
Successfully downloaded Congressional Record for 2019-01-11 (100 granules)
No Congressional Record available for 2019-01-12 (Status: 404)
No Congressional Record available for 2019-01-13 (Status: 404)
Successfully downloaded Congressional Record for 2019-01-14 (100 granules)
Successfully downloaded Congression

In [None]:
"""Generate all dates in a given range, essentially given range makes a list of day dates"""
def get_dates_in_range(start_date, end_date):
    start = datetime.strptime(start_date, "%Y-%m-%d")
    end = datetime.strptime(end_date, "%Y-%m-%d")
    
    date_list = []
    current = start
    while current <= end:
        date_list.append(current.strftime("%Y-%m-%d"))
        current += timedelta(days=1)
    return date_list

In [None]:
"""
Function to get Congressional Record data using the GovInfo API (this worked thankfully)
Args:
    date (str): Date in YYYY-MM-DD format
Returns:
    bool: Success status
"""
def get_congressional_record(date):
   
    package_id = f"CREC-{date}"
    
    # first check if the package exists for this date 
    package_url = f"https://api.govinfo.gov/packages/{package_id}/summary"
    
    params = {
        'api_key': API_KEY
    }
    
    try:
        # check if the package exists
        response = requests.get(package_url, params=params)
        
        # if package doesn't exist or other error
        if response.status_code != 200:
            print(f"No Congressional Record available for {date} (Status: {response.status_code})")
            return False
        
        # save the package summary
        with open(f"data/congressional_record/{package_id}-summary.json", 'w') as f:
            json.dump(response.json(), f)
        
        # get granules (speeches and entries) 
        granules_url = f"https://api.govinfo.gov/packages/{package_id}/granules"
        granules_params = {
            'api_key': API_KEY,
            'offset': 0,
            'pageSize': 100  # Max page size
        }
        
        # get first page of granules
        granules_response = requests.get(granules_url, params=granules_params)
        
        if granules_response.status_code != 200:
            print(f"Failed to get granules for {date} (Status: {granules_response.status_code})")
            return False
            
        # save the granules list
        with open(f"data/congressional_record/{package_id}-granules.json", 'w') as f:
            json.dump(granules_response.json(), f)
            
        # download content for each granule
        granules = granules_response.json().get('granules', [])
        
        for granule in granules:
            granule_id = granule.get('granuleId')
            
            # skip if no granule ID
            if not granule_id:
                continue
            
            # get the HTML content
            content_url = f"https://api.govinfo.gov/packages/{package_id}/granules/{granule_id}/htm"
            content_response = requests.get(content_url, params=params)
            
            if content_response.status_code == 200:
                # save the HTML content
                with open(f"data/congressional_record/{package_id}-{granule_id}.html", 'w', encoding='utf-8') as f:
                    f.write(content_response.text)
            
            # respect rate limits
            time.sleep(0.5)
            
        print(f"Successfully downloaded Congressional Record for {date} ({len(granules)} granules)")
        return True
        
    except Exception as e:
        print(f"Error retrieving data for {date}: {e}")
        return False

In [None]:
"""
Main function to download Congressional Record data
"""
def main():
   
    all_dates = []
    
    # generate all dates in the specified ranges
    for start_date, end_date in date_ranges:
        dates = get_dates_in_range(start_date, end_date)
        all_dates.extend(dates)
    
    print(f"Will download Congressional Record data for {len(all_dates)} dates")
    
    # download data for each date
    successful_downloads = 0
    for date in tqdm(all_dates):
        success = get_congressional_record(date)
        if success:
            successful_downloads += 1
        
        # wait between requests to avoid rate limiting
        time.sleep(1)
    
    print(f"\nData collection complete!")
    print(f"Successfully downloaded data for {successful_downloads} out of {len(all_dates)} dates")
    print(f"Data saved to: data/congressional_record/")
    
    # create a simple summary file with immigration keywords to help with later analysis
    with open('data/immigration_keywords.txt', 'w') as f:
        keywords = [
            'immigration', 'immigrant', 'migrant', 'migration', 'asylum', 
            'refugee', 'border', 'wall', 'undocumented', 'illegal alien',
            'daca', 'dreamer', 'deportation', 'visa', 'citizenship',
            'family separation', 'child detention', 'border security',
            'border crisis', 'caravan', 'amnesty', 'path to citizenship'
        ]
        f.write('\n'.join(keywords))

if __name__ == "__main__":
    main()

In [11]:
import os
import glob
from bs4 import BeautifulSoup
import pandas as pd

path = r"C:\Users\kevin barcenas\Documents\GitHub\LIN350Project\congressional_record"
html_files = glob.glob(os.path.join(path, "*.html"))

# Expanded immigration terms list
immigration_terms = [
    'immigration', 'immigrant', 'border', 'asylum', 'refugee', 
    'undocumented', 'illegal alien', 'migrant', 'caravan',
    'wall', 'daca', 'dreamer', 'deportation',
    'family separation', 'mexico', 'visa', 'citizenship'
]

# Search ALL files (this might take a few minutes)
immigration_files = []
for file in html_files:
    with open(file, 'r', encoding='utf-8') as f:
        try:
            content = f.read().lower()
            found_terms = [term for term in immigration_terms if term in content]
            if found_terms:
                immigration_files.append({
                    'file': file,
                    'terms': found_terms
                })
        except Exception as e:
            print(f"Error reading {file}: {e}")

print(f"Found {len(immigration_files)} files with immigration content")

# If we found any, save the list for reference
if immigration_files:
    df = pd.DataFrame(immigration_files)
    df.to_csv(os.path.join(path, "immigration_files.csv"), index=False)
    print("First 5 files with immigration content:")
    for item in immigration_files[:5]:
        print(f"  - {os.path.basename(item['file'])}: {', '.join(item['terms'])}")

Found 206 files with immigration content
First 5 files with immigration content:
  - CREC-2019-01-02-CREC-2019-01-02-pt1-PgD1334-2.html: wall, mexico
  - CREC-2019-01-02-CREC-2019-01-02-pt1-PgD1335-3.html: visa
  - CREC-2019-01-02-CREC-2019-01-02-pt1-PgH10607-6.html: mexico
  - CREC-2019-01-02-CREC-2019-01-02-pt1-PgH10607-8.html: mexico
  - CREC-2019-01-02-CREC-2019-01-02-pt1-PgH10608.html: wall
