## LIN350 Course Project - The Language of Immigration Politics: Terminology Differences Across Party Lines in Congressional Speeches

The way I usually run jupyter notebooks is opening the anaconda prompt terminal and running the command *jupyter notebook* from there I go to visual studio and click on select kernel -> existing jupyter server -> localhost or you can copy and paste the url of the tab that opened up with the *jupyter notebook* command and then click on python and that should be it

To keep track of the work we're doing together we can use a github repository to update changes and sync up our work. The usual workflow for this should be.
1. Any changes you have in your laptop can be added to the repository with "git add ./" from the terminal the notebook is in
2. After adding the files and changes you can use "git commit -m 'message here'" For the message make sure its in quotations and it can be anything
3. After adding and commiting you can "git push" which pushes ur changes to the repository
4. Let's say there's changes in the repository that are not in your laptop you can fetch them with "git pull"

Some other setup you might need to do is set environement variables in local computer since we don't want to share that in the repository for privacy issues. So to do this you would run commands in your notebook to set it up. I'll show you
1. running "%env" in a code block will show you all the environment variables in the jupyter environment
2. to set up the enviroment variable for our project run the command "%env API_KEY=apikeyfromourgoogledocs"
3. After that running the first cell of code will setup the api key to be used as API_KEY


### Congressional Record Data Collector - Very simple for now, simple text data collection

In [12]:
%pip install Xlsxwriter

Note: you may need to restart the kernel to use updated packages.


#### SECTION 1: INTRODUCTION AND SETUP


In [3]:


"""

Research Questions:
1. What statistically significant differences exist in the frequency of immigration-related 
   terminology (e.g., "undocumented" vs. "illegal") between political parties?
2. How do these terminological choices correlate with specific policy positions or votes?
3. Has the terminology used by each party shifted over the past decade (2015-2025), 
   and if so, in what direction?

The project analyzes Congressional Record speeches to investigate how politicians from
different parties use immigration-related terminology, following the methodologies
covered in the LIN350 course.
"""

import requests
import json
import os
import pandas as pd
import time
from datetime import datetime, timedelta
from tqdm.notebook import tqdm
import glob
import re
from bs4 import BeautifulSoup
import xlsxwriter
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import defaultdict, Counter
from nltk.tokenize import word_tokenize, sent_tokenize
from scipy.stats import chi2_contingency


def setup_directories():
    # create all necessary directories for the project. returns a dictionary of important paths.
    base_dir = os.path.join(os.getcwd())
    
    # main data directories
    data_dir = os.path.join(base_dir, "data")
    raw_data_dir = os.path.join(data_dir, "congressional_record")
    processed_dir = os.path.join(base_dir, "processed_data")
    samples_dir = os.path.join(processed_dir, "speech_samples")
    figures_dir = os.path.join(processed_dir, "figures")  
    
    # create all directories
    for directory in [data_dir, raw_data_dir, processed_dir, samples_dir, figures_dir]:
        os.makedirs(directory, exist_ok=True)
    
    # return dictionary of paths for easy reference
    return {
        "base_dir": base_dir,
        "data_dir": data_dir,
        "raw_data_dir": raw_data_dir,
        "processed_dir": processed_dir,
        "samples_dir": samples_dir,
        "figures_dir": figures_dir
    }

def setup_api_key():
    # set up the API key for accessing the Congress.gov API. returns the API key.

    # uncomment and run this line to set the API key in the notebook environment
    # %env API_KEY=your_api_key_here
    
    try:
        API_KEY = os.environ.get("API_KEY")
        if not API_KEY:
            print("Warning: API_KEY environment variable not found.")
            print("Please run '%env API_KEY=your_api_key' in a cell.")
            return None
        return API_KEY
    except Exception as e:
        print(f"Error accessing API key: {e}")
        return None

# define constants for data collection
def define_constants():   
    date_ranges = [
        # 2019 - Border wall government shutdown
        ("2019-01-01", "2019-01-31"),
        
        # Government shutdown over border wall funding
        ("2018-12-15", "2018-12-31"),

        # DACA debates
        ("2017-09-01", "2017-10-15"),
        ("2018-01-15", "2018-02-15"),
        
        # Border surge discussions
        ("2019-03-01", "2019-04-15"),
        
        # Election year immigration discussions
        ("2020-01-15", "2020-02-15"),
        ("2020-09-01", "2020-10-15"),
        
        # Biden administration policy changes
        ("2021-01-20", "2021-03-01")
    ]
    
    # immigration-related term pairs for analysis
    term_pairs = [
        ("undocumented", "illegal", "unauthorized"),  # Status descriptors
        ("asylum seeker", "refugee", "migrant"),      # Migration categories
        ("border security", "border crisis", "border management"),  # Border framing
        ("path to citizenship", "amnesty"),           # Legal status solutions
        ("dreamers", "daca recipients"),              # Youth beneficiaries 
        ("family separation", "child detention"),      # Child policy framing
        ("chain migration", "family reunification"),  # Family immigration framing
        ("alien", "foreign national", "noncitizen", "undocumented"),  # Legal designation terms
        ("deportation", "removal"),                   # Enforcement terminology
        ("sanctuary cities", "non-cooperative jurisdictions"),  # Local policy framing
        ("border wall", "border barrier", "border infrastructure")  # Border infrastructure
    ]
    
    # immigration-related terms with more precise matching
    immigration_terms = {
        # regular terms - can appear within other words
        'immigration': r'immigration',
        'immigrant': r'immigrant',
        'migrant': r'migrant',
        'citizenship': r'citizenship',
        'deportation': r'deportation',
        
        # terms that need word boundary checks
        'border': r'\b(?:border|borders)\b',
        'asylum': r'\basylum\b',
        'refugee': r'\b(?:refugee|refugees)\b',
        'undocumented': r'\bundocumented\b',
        'illegal alien': r'\billegal\s+alien',
        'unauthorized': r'\bunauthorized\b',
        'wall': r'\bwall\b',
        'daca': r'\bdaca\b',
        'dreamer': r'\b(?:dreamer|dreamers)\b',
        'visa': r'\bvisa\b',
        'detention': r'\bdetention\b',
        
        # phrases
        'family separation': r'family\s+separation',
        'child detention': r'child\s+detention',
        'border security': r'border\s+security',
        'border crisis': r'border\s+crisis',
        'path to citizenship': r'path\s+to\s+citizenship',
        'amnesty': r'\bamnesty\b',
        'caravan': r'\bcaravan\b',
        
        # specific entities
        'mexico': r'\bmexico\b',
        'ice': r'\b(?:ice|immigration and customs enforcement)\b',  # Only match whole word "ice"
        'cbp': r'\b(?:cbp|customs and border protection)\b'
    }

    return {
        "date_ranges": date_ranges,
        "term_pairs": term_pairs,
        "immigration_terms": immigration_terms,
    }

# initialize the project
def initialize_project():
   
    print("Initializing project...\n")
    
    directories = setup_directories()
    print(f"Directory structure created:")
    for name, path in directories.items():
        print(f"  - {name}: {path}")
    
    api_key = setup_api_key()
    if api_key:
        print(f"API key configured")
    
    constants = define_constants()
    print(f"Constants defined:")
    print(f"  - Date ranges: {len(constants['date_ranges'])} periods")
    print(f"  - Term pairs: {len(constants['term_pairs'])} pairs/groups")
    print(f"  - Immigration terms: {len(constants['immigration_terms'])} terms")
    
    config = {
        "directories": directories,
        "api_key": api_key,
        "constants": constants
    }
    
    print("\nProject initialization complete!")
    return config

# run initialization
config = initialize_project()


Initializing project...

Directory structure created:
  - base_dir: c:\Users\Kevin\Downloads\LIN350Project
  - data_dir: c:\Users\Kevin\Downloads\LIN350Project\data
  - raw_data_dir: c:\Users\Kevin\Downloads\LIN350Project\data\congressional_record
  - processed_dir: c:\Users\Kevin\Downloads\LIN350Project\processed_data
  - samples_dir: c:\Users\Kevin\Downloads\LIN350Project\processed_data\speech_samples
  - figures_dir: c:\Users\Kevin\Downloads\LIN350Project\processed_data\figures
Please run '%env API_KEY=your_api_key' in a cell.
Constants defined:
  - Date ranges: 8 periods
  - Term pairs: 11 pairs/groups
  - Immigration terms: 26 terms

Project initialization complete!


#### SECTION 2: DATA COLLECTION


In [None]:

# function to generate all dates in a given range
def get_dates_in_range(start_date, end_date):
    # start_date (str): Start date in format 'YYYY-MM-DD'
    # end_date (str): End date in format 'YYYY-MM-DD'
        
    start = datetime.strptime(start_date, "%Y-%m-%d")
    end = datetime.strptime(end_date, "%Y-%m-%d")
    
    date_list = []
    current = start
    while current <= end:
        date_list.append(current.strftime("%Y-%m-%d"))
        current += timedelta(days=1)
    return date_list

def verify_api_key(api_key):

    test_url = "https://api.govinfo.gov/collections"
    params = {
        'api_key': api_key
    }

    try:
        print("Testing API key with GovInfo API...")
        response = requests.get(test_url, params=params)
        
        if response.status_code == 200:
            print("Success! Your API key is valid for the GovInfo API.")
            print(f"Status code: {response.status_code}")
            
            # show the first few collections to confirm we got real data
            collections = response.json().get('collections', [])
            if collections:
                print("\nAvailable collections:")
                for collection in collections[:5]:
                    print(f"- {collection.get('collectionName', 'Unknown')}")
            return True
            
        elif response.status_code == 401 or response.status_code == 403:
            print("Authentication failed. Your API key appears to be invalid.")
            print(f"Status code: {response.status_code}")
            print(f"Response: {response.text}")
            return False
        else:
            print(f"Received unexpected status code: {response.status_code}")
            print(f"Response: {response.text}")
            return False
            
    except Exception as e:
        print(f"Error occurred while testing the API key: {e}")
        return False

# function to get Congressional Record data using the GovInfo API
def get_congressional_record(date, api_key, raw_data_dir):
    """
    Args:
        date (str): Date in format 'YYYY-MM-DD'
        api_key (str): API key for the GovInfo API
        raw_data_dir (str): Directory to save raw data  
    """
    package_id = f"CREC-{date}"
    package_url = f"https://api.govinfo.gov/packages/{package_id}/summary"
    params = {
        'api_key': api_key
    }
    try:
        # check if the package exists
        response = requests.get(package_url, params=params)
    
        # if package doesn't exist or other error
        if response.status_code != 200:
            print(f"No Congressional Record available for {date} (Status: {response.status_code})")
            return False
        
        # save the package summary
        with open(os.path.join(raw_data_dir, f"{package_id}-summary.json"), 'w') as f:
            json.dump(response.json(), f)
        
        # get granules (speeches and entries) 
        granules_url = f"https://api.govinfo.gov/packages/{package_id}/granules"
        granules_params = {
            'api_key': api_key,
            'offset': 0,
            'pageSize': 100  # Max page size
        }
        
        # get first page of granules
        granules_response = requests.get(granules_url, params=granules_params)
        
        if granules_response.status_code != 200:
            print(f"Failed to get granules for {date} (Status: {granules_response.status_code})")
            return False
            
        # save the granules list
        with open(os.path.join(raw_data_dir, f"{package_id}-granules.json"), 'w') as f:
            json.dump(granules_response.json(), f)
            
        # download content for each granule
        granules = granules_response.json().get('granules', [])
        
        for granule in granules:
            granule_id = granule.get('granuleId')
            
            # skip if no granule ID
            if not granule_id:
                continue
            
            # get the HTML content
            content_url = f"https://api.govinfo.gov/packages/{package_id}/granules/{granule_id}/htm"
            content_response = requests.get(content_url, params=params)
            
            if content_response.status_code == 200:
                # save the HTML content
                with open(os.path.join(raw_data_dir, f"{package_id}-{granule_id}.html"), 'w', encoding='utf-8') as f:
                    f.write(content_response.text)
            
            # respect rate limit
            time.sleep(0.5)
            
        print(f"Successfully downloaded Congressional Record for {date} ({len(granules)} granules)")
        return True
        
    except Exception as e:
        print(f"Error retrieving data for {date}: {e}")
        return False

# main function to download Congressional Record data
def collect_congressional_data(config):
    """
    Args: config (dict): Project configuration
    Returns: int: Number of successfully downloaded dates
    """
    api_key = config["api_key"]
    date_ranges = config["constants"]["date_ranges"]
    raw_data_dir = config["directories"]["raw_data_dir"]
    
    if not verify_api_key(api_key):
        print("Cannot proceed with data collection due to invalid API key.")
        return 0
    
    all_dates = []
    
    # generate all dates in the specified ranges
    for start_date, end_date in date_ranges:
        dates = get_dates_in_range(start_date, end_date)
        all_dates.extend(dates)
    
    print(f"Will download Congressional Record data for {len(all_dates)} dates")
    
    # download data for each date, commented out because I we already collected the data
    successful_downloads = 0
    # for date in tqdm(all_dates, desc="Downloading Congressional Records"):
    #     success = get_congressional_record(date, api_key, raw_data_dir)
    #     if success:
    #         successful_downloads += 1
        
    #     # Wait between requests to avoid rate limiting
    #     time.sleep(1)
    
    print(f"\nData collection complete!")
    print(f"Successfully downloaded data for {successful_downloads} out of {len(all_dates)} dates")
    print(f"Data saved to: {raw_data_dir}")
    
    return successful_downloads

# Uncomment to run data collection
successful_downloads = collect_congressional_data(config)


In [15]:
# build a df of legislators from the @unitedstates Github data (2015-2025)
current_file = "data\\legislator_data\\unitedstates.github.io\\legislators-current.json"
historical_file = "data\\legislator_data\\unitedstates.github.io\\legislators-historical.json"
def build_legislators_dataframe(current_file=current_file, 
                               historical_file=historical_file):
   
    with open(current_file, 'r') as f:
        current = json.load(f)
    
    with open(historical_file, 'r') as f:
        historical = json.load(f)
    
    all_legislators = current + historical
    legislator_records = []
    
    study_start = datetime.strptime('2015-01-01', '%Y-%m-%d')
    study_end = datetime.strptime('2025-12-31', '%Y-%m-%d')
    
    for legislator in all_legislators:
        # legislator info
        legislator_id = legislator.get('id', {}).get('bioguide', '')
        first_name = legislator.get('name', {}).get('first', '')
        last_name = legislator.get('name', {}).get('last', '')
        
        # process each term to see if any fall within our study period
        for term in legislator.get('terms', []):
            term_start = datetime.strptime(term.get('start', '1900-01-01'), '%Y-%m-%d')
            term_end = datetime.strptime(term.get('end', '2100-01-01'), '%Y-%m-%d')
            
            # check if this term overlaps with our study period
            if (term_start <= study_end and term_end >= study_start):
                record = {
                    'bioguide_id': legislator_id,
                    'first_name': first_name,
                    'last_name': last_name,
                    'last_name_upper': last_name.upper(),  # For easier matching
                    'full_name': legislator.get('name', {}).get('official_full', f"{first_name} {last_name}"),
                    'state': term.get('state', ''),
                    'party': term.get('party', ''),
                    'type': term.get('type', ''),  # 'sen' or 'rep'
                    'term_start': term.get('start', ''),
                    'term_end': term.get('end', ''),
                    'state_rank': term.get('state_rank', '')  # 'junior' or 'senior' for senators
                }
                legislator_records.append(record)
    
    df = pd.DataFrame(legislator_records)
    df.to_csv('legislators_2015_2025.csv', index=False)
    print(f"Created DataFrame with {len(df)} records from {len(set(df['bioguide_id']))} unique legislators")
    
    return df

legislators_df = build_legislators_dataframe()

Created DataFrame with 3463 records from 1030 unique legislators


In [5]:
# extract speaker information and determine party from Congressional Record text
# returns party and details about how the match was made
def get_party_from_speech(speech_text, legislators_df):
    # regex
    speaker_match = re.search(r'(?:Mr\.|Mrs\.|Ms\.) ([A-Z]+)(?:\s+of\s+([A-Za-z]+))?', speech_text)
    
    if not speaker_match:
        return None, "No speaker pattern found"
    
    last_name = speaker_match.group(1)
    state_name = speaker_match.group(2)
    
    matches = legislators_df[legislators_df['last_name_upper'] == last_name]
    
    if len(matches) == 0:
        return None, f"No match found for {last_name}"
    
    if len(matches) == 1:
        # single match - straightforward case
        return matches.iloc[0]['party'], "Unique last name match"
    
    # multiple matches - try to narrow down with state
    if state_name:
        # convert full state name to abbreviation
        state_abbrev = state_name_to_abbrev(state_name)
        state_matches = matches[matches['state'] == state_abbrev]
        
        if len(state_matches) == 1:
            return state_matches.iloc[0]['party'], f"Resolved with state ({state_abbrev})"
        elif len(state_matches) > 1:
            # still multiple matches with same state
            # sort by term_end to get the most recent/current legislator
            recent_match = state_matches.sort_values('term_end', ascending=False).iloc[0]
            return recent_match['party'], f"Multiple matches with state, using most recent ({recent_match['full_name']})"
    
    # no state or state didn't narrow it down - use most recent term
    recent_match = matches.sort_values('term_end', ascending=False).iloc[0]
    return recent_match['party'], f"Multiple matches, using most recent ({recent_match['full_name']})"

# helper function to get state name to abbrev
def state_name_to_abbrev(state_name):
    states = {
        'alabama': 'AL', 'alaska': 'AK', 'arizona': 'AZ', 'arkansas': 'AR', 'california': 'CA',
        'colorado': 'CO', 'connecticut': 'CT', 'delaware': 'DE', 'florida': 'FL', 'georgia': 'GA',
        'hawaii': 'HI', 'idaho': 'ID', 'illinois': 'IL', 'indiana': 'IN', 'iowa': 'IA',
        'kansas': 'KS', 'kentucky': 'KY', 'louisiana': 'LA', 'maine': 'ME', 'maryland': 'MD',
        'massachusetts': 'MA', 'michigan': 'MI', 'minnesota': 'MN', 'mississippi': 'MS', 'missouri': 'MO',
        'montana': 'MT', 'nebraska': 'NE', 'nevada': 'NV', 'new hampshire': 'NH', 'new jersey': 'NJ',
        'new mexico': 'NM', 'new york': 'NY', 'north carolina': 'NC', 'north dakota': 'ND', 'ohio': 'OH',
        'oklahoma': 'OK', 'oregon': 'OR', 'pennsylvania': 'PA', 'rhode island': 'RI', 'south carolina': 'SC',
        'south dakota': 'SD', 'tennessee': 'TN', 'texas': 'TX', 'utah': 'UT', 'vermont': 'VT',
        'virginia': 'VA', 'washington': 'WA', 'west virginia': 'WV', 'wisconsin': 'WI', 'wyoming': 'WY'
    }
    
    return states.get(state_name.lower(), state_name)

# example usage:
def analyze_speech_by_party(speech_text, legislators_df, term_pairs):
    """
    Analyze usage of immigration term pairs in a speech text
    and attribute them to the speaker's party
    """
    party, match_details = get_party_from_speech(speech_text, legislators_df)
    
    if not party:
        return None, f"Could not determine party: {match_details}"
    
    # init. dictionary to store term counts by party
    term_counts = {term: 0 for pair in term_pairs for term in pair}
    
    # count occurrences of each term in the speech
    for pair in term_pairs:
        for term in pair:
            # count how many times the term appears in the speech (case insensitive)
            count = len(re.findall(r'\b' + re.escape(term) + r'\b', speech_text, re.IGNORECASE))
            term_counts[term] = count
    
    return party, term_counts



#### SECTION 3: DATA PROCESSING


In [6]:

# function to identify immigration-related files
def identify_immigration_files(config):
    """
    searches through all HTML files and identifies those containing immigration-related terms.
    
    Args:
        config (dict): Project configuration
        
    Returns:
        pandas.DataFrame: DataFrame of immigration-related files
    """
    raw_data_dir = config["directories"]["raw_data_dir"]
    processed_dir = config["directories"]["processed_dir"]
    immigration_terms = config["constants"]["immigration_terms"]
    
    # list of all HTML files
    html_files = glob.glob(os.path.join(raw_data_dir, "*.html"))
    total_files = len(html_files)
    
    print(f"Found {total_files} HTML files in {raw_data_dir}")
    
    # check the first few files to make sure we can access them
    if total_files > 0:
        print("\nSample filenames:")
        for file in html_files[:5]:
            print(f"  - {os.path.basename(file)}")
        
        # try to open one file to verify access
        try:
            with open(html_files[0], 'r', encoding='utf-8') as f:
                first_chars = f.read(200)
            print("\nSuccessfully read first file. First 200 characters:")
            print(first_chars.replace('\n', ' ')[:200])
        except Exception as e:
            print(f"\nError reading file: {e}")
    
    # search for immigration-related content
    immigration_files = []
    print(f"Searching {total_files} files for immigration content...")

    for file in tqdm(html_files, desc="Searching files for immigration terms"):
        try:
            with open(file, 'r', encoding='utf-8') as f:
                content = f.read().lower()
                
            # check each term with its specific regex pattern
            found_terms = []
            for term, pattern in immigration_terms.items():
                if re.search(pattern, content):
                    found_terms.append(term)
            
            if found_terms:
                # extract date from filename
                filename = os.path.basename(file)
                date_parts = filename.split('-')
                if len(date_parts) >= 2:
                    date = date_parts[1]
                else:
                    date = "Unknown"
                
                immigration_files.append({
                    'file': file,
                    'date': date,
                    'terms': ', '.join(found_terms)  # convert list to string
                })
        except Exception as e:
            print(f"Error processing {os.path.basename(file)}: {e}")

    # save results to CSV
    if immigration_files:
        immigration_df = pd.DataFrame(immigration_files)
        csv_path = os.path.join(processed_dir, "immigration_files.csv")
        immigration_df.to_csv(csv_path, index=False)
        
        print(f"\nFound {len(immigration_files)} files with immigration content")
        print(f"List saved to: {csv_path}")
        
        # show sample of found files
        print("\nSample immigration-related files:")
        for file_info in immigration_files[:5]:
            print(f"  - {os.path.basename(file_info['file'])}: {file_info['terms']}")
    else:
        print("No immigration-related files found.")
        immigration_df = pd.DataFrame()
    
    return immigration_df

# function to extract structured data from HTML
def parse_congressional_record(file_path):
    """
    Parse a Congressional Record HTML file to extract structured data,
    making better use of HTML structure and BeautifulSoup capabilities.
   
    Args:
        file_path (str): Path to the HTML file
       
    Returns:
        dict: Dictionary of extracted data, or None if parsing failed
    """
    try:
        with open(file_path, 'r', encoding='utf-8') as f:
            content = f.read()
       
        soup = BeautifulSoup(content, 'html.parser')
        
        # grab the pre element which contains all the content
        pre_content = soup.find('pre')
        if not pre_content:
            return None
            
        # get the title from the <title> tag instead of regex if available
        title_tag = soup.find('title')
        page_title = title_tag.get_text() if title_tag else "Unknown"
        
        # extract the full text
        full_text = pre_content.get_text()
        
        # parde header information and date from the congressional record header
        header_text = pre_content.contents[0] if pre_content.contents else ""
        date_match = re.search(r'\[Congressional Record Volume \d+, Number \d+ \(([^)]+)\)\]', str(header_text))
        date = date_match.group(1) if date_match else "Unknown"
        
        # determine chamber from the header section (more reliable than searching the whole text)
        chamber_lines = [line for line in str(header_text).split('\n') if '[House]' in line or '[Senate]' in line]
        chamber = "House" if chamber_lines and '[House]' in chamber_lines[0] else \
                 "Senate" if chamber_lines and '[Senate]' in chamber_lines[0] else "Unknown"
        
        # extract links if present, could be useful metadata
        links = []
        for a_tag in pre_content.find_all('a'):
            href = a_tag.get('href', '')
            text = a_tag.get_text()
            links.append({"href": href, "text": text})
        
        # extract speaker information, look specifically for the parenthetical permission section
        # regex find
        speaker_section = re.search(r'\(((?:Mr\.|Mrs\.|Ms\.|Senator|Representative)\s+[A-Z]+[^)]*)\)', full_text)
        
        speaker_full = "Unknown"
        speaker_last = "Unknown"
        
        if speaker_section:
            speaker_text = speaker_section.group(1)
            # extract the actual speaker name from this section
            speaker_match = re.search(r'((?:Mr\.|Mrs\.|Ms\.|Senator|Representative)\s+([A-Z]+))', speaker_text)
            if speaker_match:
                speaker_full = speaker_match.group(1)
                speaker_last = speaker_match.group(2)
        
        # extract title - look for content after the speaker declaration
        # In Congressional Record format, typically the title/topic appears after the speaker is introduced
        title = "Unknown"
        title_section = re.search(r'to address the House[^.]*\.\)\s+([A-Z][A-Z\s\'",.()-]+?)\s*\n', full_text)
        if title_section:
            title = title_section.group(1).strip()
        
        # get granule ID from filename
        filename = os.path.basename(file_path)
        granule_id = filename.replace(".html", "")
        
        # extract page number which is often important for citation
        page_match = re.search(r'\[Page ([^\]]+)\]', full_text)
        page_number = page_match.group(1) if page_match else "Unknown"
        
        return {
            'file_id': granule_id,
            'date': date,
            'chamber': chamber,
            'speaker_full': speaker_full,
            'speaker_last': speaker_last,
            'title': title,
            'page_number': page_number,
            'links': links,
            'page_title': page_title,
            'full_text': full_text
        }
    except Exception as e:
        print(f"Error parsing {os.path.basename(file_path)}: {e}")
        return None



In [7]:
# function to process all immigration-related files
def process_immigration_files(config, immigration_df=None):
    """
    Process all immigration-related files to extract structured data.
    
    Args:
        config (dict): Project configuration
        immigration_df (pandas.DataFrame, optional): df of immigration-related files
            If None, the function will try to load it from a file
            
    Returns:
        pandas.DataFrame: df of processed immigration speeches
    """
    processed_dir = config["directories"]["processed_dir"]
    
    # if no df is provided, try to load it from a file
    if immigration_df is None:
        immigration_files_csv = os.path.join(processed_dir, "immigration_files.csv")
        if not os.path.exists(immigration_files_csv):
            print(f"Error: Immigration files list not found at {immigration_files_csv}")
            return None
        
        immigration_df = pd.read_csv(immigration_files_csv)
    
    print(f"Processing {len(immigration_df)} immigration-related files...")
    
    # process each file in the immigration list
    parsed_data = []
    for _, row in tqdm(immigration_df.iterrows(), total=len(immigration_df), desc="Parsing HTML files"):
        file_path = row['file']
        extracted_data = parse_congressional_record(file_path)
        
        if extracted_data:
            # add the immigration terms found
            extracted_data['immigration_terms'] = row['terms']
            parsed_data.append(extracted_data)
    
    # create a df and save to CSV
    if parsed_data:
        parsed_df = pd.DataFrame(parsed_data)
        csv_path = os.path.join(processed_dir, "immigration_speeches.csv")
        parsed_df.to_csv(csv_path, index=False)
        
        print(f"\nSuccessfully parsed {len(parsed_data)} files")
        print(f"Data saved to: {csv_path}")
        
        # print summary of speakers found
        speaker_counts = parsed_df['speaker_last'].value_counts()
        print(f"\nTop 10 speakers in the dataset:")
        print(speaker_counts.head(10))
        
        # print example of first record
        print("\nExample of parsed data (first record):")
        for key, value in parsed_data[0].items():
            if key == 'full_text':
                print(f"{key}: {value[:200]}...") # Print only first 200 chars of text
            else:
                print(f"{key}: {value}")
    else:
        print("No data could be parsed from the files.")
        parsed_df = pd.DataFrame()
    
    return parsed_df


In [8]:
# function to clean and improve data
def clean_data(df, config):
    """
    Clean and enhance the parsed data.
    
    Args:
        df (pandas.DataFrame): DataFrame of parsed speeches
        config (dict): Project configuration
        
    Returns:
        tuple: (DataFrame of all cleaned records, DataFrame of actual speeches only)
    """
    # create a copy to avoid modifying the original
    cleaned_df = df.copy()
    
    # 1. Convert dates to standard format
    def standardize_date(date_str):
        try:
            if pd.isna(date_str) or date_str == "Unknown":
                return None
            # parse date string to datetime object
            date_obj = datetime.strptime(date_str, "%A, %B %d, %Y")
            # convert to standard format
            return date_obj.strftime("%Y-%m-%d")
        except:
            return date_str
    
    cleaned_df['date_standard'] = cleaned_df['date'].apply(standardize_date)
    
    # 2. Identify real speeches vs. procedural text
    def is_real_speech(row):
        # check if it's likely a speech by a member of Congress
        # if speaker is Unknown, probably not a speech
        if row['speaker_last'] == "Unknown":
            return False
        # check for procedural titles
        procedural_titles = ['HOUSE', 'SENATE', 'PRAYER', 'PLEDGE', 'ADJOURNMENT', 
                            'RECESS', 'AMENDMENT', 'RECORD', 'MOTION', 'RESOLUTION']
        if any(title in row['title'] for title in procedural_titles):
            return False
        # check for very short texts (likely not speeches)
        if len(row['full_text']) < 500:
            return False
            
        return True
    
    cleaned_df['is_speech'] = cleaned_df.apply(is_real_speech, axis=1)
    
    # 3. Categorize speech type
    def categorize_speech(row):
        text = row['full_text'].lower()
        
        if not row['is_speech']:
            return "procedural"
            
        categories = {
            "border_security": ["border security", "border wall", "border crisis"],
            "legal_status": ["undocumented", "illegal alien", "unauthorized", "amnesty", "path to citizenship"],
            "children": ["daca", "dreamer", "child", "family separation"],
            "asylum": ["asylum", "refugee", "humanitarian"],
            "general": ["immigration", "immigrant", "migrant"]
        }
        
        for category, terms in categories.items():
            if any(term in text for term in terms):
                return category
                
        return "other"
    
    cleaned_df['speech_category'] = cleaned_df.apply(categorize_speech, axis=1)
    
    # 4. Extract a summary from the full text (first 300 characters)
    def extract_summary(text):
        # remove header content in square brackets
        text = re.sub(r'\[.*?\]', '', text)
        # remove whitespace
        text = re.sub(r'\s+', ' ', text).strip()
        # take first 300 characters
        return text[:300] + "..." if len(text) > 300 else text
    
    cleaned_df['speech_summary'] = cleaned_df['full_text'].apply(extract_summary)
    
    # 5. add party information the legislator matching
    def get_party_info(row):
        """Get party information for the speech using the speaker matching function"""
        party, match_details = get_party_from_speech(row['full_text'], legislators_df)
        return party if party else None
    
    cleaned_df['party'] = cleaned_df.apply(get_party_info, axis=1)
    
    # 6. Count tokens (words) in each speech
    def count_tokens(text):
        try:
            # simple tokenization (split on whitespace)
            return len(re.findall(r'\b\w+\b', text))
        except:
            return 0
    
    cleaned_df['token_count'] = cleaned_df['full_text'].apply(count_tokens)
    
    # 7. Count sentences in each speech
    def count_sentences(text):
        try:
            # simplistic sentence splitting (may not be perfect)
            return len(re.findall(r'[.!?]+', text)) + 1
        except:
            return 0
    
    cleaned_df['sentence_count'] = cleaned_df['full_text'].apply(count_sentences)
    
    # 8. Add page number from the improved parser
    if 'page_number' in df.columns:
        cleaned_df['page_number'] = df['page_number']
    
    # 9. Extract links from the improved parser if available
    if 'links' in df.columns:
        cleaned_df['links'] = df['links']
    
    # 10. Keep only relevant columns in a useful order
    columns_order = [
        'file_id', 'date_standard', 'chamber', 'speaker_full', 'speaker_last', 
        'party', 'title', 'is_speech', 'speech_category', 'speech_summary', 
        'token_count', 'sentence_count', 'page_number', 'immigration_terms', 'full_text'
    ]
    
    # filter columns that actually exist in the DataFrame
    columns_order = [col for col in columns_order if col in cleaned_df.columns]
    
    # return only the columns we want and create filtered dataset with only actual speeches
    cleaned_df = cleaned_df[columns_order]
    speeches_only = cleaned_df[cleaned_df['is_speech'] == True]
    
    return (cleaned_df, speeches_only)

*The following runs the data processing pipeline*

In [20]:
# run the data processing pipeline
def run_data_processing(config):
    """
    Run the data processing pipeline, focusing only on preprocessing.
    
    Args:
        config (dict): Project configuration
        
    Returns:
        tuple: (DataFrame of all cleaned records, DataFrame of speeches only)
    """
    # Step 1: Identify immigration-related files
    immigration_df = identify_immigration_files(config)
    
    # Step 2: Process immigration-related files
    speeches_df = process_immigration_files(config, immigration_df)
    
    # Step 3: Clean and enhance the data
    if speeches_df is not None and not speeches_df.empty:
        cleaned_df, speeches_only = clean_data(speeches_df, config)
        
        # save
        processed_dir = config["directories"]["processed_dir"]
        cleaned_df.to_csv(os.path.join(processed_dir, "immigration_data_clean.csv"), index=False)
        speeches_only.to_csv(os.path.join(processed_dir, "immigration_speeches_clean.csv"), index=False)
        
        # summary
        print(f"\nCreated cleaned dataset with {len(cleaned_df)} records")
        print(f"Created filtered dataset with {len(speeches_only)} actual speeches")

        # party distribution
        if 'party' in speeches_only.columns:
            party_counts = speeches_only['party'].value_counts()
            print("\nParty distribution in speeches:")
            print(party_counts)

        # category distribution
        category_counts = speeches_only['speech_category'].value_counts()
        print("\nSpeech category distribution:")
        print(category_counts)
        
        return (cleaned_df, speeches_only)
    else:
        print("No data to clean.")
        return (None, None)

# Uncomment to run data processing
cleaned_df, speeches_only = run_data_processing(config)

Found 14629 HTML files in c:\Users\Kevin\Downloads\LIN350Project\data\congressional_record

Sample filenames:
  - CREC-2017-09-01-CREC-2017-09-01-pt1-PgD909-2.html
  - CREC-2017-09-01-CREC-2017-09-01-pt1-PgD909-3.html
  - CREC-2017-09-01-CREC-2017-09-01-pt1-PgD909-4.html
  - CREC-2017-09-01-CREC-2017-09-01-pt1-PgD909-5.html
  - CREC-2017-09-01-CREC-2017-09-01-pt1-PgD909-6.html

Successfully read first file. First 200 characters:
<html> <head> <title>Congressional Record, Volume 163 Issue 141 (Friday, September 1, 2017)</title> </head> <body><pre> [Congressional Record Volume 163, Number 141 (Friday, September 1, 2017)] [Daily
Searching 14629 files for immigration content...


Searching files for immigration terms:   0%|          | 0/14629 [00:00<?, ?it/s]


Found 1785 files with immigration content
List saved to: c:\Users\Kevin\Downloads\LIN350Project\processed_data\immigration_files.csv

Sample immigration-related files:
  - CREC-2017-09-01-CREC-2017-09-01-pt1-PgD909-6.html: visa
  - CREC-2017-09-01-CREC-2017-09-01-pt1-PgE1151-4.html: refugee
  - CREC-2017-09-01-CREC-2017-09-01-pt1-PgE1152-3.html: immigration, immigrant, migrant, citizenship, deportation, undocumented, daca, dreamer, visa
  - CREC-2017-09-01-CREC-2017-09-01-pt1-PgE1154-4.html: undocumented, mexico
  - CREC-2017-09-01-CREC-2017-09-01-pt1-PgH6632-6.html: mexico
Processing 1785 immigration-related files...


Parsing HTML files:   0%|          | 0/1785 [00:00<?, ?it/s]


Successfully parsed 1785 files
Data saved to: c:\Users\Kevin\Downloads\LIN350Project\processed_data\immigration_speeches.csv

Top 10 speakers in the dataset:
speaker_last
Unknown    1145
C            58
S            54
B            43
R            42
M            39
H            36
G            27
T            26
E            24
Name: count, dtype: int64

Example of parsed data (first record):
file_id: CREC-2017-09-01-CREC-2017-09-01-pt1-PgD909-6
date: Friday, September 1, 2017
chamber: Unknown
speaker_full: Unknown
speaker_last: Unknown
title: Unknown
page_number: D910
links: [{'href': 'https://www.gpo.gov', 'text': 'www.gpo.gov'}, {'href': 'http://www.govinfo.gov', 'text': 'www.govinfo.gov'}, {'href': 'mailto:contactcenter@gpo.gov', 'text': 'contactcenter@gpo.gov'}]
page_title: Congressional Record, Volume 163 Issue 141 (Friday, September 1, 2017)
full_text: 
[Congressional Record Volume 163, Number 141 (Friday, September 1, 2017)]
[Daily Digest]
[Pages D909-D910]
From the Congressi

In [None]:
# clean the full_text column in a CSV file to normalize whitespace
def clean_whitespace_in_csv(input_file, output_file=None):

    # determine output filename if not provided
    if output_file is None:
        base, ext = os.path.splitext(input_file)
        output_file = f"{base}_cleaned{ext}"
    
    print(f"Reading CSV file: {input_file}")
    
    try:
        df = pd.read_csv(input_file, low_memory=False)
        if 'full_text' not in df.columns:
            print("Warning: 'full_text' column not found in CSV. Available columns:")
            print(", ".join(df.columns))
            return None
        
        total_rows = len(df)
        print(f"Processing {total_rows} rows...")
        
        def clean_text(text):
            if pd.isna(text):
                return text
                
            cleaned = re.sub(r'\s+', ' ', str(text))
            cleaned = re.sub(r'\n\s*\n', '\n', cleaned)
            cleaned = cleaned.strip()
            return cleaned
        
        print("Cleaning full_text column...")
        df['full_text'] = df['full_text'].apply(clean_text)
        
        print(f"Saving cleaned data to: {output_file}")
        df.to_csv(output_file, index=False)
        
        # print sample rows for verification
        print("\nSample of cleaned text:")
        for i, row in df.head(2).iterrows():
            print(f"Row {i+1} (first 100 chars): {row['full_text'][:100]}...")
        
        print(f"\nSuccessfully processed {total_rows} rows.")
        print(f"Cleaned CSV saved to: {output_file}")
        
        return output_file
        
    except Exception as e:
        print(f"Error processing CSV file: {e}")
        return None

# example usage:
clean_whitespace_in_csv('processed_data/immigration_speeches.csv')
clean_whitespace_in_csv('processed_data/immigration_data_clean.csv')
clean_whitespace_in_csv('processed_data/immigration_speeches_clean.csv')


1. `immigration_speeches.csv`:
   - The raw parsed data from your Congressional Record HTML files
   - Contains all the immigration-related speeches and procedural text
   - Includes metadata like date, speaker, chamber, etc., along with the full text extracted from HTML files
   - This is the initial dataset created by the `process_immigration_files` function

2. `immigration_data_clean.csv`:
   - Contains all records (both speeches and procedural text) with cleaned and enhanced data
   - Includes additional columns like standardized dates, party information, speech categorization
   - Adds summary text and metrics like token count and sentence count
   - This is the complete dataset after basic preprocessing

3. `immigration_speeches_clean.csv`:
   - A filtered subset of `immigration_data_clean.csv` containing only actual speeches (no procedural text)
   - Uses the `is_speech` flag to filter out non-speech content
   - This is the dataset you'd use for analyzing actual Congressional speeches about immigration


#### SECTION 4: LINGUISTIC TEXT ANALYSIS

In [None]:
# explore and visualize the data to get a small understanding of it
df = pd.read_csv("processed_data\\immigration_speeches_clean.csv")
first_row = df.iloc[0]
print(first_row["full_text"])


[Congressional Record Volume 163, Number 142 (Tuesday, September 5, 2017)]
[House]
[Pages H6638-H6641]
From the Congressional Record Online through the Government Publishing Office [www.gpo.gov]




                 BOB DOLE CONGRESSIONAL GOLD MEDAL ACT

  Mr. HULTGREN. Mr. Speaker, I move to suspend the rules and pass the 
bill (S. 1616) to award the Congressional Gold Medal to Bob Dole, in 
recognition for his service to the nation as a soldier, legislator, and 
statesman.
  The Clerk read the title of the bill.
  The text of the bill is as follows:

                                S. 1616

       Be it enacted by the Senate and House of Representatives of 
     the United States of America in Congress assembled,

     SECTION 1. SHORT TITLE.

       This Act may be cited as the ``Bob Dole Congressional Gold 
     Medal Act''.

     SEC. 2. FINDINGS.

       Congress finds the following:
       (1) Bob Dole was born on July 22, 1923, in Russell, Kansas.
       (2) Growing up during 

In [None]:
df = pd.read_csv("processed_data/no_whitespace/immigration_speeches_clean_cleaned.csv")

print("Number of rows:", len(df))
print("Party value counts:")
print(df["party"].value_counts(dropna=False))

print("\nExample Democratic rows:")
print(df[df["party"] == "Democratic"][["full_text"]].head(3))


Number of rows: 640
Party value counts:
party
NaN            244
Democrat       199
Republican     194
Independent      3
Name: count, dtype: int64

Example Democratic rows:
Empty DataFrame
Columns: [full_text]
Index: []


##### Proposed analysis pipeline

1. Term Frequency (tf-df) Analysis
- Use the cleaned speeches dataset (immigration_speeches_clean.csv)
- Calculate relative frequencies of key immigration terms for each party
- Perform chi-square tests to determine statistical significance of term usage differences

2. Contextual Analysis (TBD)
- Extract 5-word windows around key terms
- Use TF-IDF to identify distinctive contextual words for each party
- Conduct collocation analysis to measure significant word co-occurrences

3. Temporal Trend Analysis
- Analyze terminology usage across different time periods (2018-2021)
- Track shifts in terminology for each party over time
- Visualize changes using time series plots

4. Party Comparison
- Compare terminology usage between:
  - Democrats vs. Republicans
  - Border state representatives vs. non-border state representatives
  - Senators vs. Representatives

5. Advanced Text Analysis Techniques
- Topic modeling (LDA) to identify immigration-related speech topics
- Analyze topic distribution between parties
- Track topic evolution over time

6. Statistical Validation
- Chi-square tests for word usage differences
- Correlation analysis between terminology and voting patterns
- Time series analysis of terminology shifts

##### 1. Term Frequency (tf-df) Analysis

In [40]:
# count occurrences of each term using provided regex patterns
def count_term_occurrences(text, term_patterns):
    term_counts = {}
    for term, pattern in term_patterns.items():
        term_counts[term] = len(re.findall(pattern, str(text), re.IGNORECASE))
    return term_counts

# analyze term frequency by party using regex-based term patterns
def analyze_party_term_frequency(speeches_df, term_patterns):
    terms = list(term_patterns.keys())
    
    # add precomputed matches to the DataFrame for all speeches
    speeches_df['term_hits'] = speeches_df['full_text'].apply(lambda text: count_term_occurrences(text, term_patterns))
    # normalize party labels
    speeches_df['party'] = speeches_df['party'].replace({
        'Democrat': 'Democratic',
        'Republican': 'Republican',
        'Independent': 'Independent'
    })

    # Initialize totals
    party_term_counts = {
        'Democratic': {term: 0 for term in terms},
        'Republican': {term: 0 for term in terms}
    }
    party_speech_counts = {'Democratic': 0, 'Republican': 0}

    # aggregate counts
    for _, row in speeches_df.iterrows():
        party = row['party']
        if party in party_term_counts:
            party_speech_counts[party] += 1
            for term in terms:
                party_term_counts[party][term] += row['term_hits'][term]

    term_freq_df = pd.DataFrame(party_term_counts).fillna(0)

    # chi-square calculations
    chi_square_results = {}
    for term in terms:
        dem_with_term = sum(1 for _, row in speeches_df[speeches_df['party'] == 'Democratic'].iterrows()
                            if row['term_hits'][term] > 0)
        rep_with_term = sum(1 for _, row in speeches_df[speeches_df['party'] == 'Republican'].iterrows()
                            if row['term_hits'][term] > 0)

        contingency_table = np.array([
            [dem_with_term, party_speech_counts['Democratic'] - dem_with_term],
            [rep_with_term, party_speech_counts['Republican'] - rep_with_term]
        ])

        try:
            chi2, p_value, dof, expected = chi2_contingency(contingency_table)
            chi_square_results[term] = {
                'chi2': chi2,
                'p_value': p_value,
                'significant': p_value < 0.10  # Use 0.05 for strict, 0.10 for exploratory
            }
        except ValueError:
            chi_square_results[term] = {
                'chi2': None,
                'p_value': None,
                'significant': False
            }

    # additional metrics
    term_freq_df['Total'] = term_freq_df['Democratic'] + term_freq_df['Republican']
    term_freq_df['Dem_Proportion'] = term_freq_df['Democratic'] / term_freq_df['Total']
    term_freq_df['Rep_Proportion'] = term_freq_df['Republican'] / term_freq_df['Total']
    term_freq_df['Chi2_Statistic'] = [chi_square_results[term]['chi2'] for term in terms]
    term_freq_df['P_Value'] = [chi_square_results[term]['p_value'] for term in terms]
    term_freq_df['Statistically_Significant'] = [chi_square_results[term]['significant'] for term in terms]

    return term_freq_df, chi_square_results


#  create separate visualizations of term frequency and proportions
def visualize_term_frequency(term_freq_df):
    """
    """
    plot_df = term_freq_df.reset_index()
    party_cols = [col for col in ['Democratic', 'Republican', 'Independent'] if col in term_freq_df.columns]

    freq_df = plot_df.melt(id_vars=['index'], value_vars=party_cols,
                           var_name='Party', value_name='Count')
    plt.figure(figsize=(14, 7))
    sns.barplot(x='index', y='Count', hue='Party', data=freq_df)
    plt.title('Immigration Term Frequency by Party')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.savefig('term_frequency_by_party.png', bbox_inches='tight')
    plt.close()

    # prop. bar plot (if Total > 0 to avoid divide-by-zero)
    proportion_df = plot_df.copy()
    for party in party_cols:
        proportion_df[f'{party}_Proportion'] = proportion_df[party] / proportion_df[party_cols].sum(axis=1)

    proportion_df = proportion_df.melt(id_vars='index',
        value_vars=[f'{p}_Proportion' for p in party_cols],
        var_name='Party', value_name='Proportion')

    proportion_df['Party'] = proportion_df['Party'].str.replace('_Proportion', '')

    plt.figure(figsize=(14, 7))
    sns.barplot(x='index', y='Proportion', hue='Party', data=proportion_df)
    plt.title('Proportional Term Usage by Party')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.savefig('proportional_term_usage.png', bbox_inches='tight')
    plt.close()


def run_term_frequency_analysis(config):
    speeches_path = config['directories']['processed_dir'] + '/immigration_speeches_clean.csv'
    speeches_df = pd.read_csv(speeches_path)

    term_patterns = config['constants']['immigration_terms']

    term_freq_df, chi_square_results = analyze_party_term_frequency(speeches_df, term_patterns)
    visualize_term_frequency(term_freq_df)

    term_freq_df.to_csv(config['directories']['processed_dir'] + '/party_term_frequency.csv', index=True)

    print("Term Frequency Analysis Summary:")
    print("\nTop Statistically Significant Terms:")
    print(term_freq_df[term_freq_df['Statistically_Significant']]
        .sort_values('P_Value')[['Democratic', 'Republican', 'Chi2_Statistic', 'P_Value']]
        .head(10))
    significant_terms = term_freq_df[term_freq_df['Statistically_Significant']]
    print(significant_terms[['Democratic', 'Republican', 'Chi2_Statistic', 'P_Value']])

    return term_freq_df, chi_square_results

# when ready to run uncomment
results_df, chi_square_results = run_term_frequency_analysis(config)


Term Frequency Analysis Summary:

Top Statistically Significant Terms:
             Democratic  Republican  Chi2_Statistic   P_Value
dreamer             347         768        9.311942  0.002277
immigration         386         761        4.904218  0.026791
ice                  48          65        3.547258  0.059644
wall                298         445        2.901433  0.088501
             Democratic  Republican  Chi2_Statistic   P_Value
immigration         386         761        4.904218  0.026791
wall                298         445        2.901433  0.088501
dreamer             347         768        9.311942  0.002277
ice                  48          65        3.547258  0.059644


---

1. Statistically Significant Terminology Differences (RQ1)

| Term         | Dem % | Rep % | p-value | Interpretation |
|--------------|--------|--------|----------|----------------|
| **immigration** | 33.7% | 66.3% | 0.0268 | Republicans use "immigration" significantly more frequently |
| **dreamer**     | 31.1% | 68.9% | 0.0023 | Term strongly associated with Republican discourse, possibly surprising given its pro-immigrant connotation |
| **wall**        | 40.1% | 59.9% | 0.0885 | Border wall-related rhetoric is more common among Republicans |
| **ice**         | 42.5% | 57.5% | 0.0596 | Republican speeches mention ICE significantly more often |

#### Key Takeaways:

- **Republicans dominate immigration-related rhetoric**, especially with broadly used umbrella terms like *“immigration”* and *“dreamer.”*
- Despite expectations, even terms like **"dreamer"**, often seen as sympathetic, appear more in **Republican** discourse — potentially due to **criticisms or calls for reform**.
- Democrats do not dominate any terms **statistically**, but they **lean higher** in *“detention”* (66.9%) and *“cbp”* (62%) usage — likely in **critical or oversight contexts**.

---

#### 2. Temporal Trend Analysis

In [None]:

# Step 1: Compute average normalized frequency (per 1,000 tokens) by year/party/term
def count_yearly_party_term_frequency_normalized(df, term_patterns):
    df = df.copy()
    df['year'] = pd.to_datetime(df['date_standard']).dt.year

    # normalize party labels
    df['party'] = df['party'].replace({
        'Democrat': 'Democratic',
        'Republican': 'Republican'
    })

    # count term matches in each full_text
    df['term_hits'] = df['full_text'].apply(lambda text: count_term_occurrences(text, term_patterns))

    # flatten structure and normalize by token count
    rows = []
    for _, row in df.iterrows():
        token_count = row['token_count'] if row['token_count'] > 0 else 1  # avoid division by zero
        for term, raw_count in row['term_hits'].items():
            normalized = (raw_count / token_count) * 1000  # per 1,000 tokens
            rows.append({
                'year': row['year'],
                'party': row['party'],
                'term': term,
                'count_per_1000': normalized
            })

    norm_df = pd.DataFrame(rows)

    # aggregate by year, party, and term (average across speeches)
    grouped = norm_df.groupby(['year', 'party', 'term'])['count_per_1000'].mean().reset_index()
    return grouped

# Step 2: Plot usage trend over time with annotations for events
def plot_term_trends_over_time(grouped_df, highlight_events=True):
    terms = grouped_df['term'].unique()

    for term in terms:
        plt.figure(figsize=(10, 5))
        term_df = grouped_df[grouped_df['term'] == term]
        sns.lineplot(data=term_df, x='year', y='count_per_1000', hue='party', marker='o')

        plt.title(f"Normalized Usage Over Time: '{term}'")
        plt.xlabel("Year")
        plt.ylabel("Mentions per 1,000 Tokens")
        plt.xticks(sorted(grouped_df['year'].unique()))
        plt.legend(title='Party')

        if highlight_events:
            for year, label in [
                (2018, 'Family Separation'),
                (2020, 'Election Year'),
                (2021, 'Biden Inauguration'),
                (2023, 'End of Title 42')
            ]:
                plt.axvline(x=year, linestyle='--', color='gray', alpha=0.6)
                plt.text(year + 0.1, plt.ylim()[1] * 0.8, label, rotation=90, fontsize=9, alpha=0.7)

        plt.tight_layout()
        plt.savefig(f'term_trend_{term}.png')
        plt.close()


In [None]:
speeches_df = pd.read_csv(config['directories']['processed_dir'] + '/immigration_speeches_clean.csv')
term_patterns = config['constants']['immigration_terms']

# run analysis and plot
grouped_df = count_yearly_party_term_frequency_normalized(speeches_df, term_patterns)
grouped_df.to_csv("normalized_term_trends_by_year.csv", index=False)
plot_term_trends_over_time(grouped_df)


code to visualize trends for top 5-6 terms with clear party divergence

In [None]:
df = pd.read_csv("results/RQ2/normalized_term_trends_by_year.csv")

# filter to major parties
df = df[df["party"].isin(["Democratic", "Republican"])]

# compute top 6 divergent terms
pivot = df.pivot_table(index=["year", "term"], columns="party", values="count_per_1000").fillna(0)
pivot["abs_diff"] = (pivot["Democratic"] - pivot["Republican"]).abs()
top_terms = pivot.groupby("term")["abs_diff"].mean().sort_values(ascending=False).head(6).index.tolist()

output_dir = "topterm_trend_plots"
os.makedirs(output_dir, exist_ok=True)

sns.set_theme(style="whitegrid")

# one PNG per term
for term in top_terms:
    term_df = df[df["term"] == term]

    plt.figure(figsize=(8, 4))
    sns.lineplot(data=term_df, x="year", y="count_per_1000", hue="party", marker="o")

    # event annotations
    for year, label in [(2018, 'Family Separation'), (2020, 'Election'), (2021, 'Biden Inauguration')]:
        plt.axvline(x=year, linestyle='--', color='gray', alpha=0.5)
        plt.text(year + 0.1, plt.ylim()[1] * 0.85, label, rotation=90, fontsize=8, alpha=0.7)

    plt.title(f"Usage Over Time: '{term}'")
    plt.xlabel("Year")
    plt.ylabel("Mentions per 1,000 Tokens")
    plt.legend(title="Party")
    plt.tight_layout()

    filename = os.path.join(output_dir, f"{term.replace(' ', '_')}_trend.png")
    plt.savefig(filename)
    plt.close()

print(f"Saved individual trend plots to: {output_dir}")

Saved individual trend plots to: term_trend_plots


Conclusions from the results so far.

1. Immigration is a Consistently Polarizing Topic

    The top divergent terms highlight persistent framing differences, even across years with different administrations. This supports the hypothesis that immigration remains a highly partisan issue in U.S. politics.

2. Political Events Drive Terminology Spikes

    Event annotations (e.g. Family Separation, Elections) align with notable spikes in specific term frequencies. This validates your method and emphasizes the role of external events in shaping political discourse.

3. Normalized Frequencies Reveal Subtle Trends

    By measuring per 1,000 tokens, your analysis captures rhetorical shifts even when overall speech volume changes, allowing for fair year-to-year comparisons.

Based on the visualizations of term usage from 2017–2021

1. **“immigration”**
- **Republican** usage peaked in 2018 (likely a response to the Family Separation policy), then dropped sharply through 2021.
- **Democratic** usage declined steadily.
- **Interpretation**: The term “immigration” became less central after 2018, possibly replaced by more specific framing (e.g., “migrant” or “border”) or due to fatigue and shifting focus during COVID.

---

2. **“migrant”**
- Consistent **Democratic** dominance in usage, with a steady decline.
- **Republican** usage was always lower and declined similarly.
- **Interpretation**: “Migrant” is more common in humanitarian or individual-focused discourse, aligning with Democratic framing. Decreased usage over time could reflect reduced legislative attention or shifting terminology.

---

3. **“border”**
- Sharp **Republican** spike in 2019, aligning with the *Wall* and *caravan crisis* narratives.
- Both parties saw major drops post-2019.
- **Interpretation**: “Border” surged with Trump-era policies and media focus but faded post-2020, possibly due to shifting public priorities and political strategy changes under Biden.

---

4. **“daca”**
- Massive early **Democratic** emphasis (esp. 2017), tied to efforts to protect DACA recipients.
- Near-zero usage from 2019 onward by both parties.
- **Interpretation**: Declining mentions reflect stalled legislative movement and the issue becoming less central post-Trump.

---

5. **“mexico”**
- Gradual **Republican** increase, overtaking Democrats by 2021.
- 2020 spike from both parties, possibly linked to the “Remain in Mexico” policy or campaign rhetoric.
- **Interpretation**: Republicans increasingly invoke “Mexico” in enforcement/geopolitical contexts; Democrats’ interest spiked in 2020 but declined post-election.

---

6. **“dreamer”**
- Overwhelmingly a **Democratic** term in early years (2017–2018).
- Usage collapsed by 2019.
- **Interpretation**: “Dreamer” reflects advocacy for undocumented youth. Declining mentions may reflect reduced legislative hope or strategic deprioritization.

---

 Cross-Cutting Themes & Final Takeaways

1. **Term Selection Reflects Party Priorities**
- Democrats gravitate toward *people-centered* terms (“dreamer”, “migrant”, “daca”).
- Republicans lean into *security and enforcement* terms (“border”, “immigration”, “mexico”).

2. **General Decline in Mentions**
- Most terms declined after 2019, regardless of party.
- This suggests:
  - **Issue fatigue** or less policy activity.
  - The dominance of **COVID-19** discourse starting in 2020.
  - Fewer landmark immigration developments post-2019 until Title 42’s repeal.

3. **Diminished Polarization by 2021**
- In 2021, some terms converge in usage (e.g., “immigration” and “migrant”).
- This may signal either bipartisan disengagement or a momentary cooling of public rhetoric under Biden.

---


In-Depth Term Analysis

1.. **“border”**
Observed Trend:
- Steady Republican usage in 2017–2018, with a **dramatic spike in 2019** (over 4 mentions per 1,000 tokens).
- Democratic mentions also peaked in 2019 but at a lower level.
- Sharp decline for both parties after 2019, stabilizing at low levels by 2021.

Interpretation:
This term's prominence in 2019 aligns with several key political developments:
- **Trump's National Emergency Declaration (Feb 2019)** to fund the border wall likely fueled Republican talking points centered on “border security.”
- **Caravan migration narratives** dominated media cycles, especially on conservative outlets, influencing speech patterns.
- Democrats also ramped up border-related discussions during this time — often in opposition, emphasizing humanitarian concerns or the implications of militarizing the border.

Party Framing:
- **Republicans:** “Border” was used as a symbol of national security and legal enforcement.
- **Democrats:** Framed around human impact, with critiques of detention centers and border patrol practices.

---

2. **“dreamer”**
Observed Trend:
- Extremely high usage by Democrats in 2017–2018 (peaking around 3.8–4.0 per 1,000 tokens).
- Virtually disappears after 2019.
- Republicans used this term only sparingly throughout.

Interpretation:
- The spike reflects intense Democratic advocacy following **Trump’s 2017 rescission of the DACA program**, which protected undocumented immigrants brought to the U.S. as children.
- Democratic leaders pushed for the DREAM Act during 2017–2018, making “Dreamers” a centerpiece of their immigration messaging.
- After **multiple failed legislative efforts** and **Supreme Court delays**, the term fell out of use, likely due to issue fatigue and legislative gridlock.

Party Framing:
- **Democrats:** Framed “dreamers” as blameless, high-achieving young people who deserved permanent protection.
- **Republicans:** Rarely invoked the term, possibly due to its **sympathetic connotation**, instead framing immigration more broadly around legality and borders.

---

3. **“immigration”**
Observed Trend:
- **Republican** usage peaked in 2018 (2.0 per 1,000 tokens), coinciding with national debates on enforcement.
- **Democratic** mentions steadily declined over the years.
- By 2021, both parties nearly converged at minimal usage.

Interpretation:
- The **2018 surge** ties directly to the **Family Separation policy**, which became a flashpoint in immigration discourse.
- Republican usage often reflected defense or justification of harsher enforcement under the Trump administration.
- As **COVID-19** and **economic concerns** rose post-2020, immigration lost attention in congressional discourse — explaining the universal drop in usage.

Party Framing:
- **Republicans:** Focused on “immigration” as a security and law enforcement issue.
- **Democrats:** Gradually de-emphasized the term itself in favor of more humanizing alternatives like “migrant,” “asylum,” or issue-specific terms like “DACA.”

---

Final Thoughts
These terms capture broader partisan strategies:
- Republicans centered enforcement and sovereignty (e.g. “border”).
- Democrats emphasized vulnerable populations and individual stories (e.g. “dreamer”).

Their rises and falls in frequency are tightly coupled to real-world policy fights, executive actions, and moments of national attention — underscoring how **language mirrors power, policy, and public pressure**.