# Data Extraction from Meeting Minute PDFs

PDFs we are extracting from are UCSB AS F&B Meeting Minutes, publicly available at [AS F&B Committee Minutes](https://asfb.as.ucsb.edu/minutes2018-2019/)
We are using Fall 2024 and Winter 2025 quarter minutes.

In [1]:
import pdfplumber
import re
import pandas as pd
import logging
import rapidfuzz

## Motion Extraction

Converting PDFs to text and grabbing all of the motions passed by the committee

In [2]:
# Ignore non-critical warnings from pdfminer through pdfplumber
logging.getLogger("pdfminer").setLevel(logging.ERROR)

# Folder of pdfs, from UCSB AS F&B Meeting Minutes, publicly available, see above
pdf_folder = "meeting-mins-pdfs/"

with open("orgs-ucsb.txt", "r") as all_orgs_text:
    all_orgs = [line.strip() for line in all_orgs_text]
    
# This function will convert pdf pages to text holding relevant motions (motions to fund will only be found after the action items header)
def motions_text_from_pdf(pdf_path):

    collecting = False
    out = ''
    
    with pdfplumber.open(pdf_path) as pdf:
         
        for page in pdf.pages: 
            
            text = page.extract_text()

            if not collecting:

                if "action items" in text.lower():
                    collecting = True
                    
            if collecting:
                out += text
                
        return find_motions(out)

# Looks for and returns list of motions found in the text
def find_motions(text):

    pattern = r"motion language:(.*?)action: passed"
    motions = re.findall(pattern, text.replace("\n", " ").lower(), flags=re.DOTALL)
    return motions
    

# Parses motions, takes normalized club name and dollar amount
def find_motion_details(motions):

    for motion in motions:
        
        pattern = r"motion to\s+(.*?)\s*\$\s*([\d,]+(?:\.\d{2})?)"

        details = re.findall(pattern, motion.lower())

        if not details or "reaffirm" in motion or "forward" in motion or "table" in motion:
            continue

        raw_org_name, amount = details[0]

        # Cleans most of the words that confuse fuzzy matching
        cleaner_org_name = re.sub(r"\b( ucsb|fully fund|partially fund|strike| at|motion|fund| to| of| the )\b", "",
                              raw_org_name, flags=re.IGNORECASE).strip()
        
        # A motion to strike means we will want to undo an existing funding motion 
        if 'strike' in motion.lower():

            amount = '-' + amount

        _, _, org_index = rapidfuzz.process.extract(cleaner_org_name, [org.lower() for org in all_orgs],
                                            scorer=rapidfuzz.fuzz.ratio,
                                            score_cutoff=50,
                                            limit = 1)[0]

        print("RECORDED:" + all_orgs[org_index] + ' funded ' + amount)
        

    
find_motion_details(motions_text_from_pdf(r'meeting-mins-pdfs/10.07.2024 Finance Committee Meeting Minutes.pdf'))

RECORDED:Debate funded 5,000
RECORDED:Untitled Dance Company funded 1,225
RECORDED:Sigma Alpha Zeta Multicultural funded 3,200
RECORDED:Association for Computing Machinery funded 537.50
RECORDED:Model United Nations funded 5,000
RECORDED:UCSBreakin' funded 850
RECORDED:Pre-Law Society funded 250
RECORDED:Pre-Law Society funded -250
RECORDED:Mock Trial funded 4,728
RECORDED:Moot Court funded 4,935.32
RECORDED:Undergraduate Diversity and Inclusion in Physics funded 257.28
RECORDED:Pre-Law Society funded 590
RECORDED:Sociology Association funded 82
RECORDED:Association for Computing Machinery funded 140
RECORDED:REALITY funded 250
RECORDED:Cube Club funded 6,051.90
RECORDED:gauchoCatholic funded 8,145.50
RECORDED:Finance Connection funded 2,170.64
RECORDED:Gaucho Gaming funded 3,603.86
RECORDED:Laughology funded 4,650
RECORDED:Taara funded 825
RECORDED:Society of Cosmetic Chemist:  Chapter funded 200
RECORDED:Collegiate Chapter of SAE International AKA Gaucho Racing funded 14,869
RECORDED