<a href="https://colab.research.google.com/github/lizaoh/smp_program_data/blob/main/smp2010_extract_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Top of Script

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [130]:
!pip install pymupdf
!pip install pymupdf-layout
!pip install pymupdf4llm
# !pip install rapidfuzz
import glob
import os
import pathlib
import pymupdf
import pymupdf.layout
import pymupdf4llm
import re
import pandas as pd
import unicodedata
from rapidfuzz import process, fuzz



In [3]:
pdfs_path = '/content/drive/MyDrive/math_psych_work/Conference Programs/'

# Functions
Created with help from GPT 5.2, but some are my own code just turned into a function.

In [114]:
def clean_text(text):
    if not text:
        return text

    text = re.sub(r' \n\n\d{1,3} \n\n', ' ', text)
    text = re.sub(r'\s*\n\s*', ' ', text)    # Replace newlines with spaces

    text = re.sub(r'-\s+', '', text)         # Get rid of "- "; will fix actual
                                             # hyphenated words manually

    text = re.sub(r'\s{2}', ' ', text)       # Collapse two adjacent spaces into one

    text = re.sub(r'\.\s*##.*$', '.', text,\
                  flags=re.DOTALL)           # Gets rid of extraneous text after
                                             # last sentence
    text = text.strip()
    text = fix_ligatures(text)

    return text

In [5]:
LIGATURE_MAP = {
    "ﬁ": "fi",
    "ﬂ": "fl",
    "ﬃ": "ffi",
    "ﬄ": "ffl",
    "ﬀ": "ff",
    "ﬅ": "ft",
    "ﬆ": "st",
    "Æ": 'ffi'
}

def fix_ligatures(text):
    # Replace known ligatures
    for bad, good in LIGATURE_MAP.items():
        text = text.replace(bad, good)

    # Replace any private-use ligature (common in PDFs)
    cleaned_chars = []
    for ch in text:
        name = unicodedata.name(ch, "")
        if "LIGATURE" in name.upper():
            # Try to break it apart: remove spaces and lowercase
            base = name.split("LIGATURE")[-1]
            base = base.replace(" ", "").lower()
            cleaned_chars.append(base)
        else:
            cleaned_chars.append(ch)

    return "".join(cleaned_chars)

# Program

144 entries total (22 posters, 93 talks, 26 symposium presentations, and 3 plenary talks). One paper has two parts/presentations where each of the two authors is the presenter, though, so I'm getting rid of one row in the final spreadsheet (all of the info is the same, just was separated into part I and II).

## Grab text from the pdf

In [6]:
year = '2010'
program = pymupdf.open(pdfs_path + f'smp{year}_program.pdf')

In [67]:
# Extracting as markdown to more easily get title, authors, and affiliations
# Title is bold, affiliations are italic, etc.
program_text = pymupdf4llm.to_markdown(program)

In [131]:
program_text[:1000]

'**==> picture [47 x 47] intentionally omitted <==**\n\n# 2010 \n\n# Society for Mathematical Psychology 43rd Annual Conference Portland, OR \n\n**==> picture [157 x 305] intentionally omitted <==**\n\nProgram & Information \n\n**==> picture [40 x 39] intentionally omitted <==**\n\n## **Welcome** \n\nWelcome to MP2010, the 43rd annual meeting of the _Society for Mathematical Psychology_ , being held in the Embassy Suites Portland in Downtown Portland. \n\nThis year’s conference features plenary addresses from Mike Jordan, Larry Maloney, and William K. Estes Early Career Award winner Thomas Griffiths. There are 4 invited symposia, 122 accepted talks and 22 poster presentations. \n\n## **Registration** \n\nYou can register from 6pm–9pm in the Colonel Lindbergh room (overlapping with the reception from 7pm–9pm), or during the conference. \n\nYou will receive a printed copy of this program, including the abstracts and schedule, as well as your name tag, which is stamped to indicate banquet

## Separate out presentation entries

In [102]:
all_abstracts = re.split(r'Abstracts', program_text, 1)[1]
split_abstracts = re.split(r'\*\*(Sunday|Monday|Tuesday), \d{1,2}:\d{2}\*\*',\
                           all_abstracts)
abstract_entries = [re.sub(r' \n\n\d{1,2} \n\n', '', entry.strip())\
                    for entry in split_abstracts if "Session" in entry]
# Get rid of text following last poster entry
abstract_entries[-1] = abstract_entries[-1].split('##')[0]

In [132]:
abstract_entries[:2]

['(Session B) **Movement Planning under Risk, Decision Making under Risk.** Larry Maloney _NYU_ . In executing any speeded movement, there is uncertainty about the outcome due to motor variability. I’ll describe recent experiments in which subjects carried out speeded motor tasks that were equivalent to economic decisions. The outcome of each movement earned an explicit monetary reward or penalty. In one game, for example, subjects attempted to reach out and touch briefly presented reward disks while avoiding nearby, overlapping penalty disks. The task for the subject was to trade off the risk of missing the reward disk against the risk of hitting the penalty disk. The optimum tradeoff depended on the magnitudes of penalty and reward and the subject’s own motor error. For each subject, in each game, we could estimate the ideal movement strategy and maximum expected gain possible. We compared subjects’ movements and winnings to this ideal. In many of these experiments, subjects consiste

## Sort title, authors, affiliations, and abstract

In [125]:
parsed_entries = []

for entry in abstract_entries:
  after_session_text = clean_text(entry.split(')', 1)[1].strip())
  info_text, abstract = after_session_text.split(' . ', 1)

  # Extracts title
  title_parts = re.findall(r'\*\*(.*?)\*\*', info_text)
  title = ' '.join(a.strip() for a in title_parts) if title_parts else None

  # Extracts all affiliations in entry
  affiliation_parts = re.findall(r'_(.*?)_', info_text)
  affiliations = '; '.join(a.strip() for a in affiliation_parts) if affiliation_parts else None

  # Removes title and affiliation from info_text to get authors
  authors_text = info_text

  for t in title_parts:
    authors_text = authors_text.replace(f'**{t}**', '')

  for a in affiliation_parts:
    authors_text = authors_text.replace(f'_{a}_', '')

  # Cleans up punctuation & whitespace
  authors = authors_text.strip().split(',')
  list_authors = [a.strip() for a in authors if a.strip()]
  cleaned_authors = ', '.join(list_authors)

  parsed_entries.append({
    'year': year,
    'author(s)': cleaned_authors,
    'affiliation(s)': affiliations,
    'title': title.strip('.'),
    'type': '',
    'abstract': abstract
  })

In [134]:
parsed_entries[:3]

[{'year': '2010',
  'author(s)': 'Larry Maloney',
  'affiliation(s)': 'NYU',
  'title': 'Movement Planning under Risk, Decision Making under Risk',
  'type': '',
  'abstract': 'In executing any speeded movement, there is uncertainty about the outcome due to motor variability. I’ll describe recent experiments in which subjects carried out speeded motor tasks that were equivalent to economic decisions. The outcome of each movement earned an explicit monetary reward or penalty. In one game, for example, subjects attempted to reach out and touch briefly presented reward disks while avoiding nearby, overlapping penalty disks. The task for the subject was to trade off the risk of missing the reward disk against the risk of hitting the penalty disk. The optimum tradeoff depended on the magnitudes of penalty and reward and the subject’s own motor error. For each subject, in each game, we could estimate the ideal movement strategy and maximum expected gain possible. We compared subjects’ moveme

# Create df and convert to csv

In [127]:
df = pd.DataFrame(parsed_entries, columns=["year", "author(s)", "affiliation(s)", "title", "type", "abstract"])

In [128]:
df.head()

Unnamed: 0,year,author(s),affiliation(s),title,type,abstract
0,2010,Larry Maloney,NYU,"Movement Planning under Risk, Decision Making ...",,"In executing any speeded movement, there is un..."
1,2010,Mike Jordan,UC; Berkeley,Recent Developments in Nonparametric Bayesian ...,,Bayesian inference is often viewed as an assum...
2,2010,Thomas Griffiths,UC Berkeley,Using Probabilistic Models of Cognition to Ide...,,People are remarkably good at acquiring comple...
3,2010,"Luke Maurits, Amy Perfors, Daniel Navarro",University of Adelaide,New computational approaches to word learning,,All language learners must face the task of in...
4,2010,"Mark Andrews, Gabriella Vigliocco",NTU and UCL; UCL,Learning Word Meanings from Sequential and Syn...,,A promising recent approach to the problem of ...


In [129]:
df.to_csv(f"/content/drive/MyDrive/math_psych_work/csv/smp{year}_program.csv", index=False)