# NYU Course Scraper

This project uses regular expressions to extract information from the course descriptions that describe the course offerings of the NYU Linguistics Department in the coming Spring.

Concepts:
- Extract information from semi-structured text using regular expressions
- Modify and tweak regular expressions to improve their recall
- Make an informed decision on any linguistics courses you might want to enroll in for the coming semester

## Processing the course list with regular expressions

In [1]:
import re
import pandas
from copy import deepcopy
from pprint import pprint

In [2]:
# read in course descriptions file
filecontent = open('course_descriptions.txt', 'r').read().strip().split('\n')

for line in filecontent:
    print(line)

LINGUISTICS
Spring 2022

Language
LING-UA 1-001	                            Professor Lucas Champollion                                 M/W, 3:30PM - 4:45PM
Satisfies Introductory course requirement and the Societies and Social Science component of the College Core Curriculum

➣ This course is an introductory survey of the field of linguistics—the scientific study of language. During the semester, we will look at questions like the following: Is speaking an instinctual or a learned behavior? Why do children acquire language so much faster and easier than adults, and what are the stages of acquisition? What do the native speakers of a language know about the language’s word structure, sentence structure, sentence meaning, and pronunciation? How is language processed in the brain? How and why did language evolve into such a complex system? What is the relationship between language, social class, and race? The course will approach these questions from a scientific perspective, incorporati

In [3]:
# loop through file and extract lines that contain the string "Professor"
for line in filecontent:
    match = re.search('Professor', line)
    if match:
        print(line)
        print()

LING-UA 1-001	                            Professor Lucas Champollion                                 M/W, 3:30PM - 4:45PM

LING-UA 1-005	 Professor Gary Thoms	       M/W, 9:30AM - 10:45AM 

LING-UA 5-001/PSYCH-UA 56-001  		Professor Brian McElree			   F, 1:00PM - 3:30PM         

LING-UA 9-001	Professor Gillian Gallagher	           M, 12:30PM - 3:15PM

LING-UA 10-001			Professor Stephanie Harves			        T/R, 9:30AM - 10:45AM

LING-UA 12-001	Professor Juliet Stanton	     M/W, 3:30PM - 4:45PM

LING-UA 13-001	Professor Chris Collins	   M/W, 9:30AM -  10:45AM

LING-UA 15-001		Professors Lisa Davidson & Laurel MacKenzie 	 	     T/R, 11:00AM - 12:15PM   

LING-UA 19-001	Professor Anna Szabolcsi	     T/R, 2:00PM -  3:15PM

LING-UA 29-001	Professor Maria Gouskova	M/W, 11:00AM - 12:15PM

LING-UA 30-001/SPAN-UA 403-001/LATC-UA 361-003	Professor Gregory Guy	T/R, 9:30AM - 10:45AM

LING-UA 52-001/DS-UA 203-001	Professor Sam Bowman	W, 2:00PM - 3:15PM 

LING-UA 55-001/LING-GA 1029-001	Professors M

In [4]:
# create a template for the information that corresponds to a course
course_template = {
    "title": "blank",
    "course_number": "blank",
    "section_number": "blank",
    "crosslisted_course_number_OPTIONAL": "blank",
    "crosslisted_section_number_OPTIONAL": "blank",
    "days": "blank",
    "start_time": "blank",
    "end_time": "blank",
    "instructor": "blank",
    "second_instructor": "blank",
    "prerequisites_OPTIONAL": "blank",
    "satisfies_OPTIONAL": "blank",
    "description": "blank"
}

In [5]:
# title (group 1)
# any line that doesn't start with a ➣ character 
pattern1 = "^[^➣].*" 

# course (group 1) and section number (group 2)
# e.g. "LING-UA 13-001": group 1 matches "LING-UA 13" and group 2 matches "001"
pattern2 = "(LING-UA [0-9]{1,3})-([0-9]{3})" 

# crosslisted course (group 1) and section number (group 2)--usually blank
# e.g. in "/PSYCH-UA 56-001" group 1 matches "PSYCH-UA 56" and group 2 matches "001"
pattern3 = "/ ?([A-Z]{1,5}-(UA|GA) [0-9]{1,4})-?([0-9]{3})?"

# days of the week (group 1), start time (group 2), end time (group 3)
# e.g. in "T/R, 9:00AM - 9:15AM" - group 1 should match "T/R", group 2 "9:00AM" and group 3 should match "9:15AM"
pattern4 = "([MTWRF/]{1,3}), ([0-9]{1,2}:[0-9]{2}[AP]M) *- *([0-9]{1,2}:[0-9]{2}[AP]M)"

# instructor (group 1) and second instructor where available (group 2)
# e.g. in "Professors Stephanie Harves & Richard Kayne" - group 1 should match "Stephanie Harves" and group 2 "Richard Kayne"
pattern5 = "Professor[s]* ([A-Za-z]+ [A-Za-z]+) ?&? ?([A-Za-z]+ [A-Za-z]+)?"

# instructor not yet known (TBD = to be determined, TBA = to be announced)
pattern6 = "(TBD)|(TBA)"

# prerequisites (group 1)
# e.g. "PREREQUISITE:  LING-UA 11 OR Permission of the Instructor"
pattern7 = "PREREQUISITE: +([^\t]*)"

# satisfies (group 1)
# e.g. "Satisfies Syntax requirement"
pattern8 = "(Satisfies|This course satisfies) ([^\t]*)"

# description (group 1)
# e.g. "➣ This course is an introductory survey of the field of linguistics—the scientific study of language." (etc.)
pattern9 = "➣ (.*)"

In [6]:
# loop through entire file, line by line, to match with previously defined regular expression patterns

courses = []
previous_line_is_blank = False
header_processed = False

for line in filecontent:

    if (line.startswith("Spring") or line.startswith("LINGUISTICS")):
        continue # we are in a header line -- go to next iteration of for loop

    if line.strip() == "":
        previous_line_is_blank = True
        continue # go to next iteration of the for loop
        
    match = re.search(pattern1, line)
    if match and previous_line_is_blank:
        course = deepcopy(course_template) # deep copy means we create a completely new record separate from the other records
        course["title"] = match.group(0) # group(0) is the whole pattern
        courses.append(course)
    
    # course and section number
    match = re.search(pattern2, line)    
    if match:
        course["course_number"] = match.group(1)
        course["section_number"] = match.group(2)

    # crosslisted course and section number (usually blank)
    match = re.search(pattern3, line)
    if match:
        course["crosslisted_course_number_OPTIONAL"] = match.group(1)
        if match.group(3) is not None:
            course["crosslisted_section_number_OPTIONAL"] = match.group(3)

    # days of the week, start time, end time
    match = re.search(pattern4, line)
    if match:
        course["days"] = match.group(1)
        course["start_time"] = match.group(2)
        course["end_time"] = match.group(3)
                
    # instructor and (if available) second instructor
    match = re.search(pattern5, line)
    if match:
        course["instructor"] = match.group(1)
        if match.group(2) is not None:
            course["second_instructor"] = match.group(2)
    
    # instructor TBD or TBA (not yet determined)
    match = re.search(pattern6, line)
    if match:
        course["instructor"] = match.group(0)
        
    # prerequisites
    match = re.search(pattern7, line)
    if match:
        course["prerequisites_OPTIONAL"] = match.group(1)
        
    # satisfies
    match = re.search(pattern8, line)
    if match:
        course["satisfies_OPTIONAL"] = match.group(2)
    
    # description
    match = re.search(pattern9, line)
    if match:
        course["description"] = match.group(1)

    # we remember whether we've encountered a blank line, since 
    # this signals the beginning of a new course on the next line
    previous_line_is_blank = False

In [7]:
# print courses as a table

pandas.DataFrame(courses).transpose()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
title,Language,Language,Introduction to Psycholinguistics,Indigenous Languages of the Americas,Structure of the Russian Language,Phonological Analysis,Grammatical Analysis,Language and Society,Advanced Semantics,"Sex, Gender, and Language",Morphology,Language in Latin America,Machine Learning for Language Understanding,Introduction to Morphology at an Advanced Level,First Language Acquisition
course_number,LING-UA 1,LING-UA 1,LING-UA 5,LING-UA 9,LING-UA 10,LING-UA 12,LING-UA 13,LING-UA 15,LING-UA 19,LING-UA 21,LING-UA 29,LING-UA 30,LING-UA 52,LING-UA 55,LING-UA 59
section_number,001,005,001,001,001,001,001,001,001,001,001,001,001,001,001
crosslisted_course_number_OPTIONAL,blank,blank,PSYCH-UA 56,blank,blank,blank,blank,blank,blank,SCA-UA 712,blank,SPAN-UA 403,DS-UA 203,LING-GA 1029,PSYCH-UA 59
crosslisted_section_number_OPTIONAL,blank,blank,001,blank,blank,blank,blank,blank,blank,blank,blank,001,001,001,001
days,M/W,M/W,F,M,T/R,M/W,M/W,T/R,T/R,M/W,M/W,T/R,W,M/W,T/R
start_time,3:30PM,9:30AM,1:00PM,12:30PM,9:30AM,3:30PM,9:30AM,11:00AM,2:00PM,4:55PM,11:00AM,9:30AM,2:00PM,12:30PM,12:30PM
end_time,4:45PM,10:45AM,3:30PM,3:15PM,10:45AM,4:45PM,10:45AM,12:15PM,3:15PM,6:10PM,12:15PM,10:45AM,3:15PM,1:45PM,1:45PM
instructor,Lucas Champollion,Gary Thoms,Brian McElree,Gillian Gallagher,Stephanie Harves,Juliet Stanton,Chris Collins,Lisa Davidson,Anna Szabolcsi,TBD,Maria Gouskova,Gregory Guy,Sam Bowman,Maria Gouskova,Ailis Cournane
second_instructor,blank,blank,blank,blank,blank,blank,blank,Laurel MacKenzie,blank,blank,blank,blank,blank,Alec Marantz,blank


In [10]:
# print out all the records

for course in courses:
    pprint(course, sort_dicts=False)

{'title': 'Language',
 'course_number': 'LING-UA 1',
 'section_number': '001',
 'crosslisted_course_number_OPTIONAL': 'blank',
 'crosslisted_section_number_OPTIONAL': 'blank',
 'days': 'M/W',
 'start_time': '3:30PM',
 'end_time': '4:45PM',
 'instructor': 'Lucas Champollion',
 'second_instructor': 'blank',
 'prerequisites_OPTIONAL': 'blank',
 'satisfies_OPTIONAL': 'Introductory course requirement and the Societies and '
                       'Social Science component of the College Core '
                       'Curriculum',
 'description': 'This course is an introductory survey of the field of '
                'linguistics—the scientific study of language. During the '
                'semester, we will look at questions like the following: Is '
                'speaking an instinctual or a learned behavior? Why do '
                'children acquire language so much faster and easier than '
                'adults, and what are the stages of acquisition? What do the '
              

## Reading and searching through the course list using regular expressions

In [11]:
# search a given field across all courses for any strings that match a given regular expression

def course_search(regex, field):
    if (regex == Ellipsis and field == Ellipsis):
        print ("Error: regex and field not specified")
        return
    if (regex == Ellipsis):
        print ("Error: regex not specified")
        return
    if (field == Ellipsis):
        print("Error: field not specified")
        return
    result = ""
    # result += "Searching for pattern '"+regex+"' in field '"+field+"'...\n"
    if field not in course_template:
        print ("Error: there is no field '"+field+"' in the course template!")
        return
    for course in courses:
        match = re.search(regex, course[field])
        if match:
            result += "Course '" + course["course_number"] + " " + course["title"] + "' -- " + field + ": " + course[field] + "\n"
    return result

In [12]:
# search for courses that take place on Mondays and Wednesdays.

courses_on_monday_and_wednesday = course_search("M/W", "days")
print(courses_on_monday_and_wednesday)

Course 'LING-UA 1 Language' -- days: M/W
Course 'LING-UA 1 Language' -- days: M/W
Course 'LING-UA 12 Phonological Analysis' -- days: M/W
Course 'LING-UA 13 Grammatical Analysis' -- days: M/W
Course 'LING-UA 21 Sex, Gender, and Language' -- days: M/W
Course 'LING-UA 29 Morphology' -- days: M/W
Course 'LING-UA 55 Introduction to Morphology at an Advanced Level' -- days: M/W



In [13]:
# search for courses that take place on Mondays or Wednesdays, or both.

courses_on_monday_or_wednesday = course_search("M|W", "days")
print(courses_on_monday_or_wednesday)

Course 'LING-UA 1 Language' -- days: M/W
Course 'LING-UA 1 Language' -- days: M/W
Course 'LING-UA 9 Indigenous Languages of the Americas' -- days: M
Course 'LING-UA 12 Phonological Analysis' -- days: M/W
Course 'LING-UA 13 Grammatical Analysis' -- days: M/W
Course 'LING-UA 21 Sex, Gender, and Language' -- days: M/W
Course 'LING-UA 29 Morphology' -- days: M/W
Course 'LING-UA 52 Machine Learning for Language Understanding' -- days: W
Course 'LING-UA 55 Introduction to Morphology at an Advanced Level' -- days: M/W



In [14]:
# search for all of the instructors (or first-mentioned instructors in co-taught courses) of all of the courses

instructors = course_search(".*", "instructor")
print(instructors)

Course 'LING-UA 1 Language' -- instructor: Lucas Champollion
Course 'LING-UA 1 Language' -- instructor: Gary Thoms
Course 'LING-UA 5 Introduction to Psycholinguistics' -- instructor: Brian McElree
Course 'LING-UA 9 Indigenous Languages of the Americas' -- instructor: Gillian Gallagher
Course 'LING-UA 10 Structure of the Russian Language' -- instructor: Stephanie Harves
Course 'LING-UA 12 Phonological Analysis' -- instructor: Juliet Stanton
Course 'LING-UA 13 Grammatical Analysis' -- instructor: Chris Collins
Course 'LING-UA 15 Language and Society' -- instructor: Lisa Davidson
Course 'LING-UA 19 Advanced Semantics' -- instructor: Anna Szabolcsi
Course 'LING-UA 21 Sex, Gender, and Language' -- instructor: TBD
Course 'LING-UA 29 Morphology' -- instructor: Maria Gouskova
Course 'LING-UA 30 Language in Latin America' -- instructor: Gregory Guy
Course 'LING-UA 52 Machine Learning for Language Understanding' -- instructor: Sam Bowman
Course 'LING-UA 55 Introduction to Morphology at an Advanc

In [15]:
# search for all of the second instructors wherever that field does not contain "blank"

second_instructors = course_search("[^b][^l][^a][^n][^k].*", "second_instructor")
print(second_instructors)

Course 'LING-UA 15 Language and Society' -- second_instructor: Laurel MacKenzie
Course 'LING-UA 55 Introduction to Morphology at an Advanced Level' -- second_instructor: Alec Marantz



In [17]:
# search for all the courses where the field "prerequisites_OPTIONAL" has the value "blank" (that is, the five-letter string "blank", not the empty string)

prerequisites_OPTIONAL = course_search("blank", "prerequisites_OPTIONAL")
print(prerequisites_OPTIONAL)

Course 'LING-UA 1 Language' -- prerequisites_OPTIONAL: blank
Course 'LING-UA 1 Language' -- prerequisites_OPTIONAL: blank
Course 'LING-UA 9 Indigenous Languages of the Americas' -- prerequisites_OPTIONAL: blank
Course 'LING-UA 15 Language and Society' -- prerequisites_OPTIONAL: blank
Course 'LING-UA 21 Sex, Gender, and Language' -- prerequisites_OPTIONAL: blank
Course 'LING-UA 30 Language in Latin America' -- prerequisites_OPTIONAL: blank



In [18]:
# search for all the courses you can take even if you have not taken any other linguistics courses other than Patterns in Language, as long as you get permission from the instructor

prerequisites_OPTIONAL = course_search("Permission|permission", "prerequisites_OPTIONAL") 
print(prerequisites_OPTIONAL)

Course 'LING-UA 5 Introduction to Psycholinguistics' -- prerequisites_OPTIONAL: LING-UA 1 OR LING-UA 3 OR Permission of the Instructor
Course 'LING-UA 10 Structure of the Russian Language' -- prerequisites_OPTIONAL: LING-UA 13 OR Permission of the Instructor
Course 'LING-UA 12 Phonological Analysis' -- prerequisites_OPTIONAL: LING-UA 11 OR Permission of the Instructor Satisfies Phonology requirement
Course 'LING-UA 13 Grammatical Analysis' -- prerequisites_OPTIONAL: LING-UA 1 OR LING-UA 3 OR Permission of the Instructor 
Course 'LING-UA 19 Advanced Semantics' -- prerequisites_OPTIONAL: LING-UA 4 OR Permission of the Instructor 
Course 'LING-UA 29 Morphology' -- prerequisites_OPTIONAL: LING-UA 1 or LING-UA 3 OR Permission of the Instructor
Course 'LING-UA 52 Machine Learning for Language Understanding' -- prerequisites_OPTIONAL: at least one course with a substantial Python programming component, such as Introduction to Computer Programming (No Prior Experience) (CSCI-UA 2) or Introduct

In [19]:
# search for all the courses that start in the afternoon

start_time = course_search("PM", "start_time") 
print(start_time)

Course 'LING-UA 1 Language' -- start_time: 3:30PM
Course 'LING-UA 5 Introduction to Psycholinguistics' -- start_time: 1:00PM
Course 'LING-UA 9 Indigenous Languages of the Americas' -- start_time: 12:30PM
Course 'LING-UA 12 Phonological Analysis' -- start_time: 3:30PM
Course 'LING-UA 19 Advanced Semantics' -- start_time: 2:00PM
Course 'LING-UA 21 Sex, Gender, and Language' -- start_time: 4:55PM
Course 'LING-UA 52 Machine Learning for Language Understanding' -- start_time: 2:00PM
Course 'LING-UA 55 Introduction to Morphology at an Advanced Level' -- start_time: 12:30PM
Course 'LING-UA 59 First Language Acquisition' -- start_time: 12:30PM

