### Reading PDF Files is Painful

What does a student need to transfer?

They come to our website with goals (target schools) and leave with a plan to meet those goals.
The user provides every degree program they seek to transfer into, and we give them a list of classes they need to fulfill those requirements.
Let's make an example schedule for them that accounts for unit load, prerequisites, and corequisites.
OR
Let's inform the the user of all the classes they need to take to fulfill requirements.
OR
Let's audit a user's schedule to see if it will meet all requirements.
HOW
Do we structure our backend database to afford these options?
WHY
Should a user use our website instead of doing it themselves?
WHAT
Do we export the information as to the user? (Excel, Google Sheets, CSV, TSV, etc.)

SCOPE - De Anza College
UC's first
CSU's

In [16]:
import pandas as pd
import re
import json
from pypdf import PdfReader 

class Parser:
    """
    A class used to parse agreement pdf files from assist.org

    ...

    Attributes
    ----------
    id : str
        the id of the agreement being parsed
    filename : str
        the location and name of the .pdf file to be parsed
    reader : PdfReader
        PdfReader object from the pypdf library
    ...
    
    Methods
    -------
    
    """

    def __init__(self, id=None):
        if id != None:
            self._id = id
            self._filename = "./pdfs/" + str(id) +".pdf"
            self._reader = PdfReader(self._filename)
            
    def parse(self):
        """
        Parses
        """
        parts = []
        conjunctions = ['←', 'And', 'Or']
        
        def visitor_body(text, cm, tm, font_dict, font_size):
            text = text.replace('\u200b', '').strip()
            if text and text != '\n' and text != ' ':
                # if font_dict['/BaseFont'] == '/SegoeUIBold' and re.match('^(\S+\s)+\d+\w*$', text):
                #     parts.append([text, 0])
                if font_dict['/BaseFont'] == '/SegoeUIBold' and font_size == 19.0:
                    if tm[4] < 500:
                        parts.append([text, 0, 0])
                    else:
                        parts.append([text, 0, 1])
                elif font_dict['/BaseFont'] == '/SegoeUIRegular' and re.match('^(\S+\s)+\(\d+\.\d+\)$', text):
                    if tm[4] < 500:
                        parts.append([text, 2, 0])
                    else:
                        parts.append([text, 2, 1])
                elif font_size == 19.0 and font_dict['/BaseFont'] == '/SegoeUIRegular':
                    if tm[4] < 500:
                        parts.append([text, 1, 0])
                    else:
                        parts.append([text, 1, 1])
                elif text in conjunctions:
                    if tm[4] < 500:
                        parts.append([text, 3, 0])
                    else:
                        parts.append([text, 3, 1])
        
        for page in self._reader.pages:
            page.extract_text(visitor_text=visitor_body)
        
        """
        Creates a list seperated by side switches of lists of cleaned text
        
        step : 
            {0: "title", 1: "description, 2:"description end", 3:"conjunction"}
        side :
            {0: "left", 1: "right"}
        """
        step = 0
        side = 0 
        temp = []
        separated = []
        
        for i in parts:
            val = i[0]
            newstep = i[1]
            newside = i[2]
            # removes double conjunction errors
            if step == 3 and newstep == 3:
                temp = temp[:-1]
            # creates a new entry
            elif side != newside:
                separated.append(temp[:])
                temp = [val]
                step = newstep
            # continues
            else:
                temp.append(val)
                step = newstep
            # switches to (or stays on) newside
            side = newside

        """
        Creates a list seperated by side switches of lists grouped logically by the And and Or conjunctions
        """
        agreements = []
        
        for i in separated:
            course = []
            courses = []
            sets = []
            for j in i:
                # if element is not a conjunction, append to courses
                if j not in conjunctions:
                    course.append(j)
                # if element is a conjunction, append and stringify course to courses, clear course, then check if element is 'Or'
                else:
                    courses.append(" ".join(course))
                    course = []
                    # if element is 'Or', append courses to the set of courses, clear courses
                    if j == 'Or':
                        sets.append(courses[:])
                        courses = []
            # in the event that the list does not end in a conjunction, append course to courses
            if len(course) > 0:
                courses.append(" ".join(course))
            # an element of seperated will never end in 'Or', so append the final courses list to sets
            sets.append(courses[:])
            # finally, append the sets of courses to agreements
            agreements.append(sets[:])
            
        # print(*agreements, sep='\n')

        """
        Creating pairs of agreements and writing them in JSON to a .txt file
        """
        with open('./parsed/' + self._id + '.txt', 'w') as f:
            for i in range(len(agreements)//2):
                entry = {}
                entry["away"] = agreements[i*2][0]
                entry["home"] = agreements[i*2+1]
                f.write(json.dumps(entry))
                f.write('\n')
            f.close()
            
            
    def set_id(self, id):
        """
        Setter function for id parameter
        
        ...
        
        Parameters
        ----------
        id : str
            id number for desired agreement

        ...
        
        Raises
        ------
        TypeError
            If type of id is not str
        """
        if isinstance(id, str):
            self._id = id
            self._filename = "./pdfs/" + str(id) +".pdf"
            self._reader = PdfReader(self._filename)
        else:
            raise TypeError("This ID is not a string.")

thing = Parser()
thing.set_id("26274157")
thing.parse()

[['MATH 1A - Calculus (4.00)']]
[['MATH 1A - Calculus (5.00)', 'MATH 1B - Calculus (5.00)'], ['MATH 1AH - Calculus - HONORS (5.00)', 'MATH 1BH - Calculus - HONORS (5.00)']]
[['MATH 10A - Methods of Mathematics: Calculus, Statistics, and Combinatorics (4.00)']]
[['No Course Articulated']]
[['MATH 16A - Analytic Geometry and Calculus (3.00)']]
[['MATH 1A - Calculus (5.00)'], ['MATH 1AH - Calculus - HONORS (5.00)']]
[['COMPSCI 61A - The Structure and Interpretation of Computer Programs (4.00)']]
[['No Course Articulated']]
[['ENGIN 7 - Introduction to Computer Programming for Scientists and Engineers (MATLAB) (4.00)']]
[['No Course Articulated']]
[['COMPSCI C88C - Computational Structures in Data Science (3.00) Same-As: DATA C88C']]
[['This course must be taken at the university after transfer']]
[['COMPSCI C8 - Foundations of Data Science (4.00) Same-As: STAT C8, INFO C8, DATA C8']]
[['No Course Articulated']]
[['STAT 2 - Introduction to Statistics (4.00)']]
[['MATH 10 - Introductory Sta

In [18]:
thing = Parser()
with open('./data/uc_agreement_keys.txt', 'r') as f:
    for entry in f:
        temp = json.loads(entry)
        if temp['school'] == "University of California, Berkeley": 
            thing.set_id(str(temp['key']))
            thing.parse()

[['ENGLISH 17 - Shakespeare (4.00)']]
[['ELIT 17 - Introduction to Shakespeare (4.00)'], ['ELIT 17H - Introduction to Shakespeare - HONORS (4.00)']]
[['ENGLISH 45A - Literature in English (Through Milton) (4.00)']]
[['ELIT 46A - Major British Writers (Medieval and Renaissance) (4.00)', 'ELIT 46B - Major British Writers (Neo-Classical and Romantic) (4.00)'], ['ELIT 46AH - Major British Writers (Medieval and Renaissance) - HONORS (4.00)', 'ELIT 46BH - Major British Writers (Neo-Classical and Romantic) - HONORS (4.00)']]
[['ENGLISH 45B - Literature in English (Late 17th through the mid-19th Century) (4.00)']]
[['ELIT 46B - Major British Writers (Neo-Classical and Romantic) (4.00)', 'ELIT 48A - Major American Writers (Colonial and Romantic, 1620-1865) (4.00)'], ['ELIT 46BH - Major British Writers (Neo-Classical and Romantic) - HONORS (4.00)', 'ELIT 48AH - Major American Writers (Colonial to Romantic, 1620-1865) - HONORS (4.00)']]
[['ENGLISH 45C - Literature in English (Mid-19th through the

KeyboardInterrupt: 