<a href="https://colab.research.google.com/github/pbeens/Colab-Notebook-Archive/blob/main/York_AQ_ABQ_Courses.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#ToDo:

This program scans all the URLs to compile a list of all AQ/ABQ courses offered by York, then scans each of those courses to see if they are offered in the prescribed term. The name and URL of each course offered that term is then stored locally in a webpage. 

GitHub URL: https://github.com/pbeens/python/blob/master/York_AQ_ABQ_Courses.ipynb

Colab URL: https://colab.research.google.com/drive/1BKyub3iYTUgvGmmami1hl1YEQHPjP6Zs

York has a main page with links to category pages
https://www.yorku.ca/edu/professional-learning/aq-abq-pqp-courses/

* 3 Part AQs
* ABQs
* Schedule C AQs
* Honour Specialist AQs
* PQPs

Each category page then has links to the course pages. There are course pages for each offering, so we need to search for the correct registration deadlines. The course name is extracted from the H1 tag.

Using this tutorial for guidance:
https://www.dataquest.io/blog/web-scraping-python-using-beautiful-soup/

In [None]:
# imports
from bs4 import BeautifulSoup
import urllib.request

In [None]:
main_url = 'https://www.yorku.ca/edu/professional-learning/aq-abq-pqp-courses/'

page = urllib.request.urlopen(main_url)

#
# ### warm fuzzy stuff - comment out as appropriate ###
#
# return_code = page
# print(f'return_code = {return_code}')

# content = page.content
# print(content)

# convert page to soup object
soup = BeautifulSoup(page, "html.parser")

# ### warm fuzzy stuff - comment out as appropriate ###
# print(soup.prettify())
# print(list(soup.children))
# print([type(item) for item in list(soup.children)])



In [None]:
#
# Get the category links 
#
category_links = []

print(f'Finding category links in {main_url}... ')

for link in soup.findAll('a'):
    s = str(link.get('href'))
    if s.find('pdis/web/') > 0:
        category_links.append(s)

category_links.sort()

# warm fuzzy
for link in category_links:
    print(link)

print('Category links found.')

Finding category links in https://www.yorku.ca/edu/professional-learning/aq-abq-pqp-courses/... 
https://apps.edu.yorku.ca/pdis/web/3_part_aqs.html
https://apps.edu.yorku.ca/pdis/web/abqs.html
https://apps.edu.yorku.ca/pdis/web/honour_specialist_aqs.html
https://apps.edu.yorku.ca/pdis/web/pqps.html
https://apps.edu.yorku.ca/pdis/web/schedule_c_aqs.html
Category links found.


In [None]:
# 
# Get the course links from each category link
#
course_links = []

for category_link in category_links:
    page = urllib.request.urlopen(category_link)
    soup = BeautifulSoup(page, "html.parser")
    for link in soup.findAll('a'):
        s = str(link.get('href'))
        if s.find('/pdis/course/') > 0 and not s.endswith('/'):
            course_links.append(s)

course_links = list(set(course_links)) # delete dupes
course_links.sort()

# warm fuzzy
for link in course_links:
    print(link)

print('Course links found.')

http://apps.edu.yorku.ca/pdis/course/additional-basic-qualifications/dramatic-arts-intermediate-division/vf22ind1
http://apps.edu.yorku.ca/pdis/course/additional-basic-qualifications/english-intermediate-division/yf22ine1
http://apps.edu.yorku.ca/pdis/course/additional-basic-qualifications/english-senior-division/yf22sbe1
http://apps.edu.yorku.ca/pdis/course/additional-basic-qualifications/first-nations-m-tis-and-inuit-studies-intermediate-division/yf22inns1
http://apps.edu.yorku.ca/pdis/course/additional-basic-qualifications/first-nations-m-tis-and-inuit-studies-senior-division/yf22sns1
http://apps.edu.yorku.ca/pdis/course/additional-basic-qualifications/health-and-physical-education-intermediate-division/vf22inp1
http://apps.edu.yorku.ca/pdis/course/additional-basic-qualifications/health-physical-education-senior-division/vf22sps1
http://apps.edu.yorku.ca/pdis/course/additional-basic-qualifications/history-intermediate-division/yf22inh1
http://apps.edu.yorku.ca/pdis/course/additional

In [None]:
#
# Process all the course links to find which courses are running
#
# CHANGE THE DEADLINE IN THE CODE
#
deadline = 'Registration Deadline: Sep 27, 2022'
courses_this_term = [] # url[]
count = 0

print('Finding courses this term...')

for course_link in course_links:
    # print(course_link)
    page = urllib.request.urlopen(course_link)
    soup = BeautifulSoup(page, "html.parser")
    text = soup.get_text()
    if text.count(deadline) > 0:
        courses_this_term.append(course_link)
        count += 1
        print(count, course_link)

print(f'\nFound {count} courses this term.')

Finding courses this term...
1 http://apps.edu.yorku.ca/pdis/course/additional-basic-qualifications/dramatic-arts-intermediate-division/vf22ind1
2 http://apps.edu.yorku.ca/pdis/course/additional-basic-qualifications/english-intermediate-division/yf22ine1
3 http://apps.edu.yorku.ca/pdis/course/additional-basic-qualifications/english-senior-division/yf22sbe1
4 http://apps.edu.yorku.ca/pdis/course/additional-basic-qualifications/first-nations-m-tis-and-inuit-studies-intermediate-division/yf22inns1
5 http://apps.edu.yorku.ca/pdis/course/additional-basic-qualifications/first-nations-m-tis-and-inuit-studies-senior-division/yf22sns1
6 http://apps.edu.yorku.ca/pdis/course/additional-basic-qualifications/health-and-physical-education-intermediate-division/vf22inp1
7 http://apps.edu.yorku.ca/pdis/course/additional-basic-qualifications/health-physical-education-senior-division/vf22sps1
8 http://apps.edu.yorku.ca/pdis/course/additional-basic-qualifications/history-intermediate-division/yf22inh1
9 

In [None]:
term_course_dict = {}
for url in courses_this_term:
    page = urllib.request.urlopen(url)
    soup = BeautifulSoup(page, "html.parser")
    subject = soup.find('h1').get_text()
    course_section = url.split('/')[-1]
    course_plus_section = subject + '-' + course_section
    term_course_dict[course_plus_section] = url
    print(f'{course_plus_section}\n\t{url}')

In [None]:
print(term_course_dict)

{'Dramatic Arts - Intermediate Division-vf22ind1': 'http://apps.edu.yorku.ca/pdis/course/additional-basic-qualifications/dramatic-arts-intermediate-division/vf22ind1', 'English - Intermediate Division-yf22ine1': 'http://apps.edu.yorku.ca/pdis/course/additional-basic-qualifications/english-intermediate-division/yf22ine1', 'English - Senior Division-yf22sbe1': 'http://apps.edu.yorku.ca/pdis/course/additional-basic-qualifications/english-senior-division/yf22sbe1', 'First Nations, Métis and Inuit Studies - Intermediate Division-yf22inns1': 'http://apps.edu.yorku.ca/pdis/course/additional-basic-qualifications/first-nations-m-tis-and-inuit-studies-intermediate-division/yf22inns1', 'First Nations, Métis and Inuit Studies - Senior Division-yf22sns1': 'http://apps.edu.yorku.ca/pdis/course/additional-basic-qualifications/first-nations-m-tis-and-inuit-studies-senior-division/yf22sns1', 'Health and Physical Education - Intermediate Division-vf22inp1': 'http://apps.edu.yorku.ca/pdis/course/addition

In [None]:
# test section to test the term_courses dict
for (k, v) in term_course_dict.items():
  print(f'{k}: {v}')

In [None]:
# create html file with course listings
file = './York-courses.html'
with open(file, 'w') as f:
  s = f'<HTML>\n<HEAD>\n\t<TITLE>York Courses</TITLE>\n</HEAD>\n<BODY>\n'
  f.write(s)
  for course, url in term_course_dict.items():
    f.write(f'\t<a href="{url}">{course}</a><br>\n')
  s = '</BODY>\n<HTML>'
  f.write(s)
f.close()
print(f'{file} created.')

./York-courses.html created.
