***
# Fundamentals of Data Analytics Assessment - CAO Points Analysis

***

"The purpose of the Central Applications Office (CAO) is to process centrally applications for undergraduate courses in Irish Higher Education Institutions" [1]. 

<br>

[Regular Expression Documentation]('https://docs.python.org/3/library/re.html')

#### Importing Libraries 

In [1]:

# Convenient HTTP requests.
import requests as rq

# Regular expressions.
import re

# Dates and times.
import datetime as dt

# Data frames.
import pandas as pd

# For downloading.
import urllib.request as urlrq

<br>

In [2]:
# Get the current date and time.
now = dt.datetime.now()

# Format as a string.
nowstr = now.strftime('%Y%m%d_%H%M%S')

<br>

****

## 2021 CAO Points
[2021 CAO points]('http://www2.cao.ie/points/l8.php')
***


<br>

#### Server Request

In [3]:
# Fetch the CAO points URL.
resp = rq.get('http://www2.cao.ie/points/l8.php')

# 200 = ok. 404 = error: not found
resp

<Response [200]>

<br>

<br>

### Save Orignal Dataset

***

In [4]:
# Create a file path for the original data.
pathhtml = 'datasets/cao2021_' + nowstr + '.html'

Error on server

Technically, the server says we should decode as per:

Content-Type: text/html; charset=iso-8859-1
However, one line uses \x96 which isn't defined in iso-8859-1.

Therefore we use the similar decoding standard cp1252, which is very similar but includes #x96.

In [5]:
# The server uses the wrong encoding, fix it.
original_encoding = resp.encoding

# Change to cp1252.
resp.encoding = 'cp1252'

In [6]:
# Save the original html file.
with open(pathhtml, 'w') as f:
    f.write(resp.text)

## Use regular expressions to select lines we want

In [7]:
# Compile the regular expression for matching lines.
re_course = re.compile(r'([A-Z]{2}[0-9]{3})(.*)')

## Loop through the lines of the response

In [8]:
# The file path for the csv file.
path2021 = 'datasets/cao2021_csv_' + nowstr + '.csv'

In [9]:
# Keep track of how many courses we process.
no_lines = 0

# Open the csv file for writing.
with open(path2021, 'w') as f:
    # Write a header row.
    f.write(','.join(['code', 'title', 'pointsR1', 'pointsR2']) + '\n')
    # Loop through lines of the response.
    for line in resp.iter_lines():
        # Decode the line, using the wrong encoding!
        dline = line.decode('cp1252')
        # Match only the lines representing courses.
        if re_course.fullmatch(dline):
            # Add one to the lines counter.
            no_lines = no_lines + 1
            # The course code.
            course_code = dline[:5]
            # The course title.
            course_title = dline[7:57].strip()
            # Round one points.
            course_points = re.split(' +', dline[60:])
            if len(course_points) != 2:
                course_points = course_points[:2]
            # Join the fields using a comma.
            linesplit = [course_code, course_title, course_points[0], course_points[1]]
            # Rejoin the substrings with commas in between.
            f.write(','.join(linesplit) + '\n')

# Print the total number of processed lines.
print(f"Total number of lines is {no_lines}.")

Total number of lines is 949.


NB: it was verified as of 03/11/2021 that there were 949 courses exactly in the CAO 2021 points list.

In [10]:
df2021 = pd.read_csv(path2021, encoding='cp1252')

In [11]:
df2021

Unnamed: 0,code,title,pointsR1,pointsR2
0,AL801,Software Design for Virtual Reality and Gaming,300,
1,AL802,Software Design in Artificial Intelligence for...,313,
2,AL803,Software Design for Mobile Apps and Connected ...,350,
3,AL805,Computer Engineering for Network Infrastructure,321,
4,AL810,Quantity Surveying,328,
...,...,...,...,...
944,WD211,Creative Computing,270,
945,WD212,Recreation and Sport Management,262,
946,WD230,Mechanical and Manufacturing Engineering,230,230
947,WD231,Early Childhood Care and Education,266,


## 2020 Points
https://www.cao.ie/index.php?page=points&p=2020

In [12]:
url2020 = 'http://www2.cao.ie/points/CAOPointsCharts2020.xlsx'

Save Original File

In [13]:
# Create a file path for the original data.
pathxlsx = 'datasets/cao2020_' + nowstr + '.xlsx'

In [14]:
urlrq.urlretrieve(url2020, pathxlsx)

('datasets/cao2020_20211123_203922.xlsx',
 <http.client.HTTPMessage at 0x28e3405fc10>)

Load Spreadsheet using pandas

In [15]:
# Download and parse the excel spreadsheet.
df2020 = pd.read_excel(url2020, skiprows=10)


In [16]:
df2020

Unnamed: 0,CATEGORY (i.e.ISCED description),COURSE TITLE,COURSE CODE2,R1 POINTS,R1 Random *,R2 POINTS,R2 Random*,EOS,EOS Random *,EOS Mid-point,...,avp,v,Column1,Column2,Column3,Column4,Column5,Column6,Column7,Column8
0,Business and administration,International Business,AC120,209,,,,209,,280,...,,,,,,,,,,
1,Humanities (except languages),Liberal Arts,AC137,252,,,,252,,270,...,,,,,,,,,,
2,Arts,"First Year Art & Design (Common Entry,portfolio)",AD101,#+matric,,,,#+matric,,#+matric,...,,,,,,,,,,
3,Arts,Graphic Design and Moving Image Design (portfo...,AD102,#+matric,,,,#+matric,,#+matric,...,,,,,,,,,,
4,Arts,Textile & Surface Design and Jewellery & Objec...,AD103,#+matric,,,,#+matric,,#+matric,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1459,Manufacturing and processing,Manufacturing Engineering,WD208,188,,,,188,,339,...,,,,,,,,,,
1460,Information and Communication Technologies (ICTs),Software Systems Development,WD210,279,,,,279,,337,...,,,,,,,,,,
1461,Information and Communication Technologies (ICTs),Creative Computing,WD211,271,,,,271,,318,...,,,,,,,,,,
1462,Personal services,Recreation and Sport Management,WD212,270,,,,270,,349,...,,,,,,,,,,


In [17]:
# Spot check a random row.
df2020.iloc[1000]

CATEGORY (i.e.ISCED description)    Engineering and engineering trades
COURSE TITLE                                    Mechanical Engineering
COURSE CODE2                                                     SG333
R1 POINTS                                                          216
R1 Random *                                                        NaN
R2 POINTS                                                          NaN
R2 Random*                                                         NaN
EOS                                                                216
EOS Random *                                                       NaN
EOS Mid-point                                                      347
LEVEL                                                                7
HEI                                     Institute of Technology, Sligo
Test/Interview #                                                   NaN
avp                                                                NaN
v     

In [18]:
# Spot check the last row.
df2020.iloc[-1]

CATEGORY (i.e.ISCED description)          Engineering and engineering trades
COURSE TITLE                        Mechanical and Manufacturing Engineering
COURSE CODE2                                                           WD230
R1 POINTS                                                                253
R1 Random *                                                              NaN
R2 POINTS                                                                NaN
R2 Random*                                                               NaN
EOS                                                                      253
EOS Random *                                                             NaN
EOS Mid-point                                                            369
LEVEL                                                                      8
HEI                                        Waterford Institute of Technology
Test/Interview #                                                         NaN

In [19]:
# Create a file path for the pandas data.
path2020 = 'datasets/cao2020_' + nowstr + '.csv'

In [20]:
# Save pandas data frame to disk.
df2020.to_csv(path2020)

## 2019 Points

https://www.cao.ie/index.php?page=points&p=2019

In [71]:
import camelot

In [72]:
pdf = camelot.read_pdf('datasets/2019_points.pdf', pages='all')

In [73]:
print(type(pdf))

<class 'camelot.core.TableList'>


In [74]:
# Checking the number of tables, should be 18
pdf

<TableList n=18>

In [75]:
# Exporting tables into a csv file
pdf.export('datasets/2019_points.csv', f='csv', compress=True)

In [76]:
# checking to make sure it worked
pdf[1].parsing_report

{'accuracy': 100.0, 'whitespace': 2.73, 'order': 1, 'page': 2}

In [102]:
# Unzipping the folder - multiple tables are exported as a zip
from zipfile import ZipFile

with ZipFile('datasets/2019_points.zip', 'r') as df:
   # Extract all the contents of zip file in current directory
   df.extractall('2019_points')

In [103]:
import os, glob


path = "FofData-assessment/datasets/"
filelist = glob.glob(path + '2019_*.csv')
frame = pd.concat([pd.read_csv(file,names=['Course Code','Title','EOS','Mid' ]) for file in filelist])

In [104]:
frame

Unnamed: 0,Course Code,Title,EOS,Mid
0,Course Code INSTITUTION and COURSE,,EOS,Mid
1,,Athlone Institute of Technology,,
2,AL801,Software Design with Virtual Reality and Gaming,304,328
3,AL802,Software Design with Cloud Computing,301,306
4,AL803,Software Design with Mobile Apps and Connected...,309,337
...,...,...,...,...
50,TR032,Engineering,487*,520.0
51,TR033,Computer Science,465*,488.0
52,TR034,Management Science and Information Systems Stu...,589*,602.0
53,TR035,Theoretical Physics,554,601.0


### Webserver Error

***

Webserver error - server says decode as:

    Content-Type: text/html; charset=iso-8859-1
However, one line uses \x96 which isn't defined in iso-8859-1. 

Therefore, a similar decoding standard cp1252 was used. It is similar but inculdes \x96

In [18]:
# Fixing the webserves wrong encoding
orignal_encoding = response.encoding

# Changing to cp1252
response.encoding = 'cp1252'

In [19]:
# Saving the orginal html file.
with open(path, 'w') as f: # Opening path in write mode
    f.write(response.text)

FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\katie\\Documents\\FofData-assessment\\datasets\\cao_data/cao2021_20211107_215405.html'

### Getting relevant data using Regular expressions

***

To get the relevant lines from the response request, we use a regular expression. It is more efficent that recalling the expression everytime.

In [7]:
# Complie regualr expression for any matching lines. 
re_course = re.compile(r'([A-Z]{2}[0-9]{3})  (.*)([0-9]{3})(\*?) *') #r for raw data

Loop through the response for matches 

In [8]:
# Path to csv file
path = 'cao_data\cao2021_csv_' +  string_now + '.csv'

In [2]:
# Keeping track of the number of courses
no_lines = 0

path = 'cao_data\cao2021_csv_' + string_now + '.csv'

# Keep track of how many courses we process.
no_lines = 0

# Open the csv file for writing.
with open(path, 'w') as f:
# Open the csv file for writing.
#with open(path, 'w') as f:
    # Loop through lines of the response.
    for line in response.iter_lines():
            # Decode the line, using the wrong encoding!
            dline = line.decode('cp1252')
            # Match only the lines representing courses.
            if re_course.fullmatch(dline):
                # Add one to the lines counter.
                no_lines = no_lines + 1
                csv_version = re_course.sub(r'\1,\2,\3,\4', dline) # Getting relevant parts for lines which match
                #print(csv_version) # Print CSV style
                f.write(csv_version + '/n')

        # Number of courses matched
print(f'Total number of lines matched: {no_lines}.')
    

NameError: name 'string_now' is not defined

## References

## End