***
# Fundamentals of Data Analytics Assessment - CAO Points Analysis

***

"The purpose of the Central Applications Office (CAO) is to process centrally applications for undergraduate courses in Irish Higher Education Institutions" [1]. 

<br>

#### Importing Libraries 

In [78]:

# Convenient HTTP requests.
import requests as rq

# Regular expressions.
import re

# Dates and times.
import datetime as dt

# Data frames.
import pandas as pd

# For downloading.
import urllib.request as urlrq

<br>

In [79]:
# Get the current date and time.
now = dt.datetime.now()

# Format as a string.
nowstr = now.strftime('%Y%m%d_%H%M%S')

<br>

****

## 2021 CAO Points
[2021 CAO points]('http://www2.cao.ie/points/l8.php')
***


### Level 8

#### Server Request

In [80]:
# Fetch the CAO points URL.
resp = rq.get('http://www2.cao.ie/points/l8.php')

# 200 = ok. 404 = error: not found
resp

<Response [200]>

<br>

<br>

### Save Orignal Dataset

***

<br>

In [81]:
# Create a file path for the original data.
pathhtml = 'datasets/cao2021_' + nowstr + '.html'

<br>

<br>

### Webserver Error

***

<br>

<br>

Webserver error - server says decode as:

    Content-Type: text/html; charset=iso-8859-1
However, one line uses \x96 which isn't defined in iso-8859-1. 

Therefore, a similar decoding standard cp1252 was used. It is similar but inculdes #\x96

<br>

<br>

In [82]:
# Fixing the webserves wrong encoding
original_encoding = resp.encoding

# Changing to cp1252
resp.encoding = 'cp1252'

<br>

<br>

In [83]:
# Save the original html file.
with open(pathhtml, 'w') as f: # Opening path in write mode
    f.write(resp.text)

<br>

<br>

### Getting relevant data using Regular expressions

***

[Regular Expression Documentation]('https://docs.python.org/3/library/re.html')

<br>

To get the relevant lines from the response request, we use a regular expression. It is more efficent that recalling the expression everytime.

<br>

<br>

In [84]:
# Compile regular expression for matching lines.
re_course = re.compile(r'([A-Z]{2}[0-9]{3})(.*)') # r for raw data

<br>

<br>

Loop through the response for matches 

<br>

<br>

In [85]:
# Path to csv file
path2021 = 'datasets/cao2021_csv_' + nowstr + '.csv'

<br>

<br>

In [86]:
# KTracking number of courses matched
no_lines = 0



# Open the csv file for writing.
with open(path2021, 'w') as f:
    # Write a header row.
    f.write(','.join(['code', 'title', 'pointsR1', 'pointsR2']) + '\n')
    # Loop through lines of the response.
    for line in resp.iter_lines():
        # Decode the line, using the wrong encoding!
        dline = line.decode('cp1252')
        # Match only the lines representing courses.
        if re_course.fullmatch(dline):
            # Add one to the lines counter.
            no_lines = no_lines + 1
            # The course code.
            course_code = dline[:5]
            # The course title.
            course_title = dline[7:57].strip()
            # Round one points.
            course_points = re.split(' +', dline[60:])
            if len(course_points) != 2:
                course_points = course_points[:2]
            # Join the fields using a comma.
            linesplit = [course_code, course_title, course_points[0], course_points[1]]
            # Rejoin the substrings with commas in between.
            f.write(','.join(linesplit) + '\n')


            
# Number of courses matched
print(f"Total number of lines is {no_lines}.")

Total number of lines is 949.


<br>

<br>

NB: it was verified as of 03/11/2021 that there were 949 courses exactly in the CAO 2021 points list.

<br>

<br>

In [87]:
# Reading dataframe
df2021 = pd.read_csv(path2021, encoding='cp1252')

<br>

<br>

In [88]:
# Checking dataframe
df2021

Unnamed: 0,code,title,pointsR1,pointsR2
0,AL801,Software Design for Virtual Reality and Gaming,300,
1,AL802,Software Design in Artificial Intelligence for...,313,
2,AL803,Software Design for Mobile Apps and Connected ...,350,
3,AL805,Computer Engineering for Network Infrastructure,321,
4,AL810,Quantity Surveying,328,
...,...,...,...,...
944,WD211,Creative Computing,270,
945,WD212,Recreation and Sport Management,262,
946,WD230,Mechanical and Manufacturing Engineering,230,230
947,WD231,Early Childhood Care and Education,266,


<br>

<br>

<br>

### Level 6/7 Courses

<br>

<br>

#### Server Request

<br>

In [100]:
# Fetch the CAO points URL.
resp = rq.get('http://www2.cao.ie/points/l76.php')

# 200 = ok. 404 = error: not found
resp

<Response [200]>

<br>

<br>

### Save Orignal Dataset

***

<br>

In [101]:
# Create a file path for the original data.
pathhtml = 'datasets/cao2021_2' + nowstr + '.html'

<br>

<br>

### Webserver Error

***

<br>

<br>

Webserver error - server says decode as:

    Content-Type: text/html; charset=iso-8859-1
However, one line uses \x96 which isn't defined in iso-8859-1. 

Therefore, a similar decoding standard cp1252 was used. It is similar but inculdes #\x96

<br>

<br>

In [102]:
# Fixing the webserves wrong encoding
original_encoding = resp.encoding

# Changing to cp1252
resp.encoding = 'cp1252'

<br>

<br>

In [103]:
# Save the original html file.
with open(pathhtml, 'w') as f: # Opening path in write mode
    f.write(resp.text)

<br>

<br>

### Getting relevant data using Regular expressions

***

[Regular Expression Documentation]('https://docs.python.org/3/library/re.html')

<br>

To get the relevant lines from the response request, we use a regular expression. It is more efficent that recalling the expression everytime.

<br>

<br>

In [104]:
# Compile regular expression for matching lines.
re_course = re.compile(r'([A-Z]{2}[0-9]{3})(.*)') # r for raw data

<br>

<br>

Loop through the response for matches 

<br>

<br>

In [105]:
# Path to csv file
path2021_2 = 'datasets/cao2021_2_csv_' + nowstr + '.csv'

<br>

<br>

In [106]:
# KTracking number of courses matched
no_lines = 0



# Open the csv file for writing.
with open(path2021_2, 'w') as f:
    # Write a header row.
    f.write(','.join(['code', 'title', 'pointsR1', 'pointsR2']) + '\n')
    # Loop through lines of the response.
    for line in resp.iter_lines():
        # Decode the line, using the wrong encoding!
        dline = line.decode('cp1252')
        # Match only the lines representing courses.
        if re_course.fullmatch(dline):
            # Add one to the lines counter.
            no_lines = no_lines + 1
            # The course code.
            course_code = dline[:5]
            # The course title.
            course_title = dline[7:57].strip()
            # Round one points.
            course_points = re.split(' +', dline[60:])
            if len(course_points) != 2:
                course_points = course_points[:2]
            # Join the fields using a comma.
            linesplit = [course_code, course_title, course_points[0], course_points[1]]
            # Rejoin the substrings with commas in between.
            f.write(','.join(linesplit) + '\n')


            
# Number of courses matched
print(f"Total number of lines is {no_lines}.")

Total number of lines is 416.


<br>

<br>

NB: it was verified as of 03/11/2021 that there were 949 courses exactly in the CAO 2021 points list.

<br>

<br>

In [107]:
# Reading dataframe
df2021_2 = pd.read_csv(path2021_2, encoding='cp1252')

<br>

<br>

In [108]:
# Checking dataframe
df2021_2

Unnamed: 0,code,title,pointsR1,pointsR2
0,AL605,Music and Instrument Technology,211,
1,AL630,Pharmacy Technician,308,
2,AL631,Dental Nursing,311,
3,AL632,Applied Science,297,
4,AL650,Business,AQA,AQA
...,...,...,...,...
411,WD188,Applied Health Care,220,
412,WD205,Molecular Biology with Biopharmaceutical Science,AQA,262v
413,WD206,Electronic Engineering,180,
414,WD207,Mechanical Engineering,172,


<br>

<br>

<br>

<br>

<br>

***

## 2020 Points
***

https://www.cao.ie/index.php?page=points&p=2020

<br>

<br>

Level 6, 7 and 8 courses are inculded in the same excel file

<br>

<br>

In [109]:
# Getting url 
url2020 = 'http://www2.cao.ie/points/CAOPointsCharts2020.xlsx'

<br>

<br>

Save Original File

<br>

<br>

In [110]:
# Create a file path for the original data.
pathxlsx = 'datasets/cao2020_' + nowstr + '.xlsx'

<br>

<br>

In [111]:
# Fetching data
urlrq.urlretrieve(url2020, pathxlsx)

('datasets/cao2020_20211125_221246.xlsx',
 <http.client.HTTPMessage at 0x20418d07d00>)

<br>

<br>

Load Spreadsheet using pandas

<br>

<br>

In [112]:
# Download and parse the excel spreadsheet. First few rows where a blurb
df2020 = pd.read_excel(url2020, skiprows=10)


In [113]:
df2020

Unnamed: 0,CATEGORY (i.e.ISCED description),COURSE TITLE,COURSE CODE2,R1 POINTS,R1 Random *,R2 POINTS,R2 Random*,EOS,EOS Random *,EOS Mid-point,...,avp,v,Column1,Column2,Column3,Column4,Column5,Column6,Column7,Column8
0,Business and administration,International Business,AC120,209,,,,209,,280,...,,,,,,,,,,
1,Humanities (except languages),Liberal Arts,AC137,252,,,,252,,270,...,,,,,,,,,,
2,Arts,"First Year Art & Design (Common Entry,portfolio)",AD101,#+matric,,,,#+matric,,#+matric,...,,,,,,,,,,
3,Arts,Graphic Design and Moving Image Design (portfo...,AD102,#+matric,,,,#+matric,,#+matric,...,,,,,,,,,,
4,Arts,Textile & Surface Design and Jewellery & Objec...,AD103,#+matric,,,,#+matric,,#+matric,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1459,Manufacturing and processing,Manufacturing Engineering,WD208,188,,,,188,,339,...,,,,,,,,,,
1460,Information and Communication Technologies (ICTs),Software Systems Development,WD210,279,,,,279,,337,...,,,,,,,,,,
1461,Information and Communication Technologies (ICTs),Creative Computing,WD211,271,,,,271,,318,...,,,,,,,,,,
1462,Personal services,Recreation and Sport Management,WD212,270,,,,270,,349,...,,,,,,,,,,


<br>

<br>

<br>

In [17]:
# Checking random row
df2020.iloc[1000]

CATEGORY (i.e.ISCED description)    Engineering and engineering trades
COURSE TITLE                                    Mechanical Engineering
COURSE CODE2                                                     SG333
R1 POINTS                                                          216
R1 Random *                                                        NaN
R2 POINTS                                                          NaN
R2 Random*                                                         NaN
EOS                                                                216
EOS Random *                                                       NaN
EOS Mid-point                                                      347
LEVEL                                                                7
HEI                                     Institute of Technology, Sligo
Test/Interview #                                                   NaN
avp                                                                NaN
v     

<br>

<br>

In [18]:
# -1 is always last row/element
df2020.iloc[-1]

CATEGORY (i.e.ISCED description)          Engineering and engineering trades
COURSE TITLE                        Mechanical and Manufacturing Engineering
COURSE CODE2                                                           WD230
R1 POINTS                                                                253
R1 Random *                                                              NaN
R2 POINTS                                                                NaN
R2 Random*                                                               NaN
EOS                                                                      253
EOS Random *                                                             NaN
EOS Mid-point                                                            369
LEVEL                                                                      8
HEI                                        Waterford Institute of Technology
Test/Interview #                                                         NaN

<br>

<br>

In [19]:
# Create a file path for the pandas data.
path2020 = 'datasets/cao2020_' + nowstr + '.csv'

<br>

<br>

In [20]:
# Save pandas data frame to disk.
df2020.to_csv(path2020)

<br>

<br>

<br>

<br>

***

## 2019 Points
***

https://www.cao.ie/index.php?page=points&p=2019

<br>

<br>

In [21]:
# For scraping data from pdf
import camelot

<br>

<br>

In [22]:
# Checking all pages of pdf for data
pdf = camelot.read_pdf('datasets/2019_points.pdf', pages='all')

<br>

<br>

In [23]:
# Checking the type
print(type(pdf))

<class 'camelot.core.TableList'>


<br>

<br>

In [24]:
# Checking the number of tables, should be 18
pdf

<TableList n=18>

<br>

<br>

In [25]:
# Exporting tables into a csv file
pdf.export('datasets/2019_points.csv', f='csv', compress=True)

<br>

<br>

In [26]:
# checking to make sure it worked
pdf[1].parsing_report

{'accuracy': 100.0, 'whitespace': 2.73, 'order': 1, 'page': 2}

<br>

<br>

In [27]:
# Unzipping the folder - multiple tables are exported as a zip
from zipfile import ZipFile

# Loop through zipped folder for files
with ZipFile('datasets/2019_points.zip', 'r') as df:
   # Extract all the contents of zip file in current directory
   df.extractall('2019_points')

<br>

<br>

In [42]:
# To merge csv files into one
import os, glob

# Creating a path
path = '2019_points/'

<br>

<br>

In [50]:
# Find csvs which match this pattern
filelist = glob.glob(path + '2019_*.csv')

# Merge them together with these column headings
df2019 = pd.concat([pd.read_csv(file, names=['Course Code','Title','Points','Median' ]) for file in filelist])

<br>

<br>

In [51]:
# Checking pandas dataframe
df2019

Unnamed: 0,Course Code,Title,Points,Median
0,Course Code INSTITUTION and COURSE,,EOS,Mid
1,,Athlone Institute of Technology,,
2,AL801,Software Design with Virtual Reality and Gaming,304,328
3,AL802,Software Design with Cloud Computing,301,306
4,AL803,Software Design with Mobile Apps and Connected...,309,337
...,...,...,...,...
50,TR032,Engineering,487*,520.0
51,TR033,Computer Science,465*,488.0
52,TR034,Management Science and Information Systems Stu...,589*,602.0
53,TR035,Theoretical Physics,554,601.0


<br>

<br>

In [52]:
# Checking last 5 rows
df2019.tail()

Unnamed: 0,Course Code,Title,Points,Median
50,TR032,Engineering,487*,520.0
51,TR033,Computer Science,465*,488.0
52,TR034,Management Science and Information Systems Stu...,589*,602.0
53,TR035,Theoretical Physics,554,601.0
54,TR038,Engineering with Management,499,543.0


<br>

<br>

<br>

<br>

### Level 6 & 7 Courses

<br>

In [114]:
# Checking all pages of pdf for data
pdf_2 = camelot.read_pdf('datasets/2019_points_2.pdf', pages='all')

<br>

<br>

In [115]:
# Checking the type
print(type(pdf_2))

<class 'camelot.core.TableList'>


<br>

<br>

In [119]:
# Checking the number of tables, should be 10
pdf_2

<TableList n=10>

<br>

<br>

In [120]:
# Exporting tables into a csv file
pdf.export('datasets/2019_points_2.csv', f='csv', compress=True)

<br>

<br>

In [121]:
# checking to make sure it worked
pdf_2[3].parsing_report

{'accuracy': 100.0, 'whitespace': 9.26, 'order': 1, 'page': 4}

<br>

<br>

In [122]:
# Unzipping the folder - multiple tables are exported as a zip
# Loop through zipped folder for files
with ZipFile('datasets/2019_points_2.zip', 'r') as df:
   # Extract all the contents of zip file in current directory
   df.extractall('2019_points_2')

<br>

<br>

In [124]:
# Creating a path
path = '2019_points_2/'

<br>

<br>

In [125]:
# Find csvs which match this pattern
filelist = glob.glob(path + '2019_*.csv')

# Merge them together with these column headings
df2019_2 = pd.concat([pd.read_csv(file, names=['Course Code','Title','Points','Median' ]) for file in filelist])

<br>

<br>

In [126]:
# Checking pandas dataframe
df2019_2

Unnamed: 0,Course Code,Title,Points,Median
0,Course Code INSTITUTION and COURSE,,EOS,Mid
1,,Athlone Institute of Technology,,
2,AL801,Software Design with Virtual Reality and Gaming,304,328
3,AL802,Software Design with Cloud Computing,301,306
4,AL803,Software Design with Mobile Apps and Connected...,309,337
...,...,...,...,...
50,TR032,Engineering,487*,520.0
51,TR033,Computer Science,465*,488.0
52,TR034,Management Science and Information Systems Stu...,589*,602.0
53,TR035,Theoretical Physics,554,601.0


<br>

<br>

In [130]:
# Checking random row
df2019_2.iloc[25]

Course Code                                     AL861
Title          Animation and Illustration (portfolio)
Points                                           #615
Median                                            899
Name: 25, dtype: object

<br>

***
### Cleaning Dataframes & Data 
***

<br>

<br>

## References

# End