***
# Fundamentals of Data Analytics Assessment - CAO Points Analysis

***

<br>

<br>

![img](images/cao_logo.png)

<br>

<br>

Students who wish to study at third level in the Republic of Ireland apply to colleges and universities through the Central Applications Office (CAO) [1]. "The purpose of the Central Applications Office (CAO) is to process centrally applications for undergraduate courses in Irish Higher Education Institutions" [2].
 
“The CAO awards points to students based on their achievements in the Leaving Certificate examination. A student's points are calculated according to these tables (see below), counting their best six subjects only” [3]. A student can study a subject at higher or ordinary level - more points are awarded for higher level papers. Since 2012, the maximum number of points a student can receive is 625 points [3].



<br>

<br>

![img](images/points.jpg)

<br>

The entry points needed for a course is dependent on demand. This means the points requirements for a course varies year on year.  

<br>

This notebook will focus on comparing the CAO points for 2019, 2020 and 2021 and provide an overview of how to load the points from the CAO website into a pandas dataframe.

<br>

***
***

<br>

### Importing Libraries 

<br>

In [None]:
# Convenient HTTP requests.
import requests as rq

# Regular expressions.
import re

# Dates and times.
import datetime as dt

# Data frames.
import pandas as pd

# Working with data
import numpy as np

# For downloading.
import urllib.request as urlrq

# For scraping data from pdf
import camelot

# Unzipping folders 
from zipfile import ZipFile

# To merge csv files into one
import os, glob

# Common string operations
import string as str

<br>

In [None]:
# Visualisation
import matplotlib.pyplot as plt

# Visualisation
import seaborn as sns

# Standard plot size.
plt.rcParams['figure.figsize'] = (15, 10)

# Selecting a colour scheme.
plt.style.use('ggplot')

# Configures matplotlib to show figures embedded in the notebook. 
%matplotlib inline

<br>

In [None]:
# Get the current date and time.
now = dt.datetime.now()

# Format as a string.
nowstr = now.strftime('%Y%m%d_%H%M%S')

<br>

<br>

***
****

<br>

## 2021 CAO Points


<br>

***

<br>

### Level 8  
[Link to 2021 CAO points (Level 8)]('http://www2.cao.ie/points/l8.php')

<br>

<br>

As the 2021 points are storedas a html file on the CAO website, we must make a request to the server the are stored on. We can easily do this by using the python requests package which makes HTTP requests easily and simple and "human friendly". "The HTTP request returns a Response Object with all the response data (content, encoding, status, etc)" [4].

<br>

<br>

#### Server Request

<br>

In [None]:
# Get the CAO points URL
resp = rq.get('http://www2.cao.ie/points/l8.php')

# 200 = ok. 404 = error: not found
resp

<br>

<br>

#### Save Orignal Dataset


<br>

To save the orignal HTML file from the request we need to create a path in our repository. Using the datetime module we can convert the response request date to the day the request was made. 

<br>

In [None]:
# Create a file path for the original data.
pathhtml = 'datasets/cao2021_' + nowstr + '.html'

<br>

<br>

### Webserver Error


<br>

The CAO website throws an error - it tells us to decode in iso-8859-1 but, this isn't possible as some of the lines inculdes \x96 - a fada - which isn't recongised in iso-8859-1. To combat this we use tje decoding standard cp1252 which inculdes \x96 which allows the fada to be understood. 

<br>

Webserver error - server says decode as:

    Content-Type: text/html; charset=iso-8859-1
However, one line uses \x96 which isn't defined in iso-8859-1. 

Therefore, a similar decoding standard cp1252 was used. It is similar but inculdes #\x96

<br>

<br>

In [None]:
# Fixing the webserves wrong encoding
original_encoding = resp.encoding

# Changing to cp1252
resp.encoding = 'cp1252'

<br>

In [None]:
# Save the original html file.
with open(pathhtml, 'w') as f: # Opening path in write mode
    f.write(resp.text)

<br>

<br>

<br>

### Getting The Data


[Link To Regular Expression Documentation]('https://docs.python.org/3/library/re.html')

<br>

To get the relevant lines from the response request, we employ a regular expression. A regualr expression "...is a sequence of characters that specifies a search pattern" [5]. You can think of them as as a search and find function. We use a regular expression as it  more efficent than constantly recalling an expression over and over.

<br>

In [None]:
# Compile regular expression for matching lines.
re_course = re.compile(r'([A-Z]{2}[0-9]{3})(.*)') # r for raw data

<br>

<br>

We will then use this expression to loop through the response from earlier for any matches of the regular expression and save these matches in a CSV file. 

NB: verified as of 03/11/2021 that there were 949 courses on the CAO Level 8 2021 points list.

<br>

<br>

In [None]:
# Path to csv file
path2021 = 'datasets/cao2021_csv_' + nowstr + '.csv'

<br>

In [None]:
# Tracking number of courses matched
no_lines = 0



# Open the csv file for writing.
with open(path2021, 'w') as f:
    # Write a header row.
    f.write(','.join(['code', 'title', 'pointsR1', 'pointsR2']) + '\n')
    # Loop through lines of the response.
    for line in resp.iter_lines():
        # Decode the line, using the wrong encoding!
        dline = line.decode('cp1252')
        # Match only the lines representing courses.
        if re_course.fullmatch(dline):
            # Add one to the lines counter.
            no_lines = no_lines + 1
            # The course code.
            course_code = dline[:5] # i.e first 5 characters
            # The course title.
            course_title = dline[7:57].strip() # Strip gets rid of whitespace
            # Round one points.
            course_points = re.split(' +', dline[60:]) # split into substringd from index 60 
            if len(course_points) != 2: # This is because last course has an extra substring
                course_points = course_points[:2]
            # Join the fields using a comma.
            linesplit = [course_code, course_title, course_points[0], course_points[1]]
            # Rejoin the substrings with commas in between.
            f.write(','.join(linesplit) + '\n')

            
            
            
            
# Number of courses matched
print(f"Total number of lines is {no_lines}.")

<br>

<br>

<br>

<br>

In [None]:
# Reading dataframe
df2021 = pd.read_csv(path2021, encoding='cp1252')

In [None]:
df2021

<br>

<br>

As we know these are all level 8 courses, we will also insert a new column into the dataframe to show this.

<br>

<br>

In [None]:
# Creating a new column at the second index with the value 8
df2021.insert(2, 'level', '8')
df2021

<br>

<br>

<br>

***

<br>

## Level 6/7 Courses

<br>

To retrive the 2021 level 6 and 7 courses we repeat this process.

<br>

### Server Request

<br>

In [None]:
# Fetch the CAO points URL.
resp = rq.get('http://www2.cao.ie/points/l76.php')

# 200 = ok. 404 = error: not found
resp

<br>

<br>

### Save Orignal Dataset



<br>

In [None]:
# Create a file path for the original data.
pathhtml_2 = 'datasets/cao2021_2' + nowstr + '.html'

<br>

<br>

### Webserver Error


<br>

In [None]:
# Fixing the webserves wrong encoding
original_encoding = resp.encoding

# Changing to cp1252
resp.encoding = 'cp1252'

<br>

In [None]:
# Save the original html file.
with open(pathhtml_2, 'w') as f: # Opening path in write mode
    f.write(resp.text)

<br>

<br>

### Getting The Data


<br>

In [None]:
# Compile regular expression for matching lines.
re_course = re.compile(r'([A-Z]{2}[0-9]{3})(.*)') # r for raw data

<br>

<br>

In [None]:
# Path to csv file
path2021_2 = 'datasets/cao2021_2_csv_' + nowstr + '.csv'

<br>

In [None]:

# KTracking number of courses matched
no_lines = 0



# Open the csv file for writing.
with open(path2021_2, 'w') as f:
    # Write a header row.
    f.write(','.join(['code', 'title', 'pointsR1', 'pointsR2']) + '\n')
    # Loop through lines of the response.
    for line in resp.iter_lines():
        # Decode the line, using the wrong encoding!
        dline = line.decode('cp1252')
        # Match only the lines representing courses.
        if re_course.fullmatch(dline):
            # Add one to the lines counter.
            no_lines = no_lines + 1
            # The course code.
            course_code = dline[:5]
            # The course title.
            course_title = dline[7:57].strip()
            # Round one points.
            course_points = re.split(' +', dline[60:])
            if len(course_points) != 2:
                course_points = course_points[:2]
            # Join the fields using a comma.
            linesplit = [course_code, course_title, course_points[0], course_points[1]]
            # Rejoin the substrings with commas in between.
            f.write(','.join(linesplit) + '\n')


            
# Number of courses matched
print(f"Total number of lines is {no_lines}.")

<br>

<br>

In [None]:
# Reading dataframe
df2021_2 = pd.read_csv(path2021_2, encoding='cp1252')

<br>

Unfortunatly, the 2021 level 6 and 7 courses aren't distingushed from each other. So, I've decided to insert them as '6/7' under a new column heading 'level'. This will distingush them from the courses we know for certain are level 8.

<br>

In [None]:
# Creating a new column at the second index
df2021_2.insert(2, 'level', '6/7')
df2021_2

<br>

<br>

<br>

***

## 2020 Points
***

https://www.cao.ie/index.php?page=points&p=2020

<br>

Luckily the 2020 points are already available in a excel file. This equates to a simpler workload! Additionally, level 6, 7 and 8 are all inculded and are clearly marked.

<br>

<br>

In [None]:
# Saving the URL as a variable
url2020 = 'http://www2.cao.ie/points/CAOPointsCharts2020.xlsx'

<br>

<br>

### Saving Orginal Data

<br>

We must save a copt of the orignal file, we do this with the datetime module which keeps tabs on when the request was made. 

<br>

In [None]:
# Create a file path for original data
pathxlsx = 'datasets/cao2020_' + nowstr + '.xlsx'

<br>

<br>

### Retrieve Data

<br>

To get the data we can use the Urllib package. Urlib is a "...URL handling module for python. It is used to fetch URLs" [6]. The retrieve method downloads "...the remote data directly to the local [disk]" [7]. 

In [None]:
# Getting data
urlrq.urlretrieve(url2020, pathxlsx)

<br>

<br>

### Loading The Spreadsheet

<br>

We can now load the data into the notebook by using pandas. The first 10 lines of the file inculde a blurb about the CAO so we will skip the first 10 rows to avoid pulling them into the newly created dataframe. 

<br>

In [None]:
# Download and parse the excel spreadsheet. First few rows where a blurb
df2020 = pd.read_excel(url2020, skiprows=10)

In [None]:
# Checking the dataframe
df2020

<br>

<br>

<br>

In [None]:
# Checking random row
df2020.iloc[1000]

<br>

<br>

In [None]:
# -1 is always last row/element
df2020.iloc[-1]

<br>

<br>

<br>

As mentioned earlier, the 2020 points do provide the level for each course. The levels are saved as an interger. As we had to add our own level's into 2021 (and 2019), where '6/7' was used. So I am going to convert the level column values into a string type to match 2019 and 2021.

<br>

In [None]:
# Converting level column values into strings
df2020['LEVEL'] = df2020['LEVEL'].astype(str)

<br>

<br>

### Storing The Dataframe

<br>

Storing the dataframe we created with the other files in the project directory.~

<br>

In [None]:
# Creating a path for dataframe
path2020 = 'datasets/cao2020_' + nowstr + '.csv'

<br>

In [None]:
# Saving dataframe to disk
df2020.to_csv(path2020)

<br>

<br>

<br>

<br>

***

## 2019 Points
***

https://www.cao.ie/index.php?page=points&p=2019

<br>

The 2019 CAO points are stored in PDF files. Only round 1 points are available for 2019.

<br>

I decided to use the package Camelot to extract the data from the files. [The Documentation for Camelot can be found here](https://camelot-py.readthedocs.io/en/master/user/quickstart.html).

Camelot has a number of [dependencies](https://camelot-py.readthedocs.io/en/master/user/install-deps.html) which must be installed before the package can be used.  I found [this](https://www.youtube.com/watch?v=LoiHI-IB3lY&t=308s) video extremely useful in getting Camelot up and running.

<br>

<br>

### Getting The Data

<br>

In [None]:
# Checking all pages of pdf for data
pdf = camelot.read_pdf('datasets/2019_points.pdf', pages='all')

<br>

In [None]:
# Checking the type
print(type(pdf))

<br>

<br>

Note: The number of pages in the 2019 level 8 courses PDF is 18.

<br>

In [None]:
# Checking the number of tables found
pdf

<br>

In [None]:
# checking tothe accuracy
pdf[1].parsing_report

<br>

From the parsing report we can see the table was extracted correctly. You can check this with all pages.

<br>

<br>

### Exporting Data

<br>

The next step is to export the data found into a CSV file so we can convert it into a dataframe. By passing '''compress=True''' camelot will create a zipped folder in the specified path. This will prevent - 18 in this case - CSV files being stored. 

<br>

In [None]:
# Exporting tables into a csv file
pdf.export('datasets/2019_points.csv', f='csv', compress=True)

<br>

<br>

To unzip the CSV files with use the Zipfile package [8]. 

<br>

In [None]:
# Loop through zipped folder for files
with ZipFile('datasets/2019_points.zip', 'r') as df:
   # Extract all the contents of zip file in current directory
   df.extractall('2019_points')

<br>

<br>

### Merging CSV Files

<br>

As camelot exports each table as a single file, we need to merge them all into one dataframe. We can do this by using the Glob module. Glob, which stands for global, is part of the standard Python library. 

It is used to "...search for a specific file pattern, or perhaps more usefully, search for files where the filename matches a certain pattern by using wildcard characters" [9].

<br>

<br>

In [None]:
# The path to use
path = '2019_points/'

<br>

In [None]:
# Find csvs which match this pattern
filelist = glob.glob(path + '2019_*.csv')

# Merge them together with these column headings
df2019 = pd.concat([pd.read_csv(file, names=['code','title','points','median' ]) for file in filelist])

<br>

<br>

<br>

<br>

<br>

As we know these are all level 8 courses, we will also insert a new column into the dataframe to show this. We will also clean up the dataframe a bit by dropping the old heading and by getting rid of the rows which only contain the name of the third level institutes.

<br>

<br>

<br>

In [None]:
# Checking last 5 rows
df2019.tail()

<br>

<br>

<br>

<br>

<br>

<br>

***

<br>

### Level 6 and 7 Courses

http://www2.cao.ie/points/lvl76_19.pdf

<br>

To retrive the 2019 level 6 and 7 courses we repeat this process. 

<br>

### Getting The Data

<br>

There are 10 pages in the level 6 and 7 courses PDF.

<br>

In [None]:
# Dropping old heading
df2019.drop(0)

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

### Exporting The Data

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

### Merging CSV Files

<br>

<br>

<br>

<br>

Unfortunatly, the 2019 level 6 and 7 courses aren't distingushed from each other. In tandem with 2021, I decided to insert them as '6/7' under a new column heading 'level'. This will distingush them from the courses we know for certain are level 8. Additionally, we will get rid of the rows which only contain the name of the third level institutes.

<br>

<br>

<br>

In [None]:
# Find csvs which match this pattern
filelist = glob.glob(path + '2019_*.csv')

# Merge them together with these column headings. 
df2019_2 = pd.concat([pd.read_csv(file, names=['code','title','points','median'], skiprows=10) for file in filelist])

<br>

In [None]:
# Creating a path
path = '2019_points_2/'

<br>

<br>

<br>

<br>

<br>

***
## Cleaning Up Dataframes & Data 
***

<br>

### 2021

<br>

<br>

<br>

<br>

<br>

<br>

### 2021

<br>

In [None]:
# Selecting columns level 6/7
level_6_2021 = df2021_2[['code', 'title', 'level',  'pointsR1', 'pointsR2']]

<br>

<br>

### 2020

<br>

<br>

In [None]:
# Search for duplicates
total2021[total2021.duplicated()]

<br>

<br>

<br>

<br>

<br>

### 2019

<br>

<br>

<br>

<br>

<br>

### 2019

<br>

<br>

<br>

<br>

<br>

<br>

<br>

### All Courses

<br>

<br>

In [None]:
# Search for duplicates
total2019[total2019.duplicated()]

<br>

<br>

<br>

<br>

<br>

<br>

<br>

### All Courses

<br>

<br>

<br>

<br>

<br>

In [None]:
courses2021 = total2021[['code', 'title', 'level']]

<br>

In [None]:
# Using code as the main column to sort by
allcourses.sort_values('code')

In [None]:
courses2021

<br>

<br>

In [None]:
# Merging all years 
allcourses = pd.concat([courses2021, courses2020, courses2019], ignore_index=True)

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

***
## Joining Points and Levels
***

<br>

<br>

<br>

<br>

<br>

<br>

<br>

In [None]:
# Setting the index as the code column
total2021.set_index('code', inplace=True)

# Column headings
total2021.columns = ['title', 'level', 'points_r1_2021', 'points_r2_2021']

<br>

<br>

<br>

<br>

<br>

<br>

In [None]:
allcourses = allcourses.join(total2021[['points_r1_2021', 'points_r2_2021']])

In [None]:
allcourses

<br>

<br>

In [None]:
df2020_r1 = df2020[['COURSE CODE2', 'LEVEL', 'R1 POINTS', 'R2 POINTS']]
df2020_r1.columns = ['code', 'level' ,'points_r1_2020', 'points_r2_2020']
df2020_r1

<br>

<br>

<br>

In [None]:
# Set the index to the code column.
df2020_r1.set_index('code', inplace=True)
df2020_r1

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

In [None]:
# Join 2019 points to allcourses.
allcourses = allcourses.join(df2019_r1, rsuffix='2019')
allcourses

<br>

<br>

<br>

<br>

In [None]:
# Dropping columns not using
allcourses = allcourses.drop(['level2020', 'level2019', 'points_r2_2021', 'points_r2_2020'], axis=1)

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

***
## Data
***

<br>

<br>

<br>

<br>

In [None]:
# How many of each value
allcourses['level_2019'].value_counts()

<br>

<br>

***

## Data

***

<br>

In [None]:
import string as str

<br>

In [None]:
# Function to remove special character we can convert to numeric values
def special_char(string):
    string.str.replace('\W', '')
    return string
    

<br>

<br>

In [None]:
# 2021 points
allcourses['points_r1_2021_special'] = special_char(allcourses['points_r1_2021'])
#special_char(allcourses['points_r2_2021'])

# 2020 points
allcourses['points_r1_2020_special'] = special_char(allcourses['points_r1_2020'])
#special_char(allcourses['points_r2_2020'])

# 2019 points
allcourses['points_r1_2019_special'] = special_char(allcourses['points_r1_2019'])

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

In [None]:
# Average number of points 2020
allcourses['points_r1_2020'].mean()

<br>

<br>

<br>

The average increase of points in 2020 can be put down to predicted grades as a result of the Covid-19 pandemic, leading to a rise in points for the majority of courses.

<br>

<br>

<br>

<br>

<br>

<br>

In [None]:
# Generating a KDE plot
a.plot.kde()

<br>

<br>

In [None]:
# Pivot table of top 5 course points by 2019
allcourses.sort_values(by=['points_r1_2021', 'title'], ascending=False).head()

<br>

<br>

In [None]:
# Pivot table of top 5 course points by 2019
allcourses.sort_values(by=['points_r1_2020', 'title'], ascending=False).head()

<br>

<br>

In [None]:
# Pivot table of top 5 course points by 2019
allcourses.sort_values(by=['points_r1_2019', 'title'], ascending=False).head()

<br>

<br>

In [None]:
x = allcourses.groupby(['level_2020'])[cols].agg([np.mean, np.std, np.min, np.max])

In [None]:
x

<br>

In [None]:
# Generating a barplot for the above pivot table
x.plot.barh()

<br>

<br>

<br>

In [None]:
# Overview of 2021 points
allcourses['points_r1_2021'].describe()

<br>

<br>

In [None]:
# Most points in 2021
allcourses['points_r1_2021'].max()

<br>

<br>

In [None]:
# Least points in 2021
allcourses['points_r1_2021'].min()

<br>

<br>

<br>

In [None]:
# Searching for that course
allcourses[allcourses['points_r1_2021'] == 625]

<br>

<br>

An increase of 80 points in 2 years. 

<br>

<br>

In [None]:
# Searching for that course
allcourses[allcourses['points_r1_2021'] == 100]

<br>

A massive 134 point drop from 2020.

<br>

<br>

In [None]:
# Overview of 2020 points
allcourses['points_r1_2020'].describe()

<br>

<br>

In [None]:
# Most points in 2020
allcourses['points_r1_2020'].max()

<br>

<br>

In [None]:
# Least points in 2020
allcourses['points_r1_2020'].min()

<br>

<br>

<br>

In [None]:
# Searching for that course
allcourses[allcourses['points_r1_2020'] == 1088]

<br>

<br>

The course with the most points was Popular Music at CIT Cork School of Music. This seems to be a new introduction in 2020, and it was in 2021 CAO.

<br>

<br>

In [None]:
# Searching for that course
allcourses[allcourses['points_r1_2020'] == 55]

<br>

<br>

The course with the lowest points is misleading as it requires a GAMSAT*. 

*Graduate Medical School Admissions Test (GAMSAT) "is a standardised exam designed, scored and developed by ACER to assist in the admissions of students to graduate-entry programmes (medicine, dentistry, optometry, pharmacy, podiatry) open to graduates of any discipline.GRADUATE ENTRY IRISH MEDICAL SCHOOLS. There are 4 medicals schools in Ireland which require the GAMSAT as part of their admissions criteria."

<br>

<br>

In [None]:
# Overview of 2019 points
allcourses['points_r1_2019'].describe()

<br>

<br>

In [None]:
# Most points in 2019
allcourses['points_r1_2019'].max()

<br>

<br>

In [None]:
# Least points in 2019
allcourses['points_r1_2019'].min()

<br>

<br>

<br>

In [None]:
# Searching for that course
allcourses[allcourses['points_r1_2019'] == 601]

<br>

<br>

Intrestingly the points went down from 2020, which we can assume was the first year of Popular Music at CIT Cork School of Music as it is not present in 2019.

<br>

<br>

In [None]:
# Searching for that course
allcourses[allcourses['points_r1_2019'] == 100]

<br>

<br>

In [None]:
# Standard plot size.
plt.rcParams['figure.figsize'] = (15, 10)


<br>

In [None]:
# Generating a kde plot for points 
allcourses['points_r1_2021'].plot.kde(label='2021')
allcourses['points_r1_2020'].plot.kde(label='2020')
allcourses['points_r1_2019'].plot.kde(label='2019')

# Adding a legend
plt.legend()

<br>

<br>

In [None]:
# Getting skew and kurtosis for 2021 points
print("2021 points skewness: %f" % allcourses['points_r1_2021'].skew())
print("2021 points kurtosis: %f" % allcourses['points_r1_2021'].kurt())

# Formatting
print('******************************')

# Getting skew and kurtosis for 2020 points
print("2020 points skewness: %f" % allcourses['points_r1_2020'].skew())
print("2020 points kurtosis: %f" % allcourses['points_r1_2020'].kurt())

# Formatting
print('******************************')

# Getting skew and kurtosis for 2019 points
print("2019 points skewness: %f" % allcourses['points_r1_2019'].skew())
print("2019 points kurtosis: %f" % allcourses['points_r1_2019'].kurt())

<br>

<br>

Let's create a scatterplot to compare 2020 and 2021 points. I'm going to use the 2020 level of the course because the CAO distingushed between the levels for 2020. Whereas for 2021 (and 2019) they didn't specify which was a level 6 and level 7 course.

<br>

In [None]:
# Generating a scatterplot between 2020 and 2021 points.
sns.scatterplot(x=allcourses['points_r1_2021'], y=allcourses['points_r1_2020'], hue=allcourses['level_2020'], palette='plasma')

# Adding labels to axis
plt.xlabel('points_r1_2021')
plt.ylabel('points_r1_2020')

# Adding a grid
plt.grid(color='black',ls='--')

<br>

<br>

In [None]:
# Generating a pairplot
sns.pairplot(data=allcourses )

<br>

<br>

<br>

<br>

Similar to Excel, we can create pivot tables. Which takes the following parameters:

- values – a list of variables to calculate statistics for.
- index – a list of variables to group data by
- aggfunc – what statistics we need to calculate for groups, ex. sum, mean, maximum, minimum or something else.

<br>

<br>

In [None]:
allcourses.pivot_table(['points_r1_2021', 'points_r1_2020', 'points_r1_2019'],['level_2020'], aggfunc='max')

<br>

<br>

In [None]:
allcourses.pivot_table(['points_r1_2021', 'points_r1_2020', 'points_r1_2019'],['level_2020'], aggfunc='min')

<br>

<br>

In [None]:
allcourses.pivot_table(['points_r1_2021', 'points_r1_2020', 'points_r1_2019'],['level_2020'], aggfunc='mean')

<br>

<br>

<br>

In [None]:
allcourses.pivot_table(['points_r1_2021', 'points_r1_2020', 'points_r1_2019'],['level_2020'], aggfunc='median')

<br>

<br>

<br>

<br>

In [None]:
corr = allcourses.corr()

In [None]:
corr

In [None]:
sns.heatmap(corr, cmap='coolwarm', annot=True)

<br>

<br>

In [None]:
from scipy.stats import norm
from sklearn.preprocessing import StandardScaler
from scipy import stats

<br>

<br>

In [None]:
sns.distplot(allcourses[allcourses['points_r1_2021']>0]['points_r1_2021'], fit=norm)


In [None]:

stats.probplot(allcourses[allcourses['points_r1_2021']>0]['points_r1_2021'], plot=plt)

In [None]:
# nrows, ncols, the first plot
plt.subplot(1,3,1)
(allcourses['points_r1_2021']).plot.box()

# nrows, ncols, the second plot
plt.subplot(1,3,2)
(allcourses['points_r1_2020']).plot.box()

# nrows, ncols, the thrid plot
plt.subplot(1,3,3)
(allcourses['points_r1_2019']).plot.box()


<br>

<br>

In [None]:

plt.plot(allcourses["points_r1_2021"], label='2021')
plt.plot(allcourses["points_r1_2020"], label='2020')
plt.plot(allcourses["points_r1_2019"], label='2019')

plt.legend()

<br>

<br>

.plot() is a wrapper for pyplot.plot(), and the result is a graph identical to the one you produced with Matplotlib https://realpython.com/pandas-plot-python/

<br>

<br>

In [None]:
median_2020 = allcourses["points_r1_2020"]

In [None]:
median_2020.plot(kind='hist')

In [None]:
median_2019 = allcourses["points_r1_2019"]

In [None]:
median_2019.plot(kind='hist')

<br>

<br>

<br>

<br>

<br>

<br>

## References

#### Images

- [CAO Logo](https://www.icareer.ie/wp-content/uploads/2016/12/Untitled-1.jpg)
- [Points](https://www.irishtimes.com/polopoly_fs/1.3596890!/image/image.jpg)

# End