# UKRI Funding and University League Score data cleaning

This notebook reads the required data from the UKRI's and Complete University Guide's websites then cleans and formats it into a dataframe for later use my machine learning models.

- The university league tables are well formatted and able to be phrased with BeautifulSoup fairly easily.
- The UKRI grants data was a bit harder...with each month stored separately with different encodings, inconsistent names and a lack of uniformity in the format of the data.

In [19]:
import pandas as pd
import requests
import urllib.request
import regex as re
from bs4 import BeautifulSoup

## University League data 2022
The university league table data could be directly pulled from "www.thecompleteuniversityguide.co.uk" using BeautifulSoup. The table is stored as a series of single column lists (all called "col_one") so it was not possible to load the table directly, instead the individual values could be pulled out using bs.find_all with the appropriate filters.

In [28]:
uni_ranking_page = requests.get('https://www.thecompleteuniversityguide.co.uk/league-tables/rankings?tabletype=full-table')

In [29]:
soup = BeautifulSoup(uni_ranking_page.content, 'html.parser')

In [30]:
institute_names = []
for name in soup.find_all("a", {"class" : "uni_lnk"}):
    institute_names.append(name["data-ga-label"])

In [31]:
institute_scores = []
for item in soup.find_all("div", {"class" : "segtxt"}):
    institute_scores.append(item.findChildren()[0].text)

In [32]:
col_titles = []
for item in soup.find_all("span", {"class" : "hdrtbl"}):
    col_titles.append(item.text)

In [33]:
institute_scores_df = pd.DataFrame(data=np.reshape(institute_scores, (130,11), order='F'), 
                                   index=institute_names, 
                                   columns=col_titles)

In [34]:
institute_scores_df = institute_scores_df.apply(pd.to_numeric, errors='coerce') #convert strings to floats and replace "n/a" with NaN's

In [35]:
institute_scores_df

Unnamed: 0,Overall score,Entry standards,Student satisfaction,Research quality,Research intensity,Academic services spend,Facilities spend,Degree completion,Student -staff ratio,Graduate prospects – outcomes,Graduate prospects – on track
University of Oxford,1000,200,,3.34,0.87,2842,599,99.1,10.1,90.4,84.7
University of Cambridge,989,205,,3.33,0.95,2718,1043,99.1,11.4,90.0,86.0
"London School of Economics and Political Science, University of London",963,177,3.98,3.35,0.85,2051,853,96.5,12.4,90.6,83.3
University of St Andrews,947,208,4.30,3.13,0.82,2650,746,95.7,11.1,79.9,79.6
Imperial College London,895,194,3.99,3.36,0.92,2982,755,97.5,11.1,95.1,86.7
...,...,...,...,...,...,...,...,...,...,...,...
University of Suffolk,395,110,3.96,,,1961,478,65.0,16.5,74.4,80.9
University of East London,374,97,4.00,2.71,0.23,1105,756,76.5,21.9,56.5,69.6
"Glyndwr University, Wrexham",364,102,4.14,2.15,0.16,1334,559,73.3,21.9,62.9,68.6
Ravensbourne University London,333,113,3.80,,,1163,322,79.7,22.6,70.0,77.5


## UK Research and Innovate (UKRI) funding data for the years 2018-2020
- UKRI is a new project that consolidates a large number of separate funding bodies so there is no aggregated data pre 2018.

In [97]:
UKRI_URL = 'https://www.ukri.org/about-us/what-we-do/financial-data/'
UKRI_HTML = requests.get(url = UKRI_URL).text

In [98]:
def download_UKRI_expenditure(HTML, download_signature):
    """
    Function to import the csv file names from an HTML page, given a REGEX identifier
        
    Returns a dict with pd.DataFrames for each csv with the REGEX identifier.       
    """
    data_URL_set = re.findall(download_signature, HTML)
    data = {}
    for url in data_URL_set:
        with urllib.request.urlopen('https://www.ukri.org/'+url) as html_doc:
            data[url] = pd.read_csv(html_doc,
                                    encoding='cp1252', 
                                    dtype=str,
                                    low_memory=False)
    return(data)

The csv's were in a poor shape with a non-uniform set of columns/column names, some with Byte Order Marks, some without as well as trailing white space in some instances. Beautiful soup could be used, but regex was chosen as an easier way to select just the expenditure data.

- utf-8-sig decoding could not be used to remove the byte order mark as there were non utf characters in the dataset which causes errors so a manual replacement was used. 

In [99]:
data_dict = download_UKRI_expenditure(UKRI_HTML, '\/wp-content\/uploads\/20\d+\/\d+\/UKRI-\d+-\D+\d+-AllExpenditure.csv')

In [100]:
for df in data_dict:
    data_dict[df].columns = data_dict[df].columns.str.replace('ï»¿', '') #removes byte order mark
    data_dict[df].columns = data_dict[df].columns.str.strip()

In [101]:
UKRI_spending = pd.concat(data_dict.values())
UKRI_spending.dropna(how='all', axis=1, inplace=True) #remove blank columns
UKRI_spending.head()

Unnamed: 0,Department Family,Entity,Date,Expense Type,Expense Area,Supplier,Transaction Number,Amount,Item Description,Unnamed: 11
0,BEIS,UKRI - AHRC,02/12/2020,Fees (Including College Members),GCRF - Programme Delivery,Non Employee Expenses Supplier,98856194,170.0,Fees (Including College Members),
1,BEIS,UKRI - AHRC,02/12/2020,Fees (Including College Members),GCRF - Programme Delivery,Non Employee Expenses Supplier,98856195,170.0,Fees (Including College Members),
2,BEIS,UKRI - AHRC,02/12/2020,Fees (Including College Members),Cultural Value EDI,Non Employee Expenses Supplier,98856196,340.0,Fees (Including College Members),
3,BEIS,UKRI - AHRC,02/12/2020,Fees (Including College Members),GCRF - Programme Delivery,Non Employee Expenses Supplier,98856197,170.0,Fees (Including College Members),
4,BEIS,UKRI - AHRC,02/12/2020,Fees (Including College Members),Executive Office,Non Employee Expenses Supplier,98856198,500.0,Fees (Including College Members),


In [102]:
UKRI_spending[UKRI_spending["Unnamed: 11"].isnull()==False]

Unnamed: 0,Department Family,Entity,Date,Expense Type,Expense Area,Supplier,Transaction Number,Amount,Item Description,Unnamed: 11
50640,BEIS,UKRI - Innovate UK,01/01/2020,Assessor Fees,Healthy Ageing,CATON BELL LTD,60895,1710.0,Assessor Fees,...


In [103]:
UKRI_spending.drop(["Unnamed: 11"], axis=1, inplace=True) #only has one entry

In [104]:
UKRI_spending.head()

Unnamed: 0,Department Family,Entity,Date,Expense Type,Expense Area,Supplier,Transaction Number,Amount,Item Description
0,BEIS,UKRI - AHRC,02/12/2020,Fees (Including College Members),GCRF - Programme Delivery,Non Employee Expenses Supplier,98856194,170.0,Fees (Including College Members)
1,BEIS,UKRI - AHRC,02/12/2020,Fees (Including College Members),GCRF - Programme Delivery,Non Employee Expenses Supplier,98856195,170.0,Fees (Including College Members)
2,BEIS,UKRI - AHRC,02/12/2020,Fees (Including College Members),Cultural Value EDI,Non Employee Expenses Supplier,98856196,340.0,Fees (Including College Members)
3,BEIS,UKRI - AHRC,02/12/2020,Fees (Including College Members),GCRF - Programme Delivery,Non Employee Expenses Supplier,98856197,170.0,Fees (Including College Members)
4,BEIS,UKRI - AHRC,02/12/2020,Fees (Including College Members),Executive Office,Non Employee Expenses Supplier,98856198,500.0,Fees (Including College Members)


In [105]:
UKRI_spending.isna().sum()

Department Family     269654
Entity                269654
Date                  269663
Expense Type          270457
Expense Area          273204
Supplier              269655
Transaction Number    269654
Amount                269654
Item Description      270455
dtype: int64

In [106]:
UKRI_spending.dropna(how="all", inplace=True)
UKRI_spending.isna().sum()

Department Family        0
Entity                   0
Date                     9
Expense Type           803
Expense Area          3550
Supplier                 1
Transaction Number       0
Amount                   0
Item Description       801
dtype: int64

### Enough entries that we can drop the remaining NaN's

In [107]:
UKRI_spending.dropna(inplace=True)

In [108]:
UKRI_spending.head()

Unnamed: 0,Department Family,Entity,Date,Expense Type,Expense Area,Supplier,Transaction Number,Amount,Item Description
0,BEIS,UKRI - AHRC,02/12/2020,Fees (Including College Members),GCRF - Programme Delivery,Non Employee Expenses Supplier,98856194,170.0,Fees (Including College Members)
1,BEIS,UKRI - AHRC,02/12/2020,Fees (Including College Members),GCRF - Programme Delivery,Non Employee Expenses Supplier,98856195,170.0,Fees (Including College Members)
2,BEIS,UKRI - AHRC,02/12/2020,Fees (Including College Members),Cultural Value EDI,Non Employee Expenses Supplier,98856196,340.0,Fees (Including College Members)
3,BEIS,UKRI - AHRC,02/12/2020,Fees (Including College Members),GCRF - Programme Delivery,Non Employee Expenses Supplier,98856197,170.0,Fees (Including College Members)
4,BEIS,UKRI - AHRC,02/12/2020,Fees (Including College Members),Executive Office,Non Employee Expenses Supplier,98856198,500.0,Fees (Including College Members)


In [109]:
UKRI_spending.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1576642 entries, 0 to 53771
Data columns (total 9 columns):
 #   Column              Non-Null Count    Dtype 
---  ------              --------------    ----- 
 0   Department Family   1576642 non-null  object
 1   Entity              1576642 non-null  object
 2   Date                1576642 non-null  object
 3   Expense Type        1576642 non-null  object
 4   Expense Area        1576642 non-null  object
 5   Supplier            1576642 non-null  object
 6   Transaction Number  1576642 non-null  object
 7   Amount              1576642 non-null  object
 8   Item Description    1576642 non-null  object
dtypes: object(9)
memory usage: 120.3+ MB


In [113]:
UKRI_spending.describe().transpose()

Unnamed: 0,count,unique,top,freq
Department Family,1576642,1,BEIS,1576642
Entity,1576642,16,UKRI - STFC,400559
Date,1576642,905,20/06/2018,9296
Expense Type,1576642,760,T&S UK - Public Transport,181866
Expense Area,1576642,3106,Unspecified,110315
Supplier,1576642,21752,Unspecified,196629
Transaction Number,1576642,665062,98357351,3799
Amount,1576642,304572,2.50,32680
Item Description,1576642,1504,T&S UK - Public Transport,175407


### Fixing the dtypes of the different collumns as the import is string only. Involves removing some formatting.

In [None]:
UKRI_spending["Date"] = UKRI_spending["Date"].astype('datetime64')
UKRI_spending["Amount"] = UKRI_spending["Amount"].str.replace(",","", regex=False)
UKRI_spending["Amount"] = UKRI_spending["Amount"].str.replace("(","", regex=False)
UKRI_spending["Amount"] = UKRI_spending["Amount"].str.replace(")","", regex=False).astype(float)

### Only the research grant expenditure of expnse type will be used as other types are too vague

In [128]:
UKRI_spending['Expense Type'].value_counts().head(10)

T&S UK - Public Transport            181866
Chemicals                            166449
Research Grant Expenditure           103637
Research Consumables                  67633
T&S UK - Accommodation                64939
Mobile Phone Costs                    55769
Projects CIP                          49364
T&S UK - Subsistence                  47347
Purchase of non capital equipment     43791
Resource grants                       43456
Name: Expense Type, dtype: int64

In [129]:
UKRI_spending = UKRI_spending[UKRI_spending['Expense Type'] == 'Research Grant Expenditure']

### removing GRANT and (GRANT) from institute names as they are the same suppliers

In [130]:
UKRI_spending['Supplier'] = UKRI_spending['Supplier'].str.replace(" GRANT","", regex=False)
UKRI_spending['Supplier'] = UKRI_spending['Supplier'].str.replace(" (GRANT)","", regex=False)