### Load Data Up To 2021

[data.gov.sg](https://beta.data.gov.sg/collections/415/view)

In [2]:
import pandas as pd

ori_df = pd.read_csv("GraduateEmploymentSurveyNTUNUSSITSMUSUSSSUTD.csv")
ori_df.shape

(1121, 12)

In [3]:
ori_df["university"].unique()

array(['Nanyang Technological University',
       'National University of Singapore',
       'Singapore Management University',
       'Singapore Institute of Technology',
       'Singapore University of Technology and Design',
       'Singapore University of Social Sciences'], dtype=object)

Handle SMU weirdness in the DataGovSG dataset. `(4-years programme)` appended to the back of degree names.

In [12]:
from table_parser import magic_clean_strings

r = "(4-years programme)"
clean_smu = lambda x: x.replace(r, "").strip()
ori_df["degree"] = ori_df["degree"].apply(magic_clean_strings).apply(clean_smu)
ori_df["school"] = ori_df["school"].apply(magic_clean_strings)

In [10]:
ori_df.school.unique()

array(['College of Business (Nanyang Business School)',
       'College of Engineering',
       'College of Humanities, Arts & Social Sciences',
       'College of Sciences', 'National Institute of Education (NIE)',
       'Faculty of Arts & Social Sciences', 'NUS Business School',
       'School of Computing', 'Faculty of Dentistry',
       'School of Design & Environment', 'Faculty of Engineering',
       'Faculty of Law', 'YLL School of Medicine',
       'Yong Siew Toh Conservatory of Music', 'Faculty of Science',
       'School of Accountancy (4-years programme) *',
       'School of Business (4-years programme) *',
       'School of Economics (4-years programme) *',
       'School of Information Systems (4-years programme) *',
       'School of Social Sciences (4-years programme) *',
       'School of Law (4-years programme) *',
       'School of Accountancy (4-year programme) *',
       'School of Business (4-year programme) *',
       'School of Economics (4-year programme) *',


### Parse 2022 -> Present

Download PDF's from MOE

URL Pattern: `https://www.moe.gov.sg/-/media/files/post-secondary/ges-{YYYY}/web-publication-{nus/sit/ntu/sutd/suss}-ges-{YYYY}.ashx`

In [5]:
from glob import glob
from table_parser import GESTableParser
parser = GESTableParser()

dfs = []
for f in list(glob("in/*.pdf")): dfs.append(parser.extract_tab_from_one_pdf(f))



### Concat

In [13]:
merged_df = pd.concat([ori_df, *dfs], axis=0, ignore_index=True)
merged_df.shape

(1367, 12)

In [14]:
merged_df.school.unique()

array(['College of Business (Nanyang Business School)',
       'College of Engineering',
       'College of Humanities, Arts & Social Sciences',
       'College of Sciences', 'National Institute of Education (NIE)',
       'Faculty of Arts & Social Sciences', 'NUS Business School',
       'School of Computing', 'Faculty of Dentistry',
       'School of Design & Environment', 'Faculty of Engineering',
       'Faculty of Law', 'YLL School of Medicine',
       'Yong Siew Toh Conservatory of Music', 'Faculty of Science',
       'School of Accountancy (4-years programme)',
       'School of Business (4-years programme)',
       'School of Economics (4-years programme)',
       'School of Information Systems (4-years programme)',
       'School of Social Sciences (4-years programme)',
       'School of Law (4-years programme)',
       'School of Accountancy (4-year programme)',
       'School of Business (4-year programme)',
       'School of Economics (4-year programme)',
       'School of 

**warning**: 

There are a bunch of inconsistencies over the years on namings, for example:
-  `Trinity College Dublin / Singapore Institute of Technology-Trinity College Dublin`, 
- `Singapore Institute of Technology -Trinity College Dublin / Trinity College Dublin`, 
- `SIT-Trinity College Dublin / Trinity College Dublin`

More manual cleaning is required here if you want to do groupbys & aggs nicely 😢
This is left as an exercise 😁

In [19]:
merged_df["year"] = merged_df["year"].astype(int)
merged_df = merged_df.fillna('na').sort_values(["year", "university"])
merged_df.to_csv("2024.csv", index=False)