In [9]:
import pandas as pd

In [10]:
from datetime import datetime

# Load

In [12]:
# Load raw dataset
df = pd.read_csv('datasets/GovernmentProcurementviaGeBIZ.csv')

In [13]:
df.shape #how big the dataset is - 18021 rows, 7 columns

(18021, 7)

In [14]:
df.info() # get summary of DataFrame, including data types. alternative way: using `df.dtypes` 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18021 entries, 0 to 18020
Data columns (total 7 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   tender_no             18021 non-null  object 
 1   tender_description    18021 non-null  object 
 2   agency                18021 non-null  object 
 3   award_date            18021 non-null  object 
 4   tender_detail_status  18021 non-null  object 
 5   supplier_name         18021 non-null  object 
 6   awarded_amt           18021 non-null  float64
dtypes: float64(1), object(6)
memory usage: 985.7+ KB


In [15]:
df.head(3) #first 3 rows

Unnamed: 0,tender_no,tender_description,agency,award_date,tender_detail_status,supplier_name,awarded_amt
0,ACR000ETT20300002,INVITATION TO TENDER FOR THE PROVISION OF SERV...,Accounting And Corporate Regulatory Authority,10/11/2020,Awarded by Items,DELOITTE & TOUCHE ENTERPRISE RISK SERVICES PTE...,285000.0
1,ACR000ETT20300002,INVITATION TO TENDER FOR THE PROVISION OF SERV...,Accounting And Corporate Regulatory Authority,10/11/2020,Awarded by Items,KPMG SERVICES PTE. LTD.,90000.0
2,ACR000ETT20300003,PROVISION OF AN IT SECURITY CONTROLS AND OPERA...,Accounting And Corporate Regulatory Authority,9/12/2020,Awarded to Suppliers,ERNST & YOUNG ADVISORY PTE. LTD.,182400.0


# Transform

In [16]:
# catch if the file already has missing values before any transformations
print("Checking for null values before cleaning:\n", df.isna().sum())

Checking for null values before cleaning:
 tender_no               0
tender_description      0
agency                  0
award_date              0
tender_detail_status    0
supplier_name           0
awarded_amt             0
dtype: int64


## 1) `tender_no` column

- Alphanumeric ID
- `tender_no` is NOT UNIQUE. Same tenders shows up on multiple rows as a tender may be split across **multiple** suppliers. 
- A tender amy also have multiple line items (e.g. laptops, printers, services). 
- A tender may even have some items awarded or not. There's one tender with `FINVITETT20300009` tendor ID  which appears 131 times in the dataset.
- Transformation(s)
    - Strip whitespace

In [17]:
df['tender_no'] = df['tender_no'].str.strip() #strip whitespace 

In [18]:
# tender IDs are not unique. Use .nunique() to tell us how many UNIQUE tender IDs there are over the past years. ERROR
print(df['tender_no'].nunique())   # ✅ 11915

11915


## 2) `tender_description`

- Long-text with varied casing (e.g. some descriptions are in ALL CAPS while some are Mixed Case)
- Transformation(s)
    - Strip whitespace
    - Convert to lowercase (for future enhancements - text classification) 

In [19]:
df['tender_description_clean'] = df['tender_description'].str.strip().str.lower()
print("Sample tender_description:\n", df['tender_description_clean'].head(20))

Sample tender_description:
 0     invitation to tender for the provision of serv...
1     invitation to tender for the provision of serv...
2     provision of an it security controls and opera...
3     conceptualization, design, build, set-up of ne...
4     design, development, customization, delivery, ...
5     supply, delivery, installation, testing, commi...
6     invitation to tender for the application softw...
7     for provision of the application software main...
8     invitation to tender for the provision of the ...
9     invitation to tender for the provision of serv...
10    invitation to tender for the provision of serv...
11    supply, delivery, support of database activity...
12    development of the bodies of knowledge (boks) ...
13    invitation to tender for the provision of chan...
14    invitation to tender for the provision of chan...
15    invitation to tender for the provision of chan...
16    itt for the provisioning and maintenance of an...
17    itt for the pr

## 3) `agency`

- Values are mixed casing but consistent. E.g. `Ministry of Education`, `Attorney-General's Chambers`.
    - Did not convert to uppercase as you'll lose readability. 
- For easier querying in DB by analysts, map it to a controlled vocabulary of "shortform" names for respective agencies. E.g. `Ministry of Education` -> `MOE`
- **Transformation(s)**
    - Trim whitespace, keep title case
    - **Challenges:**
        - Singapore has dozens of staturatory boards, ministries, councils, etc.
            - Core Ministries (MOE, MOH, MINDEF, etc)
            - Statutory Boards (ACRA, BCA, LTD, HDB, etc)
        - Formatting differences: `Ministry of Education - Schools` vs `Ministry of Education`, `Supreme Court - State Courts` vs `Supreme Court of Singapore`
        - Some procuring entity is often listed as the sub-unit of a ministry. E.g. Instead of `Ministry of Education`, tender listed as `Temasek Polytechnic`. This means schools (or even polytechnics, junior colleges, statutory boards) procure directly, but they are still under MOE's umbrella. So the `agency` column is not just ministries/stat boards — it sometimes includes schools, IHLs, or specific institutions.
        - Options
            - **1) Leave as-is**
                - Pros: Preserves raw truth, most accurate
                - Cons: Have 100s of unique agencies (MOE schools, hospitals, etc) making aggregation harder for analysts down the ro
            - **2) Group under parent agency**
                - E.g. Map all schools/tertiary education back to "Ministry of Education". Map all hospitals to "Ministry of Health"
                - Pros:
                    - Cleaner analysis at ministry/stat board level
                - Cons:
                    - Lose granularity (can't see which specific school procured)
            - **Decision**
                - Keep raw `agency` column untouched to avoid losing details
                - Create dictionary of known/official stat boards & ministry names.
                    - Cons: Does not handle sub-departments or variations. E.g. Ministry of Finance - Vital vs Ministry of Finance
                - Rule-based mapping.
                    - Pros
                        - Easy to extend. Just add another keyword rule when a new type shows up.
                        - Reduces maintenance costs (no need to update dictionary every time)
                    - Cons
                        - Slight risk of false positives if a keyword matches unintended strings
                - Create column `agency_parent` for roll-ups.
                    - Schools -> `MOE`
                    - Hospitals -> `MOH`
                    - Anything with no clear acronym or misc entities like `Istana`, leave as-is (full name)
                - This way, queries in DB can work at either:
                    - granular level (school-level spend). E.g. "Which school procured the most in 2022?"
                    - parent level (ministry-level spend). E.g. "Which ministry spends the most?"

Most data teams combine both:

Dictionary for exact names

Covers statutory boards, IHLs, and agencies that don’t vary much.

E.g. "HDB", "NEA", "MAS", "Temasek Polytechnic".

Rules for families / messy cases

Ministries and their sub-units.

Schools, hospitals, PMO sub-offices.

Catch-alls like "contains('School') → MOE".

This hybrid approach gives you:

Clean explicit mappings where possible.

Robust handling for all the messy variants.

In [52]:
df['agency'] = df['agency'].str.strip().str.title() #remove spaces and put into title case

In [53]:
# By default, if no rule matches, the agency is its own parent. agency_sub stays blank unless we explicitly split it
df['agency_parent'] = df['agency']
df['agency_sub'] = None 

# Case 1: Split on " - " or "-"
split_mask = df['agency'].str.contains(r'\s*-\s*', regex=True)
df.loc[split_mask, ['agency_parent', 'agency_sub']] = (
    df.loc[split_mask, 'agency'].str.split(r'\s*-\s*', n=1, expand=True)
)

# Case 2: Judiciary
# E.g. Judiciary - Supreme Court -> parent: Judiciary, sub -> Supreme Court. Extensible → if we later have "Family Justice Courts", this rule will help split. 
jud_mask = df['agency'].str.startswith("Judiciary", na=False)
df.loc[jud_mask, 'agency_parent'] = "Judiciary"
df.loc[jud_mask, 'agency_sub'] = df.loc[jud_mask, 'agency'].str.replace("Judiciary[-]?", "", regex=True).str.strip()

# Case 3: Prime Minister's Office
pmo_mask = df['agency'].str.startswith("Prime Minister", na=False)
df.loc[pmo_mask, 'agency_parent'] = "Prime Minister's Office"
df.loc[pmo_mask, 'agency_sub'] = df.loc[pmo_mask, 'agency'].str.replace("Prime Minister'S Office[-]?", "", regex=True).str.strip()

# Case 4: Schools / IHLs (MOE roll-up)
edu_keywords = ["School", "Institution", "Polytechnic", "University", "College"]
edu_mask = df['agency'].str.contains('|'.join(edu_keywords), case=False, na=False)
df.loc[edu_mask, 'agency_parent'] = "Ministry Of Education"
df.loc[edu_mask, 'agency_sub'] = df.loc[edu_mask, 'agency']

# Case 5: Hospitals / Polyclinics (MOH roll-up)
health_keywords = ["Hospital", "Polyclinic"]
health_mask = df['agency'].str.contains('|'.join(health_keywords), case=False, na=False)
df.loc[health_mask, 'agency_parent'] = "Ministry Of Health"
df.loc[health_mask, 'agency_sub'] = df.loc[health_mask, 'agency'] #E.g. KK Hospital will be preserved

# Case 6: Standalone. Parliament / Istana as self-contained parents. No sub/child as they aren't split further
special = ["Parliament", "Istana"]
spec_mask = df['agency'].isin(special)
df.loc[spec_mask, 'agency_parent'] = df.loc[spec_mask, 'agency']
df.loc[spec_mask, 'agency_sub'] = None

In [54]:
# Validation

# 1. Spot check random rows
df[['agency','agency_parent','agency_sub']].sample(5)

# 2. Compare parent vs raw counts - parents should be fewer than raw agencies. If =, roll-up did not wrok
print("Unique raw agencies:", df['agency'].nunique())
print("Unique agency_parent:", df['agency_parent'].nunique())

#3 . Find where parent == raw (i.e. standalone agencies). Should show statutory boards like "HDB"
standalone = df.loc[df['agency'] == df['agency_parent'], 'agency'].unique()
print("Standalone agencies (no split, not rolled up):")
print(standalone[:100])

Unique raw agencies: 111
Unique agency_parent: 61
Standalone agencies (no split, not rolled up):
['Accounting And Corporate Regulatory Authority'
 'Building And Construction Authority' 'Central Provident Fund Board'
 'Board Of Architects' 'Civil Aviation Authority Of Singapore'
 'Competition And Consumer Commission Of Singapore (Cccs)'
 'Council For Estate Agencies'
 'Gambling Regulatory Authority Of Singapore (Gra)' 'Ministry Of Defence'
 'Defence Science And Technology Agency'
 'Singapore Examinations And Assessment Board'
 'Economic Development Board' 'Housing And Development Board'
 'Energy Market Authority Of Singapore'
 'Ministry Of Sustainability And The Environment' 'Enterprise Singapore'
 'Ministry Of Foreign Affairs' 'Government Technology Agency  (Govtech)'
 'Health Promotion Board' 'Home Team Science And Technology Agency'
 'Land Transport Authority' 'Health Sciences Authority'
 'Intellectual Property Office Of Singapore'
 'Inland Revenue Authority Of Singapore' 'Istana'
 '

In [55]:
# Find where parent != raw (i.e. split worked)
# e.g. #e.g. Ministry of Finance - Vital -> parent: Ministry of Finance, sub: Vital
split_examples = df.loc[df['agency'] != df['agency_parent'], 
                        ['agency','agency_parent','agency_sub']].sample(20) 
print(split_examples)

                                                  agency  \
17773  Ministry Of Trade & Industry-Ministry Headquarter   
437    Ministry Of Culture, Community And Youth - Min...   
4885             Ministry Of Health-Ministry Headquarter   
15562                               Republic Polytechnic   
2311                         Ministry Of Finance - Vital   
702    Ministry Of Social And Family Development - Mi...   
10340  Ministry Of Manpower - Occupational Safety & H...   
4935             Ministry Of Health-Ministry Headquarter   
2546                         Ministry Of Finance - Vital   
9237   Ministry Of Home Affairs - Ministry Headquarter 1   
42                           Attorney-General'S Chambers   
5057       Ministry Of Home Affairs-Ministry Headquarter   
625    Ministry Of Social And Family Development - Mi...   
2408                         Ministry Of Finance - Vital   
17771  Ministry Of Trade & Industry-Ministry Headquarter   
2343                         Ministry Of

In [50]:
# Check school roll-ups
schools = df[df['agency_parent'] == "Ministry Of Education"]
print("Examples of schools under MOE:\n", schools[['agency','agency_sub']].sample(10))

Examples of schools under MOE:
                       agency             agency_sub
17689    Temasek Polytechnic    Temasek Polytechnic
13355    Nanyang Polytechnic    Nanyang Polytechnic
13441    Nanyang Polytechnic    Nanyang Polytechnic
9498   Ministry Of Education                   None
16519  Singapore Polytechnic  Singapore Polytechnic
15548   Republic Polytechnic   Republic Polytechnic
10011  Ministry Of Education                   None
17716    Temasek Polytechnic    Temasek Polytechnic
1188   Civil Service College  Civil Service College
10102  Ministry Of Education                   None


In [56]:
# Check hospital roll-ups
hospitals = df[df['agency_parent'] == "Ministry Of Health"]
print("Examples of healthcare institutions:\n", hospitals[['agency','agency_sub']].head(10))

Examples of healthcare institutions:
 Empty DataFrame
Columns: [agency, agency_sub]
Index: []


In [57]:
# Where parent is missing (shouldn’t happen unless bug)
missing_parent = df[df['agency_parent'].isna()]
print("Missing parents:", missing_parent)

Missing parents:           id          tender_no  \
13        13  AGC000ETT19300016   
14        14  AGC000ETT19300016   
15        15  AGC000ETT19300016   
16        16  AGC000ETT20300002   
17        17  AGC000ETT20300003   
...      ...                ...   
17779  17779  TRAHQ0ETT23000002   
17780  17780  TRAHQ0ETT23000002   
17781  17781  TRAHQ0ETT23000002   
17782  17782  TRAHQ0ETT24000001   
17783  17783  TRAHQ0ETT24000002   

                                      tender_description  \
13     INVITATION TO TENDER FOR THE PROVISION OF CHAN...   
14     INVITATION TO TENDER FOR THE PROVISION OF CHAN...   
15     INVITATION TO TENDER FOR THE PROVISION OF CHAN...   
16     ITT for the Provisioning and Maintenance of an...   
17     ITT FOR THE PROVISION OF DESIGN AND DELIVERY S...   
...                                                  ...   
17779  Provision of ICT Professional Services for an ...   
17780  Provision of ICT Professional Services for an ...   
17781  Provision of IC

## 4) award_date

- Currently, `award_date` is in `dd/mm/yyyy` or `d/m/yyyy` formats
- Transformation(s):
    - Convert to `datetime`.
        - There’s no separate `date` dtype that works nicely with vectorized operations.
        - Even if your data only has dates (no times), Pandas stores it as datetime because its `datetime64` is the standard for date-like data.
        - You could technically convert to Python’s datetime.date, but then your column becomes object dtype (slower, clunkier, not recommended).
        - Any invalid/missing dates will become `NaT`
- Note: In SQLAlchemy load step, map it to `DATE`. 

In [24]:
# dayfirst = True because dates are in dd/mm/yyyy format. with errors='coerce', if pandas sees an invalid date, it will set it as NaT (Not a Time) 
df['award_date'] = pd.to_datetime(df['award_date'], dayfirst=True, errors='coerce')
# confirming that it's now datetime64, not object (string)
print("Award date dtype:", df['award_date'].dtype)

Award date dtype: datetime64[ns]


## 5) `tender_detail_status`

**Has 4 different statuses:**
- `Awarded to Suppliers` - Supplier was chosen and amount is recorded.
- `Awarded by Items` - Instead of one supplier getting the whole thing, individual line items were awarded to different suppliers. 
- `Awarded to No Suppliers` - Tender closed but nobody was awarded. Hence corresponding column `awarded_amt = 0`
- `Award by Interface` - System generated status. Usually means the award details were inserted by an automated interface instead of manual entry. Functionally, it's still an award.

**Status phrasing is inconsistent.**
- `Awarded by items` vs `Awarded to Suppliers` mix `"by"` vs `"to"`
- `Award by interface record` does not follow the same tense/grammar pattern like the other statuses.
- if we want to run `SELECT COUNT(*) WHERE tender_detail_status LIKE `Awarded%`, we will miss `Award by interface record`. The subtle difference makes grouping harder.

- Ideally we want **categorical columns** to be standardized into a controlled vocabulary (e.g. enum or lookup table) Why?
    - Consistency across rows
    - Easier to group/aggregate in SQL
    - Makes schema design more meaningful (e.g. a dimension table for tender status could have 4 values).


- **Transformation(s):**
    - Step 1 - Trim whitespace
    - Step 2 - Map to a controlled vocabulary for easier querying in DB (e.g. `"AWARDED_ITEMS"`, `"AWARDED_SUPPLIERS"`)

In [25]:
df['tender_detail_status'] = df['tender_detail_status'].str.strip() #strip whitespace

In [26]:
# Create a dictionary to raw text into clean categories
status_map = {
    "Award by interface record": "AWARDED_INTERFACE",
    "Awarded by Items": "AWARDED_ITEMS",
    "Awarded to No Suppliers": "NO_SUPPLIERS",
    "Awarded to Suppliers": "AWARDED_SUPPLIERS"
}

In [27]:
# Create a new column with the clean values
df['tender_detail_status_clean'] = df['tender_detail_status'].map(status_map)

In [28]:
# Check distribution of statuses after cleaning
print("Tender detail status counts:\n", df['tender_detail_status_clean'].value_counts())

Tender detail status counts:
 tender_detail_status_clean
AWARDED_SUPPLIERS    9174
AWARDED_ITEMS        7590
NO_SUPPLIERS          699
AWARDED_INTERFACE     558
Name: count, dtype: int64


In [29]:
# TEST - Tender status consistency check. 
unexpected_status = df.loc[~df['tender_detail_status'].isin(status_map.keys())]
# empty dataframe as dataset only has 4 statuses we already mapped. 
# While there are no surprise values in this dataset, it's just a safeguard for future updates.
# E.g. imagine a new status "Cancelled" appears. this check will flag it
print(unexpected_status) 

Empty DataFrame
Columns: [tender_no, tender_description, agency, award_date, tender_detail_status, supplier_name, awarded_amt, tender_description_clean, agency_parent, agency_sub, tender_detail_status_clean]
Index: []


## 6) `supplier_name`

- Supplier names are inconsistent. Messy casing & suffixes due to different capitalization style.
    - E.g. `KPMG SERVICES PTE. LTD`, `CRIMSONLOGIC PTE LTD`, `Checkbox Technology Pte Ltd`
- If we don’t standardize, queries like "GROUP BY supplier" will count them separately.
- Important as we want to avoid duplicates in SQL when grouping by supplier

- **Transformations:**
    - Strip whitespace
    - Identify rows with double spaces. Why?
        - If you do a `GROUP BY supplier_name` in SQL, `BGPROTECT  LTD` and `BGPROTECT LTD` will be treated as two different suppliers.
    - Normalize suffixes (e.g. `PTE LTD, PTE. LTD -> PTE LTD)
    - Remove trailing periods

In [30]:
# Remove leading/trailing spaces make everything uppercase for consistency
df['supplier_name'] = df['supplier_name'].str.strip()

In [31]:
# Identify rows with double spaces in supplier_name column values
mask = df['supplier_name'].str.contains(r'\s{2,}', na=False) #2 or more double spaces, na=False ignore NaN values if any
df[mask].head(10)
print("Rows with double spaces:", mask.sum()) #if 0, means it's clean. otherwise, it means we have supplier names with double spaces

Rows with double spaces: 6


In [32]:
# show distinct supplier names that have double spaces 
df.loc[mask, 'supplier_name'].unique() 

array(['BGProtect  LTD',
       'Taisei Corporation-China State Construction  Engineering Corporation Limited Singapore Branch Joint Venture',
       'SUEZ (SINGAPORE) SERVICES  PTE. LTD.', 'KASTURI  PRODUCTION',
       'CHEMICALS TESTING & CALIBRATION  LABORATORY',
       'Winnie Manikam  Krishnaveni mrs Winnie Ubbink'], dtype=object)

In [33]:
# ensure supplier names are normalized. so no duplicates due to space issues
df['supplier_name'] = df['supplier_name'].str.replace(r'\s+', ' ', regex=True) 
print("Rows with double spaces:", mask.sum()) #if 0, means it's clean. 

Rows with double spaces: 6


In [34]:
# Normalizing common suffixes using regex (pattern matching)
df['supplier_name'] = df['supplier_name'].str.replace(r'PTE\.?', 'PTE', regex=True)
df['supplier_name'] = df['supplier_name'].str.replace(r'LTD\.?', 'LTD', regex=True)

In [35]:
# Remove trailing periods
df['supplier_name'] = df['supplier_name'].str.replace(r'\.\s*$', '', regex=True)

In [36]:
print("Sample suppliers:\n", df['supplier_name'].drop_duplicates().head(10)) #to check if cleaning works as expected

Sample suppliers:
 0     DELOITTE & TOUCHE ENTERPRISE RISK SERVICES PTE...
1                                 KPMG SERVICES PTE LTD
2                        ERNST & YOUNG ADVISORY PTE LTD
3                       D' PERCEPTION SINGAPORE PTE LTD
4                                   ALPHA ZETTA PTE LTD
5                         ACCENTURE SG SERVICES PTE LTD
6                                  CRIMSONLOGIC PTE LTD
7                              FPT ASIA PACIFIC PTE LTD
11                         NXGEN COMMUNICATIONS PTE LTD
12         PRICEWATERHOUSECOOPERS RISK SERVICES PTE LTD
Name: supplier_name, dtype: object


## 7) `awarded_amt` column

- No currency symbol cleanup is needed
- Ensure the award amount is numeric.
- If tender was closed with no suppliers, this column should be 0.
- **Transformation(s):**
    - Convert to numeric explicitly
    - Enforce rule: if status == `No Suppliers`, check awarded_amt = 0

In [37]:
# Convert to numeric, force errors to NaN (null).
df['awarded_amt'] = pd.to_numeric(df['awarded_amt'], errors='coerce')

In [38]:
# Business rule check - "No Suppliers" tenders should ALWAYS HAVE awarded_amt = 0
mask_no_suppliers = df['tender_detail_status_clean'] == "NO_SUPPLIERS"
if not (df.loc[mask_no_suppliers, 'awarded_amt'] == 0).all():
    print("WARNING: Some NO_SUPPLIERS tenders have non-zero awarded_amt")

In [39]:
# Using describe() to give us min, max, mean, percentiles. Good for spotting outliers.
print("Awarded_amt stats:\n", df['awarded_amt'].describe())

Awarded_amt stats:
 count    1.802100e+04
mean     6.100677e+06
std      4.120565e+07
min      0.000000e+00
25%      7.490000e+03
50%      1.709200e+05
75%      8.930000e+05
max      1.493179e+09
Name: awarded_amt, dtype: float64


In [40]:
# For outlier sanity check experiment

In [41]:
# 1) look for abnormal spikes by percentiles
df['awarded_amt'].describe(percentiles=[.5, .9, .99])

count    1.802100e+04
mean     6.100677e+06
std      4.120565e+07
min      0.000000e+00
50%      1.709200e+05
90%      5.157000e+06
99%      1.541576e+08
max      1.493179e+09
Name: awarded_amt, dtype: float64

In [42]:
# 2) Check for negative amounts
print("Negative amounts:", df[df['awarded_amt'] < 0]) #should be empty anyway. 

Negative amounts: Empty DataFrame
Columns: [tender_no, tender_description, agency, award_date, tender_detail_status, supplier_name, awarded_amt, tender_description_clean, agency_parent, agency_sub, tender_detail_status_clean]
Index: []


In [43]:
# 3) Check for suspiciously large/extreme amounts (e.g. > 1 billion SGD)
high_values = df[df['awarded_amt'] > 1e9]
print("Suspiciously large amounts:", high_values) 

Suspiciously large amounts:                tender_no                                 tender_description  \
7759   LTA000ETT19300138       Design and Construction of Changi East Depot   
7798   LTA000ETT19300194  Bus Contracting - Bulim and Sembawang-Yishun B...   
8054   LTA000ETT21000084  Design and Construction of Riviera Interchange...   
10819  NEA000ETT19300029  Proposed Erection of Integrated Waste Manageme...   

                            agency award_date  tender_detail_status  \
7759      Land Transport Authority 2021-05-28  Awarded to Suppliers   
7798      Land Transport Authority 2020-09-30  Awarded to Suppliers   
8054      Land Transport Authority 2022-09-30  Awarded to Suppliers   
10819  National Environment Agency 2020-04-14  Awarded to Suppliers   

                                           supplier_name   awarded_amt  \
7759   CHINA JINGYE ENGINEERING CORPORATION LIMITED (...  1.050500e+09   
7798                     TOWER TRANSIT SINGAPORE PTE LTD  1.025103e+09  

In [44]:
# Check for Nan/Nulls explicitly. Why? After .to_datetime() and .to_numeric(), some rows may have become NaT or NaN.
print("Checking for null values after cleaning:\n", df.isna().sum())

Checking for null values after cleaning:
 tender_no                         0
tender_description                0
agency                            0
award_date                        0
tender_detail_status              0
supplier_name                     0
awarded_amt                       0
tender_description_clean          0
agency_parent                  2783
agency_sub                    16177
tender_detail_status_clean        0
dtype: int64


# Load

- Primary Key
    - Since `tender_no` is not unique, we can either have a composite key (e.g. tender_no, supplier_name) or surrogate key `id` before loading. Surrogate key seems more futureproof.
    - E.g. Of making a surrogate key - `id BIGSERIAL PRIMARY KEY`

- Use PostgreSQL

# Data Cleaning

In [5]:
# Clean text data by removing non-word characters, extra spaces and converting text to lowercase

In [6]:
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

# Example preprocessing: cleaning text
def clean_text(text):
    text = re.sub(r'\W', ' ', text)  # Remove non-word characters
    text = re.sub(r'\s+', ' ', text)  # Remove extra spaces
    text = text.lower()  # Convert to lowercase
    return text

data['cleaned_description'] = data['tender_description'].apply(clean_text)

In [7]:
# Labelling the data 

In [9]:
def label_category(description):
    if 'construction' in description:
        return 'Construction'
    elif 'medical' in description:
        return 'Medical'
    else:
        return 'Other'

#This code adds a Category column to the DataFrame.
data['Category'] = data['cleaned_description'].apply(label_category)

In [10]:
# This code splits the data into 80% training and 20% testing sets
X = data['cleaned_description']
y = data['Category']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [11]:
# convert text data into numerical features using TF-IDF vectorization. This helps to represent the text data numerically
vectorizer = TfidfVectorizer(max_features=5000)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

In [12]:
# train a random forest classifier to predict the categories
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

model = RandomForestClassifier()
model.fit(X_train_tfidf, y_train)

# Make predictions
y_pred = model.predict(X_test_tfidf)

In [13]:
# add predicted categories to the dataframe
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

Construction       1.00      1.00      1.00        51
     Medical       1.00      1.00      1.00        38
       Other       1.00      1.00      1.00      3516

    accuracy                           1.00      3605
   macro avg       1.00      1.00      1.00      3605
weighted avg       1.00      1.00      1.00      3605



In [14]:
# Predict categories for the entire dataset
data_tfidf = vectorizer.transform(data['cleaned_description'])
data['Predicted_Category'] = model.predict(data_tfidf)

# Display the updated dataset
data.head()

Unnamed: 0,tender_no,tender_description,agency,award_date,tender_detail_status,supplier_name,awarded_amt,cleaned_description,Category,Predicted_Category
0,ACR000ETT20300002,INVITATION TO TENDER FOR THE PROVISION OF SERV...,Accounting And Corporate Regulatory Authority,10/11/2020,Awarded by Items,DELOITTE & TOUCHE ENTERPRISE RISK SERVICES PTE...,285000.0,invitation to tender for the provision of serv...,Other,Other
1,ACR000ETT20300002,INVITATION TO TENDER FOR THE PROVISION OF SERV...,Accounting And Corporate Regulatory Authority,10/11/2020,Awarded by Items,KPMG SERVICES PTE. LTD.,90000.0,invitation to tender for the provision of serv...,Other,Other
2,ACR000ETT20300003,PROVISION OF AN IT SECURITY CONTROLS AND OPERA...,Accounting And Corporate Regulatory Authority,9/12/2020,Awarded to Suppliers,ERNST & YOUNG ADVISORY PTE. LTD.,182400.0,provision of an it security controls and opera...,Other,Other
3,ACR000ETT20300004,"CONCEPTUALIZATION, DESIGN, BUILD, SET-UP OF NE...",Accounting And Corporate Regulatory Authority,9/3/2021,Awarded to Suppliers,D' PERCEPTION SINGAPORE PTE. LTD.,3071056.4,conceptualization design build set up of new o...,Other,Other
4,ACR000ETT21000001,"DESIGN, DEVELOPMENT, CUSTOMIZATION, DELIVERY, ...",Accounting And Corporate Regulatory Authority,6/9/2021,Awarded to Suppliers,ALPHA ZETTA PTE. LTD.,2321600.0,design development customization delivery inst...,Other,Other


In [16]:
# filter non-other predicted categories, we filter out the rows where the predicted category is "Other"
non_other_data = data[data['Predicted_Category'] != 'Other']
# Display the filtered dataset
non_other_data.head()

Unnamed: 0,tender_no,tender_description,agency,award_date,tender_detail_status,supplier_name,awarded_amt,cleaned_description,Category,Predicted_Category
79,BCA000ETT20300019,PROVISION OF LEGAL SERVICES TO THE BUILDING AN...,Building and Construction Authority,17/9/2020,Awarded by Items,ADVOCATUS LAW LLP,1010.0,provision of legal services to the building an...,Construction,Construction
80,BCA000ETT20300019,PROVISION OF LEGAL SERVICES TO THE BUILDING AN...,Building and Construction Authority,17/9/2020,Awarded by Items,BIRD & BIRD ATMD LLP,4000.0,provision of legal services to the building an...,Construction,Construction
81,BCA000ETT20300019,PROVISION OF LEGAL SERVICES TO THE BUILDING AN...,Building and Construction Authority,17/9/2020,Awarded by Items,CENTRAL CHAMBERS LAW CORPORATION,1890.0,provision of legal services to the building an...,Construction,Construction
82,BCA000ETT20300019,PROVISION OF LEGAL SERVICES TO THE BUILDING AN...,Building and Construction Authority,17/9/2020,Awarded by Items,DENTONS RODYK & DAVIDSON LLP,4910.0,provision of legal services to the building an...,Construction,Construction
83,BCA000ETT20300019,PROVISION OF LEGAL SERVICES TO THE BUILDING AN...,Building and Construction Authority,17/9/2020,Awarded by Items,DONALDSON & BURKINSHAW LLP,2000.0,provision of legal services to the building an...,Construction,Construction


In [18]:
non_other_data.to_csv('predictedcategory.csv', index=False)

# Reference

In [46]:

agency_map = {
    # Ministries
    "Ministry Of Defence": "MINDEF",
    "Ministry Of Education": "MOE",
    "Ministry Of Finance": "MOF",
    "Ministry Of Foreign Affairs": "MFA",
    "Ministry Of Health": "MOH",
    "Ministry Of Home Affairs": "MHA",
    "Ministry Of Law": "MinLaw",
    "Ministry Of Manpower": "MOM",
    "Ministry Of National Development": "MND",
    "Ministry Of Social And Family Development": "MSF",
    "Ministry Of Sustainability And The Environment": "MSE",
    "Ministry Of Trade & Industry": "MTI",
    "Ministry Of Transport": "MOT",
    "Ministry Of Communications And Information": "MCI",

    # Statutory board
    "Accounting And Corporate Regulatory Authority": "ACRA",
    "Building And Construction Authority": "BCA",
    "Board Of Architects": "BOA",
    "Civil Aviation Authority Of Singapore": "CAAS",
    "Competition And Consumer Commission Of Singapore (Cccs)": "CCCS",
    "Council For Estate Agencies": "CEA",
    "Defence Science And Technology Agency": "DSTA",
    "Economic Development Board": "EDB",
    "Energy Market Authority Of Singapore": "EMA",
    "Enterprise Singapore": "ESG",
    "Gambling Regulatory Authority Of Singapore (Gra)": "GRA",
    "Health Promotion Board": "HPB",
    "Health Sciences Authority": "HSA",
    "Home Team Science And Technology Agency": "HTX",
    "Housing And Development Board": "HDB",
    "Infocomm Media Development Authority": "IMDA",
    "Inland Revenue Authority Of Singapore": "IRAS",
    "Intellectual Property Office Of Singapore": "IPOS",
    "Jurong Town Corporation": "JTC",
    "Land Transport Authority": "LTA",
    "Majlis Ugama Islam Singapura": "MUIS",
    "Maritime And Port Authority Of Singapore": "MPA",
    "National Environment Agency": "NEA",
    "National Heritage Board": "NHB",
    "National Library Board": "NLB",
    "National Parks Board": "NPARKS",
    "People'S Association": "PA",
    "Professional Engineers Board": "PEB",
    "Public Utilities Board": "PUB",
    "Public Transport Council": "PTC",
    "Science Centre Board": "Science Centre",   # <- not a strict acronym
    "Sentosa Development Corporation": "SDC",
    "Singapore Civil Defence Force": "SCDF",
    "Singapore Examinations And Assessment Board": "SEAB",
    "Singapore Food Agency": "SFA",
    "Singapore Land Authority": "SLA",
    "Singapore Labour Foundation": "SLF",
    "Singapore Medical Council": "SMC",
    "Singapore Nursing Board": "SNB",
    "Singapore Police Force": "SPF",
    "Singapore Prison Service": "SPS",
    "Singapore Tourism Board": "STB",
    "Skillsfuture Singapore": "SSG",
    "Urban Redevelopment Authority": "URA",
    "Workforce Singapore": "WSG",
    "Yellow Ribbon Singapore": "YRSG",
    "Civil Service College": "CSC",
    "Iseas - Yusof Ishak Institute": "ISEAS",

    #Councils
    "National Arts Council": "NAC",
    "National Council Of Social Service": "NCSS",
    "National Youth Council": "NYC",

    #IHLs
    "Institute Of Technical Education": "ITE",
    "Temasek Polytechnic": "TP",
    "Nanyang Polytechnic": "NYP",
    "Ngee Ann Polytechnic": "NP",
    "Republic Polytechnic": "RP",
    "Singapore Polytechnic": "SP",
    "National University Of Singapore": "NUS",
    "Nanyang Technological University": "NTU",
    "Singapore Management University": "SMU",
    "Singapore Institute Of Technology": "SIT",
    "Singapore University Of Technology And Design": "SUTD"
}