# Malaysia Job Market Analysis 2024
## Part 1: Data Cleaning & Feature Engineering

**Dataset:** 69,024 JobStreet Malaysia job postings  
**Goal:** Clean raw data, extract salary features, and prepare dataset for SQL analysis  
**Tools:** Python, Pandas, Regex  
**Author:** Priyanshu Singh  
**GitHub:** [malaysia-job-market-analysis](https://github.com/maanajipriyanshu/malaysia-job-market-analysis)

## 1. Import Libraries
Importing all required libraries for data loading, cleaning, and feature engineering.

In [31]:
import pandas as pd
import numpy as np
import re
import ast

## 2. Load Dataset
Loading the raw JobStreet Malaysia dataset and taking a first look at its shape, columns, and data types.

In [32]:
df = pd.read_csv('../data/raw/jobstreet_all_job_dataset.csv')
print(df.shape)
print('\nColumns:', df.columns.tolist())
print('\nData Types:')
print(df.dtypes)
print('\nFirst 5 rows:')
df.head()

(69024, 11)

Columns: ['job_id', 'job_title', 'company', 'descriptions', 'location', 'category', 'subcategory', 'role', 'type', 'salary', 'listingDate']

Data Types:
job_id          float64
job_title        object
company          object
descriptions     object
location         object
category         object
subcategory      object
role             object
type             object
salary           object
listingDate      object
dtype: object

First 5 rows:


Unnamed: 0,job_id,job_title,company,descriptions,location,category,subcategory,role,type,salary,listingDate
0,74630583.0,Procurement Executive (Contract),Coca-Cola Bottlers (Malaysia) Sdn Bhd,Position Purpose\nManage aspects of procuremen...,Negeri Sembilan,"Manufacturing, Transport & Logistics","Purchasing, Procurement & Inventory",procurement-executive,Contract/Temp,,2024-03-21T05:58:35Z
1,74660602.0,Account Executive/ Assistant,Acoustic & Lighting System Sdn Bhd,We are looking for a Account Executive/ Assist...,Petaling,Accounting,Bookkeeping & Small Practice Accounting,executive-assistant,Full time,"RM 2,800 – RM 3,200 per month",2024-03-22T06:52:57Z
2,74655679.0,"Data Analyst - Asset Management, SPX Express",Shopee Mobile Malaysia Sdn Bhd,Performs detailed data analysis on existing sp...,Klang District,"Manufacturing, Transport & Logistics",Analysis & Reporting,asset-management-analyst,Full time,,2024-03-22T04:22:43Z
3,74657624.0,Service Engineer,Sun Medical Systems Sdn Bhd,"You are important for troubleshooting, install...",Petaling,Engineering,Electrical/Electronic Engineering,services-engineer,Full time,"RM 3,000 – RM 3,500 per month",2024-03-22T05:32:09Z
4,74679363.0,Purchasing Executive,Magnet Security & Automation Sdn. Bhd.,"MAG is a trailblazer in the industry, boasting...",Hulu Langat,"Manufacturing, Transport & Logistics","Purchasing, Procurement & Inventory",purchasing-executive,Full time,"RM 2,800 – RM 3,500 per month",2024-03-23T03:56:39Z


## 3. Missing Value Analysis
Identifying which columns have null values to understand data quality issues before cleaning.

In [33]:
# Missing values per column
print('Missing values per column:')
print(df.isnull().sum())

# Salary availability
print(f'\n{df["salary"].isna().sum()} jobs have NO salary listed')
print(f'{df["salary"].notna().sum()} jobs HAVE salary listed')
print(f'\nSalary transparency rate: {round(df["salary"].notna().sum() / len(df) * 100, 1)}%')

Missing values per column:
job_id              0
job_title           0
company             0
descriptions        0
location            0
category            0
subcategory         0
role             2252
type                0
salary          37430
listingDate         0
dtype: int64

37430 jobs have NO salary listed
31594 jobs HAVE salary listed

Salary transparency rate: 45.8%


### Finding
- `salary` has **37,430 missing values (54%)** most of the Malaysian employers don't publicly list salary
- `role` has 2,252 missing values- minor, manageable
- All other columns are complete

> **Insight:** The 54% salary gap is itself a key market insight, salary transparency is low in Malaysia's job market.

## 4. Job Type Distribution
Understanding how job types are stored in the raw data before cleaning.

In [34]:
print('Raw job type values:')
print(df['type'].value_counts())

Raw job type values:
type
Full time                                      53790
['Full time']                                   9030
Contract/Temp                                   4976
['Contract/Temp']                                584
Part time                                        289
Casual/Vacation                                  251
['Part time']                                     78
['Casual/Vacation']                               22
['Contract/Temp', 'Full time']                     2
['Full time', 'Part time']                         1
['Contract/Temp', 'Full time', 'Part time']        1
Name: count, dtype: int64


### Finding
The `type` column has 11 inconsistent variations of the same values like:
- `Full time` and `['Full time']` are the same thing stored differently
- Some entries are stored as Python list strings

This is a classic real-world data quality issue that needs standardization.

## 5. Job Type Cleaning
Standardizing 11 inconsistent job type variations into 4 clean categories using string matching.

In [35]:
def clean_type(val):
    try:
        parsed = ast.literal_eval(val)
        if isinstance(parsed, list):
            return parsed[0]  
    except:
        pass
    return val  

df['type_clean'] = df['type'].apply(clean_type)

print(df['type_clean'].value_counts())

type_clean
Full time          62821
Contract/Temp       5563
Part time            367
Casual/Vacation      273
Name: count, dtype: int64


### Finding
Successfully consolidated 11 variations into 4 clean job types:
- **Full time: 62,821 (90%)** :  overwhelming majority
- Contract/Temp: 5,563
- Part time: 367
- Casual/Vacation: 273

> **Insight:** 9 out of 10 Malaysian jobs are full time - strong market stability for job seekers.

## 6. Salary Feature Engineering
The salary column contains unstructured text like `RM 2,800 - RM 3,200 per month`.  
Using Regex to extract numeric min, max, and average salary values from 31,594 records.

In [36]:
def extract_salary(val):
    if pd.isna(val):
        return None, None
    
    numbers = re.findall(r'[\d,]+', val.replace(',', ''))
    numbers = [int(n) for n in numbers if len(n) >= 3] 
    
    if len(numbers) >= 2:
        return numbers[0], numbers[1]  
    elif len(numbers) == 1:
        return numbers[0], numbers[0]
    return None, None

df['salary_min'], df['salary_max'] = zip(*df['salary'].apply(extract_salary))
df['salary_avg'] = (df['salary_min'] + df['salary_max']) / 2
print('Sample salary extraction:')
print(df[['salary', 'salary_min', 'salary_max', 'salary_avg']].dropna().head(10))
print("\nAverage salary across all jobs with salary listed:")
print(f"RM {df['salary_avg'].mean():,.0f} per month")

Sample salary extraction:
                           salary  salary_min  salary_max  salary_avg
1   RM 2,800 – RM 3,200 per month      2800.0      3200.0      3000.0
3   RM 3,000 – RM 3,500 per month      3000.0      3500.0      3250.0
4   RM 2,800 – RM 3,500 per month      2800.0      3500.0      3150.0
6   RM 3,000 – RM 4,500 per month      3000.0      4500.0      3750.0
9   RM 2,000 – RM 3,000 per month      2000.0      3000.0      2500.0
10  RM 3,000 – RM 3,200 per month      3000.0      3200.0      3100.0
12  RM 1,600 – RM 2,000 per month      1600.0      2000.0      1800.0
14  RM 3,500 – RM 4,000 per month      3500.0      4000.0      3750.0
15  RM 3,000 – RM 4,000 per month      3000.0      4000.0      3500.0
18  RM 3,500 – RM 5,000 per month      3500.0      5000.0      4250.0

Average salary across all jobs with salary listed:
RM 4,777 per month


### Finding
Successfully extracted salary data from **31,594 records** using Regex.  
- Created 3 new numeric columns: `salary_min`, `salary_max`, `salary_avg`
- **Average Malaysian salary: RM 4,777/month** across all job postings with salary data

## 7. Exploratory Analysis
Quick exploration of salary patterns by category and location before saving.

In [37]:
#Average salary by category
salary_by_category = df.groupby('category')['salary_avg'].agg(['mean', 'count']).reset_index()
salary_by_category.columns = ['category', 'avg_salary', 'job_count']
salary_by_category = salary_by_category[salary_by_category['job_count'] >= 50]
salary_by_category = salary_by_category.sort_values('avg_salary', ascending=False)

print('Average salary by industry:')
print(salary_by_category.to_string(index=False))

Average salary by industry:
                              category   avg_salary  job_count
              CEO & General Management 22741.101695         59
                Real Estate & Property  6992.121069        318
Information & Communication Technology  6422.375369       2708
                 Consulting & Strategy  5844.116883         77
          Banking & Financial Services  5784.311966        702
                                 Sales  5708.160523       2984
                  Science & Technology  5616.800000        195
                          Construction  5424.653892        835
                           Engineering  5126.113754       2712
                  Healthcare & Medical  4996.263303        545
         Human Resources & Recruitment  4891.383681       1728
                                 Legal  4794.037143        175
                            Accounting  4670.759358       5637
            Mining, Resources & Energy  4658.125000         56
            Marketing & Com

In [38]:
# Average salary by top 10 locations
top_locations = df['location'].value_counts().head(10).index
salary_by_location = df[df['location'].isin(top_locations)].groupby('location')['salary_avg'].mean().sort_values(ascending=False)

print('Average salary by location:')
print(salary_by_location.round(0))

Average salary by location:
location
Penang                      7216.0
Kuala Lumpur City Centre    6217.0
Kuala Lumpur                5468.0
Selangor                    4971.0
Johor Bahru District        4701.0
Petaling                    4612.0
Seberang Perai              4510.0
Klang District              4422.0
Penang Island               4404.0
Shah Alam/Subang            4324.0
Name: salary_avg, dtype: float64


In [39]:
# top 20 companies hiring the most
top_companies = df['company'].value_counts().head(20)
print(top_companies)

# and specifically - who's hiring the most DATA roles?
data_jobs = df[df['category'] == 'Information & Communication Technology']
print(f"\nTotal ICT jobs: {len(data_jobs)}")
print(f"Average ICT salary: RM {data_jobs['salary_avg'].mean():,.0f}")
print("\nTop companies hiring ICT roles:")
print(data_jobs['company'].value_counts().head(15))

company
Private Advertiser                                    2230
AGENSI PEKERJAAN JS STAFFING SERVICES SDN BHD          288
Agensi Pekerjaan PERSOLKELLY Malaysia Sdn Bhd          239
Agensi Pekerjaan Hays (Malaysia) Sdn Bhd               238
RHB Banking Group                                      228
Michael Page International (Malaysia) Sdn Bhd          223
Intel Technology Sdn. Bhd.                             218
DKSH Malaysia Sdn Bhd                                  194
PERSOLKELLY Workforce Solutions Malaysia Sdn Bhd       191
Sunway Berhad                                          174
Standard Chartered Bank                                166
AmBank Group                                           162
SEEK                                                   148
Ambition Group Malaysia Sdn Bhd                        134
Malayan Banking Berhad (Maybank)                       133
Huawei Technologies (Malaysia) Sdn. Bhd                130
Flash Express                                   

In [40]:
# Top hiring companies (agencies removed)
agency_keywords = ['agensi', 'staffing', 'recruitment', 'michael page',
                   'hays', 'persolkelly', 'manpower', 'ambition',
                   'seek', 'private advertiser', 'tribehired', 'private',
                   'sdn bhd staffing', 'pekerjaan', 'outsource',
                   'consulting', 'headhunt', 'talent', 'hr solutions',
                   'workforce', 'executive search', 'placement']

mask = ~df['company'].str.contains('|'.join(agency_keywords), case=False, na=False)
df_companies = df[mask]

print('Top 20 real employers:')
print(df_companies['company'].value_counts().head(20))

Top 20 real employers:
company
RHB Banking Group                                     228
Intel Technology Sdn. Bhd.                            218
DKSH Malaysia Sdn Bhd                                 194
Sunway Berhad                                         174
Standard Chartered Bank                               166
AmBank Group                                          162
Malayan Banking Berhad (Maybank)                      133
Huawei Technologies (Malaysia) Sdn. Bhd               130
Flash Express                                         130
China Communications Construction (ECRL) Sdn. Bhd.    128
ExxonMobil Malaysia                                   125
CBRE                                                  118
TDCX Malaysia                                         118
Teleperformance Malaysia Sdn Bhd                      118
MumsMe Sdn Bhd                                        115
Marriott International                                112
Shopee Mobile Malaysia Sdn Bhd           

### Finding
- **Penang pays 32% more than Kuala Lumpur** (RM 7,216 vs RM 5,468/month) - non-obvious insight
- **ICT is #3 highest paying industry** at RM 6,422/month with 8,675 openings
- **Accounting has most jobs** (11,308) but ranks #13 in salary- high competition, average pay
- Top real employers: RHB, Intel, Maybank, Huawei, Shopee, ExxonMobil

## 8. Save Cleaned Dataset
Saving two clean CSVs - one for full analysis, one with agencies removed for company-level analysis.

In [42]:

df_clean = df[['job_id', 'job_title', 'company', 'location', 'category', 
               'subcategory', 'role', 'type_clean', 'salary_min', 
               'salary_max', 'salary_avg', 'listingDate']].copy()

df_clean = df_clean.rename(columns={'type_clean': 'job_type'})

# convert date properly
df_clean['listingDate'] = pd.to_datetime(df_clean['listingDate'])
df_clean['month'] = df_clean['listingDate'].dt.month
df_clean['month_name'] = df_clean['listingDate'].dt.strftime('%B')

df_clean.to_csv('../data/cleaned/jobstreet_cleaned.csv', index=False)
print(f"Saved! Shape: {df_clean.shape}")
print(df_clean.dtypes)

Saved! Shape: (69024, 14)
job_id                     float64
job_title                   object
company                     object
location                    object
category                    object
subcategory                 object
role                        object
job_type                    object
salary_min                 float64
salary_max                 float64
salary_avg                 float64
listingDate    datetime64[ns, UTC]
month                        int32
month_name                  object
dtype: object


## Summary

### Data Cleaning Complete 

| Step | Action | Result |
|------|--------|--------|
| Missing values | Identified nulls in salary and role | 54% salary missing -> market insight |
| Job type | Standardized 11 variations -> 4 categories | 90% full time jobs |
| Salary | Extracted min/max/avg using Regex | RM 4,777/month average |
| Companies | Removed recruitment agencies | Clean employer list |
| Dates | Parsed to datetime, extracted month | Ready for time analysis |

### Key Insights Found
- **RM 4,777/month** - average Malaysian salary
- **54%** of employers don't list salary publicly
- **90%** of jobs are full time
- **Penang** pays 32% more than Kuala Lumpur
- **ICT** is #3 highest paying with 8,675 openings

### Now Next Step
Proceed to `02_sql_analysis.ipynb` for deeper analysis using SQL CTEs and window functions.