# **STINTSY Machine Project**

## Section 1. Introduction

Each group should select one real-world dataset from the list of datasets provided for the project. Each dataset is accompanied with a description file, which also contains detailed description of each feature.

The target task (i.e., classification or regression) should be properly stated as well.

This project utilizes the **Labor Force Survey (LFS), April 2016**, a nationwide household survey conducted quarterly to collect data on demographic and socio-economic characteristics of the population. The dataset provided statistics on levels and trends of employment, unemployment and underemployment of the country as a whole.

The objective of this study is to develop a **classification model** that predicts an individual's employment status based on various socio-economic and demographic factors. The target variable, **New Employment Criteria**, categorizes individuals into three groups: Employed, Unemployed, and Not in the Labor Force. Several features, including age, sex, education level, occupation, and work history, are considered to determine their relationship with employment status.

To achieve this, we explore multiple machine learning models—**Logistic Regression (Multinomial)**, **Neural Networks**, and **Naïve Bayes**—to classify individuals based on their employment status. By evaluating the performance of these models, we aim to identify the most effective approach for employment prediction, which could aid in labor policy formulation and workforce development initiatives.

## Section 2. Description of the dataset

In this section of the notebook, you must fulfill the following:

- State a brief description of the dataset.
- Provide a description of the collection process executed to build the dataset. Discuss the implications of the data collection method on the generated conclusions and insights. Note that you may need to look at relevant sources related to the dataset to acquire necessary information for this part of the project.
- Describe the structure of the dataset file.
- What does each row and column represent?
- How many instances are there in the dataset?
- How many features are there in the dataset?
- If the dataset is composed of different files that you will combine in the succeeding steps, describe the structure and the contents of each file.
- Discuss the features in each dataset file. What does each feature represent? All features, even those which are not used for the study, should be described to the reader. The purpose of each feature in the dataset should be clear to the reader of the notebook without having to go through an external link.


### Brief Description of the Dataset

The dataset originates from the Labor Force Survey (LFS) conducted by the Philippine Statistics Authority (PSA) in April 2016. The LFS is a nationwide household survey conducted quarterly to gather data on the demographic and socio-economic characteristics of the population. Its primary objective is to estimate employment, unemployment, and underemployment levels in the country. 

### Data Collection Process

The dataset was collected through a sample survey method. The survey involved 42,768 sample households (or 42,576 households excluding Batanes) selected to provide precise and reliable labor force estimates at the national and regional levels. The data collection focused on private households, excluding institutional populations.

Supervision of the data collection process was rigorous, involving Regional Directors (RDs), Provincial Statistics Officers (PSOs), and field supervisors. The process included:

1. Observation of interviews to ensure data quality.

2. Review of accomplished questionnaires for completeness and consistency.

3. Discussions with interviewers to correct errors and refine data collection techniques.

The supervisors conducted spot-checks, re-interviews, and field verifications to ensure the accuracy of the reported information. Findings and errors were documented and submitted to the central office for further validation. This methodology ensured the reliability of employment and labor market statistics derived from the survey.

### Structure of the Dataset

In the dataset, each row represents an individual respondent (a person aged 15 years and over), and each column corresponds to a specific  feature of the respondent. 

- Number of Instances (Rows): 180862

- Number of Features (Columns): 50

## Section 3. List of requirements

List all the Python libraries and modules that you used.

In [387]:
import pandas as pd
import numpy as np

## Section 4. Data preprocessing and cleaning

Perform necessary steps before using the data. In this section of the notebook, please take note of the following:
- If needed, perform preprocessing techniques to transform the data to the appropriate representation. This may include binning, log transformations, conversion to one-hot encoding, normalization, standardization, interpolation, truncation, and feature engineering, among others. There should be a correct and proper justification for the use of each preprocessing technique used in the project.
- Make sure that the data is clean, especially features that are used in the project. This may include checking for misrepresentations, checking the data type, dealing with missing data, dealing with duplicate data, and dealing with outliers, among others. There should be a correct and proper justification for the application (or non-application) of each data cleaning method used in the project. Clean only the variables utilized in the study.


### Loading the dataset

In [388]:
df = pd.read_csv("LFS PUF April 2016.csv") 

In [389]:
display(df)

Unnamed: 0,PUFREG,PUFPRV,PUFPRRCD,PUFHHNUM,PUFURB2K10,PUFPWGTFIN,PUFSVYMO,PUFSVYYR,PUFPSU,PUFRPL,...,PUFC33_WEEKS,PUFC34_WYNOT,PUFC35_LTLOOKW,PUFC36_AVAIL,PUFC37_WILLING,PUFC38_PREVJOB,PUFC40_POCC,PUFC41_WQTR,PUFC43_QKB,PUFNEWEMPSTAT
0,1,28,2800,1,2,405.2219,4,2016,217,1,...,,,,,,,,1,01,1
1,1,28,2800,1,2,388.8280,4,2016,217,1,...,,,,,,,,1,01,1
2,1,28,2800,1,2,406.1194,4,2016,217,1,...,,,,,,,,1,01,1
3,1,28,2800,2,2,405.2219,4,2016,217,1,...,,,,,,,,1,01,1
4,1,28,2800,2,2,384.3556,4,2016,217,1,...,,,,,,,,1,96,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
180857,17,59,5900,40880,2,239.4341,4,2016,258,1,...,,,,,,,,1,50,1
180858,17,59,5900,40880,2,189.8885,4,2016,258,1,...,,8,,,,2,,,,3
180859,17,59,5900,40880,2,207.7395,4,2016,258,1,...,,,,,,,,,,
180860,17,59,5900,40880,2,207.7395,4,2016,258,1,...,,,,,,,,,,


### Dropping Unwanted Columns

In [390]:
features = [
    "PUFC04_SEX", "PUFC05_AGE", "PUFC06_MSTAT", "PUFC07_GRADE", "PUFC09_GRADTECH", 
    "PUFC11_WORK", "PUFC12_JOB", "PUFC14_PROCC", "PUFC17_NATEM", "PUFC18_PNWHRS", 
    "PUFC19_PHOURS", "PUFC23_PCLASS", "PUFC25_PBASIC", "PUFC30_LOOKW", "PUFC32_JOBSM", 
    "PUFC33_WEEKS", "PUFC34_WYNOT", "PUFC35_LTLOOKW", "PUFC36_AVAIL", "PUFC37_WILLING", 
    "PUFC38_PREVJOB", "PUFC40_POCC", "PUFC41_WQTR"
]

df = df[features]

### Initial Exploration

In [391]:
display(df)

Unnamed: 0,PUFC04_SEX,PUFC05_AGE,PUFC06_MSTAT,PUFC07_GRADE,PUFC09_GRADTECH,PUFC11_WORK,PUFC12_JOB,PUFC14_PROCC,PUFC17_NATEM,PUFC18_PNWHRS,...,PUFC30_LOOKW,PUFC32_JOBSM,PUFC33_WEEKS,PUFC34_WYNOT,PUFC35_LTLOOKW,PUFC36_AVAIL,PUFC37_WILLING,PUFC38_PREVJOB,PUFC40_POCC,PUFC41_WQTR
0,1,49,2,350,2,1,,61,1,08,...,,,,,,,,,,1
1,2,61,2,350,2,1,,92,2,04,...,,,,,,,,,,1
2,1,19,1,350,2,1,,92,2,08,...,,,,,,,,,,1
3,1,48,2,320,2,1,,61,1,04,...,,,,,,,,,,1
4,2,41,2,350,2,1,,91,1,12,...,,,,,,,,,,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
180857,1,29,2,350,2,1,,13,1,08,...,,,,,,,,,,1
180858,2,29,2,830,2,2,2,,,,...,2,,,8,,,,2,,
180859,2,4,,,,,,,,,...,,,,,,,,,,
180860,2,2,,,,,,,,,...,,,,,,,,,,


### Check Duplicate Rows

In [392]:
df.duplicated().sum()

np.int64(86345)

In [393]:
# Check size before dropping duplicates
print("Before:", df.shape)

Before: (180862, 23)


In [394]:
df.drop_duplicates(inplace=True)

In [395]:
# Check size after dropping duplicates
print("After:", df.shape)

After: (94517, 23)


### Check Missing Values

In [396]:
df.replace("", np.nan, inplace=True)  # Convert empty strings to NaN
df.replace(" ", np.nan, inplace=True)  # Convert whitespace to NaN

print(df.isnull().sum())

PUFC04_SEX             0
PUFC05_AGE             0
PUFC06_MSTAT          10
PUFC07_GRADE           0
PUFC09_GRADTECH      516
PUFC11_WORK         1770
PUFC12_JOB         69455
PUFC14_PROCC           0
PUFC17_NATEM       26099
PUFC18_PNWHRS          0
PUFC19_PHOURS          0
PUFC23_PCLASS      26099
PUFC25_PBASIC          0
PUFC30_LOOKW       70694
PUFC32_JOBSM       92256
PUFC33_WEEKS           0
PUFC34_WYNOT       72955
PUFC35_LTLOOKW     92990
PUFC36_AVAIL       88701
PUFC37_WILLING     88701
PUFC38_PREVJOB     70694
PUFC40_POCC            0
PUFC41_WQTR         7014
dtype: int64


### Check Data Types

In [397]:
print(df.dtypes)

PUFC04_SEX          int64
PUFC05_AGE          int64
PUFC06_MSTAT       object
PUFC07_GRADE       object
PUFC09_GRADTECH    object
PUFC11_WORK        object
PUFC12_JOB         object
PUFC14_PROCC       object
PUFC17_NATEM       object
PUFC18_PNWHRS      object
PUFC19_PHOURS      object
PUFC23_PCLASS      object
PUFC25_PBASIC      object
PUFC30_LOOKW       object
PUFC32_JOBSM       object
PUFC33_WEEKS       object
PUFC34_WYNOT       object
PUFC35_LTLOOKW     object
PUFC36_AVAIL       object
PUFC37_WILLING     object
PUFC38_PREVJOB     object
PUFC40_POCC        object
PUFC41_WQTR        object
dtype: object


### Exploring Each Features

**PUFC04_SEX** - Sex

In [398]:
print(df['PUFC04_SEX'].isna().sum())

0


In [399]:
df['PUFC04_SEX'].unique() 

array([1, 2])

In the dataset, the code "1" represents Male, and "2" represents Female. We'll use the map function to replace these codes with their corresponding gender labels for better readability.

In [400]:
df['PUFC04_SEX'] = df['PUFC04_SEX'].map({1: 'Male', 2: 'Female'}) 
display(df)  

Unnamed: 0,PUFC04_SEX,PUFC05_AGE,PUFC06_MSTAT,PUFC07_GRADE,PUFC09_GRADTECH,PUFC11_WORK,PUFC12_JOB,PUFC14_PROCC,PUFC17_NATEM,PUFC18_PNWHRS,...,PUFC30_LOOKW,PUFC32_JOBSM,PUFC33_WEEKS,PUFC34_WYNOT,PUFC35_LTLOOKW,PUFC36_AVAIL,PUFC37_WILLING,PUFC38_PREVJOB,PUFC40_POCC,PUFC41_WQTR
0,Male,49,2,350,2,1,,61,1,08,...,,,,,,,,,,1
1,Female,61,2,350,2,1,,92,2,04,...,,,,,,,,,,1
2,Male,19,1,350,2,1,,92,2,08,...,,,,,,,,,,1
3,Male,48,2,320,2,1,,61,1,04,...,,,,,,,,,,1
4,Female,41,2,350,2,1,,91,1,12,...,,,,,,,,,,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
180850,Male,34,2,000,2,1,,62,1,05,...,,,,,,,,,,1
180851,Female,32,2,000,2,1,,61,1,04,...,,,,,,,,,,1
180857,Male,29,2,350,2,1,,13,1,08,...,,,,,,,,,,1
180858,Female,29,2,830,2,2,2,,,,...,2,,,8,,,,2,,


**PUFC05_AGE** - Age

In [401]:
print(df['PUFC05_AGE'].isna().sum())

0


In [402]:
df['PUFC05_AGE'].unique() 

array([49, 61, 19, 48, 41, 20, 15, 59, 11,  2, 51, 26, 23, 71, 54, 27, 46,
       18,  5, 80, 43, 38, 73, 35, 72, 74, 39, 16, 14, 13,  8, 22,  3, 50,
       44, 34, 40,  9, 17, 53, 45, 10,  7, 77, 81, 37, 25,  6, 52, 55, 68,
       56, 30, 95, 70, 32, 65, 62, 36, 92, 33, 28, 24, 21, 29,  1, 89, 31,
        0, 57, 12, 47, 42, 63, 86,  4, 60, 76, 79, 88, 58, 64, 69, 66, 75,
       78, 67, 85, 84, 87, 82, 83, 91, 93, 90, 96, 94, 97, 98, 99])

Since the study focuses on participants aged 15 and over, we remove records of individuals younger than 15 to ensure the dataset aligns with the target population.

In [403]:
df = df[df['PUFC05_AGE'] >= 15]
df['PUFC05_AGE'].unique() 

array([49, 61, 19, 48, 41, 20, 15, 59, 51, 26, 23, 71, 54, 27, 46, 18, 80,
       43, 38, 73, 35, 72, 74, 39, 16, 22, 50, 44, 34, 40, 17, 53, 45, 77,
       81, 37, 25, 52, 55, 68, 56, 30, 95, 70, 32, 65, 62, 36, 92, 33, 28,
       24, 21, 29, 89, 31, 57, 47, 42, 63, 86, 60, 76, 79, 88, 58, 64, 69,
       66, 75, 78, 67, 85, 84, 87, 82, 83, 91, 93, 90, 96, 94, 97, 98, 99])

**PUFC06_MSTAT** - Marital Status

In [404]:
print(df['PUFC06_MSTAT'].isna().sum())

0


In [405]:
df['PUFC06_MSTAT'].unique() 

array(['2', '1', '3', '4', '6', '5'], dtype=object)

The dataset contains marital status codes stored as strings (e.g., "1"-"6"). To fix this, we replace the codes with their corresponding labels.

In [406]:
df['PUFC06_MSTAT'] = df['PUFC06_MSTAT'].map({'1': 'Single', '2': 'Married/Living Together', '3': 'Widowed', '4': 'Divorced/Separated', '5': 'Annulled', '6': 'Unknown'})
display(df)

Unnamed: 0,PUFC04_SEX,PUFC05_AGE,PUFC06_MSTAT,PUFC07_GRADE,PUFC09_GRADTECH,PUFC11_WORK,PUFC12_JOB,PUFC14_PROCC,PUFC17_NATEM,PUFC18_PNWHRS,...,PUFC30_LOOKW,PUFC32_JOBSM,PUFC33_WEEKS,PUFC34_WYNOT,PUFC35_LTLOOKW,PUFC36_AVAIL,PUFC37_WILLING,PUFC38_PREVJOB,PUFC40_POCC,PUFC41_WQTR
0,Male,49,Married/Living Together,350,2,1,,61,1,08,...,,,,,,,,,,1
1,Female,61,Married/Living Together,350,2,1,,92,2,04,...,,,,,,,,,,1
2,Male,19,Single,350,2,1,,92,2,08,...,,,,,,,,,,1
3,Male,48,Married/Living Together,320,2,1,,61,1,04,...,,,,,,,,,,1
4,Female,41,Married/Living Together,350,2,1,,91,1,12,...,,,,,,,,,,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
180850,Male,34,Married/Living Together,000,2,1,,62,1,05,...,,,,,,,,,,1
180851,Female,32,Married/Living Together,000,2,1,,61,1,04,...,,,,,,,,,,1
180857,Male,29,Married/Living Together,350,2,1,,13,1,08,...,,,,,,,,,,1
180858,Female,29,Married/Living Together,830,2,2,2,,,,...,2,,,8,,,,2,,


**PUFC07_GRADE** - Highest Grade Completed

In [407]:
print(df['PUFC07_GRADE'].isna().sum())

0


In [408]:
df['PUFC07_GRADE'].unique() 

array(['350', '320', '622', '672', '240', '220', '614', '330', '280',
       '632', '900', '820', '589', '572', '250', '830', '810', '634',
       '230', '686', '581', '681', '552', '534', '840', '658', '000',
       '548', '310', '648', '210', '652', '662', '601', '642', '562',
       '685', '631', '684', '340', '584', '621', '410', '010', '260',
       '420', '664', '676', '521', '638', '554', '646', '689', '522',
       '654', '644', '532', '531', '514', '558', '501', '586', '542',
       '576', '544', '585', '564'], dtype=object)

In this, we will replace all known grade codes with their corresponding descriptions and set unknown values as "Unknown".

In [409]:
df['PUFC07_GRADE'] = df['PUFC07_GRADE'].map({
    '000': 'No grade completed',
    '010': 'Preschool',
    '210': 'Grade 1',
    '220': 'Grade 2',
    '230': 'Grade 3',
    '240': 'Grade 4',
    '250': 'Grade 5',
    '260': 'Grade 6',
    '280': 'Elementary Graduate',
    '310': 'High School - First Year',
    '320': 'High School - Second Year',
    '330': 'High School - Third Year',
    '350': 'High School Graduate',
    '410': 'Post Secondary - First Year',
    '420': 'Post Secondary - Second Year',
    '810': 'College - First Year',
    '820': 'College - Second Year',
    '830': 'College - Third Year',
    '840': 'College - Fourth Year',
    '900': 'Post Baccalaureate'
}).fillna('Unknown')

display(df)

Unnamed: 0,PUFC04_SEX,PUFC05_AGE,PUFC06_MSTAT,PUFC07_GRADE,PUFC09_GRADTECH,PUFC11_WORK,PUFC12_JOB,PUFC14_PROCC,PUFC17_NATEM,PUFC18_PNWHRS,...,PUFC30_LOOKW,PUFC32_JOBSM,PUFC33_WEEKS,PUFC34_WYNOT,PUFC35_LTLOOKW,PUFC36_AVAIL,PUFC37_WILLING,PUFC38_PREVJOB,PUFC40_POCC,PUFC41_WQTR
0,Male,49,Married/Living Together,High School Graduate,2,1,,61,1,08,...,,,,,,,,,,1
1,Female,61,Married/Living Together,High School Graduate,2,1,,92,2,04,...,,,,,,,,,,1
2,Male,19,Single,High School Graduate,2,1,,92,2,08,...,,,,,,,,,,1
3,Male,48,Married/Living Together,High School - Second Year,2,1,,61,1,04,...,,,,,,,,,,1
4,Female,41,Married/Living Together,High School Graduate,2,1,,91,1,12,...,,,,,,,,,,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
180850,Male,34,Married/Living Together,No grade completed,2,1,,62,1,05,...,,,,,,,,,,1
180851,Female,32,Married/Living Together,No grade completed,2,1,,61,1,04,...,,,,,,,,,,1
180857,Male,29,Married/Living Together,High School Graduate,2,1,,13,1,08,...,,,,,,,,,,1
180858,Female,29,Married/Living Together,College - Third Year,2,2,2,,,,...,2,,,8,,,,2,,


**PUFC09_GRADTECH** - Graduate of technical/vocational course

In [410]:
print(df['PUFC09_GRADTECH'].isna().sum())

0


In [411]:
display(df['PUFC09_GRADTECH'])

0         2
1         2
2         2
3         2
4         2
         ..
180850    2
180851    2
180857    2
180858    2
180861    2
Name: PUFC09_GRADTECH, Length: 94001, dtype: object

In [412]:
df['PUFC09_GRADTECH'].unique()

array(['2', '1'], dtype=object)

In the dataset, code "1" (Yes) represents if the member is currently attending school, and code "2" (No) if not.

In [413]:
df['PUFC09_GRADTECH'] = df['PUFC09_GRADTECH'].map({'1': 'Yes', '2': 'No'})
display(df)

Unnamed: 0,PUFC04_SEX,PUFC05_AGE,PUFC06_MSTAT,PUFC07_GRADE,PUFC09_GRADTECH,PUFC11_WORK,PUFC12_JOB,PUFC14_PROCC,PUFC17_NATEM,PUFC18_PNWHRS,...,PUFC30_LOOKW,PUFC32_JOBSM,PUFC33_WEEKS,PUFC34_WYNOT,PUFC35_LTLOOKW,PUFC36_AVAIL,PUFC37_WILLING,PUFC38_PREVJOB,PUFC40_POCC,PUFC41_WQTR
0,Male,49,Married/Living Together,High School Graduate,No,1,,61,1,08,...,,,,,,,,,,1
1,Female,61,Married/Living Together,High School Graduate,No,1,,92,2,04,...,,,,,,,,,,1
2,Male,19,Single,High School Graduate,No,1,,92,2,08,...,,,,,,,,,,1
3,Male,48,Married/Living Together,High School - Second Year,No,1,,61,1,04,...,,,,,,,,,,1
4,Female,41,Married/Living Together,High School Graduate,No,1,,91,1,12,...,,,,,,,,,,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
180850,Male,34,Married/Living Together,No grade completed,No,1,,62,1,05,...,,,,,,,,,,1
180851,Female,32,Married/Living Together,No grade completed,No,1,,61,1,04,...,,,,,,,,,,1
180857,Male,29,Married/Living Together,High School Graduate,No,1,,13,1,08,...,,,,,,,,,,1
180858,Female,29,Married/Living Together,College - Third Year,No,2,2,,,,...,2,,,8,,,,2,,


**PUFC11_WORK** - Work Indicator

In [414]:
print(df['PUFC11_WORK'].isna().sum())

1760


In [415]:
df['PUFC11_WORK'].unique()

array(['1', '2', nan], dtype=object)

In [416]:
df['PUFC11_WORK'] = df['PUFC11_WORK'].map({'1': 'Yes', '2': 'No'})
display(df)

Unnamed: 0,PUFC04_SEX,PUFC05_AGE,PUFC06_MSTAT,PUFC07_GRADE,PUFC09_GRADTECH,PUFC11_WORK,PUFC12_JOB,PUFC14_PROCC,PUFC17_NATEM,PUFC18_PNWHRS,...,PUFC30_LOOKW,PUFC32_JOBSM,PUFC33_WEEKS,PUFC34_WYNOT,PUFC35_LTLOOKW,PUFC36_AVAIL,PUFC37_WILLING,PUFC38_PREVJOB,PUFC40_POCC,PUFC41_WQTR
0,Male,49,Married/Living Together,High School Graduate,No,Yes,,61,1,08,...,,,,,,,,,,1
1,Female,61,Married/Living Together,High School Graduate,No,Yes,,92,2,04,...,,,,,,,,,,1
2,Male,19,Single,High School Graduate,No,Yes,,92,2,08,...,,,,,,,,,,1
3,Male,48,Married/Living Together,High School - Second Year,No,Yes,,61,1,04,...,,,,,,,,,,1
4,Female,41,Married/Living Together,High School Graduate,No,Yes,,91,1,12,...,,,,,,,,,,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
180850,Male,34,Married/Living Together,No grade completed,No,Yes,,62,1,05,...,,,,,,,,,,1
180851,Female,32,Married/Living Together,No grade completed,No,Yes,,61,1,04,...,,,,,,,,,,1
180857,Male,29,Married/Living Together,High School Graduate,No,Yes,,13,1,08,...,,,,,,,,,,1
180858,Female,29,Married/Living Together,College - Third Year,No,No,2,,,,...,2,,,8,,,,2,,


**PUFC12_JOB** - Job Indicator

In [417]:
print(df['PUFC12_JOB'].isna().sum())

69115


In [418]:
df['PUFC12_JOB'].unique()

array([nan, '2', '1'], dtype=object)

In [419]:
df['PUFC12_JOB'] = df['PUFC12_JOB'].map({'1': 'Yes', '2': 'No'})
display(df)

Unnamed: 0,PUFC04_SEX,PUFC05_AGE,PUFC06_MSTAT,PUFC07_GRADE,PUFC09_GRADTECH,PUFC11_WORK,PUFC12_JOB,PUFC14_PROCC,PUFC17_NATEM,PUFC18_PNWHRS,...,PUFC30_LOOKW,PUFC32_JOBSM,PUFC33_WEEKS,PUFC34_WYNOT,PUFC35_LTLOOKW,PUFC36_AVAIL,PUFC37_WILLING,PUFC38_PREVJOB,PUFC40_POCC,PUFC41_WQTR
0,Male,49,Married/Living Together,High School Graduate,No,Yes,,61,1,08,...,,,,,,,,,,1
1,Female,61,Married/Living Together,High School Graduate,No,Yes,,92,2,04,...,,,,,,,,,,1
2,Male,19,Single,High School Graduate,No,Yes,,92,2,08,...,,,,,,,,,,1
3,Male,48,Married/Living Together,High School - Second Year,No,Yes,,61,1,04,...,,,,,,,,,,1
4,Female,41,Married/Living Together,High School Graduate,No,Yes,,91,1,12,...,,,,,,,,,,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
180850,Male,34,Married/Living Together,No grade completed,No,Yes,,62,1,05,...,,,,,,,,,,1
180851,Female,32,Married/Living Together,No grade completed,No,Yes,,61,1,04,...,,,,,,,,,,1
180857,Male,29,Married/Living Together,High School Graduate,No,Yes,,13,1,08,...,,,,,,,,,,1
180858,Female,29,Married/Living Together,College - Third Year,No,No,No,,,,...,2,,,8,,,,2,,


**PUFC14_PROCC** - Primary Occupation


In [420]:
print(df['PUFC14_PROCC'].isna().sum())

0


In [421]:
print(df['PUFC14_PROCC'].dtype) 

object


In [422]:
df['PUFC14_PROCC'].unique()

array(['61', '92', '91', '52', '  ', '53', '13', '54', '93', '12', '71',
       '11', '83', '14', '51', '33', '44', '75', '42', '34', '96', '22',
       '62', '23', '72', '26', '41', '21', '43', '24', '74', '31', '82',
       '35', '94', '73', '25', '32', '81', '95', '02', '01', '03', '63'],
      dtype=object)

In [423]:
df['PUFC14_PROCC'] = df['PUFC14_PROCC'].replace("  ", np.nan)
df['PUFC14_PROCC'] = pd.to_numeric(df['PUFC14_PROCC'], errors='coerce')
print(df['PUFC14_PROCC'].dtype) 

float64


In [424]:
occupation_map = {
    11: "Chief executives, senior officials and legislators",
    12: "Administrative and commercial managers",
    13: "Production and specialized services managers",
    14: "Hospitality, retail and other services managers",
    21: "Science and engineering professionals",
    22: "Health professionals",
    23: "Teaching professionals",
    24: "Business and administration professionals",
    25: "Information and communication technology professionals",
    26: "Legal, social and cultural professionals",
    31: "Science and engineering associate professionals",
    32: "Health associate professionals",
    33: "Business and administration associate professionals",
    34: "Legal, social, cultural and related professionals",
    35: "Information and communications technician",
    41: "General and keyboard clerks",
    42: "Customer service clerks",
    43: "Numerical and material recording clerks",
    44: "Other clerical support workers",
    51: "Personal service workers",
    52: "Sales workers",
    53: "Personal care workers",
    54: "Protective services workers",
    61: "Market-oriented skilled agricultural workers",
    62: "Market-oriented skilled forestry, fishery and hunting workers",
    63: "Subsistence farmers, fishers, hunters and gatherers",
    71: "Building and related trades workers, excluding electricians",
    72: "Metal, machinery and related trades workers",
    73: "Handicraft and printing workers",
    74: "Electrical and electronics trades workers",
    75: "Food processing, wood working, garment and other craft and related trades workers",
    81: "Stationary plant and machine operators",
    82: "Assemblers",
    83: "Drivers and mobile plant operators",
    91: "Cleaners and helpers",
    92: "Agricultural, forestry and fishery laborers",
    93: "Laborers in mining, construction, manufacturing and transport",
    94: "Food preparation assistants",
    95: "Street and related sales and service workers",
    96: "Refuse workers and other elementary workers",
    1: "Commissioned armed forces officers",
    2: "Non-commissioned armed forces officers",
    3: "Armed forces occupations, other ranks"
}
df['PUFC14_PROCC'] = df['PUFC14_PROCC'].map(occupation_map)


In [425]:
display(df[['PUFC14_PROCC']].head())

Unnamed: 0,PUFC14_PROCC
0,Market-oriented skilled agricultural workers
1,"Agricultural, forestry and fishery laborers"
2,"Agricultural, forestry and fishery laborers"
3,Market-oriented skilled agricultural workers
4,Cleaners and helpers


**PUFC17_NATEM** - Nature of Employment

In [426]:
print(df['PUFC17_NATEM'].isna().sum())

25583


In [427]:
print(df['PUFC17_NATEM'].dtype) 

object


In [428]:
df['PUFC17_NATEM'].unique()

array(['1', '2', nan, '3'], dtype=object)

In [429]:
df[df['PUFC17_NATEM'] == '2']

Unnamed: 0,PUFC04_SEX,PUFC05_AGE,PUFC06_MSTAT,PUFC07_GRADE,PUFC09_GRADTECH,PUFC11_WORK,PUFC12_JOB,PUFC14_PROCC,PUFC17_NATEM,PUFC18_PNWHRS,...,PUFC30_LOOKW,PUFC32_JOBSM,PUFC33_WEEKS,PUFC34_WYNOT,PUFC35_LTLOOKW,PUFC36_AVAIL,PUFC37_WILLING,PUFC38_PREVJOB,PUFC40_POCC,PUFC41_WQTR
1,Female,61,Married/Living Together,High School Graduate,No,Yes,,"Agricultural, forestry and fishery laborers",2,04,...,,,,,,,,,,1
2,Male,19,Single,High School Graduate,No,Yes,,"Agricultural, forestry and fishery laborers",2,08,...,,,,,,,,,,1
5,Male,20,Single,High School Graduate,No,Yes,,Sales workers,2,08,...,,,,,,,,,,1
14,Male,23,Single,High School Graduate,No,Yes,,Protective services workers,2,12,...,,,,,,,,,,1
15,Male,71,Married/Living Together,Grade 4,No,Yes,,"Laborers in mining, construction, manufacturin...",2,08,...,,,,,,,,,,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
180805,Female,31,Single,College - Second Year,No,Yes,,Personal care workers,2,08,...,,,,,,,,,,1
180836,Male,28,Married/Living Together,High School Graduate,No,Yes,,Drivers and mobile plant operators,2,08,...,,,,,,,,,,1
180837,Female,26,Married/Living Together,High School Graduate,No,Yes,,Personal service workers,2,08,...,,,,,,,,,,1
180840,Male,68,Married/Living Together,Unknown,No,Yes,,"Chief executives, senior officials and legisla...",2,08,...,,,,,,,,,,1


In [430]:
df['PUFC17_NATEM'] = df['PUFC17_NATEM'].map({
    '1': 'Permanent job/business/unpaid family work',
    '2': 'Short-term or seasonal or casual job/business/unpaid family work',
    '3': 'Worked for different employer on day to day or week to week basis'
})
display(df)

Unnamed: 0,PUFC04_SEX,PUFC05_AGE,PUFC06_MSTAT,PUFC07_GRADE,PUFC09_GRADTECH,PUFC11_WORK,PUFC12_JOB,PUFC14_PROCC,PUFC17_NATEM,PUFC18_PNWHRS,...,PUFC30_LOOKW,PUFC32_JOBSM,PUFC33_WEEKS,PUFC34_WYNOT,PUFC35_LTLOOKW,PUFC36_AVAIL,PUFC37_WILLING,PUFC38_PREVJOB,PUFC40_POCC,PUFC41_WQTR
0,Male,49,Married/Living Together,High School Graduate,No,Yes,,Market-oriented skilled agricultural workers,Permanent job/business/unpaid family work,08,...,,,,,,,,,,1
1,Female,61,Married/Living Together,High School Graduate,No,Yes,,"Agricultural, forestry and fishery laborers",Short-term or seasonal or casual job/business/...,04,...,,,,,,,,,,1
2,Male,19,Single,High School Graduate,No,Yes,,"Agricultural, forestry and fishery laborers",Short-term or seasonal or casual job/business/...,08,...,,,,,,,,,,1
3,Male,48,Married/Living Together,High School - Second Year,No,Yes,,Market-oriented skilled agricultural workers,Permanent job/business/unpaid family work,04,...,,,,,,,,,,1
4,Female,41,Married/Living Together,High School Graduate,No,Yes,,Cleaners and helpers,Permanent job/business/unpaid family work,12,...,,,,,,,,,,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
180850,Male,34,Married/Living Together,No grade completed,No,Yes,,"Market-oriented skilled forestry, fishery and ...",Permanent job/business/unpaid family work,05,...,,,,,,,,,,1
180851,Female,32,Married/Living Together,No grade completed,No,Yes,,Market-oriented skilled agricultural workers,Permanent job/business/unpaid family work,04,...,,,,,,,,,,1
180857,Male,29,Married/Living Together,High School Graduate,No,Yes,,Production and specialized services managers,Permanent job/business/unpaid family work,08,...,,,,,,,,,,1
180858,Female,29,Married/Living Together,College - Third Year,No,No,No,,,,...,2,,,8,,,,2,,


In [431]:
df[df['PUFC17_NATEM'] == 'Short-term or seasonal or casual job/business/unpaid family work']

Unnamed: 0,PUFC04_SEX,PUFC05_AGE,PUFC06_MSTAT,PUFC07_GRADE,PUFC09_GRADTECH,PUFC11_WORK,PUFC12_JOB,PUFC14_PROCC,PUFC17_NATEM,PUFC18_PNWHRS,...,PUFC30_LOOKW,PUFC32_JOBSM,PUFC33_WEEKS,PUFC34_WYNOT,PUFC35_LTLOOKW,PUFC36_AVAIL,PUFC37_WILLING,PUFC38_PREVJOB,PUFC40_POCC,PUFC41_WQTR
1,Female,61,Married/Living Together,High School Graduate,No,Yes,,"Agricultural, forestry and fishery laborers",Short-term or seasonal or casual job/business/...,04,...,,,,,,,,,,1
2,Male,19,Single,High School Graduate,No,Yes,,"Agricultural, forestry and fishery laborers",Short-term or seasonal or casual job/business/...,08,...,,,,,,,,,,1
5,Male,20,Single,High School Graduate,No,Yes,,Sales workers,Short-term or seasonal or casual job/business/...,08,...,,,,,,,,,,1
14,Male,23,Single,High School Graduate,No,Yes,,Protective services workers,Short-term or seasonal or casual job/business/...,12,...,,,,,,,,,,1
15,Male,71,Married/Living Together,Grade 4,No,Yes,,"Laborers in mining, construction, manufacturin...",Short-term or seasonal or casual job/business/...,08,...,,,,,,,,,,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
180805,Female,31,Single,College - Second Year,No,Yes,,Personal care workers,Short-term or seasonal or casual job/business/...,08,...,,,,,,,,,,1
180836,Male,28,Married/Living Together,High School Graduate,No,Yes,,Drivers and mobile plant operators,Short-term or seasonal or casual job/business/...,08,...,,,,,,,,,,1
180837,Female,26,Married/Living Together,High School Graduate,No,Yes,,Personal service workers,Short-term or seasonal or casual job/business/...,08,...,,,,,,,,,,1
180840,Male,68,Married/Living Together,Unknown,No,Yes,,"Chief executives, senior officials and legisla...",Short-term or seasonal or casual job/business/...,08,...,,,,,,,,,,1


**PUFC18_PNWHRS** – Normal Working Hours per Day

In [432]:
print(df['PUFC18_PNWHRS'].isna().sum())

0


In [433]:
df['PUFC18_PNWHRS'].unique()

array(['08', '04', '12', '  ', '10', '02', '03', '06', '09', '07', '05',
       '01', '13', '15', '14', '11', '16'], dtype=object)

In [434]:
print(df['PUFC18_PNWHRS'].dtype) 

object


In [447]:
df['PUFC18_PNWHRS'] = pd.to_numeric(df['PUFC18_PNWHRS'], errors='coerce')# convert to int 
df['PUFC18_PNWHRS'] = df['PUFC18_PNWHRS'].fillna(0).astype(int) 
print(df['PUFC18_PNWHRS'].dtype) 

int64


In [448]:
df.replace(r'^\s*$', np.nan, regex=True, inplace=True)

In [449]:
df['PUFC18_PNWHRS'].unique()

array([ 8,  4, 12,  0, 10,  2,  3,  6,  9,  7,  5,  1, 13, 15, 14, 11, 16])

In [450]:
display(df)

Unnamed: 0,PUFC04_SEX,PUFC05_AGE,PUFC06_MSTAT,PUFC07_GRADE,PUFC09_GRADTECH,PUFC11_WORK,PUFC12_JOB,PUFC14_PROCC,PUFC17_NATEM,PUFC18_PNWHRS,...,PUFC30_LOOKW,PUFC32_JOBSM,PUFC33_WEEKS,PUFC34_WYNOT,PUFC35_LTLOOKW,PUFC36_AVAIL,PUFC37_WILLING,PUFC38_PREVJOB,PUFC40_POCC,PUFC41_WQTR
0,Male,49,Married/Living Together,High School Graduate,No,Yes,,Market-oriented skilled agricultural workers,Permanent job/business/unpaid family work,8,...,,,,,,,,,,1
1,Female,61,Married/Living Together,High School Graduate,No,Yes,,"Agricultural, forestry and fishery laborers",Short-term or seasonal or casual job/business/...,4,...,,,,,,,,,,1
2,Male,19,Single,High School Graduate,No,Yes,,"Agricultural, forestry and fishery laborers",Short-term or seasonal or casual job/business/...,8,...,,,,,,,,,,1
3,Male,48,Married/Living Together,High School - Second Year,No,Yes,,Market-oriented skilled agricultural workers,Permanent job/business/unpaid family work,4,...,,,,,,,,,,1
4,Female,41,Married/Living Together,High School Graduate,No,Yes,,Cleaners and helpers,Permanent job/business/unpaid family work,12,...,,,,,,,,,,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
180850,Male,34,Married/Living Together,No grade completed,No,Yes,,"Market-oriented skilled forestry, fishery and ...",Permanent job/business/unpaid family work,5,...,,,,,,,,,,1
180851,Female,32,Married/Living Together,No grade completed,No,Yes,,Market-oriented skilled agricultural workers,Permanent job/business/unpaid family work,4,...,,,,,,,,,,1
180857,Male,29,Married/Living Together,High School Graduate,No,Yes,,Production and specialized services managers,Permanent job/business/unpaid family work,8,...,,,,,,,,,,1
180858,Female,29,Married/Living Together,College - Third Year,No,No,No,,,0,...,2,,,8,,,,2,,


**PUFC19_PHOURS** – Total Number of Hours Worked during the past week

In [439]:
print(df['PUFC19_PHOURS'].isna().sum())

25583


In [440]:
print(df['PUFC19_PHOURS'].dtype) 

object


In [441]:
df['PUFC19_PHOURS'].unique()

array(['024', '008', '020', '072', '048', nan, '010', '060', '016', '040',
       '070', '004', '032', '045', '030', '015', '003', '002', '042',
       '063', '054', '036', '007', '028', '006', '014', '012', '018',
       '056', '025', '000', '091', '035', '021', '009', '084', '090',
       '050', '005', '049', '066', '044', '077', '057', '098', '105',
       '052', '064', '065', '059', '027', '055', '112', '022', '038',
       '096', '078', '033', '001', '089', '058', '075', '071', '053',
       '062', '039', '031', '011', '034', '043', '026', '103', '094',
       '047', '108', '041', '046', '013', '080', '100', '017', '092',
       '029', '102', '082', '088', '076', '093', '051', '019', '074',
       '068', '061', '067', '069', '073', '101', '086', '023', '037',
       '081', '079', '085'], dtype=object)

In [455]:
df['PUFC19_PHOURS'] = pd.to_numeric(df['PUFC19_PHOURS'], errors='coerce') # convert to int 
df['PUFC19_PHOURS'] = df['PUFC19_PHOURS'].fillna(0).astype(int)
print(df['PUFC19_PHOURS'].dtype) 

int64


In [456]:
df['PUFC19_PHOURS'].unique()

array([ 24,   8,  20,  72,  48,   0,  10,  60,  16,  40,  70,   4,  32,
        45,  30,  15,   3,   2,  42,  63,  54,  36,   7,  28,   6,  14,
        12,  18,  56,  25,  91,  35,  21,   9,  84,  90,  50,   5,  49,
        66,  44,  77,  57,  98, 105,  52,  64,  65,  59,  27,  55, 112,
        22,  38,  96,  78,  33,   1,  89,  58,  75,  71,  53,  62,  39,
        31,  11,  34,  43,  26, 103,  94,  47, 108,  41,  46,  13,  80,
       100,  17,  92,  29, 102,  82,  88,  76,  93,  51,  19,  74,  68,
        61,  67,  69,  73, 101,  86,  23,  37,  81,  79,  85])

In [457]:
display(df['PUFC19_PHOURS'])

0         24
1          8
2         24
3         20
4         72
          ..
180850    30
180851    28
180857    40
180858     0
180861    28
Name: PUFC19_PHOURS, Length: 94001, dtype: int64

**PUFC23_PCLASS** – Class of Worker

In [460]:
print(df['PUFC23_PCLASS'].isna().sum())

25583


In [461]:
print(df['PUFC23_PCLASS'].dtype) 

object


In [462]:
df['PUFC23_PCLASS'].unique()

array(['3', '6', '1', '0', nan, '2', '4', '5'], dtype=object)

In [463]:
pclass_mapping = {
    "0": "Worked for private household",
    "1": "Worked for private establishment",
    "2": "Worked for government/government-controlled corporation",
    "3": "Self-employed without any paid employee",
    "4": "Employer in own family-operated farm or business",
    "5": "Worked with pay in own family-operated farm or business",
    "6": "Worked without pay in own family-operated farm or business"
}
df['PUFC23_PCLASS'] = df['PUFC23_PCLASS'].map(pclass_mapping)
print(df['PUFC23_PCLASS'].unique())

['Self-employed without any paid employee'
 'Worked without pay in own family-operated farm or business'
 'Worked for private establishment' 'Worked for private household' nan
 'Worked for government/government-controlled corporation'
 'Employer in own family-operated farm or business'
 'Worked with pay in own family-operated farm or business']


In [464]:
display(df['PUFC23_PCLASS'])

0                   Self-employed without any paid employee
1         Worked without pay in own family-operated farm...
2                          Worked for private establishment
3                   Self-employed without any paid employee
4                              Worked for private household
                                ...                        
180850              Self-employed without any paid employee
180851              Self-employed without any paid employee
180857     Employer in own family-operated farm or business
180858                                                  NaN
180861              Self-employed without any paid employee
Name: PUFC23_PCLASS, Length: 94001, dtype: object

**PUFC25_PBASIC** – Basic Pay per Day

**PUFC30_LOOKW** – Looked for Work or Tried to Establish Business during the past week


**PUFC32_JOBSM** - Job Search Method

**PUFC33_WEEKS** – Number of Weeks Spent Looking for Work

**PUFC34_WYNOT** – Reason for Not Looking for Work

**PUFC35_LTLOOKW** – When Last Looked for Work

**PUFC36_AVAIL** – Available for Work

**PUFC37_WILLING** – Willingness to take up work during the past week or within two weeks

**PUFC38_PREVJOB** – Previous Job Indicator

**PUFC40_POCC** – Previous Occupation

**PUFC41_WQTR** – Did work or had a job during the past quarter

## Export cleaned data

In [None]:
## df.to_csv("cleansed_labor_dataset.csv", index=False)
## cleansed_labor_dataset = pd.read_csv("cleansed_labor_dataset.csv")
## cleansed_labor_dataset.head()