# STINTSY Project - 

### Group 4 (S13)
CAPAROS, MIGUEL ANTONIO <br> 
MARTINEZ, AZELIAH <br>
PAREDES, BILL JETHRO <br>
VILLANUEVA, KEISHA LEIGH <br>

# I. Introduction to the problem and dataset

Select one real-world dataset from the list of datasets provided for the project. Each dataset is accompanied with a description file, which also contains detailed description of each feature.

The target task (i.e., classification or regression) should be properly stated as well.

The Labor Force Survey (LFS) is a nationwide quarterly survey conducted by the Philippine Statistics Authority (PSA). It aims to gather data on the demographic and socio-economic characteristics of the labor force, providing insights into employment, unemployment, and underemployment trends in the country.

# II. Description of the dataset

• State a brief description of the dataset.

• Provide a description of the collection process executed to build the dataset. Discuss the
implications of the data collection method on the generated conclusions and insights. Note that you may need to look at relevant sources related to the dataset to acquire necessary information for this part of the project.

• Describe the structure of the dataset file.
- What does each row and column represent?
- How many instances are there in the dataset?
- How many features are there in the dataset?
- If the dataset is composed of different files that you will combine in the succeeding
steps, describe the structure and the contents of each file.

• Discuss the features in each dataset file. What does each feature represent? All features, even those which are not used for the study, should be described to the reader. The purpose of each feature in the dataset should be clear to the reader of the notebook
without having to go through an external link.

## Dataset Overview

The Labor Force Survey (LFS) dataset contains information on individuals' demographic profiles, educational attainment, occupation types, work status, and income levels. 

## Data Collection Process

The PSA collects the LFS data through quarterly household surveys using face-to-face interviews. The survey employs a multi-stage sampling design to ensure that the sample accurately represents the population. The key steps in the data collection process include:

1. Sampling frame creation: The PSA uses a master sample list derived from the 2015 Census of Population and Housing (CPH) as the sampling frame.
2. Random selection of households: Households are randomly selected within each stratum (region or province) to create a representative sample.
3. Face-to-face interviews: Field interviewers visit the sampled households and collect data through structured questionnaires.
4. Data validation and processing: The collected data undergoes validation and consistency checks before being aggregated and published.

The multi-stage sampling design of the Labor Force Survey (LFS) ensures representativeness across regions, but certain limitations exist. Non-response rates and inaccuracies in self-reported data may introduce bias, particularly in income-related features. Since the survey is conducted quarterly, it captures seasonal employment trends, which may affect the generalizability of insights over longer periods. Additionally, recall bias or misreporting could impact data reliability. Despite these limitations, the LFS remains a valuable source for policy-making and labor market analysis, providing key insights into the employment landscape of the Philippines.

## Dataset Structure

## Dataset Features

# III. List of requirements

List all the Python libraries and modules that you used.

## Import Libraries

add here

In [3]:
# Data Manipulation 
import random
import numpy as np
import pandas as pd  
import pickle
import os
import matplotlib.pyplot as plt

# Visualization
import matplotlib.pyplot as plt     # For creating plots and visualizations

# Makes matplotlib figures appear inline in the notebook
# rather than in a new window.
%matplotlib inline

import seaborn as sns   

# IV.  Data preprocessing and cleaning

Perform necessary steps before using the data. In this section of the notebook, please take note of the following:

• If needed, perform preprocessing techniques to transform the data to the appropriate representation. This may include binning, log transformations, conversion to one-hot encoding, normalization, standardization, interpolation, truncation, and feature engineering, among others. There should be a correct and proper justification for the use of each preprocessing technique used in the project.

• Make sure that the data is clean, especially features that are used in the project. This may include checking for misrepresentations, checking the data type, dealing with missing data, dealing with duplicate data, and dealing with outliers, among others. There should be a correct and proper justification for the application (or non-application) of each data cleaning method used in the project. Clean only the variables utilized in the study.

## Reading the Dataset

Our first step is to load the dataset using pandas, which will import the data into a pandas `DataFrame`. We use the [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) function to accomplish this.

In [5]:
laborforce_df = pd.read_csv('Labor Force Survey 2016.csv')

When loading a new dataset, it is advisable to utilize the [`info`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html) function, as it displays general information regarding the dataset's structure and attributes.

In [6]:
laborforce_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180862 entries, 0 to 180861
Data columns (total 50 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   PUFREG           180862 non-null  int64  
 1   PUFPRV           180862 non-null  int64  
 2   PUFPRRCD         180862 non-null  int64  
 3   PUFHHNUM         180862 non-null  int64  
 4   PUFURB2K10       180862 non-null  int64  
 5   PUFPWGTFIN       180862 non-null  float64
 6   PUFSVYMO         180862 non-null  int64  
 7   PUFSVYYR         180862 non-null  int64  
 8   PUFPSU           180862 non-null  int64  
 9   PUFRPL           180862 non-null  int64  
 10  PUFHHSIZE        180862 non-null  int64  
 11  PUFC01_LNO       180862 non-null  int64  
 12  PUFC03_REL       180862 non-null  int64  
 13  PUFC04_SEX       180862 non-null  int64  
 14  PUFC05_AGE       180862 non-null  int64  
 15  PUFC06_MSTAT     180862 non-null  object 
 16  PUFC07_GRADE     180862 non-null  obje

We will use the [`head`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html) function to quickly view the first few rows of our dataset.

In [7]:
laborforce_df.head()

Unnamed: 0,PUFREG,PUFPRV,PUFPRRCD,PUFHHNUM,PUFURB2K10,PUFPWGTFIN,PUFSVYMO,PUFSVYYR,PUFPSU,PUFRPL,...,PUFC33_WEEKS,PUFC34_WYNOT,PUFC35_LTLOOKW,PUFC36_AVAIL,PUFC37_WILLING,PUFC38_PREVJOB,PUFC40_POCC,PUFC41_WQTR,PUFC43_QKB,PUFNEWEMPSTAT
0,1,28,2800,1,2,405.2219,4,2016,217,1,...,,,,,,,,1,1,1
1,1,28,2800,1,2,388.828,4,2016,217,1,...,,,,,,,,1,1,1
2,1,28,2800,1,2,406.1194,4,2016,217,1,...,,,,,,,,1,1,1
3,1,28,2800,2,2,405.2219,4,2016,217,1,...,,,,,,,,1,1,1
4,1,28,2800,2,2,384.3556,4,2016,217,1,...,,,,,,,,1,96,1


## Handling Missing Data

Detecting and managing missing values is crucial for data analysis. To identify missing data within our DataFrame, we will use the [`isnull`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isnull.html) function in combination with [`sum`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isnull.html). This approach allows us to understand the extent of missing values in each column, facilitating appropriate strategies for data cleaning and preprocessing.

In [8]:
missing_data = laborforce_df.isnull().sum()
print("Missing data:\n", missing_data)

Missing data:
 PUFREG             0
PUFPRV             0
PUFPRRCD           0
PUFHHNUM           0
PUFURB2K10         0
PUFPWGTFIN         0
PUFSVYMO           0
PUFSVYYR           0
PUFPSU             0
PUFRPL             0
PUFHHSIZE          0
PUFC01_LNO         0
PUFC03_REL         0
PUFC04_SEX         0
PUFC05_AGE         0
PUFC06_MSTAT       0
PUFC07_GRADE       0
PUFC08_CURSCH      0
PUFC09_GRADTECH    0
PUFC10_CONWR       0
PUFC11_WORK        0
PUFC12_JOB         0
PUFC14_PROCC       0
PUFC16_PKB         0
PUFC17_NATEM       0
PUFC18_PNWHRS      0
PUFC19_PHOURS      0
PUFC20_PWMORE      0
PUFC21_PLADDW      0
PUFC22_PFWRK       0
PUFC23_PCLASS      0
PUFC24_PBASIS      0
PUFC25_PBASIC      0
PUFC26_OJOB        0
PUFC27_NJOBS       0
PUFC28_THOURS      0
PUFC29_WWM48H      0
PUFC30_LOOKW       0
PUFC31_FLWRK       0
PUFC32_JOBSM       0
PUFC33_WEEKS       0
PUFC34_WYNOT       0
PUFC35_LTLOOKW     0
PUFC36_AVAIL       0
PUFC37_WILLING     0
PUFC38_PREVJOB     0
PUFC40_POCC        

# V. Exploratory data analysis

Perform exploratory data analysis comprehensively to gain a good understanding of your dataset. In this section of the notebook, you must present relevant numerical summaries and visualizations. Make sure that each code is accompanied by a brief explanation. The whole process should be supported with verbose textual descriptions of your procedures and findings.

# VI.  Initial model training

Use machine learning models to accomplish your chosen task (i.e., classification or regression) for the dataset. In this section of the notebook, please take note of the following:

• The project should train and evaluate at least 3 different kinds of machine learning models. The models should not be multiple variations of the same model, e.g., three neural network models with different number of neurons.

• Each model should be appropriate in accomplishing the chosen task for the dataset. There should be a clear and correct justification on the use of each machine learning model.

• Make sure that the values of the hyperparameters of each model are mentioned. At the minimum, the optimizer, the learning rate, and the learning rate schedule should be discussed per model.

• The report should show that the models are not overfitting nor underfitting.

# VII.  Error analysis

Perform error analysis on the output of all models used in the project. In this section of the notebook, you should:

• Report and properly interpret the initial performance of all models using appropriate evaluation metrics.

• Identify difficult classes and/or instances. For classification tasks, these are classes and/or instances that are difficult to classify. Hint: You may use confusion matrix for this. For regression tasks, these are instances that produces high error

# VIII.  Improving model performance

Perform grid search or random search to tune the hyperparameters of each model. You should also tune each model to reduce the error in difficult classes and/or instances. In this section of the notebook, please take note of the following:

• Make sure to elaborately explain the method of hyperparameter tuning.

• Explicitly mention the different hyperparameters and their range of values. Show the corresponding performance of each configuration.

• Report the performance of all models using appropriate evaluation metrics and visualizations.

• Properly interpret the result based on relevant evaluation metrics

# IX. Model performance summary

Present a summary of all model configurations. In this section of the notebook, do the following:

• Discuss each algorithm and the best set of values for its hyperparameters. Identify the best model configuration and discuss its advantage over other configurations.

• Discuss how tuning each model helped in reducing its error in difficult classes and/or instances.

# X. Insights and conclusions

Clearly state your insights and conclusions from training a model on the data. Why did some models produce better results? Summarize your conclusions to explain the performance of the models. Discuss recommendations to improve the performance of the model.

# XI. References

Cite relevant references that you used in your project. All references must be cited, including:

• Scholarly Articles – Cite in APA format and put a description of how you used it for your work.

• Online references, blogs, articles that helped you come up with your project – Put the website, blog, or article title, link, and how you incorporated it into your work.

• Artificial Intelligence (AI) Tools – Put the model used (e.g., ChatGPT, Gemini), the complete transcript of your conversations with the model (including your prompts and its responses), and a description of how you used it for your work