# MCO 1 - 2012 Family Income and Expenditure Survey (FIES)
In this Notebook, we will explore income and expenditure behavior across Filipino households using the 2012 Family Income and Expenditure Survey (FIES) dataset. We will focus on statistical inference, particularly confidence intervals and hypothesis testing for means, while also applying unsupervised learning techniques such as clustering to reveal patterns in household spending.

We aim to understand how households from different income groups allocate their spending across essential categories like food, education, and utilities.

The dataset, provided in the file FIES PUF 2012 Vol.1.CSV, comes from the Philippine Statistics Authority and contains anonymized microdata on household income from various sources (such as salaries, businesses, and remittances), categorized expenditures (including food, housing, education, health, and utilities), as well as demographic and geographic variables like region and urban/rural classification. Household characteristics such as household size and number of earners are also included. 

# Dataset Description

## Overview

The **Family Income and Expenditure Survey (FIES) 2012 Volume 1** is a comprehensive household-level dataset collected by the Philippine Statistics Authority (PSA). This dataset provides detailed information about Filipino families' income sources, expenditure patterns, and socio-demographic characteristics, serving as a critical resource for understanding household economic behavior and living standards in the Philippines.

## Data Collection Methodology

The FIES 2012 was conducted as a nationwide survey using a stratified multi-stage sampling design:

- **Survey Period**: 2012
- **Coverage**: National scope covering all regions of the Philippines
- **Sampling Method**: Stratified multi-stage cluster sampling with Primary Sampling Units (PSUs)
- **Data Collection**: Two-visit approach with structured questionnaires administered to selected households
- **Weighting**: Base weights and final weights (RFACT) provided to ensure national representativeness

The survey utilized a systematic sampling framework with:
- Regional stratification (W_REGN)
- Urban/Rural classification (URB)
- Stratum coding (RSTR)
- PSU identification for cluster sampling

## Potential Implications and Limitations

### Sampling Implications
- **Representativeness**: The stratified sampling design ensures national and regional representativeness when proper weights are applied
- **Temporal Limitations**: Data reflects 2012 economic conditions and may not capture more recent economic changes
- **Reference Period Bias**: Income and expenditure data are based on recall periods (past six months for some variables), which may introduce recall bias
- **Seasonal Variations**: Data collection timing may not fully capture seasonal income fluctuations, particularly for agricultural households

### Analytical Considerations
- **Self-reporting Bias**: Income and expenditure data rely on household self-reporting, potentially leading to underreporting of income or misclassification of expenses
- **Informal Economy**: May underrepresent informal economic activities common in developing economies
- **Cultural Sensitivity**: Some expenditure categories (e.g., alcohol, tobacco) may be subject to social desirability bias

## Data Structure

### Basic Structure
- **Data Format**: Tabular/CSV format
- **Unit of Analysis**: Individual households
- **Number of Observations**: To be determined from actual dataset
- **Data Type**: Cross-sectional survey data

### Row and Column Representation
- **Rows**: Each row represents a unique household in the survey
- **Columns**: Each column represents a specific variable measuring household characteristics, income sources, or expenditure categories
- **Unique Identifiers**: Households are identified through multiple ID variables (W_REGN, W_OID, W_SHSN, W_HCN)

## Key Attribute Categories

### 1. Identification and Sampling Variables
- **W_REGN**: Region code
- **W_OID**: Other unique identifier
- **W_SHSN**: Sample household serial number
- **W_HCN**: Household control number
- **URB**: Urban/Rural classification
- **BWEIGHT, RFACT**: Base and final sampling weights

### 2. Income Variables (17 categories)
**Employment Income:**
- **AGRI_SAL**: Agricultural sector wages and salaries
- **NONAGRI_SAL**: Non-agricultural sector wages and salaries
- **WAGES**: Combined agricultural and non-agricultural wages

**Other Income Sources:**
- **NETSHARE**: Net share from crops, livestock, and fishing
- **CASH_ABROAD**: Remittances and assistance from abroad
- **CASH_DOMESTIC**: Domestic cash assistance and support
- **RENTALS_REC**: Rental income from properties
- **INTEREST**: Interest from deposits and loans
- **PENSION**: Pension and retirement benefits
- **DIVIDENDS**: Investment dividends

**Entrepreneurial Income (11 categories):**
- **NET_CFG**: Crop farming and gardening
- **NET_LPR**: Livestock and poultry raising
- **NET_FISH**: Fishing activities
- **NET_RET**: Wholesale and retail trade
- **NET_MFG**: Manufacturing
- And 6 additional entrepreneurial categories

**Derived Income:**
- **EAINC**: Total entrepreneurial income
- **TOINC**: Total household income

### 3. Expenditure Variables (20+ categories)
**Food Expenditure (14 detailed categories):**
- **T_BREAD**: Bread and cereals
- **T_MEAT**: Meat products
- **T_FISH**: Fish and seafood
- **T_MILK**: Milk, cheese, and eggs
- **T_FRUIT, T_VEG**: Fruits and vegetables
- **T_FOOD_HOME**: Total food consumed at home
- **T_FOOD_OUTSIDE**: Food consumed outside home
- **T_FOOD**: Total food expenditure

**Non-Food Expenditure:**
- **T_CLOTH**: Clothing and footwear
- **T_HOUSING_WATER**: Housing, utilities, and water
- **T_TRANSPORT**: Transportation
- **T_HEALTH**: Healthcare
- **T_EDUCATION**: Education
- **T_RECREATION**: Recreation and culture
- **T_COMMUNICATION**: Communication services

**Derived Expenditure:**
- **T_TOTEX**: Total expenditure
- **T_TOTDIS**: Total disbursements

### 4. Household Demographics (15 variables)
- **FSIZE**: Family size
- **SEX, AGE, MS**: Head of household characteristics
- **HGC**: Education level of household head
- **MEMBERS**: Total family members
- **AGELESS5, AGE5_17**: Age distribution of family members
- **EMPLOYED_PAY, EMPLOYED_PROF**: Employment status of family members

### 5. Housing and Assets (20+ variables)
**Housing Characteristics:**
- **BLDG_TYPE**: Type of building
- **ROOF, WALLS**: Construction materials
- **TENURE**: Housing tenure status
- **TOILET**: Toilet facilities
- **ELECTRIC**: Electricity access
- **WATER**: Water source

**Asset Ownership (Quantities):**
- **TV_QTY, RADIO_QTY**: Entertainment devices
- **REF_QTY, WASH_QTY**: Household appliances
- **CAR_QTY, MOTORCYCLE_QTY**: Transportation assets
- **PC_QTY, CELLPHONE_QTY**: Technology assets

### 6. Derived Analysis Variables
- **PCINC**: Per capita income
- **NATPC, NATDC**: National income decile classifications
- **REGPC, REGDC**: Regional income decile classifications

## Dataset Significance

This dataset is significant for the following:
- **Poverty Analysis**: Income and expenditure patterns for poverty measurement
- **Consumer Behavior Studies**: Detailed expenditure breakdowns across categories
- **Regional Economic Analysis**: Geographic variations in household economics
- **Policy Research**: Evidence base for social and economic policy development
- **Inequality Studies**: Income distribution and household welfare analysis

The comprehensive nature of the FIES 2012 dataset makes it a valuable resource for understanding household economic behavior, consumption patterns, and living standards in the Philippines during the 2012 period.

## Research Questions

### General Research Question:
What are the key differences in expenditure allocation (e.g., food, education, utilities) across income groups?

#### Supporting Research Questions:
1. What are the average and median incomes in each income group?
2. Which expenditure category takes up the largest portion of total expenses for each group?
3. Do wealthier households spend a higher or lower percentage of their income on basic needs like food and utilities?
4. Are low-income households more likely to prioritize essential expenses over discretionary (e.g., entertainment, travel) ones?
5. How does the ratio of entertainment spending to income change as income increases?
6. How does the ratio of education spending to income change as income increases?
7. Is there a statistically significant difference in food expenditure between the lowest and highest income groups?



# Import Libraries

For the statistical functions, we will be using `scipy`, specifically, the `stats` submodule. The [`scipy.stats`](https://docs.scipy.org/doc/scipy/reference/stats.html) module provides a number of probability distribution functions, summary and frequency statistics, correlation functions, statistical tests, and more.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import norm
from scipy.stats import ttest_ind

## Family Income and Expenditure Data


In [5]:
fies_df = pd.read_csv('./Dataset/FIES_PUF_2012_Vol.1.CSV')
fies_df.head()

Unnamed: 0,W_REGN,W_OID,W_SHSN,W_HCN,URB,RSTR,PSU,BWEIGHT,RFACT,FSIZE,...,PC_QTY,OVEN_QTY,MOTOR_BANCA_QTY,MOTORCYCLE_QTY,POP_ADJ,PCINC,NATPC,NATDC,REGDC,REGPC
0,14,101001000,2,25,2,21100,415052,138.25,200.6576,3.0,...,1.0,1.0,,,0.946172,108417.0,9,8,8,9
1,14,101001000,3,43,2,21100,415052,138.25,200.6576,12.5,...,,1.0,,1.0,0.946172,30631.6,5,9,9,4
2,14,101001000,4,62,2,21100,415052,138.25,200.6576,2.0,...,,1.0,,,0.946172,86992.5,9,6,6,8
3,14,101001000,5,79,2,21100,415052,138.25,200.6576,4.0,...,,1.0,,,0.946172,43325.75,6,6,6,6
4,14,101001000,10,165,2,21100,415052,138.25,200.6576,5.0,...,,,,1.0,0.946172,37481.8,6,6,6,5


Call the [`info()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html) function.

In [6]:
fies_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40171 entries, 0 to 40170
Columns: 119 entries, W_REGN to REGPC
dtypes: float64(5), int64(92), object(22)
memory usage: 36.5+ MB


Call the [`describe()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html) function.

In [7]:
fies_df.describe()

Unnamed: 0,W_REGN,W_OID,W_SHSN,W_HCN,URB,RSTR,PSU,BWEIGHT,RFACT,FSIZE,...,HSE_ALTERTN,TOILET,ELECTRIC,WATER,POP_ADJ,PCINC,NATPC,NATDC,REGDC,REGPC
count,40171.0,40171.0,40171.0,40171.0,40171.0,40171.0,40171.0,40171.0,40171.0,40171.0,...,40171.0,40171.0,40171.0,40171.0,40171.0,40171.0,40171.0,40171.0,40171.0,40171.0
mean,13.01989,4210536000.0,9.633666,1563.601753,1.617311,21547.277215,258123.702099,340.330363,533.363298,4.699223,...,1.94033,1.71813,1.131563,3.18603,0.942329,54324.33,5.233303,5.238306,5.445769,5.455129
std,11.995555,2285729000.0,6.198442,2977.363506,0.486049,3520.981146,112143.268816,112.377931,209.996517,2.19405,...,0.236877,1.539145,0.338019,2.405758,0.038631,73721.11,2.874581,2.856486,2.866703,2.864137
min,1.0,101001000.0,1.0,1.0,1.0,2475.0,100010.0,92.25,126.1643,1.0,...,1.0,0.0,1.0,1.0,0.876132,2979.2,1.0,1.0,1.0,1.0
25%,6.0,2239012000.0,4.0,95.0,1.0,21100.0,116384.0,271.5,399.615,3.0,...,2.0,1.0,1.0,1.0,0.92445,19968.03,3.0,3.0,3.0,3.0
50%,10.0,4112005000.0,9.0,204.0,2.0,22100.0,216212.0,329.75,509.8749,4.5,...,2.0,1.0,1.0,3.0,0.940724,33369.75,5.0,5.0,5.0,5.0
75%,14.0,6210006000.0,14.0,393.0,2.0,23200.0,316519.0,428.71,634.1608,6.0,...,2.0,2.0,1.0,4.0,0.961401,61758.67,8.0,8.0,8.0,8.0
max,42.0,9804035000.0,30.0,8026.0,2.0,29000.0,416581.0,1630.2,2895.8149,20.5,...,2.0,7.0,2.0,12.0,1.058416,3231120.0,10.0,10.0,10.0,10.0


Start here for Data Cleaning

Start here for Q1

Start here for Q2

Start here for Q3

Start here for Q4

Start here for Q5

Start here for Q6

Start here for Q7