# Happiness Survey Analysis Pipeline

This notebook runs the end-to-end pipeline for processing and analyzing the FPT Polytechnic student happiness survey data.

## 1. Imports and Setup

First, we import the necessary libraries and add the `src` directory to the system path to locate our custom modules.

In [7]:
import os
import sys
from pathlib import Path

# Resolve project root for cross-platform, device-independent paths
_cwd = Path.cwd()
if (_cwd / "main.ipynb").exists():
    PROJECT_ROOT = _cwd.parent
    SRC_ROOT = _cwd
elif (_cwd / "src" / "main.ipynb").exists():
    PROJECT_ROOT = _cwd
    SRC_ROOT = _cwd / "src"
else:
    PROJECT_ROOT = _cwd
    SRC_ROOT = _cwd

# Ensure paths for imports: src for etl/analytics, project root for src.config
for p in (str(SRC_ROOT), str(PROJECT_ROOT)):
    if p not in sys.path:
        sys.path.insert(0, p)

from etl.processor import DataProcessor
from analytics.analyzer import DataAnalyzer

## 2. Run ETL Process

This step loads the raw data, cleans it, applies transformations (like renaming columns and converting data types), and saves the processed data to the `data/processed` directory.

In [8]:
# --- ETL Step ---
print("--- Starting ETL Process ---")

# Define file paths (Path objects work on Windows, macOS, Linux)
raw_data_path = PROJECT_ROOT / 'data' / 'raw' / 'fpoly_survey.csv'
processed_data_path = PROJECT_ROOT / 'data' / 'processed' / 'fpoly_survey_processed.csv'

if not raw_data_path.exists():
    print(f"ERROR: Raw data file not found at {raw_data_path}")
else:
    # Instantiate the processor and run the ETL process
    data_processor = DataProcessor(file_path=raw_data_path)
    # The 'process' method will load, clean, transform, and save the data.
    processed_df_head = data_processor.process(output_path=processed_data_path)
    print("--- ETL Process Finished ---")
    print("D·ªØ li·ªáu xem tr∆∞·ªõc üëÄ:")
    display(processed_df_head)

--- Starting ETL Process ---
Loading data...
Renaming columns...
Cleaning data...
üöÄ Kh·ªüi ƒë·ªông quy tr√¨nh ETL...
üîÑ ƒê√£ ƒë·∫£o ng∆∞·ª£c ƒëi·ªÉm cho c·ªôt: aca_deadline_pressure
üîÑ ƒê√£ ƒë·∫£o ng∆∞·ª£c ƒëi·ªÉm cho c·ªôt: fin_living_cost_worry
‚úÖ Ho√†n t·∫•t ETL. D·ªØ li·ªáu s·∫°ch s·∫µn s√†ng: 124 d√≤ng.
üìÇ ƒêang chu·∫©n b·ªã l∆∞u d·ªØ li·ªáu v√†o: c:\Users\OMEN\Desktop\Learn\FPT\Working_skill\data\processed\fpoly_survey_processed.csv...
‚úÖ L∆∞u d·ªØ li·ªáu th√†nh c√¥ng.
--- ETL Process Finished ---
D·ªØ li·ªáu xem tr∆∞·ªõc üëÄ:


  self.data['timestamp'] = pd.to_datetime(self.data['timestamp'], errors='coerce')


Unnamed: 0,timestamp,dem_major,dem_semester,dem_gpa,dem_residence,hap_general_satisfaction,hap_school_energy,hap_meaningful_life,hap_loyalty_choice,aca_curriculum_fit,...,env_utilities,env_dynamic_culture,soc_friendship_support,soc_activity_integration,soc_family_support,fin_tuition_value,fin_living_cost_worry,fin_job_prospects,wish,Column 27
0,2026-01-20 10:01:33,Ng√†nh C√¥ng Ngh·ªá Th√¥ng Tin,5,8.5,·ªû v·ªõi gia ƒë√¨nh,4,3,4,4,4,...,3,4,5,3,4,3,2,3,,
1,2026-01-20 10:14:55,Thi·∫øt k·∫ø ƒë·ªì h·ªça,5,8.5,KTX,4,2,2,1,3,...,3,4,3,2,4,3,3,1,tr∆∞·ªùng b·ªõt drama l·∫°i. ngo·∫°i tr·ª´ vi·ªác d·∫°y ki·∫øn ...,
3,2026-01-20 11:41:38,Ng√†nh C√¥ng Ngh·ªá Th√¥ng Tin,5,7.5,·ªû tr·ªç,3,4,3,3,3,...,3,3,4,3,4,3,1,3,,
4,2026-01-20 11:42:55,Kh√°c,5,8.5,·ªû v·ªõi gia ƒë√¨nh,4,4,5,5,5,...,3,4,5,2,5,2,2,3,Tr∆∞·ªùng si·∫øt ch·∫∑t h∆°n v·ªÅ ƒë√°nh gi√° nƒÉng l·ª±c c·ªßa ...,
6,2026-01-20 12:37:46,Ng√†nh C√¥ng Ngh·ªá Th√¥ng Tin,5,9.5,·ªû v·ªõi gia ƒë√¨nh,5,5,4,4,3,...,4,4,4,3,5,4,3,4,C√≥ nhi·ªÅu cu·ªôc thi h∆°n cho ng√†nh cntt n√≥i chung...,


## 3. Run Analysis

Now that we have clean data, this step runs the various analyses defined in the `DataAnalyzer` class and prints the final report.

In [9]:
# --- Analysis Step ---
import pprint


print("--- Starting Analysis Process ---")

if not processed_data_path.exists():
    print(f"ERROR: Processed data file not found at {processed_data_path}")
else:
    # Instantiate the analyzer with the processed data
    data_analyzer = DataAnalyzer(file_path=processed_data_path)

    # Run the analysis
    analysis_report = data_analyzer.analysis()

    # Print the final report
    print("--- Analysis Report ---")
    pprint.pprint(analysis_report)
    print("--- Analysis Finished ---")

--- Starting Analysis Process ---
üìä ƒêang ph√¢n t√≠ch c√°c ch·ªâ s·ªë h·∫°nh ph√∫c...
‚úÖ Ph√¢n t√≠ch ho√†n t·∫•t.
--- Analysis Report ---
{'ahs_overall': np.float64(3.7),
 'correlations': {'Academic': np.float64(0.6),
                  'Environment': np.float64(0.55),
                  'Finance': np.float64(0.68),
                  'Social': np.float64(0.54)},
 'factor_scores': {'Academic (X1)': np.float64(3.32),
                   'Environment (X2)': np.float64(3.72),
                   'Finance (X4)': np.float64(3.37),
                   'Social (X3)': np.float64(3.86)},
 'gpa_happiness_correlation': {'5.0-6.5': 3.57,
                               '6.5-8.0': 3.7,
                               '<5.0': 3.25,
                               '>8.0': 3.76},
 'nhs_percentage': 40.32,
 'residence_stress_index': {'KTX': 2.62,
                            'Nh√† ri√™ng': 2.5,
                            '·ªû tr·ªç': 2.28,
                            '·ªû v·ªõi gia ƒë√¨nh': 2.84},
 'retenti