# <center>Life Expectancy (WHO) Dataset</center>

# Project Objective
Analyze and predict average life expectancy based on economic, health, and social factors from the World Health Organization (WHO) dataset.

---

# Student Information

**Student 1:**
- Full Name: Cao Trần Bá Đạt
- Student ID: 23127168

**Student 2:**
- Full Name: Trần Hoài Thiện Nhân
- Student ID: 23127238

**Student 3:**
- Full Name: Bùi Nam Việt
- Student ID: 23127516

**Class:** 23KHDL

---
# Table of Contents
1. [Project Summary](#c6)
    - [1.1 Key Findings](#c61)
    - [1.2 Limitations](#c62)
    - [1.3 Future Directions (If You Had More Time)](#c63)
    - [1.4 Individual Reflections](#c64)
        - [1.4.1 Student 1 - Cao Trần Bá Đạt](#c641)
        - [1.4.2 Student 2 - Trần Hoài Thiện Nhân](#c642)
        - [1.4.3 Student 3 - Bùi Nam Việt](#c643)
2. [References](#c7)

---

<a id="c6"></a>
# 1. Project Summary

<a id="c61"></a>
## 1.1 Key Findings

*List 3-5 most important insights from your analysis:*
- Poverty is not Destiny: Our analysis of RQ1 identified a distinct group of "Low GDP Overachievers." Despite having low income levels comparable to "Underachievers," these nations attained life expectancies on par with High-Income countries. The key differentiator was their maintenance of excellent public health metrics, specifically maintaining a vaccination coverage of ~88%, closely approaching the 91% benchmark of developed nations.
- GDP impacts life expectancy: An increase in GDP will affect life expectancy in water-producing countries, but this has not had a significant impact on high-income countries.
- The Education Lag Paradox: Data revealed an interesting paradox: The "Stagnant" group actually possessed better baseline Schooling metrics than the "Rapid Improvers" in the year 2000, yet they still fell behind in life expectancy. This suggests that within a short 15-year window, Education acts as a long-term structural driver, whereas Preventive Medicine (Vaccines, ARVs) acts as an immediate survival mechanism that creates instant impact.
- Disease control: Analyzing growth drivers (2000-2015) revealed that "Rapid Improvers" (countries gaining >10 years of life expectancy) were not defined by economic booms, but by a successful "dual-strategy" of health interventions: Suppressing the HIV/AIDS epidemic (sharp reduction in incidents) and Expanding vaccine coverage (increasing by an average of 13%). "Stagnant" countries failed to implement these critical measures effectively.

*Highlight the most interesting or surprising discovery:*
- The most striking discovery from our Mortality Matrix analysis was the revelation that a poor nation can statistically "outperform" a significantly richer one in health outcomes through effective immunization.
- "Specifically: The group of 'Low Income but High Vaccination (>90%)' countries recorded a child mortality rate of 57.0, which was significantly lower than the 'Medium Income but Low Vaccination (<80%)' group at 62.1."
- This finding shatters the conventional "Wealth equals Health" paradigm. It demonstrates that vaccines serve as the "Great Equalizer," allowing developing nations to leapfrog the middle-income trap and achieve advanced health standards immediately, without waiting for economic prosperity.

<a id="c62"></a>
## 1.2 Limitations

*Dataset Limitations:*

- **Sample size**: With roughly 2,900 observations (179 countries $\times$ 15 years), the dataset is relatively small for high-dimensional machine learning. This increases the risk that the model memorized specific country profiles rather than learning generalizable global health rules.
- **Biases**: The data relies on self-reporting from various nations. Developing countries (often those with lower life expectancy) may have less rigorous data collection infrastructure, leading to potential underreporting of specific diseases or mortality rates compared to developed nations.
- **Missing data**: To handle missing values in features like *Hepatitis B* or *GDP*, we used mean/median imputation. This artificial filling of data reduces variance and may smooth over critical outliers that represent genuine health crises in specific years.

*Analysis Limitations:*

- **Methodology constraints**: While Random Forest excellently identifies *correlations* (e.g., HIV predicts low life expectancy), it cannot prove *causality*. We cannot definitively claim that increasing GDP will *cause* life expectancy to rise without controlling for confounding variables (like government stability or culture).
- **Unanswered aspects**: The model connects high GDP/Expenditure to high life expectancy, but it fails to answer why some lower-income nations (like Vietnam or Cuba) achieve "First World" health outcomes with "Third World" budgets. The model captures resources, but leaves the question of resource efficiency unanswered.

*Scope Limitations:*

- What we couldn't address:
* **Granularity (The "Average" Fallacy):** The analysis operates at a **national level**, which masks internal inequality. A country with a high average life expectancy might still have specific regions with very poor health outcomes, which this model cannot detect.
* **Temporal Dynamics Ignored:** We treated each year as an independent observation (Cross-Sectional approach). The current scope did not account for **Time-Series** trends (e.g., how the life expectancy of a country in 2014 is dependent on its state in 2013).
* **Omission of Qualitative Factors:** The model is limited to quantitative socio-economic metrics. It does not account for qualitative factors such as political stability, war, cultural diet habits, or the quality of the healthcare system (only the *expenditure* on it).

<a id="c63"></a>
## 1.3 Future Directions (If You Had More Time)

*Additional Questions to Explore:*
- The "Quality" vs. "Quantity" of Life: Does living longer mean living healthier? We would explore HALE ([Health-adjusted Life Expectancy](https://www.who.int/data/gho/indicator-metadata-registry/imr-details/7752)) to see if "Rapid Improvers" are just extending years of life or actually improving the quality of those years.
- The Gender Gap: How do the drivers of life expectancy differ between men and women? Do vaccines benefit both genders equally, or are there cultural barriers affecting women in certain low-income regions?

*Deeper Analysis:*
- Lag-Effect Analysis (Time-Series Analysis): Currently, we compare 2000 vs. 2015 directly. A deeper analysis would use Cross-Correlation Functions to determine exactly how many years it takes for a 10% increase in GDP or Education to translate into a 1-year increase in life expectancy.
- Cluster Analysis (Unsupervised Learning): Instead of manually categorizing "Overachievers" based on quartiles, we would use K-Means Clustering to let the data naturally group countries with similar trajectories. This could reveal hidden patterns or subgroups we missed.

*Alternative Methods/Approaches:*
- Causal Inference ([Propensity Score Matching](https://www.datacamp.com/tutorial/propensity-score)): To strictly prove that vaccines cause mortality reduction (rather than just correlating), we would use Propensity Score Matching. We could match countries with similar GDP/Education but different vaccination rates to isolate the true effect of the vaccine policy.
- Geospatial Analysis: Visualizing the "Rapid Improvers" on a map to detect regional contagion effects. For example, did the success of Rwanda influence its neighbors in East Africa?

*Additional Data to Seek:*
- Healthcare Expenditure Efficiency: Not just how much money is spent, but how it is spent. Data on Out-of-pocket expenditure vs. Government expenditure would help understand if financial protection for the poor is a key driver.
- Infrastructure Metrics: Access to Clean Water and Sanitation is a major confounder. Adding this data would sharpen our comparison between GDP and Vaccines, as water infrastructure is often funded by GDP growth.

*Project Expansion/Improvement:*
- Policy Simulation Tool: Convert the analysis into an interactive dashboard (using Streamlit or PowerBI) where policymakers can adjust sliders (e.g., "Increase Vaccine by 10%") to see the predicted impact on Life Expectancy based on our regression models.

<a id="c64"></a>
## 1.4 Individual Reflections

<a id="c641"></a>
### 1.4.1 Student 1: Cao Trần Bá Đạt

**Challenges & Difficulties Encountered:**

*Specific obstacles faced:*
- Technical: Handling the data transformation from a "Long format" (yearly rows) to a "Wide format" (columns for 2000 and 2015) was tricky. Specifically, flattening the MultiIndex columns after using the pivot() function in Pandas required learning new syntax.
- Analytical: Deciding how to categorize countries ("Overachievers" vs. "Rapid Improvers") was not straightforward. Setting arbitrary thresholds could lead to bias, so we had to experiment with quantiles (Top 25% vs Bottom 25%) to ensure statistical significance.
- Conceptual: Grasping the concept of "Policy Latency" (Time Lag) was intellectually challenging. I initially struggled to reconcile why Education—a universally accepted pillar of development—showed a weaker correlation with Life Expectancy growth compared to Vaccination in our 15-year window. It felt counter-intuitive.

*How I overcame them:*
- To solve the data structure issue, I utilized the pivot method combined with list comprehension to flatten column names (e.g., creating Life_2015 and Life_2000). For the categorization, instead of looking at raw values, I calculated the "Delta" ($\Delta$) — the absolute change between 2000 and 2015 — which allowed for a fair comparison of growth speed regardless of the starting point.

- I overcame this by refining my conceptual framework to distinguish between "Immediate Survival Drivers" (like Vaccines/HIV drugs which save lives instantly) and "Long-term Structural Investments" (like Education which takes a generation to yield health benefits). This shift in perspective allowed me to interpret the data accurately: Education isn't ineffective; it’s just slower acting.

*Most challenging aspect and why:*
- The most challenging part was moving from correlation to implied causality in the GDP vs. Vaccine analysis. It is easy to plot a scatter plot, but proving that Vaccines are more important than GDP required designing a specific visualization: the Binned Heatmap (Matrix of Mortality). This was difficult because it required segmenting continuous data into logical categories (bins) to make a direct comparison possible.


**Learning & Growth:**

*What I learned:*
- Technical skills: I mastered advanced Pandas techniques, specifically pivot_table, qcut for binning data, and creating complex multi-panel visualizations (Heatmaps + Bubble Charts) using Seaborn.
- Analytical approaches: I learned the "Dual-Layer Comparison" strategy: first comparing Overachievers vs. Underachievers (internal), then comparing Overachievers vs. High Income (external). This provides a much more holistic view than simple regression.
- Domain knowledge:
    -Development Economics: I gained a nuanced understanding of the limits of GDP. I learned that economic growth does not automatically translate to better health outcomes ("trickle-down" health). Instead, targeted Public Health Interventions (like the Expanded Programme on Immunization) act as a much more efficient "equalizer" for poor nations.

    - Epidemiology Context: I understood the devastating historical context of the HIV/AIDS epidemic in the early 2000s and how the introduction of antiretroviral therapies created the "Phoenix Effect" (rapid life expectancy recovery) we observed in the data.

*What surprised me most:*
- I was genuinely surprised by the "Education Lag." I assumed education would be the primary driver of life expectancy, but the data showed that HIV control and Vaccination had a much faster, more immediate impact in the 2000-2015 period. Also, seeing that "Poor but Vaccinated" countries had lower child mortality than "Rich but Unvaccinated" countries changed my perspective on the power of GDP.

*How this project shaped my understanding of data science:*
- This project taught me that Data Science is not just about coding; it is about Evidence Design. The code is just a tool to tell a story. The ability to calculate a simple number (like the -0.43 correlation for vaccines vs. -0.11 for GDP) can carry more weight in policy-making than complex machine learning models if presented correctly.

<a id="c642"></a>
### 1.4.2 Student 2: Trần Hoài Thiện Nhân

**Challenges & Difficulties Encountered:**

*Specific obstacles faced:*

**Technical:**
- **Dataset Search & Selection**: Had to evaluate multiple WHO life expectancy datasets on Kaggle. The original dataset had significant issues with missing values and inconsistencies. After comparing several versions, I selected "Life Expectancy WHO Updated" by Lasha Gochiashvili, which corrected errors from KumarRajarshi's original dataset
- **Single Notebook Integration**: Unlike typical projects with separate files, all code had to coexist in one `main.ipynb`. This created challenges: Ensuring cell execution order was foolproof for teammates. Coordinating with teammates to avoid duplicate preprocessing code

**Analytical:**
- **Preston Curve Modeling (Q4)**: Implementing `scipy.optimize.curve_fit` for exponential models was new to me.
- **Confounding Variables (Q6 & Q7)**: Difficult to isolate individual factor effects. In Q6, wealthy countries have both high vaccination rates AND high GDP - which drives infant mortality reduction? Had to learn multiple regression with control variables, but interpreting coefficients in presence of multicollinearity was tricky.
- **Counter-Intuitive Findings (Q7)**: Encountered a paradox where developed nations showed positive correlations between risk factors (Alcohol, BMI) and life expectancy. Initially suspecting a spurious correlation or data error, I performed a literature review which revealed healthcare quality as a massive confounding variable. I learned that advanced medical systems in wealthy nations effectively "mask" the mortality risks of lifestyle diseases (diseases of affluence).

**Conceptual:**
- **Team Plan & Work Distribution**: As project coordinator, I created the team structure using Google Docs:
  - Defined clear task ownership
  - Set milestone deadlines with buffer time for integration
- **Skeleton/Template Design**: Created the notebook structure for the team to follow:
  - Difficulty: Making it detailed enough to be helpful but flexible enough to allow creativity

*How I overcame them:*

**Technical Solutions:**
- **Dataset Validation**:
  - Compared sample country data against WHO official statistics
  - Validated that "corrected" columns had fewer nulls than original - or it don't have null value
  - Documented data source and license in Section 1 for transparency
- **Notebook Organization Strategy**:
  - Used clear markdown headers as section dividers
  - Added "reset kernel and run all" checks before sharing with teammates
- **Coordination via Google Docs**: Established a central command hub to manage the complex workflow:
  - **Live Task Board**: Implemented the "Task | Who | Deadline | Status" table to track real-time progress and dependencies
  - **Sprint Documentation**: Recorded meeting minutes and strategic direction for each Scrum cycle to keep the team aligned on technical goals

**Analytical Strategies:**
- **Iterative Modeling for Q4**:
  - Researched "Preston Curve" economic theory
  - Tried multiple transformations: log(GDP), polynomial, exponential
  - Used residual plots and $R^2$ to select best model.
- **Stratified Analysis for Q6 & Q7**:
  - Split dataset by Development Status BEFORE regression
  - Ran separate models for Developed vs Developing countries
  - Compared coefficients to understand context-dependent effects
  - Example: Vaccination impact stronger in Developing nations (where baseline is lower)
- **Literature Review**: Read published papers to validate findings:
  - Preston Curve papers showed GDP effect plateaus around $15k-20k per capita (matched my analysis)
  - Public health studies confirmed infant mortality is leverage point for life expectancy
  - Epidemiology journals explained affluence paradox (healthcare offsets bad habits)

*Most challenging aspect and why:*

**Formulating Meaningful Research Questions**

**Why this was the hardest challenge:**

While technical implementation and code integration presented their difficulties, the most intellectually demanding aspect was crafting research questions that were simultaneously meaningful, sufficiently deep, and practically useful.

**The core difficulties:**

1. **Balancing Depth vs Accessibility**: Questions needed to be sophisticated enough to demonstrate analytical skills, yet interpretable enough to yield actionable insights.

2. **Avoiding "Correlation Theater"**: Easy to find statistical relationships that mean nothing. The real challenge was asking questions that could reveal *causal mechanisms* or *policy-relevant patterns*, not just "X correlates with Y."

3. **Domain Knowledge Gap**: Understanding what questions *matter* in public health epidemiology required extensive background research:
   - What have researchers already established?
   - What debates exist in the field?
   - What would WHO policymakers actually care about?

4. **Data Limitations Constraint**: Had to reverse-engineer questions based on available variables. Many interesting hypotheses (healthcare spending efficiency, genetic factors, climate impact) were unanswerable with this dataset.

**Examples of the iterative refinement process:**

- **Initial idea**: "Does GDP affect life expectancy?" 
  - **Problem**: Too basic, well-established in literature
  - **Refined to**: "At what GDP threshold does the Preston Curve plateau, and how do developing nations compare?" (Q4)

- **Initial idea**: "What factors correlate with life expectancy?"
  - **Problem**: Meaningless fishing expedition
  - **Refined to**: "Among controllable health interventions (vaccination, BMI, alcohol), which has highest ROI for developing vs developed nations?" (Q6 & Q7)

**What I learned:**

The meta-skill of "knowing what to ask" is harder to learn than pandas or scikit-learn. It requires:
- Reading beyond the dataset (literature review)
- Thinking like the end-user (WHO policymaker lens)
- Iterating through multiple question drafts before coding anything

This experience taught me that data science is not just technical execution - the conceptual framing phase deserves as much time as the implementation phase.

**Learning & Growth:**

*What I learned:*

**Technical skills:**

1. **Dataset Evaluation & Selection**:
   - Learned to cross-reference dataset claims with original sources (WHO, World Bank)
   - Understood importance of data provenance and licensing for academic work

2. **Advanced Statistical Modeling**:
   - **Non-linear regression**: Implemented exponential curve fitting using `scipy.optimize.curve_fit` for Preston Curve analysis to capture the diminishing returns of GDP on life expectancy
   - **Model selection**: Learned to compare models using $R^2$, AIC, residual plots
   - **Multiple regression**: Used `statsmodels` for multivariate analysis with control variables
   - **Stratified analysis**: Discovered how to segment data by categorical variables to reveal heterogeneous effects

3. **Jupyter Notebook Best Practices**:
   - **Organization**: Used markdown headers, table of contents with anchor links, clear cell labeling
   - **Documentation**: Wrote markdown explanations between code cells, not just code comments
   - **Collaboration**: Learned to write code that others can understand and modify

**Analytical approaches:**

1. **From Exploration to Hypothesis-Driven Analysis**:
   - Initially: Plotted everything hoping to find patterns
   - Learned: Start with research question → review literature → form hypothesis → design targeted analysis

2. **Thinking About Subgroups**:
   - Key insight: "Average effect" can be misleading
   - Always ask: "Does this pattern hold across different contexts?""

3. **Iterative Refinement**:
   - First analysis attempt is rarely correct
   - Process: Analyze → Visualize → Critique → Research domain → Re-analyze

**Domain knowledge:**

1. **Global Health Concepts**:
   - **Preston Curve**: Logarithmic GDP-life expectancy relationship, diminishing returns above ~$10k GDP per capita
   - **Infant mortality as leverage point**: Reducing child deaths has outsized impact on population life expectancy (mathematical effect + reflects overall health system quality)
   - **Epidemiological transition**: As countries develop, mortality shifts from infectious diseases (children) to chronic diseases (elderly)

2. **Development Economics**:
   - **Education vs Wealth**: Schooling empowers health literacy, women's education delays childbearing, creates virtuous cycle

3. **Public Health Policy Insights**:
   - **Vaccination ROI**: Immunization programs are cost-effective interventions (prevent expensive future treatments)
   - **"Diseases of Affluence"**: Wealthy nations face alcohol, obesity, sedentary lifestyle diseases
   - **Healthcare as buffer**: Why developed countries tolerate lifestyle risks - advanced medical systems (ICU capacity, preventive screening, chronic disease management) mitigate consequences

*What surprised me most:*

**The Lifestyle-Longevity Paradox in Q7**

**Initial assumption**: Wealthy countries = healthier behaviors (better diet, less alcohol, optimal BMI)

**Reality discovered**:
- Developed countries had HIGHER alcohol consumption (social drinking culture, affordability)
- BMI showed two extremes: underweight in poor countries (malnutrition), overweight in rich countries (obesity)
- YET developed countries maintained 10+ years longer life expectancy despite "worse" lifestyle metrics

**Why this shocked me**:
- Challenged my naive belief that "wealth automatically leads to optimal health choices"
- Realized life expectancy isn't just about behavior - it's about **surviving** your mistakes

**Deeper understanding gained**:
- **Healthcare infrastructure is the key differentiator**: 
  - Rich countries have cardiac surgery, diabetes management, cancer screening, emergency response
  - Poor countries: Same lifestyle risk → death; Rich countries: Risk → treatment → survival
- **Complexity of "health"**: Life expectancy measures survival, not quality of life or wellness
- **Policy implication**: Can't just tell developing countries to "copy developed world behaviors" - need healthcare system capacity first

*How this project shaped my understanding of data science:*

**1. Data Science is Team Coordination, Not Solo Genius**

**Key lesson from coordinator role**: The best model is worthless if:
- Teammates can't reproduce it (poor documentation)
- Stakeholders don't understand it (technical jargon without translation)
- It doesn't integrate with team's workflow (works in isolation but breaks the whole)

**2. Real Projects Are Messy - Embrace Imperfection**

**Academic exercises**: Clean CSV, clear objectives, known solutions

**This project**:
- Dataset had been "cleaned" by someone else - had to trust their judgment (risky!)
- Multiple valid analytical approaches for same question (which is "correct"?)
- Trade-offs everywhere (time vs depth, complexity vs interpretability)
- Worked with teammates of different skill levels (had to accommodate)

**Learning**: Perfect is enemy of done. Sometimes "good enough with proper caveats" is better than "perfect but late". Document limitations honestly.

**3. Narrative Matters More Than Numbers**

**Initially**: Obsessed with achieving high $R^2$, low p-values, fancy algorithms

**Realized**: Stakeholders (professors, policymakers, general audience) care about:
- "What does this MEAN in the real world?"
- "What should we DO with this information?"
- "Can you show me visually?"

**Skill to develop**: Translating regression outputs into actionable insights and compelling stories

---

This project taught me that **data science is interdisciplinary craft** - blending statistics, programming, domain knowledge, communication, and teamwork. It's not enough to be technically proficient; you must understand context, explain clearly, and collaborate effectively. The challenges I faced (dataset selection, team coordination, code integration, analytical complexity) gave me practical skills no lecture could teach. I now appreciate that becoming a good data scientist requires continuous learning across multiple dimensions, not just technical depth.

<a id="c643"></a>
### 1.4.3 Student 3: Bùi Nam Việt

**Challenges & Difficulties Encountered:**

*Specific obstacles faced:*

- **Technical**: Multicollinearity Masking Feature Importance
    * **The Issue:** In the initial Random Forest, the "Feature Importance" chart showed **Schooling** and **GDP** as having almost 0% importance. This was technically incorrect. Because `Adult_mortality` was so mathematically dominant, the tree-building algorithm (greedy approach) always picked it first, "masking" the contribution of socio-economic factors.

- **Analytical**: 
    - **Extreme Data Skewness**:

        * **The Issue:** The distribution analysis in Question 3 revealed that key variables did not follow a normal (bell curve) distribution.

            * **GDP per capita:** Highly right-skewed (Skewness 2.38).
            * **HIV Incidents:** Extreme right-skewness (Skewness 4.98) with massive outliers (Kurtosis 28.64).
            * This skewness distorts statistical relationships, making it hard for models to detect patterns for "average" countries versus "extreme" ones.

    - **The Limitations of Linear Models**:
        * **The Issue:** Linear Regression performed worse than Random Forest due to its assumption of linear relationships.

- **Conceptual**: The "Proxy Trap" (Causal Confusion)

    * **The Issue:** In the first model, the Random Forest achieved near-perfect accuracy ($R^2 \approx 99.7\%$). However, it was "cheating" by using **Mortality Rates** (Adult Mortality, Infant Deaths) to predict **Life Expectancy**. This is conceptually flawed because death rates are just the mathematical inverse of life expectancy, not the *cause* of it. The model wasn't learning *why* people live longer; it was just calculating *when* they died.

*How I overcame them:*

- **The Ablation Strategy:** To solve the "Proxy Trap," we deliberately removed the highly correlated mortality variables (`Adult_mortality`, `Infant_deaths`, `Under_five_deaths`). We forced the model to "re-learn" the world based only on socio-economic and health inputs. This technically "unlocked" the model, allowing **Incidents_HIV (45%)** and **GDP (23%)** to surface as the true top predictors in the new Feature Importance chart.

- **Distribution Analysis:** We utilized Skewness and Kurtosis metrics to identify which variables (GDP, Population) require Log Transformation in future steps to improve model performance further.

- I determined that Linear Regression was limited because it assumes linear relationships between features and the target. I overcame this by **Model Selection**, choosing the Random Forest algorithm which can capture non-linear patterns and feature interactions.

*Most challenging aspect and why:*

- **Sacrificing Accuracy for Interpretability:** The hardest decision was removing the mortality variables. It is counter-intuitive for a data scientist to intentionally lower the model's accuracy (RMSE increased from ~0.5 to ~1.0). However, this was necessary to transform the model from a "calculator" into a "decision-support tool" for policymakers.

**Learning & Growth:**

*What I learned:*

- **Technical skills**:

    * **Feature Selection:** High accuracy does not always mean a good model. I learned to identify **Data Leakage** (using the target variable's symptoms to predict the target).

    * **Model Comparison:** I learned why Random Forest (tree-based) handles non-linear relationships better than Linear Regression, which assumes linear relationships and additive feature effects.

- Analytical approaches:

    * **Ablation Study:** The technique of removing features to test model robustness and uncover hidden feature importance.

    * **Multicollinearity:** Understanding that when two features are highly correlated (e.g., Death Rate and Life Expectancy), the model will "lazy learn" and ignore other important but subtle features like GDP or Schooling.

- Domain knowledge:

    - **Social Determinants of Health (SDOH)**: I learned that Life Expectancy is not merely a biological result but strongly tied to socio-economic structures. Schooling is not just about education; it serves as a proxy for "Health Literacy" (the ability to make informed health decisions). Similarly, GDP represents the capacity for healthcare infrastructure, not just personal wealth.

    - **Epidemiological Impact**: The dominance of Incidents_HIV (after removing leakage) highlighted a critical domain insight: infectious disease outbreaks can act as "statistical shocks," overriding standard economic progress and becoming the primary limiter of life expectancy in specific regions.

    - **Causality vs. Correlation**: I learned to distinguish between outcomes (like Adult Mortality) and drivers (like GDP/Schooling). Using an outcome to predict another outcome (Life Expectancy) creates a closed loop, whereas using drivers allows for genuine policy-making predictions.

*What surprised me most:*

* **The "Hidden" Killer:** In the first model, "Schooling" seemed irrelevant. In the second model (after removing mortality proxies), **"Incidents_HIV"** suddenly jumped to the #1 most important feature (45.0%). It was surprising to see that a specific disease could outweigh general economic factors (GDP at 23.1%) in determining life expectancy globally.
* **Robustness of Socio-Economics:** Even after removing the direct death statistics, the model maintained an $R^2$ of **0.9877**. It was shocking that we can predict how long people live with 98% accuracy just by knowing their economy, education, and HIV rates, without knowing their actual mortality rates.

*How this project shaped my understanding of data science:*

* **Context over Code:** I realized that knowing how to code `RandomForestRegressor` is easy, but knowing *what to feed it* is hard. Data Science is not just about minimizing RMSE; it is about understanding the domain (e.g., the relationship between GDP and Health) to ensure the model reflects reality.
* **The Danger of "Perfect" Scores:** A model that is "too good to be true" (like the first one with 99.9% training accuracy) is usually flawed. I learned to be skeptical of perfect metrics and to always investigate *why* the model is making its decisions.

<a id="c7"></a>
# 2. References

- Dataset source: **World Health Organization (WHO) - Global Health Observatory (GHO):**

    - **Data Repository:** [Life Expectancy (WHO)](https://www.who.int/data/gho/data/themes/mortality-and-global-health-estimates)
    - **Description:** A dataset tracking health status and socio-economic factors for 179 countries between the years 2000-2015.
    - **Usage:** Provided the ground truth for target variables (`Life expectancy`) and features (`GDP`, `Schooling`, `Immunization coverage`) used to train and validate the regression models.
- Libraries documentation:

    * **Scikit-Learn (User Guide & API):**
        * **Focus:** `sklearn.ensemble.RandomForestRegressor` and `sklearn.linear_model.LinearRegression`.
        * **Application:** Consulted to understand hyperparameter tuning (e.g., `n_estimators`, `max_depth`) and to implement the `feature_importances_` attribute for model interpretation.
    * **Pandas & NumPy:**
        * **Focus:** Data manipulation and statistical operations.
        * **Application:** Used for cleaning data (handling null values), performing vectorization, and calculating correlation matrices to detect multicollinearity.
    * **Matplotlib & Seaborn:**
        * **Focus:** Data visualization techniques.
        * **Application:** Used to generate the Heatmap (for correlation analysis) and Bar Charts (for visualizing Feature Importance).

- Research papers:

    * **World Health Organization (WHO)**. *Global Health Observatory Data Repository: Life Expectancy and Healthy Life Expectancy*.
        * Accessed from: [https://www.who.int/data/gho](https://www.who.int/data/gho)
        * *Description:* A dataset recording health indicators, immunization coverage, and socio-economic factors for 193 countries during the period 2000-2015.

    * **Department of Data Science**. *Lecture Slides: Introduction to Data Science*. Internal course materials.
        * Link: [Google Drive Folder](https://drive.google.com/drive/folders/1DW67RvXnyR2cTaASpEwTz0wvlZc5nYCb)
    * **Department of Data Science**. *Lecture Slides: Data Science Programming*. Internal course materials.
        * Link: [Google Drive Folder](https://drive.google.com/drive/folders/1FyzNTCs_xpx-CUVBw_VwXlEt73tf8ywX)

    * **Vu Huu Tiep**. *Machine Learning Co Ban - TabML Book: Random Forest*.
        * Accessed from: [https://machinelearningcoban.com/tabml_book/ch_model/random_forest.html](https://machinelearningcoban.com/tabml_book/ch_model/random_forest.html)
        * *Reference Content:* The mechanism of Bagging and constructing multiple Decision Trees to mitigate overfitting.
    * **Vu Huu Tiep**. *Machine Learning Co Ban - Linear Regression*.
        * Accessed from: [https://machinelearningcoban.com/2016/12/28/linearregression/](https://machinelearningcoban.com/2016/12/28/linearregression/)
        * *Reference Content:* Theory regarding linear regression, cost functions, and gradient descent optimization.

    * **Scikit-learn Developers**. *User Guide: Ensemble Methods & Linear Models*.
        * Accessed from: [https://scikit-learn.org/stable/modules/ensemble.html](https://scikit-learn.org/stable/modules/ensemble.html)
        * *Usage:* Technical reference for parameters in `RandomForestRegressor` and `LinearRegression`.
    * **Pandas Documentation**. *User Guide: Missing Data & IO Tools*.
        * *Usage:* Methods for handling missing data (Imputation) and reading/writing CSV files.