# Probabilistic Machine Learning - Project Report
**Course:** Probabilistic Machine Learning (SoSe 2025)  
**Lecturer:** Alvaro Diaz-Ruelas  
**Student(s) Name(s):**  Timm Nicklaus  
**GitHub Username(s):**  t1mmb0  
**Date:**  02.07.25  
**PROJECT-ID:** 13-3NTXXXX  

---


## 1. Introduction



### Motivation
- Social indicators are important metrics collected by the Federal Republic of Germany to draw conclusions about societal structures.
- This project investigates the relationship between sociodemographic characteristics and income, using probabilistic models to analyze and interpret these relationships.
- The focus lies on identifying which characteristics of the main income earner significantly influence the income level and therefore the social position of the entire household.

### Dataset
- CAMPUS file from the 2010 Microcensus (fully anonymized and designed for academic/student use)
- The CAMPUS file is a 3.5% sample of the 2010 Microcensus, containing data on 23,374 individuals from 11,494 households. In total, 427 of the original 828 features are included in the dataset.
- From these, 13 features were selected that relate to the main income earner, along with the regional context (East/West Germany).
- The analyzed features include:

  1. Gender, marital status, education  
  2. Employment status, occupation, sector, type of employment  
  3. Nationality, housing situation, household role

- **Target variable:**  
  Income of the main income earner.

### Hypothesis
- Income is largely influenced by factors over which individuals have limited control – such as gender, nationality, or level of education.
- Individuals with higher educational qualifications, stable employment histories, and certain demographic characteristics (e.g., living in a Western German state, German citizenship) are more likely to earn above-average incomes.



## 2. Data Loading and Transformation



- The data was loaded from the CAMPUS file of the 2010 Microcensus using the `data_load()` function.
- The script `transformation.py` performs the data transformation.

### The following steps were carried out:

1. Column names were replaced with descriptive labels.
2. A dataset with human-readable feature values was created and saved as *df_labels.csv*.
   - This version retains missing values to allow analysis of main income earners without income in the preprocessing section.
3. Normalization and removal of missing values.
4. Removal of all samples with the following income classes, as they are not comparable:
   - 50: *Self-employed farmer*
   - 90: *No income*
   - 99: *Not specified*
5. Application of `LabelEncoder` from the Scikit-learn library:
   - *Encode target labels with values between 0 and n_classes-1.*
6. Application of `train_test_split` from the Scikit-learn library:
   - *test_size = 0.3 → Train: (10,376 × 13) – Test: (4,447 × 13)*
7. Saving of the transformed datasets and mappings in the `/data/` directory.



## 3. Data Exploration

- All results in this chapter are derived from `exploration.ipynb` and can be verified there.


### Basic Statistics:



| Feature                     | Count | Type     | Missing Values |
| -------------------------- | ----- | -------- | -------------- |
| `federal_state`            | 23,374| nominal  | 0              |
| `gender`                   | 23,107| nominal  | 267            |
| `citizenship`              | 23,107| nominal  | 267            |
| `marital_status`           | 23,107| nominal  | 267            |
| `employment_status`        | 23,107| nominal  | 267            |
| `employment_sector`        | 15,655| nominal  | 7,722          |
| `job`                      | 15,651| nominal  | 7,726          |
| `employment_position`      | 15,655| nominal  | 7,722          |
| `livelihood`               | 23,107| nominal  | 267            |
| `income`                   | 23,107| ordinal  | 267            |
| `educational_qualification`| 23,107| ordinal  | 267            |
| `highest_qualification`    | 23,079| ordinal  | 295            |
| `primary_residence`        | 23,107| metric   | 267            |
| `household_relationship`   | 23,107| nominal  | 267            |


### Analysis of the Missing Values:

- All missing values for `gender` — and consequently the 267 missing values in all other features — originate from individuals living in collective accommodations (*Gemeinschaftsunterkünfte*), according to the source documentation.
- Removing these samples is justified, as they provide no useful information for the analysis.



<img src="plots/Countplots with and without employment.png" width="1500">


- A clear pattern emerges regarding marital status: individuals without employment are disproportionately widowed.  
→ This indicates that many of them are elderly and live on their pension income without being employed.


<img src="plots/Livelihood by gender and employment.png" width="1200">

- This evaluation further confirms that individuals without income live off pensions or unemployment benefits.

### Reducing the Target Variable to Fewer Income Classes


- `create_income_classes.ipynb` reduces the original 24 income classes to 5.
- This leads to improved and more informative results.

#### Creating Income Classes with 1D K-Means Clustering
- Uses `KMeans` from the Scikit-learn library.
- Not ideally suited for ordinal features.




<img src="plots/Income classes.png" width="1000">

---

| Cluster | Income (approx.)  | Classes | Typical Meaning                  |
| ------- | ----------------- | ------- | -------------------------------- |
| 0       | €2,000 – €2,900   | 11–13   | **Middle-income earners**        |
| 1       | €1,300 – €2,000   | 8–10    | **Lower end of employees**     |
| 2       | €2,900 – €4,500   | 14–17   | **Higher earners**               |
| 3       | below €1,300      | 1–7     | **At risk of poverty**           |
| 4       | above €4,500      | 18–24   | **Top earners**                  |




The income thresholds derived from KMeans clustering align well with established socio-statistical benchmarks. This supports the plausibility of the clusters as meaningful analytical income groups. 

Even though the data is from 2010, it is still valid to say that the resulting classes reasonably reflect the income distribution in Germany.

- [The poverty line in Germany is approximately €1,300](https://biaj.de/archiv-materialien/2026-eurostat-armutsgefaehrdung-vor-und-nach-sozialleistungen-in-der-bundesrepublik-deutschland-2023.html)  
- [The average income in Germany is around €2,500](https://www.bpb.de/kurz-knapp/zahlen-und-fakten/sozialbericht-2024/553205/einkommen-und-einkommensverteilung/)





### CHI² Test


- Implementation: **categorical_nb.ipynb**

| Rank | Feature                     | Chi² Score | p-Value |
|------|-----------------------------|------------|---------|
| 1    | `job`                       | 38,144.87  | 0.00    |
| 2    | `livelihood`                | 4,476.60   | 0.00    |
| 3    | `highest_qualification`     | 1,305.22   | 0.00    |
| 4    | `educational_qualification` | 599.30     | 0.00    |
| 5    | `gender`                    | 592.04     | 0.00    |
| 6    | `federal_state`             | 545.83     | 0.00    |
| 7    | `employment_position`       | 435.50     | 0.00    |
| 8    | `household_relationship`    | 368.51     | 0.00    |
| 9    | `employment_sector`         | 211.19     | 0.00    |
| 10   | `citizenship`               | 67.96      | 0.00    |
| 11   | `primary_residence`         | 55.50      | 0.00    |
| 12   | `employment_status`         | 46.53      | 0.00    |
| 13   | `marital_status`            | 3.11       | 0.54    |

- Every feature is statistically significant except for `marital_status`.  
However, the Chi² test only measures the individual (marginal) association of each feature with the target.  
Therefore, `marital_status` might still have an **indirect** effect on income through interactions with other features.




### Crosstab Analysis


- Displays conditional, bayesian probabilities.

- There are significant associations between many features and the target variable. One example is the relationship between `Gender` and `Income`:
- It is clearly visible that women, on average, earn less. Higher shares are found in the *lowest* and *low* income classes.

<img src="plots/Income Level by gender.png" width="800">

- Another interesting pattern emerges for `Federal State` and `Income`:
- Households in the new federal states (former East Germany) show significantly lower income levels.

<img src="plots/Income Level by state.png" width="800">

- ``Educational Qualification`` is also strongly related to ``income``:

<img src="plots/Income Level by edqual.png" width="500">

## 4. Probabilistic Modeling

### Categorical Naive Bayesian Classifier

- A simple classification model that assumes no dependencies between features.  
  Implementation: `categorical_nb.ipynb`
- Conceptually similar to the Chi² test, but used as a classification model.

Basic Model with `alpha: 1.0`:

**Accuracy: 0.49**

| Class        | Precision | Recall | F1-Score | Support |
|---------------|-----------|--------|----------|---------|
| high          | 0.39      | 0.54   | 0.45     | 650     |
| highest       | 0.41      | 0.43   | 0.42     | 240     |
| lower middle  | 0.50      | 0.65   | 0.57     | 1530    |
| lowest        | 0.65      | 0.49   | 0.56     | 852     |
| middle        | 0.48      | 0.28   | 0.35     | 1175    |
|---------------|-----------|--------|----------|---------|
| **Accuracy**      |           |        | **0.49**    | 4447    |
| **Macro avg**     | 0.49      | 0.48   | 0.47     | 4447    |
| **Weighted avg**  | 0.50      | 0.49   | 0.48     | 4447    |

- The classification quality outperforms a decision at random, but can't be considered good.  
--> Visualization of typical misclassifications:

<img src="plots/Confusion Matrix.png" width="500">

- The model is capable of coarse classification but fails at fine-grained categorization. It can distinguish well between high and low incomes, but it is unable to separate neighboring classes with good quality.

- Parameter Training with Grid Search `(alpha: [0.01, 0.1, 0.5, 1.0, 2.0])`:

| param_alpha | mean_test_score |
|-------------|-----------------|
| 0.01        | 0.497687        |
| 0.10        | 0.497205        |
| 0.50        | 0.494989        |
| 1.00        | 0.494121        |
| 2.00        | 0.490652        |

- No improvement:
→ The model is not capable of performing well on this kind of data.

- The *naive* assumption `No Dependency between Features`:  
→ too strong and does not hold well in this dataset.




### Analysis of Uncertainty

#### Connection between max(p) and entropy

<img src="plots/connection between p and entropy.png" width="500">

- ``max_probability``: the highest predicted class probability for each instance
- ``entropy``: a measure of the model’s prediction uncertainty
- Misclassifications (blue) are concentrated at lower probabilities (< 0.6).
- These predictions often have high entropy (> 1.0), reflecting uncertainty.

BUT:
Accuracy on conf. predictions (entropy <0.7 & max_p > 0.5): ``66.35%``


<img src="plots/box_plots entropy.png" width="500">

- Lowest entropy for class ``lowest``:  
- The model shows consistently lower entropy when predicting instances labeled as lowest, indicating higher confidence in those predictions.
- The model is most confident when predicting the lowest income group.
- This may imply that the lowest class has distinctive feature patterns, making it easier to separate from the others.

<img src="plots/entropy misclassification.png" width="500">

- The heatmap shows mean entropy for misclassified instances (true ≠ predicted).

- High values (e.g. ``highest → middle``, ``lowest → middle``) indicate uncertain misclassifications.

- Low entropy but wrong (e.g. ``high → lowest``, entropy = 0.53) → the model was confidently wrong → potentially systematic confusion.

### Bayesian Network
*pass*


## 5. Rule Based Models 


### Decision Tree



- A simple rule based model, which is capable of modelling non linear relationship.
- Implementation: `decision_tree.ipynb`
- basis model: `Decision Tree without pruning or max_depth`  
--> depth: 30

| Metric                | Value |
| --------------------- | ---- |
| **Train Accuracy**    | 0.90 |
| **Test Accuracy**     | 0.76 |
| **Balanced Accuracy** | 0.75 |

- high train acc: `overfitting`

| Class           | Precision | Recall | F1-Score | Support |
| ---------------- | --------- | ------ | -------- | ------- |
| **high**         | 0.74      | 0.70   | 0.72     | 650     |
| **highest**      | 0.72      | 0.76   | 0.74     | 240     |
| **lower middle** | 0.77      | 0.81   | 0.79     | 1530    |
| **lowest**       | 0.85      | 0.75   | 0.79     | 852     |
| **middle**       | 0.72      | 0.74   | 0.73     | 1175    |

- Parameter variation ( `max_depth` and `ccp_alpha`)

<img src="plots/parameter variation decision tree.png" width="1700">

- ``No pruning (ccp_alpha = 0):`` Highest accuracy on the test dataset, but increased risk of overfitting.

- ``Marginal pruning (ccp_alpha = 0.0005 / 0.001):`` High test accuracy with reduced risk of overfitting.

- ``Max depth 10–15: Deeper models tend to perform better:`` this range offers a good balance between complexity and generalization.

- Parameter Training with Grid Search:

- final model:  
--> ``{'ccp_alpha': 0.0005, 'criterion': 'entropy', 'max_depth': 10, 'max_features': None}``


| Metrik                | Wert |
| --------------------- | ---- |
| **Train Accuracy**    | 0.61 |
| **Test Accuracy**     | 0.56 |
| **Balanced Accuracy** | 0.53 |

-  average train acc: `reduced risk for overfitting`

### Random Forest

## 6. Results

- Present key findings
- Comparison of models if multiple approaches were used



## 7. Discussion

- Interpretation of results
- Limitations of the approach
- Possible improvements or extensions



## 8. Conclusion

- Summary of main outcomes



## 9. References

- Cite any papers, datasets, or tools used