# Fairness Metrics Comparison and Discussion

## Comparison of Fairness Across Datasets

We evaluate the fairness of recommendation models (**SVD++** (Movielens & E-commerce), **UserKNN** (Movielens), and **ItemKNN** (E-commerce)) on two distinct datasets: **MovieLens** and **E-commerce**, using a comprehensive set of fairness metrics. While both datasets support recommender system research, they differ in scope, structure, and granularity.

### Dataset Differences

- **MovieLens** provides a structured dataset with explicit ratings, clear user demographics (gender, age, occupation), and consistent rating behavior, making it highly suitable for recommender system evaluation.
- **E-commerce** dataset offers a different domain perspective with product categories and more diverse gender identities, providing insights into commercial recommendation scenarios.
- The domain differences (movies vs. products) impact recommendation behavior and fairness interpretation across metrics.

---

#### Key Metric Observations

| Metric                         | MovieLens Insights                                               | E-commerce Insights                                              |
|-------------------------------|-------------------------------------------------------------------|------------------------------------------------------------------|
| **RMSE (Accuracy)**           | SVD++ performs best (0.8830); UserKNN slightly higher (0.8908).   | SVD++: 1.1357, ItemKNN: 0.9094 — slightly lower error with ItemKNN. |
| **Gender-Based RMSE**         | RMSE is higher for **female users** in both models, indicating potential bias. (F: 0.9111 vs M: 0.8738 for SVD++) | SVD++: Female RMSE 1.2206 (+7.5%), Male 1.1211. Non-binary & fluid identities show significantly lower RMSEs (e.g., Non-binary: 0.5772). ItemKNN shows similar trends: Female RMSE 0.9955 (+9.5%), Male 0.8437. |
| **Gini Coefficient**          | Moderate inequality in exposure: SVD++ (0.1228), UserKNN slightly better (0.1184). | SVD++: Moderate inequality in exposure (Gini = 0.1847) across 24 product categories. Item-KNN shows lower inequality (0.1200), though the evaluation was based on only 5 categories and a smaller user base, limiting its generalizability. |
| **Bottom-N Individual Fairness** | Notable disparity: SVD++ bottom 1% avg = 2.3021 vs overall = 3.6623 → Δ = 1.3602 (37.1%). UserKNN shows even larger gap (43.5%). | SVD++: Bottom 1% utility is 3.4625 vs overall 3.7420 (−7.5%). Disparities are consistent across gender (F: −7.7%, M: −7.3%). ItemKNN has extreme disparity: bottom 1% utility is 1.2000 vs overall 3.7262 (−67.8%), with females more affected (−66%). |
| **N1-Norm Group Fairness (Gender)** | High RMSE distribution divergence for females in both models (e.g., SVD++ F: 0.4269 vs M: 0.1557). | SVD++: RMSE and MAE distributions show high divergence across gender. Female N1-norms: RMSE 5.04, MAE 4.70. Male N1-norms: RMSE 5.96, MAE 5.56. ItemKNN also shows gender disparity, with Female RMSE N1-norm = 5.23 and MAE N1-norm = 7.41 — suggesting greater prediction inconsistencies for female users. |
| **KL Divergence (Gender)**    | Low but present divergence in prediction and error distributions (e.g., SVD++ F: 0.0096). | SVD++: Moderate KL divergence in prediction errors across gender (RMSE KL: M = 0.4731, F = 0.3203; MAE KL: M = 0.4651, F = 0.3465). ItemKNN only shows divergence for female users (RMSE KL = 0.0499), as male users did not dominate any categories in the smaller sample. This highlights fairness limitations due to low coverage. |
| **Individual Fairness Variance** | SVD++ shows low variance across users (mean: 0.0143); stable predictions. UserKNN exhibits higher mean variance (0.1168) and outliers (max: 351.03). Younger users tend to have more variability. | SVD++: Very low variance across users (mean: 0.0082, max: 0.0542) shows stable recommendations. Item-KNN has higher mean variance (0.4807), and 23 users had extreme variance > 1.0 — indicating less fairness in prediction consistency. |
| **Mean Average Envy for Individual Fairness** | SVD++ shows moderate envy (mean: 0.1826); lowest for older groups, highest for Under 18. UserKNN has higher envy (mean: 0.2378), with greatest unfairness for younger users. | SVD++: Envy is very low and consistent (mean: 0.0606, max: 0.2408) with no zero-envy users removed. Item-KNN shows significantly higher envy (mean: 0.5434, max: 2.0381), and 789 zero-envy users were removed. 27 users had extreme envy > 1.0. |
| **Fraction of Satisfied Users (FSU)** | Fairly consistent across models and groups. SVD++: 0.5332 overall, with slight gender/age variation. UserKNN: 0.5282 overall, slightly higher satisfaction for older users. | SVD++: Overall satisfaction across groups is moderate (avg FSU ranges between ~0.38–0.48). Most consistent satisfaction found among users aged 31–40 and users with Bachelor’s education. Minor variation observed across gender identities. Item-KNN: Slightly higher satisfaction overall, especially in groups like Non-binary (0.67), Polygender (0.67), and 18–25 age group (0.55). However, small group sizes inflate variability. Income and education also show clearer disparities (e.g., Master’s: 0.48 vs Bachelor’s: 0.53). |
| **Absolute Difference (AD) for Group Fairness** | SVD++: Avg AD = 0.2238 (valid for 10.2% of categories); UserKNN higher at 0.2878. Gender imbalance in users: ~2.74:1 (M:F). | SVD++: Avg AD = 0.0373 — showing high group fairness across product categories. Item-KNN is significantly less fair with Avg AD = 0.3413. Largest disparities found in “Health Care” (1.19), “Baby Products” (0.83), and “Beauty & Personal Care” (0.72). |

---

### Discussion

The MovieLens dataset provides a structured foundation for recommender system evaluation with explicit ratings and defined demographics, while the E-commerce dataset introduces complexity through broader product categories and diverse user identities. This domain difference significantly influences fairness outcomes.

Gender consistently impacts fairness across both datasets, with female users experiencing higher error rates (MovieLens: F: 0.9111 vs M: 0.8738 for SVD++; E-commerce: 7.5-9.5% higher RMSE for females). This suggests systematic biases possibly stemming from imbalanced training data or implicit rating behaviors.

SVD++, as a model-based approach, demonstrates stronger fairness consistency across both datasets, with lower individual fairness variance (MovieLens: 0.0143, E-commerce: 0.0082). For the E-commerce dataset, ItemKNN was used instead of UserKNN, which was infeasible due to extreme sparsity in the user-item matrix and insufficient user similarity patterns. ItemKNN showed greater fairness disparities with higher Absolute Difference scores (0.3413) and extreme envy values (max: 2.0381).

Gender representation imbalances are noteworthy in both datasets (MovieLens M:F ratio of ~2.74:1). The E-commerce dataset's inclusion of non-binary identities revealed these groups often experienced lower error rates, though with limited statistical confidence due to smaller sample sizes.

A significant limitation in the E-commerce evaluation for ItemKNN was its basis on only 5 product categories versus SVD++'s 24 categories. The extreme disparities observed in specific categories like "Health Care" (AD: 1.19) and "Baby Products" (AD: 0.83) suggest fairness considerations may need to be domain-specific.

### Conclusion

Among the recommendation models evaluated, SVD++ demonstrates the best overall balance between predictive accuracy and fairness, performing consistently well across both MovieLens and E-commerce datasets. Its model-based approach provides more stable recommendations with lower variance and more equitable treatment across demographic groups. ItemKNN shows stronger performance in certain metrics for the E-commerce dataset, particularly in accuracy (RMSE: 0.9094 vs SVD++'s 1.1357), but exhibits extreme disparities in fairness measures like Bottom-N Individual Fairness (67.8% utility gap for bottom 1% of users).

The inability to apply UserKNN to the E-commerce dataset highlights the challenges posed by sparse, high-dimensional commercial data, where user-similarity computations become ineffective or computationally prohibitive. This limitation reinforces the importance of algorithm selection based on dataset characteristics.

Gender-related disparities persist across both datasets, with female users consistently experiencing higher error rates and lower recommendation quality. This reinforces the need to integrate fairness-aware design into recommendation pipelines. While all models exhibit some sensitivity to gender-based differences, the effects are most pronounced in demographically unbalanced data, as evidenced by the higher RMSE distribution divergence for females in both datasets.

The structure and information richness of the dataset play critical roles in how fairness manifests and can be measured. MovieLens, with its well-structured ratings and clear demographics, provides a more controlled environment for fairness evaluation. The E-commerce dataset, with its diverse product categories and broader gender identities, offers insights into more complex real-world scenarios but introduces additional challenges in fairness assessment.