Enhancing Intersectional Fairness in Education-Focused Machine Learning Using Synthetic Data
- Python 3.9+
- Packages: pandas, numpy, scikit-learn, aif360, ctgan, sdgx, statsmodels
- For LLM generation set
OPENAI_API_KEY(used bygenerate_sdg.generate_by_llm).
dataset_folder: https://zenodo.org/records/17933909. Test data is expected at<dataset_folder>/<dataset_name>/test_<dataset_name>.csv.- Generated synthetic files are merged into the file name passed via
--merged-output-file-name.
python fairedu_plus.py \
--dataset-name student_dropout \
--dataset-folder /home/ad/m4do/proj/fairedu_plus/original_dataset \
--generator LLM \
--merged-output-file-name merged_output.csv \
--seed 42--dataset-name(student_dropout,student_oulad,student_performance,DNU)--dataset-folderpath to dataset root used to locate the test CSV--generatorchooseLLMorCTGAN--merged-output-file-namename for merged synthetic output--run-splitted-file/--no-run-splitted-filechoose split vs combined training files--seedrandom seed for reproducibility
- CTGAN without split files:
python fairedu_plus.py --dataset-name student_dropout --generator CTGAN --no-run-splitted-file- Using OULAD dataset with explicit dataset folder:
python fairedu_plus.py --dataset-name student_oulad --dataset-folder ./datasetIf you use this work, please cite the SSRN preprint:
SSRN Scholarly Paper 5290738. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5290738
This work builds upon the following open-source projects: