This project contains a Python script (sis_clean.py
) to perform data quality analysis and cleaning on the SIS Faculty dataset.
It prepares the dataset for use in Machine Learning tasks by addressing missing values, invalid identifiers, inconsistent categories, and redundant fields.
- Normalises column names and trims whitespace
- Parses date fields with multiple formats
- Validates and corrects identifiers (
ID
) - Drops columns with >95% missing values or constant values
- Standardises qualification labels (e.g.,
Ph.D
→PhD
) - Imputes missing values (mode for categorical, median for numeric)
- Removes duplicate rows
- Saves the cleaned dataset as a new CSV file
- Prints before/after summaries of data quality
- Python 3.9+
- Libraries:
pip install pandas numpy
├── sis_clean.py # Python script for cleaning the dataset ├── SIS_Faculty-List.csv # Raw dataset (input) ├── SIS_Faculty-List_clean.csv # Cleaned dataset (output, generated by script) ├── README.md # Instructions for setup and usage
-
Place your raw dataset file (e.g. SIS_Faculty-List.csv) in the same folder as sis_clean.py.
-
Open a terminal (or PowerShell on Windows) and navigate to the folder:
cd path/to/folder
-
Run the script with the default input/output names:
python sis_clean.py
This will:
-
Load SIS_Faculty-List.csv
-
Generate SIS_Faculty-List_clean.csv in the same folder
-
Print before/after data quality metrics in the terminal