Skip to content

prime97/CSCK503_Assignment_1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SIS Faculty List — Data Quality Cleaning

This project contains a Python script (sis_clean.py) to perform data quality analysis and cleaning on the SIS Faculty dataset.
It prepares the dataset for use in Machine Learning tasks by addressing missing values, invalid identifiers, inconsistent categories, and redundant fields.


Features

  • Normalises column names and trims whitespace
  • Parses date fields with multiple formats
  • Validates and corrects identifiers (ID)
  • Drops columns with >95% missing values or constant values
  • Standardises qualification labels (e.g., Ph.DPhD)
  • Imputes missing values (mode for categorical, median for numeric)
  • Removes duplicate rows
  • Saves the cleaned dataset as a new CSV file
  • Prints before/after summaries of data quality

Requirements

  • Python 3.9+
  • Libraries:
    pip install pandas numpy
    

Project Structure

├── sis_clean.py # Python script for cleaning the dataset ├── SIS_Faculty-List.csv # Raw dataset (input) ├── SIS_Faculty-List_clean.csv # Cleaned dataset (output, generated by script) ├── README.md # Instructions for setup and usage

How to Use

  1. Place your raw dataset file (e.g. SIS_Faculty-List.csv) in the same folder as sis_clean.py.

  2. Open a terminal (or PowerShell on Windows) and navigate to the folder:

    cd path/to/folder
    
  3. Run the script with the default input/output names:

    python sis_clean.py
    

This will:

  • Load SIS_Faculty-List.csv

  • Generate SIS_Faculty-List_clean.csv in the same folder

  • Print before/after data quality metrics in the terminal

About

Machine Learning in Practice; 1st assignment

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages