# CleanOps
### A Lightweight Toolkit for Dataset Inspection and Cleaning 

**Presented by:**  
Angni, Sodais M.
 / Magtrayo, Harold Hope
 / Odchigue, Jave Melchor P.
 / Padillo, Reymart
 / Ruiz, Rynzo Rapheal R.

##  Project Overview

CleanOps is a lightweight toolkit that simplifies dataset preparation.

It provides tools for:

- Missing value detection  
- Duplicate row detection  
- Outlier detection  
- Automatic cleaning  
- Data organization  
- Multi-format exporting  
- Report generation  
- End-to-end pipelines  

**Purpose:** Make preprocessing fast and repeatable.


# What's Inside the Project
**data_getter.py**

- Loads CSV and text files using pathlib

- Encapsulation: Uses protected attribute _base_path to hide implementation details

**data_preprocessor.py**

- DataInspector – detects missing values, duplicates, and outliers

    - Encapsulation: _data and _issues are protected; accessed via methods

- DataCleaner – inherits from DataInspector (Inheritance)

    - Overrides some methods to actually fix data (Polymorphism)

    - Logs all fixes

- DataOrganizer – standalone class for sorting rows/columns

    - Uses protected _data attribute (Encapsulation)

**data_output.py**

- DataExporter – exports cleaned data to CSV, Excel, JSON

- ReportGenerator – creates TXT summary reports

- DataOutput – aggregates DataExporter and ReportGenerator (Composition)

**data_pipeline.py**

- Runs full workflow: Diagnose → Clean → Log fixes → Export → Generate report

- Uses composition to combine cleaner, exporter, and reporter

- Demonstrates polymorphism through cleaner treating different datasets with the same interface

**Summary of OOP Concepts in CleanOps:**

- Encapsulation: _data, _issues, _base_path

- Inheritance: DataCleaner → DataInspector

- Polymorphism: Overridden methods in DataCleaner and pipeline handling multiple classes uniformly

# CleanOps Package Demo

This notebook demonstrates **CleanOps**, a Python package for inspecting, cleaning, and exporting datasets.
We will go **step-by-step**, showing how to detect duplicates, missing values, outliers, clean them, and generate reports.


## Step 1: Install CleanOps
If you haven't installed CleanOps yet, run this:

In [None]:
!pip install --upgrade cleanops

## Step 2: Import Modules

Import the required classes from CleanOps.

In [None]:
import pandas as pd
import os
from cleanops import (
    DataGetter,
    DataInspector,
    DataCleaner,
    DataExporter,
    ReportGenerator,
    DataPipeline
)

## Step 3: Load Dataset

Load your CSV file using `DataGetter`.

In [None]:
getter = DataGetter(r"C:\Users\jorda\Documents\GitHub\CleanOps\datasets")  # Path to folder containing your dataset
df = getter.read_csv("hotel_reservations_codeonly.csv")
df.head()

## Step 4: Inspect Data

Detect duplicates, missing values, and outliers using `DataInspector`.

In [None]:
inspector = DataInspector(df)

# Detect duplicates
duplicates = inspector.detect_duplicates()
print("Duplicates Detected:")
for col, val in duplicates.items():
    print(f"- {col}: {val}")

# Detect missing values
missing = inspector.detect_missing()
print("\nMissing Values Detected:")
for col, val in missing.items():
    if val > 0:
        print(f"- {col}: {val} missing")

# Detect outliers
outliers = inspector.detect_outliers()
print("\nOutliers Detected:")
for col, val in outliers.items():
    if val > 0:
        print(f"- {col}: {val} outliers")

## Step 5: Clean Data

Fix duplicates and apply treatments using `DataCleaner`.

In [None]:
cleaner = DataCleaner(df)
# Automatically remove duplicates per column
cleaner.fix_duplicates()

# Now treat only missing values and outliers
cleaner.treat(treat_duplicates=False)

# Show fix log
print("\nFix Log:")
for log in cleaner.get_fix_log():
    print("-", log)

# Display cleaned data
cleaner._data.head()

## Step 6: Export Cleaned Data

Save the cleaned dataset in multiple formats using `DataExporter`.

In [None]:
exporter = DataExporter(cleaner._data)
exporter.to_csv("cleaned_data.csv")
exporter.to_excel("cleaned_data.xlsx")
exporter.to_json("cleaned_data.json")
print("Cleaned data exported successfully.")

## Step 7: Generate a Report

Summarize all data issues and export a report using `ReportGenerator`.

In [None]:
reporter = ReportGenerator(cleaner._data)
report_summary = reporter.report()
reporter.export_report("cleaning_report.txt")
print("Report generated.")
print(report_summary)

## Step 8: Run Full Pipeline

Combine cleaning, exporting, and reporting in a single `DataPipeline`.

In [None]:
pipeline = DataPipeline(cleaner=cleaner, exporter=exporter, reporter=reporter)
pipeline
pipeline.run()


# Demo Complete

You have successfully inspected, cleaned, exported, and reported on your dataset using CleanOps!