# CSI4142 - Group 48 - Assignment 2 - Part 1

---

## Introduction
In this report, we explore methods to identify and address common data quality issues, focusing on ten types of errors: data type errors, range errors, format errors, consistency errors, uniqueness errors, presence errors, length errors, look-up errors, exact duplicate errors, and near duplicate errors. By implementing our Clean Data Checker, we provide an automated approach to detecting these errors, allowing users to specify validation rules and parameters.

Data quality is a critical factor in ensuring the reliability and usability of information stored in databases. Poor data quality can lead to incorrect analyses, flawed decision-making, and inefficiencies in various domains, including business, healthcare, and research. As organizations increasingly rely on large datasets, maintaining high-quality data through systematic validation and cleaning techniques becomes essential.

Our analysis is conducted on the café sales dataset & an altered version of the cafe sales dataset. We specify which is being used in the heading of the test. 



#### Group 48 Members
- Ali Bhangu - 300234254
- Justin Wang - 300234186

<br>

---

## Dataset Descriptions

### Café Sales Dataset

- **Dataset Name:** Dirty Café Sales Dataset
- **Author:** Ahmed Mohamed (Kaggle)
- **Purpose:** This dataset was created for data cleaning training, containing real-world transactional data with common data quality issues such as missing values, duplicates, and inconsistent formats.

##### Dataset Shape
- **Rows:** 10000 Rows
- **Columns:** 8 Columns 

#### Features & Descriptions
| Feature Name       | Data Type  | Category    | Description |
|--------------------|-----------|------------|-------------|
| `Transaction ID`  | String     | Categorical | Unique identifier for each transaction |
| `Item`            | String     | Categorical | Name of the purchased item |
| `Quantity`        | String    | Numerical   | Number of units purchased |
| `Price Per Unit`  | Float      | Numerical   | Cost per single unit of the item |
| `Total Spent`     | Float      | Numerical   | Total amount spent on the transaction (Quantity × Price Per Unit) |
| `Payment Method`  | String     | Categorical | Payment type (e.g., Cash, Credit Card) |
| `Location`        | String     | Categorical | Café branch where the transaction took place |
| `Transaction Date`| String       | Categorical | Date when the transaction occurred |

---

In [205]:
# Importing the required Python libraries
import numpy as npy
import pandas as pd
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
from fuzzywuzzy import fuzz
import os as os
import re

In [111]:
#Download Function for the Cafe Dataset

# Define paths
zip_path = "cafe-sales-dirty-data.zip"
csv_path = "dirty_cafe_sales.csv"

# Delete existing CSV if present
if os.path.exists(csv_path):
    print(f"Existing {csv_path} found. Deleting and re-extracting...")
    os.remove(csv_path)

# Download dataset using curl (Bash command in Jupyter Notebook)
!curl -L -o {zip_path} https://www.kaggle.com/api/v1/datasets/download/ahmedmohamed2003/cafe-sales-dirty-data-for-cleaning-training

# Extract the ZIP file in the current folder
print("Extracting dataset...")
!unzip -o {zip_path} -d .

# Verify that the CSV exists after extraction
if not os.path.exists(csv_path):
    raise FileNotFoundError(f"Dataset not found: {csv_path}. Ensure the ZIP file was correctly extracted.")

# Load dataset
cafeSet = pd.read_csv(csv_path)
print("Dataset loaded successfully.")
cafeSet.head()

Existing dirty_cafe_sales.csv found. Deleting and re-extracting...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  110k  100  110k    0     0   234k      0 --:--:-- --:--:-- --:--:--  234k
Extracting dataset...
Archive:  cafe-sales-dirty-data.zip
  inflating: ./dirty_cafe_sales.csv  
Dataset loaded successfully.


Unnamed: 0,Transaction ID,Item,Quantity,Price Per Unit,Total Spent,Payment Method,Location,Transaction Date
0,TXN_1961373,Coffee,2,2.0,4.0,Credit Card,Takeaway,2023-09-08
1,TXN_4977031,Cake,4,3.0,12.0,Cash,In-store,2023-05-16
2,TXN_4271903,Cookie,4,1.0,ERROR,Credit Card,In-store,2023-07-19
3,TXN_7034554,Salad,2,5.0,10.0,UNKNOWN,UNKNOWN,2023-04-27
4,TXN_3160411,Coffee,2,2.0,4.0,Digital Wallet,In-store,2023-06-11


In [172]:
# Setting up the altered dataset for the Cafe Sales Dataset

# Changes to the dataset:
# 1. Duplicated Transaction ID Values
# 2. Duplicated some rows of the dataset for the exact duplicate check. 
alteredCafeSet = pd.read_csv("altered_dirty_cafe_sales.csv")
alteredCafeSet.head()

Unnamed: 0,Transaction ID,Item,Quantity,Price Per Unit,Total Spent,Payment Method,Location,Transaction Date
0,TXN_1961373,Coffee,2,Ali,4.0,Credit Card,Takeaway,2023-09-08
1,TXN_1961373,Coffee,2,2.0,4.0,Credit Card,Takeaway,2023-09-08
2,TXN_1961373,Coffee,2,2.0,4.0,Credit Card,Takeaway,2023-09-08
3,TXN_4977031,Cake,4,3.0,12.0,Cash,In-store,2023-05-16
4,TXN_4271903,Cookie,4,1.0,ERROR,Credit Card,In-store,2023-07-19


---
## Data Type Error - Using Cafe Dataset
This is our first type of check, this is a data type check, which will make sure that the data entered into a field is respective of the typing of that column. 

### How To Use:
1. Enter and run parameters
2. Run function with comment "Data Type Test" 
3. See results below above mentioned code block

### Parameters: 

In [None]:
attributes = ['Transaction ID', 'Item', 'Quantity', 'Price Per Unit', 'Total Spent', 'Payment Method', 'Location', 'Transaction Date']
print(cafeSet.dtypes)
# Please enter the various attributes below to perform the data cleaning process on the dataset:
 
# Input your column from the above list. 
testColumn = attributes[3]
# Change the expected data type of the column
expectedType = float

Transaction ID      float64
Item                 object
Quantity             object
Price Per Unit      float64
Total Spent          object
Payment Method       object
Location             object
Transaction Date     object
dtype: object


In [None]:
# Data Type Test
def data_type_checker(df, column, expected_type):
     # Convert the column to expected type (ignoring errors for detection)
    def is_expected_type(value):
        if pd.isna(value):  
            return False  
        try:
            return isinstance(eval(str(value)), expected_type)
        except:
            return False 

    # This bit identifies the incorrect entries, making a new dataframe. 
    incorrect_types = df[~df[column].apply(is_expected_type)]

    # This right here controls the output for the reader of our report to see and understand. 
    print(f"Checking column: {column} (Expected type: {expected_type.__name__})")
    if incorrect_types.empty:
        print(f"The Data Type Checker suggests all values in '{column}' match the expected data type.")
    else:
        # This outputs using the values set as parameters in the sentence. 
        print(f"The Data Type Checker found {len(incorrect_types)} incorrect entries in '{column}'. \nFor Example, here are some of the problem entries:")
        display(incorrect_types[[column]].head(5))  # Here we showcase some of the incorrect entries for the user.

    return incorrect_types

# This starts the program and runs the function
data_type_checker(cafeSet, testColumn, expectedType)

Checking column: Price Per Unit (Expected type: float)
The Data Type Checker found 533 incorrect entries in 'Price Per Unit'. 
For Example, here are some of the problem entries:


Unnamed: 0,Price Per Unit
56,
65,
68,
85,
104,


Unnamed: 0,Transaction ID,Item,Quantity,Price Per Unit,Total Spent,Payment Method,Location,Transaction Date
56,,Cake,5.0,,15.0,,Takeaway,2023-06-27
65,,Sandwich,3.0,,,,In-store,2023-10-20
68,,Salad,2.0,,10.0,,In-store,2023-10-27
85,,Tea,3.0,,4.5,Cash,UNKNOWN,2023-10-29
104,,Juice,2.0,,6.0,,,
...,...,...,...,...,...,...,...,...
9924,,Juice,2.0,,6.0,Digital Wallet,,2023-12-24
9926,,Cake,4.0,,12.0,Digital Wallet,Takeaway,2023-11-09
9961,,Tea,2.0,,3.0,Cash,,2023-12-29
9996,,,3.0,,3.0,Digital Wallet,,2023-06-02


---

## Range Error - Using Cafe Sales Dataset

In this test, we will determine whether the numerical input data falls within a given range for a specific column. The range is the maximum and minimum values that an attribute can have, and any values outside of this range are deemed in-correct. 

### How To Use:
1. Enter parameters in the code block below
2. Then run the code block. 
3. After that run the function annotated with the comment "Range Checker Test" and see the results outputted. 

### Parameters: 

In [57]:
# List of NUMERICAL columns from the dataset, please select below which one you would like to perform the range check on:
attributes = ['Quantity', 'Price Per Unit', 'Total Spent']

# Please enter the column you would like to run the range check on:
test_attribute = attributes[1]

# Please specify the minimum and maximum values for the range check: 
minimum = 2
maximum = 4

In [None]:
# Range Checker Test 
def range_checker(df, column, minimum, maximum):
    # Converts columns to a numeric format, just to be sure. 
    numeric_col = pd.to_numeric(df[column], errors='coerce')
    
    # Filters here using the values inputted by the user to then comb through the dataset. 
    below_min = df.loc[(numeric_col < minimum) & numeric_col.notna()]
    above_max = df.loc[(numeric_col > maximum) & numeric_col.notna()]
    
    # these variables will help with the output statements 
    total_below = below_min.shape[0]
    total_above = above_max.shape[0]
    
    # General Print Statement: 
    print(f"There are {total_below} data points with {column} less than {minimum}, and {total_above} data points with {column} over {maximum}.")
    
    # Specific statements to print examples:
    if total_below > 0:
        print("\For Example: Rows with values below minimum:")
        print(below_min.head(2))  # Show first 2 rows
    
    if total_above > 0:
        print("\nFor Example: Rows with values above maximum:")
        print(above_max.head(2))  # Show first 2 rows

# Running the function to showcase usage
range_checker(cafeSet, test_attribute, minimum, maximum)

There are 2276 data points with Price Per Unit less than 2, and 1204 data points with Price Per Unit over 4.

Examples below minimum:
   Transaction ID    Item Quantity Price Per Unit Total Spent Payment Method  \
2     TXN_4271903  Cookie        4            1.0       ERROR    Credit Card   
13    TXN_9437049  Cookie        5            1.0         5.0            NaN   

    Location Transaction Date  
2   In-store       2023-07-19  
13  Takeaway       2023-06-01  

Examples above maximum:
   Transaction ID   Item Quantity Price Per Unit Total Spent Payment Method  \
3     TXN_7034554  Salad        2            5.0        10.0        UNKNOWN   
10    TXN_2548360  Salad        5            5.0        25.0           Cash   

    Location Transaction Date  
3    UNKNOWN       2023-04-27  
10  Takeaway       2023-11-07  


---
## Format Errors - Using Cafe Sales Dataset

Within this section, we are testing for format errors. A Format Check will ensure that the data is in an acceptable format, such as dates are written in YYYY-MM-DD or DD-MM-YYYY. If this format is violated, it will return with the violating returns and provide a summarized output. 

### How To Use:
1. Enter Parameters in Code Block below
2. Enter Regex Pattern for the Format Pattern 
3. Run the Parameters Code Block
4. Run the code block, annotated with "Format Check Test". 

### Parameters: 

In [None]:
# Please enter the various attributes below to perform the data cleaning process on the dataset. 
attributes = ['Transaction Date', 'Transaction Time', 'Card Number', 'Transaction ID']

# Input your column 
column = attributes[0]

# Please enter the regex pattern you would like to check for
pattern = r'^\d{4}-\d{2}-\d{2}$'

In [None]:
# Format Check Test: 

def format_checker(df, column, pattern):
    # creating the regex pattern
    regex = re.compile(pattern)
    
    # Applying the regex pattern to the column and filters the rows that don't match
    mismatched_rows = df[~df[column].astype(str).apply(lambda x: bool(regex.match(x)))]
    total_mismatched = mismatched_rows.shape[0]
    
    # Printing the results of the format check. 
    print(f"There are {total_mismatched} data points in {column} that do not match the format {pattern}. \nSee below for examples if there are mismatched rows:")
    
    # Outputting the mismatched rows for the user to see
    if total_mismatched > 0:
        print("\nHere are some of the rows of mismatched format:")
        print(mismatched_rows.head(3))  

# Running the function with parameters defined above:
format_checker(cafeSet, column, pattern)

There are 460 data points in Transaction Date that do not match the format ^\d{4}-\d{2}-\d{2}$. 
See below for examples if there are mismatched rows:

Here are some of the rows of mismatched format:
   Transaction ID      Item  Quantity  Price Per Unit  Total Spent  \
11    TXN_3051279  Sandwich       2.0             4.0          8.0   
29    TXN_7640952      Cake       4.0             3.0         12.0   
33    TXN_7710508   UNKNOWN       5.0             1.0          5.0   

    Payment Method  Location Transaction Date  expected_value  expected_total  
11     Credit Card  Takeaway            ERROR             8.0             8.0  
29  Digital Wallet  Takeaway            ERROR            12.0            12.0  
33            Cash       NaN            ERROR             5.0             5.0  


---
## Consistency Errors - Using Cafe Sales Data Set

Within this section we will run a consistency check test, a consistency check is defined as a logical check that ensures data is consistent and makes sense. For example, checking if the date of a show added is after its release date. Find this test in the code block annoted with "Consistency Checker Test"

### How To Use:
1. Enter Parameters in Code Block below
2. Run the parameters code block 
3. Run the code block for the function that will test the consistency. 

### Parameters: 



In [65]:
# Please enter the various attrivutes below to perform the data cleaning process on the dataset. 
attributes = ['Transaction ID', 'Item', 'Quantity', 'Price Per Unit', 'Total Spent', 'Payment Method', 'Location', 'Transaction Date']

# You can run the consistency checker against two columns or with two against 1. 
# Example usage: Checking if Quantity x Price Per Unit = Total Spent
test_attribute_1 = attributes[2]
test_attribute_2 = attributes[3]
test_attribute_3 = attributes[4]

#If True, checks if test_attribute_1 == test_attribute_2.
# If False, checks if test_attribute_1 * test_attribute_2 == test_attribute_3.
compare_two_columns = False


In [None]:
# Consistency Checker Test
def consistency_checker(df, column_1, column_2, expected_column=None, compare_two=False):
    # Convert specified columns to numeric as we did above 
    df[column_1] = pd.to_numeric(df[column_1], errors='coerce')
    df[column_2] = pd.to_numeric(df[column_2], errors='coerce')
    
    # Check if comparing two columns or two columns against one
    if compare_two:
        # Compare two columns directly for equality
        inconsistent = df[df[column_1] != df[column_2]]
        # Formatting the string check type for the output
        check_type = f"{column_1} != {column_2}"
    
    # If there is no expected column, raise an error for the user
    else:
        if expected_column is None:
            raise ValueError("You must provide an expected column when compare_two=False.")
        
        # Otherwise we convert it to numeric and then check if the expected column is equal to the product of the two columns
        df[expected_column] = pd.to_numeric(df[expected_column], errors='coerce')
        df['expected_value'] = df[column_1] * df[column_2]
        inconsistent = df[df[expected_column] != df['expected_value']]
        check_type = f"{column_1} * {column_2} != {expected_column}"
    
    # Again, here we display results according to the findings
    if inconsistent.empty:
        print(f"No consistency errors found for check: {check_type}.")
    else:
        print(f"There are {len(inconsistent)} consistency errors for check: {check_type}. See for example the following rows:")
        display_cols = [column_1, column_2]
        if not compare_two:
            display_cols.append(expected_column)
            display_cols.append('expected_value')
        display(inconsistent[display_cols].head())  

    return inconsistent

# Running Function
consistency_checker(cafeSet, test_attribute_1, test_attribute_2, test_attribute_3, compare_two_columns)

There are 1456 consistency errors for check: Quantity * Price Per Unit != Total Spent. See for example the following rows:


Unnamed: 0,Quantity,Price Per Unit,Total Spent,expected_value
2,4.0,1.0,,4.0
20,,4.0,20.0,
25,3.0,4.0,,12.0
31,2.0,1.0,,2.0
42,2.0,1.5,,3.0


Unnamed: 0,Transaction ID,Item,Quantity,Price Per Unit,Total Spent,Payment Method,Location,Transaction Date,expected_value,expected_total
2,TXN_4271903,Cookie,4.0,1.0,,Credit Card,In-store,2023-07-19,4.0,4.0
20,TXN_3522028,Smoothie,,4.0,20.0,Cash,In-store,2023-04-04,,
25,TXN_7958992,Smoothie,3.0,4.0,,UNKNOWN,UNKNOWN,2023-12-13,12.0,12.0
31,TXN_8927252,UNKNOWN,2.0,1.0,,Credit Card,ERROR,2023-11-06,2.0,2.0
42,TXN_6650263,Tea,2.0,1.5,,,Takeaway,2023-01-10,3.0,3.0
...,...,...,...,...,...,...,...,...,...,...
9984,TXN_3142496,Smoothie,,4.0,4.0,Cash,Takeaway,2023-07-27,,
9988,TXN_9594133,Cake,5.0,3.0,,ERROR,,,15.0,15.0
9993,TXN_4766549,Smoothie,2.0,4.0,,Cash,,2023-10-20,8.0,8.0
9996,TXN_9659401,,3.0,,3.0,Digital Wallet,,2023-06-02,,


---

## Uniqueness Errors - Using Altered Cafe Sales Dataset

Within this section, we will be testing for Uniqueness errors. A uniqueness error is pretty much when there is inherently unique data such as IDs or E-Mails in a database. Our tester below, annoted with the comment "Uniqueness Check Function" will run and ensure that an item is not entered into a database more than once. 

This section uses an <strong>altered version</strong> of the cafe sales dataset, within that altered version there are duplicates of the transaction ID, and some rows. 

### How To Use:
1. Input parameters and run code block
2. Run code block with Comment: "Unqiueness Checker Test" 
3. See results 

### Parameters: 

In [None]:
# attributes to choose from: 
attributes = ['Transaction ID', 'Item', 'Quantity', 'Price Per Unit', 'Total Spent', 'Payment Method', 'Location', 'Transaction Date']

# Please enter the column you would like to perform the uniqueness check on from the above list: 
testColumn = attributes[0]

# Specify the dataset you would like to perform the uniqueness check on (cafeSet or alteredCafeSet):
dataFrame = alteredCafeSet

In [114]:
# Uniqueness Checker Test
def uniqueness_checker(df, column):
    # Find duplicates in the specified column
    duplicates = df[df.duplicated(subset=[column], keep=False)]
    
    # Output results
    if duplicates.empty:
        print(f"All values in '{column}' are unique.")
    else:
        print(f"Found {len(duplicates)} duplicate entries in the '{column}' column. Here are some examples of the duplicate entries:")
        print(duplicates[[column]].head(len(duplicates)))  
        
    return duplicates

# Example usage:
duplicates = uniqueness_checker(dataFrame, testColumn)

Found 14 duplicate entries in the 'Transaction ID' column. Here are some examples of the duplicate entries:
   Transaction ID
0     TXN_1961373
3     TXN_1961373
10    TXN_2548360
21    TXN_2548360
27    TXN_5695074
36    TXN_6855453
37    TXN_1080432
38    TXN_1080432
43    TXN_5695074
46    TXN_6855453
47    TXN_8078640
48    TXN_8201146
50    TXN_8201146
52    TXN_8078640


---
## Presence Errors - Using Cafe Dataset

This section of the report will be focusing on the presence errors that are present. A presence check is defined as a check that ensures that all mandatory fields are not left blank. Our checker takes a column as input and will run the presence checker on the specified attribute. 

### How To Use: 
1. Input desired attribute from list
2. Run attribute code block
3. Run the code block annotated with this comment: "Presence Checker Test" 

### Parameters:

In [82]:
attributes = ['Transaction ID', 'Item', 'Quantity', 'Price Per Unit', 'Total Spent', 'Payment Method', 'Location', 'Transaction Date']

# Please specify the column from the above list:
testColumn = attributes[1]

In [None]:
# Presence Checker Test
def presence_checker(df, column):
    
    # Checks if the column has any missing values or unknown entries as they exist in the dataset. 
    missing = df[df[column].isna() | (df[column].str.lower() == 'unknown')]
    
    # If there is nothing 
    if missing.empty:
        print(f"The results of the presence checker indicate that there are no missing values found in '{column}'.")
    else:
        # Showcase the results in the dataset. 
        print(f"The results of the presence check are as follows: There are {len(missing)} missing values in '{column}'. \nFor Example:")
        print(missing[[column]].head(len(missing)))  
    
    return missing


# Running the function:
missing_values = presence_checker(cafeSet, testColumn)

The results of the presence check are as follows: There are 677 missing values in 'Item'. 
For Example:
         Item
6     UNKNOWN
8         NaN
30        NaN
31    UNKNOWN
33    UNKNOWN
...       ...
9876      NaN
9885      NaN
9946  UNKNOWN
9994  UNKNOWN
9996      NaN

[677 rows x 1 columns]


---
## Length Errors - Using Cafe Dataset
This section is for a length check, which is a check that determines if the right amount of characters are entered into a field. Below we have our parameters which can be altered, a user can specify the desired attribute and the testLength. 

### How To Use
1. Enter parameters and run the code block
2. Run the code block with the comment "Length Checker Test" 
3. See results in the cell below the code block. 

### Parameters:

In [192]:
attributes = ['Transaction ID', 'Item', 'Quantity', 'Price Per Unit', 'Total Spent', 'Payment Method', 'Location', 'Transaction Date']

# Select from the list above the column you would like to perform the length check on:
testColumn = attributes[4]

# Enter the length you would like to check for:
testLength = 3

In [None]:
# Length Checker Test: 
def length_checker(df, column, length):
    # Converting the column to string and checking its length
    invalid_length = df[df[column].astype(str).str.len() != length]
    
    # Formatting the findings of the function to be results for the reader. 
    if invalid_length.empty:
        print(f"The length checker test suggests that all values in '{column}' meet the length requirement of {length}.")
    else:
        print(f"The length checker test indicates that there are {len(invalid_length)} entries in '{column}' that do not meet the length requirement of {length}. \nFor Example:")
        print(invalid_length[[column]].head(5))  # Display first 5 invalid entries
    
    return invalid_length

# Running the test: 
invalid_length = length_checker(cafeSet, testColumn, testLength)

The length checker test indicates that there are 3975 entries in 'Total Spent' that do not meet the length requirement of 3. 
For Example:
  Total Spent
1        12.0
2       ERROR
3        10.0
5        20.0
7        16.0


---
## Look-Up Errors - Using Cafe Dataset

This section is devoted to look-up errors. For our look-up check, we take the desired test column and then valid values of the column. It will then look through the dataset and return invalid entries.

### How To Use
1. Enter parameters and run the code block 
2. Run the code block labelled with "Look-Up Test" 
3. See results in the cell below the labelled code block. 

### Parameters: 

In [88]:
attributes = ['Transaction ID', 'Item', 'Quantity', 'Price Per Unit', 'Total Spent', 'Payment Method', 'Location', 'Transaction Date']

# Pleaase enter the desired column from the list above to perform the data cleaning process on the dataset.
testColumn = attributes[5]

# Please enter the list of valid values for the selected column:
validValues = ["Credit Card", "Cash", "Digital Wallet"]


In [None]:
# Look-Up Test
def lookup_checker(df, column, valid_values):
    # Check for invalid values in the column
    invalid_values = df[~df[column].isin(valid_values)]
    
    # Formatting the results for the invalid_values, either if its empty or if there were found
    if invalid_values.empty:
        print(f"The look-up test insists that all values in '{column}' are valid to the provided vales {valid_values}.")
    else:
        print(f"Found {len(invalid_values)} invalid entries in '{column}' for these specified values: {valid_values}. \nFor Example:")
        print(invalid_values[[column]].head(6))  # Display first 6 invalid entries
    
    return invalid_values


# Run the function: 
lookup_errors = lookup_checker(cafeSet, testColumn, validValues)

Found 3178 invalid entries in 'Payment Method' for these specified values: ['Credit Card', 'Cash', 'Digital Wallet']. 
For Example:
   Payment Method
3         UNKNOWN
6           ERROR
8             NaN
9             NaN
13            NaN
14            NaN


---

## Exact Duplicate Errors - Using Altered Cafe Dataset

This section will focus on exact duplicates, primarily at a row and record-level. Meaning records that are exactly the same.  

### How To Use:
1. Input parameters below
2. Run the parameters code block
3. Navigate to the code block below the parameters one, identifiable by the comment at the top "Exact Duplicate Checker Test
" 
4. Run the code block and see the results. 

#### Parameters: 

In [None]:
attributes = ['Transaction ID', 'Item', 'Quantity', 'Price Per Unit', 'Total Spent', 'Payment Method', 'Location', 'Transaction Date']


# Please input the columns you would like to check for an exact duplicate checker. 
columns_to_check = [attributes[0], attributes[5]]

# Select the dataset. 
dataFrame = alteredCafeSet

In [None]:
# Exact Duplicate Checker Test
def exact_duplicate_checker(df, subset_columns):
    if not subset_columns:
        raise ValueError("You must specify at least one column to check for exact duplicates.")

    # Find exact duplicates based on the selected columns
    duplicates = df[df.duplicated(subset=subset_columns, keep=False)]

    if duplicates.empty:
        # If there are no duplicates, this will be the output:
        print(f"No exact duplicates found based on columns: {subset_columns}.")
    else:
        # Creating a neat output for the results of the function. 
        print(f"The checker indicates that there are {len(duplicates)} exact duplicate rows based on columns: {subset_columns}. \n For example, find some of the duplicate rows below:")
        display(duplicates[subset_columns].head()) 

    return duplicates

exact_duplicate_checker(dataFrame, columns_to_check)

The checker indicates that there are 2 exact duplicate rows based on columns: ['Transaction ID', 'Payment Method']. 
 For example, find some of the duplicate rows below:


Unnamed: 0,Transaction ID,Payment Method
0,TXN_1961373,Credit Card
1,TXN_1961373,Credit Card


Unnamed: 0,Transaction ID,Item,Quantity,Price Per Unit,Total Spent,Payment Method,Location,Transaction Date
0,TXN_1961373,Coffee,2,2.0,4.0,Credit Card,Takeaway,2023-09-08
1,TXN_1961373,Coffee,2,2.0,4.0,Credit Card,Takeaway,2023-09-08


---

## Near Duplicate Errors - Using Cafe Dataset 
This is the final test in our report. It checks for near duplicate errors, which can be defined as records that are similar but not completely identical due to typos or missing values. Our implemntation allows the user to input columns to check and a difference in column they can tweak. 

### How To Use:
1. Set parameters and run the code block
2. Run code block labelled with comment "Near Dupes Test"
3. See results outputted in real time below that code block. 

#### Parameters:

In [None]:
attributes = ['Transaction ID', 'Item', 'Quantity', 'Price Per Unit', 'Total Spent', 'Payment Method', 'Location', 'Transaction Date']

# Please enter the columns you would like to perform the data cleaning process on the dataset.
columns_to_check = [attributes[1], attributes[4]]

# Please enter a threshold value for the near duplicate checker, this will specify the percentage of similarity between the columns
threshold = 90

In [207]:
# Near Dupe Checker: 

def near_duplicate_checker(df, columns, threshold):
    # Sorting to bring similar rows together
    sorted_df = df.sort_values(by=columns).reset_index(drop=True)  
    near_duplicates = []
    
    # For Loop to go through the dataset and find near duplicates based on the threshold specified
    for i in range(len(sorted_df) - 1):
        row1 = sorted_df.iloc[i][columns].astype(str).fillna('')
        row2 = sorted_df.iloc[i + 1][columns].astype(str).fillna('')
        
        # Skipping exact duplicates
        if row1.equals(row2):  
            continue
        
        similarity_scores = [fuzz.ratio(row1[col], row2[col]) for col in columns]
        avg_similarity = sum(similarity_scores) / len(columns)
        
        if avg_similarity >= threshold:
            near_duplicates.append((i, i + 1, avg_similarity, row1.to_dict(), row2.to_dict()))
    
    if not near_duplicates:
        print("No near duplicates found.")
    else:
        print(f"The near duplicate test program found {len(near_duplicates)} near duplicate row pairs for the columns {columns}. \nHere are some examples:")
        for pair in near_duplicates[:5]:  # Show first 5 pairs
            print(f"\nRow {pair[0]} ~ Row {pair[1]} (Similarity: {pair[2]:.2f}%)")
            print("Row 1:", pair[3])  # Print first row
            print("Row 2:", pair[4])  # Print second row
    
    return near_duplicates

# running the function here
near_duplicate_checker(cafeSet, columns_to_check, threshold)


The near duplicate test program found 4 near duplicate row pairs for the columns ['Item', 'Total Spent']. 
Here are some examples:

Row 3496 ~ Row 3497 (Similarity: 93.00%)
Row 1: {'Item': 'ERROR', 'Total Spent': '2.0'}
Row 2: {'Item': 'ERROR', 'Total Spent': '20.0'}

Row 5739 ~ Row 5740 (Similarity: 93.00%)
Row 1: {'Item': 'Salad', 'Total Spent': '25.0'}
Row 2: {'Item': 'Salad', 'Total Spent': '5.0'}

Row 9430 ~ Row 9431 (Similarity: 93.00%)
Row 1: {'Item': 'UNKNOWN', 'Total Spent': '2.0'}
Row 2: {'Item': 'UNKNOWN', 'Total Spent': '20.0'}

Row 9800 ~ Row 9801 (Similarity: 93.00%)
Row 1: {'Item': 'nan', 'Total Spent': '2.0'}
Row 2: {'Item': 'nan', 'Total Spent': '20.0'}


[(3496,
  3497,
  93.0,
  {'Item': 'ERROR', 'Total Spent': '2.0'},
  {'Item': 'ERROR', 'Total Spent': '20.0'}),
 (5739,
  5740,
  93.0,
  {'Item': 'Salad', 'Total Spent': '25.0'},
  {'Item': 'Salad', 'Total Spent': '5.0'}),
 (9430,
  9431,
  93.0,
  {'Item': 'UNKNOWN', 'Total Spent': '2.0'},
  {'Item': 'UNKNOWN', 'Total Spent': '20.0'}),
 (9800,
  9801,
  93.0,
  {'Item': 'nan', 'Total Spent': '2.0'},
  {'Item': 'nan', 'Total Spent': '20.0'})]

---

## Conclusion 

Overall, this comprehensive clean data checker provides a strucutred approach to data validation and data cleaning. By integrating the ten essential checks,data type verification, range validation, format enforcement, consistency analysis, uniqueness detection, presence checks, length constraints, look-up validation, and exact and near duplicate detection. The functions in the notebook effectively identify and highlight issues that are common in data science. This part of the assignment helped develop our skills as Data Scientists and has helped grow our python skills. 

### References
1. Winter2025-CSI4142-Week4-DataQuality-Cleaning-Part1
2. Winter2025-CSI4142-Week4-DataQuality-Cleaning-Part2
- From these two slide decks of class content, we learned the different types of checks and general ideas about how to implement. 
3. https://medium.com/@alphaiterations/fuzzy-matching-with-fuzzywuzzy-a-comprehensive-guide-04873f07de31
- This article helped with some information about FuzzyWuzzy, a python library we used for in Near Duplicate Check

