# CSI4142 - Group 48 - Assignment 3 - Part 2

---

## Introduction
In this report, we perform an empirical study in which we evaluate a decision tree approach on a classification task.

1. Clean the data 
2. (Optional) Groups different numerical values into bins or buckets 
3. Conduct an EDA (Exploritory Data Analysis) to visualize data and find outliers in the features using LOF (Local Outlier Factor)
4. Explore the DecisionTreeClassifier method suggested in scikit-learn and choose a baseline setting by looking at the parameters (splitting criterion (gini, entropy), max_depth, min_samples_split, etc)
5. Program a feature aggregator to create 2 additional features
6. Conduct an empirical study
7. Analyize the results
8. Discuss the outliers and feature aggregation, as well as the results on the unseen test set compare to the cross-validation results

#### Group 48 Members
- Ali Bhangu - 300234254
- Justin Wang - 300234186

<br>

## Dataset Description: Iris Dataset

- **Dataset Name:** Iris Dataset  
- **Author:** Himanshi Nakrani 
- **Purpose:** This dataset was found on Kaggle.com, and is used in numerous data science projects across the world. For our purposes, this will serve as the dataset we use for Assignment 3 Part 2. 
---

### Dataset Shape
- **Rows:** 150  
- **Columns:** 5  

---

### Features & Descriptions  

| Feature Name    | Data Type  | Category    | Description |
|----------------|------------|-------------|-------------|
| `sepal_length` | Float      | Numerical   | Length of the sepal in cm |
| `sepal_width`  | Float      | Numerical   | Width of the sepal in cm |
| `petal_length` | Float      | Numerical   | Length of the petal in cm |
| `petal_width`  | Float      | Numerical   | Width of the petal in cm |
| `species`      | String     | Categorical | The species of the iris flower (Setosa, Versicolor, Virginica) |
Would you like any modifications or additional insights on this dataset? 🚀

In [1]:
import numpy as npy
import pandas as pd
from fuzzywuzzy import fuzz
import os as os
import re
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.neighbors import LocalOutlierFactor
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error, r2_score
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt



In [4]:
# Define paths
zip_path = "iris-dataset.zip"
csv_path = "iris.csv"  # Adjust this if the extracted file has a different name

# Delete existing CSV if present
if os.path.exists(csv_path):
    print(f"Existing {csv_path} found. Deleting and re-extracting...")
    os.remove(csv_path)

# Download dataset using curl (Bash command in Jupyter Notebook)
!curl -L -o {zip_path} https://www.kaggle.com/api/v1/datasets/download/himanshunakrani/iris-dataset

# Extract the ZIP file in the current folder
print("Extracting dataset...")
!unzip -o {zip_path} -d .

# Verify that the CSV exists after extraction
if not os.path.exists(csv_path):
    raise FileNotFoundError(f"Dataset not found: {csv_path}. Ensure the ZIP file was correctly extracted.")

# Load dataset
irisSet = pd.read_csv(csv_path)
print("Dataset loaded successfully.")
irisSet.head()
irisSet.info()



Existing iris.csv found. Deleting and re-extracting...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  1006  100  1006    0     0   3350      0 --:--:-- --:--:-- --:--:--  3350
Extracting dataset...
Archive:  iris-dataset.zip
  inflating: ./iris.csv              
Dataset loaded successfully.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


---
### a) Clean Data

Within this section of our report, we will be cleaning the Iris Details dataset. The dataset has no missing values, therefore we have opted in not needing to clean the dataset in regards to missing values. We have decided on verifying the following checks for this dataset:

- Data Type Check
- Consistency Check
- Exact Duplicate Check

As beyond this, the dataset is already clean and ready for regression testing. 

In [None]:
 # Data Type Test
def data_type_checker(df, attributes, expected_type):
     # Convert the column to expected type (ignoring errors for detection)
    def is_expected_type(value):
        if pd.isna(value):  
            return False  
        try:
            return isinstance(eval(str(value)), expected_type)
        except:
            return False 

    # This bit identifies the incorrect entries, making a new dataframe. 
    for x in range(4):
        incorrect_types = df[~df[attributes[x]].apply(is_expected_type)]

        # This right here controls the output for the reader of our report to see and understand. 
        print(f"Checking column: {attributes[x]} (Expected type: {expected_type.__name__})")
        if incorrect_types.empty:
            print(f"The Data Type Checker suggests all values in '{attributes[x]}' match the expected data type.")
        else:
            # This outputs using the values set as parameters in the sentence. 
            print(f"The Data Type Checker found {len(incorrect_types)} incorrect entries in '{column}'. \nFor Example, here are some of the problem entries:")
            display(incorrect_types[[column]].head(5))  # Here we showcase some of the incorrect entries for the user.

        print("\n")

# This starts the program and runs the function
irisAttributes = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
data_type_checker(irisSet, irisAttributes, float)

Checking column: sepal_length (Expected type: float)
The Data Type Checker suggests all values in 'sepal_length' match the expected data type.


Checking column: sepal_width (Expected type: float)
The Data Type Checker suggests all values in 'sepal_width' match the expected data type.


Checking column: petal_length (Expected type: float)
The Data Type Checker suggests all values in 'petal_length' match the expected data type.


Checking column: petal_width (Expected type: float)
The Data Type Checker suggests all values in 'petal_width' match the expected data type.




In [13]:
def consistency_checker(df, column, valid_values):

    # Find inconsistent values
    inconsistent = df[~df[column].isin(valid_values)]
    
    # Display results
    if inconsistent.empty:
        print(f"No consistency errors found in column '{column}'.")
    else:
        print(f"Found {len(inconsistent)} inconsistent values in column '{column}':")
        display(inconsistent[[column]])

    return inconsistent

# Please enter the various attrivutes below to perform the data cleaning process on the dataset. 
attributes = ['Transaction ID', 'Item', 'Quantity', 'Price Per Unit', 'Total Spent', 'Payment Method', 'Location', 'Transaction Date']

# Define valid species values
valid_species = {"setosa", "versicolor", "virginica"}

# Run the function on your dataset
inconsistent_species = consistency_checker(irisSet, "species", valid_species)


No consistency errors found in column 'species'.


In [15]:
irisSet.info()

irisSetNew = irisSet.drop_duplicates(keep="first").reset_index(drop=True)
print("Duplicates removed successfully.")

irisSetNew.head() 
irisSetNew.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
Duplicates removed successfully.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 147 entries, 0 to 146
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  147 non-null    float64
 1   sepal_width   147 non-null    float64
 2   petal_length  147 non-null    float64
 3   petal_width   147 non-null    float64
 4   species       147 non-null    object 
dtypes: float64(4), object(1)
memory usage: 5.9+ KB
