<a href="https://colab.research.google.com/github/hind190/Data-Mining-Project-/blob/main/Phase2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Part 2: Data Preprocessing**

In this part, we apply a series of data preprocessing techniques to prepare the dataset for accurate and reliable analysis. The techniques used include Discretization, Noise Removal, Handling Missing Values, and Normalization. These techniques were chosen based on the structure of the dataset and the analytical requirements.

For each technique, we provide an explanation of why it was necessary, how it was implemented, and which attributes it was applied to. We also include a brief description of the results, outlining how the dataset improved as a result of these transformations. These preprocessing steps help reduce inconsistencies, minimize the impact of noise, balance feature scales, and improve interpretability for downstream tasks such as K-Means clustering and Decision Tree classification.

Finally, snapshots of both the raw dataset and the preprocessed dataset are provided to clearly demonstrate the changes made.

**Noise Removal**

In [14]:
# 1. Check minimum value before cleaning
min_before = data['targeted_productivity'].min()
print(f"Lowest targeted_productivity before cleaning: {min_before}")

# 2. Remove unrealistic or noisy values (below 0.1)
data_cleaned = data[data['targeted_productivity'] >= 0.1].reset_index(drop=True)

# 3. Check minimum value after cleaning
min_after = data_cleaned['targeted_productivity'].min()
print(f"Lowest targeted_productivity after cleaning: {min_after}")

# 4. Summary
removed_count = len(data) - len(data_cleaned)
print(f"\nNoise removal complete. {removed_count} data point removed.")

Lowest targeted_productivity before cleaning: 0.07
Lowest targeted_productivity after cleaning: 0.35

Noise removal complete. 1 data point removed.


In [15]:
# 1. Check maximum value before cleaning
max_before = data['over_time'].max()
print(f"Highest over_time before cleaning: {max_before}")

# 2. Remove values greater than 25,000 (noise)
data_cleaned = data[data['over_time'] <= 25000].reset_index(drop=True)

# 3. Check maximum value after cleaning
max_after = data_cleaned['over_time'].max()
print(f"Highest over_time after cleaning: {max_after}")

# 4. Summary
removed_count = len(data) - len(data_cleaned)
print(f"\nNoise removal complete. {removed_count} data points removed.")

Highest over_time before cleaning: 25920
Highest over_time after cleaning: 15120

Noise removal complete. 1 data points removed.


**Explanation of the Technique (why and how it was applied and which attributes were selected for it):**

Noise removal was applied to eliminate unrealistic values that could distort the analysis and lead to incorrect conclusions. This technique was chosen after exploratory data analysis (EDA) revealed anomalies in over_time and targeted_productivity attributes. For instance, one record showed an over_time value of 25,000 minutes, which is unrealistic even when distributed across all 54 team members, and certain entries in targeted_productivity were below 0.1, which is implausibly low and likely due to data entry errors. To maintain accuracy, these records were removed by keeping only entries where over_time ≤ 25000 and targeted_productivity ≥ 0.1.

**Description of Preprocessing Results (and how this technique improved the dataset):**

After the removal of unrealistic records, both attributes now reflect feasible and consistent ranges. The lowest targeted_productivity value became 0.35, which is a realistic and achievable productivity level in factory conditions. Similarly, the lowest over_time value now stands at 15,000 minutes, equivalent to about 250 hours, which is a plausible cumulative overtime duration for a full team of 30 employees. This process reduced data distortion, enhanced reliability, and ensured that the dataset now represents realistic target production behavior, leading to more accurate insights and reliable modeling outcomes.

**Descretization**

In [17]:
# Define number of bins and labels
num_bins = 3
bin_labels = ['Low', 'Medium', 'High']

# Apply discretization
data_cleaned['discretized_actual_productivity'] = pd.cut(
    data_cleaned['actual_productivity'],
    bins=num_bins,
    labels=bin_labels,
    include_lowest=True
)

# Print summary
print('-------------------------------------------------------')
print('Discretization complete: actual_productivity → discretized_actual_productivity')
print('-------------------------------------------------------')
print('First few values:')
print(data_cleaned[['actual_productivity', 'discretized_actual_productivity']].head())
print('-------------------------------------------------------')
print('Number of instances for each label:')
print('-------------------------------------------------------')
print('Class  -- Count ---------------------------------------')
print(data_cleaned['discretized_actual_productivity'].value_counts())
print('-------------------------------------------------------')

-------------------------------------------------------
Discretization complete: actual_productivity → discretized_actual_productivity
-------------------------------------------------------
First few values:
   actual_productivity discretized_actual_productivity
0             0.940725                            High
1             0.886500                            High
2             0.800570                          Medium
3             0.800570                          Medium
4             0.800382                          Medium
-------------------------------------------------------
Number of instances for each label:
-------------------------------------------------------
Class  -- Count ---------------------------------------
discretized_actual_productivity
Medium    691
High      344
Low       161
Name: count, dtype: int64
-------------------------------------------------------


**Explanation for the Technique (why and how it was applied and which attributes were selected):**

Discretization was applied to convert the continuous actual_productivity feature into categorical labels to better suit classification algorithms like Decision Trees. These algorithms tend to perform better when target variables are discrete, as it allows for clear and interpretable rule-based outputs. The process divided actual_productivity into three categories: Low, Medium, and High, using the pd.cut() function with equal-width binning. This ensured that the full range of productivity values was captured while maintaining simplicity and interpretability, which are important for better understanding the model’s outcomes.

**Description of Preprocessing Results (and how this technique improved the dataset):**

After discretization, the new column discretized_actual_productivity grouped the data into three balanced categories: Medium (691 instances), High (344 instances), and Low (160 instances). This transformation made the target variable more suitable for Decision Tree classification, as the model could now work with clearly defined classes instead of continuous values. Additionally, the categorization enhanced interpretability by allowing a direct comparison across productivity levels and minimized the impact of small fluctuations in continuous measurements. This made the dataset more reliable and easier to model.