## 01 - Data Classification

September 18, 2025

### **Content**

- Understand concepts such as a model and a confusion matrix.
- Evaluate the performance of an algorithm using cross-validation or holdout.
- Understand the difference between a generic model and an overfitted model.
- Problems in data classification: rare classes, unknown attributes, curse of dimensionality.
- Some of the main types of algorithms: Naive Bayes, Decision Trees.

### **Classification**

- In classification, we want to discover the class of an instance.
- Normally, the class in a relation is represented as an attribute positioned last in the relation.

### **Why use machine learning algorithms?**

Here is the summary of the text in English, formatted in Markdown:

The text contrasts two types of algorithms—a deterministic calculation (print utilization) and a machine learning algorithm (loan prediction)—to highlight their fundamental differences.

* **Print Algorithm (Deterministic):** This algorithm calculates the optimal way to use paper. It does not require historical data; it only needs the measurements of the current job. Once created, it works forever, on any printer, and is expected to deliver perfect performance (100% utilization).

* **Loan Algorithm (Machine Learning):** This algorithm predicts whether a customer will be a good or bad payer. It is fundamentally different because it **must learn from historical data** (income, credit history, number of children, etc.) to identify patterns.

The key differences arising from this are:

1.  **Changing Performance:** The loan algorithm's performance can **degrade over time** as customer profiles change (due to social or economic factors), requiring it to be retrained. The print algorithm is static.
2.  **Not Universal:** A loan model trained for a private bank may **not perform well** at a public bank, as their customer profiles differ. The print algorithm is universal.
3.  **Expected Performance:** A 100% success rate is not expected from the machine learning algorithm. The text notes that **90% accuracy would be considered exceptional**, whereas 100% is the standard for the print calculation.

In machine learning, an algorithm learns from historical data to build what we call a model. This model is used to predict or classify new, unseen data. There are different types of models depending on the algorithm family. To illustrate this concept, a simple example is given with a table showing people's ages and wheter they paid back a loan or not.

By using the historical data from previous table, we can create a model to predict whether new clients will be good or bad payers based on their age. The classifier processes this information and generates a model wiith age ranges and corresponding outcomes (paid or not).

Once created, this model can be used to predict new cases - for example, a new client aged 37. The algorithm looks at the model built from historical data and returns whether this new client would be a good payer. Although this is a very simple example, it is useful for understanding how machine learning works.

### **Evaluating the trained model**

When building a model using historical data, it's not enough to simply feed the data to the algorithm and use it for predictions. As mentioned earlier, a machine learning algorithm is not expected to be 100% accurate. On the other hand, and algorithm that is correct only 50% of the time performs no better than a random guess. Therefore, before deploying a model, we must evaluate its performance - that is, measurer how effective it is at predicting data it has never seen before.

In model evaluation, the simplest method is to test a model using the same historical data it was trained on. However, a more reliable evaluation comes from testing the model on new, unseen data.

We can do this simply by separating part of the data to create the model, and part to test it, so that the instances used in the test are not used for the creation of the model. The most common way to separate data to build the model and to test it is the method called hold out.

#### **Hold-out Method**

This technique involves splitting the historical data into two separated parts:

1. Learning Phase:

- Training Set (70%): A larger portion used to build and train the model.

- Test Set(30%): A smaller portion used to test the model's performance. The test data is used by the moodel to make predictions, wich is subsequently used for performance evaluation.

1. Production Phase:

- New Data is fed into the finalized "Model" to generate new "Predictions".

Evaluating a classifier's performance involves comparing its predictions against the actual, real-world data. If the performance is deemed satisfactory, the model can be deployed for practical use (in production).

For example, if a model is tested on 5 instances and is correct on 3 of them, its accuracy is 60%.

#### **Confusion Matrix**

Classifier evaluation involves more than just the overall accuracy. There are four possible outcomes:

1.  **True Positives (TP):** Correct positive predictions (e.g., a good payer correctly predicted as good).
2.  **False Negatives (FN):** Errors where a positive case is predicted as negative (e.g., a good payer incorrectly predicted as bad).
3.  **True Negatives (TN):** Correct negative predictions (e.g., a bad payer correctly predicted as bad).
4.  **False Positives (FP):** Errors where a negative case is predicted as positive (e.g., a bad payer incorrectly predicted as good).

True Positives and True Negatives represent the model's correct predictions (the accuracy rate). False Negatives and False Positives represent the errors. A table that organizes the frequency of these outcomes is known as a **confusion matrix**.


#### **Cross Validation**

Assuming a model has an accuracy rate of 70%, this figure is inherently subject to a margin of error.

The cross-validation method addresses the margin of error by allowing the model to be evaluated multiple times. Each evaluation run is known as a "partition." The final performance metric is then determined by calculating the arithmetic average of all these separate evaluations.




#### **Generalization vs. Overfitting**

When a constructed model performs well on production data, it is considered a "generic" or generalized model. The goal of any classifier is to create generalized models.

The opposite of a generalized model is an overfitted model. An overfitted model functions well with the test data but performs poorly on production data.

Several factors can cause overfitting. The primary cause is when the data used in the training phase does not efficiently represent the production data. This can happen because the training data is different (e.g., old data) or not significant (insufficient data).

#### **Rare Class Problem**

The text presents a scenario involving the classification of student data into two classes: "approved" or "failed". From a total of 100,000 instances, a random sample of 4,000 records is extracted to train a model.

The problem arises because only 1% of the students "failed," making it a rare class. Consequently, the 4,000-record training sample will likely be highly imbalanced, containing approximately 40 "failed" instances and 3,960 "approved" instances.

The probable outcome is that the model will learn the characteristics of the "approved" student well, but it will fail to correctly classify new instances of "failed" students because that class was so rare in the training data.

The most common solution to this problem (class imbalance) is a technique called **stratification**.

Following the previous example, instead of taking a random sample from the 100,000 records, stratification would involve extracting 2,000 records of "approved" students and 2,000 records of "failed" students.

This method creates a balanced dataset of 4,000 records, ensuring class equilibrium for training the model.

#### **Curse of Dimensionality**

It might seem that the more attributes (characteristics) a dataset has, the better the model a classifier can create.

However, for many classifiers, the opposite is true: too many attributes can cause **overfitting**, making the model inefficient. This effect is known as the **curse of dimensionality**.

The text also mentions that attribute selection techniques will be studied later. These techniques help discover which attributes are truly important for achieving better model generalization.

#### **Cost-Sensitive Evaluation**

A high overall accuracy rate (e.g., ~70%) can be deceptive. A simple accuracy metric treats all errors equally, which may hide underlying problems.

A deeper analysis of the confusion matrix is required because different types of errors have different real-world *costs*.

For instance, a model might demonstrate:
* A high probability of **False Positives** (incorrectly predicting a positive outcome, e.g., ~25%). This type of error could be extremely costly.
* A low probability of **False Negatives** (incorrectly predicting a negative outcome, e.g., 5%).

In such scenarios, it becomes more important to evaluate and improve performance on specific, high-cost errors rather than focusing only on the total accuracy rate.

#### **Model Metrics**

A model's performance can be initially measured by its overall accuracy, which is the total number of correct predictions (True Positives + True Negatives) divided by the total instances. A result like ~76% might seem good on the surface.

However, high overall accuracy can be deceptive. A model may be "practically useless" if it fails to reliably predict a specific, crucial class. For instance, even with 76% accuracy, the model might be very poor at identifying "failed" cases, making its predictions for that class unreliable and rendering the model unfit for its purpose.

This demonstrates that accuracy and error rates alone are insufficient for a complete evaluation. A comprehensive assessment requires a wider set of performance metrics to understand the model's true behavior.

Other essential metrics include:
* **Precision:** The fraction of positive predictions that are correct.
* **Recall (or True Positive Rate):** The fraction of actual positive cases that were correctly identified.
* **Negative Predictive Value (listed as "True Negatives"):** The fraction of negative predictions that are correct.
* **False Positive Rate**
* **False Negative Rate**