### Summary of Alternative Classification Techniques

This document details several alternative methods for data classification beyond decision trees, focusing on rule-based classifiers, nearest-neighbor classifiers, and Bayesian classifiers.

#### Rule-Based Classifiers

This technique uses a set of "if-then" rules to classify records. A rule consists of an *antecedent* (the "if" part, containing attribute tests) and a *consequent* (the "then" part, which assigns a class).

* **Rule Quality:** Rules are evaluated using two main metrics:
    * **Coverage:** The fraction of records in the dataset that satisfy the rule's antecedent.
    * **Precision:** The fraction of records covered by the rule that actually belong to the predicted class.

* **Conflict Resolution:** A dataset can present challenges if a record is covered by no rules (incomplete set) or by multiple rules with conflicting classes (not mutually exclusive). There are two primary ways to resolve this:
    * **Ordered Rules (Decision List):** Rules are ranked by priority (e.g., by precision). The highest-priority rule that fires is used to classify the record.
    * **Unordered Rules:** All rules that fire are allowed to "vote" for their respective classes. The class with the most votes (potentially weighted by rule precision) is assigned.

* **Rule Extraction Methods:**
    * **Direct Methods:** Rules are extracted directly from the data. The most common approach is the **sequential covering algorithm**. This algorithm iteratively finds the single "best" rule (e.g., highest precision or information gain), adds it to the list, removes the training records covered by that rule, and repeats the process on the remaining data. The **RIPPER** algorithm is a well-known implementation of this method.
    * **Indirect Methods:** Rules are extracted from another model. For example, a decision tree can be converted into a rule set, where each path from the root to a leaf becomes a rule. These rules are often then pruned and simplified to improve generality.

#### Bayesian Classifiers

These classifiers use probability to model the non-deterministic relationship between attributes and a class, based on **Bayes' Theorem**.

* **Core Concept:** The goal is to find the class $Y$ that has the highest *posterior probability* $P(Y|X)$, given the attributes $X$.
* **Bayes' Theorem:** The posterior probability is calculated as:
    $P(Y|X) = [P(X|Y) \times P(Y)] / P(X)$
    * $P(Y)$: The *prior probability* of the class (its frequency in the training data).
    * $P(X|Y)$: The *class conditional probability* (the probability of observing attributes $X$ *given* that the class is $Y$).
    * $P(X)$: The probability of the attributes (acts as a normalizing constant and can be ignored for classification).

##### Naive Bayes Classifier

Directly calculating $P(X|Y)$ is computationally difficult. The **Naive Bayes** classifier simplifies this by making a "naive" assumption of **conditional independence**—it assumes that all attributes are independent of each other, given the class.

* **Simplified Calculation:** This assumption allows the class conditional probability to be calculated as the product of the individual probabilities for each attribute:
    $P(X|Y) = P(X_1|Y) \times P(X_2|Y) \times ... \times P(X_d|Y)$

* **Handling Attribute Types:**
    * **Categorical Attributes:** Probabilities are estimated as the simple fraction of times that attribute value appears with that class in the training data (e.g., $P(\text{Marital Status} = \text{Single} | \text{Class} = \text{No}) = 2/7$).
    * **Continuous Attributes:** Typically handled by assuming the attribute follows a **Gaussian (Normal) distribution**. The algorithm calculates the mean ($\mu$) and variance ($\sigma^2$) for that attribute for each class, and then uses the Gaussian probability density function to find $P(X_i|Y)$.

* **m-Estimate (Laplace Correction):** This is a crucial technique to handle zero-probability issues. If a specific attribute-value pair never appears in the training set for a class (e.g., $P(\text{Marital Status} = \text{Married} | \text{Class} = \text{Yes}) = 0$), it would cause the entire posterior probability to become zero. The m-estimate adds a small number of "virtual" examples to prevent this.

* **Characteristics:** Naive Bayes is robust to noise and irrelevant attributes. However, its performance can suffer if the attributes are highly correlated, as this violates its fundamental assumption.


### Bayes' Theorem

Bayes' Theorem is a statistical principle that combines prior knowledge with new evidence gathered from data.

Given two random variables, X and Y, their relationship can be described by:
* **Joint Probability, $P(X,Y)$:** The probability that X takes a specific value *and* Y takes a specific value.
* **Conditional Probability, $P(Y|X)$:** The probability that Y takes a specific value *given that* X's value is known.

These probabilities are linked by the formula:
$$P(X,Y) = P(Y|X) \times P(X) = P(X|Y) \times P(Y)$$

By manipulating this relationship, Bayes' Theorem is derived. One form is presented as:
$$P(Y|X) = \frac{P(X,Y)}{P(X)}$$

### Summary of Naive Bayes Classification Problem

This document presents a classification problem solved using the Naive Bayes algorithm. The goal is to build a predictive model to estimate whether a third-year college student will graduate within five years ("D" - within the deadline) or take longer ("F" - outside the deadline).

The model is built using a dataset of 10 graduated students. The attributes, all of which can be "A" or "B," are:
* **DD:** Performance in courses.
* **DA:** Performance in complementary activities.
* **DE:** Performance in internships.

The solution demonstrates the steps to build the model and classify new instances.

#### a) Building the Naive Bayes Model

1.  **Prior Probabilities (P(Class)):**
    * $P(D) = 6/10$
    * $P(F) = 4/10$

2.  **Class Conditional Probabilities (P(Attribute=Value | Class)):**
    * $P(DD=A|D) = 4/6$
    * $P(DD=B|D) = 2/6$
    * $P(DA=A|D) = 3/6$
    * $P(DA=B|D) = 3/6$
    * $P(DE=A|D) = 4/6$
    * $P(DE=B|D) = 2/6$
    * $P(DD=A|F) = 1/4$
    * $P(DD=B|F) = 3/4$
    * $P(DA=A|F) = 3/4$
    * $P(DA=B|F) = 1/4$
    * $P(DE=A|F) = 2/4$
    * $P(DE=B|F) = 2/4$

#### b) Classification Test for Student $X = (DD=A, DA=B, DE=B)$

1.  **Calculate $P(X | Class)$:**
    * $P(X|D) = P(DD=A|D) \times P(DA=B|D) \times P(DE=B|D) = (4/6) \times (3/6) \times (2/6) = 1/9$
    * $P(X|F) = P(DD=A|F) \times P(DA=B|F) \times P(DE=B|F) = (1/4) \times (1/4) \times (2/4) = 1/32$

2.  **Calculate Posterior Probability $P(Class | X)$:**
    * $P(D|X) = P(D) \times P(X|D) = (6/10) \times (1/9) = 1/15$
    * $P(F|X) = P(F) \times P(X|F) = (4/10) \times (1/32) = 1/80$

**Result:** Since $1/15 > 1/80$, the student (A, B, B) is classified as "D" (will graduate within the deadline).

#### c) Classification Test for Student (B, B, B)

The student is classified as "F" (outside the deadline) because $P(F|X) = 0.0375$ is greater than $P(D|X) = 0.03$.

#### d) Classification Test for Student (A, A, A)

The student is classified as "D" (within the deadline) because $P(D|X) = 0.13$ is greater than $P(F|X) = 0.0375$.