# Machine Learning Overview
This notebook explains key algorithms, performance metrics, and differences between supervised and unsupervised learning.

## 1. Supervised vs Unsupervised Learning

### Supervised Learning
- **Data is labeled**: We know the output for every input.
- **Goal**: Learn a function that maps input to output.
- **Examples**: Classification (e.g. spam detection), Regression (e.g. price prediction)

### Unsupervised Learning
- **Data is unlabeled**: No outputs provided.
- **Goal**: Find structure in the data (e.g., group similar items).
- **Examples**: Clustering, Dimensionality Reduction

### Comparison

| Feature                   | Supervised Learning         | Unsupervised Learning        |
|--------------------------|-----------------------------|------------------------------|
| Labels                   | Required                    | Not required                 |
| Goal                     | Predict outputs             | Find hidden patterns         |
| Algorithms               | KNN, SVM, Logistic, etc.    | KMeans, PCA, etc.            |


## 2. Algorithms Overview and Differences

### Classification Algorithms

| Algorithm         | Type          | Pros                                  | Cons                              |
|------------------|---------------|---------------------------------------|-----------------------------------|
| **KNN**           | Classification | Simple, intuitive                     | Slow with large data              |
| **SVM**           | Classification | Good for high-dimensional data       | Hard to tune, not great with noise|
| **Naive Bayes**   | Classification | Fast, good with text data             | Assumes features are independent  |
| **Logistic Reg.** | Classification | Probabilistic output, simple          | Only linear boundaries            |
| **Decision Tree** | Classification | Easy to visualize, interpretable      | Can overfit easily                |
| **Random Forest** | Classification | Powerful, reduces overfitting         | Slower, harder to interpret       |

### Regression

| Algorithm           | Type      | Use Case                     |
|--------------------|-----------|------------------------------|
| **Linear Regression** | Regression | Predicting continuous values |


In [13]:
from sklearn.datasets import load_iris, load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.cluster import KMeans
from sklearn.metrics import (
    accuracy_score, classification_report, mean_squared_error,
    r2_score, silhouette_score
)

import numpy as np


## 3. Model Examples

In [14]:
# Logistic Regression
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)
print("Logistic Regression Accuracy:", model.score(X_test, y_test))


Logistic Regression Accuracy: 0.9333333333333333


In [15]:
# KNN
model = KNeighborsClassifier(n_neighbors=3)
model.fit(X_train, y_train)
print("KNN Accuracy:", model.score(X_test, y_test))

KNN Accuracy: 0.9333333333333333


In [16]:
# SVM
model = SVC()
model.fit(X_train, y_train)
print("SVM Accuracy:", model.score(X_test, y_test))

SVM Accuracy: 0.9666666666666667


In [17]:
# Naive Bayes
model = GaussianNB()
model.fit(X_train, y_train)
print("Naive Bayes Accuracy:", model.score(X_test, y_test))

Naive Bayes Accuracy: 0.9333333333333333


In [18]:
# Decision Tree
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
print("Decision Tree Accuracy:", model.score(X_test, y_test))

Decision Tree Accuracy: 0.9333333333333333


In [19]:
# Random Forest
model = RandomForestClassifier()
model.fit(X_train, y_train)
print("Random Forest Accuracy:", model.score(X_test, y_test))

Random Forest Accuracy: 0.9333333333333333


In [20]:
# Linear Regression
X, y = load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
model = LinearRegression()
model.fit(X_train, y_train)
print("R² Score (Linear Regression):", model.score(X_test, y_test))

R² Score (Linear Regression): 0.3454668312171736


In [21]:
# Clustering
X = load_iris().data
kmeans = KMeans(n_clusters=3, random_state=0)
kmeans.fit(X)
print("Cluster centers:", kmeans.cluster_centers_)

Cluster centers: [[5.88360656 2.74098361 4.38852459 1.43442623]
 [5.006      3.428      1.462      0.246     ]
 [6.85384615 3.07692308 5.71538462 2.05384615]]


## 4. Performance Metrics

###  Classification

| Metric          | Description                                 |
|-----------------|---------------------------------------------|
| **Accuracy**     | % of correct predictions                   |
| **Precision**    | True Positives / (True Positives + FP)     |
| **Recall**       | True Positives / (True Positives + FN)     |
| **F1 Score**     | Harmonic mean of precision and recall      |


In [24]:
# Classification metrics with Logistic Regression
from sklearn.linear_model import LogisticRegression
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

clf = LogisticRegression(max_iter=200)
clf.fit(X_train, y_train)

print("Classification Report:")
print(classification_report(y_test, clf.predict(X_test)))


Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00         9
           1       1.00      1.00      1.00        11
           2       1.00      1.00      1.00        10

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30



### Regression

| Metric         | Description                          |
|----------------|--------------------------------------|
| **MSE**         | Penalizes large errors               |
| **R² Score**    | Proportion of variance explained     |


In [None]:
# Regression metrics
y_pred = model.predict(X_test)
print("MSE:", mean_squared_error(y_test, y_pred))
print("R²:", r2_score(y_test, y_pred))

MSE: 3349.225519027615
R²: 0.4469357891154532


### Clustering

| Metric              | Use Case               |
|---------------------|------------------------|
| **Silhouette Score**| Measures cluster quality |


In [None]:
# Clustering score
print("Silhouette Score:", silhouette_score(X, kmeans.labels_))

Silhouette Score: 0.551191604619592


## Summary Table

| Task         | Recommended Algorithm | Reason                        |
|--------------|-----------------------|-------------------------------|
| Classification | Random Forest / SVM   | Accurate and robust           |
| Regression     | Linear Regression     | Simple, fast                  |
| Clustering     | KMeans                | Popular, interpretable        |
| Text data      | Naive Bayes           | Works well with probabilities |
| Interpretability | Decision Tree       | Easy to visualize             |
