**Scikit-Learn** is one of the most popular Python libraries for **machine learning**. 

It provides tools (methods and functions) for:
- Loading datasets
- Data preprocessing
- Training models (supervised and unsupervised)
- Making predictions
- Evaluating performance

---

### 3.1.1. Importing Scikit-Learn

Working with Iris dataset.

The Iris dataset is a classic dataset for classification tasks in machine learning.
It contains:
- 150 samples of iris flowers

With the following numeric features:
- Sepal Length
- Sepal Width
- Petal Length
- Petal Width 

And divided into 3 classess:
- Iris-setosa
- Iris-versicolor
- Iris-virginica

Load Iris dataset `from sklearn.datasets import load_iris `

you can import other datasets as long as they are maintained and porvided by Scikit-Learn

In [None]:
# import Iris dataset
from sklearn.datasets import load_iris

In machine learning, We usually separate the learning process from the testing process.

This is achieved using `train_test_split ` method

In [None]:
# import the method train_test_split 
from sklearn.model_selection import train_test_split 

The recorded ovservation can be sekwed or have wide range of values.

Best practices is to normalize and standarize the observations.

import the pacakge for standarization `StandardScaler` 

In [None]:
from sklearn.preprocessing import StandardScaler

We will use the Logistic Regression model.
It is a machine learning algorithm used for classification tasks, which means it predicts the category or class of an observation rather than a continuous number.

For example, given features of a flower (like petal length and width), Logistic Regression can predict whether the flower is Iris-setosa, Iris-versicolor, or Iris-virginica.
It works by estimating the probability that a given observation belongs to each class and assigning it to the most likely one.

In [None]:
from sklearn.linear_model import LogisticRegression

To calcualte the quality and the perfromance of the model we will use different matheatical measures

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix

---

### 3.1.2. Experment

#### Load Dataset

In [None]:
# Load the Iris dataset
iris = load_iris()

# Features and target
X = iris.data      # shape = (150, 4)
y = iris.target    # shape = (150,)

# Optional: view first 5 samples
print("Features (first 5 rows):\n", X[:5])
print("Target (first 5 rows):", y[:5])

#### Splitting Data into Training and Test Sets

In [None]:
# TODO: Split the data into 80% training and 20% testing
# Use train_test_split(X, y, test_size=0.2, random_state=42)
# Explanation: This separates the dataset into training and test sets
X_train, X_test, y_train, y_test = None, None, None, None  # TODO

#### Preprocessing (Standardization)

In [None]:
scaler = StandardScaler()

# TODO: Fit the scaler on the training data and transform both training and test sets
# Use scaler.fit_transform(X_train) for training data
# Use scaler.transform(X_test) for test data
# Explanation: Standardization scales features to mean=0 and std=1, improving model performance
X_train_scaled = None  # TODO
X_test_scaled = None   # TODO

#### Training a Model (Logistic Regression)

---
- *Brief Math Behind Linear Regression*

Linear Regression predicts a continuous value by fitting a **straight line** (or hyperplane in higher dimensions) to the training data.  

 **Model equation**:  

$$
\hat{y} = \mathbf{w}^\top \mathbf{x} + b
$$  

where:  
-  $\hat{y} $ = predicted value  
-  $\mathbf{x} $ = feature vector  
-  $\mathbf{w} $ = weights (coefficients)  
-  $b $ = bias (intercept)  

In two dimensional space, $\mathbf{w}$ is simply the slope and $b$ is the intercept (offset in the y-axis).

 **Learning rule**: The weights are chosen to minimize the **Mean Squared Error (MSE)**:  

$$
\text{MSE} = \frac{1}{m} \sum_{i=1}^{m} \big(y_i - \hat{y}_i\big)^2
$$  

where $i$ is the sample index and $m$ is the number of samples.

---


In [None]:
model = LogisticRegression(max_iter=200)

# TODO: Train the model on the scaled training data
# Use model.fit(X_train_scaled, y_train)
# Explanation: The model learns the patterns in the training data
# Hint: Check documentation: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html


#### Making Predictions

In [None]:
# TODO: Predict on the test set
# Use model.predict(X_test_scaled)
# Explanation: The model predicts labels for unseen data
y_pred = None  # TODO

#### Evaluating the Model

In [None]:
# TODO: Calculate accuracy using accuracy_score(y_test, y_pred)
accuracy = None  # TODO
print("Accuracy:", accuracy)

# TODO: Confusion matrix using confusion_matrix(y_test, y_pred)
cm = None  # TODO
print("Confusion Matrix:\n", cm)

#### Visualizing the Confusion Matrix

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# TODO: Use seaborn heatmap to visualize 'cm'
# Use sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
# Explanation: Heatmap helps understand which classes are predicted correctly or incorrectly
sns.heatmap(None, annot=True, fmt="d", cmap="Blues")  # TODO
plt.xlabel("Predicted")
plt.ylabel("True")
plt.title("Confusion Matrix Heatmap")
plt.show()


#### Visualize Regression Line

In [None]:
plt.scatter(X_train_scaled, y_train, color='blue', label='Actual Data')
plt.plot(X_train_scaled, y_pred, color='red', label='Regression Line')
plt.xlabel('Petal Length (cm)')
plt.ylabel('Petal Width (cm)')
plt.title('Linear Regression: Petal Width vs Petal Length (Setosa)')
plt.legend()
plt.show()


#### Residuals Plot

It shows whether the model under or over predict the values.

In [None]:
# TODO: Residual is the difference between the actual value and the predicted value
residuals = None

plt.scatter(X_train_scaled, residuals, color='purple')
plt.axhline(0, color='black', linestyle='--')
plt.xlabel('Petal Length (cm)')
plt.ylabel('Residual (y - y_pred)')
plt.title('Residuals Plot')
plt.show()


#### Squared Error Plot

Highlights which points contribute most to the MSE and shows potential outliers.

In [None]:
# TODO: it is the squared value of the difference between the actual and the predicted value.
squared_errors = (y - y_pred) ** 2

plt.bar(None, None, color='orange')
plt.xlabel('Sample Index')
plt.ylabel('Squared Error')
plt.title('Squared Errors per Sample')
plt.show()