<a href="https://colab.research.google.com/github/pranavinemalikanti/BayesLabAssignments/blob/main/Copy_of_MLConcepts_Day2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from ipywidgets import interact, widgets

# Ensure visuals are inline
%matplotlib inline


# Section 1: Introduction to Logistic Regression


#### **What is Logistic Regression?**
---


##### Imagine This:

You're trying to predict if a student will pass an exam. The answer is either **Yes (1)** or **No (0)**. You can't draw a straight line like in Linear Regression because the output is not continuousÃƒÂ¢Ã¢â€šÂ¬Ã¢â‚¬Âit's a yes or no.

That's where **Logistic Regression** comes in. It's used for **classification** problems.



##### Key Idea:
Instead of predicting a number, Logistic Regression predicts a **probability**. This probability is then converted into a class (0 or 1).

- **Probability > 0.5?** Predict 1 (Yes)
- **Probability ÃƒÂ¢Ã¢â‚¬Â°Ã‚Â¤ 0.5?** Predict 0 (No)



##### The Sigmoid Function
This is the magic formula that converts numbers into probabilities:

$$
ÃƒÅ½Ã‚Â£(Z) = \frac{1}{1 + e^{-z}}
$$


Where:
- $$
  z = w ÃƒÂ¢Ã¢â‚¬Â¹Ã¢â‚¬Â¦ x + b
  $$
  (like the line equation in linear regression)
  
- ÃƒÂ°Ã‚ÂÃ¢â€žÂ¢Ã…Â¡ is Euler's number (about 2.718)

- ÃƒÅ½Ã‚Â£(z) outputs values between 0 and 1 (probabilities!)


# Section 2: Visualizing the Sigmoid Function

In [None]:
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

In [None]:
# Plotting the Sigmoid Curve
z = np.linspace(-10, 10, 100)
s = sigmoid(z)

plt.figure(figsize=(8, 5))
plt.plot(z, s, color='red')
plt.title('Sigmoid Function', fontsize=14)
plt.xlabel('z')
plt.ylabel('Sigmoid(z)')
plt.grid()
plt.show()

In [None]:
# --- Interactive Visualization ---
def interactive_sigmoid(weight=1.0, bias=0.0):
    z = np.linspace(-10, 10, 100)
    s = sigmoid(weight * z + bias)
    plt.figure(figsize=(8, 5))
    plt.plot(z, s, color='blue')
    plt.title(f'Sigmoid Function (Weight={weight}, Bias={bias})')
    plt.xlabel('z')
    plt.ylabel('Sigmoid(z)')
    plt.grid()
    plt.show()

interact(interactive_sigmoid, weight=(-5.0, 5.0, 0.1), bias=(-5.0, 5.0, 0.1))

# Section 3: Binary Classification with Logistic Regression

#### **Example: Predicting if a Student Passes**

- **Features (X):** Hours Studied
- **Target (y):** Pass (1) or Fail (0)

Let's create a simple dataset.

In [None]:
# Sample Data
np.random.seed(0)
hours = np.random.randint(1, 10, 20)
pass_fail = (hours + np.random.randn(20) > 5).astype(int)

In [None]:
# Data Visualization
plt.figure(figsize=(8, 5))
plt.scatter(hours, pass_fail, color='purple', s=100, alpha=0.7)
plt.xlabel('Hours Studied')
plt.ylabel('Pass (1) / Fail (0)')
plt.title('Pass/Fail Based on Study Hours')
plt.grid()
plt.show()

In [None]:
# Model Training
X = hours.reshape(-1, 1)
y = pass_fail

log_reg = LogisticRegression()
log_reg.fit(X, y)

In [None]:
# Predictions
y_pred = log_reg.predict(X)

In [None]:
# Evaluation
print("Accuracy:", accuracy_score(y, y_pred))
print("\nClassification Report:\n", classification_report(y, y_pred))

In [None]:
# Decision Boundary Visualization
x_range = np.linspace(0, 10, 100).reshape(-1, 1)
probabilities = log_reg.predict_proba(x_range)[:, 1]

plt.figure(figsize=(8, 5))
plt.scatter(hours, pass_fail, color='purple', s=100, alpha=0.7, label='Data')
plt.plot(x_range, probabilities, color='green', label='Sigmoid Curve')
plt.axhline(0.5, color='red', linestyle='--', label='Decision Boundary (0.5)')
plt.xlabel('Hours Studied')
plt.ylabel('Probability of Passing')
plt.title('Logistic Regression: Decision Boundary')
plt.legend()
plt.grid()
plt.show()

# Section 4: Multiclass Classification with Logistic Regression


#### **Example: Classifying Iris Flowers**
We'll classify Iris flowers into three species:
- Setosa
- Versicolor
- Virginica


In [None]:
# Load Iris Dataset
iris = load_iris()
X_iris = iris.data[:, :2]  # Using two features for visualization
y_iris = iris.target

In [None]:
# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X_iris, y_iris, test_size=0.2, random_state=42)

In [None]:
# Model Training
log_reg_multi = LogisticRegression(multi_class='ovr')
log_reg_multi.fit(X_train, y_train)

In [None]:
# Predictions
y_pred_multi = log_reg_multi.predict(X_test)
print(y_pred_multi)

In [None]:
# Evaluation
print("Accuracy:", accuracy_score(y_test, y_pred_multi))
print("\nClassification Report:\n", classification_report(y_test, y_pred_multi))


In [None]:
# Visualization
plt.figure(figsize=(8, 5))
plt.scatter(X_test[:, 0], X_test[:, 1], c=y_pred_multi, cmap='viridis', edgecolors='k')
plt.xlabel(iris.feature_names[0])
plt.ylabel(iris.feature_names[1])
plt.title('Multiclass Classification with Logistic Regression')
plt.grid()
plt.show()

## **Section 5: Assignments**

### Beginner Assignment: Predicting Pass/Fail Based on Study Hours

Task:
- Use Logistic Regression to predict if a student will pass based on study hours.
- Visualize the data and decision boundary.
- Using real world data from kaggle

**Hint:** Use a kaggle dataset and modify the code.

In [None]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("mrsimple07/student-exam-performance-prediction")

print("Path to dataset files:", path)

In [None]:
import os

# Check files inside the downloaded dataset folder
print("Files in dataset folder:", os.listdir(path))

In [None]:
csv_file_path = os.path.join(path, "student_exam_data_new.csv")  # Replace with the actual file name
df = pd.read_csv(csv_file_path)
print(df.head())  # Display first few rows

In [None]:
#remove Previous Exam Score column
df.drop(columns = ["Previous Exam Score"], inplace = True)
print(df.head())

In [None]:
# Data Visualization
plt.figure(figsize=(8, 5))
plt.scatter(df['Study Hours'], df['Pass/Fail'], color='blue', s=100, alpha=0.7)
plt.xlabel('Hours Studied')
plt.ylabel('Pass (1) / Fail (0)')
plt.title('Pass/Fail Based on Study Hours')
plt.grid()
plt.show()

In [None]:
# Model Training
X = df.drop(columns=['Study Hours'])
y = df['Pass/Fail']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [None]:
model = LogisticRegression()

In [None]:
model.fit(X_train, y_train)

In [None]:
y_pred = model.predict(X_test)

In [None]:
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

In [None]:
# Decision Boundary Visualization

# Generate smooth X values over the study hours range
X_range = np.linspace(X_test.min(), X_test.max(), 100).reshape(-1, 1)
# Predict probabilities for these X values
y_prob_range = model.predict_proba(X_range)[:, 1]

plt.figure(figsize=(8, 5))
plt.scatter(X_test, y_test, color='purple', s=100, alpha=0.7, label='Data')
plt.plot( X_range, y_prob_range, color='blue', label='Sigmoid Curve')
plt.axhline(0.5, color='red', linestyle='--', label='Decision Boundary (0.5)')
plt.xlabel('Hours Studied')
plt.ylabel('Probability of Passing')
plt.title('Logistic Regression: Decision Boundary')
plt.legend()
plt.grid()
plt.show()

### Advanced Assignment: Classifying Flowers

Task:
- Use the Iris dataset from kaggle.
- Train Logistic Regression for multiclass classification.
- Visualize the classification results with decision boundaries.

**Hint:** Experiment with different features and observe how performance changes.


In [None]:
path = kagglehub.dataset_download("himanshunakrani/iris-dataset")
print("Path to dataset files:", path)

In [None]:
print(os.listdir(path))

In [None]:
csv_file_path = os.path.join(path, "iris.csv")  # Replace with the actual file name
df = pd.read_csv(csv_file_path)
print(df.head())

In [None]:
X_iris = df.drop(columns=['species', 'petal_length', 'petal_width'])
y_iris = df['species']
X_train, X_test, y_train, y_test = train_test_split(X_iris, y_iris, test_size=0.2, random_state=42)

In [None]:
model = LogisticRegression(multi_class='ovr')
model.fit(X_train, y_train)

In [None]:
y_pred = model.predict(X_test)

In [None]:
print("Accuracy:",accuracy_score(y_test, y_pred))
print("Classification report\n",classification_report(y_test, y_pred))

In [None]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
y_numeric = label_encoder.fit_transform(y_iris)  # Convert 'setosa', 'versicolor', etc. ÃƒÂ¢Ã¢â‚¬Â Ã¢â‚¬â„¢ 0, 1, 2

In [None]:
# Visualization
plt.figure(figsize=(8, 5))
plt.scatter(df['sepal_length'], df['sepal_width'], c=y_numeric, cmap='viridis', edgecolors='k')
plt.xlabel('sepal length')
plt.ylabel('sepal_width')
plt.title('Multiclass Classification with Logistic Regression')
plt.grid()
plt.show()

## **References**



*   https://youtu.be/T5AoqxQFkzY?si=AeKiKwcwxm506VJz
*   https://www.youtube.com/watch?v=nk2CQITm_eo
*   https://www.youtube.com/watch?v=yIYKR4sgzI8
*   Machine Learning from Andrew Ng on Coursera

