# Code Summary – Explaining the ML Cycle (Classification)

## Importing Python Libraries for Data Handling and Visualization

In this step, we import essential libraries for data analysis and visualization:

- `pandas` for handling tabular data, and `numpy` for numerical operations.
- `plotly.express` for creating interactive plots, and `ipywidgets` for adding UI controls like sliders.


In [None]:
#Libraries for importing and visualizing data
import pandas as pd
import numpy as np
import plotly.express as px
from ipywidgets import interact, FloatSlider

# Mounting Google Drive to Access the Dataset

- We use `drive.mount()` to connect Google Drive to Colab and access stored files.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Let's import and view the retention DataFrame:

## Importing and Viewing the Student Retention Dataset

We load the student academic dataset and explore its structure:

- `pd.set_option('display.max_columns', None')` displays all columns when viewing data.
- `pd.read_csv()` loads the CSV file from Google Drive into a DataFrame called `df`.
- `df.head()` shows the first 5 rows to preview the dataset structure.
- `print(df.shape)` displays the dataset dimensions.



In [None]:
#Let's ensure that we can view all columns of the dataframe, along with a head and tail look at the data
pd.set_option('display.max_columns',None)
df = pd.read_csv('/content/drive/MyDrive/projects/Applied-Data-Analytics-For-Higher-Education-Course-2/data/student_academics_data.csv')
df.head()
print(df.shape)

FileNotFoundError: [Errno 2] No such file or directory: '/content/drive/MyDrive/projects/Applied-Data-Analytics-For-Higher-Education-Course-2/data/student_academics_data.csv'


*   We filter the dataset to include only students from the Fall 2022 cohort to ensure consistent analysis across the same starting group.


In [None]:
df = df[df['COHORT']=='Fall 2022']

- We select a sample of 150 records with key academic features (GPA, DFW units, and retention status) and remove any incomplete rows to keep the data clean and manageable for visualization.


In [None]:
df_acad_perf = df[['HS_GPA','GPA_1','GPA_2','DFW_UNITS_1','DFW_UNITS_2','SEM_3_STATUS']].sample(n=150,random_state=5).dropna()
df_acad_perf.head()

Unnamed: 0,HS_GPA,GPA_1,GPA_2,DFW_UNITS_1,DFW_UNITS_2,SEM_3_STATUS
4804,3.125,2.214286,2.571429,4.0,0.0,E
7628,4.061,3.714286,4.0,0.0,0.0,D
17427,4.079,3.5,3.733333,0.0,0.0,E
10057,4.089,4.0,3.75,0.0,0.0,E
10721,4.097,4.0,3.785714,0.0,0.0,E


## Visualizing Retention Patterns: HS GPA vs First-Semester GPA

This scatter plot shows the relationship between:

- **HS_GPA** (X-axis) and **GPA_1** (Y-axis)
- Colored by **SEM_3_STATUS**:
  - `E` = Enrolled (retained)
  - `D` = Dropped (not retained)

- Each point represents a student. The chart helps us observe how early academic performance relates to retention and highlights the role of choosing meaningful feature combinations when building classification models.



In [None]:
px.scatter(data_frame=df_acad_perf, x='HS_GPA', y='GPA_1',color='SEM_3_STATUS',  size_max=100)

## Exploring Feature Separation: DFW Units vs Second-Semester GPA


This scatter plot shows the relationship between:

- **DFW_UNITS_2** (X-axis) and **GPA_2** (Y-axis)
- Colored by **SEM_3_STATUS**:
   - `E` = Enrolled (retained)
   - `D` = Dropped (not retained)

- Each point represents a student. This feature combination provides clearer separation between retained and dropped students, showing how academic performance in semester 2 relates to retention.

- The plot illustrates how the right feature choices can improve class distinction—laying the groundwork for training models like Support Vector Machines (SVMs).


In [None]:
px.scatter(data_frame=df_acad_perf, x='DFW_UNITS_2', y='GPA_2',color='SEM_3_STATUS', size_max=100)

## Installing and Enabling Interactive Visualization Support

This step installs the required packages for interactive visualizations:

- `ipywidgets` enables UI elements like sliders.
- `plotly` supports interactive plotting.
- `jupyter nbextension enable` ensures widgets work properly in Jupyter or Colab.



In [None]:
!pip install ipywidgets plotly
!jupyter nbextension enable --py widgetsnbextension

Collecting jedi>=0.16 (from ipython>=4.0.0->ipywidgets)
  Downloading jedi-0.19.2-py2.py3-none-any.whl.metadata (22 kB)
Downloading jedi-0.19.2-py2.py3-none-any.whl (1.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m30.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: jedi
Successfully installed jedi-0.19.2
Enabling notebook extension jupyter-js-widgets/extension...
Paths used for configuration of notebook: 
    	/root/.jupyter/nbconfig/notebook.json
Paths used for configuration of notebook: 
    	
      - Validating: [32mOK[0m
Paths used for configuration of notebook: 
    	/root/.jupyter/nbconfig/notebook.json


### This code builds an interactive visualization to show how an SVM decision boundary responds to different values of the hyperparameter `C`.
### Imports:
Imports libraries for data handling (`numpy`), plotting (`matplotlib`), SVM modeling (`SVC`), and interactive controls (`ipywidgets`, `interact`, `FloatLogSlider`).

### Step 1: Generate Synthetic Data
- Creates 100 random data points and Splits them into two classes and shifts each cluster to create overlap.
- Adds noise to increase complexity and simulate real-world class separation challenges.
-----
### Step 2: Define the SVM Plotting Function
- A function `plot_svm_decision_boundary(C)` trains an SVM with the given `C` using an RBF kernel.
- It builds a meshgrid and uses `matplotlib` to plot decision boundaries and data points.
--------
### Step 3: Add Interactive Control
- A `FloatLogSlider` allows dynamic adjustment of `C` from 0.01 to 1000.
- The `interact()` function links the slider to the plotting function, updating the boundary in real time.



In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from ipywidgets import interact, FloatLogSlider
from IPython.display import display

# 1. Generate more mixed synthetic data
np.random.seed(42) # for reproducibility
n_samples = 100
X = np.random.rand(n_samples, 2) * 5 # Keep the overall range

# Create two clusters that are closer and more overlapping
X[:n_samples//2, :] += [1, 1]  # Shift first cluster less
X[n_samples//2:, :] += [3, 3]  # Shift second cluster less

y = np.array([0] * (n_samples//2) + [1] * (n_samples//2))

# Add more noise to increase overlap
X += np.random.randn(n_samples, 2) * 1.0 # Increased scale of noise

# 2. Define a function to train and plot SVM with varying C
def plot_svm_decision_boundary(C):
    # Train the SVM model with RBF kernel
    # Smaller C allows more margin, larger C penalizes misclassifications more
    model = SVC(kernel='rbf', C=C)
    model.fit(X, y)

    # Create a meshgrid to plot the decision boundary
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
                         np.arange(y_min, y_max, 0.02))
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    # Plot the results
    plt.figure(figsize=(8, 6))
    plt.contourf(xx, yy, Z, alpha=0.8, cmap=plt.cm.RdPu) # Using RdPu colormap
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.RdPu, edgecolors='k') # Using RdPu colormap for points
    plt.title(f'SVM Decision Boundary (C={C:.2f})')
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    plt.show()

# 3. Use ipywidgets to create an interactive slider
# FloatLogSlider is good for C as it often spans orders of magnitude
c_slider = FloatLogSlider(
    value=1.0,
    min=-2, # equivalent to 10^-2 = 0.01
    max=3,  # equivalent to 10^3 = 1000 (increased max to see more effect of high C)
    step=0.1,
    description='C:',
    continuous_update=True,
    orientation='horizontal',
    readout=True,
    readout_format='.2f',
)

interact(plot_svm_decision_boundary, C=c_slider);

interactive(children=(FloatLogSlider(value=1.0, description='C:', max=3.0, min=-2.0, readout_format='.2f'), Ou…