# Introduction

# What is Machine Learning

Simply put, machine learning (ML) is the process of training computers to discover patterns and relationships amongst a set of variables, with the goal of generalizing that knowledge to unseen examples. The term was coined by IBM's Arthur Samuel in his work on computers playing checkers. He stated that machine learning is a “field of study that gives computers the ability to learn without being explicitly programmed.” Thus ML is an example of a specific kind of artificial intelligence (AI). The foundations of the methods used are rooted in statistical and mathematical theory, but it is the accessibility and rapid increase of computing power that has contributed to the growth of the field. The greatest accelerant to the rise of machine learning as ubiquitous in the modern world is the tech industry that has reshaped every aspect of life from the nature of human social interaction to how we monitor our health to the entertainment we are exposed to.

In the modules that follow, we use the UPAD (Understand, Prepare, Analyze, Deploy) paradigm to get us from raw higher ed data to machine learning results that can impact policy and procedures in the university setting.

**FIGURE 1:** The UPAD model of the machine learning cycle

<img src="../public/UPAD_2-2.png" width="300" height="300">


# The Machine Learning Cycle

The machine learning cycle is a description of the steps that need to be taken to apply machine learning models to solve complex contextualized scenarios, and subsequently interrogate the implications of the solutions to the data generating process. Informed with this new feedback, the analyst may restart the cycle with deeper insight that can be used to refine the model and improve predictions.

## Understand

Machine learning in higher ed is not merely an academic curiosity. There are significant challenges that institutions and the individuals that comprise them have faced for decades. As one of the most important tools driving success in the business sector, machine learning holds great promise for implementation in the higher ed domain. But the existence of a solution by no means implies it's appropriateness or utility in any given context. Not to mention the extra care that needs to be taken when handling potentially sensitive student records. In the article *Improving Student Retention in Institutions of Higher Education through Machine Learning: A Sustainable Approach*, the authors cautiously advise "The collection and use of student data raises questions about data security and informed consent. Ensuring the impartiality of algorithms and equity in interventions is essential for a responsible and ethical approach". Determination of the applicability and appropriateness of machine learning tools necesitates a clear understanding of the challenges and opportunities facing the institution of higher education. This initiates the first step of the machine learning cycle.

#### Higher Ed Challenge/Opportunity

What is the single most difficult/pressing challenge facing your unit/division/university? Is it accreditation or a pending external review? Is it a threatened strike by faculty and/or staff due to noncompetitive wages? Are your retention and graduation rates low? Take some time to think about it. On the other end of things maybe the systems in place at your institution are all flowing well and you're just seeking a competitive edge to make a good thing great. In any case, the tools of machine learning have the potential to add rich insights to your process.

To initiate the machine learning cycle for higher ed data, a few key facts must be laid out

1. **Target Population:** Who are the individual observation units relevant to the issue you are trying to understand?

  EXAMPLES: Students; Faculty; Departments; Administrators; etc


2. **Explanatory Features:** What characteristics of the population are available for you to mine for exploratory or predictive purposes?

  EXAMPLES: Demographic data; Academic data prior to application/admission; Data generated while at institution,

If your goal is simply to characterize and organize the population according to these features, you are embarking on an **unsupervised learning** task.


3. **Target Variable:** What is the information you are seeking to learn or predict about this population?

  EXAMPLES: Retained in semester 3? Yes or No; Graduated in 4 years? yes or No; GPA in semester 3

If your goal is to use the explanatory features to learn their relationship to the target from historical data, and use that insight to predict the target for unseen examples, you are embarking on a **supervised learning** task.

Whether the target variable is *continuous* or *categorical* will have implicaitons for how we proceed.

A **continuous variable** measures a numerical quantity that in theory can take on any value in an interval of real numbers, or can reasonably be thought as doing so. Please note this definition is a tad more general than the definition from classical statistics.

  EXAMPLES: Student GPA in semester 3; University Admission yield

There are some cases in which variables that don't fit this description can be treated as continuous, such as FTES or headcount.

A **categorical variable** names a quality of interest possessed, or a category that each observation falls under.

  EXAMPLE: Institutional characteristics: public or private, 2-year or 4-year

4. **Research Question:** An actionable research question concerning the target population frames a pressing higher ed challenge or opportunity as an unsupervised or supervised machine learning task related to available variables. Machine Learning in higher ed is always practical and driven by real challenges with real implications.


## Prepare

Preparing the data for downstream analysis is a process that involves critical steps. It is a central aspect that makes data science truly a science about data, whereas it is often seen as a peripheral necessity in classical statistics. The preparation process involves Data Ingestion and Curation, wrangling and splitting; exploratory data analysis and feature engineering.

#### Data Ingestion and Curation

Once the research question is formulated, it is imperative to identify data sources that can provide the variables we need to derive the insight we seek. Oftentimes this data comes directly from our institution, but there are instances where we may need to request data from another institution, local, state or federal governtmental agencies, etc. Sometimes the data may be freely accessible on the web (such as secondary school performance data or IPEDS data). At this stage, our focus is identifying, extracting, reading and merging data from multiple sources to curate a custom data set with information on our target population. This curation looks at the data from the 30,000 foot view, and addresses the big picture to ensure everything is set up to maximize our ability to answer the research question.

#### Data Wrangling

Unfortunately, the only place you will find data that is truly ready to be analyzed out-of-the-box is in a textbook. In "the wild", data are untidy, unwieldy, and oftentimes uncooperative, like an animal we are trying to tame for domesticated use. The process of taming the animal is not overnight; it takes time, patience and persistence. But for those that persevere, the rewards are great. The same is true concerning data. To get our data machine learning ready, there are quite a number of considerations to address. The Oxford Dictionary defines "wrangle" as *"round up, herd, or take charge of (livestock)"*, and that is an apt description of the multiple steps we'll need to take to "take charge" of our data so that it is cleaned up and ready for machine learning algorithms to learn from.  

#### Data Splitting

**The Vertical Split: Label from Features**

Scikit learn requires us to seperate our target variable from the explanatory variables. This makes alot of sense, because the downstream preprocessing steps will be very different for both.

**FIGURE 1:** Seperating label from features

<img src="../public/Xy_pic_2-2.png" width="600" height="400">

**The Horizontal Split: Toward Generalization**

What comes to mind when you think of "learning"? The word most likely takes you back to your time as a student in K-12 or college/university. If you were a halfway responsible student, your study process before a final exam may have looked something like this:

Train your brain on the data: Build a knowledgebase by studying lecture notes, course readings/assignments, quizzes, exams, etc with the goal of trying to learn the underlying principles, frameworks and problem-solving strategies that undergird the course contents and lecturer's approach.

Validate your knowledge: A couple days before he final, simulate the actual exam by taking the alloted two hours timeframe to take a first look at the practice exam and attempt it. If this doesn't go well, you can always go back and refine your training process, and reattempt the practice exam.

Test day: You've never seen the test before, so this will be a true indication of your understanding of course concepts. You expect your performance to be slightly better that your initial performance on the practive exam.

This example does a decent job of explaining some of the practical considrations of how machine learning works. As mentioned before, our goal is to build models that learn from existing data (build set) in a way that generalizes to unseen data (validation and test sets). In many applications, we will not have the luxury of collecting data for an independent test set; instead we take our existing DataFrame and simulate the build, validate and test sets by splitting our full data into those three parts, as shown in the diagram below.

**FIGURE 1:** The build, validate and test set- example of a 60-20-20 split.

<img src="../public/Build_Val_Test_2-2.png" width="800" height="600">

**Generalizing and Algorithmic Bias**

Splitting data is a critical step to avoid overfitting and measure our model's ability to generalize to unseen data. That being said, there is another sinister foe we must be vigilant against. In our analogy, it is important to remember that a student could move past memorization, truly understand course content, excel on all exams - and still not have tru, generalizable knowledge of the content. While most instructors endeavor to teach the topics objectively and comprehensively, we all have implicit biases (and explicit ones too) that influnce the nuances of our course presentation. Thus any evaluation is in truth an assessment of how well the student has learned the instructor's representation of the subject matter. In a similar way, our ML algorithms can only learn from the data they can see. If they are provided with biased data, they will draw biased conclusions and generalizations. Thus it is always important to interrogate the processes generating the data, so that modifications can be made toward eliminating bias, or conclusions are clearly represented as relevant only to the subpopulation reflected in the training data.

#### Exploratory Data Analysis

The strength of supervised statistical and machine learning models is directly connected to the nature of the underlying relationship between the features and our target variable. Indeed, our assumption is that such a relationship genuinely exists, and our main task is to approximate it as accurately as possible. Thus it will be useful to use the build set to generate numerical summaries, tables and graphs to explore and identify possible clues to the *nature* of the relationship between our features and label. Here are a few possibilities based on the types of variables:

**Quantitative Explanatory and Qualitative Response**

- Side by side boxplots

- Comparative histograms


**Qualtitative Explanatory and Qualitative Response**

- Two way tables

- Side by side bar graphs

- Stacked bar graphs


####Feature Engineering

To build a model with maximum predictive ability, it is necessary to transform existing variables so that they are in a format scikit learn models can use effectively. There are three closely intertwined considerations

**Domain Distinctives** - Are some of the variables in our data inherently devoid of predictive value? Are there new variables we can create that are likely to be predictive?

**Statistical Subtleties** - Are there further transformations that need to take place on existing features to minimize the possibility of poor model performance?

**Coding Customization** - Is the data presented in a format tailored to scikit learn's predictive modeling infrastructure to readily move through prescribed train-test-validate workflows?

## Analyze

Up to this point, it could be argued that we have not engaged in machine learning proper. We've done some statistics and data science as prelude, but the heart of machine learning is engaged in the *Analyze* stage.

#### Model Specification

One of the best things about machine learning is that it builds on decades of statistics and computer science research, and an equal amount of time in implementation and improvement in a number of practical contexts in industry. Therefore there are myriad models to choose from, with most having some characteristic that make them preferable to others in specified contexts. It's often said that there is "no free lunch" in modeling- one ML model will not outperform every other model accross learning tasks. In this certificate program, you will be exposed to some of the top models used for classification, including logistic regression, elastic net, k nearest neighbors, support vector machine and random forest. We'll give gentle introductions into each of these as we progress through the modules.

#### Hyperparameter Tuning and Model Selection

Every ML model has characteristic numerical values that dictate how it makes predictions. For linear and logistic regression, these numerical characteristics can be directly estimated from the data we collect. These parameters are *what* these algorithms learn, and their statistical properties enable us to make the "best" predictions under the strict modeling assumptions that drive theis derivation. In contrast, a model like k nearest neighbors operates with hyperparameters that dictate *how* the algorithm learns. As we review examples covered by the teacher during class, are we going ot focus on memorizing the steps she took line by line, or are we going to endeavor to read in between the lines to discover the rationale behind the steps? The build and validation sets are iteratively used to optimize the learning strategy.

#### Bias Variance Tradeoff

In our course taking analogy to machine learning, suppose there are 20 possible topics that could show up on the first exam, which is happening in 5 days. Consider two study strategies.

Strategy #1: Randomly select 10 topics and study all lecture notes, homework, etc that relate to these topics. Study 2 topics per day. It is reasonable to assume you could get to a point where you know those topics so well that you can answer any question about them posed during lecture or in previous homework assignments. What are some deficiencies in this study strategy?

Strategy #2: You attempt to learn all 20 topics. Spend your time studying 4 topics from the class per day and study them as best as you can, emphasizing general themes and broad understanding. What are some deficiencies in this study strategy?

Every machine learning algorithm has the ability to learn the data used to build it so well that the number and/or size of errors made is relatively small.  If we imagine obtaining new data, and attempt to learn from it using the same strategy, that would lead to a highly variable, unstable learning process that does a consistent job learning what it studied. This is a high variance, low bias learning strategy, corresponding to strategy #1 above. Performing well on test day is not likely if you are tested on deep understanding of enough course topics you didn't study.

Conversely, ML algorithms can be calibrated to learn the "loud signals" in the data and deemphasize the  "noise". This would likely lead to larger errors on the build set. However this would lead to consistency on new data, since the overall structural patterns in the samle data are governed by the process that generates it. On the flip side, there is more seperation between what is learned and the details of the individual topics. This is a low variance, high bias learning strategy, corresponding to strategy #2 above. Performing well on test day is not likely if you are tested on deep understanding of the course topics, as you opted to keep it shallow.

Strategy #3: As an attentive student, you have been identifying topics that were emphasized by the professor during lecture and weigh your study times accordingly giving more depth to emphasized topics. Moreover, you move beyond studying lists of facts and seek to understand how and why those facts were applied in various case studies.

This third strategy balances the first two, emphasizes understanding over memorization, and is more likely to lead to results on exam day. That being said, a cost is paid on the front end to construct a better learning approach, but it will pay dividends.

This section illustrates the bias-variance tradeoff that is a critical component of the model fitting process in machine learning.



**Table 1:** Impact of number of features in multiple regression models on bias and variance:

| #features too Large - Overfitting | Optimal #features | #features too Small - Underfitting |
|---|---|---|
|Low Bias, High Variance | Low Bias, Low Variance | High Bias, Low Variance |


In the following code, we use a sample of the GPA DataFrame to visualize the bias-variance tradeoff.

In [None]:
#Libraries for importing and visualizing data
import pandas as pd
import numpy as np
import plotly.express as px
from ipywidgets import interact, FloatSlider

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Let's import and view the retention DataFrame:

In [None]:
#Let's ensure that we can view all columns of the dataframe, along with a head and tail look at the data
pd.set_option('display.max_columns',None)
df = pd.read_csv('/content/drive/MyDrive/projects/Applied-Data-Analytics-For-Higher-Education-Course-2/data/student_academics_data.csv')
df.head()
print(df.shape)

(25263, 22)


Select the 2021 cohort

In [None]:
df = df[df['COHORT']=='Fall 2022']

Select a subset of 15 examples, and reduce the features to select academic performance indicators.

In [None]:
df_acad_perf = df[['HS_GPA','GPA_1','GPA_2','DFW_UNITS_1','DFW_UNITS_2','SEM_3_STATUS']].sample(n=150,random_state=5).dropna()
df_acad_perf.head()

Unnamed: 0,HS_GPA,GPA_1,GPA_2,DFW_UNITS_1,DFW_UNITS_2,SEM_3_STATUS
15376,4.0,3.8,3.692308,0.0,0.0,E
18773,4.031,3.538462,3.692308,0.0,0.0,E
7273,3.892,3.4,3.0,0.0,0.0,E
7873,3.593,3.125,4.0,0.0,0.0,E
9802,3.393,3.2,3.4375,0.0,0.0,E


Let's get a visual on how the points are dispersed, and how the *feature space* is an important consideration when identifying how to classify observations.

In [None]:
px.scatter(data_frame=df_acad_perf, x='HS_GPA', y='GPA_1',color='SEM_3_STATUS',  size_max=100)

Notice the location of the "Not Retained" students (red points) above. How would you describe the values of HS_GPA and GPA_1 that characterize the neighborood where these points live? In contrast, suppose we look at the same points through the lens of DFW_UNITS_2 and GPA_2, as below. There seems to be a better differentiation of the red points. This indicates how important it is to include the right features in strategic ways to enhance performance of our classifier.

In [None]:
px.scatter(data_frame=df_acad_perf, x='DFW_UNITS_2', y='GPA_2',color='SEM_3_STATUS', size_max=100)

So we can clearly see the importance of including features that assist our classifier in discriminating between classes. But how do we draw the "lines" between the points to create a classification rule that

1. Accurately distinguishes between students retained and those not in our tranin data and
2. Generalizes well to unseen test data?

Thet is the role of our machine learning algorithm, with its mathematical complexities and customizable hyperparameters. Keep it locked, we'll learn a lot about these over the remainder of the certificate.
 For visualization purposes, let's take a sneak peak at how one of our classification methods, support vector machine can be used to create a family of classifiers based on the value of the window hyperparameter C. We can use gen AI to create a dynamic plot: `include code that uses ipywidgets to create a slider that allows user to interactively change the kernel window parameter C and dynamically displays the resulting plot`:

In [None]:
!pip install ipywidgets plotly
!jupyter nbextension enable --py widgetsnbextension

Collecting jedi>=0.16 (from ipython>=4.0.0->ipywidgets)
  Downloading jedi-0.19.2-py2.py3-none-any.whl.metadata (22 kB)
Downloading jedi-0.19.2-py2.py3-none-any.whl (1.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m51.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: jedi
Successfully installed jedi-0.19.2
Enabling notebook extension jupyter-js-widgets/extension...
Paths used for configuration of notebook: 
    	/root/.jupyter/nbconfig/notebook.json
Paths used for configuration of notebook: 
    	
      - Validating: [32mOK[0m
Paths used for configuration of notebook: 
    	/root/.jupyter/nbconfig/notebook.json


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from ipywidgets import interact, FloatLogSlider
from IPython.display import display

# 1. Generate more mixed synthetic data
np.random.seed(42) # for reproducibility
n_samples = 100
X = np.random.rand(n_samples, 2) * 5 # Keep the overall range

# Create two clusters that are closer and more overlapping
X[:n_samples//2, :] += [1, 1]  # Shift first cluster less
X[n_samples//2:, :] += [3, 3]  # Shift second cluster less

y = np.array([0] * (n_samples//2) + [1] * (n_samples//2))

# Add more noise to increase overlap
X += np.random.randn(n_samples, 2) * 1.0 # Increased scale of noise

# 2. Define a function to train and plot SVM with varying C
def plot_svm_decision_boundary(C):
    # Train the SVM model with RBF kernel
    # Smaller C allows more margin, larger C penalizes misclassifications more
    model = SVC(kernel='rbf', C=C)
    model.fit(X, y)

    # Create a meshgrid to plot the decision boundary
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
                         np.arange(y_min, y_max, 0.02))
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    # Plot the results
    plt.figure(figsize=(8, 6))
    plt.contourf(xx, yy, Z, alpha=0.8, cmap=plt.cm.RdPu) # Using RdPu colormap
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.RdPu, edgecolors='k') # Using RdPu colormap for points
    plt.title(f'SVM Decision Boundary (C={C:.2f})')
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    plt.show()

# 3. Use ipywidgets to create an interactive slider
# FloatLogSlider is good for C as it often spans orders of magnitude
c_slider = FloatLogSlider(
    value=1.0,
    min=-2, # equivalent to 10^-2 = 0.01
    max=3,  # equivalent to 10^3 = 1000 (increased max to see more effect of high C)
    step=0.1,
    description='C:',
    continuous_update=True,
    orientation='horizontal',
    readout=True,
    readout_format='.2f',
)

interact(plot_svm_decision_boundary, C=c_slider);

interactive(children=(FloatLogSlider(value=1.0, description='C:', max=3.0, min=-2.0, readout_format='.2f'), Ou…

# How to observe Overfitting and Underfitting with C:

- **Small C values (e.g., 0.01 to 1)**: When C is small, the SVM is more tolerant of misclassifications (points on the wrong side of the margin or boundary). This leads to a wider margin and a smoother decision boundary. This can sometimes result in underfitting, where the model is too simple and doesn't capture the nuances of the data, even if it's generally separable. The decision boundary might not perfectly separate all points.

- **Large C values (e.g., 10 to 100)**: When C is large, the SVM heavily penalizes misclassifications. It will try hard to classify every training point correctly, even if it means creating a very narrow margin and a potentially complex or jagged decision boundary. On linearly separable (or near-linearly separable) data, a very large C will push the decision boundary to lie closer to the points on the margin. While it might achieve perfect accuracy on the training data, this high sensitivity to individual points can lead to overfitting, where the model performs poorly on unseen data because it has learned the noise or specific positions of the training points rather than the underlying general pattern.

By moving the slider, you can visually see how the decision boundary changes. With low C, you'll see a wider, simpler boundary. With high C, you'll see the boundary trying to tightly fit around the training points.

Thus our objective is to strike a balance between over- and under- fitting on the build set. We will use the *validation* sets as discussed below to help us make the determination of best learning strategy. This process revolves around the hub of the *fit-predict-score* cycle.

**Fit.**
Learn from the training data: estimate the parameters of the model using the training data. At this stage, if there are hyperparameters in the model we'll just use sklearn's default values. Analogous to studying material given to us by the professor.

**Predict.**
Based on the parameter estimates obtained in the fit stage, predict the labels for observations in the training set. This is the \\(\hat{y}_{Train}\\) vector. Analogous to taking an exam that is the exact same material we were given to study, verbatim.

**Score.**
Our predicted labels were generated from parameters that minimized the objective function. But there are often other criteria that more readily describe how well our model fits the training data (and test data later). These criteria are referred to as *scores*. In scikit learn, the default score for classification problems is *accuracy*. We discussed it and others in 1.2 Introduction to Classification in Higher Ed. Analogous to the grade we earned on the exam.

#### Model Validation and Testing

As mentioned above, the validation set is a held out fraction of the data that can be used to get a feel for how well our model generalizes to unseen data. *k-fold Cross validation* is a powerful way to get multiple reps of the Fit-Predict-Score cycle to strengthen our model's usefulness for unseen data. We can iteratively modify model characteristics and use the validation set to help us approximate generalization performance.

## Deploy

As seen in the Understand step, higher ed machine learning is driven by real world challenges and opportunities that need data driven solutions. Machine learning provides a way to supply reliable solutions to a wide variety of problems. It is important that we close each iteration of the ML cycle by reporting the results of implementation, and investigating their implications for subsequent iterations through the cycle.

#### Model Implementation for New Examples

This is what it's all about- put up or shut up time for our models. This new data typically does **not** contain the associated label for each row, so we often use domain knowledge and other experience to assess the reasonableness of predictions made.

#### Addressing the Higher Ed Challenge

If your question was well posed, and the machine learning solution genuinely applicable, then you will be able to derive insight on the original context that could illuminate the avenues for action. These interventions can have lasting impact for individuals and institutions.  

#### Tracking Performance to Inform the Next Cycle Iteration

What are the consequences of your findings?

Assuming our trained model is deployed for prediction on incoming data over a period of time, we'll have the opportunity to monitor its performance. Based on some minimum performance threshold representing the maximum risk we're willing to incur, we can identify when it's time to pull the plug and iterate through the cycle one more time.

# Conclusion

The machine learning cycle is what takes us from a higher ed challenge to raw data to data driven insights. In this section we discussed the steps and challenges associated with the cycle. We are now ready to embark on the most time and code intensive portion of the cycle - preparing the data.