<a href="https://colab.research.google.com/github/noe2001/Mybetterworld/blob/master/20240621_ml_mentorship.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What MUST you (as an organization and as an individual) have to succeed with AI/ML?

# Types of Machine Learning

## Supervised

Use labeled data to train models to predict some outcomes. The classic, but it can take a lot of time to produce a labeled dataset.

**Example**: Predicting if a unit will pass inspection.

**Tip**: Sometimes labels can be efficiently generated with hindsight. For example, a product failing inspection can be recorded and then used to build a dataset with values from when the product was being manufactured.

## Unsupervised

Analyze unlabeled data to detect patterns imperceptible to the naked eye. Can be difficult to evaluate in practice.

**Example**: Grouping similar batches using a nearest-neighbours approach for various measurements taken during that each batch's creation.

## Semi-Supervised

A combination of the previous two methods: we augment a small, valuable collection of labeled data with a larger, cheaper set of unlabeled data.

**Example**: Given a small collection of labeled machine statuses (correct vs faulty), augment the data with a large amount of unlabeled machine status to improve machine status classification.

## A Note: Regression vs. Classification

**Regression** is predicting continuous values as a function of the inputs, whereas **classification** is concerned with predicting a group or "class" that an input belongs to. Many types of ML models can be adapted to either task.

## A Note: Reinforcement Learning

A class of ML where an agent learns to act in an environment via trial and error, imitation, or other techniques.

# Making a Model

Machine learning and AI are tools to leverage data and domain knowledge to produce actionable information or automatically act within a system.


## Machine Learning Libraries

- `numpy`: A performant mathematical library written in C.
- `pandas`: A library for creating and manipulating tables of data called dataframes.
- `scikit-learn` (import as `sklearn`): A library with a large array of data science and machine learning tools.
- `torch`, `tensorflow`: Graph computation libraries for neural networks. Can exploit GPUs to train large models efficiently.

In [None]:
# A mathematical library, written in C for speed, that works on N-dimensional arrays
import numpy as np

# A library for creating and manipulating tables of data
import pandas as pd

# A library that implements various AI/ML and data science algorithms
# Usually not imported outright
# from sklearn import ...

# Libraries for neural networks that use computational graphs
# for efficient processing, and can also use the GPU!
# import tensorflow as tf
import torch
torch.cuda.is_available(), torch.cuda.device_count()
# Also mps backend for macs!

We can set random seeds for reproducibility in teaching. Generally, you won't do this in practice.

In [None]:
np.random.seed(20240621)
torch.manual_seed(20240621)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(20240621)

We'll want this later, so we can install it now.

In [None]:
!pip install torchviz

## Identify a opportunity

Seek out areas in your business where you have data that is not being fully leveraged. Find something that would be valuable to know in advance. Understand how and why failures happen in production, even when there's no pattern obvious to the naked eye.

## Gather a dataset

Given a source of data and one or more problems to solve relating to that data,
can we:
- Solve or model those problems using AI/ML?
- Use ML to capture patterns or trends in the data relating to those problems?

We'll just generate some sample data, but this could be connecting to a historian, loading image files, or any other data input.

In [None]:
from sklearn.datasets import make_classification

raw_x, raw_y = make_classification(
    n_samples=500,
    n_features=10, n_redundant=1, n_repeated=1, n_informative=8,
    random_state=20240621
)
feature_names = [f'x{i}' for i in range(raw_x.shape[1])]
df = pd.DataFrame(raw_x, columns=feature_names)
df['y'] = raw_y
df.head()

That was easy - but in practice, this can be one of the most labor-intensive steps. Sometimes, the labeling can take more work than everything else combined.

## Exploratory Data Analysis

This could be a dozen lectures in itself, but we'll sum it up as follows:

Exploratory data analysis is the process of gaining a general understanding of your dataset.
- Data types
  - Categorical: machine states, part types
  - Discrete: counts, levels
  - Continuous: measurements, rates
- Data quality
  - Missing values?
  - Inconsistent formats? (Common for human-entered data!)
- Data properties
  - Correlations
  - Anomalies
  - Outliers

In [None]:
df.describe()

It's always useful to know how many of each class we have in our training data.

In [None]:
df["y"].value_counts()

We can examine correlation between pairs of variables easily.

In [None]:
# A powerful but quick to use plotting library
import matplotlib.pyplot as plt
# Seaborn is a matplotlib wrapper that has a lot of handy-dandy features for data science!
import seaborn as sns

# Pairplot can be informative, but it can take a while to run!
# sns.pairplot(df, hue="target")

correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')

In [None]:
sns.scatterplot(x="x1", y="x5", hue="y", data=df)

**Tip:** some large language models, like ChatGPT, have features useful for exploratory data analysis!

## Data Transformations and Feature Engineering

Now that we understand our dataset, we can transform it to make it more useful for our models. Often, the domain experts will have practical knowledge as to some useful transformations or derived features!

- Converting continuous variables to discrete or categorical
  - Categorical bins: Maybe we don't need to worry about a machine's precise temperature, and can just split it into low/medium/high/critical ranges
  - Binarization: Maybe a rate can be converted into active/inactive
- More complex derived features can extracted
  - Temporal features: What day of the week was a given date?
  - Total related features: 4 machines each producing something - what is their total output?
  - Mathematical transformations: Logariths, roots, polynomials, and so much more!

In [None]:
df_engineered = df.copy()
df_engineered["e1"] = df_engineered[["x1", "x8"]].sum(axis=1)
correlation_matrix = df_engineered.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')

## Preparing the data for modeling

For supervised and semi-supervised machine learning, we'll want training and testing data.

In practice, there are lots of way to handle this - look up **cross validation** to learn about a powerful way to get more bang for your buck from your data.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df_engineered.drop("y", axis=1), df_engineered["y"], test_size=0.2, random_state=20240621)

### Pitfalls

- Much manufacturing data is **temporal**. We can accidentally "cheat" by putting data from one moment in the training set and the next moment in the test set, when the two moments are almost the same - we've essentially trained using test data!
- Make sure you have diverse scenarios captured in both your training and test sets!
- **Class imbalance** is when one class is rarer than another. Some modeling techniques and metrics aren't resilient to this.
  - If 1% of my products fail QA, a prediction that every part will pass QA is 99% accurrate but 0% useful!

## First contact: basic models

When possible, I like to start with some basic `scikit-learn` models - they are surprisingly powerful, and sufficient for many tasks! It's tempting to use deep learning or neural networks for everything, but not very practical and often times less effective (large, super-powerful models require a lot of data to effectively train).

In [None]:
from sklearn.tree import DecisionTreeClassifier

# We limit max depth to prevent tree from just exactly learning the training data!
tree1 = DecisionTreeClassifier(max_depth=3, random_state=20240621)
tree1.fit(X_train, y_train)

Visualize the tree we learned. Trees are explainable models - what kind of value could be gained by analyzing an explainable model?

In [None]:
from sklearn.tree import plot_tree

plt.figure(figsize=(10, 10))
_ = plot_tree(tree1, feature_names=X_train.columns)

Use the tree to predict labels for our testing data.

In [None]:
y_pred = tree1.predict(X_test)
y_pred

### Model Evaluation

There are [a lot](https://scikit-learn.org/stable/modules/model_evaluation.html) of possible ways to evaluate a model's performance. Some tasks may require very specialized ways to score.

Here, we have balanced classes for a basic binary prediction task, so our usual suspects should suffice.

In [None]:
from sklearn import metrics

print(metrics.classification_report(y_test, y_pred))

The confusion matrix is valuable to know what *kind* of errors our model is making.

In [None]:
metrics.ConfusionMatrixDisplay(metrics.confusion_matrix(y_test, y_pred), display_labels=tree1.classes_).plot()

Not great, but that's okay - it's just a first experiment!

### Other Models

We can try other models (scikit-learn has [a lot of supervised learning techniques](https://scikit-learn.org/stable/supervised_learning.html) available), or we could perform a [hyperparameter search](https://scikit-learn.org/stable/modules/grid_search.html). (Or both! Experimental mindset!)

In [None]:
# SVMs are a versatile class of supervised models that can be used for
# classification and regression!
from sklearn.svm import SVC

svm1 = SVC(random_state=20240621)
svm1.fit(X_train, y_train)

Let's run our metrics again.

In [None]:
y_pred = svm1.predict(X_test)
print(metrics.classification_report(y_test, y_pred))

In [None]:
metrics.ConfusionMatrixDisplay(metrics.confusion_matrix(y_test, y_pred), display_labels=tree1.classes_).plot()

Better!

## Neural Networks

Neural networks are class of model that, at their simplest consist of graphs arranged into layers with weighted edges between layers, where the edge weights are learnable. A node's value is multiplied by the weight of its outgoing edges and propagated through the network to the end. Of course, in practice, there are many more details and modifications, but those are beyond the scope of this lecture.

Often, neural networks are overkill, but in **computer vision** and **natural language**, they are consistently the most effective tool in the ML scientist's toolkit.

Neural networks tend to require a lot of data, and can be efficiently trained on GPUs. We'll just examine a small, toy neural network that we can train quickly.

### Rescaling

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim

from sklearn.preprocessing import StandardScaler

# Neural networks are sensitive to the scale of their inputs
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

### Data Handling for Neural Network Libraries

Torch uses DataSets and DataLoaders to manage inputs to its models. There's a bunch of different kinds of each - read the documentation to find the best fit for your application.

In [None]:
# A tensor is just a multi-dimensional array
# A generalization of scalars, vectors, and matrices
X_train_tensor = torch.tensor(X_train_scaled, dtype=torch.float32)
X_test_tensor = torch.tensor(X_test_scaled, dtype=torch.float32)

y_train_tensor = torch.tensor(y_train.values, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test.values, dtype=torch.float32)

# A dataset is used to load and preprocess data.
train_dataset = torch.utils.data.TensorDataset(X_train_tensor, y_train_tensor)
test_dataset = torch.utils.data.TensorDataset(X_test_tensor, y_test_tensor)

# The dataloaders handle iteration, batching, shuffling, and parallelization
batch_size = 4
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

X_train_tensor

We define our own structure for a neural network. In torch, we'll typically do this by extending `nn.Module`.

In [None]:
class SimpleNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(X_train_tensor.shape[1], 10),
            nn.ReLU(), # A ReLU is a type of activation functions
            nn.Linear(10, 10),
            nn.ReLU(),
            nn.Linear(10, 1),
            nn.Sigmoid()
        )

    def forward(self, x):
        logits = self.linear_relu_stack(x)
        return logits


We can visualize how the model is structured with the torchviz library we installed earlier.

In [None]:
from torchviz import make_dot

dummy_nn = SimpleNN()
make_dot(dummy_nn(torch.randn(1, X_train_tensor.shape[1])), params=dict(list(dummy_nn.named_parameters())))

In [None]:
torch.manual_seed(271828)

nn1 = SimpleNN()
criterion = nn.BCELoss()
optimizer = optim.Adam(nn1.parameters(), lr=0.001)

num_epochs = 100
interval = 5
losses, val_losses = [], []
for epoch in range(1, num_epochs+1):
    nn1.train()
    total_loss, n_batches = 0, 0
    for X_batch, y_batch in train_loader:
        y_batch = y_batch.unsqueeze(1)

        # make a prediction for this value
        out = nn1(X_batch)
        # find out how wrong we are
        loss = criterion(out, y_batch)

        # resets the gradients
        optimizer.zero_grad()
        # computes the new gradients for this pass by working backwards
        loss.backward()
        # updates the weights with the new gradients
        optimizer.step()

        total_loss += loss.item()
        n_batches += 1

    with torch.no_grad():
        nn1.eval()
        total_val_loss, n_val_batches = 0, 0
        for X_batch, y_batch in test_loader:
            y_batch = y_batch.unsqueeze(1)
            out = nn1(X_batch)
            loss = criterion(out, y_batch)
            total_val_loss += loss.item()
            n_val_batches += 1

    avg_val_loss = total_val_loss/n_val_batches
    avg_loss = total_loss/n_batches
    losses.append(avg_loss)
    val_losses.append(avg_val_loss)

    if epoch % interval == 0:
        print(f"Epoch {epoch}/{num_epochs}, Average Train Loss: {avg_loss:.4f}, Average Val Loss: {avg_val_loss:.4f}")

Watch out for overfitting!

In [None]:
df_loss = pd.DataFrame({"loss": losses, "val_loss": val_losses})
df_loss.plot()

### A Note: Transfer Learning

Transfer learning is the use of a pretrained model to speed up training. Transfer learning is incredibly powerful, and can massively reduce the time and data required to train a new model.

An example of transfer learning you can play with today is [AWS Rekognition Custom Labels](https://aws.amazon.com/rekognition/custom-labels-features/), which can learn from just a handful of images.

# How do I deploy it? How do I maintain it?

Unfortunately, there's no clean answer - every architecture and need is different. Additionally, as the world evolves, your model might **drift** - cease to be accurate to the new reality.

Look in to https://madewithml.com/#course for a detailed course on machine learning for production. There are a lot of great courses at https://www.deeplearning.ai/courses/ and https://fullstackdeeplearning.com/.

Remember, your goal should be to reach the skill level needed for your interests or tasks, then to always improve your skills and knowledge bit by bit.

# Any questions?

# Go be experimental!