# **Fundamentals of Supervised and Unsupervised ML**

Objectives:
- Understand the importance of ML in mechanics and materials
- Understand what we need to learn a model
- Understand how to use learning curves to asses model performance

In any ML model, it is important that we have the the **right type of data**. This means we have to ensure it is relevant and of a high-quality. We need to ensure it isn't biased in any way.

The **manifold hypothesis** states that many real-world high-dimensional datasets actually lie along a low-dimensional latent manifold inside that high-dimensional space. 

ML algorithms can learn this low-dimensional structure of data, which is something that would be impossible for humans to do. 

## **Machine Learning Algo Workflow**

1. **Gather Data**  
   Collect data from relevant sources.

2. **Data Processing & Cleaning**  
   Preprocess the data, handle missing values, and clean it for analysis.

3. **Build the Model**  
   Choose an appropriate machine learning algorithm and train the model on the data.

4. **Extract Insights**  
   Analyse the model's results. What does it tell us about the data?

5. **Data Visualization**  
   Visualize the findings to communicate insights effectively.


**Data cleaning** is the most important and longest step in this process. Here are some important processes in cleaning a dataset.
- *Data standardisation* -> convert data into the same format (same units, remove punctuation etc)
- *Removing unwanted observations* -> get rid of duplicates or redundant data. Consider what is 'valid' for your model
- *Handling missing data* -> dealing with unknown data (e.g. NaN). You may ignore them, set them to 0 or try to predict them
- *Structural error solving* -> errors in the setup of the data (e.d. mislabelled classes)
- *Outliers' management* -> dealing with values that don't belong in out dataset (e.g. we might solve this by defining min and max values)

**Data in materials and mechanics applications** can be:
- expensive to obtain
- difficult to measure
- noisy (if its obtained from measurements)
- deterministic (if its simulated)
- heterogenous (comed from different sources)
- multi-modal 

## **Feature Engineering**

**Feature engineering is the process of transforming raw data into features that are suitable for machine learning models**. 

For example, in a dataset of housing prices, features could include the number of bedrooms, square footage, the location, and the age of the property. If we have a dataset of customers, features could include age, gender and occupation.

Features can be:
- Quantitative/qualitative
- Visible/latent
- Deterministic/statistical

Why do we do feature engineering?
1) To reduce the complexity of the data
2) To identify relevant features or design meaningful transformations (requires domain expertise)

Typucally, in deep learning architectures, features are automatically extracted. 

### Feature Scaling

Quantitative features can have varying magnitudes. Feature scaling is the process of normalising the values of the features in your dataset. This means that no feature dominates or skews the model too much. This helps improve model convergence. It also means that we can input features into the model as a single vector.

**Min-Max Scaling (Normalisation)**
- All features range from **0 to 1**
- Useful when distribution of the data is unknown (or heavily non-Gaussian)
- Sensitive to outliers
- May not preserve the relationship between datapoints

$$
x_{scaled, j} = \frac{x_j - \min{(\mathbf{x})}}{\max{(\mathbf{x})} - \min{(\mathbf{x})}}
$$

**Standardisation**
- Transforming the distribution into **zero mean, unit variance**
- Variables are not restricted (-$\infty$, +$\infty$)
- Useful when disttribution is similar to or is Gaussian
- Not sensitive to outliers
- Preserves the relationships between datapoints

$$
x_{standard, j} = \frac{x_{j} - \text{mean}(\mathbf{x})}{\text{std}(\mathbf{x})}
$$

## **Machine Learning Models**

### Supervised and Unsupervised Learning

**Supervised Learning Strategies**

Classification -> *predicting a categorical variable*
- Naive bayes
- Artificial neural networks
- Support vector machines
- Decicision trees

Regression -> *predicting a numerical value*
- Linear regression
- Gaussian process


**Unsupervised Learning Strategies**

Clustering -> *grouping a set of objects*
- K means
- Hierarchical clustering
- Gaussian mixture

Association -> *discovering relations between variables in large datasets*
- Autoencoders
- Artifical neural networks

### Training and Testing

A data-driven model is parameterised in some way. During training, we learn the values of these parameters using the dataset. This is an optimisation problem because we need to find the values of the parameters that minimises the model prediction error. 

Note, the MSE is good for small error values. When you have larger errors, the squared term magnifies their impact, which accelerates the minimisation process during training. However, it also amplifies the effect of outliers. 

The objective/cost/loss function tells us the error in a model. This function must be chosen well to respresent the design goals. The proccess of minimising error and finding the optimal model parameters is known as **training**.

One way of computing the error is finding the **mean squared error (MSE)**:

$$
\text{loss} = \frac{1}{N} \sum_{i=1}^{N} (z_i - y_i(\mathbf{p}))
$$

where,
- $z_i$ is the true output for datapoint i
- $y_i$ is the model output for datapoint i
- $\mathbf{p}$ is the parameters of the model
- N is the number of datapoints

Some alternative loss functions for regression inclead mean absolute error (MAE) and root mean squared error (RMSE). For classification, some examples of loss functions are cross-entropy, ROC-AUC and accuracy.

Increasing the model complexity doesn't necessarily mean the model will be better, due to the risk of underfitting or overfitting. This idea is known as **Occam's razor**. If you underfit, there is low accuracy in training and validation. We can tackle underfitting by increasing model complexity, more iterations or changing the learning rate. If you overfit, there is high accuracy in training and low accuracy in validation. We can tackle overfitting using regularisation, feature selection or early stopping.

Note, classification doesn't have learning curves usually. The performance of the model at each iteration is computed differently. Some examples of error metrics are a confusion matrix, classification score or computing the number of false positives and false negatives.

Once we have a model with optimal parameters. We can test its performance using test data. This phase is known as the **testing** phase. 

The learning curve is the graph of the loss at each iteration of the optimisation scheme. Ideally, this should converge to around 0.