<div align="center">
    <p><font size="6">DS-210: Programming for Data Science</font></p>
    <p><font size="6">Lecture 23</font></p>
</div>



<div align="center"> 
    <p><font size="5">Linear Regression, Loss, and Bias</font></p>
</div>

> Note: A1 Lectures 35 and 36


# In-Class Poll

https://piazza.com/class/m5qyw6267j12cj/post/476 

# Linear regression

### Simplest setting

**Input:** set of points $(x_i,y_i)$ in $\mathbb R \times \mathbb R$

* What function $f:\mathbb R \rightarrow \mathbb R$ explains the relationship of $x_i$'s with $y_i$'s?

* What linear function $f(x) = a x + b$ describes it best? 

### Multivariate version
**Input:** set of points $(X_i,y_i)$ in $\mathbb R^d \times \mathbb R$

Find linear function $f(x_1,x_2,\ldots,x_d) = a_1 x_1 + \cdots + a_d x_d + b$ that describes $y_i$'s in terms of $X_i$'s? 


### Why linear regression?
* Have to assume something!
* Models hidden linear relationship + noise

## Typical objective: minimize square error


* Points rarely can be described exactly using a linear relationship 

* How to decide between several non-ideal options?

* Typically want to find $f$ that minimizes total square error:

$$\sum_{i} \left(f(x_i) - y_i\right)^2$$

## What about Categorical Data?

* Convert to numerical first
* One way is to simply convert to unique integers
* A better way is to create N new columns (one for each category) and make them boolean (is_x, is_y, is_z etc) -- **one hot encoding**

For example with 3 categories:

$$
\begin{bmatrix}
2 & 0 & 1
\end{bmatrix}
\rightarrow
\begin{bmatrix}
0 \\
0 \\
1 \\
\end{bmatrix}
\begin{bmatrix}
1 \\
0 \\
0 \\
\end{bmatrix}
\begin{bmatrix}
0 \\
1 \\
0 \\
\end{bmatrix}
$$

## 1-D Example

We'll start with an example from the 
[Rust Machine Learning Book](https://rust-ml.github.io/book/5_linear_regression.html).

In [None]:
// Use a crate built from source on github
:dep linfa-book = { git = "https://github.com/rust-ml/book" }
:dep ndarray = { version = "^0.15.6" }

// create_curve implemented at https://github.com/rust-ml/book/blob/main/src/lib.rs#L52
use linfa_book::create_curve;
use ndarray::Array2;
use ndarray::s;

fn generate_data(output: bool) -> Array2<f64> {

    /*
     * Generate a dataset of x and y values
     * 
     * - Randomly generate 50 points between 0 and 7
     * - Calculate y = m * x^power + b +noise
     *   where noise is uniformly random between -0.5 and 0.5
     * 
     * m = 1.0  (slope)
     * power = 1.0  (straight line)
     * b = 0.0 (y-intercept)
     * num_points = 50
     * x_range = [0.0, 7.0]
     * 
     * This produces a 50x2 Array2<f64> with the first column being x and the
     * second being y
     */
    let array: Array2<f64> = linfa_book::create_curve(1.0, 1.0, 0.0, 50, [0.0, 7.0]);

    // Converting from an array to a Linfa Dataset can be the trickiest part of this process
    // The first part has to be an array of arrays even if they have a single entry for the 1-D case
    // The underlying type is kind of ugly but thankfully the compiler figures it out for us
    // let data: ArrayBase<OwnedRepr<f64>, Ix2> = ....  is the actual type
    // The second part has to be an array of values.
    // let targets: Array1<f64> is the actual type


    let data = array.slice(s![.., 0..1]).to_owned();
    let targets = array.column(1).to_owned();
    if output {
        println!("The original array is:");
        println!("{:.2?}", array);

        println!("The data is:");
        println!("{:.2?}", data);

        println!("The targets are:");
        println!("{:.2?}", targets);

    }
    return array;
}

generate_data(true);

The original array is:
[[6.24, 5.83],
 [4.81, 4.54],
 [0.48, 0.90],
 [2.10, 2.03],
 [1.94, 2.07],
 [4.06, 4.07],
 [2.63, 2.81],
 [6.79, 6.93],
 [2.16, 2.37],
 [2.71, 2.32],
 [3.39, 3.24],
 [2.01, 1.75],
 [1.27, 1.11],
 [6.15, 6.01],
 [2.59, 3.06],
 [6.85, 6.63],
 [0.67, 0.22],
 [6.08, 6.17],
 [3.32, 3.37],
 [1.79, 1.80],
 [5.53, 5.57],
 [3.30, 3.21],
 [4.54, 4.83],
 [3.82, 3.64],
 [5.69, 6.18],
 [4.30, 4.55],
 [3.36, 2.89],
 [0.33, 0.15],
 [1.93, 1.91],
 [3.32, 3.25],
 [4.88, 4.71],
 [0.64, 0.66],
 [1.29, 1.15],
 [1.65, 1.32],
 [2.39, 2.67],
 [3.72, 3.43],
 [2.57, 2.92],
 [5.31, 5.23],
 [2.77, 3.01],
 [1.10, 0.94],
 [2.98, 2.74],
 [3.87, 3.55],
 [6.19, 6.12],
 [3.11, 3.15],
 [1.65, 1.81],
 [6.56, 6.94],
 [5.00, 5.39],
 [3.72, 3.59],
 [3.90, 3.63],
 [2.84, 2.39]], shape=[50, 2], strides=[2, 1], layout=Cc (0x5), const ndim=2
The data is:
[[6.24],
 [4.81],
 [0.48],
 [2.10],
 [1.94],
 [4.06],
 [2.63],
 [6.79],
 [2.16],
 [2.71],
 [3.39],
 [2.01],
 [1.27],
 [6.15],
 [2.59],
 [6.85],
 [0.67],

## Let's plot these points to see what they look like


In [7]:
// 2021
//:dep plotters={version = "^0.3.0", default_features = false, features = ["evcxr", "all_series"]}

// 2024
:dep plotters={version = "^0.3.0", default-features = false, features = ["evcxr", "all_series"]}

:dep ndarray = { version = "^0.15.6" }

extern crate plotters;
use plotters::prelude::*;

{
let array: Array2<f64> = generate_data(false);

let x_values = array.column(0);
let y_values = array.column(1);

evcxr_figure((640, 480), |root| {
    let mut chart = ChartBuilder::on(&root)
    // the caption for the chart
        .caption("2-D Plot", ("Arial", 20).into_font())
        .x_label_area_size(40)
        .y_label_area_size(40)
   // the X and Y coordinates spaces for the chart
        .build_cartesian_2d(0f64..8f64, 0f64..8f64)?;
    chart.configure_mesh()
        .x_desc("X Values")
        .y_desc("Y Values")
        .draw()?;

    chart.draw_series(
        x_values.iter().zip(y_values.iter()).map(|(&x, &y)| {
            Circle::new((x, y), 3, RED.filled())
        }),
    )?;
    Ok(())
})
}


## Let's fit a linear regression to it

In [5]:
// There are multiple packages in this git repository so we have to declare
// all the ones we care about
:dep linfa = { git = "https://github.com/rust-ml/linfa" }
:dep linfa-linear = { git = "https://github.com/rust-ml/linfa" }
:dep ndarray = { version = "^0.15.6" }

// Now we need to declare which function we are going to use
use linfa::Dataset;
use linfa_linear::LinearRegression;
use linfa_linear::FittedLinearRegression;
use linfa::prelude::*;

fn fit_model(array: &Array2<f64>) -> FittedLinearRegression<f64> {
    // Let's regenarate and split our data
    let data = array.slice(s![.., 0..1]).to_owned();
    let targets = array.column(1).to_owned();

    // And finally let's fit a linear regression to it
    println!("Data: {:?} \n Targets: {:?}", data, targets);
    let dataset = Dataset::new(data, targets);
    let lin_reg: LinearRegression = LinearRegression::new();
    let model: FittedLinearRegression<f64> = lin_reg.fit(&dataset).unwrap();
    
    let ypred = model.predict(&dataset);
    let loss = (dataset.targets() - ypred)
        .mapv(|x| x.abs())
        .mean();

    println!("{:?}", loss);
    return model;
}

let array: Array2<f64> = generate_data(false);
let model = fit_model(&array);
println!("{:?}", model);


Data: [[3.594498536715058],
 [5.018369642738123],
 [3.9549510162761425],
 [5.6000945555425545],
 [6.871522202349424],
 [3.776979425924523],
 [0.8263909853857814],
 [1.0158189970802722],
 [0.6485679162386087],
 [3.9063222520000354],
 [1.9062980840469828],
 [0.7092662567147963],
 [1.0180221151201896],
 [2.8013671090064345],
 [4.138474701278274],
 [6.504604649063631],
 [2.3261634181116664],
 [1.846173945844946],
 [6.487483313167521],
 [2.149000886670419],
 [6.715647835175662],
 [6.111196805646161],
 [3.772657077130999],
 [2.603263786394396],
 [2.264228234039776],
 [0.9533764747166664],
 [6.786117495521262],
 [1.9487396442040545],
 [6.903161703514631],
 [5.518937121304132],
 [1.9186201467117683],
 [0.43456025187150815],
 [5.301681393839736],
 [6.620435644145145],
 [3.480368520525597],
 [6.971070657889326],
 [4.9768171107282],
 [6.526563267956613],
 [6.915326072397274],
 [4.893081117526469],
 [4.391843900223079],
 [5.366712230402502],
 [0.7484489866462081],
 [3.726275954222591],
 [1.4817990

## Finally let's put everything together and plot the data points and model line

In [8]:
extern crate plotters;
use plotters::prelude::*;

{    
let array: Array2<f64> = generate_data(false);
let model = fit_model(&array);
println!("{:?}", model);

let x_values = array.column(0);
let y_values = array.column(1);

evcxr_figure((640, 480), |root| {
    let mut chart = ChartBuilder::on(&root)
    // the caption for the chart
        .caption("Linear Regression", ("Arial", 20).into_font())
        .x_label_area_size(40)
        .y_label_area_size(40)
    // the X and Y coordinates spaces for the chart
        .build_cartesian_2d(0f64..8f64, 0f64..8f64)?;
    chart.configure_mesh()
        .x_desc("X Values")
        .y_desc("Y Values")
        .draw()?;

    chart.draw_series(
        x_values.iter().zip(y_values.iter()).map(|(&x, &y)| {
            Circle::new((x, y), 3, RED.filled())
        }),
    )?;
    let mut line_points = Vec::with_capacity(2);
    for i in (0..8i32).step_by(1) {
        line_points.push((i as f64, (i as f64 * model.params()[0]) + model.intercept()));
    }
    // We can configure the rounded precision of our result here
    let precision = 2;
    let label = format!(
        "y = {:.2$}x + {:.2}",
        model.params()[0],
        model.intercept(),
        precision
    );
    chart.draw_series(LineSeries::new(line_points, &BLACK))
        .unwrap()
        .label(&label);

    Ok(())
})
    
}

Data: [[0.6380189123434981],
 [0.6799208075594287],
 [4.521788480938348],
 [0.5915966036325138],
 [5.1750453811004755],
 [4.2487829544260105],
 [4.522841291436206],
 [5.238602321465335],
 [6.93067176255613],
 [0.8575903628460368],
 [4.869844006299996],
 [0.5989087679365952],
 [2.63465506214517],
 [5.896676700949484],
 [4.688513424302119],
 [3.5091103817728566],
 [3.4628037183994227],
 [1.3254944732625764],
 [5.812585673346609],
 [5.66529598518652],
 [0.5952611595716057],
 [4.3931706641848915],
 [6.343103295990133],
 [5.519086741608909],
 [3.2382316654137835],
 [0.5173429542684593],
 [1.4138136366869134],
 [3.44714501752033],
 [2.690083559373025],
 [1.8763705852906722],
 [0.49399855465875553],
 [6.742088832340754],
 [2.4600054344809514],
 [3.8152111900769707],
 [1.2350112032355123],
 [6.02743220407584],
 [3.447608272249459],
 [6.475917584032839],
 [5.508843971002741],
 [2.4404781600505943],
 [4.235130483928617],
 [6.601052952841419],
 [6.909770631665306],
 [3.8070656111586],
 [4.4047623

 * `params`${}= a$
 * `intercept`${}= b$
in

$$f(x) = ax + b$$  



## Coefficient of determination (or $R^2$)

* How good is my function $f$?

* **Input:** points $\{(x_1,y_1),(x_2,y_2),\ldots,(x_n,y_n)\}$

* **Idea:** Compare variance of $y_i$'s to the deviation of $y_i$'s from $f(x_i)$'s

* **Formally:** 

$$
1 - \frac{\sum_i \left(y_i - f(x_i)\right)^2}{\sum_i \left(y_i - \bar y\right)^2}
= 1 - \frac{\frac{1}{n}\sum_i \left(y_i - f(x_i)\right)^2}{\textrm{Var}(y_i)}
$$

where $\bar y = \frac{1}{n}\sum_i y_i$

* **Range:** $(-\infty,1]$ &nbsp;&nbsp;&nbsp;&nbsp;(should be in $[0,1]$ for linear regression)

## Let's compute for our data

In [14]:
:dep linfa = { git = "https://github.com/rust-ml/linfa" }
:dep linfa-linear = { git = "https://github.com/rust-ml/linfa" }
:dep ndarray = { version = "^0.15.6" }
:dep ndarray-stats = { version = "^0.5.1" }

let array: Array2<f64> = generate_data(false);
let data = array.slice(s![.., 0..1]).to_owned();
let targets = array.column(1).to_owned();

// And finally let's fit a linear regression to it
let dataset = Dataset::new(data, targets).with_feature_names(vec!["x"]);
let lin_reg: LinearRegression = LinearRegression::new();
let model: FittedLinearRegression<f64> = lin_reg.fit(&dataset).unwrap();    
let ypred = model.predict(&dataset);
//
let variance:f64 = dataset.targets().var(0.0);
let mse:f64 = (dataset.targets() - ypred)
    .mapv(|x| x.powi(2))
        .mean().unwrap();
let r2 = 1.0 - mse/variance;
println!("variance = {:.3}, mse = {:.3}, R2 = {:.3}", variance, mse, r2);

variance = 3.443, mse = 0.076, R2 = 0.978


## Multivariable linear regression

What if you have multiple input variables and one output variable?

It's actually quite simple.  The same code we used above but make sure your X side contains multiple values for each of the variables in your function.  The code will compute as many coefficients as the variables it sees.

In [15]:
:dep linfa = { git = "https://github.com/rust-ml/linfa" }
:dep linfa-linear = { git = "https://github.com/rust-ml/linfa" }
:dep ndarray = { version = "^0.15.6" }

use linfa::Dataset;
use linfa::traits::Fit;
use ndarray::{Array1, Array2, array};
use linfa_linear::LinearRegression;
use linfa_linear::FittedLinearRegression;
use linfa::prelude::*;


fn main() {
    // Example data: 4 samples with 3 features each
    let x: Array2<f64> = array![[1.0, 2.0, 3.0],
                                //[2.0, 3.0, 1.0],
                                [2.0, 3.0, 4.0],
                                [3.0, 4.0, 5.0],
                                [4.0, 5.0, 6.0],
                                //[6.0, 11.0, 12.0]
                                ];
    // Target values
    let y: Array1<f64> = array![6.0, 
                                //6.0, 
                                9.0, 
                                12.0, 
                                15.0, 
                                //29.0,
        ];

    // Create dataset
    let dataset = Dataset::new(x.clone(), y.clone());

    // Fit linear regression model
    let lin_reg = LinearRegression::new();
    let model = lin_reg.fit(&dataset).unwrap();

    // Print coefficients
    println!("Coefficients: {:.3?}", model.params());
    println!("Intercept: {:.3?}", model.intercept());

    // Predict using the fitted model
    let predictions = model.predict(&x);
    println!("Predictions: {:.3?}", predictions);
}

main();


Coefficients: [22.667, -8.333, -11.333], shape=[3], strides=[1], layout=CFcf (0xf), const ndim=1
Intercept: 34.000
Predictions: [6.000, 9.000, 12.000, 15.000], shape=[4], strides=[1], layout=CFcf (0xf), const ndim=1


## General Least Squares Fit


We can generalize the ordinary least squares regression problem, by trying,
for example, to find the $\beta_i$ in an equation like

$$ 
\hat{y}_i = \beta_0  + \beta_1 f_1(x_i) + \beta_2 f_1(x_i) + ... + \beta_n f_n(x_i)
$$

to minize the differences between some targets $y_i$ and the $\hat{y}_i$.

> Note that these are still linear in the parameters.


We can rewrite this in matrix form:

$$
\begin{bmatrix}
y_1 \\
y_2 \\
\vdots \\
y_m \\
\end{bmatrix}
=
\begin{bmatrix}
1 & f_1(x_1) & ... & f_n(x_1) \\
1 & f_1(x_2) & ... & f_n(x_2) \\
  &          & \vdots \\
1 & f_1(x_m) & ... & f_n(x_m) \\
\end{bmatrix}
\begin{bmatrix}
\beta_0 \\
\beta_1 \\
\vdots \\
\beta_n
\end{bmatrix}
+
\begin{bmatrix}
\varepsilon_1 \\
\varepsilon_2 \\
\vdots \\
\varepsilon_m \\
\end{bmatrix}
$$

where the matrix of $f$'s is sometimes called the
[Design Matrix](https://en.wikipedia.org/wiki/Design_matrix).

* We are going to use a different library familar to us from the ndarray lecture.

* ndarray-linalg can compute the parameters to fit an arbitrary function as long as you have some idea of what the function might be.
  * e.g. express it as a design matrix

* Then solve a system of linear equations with that matrix as the left hand side and our observed values as the right hand side.  

* The result is the missing parameters of our assumed function

In [18]:
:dep ndarray = { version = "^0.15.6" }

// See ./README.md for ndarray-linalg prereqs

// This is the version for MAC 
:dep ndarray-linalg = { version = "^0.16", features = ["openblas-system"] }

// Alternative for Mac if you installed netlib
//:dep ndarray-linalg = { version = "^0.14" , features = ["netlib"]}

// This works for linux
// :dep ndarray-linalg = { version = "^0.14" , features = ["openblas"]}


use ndarray::{Array1, Array2, ArrayView1};
use ndarray_linalg::Solve;
use ndarray::array;
use ndarray::Axis;
use ndarray_linalg::LeastSquaresSvd;

// Define an arbitrary function (e.g., a quadratic function)
fn arbitrary_function(x: f64, params: &ArrayView1<f64>) -> f64 {
    params[0] + params[1] * x + params[2] * x * x
}

// Compute the design matrix for the arbitrary function
fn design_matrix(x: &Array1<f64>) -> Array2<f64> {
    let mut dm = Array2::<f64>::zeros((x.len(), 3));
    for (i, &xi) in x.iter().enumerate() {
        dm[(i, 0)] = 1.0;
        dm[(i, 1)] = xi;
        dm[(i, 2)] = xi * xi;
    }
    dm
}

// Perform least squares fit
fn least_squares_fit(x: &Array1<f64>, y: &Array1<f64>) -> Array1<f64> {
    let dm = design_matrix(x);
    let y_col = y.to_owned().insert_axis(Axis(1)); // Convert y to a column vector
    let params = dm.least_squares(&y_col).unwrap().solution; // Use least squares solver
    params.column(0).to_owned() // Convert params back to 1D array
}


fn main() {
    // Example data
    let x: Array1<f64> = array![1.0, 2.0, 3.0, 4.0];
    let y: Array1<f64> = array![1.0, 4.0, 9.0, 16.0];  // y = x^2 for this example

    // Perform least squares fit
    let params = least_squares_fit(&x, &y);
    println!("Fitted parameters: {:.3}", params);

    // Predict using the fitted parameters
    let predictions: Array1<f64> = x.mapv(|xi| arbitrary_function(xi, &params.view()));
    println!("Predictions: {:.3?}", predictions);
}

main();

Fitted parameters: [0.000, -0.000, 1.000]
Predictions: [1.000, 4.000, 9.000, 16.000], shape=[4], strides=[1], layout=CFcf (0xf), const ndim=1


### Example General Function

Imagine a case where you have a function like :

$$f1 =a0 + a1​x + a2​x^2+a3​log(x)$$

Then setup your functions like this:

```Rust
// Define an arbitrary function (e.g., a quadratic function)
fn arbitrary_function(x: f64, params: &ArrayView1<f64>) -> f64 {
    params[0] + params[1] * x + params[2] * x * x + params[3] * x.ln()

}

// Compute the design matrix for the arbitrary function
fn design_matrix(x: &Array1<f64>) -> Array2<f64> {
    let mut dm = Array2::<f64>::zeros((x.len(), 4));
    for (i, &xi) in x.iter().enumerate() {
        dm[(i, 0)] = 1.0;
        dm[(i, 1)] = xi;
        dm[(i, 2)] = xi * xi;
        dm[(i, 3)] = xi.ln();
    }
    dm
}
```

It gets messier if x appears in exponents and outside the scope of this lecture but it is possible to do non-linear least squares fit for completely arbitrary functions!!!

## Train/Test Splits

To test the generality of your models, it is recommended to split your data into
- training dataset (e.g. 80% of the data)
- testing dataset (e.g. 20% of the data)

Then train with the training dataset and evaluate using the test dataset.

More on this below.

It is a bit more cumbersome in Rust than in scikit-learn in python but in the end not that hard

The `smartcore` crate implements this function for you and you can use it as follows

In [None]:
:dep linfa = { git = "https://github.com/rust-ml/linfa" }
:dep linfa-linear = { git = "https://github.com/rust-ml/linfa" }
:dep ndarray = {version = "0.15"}
:dep smartcore = {version = "0.2", features=["ndarray-bindings"]}

use linfa::Dataset;
use linfa::traits::Fit;
use linfa_linear::LinearRegression;
use ndarray::{Array1, Array2, array};
use linfa::DatasetBase;
use smartcore::model_selection::train_test_split;  // This is the function we need
use smartcore::linalg::naive::dense_matrix::DenseMatrix;
use smartcore::linalg::BaseMatrix;


fn main() {
    // Example data: 4 samples with 3 features each
    let x: Array2<f64> = array![[1.0, 2.0, 3.0],
                                [2.0, 3.0, 4.0],
                                [3.0, 4.0, 5.0],
                                [4.0, 5.0, 6.0]];
    // Target values
    let y: Array1<f64> = array![6.0, 9.0, 12.0, 15.0];

    // Split the data into training and testing sets
    let (x_train, x_test, y_train, y_test) = train_test_split(&x, &y, 0.5, true);

    let train_dataset = Dataset::new(x_train.clone(), y_train.clone());
    let test_dataset = Dataset::new(x_test.clone(), y_test.clone());

    // Fit linear regression model to the training data
    let lin_reg = LinearRegression::new();
    let model = lin_reg.fit(&train_dataset).unwrap();

    // Print coefficients
    println!("Coefficients: {:.3?}", model.params());
    println!("Intercept: {:.3?}", model.intercept());

    // Predict from the test set
    let predictions = model.predict(&x_test);
    println!("X_test: {:.3?}, Predictions: {:.3?}", x_test, predictions);
}

main();

Coefficients: [1.000, 1.000, 1.000], shape=[3], strides=[1], layout=CFcf (0xf), const ndim=1
Intercept: -0.000
X_test: [[4.000, 5.000, 6.000],
 [1.000, 2.000, 3.000]], shape=[2, 3], strides=[3, 1], layout=Cc (0x5), const ndim=2, Predictions: [15.000, 6.000], shape=[2], strides=[1], layout=CFcf (0xf), const ndim=1


# Loss Functions, Bias & Cross-Validation

## Reminders

Typical predictive data analysis pipeline:

* **Very important:** split your data into a training and test part
* Train your model on the training part
* Use the testing part to evaluate accuracy

## Measuring errors for regression

* Usually, the predictor is not perfect.
* How do I evaluate different options and choose the best one?

**Mean Squared Error (or $L_2$ loss):**

$$\frac{1}{n}\sum_{i = 1}^n \left(f(x_i) - y_i\right)^2$$

**Mean Absolute Error (or $L_1$ loss):**

$$\frac{1}{n}\sum_{i = 1}^n \left| f(x_i) - y_i\right|$$

### Plots of MSE and MAE

In [2]:
// 2021
//:dep plotters={version = "^0.3.0", default_features = false, features = ["evcxr", "all_series","full_palette"]}

// 2024
:dep plotters={version = "^0.3.0", default-features = false, features = ["evcxr", "all_series","full_palette"]}

:dep ndarray = { version = "^0.16.0" }
:dep ndarray-rand = { version = "0.15.0" }
:dep linfa = { version = "^0.7.0" }
:dep linfa-linear = { version = "^0.7.0" }
:dep linfa-datasets = { version = "^0.7.0" }
:dep linfa-linear = { version = "^0.7.0" }
:dep ndarray-stats = { version = "^0.5.1" }

In [3]:
// Helper code for plotting
extern crate plotters;
use plotters::prelude::*;
use ndarray::Array1;
use plotters::evcxr::SVGWrapper;
use plotters::style::colors::full_palette;

fn plotter_scatter(sizes: (u32, u32), x_range: (f64, f64), y_range: (f64, f64), scatters: &[(&Array1<f64>, &Array1<f64>, &RGBColor, &str, &str)], lines: &[(&Array1<f64>, &Array1<f64>, &str, &RGBColor)]) -> SVGWrapper {
    evcxr_figure((sizes.0, sizes.1), |root| {
    let mut chart = ChartBuilder::on(&root)
    // the caption for the chart
        .caption("2-D Plot", ("Arial", 20).into_font())
        .x_label_area_size(40)
        .y_label_area_size(40)
   // the X and Y coordinates spaces for the chart
        .build_cartesian_2d(x_range.0..x_range.1, y_range.0..y_range.1)?;
    chart.configure_mesh()
        .x_desc("X Values")
        .y_desc("Y Values")
        .draw()?;
        
        for scatter in scatters {
            if scatter.3 == "x" {
              chart.draw_series(
                scatter.0.iter().zip(scatter.1.iter()).map(|(&a, &b)| {
                  Cross::new((a, b), 3, scatter.2.filled())   
                }),
              )?
              .label(scatter.4)
              .legend(|(x, y)| Cross::new((x, y), 3, scatter.2.filled()));
            } else {
              chart.draw_series(
                scatter.0.iter().zip(scatter.1.iter()).map(|(&a, &b)| {
                  Circle::new((a, b), 3, scatter.2.filled())   
                }),
              )?
              .label(scatter.4)
              .legend(|(x, y)| Circle::new((x, y), 3, scatter.2.filled()));
            }
        }
        
        if lines.len() > 0 {
            for line in lines {
              chart
                .draw_series(LineSeries::new(line.0.iter().cloned().zip(line.1.iter().cloned()), line.3))?
                .label(line.2)
                .legend(|(x, y)| PathElement::new(vec![(x, y), (x + 20, y )], line.3));
            }
            // Configure and draw the legend
        
        }
        chart
              .configure_series_labels()
              .background_style(&WHITE.mix(0.8))
              .border_style(&BLACK)
              .draw()?;
    Ok(())
  })
}

In [4]:
use ndarray::Array;
    
let xs = Array::linspace(-1.3, 1.3, 300);
let abs_xs = xs.mapv(f64::abs);
let xs_squared = xs.mapv(|a| a.powi(2));

plotter_scatter(
  (500, 400), (-1.5,1.5), (0.,1.75), //plot size and ranges
  &[], //any scatters
  &[(&xs,&abs_xs, "MAE", &full_palette::RED), (&xs,&xs_squared, "MSE", &full_palette::BLUE)] //any lines
)

Observe that MAE will be more sensitive to small differences, while

MSE penalized big differences more.

## Definition of an outlier

<br>
<br><br><br>
<div align="center">
    <b>A point or small set of points that are <i>"different"</i></b>
</div>


<br>
<br>
<br>
<br>
<br>
<b>Important difference between error measures:</b> different attention to outliers

## Linear Regression: higher powers of absolute error ($L_p$ loss)

## In the limit...

* This converges to minimizing the maximum difference between ($f(x_i) = c_0 x_i + c_1$ and $y_i$)

* This is called: $L_\infty$ loss

* **Another way to express it:** minimize $z$ such that 

$$ |c_0 x_i + c_1 - y_i| \le z \qquad\textrm{for all} \quad i $$

## In the limit...

* **Another way to express it:** minimize $z$ such that 

$$ |c_0 x_i + c_1 - y_i| \le z \qquad\textrm{for all} \quad i $$

* **Linear programming formulation:** minimize $z$ such that

$$ c_0 x_i + c_1 - y_i \le z \qquad\textrm{for all} \quad i $$

and

$$ -(c_0 x_i + c_1 - y_i) \le z \qquad\textrm{for all} \quad i $$



## 1D insight into the outlier sensitivity

* **Input:** set of points in $\mathbb R$

* What point minimizes MSE, MAE, ... as a representative of these points?

* MSE aka $L_2$: mean

* MAE aka $L_1$: median

* $L_\infty$: the mean of the maximum and minimum



## Something in between MSE and MAE?

* $L_p$ loss for $p \in (1,2)$?

* Huber loss:
  - quadratic for small distances
  - linear for large distances
  
  $$
  L_\delta(f(x),y) = 
  \begin{cases}
  \frac{1}{2}(y-f(x))^2& \textrm{if $|y-f(x)| \le \delta$}\\
  \delta|y-f(x)| - \frac{1}{2}\delta^2& \textrm{otherwise}\\
  \end{cases}$$

<div align="center">
  <img src="./Huber_loss.svg" width="60%">
</div>

## Underfitting

* Model not expressive enough to capture the problem
* Or a solution found does not match the problem

Possible solutions:
* Try harder to find a better solution
* Add more parameters
* Try a different model that captures the solution

## Overfitting

* Predictions adjusted too well to training data
* Error on test data $\ggg$ error on training data

**Possible solutions:**
* Don't optimize the model on the training data too much
* Remove features that are too noisy
* Add more training data
* Reduce model complexity

## Bias and variance

$$\textrm{Total learning error} = \text{Bias} + \text{Variance} + \text{noise}$$

### Bias:  

The simplifying assumptions made by the model to make the target function easier to approximate.  

Mathematically it is the difference of the average value of predictions from the true function:

$$E[\hat{f}(x) - f(x)]$$

### Variance: 

The amount that the output of the model changes given different training data.  

Mathematically it is the mean of the delta of the squared deviation of the prediction function from its expected value:  

$$E[(\hat{f}(x) - E[\hat{f}(x)])^2]$$

**Bias:** error due to model unable to match the complexity of the problem

**Variance:** how much the prediction will vary in response to data points

**Overfitting:** high variance, low bias

**Underfitting:** high bias, low variance

**Important in practice:**

* detecting the source of problems: variance vs. bias, overfitting vs. underfitting

* navigating the trade-off and finding the sweet spot  

## Some examples


 |<font size="5"> Algorithm </font> |<font size="5"> Bias|<font size="5"> Variance|
 |:-:|:-:|:-:|
 |<font size="5">Linear Regression|<font size="5">High|<font size="5">Low|
 |<font size="5">Decision Tree|<font size="5">Low|<font size="5">High|
 |<font size="5">Random Forest|<font size="5">Low|<font size="5">High (less than tree)|



## Terminology

**Parameters**
* Variables fixed in a specific instantiation of a model
* Examples:
  * coefficients in linear regression
  * decision tree structure and thresholds
  * weights and thresholds in a neural network


**Hyperparameters**
* Also parameters, but higher level


* Examples:
  * number of leafs in a decision tree
  * number of layers and their structure in a neural network
  * degree of a polynomial


**Hyperparameter tuning**
* Adjusting hyperparameters before training the final model


**Model selection**
* Deciding on the type of model to be used (linear regression? decision trees? ...)



## Decision Tree discussion

* Tree structure and thresholds for splits are parameters (learned by the algorithm)  
* Many of the others are hyperparameters
 * split_quality: Sets the metric used to decide the feature on which to split a node
 * min_impurity_decrease: How much reduction in gini or other metric should we see before we allow a split 
 * min_weight_split: Sets the minimum weight of samples required to split a node.  
 * min_weight_leaf: Sets the minimum weight of samples that a split has to place in each leaf 
 * max_depth: Affects the structure of the tree and how elements can be assigned to nodes
 
 All documented at great length in (https://docs.rs/linfa-trees/latest/linfa_trees/struct.DecisionTreeParams.html)


## Challenges of training and cross-validation

**Big goal:** train a model that can be used for predicting

**Intermediate goal:** select the right model and hyperparameters

<div align="center">
    <b>How about trying various options and seeing how they perform on the test set?</b>
</div>

<div align="center">

</div>

**Information leak danger!**

* If we do it adaptively, information from the test set could affect the model selection

<div align="center">
    <h2><b>Cross–validation</b> attempts to solve this problem </h2>
</div>

Tune your parameters by using portions of the training set and preserve the test set for only a final evaluation

<div align="center">
  <img src="./cross-validation-flowchart.png" width="60%">
</div>

## Holdout method

* Partition the training data again: training set and validation set

* Use the validation part to estimate accuracy whenever needed

<div align="center">

</div>

**Pros:**
* Very efficient
* Fine with lots of data when losing a fraction is not a problem

**Cons:**
* Yet another part of data not used for training
* Problematic when the data set is small
* Testing part could contain important information

## $k$–fold cross–validation

* Partition the training set into $k$ folds at random
* Repeat $k$ times:
  - train on $k-1$ folds
  - estimate the accuracy on the $k$-th fold
* Return the mean


<div align="center">

</div>

**Pros:**
* Every data point used for training most of the time
* Less variance in the estimate

**Cons:**
* $k$ times slower

## LOOCV: Leave–one–out cross–validation

* Extreme case of the previous approach: separate fold for each data point


* For each data point $q$:
  - train on data without $q$
  - estimate the accuracy on $q$
* Return the mean of accuracies


**Cons:**
* Even more expensive

## Many other options

* Generalization: leave–$p$–out cross–validation enumerates over $\binom{n}{p}$ subsets

* Sampling instead of trying all options

* A variation that ensures that all classes evenly distributed in folds

* ... 



## Training and Cross Validation with Iris Dataset

Classic dataset

* 3 Iris species
* 50 examples in each class
* 4 features (sepal length/width, petal length/width) for each sample

In [None]:
:dep linfa = {version = "0.7.0"}
:dep linfa-datasets = { version = "0.7.0", features = ["iris"] }
:dep linfa-trees = { version = "0.7.0" }
:dep ndarray = { version = "0.16.1" }
:dep ndarray-rand = {version = "0.15.0" }
:dep rand = { version = "0.8.5", features = ["small_rng"] }
:dep smartcore = { version = "0.3.2", features = ["datasets", "ndarray-bindings"] }

use linfa::prelude::*;
use linfa_trees::DecisionTree;
use ndarray_rand::rand::SeedableRng;
use rand::rngs::SmallRng;
use linfa_trees::DecisionTreeParams;

fn crossvalidate() -> Result<(), Box<dyn std::error::Error>> {

    // Load the Iris dataset
    let mut iris = linfa_datasets::iris();

    let mut rng = SmallRng::seed_from_u64(42);

    // Split the data into training and testing sets
    let (train, test) = iris.clone()
        .shuffle(&mut rng)
        .split_with_ratio(0.8);

    // Extract the features (X) and target (y) for training and testing sets
    let X_train = train.records();
    let y_train = train.targets();
    let X_test = test.records();
    let y_test = test.targets();

    // Print the shape of the training and testing sets
    println!("X_train shape: ({}, {})", X_train.nrows(), X_train.ncols());
    println!("y_train shape: ({})", y_train.len());
    println!("X_test shape: ({}, {})", X_test.nrows(), X_test.ncols());
    println!("y_test shape: ({})", y_test.len());

    // Train the model on the training data
    let model = DecisionTree::params()
        .max_depth(Some(3))
        .fit(&train)?;

    // Evaluate the model's accuracy on the training set
    let train_accuracy = model.predict(&train)
        .confusion_matrix(&train)?
        .accuracy();
    println!("Training accuracy: {:.2}%", train_accuracy * 100.0);

    // Evaluate the model's accuracy on the test set
    let test_accuracy = model.predict(&test)
        .confusion_matrix(&test)?
        .accuracy();
    println!("Test accuracy: {:.2}%", test_accuracy * 100.0);
    
    // Define two models with depths 3 and 2
    let dt_params1 = DecisionTree::params().max_depth(Some(3));
    let dt_params2 = DecisionTree::params().max_depth(Some(2));

    // Create a vector of models
    let models = vec![dt_params1, dt_params2];

    // Train and cross-validation using the models
    let scores = iris.cross_validate_single(
        5, 
        &models, 
        |prediction, truth|
            Ok(prediction.confusion_matrix(truth.to_owned())?.accuracy()))?;
    println!("Cross-validation scores: {:?}", scores);

    // Perform cross-validation using fold
    let scores: Vec<_> = iris.fold(5).into_iter().map(|(train, valid)| {
       let model = DecisionTree::params()
          .max_depth(Some(3))
          .fit(&train).unwrap();
       let accuracy = model.predict(&valid).confusion_matrix(&valid).unwrap().accuracy();
       accuracy
    }).collect();
    
    println!("Cross-validation scores general: {:?} {}", scores, scores.iter().sum::<f32>()/scores.len() as f32);

    Ok(())
}

crossvalidate();

X_train shape: (120, 4)
y_train shape: (120)
X_test shape: (30, 4)
y_test shape: (30)
Training accuracy: 96.67%
Test accuracy: 90.00%
Cross-validation scores: [0.93999994, 0.9133333], shape=[2], strides=[1], layout=CFcf (0xf), const ndim=1
Cross-validation scores general: [1.0, 1.0, 1.0, 0.93333334, 1.0] 0.9866667


Let's take a closer look at

```rust
    // Evaluate the model's accuracy on the training set
    let train_accuracy = model.predict(&train)
        .confusion_matrix(&train)?
        .accuracy();
    println!("Training accuracy: {:.2}%", train_accuracy * 100.0);
```

We're creating the confusion matrix, for example

```
                Predicted
                Class1    Class2    Class3
Actual  Class1    TP1       FP12      FP13
        Class2    FP21      TP2       FP23
        Class3    FP31      FP32      TP3
```

Where:

* TP1, TP2, TP3 are True Positives for each class (correct predictions)
* FPxy are False Positives (predicting class y when it's actually class x)

And then calculating accuracy as:

`Accuracy = (TP1 + TP2 + TP3) / Total Predictions`

For the second training experiment:

```rust
    // Define two models with depths 3 and 2
    let dt_params1 = DecisionTree::params().max_depth(Some(3));
    let dt_params2 = DecisionTree::params().max_depth(Some(2));

    // Create a vector of models
    let models = vec![dt_params1, dt_params2];

    // Train and cross-validation using the models
    let scores = iris.cross_validate_single(5, &models, |prediction, truth|
        Ok(prediction.confusion_matrix(truth.to_owned())?.accuracy()))?;
    println!("Cross-validation scores: {:?}", scores);
```

`dt_params1` and `dt_params2` are just parameter configurations, not yet trained models.

We'll split the training set into 5 folds:

```
Initial Split: [Fold1, Fold2, Fold3, Fold4, Fold5]
```

Then train on 4 of the folds and validate on the 5th:

```
For Model 1 (max_depth=3):
Iteration 1: Train on [Fold2, Fold3, Fold4, Fold5], Validate on Fold1
Iteration 2: Train on [Fold1, Fold3, Fold4, Fold5], Validate on Fold2
Iteration 3: Train on [Fold1, Fold2, Fold4, Fold5], Validate on Fold3
Iteration 4: Train on [Fold1, Fold2, Fold3, Fold5], Validate on Fold4
Iteration 5: Train on [Fold1, Fold2, Fold3, Fold4], Validate on Fold5
```
For Model 2 (max_depth=2):
Same process repeated with the same folds

And the last training experiment:

```rust
// Perform cross-validation using fold
let scores: Vec<_> = iris.fold(5).into_iter().map(|(train, valid)| {
   let model = DecisionTree::params()
      .max_depth(Some(3))
      .fit(&train).unwrap();
   let accuracy = model.predict(&valid).confusion_matrix(&valid).unwrap().accuracy();
   accuracy
}).collect();
```

This manually implements 5-fold cross-validation.

Step-by-step description courtesy of Cursor Agent.

1. **Creating the Folds**:
```rust
iris.fold(5)
```
- This splits the Iris dataset into 5 folds
- Returns an iterator over tuples of `(train, valid)` where:
  - `train` contains 4/5 of the data
  - `valid` contains 1/5 of the data
- Each iteration will use a different fold as the validation set

2. **The Main Loop**:
```rust
.into_iter().map(|(train, valid)| {
```
- Converts the folds into an iterator
- For each iteration, we get a training set and validation set

3. **Model Training**:
```rust
let model = DecisionTree::params()
    .max_depth(Some(3))
    .fit(&train).unwrap();
```
- Creates a decision tree model with max depth of 3
- Trains the model on the current training fold
- Uses `unwrap()` to handle potential errors (in production code, you'd want better error handling)

4. **Model Evaluation**:
```rust
let accuracy = model.predict(&valid)
    .confusion_matrix(&valid)
    .unwrap()
    .accuracy();
```
- Makes predictions on the validation set
- Creates a confusion matrix comparing predictions to true labels
- Calculates the accuracy from the confusion matrix

5. **Collecting Results**:
```rust
}).collect();
```
- Collects all 5 accuracy scores into a vector
- Each score represents the model's performance on a different validation fold

The key differences between this and the previous `cross_validate_single` method are:
- This is a more manual implementation
- It only evaluates one model configuration (max_depth=3)
- It gives you direct control over the training and evaluation process
- The results are collected into a simple vector of accuracy scores

This approach is useful when you want more control over the cross-validation process or when you need to implement custom evaluation metrics.
