# Machine Learning: Programming Exercise 1 - Rust

## Linear Regression
In this exercise, you will implement linear regression and get to see it work on data. Before starting on this programming exercise, we strongly recommend watching the video lectures and completing the review questions for the associated topics.

To get started with the exercise, you will need to download the starter code and unzip its contents to the directory where you wish to complete the exercise. If needed, use the cd command in Octave/MATLAB to change to this directory before starting this exercise.

Prepare the rust environment.

In [4]:
:dep polars = { version = "0.20.0", features = ["lazy", "csv-file", "ndarray"] }
:dep plotly = { version = "0.7.0", features = ["plotly_ndarray"] }
:dep ndarray = "0.15.4"

In [3]:
use polars::prelude::*;

Error: failed to resolve: use of undeclared crate or module `polars`

## Linear regression with one variable

In this part of this exercise, you will implement linear regression with one variable to predict profits for a food truck. Suppose you are the CEO of a restaurant franchise and are considering different cities for opening a new outlet. The chain already has trucks in various cities and you have data for profits and populations from the cities. You would like to use this data to help you select which city to expand to next.

In [2]:
// rust load data from ex1data1.txt
use std::sync::Arc;

let mut schema = Schema::new();
schema.with_column("population".to_string(), DataType::Float64);
schema.with_column("profit".to_string(), DataType::Float64);

let mut df = LazyCsvReader::new("./ex1data1.txt".into()).with_schema(Arc::new(dbg!(schema))).has_header(false).finish()?.collect()?;

// dbg!(&df);
dbg!(df.head(Some(3)));

Error: failed to resolve: use of undeclared type `Schema`

Error: failed to resolve: use of undeclared type `DataType`

Error: failed to resolve: use of undeclared type `DataType`

Error: failed to resolve: use of undeclared type `LazyCsvReader`

### Plotting the data

Before starting on any task, it is often useful to understand the data by visualizing it. For this dataset, you can use a scatter plot to visualize the data, since it has only two properties to plot (profit and population). Many other problems that you will encounter in real life are multi-dimensional and can't be plotted on a 2-d plot.

In [76]:
// rust code to plot the data

use plotly::{Plot,Scatter};
use plotly::common::{Mode, Title};
use plotly::ndarray::ArrayTraces;
use ndarray::prelude::*;

let mut plot = Plot::new();
let ndarray = df.to_ndarray::<Float64Type>()?;
let trace = Scatter::from_array(ndarray.slice(s![..,0]).into_owned(),ndarray.slice(s![..,1]).into_owned()).mode(Mode::Markers);
plot.add_trace(trace);
// let layout = Layout::new().height(800);
// plot.set_layout(layout);
plot.lab_display();

### Gradient Descent

In this section, you will fit the linear regression parameters to our dataset using gradient descent.

#### Update Equations

The objective of linear regression is to minimize the cost function

$$
J(\theta) = \frac{1}{2m}\sum\limits_{i=1}^m ( h_{\theta}(x^{(i)})-y^{(i)} )^{2}
$$

where the hypothesis $h_\theta(x)$ is given by the linear model

$$
h_\theta(x) = \theta^Tx =\theta_{0}+\theta_{1}x_1
$$

The training examples are stored in X row-wise, like such:

$$
\begin{align*}
X =
\begin{bmatrix}
  x^{(1)}_0 & x^{(1)}_1  \newline
  x^{(2)}_0 & x^{(2)}_1  \newline
  x^{(3)}_0 & x^{(3)}_1
\end{bmatrix}
&,\theta =
\begin{bmatrix}
  \theta_0 \newline
  \theta_1 \newline
\end{bmatrix}
\end{align*}
$$

You can calculate the hypothesis as a column vector of size (m x 1) with:

$$
{{h}_{\theta }}\left( X \right)=X{\theta }
$$

Recall that the parameters of your model are the $\theta$ values. These are the values you will adjust to minimize cost $J(\theta)$. One way to do this is to use the batch gradient descent algorithm. In batch gradient descent, each iteration performs the update

$$
\theta_j := \theta_j - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}((h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)}\ (\text{simultaneously update } \theta_j \text{ for all }j) 
$$

With each step of gradient descent, your parameters $j$ come closer to the optimal values that will achieve the lowest cost $j(\theta)$.

**Implementation Note:** We store each example as a row in the the X matrix in MATLAB. To take into account the intercept term ($\theta_0$), we add an additional first column to **X** and set it to all ones. This allows us to treat $\theta_0$ as simply another 'feature'.

##### Vectorized Implementation

Vectorizations is the act of replacing the loops in a computer program with matrix operations. If you have a good linear algebra library (like numpy), the library will optimize the code automatically for the computer the code runs on. Mathematically, the 'regular' function should mean the same as the vectorized function.

Gradient descent vectorized: $\theta = \frac{1}{2m}(X\theta - \vec{y})^T(X\theta-\vec{y})$

#### Implementation

In this script, we have already set up the data for linear regression. In the following lines, we add another dimension to our data to accommodate the $\theta_0$ intercept term. Run the code below to initialize the parameters to 0 and the learning rate alpha to 0.01.

#### Computing the cost $J(\theta)$

As you perform gradient descent to minimize the cost function $J(\theta)$, it is helpful to monitor the convergence by computing the cost. In this section, you will implement a function to calculate $J(\theta)$ so you can check the convergence of your gradient descent implementation.

**Exercise:** Implement a vectorized implementation of the cost function.

In [None]:
// rust code to compute the cost function

#### Gradient Descent

Next, you will implement gradient descent. The loop structure has been written for you, and you only need to supply the updates to  within each iteration.

As you program, make sure you understand what you are trying to optimize and what is being updated. Keep in mind that the cost $J(\theta)$ is parameterized by the vector $\theta$, not $X$ and $y$. That is, we minimize the value of $J(\theta)$ by changing the values of the vector $\theta$, not by changing $X$ or $y$. Refer to the equations given earlier and to the video lectures if you are uncertain.

A good way to verify that gradient descent is working correctly is to look at the value of $J$ and check that it is decreasing with each step. If it is not, you may need to increase the learning rate $\alpha$.

After you are finished, run this execute this section. The code below will use your final parameters to plot the linear fit. The result should look something like Figure 2 below:

Your final values for  will also be used to make predictions on profits in areas of 35,000 and 70,000 people.

We want are hypothesis $h_\theta(x)$ to function as good as possibly. Therefore, we want to minimalize the cost function J(\theta). Gradient descent is an algorithm used to do that.

The formal definition of gradient descent:

$$
repeat \ \{ \\ \enspace \theta_j := \theta_j - \alpha \frac{1}{m}\displaystyle\sum_{i = 1}^{m}(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}\\\}
$$

An illustration of gradient descent on a single variable:

![gradientdescent](https://github.com/rickwierenga/CS229-Python/raw/8c899e94c7bdf60031ce0b80402285820ce1cb44/ex1/notes/gradientdescent.png)

**Exercise:** Implement the gradient descent algorithm in Python.

In [None]:
// rust code to compute the gradient

## Linear regression with multiple variables - Multivariate Linear Regression

In this part, you will implement linear regression with multiple variables to predict the prices of houses. Suppose you are selling your house and you want to know what a good market price would be. One way to do this is to first collect information on recent houses sold and make a model of housing prices.

The file `ex1data2.txt` contains a training set of housing prices in Portland, Oregon. The first column is the size of the house (in square feet), the second column is the number of bedrooms, and the third column is the price of the house. Run this section now to preview the data.

In [None]:
// rust code to load ex1data2.txt data

### Feature Normalization

This section of the script will start by loading and displaying some values from this dataset. By looking at the values, note that house sizes are about 1000 times the number of bedrooms. When features differ by orders of magnitude, first performing feature scaling can make gradient descent converge much more quickly.

Your task here is to complete the code to:

- Subtract the mean value of each feature from the dataset.
- After subtracting the mean, additionally scale (divide) the feature values by their respective "standard deviations".

The standard deviation is a way of measuring how much variation there is in the range of values of a particular feature (most data points will lie within $\pm 2$ standard deviations of the mean).

You will do this for all the features and your code should work with datasets of all sizes (any number of features / examples). Note that each column of the matrix **X** corresponds to one feature.

**Implementation Note:** When normalizing the features, it is important to store the values used for normalization - the mean value and the standard deviation used for the computations. After learning the parameters from the model, we often want to predict the prices of houses we have not seen before. Given a new x value (living room area and number of bedrooms), we must first normalize x using the mean and standard deviation that we had previously computed from the training set.

**Add the bias term**
Now that we have normailzed the features, we again add a column of ones corresponding to  to the data matrix X.

When features differ by order of magnitude, first performing feature scaling can make gradient descent converge much more quickly. Formally:

$$
x := \frac{x - \mu}{\sigma}
$$

where $\mu$ is the mean and $\sigma$ is the standard deviation.

**Important:** It is crucial to store $\mu$ and $\sigma$ if you want to make predictions using the model later.

**Exercise:** Perform feature normalization on the following dataset.

In [None]:
// rust code to normalize the data

### Gradient Descent

Previously, you implemented gradient descent on a univariate regression problem. The only difference now is that there is one more feature in the matrix X. The hypothesis function and the batch gradient descent update rule remain unchanged.

You should complete the code to implement the cost function and gradient descent for linear regression with multiple variables. If your code in the previous part (single variable) already supports multiple variables, you can use it here too.

Make sure your code supports any number of features and is well-vectorized. You can use the command size(X,2) to find out how many features are present in the dataset.

We have provided you with the following starter code below that runs gradient descent with a particular learning rate (alpha). Your task is to first make sure that your functions computeCost and gradientDescent already work with this starter code and support multiple variables.

**Implementation Note:** In the multivariate case, the cost function can also be written in the following vectorized form:

$$
J(\theta)=\frac{1}{2m}\left(X\theta-\vec{y}\right)^T\left(X\theta-\vec{y}\right)
$$

where

$$
X = \begin{bmatrix}
    -(x^{(1)})^T -) \\
    -(x^{(2)})^T -) \\
    \vdots \\
    -(x^{(m)})^T -) \\
    \end{bmatrix},
\ \ 
\vec{y} = \begin{bmatrix}
    y^{(1)} \\
    y^{(2)} \\
    \vdots \\
    y^{(m)} \\
    \end{bmatrix}
$$

The vectorized version is efficient when you're working with numerical computing tools like MATLAB. If you are an expert with matrix operations, you can prove to yourself that the two forms are equivalent.

Finally, you should complete and run the code below to predict the price of a 1650 sq-ft, 3 br house using the value of theta obtained above. 

**Hint:** At prediction, make sure you do the same feature normalization. Recall that the first column of X is all ones. Thus, it does not need to be normalized.

Remember the algorithm for gradient descent:

$$
repeat \ \{ \\ \enspace \theta_j := \theta_j - \alpha \frac{1}{m}\displaystyle\sum_{i = 1}^{m}(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}\\\}
$$

The vectorization for multivariate gradient descent:

$$
\theta := \theta - \frac{\alpha}{m}X^T(X\theta - \vec{y})
$$

**Exercise:** Implement gradient descent for multiple features. Make sure your solution is vectorized and supports any number of features.

In [None]:
// rust code to implement gradient descent

### Normal Equations

In the lecture videos, you learned that the closed-form solution to linear regression is

$$
\theta = \left(X^T X\right)^{-1} X^T \vec{y} 
$$

Using this formula does not require any feature scaling, and you will get an exact solution in one calculation: there is no "loop until convergence" like in gradient descent.

Complete the code to use the formula above to calculate $\theta$, then run the code in this section. Remember that while you don't need to scale your features, we still need to add a column of 1's to the X matrix to have an intercept term $(\theta_0)$ . Note that the code below will add the column of 1's to X for you.

**Optional (ungraded) exercise:** Now, once you have found  using this method, use it to make a price prediction for a 1650-square-foot house with 3 bedrooms. You should find that gives the same predicted price as the value you obtained using the model fit with gradient descent

In [None]:
// rust code to implement normal equation

## Reference

- https://levelup.gitconnected.com/machine-learning-and-rust-part-3-smartcore-dataframe-and-linear-regression-10451fdc2e60