# Project 2 Report - Nick Vega

# Task 2A

In this task, we will use a perceptron to deal with a 2D classification problem.

Code was provided for generating the training set, test set, and plot.

![Given Data](https://raw.githubusercontent.com/nickmvega/machine-learning/651cd6232136ddccbf6935ff38af8827c4095c44/images_p2/given_data.png)

## Task 2A.1 Perceptron

In this section, we are going to implement a binary classifier with the Perceptron class.

`class Perceptron(object)`

The methods in this class include the following:
- `__init__(self, T=1)`: In this method, the number of iterations T is defined.

- `fit(self, X, y)`: In this method, we train the perceptron model on data X with labels y and iteration T. We are given initailization code for the number of samples, the number of features, a weight vector, and a bias variabel. In the code we implemented, we conduct an iteration up to the number of iterations T. Within each iteration, we conduct an iteration up to the number of samples. In the inner iterations, we check if the predicted class for X[i] is not equal to y[i]. If this is true we add y[i] * X[i] to the weight and y[i] to the bias. 

- `project(self, X)`: In this method, we project data X onto the learned hyperplane with weights w and bias b. The code works by computing the dot product of X and the weight vector and adding the result to the bias vector.

- `predict(self, X)`: In this method, we predict class labels for samples in X using the project method defined above. The code works by calling the project function above which returns a projection vector. We then loop through each element in the vector and if the result is greater than or equal to 0, we set our prediciton to be 1, otherwise we set our prediction to be -1. We then return our predictions. 


Using the Perceptron class we have defined above, we set T=5 and fit the model using the given Xtrain and ytrain sets. We then predict on the given Xtest. We achieve an accuracy of 100.00% and plot the decision boundary below

![Initial Perceptron](https://raw.githubusercontent.com/nickmvega/machine-learning/651cd6232136ddccbf6935ff38af8827c4095c44/images_p2/initial_perceptron.png)


This dataset is linearly separable and we are able to classify the data with 100% accuracy given the testing dataset. 

## Task 2A.2 Kernel Trick

In this section, we will build kernel functions and a kernel perceptron. We will then plot the decision boundary using the kernel perceptron class we created and for their corresponding kernel functions.

The decision function for the Kernel Perceptron is given by 

$$
f(\mathbf{x}) = \text{sign}\left(\sum_{i=1}^{n} \alpha_i y_i k(\mathbf{x}_i, \mathbf{x})\right)
$$

where k is the kernel function y_i are the labels and alpha_i are the learend weights. 

The kernel (Gram) matrix induced by the kernel function k over n data points is defined as 

$$
\mathbf{K}=
\left(\begin{array}{ccc} 
k(\mathbf{x}_1,\mathbf{x}_1) & \dots & k(\mathbf{x}_1,\mathbf{x}_n)\\
\vdots & \ddots & \vdots \\
k(\mathbf{x}_n,\mathbf{x}_1) & \dots & k(\mathbf{x}_n,\mathbf{x}_n)
\end{array}\right)
$$ 

Given a test data point **x**, the predicted label is

$$
\hat{y} = \text{sign}\left(\sum_{i=1}^{n} \alpha_i y_i k(\mathbf{x}_i, \mathbf{x})\right)
$$

We are given the following generated dataset below:

![Given Data 2](https://raw.githubusercontent.com/nickmvega/machine-learning/651cd6232136ddccbf6935ff38af8827c4095c44/images_p2/given_data_2.png)


### KernelPerceptron

We first created a Kernel Perceptron class described below:

`class KernelPerceptron(object)`

The methods in this class include the following:

`__init__(self, kernel=PolynomialKernel(p = 1), T=1)`: In this method, we define a kernel, the number of iterations T, alpha, Xtrain, ytrain. 

`fit(self, X, y)`: In this method, we fill the Gram matrix defind below
$$
\mathbf{K}=
\left(\begin{array}{ccc} 
k(\mathbf{x}_1,\mathbf{x}_1) & \dots & k(\mathbf{x}_1,\mathbf{x}_n)\\
\vdots & \ddots & \vdots \\
k(\mathbf{x}_n,\mathbf{x}_1) & \dots & k(\mathbf{x}_n,\mathbf{x}_n)
\end{array}\right)
$$ 
by calling the kernel objected that is passed through the initialization of the KernelPerceptron. 

We then set the alpha values using one outer iteration up to the number of iterations T and one inner iteration up the number of samples. We then check if the function defined below 

$$
f(\mathbf{x}) = \text{sign}\left(\sum_{i=1}^{n} \alpha_i y_i k(\mathbf{x}_i, \mathbf{x})\right)
$$

is not equal to y[i]. If this is the case, we add 1 to alpha[i].

`project(self, X)`: In this method, we are creating a vector that stores the following result

$$
\text{projection} = \left(\sum_{i=1}^{n} \alpha_i y_i k(\mathbf{x}_i, \mathbf{x})\right)
$$

In the code, we achieve this using two iterations, the outer iteration being up to the number of rows in X and the inner iteration being up the number of rows in Xtrain. In each inner iteration we then compute the following 

$$
\alpha_i y_i k(\mathbf{x}_i, \mathbf{x})
$$

and add the result to projection[i].

`predict(self, X)`: In this method, we call the project method we have just described above with X. We then take the resulting projection vector and iterate through each element. For each element, we check if the projection of that element is greater than or equal to 0 and if so set the prediction for this element in our y_hat vector to be 1. Otherwise, we set the prediction for this element in our y_hat vector to be -1. 

### Kernel Functions

We will now create three Kernel functions for a `PolynomialKernel`, `GaussianKernel`, and `LaplaceKernel`. Each KernelFunction and their methods are defined below.

`PolynomialKernel`. The polynomial kernel function has two methods:

`__init__(self, p=1)`: In this method, p is the degree of the polynomial and set to 1 when not specified

`__call__(self, x, y)`: In this method, we implement the polynomial kernel function which is defined below

$$
k_{\text{poly}}(\mathbf{x},\mathbf{x}', d) = (1+\mathbf{x}^\top \mathbf{x}')^d
$$

In the code, the dot product of x transpose and y is added to 1, and then set to the pth exponent.

`GaussianKernel`. The gaussian kernel function has two methods:

`__init__(self, sigma=1)`: In this method, sigma is an important parameter in the RBF kernel function and is set to 5 as default. 

`__call__(self, x, y)`: In this method, we implement the gaussian kernel function which is defined below 

$$
k_{\text{RBF}}(\mathbf{x},\mathbf{x}', \sigma) = \exp\left(-\frac{\lVert \mathbf{x}-\mathbf{x'} \rVert^2_2}{2\sigma^2}\right)
$$

In the code, np.linalg.norm is used to subtract x and y. The norm is then divided by the denominator of 2 * sigma ** 2. Then, np.exp of the negative result is returned.

`LaplaceKernel`. The Laplace Kernel function has two methods: 

`__init__(self, sigma=1)`: In this method, sigma is an important parameter in the laplace kernel function and is set to 5 as default. 

`__call__(self, x, y)`: In this method, we implement the laplace kernel function which is defined below 

$$
k_{\text{laplace}}(\mathbf{x},\mathbf{x}', \sigma) = \exp\left(-\frac{\lVert \mathbf{x}-\mathbf{x'} \rVert _1}{\sigma}\right)
$$

In the code, np.linalg.norm with ord=1 is used to subtract x and y. Then the norm is divided by sigma, and then np.exp of the negative result is returned. 

### Kernel Perceptron Results

#### Is the data linearly separable?

We will know check if the given data is linearly separable by calling the KernelPerceptron class with the PolynomialKernel function where p=1. We train the KernelPerceptron using Xtrain and ytrain and predicton on Xtest.

Our model achieved an accuracy of :
- Accuracy: 47.50%

We then plot the decision boundary on the dataset as seen below

![Is Lin Sep?](https://raw.githubusercontent.com/nickmvega/machine-learning/35e74c44e8e9f45b2b819425ddfc9a6fe8831522/images_p2/is%20lin%20sep.png)

#### More powerful kernels

We will now perform regression with the Polynomial Kernel, Gaussian Kernel, and Laplace Kernel functions.

**Polynomial Kernel**

We performed regression with the Kernel Perceptron class given a Polynomial Kernel and varied the degree and number of iterations. The accuracy results were the following:


- Accuracy with 2 epochs and degree 2 : 47.50%
- Accuracy with 2 epochs and degree 3 : 97.50%
- Accuracy with 2 epochs and degree 4 : 95.00%

- Accuracy with 4 epochs and degree 2 : 57.50%
- Accuracy with 4 epochs and degree 3 : 97.50%
- Accuracy with 4 epochs and degree 4 : 95.00%

- Accuracy with 6 epochs and degree 2 : 95.00%
- Accuracy with 6 epochs and degree 3 : 97.50%
- Accuracy with 6 epochs and degree 4 : 92.50%

- Accuracy with 8 epochs and degree 2 : 97.50%
- Accuracy with 8 epochs and degree 3 : 97.50%
- Accuracy with 8 epochs and degree 4 : 97.50%

- Accuracy with 10 epochs and degree 2 : 92.50%
- Accuracy with 10 epochs and degree 3 : 97.50%
- Accuracy with 10 epochs and degree 4 : 97.50%

For a degree of 2, it takes about 8 epochs for the accuracy to plateau at 97.50%.

For a degree of 3, it takes about 2 epochs for the accuracy to plateau at 97.50%.

For a degree of 4 it takes about 8 epochs for the accuracy to plateau at 97.50%.

Each degree of 2, 3 and 4 were able to achieve an accuracy 97.50% given a certain number of epochs. An example of a decision boundary for a polynomial kernel with p=3 and epochs=10 is seen below.

![Poly P3](https://raw.githubusercontent.com/nickmvega/machine-learning/35e74c44e8e9f45b2b819425ddfc9a6fe8831522/images_p2/poly%20p3.png)

**Gaussian Kernel**

We performed regression with the Kernel Perceptron class given a Gaussian Kernel and varied the sigma and number of epochs. The accuracy results were the following:

- Accuracy with 2 epochs and sigma 1 : 97.50%
- Accuracy with 2 epochs and sigma 2 : 97.50%
- Accuracy with 2 epochs and sigma 3 : 97.50%
- Accuracy with 2 epochs and sigma 4 : 97.50%

- Accuracy with 4 epochs and sigma 1 : 95.00%
- Accuracy with 4 epochs and sigma 2 : 97.50%
- Accuracy with 4 epochs and sigma 3 : 97.50%
- Accuracy with 4 epochs and sigma 4 : 97.50%

- Accuracy with 6 epochs and sigma 1 : 97.50%
- Accuracy with 6 epochs and sigma 2 : 97.50%
- Accuracy with 6 epochs and sigma 3 : 97.50%
- Accuracy with 6 epochs and sigma 4 : 97.50%

- Accuracy with 8 epochs and sigma 1 : 97.50%
- Accuracy with 8 epochs and sigma 2 : 97.50%
- Accuracy with 8 epochs and sigma 3 : 97.50%
- Accuracy with 8 epochs and sigma 4 : 97.50%

- Accuracy with 10 epochs and sigma 1 : 97.50%
- Accuracy with 10 epochs and sigma 2 : 97.50%
- Accuracy with 10 epochs and sigma 3 : 97.50%
- Accuracy with 10 epochs and sigma 4 : 97.50%

For a sigma of 1, it takes about 6 epochs for the accuracy to plateau at 97.50%.

For a sigma of 2, it takes about 2 epochs for the accuracy to plateau at 97.50%.

For a sigma of 3 it takes about 2 epochs for the accuracy to plateau at 97.50%.

For a sigma of 4 it takes about 2 epochs for the accuracy to plateau at 97.50%.

Each sigma of 1, 2, 3 and 4 were able to achieve an accuracy 97.50% given a certain number of epochs. An example of a decision boundary for a polynomial kernel with sigma=3 and epochs=10 is seen below.

![Gaus S3](https://raw.githubusercontent.com/nickmvega/machine-learning/6b5a6551694ff624757ee8dd08561f6e8ca96114/images_p2/gaus%20s3.png)

**Laplace Kernel**

We performed regression with the Kernel Perceptron class given a Laplace Kernel and varied the sigma and number of epochs. The accuracy results were the following:

- Accuracy with 2 epochs and sigma 1 : 95.00%
- Accuracy with 2 epochs and sigma 3 : 95.00%
- Accuracy with 2 epochs and sigma 5 : 95.00%

- Accuracy with 4 epochs and sigma 1 : 97.50%
- Accuracy with 4 epochs and sigma 3 : 95.00%
- Accuracy with 4 epochs and sigma 5 : 95.00%

- Accuracy with 6 epochs and sigma 1 : 97.50%
- Accuracy with 6 epochs and sigma 3 : 95.00%
- Accuracy with 6 epochs and sigma 5 : 47.50%

- Accuracy with 8 epochs and sigma 1 : 97.50%
- Accuracy with 8 epochs and sigma 3 : 95.00%
- Accuracy with 8 epochs and sigma 5 : 95.00%

- Accuracy with 10 epochs and sigma 1 : 97.50%
- Accuracy with 10 epochs and sigma 3 : 95.00%
- Accuracy with 10 epochs and sigma 5 : 95.00%

For a sigma of 1, it takes about 4 epochs for the accuracy to plateau at 97.50%.

For a sigma of 3, it takes about 2 epochs for the accuracy to plateau at 95.00%.

For a sigma of 5 it takes about 8 epochs for the accuracy to plateau at 95.00%.

Sigma=1 was able to achieve an accuracy 97.50% given a certain number of epochs. Sigma=3,5 were able to able to achieve an accuracy of only 95%. This means that smaller sigma values than 1 would have proved to be more accurate as the number of epochs increases.

An example of a decision boundary for a polynomial kernel with sigma=1 and epochs=10 is seen below.

![Laplace S1](https://raw.githubusercontent.com/nickmvega/machine-learning/6b5a6551694ff624757ee8dd08561f6e8ca96114/images_p2/laplace%20s1.png
)

# Task 2B: Real-World Data Analysis: Seoul Bike Rental Data

In this task, we will analyze the SeoulBikeData.csv dataset, which provides information about bike rentals in Seoul. The dataset includes:
- **6 Features**: Weather-related conditions like temperature, humidity, and wind speed.
- **1 Time Feature**: Hour of the day.
- **Target**: The number of rented bikes, with the objective of predicting whether `Rented Bike Count > 500`.

## Step 1

1. **Load and Explore the Dataset**:
   - Load the `SeoulBikeData.csv` file using `pandas`.
   - Display descriptive statistics and visualize feature distributions (e.g., histograms, pair plots).


In this step, we loaded the SeoulBikeData.csv file using pandas and given code to binarize y and conduct a train test split. 

We display descriptive statistics and visualize feature distributions below:

**DF Sample**
       Rented Bike Count  Hour  Temperature (deg C)  Humidity(%)  \
8673                800     9                  5.0           75   
7244                  0    20                 18.0           67   
4194               2825    18                 22.6           39   
5727                776    15                 35.3           52   
7083                458     3                 18.8           90   

      Wind speed (m/s)  Visibility (10m)  Dew point temperature (deg C)
8673               0.3               390                            0.9   
7244               0.7              2000                           11.7   
4194               2.8              1655                            7.9   
5727               1.9               822                           23.9   
7083               1.0              1191                           17.1   

      Solar Radiation (MJ/m2)  Rainfall(mm)  Snowfall (cm)  
8673                     0.12           0.0            0.0  
7244                     0.00           0.0            0.0  
4194                     1.10           0.0            0.0  
5727                     2.15           0.0            0.0  
7083                     0.00           0.0            0.0   

**DF description**
        Rented Bike Count         Hour  Temperature (deg C)  Humidity
count         1000.00000  1000.000000          1000.000000  1000.000000   
mean           737.44300    11.424000            13.256600    58.108000   
std            669.37817     6.886774            12.066944    19.731439   
min              0.00000     0.000000           -16.200000     0.000000   
25%            191.00000     5.000000             3.850000    44.000000   
50%            552.00000    12.000000            14.100000    57.000000   
75%           1118.75000    17.250000            23.325000    73.000000   
max           3556.00000    23.000000            37.900000    98.000000   

       Wind speed (m/s)  Visibility (10m)  Dew point temperature (degC) 
count        1000.00000       1000.000000                   1000.000000   
mean            1.72070       1457.935000                       4.439100   
std             1.03171        594.772765                      13.109437   
min             0.00000         54.000000                     -30.500000   
25%             0.90000        977.000000                      -4.600000   
50%             1.50000       1717.000000                       5.300000   
75%             2.40000       2000.000000                      15.700000   
max             6.90000       2000.000000                      26.100000   

       Solar Radiation (MJ/m2)  Rainfall(mm)  Snowfall (cm)  
count              1000.000000   1000.000000    1000.000000  
mean                  0.573570      0.114600       0.071400  
std                   0.853094      0.846501       0.405939  
min                   0.000000      0.000000       0.000000  
25%                   0.000000      0.000000       0.000000  
50%                   0.030000      0.000000       0.000000  
75%                   0.940000      0.000000       0.000000  
max                   3.490000     15.500000       5.100000   

Feature Histograms:  

![Feature Histograms]()

Correlation Matrix:

![Correlation Matrix]()

## Step 2

2. **Preprocessing**:
   - Convert `Rented Bike Count` into a binary target (`1` if > 500, else `0`).
   - Normalize the numerical features using min-max scaling or standardization.

In this step, we convert the Rented Bike COunt into a binary target shown below:

      Rented Bike Count  Hour  Temperature (deg C)  Humidity(%)  \
8673                  1     9                  5.0           75   
7244                  0    20                 18.0           67   
4194                  1    18                 22.6           39   
5727                  1    15                 35.3           52   
7083                  0     3                 18.8           90   

      Wind speed (m/s)  Visibility (10m)  Dew point temperature (deg C)  \
8673               0.3               390                            0.9   
7244               0.7              2000                           11.7   
4194               2.8              1655                            7.9   
5727               1.9               822                           23.9   
7083               1.0              1191                           17.1   

      Solar Radiation (MJ/m2)  Rainfall(mm)  Snowfall (cm)  
8673                     0.12           0.0            0.0  
7244                     0.00           0.0            0.0  
4194                     1.10           0.0            0.0  
5727                     2.15           0.0            0.0  
7083                     0.00           0.0            0.0  

We also normalize the numerical features using min-max scaling shown below: 

      Rented Bike Count      Hour  Temperature (deg C)  Humidity(%)  \
8673                1.0  0.391304             0.391867     0.765306   
7244                0.0  0.869565             0.632163     0.683673   
4194                1.0  0.782609             0.717190     0.397959   
5727                1.0  0.652174             0.951941     0.530612   
7083                0.0  0.130435             0.646950     0.918367   

      Wind speed (m/s)  Visibility (10m)  Dew point temperature (deg C)  \
8673          0.043478          0.172662                       0.554770   
7244          0.101449          1.000000                       0.745583   
4194          0.405797          0.822713                       0.678445   
5727          0.275362          0.394656                       0.961131   
7083          0.144928          0.584275                       0.840989   

      Solar Radiation (MJ/m2)  Rainfall(mm)  Snowfall (cm)  
8673                 0.034384           0.0            0.0  
7244                 0.000000           0.0            0.0  
4194                 0.315186           0.0            0.0  
5727                 0.616046           0.0            0.0  
7083                 0.000000           0.0            0.0  

## Step 3

3. **Kernel-Based Modeling**:

In this step, we find a proper kernel function to solve all classifciaton tasks.

We identify and implement a sigmoid kernel function defined below:

$$
k_{\text{sigmoid}}(\mathbf{x}, \mathbf{x}', \alpha, c) = \tanh(\alpha \cdot \mathbf{x}^\top \mathbf{x}' + c)
$$

`NewKernel(object)` This Kernel class represents a sigmoid kernel function. The following class has two methods:
- `__init__(self, alpha=1, c=0)`: In this method, we initalize alpha as default 1 and c as default 0. These are important parameters in the sigmoid kernel function as seen above. 
- `__call__(self, x, y)`: In this method, we implement the sigmoid kernel function above using np.tanh and multiplying alpha and the dot product of x tranpose and y and then adding c all within the tanh function. 

We will now optimize the hyperparameters of each Kernel, namely the Polynomial Kernel, Gaussian Kernel, Laplace Kernel, and Sigmoid Kernel functions. We optimized the hyperparamters using the following steps:

- pick range of parameters depending on the type of kernel function
- for each parameter, create a model with that specified parameter and fit the model by creating an additional val split from the given Xtrain and ytrain. 
- pick the optimal parameter for the kernel function and test the model with optimal parameter on test set.


**Optimal p in Polynomal Kernel**

Following the steps above, we used a range of [1,2,3,4,5] for the degrees of the polynomial kernel. We justify this range of degrees because any degree greater than 5 would be be an extremely complex polynomial and have taken a lot of time to compute. We felt that a quadratic and cubic polynomial were standard choices for a polynomial kernel and extended that logic to degrees of 4 and 5. 

The results were the following:

- Accuracy of polynomial 1 : 75.00%
- Accuracy of polynomial 2 : 46.88%
- Accuracy of polynomial 3 : 76.56%
- Accuracy of polynomial 4 : 46.88%
- Accuracy of polynomial 5 : 71.88%
- Optimal polynomial parameter:  3
- Optimal polynomial accuracy:  : 76.56%
- Accuracy of optimal polynomial on test set: 72.50%

We found that a polynomial of degree 3 was the most optimal hyerparameter and had an accuracy of 72.50% on the test set. This is ~4% lower than the accuracy on the training data for the same optimal hyperparameter. Given that the difference between the two accuracies is not large, it means that our model is not overfitting and generalizing well to a dataset it has not seen before. 

**Optimal sigma in Gaussian Kernel**

Following the steps above, we used a range of [0.00095, 0.00098, 0.001, 0.00102, 0.00105] for sigma values. We justify this range of values for sigma as a values over 1 performed much more poorly than values below 1. We were able to find that the range of values were the most optimal for finding the optimal hyperparameter, where accuracy does not change below 0.001 (sigma). 

The results were the following: 

- Accuracy of Gaussian Kernel sigma parameter 0.00095 : 53.12%
- Accuracy of Gaussian Kernel sigma parameter 0.00098 : 53.12%
- Accuracy of Gaussian Kernel sigma parameter 0.001 : 53.12%
- Accuracy of Gaussian Kernel sigma parameter 0.00102 : 53.12%
- Accuracy of Gaussian Kernel sigma parameter 0.00105 : 53.12%
- Optimal Gaussian parameter:  0.00095
- Optimal Gaussian accuracy:  : 53.12%
- Accuracy of optimal sigma on test set: 47.50%

We found that a sigma of 0.00095 was the most optimal hyperparameter and had an accuracy of 47.50% on the test set. Given that the accuracy was low for the most optimal hyperparameter, we thought that a Gaussian Kernel may have no be a great choice for this dataset. Additionally, the training accuracy with the optimal hyperparameter was 53.12% compared to the 47.50% on the testing data. In this case, our model is not overfitting but instead underfitting as it is performing poorly on both the training data and testing data. 

**Optimal sigma in Laplace Kernel**

Following the steps above, we used a range of [0.0013, 0.0014, 0.0015, 0.001505, 0.0016] for sigma values. We justify this range of values for sigma because we say that integer values performed poorly in the previous section. We were able to see in that seciton that values below 1 were the going to perform the best. Given that, we found that the range of values we have provided gives us the most highest accuracy on both the training and testing datasets. 

The results were the following:

- Accuracy of Laplace kernel sigma parameter 0.0013 : 69.53%
- Accuracy of Laplace kernel sigma parameter 0.0014 : 73.44%
- Accuracy of Laplace kernel sigma parameter 0.0015 : 75.78%
- Accuracy of Laplace kernel sigma parameter 0.001505 : 75.78%
- Accuracy of Laplace kernel sigma parameter 0.0016 : 75.78%
- Optimal Laplace parameter:  0.0015
- Optimal Laplace accuracy:  : 75.78%
- Accuracy of optimal sigma on test set: 69.00%

We found that a sigma of 0.0015 was the most optimal for the Laplace Kernel with an accuracy of 75.78% on training set and 69.00% on the testing set. Given that the difference between the accuracies is not that large, we cannot say our model is overfitting as it is generalizing well to data it hasn't seen before.

**Optimal alpha and c in Sigmoid Kernel**

Following the steps above, we used a range of [0.01,0.1,1,10,100] for alpha values and [0, 0.25, 0.5,0.75,1] for c values. We justify these ranges of values for alpha and c through research we found online. For the range of values for c, we found that values between 0 and 1 would be most optimal given the tanh function in the Sigmoid Kernel function. For the range of alpha values, we found online that sigma values can vary on datasets and decided to use small and large alpha values in combination with c valus to find the optimal hyperparameter.

The results were the following: 

- Accuracy of sigmoid with alpha 0.01 and c 0 : 71.88%
- Accuracy of sigmoid with alpha 0.01 and c 0.25 : 46.88%
- Accuracy of sigmoid with alpha 0.01 and c 0.5 : 46.88%
- Accuracy of sigmoid with alpha 0.01 and c 0.75 : 46.88%
- Accuracy of sigmoid with alpha 0.01 and c 1 : 46.88%
- Accuracy of sigmoid with alpha 0.1 and c 0 : 71.88%
- Accuracy of sigmoid with alpha 0.1 and c 0.25 : 54.69%
- Accuracy of sigmoid with alpha 0.1 and c 0.5 : 46.88%
- Accuracy of sigmoid with alpha 0.1 and c 0.75 : 46.88%
- Accuracy of sigmoid with alpha 0.1 and c 1 : 46.88%
- Accuracy of sigmoid with alpha 1 and c 0 : 60.16%
- Accuracy of sigmoid with alpha 1 and c 0.25 : 67.19%
- Accuracy of sigmoid with alpha 1 and c 0.5 : 76.56%
- Accuracy of sigmoid with alpha 1 and c 0.75 : 75.00%
- Accuracy of sigmoid with alpha 1 and c 1 : 75.00%
- Accuracy of sigmoid with alpha 10 and c 0 : 65.62%
- Accuracy of sigmoid with alpha 10 and c 0.25 : 63.28%
- Accuracy of sigmoid with alpha 10 and c 0.5 : 58.59%
- Accuracy of sigmoid with alpha 10 and c 0.75 : 60.16%
- Accuracy of sigmoid with alpha 10 and c 1 : 59.38%
- Accuracy of sigmoid with alpha 100 and c 0 : 65.62%
- Accuracy of sigmoid with alpha 100 and c 0.25 : 67.19%
- Accuracy of sigmoid with alpha 100 and c 0.5 : 67.19%
- Accuracy of sigmoid with alpha 100 and c 0.75 : 67.19%
- Accuracy of sigmoid with alpha 100 and c 1 : 67.19%
- Optimal sigmoid parameters:  (1, 0.5)
- Optimal sigmoid accuracy:  : 76.56%
- Accuracy of optimal sigmoid on test set: 83.00%

We found that the optimal alpha was 1 and the optimal c was 0.5. These optimal hyperparamters in the sigmoid kernel had an accuracy of 76.56% on training data and 83% on the testing dataset. We see here that the kernel function generalizes better on data it hasnt seen before than data it has seen before. I do not believe that the kernel is overfitting because the difference between accuracies isnt large and it performs better on data is has not seen before. 

**Across all Kernel Functions Comparison**

Across all datasets, the sigmoid kernel function with its optimal hyperparameters had the highest accuracy on testing data while the gaussian kernel function with its optimal hyperparameters had the lowest accuracy on testing data. Given that the sigmoid kernel function has two hyperparameters, we were able to test 25 combinations for hyperparameters instead of 5 for the rest of kernels which might have helped us find more optimal hyperparameters given the circumstances.

## Step 4

4. **Evaluation and Analysis**:

In the previous step, we found that for the Laplace kernel, the optimal hyperparameter sigma was 0.0015. We will now visualize decision boundaries for the Laplace Kernel with the optimal sigma using principle component analysis to project data into higher dimensional feature space to 3 dimensions. 

We will then plot the decision boundary in the same graph to visualize. 

We found that 79.76% of the variance is explained by 3 dominant principle components for the Laplace kernel with the optimal hyperparameter sigma.

The 3D PCA graph for the Laplace Kernel with the optimal sigma is below:

![3d](https://raw.githubusercontent.com/nickmvega/machine-learning/a9c711578934cf4cf6be59131bb018af7b17c5f0/images_p2/3d.png)

# Task 2C Feature Selection

In this task, we will implement and analyze 2 feature selection methods
- Greedy forward selection
- L-1 SVM

## Task 2C.1 Greedy Forward Selection

In this task, we will use the Greedy Forward Selection Algorithm and analyze its feature selection capabilities. 

We first reimport the dataset in a given codeblock. The dataset is reimported with the following features:
- Rented Bike Count	
- Hour	
- Temperature (deg C)	
- Humidity(%)	
- Wind speed (m/s)	
- Visibility (10m)	
- Dew point temperature (deg C)	
- Solar Radiation (MJ/m2)
- Rainfall(mm)	
- Snowfall (cm)

The Greedy forward feature selection algorithm (using accuracy):

Initialize $S = \emptyset, A_0 = -1$. Then for $i = 1, \dots, d$, find the best element to add $$s_i = \arg\max_{j \in S} {A_{cv}(S \cup \set{j})}$$
with the corresponding maximum accuracy $A_i = A_{cv}(S \cup \set{s_i})$. If $A_{i+1} < A_i$, break, else set $S = S \cup \set{s_i}$.

We first implement a function for cross validation using accuracy as the metric. 

`accuracy_cross_validation(Xtrain, ytrain, model, k = 10)`

The function can be described as follows.

We partition the data into k folds. Then for each fold, we create a train index and validation index. Then, we use the train index and validation index to create the Xtrain_folds, ytrain_folds with the train index and Xval_folds, yval_folds with the validation index. The function has a model parameter which we now fit using the Xtrain_folds and ytrain_folds we obtained. We then make predictions on the Xval_folds set. We then compute the accuracy and append the accuracy to the accuracy list. The function then returns the mean of the accuracy list. 

We will now implement the greedy forward feature selection algorithm. 

`greedy_forward(Xtrain, ytrain, model, k = 5)`

The function can be described as follows.

We initialize an empty list S, a list with the number of columns in Xtrain as V, a best accuracy and a best accuracy list that is empty. We then iterate up to the number of columns in X. In each iteration, we perform iteration for all the indices in V. We check if the index is not in S. We then create a temporary variable and add both S and the current index to. We then call the accuracy_cross_validation function we described above with Xtrain sliced with all the rows and current index column, ytrain, the given model parameter, and the given k parameter. We then check if the accuracy is better than the best accuracy we have stored. If it is, we then replace that best accuracy with the current accuracy. Once we end the inner iteration, we add the current index to S and remove the current index from V. We also append the best accuray in that iteration to the best accuracy list. The function returns S and the best accuracy list. 


We now plot the cross validated accuracy against the number of features added. The plot is below

## Task 2C.2 L-1 SVM