<h1><b><i>Principal Component Analysis</i> (<i>PCA</i>) </b></h1> <p>In this exercise you will study the <b><i>principal component analysis</i></b> algorithm (<b><i>Principal Component Analysis</i></b>, <b><i>PCA</i></b>), implemented according to the <b><i>covariance method</i></b> (<b><i>covariance method</i></b>). To understand the usefulness of the method, you will train and evaluate the accuracy of a <b><i>logistic regression</i></b> model on a dataset before and after applying the <b><i>PCA</i></b> algorithm. More information about this method can be found <a href="https://ourarchive.otago.ac.nz/handle/10523/7534">here</a>.</p> <p>The exercise includes <b><i>two</i></b> <i>Python</i> programs: (a) the first takes a dataset in .<i>csv</i> format, applies the <b><i>PCA</i></b> algorithm, and creates the file <b><i>foo.csv</i></b> with the transformed dataset, as it results from the principal components that the user chose to keep, and (b) the second takes a file in .<i>csv</i> format, splits the dataset into a <i>training</i> and a <i>test set</i>, trains a <b><i>logistic regression</i></b> model using the <i>training set</i>, and computes the number of model errors on the <i>test set</i>.</p> <p>The dataset you will use is provided in two forms: (a) <b><i><a href="https://github.com/netmode/Stochastic-Processes-and-Optimization-in-Machine-Learning-Lab/blob/main/lab2/demo3a.csv">demo3a.csv</a></i></b> and (b) <b><i><a href="https://github.com/netmode/Stochastic-Processes-and-Optimization-in-Machine-Learning-Lab/blob/main/lab2/demo3b.csv">demo3b.csv</a></i></b>, which does not include the first column of <b><i><a href="https://github.com/netmode/Stochastic-Processes-and-Optimization-in-Machine-Learning-Lab/blob/main/lab2/demo3a.csv">demo3a.csv</a></i></b>, i.e., the <i>labels</i> that correspond to each input. These datasets are a simplified form of the dataset that can be found <a href="https://archive.ics.uci.edu/ml/datasets/wine">here</a>.</p> <h3><b><i>Principal Component Analysis</i></b></h3> <p>First, you will load the libraries required for the program that will analyze the dataset <b><i><a href="https://github.com/netmode/Stochastic-Processes-and-Optimization-in-Machine-Learning-Lab/blob/main/lab2/demo3b.csv">demo3b.csv</a></i></b> into its principal components.</p>




In [None]:
!pip install numpy
#https://machinelearningmastery.com/calculate-principal-component-analysis-scratch-python/
from numpy import genfromtxt
from numpy import mean
from numpy import cov
from numpy.linalg import eig
import numpy as np

<p>Next, you will load the dataset <b><i></i></b></p>

In [None]:
data = genfromtxt('https://github.com/netmode/Stochastic-Processes-and-Optimization-in-Machine-Learning-Lab/blob/main/lab2/demo3b.csv', delimiter=',')

<p>Then, you will compute the mean of each dataset column (feature) and normalize each feature using it</p>

In [None]:
M = mean(data.T, axis=1)
data_normal = data - M

<p>Next, you will compute the <b><i>covariance matrix</i></b> for the dataset</p>

In [None]:
covariance = cov(data_normal.T)
print("The covariance matrix of the normalized data is the following: ")
print(covariance)

<p>The next step is to compute the <b><i>eigenvalues</i></b> and <b><i>eigenvectors</i></b> of the dataset.</p>

In [None]:
values, vectors = eig(covariance)
print("The eigenvalues of the normalized data are the following: ")
print(values)

<p>Next, you will select the most important <b><i>eigenvalues</i></b> and adjust the corresponding <b><i>eigenvectors</i></b> of the dataset accordingly.</p>

In [None]:
new_values = values[0:3]
print("The most important eigenvalues are the following: ")
print(new_values)
new_vectors = vectors[0:3]
print("The most important eigenvectors are the following: ")
print(new_vectors)

<p>Now, you will apply the new <b><i>eigenvectors</i></b> to the old dataset in order to obtain the new, reduced-size dataset.</p>

In [None]:
new_data = new_vectors.dot(data_normal.T)

<p>Save the new dataset to a <i>csv</i> file.</p>

In [None]:
np.savetxt("foo.csv", new_data.T, delimiter=",")

<h4><b><i>Questions</i></b></h4> <ul> <li>Study the above program and briefly describe the steps followed by the <b><i>PCA</i></b> algorithm, implemented using the <b><i>covariance</i></b> method. Include the mathematical operations as well.</li> <li>Apply the <b><i>PCA</i></b> algorithm to the data in the file <i><a href="https://github.com/netmode/Stochastic-Processes-and-Optimization-in-Machine-Learning-Lab/blob/main/lab2/demo3b.csv">demo3b.csv</a></i>. Then, record the dataset’s <i>covariance matrix</i> and the eigenvalues of this matrix. What do the positive values and what do the negative values of the <i>covariance matrix</i> indicate? Sort the <i>eigenvalues</i> in descending order. What do you observe about the first three compared to the rest? How many <i>principal components</i> does the algorithm choose to keep?</li> </ul>

<h3><b><i>Logistic Regression</i></b></h3> <p>First, you will load the necessary libraries.</p>

In [None]:
!pip install pandas
!pip install numpy
!pip install matplotlib
!pip install sklearn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

Next, you will load the datasets that will be used to train the <b><i>logistic regression</i></b> model. In the first case, you will train the model using the file <i><a href="https://github.com/netmode/Stochastic-Processes-and-Optimization-in-Machine-Learning-Lab/blob/main/lab2/demo3a.csv">demo3a.csv</a></i>. In the second case, you will load the file <i>foo.csv</i> that you obtained as output from the previous code section of the exercise (<i>Principal Component Analysis</i>), adding in the first column of the file the labels that exist in the first column of the file <a href="https://github.com/netmode/Stochastic-Processes-and-Optimization-in-Machine-Learning-Lab/blob/main/lab2/demo3a.csv><i>demo3a.csv</i></a>.

In [None]:
df = pd.read_csv("https://github.com/netmode/Stochastic-Processes-and-Optimization-in-Machine-Learning-Lab/blob/main/lab2/demo3a.csv")
#df = pd.read_csv("foo.csv")

<p>Then, you will split the dataset into a <i>training</i> and a <i>test set</i>.</p>

In [None]:
# Separate the input features from the target variable
x = df.iloc[:,1:13].values
y = df.iloc[:,0].values
# Split the dataset into train and test set
Xtrain,Xtest,Ytrain,Ytest = train_test_split(x,y, test_size = 0.2)

<p>Then, you will train the <b><i>Logistic Regression</i></b> model.</p>

In [None]:
classifier = LogisticRegression(max_iter = 1000)
classifier.fit(Xtrain,Ytrain)

<p>Finally, you will obtain the predictions of the model you trained on the test set and compute the total number of errors.</p>

In [None]:
# Get the predictions on the test set
prediction = classifier.predict(Xtest)

# Calculate the total number of errors on the test set
errors = 0
for index in range(0,len(prediction) - 1):
	if prediction[index] != Ytest[index]:
		errors += 1

print("Total errors on the test dataset")
print(errors)

<h4><b><i>Question</i></b></h4> <p>Run the above code segments using as input the files (a) <i><a href="https://github.com/netmode/Stochastic-Processes-and-Optimization-in-Machine-Learning-Lab/blob/main/lab2/demo3a.csv">demo3a.csv</a></i> and (b) <i>foo.csv</i>. What do you observe about the model’s accuracy in the two cases? Also try the case where we keep (a) 1 and (b) 2 <i>principal components</i>. What do you observe?</p>

<h3><b><i>Additional Questions</i></b></h3> <ul> <li>What is the usefulness of the <b><i>PCA</i></b> algorithm with respect to the ability to <i>visualize</i> the dataset’s data?</li> <li>What is the usefulness of the <b><i>PCA</i></b> algorithm with respect to the training speed of the logistic regression model? Base your answer on the following two code segments.</li> </ul>