<a href="https://colab.research.google.com/github/m-mwangi/alu-machine_learning/blob/main/Formative_Assignment_PCA_%5BMarion_Wandia_Mwangi%5D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



<center>
    <img src="https://miro.medium.com/v2/resize:fit:300/1*mgncZaKaVx9U6OCQu_m8Bg.jpeg">
</center>



The goal of PCA is to extract information while reducing the number of features
from a dataset by identifying which existing features relate to another. The crux of the algorithm is trying to determine the relationship between existing features, called principal components, and then quantifying how relevant these principal components are. The principal components are used to transform the high dimensional data to a lower dimensional data while preserving as much information. For a principal component to be relevant, it needs to capture information about the features. We can determine the relationships between features using covariance.

In [83]:
#import necessary package
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler


In [84]:

data = np.array([
    [   1,   2,  -1,   4,  10],
    [   3,  -3,  -3,  12, -15],
    [   2,   1,  -2,   4,   5],
    [   5,   1,  -5,  10,   5],
    [   2,   3,  -3,   5,  12],
    [   4,   0,  -3,  16,   2],
])

### Step 1: Standardize the Data along the Features

![image.png](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQLxe5VYCBsaZddkkTZlCY24Yov4JJD4-ArTA&usqp=CAU)




Explain why we need to handle the data on the same scale.

It ensures that it contributes equally to the analysis.

In [85]:
# Calculate the mean (μ) and standard deviation (σ) for each feature (column-wise)
mean = np.mean(data, axis = 0)
standard_dev = np.std(data, axis = 0)

standardized_data = (data - mean) / standard_dev
# Print the standardized data
print(standardized_data)

[[-1.36438208  0.70710678  1.5109662  -0.99186978  0.77802924]
 [ 0.12403473 -1.94454365 -0.13736056  0.77145428 -2.06841919]
 [-0.62017367  0.1767767   0.68680282 -0.99186978  0.20873955]
 [ 1.61245155  0.1767767  -1.78568733  0.33062326  0.20873955]
 [-0.62017367  1.23743687 -0.13736056 -0.77145428  1.00574511]
 [ 0.86824314 -0.35355339 -0.13736056  1.65311631 -0.13283426]]


![cov matrix.webp](https://dmitry.ai/uploads/default/original/1X/9bd2851674ebb55e404cc3ff5e2ffe65b42ff460.png)

We use the pair - wise covariance of the different features to determine how they relate to each other. With these covariances, our goal is to group / cluster based on similar patterns. Intuitively, we can relate features if they have similar covariances with other features.

### Step 2: Calculate the Covariance Matrix



In [86]:
covariant_matrix = np.cov(data, rowvar=False)

print(covariant_matrix)

[[  2.16666667  -1.06666667  -1.76666667   5.5         -4.36666667]
 [ -1.06666667   4.26666667   0.46666667  -6.6         19.66666667]
 [ -1.76666667   0.46666667   1.76666667  -3.3          2.36666667]
 [  5.5         -6.6         -3.3         24.7        -27.9       ]
 [ -4.36666667  19.66666667   2.36666667 -27.9         92.56666667]]


### Step 3: Eigendecomposition on the Covariance Matrix


In [87]:


eigenvalues, eigenvectors = np.linalg.eig(covariant_matrix)

print(eigenvalues)
print(eigenvectors)


[107.22475075  16.18237882   1.93173735   0.12757974   0.00022   ]
[[-0.05817655 -0.2631212   0.57237125  0.6292347  -0.45148374]
 [ 0.19774895 -0.03283879  0.06849106 -0.60720902 -0.7657827 ]
 [ 0.0328828   0.17887983 -0.75671562  0.45776292 -0.42983171]
 [-0.33200499 -0.88598416 -0.30234056 -0.11461168  0.01609676]
 [ 0.91989252 -0.33574235 -0.06059523  0.11259736  0.15724145]]


### Step 4: Sort the Principal Components
# np.argsort can only provide lowest to highest; use [::-1] to reverse the list

In [88]:
# np.argsort can only provide lowest to highest; use [::-1] to reverse the list

importance_order = np.argsort(eigenvalues)[::-1]
print ( 'the order of importance is :\n {}'.format(importance_order))

# utilize the sort order to sort eigenvalues and eigenvectors
sorted_eigenvalues = eigenvalues[importance_order]
print('\n\n sorted eigen values:\n{}'.format(sorted_eigenvalues))

sorted_eigenvectors = eigenvectors[:, importance_order] # sort the columns
print('\n\n The sorted eigen vector matrix is: \n {}'.format(sorted_eigenvectors))

the order of importance is :
 [0 1 2 3 4]


 sorted eigen values:
[107.22475075  16.18237882   1.93173735   0.12757974   0.00022   ]


 The sorted eigen vector matrix is: 
 [[-0.05817655 -0.2631212   0.57237125  0.6292347  -0.45148374]
 [ 0.19774895 -0.03283879  0.06849106 -0.60720902 -0.7657827 ]
 [ 0.0328828   0.17887983 -0.75671562  0.45776292 -0.42983171]
 [-0.33200499 -0.88598416 -0.30234056 -0.11461168  0.01609676]
 [ 0.91989252 -0.33574235 -0.06059523  0.11259736  0.15724145]]


Question:

1. Why do we order eigen values and eigen vectors?

To help us prioritize the most important principal components in the PCA analysis.



2. Is it true we would consider the lowest eigen value compared to the highest? Defend your answer

NO. We are supposed to consider higher eigenvalues over lower ones because they capture more meaningful variance in the data, allowing for better data representation.


You want to see what percentage of information each eigen value holds. You would have print out the percentage of each eigen value using the formula



> (sorted eigen values / sum of all sorted eigen values) * 100



In [89]:
# use sorted_eigenvalues to ensure the explained variances correspond to the eigenvectors

explained_variance = (sorted_eigenvalues / np.sum(sorted_eigenvalues)) * 100
explained_variance =["{:.2f}%".format(value) for value in explained_variance]
print(explained_variance)

['85.46%', '12.90%', '1.54%', '0.10%', '0.00%']


#Initialize the number of Principle components then perfrom matrix multiplication with the variable K example k = 3 for 3 priciple components




> The reulting matrix (with reduced data) = standardized data * vector with columns k

See expected output for k = 2



In [90]:
k = 2

reduced_data = np.matmul(standardized_data, sorted_eigenvectors[:, :k])# transform the original data

In [91]:
print(reduced_data)

[[ 1.31389845  1.22362226]
 [-2.55511419  0.01760889]
 [ 0.61494463  1.08892909]
 [-0.03531847 -1.11250845]
 [ 1.45756867  0.44379893]
 [-0.7959791  -1.66145072]]


In [92]:
print(reduced_data.shape)

(6, 2)


# *What are 2 positive effects and 2 negative effects of PCA

Give 2 Benefits and 2 limitations

Positive effects:

* Helps in identifying patterns and clusters in the data.
* Helps in reducing number of features in a dataset while maintaining the most important information.


Negative effects:

* Components of PCA are linear combinations of original features, which makes them difficult to interpret.
* PCA assumes that the relationships in the data are linear.







## **Second PCA with Data from Previous Assignment**

In [93]:
#import necessary packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler


**The data is here below for the first ten columns:**


cylinders: [6, 4, 6, 6, 4, 6, 6, 4, 4, 6]

displ: [3.8, 2.0, 3.6, 3.6, 2.4, 3.5, 3.5, 2.0, 2.0, 3.8]

pv2: [79, 94, 94, 94, 0, 0, 0, 89, 89, 89]

pv4: [0, 0, 0, 0, 95, 99, 99, 0, 0, 0]

city: [16.4596, 21.8706, 17.4935, 16.9415, 24.7726, 19.4325, 18.5752, 17.4460, 20.6741, 16.4675]

UCity: [20.2988, 26.9770, 21.2000, 20.5000, 31.9796, 24.1499, 23.5261, 21.7946, 26.2000, 20.4839]

highway: [22.5568, 31.0367, 26.5716, 25.2190, 35.5340, 28.2234, 26.3573, 26.6295, 29.2741, 24.5605]

UHighway: [30.1798, 42.4936, 35.1000, 33.5000, 51.8816, 38.5000, 36.2109, 37.6731, 41.8000, 34.4972]

comb: [18.7389, 25.2227, 20.6716, 19.8774, 28.6813, 22.6002, 21.4213, 20.6507, 23.8235, 19.3344]

co2: [471, 349, 429, 446, 310, 393, 412, 432, 375, 461]

feScore: [4, 6, 5, 5, 8, 6, 5, 5, 6, 4]

ghgScore: [4, 6, 5, 5, 8, 6, 5, 5, 6, 4]

In [94]:
#Showing all the data we'll be workig with
data_array = np.array([
    [6, 3.8, 79, 0, 16.4596, 20.2988, 22.5568, 30.1798, 18.7389, 471, 4, 4],
    [4, 2.0, 94, 0, 21.8706, 26.9770, 31.0367, 42.4936, 25.2227, 349, 6, 6],
    [6, 3.6, 94, 0, 17.4935, 21.2000, 26.5716, 35.1000, 20.6716, 429, 5, 5],
    [6, 3.6, 94, 0, 16.9415, 20.5000, 25.2190, 33.5000, 19.8774, 446, 5, 5],
    [4, 2.4, 0, 95, 24.7726, 31.9796, 35.5340, 51.8816, 28.6813, 310, 8, 8],
    [6, 3.5, 0, 99, 19.4325, 24.1499, 28.2234, 38.5000, 22.6002, 393, 6, 6],
    [6, 3.5, 0, 99, 18.5752, 23.5261, 26.3573, 36.2109, 21.4213, 412, 5, 5],
    [4, 2.0, 89, 0, 17.4460, 21.7946, 26.6295, 37.6731, 20.6507, 432, 5, 5],
    [4, 2.0, 89, 0, 20.6741, 26.2000, 29.2741, 41.8000, 23.8235, 375, 6, 6],
    [6, 3.8, 89, 0, 16.4675, 20.4839, 24.5605, 34.4972, 19.3344, 461, 4, 4]
])




Standardization of the data

In [95]:
# Calculate the mean (μ) and standard deviation (σ) for each feature (column-wise)
mean = np.mean(data_array, axis=0)
std_dev = np.std(data_array, axis=0)

standardized_array = (data_array - mean) / std_dev

# Print the standardized data
print(standardized_array)

[[ 0.81649658  1.01928133  0.39205993 -0.65447944 -0.99013236 -0.9544673
  -1.44377411 -1.38775261 -1.15515073  1.29947657 -1.25723711 -1.25723711]
 [-1.22474487 -1.33290635  0.75507839 -0.65447944  1.10783734  0.91357742
   0.98565031  0.74729142  1.07175924 -1.20900668  0.53881591  0.53881591]
 [ 0.81649658  0.75792714  0.75507839 -0.65447944 -0.58926545 -0.70238112
  -0.2935656  -0.53465741 -0.49135035  0.43590037 -0.3592106  -0.3592106 ]
 [ 0.81649658  0.75792714  0.75507839 -0.65447944 -0.8032886  -0.89818707
  -0.68107482 -0.81207547 -0.76412433  0.78544312 -0.3592106  -0.3592106 ]
 [-1.22474487 -0.81019798 -1.51983728  1.46755287  2.23300978  2.31291864
   2.27409127  2.37504185  2.25964148 -2.01089887  2.33486893  2.33486893]
 [ 0.81649658  0.62725005 -1.51983728  1.5569016   0.16252965  0.12277313
   0.17966207  0.05485595  0.17104185 -0.3043078   0.53881591  0.53881591]
 [ 0.81649658  0.62725005 -1.51983728  1.5569016  -0.16986537 -0.05171795
  -0.35496086 -0.34204259 -0.2338

Calculate the Covariance Matrix

In [96]:
cov_matrix = np.cov(standardized_array, rowvar=False)

print(cov_matrix)

[[ 1.11111111  1.09068453 -0.11417017  0.1124691  -0.7659399  -0.76811256
  -0.78552487 -0.83029691 -0.77659384  0.7703947  -0.69250027 -0.69250027]
 [ 1.09068453  1.11111111 -0.13128082  0.11721238 -0.71216827 -0.70740493
  -0.73717032 -0.76722045 -0.72446233  0.74498667 -0.64934493 -0.64934493]
 [-0.11417017 -0.13128082  1.11111111 -1.10522472 -0.50985875 -0.55411124
  -0.46331841 -0.4704149  -0.49715869  0.49863808 -0.57037961 -0.57037961]
 [ 0.1124691   0.11721238 -1.10522472  1.11111111  0.52469965  0.56280202
   0.49311586  0.48942627  0.51734632 -0.52768481  0.59464923  0.59464923]
 [-0.7659399  -0.71216827 -0.50985875  0.52469965  1.11111111  1.1056822
   1.08282664  1.06981029  1.10774696 -1.09829822  1.05673021  1.05673021]
 [-0.76811256 -0.70740493 -0.55411124  0.56280202  1.1056822   1.11111111
   1.07387467  1.07676575  1.10106192 -1.08589405  1.05054107  1.05054107]
 [-0.78552487 -0.73717032 -0.46331841  0.49311586  1.08282664  1.07387467
   1.11111111  1.09604077  1.0989

Eigen decomposition on the Covariance Matrix

In [97]:
eigen_values, eigen_vectors = np.linalg.eig(cov_matrix)

print(eigen_values)
print(eigen_vectors)

[10.24529721  2.45883383  0.40355715  0.12099646  0.06134573  0.03848733
  0.0040349   0.00052959  0.00025113  0.          0.          0.        ]
[[-0.2385727   0.42295567  0.44585843 -0.00129212 -0.27763527  0.25545673
   0.4013366   0.06447963 -0.02677638 -0.23011816  0.49198276 -0.1298563 ]
 [-0.22588056  0.42860263  0.55979702 -0.12128813  0.33671958 -0.17690511
  -0.26492033 -0.0081714   0.026108    0.21281766 -0.45650817  0.12066398]
 [-0.16037148 -0.56179021  0.41406145  0.0440683  -0.08640073  0.18905914
   0.17711502 -0.61414281 -0.04203024  0.06813831 -0.16900194  0.04724347]
 [ 0.16609825  0.55940646 -0.37232833  0.04343202 -0.07087124  0.24376461
   0.0626357  -0.63861208 -0.06853702  0.07526127 -0.18523942  0.05164329]
 [ 0.32465753 -0.00041249  0.15641464 -0.3764967  -0.15490509 -0.2567781
  -0.17464064 -0.1197934  -0.51182376 -0.57000703  0.31203392  0.02010883]
 [ 0.32485624  0.01863884  0.06357044 -0.40892013  0.10351187 -0.38884158
   0.3870572  -0.17670874  0.610749

Sort the Principal Components

In [98]:
# np.argsort can only provide lowest to highest; use [::-1] to reverse the list

order_of_importance = np.argsort(eigen_values)[::-1]
print('The order of importance is:\n{}'.format(order_of_importance))

# Utilize the sort order to sort eigenvalues and eigenvectors
sorted_eigen_values = eigen_values[order_of_importance]
print('\n\nSorted Eigenvalues:\n{}'.format(sorted_eigen_values))

sorted_eigen_vectors = eigen_vectors[:, order_of_importance]  # Sort the columns
print('\n\nThe sorted eigenvector matrix is:\n{}'.format(sorted_eigen_vectors))



The order of importance is:
[ 0  1  2  3  4  5  6  7  8 10  9 11]


Sorted Eigenvalues:
[10.24529721  2.45883383  0.40355715  0.12099646  0.06134573  0.03848733
  0.0040349   0.00052959  0.00025113  0.          0.          0.        ]


The sorted eigenvector matrix is:
[[-0.2385727   0.42295567  0.44585843 -0.00129212 -0.27763527  0.25545673
   0.4013366   0.06447963 -0.02677638  0.49198276 -0.23011816 -0.1298563 ]
 [-0.22588056  0.42860263  0.55979702 -0.12128813  0.33671958 -0.17690511
  -0.26492033 -0.0081714   0.026108   -0.45650817  0.21281766  0.12066398]
 [-0.16037148 -0.56179021  0.41406145  0.0440683  -0.08640073  0.18905914
   0.17711502 -0.61414281 -0.04203024 -0.16900194  0.06813831  0.04724347]
 [ 0.16609825  0.55940646 -0.37232833  0.04343202 -0.07087124  0.24376461
   0.0626357  -0.63861208 -0.06853702 -0.18523942  0.07526127  0.05164329]
 [ 0.32465753 -0.00041249  0.15641464 -0.3764967  -0.15490509 -0.2567781
  -0.17464064 -0.1197934  -0.51182376  0.31203392 -0.5700070

You want to see what percentage of information each eigen value holds.

In [99]:
explained_variance = (sorted_eigen_values / np.sum(sorted_eigen_values)) * 100
explained_variance =["{:.2f}%".format(value) for value in explained_variance]
print(explained_variance)

['76.84%', '18.44%', '3.03%', '0.91%', '0.46%', '0.29%', '0.03%', '0.00%', '0.00%', '0.00%', '0.00%', '0.00%']


Initialize the number of Principle components then perfrom matrix multiplication with the variable K

In [100]:
k = 3
reduced_data = np.matmul(standardized_array, sorted_eigen_vectors[:, :k])# transform the original data

print(reduced_data)

[[-3.75141298  0.15205974 -0.03096503]
 [ 2.66996284 -1.8780594   0.2387649 ]
 [-1.81597178 -0.13622403  0.85823159]
 [-2.36734833 -0.11341548  0.63917468]
 [ 6.83406369  0.92171383  0.35461622]
 [ 0.83462079  2.39002074 -0.15970948]
 [-0.46678299  2.3235411  -0.79209793]
 [-0.66178285 -1.84183155 -1.24612696]
 [ 1.93464617 -1.78716937 -0.14057736]
 [-3.20999458 -0.03063557  0.27868937]]


In [101]:
print(reduced_data.shape)

(10, 3)
