# Principal Component Analysis

## On a pizza dataset
Link to data (from data.world): https://data.world/sdhilip/pizza-datasets

Or (from Google Drive): https://drive.google.com/file/d/1w1x2r2FckkdVX9Pte9lTcbjyFTG35T6C/view?usp=sharing

Or (from GitHub): https://github.com/pauldubois98/RefresherMaths2023/blob/main/ExercisesSet5/pizza.csv

![pizza database illustration](pizza.png)

Step -1: Imports libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Step 0: Read data

In [2]:
df = pd.read_csv('pizza.csv')
df.head()

Unnamed: 0,brand,id,mois,prot,fat,ash,sodium,carb,cal
0,A,14069,27.82,21.43,44.87,5.11,1.77,0.77,4.93
1,A,14053,28.49,21.26,43.89,5.34,1.79,1.02,4.84
2,A,14025,28.35,19.99,45.78,5.08,1.63,0.8,4.95
3,A,14016,30.55,20.15,43.13,4.79,1.61,1.38,4.74
4,A,14005,30.49,21.28,41.65,4.82,1.64,1.76,4.67


Put your data of interest in a matrix `X`

In [3]:
X = df[['mois', 'prot', 'fat', 'ash', 'sodium', 'carb', 'cal']]
X.shape

(300, 7)

Step 1: Standardize data

In [7]:
means = X.mean(axis=0)
stds = X.std(axis=0)
X = (X - means) / stds
X.head()

Unnamed: 0,mois,prot,fat,ash,sodium,carb,cal
0,-1.369526,1.252089,2.745255,1.950635,2.971721,-1.225463,2.675659
1,-1.299391,1.225669,2.63607,2.131776,3.025723,-1.211598,2.530505
2,-1.314046,1.028292,2.84664,1.927007,2.593708,-1.2238,2.707915
3,-1.083752,1.053158,2.551397,1.698611,2.539707,-1.19163,2.369224
4,-1.090033,1.228777,2.386506,1.722238,2.620709,-1.170554,2.256327


Step 2: Compute covariance matrix

In [9]:
C = np.cov(X.T)
C.round(2)

array([[ 1.  ,  0.36, -0.17,  0.27, -0.1 , -0.59, -0.76],
       [ 0.36,  1.  ,  0.5 ,  0.82,  0.43, -0.85,  0.07],
       [-0.17,  0.5 ,  1.  ,  0.79,  0.93, -0.64,  0.76],
       [ 0.27,  0.82,  0.79,  1.  ,  0.81, -0.9 ,  0.33],
       [-0.1 ,  0.43,  0.93,  0.81,  1.  , -0.62,  0.67],
       [-0.59, -0.85, -0.64, -0.9 , -0.62,  1.  , -0.02],
       [-0.76,  0.07,  0.76,  0.33,  0.67, -0.02,  1.  ]])