Factor Analysis (FA) is a linear factor model that not only assumes observables are linear combination of factors (or laten variables) plus noise, but they follow the Gaussian distribution as well. In addition, observed variables are assumed to be conditionally independent, given latent variables. 

* Fewer factors than original features in data space. 
* Different types of methods and solutions.
* More elaborate framework than PCA

In [1]:
import matplotlib.pyplot as plt
%matplotlib inline

import os, sys
import numpy as np
import pandas as pd
from sklearn.decomposition import FactorAnalysis
from sklearn.preprocessing import scale

In [3]:
datasource = "datasets/winequality-red.csv"
print(os.path.exists(datasource))

True


In [4]:
df = pd.read_csv(datasource).sample(frac = 1).reset_index(drop = True)

In [5]:
df.head()

Unnamed: 0.1,Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,728,8.2,0.59,0.0,2.5,0.093,19.0,58.0,1.0002,3.5,0.65,9.3,6
1,640,7.3,0.365,0.49,2.5,0.088,39.0,106.0,0.9966,3.36,0.78,11.0,5
2,919,10.2,0.54,0.37,15.4,0.214,55.0,95.0,1.00369,3.18,0.77,9.0,6
3,1112,7.7,0.965,0.1,2.1,0.112,11.0,22.0,0.9963,3.26,0.5,9.5,5
4,468,9.4,0.395,0.46,4.6,0.094,3.0,10.0,0.99639,3.27,0.64,12.2,7


In [6]:
del df["Unnamed: 0"]

In [7]:
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,8.2,0.59,0.0,2.5,0.093,19.0,58.0,1.0002,3.5,0.65,9.3,6
1,7.3,0.365,0.49,2.5,0.088,39.0,106.0,0.9966,3.36,0.78,11.0,5
2,10.2,0.54,0.37,15.4,0.214,55.0,95.0,1.00369,3.18,0.77,9.0,6
3,7.7,0.965,0.1,2.1,0.112,11.0,22.0,0.9963,3.26,0.5,9.5,5
4,9.4,0.395,0.46,4.6,0.094,3.0,10.0,0.99639,3.27,0.64,12.2,7


In [8]:
X = np.array(df.iloc[:, :-1])

In [9]:
y = np.array(df["quality"])

In [10]:
df.describe()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0
mean,8.319637,0.527821,0.270976,2.538806,0.087467,15.874922,46.467792,0.996747,3.311113,0.658149,10.422983,5.636023
std,1.741096,0.17906,0.194801,1.409928,0.047065,10.460157,32.895324,0.001887,0.154386,0.169507,1.065668,0.807569
min,4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99007,2.74,0.33,8.4,3.0
25%,7.1,0.39,0.09,1.9,0.07,7.0,22.0,0.9956,3.21,0.55,9.5,5.0
50%,7.9,0.52,0.26,2.2,0.079,14.0,38.0,0.99675,3.31,0.62,10.2,6.0
75%,9.2,0.64,0.42,2.6,0.09,21.0,62.0,0.997835,3.4,0.73,11.1,6.0
max,15.9,1.58,1.0,15.5,0.611,72.0,289.0,1.00369,4.01,2.0,14.9,8.0


## Factor Analysis with sklearn

In [11]:
fa = FactorAnalysis(n_components = 5)

In [12]:
X_features = fa.fit_transform(X)

In [13]:
print("Features shape \n", X_features.shape)

Features shape 
 (1599, 5)


## Estimation of the factor model
Factor analysis essentially proposes the following to explain the structure of the observables. Obervables are linear combinations of latent variables plus noise. 

### Factor loadings
The factor loadings is the matrix L would take latent variables and transform them to observables X minus its mean and noise. 

In [14]:
def FactorLoadings(components, n_components = 5):
    """This function puts a frame on the loadings matrix for pretty printing"""
    return pd.DataFrame(components.T, 
                       columns = ['Factor {}'.format(i + 1) for i in range(n_components)],
                       index = df.columns[: -1])

FactorLoadings(fa.components_)

Unnamed: 0,Factor 1,Factor 2,Factor 3,Factor 4,Factor 5
fixed acidity,-0.161692,-1.145026,0.525767,0.412258,0.964023
volatile acidity,0.013626,-0.005764,-0.041384,-0.011233,-0.072485
citric acid,0.007791,-0.065046,0.057872,0.058579,0.108214
residual sugar,0.306013,-0.417179,0.445379,0.131728,-0.280869
chlorides,0.002664,-0.009901,-0.004634,-0.002355,0.006367
free sulfur dioxide,7.338024,1.2539,4.149618,-5.979583,0.106642
total sulfur dioxide,32.795243,1.303346,-0.671834,1.656218,-0.013853
density,0.000209,-0.001862,0.000187,6.8e-05,-3.2e-05
pH,-0.011372,0.054082,0.005221,-0.020063,-0.117808
sulphates,0.008098,-0.020796,0.035236,0.015193,0.026729


In [16]:
fa.noise_variance_
# diagonal matrix representing the variances of noise in the model with the following elements on the diagonal

array([  3.16564061e-01,   2.47304841e-02,   1.51411722e-02,
         1.42437402e+00,   2.04105829e-03,   9.41611024e-01,
         1.00434986e+00,   9.56654549e-09,   6.45761177e-03,
         2.60298530e-02,   1.25044888e-02])

## Reconstruction of data space
The factor analysis models the observables X as linear combination of factors plus noise. Therefore, it should be interesting to reconstruct data space with some solution appropriate for the formulation of factor analysis. 

In [17]:
print("Factors shapes:", fa.components_.shape)
noise = np.random.multivariate_normal(np.mean(X, axis = 0), np.diag(fa.noise_variance_), X.shape[0])
X_reconstructed = np.dot(X_features, fa.components_) + noise
print("Reconstructed dataset shape", X_reconstructed.shape)

Factors shapes: (5, 11)
Reconstructed dataset shape (1599, 11)


In [18]:
# reconstructed dataset is an approximation of the original dataset
pd.DataFrame(X_reconstructed, columns = df.columns[:-1])[:5]

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
0,8.609748,0.82798,0.174472,6.363364,0.135995,18.866443,59.744205,0.999991,3.427278,0.546082,9.224201
1,7.222211,0.441915,0.269315,3.113013,0.191209,39.217567,105.26724,0.996744,3.510475,0.599658,10.913251
2,11.126393,0.770194,0.447762,6.063744,0.159268,53.461154,94.19588,1.003466,3.433062,0.44581,9.009941
3,7.431258,0.376553,-0.011596,1.49625,-0.012565,11.158543,21.758231,0.996306,3.34196,0.655933,9.5573
4,10.587377,0.478076,0.440523,3.635231,0.048903,3.192271,10.962647,0.996372,3.233714,1.032874,12.100942


In [19]:
df.iloc[0:5, :-1]

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
0,8.2,0.59,0.0,2.5,0.093,19.0,58.0,1.0002,3.5,0.65,9.3
1,7.3,0.365,0.49,2.5,0.088,39.0,106.0,0.9966,3.36,0.78,11.0
2,10.2,0.54,0.37,15.4,0.214,55.0,95.0,1.00369,3.18,0.77,9.0
3,7.7,0.965,0.1,2.1,0.112,11.0,22.0,0.9963,3.26,0.5,9.5
4,9.4,0.395,0.46,4.6,0.094,3.0,10.0,0.99639,3.27,0.64,12.2


The comparison shows that the values are VERY different...although it still follows some general pattern. 