Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
191 lines (130 sloc) 11.1 KB
title author date output vignette
Multiple Factor Analysis in R
Yanli Fan, Jonathan Ackerman, Dario Cantore, Josiah Davis
`r Sys.Date()`
rmarkdown::html_vignette
%\VignetteIndexEntry{Introduction to mfa} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8}
library(mfa)
knitr::opts_chunk$set(collapse = TRUE, comment = "#>", fig.width = 6, fig.align = "center")

The mfa package can be used to perform Multiple Factor Analysis (MFA, also called multiple factorial analysis) on a given dataset or matrix. The centerpiece of the package is the mfa function, which performs multiple factor analysis on the given data. Multiple factor analysis consists of 2 steps: In step one, the dataset is split up into several smaller sets, on each of which a principal component analysis is done. These tables are then normalized, each by its first singular value. In the second step the smaller datatables are concatenated to one big table that is again analyzed using principal component analysis. Through this analysis we obtain, amongst others, factor scores and variable loadings. A thorough description of this can be found in WIREs Comput Stat 2013. doi: 10.1002/wics.1246.

#Installing the package To install the package download the Github repository. Open R Studio and setwd to the root folder (the folder that contains the Github repository).

library(devtools)
devtools::install_github("fussballball/stat243FinalProject/mfa", 
                         force_deps = FALSE)

mfa()

The mfa function takes data, sets, ncomps, center and scale as parameters. It returns an object of class mfa.

The data parameter must either be a data.frame or matrix and contains all the data which will be analyzed through the multiple factor analysis.
Sets is a list, where each element of the list indicates either the column names or numbers out of which one smaller datatable will be formed. If for example column 1 to 3 corresponds to the observation of one expert, the first element of sets will be 1:3. ncomps is the number of factors / components that will be extracted. If left to default all factors will be extracted. Center is by default True and indicates whether centering in Step 1 (as outlined above) will be done. By default, each column is scaled so that the sum of its elements is zero. Scale is by default True and indicates whether scaling in Step 1 (as outlined above) will be done. By default, each column is scaled so that the sum of its squared elements is 1. For further information concerning centering and scaling refer to the help of the base scale function ?scale. Please note the different meanings of scale = TRUE between base package and mfa function. Whereas in the standard R function scale = TRUE means the sum of the squared values of each column add up to nrow(data) - 1 (sample SD = 1), here they add up to 1.

The function returns an mfa object with attributes P, Q $\lambda$, F, Fk, where P and Q are the left repsectively right singular values resulting from the singular value decomposition as outlined in Step 2 above. Also returned is the vector of Eigenvalues $\lambda$, which are the singular values squared, $\Delta^2$. The common factor scores F, as well as the partial factor scores Fk are also returned. For details on their computation see WIREs Comput Stat 2013. doi: 10.1002/wics.1246 pp 8.

#Wine data The wine data is described in https://www.utdallas.edu/~herve/abdi-WiresCS-mfa-2013.pdf and is loaded with the package to be used throughout the Vignette.

wine[,1:15]

Rows correspond to different variaties of Souvignon Blanc (i.e., white wine), columns correspond to ratings of the wine along various dimensions of taste (e.g., mineral, passion fruit, cat pee) given by 10 different wine experts.

library(mfa)
SETS <- list(2:7, 8:13, 14:19, 20:24, 25:30, 31:35, 36:39, 40:45 , 46:50, 51:54)
mfa1 <- mfa(data = wine, sets = SETS)
mfa1

#Generic Methods

print.mfa()

The print.mfa function takes an mfa object as an input and returns a basic summary of it. This summary consists of the name of the object, the number of smaller tables which make up the entire dataset (i.e. the number of experts rating the wines), the eigenvalues of the mfa object, as well as the variance explained by the first 2 eigenvalues. Also given are the first 2 factor scores.

plot.mfa()

The plot function for mfa objects automatically plots the Compromise, Partial Factor Scores and Variable Loadings, using the plot_compromise, plot_partial_factor and plot_variable_loadings functions. It can take a color argument as input, which should be a list of vectors. Each element of the vector denotes observations (e.g. rows) of the data table which will then be displayed in the same color.

par(mar = rep(1, 4))
plot(mfa1)

plot_compromise()

The plot_compromise function for mfa objects plots the factor scores (factor scores and compromise are used interchangeably here) of the object.

plot_compromise(mfa1)

plot_partial_factor()

The plot_partial_factor function for mfa objects plots the partial factor scores for each table. This means that there will be k, the number of smaller datatables, or the length of the sets list given as input for the mfa function, plots. The user can specify which of the k plots should be displayed by setting the sets argument to a vector in the function call. By default all k plots are displayed.

par(mar = rep(1, 4))
plot_partial_factor(mfa1)

plot_variable_loadings()

The plot_variable_loadings function for mfa objects plots the variable loadings for each variable.

# plot_variable_loadings(mfa1)

#Related Methods and Functions

summary_eigenvalues()

The summary_eigenvalues function takes an mfa object as an input and returns a summary of its eigenvalues. This summary consists of the singular values, the eigenvalues, the cumulative sum of the eigenvalues, the % of inertia explained, and the cumulative % of inertia explained.

summary_eigenvalues(mfa1)

contribution_obs_dim() (Contribution of an Observation to a Dimension)

The contribution_obs_dim function also operates on an mfa object, it returns the contribution of an observation o on a dimension d. It takes as an input an mfa object, and the number of dimensions that will be returned. It returns a matrix of these dimensions where each entry is the contribution of observation i onto dimension l.

$ctr_{i,l} = \frac{m_i * f_{i,l}^2}{\lambda_l^2}$, where $\lambda_l$ is the eigenvalue of dimension $l$, $m_i$ is the mass of the $i$-th dimension and $f_{i,l}$ is the factor score for the $i$-th observation for the $l$-th dimension.

contribution_obs_dim(mfa = mfa1, l_max = 3)

contribution_var_dim()(Contribution of a Variable to a Dimension)

The contribution_var_dim function operates on an mfa object, it returns the contribution of a variable j on a dimension l. It takes as input an object of class mfa and the number of dimensions that will be returned. It returns a matrix of these dimensions where each entry j,l is:

$ctr_{j,l} = a_j * q_{j,l}$ where $a_j$ is the $\alpha$ weight of the first variable (the inverse of the first singular value of the table from Step 1 of the mfa function), and $q_{j,l}$ is the loading of the $j$th variable for the $l$th dimension.

contribution_var_dim(mfa = mfa1, l_max = 2)

contribution_table_dim()(Contribution of a Table to a Dimension)

The contribution_table_dim function operates on an mfa Object, it returns the contribution of a Table $k$ on a Dimension $l$. It takes as input an object of class mfa and the number of dimensions that will be returned. It returns a matrix of these dimensions where each entry is:

$ctr_{kl} = \sum_i^{J_k} ctr_{jl}$ where $ctr_{jl}$ is the contribution of a variable $j$ to a dimension $l$ and the sum goes over all variables in Table $k$. The contributions of a table $k$ to a dimension $l$ is therefore just the sum of the contributions of the variables comprising the table to the dimension $l$.

contribution_table_dim(mfa = mfa1, l_max = 4)

#Escoufier's RV Coefficient

RV()

The RV function takes as input two tables and returns a single value, called the Escoufier's RV Coefficient. The RV function analyzes the structure between two tables. It can be seen as the multivariate generalization of the squared Pearson Correlation Coefficient. Given two tables as an input it returns the scalar $R_V$.
$R_{Vk,k'}=\frac{trace(X_{[k]} X_{[k]}^T) * trace(X_{[k']} X_{[k']}^T)}{\sqrt(trace[(X_{[k]} X_{[k]}^T) * (X_{[k]} X_{[k]}^T)] * trace[(X_{[k']} X_{[k']}^T) * (X_{[k']} X_{[k']}^T)])}$ where the Root is taken with respect to the entire denominator and $*$ as well as " " denote the normal Matrix Multiplication.

RV(wine[,SETS[[1]]], wine[,SETS[[2]]])

RV_Table()

The RV Table function returns a matrix of RV Coefficients. The RV Table function takes as input an mfa object and a list which consists of vectors. These vectors denote the column indizes out of which tables will be created, between these tables the RV coefficients will be calculated. The pairwise RV Coefficient calculation is the same as for the RV function. The output is a matrix M, where $M_{ij}$ is the RV coefficient between Table $i$ and Table $j$. Table $i$ is created by taking the $i$-th element of the set vector and extracting the corresponding columns out of the dataset / matrix given.

RV_table(mfa = mfa1, SETS[1:4])

Lg()

The Lg function takes as input two tables and returns a single value, called $L_g$ coefficient. The $L_g$ coefficient measures the richness of the common structure between the two tables: the larger the $L_g$ coefficient, the larger the common structure (Analyzing Sensory Data with R, Sebastian Le, p.290).

It is calculated the following way: $L_g = \frac{trace(X_1 X_1^T) trace(X_2 X_2^T)}{\gamma_{11}^2 * \gamma_{12}^2}$ where the notation is the same as for the RV function and $\gamma_{1k}$ denotes the first singular value of table k.

Lg(wine[,SETS[[1]]], wine[,SETS[[2]]])

Lg_table()

The Lg_table function is related to the Lg function in exactly the same way as the RV_table function is related to the RV function.
The Lg Table function returns a matrix of Lg Coefficients. The Lg Table function takes as input an mfa object and a list which consists of vectors. These vectors denote the column indizes out of which tables will be created, between these tables the Lg coefficients will be calculated. The pairwise Lg Coefficient calculation is the same as for the Lg function. The output is a matrix M, where $M_{ij}$ is the Lg coefficient between Table $i$ and Table $j$. Table $i$ is created by taking the $i$-th element of the set vector and extracting the corresponding columns out of the dataset / matrix given.

Lg_table(mfa = mfa1, SETS[1:4])

bootstrap

The bootstrap function takes an mfa object, a positive integer B and the seed number as an optional input. It is used to estimate the stability of the compromise factor scores. It returns the input mfa object with a new attribute which consists of the bootstrap samples for the compromise factor scores.

mfa2 <- bootstrap(mfa1,  B = 1000, seed = 12)
mfa2