## Exploratory Factor Analysis

All of this work is performed in the R statisical environment rather than Python. If I could get the R part of jupyteR notebook working, I'd run it straight away, but I've had no luck on this front to date and so, while the instructions are provided here (and discussed), everything is cut from  this page, and pasted into an R session, to make the statistical magic happen. 

In [None]:
# first, let's load the data into an R dataframe:

fn='/Users/paulparis/Documents/Projects/csi/data/vector/factor_analysis2.csv'
dat<-read.csv(file=fn, header=FALSE)

X<-as.matrix(dat)
colnames(X)<-c('northing','area','depth_max','depth_mean','depth_std','slope_mean','slope_std')

In [None]:
# ## do the principal components solution:

R<-cor(X)                  # correlation matrix for obs. in X
e<-eigen(R)$vectors
lambda<-eigen(X)$values

# Compute the first four (4) Factors:
F1<-sqrt(lambda[1])*e[,1]
F2<-sqrt(lambda[2])*e[,2]
F3<-sqrt(lambda[3])*e[,3]
F4<-sqrt(lambda[4])*e[,4]

# combine the 4 factor factors (each one a column vector) into a single matrix L.pc
L.pc<-cbind(F1,F2,F3,F4)
rownames(L.pc)<-colnames(X)
round(L.pc, 3)

 Factors:<br>
               F1     F2     F3     F4<br>
northing   -0.426 -0.703  0.525 -0.149<br>
area        0.404 -0.765 -0.421  0.248<br>
depth_max   0.853 -0.074  0.343  0.171<br>
depth_mean  0.812  0.403  0.116  0.043<br>
depth_std  -0.882  0.016 -0.258 -0.207<br>
slope_mean -0.921  0.098  0.161  0.223<br>
slope_std  -0.883  0.170  0.082  0.364<br>

Some initial interpretation:
Factor 1 looks like it accounts for variance associated with the general morphology of a continental shelf. Here, the slope and depth conspire to define the gentle sloping "plain" that is the shelf province along the continental margin.

Factor 2 looks to account for variation seen in shelf morphology and extent in association with latitude. It is widely held that shelves tend to increase in areal extent, and are much deeper to the break as one moves north or south toward high latitudes and the poles. Interestingly, there is only a weak depth signal here. I might have suspected depth to be more of a factor (only 0.403) as a foil to the northing, and area.

Factor 3 has only one significant observation: northing. This might further reflect the strong influence that latitude is thought to have on the overall geometry and morphology of the shelf. 

Factor 4 is insignificant, likely accounting for very little of the overall variance seen in the data. We'll find out for sure, in a bit...

In [None]:
# ## compute the commonalities and the proportion of the overall variance in X explained by 
# ## the four factors

p<-ncol(X)
apply(L.pc^2, 1, sum)       # commonalities
1 - apply(L.pc^2,1,sum)     # specific variances
apply(L.pc^2,2,sum)/p       # proportion of variance explained by each factor

 commonalities:<br>
 northing       area  depth_max depth_mean  depth_std slope_mean  slope_std <br>
 0.9726989  0.9871647  0.8800511  0.8362416  0.8870848  0.9330808  0.9471448 
 
 specific variances<br>
  northing       area  depth_max depth_mean  depth_std slope_mean  slope_std <br>
 0.9726989  0.9871647  0.8800511  0.8362416  0.8870848  0.9330808  0.9471448
 
 proportion of variance explained by each factor<br>
         F1         F2         F3         F4 <br>
0.59070509 0.18375704 0.09752663 0.04850648 

Interpretation: based on the results the first two factors (F1 and F2) account for almost 78% (F1 59% and F2 18%) of the total variance seen in X--that plenty good. The remaining pair of factors, F3 and F4, add a bit (F3 about 10% and F4 about 5%), but to consider them further in the analysis would likely add more complication--for the additional latent variables--than is worth for their respective contributions. So, attention will focus from now on, on F1 and F2, only.

In [None]:
# ## the total variance accounted for by all the factors
lambda/p
sum(apply(L.pc^2,2,sum)/p)

0.9204952   # about 92%. That's pretty good!

In [None]:
# ## in an effort to gain some additional clarity with the data, we'll next apply an axes 
# ## rotation to enhance the interpreation of the factors, and perhaps better connect them
# ## to the actual observations in X

# do a varimax rotation
varimax(L.pc, normalize=FALSE)

$loadings

Loadings:<br>
           F1     F2     F3     F4  <br>  
northing   -0.105         0.969  0.132<br>
area        0.103 -0.961        -0.216<br>
depth_max   0.856 -0.142        -0.354<br>
depth_mean  0.671  0.141 -0.455 -0.398<br>
depth_std  -0.847  0.167  0.140  0.349<br>
slope_mean -0.442  0.279  0.253  0.772<br>
slope_std  -0.394  0.224  0.110  0.854<br>

                  F1    F2    F3    F4<br>
SS loadings    2.273 1.124 1.250 1.796<br>
Proportion Var 0.325 0.161 0.179 0.257<br>
Cumulative Var 0.325 0.485 0.664 0.920<br>

$rotmat<br>
          [,1]       [,2]       [,3]       [,4]<br>
[1,] 0.6923532 -0.2342007 -0.2966098 -0.6146704<br>
[2,] 0.0613042  0.7142349 -0.6854275  0.1276690<br>
[3,] 0.5681203  0.5215039  0.6206331  0.1417304<br>
[4,] 0.4405997 -0.4037972 -0.2388017  0.7653714<br>

Interpretation:
Following the rotation things changed a bit. Well, actually, things changed a lot. For the first factor, which now (post-rotation) accounts for only 33% of the common factor variance (it was 52% before), depth is the key observed set of variables. There is a clear distinction (no dilution) between depth and the other observed variables.

Factor 2, interestly, which now accounts for 16% of the common factor variance, is associated here with shelf surface area. So, some 15% of the overall observed variance in the data is tied to variation in shelf planform area. Again, interesting.

I know that said there would be no more mention of factors 3 and 4, but with the rotation, I must renege. Factor 3, just moments ago was ready for dispensation over the trash heap, is now back, and important. You might note that this third latent variable/observation now accounts for more variation than does F2! F3, BTW, points to latitude's influence. There is a weak tie in with maximum depths over shelves, but the latter observations are not truly significant here. This is where latitude dominates. 

Factor 4 is also worthy of mention for its emphasis on the shelf slope's contribution to the variance mix. The slope, now isolated from its earlier association with depths, contributes a surprising amount to the overall variance accounting. F4, as per the tables above, once associated with < 5% of variance, now accounts for more than 25%! Only depth, at 33%, contributes more to our explaining the data in X. 

Overall, with the rotation, the cumulative variance accounted for remains at 92%, however, the overall distribution changed dramatically. We now must consider all four factors. Factors 1 and 4 are clearly important. Factors 2 and 3, on the other hand, one could be dropped to simplify things--probably F2.

In [None]:
# Data correlations:
R<-cor(X)
R
             northing        area  depth_max  depth_mean  depth_std slope_mean  slope_std
northing    1.0000000  0.12315212 -0.1944637 -0.51229893  0.2657059  0.3572749  0.2706418
area        0.1231521  1.00000000  0.2879927  0.02510581 -0.2919584 -0.4548484 -0.4252346
depth_max  -0.1944637  0.28799266  1.0000000  0.64744512 -0.8002891 -0.6775925 -0.7021580
depth_mean -0.5122989  0.02510581  0.6474451  1.00000000 -0.6964975 -0.6907670 -0.5816764
depth_std   0.2657059 -0.29195841 -0.8002891 -0.69649751  1.0000000  0.7329513  0.6919666
slope_mean  0.3572749 -0.45484837 -0.6775925 -0.69076696  0.7329513  1.0000000  0.8680701
slope_std   0.2706418 -0.42523457 -0.7021580 -0.58167642  0.6919666  0.8680701  1.0000000

**------------------------------------------------------------------------------------------

In [None]:
# ## Next, let's try the more sophisticated Maximum Likelihood solution toward a factor
# ## solution:

# here's a short, home-grown function designed to an array (or matrix) of observations
stand<-function(v){
    de<-sd(v)
    if(de==0) de<-1
    (v-mean(v)) / de
}

# ## now, use the new function to standardized the observations in X:
Z<-apply(X,2,stand)

# now, do the MaxLikelihood Anal: 
# ## NOTE: since 4 factors are too many for a 7 variable observation dataset, we go with 3 
# ## factors in the model. Also, note that these data are not (yet) rotated...
factanal(Z, factors=3, scores='Bartlett', rotation='none')

Call:<br>
factanal(x = Z, factors = 3, scores = "Bartlett", rotation = "none")<br>

Uniquenesses:<br>
  northing       area  depth_max depth_mean  depth_std slope_mean  slope_std <br>
     0.613      0.542      0.097      0.120      0.242      0.005      0.207 <br>

Loadings:<br>
           Factor1 Factor2 Factor3<br>
northing    0.362  -0.125   0.490 <br>
area       -0.447  -0.250   0.442 <br>
depth_max  -0.705   0.574   0.276 <br>
depth_mean -0.711   0.454  -0.411 <br>
depth_std   0.754  -0.426         <br>
slope_mean  0.997                 <br>
slope_std   0.874          -0.155 <br>

               Factor1 Factor2 Factor3<br>
SS loadings      3.658   0.802   0.714<br>
Proportion Var   0.523   0.115   0.102<br>
Cumulative Var   0.523   0.637   0.739<br>

Test of the hypothesis that 3 factors are sufficient.<br>
The chi square statistic is 3.61 on 3 degrees of freedom.<br>
The p-value is 0.307 

Interpretation: Here, as with the unrotated pricipal components solution, the first factor (Factor 1) accounts for the largest share of the variance: about 52%. Again, the focus of this factor is on depths and slopes, which we earlier and now continue to explain, as the overall morphology (associated with depth and slope across the shelf planform) of the shelf surface. 

Factor 2 is weakly tied to maximum depth over the shelf [break]. 

Factor 3 is not particularly interesting, but if we set a minimum variance accountabilty threhold at 70% we must include Factor 3 to reach it (~74%). 

Finally, take note of the hypothesis test for factor sufficiency. Based on the result, and a hypothesis model:

Ho: 3 factors are sufficient for the model
H1: 3 factors are not enough

The value of p at 0.307 is < 0.35 (from the Chi-Square Distribution Percentage table), so Ho survives and we conclude that 3 factors is good enough.

Cool!

In [None]:
# ## Now, let's rotate the max. likelihood model:

factanal(Z, factors=3,scores='Bartlett', rotation='varimax')

Call:<br>
factanal(x = Z, factors = 3, scores = "Bartlett", rotation = "varimax")<br>

Uniquenesses:<br>
  northing       area  depth_max depth_mean  depth_std slope_mean  slope_std <br>
     0.613      0.542      0.097      0.120      0.242      0.005      0.207 <br>

Loadings:<br>
           Factor1 Factor2 Factor3<br>
northing    0.100           0.614 <br>
area       -0.139  -0.644   0.156 <br>
depth_max  -0.882  -0.307  -0.172 <br>
depth_mean -0.555          -0.750 <br>
depth_std   0.719   0.350   0.344 <br>
slope_mean  0.406   0.745   0.524 <br>
slope_std   0.495   0.661   0.332 <br>

               Factor1 Factor2 Factor3<br>
SS loadings      2.044   1.633   1.497<br>
Proportion Var   0.292   0.233   0.214<br>
Cumulative Var   0.292   0.525   0.739<br>

Test of the hypothesis that 3 factors are sufficient.<br>
The chi square statistic is 3.61 on 3 degrees of freedom.<br>
The p-value is 0.307 


Interpretation: 
Depth still dominates Factor 1, and Factor 1 dominates, if only by a little bit, the variance accounting associated with the 3 common factors. Factor 1 can speak to about 29% of the variance. 

Factor 2, is not so clearly defined anymore. Rotation has moved things around enough that both slope and area are important contributors. What does this mean? Slope and area are inversely proportional in their relative influence, so, we can say that, as the mean slope steepens the shelve's planform area diminishes. That makes intuitive sense. Factor 2 accounts for 23% of the common factor variance. 

Factor 3 is dominated by mean depth, latitude, and weakly, mean slope. Latutude and slope are proportional, so we might state that, as we move north or south toward the poles, there is a general expectation that shelf slope's will be, on average steeper. This again is a weak association. Depth in inversely proportional to both latitude and slope. So, as we move north/south (increasing) we find smaller magnitude depths (i.e, deeper water). So, we might conclude that we find the deepest shelves at higher latitudes. This we, in general, already surmise. Factor 3, BTW, accounts for a bit more than 21% of the common factor variance. 

Overall, however, only about 74% of the total variance measured in the observations (in X) is accounted for by the common factors. This is down from 94% across 4 factors, in the principal components solution. 