# Cluster Analysis using SAS code

#### PROC CLUSTER is the hierarchical clustering method, 
    
#### PROC FASTCLUS is the K-Means clustering and 
    
####  PROC VARCLUS is a special type of clustering where (by default) Principal Component Analysis (PCA) is done to cluster variables.
    

https://subhasis4analytics.blogspot.com/2014/09/cluster-analysis-with-sas.html

### Content
####    Fields in the dataset:  https://www.kaggle.com/datasets/crawford/80-cereals

        Name: Name of cereal
        mfr: Manufacturer of cereal
            A = American Home Food Products;
            G = General Mills
            K = Kelloggs
            N = Nabisco
            P = Post
            Q = Quaker Oats
            R = Ralston Purina
        type:
            cold
            hot
        calories: calories per serving
        protein: grams of protein
        fat: grams of fat
        sodium: milligrams of sodium
        fiber: grams of dietary fiber
        carbo: grams of complex carbohydrates
        sugars: grams of sugars
        potass: milligrams of potassium
        vitamins: vitamins and minerals - 0, 25, or 100, indicating the typical percentage of FDA recommended
        shelf: display shelf (1, 2, or 3, counting from the floor)
        weight: weight in ounces of one serving
        cups: number of cups in one serving
        rating: a rating of the cereals (Possibly from Consumer Reports?)

In [60]:
import sys
try:
    import saspy
except:
    !pip install saspy

In [61]:
# check the version of SASPy
saspy.__version__

'5.4.4'

In [62]:
# Start a SAS session, SAS is running in UTF-8

sas_sess = saspy.SASsession()

Using SAS Config named: oda
SAS Connection established. Subprocess id is 14204



In [63]:
%%SAS sas_sess

FILENAME REFFILE '/home/u61624884/cereal.csv';
   
PROC IMPORT DATAFILE=REFFILE
DBMS=CSV
OUT=WORK.IMPORT;
GETNAMES=YES;
RUN;


/* print first 10 observations */
proc print data=WORK.IMPORT (obs=10);
run;

/* Checking the contents of the datasets */
proc means data=WORK.IMPORT N Nmiss mean median max min;
run;

Obs,name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
1,100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
2,100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679
3,All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
4,All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
5,Almond Delight,R,C,110,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843
6,Apple Cinnamon Cheerios,G,C,110,2,2,180,1.5,10.5,10,70,25,1,1.0,0.75,29.509541
7,Apple Jacks,K,C,110,2,0,125,1.0,11.0,14,30,25,2,1.0,1.0,33.174094
8,Basic 4,G,C,130,3,2,210,2.0,18.0,8,100,25,3,1.33,0.75,37.038562
9,Bran Chex,R,C,90,2,1,200,4.0,15.0,6,125,25,1,1.0,0.67,49.120253
10,Bran Flakes,P,C,90,3,0,210,5.0,13.0,5,190,25,3,1.0,0.67,53.313813

Variable,N,N Miss,Mean,Median,Maximum,Minimum
calories protein fat sodium fiber carbo sugars potass vitamins shelf weight cups rating,77 77 77 77 77 77 77 77 77 77 77 77 77,0 0 0 0 0 0 0 0 0 0 0 0 0,106.8831169 2.5454545 1.0129870 159.6753247 2.1519481 14.5974026 6.9220779 96.0779221 28.2467532 2.2077922 1.0296104 0.8210390 42.6657050,110.0000000 3.0000000 1.0000000 180.0000000 2.0000000 14.0000000 7.0000000 90.0000000 25.0000000 2.0000000 1.0000000 0.7500000 40.4002080,160.0000000 6.0000000 5.0000000 320.0000000 14.0000000 23.0000000 15.0000000 330.0000000 100.0000000 3.0000000 1.5000000 1.5000000 93.7049120,50.0000000 1.0000000 0 0 0 -1.0000000 -1.0000000 -1.0000000 0 1.0000000 0.5000000 0.2500000 18.0428510


In [64]:
%%SAS sas_sess

/* find relation between fat and sugar within different bands */
PROC sgscatter  DATA = WORK.IMPORT;
   PLOT fat*sugars / datalabel = mfr;

RUN; 


In [65]:
%%SAS sas_sess
proc sgplot data=WORK.IMPORT ;
   scatter x=sugars y=fat / group=mfr;
run;

### Any missing data

In [66]:
%%SAS sas_sess
proc format;
 value $missing_char
              ' '  = 'Missing'
              other = 'Present'
    ;
 value missing_num
    . = 'Missing'
    other = 'Present'
    ;
run;

proc freq data = WORK.IMPORT;
 tables _all_ /missing;
 format _character_ $missing_char. _numeric_ missing_num.;
run;  

name,Frequency,Percent,Cumulative Frequency,Cumulative Percent
Present,77,100.0,77,100.0

mfr,Frequency,Percent,Cumulative Frequency,Cumulative Percent
Present,77,100.0,77,100.0

type,Frequency,Percent,Cumulative Frequency,Cumulative Percent
Present,77,100.0,77,100.0

calories,Frequency,Percent,Cumulative Frequency,Cumulative Percent
Present,77,100.0,77,100.0

protein,Frequency,Percent,Cumulative Frequency,Cumulative Percent
Present,77,100.0,77,100.0

fat,Frequency,Percent,Cumulative Frequency,Cumulative Percent
Present,77,100.0,77,100.0

sodium,Frequency,Percent,Cumulative Frequency,Cumulative Percent
Present,77,100.0,77,100.0

fiber,Frequency,Percent,Cumulative Frequency,Cumulative Percent
Present,77,100.0,77,100.0

carbo,Frequency,Percent,Cumulative Frequency,Cumulative Percent
Present,77,100.0,77,100.0

sugars,Frequency,Percent,Cumulative Frequency,Cumulative Percent
Present,77,100.0,77,100.0

potass,Frequency,Percent,Cumulative Frequency,Cumulative Percent
Present,77,100.0,77,100.0

vitamins,Frequency,Percent,Cumulative Frequency,Cumulative Percent
Present,77,100.0,77,100.0

shelf,Frequency,Percent,Cumulative Frequency,Cumulative Percent
Present,77,100.0,77,100.0

weight,Frequency,Percent,Cumulative Frequency,Cumulative Percent
Present,77,100.0,77,100.0

cups,Frequency,Percent,Cumulative Frequency,Cumulative Percent
Present,77,100.0,77,100.0

rating,Frequency,Percent,Cumulative Frequency,Cumulative Percent
Present,77,100.0,77,100.0


In [67]:
%%SAS sas_sess

proc sort data=WORK.IMPORT  out=raw_t (keep= calories sodium potass mfr);
    by mfr;
run;

proc transpose data=raw_t out=raw_t2;
    by mfr;
run;

data raw_t2 ;
    set raw_t2 ;
    label _name_ = "Variable";
    label col1 = "Value";
run;

proc sgplot data=raw_t2;
    vbox col1 / group=_name_ ;
run;


In [68]:
%%SAS sas_sess

proc sort data=WORK.IMPORT  out=raw_t (keep= protein fat fiber carbo sugars vitamins weight mfr);
    by mfr;
run;

proc transpose data=raw_t out=raw_t2;
    by mfr;
run;

data raw_t2 ;
    set raw_t2 ;
    label _name_ = "Variable";
    label col1 = "Value";
    drop shelf;
run;

proc sgplot data=raw_t2;
    vbox col1 / group=_name_ ;
run;


In [69]:
%%SAS sas_sess

proc sgplot data=WORK.IMPORT;
    vbox fat / group=mfr;
run;

proc sgplot data=WORK.IMPORT;
    vbox vitamins / group=mfr;
run;

proc sgplot data=WORK.IMPORT;
    vbox weight / group=mfr;
run;

#### There has large gaps in fat, vitamins and weight from among different bands

In [70]:
%%SAS sas_sess
/* from https://gist.github.com/statgeek/566dda98173a5a28a6e4147317547959 
       https://learn.sas.com/pluginfile.php/7660/mod_resource/content/2/apsoln.htm 
    */
/* Use the pearson data set to print only the correlations whose absolute values are 0.70 and above, 
    or note them with an asterisk in the full correlation table.
    */
    
    
%let big=0.7;
%let interval = protein fat sodium fiber carbo sugars potass vitamins;

proc corr data=WORK.IMPORT 
          nosimple 
          best=5
          out=pearson;  /* output pearson correlations result */
   var &interval;
   title "Correlations of Predictors";
run;

proc format;
    value CorrSignif -0.0-<0.2 = "red"
        0.2-<0.4, -0.4-<-0.2 = "orange"
        0.4-<0.6, -0.6-<-0.4 = "yellow"
        0.6-<0.8, -0.8-<-0.6 = "lightgreen"
        0.8-<1.0, -1.0-<-0.8= "forestgreen"
        1, -1 = "White";
    picture correlations &big -< 1 = '009.99' (prefix="*")
                         -1 <- -&big = '009.99' (prefix="*")
                         -&big <-< &big = '009.99';
run;



proc print data=pearson style(column)={backgroundcolor= CorrSignif.} noobs;
    var _NAME_ &interval;
    where _type_="CORR";
    format &interval correlations.;
    ;
run;

0,1
8 Variables:,protein fat sodium fiber carbo sugars potass vitamins

"Pearson Correlation Coefficients, N = 77 Prob > |r| under H0: Rho=0","Pearson Correlation Coefficients, N = 77 Prob > |r| under H0: Rho=0.1","Pearson Correlation Coefficients, N = 77 Prob > |r| under H0: Rho=0.2","Pearson Correlation Coefficients, N = 77 Prob > |r| under H0: Rho=0.3","Pearson Correlation Coefficients, N = 77 Prob > |r| under H0: Rho=0.4","Pearson Correlation Coefficients, N = 77 Prob > |r| under H0: Rho=0.5"
protein,protein 1.00000,potass 0.54941 <.0001,fiber 0.50033 <.0001,sugars -0.32914 0.0035,fat 0.20843 0.0689
fat,fat 1.00000,carbo -0.31804 0.0048,sugars 0.27082 0.0172,protein 0.20843 0.0689,potass 0.19328 0.0921
sodium,sodium 1.00000,vitamins 0.36148 0.0012,carbo 0.35598 0.0015,sugars 0.10145 0.3800,fiber -0.07068 0.5413
fiber,fiber 1.00000,potass 0.90337 <.0001,protein 0.50033 <.0001,carbo -0.35608 0.0015,sugars -0.14121 0.2206
carbo,carbo 1.00000,fiber -0.35608 0.0015,sodium 0.35598 0.0015,potass -0.34969 0.0018,sugars -0.33167 0.0032
sugars,sugars 1.00000,carbo -0.33167 0.0032,protein -0.32914 0.0035,fat 0.27082 0.0172,fiber -0.14121 0.2206
potass,potass 1.00000,fiber 0.90337 <.0001,protein 0.54941 <.0001,carbo -0.34969 0.0018,fat 0.19328 0.0921
vitamins,vitamins 1.00000,sodium 0.36148 0.0012,carbo 0.25815 0.0234,sugars 0.12514 0.2782,fiber -0.03224 0.7807

_NAME_,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins
protein,1.0,0.2,0.05,0.50,0.13,0.32,0.54,0.0
fat,0.2,1.0,0.0,0.01,0.31,0.27,0.19,0.03
sodium,0.05,0.0,1.0,0.07,0.35,0.1,0.03,0.36
fiber,0.5,0.01,0.07,1,0.35,0.14,*0.90,0.03
carbo,0.13,0.31,0.35,0.35,1.0,0.33,0.34,0.25
sugars,0.32,0.27,0.1,0.14,0.33,1.0,0.02,0.12
potass,0.54,0.19,0.03,*0.90,0.34,0.02,1,0.02
vitamins,0.0,0.03,0.36,0.03,0.25,0.12,0.02,1.0


In [71]:
%%SAS sas_sess

/* check Multicollinearity using VIF */ 
/* VIF between 1 and 5: Moderate multicollinearity (not a major concern).
    VIF above 5: High multicollinearity (may lead to unreliable coefficient estimates). */
        
PROC REG DATA=WORK.IMPORT;
   MODEL weight = protein fat sodium fiber carbo sugars potass vitamins / VIF ;
RUN;

0,1
Number of Observations Read,77
Number of Observations Used,77

Analysis of Variance,Analysis of Variance,Analysis of Variance,Analysis of Variance,Analysis of Variance,Analysis of Variance
Source,DF,Sum of Squares,Mean Square,F Value,Pr > F
Model,8,1.13989,0.14249,16.68,<.0001
Error,68,0.58099,0.00854,,
Corrected Total,76,1.72089,,,

0,1,2,3
Root MSE,0.09243,R-Square,0.6624
Dependent Mean,1.02961,Adj R-Sq,0.6227
Coeff Var,8.97756,,

Parameter Estimates,Parameter Estimates,Parameter Estimates,Parameter Estimates,Parameter Estimates,Parameter Estimates,Parameter Estimates
Variable,DF,Parameter Estimate,Standard Error,t Value,Pr > |t|,Variance Inflation
Intercept,1,0.37445,0.07201,5.2,<.0001,0.0
protein,1,0.03591,0.01331,2.7,0.0088,1.88959
fat,1,0.01257,0.01262,1.0,0.3229,1.43592
sodium,1,9.493e-05,0.00014624,0.65,0.5185,1.33693
fiber,1,0.00752,0.01267,0.59,0.5549,8.10799
carbo,1,0.01909,0.0035,5.45,<.0001,1.99611
sugars,1,0.02333,0.00334,6.99,<.0001,1.95666
potass,1,0.00068392,0.00042931,1.59,0.1158,8.33115
vitamins,1,0.00048951,0.00052824,0.93,0.3574,1.23903


####  fiber and potass are highly correlated to each other variables

### Standardize Continuous Variables

#### If do not standardise the data then the variables measured in higher unit will dominate the computed dissimilarity and variables that are measured in small unit will contribute very little.

In [72]:
%%SAS sas_sess

%let interval=protein fat sodium carbo sugars vitamins; /* exclude fiber and potass */

proc stdize data=WORK.IMPORT out=outdata method=std;
    var &interval;
run;


### Principal Components Analysis (PCA)

In [73]:
%%SAS sas_sess

/* find principal components – linear combinations of the predictor variables 
– that explain a large portion of the variation in a dataset*/


proc princomp data=outdata out=out_data outstat=stats;  /* Default Method : Based on Correlation Matrix. */
    var &interval;
run;

0,1
Observations,77
Variables,6

Simple Statistics,Simple Statistics,Simple Statistics,Simple Statistics,Simple Statistics,Simple Statistics,Simple Statistics
Unnamed: 0_level_1,protein,fat,sodium,carbo,sugars,vitamins
Mean,0.0,0.0,0.0,0.0,0.0,0.0
StD,1.0,1.0,1.0,1.0,1.0,1.0

Correlation Matrix,Correlation Matrix,Correlation Matrix,Correlation Matrix,Correlation Matrix,Correlation Matrix,Correlation Matrix
Unnamed: 0_level_1,protein,fat,sodium,carbo,sugars,vitamins
protein,1.0,0.2084,-0.0547,-0.1309,-0.3291,0.0073
fat,0.2084,1.0,-0.0054,-0.318,0.2708,-0.0312
sodium,-0.0547,-0.0054,1.0,0.356,0.1015,0.3615
carbo,-0.1309,-0.318,0.356,1.0,-0.3317,0.2581
sugars,-0.3291,0.2708,0.1015,-0.3317,1.0,0.1251
vitamins,0.0073,-0.0312,0.3615,0.2581,0.1251,1.0

Eigenvalues of the Correlation Matrix,Eigenvalues of the Correlation Matrix,Eigenvalues of the Correlation Matrix,Eigenvalues of the Correlation Matrix,Eigenvalues of the Correlation Matrix
Unnamed: 0_level_1,Eigenvalue,Difference,Proportion,Cumulative
1,1.77394528,0.30488623,0.2957,0.2957
2,1.46905905,0.26825986,0.2448,0.5405
3,1.20079919,0.51759228,0.2001,0.7406
4,0.68320691,0.14649646,0.1139,0.8545
5,0.53671045,0.20043133,0.0895,0.944
6,0.33627912,,0.056,1.0

Eigenvectors,Eigenvectors,Eigenvectors,Eigenvectors,Eigenvectors,Eigenvectors,Eigenvectors
Unnamed: 0_level_1,Prin1,Prin2,Prin3,Prin4,Prin5,Prin6
protein,-0.148753,-0.349619,0.734449,-0.159831,-0.301734,0.446805
fat,-0.394398,0.317591,0.483343,0.423136,0.555059,-0.1511
sodium,0.463384,0.36214,0.284752,0.469363,-0.54682,-0.231803
carbo,0.629027,-0.140478,-0.012388,0.294499,0.459373,0.535429
sugars,-0.21801,0.703139,-0.17682,-0.096363,-0.159325,0.626204
vitamins,0.405455,0.362792,0.338322,-0.692166,0.255825,-0.2121


#### The table in the output titled Eigenvalues of the Correlation Matrix allow us to see exactly what percentage of total variation is explained by each principal component:

    From the table in the output titled Eigenvalues of the Correlation Matrix that the first 5 principal components account for 94.40% of the total variation in the dataset.

#### Eigenvector shows the correlation between component and variable. It explains which variable is highly correlated to a component. Choosing components with eigenvalue > 1 or cumulative proportion of variance more than 80%.

    from https://www.listendata.com/2015/04/principal-component-analysis-with-sas.html
    
    Why Eigenvalues > 1

&nbsp; Each observed variable contributes one unit of variance to the total variance in the data set. Any component that displays an eigenvalue greater than 1.0 is accounting for a greater amount of variance than was contributed by one variable. Such a component is therefore accounting for a meaningful amount of variance and is worthy of being retained. On the other hand, a component with an eigenvalue less than 1.0 is accounting for less variance than had been contributed by one variable.

#### If you run a principal component analysis on a set of 5 variables and observe that the first component explains 85% of the variance. It means the variables are highly correlated to each other. In other words, variables are faced with multicollinearity.

### Biplot to Visualize Results

In [74]:
%%SAS sas_sess
/* displays every observation in a dataset on a plane that is formed by the first two principal components. */

/*create dataset with column called obs to represent row numbers of original data*/
data biplot_data;
   set out_data;
   obs=_n_;
run;

/*create biplot using values from first two principal components*/
proc sgplot data=biplot_data;
    scatter x=Prin1 y=Prin2 / datalabel=obs;
run;

#### Observations that are next to each other on the plot have similar values across the 6 variables. For example, on the far right side of the plot we can see that observations #44  and #72 are extremely close to each other.

### Principal Factor Analysis with Parallel Analysis

#### Parallel analysis compares each of eigenvalues of the input data correlation matrix to an empirical distribution of eigenvalues. Each eigenvalue from the input correlation matrix that exceeds a critical value (based on a one-sided -level) in the corresponding empirical distribution suggests a factor to be retained.

#### the scree plot for a parallel analysis contains two lines. The first line is a traditional scree plot and shows the eigenvalues of the sample correlation matrix, sorted in descending order. The second line is constructed from the critical values that are obtained from the parallel analysis. The suggested number of factors to retain is indicated by the intersection of these two lines.

https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.4/statug/statug_factor_examples09.htm

In [75]:
%%SAS sas_sess

proc factor data=outdata corr

   nfactors= parallel(alpha=0.01 nsims=10000 seed=20170229) map  /* Parallel Analysis */
   priors=smc msa residual
   rotate=promax reorder
   outstat=fact_all
   plots=(scree initloadings preloadings loadings parallel map);
    var &interval;
run;

0,1
Input Data Type,Raw Data
Number of Records Read,77
Number of Records Used,77
N for Significance Tests,77

Correlations,Correlations,Correlations,Correlations,Correlations,Correlations,Correlations
Unnamed: 0_level_1,protein,fat,sodium,carbo,sugars,vitamins
protein,1.0,0.20843,-0.05467,-0.13086,-0.32914,0.00734
fat,0.20843,1.0,-0.00541,-0.31804,0.27082,-0.03116
sodium,-0.05467,-0.00541,1.0,0.35598,0.10145,0.36148
carbo,-0.13086,-0.31804,0.35598,1.0,-0.33167,0.25815
sugars,-0.32914,0.27082,0.10145,-0.33167,1.0,0.12514
vitamins,0.00734,-0.03116,0.36148,0.25815,0.12514,1.0

Partial Correlations Controlling all other Variables,Partial Correlations Controlling all other Variables,Partial Correlations Controlling all other Variables,Partial Correlations Controlling all other Variables,Partial Correlations Controlling all other Variables,Partial Correlations Controlling all other Variables,Partial Correlations Controlling all other Variables
Unnamed: 0_level_1,protein,fat,sodium,carbo,sugars,vitamins
protein,1.0,0.27603,0.04601,-0.24286,-0.46579,0.13742
fat,0.27603,1.0,0.06283,-0.17007,0.26437,-0.04471
sodium,0.04601,0.06283,1.0,0.34895,0.17892,0.24654
carbo,-0.24286,-0.17007,0.34895,1.0,-0.41941,0.22101
sugars,-0.46579,0.26437,0.17892,-0.41941,1.0,0.21187
vitamins,0.13742,-0.04471,0.24654,0.22101,0.21187,1.0

Kaiser's Measure of Sampling Adequacy: Overall MSA = 0.45629545,Kaiser's Measure of Sampling Adequacy: Overall MSA = 0.45629545,Kaiser's Measure of Sampling Adequacy: Overall MSA = 0.45629545,Kaiser's Measure of Sampling Adequacy: Overall MSA = 0.45629545,Kaiser's Measure of Sampling Adequacy: Overall MSA = 0.45629545,Kaiser's Measure of Sampling Adequacy: Overall MSA = 0.45629545
protein,fat,sodium,carbo,sugars,vitamins
0.31545199,0.54749069,0.55096123,0.49253617,0.3705051,0.54955401

Average Partial Correlations Controlling Principal Components,Average Partial Correlations Controlling Principal Components,Average Partial Correlations Controlling Principal Components
N Prin Comp Partialled,Squared,Fourth- Powered
0,0.0538*,0.0053*
1,0.0846,0.0126
2,0.1256,0.0260
3,0.2192,0.1233
4,0.4528,0.3140
5,1.0000,1.0000
* MAP = Minimum Values in Columns,* MAP = Minimum Values in Columns,* MAP = Minimum Values in Columns

Parallel Analysis: NSims=10000 Seed=20170229,Parallel Analysis: NSims=10000 Seed=20170229,Parallel Analysis: NSims=10000 Seed=20170229
Unnamed: 0_level_1,Observed Eigenvalue,Simulated Crit Val
1,1.7739,1.6446*
2,1.4691,1.3597*
3,1.2008,1.1714*
4,0.6832,1.0302
5,0.5367,0.9115
6,0.3363,0.7919
"* Retained Dimension (Obs > Crit, alpha=0.01)","* Retained Dimension (Obs > Crit, alpha=0.01)","* Retained Dimension (Obs > Crit, alpha=0.01)"

Prior Communality Estimates: SMC,Prior Communality Estimates: SMC,Prior Communality Estimates: SMC,Prior Communality Estimates: SMC,Prior Communality Estimates: SMC,Prior Communality Estimates: SMC
protein,fat,sodium,carbo,sugars,vitamins
0.2568228,0.20311859,0.2408119,0.37859303,0.38132242,0.18998477

Eigenvalues of the Reduced Correlation Matrix: Total = 1.65065351 Average = 0.27510892,Eigenvalues of the Reduced Correlation Matrix: Total = 1.65065351 Average = 0.27510892,Eigenvalues of the Reduced Correlation Matrix: Total = 1.65065351 Average = 0.27510892,Eigenvalues of the Reduced Correlation Matrix: Total = 1.65065351 Average = 0.27510892,Eigenvalues of the Reduced Correlation Matrix: Total = 1.65065351 Average = 0.27510892
Unnamed: 0_level_1,Eigenvalue,Difference,Proportion,Cumulative
1,1.07120513,0.29508732,0.649,0.649
2,0.77611781,0.33885928,0.4702,1.1191
3,0.43725853,0.52907399,0.2649,1.384
4,-0.09181546,0.11412447,-0.0556,1.3284
5,-0.20593993,0.13023266,-0.1248,1.2037
6,-0.33617258,,-0.2037,1.0

Factor Pattern,Factor Pattern,Factor Pattern,Factor Pattern
Unnamed: 0_level_1,Factor1,Factor2,Factor3
carbo,0.70462,-0.03893,-0.02333
sodium,0.42571,0.34199,0.22313
vitamins,0.34943,0.31563,0.23906
fat,-0.39892,0.17201,0.32891
sugars,-0.31529,0.63445,-0.06942
protein,-0.11333,-0.35483,0.4656

Variance Explained by Each Factor,Variance Explained by Each Factor,Variance Explained by Each Factor
Factor1,Factor2,Factor3
1.0712051,0.7761178,0.4372585

Final Communality Estimates: Total = 2.284581,Final Communality Estimates: Total = 2.284581,Final Communality Estimates: Total = 2.284581,Final Communality Estimates: Total = 2.284581,Final Communality Estimates: Total = 2.284581,Final Communality Estimates: Total = 2.284581
protein,fat,sodium,carbo,sugars,vitamins
0.35552417,0.29690431,0.34797469,0.49854539,0.5067586,0.2788743

Residual Correlations With Uniqueness on the Diagonal,Residual Correlations With Uniqueness on the Diagonal,Residual Correlations With Uniqueness on the Diagonal,Residual Correlations With Uniqueness on the Diagonal,Residual Correlations With Uniqueness on the Diagonal,Residual Correlations With Uniqueness on the Diagonal,Residual Correlations With Uniqueness on the Diagonal
Unnamed: 0_level_1,protein,fat,sodium,carbo,sugars,vitamins
protein,0.64448,0.07112,0.01103,-0.05396,-0.10743,0.04762
fat,0.07112,0.7031,0.0322,-0.02259,0.05874,-0.02469
sodium,0.01103,0.0322,0.65203,0.07454,0.03419,0.05144
carbo,-0.05396,-0.02259,0.07454,0.50145,-0.08642,0.0298
sugars,-0.10743,0.05874,0.03419,-0.08642,0.49324,0.05165
vitamins,0.04762,-0.02469,0.05144,0.0298,0.05165,0.72113

Root Mean Square Off-Diagonal Residuals: Overall = 0.05651183,Root Mean Square Off-Diagonal Residuals: Overall = 0.05651183,Root Mean Square Off-Diagonal Residuals: Overall = 0.05651183,Root Mean Square Off-Diagonal Residuals: Overall = 0.05651183,Root Mean Square Off-Diagonal Residuals: Overall = 0.05651183,Root Mean Square Off-Diagonal Residuals: Overall = 0.05651183
protein,fat,sodium,carbo,sugars,vitamins
0.06618379,0.04618444,0.04588978,0.05888185,0.07252227,0.04261218

Partial Correlations Controlling Factors,Partial Correlations Controlling Factors,Partial Correlations Controlling Factors,Partial Correlations Controlling Factors,Partial Correlations Controlling Factors,Partial Correlations Controlling Factors,Partial Correlations Controlling Factors
Unnamed: 0_level_1,protein,fat,sodium,carbo,sugars,vitamins
protein,1.0,0.10565,0.01702,-0.09492,-0.19055,0.06986
fat,0.10565,1.0,0.04756,-0.03804,0.09975,-0.03467
sodium,0.01702,0.04756,1.0,0.13036,0.06029,0.07501
carbo,-0.09492,-0.03804,0.13036,1.0,-0.17377,0.04956
sugars,-0.19055,0.09975,0.06029,-0.17377,1.0,0.08661
vitamins,0.06986,-0.03467,0.07501,0.04956,0.08661,1.0

Root Mean Square Off-Diagonal Partials: Overall = 0.09761105,Root Mean Square Off-Diagonal Partials: Overall = 0.09761105,Root Mean Square Off-Diagonal Partials: Overall = 0.09761105,Root Mean Square Off-Diagonal Partials: Overall = 0.09761105,Root Mean Square Off-Diagonal Partials: Overall = 0.09761105,Root Mean Square Off-Diagonal Partials: Overall = 0.09761105
protein,fat,sodium,carbo,sugars,vitamins
0.11104132,0.07214266,0.07590282,0.10963938,0.13235444,0.06582579

Orthogonal Transformation Matrix,Orthogonal Transformation Matrix,Orthogonal Transformation Matrix,Orthogonal Transformation Matrix
Unnamed: 0_level_1,1,2,3
1,-0.74989,0.66148,-0.01001
2,0.50295,0.56022,-0.65818
3,0.42977,0.49859,0.7528

Rotated Factor Pattern,Rotated Factor Pattern,Rotated Factor Pattern,Rotated Factor Pattern
Unnamed: 0_level_1,Factor1,Factor2,Factor3
fat,0.52701,-0.00352,0.13837
sugars,0.5257,0.11226,-0.46668
carbo,-0.558,0.43265,0.00101
sodium,-0.05134,0.58444,-0.06138
vitamins,-0.00054,0.52716,-0.03127
protein,0.10662,-0.0416,0.58517

Variance Explained by Each Factor,Variance Explained by Each Factor,Variance Explained by Each Factor
Factor1,Factor2,Factor3
0.87946972,0.82099718,0.58411457

Final Communality Estimates: Total = 2.284581,Final Communality Estimates: Total = 2.284581,Final Communality Estimates: Total = 2.284581,Final Communality Estimates: Total = 2.284581,Final Communality Estimates: Total = 2.284581,Final Communality Estimates: Total = 2.284581
protein,fat,sodium,carbo,sugars,vitamins
0.35552417,0.29690431,0.34797469,0.49854539,0.5067586,0.2788743

Target Matrix for Procrustean Transformation,Target Matrix for Procrustean Transformation,Target Matrix for Procrustean Transformation,Target Matrix for Procrustean Transformation
Unnamed: 0_level_1,Factor1,Factor2,Factor3
fat,1.0,-0.0,0.01733
sugars,0.44512,0.00394,-0.29807
carbo,-0.5455,0.23128,0.0
sodium,-0.00073,0.97766,-0.00119
vitamins,-0.0,1.0,-0.00022
protein,0.00632,-0.00034,1.0

Procrustean Transformation Matrix,Procrustean Transformation Matrix,Procrustean Transformation Matrix,Procrustean Transformation Matrix
Unnamed: 0_level_1,1,2,3
1,1.26207,0.21151,0.15117
2,0.12804,1.55194,0.15324
3,0.14713,0.22264,1.30588

Normalized Oblique Transformation Matrix,Normalized Oblique Transformation Matrix,Normalized Oblique Transformation Matrix,Normalized Oblique Transformation Matrix
Unnamed: 0_level_1,1,2,3
1,-0.70964,0.57914,-0.01999
2,0.50121,0.55474,-0.55661
3,0.58945,0.69056,0.89715

Inter-Factor Correlations,Inter-Factor Correlations,Inter-Factor Correlations,Inter-Factor Correlations
Unnamed: 0_level_1,Factor1,Factor2,Factor3
Factor1,1.0,-0.19549,-0.18433
Factor2,-0.19549,1.0,-0.22201
Factor3,-0.18433,-0.22201,1.0

Rotated Factor Pattern (Standardized Regression Coefficients),Rotated Factor Pattern (Standardized Regression Coefficients),Rotated Factor Pattern (Standardized Regression Coefficients),Rotated Factor Pattern (Standardized Regression Coefficients)
Unnamed: 0_level_1,Factor1,Factor2,Factor3
fat,0.56318,0.09152,0.20731
sugars,0.50082,0.12142,-0.40912
carbo,-0.53329,0.37037,-0.01335
sodium,0.00083,0.59035,0.00131
vitamins,0.05115,0.54255,0.0318
protein,0.17703,0.05905,0.61747

Reference Axis Correlations,Reference Axis Correlations,Reference Axis Correlations,Reference Axis Correlations
Unnamed: 0_level_1,Factor1,Factor2,Factor3
Factor1,1.0,0.2467,0.23816
Factor2,0.2467,1.0,0.26771
Factor3,0.23816,0.26771,1.0

Reference Structure (Semipartial Correlations),Reference Structure (Semipartial Correlations),Reference Structure (Semipartial Correlations),Reference Structure (Semipartial Correlations)
Unnamed: 0_level_1,Factor1,Factor2,Factor3
fat,0.53642,0.08648,0.19632
sugars,0.47702,0.11473,-0.38743
carbo,-0.50795,0.34996,-0.01264
sodium,0.00079,0.55782,0.00124
vitamins,0.04872,0.51266,0.03012
protein,0.16862,0.0558,0.58474

Variance Explained by Each Factor Eliminating Other Factors,Variance Explained by Each Factor Eliminating Other Factors,Variance Explained by Each Factor Eliminating Other Factors
Factor1,Factor2,Factor3
0.80411983,0.72021099,0.53163363

Factor Structure (Correlations),Factor Structure (Correlations),Factor Structure (Correlations),Factor Structure (Correlations)
Unnamed: 0_level_1,Factor1,Factor2,Factor3
fat,0.50707,-0.0646,0.08318
sugars,0.5525,0.11434,-0.52839
carbo,-0.60324,0.47758,0.00273
sodium,-0.11482,0.58989,-0.1299
vitamins,-0.06078,0.52549,-0.09808
protein,0.05166,-0.11264,0.57173

Variance Explained by Each Factor Ignoring Other Factors,Variance Explained by Each Factor Ignoring Other Factors,Variance Explained by Each Factor Ignoring Other Factors
Factor1,Factor2,Factor3
0.94581866,0.8821324,0.6394957

Final Communality Estimates: Total = 2.284581,Final Communality Estimates: Total = 2.284581,Final Communality Estimates: Total = 2.284581,Final Communality Estimates: Total = 2.284581,Final Communality Estimates: Total = 2.284581,Final Communality Estimates: Total = 2.284581
protein,fat,sodium,carbo,sugars,vitamins
0.35552417,0.29690431,0.34797469,0.49854539,0.5067586,0.2788743


### If the data are appropriate for the common factor model, the partial correlations (controlling all other variables) should be small compared to the original correlations.

Before conducting a principal components analysis, you want to check the correlations between the variables.  If any of the correlations are too high (say above .9), you may need to remove one of the variables from the analysis, as the two variables seem to be measuring the same thing.  Another alternative would be to combine the variables in some way (perhaps by taking the average).  If the correlations are too low, say below .1, then one or more of the variables might load only onto one factor (in other words, make its own factor).  This is not helpful, as the whole point of the analysis is to reduce the number of items (variables). 
    https://stats.oarc.ucla.edu/sas/output/factor-analysis/

Kaiser’s MSA is a summary, for each variable and for all variables together, of how much smaller the partial correlations are than the original correlations. Values of 0.8 or 0.9 are considered good, while MSAs below 0.5 are unacceptable. The variables sugars, fiber and fat have very poor MSAs. 

Only the protein variable has little better MSA. The overall MSA of 0.50 is sufficiently poor that additional variables should be included in the analysis to better define the common factors. 
A commonly used rule is that there should be at least three variables per factor.

### K-means clustering with proc fastclus

In [76]:
%%SAS sas_sess

/* Note : If you want a complete convergence (i.e. no relative change in cluster seeds), 
    set converge = 0 and a large value for the MAXITER option. */
    
proc fastclus data=outdata /* using standardization data */
    maxclusters=6  /* tells SAS to form the number of clusters using k-means algorithm. */
    maxiter=100 
    converge=0
    mean=mean     /* creates an output data set mean that contains the cluster means and other statistics for each cluster*/
    out=clust;    /*output data set that contains the original variables and two new variables, cluster and distance. */
    var &interval;
run;

/* * Canonical Discriminant Analysis;
 -finds linear combinations of the numeric variables that provide maximal separation between classes or groups.
    */
proc candisc data=clust out=egclustcan;
    var &interval;
    class cluster;
run;

/*Plots the two canonical variables generated from PROC CANDISC, can1 and can2. */

proc sgplot data=egclustcan;
    scatter y=can2 x=can1 / group=cluster;
run; 

Initial Seeds,Initial Seeds,Initial Seeds,Initial Seeds,Initial Seeds,Initial Seeds,Initial Seeds
Cluster,protein,fat,sodium,carbo,sugars,vitamins
1,0.415189725,-1.006472559,-0.950413256,1.49629886,-1.557313026,-1.26425981
2,0.415189725,3.961372766,-1.72577077,-1.541825194,0.242508407,-1.26425981
3,3.15544191,0.980665571,1.554587942,0.561491458,-1.332335347,-0.14531722
4,-0.49822767,-0.012903494,0.481016,1.49629886,-0.882379989,3.211510551
5,2.242024515,0.980665571,-1.904699427,-3.645141847,-1.782290706,-1.26425981
6,-0.49822767,-0.012903494,-1.069699027,-1.308123344,1.817352162,-0.14531722

0,1
Minimum Distance Between Initial Seeds =,4.461978

Iteration History,Iteration History,Iteration History,Iteration History,Iteration History,Iteration History,Iteration History,Iteration History
Iteration,Criterion,Relative Change in Cluster Seeds,Relative Change in Cluster Seeds,Relative Change in Cluster Seeds,Relative Change in Cluster Seeds,Relative Change in Cluster Seeds,Relative Change in Cluster Seeds
Iteration,Criterion,1,2,3,4,5,6
1,0.9734,0.367,0.4065,0.4302,0.266,0,0.3958
2,0.6713,0.0331,0.1916,0.1988,0.0,0,0.0461
3,0.6531,0.0,0.0,0.0,0.0,0,0.0

0
Convergence criterion is satisfied.

0,1
Criterion Based on Final Seeds =,0.6531

Cluster Summary,Cluster Summary,Cluster Summary,Cluster Summary,Cluster Summary,Cluster Summary,Cluster Summary
Cluster,Frequency,RMS Std Deviation,Maximum Distance from Seed to Observation,Radius Exceeded,Nearest Cluster,Distance Between Cluster Centroids
1,25,0.7846,2.7372,,6,2.3641
2,5,0.6894,2.3517,,6,2.8048
3,7,0.7770,2.5078,,1,2.665
4,6,0.6030,1.9211,,6,3.7968
5,1,.,0.0,,3,4.1238
6,33,0.5791,2.2165,,1,2.3641

Statistics for Variables,Statistics for Variables,Statistics for Variables,Statistics for Variables,Statistics for Variables
Variable,Total STD,Within STD,R-Square,RSQ/(1-RSQ)
protein,1.0,0.6973,0.545758,1.20147
fat,1.0,0.69141,0.5534,1.239139
sodium,1.0,0.94683,0.1625,0.19403
carbo,1.0,0.69327,0.551,1.227174
sugars,1.0,0.58307,0.682397,2.148586
vitamins,1.0,0.30744,0.911699,10.324869
OVER-ALL,1.0,0.68018,0.567792,1.313702

0,1
Pseudo F Statistic =,18.65

0,1
Approximate Expected Over-All R-Squared =,0.49631

0,1
Cubic Clustering Criterion =,4.911

Cluster Means,Cluster Means,Cluster Means,Cluster Means,Cluster Means,Cluster Means,Cluster Means
Cluster,protein,fat,sodium,carbo,sugars,vitamins
1,-0.096324016,-0.688530459,-0.256170067,0.757801013,-0.900378203,-0.413863441
2,0.780556683,2.371662262,-0.771484599,-0.466796683,0.287503943,-0.369105738
3,1.850559917,-0.012903494,0.370250641,-0.773947686,-0.689541978,-0.14531722
4,0.11071726,-0.178498339,0.580420809,0.834143617,-0.132454391,3.211510551
5,2.242024515,0.980665571,-1.904699427,-3.645141847,-1.782290706,-1.26425981
6,-0.525906985,0.167745427,0.184608932,-0.380397817,0.862901402,-0.14531722

Cluster Standard Deviations,Cluster Standard Deviations,Cluster Standard Deviations,Cluster Standard Deviations,Cluster Standard Deviations,Cluster Standard Deviations,Cluster Standard Deviations
Cluster,protein,fat,sodium,carbo,sugars,vitamins
1,0.650175267,0.553195843,1.357326719,0.803018964,0.489983274,0.487735767
2,0.500299312,0.888675188,0.649258767,0.836117158,0.663588817,0.500406339
3,0.891404102,0.811245745,0.803467796,1.093783630,0.572531307,0.000000000
4,0.471686714,0.405622872,0.672317578,0.580346524,1.002771391,0.000000000
5,.,.,.,.,.,.
6,0.739418592,0.763762616,0.602453089,0.467348370,0.548290219,0.000000000

0,1,2,3
Total Sample Size,77,DF Total,76
Variables,6,DF Within Classes,71
Classes,6,DF Between Classes,5

0,1
Number of Observations Read,77
Number of Observations Used,77

Class Level Information,Class Level Information,Class Level Information,Class Level Information,Class Level Information
CLUSTER,Variable Name,Frequency,Weight,Proportion
1,1,25,25.0,0.324675
2,2,5,5.0,0.064935
3,3,7,7.0,0.090909
4,4,6,6.0,0.077922
5,5,1,1.0,0.012987
6,6,33,33.0,0.428571

Multivariate Statistics and F Approximations,Multivariate Statistics and F Approximations,Multivariate Statistics and F Approximations,Multivariate Statistics and F Approximations,Multivariate Statistics and F Approximations,Multivariate Statistics and F Approximations
S=5 M=0 N=32,S=5 M=0 N=32,S=5 M=0 N=32,S=5 M=0 N=32,S=5 M=0 N=32,S=5 M=0 N=32
Statistic,Value,F Value,Num DF,Den DF,Pr > F
Wilks' Lambda,0.00192879,33.44,30,266,<.0001
Pillai's Trace,2.96376251,16.98,30,350,<.0001
Hotelling-Lawley Trace,24.02875462,51.89,30,165.2,<.0001
Roy's Greatest Root,17.67309718,206.19,6,70,<.0001
NOTE: F Statistic for Roy's Greatest Root is an upper bound.,NOTE: F Statistic for Roy's Greatest Root is an upper bound.,NOTE: F Statistic for Roy's Greatest Root is an upper bound.,NOTE: F Statistic for Roy's Greatest Root is an upper bound.,NOTE: F Statistic for Roy's Greatest Root is an upper bound.,NOTE: F Statistic for Roy's Greatest Root is an upper bound.

Unnamed: 0_level_0,Canonical Correlation,Adjusted Canonical Correlation,Approximate Standard Error,Squared Canonical Correlation,Eigenvalues of Inv(E)*H = CanRsq/(1-CanRsq),Eigenvalues of Inv(E)*H = CanRsq/(1-CanRsq),Eigenvalues of Inv(E)*H = CanRsq/(1-CanRsq),Eigenvalues of Inv(E)*H = CanRsq/(1-CanRsq),Test of H0: The canonical correlations in the current row and all that follow are zero,Test of H0: The canonical correlations in the current row and all that follow are zero,Test of H0: The canonical correlations in the current row and all that follow are zero,Test of H0: The canonical correlations in the current row and all that follow are zero,Test of H0: The canonical correlations in the current row and all that follow are zero
Unnamed: 0_level_1,Canonical Correlation,Adjusted Canonical Correlation,Approximate Standard Error,Squared Canonical Correlation,Eigenvalue,Difference,Proportion,Cumulative,Likelihood Ratio,Approximate F Value,Num DF,Den DF,Pr > F
1,0.972855,0.969635,0.006143,0.946447,17.6731,14.5949,0.7355,0.7355,0.00192879,33.44,30,266.0,<.0001
2,0.86879,0.844347,0.028127,0.754796,3.0782,0.7184,0.1281,0.8636,0.03601657,19.24,20,223.16,<.0001
3,0.838072,.,0.034141,0.702365,2.3598,1.582,0.0982,0.9618,0.14688437,15.99,12,180.2,<.0001
4,0.661442,0.654453,0.064523,0.437505,0.7778,0.638,0.0324,0.9942,0.49350572,9.74,6,138.0,<.0001
5,0.350212,0.346881,0.100639,0.122648,0.1398,,0.0058,1.0,0.87735156,4.89,2,70.0,0.0103

Total Canonical Structure,Total Canonical Structure,Total Canonical Structure,Total Canonical Structure,Total Canonical Structure,Total Canonical Structure
Variable,Can1,Can2,Can3,Can4,Can5
protein,0.058747,-0.434385,0.722476,0.12404,0.466772
fat,-0.110145,0.482404,0.501075,0.657346,0.084256
sodium,0.171455,0.175704,-0.043785,-0.411655,0.540942
carbo,0.320532,-0.307485,-0.691585,0.186033,0.505353
sugars,-0.123073,0.928527,-0.02128,-0.185664,0.124637
vitamins,0.953103,0.236666,0.034915,-0.125935,0.123481

Between Canonical Structure,Between Canonical Structure,Between Canonical Structure,Between Canonical Structure,Between Canonical Structure,Between Canonical Structure
Variable,Can1,Can2,Can3,Can4,Can5
protein,0.077363,-0.510846,0.819606,0.111059,0.221277
fat,-0.144043,0.563387,0.564502,0.584475,0.039665
sodium,0.413783,0.378677,-0.091029,-0.675459,0.469954
carbo,0.420092,-0.359885,-0.78082,0.16577,0.238424
sugars,-0.144941,0.976542,-0.021589,-0.148662,0.052839
vitamins,0.971096,0.21534,0.030645,-0.08724,0.04529

Pooled Within Canonical Structure,Pooled Within Canonical Structure,Pooled Within Canonical Structure,Pooled Within Canonical Structure,Pooled Within Canonical Structure,Pooled Within Canonical Structure
Variable,Can1,Can2,Can3,Can4,Can5
protein,0.020171,-0.31915,0.584819,0.138031,0.648706
fat,-0.038141,0.35745,0.409058,0.737724,0.118094
sodium,0.043356,0.095072,-0.026102,-0.337365,0.553662
carbo,0.110698,-0.227229,-0.563073,0.208222,0.706413
sugars,-0.050537,0.81586,-0.0206,-0.247084,0.207153
vitamins,0.742246,0.39438,0.064101,-0.317851,0.389228

Total-Sample Standardized Canonical Coefficients,Total-Sample Standardized Canonical Coefficients,Total-Sample Standardized Canonical Coefficients,Total-Sample Standardized Canonical Coefficients,Total-Sample Standardized Canonical Coefficients,Total-Sample Standardized Canonical Coefficients
Variable,Can1,Can2,Can3,Can4,Can5
protein,-0.12844314,-0.550950602,0.933926916,-0.106203364,0.834868195
fat,0.015497733,0.656288848,0.382685145,1.187591729,-0.024053153
sodium,-0.832048419,0.114132074,0.360580504,-0.785325019,0.360162054
carbo,0.179290149,-0.13346166,-1.305129673,0.805910405,0.796038678
sugars,-0.961864314,1.35873411,-0.342989682,-0.246654808,0.674885826
vitamins,4.35707221,0.30960679,0.316424875,-0.017818583,-0.299592821

Pooled Within-Class Standardized Canonical Coefficients,Pooled Within-Class Standardized Canonical Coefficients,Pooled Within-Class Standardized Canonical Coefficients,Pooled Within-Class Standardized Canonical Coefficients,Pooled Within-Class Standardized Canonical Coefficients,Pooled Within-Class Standardized Canonical Coefficients
Variable,Can1,Can2,Can3,Can4,Can5
protein,-0.089563731,-0.384179267,0.651229633,-0.074055878,0.582155732
fat,0.010715324,0.453766216,0.264593237,0.821115591,-0.016630647
sodium,-0.787804913,0.108063192,0.341406926,-0.743565992,0.341010726
carbo,0.124295955,-0.092524573,-0.904803417,0.55871114,0.551867397
sugars,-0.560833143,0.792235569,-0.199986608,-0.143816742,0.393504919
vitamins,1.339540296,0.095185655,0.09728181,-0.005478153,-0.092106956

Raw Canonical Coefficients,Raw Canonical Coefficients,Raw Canonical Coefficients,Raw Canonical Coefficients,Raw Canonical Coefficients,Raw Canonical Coefficients
Variable,Can1,Can2,Can3,Can4,Can5
protein,-0.12844314,-0.550950602,0.933926916,-0.106203364,0.834868195
fat,0.015497733,0.656288848,0.382685145,1.187591729,-0.024053153
sodium,-0.832048419,0.114132074,0.360580504,-0.785325019,0.360162054
carbo,0.179290149,-0.13346166,-1.305129673,0.805910405,0.796038678
sugars,-0.961864314,1.35873411,-0.342989682,-0.246654808,0.674885826
vitamins,4.35707221,0.30960679,0.316424875,-0.017818583,-0.299592821

Class Means on Canonical Variables,Class Means on Canonical Variables,Class Means on Canonical Variables,Class Means on Canonical Variables,Class Means on Canonical Variables,Class Means on Canonical Variables
CLUSTER,Can1,Can2,Can3,Can4,Can5
1,-0.58647758,-1.8806892,-1.25698474,0.23389041,-0.03654232
2,-1.3900414,1.37705941,1.75222559,2.89900131,0.24978207
3,-0.65463115,-1.86438111,3.05748079,-0.95369117,0.64071581
4,13.76981485,0.5911074,0.21734832,-0.03186971,-0.0817532
5,-3.13565991,-3.13562809,6.75102256,-0.05319196,-2.56354706
6,-1.61480802,1.59913892,-0.20587841,-0.40672783,-0.05352414


Note :
Largest value of CCC greater than 2 or 3 indicate good clusterings.
Largest value of CCC between 0 and 2 indicate possible clusters but should be interpreted cautiously.

    The pseudo-F statistic is intended to capture the 'tightness' of clusters, and is in essence a ratio of the mean sum of squares between groups to the mean sum of squares within group.
    Larger numbers of the pseudo-F usually indicate a better clustering solution. If pseudo-F decreases with k and reaches a maximum value, the value of k at the maximum or immediately prior to the point may be a candidate for the value of k.

#### The clusters are grouped on the basis of maximum distance from seed to observations. Approximate Expected Over-All R-Squared =	0.49631 (<0.7) no a good fit model. The distance between the seed and observation of the first cluster distance is 2.5241, and the 4th cluster is the maximum value.

## Cluster Analysis - K-Means Algorithm

In [77]:
%%SAS sas_sess

/* Perfoming Cluster Analysis - K-Means Algorithm */

proc cluster data = outdata 
    method = centroid /* using the CENTROID method */
    ccc print=6    /* CCC — Cubic Clustering Criterion — It helps to find out the optimum cluster point. */
    outtree=Tree;
    var &interval;
run;

Eigenvalues of the Covariance Matrix,Eigenvalues of the Covariance Matrix,Eigenvalues of the Covariance Matrix,Eigenvalues of the Covariance Matrix,Eigenvalues of the Covariance Matrix
Unnamed: 0_level_1,Eigenvalue,Difference,Proportion,Cumulative
1,1.77394528,0.30488623,0.2957,0.2957
2,1.46905905,0.26825986,0.2448,0.5405
3,1.20079919,0.51759228,0.2001,0.7406
4,0.68320691,0.14649646,0.1139,0.8545
5,0.53671045,0.20043133,0.0895,0.944
6,0.33627912,,0.056,1.0

0,1
Root-Mean-Square Total-Sample Standard Deviation,1

0,1
Root-Mean-Square Distance Between Observations,3.464102

Cluster History,Cluster History,Cluster History,Cluster History,Cluster History,Cluster History,Cluster History,Cluster History,Cluster History,Cluster History
Number of Clusters,Clusters Joined,Clusters Joined.1,Freq,Semipartial R-Square,R-Square,Approximate Expected R-Square,Cubic Clustering Criterion,Norm Centroid Distance,Tie
6,CL16,CL8,57,0.0459,0.474,0.583,-6.1,0.7838,
5,CL6,CL13,67,0.1438,0.331,0.543,-9.8,0.8013,
4,CL5,CL7,73,0.1589,0.172,0.489,-12.0,1.0473,
3,CL4,CL9,75,0.0583,0.113,0.39,-10.0,1.0672,
2,OB2,OB58,2,0.0228,0.09,0.239,-6.1,1.3159,
1,CL3,CL2,77,0.0904,0.0,0.0,0.0,1.3282,


#### The first 5 eigenvalues account for about 94.40% of the total variance, hence, it suggests to go with 5 clusters. However, it can be cross-checked in the ccc plot. From the above CCC plot, it can be seen that elbow has dropped at 4. Hence, the optimum cluster would be 4.

In [78]:
%%SAS sas_sess

/* Retaining 3 clusters */
proc tree data=tree noprint ncl=3 out=out;  /*categorize each observation out of all observations into three clusters */
    copy &interval;
run;

/* To create a Scatterplot */
proc candisc out = can;
    class cluster;
    var &interval;
run;

proc sgplot data = can;
    title "Cluster Analysis for Cereal datasets";
    scatter y = can2 x = can1 / group = cluster;
run;
title;

0,1,2,3
Total Sample Size,77,DF Total,76
Variables,6,DF Within Classes,74
Classes,3,DF Between Classes,2

0,1
Number of Observations Read,77
Number of Observations Used,77

Class Level Information,Class Level Information,Class Level Information,Class Level Information,Class Level Information
CLUSTER,Variable Name,Frequency,Weight,Proportion
1,1,75,75.0,0.974026
2,2,1,1.0,0.012987
3,3,1,1.0,0.012987

Multivariate Statistics and F Approximations,Multivariate Statistics and F Approximations,Multivariate Statistics and F Approximations,Multivariate Statistics and F Approximations,Multivariate Statistics and F Approximations,Multivariate Statistics and F Approximations
S=2 M=1.5 N=33.5,S=2 M=1.5 N=33.5,S=2 M=1.5 N=33.5,S=2 M=1.5 N=33.5,S=2 M=1.5 N=33.5,S=2 M=1.5 N=33.5
Statistic,Value,F Value,Num DF,Den DF,Pr > F
Wilks' Lambda,0.48811914,4.96,12,138,<.0001
Pillai's Trace,0.58699889,4.85,12,140,<.0001
Hotelling-Lawley Trace,0.89478735,5.09,12,104.29,<.0001
Roy's Greatest Root,0.66249431,7.73,6,70,<.0001
NOTE: F Statistic for Roy's Greatest Root is an upper bound.,NOTE: F Statistic for Roy's Greatest Root is an upper bound.,NOTE: F Statistic for Roy's Greatest Root is an upper bound.,NOTE: F Statistic for Roy's Greatest Root is an upper bound.,NOTE: F Statistic for Roy's Greatest Root is an upper bound.,NOTE: F Statistic for Roy's Greatest Root is an upper bound.
NOTE: F Statistic for Wilks' Lambda is exact.,NOTE: F Statistic for Wilks' Lambda is exact.,NOTE: F Statistic for Wilks' Lambda is exact.,NOTE: F Statistic for Wilks' Lambda is exact.,NOTE: F Statistic for Wilks' Lambda is exact.,NOTE: F Statistic for Wilks' Lambda is exact.

Unnamed: 0_level_0,Canonical Correlation,Adjusted Canonical Correlation,Approximate Standard Error,Squared Canonical Correlation,Eigenvalues of Inv(E)*H = CanRsq/(1-CanRsq),Eigenvalues of Inv(E)*H = CanRsq/(1-CanRsq),Eigenvalues of Inv(E)*H = CanRsq/(1-CanRsq),Eigenvalues of Inv(E)*H = CanRsq/(1-CanRsq),Test of H0: The canonical correlations in the current row and all that follow are zero,Test of H0: The canonical correlations in the current row and all that follow are zero,Test of H0: The canonical correlations in the current row and all that follow are zero,Test of H0: The canonical correlations in the current row and all that follow are zero,Test of H0: The canonical correlations in the current row and all that follow are zero
Unnamed: 0_level_1,Canonical Correlation,Adjusted Canonical Correlation,Approximate Standard Error,Squared Canonical Correlation,Eigenvalue,Difference,Proportion,Cumulative,Likelihood Ratio,Approximate F Value,Num DF,Den DF,Pr > F
1,0.631264,0.589379,0.068997,0.398494,0.6625,0.4302,0.7404,0.7404,0.48811914,4.96,12,138,<.0001
2,0.434171,0.400583,0.093085,0.188505,0.2323,,0.2596,1.0,0.81149529,3.25,5,70,0.0107

Total Canonical Structure,Total Canonical Structure,Total Canonical Structure
Variable,Can1,Can2
protein,-0.372188,-0.276833
fat,-0.591424,0.667672
sodium,0.471895,-0.051345
carbo,0.703409,0.269199
sugars,0.230941,0.339549
vitamins,0.326692,-0.058864

Between Canonical Structure,Between Canonical Structure,Between Canonical Structure
Variable,Can1,Can2
protein,-0.890269,-0.455435
fat,-0.789859,0.613288
sodium,0.997212,-0.074626
carbo,0.96706,0.254548
sugars,0.703146,0.711046
vitamins,0.992408,-0.122986

Pooled Within Canonical Structure,Pooled Within Canonical Structure,Pooled Within Canonical Structure
Variable,Can1,Can2
protein,-0.299267,-0.258545
fat,-0.520505,0.682516
sodium,0.383497,-0.048466
carbo,0.614104,0.27298
sugars,0.183089,0.312671
vitamins,0.259027,-0.05421

Total-Sample Standardized Canonical Coefficients,Total-Sample Standardized Canonical Coefficients,Total-Sample Standardized Canonical Coefficients
Variable,Can1,Can2
protein,0.0439887809,-0.2946165845
fat,-0.6941207058,0.9587202715
sodium,0.2185717073,-0.3412270439
carbo,0.8533567145,0.8431671802
sugars,0.7571841961,0.3509521069
vitamins,-0.000351286,-0.1706810033

Pooled Within-Class Standardized Canonical Coefficients,Pooled Within-Class Standardized Canonical Coefficients,Pooled Within-Class Standardized Canonical Coefficients
Variable,Can1,Can2
protein,0.0429988318,-0.2879863617
fat,-0.619896767,0.8562020867
sodium,0.211391692,-0.3300178375
carbo,0.7682579708,0.7590845609
sugars,0.7506741876,0.3479347418
vitamins,-0.0003482299,-0.169196134

Raw Canonical Coefficients,Raw Canonical Coefficients,Raw Canonical Coefficients
Variable,Can1,Can2
protein,0.0439887809,-0.2946165845
fat,-0.6941207058,0.9587202715
sodium,0.2185717073,-0.3412270439
carbo,0.8533567145,0.8431671802
sugars,0.7571841961,0.3509521069
vitamins,-0.000351286,-0.1706810033

Class Means on Canonical Variables,Class Means on Canonical Variables,Class Means on Canonical Variables
CLUSTER,Can1,Can2
1,0.129311271,-0.009489168
2,-4.240271066,3.265283821
3,-5.458074267,-2.553596206


##### Advantage and Disadvantage of K-means Clustering

    Advantage:

    1) Practically work well even some assumptions are broken.

    2) Simple, easy to implement.

    3) Easy to interpret the clustering results.

    4) Fast and efficient in terms of computational cost.

##### Disadvantage:

    1) Uniform effect often produces clusters with relatively uniform size even if the input data have different cluster size.

    2) Different densities may work poorly with clusters.

    3) Sensitive to outliers.

    4) K value needs to be known before K-means clustering.

https://towardsdatascience.com/k-means-clustering-in-sas-9d19efd4fb1b

### Variable Cluster Analysis

#### Variable Cluster Analysis, (implemented in SAS through PROC VARCLUS), is another variable reduction method that often has distinct advantages over the traditional Factor Analysis (FA) approach. This method borrowed some ideas from the Factor Analysis method and some  from the Hierarchical Clustering method and produces either disjoint or hierarchical clusters.

https://apha.confex.com/apha/responses/134am/7422.ppt

In [79]:
%%SAS sas_sess

proc varclus data=outdata outtree=tree centroid maxclusters=6;
      var &interval;
run;

0,1,2,3
Observations,77,Proportion,1
Variables,6,Maxeigen,0

0
Clustering algorithm converged.

Cluster Summary for 1 Cluster,Cluster Summary for 1 Cluster,Cluster Summary for 1 Cluster,Cluster Summary for 1 Cluster,Cluster Summary for 1 Cluster
Cluster,Members,Cluster Variation,Variation Explained,Proportion Explained
1,6,6,1.16261,0.1938

0
Clustering algorithm converged.

Cluster Summary for 2 Clusters,Cluster Summary for 2 Clusters,Cluster Summary for 2 Clusters,Cluster Summary for 2 Clusters,Cluster Summary for 2 Clusters
Cluster,Members,Cluster Variation,Variation Explained,Proportion Explained
1,3,3,1.650405,0.5501
2,3,3,1.100072,0.3667

2 Clusters,2 Clusters,R-squared with,R-squared with,1-R**2 Ratio
Cluster,Variable,Own Cluster,Next Closest,1-R**2 Ratio
Cluster 1,sodium,0.5957,0.0005,0.4045
,carbo,0.5262,0.1846,0.5811
,vitamins,0.5298,0.0031,0.4717
Cluster 2,protein,0.2343,0.0064,0.7707
,fat,0.663,0.0254,0.3457
,sugars,0.2687,0.0022,0.7329

Standardized Scoring Coefficients,Standardized Scoring Coefficients,Standardized Scoring Coefficients
Cluster,1,2
protein,0.0,0.550464
fat,0.0,0.550464
sodium,0.449411,0.0
carbo,0.449411,0.0
sugars,0.0,0.550464
vitamins,0.449411,0.0

Cluster Structure,Cluster Structure,Cluster Structure
Cluster,1,2
protein,-0.080086,0.484017
fat,-0.159365,0.814274
sodium,0.771846,0.022772
carbo,0.725409,-0.429677
sugars,-0.047223,0.518359
vitamins,0.727878,0.055771

Inter-Cluster Correlations,Inter-Cluster Correlations,Inter-Cluster Correlations
Cluster,1,2
1,1.0,-0.1578
2,-0.1578,1.0

0
Clustering algorithm converged.

Cluster Summary for 3 Clusters,Cluster Summary for 3 Clusters,Cluster Summary for 3 Clusters,Cluster Summary for 3 Clusters,Cluster Summary for 3 Clusters
Cluster,Members,Cluster Variation,Variation Explained,Proportion Explained
1,3,3,1.650405,0.5501
2,2,2,1.270819,0.6354
3,1,1,1.0,1.0

3 Clusters,3 Clusters,R-squared with,R-squared with,1-R**2 Ratio
Cluster,Variable,Own Cluster,Next Closest,1-R**2 Ratio
Cluster 1,sodium,0.5957,0.0036,0.4057
,carbo,0.5262,0.1661,0.5681
,vitamins,0.5298,0.0035,0.4718
Cluster 2,fat,0.6354,0.0434,0.3811
,sugars,0.6354,0.1083,0.4089
Cluster 3,protein,1.0,0.0064,0.0

Standardized Scoring Coefficients,Standardized Scoring Coefficients,Standardized Scoring Coefficients,Standardized Scoring Coefficients
Cluster,1,2,3
protein,0.0,0.0,1.0
fat,0.0,0.62725,0.0
sodium,0.44941,0.0,0.0
carbo,0.44941,0.0,0.0
sugars,0.0,0.62725,0.0
vitamins,0.44941,0.0,0.0

Cluster Structure,Cluster Structure,Cluster Structure,Cluster Structure
Cluster,1,2,3
protein,-0.08009,-0.07572,1.0
fat,-0.15936,0.79713,0.20843
sodium,0.77185,0.06024,-0.05467
carbo,0.72541,-0.40753,-0.13086
sugars,-0.04722,0.79713,-0.32914
vitamins,0.72788,0.05895,0.00734

Inter-Cluster Correlations,Inter-Cluster Correlations,Inter-Cluster Correlations,Inter-Cluster Correlations
Cluster,1,2,3
1,1.0,-0.12958,-0.08009
2,-0.12958,1.0,-0.07572
3,-0.08009,-0.07572,1.0

0
Clustering algorithm converged.

Cluster Summary for 4 Clusters,Cluster Summary for 4 Clusters,Cluster Summary for 4 Clusters,Cluster Summary for 4 Clusters,Cluster Summary for 4 Clusters
Cluster,Members,Cluster Variation,Variation Explained,Proportion Explained
1,2,2,1.361477,0.6807
2,2,2,1.270819,0.6354
3,1,1,1.0,1.0
4,1,1,1.0,1.0

4 Clusters,4 Clusters,R-squared with,R-squared with,1-R**2 Ratio
Cluster,Variable,Own Cluster,Next Closest,1-R**2 Ratio
Cluster 1,sodium,0.6807,0.1267,0.3656
,vitamins,0.6807,0.0666,0.3421
Cluster 2,fat,0.6354,0.1012,0.4056
,sugars,0.6354,0.11,0.4097
Cluster 3,protein,1.0,0.0171,0.0
Cluster 4,carbo,1.0,0.1661,0.0

Standardized Scoring Coefficients,Standardized Scoring Coefficients,Standardized Scoring Coefficients,Standardized Scoring Coefficients,Standardized Scoring Coefficients
Cluster,1,2,3,4
protein,0.0,0.0,1.0,0.0
fat,0.0,0.62725,0.0,0.0
sodium,0.60601,0.0,0.0,0.0
carbo,0.0,0.0,0.0,1.0
sugars,0.0,0.62725,0.0,0.0
vitamins,0.60601,0.0,0.0,0.0

Cluster Structure,Cluster Structure,Cluster Structure,Cluster Structure,Cluster Structure
Cluster,1,2,3,4
protein,-0.02869,-0.07572,1.0,-0.13086
fat,-0.02216,0.79713,0.20843,-0.31804
sodium,0.82507,0.06024,-0.05467,0.35598
carbo,0.37217,-0.40753,-0.13086,1.0
sugars,0.13732,0.79713,-0.32914,-0.33167
vitamins,0.82507,0.05895,0.00734,0.25815

Inter-Cluster Correlations,Inter-Cluster Correlations,Inter-Cluster Correlations,Inter-Cluster Correlations,Inter-Cluster Correlations
Cluster,1,2,3,4
1,1.0,0.07223,-0.02869,0.37217
2,0.07223,1.0,-0.07572,-0.40753
3,-0.02869,-0.07572,1.0,-0.13086
4,0.37217,-0.40753,-0.13086,1.0

0
Clustering algorithm converged.

Cluster Summary for 5 Clusters,Cluster Summary for 5 Clusters,Cluster Summary for 5 Clusters,Cluster Summary for 5 Clusters,Cluster Summary for 5 Clusters
Cluster,Members,Cluster Variation,Variation Explained,Proportion Explained
1,2,2,1.361477,0.6807
2,1,1,1.0,1.0
3,1,1,1.0,1.0
4,1,1,1.0,1.0
5,1,1,1.0,1.0

5 Clusters,5 Clusters,R-squared with,R-squared with,1-R**2 Ratio
Cluster,Variable,Own Cluster,Next Closest,1-R**2 Ratio
Cluster 1,sodium,0.6807,0.1267,0.3656
,vitamins,0.6807,0.0666,0.3421
Cluster 2,sugars,1.0,0.11,0.0
Cluster 3,protein,1.0,0.1083,0.0
Cluster 4,carbo,1.0,0.1385,0.0
Cluster 5,fat,1.0,0.1012,0.0

Standardized Scoring Coefficients,Standardized Scoring Coefficients,Standardized Scoring Coefficients,Standardized Scoring Coefficients,Standardized Scoring Coefficients,Standardized Scoring Coefficients
Cluster,1,2,3,4,5
protein,0.0,0.0,1.0,0.0,0.0
fat,0.0,0.0,0.0,0.0,1.0
sodium,0.60601,0.0,0.0,0.0,0.0
carbo,0.0,0.0,0.0,1.0,0.0
sugars,0.0,1.0,0.0,0.0,0.0
vitamins,0.60601,0.0,0.0,0.0,0.0

Cluster Structure,Cluster Structure,Cluster Structure,Cluster Structure,Cluster Structure,Cluster Structure
Cluster,1,2,3,4,5
protein,-0.02869,-0.32914,1.0,-0.13086,0.20843
fat,-0.02216,0.27082,0.20843,-0.31804,1.0
sodium,0.82507,0.10145,-0.05467,0.35598,-0.00541
carbo,0.37217,-0.33167,-0.13086,1.0,-0.31804
sugars,0.13732,1.0,-0.32914,-0.33167,0.27082
vitamins,0.82507,0.12514,0.00734,0.25815,-0.03116

Inter-Cluster Correlations,Inter-Cluster Correlations,Inter-Cluster Correlations,Inter-Cluster Correlations,Inter-Cluster Correlations,Inter-Cluster Correlations
Cluster,1,2,3,4,5
1,1.0,0.13732,-0.02869,0.37217,-0.02216
2,0.13732,1.0,-0.32914,-0.33167,0.27082
3,-0.02869,-0.32914,1.0,-0.13086,0.20843
4,0.37217,-0.33167,-0.13086,1.0,-0.31804
5,-0.02216,0.27082,0.20843,-0.31804,1.0

0
Clustering algorithm converged.

Cluster Summary for 6 Clusters,Cluster Summary for 6 Clusters,Cluster Summary for 6 Clusters,Cluster Summary for 6 Clusters,Cluster Summary for 6 Clusters
Cluster,Members,Cluster Variation,Variation Explained,Proportion Explained
1,1,1,1,1.0
2,1,1,1,1.0
3,1,1,1,1.0
4,1,1,1,1.0
5,1,1,1,1.0
6,1,1,1,1.0

6 Clusters,6 Clusters,R-squared with,R-squared with,1-R**2 Ratio
Cluster,Variable,Own Cluster,Next Closest,1-R**2 Ratio
Cluster 1,vitamins,1.0,0.1307,0.0
Cluster 2,sugars,1.0,0.11,0.0
Cluster 3,protein,1.0,0.1083,0.0
Cluster 4,carbo,1.0,0.1267,0.0
Cluster 5,fat,1.0,0.1012,0.0
Cluster 6,sodium,1.0,0.1307,0.0

Standardized Scoring Coefficients,Standardized Scoring Coefficients,Standardized Scoring Coefficients,Standardized Scoring Coefficients,Standardized Scoring Coefficients,Standardized Scoring Coefficients,Standardized Scoring Coefficients
Cluster,1,2,3,4,5,6
protein,0.0,0.0,1.0,0.0,0.0,0.0
fat,0.0,0.0,0.0,0.0,1.0,0.0
sodium,0.0,0.0,0.0,0.0,0.0,1.0
carbo,0.0,0.0,0.0,1.0,0.0,0.0
sugars,0.0,1.0,0.0,0.0,0.0,0.0
vitamins,1.0,0.0,0.0,0.0,0.0,0.0

Cluster Structure,Cluster Structure,Cluster Structure,Cluster Structure,Cluster Structure,Cluster Structure,Cluster Structure
Cluster,1,2,3,4,5,6
protein,0.00734,-0.32914,1.0,-0.13086,0.20843,-0.05467
fat,-0.03116,0.27082,0.20843,-0.31804,1.0,-0.00541
sodium,0.36148,0.10145,-0.05467,0.35598,-0.00541,1.0
carbo,0.25815,-0.33167,-0.13086,1.0,-0.31804,0.35598
sugars,0.12514,1.0,-0.32914,-0.33167,0.27082,0.10145
vitamins,1.0,0.12514,0.00734,0.25815,-0.03116,0.36148

Inter-Cluster Correlations,Inter-Cluster Correlations,Inter-Cluster Correlations,Inter-Cluster Correlations,Inter-Cluster Correlations,Inter-Cluster Correlations,Inter-Cluster Correlations
Cluster,1,2,3,4,5,6
1,1.0,0.12514,0.00734,0.25815,-0.03116,0.36148
2,0.12514,1.0,-0.32914,-0.33167,0.27082,0.10145
3,0.00734,-0.32914,1.0,-0.13086,0.20843,-0.05467
4,0.25815,-0.33167,-0.13086,1.0,-0.31804,0.35598
5,-0.03116,0.27082,0.20843,-0.31804,1.0,-0.00541
6,0.36148,0.10145,-0.05467,0.35598,-0.00541,1.0

Number of Clusters,Total Variation Explained by Clusters,Proportion of Variation Explained by Clusters,Minimum Proportion Explained by a Cluster,Minimum R-squared for a Variable,Maximum 1-R**2 Ratio for a Variable
1,1.16261,0.1938,0.1938,0.0705,
2,2.750477,0.4584,0.3667,0.2343,0.7707
3,3.921224,0.6535,0.5501,0.5262,0.5681
4,4.632296,0.772,0.6354,0.6354,0.4097
5,5.361477,0.8936,0.6807,0.6807,0.3656
6,6.0,1.0,1.0,1.0,0.0


#### the variables are formed into 4 clusters, 1st cluster contains the variables sodium, vitamins, but split as the variable carbo forms the 4th cluster. 2nd cluster contains , sugars and fat but split as the variable protein forms the 3rd clusters

Other SAS Proc for cluster Analysis: https://data-flair.training/blogs/stat-cluster-analysis/
<img src="SAS-STAT-CLUSTER-ANALYSIS-01.jpg" alt="CLUSTER-ANALYSIS image" />

    Reference:
        - https://www.listendata.com/2014/10/cluster-analysis-using-sas.html
        - https://www.statology.org/principal-components-analysis-in-sas/
        - https://stats.oarc.ucla.edu/sas/output/factor-analysis/
        - https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.4/statug/statug_factor_examples03.htm
        - https://www.sfu.ca/sasdoc/sashtml/stat/chap68/sect3.htm