# 1. Introduction

Factor analysis for mixed data (FAMD) is a principal component method that combines principal component analysis (PCA) for continuous variables and multiple correspondence analysis (MCA) for categorical variables. To learn more about FAMD, see an excellent [tutorial](http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/115-famd-factor-analysis-of-mixed-data-in-r-essentials/) using the `FactoMineR` package.

In this series, I will use three commonly used packages in R ([FactoMineR](https://cran.r-project.org/web/packages/FactoMineR/FactoMineR.pdf) and [PCAmixdata](https://cran.r-project.org/web/packages/PCAmixdata/PCAmixdata.pdf)) and Python ([prince](https://github.com/MaxHalford/prince)) to performance FAMD on the Telco customer churn dataset, to gain insights into the relationships between various aspects of customer behaviour.

Note, I used the [SoS kernel](https://vatlab.github.io/sos-docs/) for this analysis, which allows both Python and R analysis in the same notebook without using the Jupyter R magic commands. I found SoS handled certain aspects of R performance better than the R magics, so I encourage those who are interested to check it out. :)

# 2. Import and pre-process data

Here, I will import the cleaned Telco dataset in both R and Python. 

As `Calculated_TotalCharges` is highly correlated with `tenure` and `MonthlyCharges`, it will be excluded from analysis.

Also, all three packages automatically normalize the numerical variables, so I will not do so before hand.

In [1]:
df <- read.csv('https://github.com/nd823/data-cleaning/raw/master/telco_cleaned_Jun13.csv')

df <- within(df, rm('Calculated_TotalCharges'))

In [2]:
import pandas as pd

df = pd.read_csv('https://github.com/nd823/data-cleaning/raw/master/telco_cleaned_Jun13.csv')

df.drop('Calculated_TotalCharges', axis=1, inplace=True)

# 3. Factor analysis of mixed data (FAMD)

## 3.1 `FactoMineR` (R package)

`FactoMineR` provides a variety of functions for PCA, correspondence analysis (CA), multiple correspondence analysis (MCA) and FAMD.  

See [CRAN documentation](https://cran.r-project.org/web/packages/FactoMineR/FactoMineR.pdf) for `FactoMineR`.

In [3]:
## Import libraries
library(FactoMineR)
library(factoextra)

## PCA
res.famd <- FAMD(df, 
                 sup.var = 19,  ## Set the target variable "Churn" as a supplementary variable, so it is not included in the analysis for now
                 graph = FALSE, 
                 ncp=25)

## Inspect principal components
get_eigenvalue(res.famd)

Loading required package: ggplot2
Registered S3 methods overwritten by 'ggplot2':
  method         from 
  [.quosures     rlang
  c.quosures     rlang
  print.quosures rlang
Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ


Unnamed: 0,eigenvalue,variance.percent,cumulative.variance.percent
Dim.1,4.5098513842,20.49932447,20.49932
Dim.2,2.8024012638,12.73818756,33.23751
Dim.3,1.8247464408,8.294302,41.53181
Dim.4,1.149081968,5.22309985,46.75491
Dim.5,1.0495900326,4.77086378,51.52578
Dim.6,1.0144020899,4.61091859,56.1367
Dim.7,0.9993340111,4.54242732,60.67912
Dim.8,0.9842737776,4.47397172,65.1531
Dim.9,0.9039827483,4.10901249,69.26211
Dim.10,0.8465510646,3.84795938,73.11007


To inspect the results in further detail, use the `summary(res.famd)` and `print(res.famd)` functions.

## 3.2 `PCAmixdata` (R package)

According to its authors, `PCAmixdata` is "dedicated to multivariate analysis of mixed data where observations are described by a mixture of numerical and categorical variables" ([Chavent et al., 2017](https://arxiv.org/pdf/1411.4911.pdf)). As we will see in part 2 of this series, `PCAmixdata` provides a very useful function for performing (a generalized form of) varimax rotation that aids in interpreting the principal components identified.

See [CRAN documentation](https://cran.r-project.org/web/packages/PCAmixdata/PCAmixdata.pdf) for `PCAmixdata`.

In [14]:
## Import library
library(PCAmixdata)

## Split mixed dataset into quantitative and qualitative variables
split <- splitmix(df[1:18])  ## For now excluding the target variable "Churn", which will be added back later as a supplementary variable

## PCA
res.pcamix <- PCAmix(X.quanti=split$X.quanti,  
                     X.quali=split$X.quali, 
                     rename.level=TRUE, 
                     graph=FALSE, 
                     ndim=25)

## Inspect principal components
res.pcamix$eig

“Columns of class integer are considered as quantitative”

Unnamed: 0,Eigenvalue,Proportion,Cumulative
dim 1,4.5098513842,20.49932447,20.49932
dim 2,2.8024012638,12.73818756,33.23751
dim 3,1.8247464408,8.294302,41.53181
dim 4,1.149081968,5.22309985,46.75491
dim 5,1.0495900326,4.77086378,51.52578
dim 6,1.0144020899,4.61091859,56.1367
dim 7,0.9993340111,4.54242732,60.67912
dim 8,0.9842737776,4.47397172,65.1531
dim 9,0.9039827483,4.10901249,69.26211
dim 10,0.8465510646,3.84795938,73.11007


Similarly, to inspect the results in further detail, use the `summary(res.pcamix)` and `print(res.pcamix)` functions.

Thus far, we see that the results from `FactoMineR` and `PCAmixdata` are identical.

A little background: an **eigenvalue > 1** indicates that the principal component (PCs) accounts for **more** variance than accounted by one of the original variables in **standardized** data (**N.B. This holds true only when the data are standardized.**). This is commonly used as a cutoff point for which PCs are retained. 

Therefore, interestingly, only the **first four** PCs account for more variance than each of the original variables, and together they account for only 46.7% of the total variance in the data set. This suggests that patterns between the variables are likely non-linear and complex.

## 3.3 `prince` (Python package)

Like `FactoMineR`, `prince` can be used to perform a varity of factor analyses involving purely numerical/categorical or mixed type datasets. Implemented in Python, this package uses a familiar `scikit-learn` API.

Unlike the two R packages above, there does not seem to be an option for adding in supplementary variables after FAMD.

For more detailed documentation, see the [GitHub repo](https://github.com/MaxHalford/prince).

In [20]:
## Import libraries
import prince

## Instantiate FAMD object
famd = prince.FAMD(
     n_components=25,
     n_iter=10,
     copy=True,
     check_input=True,
     engine='auto',       ## Can be "auto", 'sklearn', 'fbpca'
     random_state=42)

## Fit FAMD object to data 
famd = famd.fit(df.drop('Churn', axis=1)) ## Excluding the target variable "Churn"

## Inspect principal dimensions
famd.explained_inertia_            

[0.5374654125558428,
 0.08801867828903967,
 0.05728512443689753,
 0.03937333023240265,
 0.03127877295810108,
 0.02791225157585326,
 0.024703045349880843,
 0.02080730170414231,
 0.01893723097468581,
 0.01800539646894817,
 0.016560227742665183,
 0.015976757070266752,
 0.014945093457876387,
 0.013999456118270725,
 0.013763376399511602,
 0.013589916692549968,
 0.012208295034160321,
 0.011979368941411946,
 0.01133988994746592,
 0.007002655988383323,
 0.004847867736269992,
 5.485296596458002e-07,
 1.7957132325092428e-09,
 5.366065875277613e-33,
 5.366065875277613e-33]

Surprisingly, the results here differ greatly from the ones above. In my preliminary readings, I understand that "explained inertia" is synonymous with "explained variance", so this seems to be an unlikely cause of the discrepancy. I will keep digging, but as you will see in later parts of this series, FAMD performed using `prince` does reach nearly identical conclusions as the two R packages.

Thank you so much for reading, this is the end of part 1 of this series. In the next post, I will introduce several concepts and approaches to better understand the PCs and their relevance to the data.

Til then! :)