# Analysis and Visualization of Complex Agro-Environmental Data
---
## Ordination: Correspondence Analysis and Multiple Correspondence Analysis

CA and MCA are the equivalent of PCA for categorical nominal variables. While CA is applicable to two categorical variables, MCA is used to analyse more than two categorical variables.
These methods are implemented in Python in the `Prince` package https://github.com/MaxHalford/prince

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import prince # https://github.com/MaxHalford/prince

## 1. Correspondence Analysis (CA)

Correspondence analysis is used to analyse the dependency between two categorical variables. It is based on contingency tables.
In the next example we will use EFIplus table to relate fish species composition with a selection of four portuguese catchments (Douro, Tejo, Minho, Mondego and Vouga). The first step is to produce a contingency table between fish species and catchment name to get the sum of sites with each fish species for each catchment. We are therefore relating only two categorical variables: `fish species` and `catchment names`.

This analysis can be useful for example to answer the following questions:

* How fish species associate with each ather accross the river catchments?

* How are the fish species associated to each river catchment?

In [None]:
df = pd.read_csv('EFIplus_medit.zip',compression='zip', sep=";")
df = df.dropna() # remove all rows with missing data
# Subset the df by selecting the environmental variables and the species richness columns
dfsub = df[(df['Catchment_name']=='Douro') | (df['Catchment_name']=='Tejo') | (df['Catchment_name']=='Minho') | (df['Catchment_name']=='Mondego') | (df['Catchment_name']=='Vouga')]

In [None]:
df.info('columns')

In [None]:
list_sp = [57, 68, 102, 108, 110, 144, 150] # selection of fish species
df_fish = dfsub.iloc[:,list_sp] # get table with fish data only (columns 54 to 161)
df_fish.insert(0, 'Catchment_name', dfsub['Catchment_name'], True)
df_fish_ct = df_fish.groupby(['Catchment_name'], as_index = False).agg('sum') # contingency table between fish species and catchment name

In [None]:
df_fish_ct.set_index('Catchment_name', drop=True, inplace=True) # convert catchment_name to index
df_fish_ct

In [None]:
ca = prince.CA(
    n_components=3,
    n_iter=3,
    copy=True,
    check_input=True,
    engine='sklearn',
    random_state=42
)
ca = ca.fit(df_fish_ct)

Eigenvalues

In [None]:
ca.eigenvalues_summary

Coordinates

In [None]:
# row coordinates
ca.row_coordinates(df_fish_ct).head()

In [None]:
# columns coordinates
ca.column_coordinates(df_fish_ct).head()

In [None]:
ca.plot(
    df_fish_ct,
    x_component=0,
    y_component=1,
    show_row_markers=True,
    show_column_markers=True,
    show_row_labels=True,
    show_column_labels=False
).properties(
    width=500,
    height=500,
)

Visualize only columns

In [None]:
ca.plot(
    df_fish_ct,
    x_component=0,
    y_component=1,
    show_row_markers=False,
    show_column_markers=False,
    show_row_labels=False,
    show_column_labels=True
).properties(
    width=500,
    height=500,
)

Visualize only rows

In [None]:
ca.plot(
    df_fish_ct,
    x_component=0,
    y_component=1,
    show_row_markers=False,
    show_column_markers=False,
    show_row_labels=True,
    show_column_labels=False
).properties(
    width=500,
    height=500,
)

Contributions

In [None]:
# Contribution of rows
ca.row_contributions_.head().style.format('{:.0%}')

In [None]:
# Contribution of columns
ca.column_contributions_.head().style.format('{:.0%}')

## 2. Multiple Correspondence Analysis

MCA is an extension of simple correspondence analysis (CA) applicable to more than two categorical variables. In the following example we will use the EFIplus table to explore the relationships between sites and pressure variables. Pressure variables are coded as discrete ordinal variables and MCA is the most suitable ordination technique in this case. First we will select a table with pressure variables only.

This analysis can be useful to answer the following questions:

* How the different pressures associate with each ather accross sites?

* How different sites associate with each other according to the pressures that affect them?

* How to summarize the set of pressures into a a reduced number of dimensions that summarize most of the pressure information?

In [None]:
df_press = dfsub.iloc[:,33:53] # get table with pressure variables only (columns 33 to 53)
df_press = df_press.astype('category')
df_press.info('columns')

In [None]:
# instantiate MCA class
mca = prince.MCA(n_components = 2)

# get principal components
mca = mca.fit(df_press)

Get the eigenvalues

In [None]:
mca.eigenvalues_summary

Get the coordinates

In [None]:
mca.row_coordinates(df_press).head()

Visualization of rows and columns in the same plot

In [None]:
mca.plot(df_press,
    x_component=0,
    y_component=1,
    show_column_markers=True,
    show_row_markers=True,
    show_column_labels=False,
    show_row_labels=False,
         ).properties(
    width=500,
    height=500,
)

Visualize only columns

In [None]:
mca.plot(df_press,
    x_component=0,
    y_component=1,
    show_column_markers=False,
    show_row_markers=False,
    show_column_labels=True,
    show_row_labels=False,
         ).properties(
    width=500,
    height=500,
)

Visualize only rows

In [None]:
mca.plot(df_press,
    x_component=0,
    y_component=1,
    show_column_markers=False,
    show_row_markers=False,
    show_column_labels=False,
    show_row_labels=True,
         ).properties(
    width=500,
    height=500,
)


Contributions

In [None]:
# Contribution of rows
mca.row_contributions_.head().style.format('{:.0%}')

In [None]:
# Contribution of columns

mca.column_contributions_.head().style.format('{:.0%}')

## References
https://github.com/MaxHalford/prince/blob/master/README.md

https://maxhalford.github.io/prince/ca/

https://maxhalford.github.io/prince/mca/

https://medium.com/low-code-for-advanced-data-science/understanding-and-applying-correspondence-analysis-cbd0192dec4

https://www.kaggle.com/code/jiagengchang/heart-disease-multiple-correspondence-analysis