# Tetrachoric Correlations

The tetrachoric correlation is a special case of the polychoric correlation. As a reminder, the polychoric correlation measures the strength of a relationship between two ordinal variables that:
- Have 7 or fewer categories
- Assume the underlying latent variables for each ordinal measure are continuous in nature.

The tetrachoric correlation, by comparison, measures the relationship between two binary variables.

Like, polychoric correlation though, it has no closed-form equation. It relies on maximum likelihood estimation (MLE) on a 2x2 table.

__Interpretation__

Tetrachoric correlations are interpreted similarly to polychoric correlations.
- Within the range of [-1, 1]
- A value of 0 indicates no correlation.

The table of interpretation values appears below:

|Correlation Coefficient | Interpretation |
|---------------------------|----------------|
| 0.00 – 0.10 | Negligible or trivial |
| 0.10 – 0.30 | Weak |
| 0.30 – 0.50 | Moderate |
| 0.50 – 1.00 | Strong |

__Assumptions__
1. Both variables are binomial (2 categories)
2. Underlying latent variables are assumed to be normally distributed.
3. The variables have a joint bivariate distribution.

## Python Example

Like polychoric correlations, tetrachoric correlations in python are calculated using R's ```polycor``` library after bridging between Python and R via [rpy2](https://rpy2.github.io/doc/v3.6.x/html/index.html).

In [1]:
# Install rpy2 if you need it
# !pip install rpy2

If you're using conda for install, this is the line you need to use:
```
conda install conda-forge::rpy2
```

In [2]:
# Import
import numpy as np
import pandas as pd

# Correlation coefficient
import scipy.stats as stats

# Visualizations
import matplotlib.pyplot as plt
import seaborn as sns

# Necessary imports
import rpy2
import rpy2.robjects as ro
from rpy2.robjects.vectors import IntVector
from rpy2.robjects.packages import importr, isinstalled
from rpy2.robjects import pandas2ri

In [3]:
# Import the needed R libraries
utils = importr('utils')

# Don't go through the install process if don't need to
if not isinstalled('polycor'):
  utils.install_packages('polycor')

# Set the import
polycor = importr('polycor')

(as ‘lib’ is unspecified)







	‘/tmp/RtmpJL9pnK/downloaded_packages’



So, imagine that you're evaluating a dashboard with 100 users. You have two binary variables:
- Utilization: 0 = low use, 1 = high use
- Frustration: 0 = low frustration, 1 = high frustration

Utilization reflects a latent, normally distributed motivation to use. Frustration reflects a latent, normally distributed negative emotional state.

In [4]:
# Set a seed
np.random.seed(123)

# Randomly generate a dataset
df = pd.DataFrame(data={'utilization': list(np.random.randint(0, 2, 100, dtype=int)),
                        'frustration':list(np.random.randint(0, 2, 100, dtype=int))})

In [11]:
# Check distribution
pd.crosstab(df.utilization, df.frustration, margins=True)

frustration,0,1,All
utilization,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,24,31,55
1,23,22,45
All,47,53,100


In [14]:
# Set as an R dataframe using context and a converter object
with (ro.default_converter + pandas2ri.converter).context():
  r_df = ro.conversion.get_conversion().py2rpy(df)

To calculate tetrachoric correlation, just use ```polycor()``` as shown below.

In [15]:
# Check the correlation
r_corr = polycor.polychor(r_df.rx2('utilization'), r_df.rx2('frustration'))

# Show
print(f'Polychoric correlation: {r_corr[0]}')

Polychoric correlation: -0.11716496854676184


In [16]:
# Coerce numbers to ordered factor (levels 0,1 assumed)
r_df[r_df.names.index("utilization")] = ro.r["ordered"](r_df.rx2("utilization"), levels=IntVector([0,1]))
r_df[r_df.names.index("frustration")] = ro.r["ordered"](r_df.rx2("frustration"), levels=IntVector([0,1]))

# Check significance
r_hetcor = polycor.hetcor(r_df, use="complete.obs")

# Print
print(r_hetcor)


Two-Step Estimates

Correlations/Type of Correlation:
            utilization frustration
utilization           1  Polychoric
frustration     -0.1172           1

Standard Errors:
[1] ""       "0.1561"

n = 100 



The two variables in our example have a weak inverse correlation (r = -0.1172), and it is not statistically significant (p = 0.1561).