# Correlation: Relationships between columns

**TurboPanda** still has more to offer to allow seemless flowing of `pandas`-like operations, and indeed into `scikit-learn`.

In [4]:
import sys
import numpy as np
import pandas as pd
sys.path.insert(0,"../")
# our main import
import turbopanda as turb

import matplotlib.pyplot as plt
%matplotlib inline

print(turb.__version__)


'0.2.2'

In [23]:
g = turb.read("../data/translation.csv", name="trl")
g

MetaPanda(trl(n=5216, p=14, mem=1.169MB, options=[]))

## Determining the relationship between columns in a DataFrame

In `pandas`, there exists a basic correlation function `corr()` which takes
 a method type among other things and returns a correlation matrix. I 
 found this to be clunky and unreliable. Not only this, but it lacks options in terms of
 different types of correlation methods and different formats, for instance 
 in the following cases:

1. *Case one*: when two features are **not both** continuous.
2. *Case two*: when features are not type-casted properly due to `pandas` poor 
handling of missing data.
3. *Case three*: when desiring to compare *between* two datasets (say Matrix and Vector).
 `pandas` only provides intra-correlations between features in a *single dataframe*.

***

Note that matrices must be completely overlapping in order to correlate variables together. It doesn't
make sense otherwise.

| Use case | `pandas` response | `turbopanda` response |
| --------------------- | ----------------- | --------------- |
| One matrix $X$ | Correlates all using `method`<br>parameter (spearman, pearson)<br>returning Matrix | Correlates all using most<br>appropriate method (spearman,<br>pearson for continuous, biserial for boolean/cont)<br>returning list of interactions |
| Two vectors $x$,$y$ of same shape | Does not handle | Correlates using appropriate method<br>returning single value |
| One matrix $X$, one Vector $y$ | Does not handle | Correlates every column $X_i$ to vector $y$<br>using appropriate method returning<br>Vector $z$ |
| Two matrices $X$,$Y$ of same shape | Does not handle | Correlates column $X_i$ with $Y_i$ using <br>most appropriate method returning<br>Vector $z$ |
| Two matrices $X$,$Y$ of different shapes | Does not handle | Correlates every column $X_i$ with $Y_j$ using <br>most appropriate method returning<br>returning list of interactions |
 
***

In `turbopanda`, we have a dedicated `correlate()` function which handles missing,
 heterogenous datasets.
 
Note that there must be NO columns of object type or any other non-numeric data type
before calling:


In [26]:
corr = turb.correlate(g[float])
corr.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 66 entries, 0 to 65
Data columns (total 11 columns):
method        66 non-null object
CI95_lower    66 non-null float64
CI95_upper    66 non-null float64
adj_r2        66 non-null float64
n             66 non-null int64
p-val         66 non-null float64
power         66 non-null float64
r             66 non-null float64
r2            66 non-null float64
x             66 non-null object
y             66 non-null object
dtypes: float64(7), int64(1), object(3)
memory usage: 5.8+ KB


What we see is a list of correlation experiments, given column $x$ and $y$, we get the method used,
confidence intervals, $r$ and $r^2$, $n$, p-values and estimated power.

This data...