# Mahalanobis Distance
The Mahalanobis distance is a measure of the distance between a point P and a distribution D.  It is a multi-dimensional generalization of the idea of measuring how many standard deviations away P is from the mean of D. This distance is zero if P is at the mean of D, and grows as P moves away from the mean along each principal component axis. If each of these axes is re-scaled to have unit variance, then the Mahalanobis distance corresponds to standard Euclidean distance in the transformed space. The Mahalanobis distance is thus unitless and scale-invariant, and takes into account the correlations of the data set.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import logging
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns
import sys

from scipy import stats

formatter = logging.Formatter(
    fmt='%(asctime)s.%(msecs)03d %(levelname)s [%(name)s] %(message)s',
    datefmt='%y%m%d@%H:%M:%S',
)

logger = logging.getLogger('pynhanes')
logger.setLevel(logging.DEBUG)
# f = logging.FileHandler('nhanes.log')
# f.setFormatter(formatter)
h = logging.StreamHandler(stream=sys.stdout)
h.setFormatter(formatter)

if not logger.hasHandlers():
    logger.addHandler(h)  # log to STDOUT or Jupyter
#     logger.addHandler(f)  # log to file

import pynhanes

210304@04:37:28.431 DEBUG [pynhanes] pynhanes package (re)loaded


In [3]:
dfs = pynhanes.data.load(datasets=['DEMO','ALQ'], years=(2015, 2018))

210304@04:37:29.588 INFO [pynhanes.data] read 9971 rows x 47 cols from https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.XPT
210304@04:37:30.635 INFO [pynhanes.data] read 9254 rows x 46 cols from https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/DEMO_J.XPT
210304@04:37:30.647 INFO [pynhanes.data] combined dataset DEMO: 19225 rows x 52 cols
210304@04:37:30.929 INFO [pynhanes.data] read 5735 rows x 10 cols from https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/ALQ_I.XPT
210304@04:37:31.206 INFO [pynhanes.data] read 5533 rows x 10 cols from https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/ALQ_J.XPT
210304@04:37:31.211 INFO [pynhanes.data] combined dataset ALQ: 11268 rows x 18 cols


In [4]:
df = pd.merge(
    left = dfs['DEMO'],
    right = dfs['ALQ'],
    on = ['SEQN','year'],
    how = 'inner',
)
df.shape

(11268, 68)

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=98f37550-cbaa-491c-a3ea-f393696dc041' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>