| Question Type     | Example                                                                           |
| ----------------- | --------------------------------------------------------------------------------- |
| **Structural**    | What latent factors explain the most variability in my mixed dataset?             |
| **Variable**      | Which variables drive the first few principal components?                         |
| **Redundancy**    | Can I eliminate variables without losing important information?                   |
| **Clustering**    | Are there natural subgroups in my data? What defines them?                        |
| **Segmentation**  | How do different groups (e.g., high vs. low income) differ in structure?          |
| **Visualization** | Can I reduce 372 variables into 2-3 informative dimensions to visualize patterns? |


| Ordinal Variable Type                                        | How to Treat | Cast To       | FAMD Behavior        |
| ------------------------------------------------------------ | ------------ | ------------- | -------------------- |
| Non-numeric or symbolic (e.g., "Low", "Med", "High")         | Categorical  | `category`    | One-hot encoded      |
| Numeric but arbitrary labels (e.g., 1 = "bad", 2 = "medium") | Categorical  | `category`    | One-hot encoded      |
| Numeric with consistent intervals (e.g., 1–5 Likert scale)   | Continuous   | `float`/`int` | Included in PCA part |


1. Impute missing values (median for continuous, mode for categorical)
2. Treat all ordinal variables as nominal if spacing isn't meaningful
3. One-hot encode all categorical variables (ordinal + nominal)
4. Run MCA on this encoded categorical dataset
5. Optionally: PCA on standardized continuous-only dataset
6. Analyze latent factors separately or visualize in a shared space (e.g., via UMAP on MCA + PCA embeddings)


In [5]:
import pandas as pd
import os
import numpy as np
import json

#read in the data
filename = 'form_1_cleaned.csv'

filepath = os.path.join('..', 'Database', 'processed', filename)

df = pd.read_csv(filepath)


code_dict_filename = 'code_dict.json'

code_dic_path = os.path.join('..', 'Database', 'processed', code_dict_filename)

with open(code_dic_path, 'r') as file:
    code_dic = json.load(file)


  df = pd.read_csv(filepath)


In [None]:
#I have manually separated the ordinal from the nominal variables. 
#If I were to do it again, I would use the list of known non-value codes and anything that had more than 2 (binary codes) after
#removing those codes, I would encode it as ordinal.
ordinal_col = [
    'AgeGroup',
    'BMICat',
    'CongestiveHeartFailureTBIOnset',
    'DRINKCat',
    'DRSEmpA',
    'DRSEmpD',
    'DRSEyeA',
    'DRSEyeD',
    'DRSFeedA',
    'DRSFeedD',
    'DRSFuncA',
    'DRSFuncD',
    'DRSGroomA',
    'DRSGroomD',
    'DRSMotA',
    'DRSMotD',
    'DRSToiletA',
    'DRSToiletD',
    'DRSVerA',
    'DRSVerD',
    'DementiaTBIOnset',
    'DiabetesHighBloodSugarTBIOnset',
    'EDUCATION',
    'Earn',
    'EduYears',
    'EduYearsOld',
    'FIMBathA',
    'FIMBathD',
    'FIMBedTransA',
    'FIMBedTransD',
    'FIMBladAccA',
    'FIMBladAccD',
    'FIMBladAsstA',
    'FIMBladAsstD',
    'FIMBladMgtA',
    'FIMBladMgtD',
    'FIMBwlAccA',
    'FIMBwlAccD',
    'FIMBwlAsstA',
    'FIMBwlAsstD',
    'FIMBwlMgtA',
    'FIMBwlMgtD',
    'FIMCompA',
    'FIMCompD',
    'FIMDrsdwnA',
    'FIMDrsdwnD',
    'FIMDrupA',
    'FIMDrupD',
    'FIMExpressA',
    'FIMExpressD',
    'FIMFeedA',
    'FIMFeedD',
    'FIMGroomA',
    'FIMGroomD',
    'FIMLocoD',
    'FIMLocoModeD',
    'FIMMemA',
    'FIMMemD',
    'FIMProbSlvA',
    'FIMProbSlvD',
    'FIMSocialA',
    'FIMSocialD',
    'FIMStairsA',
    'FIMStairsD',
    'FIMToilTransA',
    'FIMToilTransD',
    'FIMToiletA',
    'FIMToiletD',
    'FIMTubTransA',
    'FIMTubTransD',
    'FIMWalkingA',
    'FIMwcA',
    'GCSCat',
    'GCSEye',
    'GCSMot',
    'GCSTot',
    'GCSVer',
    'HeartAttackTBIOnset',
    'HighBloodCholesterolTBIOnset',
    'HypertensionTBIOnset',
    'LiverDiseaseTBIOnset',
    'MOB12StepsA',
    'MOB12StepsD',
    'MOB1StepCurbA',
    'MOB1StepCurbD',
    'MOB4StepsA',
    'MOB4StepsD',
    'MOBCarTranA',
    'MOBCarTranD',
    'MOBChairTranA',
    'MOBChairTranD',
    'MOBLyingA',
    'MOBLyingD',
    'MOBPickUpA',
    'MOBPickUpD',
    'MOBRollA',
    'MOBRollD',
    'MOBSitA',
    'MOBSitD',
    'MOBSitStandA',
    'MOBToilettranA',
    'MOBToilettranD',
    'MOBWalk10ftA',
    'MOBWalk10ftD',
    'MOBWalk150ftA',
    'MOBWalk150ftD',
    'MOBWalkUnevenA',
    'MOBWalkUnevenD',
    'MOBWalkturnA',
    'MOBWalkturnD',
    'MOBWheel150ftA',
    'MOBWheel150ftD',
    'MOBWheel50ftA',
    'MOBWheel50ftD',
    'MostSevere',
    'MovementDisorderTBIOnset',
    'OsteoarthritisTBIOnset',
    'PRTHome',
    'PRTSchool',
    'PRTVol',
    'PRTWork',
    'PTSDTBIOnset',
    'PanicAttacksTBIOnset',
    'RheumatoidArthritisTBIOnset',
    'SCEatA',
    'SCEatD',
    'SCFootwearA',
    'SCFootwearD',
    'SCLBDressA',
    'SCLBDressD',
    'SCOralHygA',
    'SCShowerA',
    'SCShowerD',
    'SCToiletA',
    'SCToiletD',
    'SCUBDressA',
    'SCUBDressD',
    'SmkCig',
    'StrokeTBIOnset'

]

nominal_col = [col for col in code_dic['categorical_cols'] if col not in ordinal_col]

continuous_col = code_dic['numeric_cols']

record_cols = ['Mod1Id']

#remove codes 81-84. I missed these in preprocessing form 1 as well as any oddly casted Nans
df = df.replace([81.0, 82.0, 83.0, 84.0, 99.0], np.nan)
df = df.replace(r'^\s*(nan|NaN|null)\s*$', np.nan, regex=True)
df = df.applymap(lambda x: x.strip() if isinstance(x, str) else x)



#identify columns with mixed data types
mixed_cols = []

for col in df.columns:
    types = df[col].map(type).nunique()
    if types > 1:
        print(f"{col} has mixed types")
        mixed_cols.append(col)

#for any mixed columns, cast them to string and cateogry type

for col in mixed_cols:
    col_vals = df[col].where(df[col].isna(), df[col].astype(str))
    df[col] = df[col].astype('category')


print(mixed_cols)

#label columns
df[nominal_col] = df[nominal_col].astype('category')
for col in ordinal_col:
    df[col] = pd.Categorical(df[col], ordered=True) #should assumne the unique values are the categories

form_1_column_dict = {
    "ordinal_col" : ordinal_col,
    "nominal_col" : nominal_col,
    "continuous_col" : continuous_col
}

print(form_1_column_dict)

with open(os.path.join('..','Database', 'processed', 'form_1_column_dict.json'), 'w') as f:
    json.dump(form_1_column_dict, f, indent=4)

  df = df.applymap(lambda x: x.strip() if isinstance(x, str) else x)


DeathCause1 has mixed types
DeathCause2 has mixed types
DeathECode has mixed types
['DeathCause1', 'DeathCause2', 'DeathECode']
{'ordinal_col': ['AgeGroup', 'BMICat', 'CongestiveHeartFailureTBIOnset', 'DRINKCat', 'DRSEmpA', 'DRSEmpD', 'DRSEyeA', 'DRSEyeD', 'DRSFeedA', 'DRSFeedD', 'DRSFuncA', 'DRSFuncD', 'DRSGroomA', 'DRSGroomD', 'DRSMotA', 'DRSMotD', 'DRSToiletA', 'DRSToiletD', 'DRSVerA', 'DRSVerD', 'DementiaTBIOnset', 'DiabetesHighBloodSugarTBIOnset', 'EDUCATION', 'Earn', 'EduYears', 'EduYearsOld', 'FIMBathA', 'FIMBathD', 'FIMBedTransA', 'FIMBedTransD', 'FIMBladAccA', 'FIMBladAccD', 'FIMBladAsstA', 'FIMBladAsstD', 'FIMBladMgtA', 'FIMBladMgtD', 'FIMBwlAccA', 'FIMBwlAccD', 'FIMBwlAsstA', 'FIMBwlAsstD', 'FIMBwlMgtA', 'FIMBwlMgtD', 'FIMCompA', 'FIMCompD', 'FIMDrsdwnA', 'FIMDrsdwnD', 'FIMDrupA', 'FIMDrupD', 'FIMExpressA', 'FIMExpressD', 'FIMFeedA', 'FIMFeedD', 'FIMGroomA', 'FIMGroomD', 'FIMLocoD', 'FIMLocoModeD', 'FIMMemA', 'FIMMemD', 'FIMProbSlvA', 'FIMProbSlvD', 'FIMSocialA', 'FIMSocialD

In [3]:
#mixed data types will cause problems for the dimensionality reduction.

mixed_cols = []

for col in df.columns:
    types = df[col].map(type).nunique()
    if types > 1:
        print(f"{col} has mixed types: {df[col].map(type).value_counts()}")
        mixed_cols.append(col)

#for any mixed columns, cast them to string and cateogry type
df[mixed_cols] = df[mixed_cols].astype(str).astype('category')

df.to_csv('../Database/processed/form_1_cleaned_v2.csv')

drop_cols = ['CongestiveHeartFailure', 'Dementia', 'DiabetesHighBloodSugar',
 'HeartAttack', 'HighBloodCholesterol', 'Hypertension', 'LiverDisease',
 'MovementDisorder', 'Osteoarthritis', 'PTSDHlth', 'PanicAttacks',
 'RheumatoidArthritis', 'Stroke', 'Mod1Id','CongestiveHeartFailureTBIOnset', 'DementiaTBIOnset',
 'DiabetesHighBloodSugarTBIOnset', 'HeartAttackTBIOnset',
 'HighBloodCholesterolTBIOnset', 'HypertensionTBIOnset',
 'LiverDiseaseTBIOnset', 'MovementDisorderTBIOnset',
 'OsteoarthritisTBIOnset' ,'PRTVol' ,'PTSDTBIOnset', 'PanicAttacksTBIOnset',
 'RheumatoidArthritisTBIOnset', 'StrokeTBIOnset']

#some columns contain no information. I got this list from an inputer error code in the cell below.
#Cross referenced against the raw dataset, these are practically empty columns
df.drop(columns=drop_cols,inplace=True)

print(mixed_cols)


DeathCause1 has mixed types: DeathCause1
<class 'float'>    19559
<class 'str'>          1
Name: count, dtype: int64
DeathCause2 has mixed types: DeathCause2
<class 'float'>    19559
<class 'str'>          1
Name: count, dtype: int64
DeathECode has mixed types: DeathECode
<class 'float'>    19558
<class 'str'>          2
Name: count, dtype: int64
['DeathCause1', 'DeathCause2', 'DeathECode']


In [4]:
from sklearn.impute import SimpleImputer
import numpy as np

def drop_mostly_empty_columns(df, threshold=0.9):

    missing_ratio = df.isna().mean()
    dropped_cols = missing_ratio[missing_ratio > threshold].index.tolist()
    df_filtered = df.drop(columns=dropped_cols)

    return df_filtered, dropped_cols

#remove  inf
df.replace([np.inf, -np.inf], np.nan, inplace=True)

#drop cols with missing values
df_filtered, more_dropped_cols = drop_mostly_empty_columns(df, 0.85)

drop_cols += more_dropped_cols

ordinal_col = [col for col in ordinal_col if col not in drop_cols]

nominal_col = [col for col in nominal_col if col not in drop_cols]

continuous_col = [col for col in continuous_col if col not in drop_cols]



In [None]:
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline

#preprocess
ordinal_pipeline = make_pipeline(
    SimpleImputer(strategy='most_frequent'),
    OrdinalEncoder()
)

categorical_pipeline = make_pipeline(
    SimpleImputer(strategy='most_frequent'),
    #OneHotEncoder(handle_unknown='ignore', sparse_output=False)
)

numerical_pipeline = make_pipeline(
    SimpleImputer(strategy='mean'),
    StandardScaler()
)

#combine
preprocessor = ColumnTransformer([
    ('ord', ordinal_pipeline, ordinal_col),
    ('cat', categorical_pipeline, nominal_col),
    ('num', numerical_pipeline, continuous_col)
])

encoded_df = preprocessor.fit_transform(df_filtered)

column_names = preprocessor.get_feature_names_out()
encoded_df = pd.DataFrame(encoded_df, columns=column_names)

cat_cols = [name for name in column_names if name.startswith('cat__')]
ord_cols = [name for name in column_names if name.startswith('ord__')]
num_cols = [name for name in column_names if name.startswith('num__')]


encoded_df[num_cols] = encoded_df[num_cols].astype(float)
encoded_df[cat_cols] = encoded_df[cat_cols].astype('category')

for col in ord_cols:
    encoded_df[col] = pd.Categorical(encoded_df[col], ordered=True)

#check for NaN values
assert not encoded_df.isna().any().any(), "Still contains NaNs"
assert np.isfinite(encoded_df.select_dtypes(include=[np.number])).all().all(), "Still contains infs"
print(f"Remaining rows: {df.shape[0]}")

print(encoded_df.shape)



Remaining rows: 19560
(19560, 227)


In [6]:
from prince import FAMD

famd = FAMD(n_components=2)
famd = famd.fit(encoded_df)

# Transform data
df_famd = famd.transform(encoded_df)
df_famd.columns = ['FAMD_1', 'FAMD_2']



  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[c

In [7]:
import altair as alt

#essential for altier to work. very finicky 
alt.data_transformers.disable_max_rows()
alt.data_transformers.enable('default', max_rows=None)

# Optional: add labels from original df
df_famd['label'] = df['FIMLocoModeD'].dropna()

# Altair plot
chart = alt.Chart(df_famd.reset_index()).mark_circle().encode(
    x='FAMD_1',
    y='FAMD_2',
    color='label:N',
    tooltip='label'
).interactive()

chart

In [8]:

print("Top contributors to FAMD 01:")
famd.column_contributions_.sort_values(by=0,ascending=False).head(10).style.format('{:.3%}')

Top contributors to FAMD 01:


component,0,1
variable,Unnamed: 1_level_1,Unnamed: 2_level_1
num__FIMTOTD,0.035%,0.000%
num__FIMMOTD,0.030%,0.000%
num__FIMTOTA,0.028%,0.003%
num__FIMMOTA,0.027%,0.005%
num__DRSd,0.027%,0.004%
num__DRSdLow,0.027%,0.003%
ord__FIMBedTransD,0.027%,0.043%
ord__FIMToilTransD,0.027%,0.043%
num__DRSdHigh,0.027%,0.004%
ord__FIMDrsdwnD,0.026%,0.041%


In [9]:
print("Top contributors to FAMD 02:")
top_famd_2 = famd.column_contributions_.sort_values(by=1,ascending=False).head(10).style.format('{:.3%}')
top_famd_2

Top contributors to FAMD 02:


component,0,1
variable,Unnamed: 1_level_1,Unnamed: 2_level_1
ord__FIMToilTransD,0.027%,0.043%
ord__FIMBedTransD,0.027%,0.043%
ord__FIMDrsdwnD,0.026%,0.041%
ord__FIMDrupD,0.025%,0.040%
ord__FIMGroomD,0.026%,0.039%
ord__FIMBathD,0.025%,0.037%
ord__FIMLocoD,0.022%,0.034%
cat__ZipInj,0.015%,0.033%
ord__FIMTubTransD,0.019%,0.029%
ord__FIMStairsD,0.017%,0.025%


In [10]:
# contributions = famd.column_contributions_

# contributions.columns = contributions.columns.astype(str)

# contributions["total_contribution"] = (contributions['0']**2 + contributions['1']**2)

# top_variables = contributions['total_contribution'].sort_values(ascending=False).head(10)

# print("R squared highest contributing Variabels")
# print(top_variables)

# contributions["balanced_contribution"] = contributions[["0", "1"]].min(axis=1)

# #Find the variable with the highest balanced contribution
# top_balanced = contributions["balanced_contribution"].sort_values(ascending=False).head(10)

# print("Variable with the most balanced contribution to FAMD_1 and FAMD_2:")
# print(top_balanced)


In [26]:
import hdbscan

clusterer = hdbscan.HDBSCAN(min_cluster_size=10)

df_famd['cluster'] = clusterer.fit_predict(df_famd[['FAMD_1','FAMD_2']])

df_famd_denoised = df_famd[df_famd['cluster'] != -1]

df_famd['label'] = df['DRSdLow']

# Altair plot
chart = alt.Chart(df_famd_denoised.reset_index()).mark_circle().encode(
    x='FAMD_1',
    y='FAMD_2',
    color='label:N',
    tooltip='label'
).properties(
    width=800,   # increase width
    height=600   # increase height
).interactive()

chart


