| Question Type     | Example                                                                           |
| ----------------- | --------------------------------------------------------------------------------- |
| **Structural**    | What latent factors explain the most variability in my mixed dataset?             |
| **Variable**      | Which variables drive the first few principal components?                         |
| **Redundancy**    | Can I eliminate variables without losing important information?                   |
| **Clustering**    | Are there natural subgroups in my data? What defines them?                        |
| **Segmentation**  | How do different groups (e.g., high vs. low income) differ in structure?          |
| **Visualization** | Can I reduce 372 variables into 2-3 informative dimensions to visualize patterns? |


| Ordinal Variable Type                                        | How to Treat | Cast To       | FAMD Behavior        |
| ------------------------------------------------------------ | ------------ | ------------- | -------------------- |
| Non-numeric or symbolic (e.g., "Low", "Med", "High")         | Categorical  | `category`    | One-hot encoded      |
| Numeric but arbitrary labels (e.g., 1 = "bad", 2 = "medium") | Categorical  | `category`    | One-hot encoded      |
| Numeric with consistent intervals (e.g., 1–5 Likert scale)   | Continuous   | `float`/`int` | Included in PCA part |


1. Impute missing values (median for continuous, mode for categorical)
2. Treat all ordinal variables as nominal if spacing isn't meaningful
3. One-hot encode all categorical variables (ordinal + nominal)
4. Run MCA on this encoded categorical dataset
5. Optionally: PCA on standardized continuous-only dataset
6. Analyze latent factors separately or visualize in a shared space (e.g., via UMAP on MCA + PCA embeddings)


In [174]:
import pandas as pd
import os
import numpy as np
import json

#read in the data
filename = 'form_1_cleaned.csv'

filepath = os.path.join('..', 'Database', 'processed', filename)

df = pd.read_csv(filepath)


code_dict_filename = 'code_dict.json'

code_dic_path = os.path.join('..', 'Database', 'processed', code_dict_filename)

with open(code_dic_path, 'r') as file:
    code_dic = json.load(file)


  df = pd.read_csv(filepath)


In [175]:
#I have manually separated the ordinal from the nominal variables. 
#If I were to do it again, I would use the list of known non-value codes and anything that had more than 2 (binary codes) after
#removing those codes, I would encode it as ordinal.
ordinal_col = [
    'AgeGroup',
    'BMICat',
    'CongestiveHeartFailureTBIOnset',
    'DRINKCat',
    'DRSEmpA',
    'DRSEmpD',
    'DRSEyeA',
    'DRSEyeD',
    'DRSFeedA',
    'DRSFeedD',
    'DRSFuncA',
    'DRSFuncD',
    'DRSGroomA',
    'DRSGroomD',
    'DRSMotA',
    'DRSMotD',
    'DRSToiletA',
    'DRSToiletD',
    'DRSVerA',
    'DRSVerD',
    'DementiaTBIOnset',
    'DiabetesHighBloodSugarTBIOnset',
    'EDUCATION',
    'Earn',
    'EduYears',
    'EduYearsOld',
    'FIMBathA',
    'FIMBathD',
    'FIMBedTransA',
    'FIMBedTransD',
    'FIMBladAccA',
    'FIMBladAccD',
    'FIMBladAsstA',
    'FIMBladAsstD',
    'FIMBladMgtA',
    'FIMBladMgtD',
    'FIMBwlAccA',
    'FIMBwlAccD',
    'FIMBwlAsstA',
    'FIMBwlAsstD',
    'FIMBwlMgtA',
    'FIMBwlMgtD',
    'FIMCompA',
    'FIMCompD',
    'FIMDrsdwnA',
    'FIMDrsdwnD',
    'FIMDrupA',
    'FIMDrupD',
    'FIMExpressA',
    'FIMExpressD',
    'FIMFeedA',
    'FIMFeedD',
    'FIMGroomA',
    'FIMGroomD',
    'FIMLocoD',
    'FIMLocoModeD',
    'FIMMemA',
    'FIMMemD',
    'FIMProbSlvA',
    'FIMProbSlvD',
    'FIMSocialA',
    'FIMSocialD',
    'FIMStairsA',
    'FIMStairsD',
    'FIMToilTransA',
    'FIMToilTransD',
    'FIMToiletA',
    'FIMToiletD',
    'FIMTubTransA',
    'FIMTubTransD',
    'FIMWalkingA',
    'FIMwcA',
    'GCSCat',
    'GCSEye',
    'GCSMot',
    'GCSTot',
    'GCSVer',
    'HeartAttackTBIOnset',
    'HighBloodCholesterolTBIOnset',
    'HypertensionTBIOnset',
    'LiverDiseaseTBIOnset',
    'MOB12StepsA',
    'MOB12StepsD',
    'MOB1StepCurbA',
    'MOB1StepCurbD',
    'MOB4StepsA',
    'MOB4StepsD',
    'MOBCarTranA',
    'MOBCarTranD',
    'MOBChairTranA',
    'MOBChairTranD',
    'MOBLyingA',
    'MOBLyingD',
    'MOBPickUpA',
    'MOBPickUpD',
    'MOBRollA',
    'MOBRollD',
    'MOBSitA',
    'MOBSitD',
    'MOBSitStandA',
    'MOBToilettranA',
    'MOBToilettranD',
    'MOBWalk10ftA',
    'MOBWalk10ftD',
    'MOBWalk150ftA',
    'MOBWalk150ftD',
    'MOBWalkUnevenA',
    'MOBWalkUnevenD',
    'MOBWalkturnA',
    'MOBWalkturnD',
    'MOBWheel150ftA',
    'MOBWheel150ftD',
    'MOBWheel50ftA',
    'MOBWheel50ftD',
    'MostSevere',
    'MovementDisorderTBIOnset',
    'OsteoarthritisTBIOnset',
    'PRTHome',
    'PRTSchool',
    'PRTVol',
    'PRTWork',
    'PTSDTBIOnset',
    'PanicAttacksTBIOnset',
    'RheumatoidArthritisTBIOnset',
    'SCEatA',
    'SCEatD',
    'SCFootwearA',
    'SCFootwearD',
    'SCLBDressA',
    'SCLBDressD',
    'SCOralHygA',
    'SCShowerA',
    'SCShowerD',
    'SCToiletA',
    'SCToiletD',
    'SCUBDressA',
    'SCUBDressD',
    'SmkCig',
    'StrokeTBIOnset'

]

nominal_col = [col for col in code_dic['categorical_cols'] if col not in ordinal_col]

continuous_col = code_dic['numeric_cols']

record_cols = ['Mod1Id']

#remove codes 81-84. I missed these in preprocessing form 1
df = df.replace([81.0, 82.0, 83.0, 84.0, 99.0], np.nan)

#label columns
df[nominal_col] = df[nominal_col].astype('category')
for col in ordinal_col:
    df[col] = pd.Categorical(df[col], ordered=True) #should assumne the unique values are the categories


In [176]:
#mixed data types will cause problems for the dimensionality reduction.

mixed_cols = []

for col in df.columns:
    types = df[col].map(type).nunique()
    if types > 1:
        print(f"{col} has mixed types: {df[col].map(type).value_counts()}")
        mixed_cols.append(col)

#for any mixed columns, cast them to string and cateogry type
df[mixed_cols] = df[mixed_cols].astype(str).astype('category')

#some columns contain no information. I got this list from an inputer error code in the cell below.
df.drop(columns=['CongestiveHeartFailure', 'Dementia', 'DiabetesHighBloodSugar',
 'HeartAttack', 'HighBloodCholesterol', 'Hypertension', 'LiverDisease',
 'MovementDisorder', 'Osteoarthritis', 'PTSDHlth', 'PanicAttacks',
 'RheumatoidArthritis', 'Stroke', 'Mod1Id','CongestiveHeartFailureTBIOnset', 'DementiaTBIOnset',
 'DiabetesHighBloodSugarTBIOnset', 'HeartAttackTBIOnset',
 'HighBloodCholesterolTBIOnset', 'HypertensionTBIOnset',
 'LiverDiseaseTBIOnset', 'MovementDisorderTBIOnset',
 'OsteoarthritisTBIOnset' ,'PRTVol' ,'PTSDTBIOnset', 'PanicAttacksTBIOnset',
 'RheumatoidArthritisTBIOnset', 'StrokeTBIOnset'],inplace=True)

print(mixed_cols)


DeathCause1 has mixed types: DeathCause1
<class 'float'>    19559
<class 'str'>          1
Name: count, dtype: int64
DeathCause2 has mixed types: DeathCause2
<class 'float'>    19559
<class 'str'>          1
Name: count, dtype: int64
DeathECode has mixed types: DeathECode
<class 'float'>    19558
<class 'str'>          2
Name: count, dtype: int64
['DeathCause1', 'DeathCause2', 'DeathECode']


In [177]:
from sklearn.impute import SimpleImputer
import numpy as np

#remove  inf
df.replace([np.inf, -np.inf], np.nan, inplace=True)

#segment variable types
num_cols = df.select_dtypes(include=['float64', 'int64']).columns
cat_cols = df.select_dtypes(include=['category', 'object']).columns
 
#instantiate imputer
num_imputer = SimpleImputer(strategy='median')
cat_imputer = SimpleImputer(strategy='most_frequent')
#fit
df[num_cols] = num_imputer.fit_transform(df[num_cols])
df[cat_cols] = cat_imputer.fit_transform(df[cat_cols])

In [178]:
#check for NaN values
assert not df.isna().any().any(), "Still contains NaNs"
assert np.isfinite(df.select_dtypes(include=[np.number])).all().all(), "Still contains infs"
print(f"Remaining rows: {df.shape[0]}")


Remaining rows: 19560


In [179]:
import prince

mca = prince.MCA(
    n_components=2,
    n_iter=10,
    copy=True,
    check_input=True,
    random_state=42,
    engine="sklearn",
)
mca_result = mca.fit_transform(df[cat_cols])
mca_result.columns = ['MCA1', 'MCA2']

print(mca_result.head())


       MCA1      MCA2
0 -0.053271  0.345846
1 -0.206256 -0.300468
2 -0.117989 -0.079379
3 -0.130342  0.158021
4 -0.072096  0.564311


In [188]:
import altair as alt

alt.data_transformers.disable_max_rows()
alt.data_transformers.enable('default', max_rows=None)


mca_result["label"] = df["FIMLocoModeD"].astype(str)
mca_result.reset_index(drop=True, inplace=True)

mca_result = mca_result.astype({"MCA1": "float", "MCA2": "float", "label": "str"})


chart = alt.Chart(mca_result).mark_circle(size=60).encode(
    x='MCA1:Q',
    y='MCA2:Q',
    color='label:N',
    tooltip=['label']
).interactive()

print(mca_result.dtypes)
print(mca_result.head())

chart


MCA1     float64
MCA2     float64
label     object
dtype: object
       MCA1      MCA2 label
0 -0.053271  0.345846   1.0
1 -0.206256 -0.300468   1.0
2 -0.117989 -0.079379   1.0
3 -0.130342  0.158021   1.0
4 -0.072096  0.564311   1.0


In [None]:

column_coords = mca.column_coordinates(df).reset_index()
column_coords.columns = ['category', 'MCA1', 'MCA2']

print("Top contributors to MCA1:")
print(column_coords.sort_values(by='MCA1',ascending=False).head(10))

print("\nTop contributors to MCA2:")
print(column_coords.sort_values(by='MCA2',ascending=False).head(10))

                           category      MCA1      MCA2
8121                LOSRehab__172.0  5.940065  2.068095
13104               ZipInj__47340.0  5.878969  2.878416
402                B3TCOMP__-0.4103  5.817686 -1.408929
2027                 B3TEF__-1.0104  5.817686 -1.408929
3976   BackCountDigits_i_n__-1.8277  5.817686 -1.408929
...                             ...       ...       ...
12824               ZipInj__43766.0 -0.764050 -1.285237
12480               ZipInj__37856.0 -0.765542 -1.444577
12819               ZipInj__43755.0 -0.768078 -1.534691
14634               ZipInj__80217.0 -0.768628 -1.557798
14797               ZipInj__81025.0 -0.801908 -1.484975

[15681 rows x 3 columns]


In [190]:
from prince import FAMD

famd = FAMD(n_components=2)
famd = famd.fit(df)

# Transform data
df_famd = famd.transform(df)
df_famd.columns = ['FAMD_1', 'FAMD_2']



  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[col] = (
  eta2[c

In [191]:
# Optional: add labels from original df
df_famd['label'] = df['FIMMOTD']

# Altair plot
chart = alt.Chart(df_famd.reset_index()).mark_circle().encode(
    x='FAMD_1',
    y='FAMD_2',
    color='label:N',
    tooltip='label'
).interactive()

chart

In [None]:

print("Top contributors to FAMD 01:")
famd.column_contributions_.sort_values(by=0,ascending=False).head(10).style.format('{:.3%}')

Top contributors to FAMD 01:


component,0,1
variable,Unnamed: 1_level_1,Unnamed: 2_level_1
FIMLocoModeD,0.019%,0.001%
SCLBDressA,0.017%,0.002%
SCUBDressA,0.016%,0.002%
MOBLyingA,0.016%,0.001%
SCLBDressD,0.016%,0.002%
MOBRollA,0.015%,0.001%
MOBSitA,0.015%,0.001%
SCToiletD,0.015%,0.001%
SCFootwearD,0.015%,0.001%
SCUBDressD,0.015%,0.001%


In [200]:
print("Top contributors to FAMD 02:")
famd.column_contributions_.sort_values(by=1,ascending=False).head(10).style.format('{:.3%}')

Top contributors to FAMD 02:


component,0,1
variable,Unnamed: 1_level_1,Unnamed: 2_level_1
FIMTOTD,0.000%,0.022%
FIMMOTD,0.000%,0.019%
DRSd,0.000%,0.018%
DRSdHigh,0.000%,0.018%
DRSdLow,0.000%,0.018%
DRSaLow,0.000%,0.018%
DRSa,0.000%,0.017%
FIMTOTA,0.000%,0.017%
DRSaHigh,0.000%,0.017%
FIMToiletD,0.003%,0.017%
