# Data Preprocessing for Perfect Partner Probability Website Backend

This Jupyter Notebook is dedicated to the preprocessing of data from two surveys (`ASEC` and `NHANES`) for integration with the backend of the **[Perfect Partner Probability](https://perfectpartner.netlify.app/)**  website which utilizes Django Rest Framework.

In [1]:
# Import the neccesary packages
import pandas as pd 
import numpy as np
import sqlite3

## 2023 Annual Social and Economic Supplements (ASEC) by CPS (Current Population Survey)

### Data Dictionary ASEC

| Column Name | Description | Target Age Range | Values |
| --- | --- | --- | --- |
| A_AGE | Age of the participant | 0-79 years of age, 80-84 years of age, 85+ years of age | `00-79`: 0-79 years of age<br> `80`: 80-84 years of age<br> `85`: 85+ years of age |
| A_MARITL | Marital Status | All Persons | `1`: Married - civilian spouse present<br> `2`: Married - AF spouse present<br> `3`: Married - spouse absent (except separated)<br> `4`: Widowed<br> `5`: Divorced<br> `6`: Separated<br> `7`: Never married |
| A_SEX | Sex of the participant | All Persons | `1`: Male<br> `2`: Female |
| PRCITSHP | Citizenship Group | All Persons | `1`: Native, born in US<br> `2`: Native, born in PR or US outlying area<br> `3`: Native, born abroad of US parent(s)<br> `4`: Foreign born, US cit by naturalization<br> `5`: Foreign born, not a US citizen |
| PRDTRACE | Race | All Persons | `01`: White only<br> `02`: Black only<br> `03`: American Indian, Alaskan Native only (AI)<br> `04`: Asian only<br> `05`: Hawaiian/Pacific Islander only (HP)<br> `06`: White-Black<br> `07`: White-AI<br> `08`: White-Asian<br> `09`: White-HP<br> `10`: Black-AI<br> `11`: Black-Asian<br> `12`: Black-HP<br> `13`: AI-Asian<br> `14`: AI-HP<br> `15`: Asian-HP<br> `16`: White-Black-AI<br> `17`: White-Black-Asian<br> `18`: White-Black-HP<br> `19`: White-AI-Asian<br> `20`: White-AI-HP<br> `21`: White-Asian-HP<br> `22`: Black-AI-Asian<br> `23`: White-Black-AI-Asian<br> `24`: White-AI-Asian-HP<br> `25`: Other 3 race comb.<br> `26`: Other 4 or 5 race comb. |
| PTOTVAL | Total persons income | All Persons aged 15+ | `0`: none<br> Negative amount: Income (loss)<br> Positive amount: Income |


In [2]:
# Define the mapping of columns
column_to_keep = {
    "A_AGE": "age",
    "A_MARITL": "marital_status",
    "A_SEX": "sex",
    "PRCITSHP": "citizenship",
    "PRDTRACE": "race",
    "PTOTVAL": "income",
    "pwwgt0": "weights"
}

# Read the base statistical weights data
base_weights = pd.read_csv("asec_csv_repwgt_2023.csv", usecols=["PPPOS", "h_seq", "pwwgt0"])

# Read the persons data
ASEC = pd.read_csv("pppub23.csv")

# Merge persons data with base weights using specified columns (Primary key / Foriegn Key)
ASEC = ASEC.merge(base_weights, left_on=["PPPOS", "PH_SEQ"], right_on=["PPPOS", "h_seq"], validate="1:1")

# Select and reorder columns based on the mapping for the final dataset
ASEC = ASEC[column_to_keep.keys()]

In [3]:
ASEC

Unnamed: 0,A_AGE,A_MARITL,A_SEX,PRCITSHP,PRDTRACE,PTOTVAL,pwwgt0
0,66,4,2,1,1,12120,1580.074225
1,68,4,2,1,1,16800,1580.074225
2,52,1,2,1,1,8137,1789.597767
3,51,1,1,1,1,42000,1789.597767
4,78,4,2,1,1,14713,1307.917972
...,...,...,...,...,...,...,...
146128,17,7,2,1,4,0,379.434802
146129,15,7,1,1,4,0,410.095961
146130,59,1,1,1,15,33113,416.372671
146131,60,1,2,1,4,78092,416.372671


## 2017-2020 National Health and Nutrition Examination Survey (NHANES

### Data Dictionary (demographic fields need clarity only)

| Column Name     | Description                                      | Target Age Range | Values                                                                                      
|-----------------|--------------------------------------------------|-------------------|---------------------------------------------------------------------------------------------
| RIAGENDR        | Gender of the participant                        | 0 - 150 years     | `1`: Male<br> `2`: Female<br> `.`: Missing                                             
| RIDRETH3        | Race/Hispanic Origin with NH Asian                | 0 - 150 years     | `1`: Mexican American<br> `2`: Other Hispanic<br> `3`: Non-Hispanic White<br> `4`: Non-Hispanic Black<br> `6`: Non-Hispanic Asian<br> `7`: Other Race - Including Multi-Racial<br> `.`: Missing
| DMDMARTZ        | Marital Status                                   | 20 - 150 years    | `1`: Married/Living with Partner<br> `2`: Widowed/Divorced/Separated<br> `3`: Never married<br> `77`: Refused<br> `99`: Don't Know<br> `.`: Missing
| DMDBORN4      | Country of birth                                 | 0 - 150 years     | `1`: Born in 50 US states or Washington, DC<br> `2`: Others<br> `77`: Refused<br> `99`: Don't Know<br> `.`: Missing


In [5]:
# Define a dictionary for column renaming
measures_to_keep = {
    "RIAGENDR": "sex",
    "RIDRETH3": "race/HispanicOrigin w/ NH Asian",
    "DMDMARTZ": "maritalstatus",
    "RIDAGEYR": "age",
    "BMXWT": "weight_kg",
    "BMXHT": "height_cm",
    "BMXBMI": "bmi",
    "WTMECPRP": "weights", # statistical weights for observations
    "DMDBORN4": "birth_place"
}

# Read and merge body measures with demographic data of NHANES for the final dataset
NHANES = (
    pd.read_sas("P_DEMO.XPT")
    .merge(pd.read_sas("P_BMX.XPT"), how="inner", on="SEQN", validate="1:1")
    [measures_to_keep.keys()]
)

In [6]:
NHANES

Unnamed: 0,RIAGENDR,RIDRETH3,DMDMARTZ,RIDAGEYR,BMXWT,BMXHT,BMXBMI,WTMECPRP,DMDBORN4
0,1.0,6.0,,2.0,,,,8951.815567,1.0
1,2.0,1.0,,13.0,42.2,154.7,17.6,12271.157043,1.0
2,1.0,3.0,,2.0,12.0,89.3,15.0,16658.764203,1.0
3,2.0,6.0,3.0,29.0,97.1,160.2,37.8,8154.968193,2.0
4,1.0,2.0,,2.0,13.6,,,6848.271782,1.0
...,...,...,...,...,...,...,...,...,...
14295,1.0,4.0,1.0,40.0,108.8,168.7,38.2,21666.889837,1.0
14296,1.0,4.0,,2.0,15.4,93.7,17.5,1838.169709,1.0
14297,2.0,3.0,,7.0,22.9,123.3,15.1,16497.806674,1.0
14298,1.0,4.0,2.0,63.0,79.5,176.4,25.5,4853.430230,1.0


---
* The ```harmonize_data``` function aligns **race, marital status, and born in the US** status across two dataframes (df1 and df2) using numeric codes for consistent analysis.

In [5]:
def harmonize_data(df1, df2):
    """
    Harmonizes race, marital status, and born in US status across two dataframes using numeric codes.

    Parameters:
    df1 (DataFrame): ASEC dataset.
    df2 (DataFrame): NHANES datset.
    """

    def harmonize_race(df1, df2):
        # Race mapping with numeric codes: 1 for White, 2 for Black, 3 for Asian, 4 for Other
        
        # Apply mapping to df1
        if 'PRDTRACE' in df1.columns:
            df1['race'] = df1['PRDTRACE'].map({1: 1, 2: 2, 4: 3}).fillna(4)  # Default to 4 (Other) for missing/unmapped values
        
        # Apply mapping to df2
        race_mapping_measurement = {3: 1, 4: 2, 6: 3, 1: 4, 2: 4, 7: 4}
        if 'RIDRETH3' in df2.columns:
            df2['race'] = df2['RIDRETH3'].map(race_mapping_measurement).fillna(4)  # Default to 4 (Other) for missing/unmapped values
        return df1, df2

    def harmonize_marital_status(df1, df2):
        # Marital status mapping: 1 for Married, 2 for Widowed/Divorced/Separated, 3 for Other
        
        # Apply mapping to df1
        df1['marital_status'] = df1['A_MARITL'].map({1: 1, 2: 1, 3: 1, 4: 2, 5: 2, 6: 2}).fillna(3)  # Default to 3 (Other) for missing/unmapped values
        # Apply mapping to df2
        df2['marital_status'] = df2['DMDMARTZ'].map({1: 1, 2: 2}).fillna(3)  # Default to 3 (Other) for missing/unmapped values
        return df1, df2

    def harmonize_Born_US(df1, df2):
        # Born in US status mapping: 1 for Yes, 0 for No
        # Apply mapping to df1
        df1['born_us'] = df1['PRCITSHP'].map({1: 1}).fillna(0)  # Default to 0 (No) for missing/unmapped values
        # Apply mapping to df2
        df2['born_us'] = df2['DMDBORN4'].map({1: 1}).fillna(0)  # Default to 0 (No) for missing/unmapped values
        return df1, df2

    # Apply each harmonization function in sequence
    df1, df2 = harmonize_race(df1, df2)
    df1, df2 = harmonize_marital_status(df1, df2)
    df1, df2 = harmonize_Born_US(df1, df2)

    # Drop the original columns
    df1 = df1.drop(["PRDTRACE", "A_MARITL", "PRCITSHP"], axis=1, errors='ignore')
    df2 = df2.drop(["RIDRETH3", "DMDMARTZ", "DMDBORN4"], axis=1, errors='ignore')

    return df1.rename(column_to_keep,axis=1), df2.rename(measures_to_keep,axis=1)

---
* Transfer the filtered data to sqlite to use it with the django backend for **Perfect Partner Probability** project

In [8]:
asec,nhanes = harmonize_data(ASEC,NHANES)

conn = sqlite3.connect('myDB.sqlite3')
asec.query("age>=18").to_sql('Asec', conn, if_exists='replace')
nhanes.query("age>=18").to_sql('Nhanes', conn, if_exists='replace')
asec.query("sex==1 & age>=18").to_sql("AsecMale",conn,if_exists='replace')
asec.query("sex==2 & age>=18").to_sql("AsecFemale",conn,if_exists='replace')
nhanes.query("sex==1 & age>=18").to_sql("NhanesMale",conn,if_exists='replace')
nhanes.query("sex==2 & age>=18").to_sql("NhanesFemale",conn,if_exists='replace')
conn.close()