# **Predictive Modeling of Chest Pain and Shortness of Breath in a Diverse Population:</br>A Multivariate Analysis of Health and Lifestyle Factors**


## **Mohammad Hossein Mahmoudi**

--- 
---

## **Project Content** <a id = 0></a>

### Zeroth Step: Dataset Management

1. [Introduction](#1)
2. [Load The Relevant Libraries and Packages](#2)

### First Step: First Organization

1. [Introduction](#1)
2. [Load The Relevant Libraries and Packages](#2)

### Second Step: Data Preprocessing

3. [Load and Preprocess The Dataset](#3)

### Third Step: Data Analysis

4. [Check out the Statistics of the Categorical and the Numeric Features of The Dataset](#4)
5. [Check out the Relation between the Glucose Tests (FPG and 2hrPG) to Evaluate the BMI and HbA1c Features](#5)
6. [Check out the Relation between the Age and Total Glucose to Evaluate the Diabetes Status](#6)

### Forth and Final Step: Modeling and Examination

7. [Factorize the Categorical Features](#7)
8. [Split the Dataset into the Train and Test Sets and then Create a Vector of the Input and Output Variables](#8)
9. [Define the Logistic Regression Model and Check Its Performance](#9)
10. [Define the CARET Model and Check its Performance](#10)
11. [Define the Random Forest Model and Check Its Performance](#10)

***

# Zeroth Step: Dataset Management

***

## 1. Introduction <a id = 1></a>

### **Problem Explanation**

<div style="text-align: justify">
text
</div>

</br>

<div style="text-align: justify">
text
</div>

[Project Content](#0)

# 2. Load The Relevant Libraries and Packages

In order to manage the directories, work with files and load the datasets as dataframes, import the relevant libraries.

In [148]:
# Import standard libraries for file and directory operations
import os
import fnmatch
import shutil

# Import data manipulation libraries
import pandas as pd  # Pandas for handling data in tabular form
import numpy as np   # NumPy for numerical operations

# Import a third-party library for natural sorting
from natsort import natsorted  # natsort for natural sorting of strings

import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import LabelEncoder

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.impute import KNNImputer

[Project Content](#0)

# 3. Create a File Management System and Read the XPT Files

Create a new directory to save all datasets in it.

In [149]:
# Get the current working directory
current_directory = os.getcwd()

# Create the full path to the new folder.
tmp_folder_path = os.path.join(current_directory, "tmp")

# Remove the existing folder if it exists.
if os.path.exists(tmp_folder_path):
    shutil.rmtree(tmp_folder_path)

# Create the new folder.
os.makedirs(tmp_folder_path)

Create a list of all directories.

In [150]:
# List all directories in the current directory
directories = [d for d in os.listdir(current_directory)
               if os.path.isdir(os.path.join(current_directory, d))
               if os.path.isdir(os.path.join(current_directory, d)) and d != ".git"]

# Display the number of folders in the current directory
print(f"There is/are {len(directories)} folder(s) in the current directory:")
print()

# Print each directory name on a new line
for dir in directories:
    print(f"{directories.index(dir) + 1}. {dir}")

There is/are 2 folder(s) in the current directory:

1. Datasets
2. tmp


It seems nice. Now create a list of xpt files' paths in the current directory.

In [151]:
def find_xpt_files(root_dir):
    """
    Find .XPT files in a directory and its subdirectories.

    Parameters:
    - root_dir (str): The root directory to start searching for .XPT files.

    Returns:
    - List[str]: A list of full paths to .XPT files found in the specified directory and its subdirectories.
    """

    # Initialize an empty list to store the file paths
    xpt_files = []

    for root, dirs, files in os.walk(root_dir):
        
        # Walk through the directory tree rooted at root_dir.
        for filename in fnmatch.filter(files, "*.XPT"):
            
            # For each file with a .XPT extension, add its full path to the xpt_files list.
            xpt_files.append(os.path.join(root, filename))

    return xpt_files

In [152]:
# Specify the current directory as the root directory.
current_directory = os.getcwd()

# Call the find_xpt_files function with the current directory.
xpt_file_paths = find_xpt_files(current_directory)

# Print each found .XPT file path on a new line.
for xpt_file_path in xpt_file_paths:
    print(xpt_file_path)

/Users/shahriyar/Desktop/Study/SUT/Data Mining/Data-Mining-Project/Datasets/Questionnaire Data/Smoking - Cigarette Use/P_SMQ.XPT
/Users/shahriyar/Desktop/Study/SUT/Data Mining/Data-Mining-Project/Datasets/Questionnaire Data/Income/P_INQ.XPT
/Users/shahriyar/Desktop/Study/SUT/Data Mining/Data-Mining-Project/Datasets/Questionnaire Data/Diabetes/P_DIQ.XPT
/Users/shahriyar/Desktop/Study/SUT/Data Mining/Data-Mining-Project/Datasets/Questionnaire Data/Cardiovascular Health/P_CDQ.XPT
/Users/shahriyar/Desktop/Study/SUT/Data Mining/Data-Mining-Project/Datasets/Questionnaire Data/Physical Activity/P_PAQ.XPT
/Users/shahriyar/Desktop/Study/SUT/Data Mining/Data-Mining-Project/Datasets/Questionnaire Data/Alcohol Use/P_ALQ.XPT
/Users/shahriyar/Desktop/Study/SUT/Data Mining/Data-Mining-Project/Datasets/Examination Data/Body Measures/P_BMX.XPT
/Users/shahriyar/Desktop/Study/SUT/Data Mining/Data-Mining-Project/Datasets/Demographic Variables and Sample Weights/P_DEMO.XPT


Now, create a copy for each dataset in a new directory.

In [153]:
datasets_path = os.path.join(current_directory, "tmp")

# Check if the folder exists
if os.path.exists(datasets_path):
    
    # Loop through each .XPT file in the list.
    for xpt_file_path in xpt_file_paths:

        # Determine the subdirectories between the current directory and the file.
        new_file_name = os.path.relpath(xpt_file_path, current_directory).split(os.path.sep)[-2]

        # Create the full path to the new location for the file within the new folder.
        new_file_path = os.path.join(datasets_path, new_file_name)
        new_file_path += ".XPT"

        # Copy the .XPT file to the new location.
        shutil.copy(xpt_file_path, new_file_path)

    print(f"Files copied to: {datasets_path}")

else:
    print(f"The '{datasets_path}' folder does not exist.")

Files copied to: /Users/shahriyar/Desktop/Study/SUT/Data Mining/Data-Mining-Project/tmp


[Project Content](#0)

Seems perfect. Now, we can read these files.</br>
Although it does not seem a good idea, read all of the files in the memory.

In [154]:
dataset_dict = {}

# Call the find_xpt_files function with the current directory.
xpt_file_paths = find_xpt_files(datasets_path)

# Loop through each XPT file path to read it.
for xpt_file_path in xpt_file_paths:

    dataset_name = os.path.basename(xpt_file_path).split(".")[0]
    dataset_dict[dataset_name] = pd.read_sas(xpt_file_path, format="xport")
    
# Iterate through each dataset in the dictionary to show their information.
for key in dataset_dict.keys():

    print(key)
    print(dataset_dict[key].shape)
    print()

Diabetes
(14986, 28)

Demographic Variables and Sample Weights
(15560, 29)

Smoking - Cigarette Use
(11137, 16)

Alcohol Use
(8965, 10)

Cardiovascular Health
(6433, 17)

Income
(15560, 3)

Physical Activity
(9693, 17)

Body Measures
(14300, 22)



[Project Content](#0)

# 4. Select the Important Features from the Datasets

Before going further, let's find out the values each feature can be assigned for each record.

### **Features**

- **Demographic Variables and Sample Weights**

    <span style="color: grey; font-size: 200%;">&bull;</span> SEQN - Respondent sequence number</br>
    <span style="color: grey; font-size: 200%;">&bull;</span> RIAGENDR - Gender</br>

    | Code or Value | Value Description |
    | ------------- | ----------------- |
    | 1             | Male              |
    | 2             | Female            |
    | ...           | Missing           |

    <span style="color: grey; font-size: 200%;">&bull;</span> RIDAGEYR - Age at Screening Adjudicated</br>

    | Code or Value | Value Description       |
    | ------------- | ----------------------- |
    | 0 to 79       | Range of Values         |
    | 80            | 80 years of age and over|
    | ...           | Missing                 |

    <span style="color: grey; font-size: 200%;">&bull;</span> RIDRETH1 - Race/Ethnicity</br>

    | Code or Value | Value Description                  |
    | ------------- | ---------------------------------- |
    | 1             | Mexican American                   |
    | 2             | Other Hispanic                     |
    | 3             | Non-Hispanic White                 |
    | 4             | Non-Hispanic Black                 |
    | 5             | Other Race - Including Multi-Racial|
    | ...           | Missing                            |

</br>

- **Examination Data: Body Measures**

    <span style="color: grey; font-size: 200%;">&bull;</span> SEQN - Respondent sequence number</br>
    <span style="color: grey; font-size: 200%;">&bull;</span> BMXBMI - Body Mass Index (kg/m**2)</br>

    | Code or Value | Value Description |
    | ------------- | ----------------- |
    | 11.9 to 92.3  | Range of Values   |
    | ...           | Missing           |

</br>

- **Questionnaire Data: Diabetes**

    <span style="color: grey; font-size: 200%;">&bull;</span> SEQN - Respondent sequence number</br>
    <span style="color: grey; font-size: 200%;">&bull;</span> DIQ010 - Doctor told you have diabetes</br>

    | Code or Value | Value Description |
    | ------------- | ----------------- |
    | 1             | Yes               |
    | 2             | No                |
    | 3             | Borderline        |
    | 7             | Refused           |
    | 9             | Don't know        |
    | ...           | Missing           |

</br>

- **Questionnaire Data: Physical Activity**

    <span style="color: grey; font-size: 200%;">&bull;</span> SEQN - Respondent sequence number</br>
    <span style="color: grey; font-size: 200%;">&bull;</span> PAQ605 - Vigorous work activity</br>

    | Code or Value | Value Description |
    | ------------- | ----------------- |
    | 1             | Yes               |
    | 2             | No                |
    | 7             | Refused           |
    | 9             | Don't know        |
    | ...           | Missing           |

</br>

- **Questionnaire Data: Smoking-Cigarette Use**

    <span style="color: grey; font-size: 200%;">&bull;</span> SEQN - Respondent sequence number</br>
    <span style="color: grey; font-size: 200%;">&bull;</span> SMQ020 - Smoked at least 100 cigarettes in life</br>

    | Code or Value | Value Description |
    | ------------- | ----------------- |
    | 1             | Yes               |
    | 2             | No                |
    | 7             | Refused           |
    | 9             | Don't know        |
    | ...           | Missing           |

    <span style="color: grey; font-size: 200%;">&bull;</span> SMQ040 - Do you now smoke cigarettes?</br>

    | Code or Value | Value Description |
    | ------------- | ----------------- |
    | 1             | Every day          |
    | 2             | Some days          |
    | 3             | Not at all         |
    | 7             | Refused            |
    | 9             | Don't know         |
    | ...           | Missing            |

</br>

- **Questionnaire Data: Income**

    <span style="color: grey; font-size: 200%;">&bull;</span> SEQN - Respondent sequence number</br>
    <span style="color: grey; font-size: 200%;">&bull;</span> INDFMMPC - Family monthly poverty level category</br>

    | Code or Value | Value Description                          |
    | ------------- | ------------------------------------------ |
    | 1             | Monthly poverty level index = 1.30         |
    | 2             | 1.30 < Monthly poverty level index ≤ 1.85  |
    | 3             | Monthly poverty level index > 1.85         |
    | 7             | Refused                                    |
    | 9             | Don't know                                 |
    | ...           | Missing                                    |

</br>

- **Questionnaire Data: Alcohol Use**

    <span style="color: grey; font-size: 200%;">&bull;</span> SEQN - Respondent sequence number</br>
    <span style="color: grey; font-size: 200%;">&bull;</span> ALQ111 - Ever had a drink of any kind of alcohol</br>

    | Code or Value | Value Description |
    | ------------- | ----------------- |
    | 1             | Yes               |
    | 2             | No                |
    | 7             | Refused           |
    | 9             | Don't know        |
    | ...           | Missing           |

    <span style="color: grey; font-size: 200%;">&bull;</span> ALQ121 - Past 12 mo how often drink alcoholic bev</br>

    | Code or Value | Value Description             |
    | ------------- | ----------------------------- |
    | 0             | Never in the last year        |
    | 1             | Every day                     |
    | 2             | Nearly every day              |
    | 3             | 3 to 4 times a week           |
    | 4             | 2 times a week                |
    | 5             | Once a week                   |
    | 6             | 2 to 3 times a month          |
    | 7             | Once a month                  |
    | 8             | 7 to 11 times in the last year |
    | 9             | 3 to 6 times in the last year  |
    | 10            | 1 to 2 times in the last year  |
    | 77            | Refused                       |
    | 99            | Don't know                    |
    | ...           | Missing                       |

</br>

- **Questionnaire Data: Cardiovascular Health**

    <span style="color: grey; font-size: 200%;">&bull;</span> SEQN - Respondent sequence number</br>
    <span style="color: red; font-size: 200%;">&bull;</span> CDQ001 - SP ever had pain or discomfort in chest</br>

    | Code or Value | Value Description |
    | ------------- | ----------------- |
    | 1             | Yes               |
    | 2             | No                |
    | 7             | Refused           |
    | 9             | Don't know        |
    | ...           | Missing           |

    <span style="color: red; font-size: 200%;">&bull;</span> CDQ010 - Shortness of breath on stairs/inclines</br>

    | Code or Value | Value Description |
    | ------------- | ----------------- |
    | 1             | Yes               |
    | 2             | No                |
    | 7             | Refused           |
    | 9             | Don't know        |
    | ...           | Missing           |

Firts of all, create a dictionary of all the important features for datasets.

In [155]:
features_dict = {
    "Demographic Variables and Sample Weights" :
        ["SEQN",
         "RIAGENDR",
         "RIDAGEYR",
         "RIDRETH1"],
    "Body Measures" :
        ["SEQN",
         "BMXBMI"],
    "Diabetes" :
        ["SEQN",
         "DIQ010"],
    "Physical Activity" :
        ["SEQN",
         "PAQ605"],
    "Smoking - Cigarette Use" :
        ["SEQN",
         "SMQ020",
         "SMQ040"],
    "Income" :
        ["SEQN",
         "INDFMMPC"],
    "Alcohol Use" :
        ["SEQN",
         "ALQ111",
         "ALQ121"],
    "Cardiovascular Health" :
        ["SEQN",
         "CDQ001",
         "CDQ010"]
        }

Before removing the irrlevant features from the datasets, make the sequence of them like the sequence of the features dictionary keys.

In [156]:
# Create a new ordered dictionary to match the sequence of keys in features_dict
ordered_dataset_dict = {}

# Iterate through the keys in features_dict
for key in features_dict.keys():
    
        # Add the key-value pair to the ordered_dataset_dict
        ordered_dataset_dict[key] = dataset_dict[key]
        
dataset_dict = ordered_dataset_dict

Now, we can delete the irrelevant features.

In [157]:
# Iterate through the keys in the features_dict (each key represents a dataset).
for key in features_dict.keys():
    
    # Get the DataFrame associated with the current dataset.
    df = dataset_dict[key]

    # Get the list of relevant features for this dataset from the features_dict.
    relevant_features = features_dict[key]

    # Create a new DataFrame containing only the relevant features.
    new_df = df[relevant_features]

    # Update the dataset_dict with the new DataFrame, containing only relevant features.
    dataset_dict[key] = new_df

[Project Content](#0)

# 5. Monitor the Datasets and Merge them into a Final Object

Now, let's check not missing values for each dataset's features.

In [158]:
# Iterate through the keys in dataset_dict, sorted in alphabetical order.
for key in dataset_dict.keys():
    
    # Get the DataFrame associated with the current key.
    df = dataset_dict[key]    
    
    # Print the dataset key.
    print(key)
    print()
    
    # Calculate the proportion of non-missing values for each feature in the DataFrame.
    # This is done by dividing the count of non-null values by the total number of rows and multiplying by 100.
    proportions = round(((df.count() / len(df)) * 100), 2)
    
    # Print the proportions of non-missing values for each feature in the DataFrame.
    print(proportions)

    print()
    
    # Check if it's the last key, and if not, print the line of '-' characters.
    if key != list(dataset_dict.keys())[-1]:
        print("-" * 40)

Demographic Variables and Sample Weights

SEQN        100.0
RIAGENDR    100.0
RIDAGEYR    100.0
RIDRETH1    100.0
dtype: float64

----------------------------------------
Body Measures

SEQN      100.00
BMXBMI     91.87
dtype: float64

----------------------------------------
Diabetes

SEQN      100.0
DIQ010    100.0
dtype: float64

----------------------------------------
Physical Activity

SEQN      100.0
PAQ605    100.0
dtype: float64

----------------------------------------
Smoking - Cigarette Use

SEQN      100.00
SMQ020     87.03
SMQ040     34.92
dtype: float64

----------------------------------------
Income

SEQN        100.00
INDFMMPC     91.63
dtype: float64

----------------------------------------
Alcohol Use

SEQN      100.00
ALQ111     93.36
ALQ121     83.69
dtype: float64

----------------------------------------
Cardiovascular Health

SEQN      100.0
CDQ001    100.0
CDQ010    100.0
dtype: float64



Inner-merge the datasets by the key of SEQN.

In [159]:
# Create an empty DataFrame with 'SEQN' column
merged_df = pd.DataFrame({"SEQN": dataset_dict[list(dataset_dict.keys())[0]]["SEQN"]})

# Iterate through the DataFrames in dataset_dict and merge them into merged_df based on the 'SEQN' column using an outer join.
for df in dataset_dict.values():
    merged_df = pd.merge(merged_df, df, on="SEQN", how="inner")

print("Dataset has the shape of:", merged_df.shape)

merged_df.head(5)

Dataset has the shape of: (5949, 14)


Unnamed: 0,SEQN,RIAGENDR,RIDAGEYR,RIDRETH1,BMXBMI,DIQ010,PAQ605,SMQ020,SMQ040,INDFMMPC,ALQ111,ALQ121,CDQ001,CDQ010
0,109271.0,1.0,49.0,3.0,29.7,2.0,2.0,1.0,1.0,1.0,1.0,5.397605e-79,1.0,1.0
1,109274.0,1.0,68.0,5.0,30.2,1.0,1.0,2.0,,1.0,1.0,4.0,2.0,2.0
2,109282.0,1.0,76.0,3.0,26.6,2.0,2.0,1.0,3.0,3.0,1.0,5.397605e-79,1.0,1.0
3,109284.0,2.0,44.0,1.0,39.1,2.0,2.0,2.0,,7.0,2.0,,2.0,2.0
4,109290.0,2.0,68.0,4.0,28.1,1.0,2.0,2.0,,3.0,1.0,5.397605e-79,2.0,2.0


[Project Content](#0)

# 6. Prepare the Dataframe in a Superficial Way

Change the datast features' names in order to make the table more readable.</br>

In [160]:
new_features_names = {
    "SEQN" : "ID",
    "RIAGENDR" : "Gender",
    "RIDAGEYR" : "Age",
    "RIDRETH1" : "Race",
    "BMXBMI" : "BMI",
    "DIQ010" : "Have Diabetes",
    "PAQ605" : "Vigorous Work Activity",
    "SMQ020" : "100 Cigarettes in Life Experience",
    "SMQ040" : "Smoke Cigarettes",
    "INDFMMPC" : "Family Poverty Level",
    "ALQ111" : "Alcohol Drink Experience",
    "ALQ121" : "Past 12 Months Alcohol Drink",
    "CDQ001" : "Chest Pain",
    "CDQ010" : "Shortness of Breath"
}

final_df = merged_df.rename(columns=new_features_names)
final_df.drop("ID", axis=1, inplace=True)

final_df.head(5)

Unnamed: 0,Gender,Age,Race,BMI,Have Diabetes,Vigorous Work Activity,100 Cigarettes in Life Experience,Smoke Cigarettes,Family Poverty Level,Alcohol Drink Experience,Past 12 Months Alcohol Drink,Chest Pain,Shortness of Breath
0,1.0,49.0,3.0,29.7,2.0,2.0,1.0,1.0,1.0,1.0,5.397605e-79,1.0,1.0
1,1.0,68.0,5.0,30.2,1.0,1.0,2.0,,1.0,1.0,4.0,2.0,2.0
2,1.0,76.0,3.0,26.6,2.0,2.0,1.0,3.0,3.0,1.0,5.397605e-79,1.0,1.0
3,2.0,44.0,1.0,39.1,2.0,2.0,2.0,,7.0,2.0,,2.0,2.0
4,2.0,68.0,4.0,28.1,1.0,2.0,2.0,,3.0,1.0,5.397605e-79,2.0,2.0


It seems there is a problem when reading the data from the XPT files.</br>
Round the data into one decimal point to prevent such values like "4.000000e+00."

In [161]:
final_df = final_df.round(1)

final_df.head(5)

Unnamed: 0,Gender,Age,Race,BMI,Have Diabetes,Vigorous Work Activity,100 Cigarettes in Life Experience,Smoke Cigarettes,Family Poverty Level,Alcohol Drink Experience,Past 12 Months Alcohol Drink,Chest Pain,Shortness of Breath
0,1.0,49.0,3.0,29.7,2.0,2.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0
1,1.0,68.0,5.0,30.2,1.0,1.0,2.0,,1.0,1.0,4.0,2.0,2.0
2,1.0,76.0,3.0,26.6,2.0,2.0,1.0,3.0,3.0,1.0,0.0,1.0,1.0
3,2.0,44.0,1.0,39.1,2.0,2.0,2.0,,7.0,2.0,,2.0,2.0
4,2.0,68.0,4.0,28.1,1.0,2.0,2.0,,3.0,1.0,0.0,2.0,2.0


Save the final dataset.

In [162]:
final_df.to_csv("data.csv", index=False)

Before getting further, remove the tmp directory.

In [163]:
shutil.rmtree(os.path.join(current_directory, "tmp"))

[Project Content](#0)

***

# First Step: Data Preprocessing

***

# 7. Load the Dataset from This Working Directory

In [164]:
df = pd.read_csv("data.csv")

df.head(5)

Unnamed: 0,Gender,Age,Race,BMI,Have Diabetes,Vigorous Work Activity,100 Cigarettes in Life Experience,Smoke Cigarettes,Family Poverty Level,Alcohol Drink Experience,Past 12 Months Alcohol Drink,Chest Pain,Shortness of Breath
0,1.0,49.0,3.0,29.7,2.0,2.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0
1,1.0,68.0,5.0,30.2,1.0,1.0,2.0,,1.0,1.0,4.0,2.0,2.0
2,1.0,76.0,3.0,26.6,2.0,2.0,1.0,3.0,3.0,1.0,0.0,1.0,1.0
3,2.0,44.0,1.0,39.1,2.0,2.0,2.0,,7.0,2.0,,2.0,2.0
4,2.0,68.0,4.0,28.1,1.0,2.0,2.0,,3.0,1.0,0.0,2.0,2.0


# 8. Deal with the Missing Values and Impute the Features

Check out the non-missing values percentage for each feature.

In [165]:
# Calculate the non-missing values percentage for each feature
non_missing_percentage = 100 - (df.isnull().mean() * 100).round(1)

# Create a new DataFrame containing features and their non-missing values percentage
features_info = pd.DataFrame({
    "Feature": non_missing_percentage.index,
    "Non-Missing Percentage": non_missing_percentage.values
})

# Sort the DataFrame in decreasing order based on "Non-Missing Percentage"
features_info = features_info.sort_values(by="Non-Missing Percentage", ascending=False)

features_info

Unnamed: 0,Feature,Non-Missing Percentage
0,Gender,100.0
1,Age,100.0
2,Race,100.0
4,Have Diabetes,100.0
5,Vigorous Work Activity,100.0
6,100 Cigarettes in Life Experience,100.0
11,Chest Pain,100.0
12,Shortness of Breath,100.0
3,BMI,97.7
9,Alcohol Drink Experience,93.5


Some of the missing values are shown in a different way.</br>
They are not like those missing values, but will be conducted similarly.</br>
Now, checking the unique values for each categorical feature can be helpful to find these kinds of missing values.

In [166]:
categorical_features = ["Gender", "Race", "Have Diabetes",
                        "Vigorous Work Activity", "100 Cigarettes in Life Experience",
                        "Smoke Cigarettes", "Family Poverty Level", "Alcohol Drink Experience",
                        "Past 12 Months Alcohol Drink", "Chest Pain", "Shortness of Breath"]

numeric_features = ["Age", "BMI"]

In [167]:
# Create an empty DataFrame to store the results
summary_table = pd.DataFrame(columns=["Feature", "Unique_Values", "Percentage"])
    
# Loop through each specified feature
for feature in categorical_features:
    
    # Extract unique values and their counts for the current feature
    value_counts = (df[feature].value_counts()).sort_index()
        
    # Calculate percentages
    percentages = round((value_counts / len(df)) * 100, 2)
        
    # Append information to the DataFrame
    summary_table = pd.concat([summary_table, pd.DataFrame({
        "Feature": [feature] * len(value_counts),
        "Unique_Values": value_counts.index,
        "Percentage": percentages.values
    })], ignore_index=True)

# Remove duplicate feature names, leaving only one occurrence
summary_table["Feature"] = summary_table["Feature"].where(~summary_table["Feature"].duplicated(), "")

summary_table

Unnamed: 0,Feature,Unique_Values,Percentage
0,Gender,1.0,49.4
1,,2.0,50.6
2,Race,1.0,10.3
3,,2.0,10.17
4,,3.0,36.46
5,,4.0,27.06
6,,5.0,16.0
7,Have Diabetes,1.0,21.21
8,,2.0,75.05
9,,3.0,3.68


Based on the NHANES documentation, make the values equal to missing into NaNs.</br>
These values and their related features are:

- Have Diabetes:
  - Borderline: 3
  - Don't know: 9

- Vigorous Work Activity:
  - Refused: 7
  - Don't know: 9

- 100 Cigarettes in Life Experience:
  - Refused: 7
  - Don't know: 9

- Family Poverty Level:
  - Refused: 7
  - Don't know: 9

- Past 12 Months Alcohol Drink:
  - Refused: 77
  - Don't know: 99

- Chest Pain:
  - Don't know: 9

- Shortness of Breath:
  - Don't know: 9

Replace these values with NaNs.

In [168]:
# List of features and their corresponding values to replace with NaN
features_to_replace = {
    "Have Diabetes": [3, 9],
    "Vigorous Work Activity": [7, 9],
    "100 Cigarettes in Life Experience": [7, 9],
    "Family Poverty Level": [7, 9],
    "Past 12 Months Alcohol Drink": [77, 99],
    "Chest Pain": [9],
    "Shortness of Breath": [9],
}

# Replace specified values with NaN for each feature
for feature, values_to_replace in features_to_replace.items():
    df[feature] = df[feature].replace(values_to_replace, np.nan)

Now, check the unique values for each feature again.

In [169]:
# Create an empty DataFrame to store the results
summary_table = pd.DataFrame(columns=["Feature", "Unique_Values", "Percentage"])
    
# Loop through each specified feature
for feature in categorical_features:
    
    # Extract unique values and their counts for the current feature
    value_counts = (df[feature].value_counts()).sort_index()
        
    # Calculate percentages
    percentages = round((value_counts / len(df)) * 100, 2)
        
    # Append information to the DataFrame
    summary_table = pd.concat([summary_table, pd.DataFrame({
        "Feature": [feature] * len(value_counts),
        "Unique_Values": value_counts.index,
        "Percentage": percentages.values
    })], ignore_index=True)

# Remove duplicate feature names, leaving only one occurrence
summary_table["Feature"] = summary_table["Feature"].where(~summary_table["Feature"].duplicated(), "")

summary_table

Unnamed: 0,Feature,Unique_Values,Percentage
0,Gender,1.0,49.4
1,,2.0,50.6
2,Race,1.0,10.3
3,,2.0,10.17
4,,3.0,36.46
5,,4.0,27.06
6,,5.0,16.0
7,Have Diabetes,1.0,21.21
8,,2.0,75.05
9,Vigorous Work Activity,1.0,21.33


Now, check out the non-missing values percentage for each feature.

In [170]:
# Calculate the non-missing values percentage for each feature
non_missing_percentage = 100 - (df.isnull().mean() * 100).round(1)

# Create a new DataFrame containing features and their non-missing values percentage
features_info = pd.DataFrame({
    "Feature": non_missing_percentage.index,
    "Non-Missing Percentage": non_missing_percentage.values
})

# Sort the DataFrame in decreasing order based on "Non-Missing Percentage"
features_info = features_info.sort_values(by="Non-Missing Percentage", ascending=False)

features_info

Unnamed: 0,Feature,Non-Missing Percentage
0,Gender,100.0
1,Age,100.0
2,Race,100.0
5,Vigorous Work Activity,99.9
6,100 Cigarettes in Life Experience,99.9
11,Chest Pain,99.9
12,Shortness of Breath,99.7
3,BMI,97.7
4,Have Diabetes,96.3
9,Alcohol Drink Experience,93.5


Seems well.</br>
Remove the records with the missing values for the target features that are **Chest Pain** and **Shortness of Breath**.</br>
Then, check the non-missing values percentage again.

In [171]:
df = df.dropna(subset=["Chest Pain", "Shortness of Breath"])

In [172]:

# Calculate the non-missing values percentage for each feature
non_missing_percentage = 100 - (df.isnull().mean() * 100).round(1)

# Create a new DataFrame containing features and their non-missing values percentage
features_info = pd.DataFrame({
    "Feature": non_missing_percentage.index,
    "Non-Missing Percentage": non_missing_percentage.values
})

# Sort the DataFrame in decreasing order based on "Non-Missing Percentage"
features_info = features_info.sort_values(by="Non-Missing Percentage", ascending=False)

features_info

Unnamed: 0,Feature,Non-Missing Percentage
0,Gender,100.0
1,Age,100.0
2,Race,100.0
11,Chest Pain,100.0
12,Shortness of Breath,100.0
5,Vigorous Work Activity,99.9
6,100 Cigarettes in Life Experience,99.9
3,BMI,97.7
4,Have Diabetes,96.3
9,Alcohol Drink Experience,93.5


Since, there are records with the missing values in, imputation is necessary.</br>
In order not to miss any insightful information, some of the most referenced methods will be employed and tested.</br>
Based on the features' specifics, the imputation method will be chosen.

Before starting the imputation process, splitting the dataset into train and test sets is essential to prevent data leakage.

In [173]:
# Select features (X) and target variables (y)
X = df.drop(["Chest Pain", "Shortness of Breath"], axis=1)
y_chest_pain = df["Chest Pain"]
y_shortness_of_breath = df["Shortness of Breath"]

# Perform the train-test split
# Adjust the test_size parameter based on the percentage of data you want to allocate to the test set
X_train, X_test, y_chest_pain_train, y_chest_pain_test, y_shortness_of_breath_train, y_shortness_of_breath_test = train_test_split(
    X, y_chest_pain, y_shortness_of_breath, test_size=0.2, random_state=123
)

Check the datasets' shapes.

In [174]:
print("Train Set Info")
print("-"*45)
print("X_train shape:", X_train.shape)
print("y_chest_pain_train shape:", y_chest_pain_train.shape)
print("y_shortness_of_breath_train shape:", y_shortness_of_breath_train.shape)

print("\nTest Set Info")
print("-"*45)
print("X_test shape:", X_test.shape)
print("y_chest_pain_test shape:", y_chest_pain_test.shape)
print("y_shortness_of_breath_test shape:", y_shortness_of_breath_test.shape)

Train Set Info
---------------------------------------------
X_train shape: (4740, 11)
y_chest_pain_train shape: (4740,)
y_shortness_of_breath_train shape: (4740,)

Test Set Info
---------------------------------------------
X_test shape: (1185, 11)
y_chest_pain_test shape: (1185,)
y_shortness_of_breath_test shape: (1185,)


Now, it's time to start the imputation process.</br>
In the process of imputing missing values in the dataset, the following methods were employed for specific features based on their characteristics:

- For the numerical feature BMI, which exhibited a relatively high non-missing percentage, SimpleImputer with mean imputation was applied as it is suitable for features with a symmetric distribution and no significant outliers.</br>
- For Family Poverty Level, a categorical feature, SimpleImputer with a strategy like "most_frequent" might be more appropriate for imputing missing values.
- Age, being a numerical feature with potential complex relationships, underwent imputation using IterativeImputer to capture non-linear dependencies.
- For features with potential dependencies on similar instances, such as Vigorous Work Activity and 100 Cigarettes in Life Experience, KNNImputer was selected to impute missing values, allowing the consideration of neighboring instances in the imputation process.

Now, everything is fine for imputation.

In [175]:
# Categorical features that will be classified
imp_categorical_features = [
    "Vigorous Work Activity",
    "100 Cigarettes in Life Experience",
    "Family Poverty Level",
    "Have Diabetes",
    "Alcohol Drink Experience",
    "Smoke Cigarettes",
    "Past 12 Months Alcohol Drink"
    ]

# IterativeImputer for numerical features with complex relationships
num_iterative_imputer = IterativeImputer(
    estimator=RandomForestRegressor(),
    initial_strategy="mean",
    max_iter=10,
    random_state=123)

df["BMI"] = num_iterative_imputer.fit_transform(df[["BMI"]])
df["Age"] = num_iterative_imputer.fit_transform(df[["Age"]])

# IterativeImputer for categorical features with potential dependencies
cat_iterative_imputer = IterativeImputer(
    estimator=RandomForestClassifier(),
    initial_strategy="most_frequent",
    max_iter=10,
    random_state=123)

df[imp_categorical_features] = cat_iterative_imputer.fit_transform(df[imp_categorical_features])



In [176]:
# Calculate the non-missing values percentage for each feature
non_missing_percentage = 100 - (df.isnull().mean() * 100).round(1)

# Create a new DataFrame containing features and their non-missing values percentage
features_info = pd.DataFrame({
    "Feature": non_missing_percentage.index,
    "Non-Missing Percentage": non_missing_percentage.values
})

# Sort the DataFrame in decreasing order based on "Non-Missing Percentage"
features_info = features_info.sort_values(by="Non-Missing Percentage", ascending=False)

features_info

Unnamed: 0,Feature,Non-Missing Percentage
0,Gender,100.0
1,Age,100.0
2,Race,100.0
3,BMI,100.0
4,Have Diabetes,100.0
5,Vigorous Work Activity,100.0
6,100 Cigarettes in Life Experience,100.0
7,Smoke Cigarettes,100.0
8,Family Poverty Level,100.0
9,Alcohol Drink Experience,100.0


In [177]:
# Create an empty DataFrame to store the results
summary_table = pd.DataFrame(columns=["Feature", "Unique_Values", "Percentage"])
    
# Loop through each specified feature
for feature in categorical_features:
    
    # Extract unique values and their counts for the current feature
    value_counts = (df[feature].value_counts()).sort_index()
        
    # Calculate percentages
    percentages = round((value_counts / len(df)) * 100, 2)
        
    # Append information to the DataFrame
    summary_table = pd.concat([summary_table, pd.DataFrame({
        "Feature": [feature] * len(value_counts),
        "Unique_Values": value_counts.index,
        "Percentage": percentages.values
    })], ignore_index=True)

# Remove duplicate feature names, leaving only one occurrence
summary_table["Feature"] = summary_table["Feature"].where(~summary_table["Feature"].duplicated(), "")

summary_table

Unnamed: 0,Feature,Unique_Values,Percentage
0,Gender,1.0,49.42
1,,2.0,50.58
2,Race,1.0,10.3
3,,2.0,10.16
4,,3.0,36.46
5,,4.0,27.12
6,,5.0,15.97
7,Have Diabetes,1.0,21.25
8,,2.0,78.75
9,Vigorous Work Activity,1.0,21.37


It seems nice.</br>
Now, in order to get a better understanding from the data, the EDA process will be started in the next chapter.

[Project Content](#0)

***

# Second Step: Exploratory Data Analysis

***

text

[Project Content](#0)

***

# Third and Final Step: Modeling and Examination

***

text

[Project Content](#0)