### Leveraging Diverse Factors for Diabetes Risk Prediction Using Machine Learning Techniques
#### Neha Reddy Yenugu, Prathyusha Elipay, Anuridhi Gupta, Ashritha Gugire
#### Team No. 03
##### Dr. Lindi Liao
AIT 614 001 – Big Data Essentials

<i> George Mason University

<i> November 28, 2023

####Objectives:
The purpose of this project is to investigate and develop machine learning algorithms for early diagnosis and segmentation of diabetes, is to identify those who are at a higher risk of getting diabetes due to a unique mix of factors with the goal of improving patient outcomes and quality of life. Moreover, early diabetes prediction has the prospect of improving healthcare resource allocation. By identifying at-risk individuals and the factors that contribute to their risk, healthcare systems may strategically allocate resources, such as staff, facilities, and funding, to combat the increasing diabetes epidemic. This strategy aids in population-level planning for preventative and therapeutic measures, ensuring that the appropriate level of care and treatments are accessible when and where they are most required. It can also lead to the establishment of focused public health initiatives and policies aimed at lowering diabetes prevalence, resulting in healthier communities, and reduced overall healthcare
expenditures.

####Approch:
In our project we will try to understand the important interactions between the variables and to analyze which combination of attributes has a higher contribution in Diabitics so that we can understand the complex interactions between the features and their effects on increasing risk of occurance of the diabetes among people with various life styles and health conditions.

####Libraries
Create a new cluster - 13.3LTS ML and connect the cluster

Install the libraries using the pip commands below if necessary

In [0]:
#pip install matplotlib

In [0]:
#pip install pyspark

In [0]:
#pip install imbalanced-learn

In [0]:
#pip install pandas

In [0]:
#pip install mlflow

In [0]:
#pip install pandas

In [0]:
#pip install numpy

If new libraries installed, restart the kernel using below command and re-attach the kernel

In [0]:
#dbutils.library.restartPython()

Import the below libraries

In [0]:
import matplotlib.pyplot as plt
from pyspark.ml.feature import VectorAssembler, PolynomialExpansion, StringIndexer, Interaction
from pyspark.ml.stat import Correlation
from pyspark.sql.functions import col, avg, max,round
from pyspark.sql.types import IntegerType, DoubleType
import pandas as pd
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from pyspark.sql import functions as F
from pyspark.ml.evaluation import MulticlassClassificationEvaluator, BinaryClassificationEvaluator
from pyspark.mllib.evaluation import MulticlassMetrics
import numpy as np
from pyspark.ml.classification import LogisticRegression, NaiveBayes, RandomForestClassifier
from pyspark.ml import Pipeline, PipelineModel
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
import mlflow

####Load the diabetes_012_health_indicators_BRFSS2015.csv dataset [3]

In [0]:
display(dbutils.fs.ls('dbfs:/FileStore/tables/'))

In [0]:
original_dataset = spark.read.format("csv").option("header", "true").load("dbfs:/FileStore/tables/diabetes_012_health_indicators_BRFSS2015.csv")

In [0]:
display(original_dataset.limit(15))
print(f"Shape: {original_dataset.count()}, {len(original_dataset.columns)}")

In [0]:
original_dataset.printSchema()

In [0]:
#Coverting datatypes to INTEGER for each column
for column in original_dataset.columns:
    original_dataset = original_dataset.withColumn(column, col(column).cast(IntegerType()))

original_dataset.printSchema()

In [0]:
display(original_dataset.describe())
original_dataset.select('HighBp').describe().show()

In [0]:
display(original_dataset)

In the first plot we have taken around 10k instances from which 8k are indicating no diabetes which means Healthy, and around 1.6k are having diabetes and vey few are diagnosed with prediabetes.

From the above summary we can find that the dataset has no missing values but it has high imbalance with in it also few columns are having outliers if we observe the data profile. But, before sampling let's try to explore our dataset.

From the above data profile we can see that BMI, and Self-reported health conditions might have outliers in it, have alook at above box plots for clear image.

####Exploratory DataAnalysis

In [0]:
# Define the columns for correlation analysis
feature_columns = original_dataset.columns[1:]

# Assemble features into a single vector column
assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")
assembled_df = assembler.transform(original_dataset)

# Compute the correlation matrix
correlation_matrix = Correlation.corr(assembled_df, "features").collect()[0][0]

# Extract correlations for the target variable (Diabetes) and plot the bar chart
correlations_diabetes = correlation_matrix.toArray()[-1, :-1]

# Create a pandas DataFrame for easy plotting
correlation_df = pd.DataFrame(list(zip(feature_columns, correlations_diabetes)), columns=["Feature", "Correlation"])

# Plot the bar chart
correlation_df.sort_values(by="Correlation").plot(kind='bar', x="Feature", y="Correlation", legend=False)
plt.title('Correlation with Diabetes')
plt.show()

Since we know that highBp and Highchol leads to heart disease and it is one of the major reason for diabetes. In the below plot we want to know how many are yes(heartdisease or any attack)with highbp and highchol.

we can see that more than 5000 people are having Heartproblems with HighBP and High Chol and are internally having Diabetes stage 2.

Below bar plot tells us that how many are having history of heart disease back then who are  having daibetes .

It is grouped by cholcheck-> people who haven't done any cholestrol check in past five year are 1,170 anad who have done are around 2k with no High bP but with some heart disease and parallely those who havent done any check with high bP are more than 2k and who have done cholcheck with High BP are more than 5k.

In [0]:
original_dataset_eda = original_dataset.filter(original_dataset.Diabetes_012 ==2).select(original_dataset.HighBP, 
                                                                                         original_dataset.HighChol, 
                                                                                         original_dataset.HeartDiseaseorAttack, 
                                                                                         original_dataset.Diabetes_012, 
                                                                                         original_dataset.CholCheck)

display(original_dataset_eda)

Lets see which age group people having heart disease with High BP and High Chol Check

1 Age 18 to 24, 2 Age 25 to 29, 3 Age 30 to 34, 4 Age 35 to 39 ,5 Age 40 to 44, 6 Age 45 to 49,7 Age 50 to 54,8 Age 55 to 59,9 Age 60 to 64, 10 Age 65 to 69, 11 Age 70 to 74, 12 Age 75 to 79, 13 Age 80 or older

It is clear that people with age groups from 9 to 13 are having high Bp and those are the people who have done Cholestrol check in past 5 years and are diagonsed with heartdisease, also as age is increasing there is more room for any attack or heartdisease

In [0]:
original_dataset_heartdisease = original_dataset.filter((original_dataset.HighBP ==1 )&(original_dataset.HighChol == 1)& 
                                                        (original_dataset.HeartDiseaseorAttack == 1)).select(
                                                            original_dataset.Age, original_dataset.HighBP,
                                                            original_dataset.HighChol, original_dataset.HeartDiseaseorAttack,
                                                            original_dataset.Diabetes_012)
display(original_dataset_heartdisease)

Above explained in graphical view determines age groups between 9 to 12 diabetes daignosis

It is clear that age groups 9, 10,11 slight variations such that more than 50% are healthy even when diagnosed with HighBp, HighChol and heartdisease. Parallely more than 40% are diagnosed with diabetes.

Exception with group 12 ,i.e age with 75 -79 only 35% are having diabetes are others are Healthy.

That means HighBP, HighChol and HeartDisease have significant effect on diabetes for age more than 60.

In [0]:
original_dataset_dia= original_dataset_heartdisease.filter((original_dataset_heartdisease.Age.between(9,12))).select(
    original_dataset_heartdisease.Diabetes_012,original_dataset_heartdisease.HighBP, original_dataset_heartdisease.HighChol, original_dataset_heartdisease.HeartDiseaseorAttack,original_dataset.Age)

display(original_dataset_dia)

Now lets check there BMI and whether they are married or single and there Physical Activity

Below liine plots tells us as age increases BMI is increasing resulting in over weight and income is decreasing also physical activity. But common is Age group of 10 has peak BMI, Income and also done some Physical Activity in past 30days but also diagnosed with diabetes

In [0]:
original_dataset_BMI = original_dataset.filter((original_dataset.Diabetes_012 == 2)).select(
    original_dataset.HeartDiseaseorAttack,original_dataset.Age,original_dataset.HighBP, original_dataset.HighChol, 
    original_dataset.BMI, original_dataset.Income, original_dataset.PhysActivity)

display(original_dataset_BMI)

Below plot tells us that males who have been excersing for past 30 days by age have diagnosed by stage 2 diabetes

In [0]:
original_dataset_BMI = original_dataset.filter((original_dataset.PhysActivity==1)&(original_dataset.Age<30)& (original_dataset.Sex == 1) & 
                                               (original_dataset.Diabetes_012 == 2)).select( original_dataset.Sex, original_dataset.Diabetes_012, 
                                                                                            original_dataset.Age, original_dataset.PhysActivity)

display(original_dataset_BMI) 

Now let us focus on the diet plan for women without Alcohol consumption for those having problems with physical Health conditions

Below pies represents almost 60% of all age groups consume atleast one fruit and one veggies type each day 

If we consider Alcohol consumption it says that 90% among those who consumed one fruit and one type of veggie hasnot consumed alcohol because of there diabetes, on the other hand who consumed alcohol is very less.

In [0]:
original_dataset_diet = original_dataset.filter((original_dataset_BMI.Diabetes_012 ==2)&(original_dataset.Sex == 0)&
                                                (original_dataset.Age <13)).select(original_dataset.Sex,original_dataset.Diabetes_012, 
                                                                                   original_dataset.Fruits,original_dataset.Veggies,
                                                                                   original_dataset.Age,original_dataset.HvyAlcoholConsump)

display(original_dataset_diet)

From the below plot, surprisingly it is clear that there are only 296 people consuming alcohol and all of them has diabetes and they consume atleast one fruit and one type of veggie each day. That means it is clear that HvyAlcohol consumption is affecting badly and is one of the major factor of diabetes inspite of taking in good diet.

In [0]:
original_dataset_alco = original_dataset_diet.filter(original_dataset_diet.HvyAlcoholConsump == 1).select(
    original_dataset.Fruits, original_dataset.Veggies, original_dataset.Diabetes_012, original_dataset.HvyAlcoholConsump, original_dataset.Age)

display(original_dataset_alco)

Now let us find out whether self-reported health conditions has any importance or not.

Lets look at self reported measures such as mental health and physical health and some more attributes

From belpw pie chart it is clear that each age group has difficulty in walking , escpecially age groups between 9 and 10 has more difficulty and their BMI index is over the limit

In [0]:
original_dataset_health = original_dataset.filter((original_dataset.BMI >20) & (original_dataset.Age <30) 
                                                  &(original_dataset.DiffWalk == 1)).select(original_dataset.BMI, original_dataset.Age, 
                                                                                            original_dataset.DiffWalk)

display(original_dataset_health)

Lets take smokers physical activity and physical health into consideration.

Was your physical health not good? scale 1-30 days

From the below plot we understand that  their is no significant relationship between smoking and health conditions, but it indicates that there are many people who smoked and there health condition was good but as they go on there health condition worsened by 30th day, but couldn't extract any information from this.

In [0]:
original_dataset_physical_health= original_dataset.filter(original_dataset.HighBP ==1).select(original_dataset.Smoker, 
                                                                                              original_dataset.PhysHlth, 
                                                                                              original_dataset.MentHlth)
display(original_dataset_physical_health)

Lets see whether any personal info like education and income has any affect on diabetes

1 Less than $10,000,2 Less than $15,000 ($10,000 to less than $15,000),3 Less than $20,000 ($15,000 to less than $20,000),4 Less than $25,000 ($20,000 to less than $25,000),5 Less than $35,000 ($25,000 to less than $35,000),6 Less than $50,000 ($35,000 to less than $50,000),7 Less than $75,000 ($50,000 to less than $75,000),8 $75,000 or more

In [0]:
original_dataset_eduinc = original_dataset.filter(original_dataset.Diabetes_012 == 2).select(original_dataset.Diabetes_012, 
                                                                                             original_dataset.Education, original_dataset.Income)

display(original_dataset_eduinc) 

Now lets see how is the health conditions for those with high income

In [0]:
original_dataset_income = original_dataset.filter(original_dataset.Income > 5).select(original_dataset.HighBP, original_dataset.HighChol, 
                                                                                      original_dataset.HeartDiseaseorAttack, original_dataset.HvyAlcoholConsump, original_dataset.GenHlth, 
                                                                                      original_dataset.MentHlth)

display(original_dataset_income)

If we observe, from the above eda we can get few factors such as HighBp, HighChol, Income, HvyAlcoholconsumption, BMI, PhysActivity, and few Health Conditions plays a significant role. It is observed that Age group 10 females and males with HighBp and cholcheck in past years with High Income are more obese(BMI high) and have poor Genhlth and menthlth most likely to get diabetes.

####Data Cleaning and Preparation

In [0]:
#Drop duplicate tuples

df = original_dataset

df = df.dropDuplicates()
print(f"Shape after removing duplicate tuples: {df.count()}, {len(df.columns)}")

In [0]:
#Check for null values
null_check = 0

for row in df.collect():
    for column in df.columns:
        # Check if the value is null
        if row[column] is None:
            null_check = null_check + 1

print(f"Count of NULL values in the Dataset: {null_check}")

In [0]:
# Categorical Variables
categorical_variables = ['HighBP', 'HighChol', 'CholCheck', 'Smoker', 'Stroke',
                         'HeartDiseaseorAttack', 'PhysActivity', 'Fruits','Veggies', 'HvyAlcoholConsump',
                         'AnyHealthcare', 'NoDocbcCost', 'Sex', 'DiffWalk']

# Ordinal Variables
ordinal_variables = ['Age', 'GenHlth', 'Education', 'Income']

# Numerical Variables
numerical_variables = ['BMI', 'MentHlth', 'PhysHlth']

# Target Variable
target_variable = 'Diabetes_012'

In [0]:
display(df.groupBy(target_variable).count().orderBy(target_variable))

####Train-Test Splitting

In [0]:
trainDF, testDF = df.randomSplit([0.8, 0.2], seed=12)

print(f"Train Dataset Instances: {trainDF.cache().count()}")
print(f"Test Dataset Instances: {testDF.count()}")

display(trainDF.groupBy(target_variable).count().orderBy(target_variable))

In [0]:
display(trainDF.describe())

#####OverSampling with SMOTE

In [0]:
X = trainDF.toPandas().drop(target_variable, axis=1)
y = trainDF.toPandas()[target_variable]

X_resampled, y_resampled = SMOTE(sampling_strategy= 'all', k_neighbors=7).fit_resample(X, y)

resampledDF = pd.DataFrame({target_variable: y_resampled, **dict(X_resampled)})
resampledDF = spark.createDataFrame(resampledDF)

display(resampledDF.groupBy(target_variable).count().orderBy(target_variable))

In [0]:
display(resampledDF.describe())

####Models Building

#####Model Evaluation Function

In [0]:
def evaluate_multiclass_model(model, test_data, target_column):
    """
    Evaluate a multiclass classification model using various metrics.

    Parameters:
    - model (PipelineModel): The trained multiclass classification model.
    - test_data (DataFrame): The test DataFrame.
    - target_column (str): Name of the target column.

    Returns:
    - evaluation_results (dict): A dictionary containing evaluation metrics.
    """

    # Make predictions on the test data
    predictions = model.transform(test_data)

    # Create a MulticlassClassificationEvaluator
    multiclass_evaluator = MulticlassClassificationEvaluator(labelCol=target_column, predictionCol="prediction")

    # Calculate evaluation metrics
    accuracy = multiclass_evaluator.evaluate(predictions, {multiclass_evaluator.metricName: "accuracy"})
    precision = multiclass_evaluator.evaluate(predictions, {multiclass_evaluator.metricName: "weightedPrecision"})
    recall = multiclass_evaluator.evaluate(predictions, {multiclass_evaluator.metricName: "weightedRecall"})
    f1_score = multiclass_evaluator.evaluate(predictions, {multiclass_evaluator.metricName: "f1"})
    true_positive_rate = multiclass_evaluator.evaluate(predictions, {multiclass_evaluator.metricName: "weightedTruePositiveRate"})
    false_positive_rate = multiclass_evaluator.evaluate(predictions, {multiclass_evaluator.metricName: "weightedFalsePositiveRate"})
    f_measure = multiclass_evaluator.evaluate(predictions, {multiclass_evaluator.metricName: "weightedFMeasure"})
    log_loss = multiclass_evaluator.evaluate(predictions, {multiclass_evaluator.metricName: "logLoss"})
    hamming_loss = multiclass_evaluator.evaluate(predictions, {multiclass_evaluator.metricName: "hammingLoss"})
    
    # Create a confusion matrix
    confusion_matrix = predictions.groupBy(target_column, "prediction").count()

    # Convert confusion matrix to a Pandas DataFrame for better display
    confusion_matrix_pd = confusion_matrix.toPandas()

    # Calculate percentages for each class
    confusion_matrix_pd['percentage'] = confusion_matrix_pd.groupby(target_column)['count'].transform(lambda x: (x / x.sum() * 100).round(4))

    # Create a dictionary to store evaluation results
    evaluation_results = {
        "Accuracy": f"{accuracy:.4f}",
        "Weighted Precision": f"{precision:.4f}",
        "Weighted Recall": f"{recall:.4f}",
        "F1-Score": f"{f1_score:.4f}",
        "Weighted True Positive Rate": f"{true_positive_rate:.4f}",
        "Weighted False Positive Rate": f"{false_positive_rate:.4f}",
        "Weighted F-Measure": f"{f_measure:.4f}",
        "Log Loss": f"{log_loss:.4f}",
        "Hamming Loss": f"{hamming_loss:.4f}",
        "Confusion Matrix": confusion_matrix_pd
    }

    return evaluation_results

#####Logistic Regression

In [0]:
feature_columns = [col for col in trainDF.columns if col != target_variable]

# Create a vector assembler to assemble features into a single vector column
vector_assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")

# Create a Logistic Regression model
logistic_regression = LogisticRegression(labelCol=target_variable, featuresCol="features", maxIter=10, regParam=0.3, elasticNetParam=0.8)

# Create a pipeline with the vector assembler and logistic regression stages
pipeline = Pipeline(stages=[vector_assembler, logistic_regression])

# Train the model
lm_m1 = pipeline.fit(resampledDF)

# Make predictions on the test set
lm_m1_eval = evaluate_multiclass_model(lm_m1, testDF, target_variable)
    
for key, value in lm_m1_eval.items():
    if key == "Confusion Matrix":
        print(f"{key} Combinations:")
        display(value)
    else:
        print(f"{key}: {value}")

In [0]:
selected_columns = ['HighBP', 'HighChol', 'HvyAlcoholConsump', 'Diabetes_012']

# Select the specified columns from resampled_df
selected_df = resampledDF.select(selected_columns)

# Create a vector assembler to assemble selected features into a single vector column
vector_assembler = VectorAssembler(inputCols=selected_columns[:-1], outputCol="features")

# Create a PolynomialExpansion stage for interaction terms
poly_expansion = PolynomialExpansion(inputCol="features", outputCol="expanded_features", degree=2)

# Create a Logistic Regression model
logistic_regression = LogisticRegression(labelCol=selected_columns[-1], featuresCol="expanded_features", 
                                         maxIter=5, regParam=0.4, elasticNetParam=0.8)

# Create a pipeline with the vector assembler, polynomial expansion, and logistic regression stages
pipeline = Pipeline(stages=[vector_assembler, poly_expansion, logistic_regression])

# Train the model on the selected DataFrame
lm_m2 = pipeline.fit(selected_df)

# Make predictions on the test set
selected_test_df = testDF.select(selected_columns)

# Make predictions on the test set
lm_m2_eval = evaluate_multiclass_model(lm_m2, selected_test_df, target_variable)
    
for key, value in lm_m2_eval.items():
    if key == "Confusion Matrix":
        print(f"{key} Combinations:")
        display(value)
    else:
        print(f"{key}: {value}")

In [0]:
#Hyper Parameter Tuning on the model lm_m2
predictions = lm_m2.transform(selected_test_df)
evaluator = MulticlassClassificationEvaluator(labelCol=selected_columns[-1], predictionCol="prediction", metricName="accuracy")

paramGrid = ParamGridBuilder() \
    .addGrid(logistic_regression.regParam, [0.1, 0.01, 0.001]) \
    .addGrid(logistic_regression.elasticNetParam, [0.0, 0.5, 1.0]) \
    .build()

crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=evaluator,
                          numFolds=5)

lm_cv = crossval.fit(selected_df)

# Make predictions on the test set
lm_cv_eval = evaluate_multiclass_model(lm_cv, selected_test_df, target_variable)
    
for key, value in lm_cv_eval.items():
    if key == "Confusion Matrix":
        print(f"{key} Combinations:")
        display(value)
    else:
        print(f"{key}: {value}")

In [0]:
# Model using mlflow
# Convert the diabetes label to double
data = resampledDF.withColumn("Diabetes_012", col("Diabetes_012").cast(DoubleType()))

# Split the data into training and test sets
(trainingData, testData) = data.randomSplit([0.7, 0.3], seed=1234)

# Set the hyperparameters for the logistic regression model
lr_params = {
    "regParam": 0.01,
    "elasticNetParam": 0.0,
    "maxIter": 100
}
token = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().get()
dbutils.fs.put("file:///root/.databrickscfg","[DEFAULT]\nhost=https://community.cloud.databricks.com\ntoken = "+token,overwrite=True)


# Train the model and make predictions on the test data
with mlflow.start_run():
    # Convert categorical features to numeric using StringIndexer
    BPIndexer = StringIndexer(inputCol="HighBP", outputCol="HighBPIndex")                       
    CholestrolIndexer = StringIndexer(inputCol="HighChol", outputCol="HighChol_statusIndex")  
    SmokeHistIndexer  = StringIndexer(inputCol = "Smoker", outputCol= "SmokerIndex")
    AlcoholIndexer = StringIndexer(inputCol = "HvyAlcoholConsump", outputCol= "AlcoholIndex")
    GenderIndexer = StringIndexer(inputCol = "Sex", outputCol= "GenderIndex")

    # Create the features vector using VectorAssembler
    F_assembler = VectorAssembler(inputCols=["HighBP", "HighChol", "BMI","HvyAlcoholConsump", "Income","MentHlth","HeartDiseaseorAttack"], outputCol="features")
    
    # Define the logistic regression model
    lr = LogisticRegression(featuresCol="features", labelCol="Diabetes_012")
  
    # Define the pipeline
    pipeline = Pipeline(stages=[GenderIndexer, SmokeHistIndexer,F_assembler, lr])
    
    # Log the hyperparameters
    mlflow.log_params(lr_params)
    
    # Train the model
    models=pipeline.fit(trainingData)
    
    # Make predictions on the test data
    predictions = models.transform(testData)
    
    # Evaluate the model
    auc_evaluator = BinaryClassificationEvaluator(labelCol="Diabetes_012", rawPredictionCol="prediction", metricName="areaUnderROC")
    areaUnderROC = auc_evaluator.evaluate(predictions)
    print("Area under ROC curve: {}".format(areaUnderROC))
    acc_evaluator = MulticlassClassificationEvaluator(predictionCol="prediction", labelCol="Diabetes_012", metricName="accuracy")
    accuracy = acc_evaluator.evaluate(predictions)
    print("Accuracy: {:.2f}%".format(accuracy*100))
    
    eval = evaluate_multiclass_model(lm_cv, selected_test_df, target_variable)
    
    for key, value in eval.items():
        if key == "Confusion Matrix":
            print(f"{key} Combinations:")
            display(value)
        else:
            print(f"{key}: {value}")

    #Log the metrics
    mlflow.log_metric("Accuracy", accuracy*100)
    mlflow.log_metric("AUC", areaUnderROC) 

#####NaiveBayes

In [0]:
feature_columns = [col for col in trainDF.columns if col != target_variable]

# Create a vector assembler to assemble features into a single vector column
vector_assembler = VectorAssembler(inputCols=feature_columns, outputCol="feature")

# Create a Naive Bayes Model
nb_m1 = NaiveBayes(labelCol=target_variable, featuresCol="feature", smoothing = 1.5, modelType= "multinomial")

# Create a pipeline with the vector assembler and naive bayes stages
pipeline = Pipeline(stages=[vector_assembler, nb_m1])

# Train the model
nb_m1 = pipeline.fit(resampledDF)

# Make predictions on the test set
nb_m1_eval = evaluate_multiclass_model(nb_m1, testDF, target_variable)
    
for key, value in nb_m1_eval.items():
    if key == "Confusion Matrix":
        print(f"{key} Combinations:")
        display(value)
    else:
        print(f"{key}: {value}")

In [0]:
selected_columns = ['HighBP', 'BMI', 'HvyAlcoholConsump', 'Diabetes_012']

# Select the specified columns from resampled_df
selected_df = resampledDF.select(selected_columns)

# Create a vector assembler to assemble selected features into a single vector column
vector_assembler = VectorAssembler(inputCols=selected_columns[:-1], outputCol="features")

# Create a PolynomialExpansion stage for interaction terms
poly_expansion = PolynomialExpansion(inputCol="features", outputCol="expanded_features", degree=2)

# Create a Naive Bayes Model
nb_m2 = NaiveBayes(labelCol=selected_columns[-1], featuresCol="expanded_features",   smoothing = 1.0, modelType= "multinomial")

# Create a pipeline with the vector assembler, polynomial expansion, and naive bayes
pipeline = Pipeline(stages=[vector_assembler, poly_expansion, nb_m2])

# Train the model on the selected DataFrame
nb_m2 = pipeline.fit(selected_df)

# Make predictions on the test set
selected_test_df = testDF.select(selected_columns)

# Make predictions on the test set
nb_m2_eval = evaluate_multiclass_model(nb_m2, testDF, target_variable)
    
for key, value in nb_m2_eval.items():
    if key == "Confusion Matrix":
        print(f"{key} Combinations:")
        display(value)
    else:
        print(f"{key}: {value}")

#####RandomForests

###### Function to fit the RandomForests models

In [0]:
def random_forest_model(train_data, feature_columns, target_column):
    """
    Train a Random Forest model using PySpark.

    Parameters:
    - train_data (DataFrame): The training DataFrame.
    - feature_columns (list): List of feature column names.
    - target_column (str): Name of the target column.

    Returns:
    - model (PipelineModel): The trained Random Forest model.
    """

    # Create a VectorAssembler to combine features into a vector column
    vec_assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")

    # Create a RandomForestClassifier with default parameters
    rf_classifier = RandomForestClassifier(labelCol=target_column, featuresCol="features", numTrees=15)

    # Create a pipeline with the VectorAssembler and RandomForestClassifier
    pipeline = Pipeline(stages=[vec_assembler, rf_classifier])

    # Fit the pipeline on the training data
    model = pipeline.fit(train_data)

    return model

In [0]:
# RF model using TrainDF 
feature_columns = [col for col in trainDF.columns if col != target_variable]

rf_m1 = random_forest_model(trainDF, feature_columns, target_variable)
rf_m1_eval = evaluate_multiclass_model(rf_m1, testDF, target_variable)
    
for key, value in rf_m1_eval.items():
    if key == "Confusion Matrix":
        print(f"{key} Combinations:")
        display(value)
    else:
        print(f"{key}: {value}")

In [0]:
#RF model using resampledDF

feature_columns = [col for col in resampledDF.columns if col != target_variable]

rf_m2 = random_forest_model(resampledDF, feature_columns, target_variable)
rf_m2_eval = evaluate_multiclass_model(rf_m2, testDF, target_variable)
    
for key, value in rf_m2_eval.items():
    if key == "Confusion Matrix":
        print(f"{key} Combinations:")
        display(value)
    else:
        print(f"{key}: {value}")

###### Subsampling 30% of resampledDF to reduce noise

In [0]:
# Subsample the DataFrame
subsampledDF = resampledDF.sample(fraction=0.3, seed=42)

# Check the size of the subsampled DataFrame
print("Original DataFrame count:", resampledDF.count())
print("Subsampled DataFrame count:", subsampledDF.count())

In [0]:
# RF model using subsampledDF

feature_columns = [col for col in trainDF.columns if col != target_variable]

# Create and train the RandomForestClassificationModel
rf_m3 = random_forest_model(subsampledDF, feature_columns, target_variable)

# Evaluate the model
rf_m3_eval = evaluate_multiclass_model(rf_m3, testDF, target_variable)

# Print the evaluation results
for key, value in rf_m3_eval.items():
    if key == "Confusion Matrix":
        print(f"{key} Combinations:")
        display(value)
    else:
        print(f"{key}: {value}")

In [0]:
# Extract feature importances
feature_importances = rf_m3.stages[-1].featureImportances.toArray()

sorted_features = sorted(zip(feature_columns, feature_importances), key=lambda x: x[1], reverse=True)

# Feature importances:
print("Feature Importances:")
for i, (feature_name, importance) in enumerate(sorted_features):
    print(f"{feature_name}: {importance}")

No significatly important feature observed from the above model

###### Splitting the subsampledDF into three to implement binary classification models for better prediction of the target classes

In [0]:
# RF model with binary classification over '0' and '1' classes of the target_variable

subsampledDF_01 = subsampledDF.filter(col(target_variable) != 2)
testDF_01 = testDF.filter(col(target_variable) != 2)

feature_columns = [col for col in trainDF.columns if col != target_variable]

# Create and train the RandomForestClassificationModel
rf_m4 = random_forest_model(subsampledDF_01, feature_columns, target_variable)

# Evaluate the model
rf_m4_eval = evaluate_multiclass_model(rf_m4, testDF_01, target_variable)

# Print the evaluation results
for key, value in rf_m4_eval.items():
    if key == "Confusion Matrix":
        print(f"{key} Combinations:")
        display(value)
    else:
        print(f"{key}: {value}")

In [0]:
# RF model with binary classification over '0' and '2' classes of the target_variable

subsampledDF_02 = subsampledDF.filter(col(target_variable) != 1)
testDF_02 = testDF.filter(col(target_variable) != 1)

# Create and train the RandomForestClassificationModel
rf_m5 = random_forest_model(subsampledDF_02, feature_columns, target_variable)

# Evaluate the model
rf_m5_eval = evaluate_multiclass_model(rf_m5, testDF_02, target_variable)

# Print the evaluation results
for key, value in rf_m5_eval.items():
    if key == "Confusion Matrix":
        print(f"{key} Combinations:")
        display(value)
    else:
        print(f"{key}: {value}")

In [0]:
# RF model with binary classification over '1' and '2' classes of the target_variable

subsampledDF_12 = subsampledDF.filter(col(target_variable) != 0)
testDF_12 = testDF.filter(col(target_variable) != 0)

feature_columns = [col for col in trainDF.columns if col != target_variable]

# Create and train the RandomForestClassificationModel
rf_m6 = random_forest_model(subsampledDF_12, feature_columns, target_variable)

# Evaluate the model
rf_m6_eval = evaluate_multiclass_model(rf_m6, testDF_12, target_variable)

# Print the evaluation results
for key, value in rf_m6_eval.items():
    if key == "Confusion Matrix":
        print(f"{key} Combinations:")
        display(value)
    else:
        print(f"{key}: {value}")

After examining the binary classification models, we concluded that the target class "1" does not have any significant variables that can differentiate it from classes "0" and "2". However, further research in the domain has taught us that the primary causes of Type 1 Diabetes are genetics and insulin resistance, neither of which are included in the selected dataset. [2]

Hence, we will continue improving the rf_m5 model for classes "0" and "2".

We will test the models by introducing interaction variables to improve their predictive power.

##### Interaction Models

In [0]:
# Extract feature importances of rf_m5 model
feature_importances = rf_m5.stages[-1].featureImportances.toArray()

sorted_features = sorted(zip(feature_columns, feature_importances), key=lambda x: x[1], reverse=True)

# Feature importances:
print("Feature Importances:")
for i, (feature_name, importance) in enumerate(sorted_features):
    print(f"{feature_name}: {importance}")

For the first interaction model, we choose GenHlth, HighBp as interaction terms based on the feature importance retrieved from the model rf_m5

From the feature importances, it is also clear that features "Stroke", "HeartDiseaseAttack", "AnyHealthcare", "Smoker" and "Check" have little to no impact for classification. Hence, these featuers will be eliminated from our further evaluations. 

In [0]:
interaction = Interaction(inputCols=['GenHlth', 'HighBP'], outputCol='Gen_BP_Interaction')

# Transform the data to include interaction terms
subsampledDF_interactions = interaction.transform(subsampledDF_02)

# List of variables to remove
variables_to_remove = ['Smoker', 'AnyHealthcare', 'Stroke', 'HeartDiseaseorAttack', 'Sex', 'CholCheck']

feature_columns = [col for col in trainDF.columns if col != target_variable and col not in variables_to_remove] + ['Gen_BP_Interaction']

# Create and train the RandomForestClassificationModel
rf_m7 = random_forest_model(subsampledDF_interactions, feature_columns, target_variable)

testDF_interactions1 = interaction.transform(testDF_02)

# Evaluate the model
rf_m7_eval = evaluate_multiclass_model(rf_m7, testDF_interactions1, target_variable)

# Print the evaluation results
for key, value in rf_m7_eval.items():
    if key == "Confusion Matrix":
        print(f"{key} Combinations:")
        display(value)
    else:
        print(f"{key}: {value}")

For the next model, we will test the interactions between the domain-based causes Obesity which is know to be highly contributing variable for Type 2 Diabetes. The featuers selected are "HighChol" and "BMI" which represent the Physical state of a subject. [1]

In [0]:
interaction = Interaction(inputCols=['HighChol', 'BMI'], outputCol='PhysState_Interaction')

# Transform the data to include interaction terms
subsampledDF_interactions = interaction.transform(subsampledDF_02)

# List of variables to remove
variables_to_remove = ['Smoker', 'AnyHealthcare', 'Stroke', 'HeartDiseaseorAttack', 'Sex', 'CholCheck']

feature_columns = [col for col in trainDF.columns if col != target_variable and col not in variables_to_remove] + ['PhysState_Interaction']


# Create and train the RandomForestClassificationModel
rf_m8 = random_forest_model(subsampledDF_interactions, feature_columns, target_variable)

testDF_interactions2 = interaction.transform(testDF_02)

# Evaluate the model
rf_m8_eval = evaluate_multiclass_model(rf_m8, testDF_interactions2, target_variable)

# Print the evaluation results
for key, value in rf_m8_eval.items():
    if key == "Confusion Matrix":
        print(f"{key} Combinations:")
        display(value)
    else:
        print(f"{key}: {value}")

###### Comparing the two interaction models by AUC-ROC

In [0]:
# Make predictions using rf_m7
predictions_rf_m7 = rf_m7.transform(testDF_interactions1)

# Evaluate the performance of rf_m7
auc_evaluator_rf_m7 = BinaryClassificationEvaluator(labelCol="Diabetes_012", rawPredictionCol="prediction", metricName="areaUnderROC")
areaUnderROC_rf_m7 = auc_evaluator_rf_m7.evaluate(predictions_rf_m7)
print("Model rf_m7 - Area under ROC curve: {:.4f}".format(areaUnderROC_rf_m7))

# Make predictions using rf_m8
predictions_rf_m8 = rf_m8.transform(testDF_interactions2)

# Evaluate the performance of rf_m8
auc_evaluator_rf_m8 = BinaryClassificationEvaluator(labelCol="Diabetes_012", rawPredictionCol="prediction", metricName="areaUnderROC")
areaUnderROC_rf_m8 = auc_evaluator_rf_m8.evaluate(predictions_rf_m8)
print("Model rf_m8 - Area under ROC curve: {:.4f}".format(areaUnderROC_rf_m8))

#### Interpretation and Conculsion

To sum up, we developed two predictive models, rf_m7 and rf_m8, to determine the likelihood of diabetes based on different feature interactions. The rf_m7 model uses the interaction between 'HighBP' and 'GenHlth', which we identified through feature importance. It has an accuracy of 72.20% and performs competently, with a weighted precision of 83.52% and a true positive prediction rate of 72.56% for class 0. On the other hand, the rf_m8 model is based on the interaction between 'HighChol' and 'BMI' using domain-specific features and has an accuracy of 70.94%. It excels at avoiding false positives, as evidenced by its lower weighted false positive rate of 26.13%. 

Both models have moderate AUC values of 71.39% for rf_m7 and 72.40% for rf_m8. The choice between these models may depend on specific objectives, such as prioritizing precision or recall, and the context in which the features are considered. Overall, these models offer valuable insights into diabetes prediction, providing a foundation for informed decision-making in healthcare scenarios.

####References
<i> [1] U.S. Department of Health and Human Services. Symptoms &amp; causes of diabetes - NIDDK. National Institute of Diabetes and Digestive and Kidney Diseases. https://www.niddk.nih.gov/health-information/diabetes/overview/symptoms-causes#:~:text=Overweight%2C%20obesity%2C%20and%20physical%20inactivity,people%20with%20type%202%20diabetes.

<i> [2] Centers for Disease Control and Prevention. What is type 1 diabetes?. Centers for Disease Control and Prevention. https://www.cdc.gov/diabetes/basics/what-is-type-1-diabetes.html 
 
<i> [3] CDC Diabetes Health Indicators from UCI Machine Learning Repository https://archive.ics.uci.edu/dataset/891/cdc+diabetes+health+indicators
