<a href="https://colab.research.google.com/github/nidhi0684/Project4-DiabetesPrediction/blob/main/Diabetes_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Diabetes Prediction using LogisticRegression ML Model, Binary classification**

## **Collaborators - Nidhi, Luis, Steven, Beauty, and Preeti**
## Instructions to run
*   Run all cells and locate output of last cell in PART 8
*   Click on the URL provided by google colab which looks like this *https://\<UUID\>.colab.googleusercontent.com/*
* UI for Diabetes Prediction will launch, which will take parameters such as Gender, Age, Hyptertension, Smoking History, Hb1Ac level, glucose levels.
* Click on Predict button which will pop-up a message saying proabability percentage of subject being diabetic or not.

## Code workflow
* PART 1 through PART 3 is initial setup, cloning of repository, dataset exploration and cleanup
* PART 4 through PART 7 is building, training, testing, and saving of the ML model
* PART 8 is focused on building Flask app to build APIs that will render UI Home page and leverage ML model to predict probability of being Diabetic depending on the paremeters passed to model through Web UI


# PART 1 : Install Dependencies & Run Spark Session

In [1]:
# Install pyspark
! pip install pyspark

from pyspark.sql.functions import when, col

Collecting pyspark
  Downloading pyspark-3.5.1.tar.gz (317.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.0/317.0 MB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.1-py2.py3-none-any.whl size=317488493 sha256=64ba57a871eef9c5d16726384d3a8f80cf3843581fcaf0787c7f27160cef1e71
  Stored in directory: /root/.cache/pip/wheels/80/1d/60/2c256ed38dddce2fdd93be545214a63e02fbd8d74fb0b7f3a6
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.1


In [2]:
# Create a sparksession
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("spark").getOrCreate()

# PART 2: Clone & Explore dataset

In [6]:
# Cleanup code to remove Project-4 and model from /content directory so it can be cloned from github successfully
import shutil
import os

# Specify the path to Project-4 to be deleted for cloning fresh repository
directory_path = '/content/Project4-DiabetesPrediction'

# Check if the directory exists before attempting to delete it
if os.path.exists(directory_path):
    shutil.rmtree(directory_path)
    print(f"The directory {directory_path} has been deleted.")
else:
    print(f"The directory {directory_path} does not exist.")

# Specify the path to model to be deleted for cloning fresh repository
directory_path = '/content/model'

# Check if the directory exists before attempting to delete it
if os.path.exists(directory_path):
    shutil.rmtree(directory_path)
    print(f"The directory {directory_path} has been deleted.")
else:
    print(f"The directory {directory_path} does not exist.")

The directory /content/Project4-DiabetesPrediction has been deleted.
The directory /content/model does not exist.


In [7]:
# Clone the diabetes dataset from the github repository
! git clone  https://github.com/nidhi0684/Project4-DiabetesPrediction

Cloning into 'Project4-DiabetesPrediction'...
remote: Enumerating objects: 16, done.[K
remote: Counting objects:   6% (1/16)[Kremote: Counting objects:  12% (2/16)[Kremote: Counting objects:  18% (3/16)[Kremote: Counting objects:  25% (4/16)[Kremote: Counting objects:  31% (5/16)[Kremote: Counting objects:  37% (6/16)[Kremote: Counting objects:  43% (7/16)[Kremote: Counting objects:  50% (8/16)[Kremote: Counting objects:  56% (9/16)[Kremote: Counting objects:  62% (10/16)[Kremote: Counting objects:  68% (11/16)[Kremote: Counting objects:  75% (12/16)[Kremote: Counting objects:  81% (13/16)[Kremote: Counting objects:  87% (14/16)[Kremote: Counting objects:  93% (15/16)[Kremote: Counting objects: 100% (16/16)[Kremote: Counting objects: 100% (16/16), done.[K
remote: Compressing objects: 100% (13/13), done.[K
remote: Total 16 (delta 1), reused 13 (delta 1), pack-reused 0[K
Receiving objects: 100% (16/16), 707.25 KiB | 5.48 MiB/s, done.
Resolving deltas: 

In [8]:
# Check if the dataset exists
! ls /content/Project4-DiabetesPrediction/dataset

diabetes_prediction_dataset.csv  diabetes_test_dataset.csv  new_test.csv


In [9]:
# Create spark dataframe
df_diabetes_data = spark.read.csv("/content/Project4-DiabetesPrediction/dataset/diabetes_prediction_dataset.csv", header=True, inferSchema=True)

In [10]:
# Display the dataframe
df_diabetes_data.show()

+------+----+------------+-------------+---------------+-----+-----------+-------------------+--------+
|gender| age|hypertension|heart_disease|smoking_history|  bmi|HbA1c_level|blood_glucose_level|diabetes|
+------+----+------------+-------------+---------------+-----+-----------+-------------------+--------+
|Female|80.0|           0|            1|          never|25.19|        6.6|                140|       0|
|Female|54.0|           0|            0|        No Info|27.32|        6.6|                 80|       0|
|  Male|28.0|           0|            0|          never|27.32|        5.7|                158|       0|
|Female|36.0|           0|            0|        current|23.45|        5.0|                155|       0|
|  Male|76.0|           1|            1|        current|20.14|        4.8|                155|       0|
|Female|20.0|           0|            0|          never|27.32|        6.6|                 85|       0|
|Female|44.0|           0|            0|          never|19.31|  

In [11]:
# Show amount of rows
df_diabetes_data.count()

100000

In [12]:
# Print the schema
df_diabetes_data.printSchema()

root
 |-- gender: string (nullable = true)
 |-- age: double (nullable = true)
 |-- hypertension: integer (nullable = true)
 |-- heart_disease: integer (nullable = true)
 |-- smoking_history: string (nullable = true)
 |-- bmi: double (nullable = true)
 |-- HbA1c_level: double (nullable = true)
 |-- blood_glucose_level: integer (nullable = true)
 |-- diabetes: integer (nullable = true)



In [13]:
# Count the total no. of diabetic and non-diabetic class (values of 1 indicating the presence of diabetes and 0 indicating the absence of diabetes)
print((df_diabetes_data.count(), len(df_diabetes_data.columns)))
df_diabetes_data.groupBy('diabetes').count().show()

(100000, 9)
+--------+-----+
|diabetes|count|
+--------+-----+
|       1| 8500|
|       0|91500|
+--------+-----+



In [14]:
# Count the total no. of gender types
print((df_diabetes_data.count(), len(df_diabetes_data.columns)))
df_diabetes_data.groupBy('gender').count().show()

(100000, 9)
+------+-----+
|gender|count|
+------+-----+
|Female|58552|
| Other|   18|
|  Male|41430|
+------+-----+



In [15]:
# Check to see if there are any empty values in the 'gender' column
df_diabetes_data[df_diabetes_data['gender'] == '']

DataFrame[gender: string, age: double, hypertension: int, heart_disease: int, smoking_history: string, bmi: double, HbA1c_level: double, blood_glucose_level: int, diabetes: int]

In [16]:
# Get the summary statistics
df_diabetes_data.describe().show()

+-------+------+-----------------+------------------+------------------+---------------+-----------------+------------------+-------------------+-------------------+
|summary|gender|              age|      hypertension|     heart_disease|smoking_history|              bmi|       HbA1c_level|blood_glucose_level|           diabetes|
+-------+------+-----------------+------------------+------------------+---------------+-----------------+------------------+-------------------+-------------------+
|  count|100000|           100000|            100000|            100000|         100000|           100000|            100000|             100000|             100000|
|   mean|  NULL|41.88585600000013|           0.07485|           0.03942|           NULL|27.32076709999422|5.5275069999983275|          138.05806|              0.085|
| stddev|  NULL|22.51683987161704|0.2631504702289171|0.1945930169980986|           NULL|6.636783416648357|1.0706720918835468|  40.70813604870383|0.27888308976661896|
|   

# PART 3: Data Cleaning & Preparation

In [17]:
# Check for null values
for col in df_diabetes_data.columns:
  print(col + ":", df_diabetes_data[df_diabetes_data[col].isNull()].count())

gender: 0
age: 0
hypertension: 0
heart_disease: 0
smoking_history: 0
bmi: 0
HbA1c_level: 0
blood_glucose_level: 0
diabetes: 0


In [19]:
# Function to look for the unnecessary values present
def count_zeros():
  columns_list = ["age", "bmi", "HbA1c_level", "blood_glucose_level"]
  for i in columns_list:
    print(i+":",df_diabetes_data[df_diabetes_data[i]==0].count())

In [20]:
count_zeros()

age: 0
bmi: 0
HbA1c_level: 0
blood_glucose_level: 0


In [21]:
# Display the dataframe
df_diabetes_data.show()

+------+----+------------+-------------+---------------+-----+-----------+-------------------+--------+
|gender| age|hypertension|heart_disease|smoking_history|  bmi|HbA1c_level|blood_glucose_level|diabetes|
+------+----+------------+-------------+---------------+-----+-----------+-------------------+--------+
|Female|80.0|           0|            1|          never|25.19|        6.6|                140|       0|
|Female|54.0|           0|            0|        No Info|27.32|        6.6|                 80|       0|
|  Male|28.0|           0|            0|          never|27.32|        5.7|                158|       0|
|Female|36.0|           0|            0|        current|23.45|        5.0|                155|       0|
|  Male|76.0|           1|            1|        current|20.14|        4.8|                155|       0|
|Female|20.0|           0|            0|          never|27.32|        6.6|                 85|       0|
|Female|44.0|           0|            0|          never|19.31|  

In [22]:
# Drop the 'other' rows in the gender columns
string_to_remove = "Other"
df_diabetes_data = df_diabetes_data[df_diabetes_data['Gender'] != string_to_remove]

In [23]:
# Count the total no. of gender types
print((df_diabetes_data.count(), len(df_diabetes_data.columns)))
df_diabetes_data.groupBy('gender').count().show()

(99982, 9)
+------+-----+
|gender|count|
+------+-----+
|Female|58552|
|  Male|41430|
+------+-----+



In [24]:
# Count the total no. of smoker/non-smoker types
print((df_diabetes_data.count(), len(df_diabetes_data.columns)))
df_diabetes_data.groupBy('smoking_history').count().show()

(99982, 9)
+---------------+-----+
|smoking_history|count|
+---------------+-----+
|    not current| 6439|
|         former| 9352|
|        No Info|35810|
|        current| 9286|
|          never|35092|
|           ever| 4003|
+---------------+-----+



In [26]:
# Drop the 'No Info' rows in the smoking_history columns
string_to_remove_1= "No Info"
df_diabetes_data = df_diabetes_data[df_diabetes_data['smoking_history'] != string_to_remove_1]

In [27]:
# Count the total no. of smoker/non-smoker types
print((df_diabetes_data.count(), len(df_diabetes_data.columns)))
df_diabetes_data.groupBy('smoking_history').count().show()

(64172, 9)
+---------------+-----+
|smoking_history|count|
+---------------+-----+
|    not current| 6439|
|         former| 9352|
|        current| 9286|
|          never|35092|
|           ever| 4003|
+---------------+-----+



In [28]:
# Count the total no. of gender types after smoking_history cleanup
print((df_diabetes_data.count(), len(df_diabetes_data.columns)))
df_diabetes_data.groupBy('gender').count().show()

(64172, 9)
+------+-----+
|gender|count|
+------+-----+
|Female|38852|
|  Male|25320|
+------+-----+



In [29]:
# Assign in the 'gender'column 'Female' = 0, and 'Male' = 1
from pyspark.sql.functions import when, col
df_diabetes_data = df_diabetes_data.withColumn("gender",
    when(col("gender") == "Female", 0).
    when(col("gender") == "Male", 1).
    otherwise(col("gender"))
)
df_diabetes_data.show()

+------+----+------------+-------------+---------------+-----+-----------+-------------------+--------+
|gender| age|hypertension|heart_disease|smoking_history|  bmi|HbA1c_level|blood_glucose_level|diabetes|
+------+----+------------+-------------+---------------+-----+-----------+-------------------+--------+
|     0|80.0|           0|            1|          never|25.19|        6.6|                140|       0|
|     1|28.0|           0|            0|          never|27.32|        5.7|                158|       0|
|     0|36.0|           0|            0|        current|23.45|        5.0|                155|       0|
|     1|76.0|           1|            1|        current|20.14|        4.8|                155|       0|
|     0|20.0|           0|            0|          never|27.32|        6.6|                 85|       0|
|     0|44.0|           0|            0|          never|19.31|        6.5|                200|       1|
|     1|42.0|           0|            0|          never|33.64|  

In [30]:
# Assign in the 'smoking_history': "never" = 0, "ever" = 1, "not current" = 2, "current" = 3, "former" = 4
df_diabetes_data = df_diabetes_data.withColumn("smoking_history",
    when(col("smoking_history") == "never", 0).
    when(col("smoking_history") == "ever", 1).
    when(col("smoking_history") == "not current", 2).
    when(col("smoking_history") == "current", 3).
    when(col("smoking_history") == "former", 4).
    otherwise(col("smoking_history"))
)
df_diabetes_data.show()

+------+----+------------+-------------+---------------+-----+-----------+-------------------+--------+
|gender| age|hypertension|heart_disease|smoking_history|  bmi|HbA1c_level|blood_glucose_level|diabetes|
+------+----+------------+-------------+---------------+-----+-----------+-------------------+--------+
|     0|80.0|           0|            1|              0|25.19|        6.6|                140|       0|
|     1|28.0|           0|            0|              0|27.32|        5.7|                158|       0|
|     0|36.0|           0|            0|              3|23.45|        5.0|                155|       0|
|     1|76.0|           1|            1|              3|20.14|        4.8|                155|       0|
|     0|20.0|           0|            0|              0|27.32|        6.6|                 85|       0|
|     0|44.0|           0|            0|              0|19.31|        6.5|                200|       1|
|     1|42.0|           0|            0|              0|33.64|  

# PART 4: Correlation Analysis & Feature Selection

In [31]:
# gender and smoking_history needs to be converted to float data type for model to work
df_diabetes_data = df_diabetes_data.withColumn("gender", col("gender").cast('float'))
df_diabetes_data = df_diabetes_data.withColumn("smoking_history", col("smoking_history").cast('float'))
df_diabetes_data.show()


+------+----+------------+-------------+---------------+-----+-----------+-------------------+--------+
|gender| age|hypertension|heart_disease|smoking_history|  bmi|HbA1c_level|blood_glucose_level|diabetes|
+------+----+------------+-------------+---------------+-----+-----------+-------------------+--------+
|   0.0|80.0|           0|            1|            0.0|25.19|        6.6|                140|       0|
|   1.0|28.0|           0|            0|            0.0|27.32|        5.7|                158|       0|
|   0.0|36.0|           0|            0|            3.0|23.45|        5.0|                155|       0|
|   1.0|76.0|           1|            1|            3.0|20.14|        4.8|                155|       0|
|   0.0|20.0|           0|            0|            0.0|27.32|        6.6|                 85|       0|
|   0.0|44.0|           0|            0|            0.0|19.31|        6.5|                200|       1|
|   1.0|42.0|           0|            0|            0.0|33.64|  

In [32]:
# Find the correlation among the set of input & output variables
for i in df_diabetes_data.columns:
  print("Correlation to outcome for {} is {}".format(i, df_diabetes_data.stat.corr("diabetes",i)))

Correlation to outcome for gender is 0.05699689368565596
Correlation to outcome for age is 0.26084962459224337
Correlation to outcome for hypertension is 0.19222574901207254
Correlation to outcome for heart_disease is 0.16961397731730365
Correlation to outcome for smoking_history is 0.06472564826560573
Correlation to outcome for bmi is 0.20442115545137657
Correlation to outcome for HbA1c_level is 0.43889709468177335
Correlation to outcome for blood_glucose_level is 0.449697968864106
Correlation to outcome for diabetes is 1.0


In [33]:
# Feature selection
from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(inputCols = ['gender', 'age', 'hypertension', 'heart_disease',
                                         'smoking_history', 'bmi', 'HbA1c_level', 'blood_glucose_level'], outputCol='features')
output_data = assembler.transform(df_diabetes_data)

In [34]:
# Print the schema
output_data.printSchema()

root
 |-- gender: float (nullable = true)
 |-- age: double (nullable = true)
 |-- hypertension: integer (nullable = true)
 |-- heart_disease: integer (nullable = true)
 |-- smoking_history: float (nullable = true)
 |-- bmi: double (nullable = true)
 |-- HbA1c_level: double (nullable = true)
 |-- blood_glucose_level: integer (nullable = true)
 |-- diabetes: integer (nullable = true)
 |-- features: vector (nullable = true)



In [35]:
# Display dataframe
output_data.show()

+------+----+------------+-------------+---------------+-----+-----------+-------------------+--------+--------------------+
|gender| age|hypertension|heart_disease|smoking_history|  bmi|HbA1c_level|blood_glucose_level|diabetes|            features|
+------+----+------------+-------------+---------------+-----+-----------+-------------------+--------+--------------------+
|   0.0|80.0|           0|            1|            0.0|25.19|        6.6|                140|       0|[0.0,80.0,0.0,1.0...|
|   1.0|28.0|           0|            0|            0.0|27.32|        5.7|                158|       0|[1.0,28.0,0.0,0.0...|
|   0.0|36.0|           0|            0|            3.0|23.45|        5.0|                155|       0|[0.0,36.0,0.0,0.0...|
|   1.0|76.0|           1|            1|            3.0|20.14|        4.8|                155|       0|[1.0,76.0,1.0,1.0...|
|   0.0|20.0|           0|            0|            0.0|27.32|        6.6|                 85|       0|(8,[1,5,6,7],[20....|


# PART 5: Split Dataset & Build the Model

In [36]:
# Create final data
from pyspark.ml.classification import LogisticRegression

final_data = output_data.select('features','diabetes')

In [37]:
# Print schema of final data
final_data.printSchema()

root
 |-- features: vector (nullable = true)
 |-- diabetes: integer (nullable = true)



In [38]:
# Split the dataset and build the model. 70% used for training and 30% for testing
train, test = final_data.randomSplit([0.7, 0.3])
models = LogisticRegression(labelCol= 'diabetes')
model = models.fit(train)

In [39]:
# Summary of the model
summary = model.summary
summary.predictions.describe().show()

+-------+-------------------+-------------------+
|summary|           diabetes|         prediction|
+-------+-------------------+-------------------+
|  count|              45013|              45013|
|   mean| 0.1106569213338369|0.08219847599582343|
| stddev|0.31371030178227927|0.27466991567918575|
|    min|                0.0|                0.0|
|    max|                1.0|                1.0|
+-------+-------------------+-------------------+



# PART 6: Evaluate and Save the Model

In [41]:
# Test the model with test set reserved from the dataset
from pyspark.ml.evaluation import BinaryClassificationEvaluator

predictions = model.evaluate(test)

In [42]:
predictions.predictions.show(100)

+--------------------+--------+--------------------+--------------------+----------+
|            features|diabetes|       rawPrediction|         probability|prediction|
+--------------------+--------+--------------------+--------------------+----------+
|(8,[1,5,6,7],[0.4...|       0|[6.27758552810036...|[0.99812559067588...|       0.0|
|(8,[1,5,6,7],[0.4...|       0|[10.4145971919036...|[0.99997000941872...|       0.0|
|(8,[1,5,6,7],[0.6...|       0|[6.55446258252596...|[0.99857827636751...|       0.0|
|(8,[1,5,6,7],[0.7...|       0|[9.64552196406055...|[0.99993528947376...|       0.0|
|(8,[1,5,6,7],[0.7...|       0|[6.60744458886748...|[0.99865154253228...|       0.0|
|(8,[1,5,6,7],[0.8...|       0|[5.72092219564637...|[0.99673401385542...|       0.0|
|(8,[1,5,6,7],[1.2...|       0|[5.12868957687114...|[0.99411057209461...|       0.0|
|(8,[1,5,6,7],[1.2...|       0|[7.36279745441235...|[0.99936598102468...|       0.0|
|(8,[1,5,6,7],[1.3...|       0|[14.3748488022489...|[0.9999994284

In [43]:
# Calculate Model efficiency
evaluator = BinaryClassificationEvaluator(rawPredictionCol= 'rawPrediction', labelCol='diabetes')
evaluator.evaluate(model.transform(test))

0.9547194674313324

In [44]:
# Save model so that it can be loaded later to test with external dataset and can be utlized to support flask app. Model will be saved under /content folder of colab.
model.save("model")

In [45]:
# Load saved model back to the environment
from pyspark.ml.classification import LogisticRegressionModel

model = LogisticRegressionModel.load('model')

# PART 7: Prediction on New Data with the saved model


In [46]:
# Create a new spark dataframe based on the another external dataset that will be evaluated against ML model
test_df = spark.read.csv('/content/Project4-DiabetesPrediction/dataset/diabetes_test_dataset.csv', header=True, inferSchema=True)

In [47]:
# Print the schema
test_df.printSchema()

root
 |-- gender: integer (nullable = true)
 |-- age: integer (nullable = true)
 |-- hypertension: integer (nullable = true)
 |-- heart_disease: integer (nullable = true)
 |-- smoking_history: integer (nullable = true)
 |-- bmi: double (nullable = true)
 |-- HbA1c_level: double (nullable = true)
 |-- blood_glucose_level: integer (nullable = true)



In [48]:
# Create an additional feature merged column
test_data = assembler.transform(test_df)

In [49]:
# Print the schema
test_data.printSchema()

root
 |-- gender: integer (nullable = true)
 |-- age: integer (nullable = true)
 |-- hypertension: integer (nullable = true)
 |-- heart_disease: integer (nullable = true)
 |-- smoking_history: integer (nullable = true)
 |-- bmi: double (nullable = true)
 |-- HbA1c_level: double (nullable = true)
 |-- blood_glucose_level: integer (nullable = true)
 |-- features: vector (nullable = true)



In [50]:
# Use model to make predictions
results = model.transform(test_data)
results.printSchema()

root
 |-- gender: integer (nullable = true)
 |-- age: integer (nullable = true)
 |-- hypertension: integer (nullable = true)
 |-- heart_disease: integer (nullable = true)
 |-- smoking_history: integer (nullable = true)
 |-- bmi: double (nullable = true)
 |-- HbA1c_level: double (nullable = true)
 |-- blood_glucose_level: integer (nullable = true)
 |-- features: vector (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = false)



In [51]:
# Display the predictions and probability
results.select('features','probability','prediction').show()

+--------------------+--------------------+----------+
|            features|         probability|prediction|
+--------------------+--------------------+----------+
|[0.0,80.0,1.0,1.0...|[0.40850955819149...|       1.0|
|[1.0,28.0,0.0,0.0...|[0.99033854760961...|       0.0|
|[0.0,36.0,0.0,0.0...|[0.99857323933788...|       0.0|
|[1.0,76.0,1.0,1.0...|[0.97428170680530...|       0.0|
|(8,[1,5,6,7],[20....|[0.99629747898381...|       0.0|
|(8,[1,5,6,7],[44....|[0.83395120741798...|       0.0|
|[1.0,42.0,0.0,0.0...|[0.99738408449781...|       0.0|
|(8,[1,5,6,7],[32....|[0.99974089677736...|       0.0|
+--------------------+--------------------+----------+



# PART 8: Host a Flask App which will take subjects parameters and return probablity and prediction on being diabetic or not based on ML model above

In [53]:
# Import dependencies to run Flask app and host it on publicly accessible colab URL
from flask import *
from google.colab import output
from google.colab.output import eval_js

In [54]:
# Initialize Flask app
app=Flask(__name__, template_folder='/content/Project4-DiabetesPrediction/html')


In [55]:
# Render Home page for Diabetes Prediction to take subject parameters as input
@app.route('/')
def home():
    return render_template('index.html')



In [56]:
# Dynamic API that will take parameters of subject through Web UI and leverage ML Model above to return probability & prediction for being diabetic
@app.route('/api/v1.0/predict/<gender>/<age>/<hypertension>/<heart_disease>/<smoking_history>/<bmi>/<HbA1c_level>/<blood_glucose_level>')
def predict(gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level):
    # Create a tuple from input parameters and convert them to int or float as they are treated as string when passed from Web UI
    data = [(int(gender),int(age),int(hypertension),int(heart_disease),int(smoking_history),float(bmi),float(HbA1c_level),int(blood_glucose_level))]
    columns = ['gender','age','hypertension','heart_disease','smoking_history','bmi','HbA1c_level','blood_glucose_level']

    # Create spark dataframe for the input parameters
    test_df = spark.createDataFrame(data,columns)

    # Invoke ML Model and capture model output in results
    test_data = assembler.transform(test_df)
    results = model.transform(test_data)

    # Return the results after converting it to JSON
    return results.toJSON().first()

In [57]:
# Flask app when run gives local IP to access API. Since this is running on Cloud (Google Colab) and not on local notebook, local IP (127.0.0.1) will not be accessible.
# Below code will ask colab to give us publicly accessible URL

print(eval_js("google.colab.kernel.proxyPort(5000)"))
output.serve_kernel_port_as_window(5000)
if __name__ == '__main__':
    app.run(host='0.0.0.0',port=5000)

https://f44lm2ma47o-496ff2e9c6d22116-5000-colab.googleusercontent.com/


<IPython.core.display.Javascript object>

 * Serving Flask app '__main__'
 * Debug mode: off


 * Running on all addresses (0.0.0.0)
 * Running on http://127.0.0.1:5000
 * Running on http://172.28.0.12:5000
INFO:werkzeug:[33mPress CTRL+C to quit[0m
INFO:werkzeug:127.0.0.1 - - [27/Feb/2024 04:48:43] "GET / HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [27/Feb/2024 04:48:44] "[33mGET /favicon.ico HTTP/1.1[0m" 404 -
INFO:werkzeug:127.0.0.1 - - [27/Feb/2024 04:49:10] "GET /api/v1.0/predict/1/28/0/0/0/27.32/5.5/158 HTTP/1.1" 200 -
