## Advanced ML Q-2.
A chemist had two chemical flasks labeled 0 and 1 which consist of two
different chemicals. He extracted 3 features from these chemicals in order to
distinguish between them, you provided the results derived by the chemicals and
your task is to create a model that will label chemical 0 or 1 given its three features
and built-in docker and use some library to display that in frontend.
Note : Use only pyspark
Dataset This is the Dataset You can use this dataset for this question.

In [1]:
import pandas as pd
df = pd.read_csv("indian_liver_patient.csv")

In [2]:
df

Unnamed: 0,Age,Gender,Total_Bilirubin,Direct_Bilirubin,Alkaline_Phosphotase,Alamine_Aminotransferase,Aspartate_Aminotransferase,Total_Protiens,Albumin,Albumin_and_Globulin_Ratio,Dataset
0,65,Female,0.7,0.1,187,16,18,6.8,3.3,0.90,1
1,62,Male,10.9,5.5,699,64,100,7.5,3.2,0.74,1
2,62,Male,7.3,4.1,490,60,68,7.0,3.3,0.89,1
3,58,Male,1.0,0.4,182,14,20,6.8,3.4,1.00,1
4,72,Male,3.9,2.0,195,27,59,7.3,2.4,0.40,1
...,...,...,...,...,...,...,...,...,...,...,...
578,60,Male,0.5,0.1,500,20,34,5.9,1.6,0.37,2
579,40,Male,0.6,0.1,98,35,31,6.0,3.2,1.10,1
580,52,Male,0.8,0.2,245,48,49,6.4,3.2,1.00,1
581,31,Male,1.3,0.5,184,29,32,6.8,3.4,1.00,1


In [3]:
df.Dataset.value_counts()

1    416
2    167
Name: Dataset, dtype: int64

In [4]:
# Import required libraries
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.sql.functions import col



In [5]:
# Create a SparkSession
spark = SparkSession.builder.appName("ChemicalClassify").getOrCreate()

# Load the dataset
dataset_path = "indian_liver_patient.csv"
data = spark.read.csv(dataset_path, header=True, inferSchema=True)

data

DataFrame[Age: int, Gender: string, Total_Bilirubin: double, Direct_Bilirubin: double, Alkaline_Phosphotase: int, Alamine_Aminotransferase: int, Aspartate_Aminotransferase: int, Total_Protiens: double, Albumin: double, Albumin_and_Globulin_Ratio: double, Dataset: int]

In [6]:
# Drop any rows with missing values
data = data.dropna()
data

DataFrame[Age: int, Gender: string, Total_Bilirubin: double, Direct_Bilirubin: double, Alkaline_Phosphotase: int, Alamine_Aminotransferase: int, Aspartate_Aminotransferase: int, Total_Protiens: double, Albumin: double, Albumin_and_Globulin_Ratio: double, Dataset: int]

In [7]:
# Convert the Dataset column to a binary label column
data = data.withColumn("label", (col("Dataset") - 1))
data

DataFrame[Age: int, Gender: string, Total_Bilirubin: double, Direct_Bilirubin: double, Alkaline_Phosphotase: int, Alamine_Aminotransferase: int, Aspartate_Aminotransferase: int, Total_Protiens: double, Albumin: double, Albumin_and_Globulin_Ratio: double, Dataset: int, label: int]

In [8]:
from pyspark.ml.feature import VectorAssembler, StringIndexer, OneHotEncoder

# Perform encoding on the Gender column
gender_indexer = StringIndexer(inputCol="Gender", outputCol="GenderIndex")
data = gender_indexer.fit(data).transform(data)
data

DataFrame[Age: int, Gender: string, Total_Bilirubin: double, Direct_Bilirubin: double, Alkaline_Phosphotase: int, Alamine_Aminotransferase: int, Aspartate_Aminotransferase: int, Total_Protiens: double, Albumin: double, Albumin_and_Globulin_Ratio: double, Dataset: int, label: int, GenderIndex: double]

In [9]:
# Select the input features column and the label column
selected_columns = ["Age", "GenderIndex", "Total_Bilirubin", "Direct_Bilirubin", "Alkaline_Phosphotase", "Alamine_Aminotransferase",
                    "Aspartate_Aminotransferase", "Total_Protiens", "Albumin", "Albumin_and_Globulin_Ratio", "label"]
data = data.select(*selected_columns)

In [10]:
# Create a vector assembler to combine the features into a single vector column
assembler = VectorAssembler(inputCols=["Age", "GenderIndex", "Total_Bilirubin", "Direct_Bilirubin", "Alkaline_Phosphotase",
                                       "Alamine_Aminotransferase", "Aspartate_Aminotransferase", "Total_Protiens",
                                       "Albumin", "Albumin_and_Globulin_Ratio"], outputCol="features")

In [11]:
# Create a Random Forest classifier
rf = RandomForestClassifier(featuresCol="features", labelCol="label")

In [12]:
# Create a pipeline to chain the vector assembler and the random forest classifier
pipeline = Pipeline(stages=[assembler, rf])

In [13]:
# Split the data into training and testing sets
(train_data, test_data) = data.randomSplit([0.8, 0.2], seed=42)

In [14]:
train_data

DataFrame[Age: int, GenderIndex: double, Total_Bilirubin: double, Direct_Bilirubin: double, Alkaline_Phosphotase: int, Alamine_Aminotransferase: int, Aspartate_Aminotransferase: int, Total_Protiens: double, Albumin: double, Albumin_and_Globulin_Ratio: double, label: int]

In [15]:
# Train the model
model = pipeline.fit(train_data)

In [17]:
# Make predictions on the test data
predictions = model.transform(test_data)
predictions.show()

+---+-----------+---------------+----------------+--------------------+------------------------+--------------------------+--------------+-------+--------------------------+-----+--------------------+--------------------+--------------------+----------+
|Age|GenderIndex|Total_Bilirubin|Direct_Bilirubin|Alkaline_Phosphotase|Alamine_Aminotransferase|Aspartate_Aminotransferase|Total_Protiens|Albumin|Albumin_and_Globulin_Ratio|label|            features|       rawPrediction|         probability|prediction|
+---+-----------+---------------+----------------+--------------------+------------------------+--------------------------+--------------+-------+--------------------------+-----+--------------------+--------------------+--------------------+----------+
|  6|        0.0|            0.6|             0.1|                 289|                      38|                        30|           4.8|    2.0|                       0.7|    1|[6.0,0.0,0.6,0.1,...|[10.8037159303027...|[0.54018579651513

In [18]:
# Evaluate the model using the area under the ROC curve
evaluator = BinaryClassificationEvaluator(labelCol="label")
auc = evaluator.evaluate(predictions)

print("Area under ROC curve:", auc)

# Stop the SparkSession
spark.stop()

Area under ROC curve: 0.7388429752066117
