# <font color="red"> MBA em IA e Big Data</font>
## <span style="color:red">Curso 03: Gerenciamento e Processamento Paralelo de Dados em Larga Escala</span>

### <span style="color:darkred">Classificação PySpark com o algoritmo Logistic Regression e múltiplos classificadores</span>

*Prof. Dr. Jose Fernando Rodrigues Junior*<br>
*ICMC-USP São Carlos*

**LOGISTIC REGRESSION**

O algoritmo de Regressão Logística é um método de aprendizado supervisionado utilizado para resolver problemas de classificação binária, onde o objetivo é prever a probabilidade de uma observação pertencer a uma das duas classes. Ele funciona modelando a relação entre as variáveis independentes (características) e a variável dependente binária (classe) usando a função logística (ou sigmoide), que mapeia qualquer valor real em um intervalo entre 0 e 1. A função logística é definida como \( \sigma(z) = \frac{1}{1 + e^{-z}} \), onde \( z \) é uma combinação linear das variáveis de entrada. A Regressão Logística ajusta os pesos das características minimizando uma função de custo, geralmente a perda logarítmica, para maximizar a verossimilhança da probabilidade de observação dos dados. O resultado final é um modelo que fornece probabilidades para a classificação, com um limiar que define em qual das duas classes a observação deve ser categorizada.

In [6]:
import findspark
findspark.init()

from pyspark.sql import SparkSession
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [17]:
# Initialize Spark session
spark = SparkSession.builder.appName("LogisticRegressionIris").getOrCreate()

In [18]:
# Load the Iris dataset in libsvm format
data = spark.read.format("libsvm").load("iris.txt")

# Show the schema to verify
data.printSchema()

root
 |-- label: double (nullable = true)
 |-- features: vector (nullable = true)



In [19]:
# Split the data into training and testing sets
train, test = data.randomSplit([0.7, 0.3], seed=1)

In [20]:
# Create the LogisticRegression model
lr = LogisticRegression(featuresCol="features", labelCol="label", maxIter=10, family="multinomial")

# Train the model
lr_model = lr.fit(train)

In [21]:
# Make the predictions using the test dataset
predictions = lr_model.transform(test)

In [22]:
# Evaluate the model's performance
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)

# Print out the accuracy
print(f"Test set accuracy = {accuracy:.4f}")

Test set accuracy = 0.9250


In [23]:
# Show some predictions
predictions.select("features", "label", "prediction").show(truncate = 50)

+--------------------------------------------------+-----+----------+
|                                          features|label|prediction|
+--------------------------------------------------+-----+----------+
|   (4,[0,1,2,3],[-0.166667,-0.416667,0.38983,0.5])|  0.0|       0.0|
|(4,[0,1,2,3],[-0.166667,-0.333333,0.38983,0.916...|  0.0|       0.0|
|   (4,[0,1,2,3],[0.111111,-0.583333,0.355932,0.5])|  0.0|       0.0|
|(4,[0,1,2,3],[0.111111,-0.416667,0.322034,0.416...|  0.0|       0.0|
|(4,[0,1,2,3],[0.111111,-0.333333,0.38983,0.1666...|  0.0|       2.0|
|  (4,[0,1,2,3],[0.111111,-0.25,0.559322,0.416667])|  0.0|       0.0|
|   (4,[0,1,2,3],[0.111111,0.0833333,0.694915,1.0])|  0.0|       0.0|
|(4,[0,1,2,3],[0.333333,0.0833333,0.59322,0.6666...|  0.0|       0.0|
|(4,[0,1,2,3],[0.444444,-0.0833334,0.38983,0.833...|  0.0|       0.0|
|(4,[0,1,2,3],[0.444444,-0.0833334,0.491525,0.66...|  0.0|       0.0|
|  (4,[0,1,2,3],[0.611111,-0.166667,0.627119,0.25])|  0.0|       2.0|
|(4,[0,1,2,3],[0.833

In [24]:
# Stop the Spark session
spark.stop()

**MULTIPLE CLASSIFIERS**

In [25]:
from pyspark.sql import SparkSession
from pyspark.ml.classification import LogisticRegression, DecisionTreeClassifier, RandomForestClassifier, MultilayerPerceptronClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [26]:
# Initialize Spark session
spark = SparkSession.builder.appName("MultipleClassifiersOnIris").getOrCreate()

In [27]:
# Load the Iris dataset in libsvm format
data = spark.read.format("libsvm").load("iris.txt")

# Show the schema to verify
data.printSchema()

root
 |-- label: double (nullable = true)
 |-- features: vector (nullable = true)



In [28]:
# Split the data into training and testing sets
train, test = data.randomSplit([0.7, 0.3])

In [29]:
# List of classifiers to evaluate
classifiers = [
    LogisticRegression(maxIter=10, featuresCol="features", labelCol="label"),
    DecisionTreeClassifier(featuresCol="features", labelCol="label"),
    RandomForestClassifier(numTrees=10, featuresCol="features", labelCol="label"),
    MultilayerPerceptronClassifier(maxIter=100, layers=[4, 5, 4, 3], blockSize=128, seed=1234, featuresCol="features", labelCol="label")
]

In [30]:
# Initialize the evaluator
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")

In [31]:
# Iterate over the classifiers, fit the model, and evaluate accuracy
for classifier in classifiers:
    # Train the model
    model = classifier.fit(train)

    # Make predictions on the test set
    predictions = model.transform(test)

    # Evaluate the model's performance
    accuracy = evaluator.evaluate(predictions)

    # Print out the classifier name and its test set accuracy
    print(f"Test set accuracy with {classifier.__class__.__name__} = {accuracy:.4f}")

Test set accuracy with LogisticRegression = 0.9762
Test set accuracy with DecisionTreeClassifier = 0.9524
Test set accuracy with RandomForestClassifier = 0.9524
Test set accuracy with MultilayerPerceptronClassifier = 0.9762


In [32]:
# Stop the Spark session
spark.stop()