# Classification

For our example, we’ll use the AI4I 2020 Predictive Maintenance Dataset
((2020) UCI Machine Learning Repository
(https://doi.org/10.24432/C5HS5C). This synthetic dataset is modeled after
an existing milling machine and consists of 10,000 data points stored as rows
with 14 features in columns

In [1]:
import warnings
warnings.filterwarnings("ignore")

### import the required libraries and create spark session
The following lines of code imports all the required packages and libraries:

In [2]:
#import the library and create spark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("ClassificationExample").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")

25/11/13 18:17:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/11/13 18:17:53 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
25/11/13 18:17:53 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.


### Load data and create dataframe
The following code snippet loads the dataset and prints out the DataFrame
and its schema and display the first 3 rows of the dataframe

In [3]:
train_df = spark.read.csv("machine_failure_data.csv", header=True, inferSchema=True)
train_df.show(3, truncate=False)

+---+----------+----+-----------------+---------------------+--------------------+---------+-------------+---------------+-----------------+------------------------+-------------+------------------+---------------+
|UDI|Product ID|type|air_temperature_k|process_temperature_k|rotational_speed_rpm|torque_nm|tool_wear_min|machine_failure|tool_wear_failure|heat_dissipation_failure|power_failure|overstrain_failure|random_failures|
+---+----------+----+-----------------+---------------------+--------------------+---------+-------------+---------------+-----------------+------------------------+-------------+------------------+---------------+
|1  |M14860    |M   |298.1            |308.6                |1551                |42.8     |0            |0              |0                |0                       |0            |0                 |0              |
|2  |L47181    |L   |298.2            |308.7                |1408                |46.3     |3            |0              |0                |

### Preprocessing the data: Convert all integer columns to float, remove null values and print the schema

The following code snippet converts integer columns into floats, removes the
null values, and prints the data schema:


In [4]:
# Convert all integer columns in train_df to float
from pyspark.sql.types import IntegerType, FloatType
for col in train_df.columns:
    if train_df.schema[col].dataType == IntegerType():
        train_df = train_df.withColumn(col, train_df[col].cast(FloatType()))
# remove null values
train_df = train_df.dropna()
train_df.printSchema()

root
 |-- UDI: float (nullable = true)
 |-- Product ID: string (nullable = true)
 |-- type: string (nullable = true)
 |-- air_temperature_k: double (nullable = true)
 |-- process_temperature_k: double (nullable = true)
 |-- rotational_speed_rpm: float (nullable = true)
 |-- torque_nm: double (nullable = true)
 |-- tool_wear_min: float (nullable = true)
 |-- machine_failure: float (nullable = true)
 |-- tool_wear_failure: float (nullable = true)
 |-- heat_dissipation_failure: float (nullable = true)
 |-- power_failure: float (nullable = true)
 |-- overstrain_failure: float (nullable = true)
 |-- random_failures: float (nullable = true)



### Prepare the data by removing the id columns and failure columns retaining only the machine failure column


The following code snippet performs feature engineering by removing the
unwanted columns, replacing the missing values, and converting the
categorical column into a numerical one:


In [5]:
# drope the id columns such as UDI, Product ID,heat_dissipation_failure,power_failure,overstrain_failure,random_failures,tool_wear_failure
train_df = train_df.drop("UDI","Product ID","heat_dissipation_failure","power_failure","overstrain_failure","random_failures","tool_wear_failure")
train_df.show(3,truncate=False) 

+----+-----------------+---------------------+--------------------+---------+-------------+---------------+
|type|air_temperature_k|process_temperature_k|rotational_speed_rpm|torque_nm|tool_wear_min|machine_failure|
+----+-----------------+---------------------+--------------------+---------+-------------+---------------+
|M   |298.1            |308.6                |1551.0              |42.8     |0.0          |0.0            |
|L   |298.2            |308.7                |1408.0              |46.3     |3.0          |0.0            |
|L   |298.1            |308.5                |1498.0              |49.4     |5.0          |0.0            |
+----+-----------------+---------------------+--------------------+---------+-------------+---------------+
only showing top 3 rows



# Feature Engineering: Fill missing values, convert the type column to index

In [6]:
#replace  the missing values with 0
train_df = train_df.fillna(0)
# convert the type column to index
from pyspark.ml.feature import StringIndexer
indexer = StringIndexer(inputCol="type", outputCol="type_index")
train_df = indexer.fit(train_df).transform(train_df)
train_df.show(3)


+----+-----------------+---------------------+--------------------+---------+-------------+---------------+----------+
|type|air_temperature_k|process_temperature_k|rotational_speed_rpm|torque_nm|tool_wear_min|machine_failure|type_index|
+----+-----------------+---------------------+--------------------+---------+-------------+---------------+----------+
|   M|            298.1|                308.6|              1551.0|     42.8|          0.0|            0.0|       1.0|
|   L|            298.2|                308.7|              1408.0|     46.3|          3.0|            0.0|       0.0|
|   L|            298.1|                308.5|              1498.0|     49.4|          5.0|            0.0|       0.0|
+----+-----------------+---------------------+--------------------+---------+-------------+---------------+----------+
only showing top 3 rows



### Feature Engineering using vector assembler
The following code snippet generates the feature through the vector function
and normalizes the column:


In [7]:
# apply vector assembler
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=["air_temperature_k", "process_temperature_k", "rotational_speed_rpm", "torque_nm", "tool_wear_min","type_index"], outputCol="features")
train_df = assembler.transform(train_df)
train_df.select("features", "machine_failure").show(3)

+--------------------+---------------+
|            features|machine_failure|
+--------------------+---------------+
|[298.1,308.6,1551...|            0.0|
|[298.2,308.7,1408...|            0.0|
|[298.1,308.5,1498...|            0.0|
+--------------------+---------------+
only showing top 3 rows



### Applying standard scaler and show the scaled features

In [8]:
# apply standard scaler
from pyspark.ml.feature import StandardScaler
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures")
scalerModel = scaler.fit(train_df)
train_df = scalerModel.transform(train_df)
# show the scaled features
train_df.select("scaledFeatures", "machine_failure").show(3)

+--------------------+---------------+
|      scaledFeatures|machine_failure|
+--------------------+---------------+
|[149.030724148837...|            0.0|
|[149.080717682600...|            0.0|
|[149.030724148837...|            0.0|
+--------------------+---------------+
only showing top 3 rows



### Split the data into train and test and select the features and label

In [9]:
# split the data
train_df, test_df = train_df.randomSplit([0.7, 0.3])
# select the features and label
train_df = train_df.select("scaledFeatures", "machine_failure")
test_df = test_df.select("scaledFeatures", "machine_failure")

### Train the model using Logistic Regression

The following code snippet splits the dataset into train and test tests. It then
trains the model using LogisticRegression and evaluates the model’s
performance.


In [10]:
# train the model
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(featuresCol="scaledFeatures", labelCol="machine_failure")
lrModel = lr.fit(train_df)

25/11/13 18:20:53 ERROR LBFGS: Failure! Resetting history: breeze.optimize.FirstOrderException: Line search failed
25/11/13 18:20:53 ERROR LBFGS: Failure! Resetting history: breeze.optimize.FirstOrderException: Line search failed
25/11/13 18:20:53 ERROR LBFGS: Failure! Resetting history: breeze.optimize.FirstOrderException: Line search failed
25/11/13 18:20:54 ERROR LBFGS: Failure! Resetting history: breeze.optimize.FirstOrderException: Line search failed
25/11/13 18:20:54 ERROR LBFGS: Failure! Resetting history: breeze.optimize.FirstOrderException: Line search failed
25/11/13 18:20:55 ERROR LBFGS: Failure! Resetting history: breeze.optimize.FirstOrderException: Line search failed
25/11/13 18:20:55 ERROR LBFGS: Failure! Resetting history: breeze.optimize.FirstOrderException: Line search failed
25/11/13 18:20:55 ERROR LBFGS: Failure! Resetting history: breeze.optimize.FirstOrderException: Line search failed
25/11/13 18:20:55 ERROR LBFGS: Failure! Resetting history: breeze.optimize.First

### Evaluating the model

In [11]:
# evaluate the model
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator(labelCol="machine_failure")
evaluator.evaluate(lrModel.transform(test_df))

0.8313237612372716

### Summary
In this chapter, we explored the fundamental concepts and techniques of
classification within the supervised learning. Classification stands
as a pivotal task in machine learning, allowing data to be classified into
predefined classes. 

At this point, you should possess a robust understanding of classification and
be well-prepared to apply these techniques to address complex problems,
marking a significant milestone in your journey through the fascinating
world of machine learning.