<a href="https://colab.research.google.com/github/ldselvera/titanic_classification_spark/blob/main/titanic_spark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PySpark 

Apache Spark is a system that provides a cluster-based distributed computing environment with the help of its packages, including:
*   SQL querying
*   streaming data processing
*   machine learning

## Spark Installation

In [7]:
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.2.1.tar.gz (281.4 MB)
[K     |████████████████████████████████| 281.4 MB 34 kB/s 
[?25hCollecting py4j==0.10.9.3
  Downloading py4j-0.10.9.3-py2.py3-none-any.whl (198 kB)
[K     |████████████████████████████████| 198 kB 54.0 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.2.1-py2.py3-none-any.whl size=281853642 sha256=eb8aaf73956cd515281bdbd2575d803aa38c0a73265759a2f31f428788344898
  Stored in directory: /root/.cache/pip/wheels/9f/f5/07/7cd8017084dce4e93e84e92efd1e1d5334db05f2e83bcef74f
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.3 pyspark-3.2.1


## Spark Session

In [31]:
from pyspark.sql import SparkSession
from pyspark.context import SparkContext
from pyspark.sql.functions import *
from pyspark.ml.feature import StringIndexer

In [93]:
# spark = SparkSession.builder.getOrCreate()
spark = SparkSession.builder.appName("FirstSparkApplication").config ("spark.executor.memory", "8g").getOrCreate()

# Exploratory Data Analysis

The “Ttanic” dataset will be used and may be downloaded from Kaggle website [here](https://www.kaggle.com/c/titanic/data).

In [100]:
training_dataset = spark.read.format("csv").option("inferSchema", True).option("header", "true").load('/content/titanic_train.csv')
test_dataset = spark.read.format("csv").option("inferSchema", True).option("header", "true").load('/content/test.csv')

training_dataset.printSchema

<bound method DataFrame.printSchema of DataFrame[PassengerId: int, Survived: int, Pclass: int, Name: string, Sex: string, Age: double, SibSp: int, Parch: int, Ticket: string, Fare: double, Cabin: string, Embarked: string]>

In [101]:
print("Unique Passenger Counts")
training_dataset.agg(countDistinct("PassengerId")).show()

Unique Passenger Counts
+------------------+
|count(PassengerId)|
+------------------+
|               891|
+------------------+



In [102]:
print("Test Dataset Row Count")
test_dataset.count()

Test Dataset Row Count


418

In [104]:
training_dataset.show(n=10)

+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| null|       S|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| null|       S|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1| C123|       S|
|          5|       0|     3|Allen, Mr. Willia...|  male|35.0|    0|    0|          373450|   8.05| null|       S|
|          6|       0|     3|    Moran, Mr. James|  male|null|    0|    0|      

In [105]:
training_dataset.show(n=2, truncate=False, vertical=True)

-RECORD 0----------------------------------------------------------
 PassengerId | 1                                                   
 Survived    | 0                                                   
 Pclass      | 3                                                   
 Name        | Braund, Mr. Owen Harris                             
 Sex         | male                                                
 Age         | 22.0                                                
 SibSp       | 1                                                   
 Parch       | 0                                                   
 Ticket      | A/5 21171                                           
 Fare        | 7.25                                                
 Cabin       | null                                                
 Embarked    | S                                                   
-RECORD 1----------------------------------------------------------
 PassengerId | 2                                

In [106]:
training_dataset.describe().show(3,vertical=True)

-RECORD 0--------------------------
 summary     | count               
 PassengerId | 891                 
 Survived    | 891                 
 Pclass      | 891                 
 Name        | 891                 
 Sex         | 891                 
 Age         | 714                 
 SibSp       | 891                 
 Parch       | 891                 
 Ticket      | 891                 
 Fare        | 891                 
 Cabin       | 204                 
 Embarked    | 889                 
-RECORD 1--------------------------
 summary     | mean                
 PassengerId | 446.0               
 Survived    | 0.3838383838383838  
 Pclass      | 2.308641975308642   
 Name        | null                
 Sex         | null                
 Age         | 29.69911764705882   
 SibSp       | 0.5230078563411896  
 Parch       | 0.38159371492704824 
 Ticket      | 260318.54916792738  
 Fare        | 32.2042079685746    
 Cabin       | null                
 Embarked    | null         

We check for any nulls values.

In [107]:
# Counting the number of null values
from pyspark.sql.functions import *

print ("NaN values\n")
training_dataset.select([count(when(isnan(item), item)).alias(item) for item in training_dataset.columns]).show(5)

print ("Null values\n")
training_dataset.select([count(when(col(item).isNull(), item)).alias(item) for item in training_dataset.columns]).show(5)

print ("Not Null values\n")
training_dataset.select([count(when(col(item).isNotNull(), item)).alias(item) for item in training_dataset.columns]).show(5)

NaN values

+-----------+--------+------+----+---+---+-----+-----+------+----+-----+--------+
|PassengerId|Survived|Pclass|Name|Sex|Age|SibSp|Parch|Ticket|Fare|Cabin|Embarked|
+-----------+--------+------+----+---+---+-----+-----+------+----+-----+--------+
|          0|       0|     0|   0|  0|  0|    0|    0|     0|   0|    0|       0|
+-----------+--------+------+----+---+---+-----+-----+------+----+-----+--------+

Null values

+-----------+--------+------+----+---+---+-----+-----+------+----+-----+--------+
|PassengerId|Survived|Pclass|Name|Sex|Age|SibSp|Parch|Ticket|Fare|Cabin|Embarked|
+-----------+--------+------+----+---+---+-----+-----+------+----+-----+--------+
|          0|       0|     0|   0|  0|177|    0|    0|     0|   0|  687|       2|
+-----------+--------+------+----+---+---+-----+-----+------+----+-----+--------+

Not Null values

+-----------+--------+------+----+---+---+-----+-----+------+----+-----+--------+
|PassengerId|Survived|Pclass|Name|Sex|Age|SibSp|Parch|

In [108]:
print("Renaming Column Name")
training_dataset = training_dataset.withColumnRenamed("Pclass","PassengerClasses").withColumnRenamed("Sex","Gender")
training_dataset

Renaming Column Name


DataFrame[PassengerId: int, Survived: int, PassengerClasses: int, Name: string, Gender: string, Age: double, SibSp: int, Parch: int, Ticket: string, Fare: double, Cabin: string, Embarked: string]

In [109]:
print("Counting the number of Passenger per Classes")
training_dataset.groupBy("PassengerClasses").count().sort("PassengerClasses").show()


print("Counting the number of Survivals by Classes")
training_dataset.groupBy("PassengerClasses",
                         "Gender",
                         "Survived").count().sort("PassengerClasses",
                                                  "Gender",
                                                  "Survived").show()

Counting the number of Passenger per Classes
+----------------+-----+
|PassengerClasses|count|
+----------------+-----+
|               1|  216|
|               2|  184|
|               3|  491|
+----------------+-----+

Counting the number of Survivals by Classes
+----------------+------+--------+-----+
|PassengerClasses|Gender|Survived|count|
+----------------+------+--------+-----+
|               1|female|       0|    3|
|               1|female|       1|   91|
|               1|  male|       0|   77|
|               1|  male|       1|   45|
|               2|female|       0|    6|
|               2|female|       1|   70|
|               2|  male|       0|   91|
|               2|  male|       1|   17|
|               3|female|       0|   72|
|               3|female|       1|   72|
|               3|  male|       0|  300|
|               3|  male|       1|   47|
+----------------+------+--------+-----+



# Feature Engineering

The 'Name' column in the titanic dataset also includes the person’s title. This information might be beneficial in the model. So let’s generate it as a new variable. A new title column can be created using the 'withColumn' operation.

In [110]:
training_dataset = training_dataset.withColumn("Title", regexp_extract(col("Name"),"([A-Za-z]+)\.", 1))
training_dataset.select("Name","Title").show(10) 

+--------------------+------+
|                Name| Title|
+--------------------+------+
|Braund, Mr. Owen ...|    Mr|
|Cumings, Mrs. Joh...|   Mrs|
|Heikkinen, Miss. ...|  Miss|
|Futrelle, Mrs. Ja...|   Mrs|
|Allen, Mr. Willia...|    Mr|
|    Moran, Mr. James|    Mr|
|McCarthy, Mr. Tim...|    Mr|
|Palsson, Master. ...|Master|
|Johnson, Mrs. Osc...|   Mrs|
|Nasser, Mrs. Nich...|   Mrs|
+--------------------+------+
only showing top 10 rows



In [111]:
training_dataset.groupBy("Title").count().show()

+--------+-----+
|   Title|count|
+--------+-----+
|     Don|    1|
|    Miss|  182|
|Countess|    1|
|     Col|    2|
|     Rev|    6|
|    Lady|    1|
|  Master|   40|
|     Mme|    1|
|    Capt|    1|
|      Mr|  517|
|      Dr|    7|
|     Mrs|  125|
|     Sir|    1|
|Jonkheer|    1|
|    Mlle|    2|
|   Major|    2|
|      Ms|    1|
+--------+-----+



The `Name` column in the titanic dataset also includes the person’s title. This information might be beneficial in the model. So let’s generate it as a new variable. A new title column can be created using the `withColumn` operation.

In [112]:
feature_df = training_dataset.\
replace(["Mme", 
         "Mlle","Ms",
         "Major","Dr", "Capt","Col","Rev",
         "Lady","Dona", "the Countess","Countess", "Don", "Sir", "Jonkheer","Master"],
        ["Mrs", 
         "Miss", "Miss",
         "Ranked","Ranked","Ranked","Ranked","Ranked",
         "Royalty","Royalty","Royalty","Royalty","Royalty","Royalty","Royalty","Royalty"])

feature_df.groupBy("Title").count().sort(desc("count")).show()

+-------+-----+
|  Title|count|
+-------+-----+
|     Mr|  517|
|   Miss|  185|
|    Mrs|  126|
|Royalty|   45|
| Ranked|   18|
+-------+-----+



Some duplicated or misspelled writer names may exist. You may replace them by using the function 'replace' as the following.

In [71]:
feature_dataframe.dtypes

[('_id', 'struct<$oid:string>'),
 ('amazon_product_url', 'string'),
 ('author', 'string'),
 ('bestsellers_date', 'struct<$date:struct<$numberLong:string>>'),
 ('description', 'string'),
 ('price', 'struct<$numberDouble:string,$numberInt:string>'),
 ('published_date', 'struct<$date:struct<$numberLong:string>>'),
 ('publisher', 'string'),
 ('rank', 'struct<$numberInt:string>'),
 ('rank_last_week', 'struct<$numberInt:string>'),
 ('title', 'string'),
 ('weeks_on_list', 'struct<$numberInt:string>'),
 ('writer', 'string')]

In [113]:
df = feature_df.select(
    "Survived",
    "PassengerClasses",
    "SibSp",
    "Parch")

df = df.dropna()
df = df.fillna(0)
df.dtypes

[('Survived', 'int'),
 ('PassengerClasses', 'int'),
 ('SibSp', 'int'),
 ('Parch', 'int')]

## String Indexing

Before starting the model implementation stage, the formats of all features should be inspected. Since the prediction method requires numerical variables, string-formatted columns shall be all converted into corresponding numerical types in the final modeling dataset.

In [114]:
from pyspark.ml.feature import StringIndexer
parchIndexer = StringIndexer(inputCol="Parch", outputCol="Parch_Ind").fit(df)
sibspIndexer = StringIndexer(inputCol="SibSp", outputCol="SibSp_Ind").fit(df)
passangerIndexer = StringIndexer(inputCol="PassengerClasses", outputCol="PassengerClasses_Ind").fit(df)
survivedIndexer = StringIndexer(inputCol="Survived", outputCol="Survived_Ind").fit(df)

## Vector Assembler

After the indexing and dropping of old string-formatted operations, the DataFrame has all numerical variables. Since all the columns have a non-string format, we can generate a feature vector using the columns in the DataFrame. The 'VectorAssembler' can be applied to transform the 'features' vector column.

In [115]:
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(
  inputCols = ["PassengerClasses","SibSp","Parch"],
  outputCol = "features")

## Split Train/Test

The 'randomSplit' method can be used to divide the data into train and test sets. 

In [124]:
(train, test) = df.randomSplit([0.8, 0.2], seed = 345)

# Modeling with Spark MLlib

## Define Classifier

In [119]:
from pyspark.ml.classification import DecisionTreeClassifier
classifier = DecisionTreeClassifier(featuresCol="features", labelCol="Survived")
classifier

DecisionTreeClassifier_9e7c1bc1eef5

## Create Pipeline

In [120]:
from pyspark.ml import Pipeline
pipeline = Pipeline(stages=[assembler, classifier, parchIndexer, sibspIndexer, passangerIndexer, survivedIndexer])
pipeline

Pipeline_9208a01fc936


## Prepare training with ParamGridBuilder

When the pipeline is created, the parameters of the classifier can be optimized with the help of 'ParamGridBuilder'. Corresponding parameters will be created after the grid search.

In [121]:
from pyspark.ml.tuning import ParamGridBuilder
from pyspark.ml.tuning import TrainValidationSplit
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

paramGrid = ParamGridBuilder() \
  .addGrid(classifier.maxDepth, [5, 10, 15, 20]) \
  .addGrid(classifier.maxBins, [25, 30]) \
  .build()
paramGrid

[{Param(parent='DecisionTreeClassifier_9e7c1bc1eef5', name='maxBins', doc='Max number of bins for discretizing continuous features.  Must be >=2 and >= number of categories for any categorical feature.'): 25,
  Param(parent='DecisionTreeClassifier_9e7c1bc1eef5', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 5},
 {Param(parent='DecisionTreeClassifier_9e7c1bc1eef5', name='maxBins', doc='Max number of bins for discretizing continuous features.  Must be >=2 and >= number of categories for any categorical feature.'): 30,
  Param(parent='DecisionTreeClassifier_9e7c1bc1eef5', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 5},
 {Param(parent='DecisionTreeClassifier_9e7c1bc1eef5', name='maxBins', doc='Max number of bins for discretizing continuous features.  Must 

With this respect, the 'label', 'features', and 'metric' columns can be applied.

In [122]:
tvs = TrainValidationSplit(
  estimator=pipeline,
  estimatorParamMaps=paramGrid,
  evaluator=MulticlassClassificationEvaluator(labelCol="Survived", predictionCol="prediction", metricName="weightedPrecision"),
  trainRatio=0.8)

tvs

TrainValidationSplit_93dc7ce09291


## Model Fitting

When the 'TrainValidationSplit' phase is finalized, the model can be fitted.

In [125]:
model_generated = tvs.fit(train)

## Model Evaluation

Print accuracy results by each parameter

In [127]:
list(zip(model_generated.validationMetrics, model_generated.getEstimatorParamMaps()))

[(0.7113061435209086,
  {Param(parent='DecisionTreeClassifier_9e7c1bc1eef5', name='maxBins', doc='Max number of bins for discretizing continuous features.  Must be >=2 and >= number of categories for any categorical feature.'): 25,
   Param(parent='DecisionTreeClassifier_9e7c1bc1eef5', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 5}),
 (0.7113061435209086,
  {Param(parent='DecisionTreeClassifier_9e7c1bc1eef5', name='maxBins', doc='Max number of bins for discretizing continuous features.  Must be >=2 and >= number of categories for any categorical feature.'): 30,
   Param(parent='DecisionTreeClassifier_9e7c1bc1eef5', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 5}),
 (0.6858601215725474,
  {Param(parent='DecisionTreeClassifier_9e7c1bc1eef5', name='maxBi

# Model Serving with MLFlow

Machine learning models generated using PySpark can be distributed with the help of the software package MLFlow. 

In [85]:
!pip install mlflow

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting mlflow
  Downloading mlflow-1.26.1-py3-none-any.whl (17.8 MB)
[K     |████████████████████████████████| 17.8 MB 476 kB/s 
Collecting alembic
  Downloading alembic-1.7.7-py3-none-any.whl (210 kB)
[K     |████████████████████████████████| 210 kB 49.2 MB/s 
[?25hCollecting prometheus-flask-exporter
  Downloading prometheus_flask_exporter-0.20.1-py3-none-any.whl (18 kB)
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 58.2 MB/s 
Collecting docker>=4.0.0
  Downloading docker-5.0.3-py2.py3-none-any.whl (146 kB)
[K     |████████████████████████████████| 146 kB 49.1 MB/s 
[?25hCollecting querystring-parser
  Downloading querystring_parser-1.2.4-py2.py3-none-any.whl (7.9 kB)
Collecting gunicorn
  Downloading gunicorn-20.1.0-py3-n

## Execution of MLFlow

You may run the 'start_run()' function after importing MLflow to activate MLflow in a Spark session.

In [129]:
import mlflow
from mlflow import spark
with mlflow.start_run(): 
    model = tvs.fit(train) 
    mlflow.spark.log_model(model_generated, "sparkML-model")

The corresponding model inferences can be occupied by using the `mlflow.pyfunc` function. For this purpose, it is crucial to assign the model and dataset paths separately. Then, a Spark UDF can be generated by using the model path. The next step is to read and register them into a dataframe. For the final phase, a new feature is created with the help of the formerly defined Spark UDF.

In [None]:
import mlflow.pyfunc
from pyspark.sql import SQLContext

train.toPandas().to_csv('dataset.csv')

model_path = '/Users/ersoyp/Documents/LAYER/ServingModelsWithApacheSpark/Scripts/mlruns/1/51ef199ab3b945e8a31b47cdfbf60912/artifacts/sparkML-model'
titanic_path = '/Users/ersoyp/Documents/LAYER/ServingModelsWithApacheSpark/Scripts/dataset.csv'
titanic_udf = mlflow.pyfunc.spark_udf(spark, model_path)

df = spark.read.format("csv").option("inferSchema", True).option("header", "true").option('delimiter', ';').load(titanic_path)

columns = ['PassengerClasses', 'SibSp', 'Parch']
          
df.withColumn('Inferences', titanic_udf(*columns)).show(False)

In [None]:
mlflow.end_run()