**ML in Spark**
---

### Spark Context
SparkContext is the internal engine that allows the connections with the clusters. If you want to run an operation, you need a SparkContext.

### Create a SparkContext

In [1]:
# First of all, you need to initiate a SparkContext.

import pyspark
from pyspark import SparkContext
sc =SparkContext()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/05/24 23:49:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/05/24 23:49:28 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


Now that the SparkContext is ready, you can create a collection of data called RDD, Resilient Distributed Dataset. Computation in an RDD is automatically parallelized across the cluster.

In [2]:
nums= sc.parallelize([1,2,3,4])

# you can access the first row
nums.take(1)

You can apply a transformation to the data with a lambda function. In the PySpark example below, you return the square of nums. It is a map transformation

In [3]:
squared = nums.map(lambda x: x*x).collect()
for num in squared:
    print('%i ' % (num))

[Stage 0:>                                                          (0 + 8) / 8]

1 
4 
9 
16 


                                                                                

### SQLContext

A more convenient way is to use the DataFrame. SparkContext is already set, you can use it to create the DataFrame. You also need to declare the SQLContext

SQLContext allows connecting the engine with different data sources. It is used to initiate the functionalities of Spark SQL.

In [4]:
from pyspark.sql import Row
from pyspark.sql import SQLContext

sqlContext = SQLContext(sc)



In [5]:
# Now in this Spark tutorial Python, let's create a list of tuples. \
#Each tuple will contain the name of the people and their age. Four steps are required:

In [6]:
# Create the list of tuple with the information
list_p = [('John',19),('Smith',29),('Adam',35),('Henry',50)]

In [7]:
# Build the RDD
rdd = sc.parallelize(list_p)

In [8]:
#Convert to Tuples
ppl = rdd.map(lambda x: Row(name=x[0], age=int(x[1])))

In [9]:
# Create a DataFrame context
DF_ppl = sqlContext.createDataFrame(ppl)

In [10]:
#If you want to access the type of each feature, you can use printSchema()

DF_ppl.printSchema()

root
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)



In [11]:
DF_ppl.show(3)

+-----+---+
| name|age|
+-----+---+
| John| 19|
|Smith| 29|
| Adam| 35|
+-----+---+
only showing top 3 rows



# Machine Learning Example with PySpark
---

Now that you have a brief idea of Spark and SQLContext, you are ready to build your first Machine Learning program using Spark.

Following are the steps to build a Machine Learning program with PySpark:

1. Basic operation with PySpark\
2. Data preprocessing\
3. Build a data processing pipeline\
4. Build the classifier: logistic\
5. Train and evaluate the model\
6. Tune the hyperparameter\
Note that, the dataset we use is not very big and you may think that the computation takes a long time. Spark is designed to process a considerable amount of data. Spark's performance increases relatively to other machine learning libraries when the dataset grows larger.

## Step 1: Basic operation with PySpark

In [12]:
import pandas as pd
url = "https://raw.githubusercontent.com/sadhana1002/PredictingSalaryClass-Classification/master/adult.csv"
df = sqlContext.createDataFrame(pd.read_csv(url, 
                                      names=['Age','workclass',
                                             'fnlwgt','education',
                                             'education_num',
                                             'marital',
                                             'occupation',
                                             'relationship','race',
                                             'sex','capital_gain',
                                             'capital_loss',
                                             'hours_week',
                                             'native_country','label']))
df.printSchema()

root
 |-- Age: long (nullable = true)
 |-- workclass: string (nullable = true)
 |-- fnlwgt: long (nullable = true)
 |-- education: string (nullable = true)
 |-- education_num: long (nullable = true)
 |-- marital: string (nullable = true)
 |-- occupation: string (nullable = true)
 |-- relationship: string (nullable = true)
 |-- race: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- capital_gain: long (nullable = true)
 |-- capital_loss: long (nullable = true)
 |-- hours_week: long (nullable = true)
 |-- native_country: string (nullable = true)
 |-- label: string (nullable = true)



In [13]:
df.show(5, truncate=False)

+---+-----------------+------+----------+-------------+-------------------+------------------+--------------+------+-------+------------+------------+----------+--------------+------+
|Age|workclass        |fnlwgt|education |education_num|marital            |occupation        |relationship  |race  |sex    |capital_gain|capital_loss|hours_week|native_country|label |
+---+-----------------+------+----------+-------------+-------------------+------------------+--------------+------+-------+------------+------------+----------+--------------+------+
|39 | State-gov       |77516 | Bachelors|13           | Never-married     | Adm-clerical     | Not-in-family| White| Male  |2174        |0           |40        | United-States| <=50K|
|50 | Self-emp-not-inc|83311 | Bachelors|13           | Married-civ-spouse| Exec-managerial  | Husband      | White| Male  |0           |0           |13        | United-States| <=50K|
|38 | Private         |215646| HS-grad  |9            | Divorced          | Hand

To convert the continuous variable in the right format, you can use recast the columns. You can use withColumn to tell Spark which column to operate the transformation.

In [14]:
#To convert the continuous variable in the right format, you can use recast the columns. \
#You can use withColumn to tell Spark which column to operate the transformation.

# Import all from `sql.types`
from pyspark.sql.types import *

# Write a custom function to convert the data type of DataFrame columns
def convertColumn(df, names, newType):
    for name in names: 
        df = df.withColumn(name, df[name].cast(newType))
    return df

# List of continuous features
CONTI_FEATURES  = ['Age', 'fnlwgt','capital_gain', 'education_num', 'capital_loss', 'hours_week']

# Convert the type
df = convertColumn(df, CONTI_FEATURES, FloatType())

# Check the dataset
df.printSchema()

root
 |-- Age: float (nullable = true)
 |-- workclass: string (nullable = true)
 |-- fnlwgt: float (nullable = true)
 |-- education: string (nullable = true)
 |-- education_num: float (nullable = true)
 |-- marital: string (nullable = true)
 |-- occupation: string (nullable = true)
 |-- relationship: string (nullable = true)
 |-- race: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- capital_gain: float (nullable = true)
 |-- capital_loss: float (nullable = true)
 |-- hours_week: float (nullable = true)
 |-- native_country: string (nullable = true)
 |-- label: string (nullable = true)



### Select columns

You can select and show the rows with select and the names of the features. Below, age and fnlwgt are selected

In [15]:
df.select('age', 'fnlwgt').show(5)

+----+--------+
| age|  fnlwgt|
+----+--------+
|39.0| 77516.0|
|50.0| 83311.0|
|38.0|215646.0|
|53.0|234721.0|
|28.0|338409.0|
+----+--------+
only showing top 5 rows



### Count by group

If you want to count the number of occurrence by group, you can chain:

`groupBy()`\
`count()`\
together. In the PySpark example below, you count the number of rows by the education level.

In [16]:
df.groupBy("education").count().sort("count",ascending=True).show()

+-------------+-----+
|    education|count|
+-------------+-----+
|    Preschool|   51|
|      1st-4th|  168|
|      5th-6th|  333|
|    Doctorate|  413|
|         12th|  433|
|          9th|  514|
|  Prof-school|  576|
|      7th-8th|  646|
|         10th|  933|
|   Assoc-acdm| 1067|
|         11th| 1175|
|    Assoc-voc| 1382|
|      Masters| 1723|
|    Bachelors| 5355|
| Some-college| 7291|
|      HS-grad|10501|
+-------------+-----+



### Describe the data

To get a summary statistics, of the data, you can use describe(). It will compute the :

`count`\
`mean`\
`standard deviation`\
`min`\
`max`

In [17]:
df.describe().show()

23/05/24 23:49:39 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
[Stage 11:>                                                         (0 + 8) / 8]

+-------+------------------+------------+------------------+-------------+------------------+---------+-----------------+------------+-------------------+-------+------------------+------------------+------------------+--------------+------+
|summary|               Age|   workclass|            fnlwgt|    education|     education_num|  marital|       occupation|relationship|               race|    sex|      capital_gain|      capital_loss|        hours_week|native_country| label|
+-------+------------------+------------+------------------+-------------+------------------+---------+-----------------+------------+-------------------+-------+------------------+------------------+------------------+--------------+------+
|  count|             32561|       32561|             32561|        32561|             32561|    32561|            32561|       32561|              32561|  32561|             32561|             32561|             32561|         32561| 32561|
|   mean| 38.58164675532078|    

                                                                                

In [18]:
#If you want the summary statistic of only one column, add the name of the column inside describe()

df.describe('age').show()

+-------+------------------+
|summary|               age|
+-------+------------------+
|  count|             32561|
|   mean| 38.58164675532078|
| stddev|13.640432553581343|
|    min|              17.0|
|    max|              90.0|
+-------+------------------+



### Drop column

There are two intuitive commands to drop columns:

`drop():` Drop a column\
`dropna():` Drop NA's\
Below you drop the column education_num

In [19]:
df.drop('education_num').columns

['Age',
 'workclass',
 'fnlwgt',
 'education',
 'marital',
 'occupation',
 'relationship',
 'race',
 'sex',
 'capital_gain',
 'capital_loss',
 'hours_week',
 'native_country',
 'label']

### Filter data

You can use `filter() `to apply descriptive statistics in a subset of data.\
 `For instance, you can count the number of people above 40:`

In [20]:
df.filter(df.Age > 40).count()

13443

### Descriptive statistics by group

Finally, you can group data by group and compute statistical operations like the mean.

In [21]:
marital_status_cg = df.groupby('marital').agg({'capital_gain': 'mean'})

In [22]:
# round avg(capital_gain), 2 digits
marital_status_cg.show()

+--------------------+------------------+
|             marital| avg(capital_gain)|
+--------------------+------------------+
|             Widowed| 571.0715005035247|
| Married-spouse-a...| 653.9832535885167|
|   Married-AF-spouse| 432.6521739130435|
|  Married-civ-spouse|1764.8595085470085|
|            Divorced| 728.4148098131893|
|       Never-married|376.58831788823363|
|           Separated| 535.5687804878049|
+--------------------+------------------+



# Step 2: Data preprocessing

Data processing is a critical step in machine learning. After you remove garbage data, you get some important insights.

For instance, you know that age is not a linear function with the income. When people are young, their income is usually lower than mid-age. After retirement, a household uses their saving, meaning a decrease in income. To capture this pattern, you can add a square to the age feature

### Add age square

To add a new feature, you need to:

`Select the column`\
`Apply the transformation and add it to the DataFrame`

In [23]:
# Add square of age feature

# import pyspark sql functions
from pyspark.sql.functions import *

#1 Select the column 
age_square = df.select(col('age')**2)

#2 Apply the transformation and add it to the DataFrame
df = df.withColumn('age_square', col('age')**2)

# Check the results
df.printSchema()

root
 |-- Age: float (nullable = true)
 |-- workclass: string (nullable = true)
 |-- fnlwgt: float (nullable = true)
 |-- education: string (nullable = true)
 |-- education_num: float (nullable = true)
 |-- marital: string (nullable = true)
 |-- occupation: string (nullable = true)
 |-- relationship: string (nullable = true)
 |-- race: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- capital_gain: float (nullable = true)
 |-- capital_loss: float (nullable = true)
 |-- hours_week: float (nullable = true)
 |-- native_country: string (nullable = true)
 |-- label: string (nullable = true)
 |-- age_square: double (nullable = true)



In [24]:
# You can change the order of the variables with select. Below, you bring agesquare right after age.

COLUMNS = ['age', 'age_square', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital',
           'occupation', 'relationship', 'race', 'sex', 'capital_gain', 'capital_loss',
           'hours_week', 'native_country', 'label']
df = df.select(COLUMNS)
df.show(5)

+----+----------+-----------------+--------+----------+-------------+-------------------+------------------+--------------+------+-------+------------+------------+----------+--------------+------+
| age|age_square|        workclass|  fnlwgt| education|education_num|            marital|        occupation|  relationship|  race|    sex|capital_gain|capital_loss|hours_week|native_country| label|
+----+----------+-----------------+--------+----------+-------------+-------------------+------------------+--------------+------+-------+------------+------------+----------+--------------+------+
|39.0|    1521.0|        State-gov| 77516.0| Bachelors|         13.0|      Never-married|      Adm-clerical| Not-in-family| White|   Male|      2174.0|         0.0|      40.0| United-States| <=50K|
|50.0|    2500.0| Self-emp-not-inc| 83311.0| Bachelors|         13.0| Married-civ-spouse|   Exec-managerial|       Husband| White|   Male|         0.0|         0.0|      13.0| United-States| <=50K|
|38.0|    

### Exclude Holand-Netherlands

When a group within a feature has only one observation, it brings no information to the model. On the contrary, it can lead to an error during the cross-validation.

Let's check the origin of the household

In [25]:
df.filter(df.native_country == 'Holand-Netherlands').count()
df.groupby('native_country').agg({'native_country': 'count'}).sort(asc("count(native_country)")).show()

+--------------------+---------------------+
|      native_country|count(native_country)|
+--------------------+---------------------+
|  Holand-Netherlands|                    1|
|            Scotland|                   12|
|            Honduras|                   13|
|             Hungary|                   13|
| Outlying-US(Guam...|                   14|
|          Yugoslavia|                   16|
|                Laos|                   18|
|            Thailand|                   18|
|            Cambodia|                   19|
|     Trinadad&Tobago|                   19|
|                Hong|                   20|
|             Ireland|                   24|
|             Ecuador|                   28|
|              France|                   29|
|              Greece|                   29|
|                Peru|                   31|
|           Nicaragua|                   34|
|            Portugal|                   37|
|                Iran|                   43|
|         

In [26]:
# remove Holand-Netherlands
df_remove = df.filter(df.native_country != 'Holand-Netherlands')

In [27]:
df.groupby('native_country').agg({'native_country': 'count'}).sort(asc("count(native_country)")).show()

+--------------------+---------------------+
|      native_country|count(native_country)|
+--------------------+---------------------+
|  Holand-Netherlands|                    1|
|            Scotland|                   12|
|            Honduras|                   13|
|             Hungary|                   13|
| Outlying-US(Guam...|                   14|
|          Yugoslavia|                   16|
|                Laos|                   18|
|            Thailand|                   18|
|            Cambodia|                   19|
|     Trinadad&Tobago|                   19|
|                Hong|                   20|
|             Ireland|                   24|
|             Ecuador|                   28|
|              France|                   29|
|              Greece|                   29|
|                Peru|                   31|
|           Nicaragua|                   34|
|            Portugal|                   37|
|                Iran|                   43|
|         

# Step 3: Build a data processing pipeline

Similar to scikit-learn, Pyspark has a pipeline API.

A pipeline is very convenient to maintain the structure of the data. You push the data into the pipeline. Inside the pipeline, various operations are done, the output is used to feed the algorithm.

For instance, one universal transformation in machine learning consists of converting a string to one hot encoder, i.e., one column by a group. One hot encoder is usually a matrix full of zeroes.

The steps to transform the data are very similar to scikit-learn. You need to:

`Index the string to numeric`\
`Create the one hot encoder`\
`Transform the data`

Two APIs do the job: `StringIndexer, OneHotEncoder`

First of all, you select the string column to index. The `inputCol` is the name of the column in the dataset. `outputCol` is the new name given to the transformed column.\
Fit the data and transform it\
Create the news columns based on the group. For instance, if there are 10 groups in the feature, the new matrix will have 10 columns, one for each group.

In [28]:
### Example encoder
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler

stringIndexer = StringIndexer(inputCol="workclass", outputCol="workclass_encoded")
model = stringIndexer.fit(df)
indexed = model.transform(df)
encoder = OneHotEncoder(dropLast=False, inputCol="workclass_encoded", outputCol="workclass_vec").fit(indexed)
encoded = encoder.transform(indexed)
encoded.show(2)

+----+----------+-----------------+-------+----------+-------------+-------------------+----------------+--------------+------+-----+------------+------------+----------+--------------+------+-----------------+-------------+
| age|age_square|        workclass| fnlwgt| education|education_num|            marital|      occupation|  relationship|  race|  sex|capital_gain|capital_loss|hours_week|native_country| label|workclass_encoded|workclass_vec|
+----+----------+-----------------+-------+----------+-------------+-------------------+----------------+--------------+------+-----+------------+------------+----------+--------------+------+-----------------+-------------+
|39.0|    1521.0|        State-gov|77516.0| Bachelors|         13.0|      Never-married|    Adm-clerical| Not-in-family| White| Male|      2174.0|         0.0|      40.0| United-States| <=50K|              4.0|(9,[4],[1.0])|
|50.0|    2500.0| Self-emp-not-inc|83311.0| Bachelors|         13.0| Married-civ-spouse| Exec-manage

### Build the pipeline

You will build a pipeline to convert all the precise features and add them to the final dataset. The pipeline will have four operations, but feel free to add as many operations as you want.

`Encode the categorical data`\
`Index the label feature`\
`Add continuous variable`\
`Assemble the steps.`\
Each step is stored in a list named stages. This list will tell the VectorAssembler what operation to perform inside the pipeline.\

`Encode the categorical data`\
This step is very similar to the above example, except that you loop over all the categorical features.

In [29]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoder
CATE_FEATURES = ['workclass', 'education', 'marital', 'occupation', 'relationship', 'race', 'sex', 'native_country']
stages = [] # stages in our Pipeline
for categoricalCol in CATE_FEATURES:
    stringIndexer = StringIndexer(inputCol=categoricalCol, outputCol=categoricalCol + "Index")
    encoder = OneHotEncoder(inputCols=[stringIndexer.getOutputCol()],
                                     outputCols=[categoricalCol + "classVec"])
    stages += [stringIndexer, encoder]

**Index the label feature**\
Spark, like many other libraries, does not accept string values for the label. You convert the label feature with StringIndexer and add it to the list stages

In [30]:
# Convert label into label indices using the StringIndexer
label_stringIdx =  StringIndexer(inputCol="label", outputCol="newlabel")
stages += [label_stringIdx]

**Add continuous variable**\
The `inputCols` of the `VectorAssembler` is a list of columns. You can create a new list containing all the new columns. The code below populate the list with encoded categorical features and continuous features.

In [31]:
assemblerInputs = [c + "classVec" for c in CATE_FEATURES] + CONTI_FEATURES

**Assemble the steps.**\
Finally, you pass all the steps in the VectorAssembler

In [32]:
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]

In [33]:
# Create a Pipeline.
pipeline = Pipeline(stages=stages)
pipelineModel = pipeline.fit(df_remove)
model = pipelineModel.transform(df_remove)

IllegalArgumentException: Age does not exist. Available: age, age_square, workclass, fnlwgt, education, education_num, marital, occupation, relationship, race, sex, capital_gain, capital_loss, hours_week, native_country, label, workclassIndex, workclassclassVec, educationIndex, educationclassVec, maritalIndex, maritalclassVec, occupationIndex, occupationclassVec, relationshipIndex, relationshipclassVec, raceIndex, raceclassVec, sexIndex, sexclassVec, native_countryIndex, native_countryclassVec, newlabel

In [34]:
model.take(1)

AttributeError: 'StringIndexerModel' object has no attribute 'take'

# Step 4: Build the classifier: logistic

To make the computation faster, we convert features to DenseVector type.

In [35]:
from pyspark.ml.linalg import DenseVector
input_data = model.rdd.map(lambda x: (x["newlabel"], DenseVector(x["features"])))

AttributeError: 'StringIndexerModel' object has no attribute 'rdd'

In [36]:
#You are ready to create the train data as a DataFrame. You use the sqlContext
df_train = sqlContext.createDataFrame(input_data, ["label", "features"])
df_train.show(2)

NameError: name 'input_data' is not defined

**Create a train/test set**\
split 80/20

In [37]:
# Split the data into train and test sets
train_data, test_data = df_train.randomSplit([.8,.2],seed=1234)

NameError: name 'df_train' is not defined

In [38]:
#Let's count how many people with income below/above 50k in both training and test set

train_data.groupby('label').agg({'label': 'count'}).show()

NameError: name 'train_data' is not defined

## Build the logistic regressor

Last but not least, you can build the classifier. Pyspark has an API called LogisticRegression to perform logistic regression.

You initialize lr by indicating the label column and feature columns. You set a maximum of 10 iterations and add a regularization parameter with a value of 0.3. Note that in the next section, you will use cross-validation with a parameter grid to tune the model

In [39]:
# Import `LinearRegression`
from pyspark.ml.classification import LogisticRegression

# Initialize `lr`
lr = LogisticRegression(labelCol="label",
                        featuresCol="features",
                        maxIter=10,
                        regParam=0.3)

# Fit the data to the model
linearModel = lr.fit(train_data)

NameError: name 'train_data' is not defined

In [40]:
# Print the coefficients and intercept for logistic regression
print("Coefficients: " + str(linearModel.coefficients))
print("Intercept: " + str(linearModel.intercept))

NameError: name 'linearModel' is not defined

# Step 5: Train and evaluate the model

To generate predictions for your test set,

You can use linearModel with `transform()` on test_data

In [41]:
# Make predictions on test data using the transform() method.
predictions = linearModel.transform(test_data)

NameError: name 'linearModel' is not defined

In [42]:
predictions.printSchema()

NameError: name 'predictions' is not defined

In [43]:
selected = predictions.select("label", "prediction", "probability")
selected.show(20)

NameError: name 'predictions' is not defined

# Evaluate the model

You need to look at the accuracy metric to see how well (or bad) the model performs. Currently, there is no API to compute the accuracy measure in Spark. The default value is the ROC, receiver operating characteristic curve. It is a different metric that take into account the false positive rate.

Before you look at the ROC, let's construct the accuracy measure. You are more familiar with this metric. The accuracy measure is the sum of the correct prediction over the total number of observations.

You create a DataFrame with the label and the prediction.

In [44]:
cm = predictions.select("label", "prediction")

NameError: name 'predictions' is not defined

In [45]:
#You can check the number of class in the label and the prediction
cm.groupby('label').agg({'label': 'count'}).show()

NameError: name 'cm' is not defined

In [46]:
cm.groupby('prediction').agg({'prediction': 'count'}).show()

NameError: name 'cm' is not defined

In [47]:
#You can compute the accuracy by computing the count when the label is correctly classified over the total number of rows.
cm.filter(cm.label == cm.prediction).count() / cm.count()

NameError: name 'cm' is not defined

In [48]:
#You can wrap everything together and write a function to compute the accuracy.

def accuracy_m(model): 
    predictions = model.transform(test_data)
    cm = predictions.select("label", "prediction")
    acc = cm.filter(cm.label == cm.prediction).count() / cm.count()
    print("Model accuracy: %.3f%%" % (acc * 100)) 
accuracy_m(model = linearModel)

NameError: name 'linearModel' is not defined

### ROC metrics

The module BinaryClassificationEvaluator includes the ROC measures. The Receiver Operating Characteristic curve is another common tool used with binary classification. It is very similar to the precision/recall curve, but instead of plotting precision versus recall, the ROC curve shows the true positive rate (i.e. recall) against the false positive rate. The false positive rate is the ratio of negative instances that are incorrectly classified as positive. It is equal to one minus the true negative rate. The true negative rate is also called specificity. Hence the ROC curve plots `sensitivity` (recall) versus 1 - `specificity.`

In [49]:
### Use ROC 
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Evaluate model
evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction")
print(evaluator.evaluate(predictions))
print(evaluator.getMetricName())

NameError: name 'predictions' is not defined

# Step 6 (Stretch): Tune the hyperparameter

Last but not least, you can tune the hyperparameters. Similar to scikit-learn you create a parameter grid, and you add the parameters you want to tune.

To reduce the time of the computation, you only tune the regularization parameter with only two values.

In [50]:
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

# Create ParamGrid for Cross Validation
paramGrid = (ParamGridBuilder()
             .addGrid(lr.regParam, [0.01, 0.5])
             .build())

In [51]:
# Finally, you evaluate the model with using the cross-validation method with 5 folds. It takes some time to train.
from time import *
start_time = time()

# Create 5-fold CrossValidator
cv = CrossValidator(estimator=lr,
                    estimatorParamMaps=paramGrid,
                    evaluator=evaluator, numFolds=5)

# Run cross validations
cvModel = cv.fit(train_data)
# likely take a fair amount of time
end_time = time()
elapsed_time = end_time - start_time
print("Time to train model: %.3f seconds" % elapsed_time)

NameError: name 'train_data' is not defined

In [52]:
#The best regularization hyperparameter is 0.01, with an accuracy of 85.316 percent.

accuracy_m(model = cvModel)

NameError: name 'cvModel' is not defined

In [53]:
# You can extract the recommended parameter by chaining cvModel.bestModel with extractParamMap()

bestModel = cvModel.bestModel
bestModel.extractParamMap()

NameError: name 'cvModel' is not defined