<div style="line-height:0.4">
<h1 style="color:#0FCBC6"> PySpark 2: Datasets, models and regression </h1>
<span style="display: inline-block;">
    <h3 style="color: lightblue; display: inline;">Keywords:</h3> Encoding + LinearRegression
</span>
</div>

In [1]:
from pyspark.sql import SparkSession
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler, StandardScaler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline
from pyspark.sql.functions import isnan, col, when, rand
from pyspark.ml.regression import LinearRegression

In [2]:
# Create a Spark session
spark = SparkSession.builder \
    .appName("MLExample2") \
    .getOrCreate()

23/10/20 23:09:36 WARN Utils: Your hostname, hpmint resolves to a loopback address: 127.0.1.1; using 192.168.1.81 instead (on interface eno1)
23/10/20 23:09:36 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/10/20 23:09:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [5]:
# Load data
dataset_covid = spark.read.csv("./datasets_for_pyspark/covid19-ita-province.csv", header=True, inferSchema=True)
dataset_covid.head()

Row(_c0=0, date=datetime.datetime(2020, 2, 24, 18, 0), state='ITA', region_code=13, region='Abruzzo', province_code=69, province='Chieti', province_ISO='CH', lat=42.35103167, long=14.16754574, total_cases=0, note_it=None, note_en=None)

In [140]:
# Remove the first ID column which is useless
dataset_covid = dataset_covid.drop("_c0")

In [141]:
dataset_covid.show()

+-------------------+-----+-----------+------------+-------------+--------------------+------------+-----------+------------------+-----------+-------+-------+
|               date|state|region_code|      region|province_code|            province|province_ISO|        lat|              long|total_cases|note_it|note_en|
+-------------------+-----+-----------+------------+-------------+--------------------+------------+-----------+------------------+-----------+-------+-------+
|2020-02-24 18:00:00|  ITA|         13|     Abruzzo|           69|              Chieti|          CH|42.35103167|       14.16754574|          0|   null|   null|
|2020-02-24 18:00:00|  ITA|         13|     Abruzzo|           66|            L'Aquila|          AQ|42.35122196|       13.39843823|          0|   null|   null|
|2020-02-24 18:00:00|  ITA|         13|     Abruzzo|           68|             Pescara|          PE|42.46458398|       14.21364822|          0|   null|   null|
|2020-02-24 18:00:00|  ITA|         13| 

<h3 style="color:#0FCBC6"> => Prepare features </h3>

In [142]:
""" Combine 2 features into a single feature vector into a single one. Add the new column as last one.
N.B.
The VectorAssembler is a transformer that takes a set of input columns and combines their values into a single vector column.
""" 
assembler = VectorAssembler(inputCols=["region_code", "province_code"], outputCol="features")
dataset_ass = assembler.transform(dataset_covid)

dataset_ass.show()

+-------------------+-----+-----------+------------+-------------+--------------------+------------+-----------+------------------+-----------+-------+-------+------------+
|               date|state|region_code|      region|province_code|            province|province_ISO|        lat|              long|total_cases|note_it|note_en|    features|
+-------------------+-----+-----------+------------+-------------+--------------------+------------+-----------+------------------+-----------+-------+-------+------------+
|2020-02-24 18:00:00|  ITA|         13|     Abruzzo|           69|              Chieti|          CH|42.35103167|       14.16754574|          0|   null|   null| [13.0,69.0]|
|2020-02-24 18:00:00|  ITA|         13|     Abruzzo|           66|            L'Aquila|          AQ|42.35122196|       13.39843823|          0|   null|   null| [13.0,66.0]|
|2020-02-24 18:00:00|  ITA|         13|     Abruzzo|           68|             Pescara|          PE|42.46458398|       14.21364822|    

In [143]:
# Create a copy of Dataframe
copied_covid = dataset_covid.alias("copying_data")
copied_covid.show()

+-------------------+-----+-----------+------------+-------------+--------------------+------------+-----------+------------------+-----------+-------+-------+
|               date|state|region_code|      region|province_code|            province|province_ISO|        lat|              long|total_cases|note_it|note_en|
+-------------------+-----+-----------+------------+-------------+--------------------+------------+-----------+------------------+-----------+-------+-------+
|2020-02-24 18:00:00|  ITA|         13|     Abruzzo|           69|              Chieti|          CH|42.35103167|       14.16754574|          0|   null|   null|
|2020-02-24 18:00:00|  ITA|         13|     Abruzzo|           66|            L'Aquila|          AQ|42.35122196|       13.39843823|          0|   null|   null|
|2020-02-24 18:00:00|  ITA|         13|     Abruzzo|           68|             Pescara|          PE|42.46458398|       14.21364822|          0|   null|   null|
|2020-02-24 18:00:00|  ITA|         13| 

In [144]:
""" Remove useless columns => note that elements in list that does not exist in the dataframe (like _c0) are ignored. """
columns_to_exclude = ["_c0", "date", "note_it", "note_en"]
df_covid = copied_covid.drop(*columns_to_exclude)
df_covid.show(n=3)

+-----+-----------+-------+-------------+--------+------------+-----------+-----------+-----------+
|state|region_code| region|province_code|province|province_ISO|        lat|       long|total_cases|
+-----+-----------+-------+-------------+--------+------------+-----------+-----------+-----------+
|  ITA|         13|Abruzzo|           69|  Chieti|          CH|42.35103167|14.16754574|          0|
|  ITA|         13|Abruzzo|           66|L'Aquila|          AQ|42.35122196|13.39843823|          0|
|  ITA|         13|Abruzzo|           68| Pescara|          PE|42.46458398|14.21364822|          0|
+-----+-----------+-------+-------------+--------+------------+-----------+-----------+-----------+
only showing top 3 rows



<h3 style="color:#0FCBC6"> => Encoding </h3>
<div style="margin-top: -17px;">

+ StringIndexer assigns a unique numerical index to each distinct category in a categorical column.    
+ OneHotEncoder converts categorical variables into binary vectors (0s and 1s). 

In [145]:
""" StringIndexer """
string_indexer = StringIndexer(inputCol="state", outputCol="stateIndex")
# Fit and transform the DataFrame with StringIndexer
indexed_df = string_indexer.fit(df_covid).transform(df_covid)
indexed_df.show()

+-----+-----------+------------+-------------+--------------------+------------+-----------+------------------+-----------+----------+
|state|region_code|      region|province_code|            province|province_ISO|        lat|              long|total_cases|stateIndex|
+-----+-----------+------------+-------------+--------------------+------------+-----------+------------------+-----------+----------+
|  ITA|         13|     Abruzzo|           69|              Chieti|          CH|42.35103167|       14.16754574|          0|       0.0|
|  ITA|         13|     Abruzzo|           66|            L'Aquila|          AQ|42.35122196|       13.39843823|          0|       0.0|
|  ITA|         13|     Abruzzo|           68|             Pescara|          PE|42.46458398|       14.21364822|          0|       0.0|
|  ITA|         13|     Abruzzo|           67|              Teramo|          TE| 42.6589177|       13.70439971|          0|       0.0|
|  ITA|         13|     Abruzzo|          979|In fase d

In [146]:
# Create a StringIndexer for multiple columns
columns_to_index = ["state", "region", "province", "province_ISO"]

# Create a list of StringIndexer stages
indexer_stages = [StringIndexer(inputCol=col, outputCol=col + "_index") for col in columns_to_index]

# Create a pipeline that applies the StringIndexer stages
pipeline = Pipeline(stages=indexer_stages)

# Fit and transform the DataFrame using the pipeline
indexed_df2 = pipeline.fit(df_covid).transform(df_covid)

indexed_df2.show(n=2)

+-----+-----------+-------+-------------+--------+------------+-----------+-----------+-----------+-----------+------------+--------------+------------------+
|state|region_code| region|province_code|province|province_ISO|        lat|       long|total_cases|state_index|region_index|province_index|province_ISO_index|
+-----+-----------+-------+-------------+--------+------------+-----------+-----------+-----------+-----------+------------+--------------+------------------+
|  ITA|         13|Abruzzo|           69|  Chieti|          CH|42.35103167|14.16754574|          0|        0.0|        12.0|          25.0|              22.0|
|  ITA|         13|Abruzzo|           66|L'Aquila|          AQ|42.35122196|13.39843823|          0|        0.0|        12.0|          43.0|               5.0|
+-----+-----------+-------+-------------+--------+------------+-----------+-----------+-----------+-----------+------------+--------------+------------------+
only showing top 2 rows



<h3 style="color:#0FCBC6"> Recap: </h3>
<div style="margin-top: -17px;">

Note that, even if the code seem to works, when showing the whole dataframe an error is generated! Due to null values!     
`ERROR TaskSetManager: Task 0 in stage 114.0 failed 1 times; aborting job`

In this case, we can REPLACE the rows with NULL values in the "province_ISO" column for rows with province in "In fase di definition" with a dummy value 


In [147]:
nan_columns = [col_name for col_name in df_covid.columns if df_covid.filter(col(col_name).isNull()).count() > 0]
print("Columns with NaN values:", nan_columns)

Columns with NaN values: ['province_ISO']


In [148]:
dummy_value = "NA"
df_covid_ok = df_covid.fillna(dummy_value, subset=["province_ISO"])

In [149]:
nan_columns = [col_name for col_name in df_covid_ok.columns if df_covid_ok.filter(col(col_name).isNull()).count() > 0]
print("Columns with NaN values:", nan_columns)

Columns with NaN values: []


<h3 style="color:#0FCBC6"> => Encoding categorical columns </h3>
<div style="margin-top: -17px;">
StringIndexer is an additive process. <br>
It adds new columns with indexed values, but it doesn't automatically drop the original columns. <br>
The original columns must be dropped manually.

In [150]:
""" OK now it works! """
# Create a StringIndexer for multiple columns
columns_to_index = ["state", "region", "province", "province_ISO"]

# Create a list of StringIndexer stages
indexer_stages = [StringIndexer(inputCol=col, outputCol=col + "_index") for col in columns_to_index]

# Create a pipeline that applies the StringIndexer stages
pipeline = Pipeline(stages=indexer_stages)

# Fit and transform the DataFrame using the pipeline
indexed_df2 = pipeline.fit(df_covid_ok).transform(df_covid_ok)

indexed_df2.show()

+-----+-----------+------------+-------------+--------------------+------------+-----------+------------------+-----------+-----------+------------+--------------+------------------+
|state|region_code|      region|province_code|            province|province_ISO|        lat|              long|total_cases|state_index|region_index|province_index|province_ISO_index|
+-----+-----------+------------+-------------+--------------------+------------+-----------+------------------+-----------+-----------+------------+--------------+------------------+
|  ITA|         13|     Abruzzo|           69|              Chieti|          CH|42.35103167|       14.16754574|          0|        0.0|        12.0|          25.0|              23.0|
|  ITA|         13|     Abruzzo|           66|            L'Aquila|          AQ|42.35122196|       13.39843823|          0|        0.0|        12.0|          43.0|               6.0|
|  ITA|         13|     Abruzzo|           68|             Pescara|          PE|42.46

In [151]:
df_to_use = indexed_df2.drop(*columns_to_index)
df_to_use.show()

+-----------+-------------+-----------+------------------+-----------+-----------+------------+--------------+------------------+
|region_code|province_code|        lat|              long|total_cases|state_index|region_index|province_index|province_ISO_index|
+-----------+-------------+-----------+------------------+-----------+-----------+------------+--------------+------------------+
|         13|           69|42.35103167|       14.16754574|          0|        0.0|        12.0|          25.0|              23.0|
|         13|           66|42.35122196|       13.39843823|          0|        0.0|        12.0|          43.0|               6.0|
|         13|           68|42.46458398|       14.21364822|          0|        0.0|        12.0|          69.0|              64.0|
|         13|           67| 42.6589177|       13.70439971|          0|        0.0|        12.0|          92.0|              91.0|
|         13|          979|        0.0|               0.0|          0|        0.0|        

In [152]:
""" OneHotEncoder.
Before applying it, you need to convert the categorical string column into a numeric column using StringIndexer.
"""
onehot_encoder = OneHotEncoder(inputCol="region_code", outputCol="region_code_ok")
onehot_encoder.setDropLast(False)
# Fit 
ohe = onehot_encoder.fit(df_to_use)
encoded_df = ohe.transform(df_to_use)

encoded_df.show()

+-----------+-------------+-----------+------------------+-----------+-----------+------------+--------------+------------------+---------------+
|region_code|province_code|        lat|              long|total_cases|state_index|region_index|province_index|province_ISO_index| region_code_ok|
+-----------+-------------+-----------+------------------+-----------+-----------+------------+--------------+------------------+---------------+
|         13|           69|42.35103167|       14.16754574|          0|        0.0|        12.0|          25.0|              23.0|(21,[13],[1.0])|
|         13|           66|42.35122196|       13.39843823|          0|        0.0|        12.0|          43.0|               6.0|(21,[13],[1.0])|
|         13|           68|42.46458398|       14.21364822|          0|        0.0|        12.0|          69.0|              64.0|(21,[13],[1.0])|
|         13|           67| 42.6589177|       13.70439971|          0|        0.0|        12.0|          92.0|              

In [153]:
# Move label feature to the last position

column_to_move = "total_cases"
# Select columns in the desired order
ordered_columns = [col for col in df_to_use.columns if col != column_to_move] + [column_to_move]
# Reorder the columns
df_to_use = df_to_use.select(*ordered_columns)
df_to_use.show()

+-----------+-------------+-----------+------------------+-----------+------------+--------------+------------------+-----------+
|region_code|province_code|        lat|              long|state_index|region_index|province_index|province_ISO_index|total_cases|
+-----------+-------------+-----------+------------------+-----------+------------+--------------+------------------+-----------+
|         13|           69|42.35103167|       14.16754574|        0.0|        12.0|          25.0|              23.0|          0|
|         13|           66|42.35122196|       13.39843823|        0.0|        12.0|          43.0|               6.0|          0|
|         13|           68|42.46458398|       14.21364822|        0.0|        12.0|          69.0|              64.0|          0|
|         13|           67| 42.6589177|       13.70439971|        0.0|        12.0|          92.0|              91.0|          0|
|         13|          979|        0.0|               0.0|        0.0|        12.0|       

In [154]:
# Get unique values of the specified column
col_to_look = "total_cases"
unique_values_total_case = df_to_use.select(col_to_look).distinct()

unique_values_list = unique_values_total_case.rdd.flatMap(lambda x: x).collect()

unique_values_list[:10]

                                                                                

[496, 148, 2122, 463, 833, 471, 1645, 1591, 2142, 1088]

In [155]:
# Split the DataFrame into train and test sets
train_ratio = 0.8  # Ratio for the train set
test_ratio = 1 - train_ratio  # Ratio for the test set

train_df, test_df = df_to_use.randomSplit([train_ratio, test_ratio], seed=42)

In [156]:
train_df.show()

+-----------+-------------+----------+-----------+-----------+------------+--------------+------------------+-----------+
|region_code|province_code|       lat|       long|state_index|region_index|province_index|province_ISO_index|total_cases|
+-----------+-------------+----------+-----------+-----------+------------+--------------+------------------+-----------+
|          1|            1|45.0732745|7.680687483|        0.0|         4.0|          94.0|              93.0|          0|
|          1|            1|45.0732745|7.680687483|        0.0|         4.0|          94.0|              93.0|          2|
|          1|            1|45.0732745|7.680687483|        0.0|         4.0|          94.0|              93.0|          3|
|          1|            1|45.0732745|7.680687483|        0.0|         4.0|          94.0|              93.0|          6|
|          1|            1|45.0732745|7.680687483|        0.0|         4.0|          94.0|              93.0|          7|
|          1|           

In [157]:
col_names_feat = train_df.columns
col_names_feat

['region_code',
 'province_code',
 'lat',
 'long',
 'state_index',
 'region_index',
 'province_index',
 'province_ISO_index',
 'total_cases']

In [158]:
# Remove target column => total_cases
del(col_names_feat[8])
col_names_feat


['region_code',
 'province_code',
 'lat',
 'long',
 'state_index',
 'region_index',
 'province_index',
 'province_ISO_index']

<h3 style="color:#0FCBC6"> => Linear Regression </h3>
<div style="margin-top: -17px;">

In PySpark's it is necessary to create the input featuresCol parameter => as a single column containing a vector of values     
This is designed to work with Spark's DataFrame structure, where each feature vector is often represented as a single column containing a vector of values.          
This approach is consistent with the way Spark handles distributed data processing and allows for more flexibility in data preparation and transformation.    

The outputCol you specify in the VectorAssembler is the name of the column that will contain the assembled feature vectors.                       
It doesn't necessarily need to already exist in the DataFrame. The new column is created by the VectorAssembler.


In [159]:
assembler = VectorAssembler(inputCols=['region_code',
    'province_code',
    'lat',
    'long',
    'state_index',
    'region_index',
    'province_index',
    'province_ISO_index'],
    
    outputCol="new_assembled_feature")

X_train_assembled = assembler.transform(train_df)
X_train_assembled.show()

+-----------+-------------+----------+-----------+-----------+------------+--------------+------------------+-----------+---------------------+
|region_code|province_code|       lat|       long|state_index|region_index|province_index|province_ISO_index|total_cases|new_assembled_feature|
+-----------+-------------+----------+-----------+-----------+------------+--------------+------------------+-----------+---------------------+
|          1|            1|45.0732745|7.680687483|        0.0|         4.0|          94.0|              93.0|          0| [1.0,1.0,45.07327...|
|          1|            1|45.0732745|7.680687483|        0.0|         4.0|          94.0|              93.0|          2| [1.0,1.0,45.07327...|
|          1|            1|45.0732745|7.680687483|        0.0|         4.0|          94.0|              93.0|          3| [1.0,1.0,45.07327...|
|          1|            1|45.0732745|7.680687483|        0.0|         4.0|          94.0|              93.0|          6| [1.0,1.0,45.07

In [160]:
# Drop the original columns and keep only the features and target columns
selected_columns = ["new_assembled_feature", "total_cases"]
train_df_ok = X_train_assembled.select(selected_columns)
train_df_ok.show()

+---------------------+-----------+
|new_assembled_feature|total_cases|
+---------------------+-----------+
| [1.0,1.0,45.07327...|          0|
| [1.0,1.0,45.07327...|          2|
| [1.0,1.0,45.07327...|          3|
| [1.0,1.0,45.07327...|          6|
| [1.0,1.0,45.07327...|          7|
| [1.0,1.0,45.07327...|         11|
| [1.0,1.0,45.07327...|         19|
| [1.0,1.0,45.07327...|         34|
| [1.0,1.0,45.07327...|         49|
| [1.0,1.0,45.07327...|         55|
| [1.0,1.0,45.07327...|         89|
| [1.0,1.0,45.07327...|        111|
| [1.0,1.0,45.07327...|        159|
| [1.0,1.0,45.07327...|        187|
| [1.0,1.0,45.07327...|        305|
| [1.0,1.0,45.07327...|        359|
| [1.0,1.0,45.07327...|        542|
| [1.0,1.0,45.07327...|        749|
| [1.0,1.0,45.07327...|       1042|
| [1.0,1.0,45.07327...|       1556|
+---------------------+-----------+
only showing top 20 rows



In [161]:
""" Create a LinearRegression model """
#lr = LinearRegression(labelCol="total_cases") #I need to pass andthe target 
lr = LinearRegression(featuresCol="new_assembled_feature", labelCol="total_cases")
#lr = LinearRegression(featuresCol="new_assembled_feature", labelCol="total_cases", regParam=0.01))

# Train the model on the training data
lr_model = lr.fit(train_df_ok)

23/08/16 16:24:25 WARN Instrumentation: [165f46f2] regParam is zero, which might cause numerical instability and overfitting.
23/08/16 16:24:26 WARN Instrumentation: [165f46f2] Cholesky solver failed due to singular covariance matrix. Retrying with Quasi-Newton solver.


In [162]:
# Make predictions on new data
new_data = spark.createDataFrame([(1, 2, 43.1325615, 8.680687483, 0.0, 5.0, 89.0, 88.0)], col_names_feat)
new_assembled_data = assembler.transform(new_data)
predictions = lr_model.transform(new_assembled_data)

# Show the predictions
predictions.show()

+-----------+-------------+----------+-----------+-----------+------------+--------------+------------------+---------------------+-----------------+
|region_code|province_code|       lat|       long|state_index|region_index|province_index|province_ISO_index|new_assembled_feature|       prediction|
+-----------+-------------+----------+-----------+-----------+------------+--------------+------------------+---------------------+-----------------+
|          1|            2|43.1325615|8.680687483|        0.0|         5.0|          89.0|              88.0| [1.0,2.0,43.13256...|1902.715534136977|
+-----------+-------------+----------+-----------+-----------+------------+--------------+------------------+---------------------+-----------------+



In [163]:
# Make predictions on the test data
predictions = lr_model.transform(train_df_ok)

# Show the predictions
predictions.show()

+---------------------+-----------+------------------+
|new_assembled_feature|total_cases|        prediction|
+---------------------+-----------+------------------+
| [1.0,1.0,45.07327...|          0|1947.9802392847282|
| [1.0,1.0,45.07327...|          2|1947.9802392847282|
| [1.0,1.0,45.07327...|          3|1947.9802392847282|
| [1.0,1.0,45.07327...|          6|1947.9802392847282|
| [1.0,1.0,45.07327...|          7|1947.9802392847282|
| [1.0,1.0,45.07327...|         11|1947.9802392847282|
| [1.0,1.0,45.07327...|         19|1947.9802392847282|
| [1.0,1.0,45.07327...|         34|1947.9802392847282|
| [1.0,1.0,45.07327...|         49|1947.9802392847282|
| [1.0,1.0,45.07327...|         55|1947.9802392847282|
| [1.0,1.0,45.07327...|         89|1947.9802392847282|
| [1.0,1.0,45.07327...|        111|1947.9802392847282|
| [1.0,1.0,45.07327...|        159|1947.9802392847282|
| [1.0,1.0,45.07327...|        187|1947.9802392847282|
| [1.0,1.0,45.07327...|        305|1947.9802392847282|
| [1.0,1.0

<h3 style="color:#0FCBC6"> => Retry with Normalization </h3>

In [168]:
scaler = StandardScaler(inputCol="new_assembled_feature", outputCol="scaled_assembled_features", withStd=True, withMean=True)
type(scaler)

pyspark.ml.feature.StandardScaler

In [169]:
pipeline = Pipeline(stages=[assembler, scaler])

In [170]:
normalized_df = pipeline.fit(train_df).transform(train_df)
normalized_df.show()

+-----------+-------------+----------+-----------+-----------+------------+--------------+------------------+-----------+---------------------+-------------------------+
|region_code|province_code|       lat|       long|state_index|region_index|province_index|province_ISO_index|total_cases|new_assembled_feature|scaled_assembled_features|
+-----------+-------------+----------+-----------+-----------+------------+--------------+------------------+-----------+---------------------+-------------------------+
|          1|            1|45.0732745|7.680687483|        0.0|         4.0|          94.0|              93.0|          0| [1.0,1.0,45.07327...|     [-1.5339445285816...|
|          1|            1|45.0732745|7.680687483|        0.0|         4.0|          94.0|              93.0|          2| [1.0,1.0,45.07327...|     [-1.5339445285816...|
|          1|            1|45.0732745|7.680687483|        0.0|         4.0|          94.0|              93.0|          3| [1.0,1.0,45.07327...|     [-

In [171]:
# Select only the relevant columns for regression
selected_columns = ["scaled_assembled_features", "total_cases"]
regression_df = normalized_df.select(selected_columns)

regression_df.show()

+-------------------------+-----------+
|scaled_assembled_features|total_cases|
+-------------------------+-----------+
|     [-1.5339445285816...|          0|
|     [-1.5339445285816...|          2|
|     [-1.5339445285816...|          3|
|     [-1.5339445285816...|          6|
|     [-1.5339445285816...|          7|
|     [-1.5339445285816...|         11|
|     [-1.5339445285816...|         19|
|     [-1.5339445285816...|         34|
|     [-1.5339445285816...|         49|
|     [-1.5339445285816...|         55|
|     [-1.5339445285816...|         89|
|     [-1.5339445285816...|        111|
|     [-1.5339445285816...|        159|
|     [-1.5339445285816...|        187|
|     [-1.5339445285816...|        305|
|     [-1.5339445285816...|        359|
|     [-1.5339445285816...|        542|
|     [-1.5339445285816...|        749|
|     [-1.5339445285816...|       1042|
|     [-1.5339445285816...|       1556|
+-------------------------+-----------+
only showing top 20 rows



<h3 style="color:#0FCBC6"> Recap: </h3>
<div style="margin-top: -20px;">
A solver corresponds to the optimization techniques to find the optimal parameter values that minimize the chosen objective function => reduce the Loss. <br>

For instance, a LinearRegression model with Limited-memory Broyden-Fletcher-Goldfarb-Shanno "l-bfgs" solver. <br>
=> instead of default  "auto", that usually leads to warnings. <br>
Quasi-Newton solver called "normal" is another available option. <br>

In [174]:
""" Define LinearRegression model """
lr = LinearRegression(featuresCol="scaled_assembled_features", labelCol="total_cases", solver="l-bfgs")

# Train the model on the regression data
lr_model = lr.fit(regression_df)

# Make predictions on the regression data
predictions = lr_model.transform(regression_df)

# Show the predictions
predictions.show()

+-------------------------+-----------+-----------------+
|scaled_assembled_features|total_cases|       prediction|
+-------------------------+-----------+-----------------+
|     [-1.5339445285816...|          0|1947.980239281093|
|     [-1.5339445285816...|          2|1947.980239281093|
|     [-1.5339445285816...|          3|1947.980239281093|
|     [-1.5339445285816...|          6|1947.980239281093|
|     [-1.5339445285816...|          7|1947.980239281093|
|     [-1.5339445285816...|         11|1947.980239281093|
|     [-1.5339445285816...|         19|1947.980239281093|
|     [-1.5339445285816...|         34|1947.980239281093|
|     [-1.5339445285816...|         49|1947.980239281093|
|     [-1.5339445285816...|         55|1947.980239281093|
|     [-1.5339445285816...|         89|1947.980239281093|
|     [-1.5339445285816...|        111|1947.980239281093|
|     [-1.5339445285816...|        159|1947.980239281093|
|     [-1.5339445285816...|        187|1947.980239281093|
|     [-1.5339