<a href="https://colab.research.google.com/github/jtao/dswebinar/blob/master/pyspark/PySpark_MLlib.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PySpark DataFrames and SQL

[Jian Tao](https://tx.ag/jtao), Texas A&M University

June 30, 2023

### 1. Set up the PySpark environment first

In [1]:
# For each Google Colab, we will need to run this cell to ensure that PySpark is installed properly.
import sys
IN_COLAB = 'google.colab' in sys.modules
if IN_COLAB:
  !pip install pyspark

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Test").config('spark.ui.port', '4050').getOrCreate()
spark
# !wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
# !unzip -o ngrok-stable-linux-amd64.zip
# get_ipython().system_raw('./ngrok http 4050 &')
# !curl -s http://localhost:4040/api/tunnels | python3 -c "import sys, json; print(\"\nClick me to launch (give it a minute or two)\n\"); print(json.load(sys.stdin)['tunnels'][0]['public_url'])"

23/07/03 06:26:33 WARN Utils: Your hostname, localhost.localdomain resolves to a loopback address: 127.0.0.1; using 192.168.1.33 instead (on interface enp4s0)
23/07/03 06:26:33 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/07/03 06:26:33 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


### 2. Create a DataFrame by reading from a CSV/JSON file

`spark.read.csv` can only read from local files, so we will have to download the CSV file from the URL first. We can use `SparkFiles` to do that or use `pandas`. For those CSV files with a header, please make sure to set `header=True` in the argument list for `spark.read.csv`. When the data types of the columns are not known, `inferSchema=True` will do the trick to automatically recognize the data types, but it is not perfect. In our example, `Horsepower` is not correctly recognized.

In [2]:
from pyspark import SparkFiles

csv_url = "https://raw.githubusercontent.com/jtao/AdvancedML/main/data/Auto.csv"
json_url = "https://raw.githubusercontent.com/jtao/dswebinar/master/pyspark/Auto.json"

spark.sparkContext.addFile(csv_url)
spark.sparkContext.addFile(json_url)

## One can create a spark dataframe from pandas dataframe as well.
# import pandas as pd
# df = spark.createDataFrame(pd.read_csv(url))

#df = spark.read.csv(SparkFiles.get("Auto.csv"), header=True, sep=",", inferSchema=False)
df = spark.read.csv(SparkFiles.get("Auto.csv"), header=True, sep=",", inferSchema=True)

df.printSchema()
df.show(5)

root
 |-- mpg: double (nullable = true)
 |-- cylinders: integer (nullable = true)
 |-- displacement: double (nullable = true)
 |-- horsepower: string (nullable = true)
 |-- weight: integer (nullable = true)
 |-- acceleration: double (nullable = true)
 |-- year: integer (nullable = true)
 |-- origin: integer (nullable = true)
 |-- name: string (nullable = true)

+----+---------+------------+----------+------+------------+----+------+--------------------+
| mpg|cylinders|displacement|horsepower|weight|acceleration|year|origin|                name|
+----+---------+------------+----------+------+------------+----+------+--------------------+
|18.0|        8|       307.0|       130|  3504|        12.0|  70|     1|chevrolet chevell...|
|15.0|        8|       350.0|       165|  3693|        11.5|  70|     1|   buick skylark 320|
|18.0|        8|       318.0|       150|  3436|        11.0|  70|     1|  plymouth satellite|
|16.0|        8|       304.0|       150|  3433|        12.0|  70|     1|

### 3. Create a Linear Regression Model with MLlib

First, we will need to split the dataset into training (70%) and testing (30%) datasets.

In [3]:
from pyspark.ml.feature import VectorAssembler
vectorAssembler = VectorAssembler(inputCols = ['weight', 'displacement', 'acceleration', 'cylinders'], outputCol = 'features')
df = vectorAssembler.transform(df)
df = df.select(['features', 'mpg'])
splits = df.randomSplit([0.7, 0.3])

train_df = splits[0]
test_df = splits[1]

In [4]:
from pyspark.ml.regression import LinearRegression
lr = LinearRegression(featuresCol = 'features', labelCol='mpg', maxIter=10, regParam=0.3, elasticNetParam=0.8)
lr_model = lr.fit(train_df)
print("Coefficients: " + str(lr_model.coefficients))
print("Intercept: " + str(lr_model.intercept))

Coefficients: [-0.004944975012578205,-0.01846786927798166,0.06492313493694646,-0.15997363527275718]
Intercept: 41.66960364686994


In [5]:
trainingSummary = lr_model.summary
print("RMSE: %f" % trainingSummary.rootMeanSquaredError)
print("r2: %f" % trainingSummary.r2)

RMSE: 4.383880
r2: 0.698442


In [6]:
train_df.describe().show()

+-------+-----------------+
|summary|              mpg|
+-------+-----------------+
|  count|              261|
|   mean|23.53103448275862|
| stddev|7.998470676104189|
|    min|             10.0|
|    max|             46.6|
+-------+-----------------+



Check the resutls with the test data. 

In [7]:
lr_predictions = lr_model.transform(test_df)
lr_predictions.select("prediction","mpg","features").show(5)
from pyspark.ml.evaluation import RegressionEvaluator
lr_evaluator = RegressionEvaluator(predictionCol="prediction", \
                 labelCol="mpg",metricName="r2")
print("R Squared (R2) on test data = %g" % lr_evaluator.evaluate(lr_predictions))

+------------------+----+--------------------+
|        prediction| mpg|            features|
+------------------+----+--------------------+
| 31.60905771530126|33.0|[1795.0,91.0,17.5...|
|31.253796036987968|36.1|[1800.0,98.0,14.4...|
|31.402781176548245|27.0|[1834.0,97.0,19.0...|
| 31.49522090394108|26.0|[1835.0,97.0,20.5...|
|31.234383612254277|39.0|[1875.0,86.0,16.4...|
+------------------+----+--------------------+
only showing top 5 rows

R Squared (R2) on test data = 0.697544


### 4. References:

SQL References
https://spark.apache.org/docs/latest/sql-ref-ansi-compliance.html