<a href="https://colab.research.google.com/github/jtao/dswebinar/blob/master/pyspark/PySpark_MLlib.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PySpark DataFrames and SQL

[Jian Tao](https://tx.ag/jtao), Texas A&M University

June 30, 2023

### 1. Set up the PySpark environment first

In [None]:
# For each Google Colab, we will need to run this cell to ensure that PySpark is installed properly.
import sys
IN_COLAB = 'google.colab' in sys.modules
if IN_COLAB:
  !pip install pyspark

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Test").config('spark.ui.port', '4050').getOrCreate()
spark
# !wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
# !unzip -o ngrok-stable-linux-amd64.zip
# get_ipython().system_raw('./ngrok http 4050 &')
# !curl -s http://localhost:4040/api/tunnels | python3 -c "import sys, json; print(\"\nClick me to launch (give it a minute or two)\n\"); print(json.load(sys.stdin)['tunnels'][0]['public_url'])"

### 2. Create a DataFrame by reading from a CSV/JSON file

`spark.read.csv` can only read from local files, so we will have to download the CSV file from the URL first. We can use `SparkFiles` to do that or use `pandas`. For those CSV files with a header, please make sure to set `header=True` in the argument list for `spark.read.csv`. When the data types of the columns are not known, `inferSchema=True` will do the trick to automatically recognize the data types, but it is not perfect. In our example, `Horsepower` is not correctly recognized.

In [None]:
from pyspark import SparkFiles

if IN_COLAB:
  csv_url = "https://raw.githubusercontent.com/jtao/dswebinar/master/pyspark/Auto.csv"
  json_url = "https://raw.githubusercontent.com/jtao/dswebinar/master/pyspark/Auto.json"
else:
  csv_url = "Auto.csv"  
  json_url = "Auto.json"
spark.sparkContext.addFile(csv_url)
spark.sparkContext.addFile(json_url)

## One can create a spark dataframe from pandas dataframe as well.
# import pandas as pd
# df = spark.createDataFrame(pd.read_csv(url))

#df = spark.read.csv(SparkFiles.get("Auto.csv"), header=True, sep=",", inferSchema=False)
df = spark.read.csv(SparkFiles.get("Auto.csv"), header=True, sep=",", inferSchema=True)

df.printSchema()
df.show(5)

### 3. Create a Linear Regression Model with MLlib

First, we will need to split the dataset into training (70%) and testing (30%) datasets.

In [None]:
from pyspark.ml.feature import VectorAssembler
vectorAssembler = VectorAssembler(inputCols = ['weight', 'displacement', 'acceleration', 'cylinders'], outputCol = 'features')
df = vectorAssembler.transform(df)
df = df.select(['features', 'mpg'])
splits = df.randomSplit([0.7, 0.3])

train_df = splits[0]
test_df = splits[1]

In [None]:
from pyspark.ml.regression import LinearRegression
lr = LinearRegression(featuresCol = 'features', labelCol='mpg', maxIter=10, regParam=0.3, elasticNetParam=0.8)
lr_model = lr.fit(train_df)
print("Coefficients: " + str(lr_model.coefficients))
print("Intercept: " + str(lr_model.intercept))

In [None]:
trainingSummary = lr_model.summary
print("RMSE: %f" % trainingSummary.rootMeanSquaredError)
print("r2: %f" % trainingSummary.r2)

In [None]:
train_df.describe().show()

Check the resutls with the test data. 

In [None]:
lr_predictions = lr_model.transform(test_df)
lr_predictions.select("prediction","mpg","features").show(5)
from pyspark.ml.evaluation import RegressionEvaluator
lr_evaluator = RegressionEvaluator(predictionCol="prediction", \
                 labelCol="mpg",metricName="r2")
print("R Squared (R2) on test data = %g" % lr_evaluator.evaluate(lr_predictions))

### 4. References:

SQL References
https://spark.apache.org/docs/latest/sql-ref-ansi-compliance.html