# PySpark Exercises 4.c & 4.d 
---
Özgün Yargı
20811



## Install Dependencies

In [None]:
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.2.1.tar.gz (281.4 MB)
[K     |████████████████████████████████| 281.4 MB 36 kB/s 
[?25hCollecting py4j==0.10.9.3
  Downloading py4j-0.10.9.3-py2.py3-none-any.whl (198 kB)
[K     |████████████████████████████████| 198 kB 46.6 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.2.1-py2.py3-none-any.whl size=281853642 sha256=fdb84b930d5e9cc2036b1e98d217ea46e7a7b1e5e201d1bac6a65c7d288587ef
  Stored in directory: /root/.cache/pip/wheels/9f/f5/07/7cd8017084dce4e93e84e92efd1e1d5334db05f2e83bcef74f
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.3 pyspark-3.2.1


## Libraries

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import isnan
import pyspark.sql.functions as F
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler, MinMaxScaler, PCA
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator

## Get Data

In [None]:
DATA = "auto-mpg.data.txt"

In [None]:
spark = SparkSession.builder.appName('AUTO-MPG').getOrCreate()

df = spark.read.csv(DATA,header=True)

In [None]:
df.show()

+---+---------+------------+----------+------+------------+----------+------+--------------------+
|mpg|cylinders|displacement|horsepower|weight|acceleration|model_year|origin|                name|
+---+---------+------------+----------+------+------------+----------+------+--------------------+
| 18|        8|         307|       130|  3504|          12|        70|     1|chevrolet chevell...|
| 15|        8|         350|       165|  3693|        11.5|        70|     1|   buick skylark 320|
| 18|        8|         318|       150|  3436|          11|        70|     1|  plymouth satellite|
| 16|        8|         304|       150|  3433|          12|        70|     1|       amc rebel sst|
| 17|        8|         302|       140|  3449|        10.5|        70|     1|         ford torino|
| 15|        8|         429|       198|  4341|          10|        70|     1|    ford galaxie 500|
| 14|        8|         454|       220|  4354|           9|        70|     1|    chevrolet impala|
| 14|     

### Convert Object Types to Numeric Type
---
Since, we are dealing with numbers, we need to convert string type features to numeric type to make calculations on them.

In [None]:
[df.select(column) for column in df.columns]

[DataFrame[mpg: string],
 DataFrame[cylinders: string],
 DataFrame[displacement: string],
 DataFrame[horsepower: string],
 DataFrame[weight: string],
 DataFrame[acceleration: string],
 DataFrame[model_year: string],
 DataFrame[origin: string],
 DataFrame[name: string]]

In [None]:
colsToChange = df.columns[:-1]

for col in colsToChange:
  df = df.withColumn(col, df[col].cast("float"))

In [None]:
[df.select(column) for column in df.columns]

[DataFrame[mpg: float],
 DataFrame[cylinders: float],
 DataFrame[displacement: float],
 DataFrame[horsepower: float],
 DataFrame[weight: float],
 DataFrame[acceleration: float],
 DataFrame[model_year: float],
 DataFrame[origin: float],
 DataFrame[name: string]]

### Identify Null values
---
We need to remove rows that contains null values since they cannot be used

In [None]:
for col in df.columns:
 if len(df.filter(df[col].isNull()).collect()) != 0:
   print(col)

horsepower


In the dataset, horsepower feature contains null value(s). We need to remove those rows.

In [None]:
df = df.na.drop()

for col in df.columns:
 if len(df.filter(df[col].isNull()).collect()) != 0:
   print(col)

Since, it did not print anything, we understand that there are no null values inside of the dataset.

### Split the Data
---
To check the performance of our model, we need to split our dataset in which one piece will be used on training, and the other piece will be used on validation.

In [None]:
df_Train, df_Test = df.randomSplit([0.7,0.3],seed=42)

### One Hot Encode
---
Since "origin" is a categorical features, to use them on training, we need to one hot encode them first. We won't be using "name" feature since it has too many unique values.

In [None]:
encoder = OneHotEncoder(inputCol='origin',outputCol='originEncoded')
encoderModel = encoder.fit(df_Train)
df_Train = encoderModel.transform(df_Train)
df_Test = encoderModel.transform(df_Test)

In [None]:
df_Train.show(truncate=False)

+----+---------+------------+----------+------+------------+----------+------+--------------------------------+-------------+
|mpg |cylinders|displacement|horsepower|weight|acceleration|model_year|origin|name                            |originEncoded|
+----+---------+------------+----------+------+------------+----------+------+--------------------------------+-------------+
|9.0 |8.0      |304.0       |193.0     |4732.0|18.5        |70.0      |1.0   |hi 1200d                        |(3,[1],[1.0])|
|10.0|8.0      |307.0       |200.0     |4376.0|15.0        |70.0      |1.0   |chevy c20                       |(3,[1],[1.0])|
|11.0|8.0      |318.0       |210.0     |4382.0|13.5        |70.0      |1.0   |dodge d200                      |(3,[1],[1.0])|
|11.0|8.0      |350.0       |180.0     |3664.0|11.0        |73.0      |1.0   |oldsmobile omega                |(3,[1],[1.0])|
|11.0|8.0      |400.0       |150.0     |4997.0|14.0        |73.0      |1.0   |chevrolet impala                |(3,[1],

### Normalize Columns
---
Normalizing would give better performance since it neglects the value differences between numeric features.

#### Vectorize

In [None]:
columnsToNormalize = ["cylinders", "displacement", "horsepower", "weight", "acceleration", "model_year"]

In [None]:
vec = VectorAssembler(inputCols=columnsToNormalize,
                      outputCol='numericFeatures')

df_Train = vec.transform(df_Train)
df_Test = vec.transform(df_Test)

df_Test.show(truncate=False)

+----+---------+------------+----------+------+------------+----------+------+----------------------------+-------------+----------------------------------+
|mpg |cylinders|displacement|horsepower|weight|acceleration|model_year|origin|name                        |originEncoded|numericFeatures                   |
+----+---------+------------+----------+------+------------+----------+------+----------------------------+-------------+----------------------------------+
|10.0|8.0      |360.0       |215.0     |4615.0|14.0        |70.0      |1.0   |ford f250                   |(3,[1],[1.0])|[8.0,360.0,215.0,4615.0,14.0,70.0]|
|11.0|8.0      |429.0       |208.0     |4633.0|11.0        |72.0      |1.0   |mercury marquis             |(3,[1],[1.0])|[8.0,429.0,208.0,4633.0,11.0,72.0]|
|12.0|8.0      |350.0       |180.0     |4499.0|12.5        |73.0      |1.0   |oldsmobile vista cruiser    |(3,[1],[1.0])|[8.0,350.0,180.0,4499.0,12.5,73.0]|
|12.0|8.0      |383.0       |180.0     |4955.0|11.5       

#### Normalize

In [None]:
scaler = MinMaxScaler(inputCol="numericFeatures", outputCol="normFeatures")
model = scaler.fit(df_Train)
df_Train = model.transform(df_Train)
df_Test = model.transform(df_Test)

df_Test.show(truncate=False)

+----+---------+------------+----------+------+------------+----------+------+----------------------------+-------------+----------------------------------+-------------------------------------------------------------------------------------------------------+
|mpg |cylinders|displacement|horsepower|weight|acceleration|model_year|origin|name                        |originEncoded|numericFeatures                   |normFeatures                                                                                           |
+----+---------+------------+----------+------+------------+----------+------+----------------------------+-------------+----------------------------------+-------------------------------------------------------------------------------------------------------+
|10.0|8.0      |360.0       |215.0     |4615.0|14.0        |70.0      |1.0   |ford f250                   |(3,[1],[1.0])|[8.0,360.0,215.0,4615.0,14.0,70.0]|[1.0,0.7545219638242895,0.9184782608695652,0.8871158392434988

### Merge OneHotEncoded Column with Normalized Column
---
Merge one hot encoded and normalized vectors to use on training.

In [None]:
columnsToMerge = ["normFeatures", "originEncoded"]

In [None]:
vec = VectorAssembler(inputCols=columnsToMerge,
                      outputCol='features')

df_Train = vec.transform(df_Train)
df_Test = vec.transform(df_Test)

df_Test.show(truncate=False)

+----+---------+------------+----------+------+------------+----------+------+----------------------------+-------------+----------------------------------+-------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------+
|mpg |cylinders|displacement|horsepower|weight|acceleration|model_year|origin|name                        |originEncoded|numericFeatures                   |normFeatures                                                                                           |features                                                                                                           |
+----+---------+------------+----------+------+------------+----------+------+----------------------------+-------------+----------------------------------+------------------------------------------------------------------------------------------

In [None]:
df_Train = df_Train.select("features", "mpg")
df_Test = df_Test.select("features", "mpg")

df_Test.show(truncate=False)

+-------------------------------------------------------------------------------------------------------------------+----+
|features                                                                                                           |mpg |
+-------------------------------------------------------------------------------------------------------------------+----+
|[1.0,0.7545219638242895,0.9184782608695652,0.8871158392434988,0.35714287336180883,0.0,0.0,1.0,0.0]                 |10.0|
|[1.0,0.9328165374677003,0.8804347826086957,0.892434988179669,0.17857143668090442,0.16666666666666666,0.0,1.0,0.0]  |11.0|
|[1.0,0.7286821705426357,0.7282608695652174,0.8528368794326241,0.26785715502135665,0.25,0.0,1.0,0.0]                |12.0|
|[1.0,0.813953488372093,0.7282608695652174,0.9875886524822695,0.2083333427943885,0.08333333333333333,0.0,1.0,0.0]   |12.0|
|[1.0,0.6046511627906977,0.45108695652173914,0.4598108747044917,0.23809524890787256,0.41666666666666663,0.0,1.0,0.0]|13.0|
|[1.0,0.60465116

### PCA
---
Apply PCA to reduce dimensinality complexity. In this case, we are reducing the dimensions by one. 

In [None]:
lengthOfFeas = len(df_Train[["features"]].take(1)[0][0])
lengthOfFeas

9

In [None]:
pca = PCA(inputCol="features",outputCol="pcaFeas",k=lengthOfFeas-1)
model = pca.fit(df_Train)

df_Train = model.transform(df_Train)
df_Test = model.transform(df_Test)

df_Test.show(truncate=False)

+-------------------------------------------------------------------------------------------------------------------+----+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|features                                                                                                           |mpg |pcaFeas                                                                                                                                                          |
+-------------------------------------------------------------------------------------------------------------------+----+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[1.0,0.7545219638242895,0.9184782608695652,0.8871158392434988,0.35714287336180883,0.0,0.0,1.0,0.0]                 |10.0|[-1.7191177297263276,-0

In [None]:
df_Train = df_Train.select("pcaFeas", "mpg")
df_Test = df_Test.select("pcaFeas", "mpg")

### Training
---
Train a linear regression model by using only the training piece of the data.

In [None]:
lRegressor = LinearRegression(featuresCol="pcaFeas", labelCol="mpg")
model = lRegressor.fit(df_Train)

### Evaluation
---
Calculate the validation accuracy by using root mean square error metric.

In [None]:
eva = RegressionEvaluator(metricName='rmse', predictionCol='prediction', labelCol='mpg')

resultDF = model.transform(df_Test)
accuracy = eva.evaluate(resultDF)
print("Validation RMSE Error : ", accuracy)

Validation RMSE Error :  3.3024401447823184
