<p style="text-align:center">
    <a href="https://skills.network/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMSkillsNetworkBD0231ENCoursera2789-2023-01-01">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo">
    </a>
</p>


## Feature Extraction and Transformation using Spark


Estimated time needed: **30** minutes


<p style='color: red'>The purpose of this lab is to show you how to use Spark to extract and transform features.


## __Table of Contents__

<ol>
  <li>
    <a href="#Objectives">Objectives
    </a>
  </li>
  <li>
    <a href="#Datasets">Datasets
    </a>
  </li>
  <li>
    <a href="#Setup">Setup
    </a>
    <ol>
      <li>
        <a href="#Installing-Required-Libraries">Installing Required Libraries
        </a>
      </li>
      <li>
        <a href="#Importing-Required-Libraries">Importing Required Libraries
        </a>
      </li>
    </ol>
  </li>
  <li> 
    <a href="#Examples">Examples
    </a>
    <ol>
    <li>
      <a href="#Task-1---Tokenizer">Task 1 - Tokenizer
      </a>
    </li>
    <li>
      <a href="#Task-2---CountVectorizer">Task 2 - CountVectorizer
      </a>
    </li>
    <li>
      <a href="#Task-3---TF-IDF">Task 3 - TF-IDF
      </a>
    </li>
    <li>
      <a href="#Task-4---StopWordsRemover">Task 4 - StopWordsRemover
      </a>
    </li>
    <li>
      <a href="#Task-5---StringIndexer">Task 5 - StringIndexer
      </a>
    </li>
    <li>
      <a href="#Task-6---StandardScaler">Task 6 - StandardScaler
      </a>
    </li>
    </ol>
  </li>
  <li>
    <a href="#Exercises">Exercises
    </a>
  </li>
  <ol>
    <li>
      <a href="#Exercise-1---Tokenizer">Exercise 1 - Tokenizer
      </a>
    </li>
    <li>
      <a href="#Exercise-2---CountVectorizer">Exercise 2 - CountVectorizer
      </a>
    </li>
    <li>
      <a href="#Exercise-3---StringIndexer">Exercise 3 - StringIndexer
      </a>
    </li>
    <li>
      <a href="#Exercise-4---StandardScaler">Exercise 4 - StandardScaler
      </a>
    </li>
  </ol>
</ol>


















## Objectives

After completing this lab you will be able to:

 - Use the feature extractor CountVectorizer
 - Use the feature extractor TF-IDF
 - Use the feature transformer Tokenizer
 - Use the feature transformer StopWordsRemover
 - Use the feature transformer StringIndexer
 - Use the feature transformer StandardScaler
 


## Datasets

In this lab you will be using dataset(s):

 - Modified version of car mileage dataset. Original dataset available at https://archive.ics.uci.edu/ml/datasets/auto+mpg 
 


----


## Setup


For this lab, we will be using the following libraries:

*   [`PySpark`](https://spark.apache.org/docs/latest/api/python/index.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMSkillsNetworkBD0231ENCoursera2789-2023-01-01) for connecting to the Spark Cluster


### Installing Required Libraries

Spark Cluster is pre-installed in the Skills Network Labs environment. However, you need libraries like pyspark and findspark to connect to this cluster.

If you wish to download this jupyter notebook and run on your local computer, follow the instructions mentioned <a href="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-BD0231EN-Coursera/labs/Connecting_to_spark_cluster_using_Skills_Network_labs.ipynb">here.</a>



The following required libraries are __not__ pre-installed in the Skills Network Labs environment. __You will need to run the following cell__ to install them:


In [2]:
!pip install pyspark==3.1.2 -q
!pip install findspark -q

### Importing Required Libraries

_We recommend you import all required libraries in one place (here):_


In [3]:
# You can also use this section to suppress warnings generated by your code:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

# FindSpark simplifies the process of using Apache Spark with Python

import findspark
findspark.init()

from pyspark.sql import SparkSession
from pyspark.sql.functions import rand

In [4]:
#Create SparkSession
#Ignore any warnings by SparkSession command

spark = SparkSession \
    .builder \
    .appName("Feature Extraction and Transformation using Spark") \
    .getOrCreate()

24/01/29 18:34:47 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


## Task 1 - Tokenizer


A tokenizer is used to break a sentence into words.


In [5]:
#import tokenizer
from pyspark.ml.feature import Tokenizer

In [6]:
#create a sample dataframe
sentenceDataFrame = spark.createDataFrame([
    (1, "Spark is a distributed computing system."),
    (2, "It provides interfaces for multiple languages"),
    (3, "Spark is built on top of Hadoop")
], ["id", "sentence"])

In [7]:
#display the dataframe
sentenceDataFrame.show(truncate=False)

                                                                                

+---+---------------------------------------------+
|id |sentence                                     |
+---+---------------------------------------------+
|1  |Spark is a distributed computing system.     |
|2  |It provides interfaces for multiple languages|
|3  |Spark is built on top of Hadoop              |
+---+---------------------------------------------+



In [8]:
#create tokenizer instance.
#mention the column to be tokenized as inputcol
#mention the output column name where the tokens are to be stored.
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")

In [9]:
#tokenize
token_df = tokenizer.transform(sentenceDataFrame)

In [10]:
#display the tokenized data
token_df.show(truncate=False)

+---+---------------------------------------------+----------------------------------------------------+
|id |sentence                                     |words                                               |
+---+---------------------------------------------+----------------------------------------------------+
|1  |Spark is a distributed computing system.     |[spark, is, a, distributed, computing, system.]     |
|2  |It provides interfaces for multiple languages|[it, provides, interfaces, for, multiple, languages]|
|3  |Spark is built on top of Hadoop              |[spark, is, built, on, top, of, hadoop]             |
+---+---------------------------------------------+----------------------------------------------------+



## Task 2 - CountVectorizer


CountVectorizer is used to convert text into numerical format. It gives the count of each word in a given document.


In [11]:
#import CountVectorizer
from pyspark.ml.feature import CountVectorizer

In [12]:
#create a sample dataframe and display it.
textdata = [(1, "I love Spark Spark provides Python API ".split()),
            (2, "I love Python Spark supports Python".split()),
            (3, "Spark solves the big problem of big data".split())]
textdata = spark.createDataFrame(textdata, ["id", "words"])
textdata.show(truncate=False)

+---+-------------------------------------------------+
|id |words                                            |
+---+-------------------------------------------------+
|1  |[I, love, Spark, Spark, provides, Python, API]   |
|2  |[I, love, Python, Spark, supports, Python]       |
|3  |[Spark, solves, the, big, problem, of, big, data]|
+---+-------------------------------------------------+



In [13]:
# Create a CountVectorizer object
# mention the column to be count vectorized as inputcol
# mention the output column name where the count vectors are to be stored.
cv = CountVectorizer(inputCol="words", outputCol="features")

In [14]:
# Fit the CountVectorizer model on the input data
model = cv.fit(textdata)

                                                                                

In [15]:
# Transform the input data to bag-of-words vectors
result = model.transform(textdata)

In [16]:
# display the dataframe
result.show(truncate=False)

+---+-------------------------------------------------+----------------------------------------------------+
|id |words                                            |features                                            |
+---+-------------------------------------------------+----------------------------------------------------+
|1  |[I, love, Spark, Spark, provides, Python, API]   |(13,[0,1,2,4,8,12],[2.0,1.0,1.0,1.0,1.0,1.0])       |
|2  |[I, love, Python, Spark, supports, Python]       |(13,[0,1,2,4,7],[1.0,2.0,1.0,1.0,1.0])              |
|3  |[Spark, solves, the, big, problem, of, big, data]|(13,[0,3,5,6,9,10,11],[1.0,2.0,1.0,1.0,1.0,1.0,1.0])|
+---+-------------------------------------------------+----------------------------------------------------+



## Task 3 - TF-IDF


Term Frequency-Inverse Document Frequency is used to quantify the importance of a word in a document. TF-IDF is computed by multiplying the number of times a word occurs in a document by the inverse document frequency of the word.


In [20]:
#import necessary classes for TF-IDF calculation
from pyspark.ml.feature import HashingTF, IDF, Tokenizer

In [21]:
#create a sample dataframe and display it.
sentenceData = spark.createDataFrame([
        (1, "Spark supports python"),
        (2, "Spark is fast"),
        (3, "Spark is easy")
    ], ["id", "sentence"])

sentenceData.show(truncate = False)

+---+---------------------+
|id |sentence             |
+---+---------------------+
|1  |Spark supports python|
|2  |Spark is fast        |
|3  |Spark is easy        |
+---+---------------------+



In [22]:
#tokenize the "sentence" column and store in the column "words"
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
wordsData = tokenizer.transform(sentenceData)
wordsData.show(truncate=False)

+---+---------------------+-------------------------+
|id |sentence             |words                    |
+---+---------------------+-------------------------+
|1  |Spark supports python|[spark, supports, python]|
|2  |Spark is fast        |[spark, is, fast]        |
|3  |Spark is easy        |[spark, is, easy]        |
+---+---------------------+-------------------------+



In [24]:
# Create a HashingTF object
# mention the "words" column as input
# mention the "rawFeatures" column as output
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=10)
featurizedData = hashingTF.transform(wordsData)
featurizedData.show(truncate=False)

+---+---------------------+-------------------------+--------------------------+
|id |sentence             |words                    |rawFeatures               |
+---+---------------------+-------------------------+--------------------------+
|1  |Spark supports python|[spark, supports, python]|(10,[4,5,9],[1.0,1.0,1.0])|
|2  |Spark is fast        |[spark, is, fast]        |(10,[1,3,5],[1.0,1.0,1.0])|
|3  |Spark is easy        |[spark, is, easy]        |(10,[0,1,5],[1.0,1.0,1.0])|
+---+---------------------+-------------------------+--------------------------+



In [None]:

idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData)
tfidfData = idfModel.transform(featurizedData)

In [27]:
# Create an IDF object
# mention the "rawFeatures" column as input
# mention the "features" column as output
idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData)
tfidData = idfModel.transform(featurizedData)

In [28]:
#display the tf-idf data
tfidData.select("sentence", "features").show(truncate=False)

+---------------------+---------------------------------------------------------+
|sentence             |features                                                 |
+---------------------+---------------------------------------------------------+
|Spark supports python|(10,[4,5,9],[0.6931471805599453,0.0,0.6931471805599453]) |
|Spark is fast        |(10,[1,3,5],[0.28768207245178085,0.6931471805599453,0.0])|
|Spark is easy        |(10,[0,1,5],[0.6931471805599453,0.28768207245178085,0.0])|
+---------------------+---------------------------------------------------------+



## Task 4 - StopWordsRemover


StopWordsRemover is a transformer that filters out stop words like "a","an" and "the".


In [29]:
#import StopWordsRemover
from pyspark.ml.feature import StopWordsRemover

In [30]:
#create a dataframe with sample text and display it
textData = spark.createDataFrame([
    (1, ['Spark', 'is', 'an', 'open-source', 'distributed', 'computing', 'system']),
    (2, ['IT', 'has', 'interfaces', 'for', 'multiple', 'languages']),
    (3, ['It', 'has', 'a', 'wide', 'range', 'of', 'libraries', 'and', 'APIs'])
], ["id", "sentence"])

textData.show(truncate = False)

+---+------------------------------------------------------------+
|id |sentence                                                    |
+---+------------------------------------------------------------+
|1  |[Spark, is, an, open-source, distributed, computing, system]|
|2  |[IT, has, interfaces, for, multiple, languages]             |
|3  |[It, has, a, wide, range, of, libraries, and, APIs]         |
+---+------------------------------------------------------------+



In [31]:
# remove stopwords from "sentence" column and store the result in "filtered_sentence" column
remover = StopWordsRemover(inputCol="sentence", outputCol="filtered_sentence")
textData = remover.transform(textData)

In [32]:
# display the dataframe
textData.show(truncate=False)

+---+------------------------------------------------------------+----------------------------------------------------+
|id |sentence                                                    |filtered_sentence                                   |
+---+------------------------------------------------------------+----------------------------------------------------+
|1  |[Spark, is, an, open-source, distributed, computing, system]|[Spark, open-source, distributed, computing, system]|
|2  |[IT, has, interfaces, for, multiple, languages]             |[interfaces, multiple, languages]                   |
|3  |[It, has, a, wide, range, of, libraries, and, APIs]         |[wide, range, libraries, APIs]                      |
+---+------------------------------------------------------------+----------------------------------------------------+



## Task 5 - StringIndexer


StringIndexer converts a column of strings into a column of integers.


In [33]:
#import StringIndexer
from pyspark.ml.feature import StringIndexer

In [34]:
#create a dataframe with sample text and display it
colors = spark.createDataFrame(
    [(0, "red"), (1, "red"), (2, "blue"), (3, "yellow" ), (4, "yellow"), (5, "yellow")],
    ["id", "color"])

colors.show()

+---+------+
| id| color|
+---+------+
|  0|   red|
|  1|   red|
|  2|  blue|
|  3|yellow|
|  4|yellow|
|  5|yellow|
+---+------+



In [36]:
# index the strings in the column "color" and store their indexes in the column "colorIndex"
indexer = StringIndexer(inputCol="color", outputCol="colorIndex")
indexed = indexer.fit(colors).transform(colors)
indexed.show(truncate=False)

                                                                                

+---+------+----------+
|id |color |colorIndex|
+---+------+----------+
|0  |red   |1.0       |
|1  |red   |1.0       |
|2  |blue  |2.0       |
|3  |yellow|0.0       |
|4  |yellow|0.0       |
|5  |yellow|0.0       |
+---+------+----------+



## Task 6 - StandardScaler



StandardScaler transforms the data so that it has a mean of 0 and a standard deviation of 1


In [38]:
#import StandardScaler
from pyspark.ml.feature import StandardScaler
from pyspark.ml.linalg import Vectors

In [39]:
# Create a sample dataframe and display it
data = [(1, Vectors.dense([70, 170, 17])),
        (2, Vectors.dense([80, 165, 25])),
        (3, Vectors.dense([65, 150, 135]))]
df = spark.createDataFrame(data, ["id", "features"])
df.show()

+---+------------------+
| id|          features|
+---+------------------+
|  1| [70.0,170.0,17.0]|
|  2| [80.0,165.0,25.0]|
|  3|[65.0,150.0,135.0]|
+---+------------------+



In [40]:
# Define the StandardScaler transformer
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures", withStd=True, withMean=True)

In [41]:
# Fit the transformer to the dataset
scalerModel = scaler.fit(df)
scaledData = scalerModel.transform(df)
scaledData.show(truncate=False)

                                                                                

+---+------------------+------------------------------------------------------------+
|id |features          |scaledFeatures                                              |
+---+------------------+------------------------------------------------------------+
|1  |[70.0,170.0,17.0] |[-0.218217890235993,0.8006407690254366,-0.6369487984517485] |
|2  |[80.0,165.0,25.0] |[1.0910894511799611,0.32025630761017515,-0.5156252177942725]|
|3  |[65.0,150.0,135.0]|[-0.8728715609439701,-1.120897076635609,1.152574016246021]  |
+---+------------------+------------------------------------------------------------+



Stop Spark Session


In [42]:
spark.stop()

# Exercises


Create Spark Session


In [43]:
#Create SparkSession
#Ignore any warnings by SparkSession command
spark = SparkSession \
    .builder \
    .appName("Exercise - Feature Extraction adn Transformation using Spark") \
    .getOrCreate()

Create Dataframes


In [44]:
!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-BD0231EN-Coursera/datasets/proverbs.csv

--2024-01-29 18:53:33--  https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-BD0231EN-Coursera/datasets/proverbs.csv
Resolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 169.63.118.104, 169.63.118.104
Connecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|169.63.118.104|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 846 [text/csv]
Saving to: ‘proverbs.csv’


2024-01-29 18:53:33 (4.91 MB/s) - ‘proverbs.csv’ saved [846/846]



In [56]:
# Load proverbs dataset
textdata = spark.read.csv("proverbs.csv", header=True, inferSchema=True)
textdata.show(truncate=False)

+---+-----------------------------------------------------------+
|id |text                                                       |
+---+-----------------------------------------------------------+
|1  |When in Rome do as the Romans do.                          |
|2  |Do not judge a book by its cover.                          |
|3  |Actions speak louder than words.                           |
|4  |A picture is worth a thousand words.                       |
|5  |If at first you do not succeed try try again.              |
|6  |Practice makes perfect.                                    |
|7  |An apple a day keeps the doctor away.                      |
|8  |When the going gets tough the tough get going.             |
|9  |All is fair in love and war.                               |
|10 |Too many cooks spoil the broth.                            |
|11 |You can not make an omelette without breaking eggs.        |
|12 |The early bird catches the worm.                           |
|13 |Bette

In [46]:
!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-BD0231EN-Coursera/datasets/mpg.csv

--2024-01-29 18:54:46--  https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-BD0231EN-Coursera/datasets/mpg.csv
Resolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 169.63.118.104, 169.63.118.104
Connecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|169.63.118.104|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13891 (14K) [text/csv]
Saving to: ‘mpg.csv.3’


2024-01-29 18:54:46 (26.0 MB/s) - ‘mpg.csv.3’ saved [13891/13891]



In [57]:
# Load mpg dataset
mpgdata = spark.read.csv("mpg.csv", header=True, inferSchema=True)
mpgdata.show()

+----+---------+-----------+----------+------+----------+----+--------+
| MPG|Cylinders|Engine Disp|Horsepower|Weight|Accelerate|Year|  Origin|
+----+---------+-----------+----------+------+----------+----+--------+
|15.0|        8|      390.0|       190|  3850|       8.5|  70|American|
|21.0|        6|      199.0|        90|  2648|      15.0|  70|American|
|18.0|        6|      199.0|        97|  2774|      15.5|  70|American|
|16.0|        8|      304.0|       150|  3433|      12.0|  70|American|
|14.0|        8|      455.0|       225|  3086|      10.0|  70|American|
|15.0|        8|      350.0|       165|  3693|      11.5|  70|American|
|18.0|        8|      307.0|       130|  3504|      12.0|  70|American|
|14.0|        8|      454.0|       220|  4354|       9.0|  70|American|
|15.0|        8|      400.0|       150|  3761|       9.5|  70|American|
|10.0|        8|      307.0|       200|  4376|      15.0|  70|American|
|15.0|        8|      383.0|       170|  3563|      10.0|  70|Am

### Exercise 1 - Tokenizer


Write code to tokenize the "text" column of the "textdata" dataframe and store the tokens in the column "words"


In [48]:
from pyspark.ml.feature import Tokenizer

In [58]:
tokenizer = Tokenizer(inputCol="text", outputCol="words")
textdata = tokenizer.transform(textdata)
textdata.show(truncate=False)

+---+-----------------------------------------------------------+------------------------------------------------------------------------+
|id |text                                                       |words                                                                   |
+---+-----------------------------------------------------------+------------------------------------------------------------------------+
|1  |When in Rome do as the Romans do.                          |[when, in, rome, do, as, the, romans, do.]                              |
|2  |Do not judge a book by its cover.                          |[do, not, judge, a, book, by, its, cover.]                              |
|3  |Actions speak louder than words.                           |[actions, speak, louder, than, words.]                                  |
|4  |A picture is worth a thousand words.                       |[a, picture, is, worth, a, thousand, words.]                            |
|5  |If at first you do not

### Exercise 2 - CountVectorizer


CountVectorize the column "words" of the "textdata" dataframe and store the result in the column "features"


In [59]:
from pyspark.ml.feature import CountVectorizer

In [61]:
cv = CountVectorizer(inputCol="words", outputCol="features")
model = cv.fit(textdata)
textdata = model.transform(textdata)
textdata.select("id", "words", "features").show(truncate=False)

+---+------------------------------------------------------------------------+----------------------------------------------------------------------------+
|id |words                                                                   |features                                                                    |
+---+------------------------------------------------------------------------+----------------------------------------------------------------------------+
|1  |[when, in, rome, do, as, the, romans, do.]                              |(99,[0,4,5,6,17,38,78,95],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])                |
|2  |[do, not, judge, a, book, by, its, cover.]                              |(99,[1,3,4,19,21,22,62,70],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])               |
|3  |[actions, speak, louder, than, words.]                                  |(99,[9,16,58,66,89],[1.0,1.0,1.0,1.0,1.0])                                  |
|4  |[a, picture, is, worth, a, thousand, words.]               

### Exercise 3 - StringIndexer


Convert the string column "Origin" to a numeric column "OriginIndex" in the dataframe "mpgdata"


In [62]:
from pyspark.ml.feature import StringIndexer

In [63]:
indexer = StringIndexer(inputCol="Origin", outputCol="OriginIndex")
indexed = indexer.fit(mpgdata).transform(mpgdata)
indexed.orderBy(rand()).show()

+----+---------+-----------+----------+------+----------+----+--------+-----------+
| MPG|Cylinders|Engine Disp|Horsepower|Weight|Accelerate|Year|  Origin|OriginIndex|
+----+---------+-----------+----------+------+----------+----+--------+-----------+
|23.0|        4|      140.0|        78|  2592|      18.5|  75|American|        0.0|
|11.0|        8|      429.0|       208|  4633|      11.0|  72|American|        0.0|
|22.0|        6|      198.0|        95|  2833|      15.5|  70|American|        0.0|
|26.0|        4|      121.0|       113|  2234|      12.5|  70|European|        2.0|
|20.5|        6|      231.0|       105|  3425|      16.9|  77|American|        0.0|
|17.5|        6|      250.0|       110|  3520|      16.4|  77|American|        0.0|
|20.2|        6|      200.0|        85|  2965|      15.8|  78|American|        0.0|
|38.0|        4|       91.0|        67|  1965|      15.0|  82|Japanese|        1.0|
|17.0|        8|      302.0|       140|  3449|      10.5|  70|American|     

### Exercise 4 - StandardScaler



Create a single column named "feaures" using the columns "Cylinders", "Engine Disp", "Horsepower", "Weight"


In [64]:
from pyspark.ml.feature import VectorAssembler

In [65]:
assembler = VectorAssembler(inputCols=["Cylinders", "Engine Disp", "Horsepower", "Weight"], outputCol="features")
mpg_transformed_data = assembler.transform(mpgdata)
mpg_transformed_data.select("MPG", "features").show(truncate=False)

+----+------------------------+
|MPG |features                |
+----+------------------------+
|15.0|[8.0,390.0,190.0,3850.0]|
|21.0|[6.0,199.0,90.0,2648.0] |
|18.0|[6.0,199.0,97.0,2774.0] |
|16.0|[8.0,304.0,150.0,3433.0]|
|14.0|[8.0,455.0,225.0,3086.0]|
|15.0|[8.0,350.0,165.0,3693.0]|
|18.0|[8.0,307.0,130.0,3504.0]|
|14.0|[8.0,454.0,220.0,4354.0]|
|15.0|[8.0,400.0,150.0,3761.0]|
|10.0|[8.0,307.0,200.0,4376.0]|
|15.0|[8.0,383.0,170.0,3563.0]|
|11.0|[8.0,318.0,210.0,4382.0]|
|10.0|[8.0,360.0,215.0,4615.0]|
|15.0|[8.0,429.0,198.0,4341.0]|
|21.0|[6.0,200.0,85.0,2587.0] |
|17.0|[8.0,302.0,140.0,3449.0]|
|9.0 |[8.0,304.0,193.0,4732.0]|
|14.0|[8.0,340.0,160.0,3609.0]|
|22.0|[6.0,198.0,95.0,2833.0] |
|14.0|[8.0,440.0,215.0,4312.0]|
+----+------------------------+
only showing top 20 rows



Use StandardScaler to scale the "features" column of the dataframe "mpg_transformed_data" and save the scaled data into the "scaledFeatures" column.


In [66]:
from pyspark.ml.feature import StandardScaler

In [67]:
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures", withStd=True, withMean=True)
scalerModel = scaler.fit(mpg_transformed_data)
scaledData = scalerModel.transform(mpg_transformed_data)
scaledData.select(["features", "scaledFeatures"]).show(truncate=False)

+------------------------+-----------------------------------------------------------------------------------+
|features                |scaledFeatures                                                                     |
+------------------------+-----------------------------------------------------------------------------------+
|[8.0,390.0,190.0,3850.0]|[1.48205302652896,1.869079955831451,2.222084561602166,1.027093462353608]           |
|[6.0,199.0,90.0,2648.0] |[0.3095711165403583,0.043843985634147174,-0.37591456792553746,-0.38801882543985255]|
|[6.0,199.0,97.0,2774.0] |[0.3095711165403583,0.043843985634147174,-0.1940546288585982,-0.2396792678175763]  |
|[8.0,304.0,150.0,3433.0]|[1.48205302652896,1.0472459587792617,1.1828849097910845,0.5361601645084557]        |
|[8.0,455.0,225.0,3086.0]|[1.48205302652896,2.4902335582546176,3.131384256936862,0.12763773200901246]        |
|[8.0,350.0,165.0,3693.0]|[1.48205302652896,1.4868315851095026,1.57258477922024,0.8422576643639463]          |
|

Stop Spark Session


In [68]:
spark.stop()

Congratulations you have completed this lab.<br>


## Authors


[Ramesh Sannareddy](https://www.linkedin.com/in/rsannareddy/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMBD0231ENSkillsNetwork866-2023-01-01)


### Other Contributors


## Change Log


|Date (YYYY-MM-DD)|Version|Changed By|Change Description|
|-|-|-|-|
|2023-05-14|0.1|Ramesh Sannareddy|Initial Version Created|


Copyright © 2023 IBM Corporation. All rights reserved.
