 <hr />
 Before starting with the notebook ensure pyspark is installed and working. To install and to find the spark use pip install as shown in the below cells.
<hr />

In [1]:
import pyspark
import findspark

<hr />
The following command adds the pyspark to sys.path at runtime. If the pyspark is not on the system path by default. It also prints the path of the spark.
<hr />

In [2]:
print(findspark.find())
findspark.init()

C:\Users\HrushikeshaShastryBS\miniconda3\envs\mlops1\lib\site-packages\pyspark


<hr />
Create a Spark Session
<hr />

In [3]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Pipeline") \
    .master('local[2]') \
    .getOrCreate()

<hr />
Create a Dataframe comprising a sentence, an identification value and a sentiment value (0:negative and 1:positive)
<hr />

In [4]:
training = spark.createDataFrame([
     (0, 'i like apple pie for dessert', 1.0),
     (1, 'i dont drive fast cars', 0.0),
     (2, 'data science is fun', 1.0),
     (3, 'chocolate is not my favorite', 0.0),
     (4, 'my favorite movie is predator', 1.0)],
     ['id', 'text', 'label'])

<hr />
Import the relevant pyspark packages <br>
1. Pipeline : To create a Training and Testing Pipeline <br>
2. Tokenizer : To create tokens from the sentence by converting the input string to lowercase and then splits it by white spaces. <br>
3. HashingTF : To generate features from the tokens by Mapping a sequence of terms to their term frequencies using the hashing trick. <br>
4. Logistic Regression : For training a classifier <br>
<hr />

In [5]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer

<hr />
Initialzie the Estimators and Transformers.
<hr />

In [6]:
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.01, featuresCol='features',labelCol='label')

<hr />
Create a Pipeline.
<hr />

In [7]:
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

<hr />
Call the fit function for executing the pipeline and generating the trained model.
<hr />

In [8]:
model = pipeline.fit(training)

<hr />
Display the Stages of the pipeline.
<hr />

In [9]:
model.stages

[Tokenizer_470a5f5b54ab,
 HashingTF_3a6ec42310fd,
 LogisticRegressionModel: uid=LogisticRegression_29a18ab1cfd9, numClasses=2, numFeatures=262144]

<hr />
Initialize the test data.
<hr />

In [10]:
test = spark.createDataFrame([
     (5, 'I like programming'),
     (6, 'I dont eat grapes')],
     ["id", "text"])

<hr />
Use the pipeline to generate predictions for the test data.
<hr />

In [11]:
prediction = model.transform(test)

<hr />
Display the predictions.
<hr />

In [12]:
prediction.show(truncate=False, vertical=True)

-RECORD 0---------------------------------------------------------------
 id            | 5                                                      
 text          | I like programming                                     
 words         | [i, like, programming]                                 
 features      | (262144,[19036,154517,208258],[1.0,1.0,1.0])           
 rawPrediction | [-1.604878308915796,1.604878308915796]                 
 probability   | [0.16730090779697265,0.8326990922030273]               
 prediction    | 1.0                                                    
-RECORD 1---------------------------------------------------------------
 id            | 6                                                      
 text          | I dont eat grapes                                      
 words         | [i, dont, eat, grapes]                                 
 features      | (262144,[19036,87273,188981,202572],[1.0,1.0,1.0,1.0]) 
 rawPrediction | [0.4845259488940914,-0.48452594889

<hr />
Extract only the prediction value from the output of the pipeline.
<hr />

In [15]:
prediction.select("prediction").toJSON().first()

'{"prediction":1.0}'

<hr />
Stop the Spark Session.
<hr />

In [37]:
spark.stop()