# PySpark Huggingface Inferencing
### Sentence Transformers

From: https://huggingface.co/sentence-transformers

In [1]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

#Sentences we want to encode. Example:
sentence = ['This framework generates embeddings for each input sentence']


#Sentences are encoded by calling model.encode()
embedding = model.encode(sentence)

In [2]:
embedding

array([[-1.76214471e-01,  1.20601244e-01, -2.93623894e-01,
        -2.29858100e-01, -8.22922513e-02,  2.37709180e-01,
         3.39984953e-01, -7.80964196e-01,  1.18127622e-01,
         1.63374022e-01, -1.37715399e-01,  2.40282685e-01,
         4.25125629e-01,  1.72417864e-01,  1.05279565e-01,
         5.18164217e-01,  6.22219704e-02,  3.99285674e-01,
        -1.81652322e-01, -5.85578740e-01,  4.49718200e-02,
        -1.72750384e-01, -2.68443257e-01, -1.47386223e-01,
        -1.89217940e-01,  1.92150623e-01, -3.83842438e-01,
        -3.96006882e-01,  4.30648863e-01, -3.15319866e-01,
         3.65949601e-01,  6.05160147e-02,  3.57325822e-01,
         1.59736469e-01, -3.00983846e-01,  2.63250262e-01,
        -3.94311219e-01,  1.84855536e-01, -3.99549156e-01,
        -2.67889529e-01, -5.45117259e-01, -3.13403755e-02,
        -4.30644304e-01,  1.33278236e-01, -1.74793959e-01,
        -4.35465515e-01, -4.77379203e-01,  7.12556094e-02,
        -7.37001672e-02,  5.69136977e-01, -2.82579780e-0

## PySpark

## Inference using Spark ML Model
Note: you can restart the kernel and run from this point to simulate running in a different node or environment.

In [3]:
import sparkext

In [4]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

In [5]:
# only use first N examples, since this is slow
df = spark.read.parquet("imdb_test").limit(100)

                                                                                

In [6]:
df.show(truncate=80)

[Stage 1:>                                                          (0 + 1) / 1]

+--------------------------------------------------------------------------------+
|                                                                           lines|
+--------------------------------------------------------------------------------+
|i came across this film on the net by fluke and i was horrified by its conten...|
|He who fights with monsters might take care lest he thereby become a monster....|
|We thought this was one of the worst movies ever. I had to volunteer to watch...|
|This movie, despite its list of B, C, and D list celebs, is a complete waste ...|
|i have one word: focus.<br /><br />well.<br /><br />IMDb wants me to use at l...|
|This movie would have been alright, indeed probably excellent, if the directo...|
|Disappointing heist movie indeed, I was actually expecting a pretty cool cat ...|
|THE BOX (2009) * Cameron Diaz, James Marsden, Frank Langella, James Rebhorn, ...|
|Just watched on UbuWeb this early experimental short film directed by William...|
|I w

                                                                                

In [7]:
my_model = sparkext.huggingface.SentenceTransformerModel(model) \
                .setInputCol("lines") \
                .setOutputCol("embedding")

In [8]:
embeddings = my_model.transform(df)

Using supplied SentenceTransformer


In [9]:
%%time
results = embeddings.collect()

[Stage 4:>                                                          (0 + 1) / 1]

CPU times: user 16.3 ms, sys: 3.39 ms, total: 19.7 ms
Wall time: 7.96 s


                                                                                

In [10]:
embeddings.show(truncate=60)

[Stage 7:>                                                          (0 + 1) / 1]

+------------------------------------------------------------+------------------------------------------------------------+
|                                                       lines|                                                   embedding|
+------------------------------------------------------------+------------------------------------------------------------+
|i came across this film on the net by fluke and i was hor...|[-0.05298701, -0.08726701, -0.23245229, -0.037297986, 0.0...|
|He who fights with monsters might take care lest he there...|[0.20020881, -0.06675269, -0.26941508, 0.15710223, -0.031...|
|We thought this was one of the worst movies ever. I had t...|[0.06361251, -0.16458587, 0.054284055, 0.12517855, 0.0279...|
|This movie, despite its list of B, C, and D list celebs, ...|[-0.01568622, -0.4015518, -0.09817645, -0.060246892, -0.0...|
|i have one word: focus.<br /><br />well.<br /><br />IMDb ...|[0.1225523, -0.19064504, -0.18919703, 0.0863426, 0.128941...|
|This mo

                                                                                

## Inference using Spark DL UDF
Note: you can restart the kernel and run from this point to simulate running in a different node or environment.

### Using model instance on driver

In [11]:
from pyspark.sql.functions import col
from sparkext.huggingface import sentence_transformer_udf

In [12]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("paraphrase-MiniLM-L6-v2")

In [13]:
# only use first N examples, since this is slow
df = spark.read.parquet("imdb_test").limit(100)

In [14]:
df.schema

StructType(List(StructField(lines,StringType,true)))

In [15]:
encode = sentence_transformer_udf(model)

Using supplied SentenceTransformer


In [16]:
embeddings = df.withColumn("encoding", encode(col("lines")))

In [17]:
%%time
results = embeddings.collect()

[Stage 11:>                                                         (0 + 1) / 1]

CPU times: user 12.7 ms, sys: 421 µs, total: 13.1 ms
Wall time: 5.75 s


                                                                                

In [18]:
embeddings.show(truncate=60)

[Stage 14:>                                                         (0 + 1) / 1]

+------------------------------------------------------------+------------------------------------------------------------+
|                                                       lines|                                                    encoding|
+------------------------------------------------------------+------------------------------------------------------------+
|i came across this film on the net by fluke and i was hor...|[-0.05298701, -0.08726701, -0.23245229, -0.037297986, 0.0...|
|He who fights with monsters might take care lest he there...|[0.20020881, -0.06675269, -0.26941508, 0.15710223, -0.031...|
|We thought this was one of the worst movies ever. I had t...|[0.06361251, -0.16458587, 0.054284055, 0.12517855, 0.0279...|
|This movie, despite its list of B, C, and D list celebs, ...|[-0.01568622, -0.4015518, -0.09817645, -0.060246892, -0.0...|
|i have one word: focus.<br /><br />well.<br /><br />IMDb ...|[0.1225523, -0.19064504, -0.18919703, 0.0863426, 0.128941...|
|This mo

                                                                                

### Using model_id string on driver

In [19]:
from pyspark.sql.functions import col
from sparkext.huggingface import sentence_transformer_udf

In [20]:
# only use first N examples, since this is slow
df = spark.read.parquet("imdb_test").limit(100)

In [21]:
encode = sentence_transformer_udf("paraphrase-MiniLM-L6-v2")

Loading SentenceTransformer(paraphrase-MiniLM-L6-v2) on driver


In [22]:
embeddings = df.withColumn("encoding", encode(col("lines")))

In [23]:
%%time
results = embeddings.collect()

[Stage 18:>                                                         (0 + 1) / 1]

CPU times: user 11.5 ms, sys: 0 ns, total: 11.5 ms
Wall time: 5.71 s


                                                                                

In [24]:
embeddings.show(truncate=60)

[Stage 21:>                                                         (0 + 1) / 1]

+------------------------------------------------------------+------------------------------------------------------------+
|                                                       lines|                                                    encoding|
+------------------------------------------------------------+------------------------------------------------------------+
|i came across this film on the net by fluke and i was hor...|[-0.05298701, -0.08726701, -0.23245229, -0.037297986, 0.0...|
|He who fights with monsters might take care lest he there...|[0.20020881, -0.06675269, -0.26941508, 0.15710223, -0.031...|
|We thought this was one of the worst movies ever. I had t...|[0.06361251, -0.16458587, 0.054284055, 0.12517855, 0.0279...|
|This movie, despite its list of B, C, and D list celebs, ...|[-0.01568622, -0.4015518, -0.09817645, -0.060246892, -0.0...|
|i have one word: focus.<br /><br />well.<br /><br />IMDb ...|[0.1225523, -0.19064504, -0.18919703, 0.0863426, 0.128941...|
|This mo

                                                                                

### Using model loader

In [25]:
from pyspark.sql.functions import col
from sparkext.huggingface import sentence_transformer_udf

In [26]:
# only use first N examples, since this is slow
df = spark.read.parquet("imdb_test").limit(100)

In [27]:
def model_loader(model_name):
    from sentence_transformers import SentenceTransformer
    return SentenceTransformer(model_name)   

In [28]:
encode = sentence_transformer_udf("paraphrase-MiniLM-L6-v2", model_loader=model_loader)

Deferring model loading to executors.


In [29]:
embeddings = df.withColumn("encoding", encode(col("lines")))

In [30]:
%%time
results = embeddings.collect()

[Stage 25:>                                                         (0 + 1) / 1]

CPU times: user 13.2 ms, sys: 0 ns, total: 13.2 ms
Wall time: 11.6 s


                                                                                

In [31]:
embeddings.show(truncate=60)

[Stage 28:>                                                         (0 + 1) / 1]

+------------------------------------------------------------+------------------------------------------------------------+
|                                                       lines|                                                    encoding|
+------------------------------------------------------------+------------------------------------------------------------+
|Cyber zone, as this DVD was sold in Oz, is about the wors...|[-0.16858114, -0.20849928, -0.13957737, -0.2384811, -0.00...|
|I watched this movie to see the direction one of the most...|[0.008758395, -0.008342016, -0.11909039, 0.025434582, -0....|
|I tried to be patient and open-minded but found myself in...|[0.24895968, -0.06459605, 0.20849153, -0.062761486, 0.214...|
|While the dog was cute, the film was not. It wasn't the p...|[0.088919066, -0.2933342, 0.09228413, -0.018055102, -0.07...|
|The opening scene makes you feel like you're watching a h...|[0.1402397, -0.34980178, 0.09537209, -0.13844037, -0.2392...|
|This mo

                                                                                