# PySpark Huggingface Inferencing
### Text Classification using Pipelines

Based on: https://huggingface.co/docs/transformers/quicktour#pipeline-usage

In [1]:
import pandas as pd
import sparkext

from inspect import signature
from pyspark.sql.functions import col, pandas_udf
from sparkext.huggingface import pipeline_udf
from transformers import pipeline

In [2]:
pipe = pipeline("text-classification")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


In [3]:
pipe("What can I say that hasn't been said already. I think this place is totally worth the hype.")

[{'label': 'POSITIVE', 'score': 0.9994712471961975}]

In [4]:
pipe("I will not say much about this film, because there is not much to say, because there is not much there to talk about.")

[{'label': 'NEGATIVE', 'score': 0.9997401833534241}]

## Inference using Spark ML Model

In [5]:
# only use first sentence of IMDB reviews
@pandas_udf("string")
def first_sentence(text: pd.Series) -> pd.Series:
    return pd.Series([s.split(".")[0] for s in text])

df = spark.read.parquet("imdb_test").withColumn("sentence", first_sentence(col("lines"))).select("sentence").limit(100)
df.show(truncate=120)

[Stage 1:>                                                          (0 + 1) / 1]

+------------------------------------------------------------------------------------------------------------------------+
|                                                                                                                sentence|
+------------------------------------------------------------------------------------------------------------------------+
|i came across this film on the net by fluke and i was horrified by its content of vivid abuse violence and torture sc...|
|                                            He who fights with monsters might take care lest he thereby become a monster|
|                                                                        We thought this was one of the worst movies ever|
|                              This movie, despite its list of B, C, and D list celebs, is a complete waste of 90 minutes|
|                                                                                                  i have one word: focus|
|This movie woul

                                                                                

In [6]:
my_model = sparkext.huggingface.PipelineModel(pipe, return_type="label string, score float") \
                    .setInputCol("sentence") \
                    .setOutputCol("preds")

In [7]:
predictions = my_model.transform(df).select("sentence", "preds.*")

Using supplied Pipeline


In [8]:
predictions.show(truncate=80)

[Stage 4:>                                                          (0 + 1) / 1]

+--------------------------------------------------------------------------------+--------+----------+
|                                                                        sentence|   label|     score|
+--------------------------------------------------------------------------------+--------+----------+
|i came across this film on the net by fluke and i was horrified by its conten...|NEGATIVE|0.99958783|
|    He who fights with monsters might take care lest he thereby become a monster|NEGATIVE|0.99694073|
|                                We thought this was one of the worst movies ever|NEGATIVE|0.99978095|
|This movie, despite its list of B, C, and D list celebs, is a complete waste ...|NEGATIVE|0.99979585|
|                                                          i have one word: focus|POSITIVE| 0.9677835|
|This movie would have been alright, indeed probably excellent, if the directo...|NEGATIVE|0.99626476|
|Disappointing heist movie indeed, I was actually expecting a pretty cool

                                                                                

In [9]:
%%time
preds = predictions.collect()

[Stage 7:>                                                          (0 + 1) / 1]

CPU times: user 7.86 ms, sys: 3.89 ms, total: 11.7 ms
Wall time: 6.42 s


                                                                                

## Inference using Spark DL UDF

In [10]:
# only use first sentence of IMDB reviews
@pandas_udf("string")
def first_sentence(text: pd.Series) -> pd.Series:
    return pd.Series([s.split(".")[0] for s in text])

df = spark.read.parquet("imdb_test").withColumn("sentence", first_sentence(col("lines"))).select("sentence").limit(100)
df.show(truncate=80)

+--------------------------------------------------------------------------------+
|                                                                        sentence|
+--------------------------------------------------------------------------------+
|i came across this film on the net by fluke and i was horrified by its conten...|
|    He who fights with monsters might take care lest he thereby become a monster|
|                                We thought this was one of the worst movies ever|
|This movie, despite its list of B, C, and D list celebs, is a complete waste ...|
|                                                          i have one word: focus|
|This movie would have been alright, indeed probably excellent, if the directo...|
|Disappointing heist movie indeed, I was actually expecting a pretty cool cat ...|
|THE BOX (2009) * Cameron Diaz, James Marsden, Frank Langella, James Rebhorn, ...|
|Just watched on UbuWeb this early experimental short film directed by William...|
|I w

In [11]:
# note: need to manually specify return_type per pipe output above
classify = pipeline_udf(pipe, return_type="label string, score float")

Using supplied Pipeline


In [12]:
# note: expanding the "struct" return_type to top-level columns
predictions = df.withColumn("preds", classify(col("sentence"))).select("sentence", "preds.*")

In [13]:
%%time
preds = predictions.collect()

[Stage 12:>                                                         (0 + 1) / 1]

CPU times: user 12.3 ms, sys: 269 µs, total: 12.6 ms
Wall time: 8.23 s


                                                                                

In [14]:
predictions.show(truncate=80)

[Stage 15:>                                                         (0 + 1) / 1]

+--------------------------------------------------------------------------------+--------+----------+
|                                                                        sentence|   label|     score|
+--------------------------------------------------------------------------------+--------+----------+
|Cyber zone, as this DVD was sold in Oz, is about the worst B-Grade junk I hav...|NEGATIVE|0.99978906|
|I watched this movie to see the direction one of the most promising young tal...|POSITIVE| 0.9994943|
|     I tried to be patient and open-minded but found myself in a coma-like state|NEGATIVE| 0.9994462|
|                                        While the dog was cute, the film was not|NEGATIVE| 0.9985183|
|        The opening scene makes you feel like you're watching a high school play|POSITIVE| 0.6598849|
|This movie starts off promisingly enough, but it gets a little to convoluted ...|POSITIVE| 0.9640631|
|                                       I was out-of-town, visiting an ol

                                                                                

### Using model loader

In [15]:
import pandas as pd

from pyspark.sql.functions import col, pandas_udf
from sparkext.huggingface import pipeline_udf

In [16]:
# only use first sentence of IMDB reviews
@pandas_udf("string")
def first_sentence(text: pd.Series) -> pd.Series:
    return pd.Series([s.split(".")[0] for s in text])

df = spark.read.parquet("imdb_test").withColumn("sentence", first_sentence(col("lines"))).select("sentence").limit(100)
df.show(truncate=80)

+--------------------------------------------------------------------------------+
|                                                                        sentence|
+--------------------------------------------------------------------------------+
|i came across this film on the net by fluke and i was horrified by its conten...|
|    He who fights with monsters might take care lest he thereby become a monster|
|                                We thought this was one of the worst movies ever|
|This movie, despite its list of B, C, and D list celebs, is a complete waste ...|
|                                                          i have one word: focus|
|This movie would have been alright, indeed probably excellent, if the directo...|
|Disappointing heist movie indeed, I was actually expecting a pretty cool cat ...|
|THE BOX (2009) * Cameron Diaz, James Marsden, Frank Langella, James Rebhorn, ...|
|Just watched on UbuWeb this early experimental short film directed by William...|
|I w

In [17]:
def model_loader(task: str):
    import torch
    from transformers import pipeline
    
    device_id = torch.cuda.current_device() if torch.cuda.is_available() else -1
    return pipeline(task, device=device_id)    

In [18]:
# note: need to manually specify return_type per pipe output above
classify = pipeline_udf("text-classification", model_loader=model_loader, return_type="label string, score float")

Deferring model loading to executors.


In [19]:
predictions = df.withColumn("preds", classify(col("sentence"))).select("sentence", "preds.*")

In [20]:
%%time
preds = predictions.collect()

[Stage 20:>                                                         (0 + 1) / 1]

CPU times: user 11.6 ms, sys: 3.71 ms, total: 15.3 ms
Wall time: 13.3 s


                                                                                

In [21]:
predictions.show(truncate=80)

[Stage 23:>                                                         (0 + 1) / 1]

+--------------------------------------------------------------------------------+--------+----------+
|                                                                        sentence|   label|     score|
+--------------------------------------------------------------------------------+--------+----------+
|i came across this film on the net by fluke and i was horrified by its conten...|NEGATIVE|0.99958783|
|    He who fights with monsters might take care lest he thereby become a monster|NEGATIVE|0.99694073|
|                                We thought this was one of the worst movies ever|NEGATIVE|0.99978095|
|This movie, despite its list of B, C, and D list celebs, is a complete waste ...|NEGATIVE|0.99979585|
|                                                          i have one word: focus|POSITIVE| 0.9677835|
|This movie would have been alright, indeed probably excellent, if the directo...|NEGATIVE|0.99626476|
|Disappointing heist movie indeed, I was actually expecting a pretty cool

                                                                                