# PySpark Huggingface Inferencing
## Conditional generation

From: https://huggingface.co/docs/transformers/model_doc/t5

### Using PyTorch

In [1]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")

max_source_length = 512
max_target_length = 128

task_prefix = "translate English to German: "

lines = [
    "The house is wonderful",
    "Welcome to NYC",
    "HuggingFace is a company"
]

input_sequences = [task_prefix + l for l in lines]

In [2]:
input_ids = tokenizer(input_sequences, 
                      padding="longest", 
                      max_length=max_source_length,
                      return_tensors="pt").input_ids
outputs = model.generate(input_ids)

In [3]:
[tokenizer.decode(o, skip_special_tokens=True) for o in outputs]

['Das Haus ist wunderbar',
 'Willkommen in NYC',
 'HuggingFace ist ein Unternehmen']

In [4]:
model.framework

'pt'

### Using TensorFlow

In [5]:
from transformers import T5Tokenizer, TFT5ForConditionalGeneration

In [6]:
tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = TFT5ForConditionalGeneration.from_pretrained("t5-small")

max_source_length = 512
max_target_length = 128

task_prefix = "translate English to German: "

lines = [
    "The house is wonderful",
    "Welcome to NYC",
    "HuggingFace is a company"
]

input_sequences = [task_prefix + l for l in lines]

2022-04-07 13:27:05.477285: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-04-07 13:27:05.508477: W tensorflow/python/util/util.cc:368] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at t5-small.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


In [7]:
input_ids = tokenizer(input_sequences, 
                      padding="longest", 
                      max_length=max_source_length,
                      return_tensors="tf").input_ids
outputs = model.generate(input_ids)

In [8]:
[tokenizer.decode(o, skip_special_tokens=True) for o in outputs]

['Das Haus ist wunderbar',
 'Willkommen in NYC',
 'HuggingFace ist ein Unternehmen']

In [9]:
model.framework

'tf'

## PySpark

In [10]:
import os
from pathlib import Path
from torchtext.datasets import IMDB

In [11]:
# load IMDB reviews (test) dataset
data = IMDB(split='test')
len(data)

25000

In [12]:
# convert to nested array of string for pyspark
lines = []
for label, text in data:
    # only take first sentence of IMDB review
    lines.append([text])

### Create PySpark DataFrame

In [13]:
from pyspark.sql.types import *

In [14]:
df = spark.createDataFrame(lines, ['lines']).repartition(10)
df.schema

StructType(List(StructField(lines,StringType,true)))

In [15]:
df.take(1)

22/04/07 13:27:15 WARN TaskSetManager: Stage 0 contains a task of very large size (1302 KiB). The maximum recommended task size is 1000 KiB.
                                                                                

[Row(lines='...But not this one! I always wanted to know "what happened" next. We will never know for sure what happened because GWTW was Margaret\'s baby. I am a lifelong fan of Gone With the Wind and I could not have been more repulsed by the movie. I did compare "Scarlett" to the original GWTW because any film worth following GWTW needed to be on the same quality level as the first. Rhett was cast beautifully, although NO ONE will ever compare to Mr. Gable. I am also a strict Vivien Leigh fan!! She WAS Scarlett. She fit the bill. Not another actress in this lifetime or another will ever fit the same shoes but with "Scarlett" the job could have been done better. Not enough thought went into finding the proper Scarlett, that was evident.<br /><br />Overall, something to look to but if you want to know the what happened to Scarlett and Rhett, I suggest writing it yourself or finding fan fiction. This movie is not worth the time.')]

### Save the test dataset as parquet files

In [16]:
df.write.mode("overwrite").parquet("imdb_test")

22/04/07 13:27:18 WARN TaskSetManager: Stage 3 contains a task of very large size (1302 KiB). The maximum recommended task size is 1000 KiB.
                                                                                

### Check arrow memory configuration

In [17]:
spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", "512")
# This line will fail if the vectorized reader runs out of memory
assert len(df.head()) > 0, "`df` should not be empty"

22/04/07 13:27:20 WARN TaskSetManager: Stage 6 contains a task of very large size (1302 KiB). The maximum recommended task size is 1000 KiB.


## Inference using Spark ML Model
Note: you can restart the kernel and run from this point to simulate running in a different node or environment.

In [18]:
import pandas as pd
import sparkext
from pyspark.sql.functions import col, pandas_udf

In [19]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

In [20]:
tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")

In [21]:
# only use first N examples, since this is slow
df = spark.read.parquet("imdb_test").limit(100)
df.show(truncate=120)

+------------------------------------------------------------------------------------------------------------------------+
|                                                                                                                   lines|
+------------------------------------------------------------------------------------------------------------------------+
|i came across this film on the net by fluke and i was horrified by its content of vivid abuse violence and torture sc...|
|He who fights with monsters might take care lest he thereby become a monster. And if you gaze for long into an abyss,...|
|We thought this was one of the worst movies ever. I had to volunteer to watch the end. The romance was not believable...|
|This movie, despite its list of B, C, and D list celebs, is a complete waste of 90 minutes. The plot, with its few pe...|
|i have one word: focus.<br /><br />well.<br /><br />IMDb wants me to use at least ten lines of text. okay. let's disc...|
|This movie woul

In [22]:
# only use first sentence and add prefix for conditional generation
def preprocess(text: pd.Series, prefix: str = "") -> pd.Series:
    @pandas_udf("string")
    def _preprocess(text: pd.Series) -> pd.Series:
        return pd.Series([prefix + s.split(".")[0] for s in text])
    return _preprocess(text)

In [23]:
# add prefix, only use first 100 rows, since generation takes a while
df1 = df.withColumn("input", preprocess(col("lines"), "Translate English to German: ")).select("input")
df1.show(truncate=120)

[Stage 13:>                                                         (0 + 1) / 1]

+------------------------------------------------------------------------------------------------------------------------+
|                                                                                                                   input|
+------------------------------------------------------------------------------------------------------------------------+
|Translate English to German: i came across this film on the net by fluke and i was horrified by its content of vivid ...|
|               Translate English to German: He who fights with monsters might take care lest he thereby become a monster|
|                                           Translate English to German: We thought this was one of the worst movies ever|
| Translate English to German: This movie, despite its list of B, C, and D list celebs, is a complete waste of 90 minutes|
|                                                                     Translate English to German: i have one word: focus|
|Translate Engli

                                                                                

In [24]:
my_model = sparkext.huggingface.Model(model, tokenizer, 
                    max_length=128, padding="longest", return_tensors="pt", truncation=True, skip_special_tokens=True) \
                    .setInputCol("input") \
                    .setOutputCol("translation")

**Note**: "AutoModel from string" doesn't work here, because the T5ForConditionalGeneration model actually adds a 
language modeling head on top of the standard T5 model, where the AutoModel only loads the standard T5 model.
See: https://huggingface.co/docs/transformers/model_doc/t5#transformers.T5ForConditionalGeneration
```
my_model = sparkext.huggingface.Model("t5-small")
```

In [25]:
predictions = my_model.transform(df1)

Using supplied Model and Tokenizer


In [26]:
%%time
predictions.write.mode("overwrite").parquet("imdb_translations")
results = predictions.collect()

[Stage 19:>                                                         (0 + 1) / 1]

CPU times: user 18.1 ms, sys: 7.47 ms, total: 25.5 ms
Wall time: 18.5 s


                                                                                

In [27]:
results[:5]

[Row(input='Translate English to German: Cyber zone, as this DVD was sold in Oz, is about the worst B-Grade junk I have seen', translation='Cyberzone, da diese DVD in Oz verkauft wurde, ist über die schlimmste B-Gra'),
 Row(input='Translate English to German: I watched this movie to see the direction one of the most promising young talents in movies was going', translation='Ich habe diesen Film gesehen, um zu sehen, in welche Richtung eines der viel'),
 Row(input='Translate English to German: I tried to be patient and open-minded but found myself in a coma-like state', translation='Ich habe versucht, geduldig und offen zu sein, aber ich habe mich in'),
 Row(input='Translate English to German: While the dog was cute, the film was not', translation='Während der Hund süß war, war der Film nicht engl.'),
 Row(input="Translate English to German: The opening scene makes you feel like you're watching a high school play", translation='Die Eröffnungsszene macht Sie fühlen sich wie ein High Scho

## Inference using Spark DL UDF (PyTorch)
Note: you can restart the kernel and run from this point to simulate running in a different node or environment.

In [28]:
import pandas as pd
from pyspark.sql.functions import col, pandas_udf
from sparkext.huggingface import model_udf

In [29]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

In [30]:
tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")

In [31]:
# only use first N examples, since this is slow
df = spark.read.parquet("imdb_test").limit(100)
df.show(truncate=120)

+------------------------------------------------------------------------------------------------------------------------+
|                                                                                                                   lines|
+------------------------------------------------------------------------------------------------------------------------+
|i came across this film on the net by fluke and i was horrified by its content of vivid abuse violence and torture sc...|
|He who fights with monsters might take care lest he thereby become a monster. And if you gaze for long into an abyss,...|
|We thought this was one of the worst movies ever. I had to volunteer to watch the end. The romance was not believable...|
|This movie, despite its list of B, C, and D list celebs, is a complete waste of 90 minutes. The plot, with its few pe...|
|i have one word: focus.<br /><br />well.<br /><br />IMDb wants me to use at least ten lines of text. okay. let's disc...|
|This movie woul

In [32]:
# only use first sentence and add prefix for conditional generation
def preprocess(text: pd.Series, prefix: str = "") -> pd.Series:
    @pandas_udf("string")
    def _preprocess(text: pd.Series) -> pd.Series:
        return pd.Series([prefix + s.split(".")[0] for s in text])
    return _preprocess(text)

In [33]:
# only use first 100 rows, since generation takes a while
df1 = df.withColumn("input", preprocess(col("lines"), "Translate English to German: ")).select("input").limit(100)

In [34]:
df1.show(truncate=120)

+------------------------------------------------------------------------------------------------------------------------+
|                                                                                                                   input|
+------------------------------------------------------------------------------------------------------------------------+
|        Translate English to German: Cyber zone, as this DVD was sold in Oz, is about the worst B-Grade junk I have seen|
|Translate English to German: I watched this movie to see the direction one of the most promising young talents in mov...|
|                Translate English to German: I tried to be patient and open-minded but found myself in a coma-like state|
|                                                   Translate English to German: While the dog was cute, the film was not|
|                   Translate English to German: The opening scene makes you feel like you're watching a high school play|
|Translate Engli

In [35]:
# note: default return_type is 'string'
generate = model_udf(model, tokenizer=tokenizer,
                     max_length=128, padding="longest", return_tensors="pt", truncation=True, skip_special_tokens=True)

Using supplied Model and Tokenizer


In [36]:
predictions = df1.withColumn("preds", generate(col("input")))

In [37]:
predictions.show(truncate=60)

[Stage 27:>                                                         (0 + 1) / 1]

+------------------------------------------------------------+------------------------------------------------------------+
|                                                       input|                                                       preds|
+------------------------------------------------------------+------------------------------------------------------------+
|Translate English to German: i came across this film on t...|ich erfuhr diesen Film im Netz durch Fluke und war entset...|
|Translate English to German: He who fights with monsters ...|Wer mit Monstern kämpft, kann sich um die ganze Sache küm...|
|Translate English to German: We thought this was one of t...|       Wir hielten es für einen der schlimmsten Filme jemals|
|Translate English to German: This movie, despite its list...|        Dieser Film, trotz seiner Liste von B, C, und D-Kelb|
|         Translate English to German: i have one word: focus|                                      i have one word: focus|
|Transla

                                                                                

In [38]:
%%time
preds = predictions.collect()

[Stage 30:>                                                         (0 + 1) / 1]

CPU times: user 9.46 ms, sys: 1.55 ms, total: 11 ms
Wall time: 5.29 s


                                                                                

In [39]:
# only use first 100 rows, since generation takes a while
df2 = df.withColumn("input", preprocess(col("lines"), "Translate English to French: ")).select("input").limit(100)

In [40]:
df2.show(truncate=120)

+------------------------------------------------------------------------------------------------------------------------+
|                                                                                                                   input|
+------------------------------------------------------------------------------------------------------------------------+
|Translate English to French: i came across this film on the net by fluke and i was horrified by its content of vivid ...|
|               Translate English to French: He who fights with monsters might take care lest he thereby become a monster|
|                                           Translate English to French: We thought this was one of the worst movies ever|
| Translate English to French: This movie, despite its list of B, C, and D list celebs, is a complete waste of 90 minutes|
|                                                                     Translate English to French: i have one word: focus|
|Translate Engli

In [41]:
predictions = df2.withColumn("preds", generate(col("input")))

In [42]:
predictions.show(truncate=60)

[Stage 36:>                                                         (0 + 1) / 1]

+------------------------------------------------------------+------------------------------------------------------------+
|                                                       input|                                                       preds|
+------------------------------------------------------------+------------------------------------------------------------+
|Translate English to French: Cyber zone, as this DVD was ...|  La cyberzone, étant donné que ce DVD a été vendu à Oz, est|
|Translate English to French: I watched this movie to see ...|J’ai regardé ce film pour voir la direction d’un des jeun...|
|Translate English to French: I tried to be patient and op...|                J'ai essayé d'être patient et ouvert mais j'|
|Translate English to French: While the dog was cute, the ...|         Le chien était sain, mais le film n’était pas assez|
|Translate English to French: The opening scene makes you ...|La scène d'ouverture vous fait sentir que vous regardez u...|
|Transla

                                                                                

In [43]:
%%time
preds = predictions.collect()

[Stage 39:>                                                         (0 + 1) / 1]

CPU times: user 10.4 ms, sys: 3.55 ms, total: 13.9 ms
Wall time: 7.07 s


                                                                                

## Inference using Spark DL UDF (TensorFlow)
Note: you can restart the kernel and run from this point to simulate running in a different node or environment.

In [44]:
import pandas as pd
from pyspark.sql.functions import col, pandas_udf
from sparkext.huggingface import model_udf

In [45]:
from transformers import T5Tokenizer, TFT5ForConditionalGeneration

In [46]:
# only use first N examples, since this is slow
df = spark.read.parquet("imdb_test").limit(100)
df.show(truncate=120)

+------------------------------------------------------------------------------------------------------------------------+
|                                                                                                                   lines|
+------------------------------------------------------------------------------------------------------------------------+
|i came across this film on the net by fluke and i was horrified by its content of vivid abuse violence and torture sc...|
|He who fights with monsters might take care lest he thereby become a monster. And if you gaze for long into an abyss,...|
|We thought this was one of the worst movies ever. I had to volunteer to watch the end. The romance was not believable...|
|This movie, despite its list of B, C, and D list celebs, is a complete waste of 90 minutes. The plot, with its few pe...|
|i have one word: focus.<br /><br />well.<br /><br />IMDb wants me to use at least ten lines of text. okay. let's disc...|
|This movie woul

In [47]:
# only use first sentence and add prefix for conditional generation
def preprocess(text: pd.Series, prefix: str = "") -> pd.Series:
    @pandas_udf("string")
    def _preprocess(text: pd.Series) -> pd.Series:
        return pd.Series([prefix + s.split(".")[0] for s in text])
    return _preprocess(text)

In [48]:
# only use first 100 rows, since generation takes a while
df1 = df.withColumn("input", preprocess(col("lines"), "Translate English to German: ")).select("input").limit(100)

In [49]:
df1.show(truncate=120)

+------------------------------------------------------------------------------------------------------------------------+
|                                                                                                                   input|
+------------------------------------------------------------------------------------------------------------------------+
|        Translate English to German: Cyber zone, as this DVD was sold in Oz, is about the worst B-Grade junk I have seen|
|Translate English to German: I watched this movie to see the direction one of the most promising young talents in mov...|
|                Translate English to German: I tried to be patient and open-minded but found myself in a coma-like state|
|                                                   Translate English to German: While the dog was cute, the film was not|
|                   Translate English to German: The opening scene makes you feel like you're watching a high school play|
|Translate Engli

In [50]:
# Need to use a model_loader since spark doesn't serialize this model correctly
def model_loader(model_id):
    from transformers import TFT5ForConditionalGeneration, T5Tokenizer
    model = TFT5ForConditionalGeneration.from_pretrained(model_id)
    tokenizer = T5Tokenizer.from_pretrained(model_id)
    return model, tokenizer

In [51]:
# note: default return_type for model_udf is 'string'
generate = model_udf("t5-small", tokenizer=tokenizer, model_loader=model_loader,
                     max_length=128, padding="longest", return_tensors="tf", truncation=True, skip_special_tokens=True)

Deferring model loading to executors.


All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at t5-small.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


In [52]:
predictions = df1.withColumn("preds", generate(col("input")))

In [53]:
predictions.show(truncate=60)

[Stage 47:>                                                         (0 + 1) / 1]

+------------------------------------------------------------+------------------------------------------------------------+
|                                                       input|                                                       preds|
+------------------------------------------------------------+------------------------------------------------------------+
|Translate English to German: Cyber zone, as this DVD was ...|Cyberzone, da diese DVD in Oz verkauft wurde, ist über di...|
|Translate English to German: I watched this movie to see ...|Ich habe diesen Film gesehen, um zu sehen, in welche Rich...|
|Translate English to German: I tried to be patient and op...|Ich habe versucht, geduldig und offen zu sein, aber ich h...|
|Translate English to German: While the dog was cute, the ...|          Während der Hund süß war, war der Film nicht engl.|
|Translate English to German: The opening scene makes you ...|Die Eröffnungsszene macht Sie fühlen sich wie ein High Sc...|
|Transla

                                                                                

In [54]:
%%time
preds = predictions.collect()

[Stage 50:>                                                         (0 + 1) / 1]

CPU times: user 14 ms, sys: 480 µs, total: 14.5 ms
Wall time: 15.3 s


                                                                                

In [55]:
# only use first 100 rows, since generation takes a while
df2 = df.withColumn("input", preprocess(col("lines"), "Translate English to French: ")).select("input").limit(100)

In [56]:
df2.show(truncate=120)

+------------------------------------------------------------------------------------------------------------------------+
|                                                                                                                   input|
+------------------------------------------------------------------------------------------------------------------------+
|Translate English to French: i came across this film on the net by fluke and i was horrified by its content of vivid ...|
|               Translate English to French: He who fights with monsters might take care lest he thereby become a monster|
|                                           Translate English to French: We thought this was one of the worst movies ever|
| Translate English to French: This movie, despite its list of B, C, and D list celebs, is a complete waste of 90 minutes|
|                                                                     Translate English to French: i have one word: focus|
|Translate Engli

In [57]:
predictions = df2.withColumn("preds", generate(col("input")))

In [58]:
predictions.show(truncate=60)

[Stage 56:>                                                         (0 + 1) / 1]

+------------------------------------------------------------+------------------------------------------------------------+
|                                                       input|                                                       preds|
+------------------------------------------------------------+------------------------------------------------------------+
|Translate English to French: Cyber zone, as this DVD was ...|  La cyberzone, étant donné que ce DVD a été vendu à Oz, est|
|Translate English to French: I watched this movie to see ...|J’ai regardé ce film pour voir la direction d’un des jeun...|
|Translate English to French: I tried to be patient and op...|                J'ai essayé d'être patient et ouvert mais j'|
|Translate English to French: While the dog was cute, the ...|         Le chien était sain, mais le film n’était pas assez|
|Translate English to French: The opening scene makes you ...|La scène d'ouverture vous fait sentir que vous regardez u...|
|Transla

                                                                                

In [59]:
%%time
preds = predictions.collect()

[Stage 59:>                                                         (0 + 1) / 1]

CPU times: user 7.08 ms, sys: 5.95 ms, total: 13 ms
Wall time: 14.7 s


                                                                                