# PySpark Huggingface Inferencing
### Conditional generation

From: https://huggingface.co/docs/transformers/model_doc/t5

In [1]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")

max_source_length = 512
max_target_length = 128

task_prefix = "translate English to German: "

lines = [
    "The house is wonderful",
    "Welcome to NYC",
    "HuggingFace is a company"
]

input_sequences = [task_prefix + l for l in lines]

In [2]:
input_ids = tokenizer(input_sequences, 
                      padding="longest", 
                      max_length=max_source_length,
                      return_tensors="pt").input_ids
outputs = model.generate(input_ids)

In [3]:
[tokenizer.decode(o, skip_special_tokens=True) for o in outputs]

['Das Haus ist wunderbar',
 'Willkommen in NYC',
 'HuggingFace ist ein Unternehmen']

## PySpark

In [4]:
import os
from pathlib import Path
from torchtext.datasets import IMDB

In [5]:
# load IMDB reviews (test) dataset
data = IMDB(split='test')
len(data)

25000

In [6]:
# convert to nested array of string for pyspark
lines = []
for label, text in data:
    # only take first sentence of IMDB review
    lines.append([text])

### Create PySpark DataFrame

In [7]:
from pyspark.sql.types import *

In [8]:
df = spark.createDataFrame(lines, ['lines'])
df.schema

StructType(List(StructField(lines,StringType,true)))

In [9]:
df.take(1)

22/02/24 14:21:15 WARN TaskSetManager: Stage 0 contains a task of very large size (1302 KiB). The maximum recommended task size is 1000 KiB.
                                                                                

[Row(lines='I love sci-fi and am willing to put up with a lot. Sci-fi movies/TV are usually underfunded, under-appreciated and misunderstood. I tried to like this, I really did, but it is to good TV sci-fi as Babylon 5 is to Star Trek (the original). Silly prosthetics, cheap cardboard sets, stilted dialogues, CG that doesn\'t match the background, and painfully one-dimensional characters cannot be overcome with a \'sci-fi\' setting. (I\'m sure there are those of you out there who think Babylon 5 is good sci-fi TV. It\'s not. It\'s clichéd and uninspiring.) While US viewers might like emotion and character development, sci-fi is a genre that does not take itself seriously (cf. Star Trek). It may treat important issues, yet not as a serious philosophy. It\'s really difficult to care about the characters here as they are not simply foolish, just missing a spark of life. Their actions and reactions are wooden and predictable, often painful to watch. The makers of Earth KNOW it\'s rubbish a

### Save the test dataset as parquet files

In [10]:
df.write.mode("overwrite").parquet("imdb_test")

22/02/24 14:21:17 WARN TaskSetManager: Stage 1 contains a task of very large size (1302 KiB). The maximum recommended task size is 1000 KiB.
                                                                                

### Check arrow memory configuration

In [11]:
spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", "512")
# This line will fail if the vectorized reader runs out of memory
assert len(df.head()) > 0, "`df` should not be empty"

22/02/24 14:21:19 WARN TaskSetManager: Stage 2 contains a task of very large size (1302 KiB). The maximum recommended task size is 1000 KiB.


## Inference using Spark ML Model
Note: you can restart the kernel and run from this point to simulate running in a different node or environment.

In [12]:
import sparkext

In [13]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

In [14]:
tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")

In [15]:
# only use first N examples, since this is slow
df = spark.read.parquet("imdb_test").limit(100)

In [16]:
my_model = sparkext.huggingface.Model(model, tokenizer, prefix="Translate English to German: ")

In [17]:
predictions = my_model.transform(df)



In [18]:
predictions.write.mode("overwrite").text("imdb_translations")
predictions.collect()

                                                                                

[Row(prediction='Dieser Film war in vielen Bereichen nicht erledigt, sondern'),
 Row(prediction='Dieser Film beginnt mit einem Mann, der scheinbar ein Sportfahrer ist, und'),
 Row(prediction='Nicht wirklich allzu viel zu diesem Film...entweder ein Stunt Racer oder'),
 Row(prediction='Dieser Film stützt sich nicht auf Geschichte, sondern viel Alkohol, Pot-Ra'),
 Row(prediction='Dass es krank ist, ist nicht krank, es'),
 Row(prediction='Ein einziger Schwertman, der in der Wüste lebt und als Agent für'),
 Row(prediction='"Ashes of Time" war ein übeles Projekt, aber'),
 Row(prediction='Ich frage mich, was schief ging, warum der Film überhaupt nicht funktioniert'),
 Row(prediction='Während meiner ersten Filme ins Ausland habe ich mich mit einer Vielzahl'),
 Row(prediction='Die ganze Szene entfaltet sich nicht, als ob der Film'),
 Row(prediction='Die hübsche Szenen sind merkwürdigerweise in einem jerk-'),
 Row(prediction='Die Trailer für diesen Film versprochen und dieser Film lieferte genau