In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from classifier import predict_series, make_pandas_udf
import pandas as pd




In [2]:
spark = SparkSession.builder.appName('mysession').getOrCreate()
spark.conf.set("spark.sql.execution.arrow.enabled", "true")

## Build Spark DataFrame

In [3]:
df_input = spark.read.parquet('data/input.parquet')

### Take Pandas Sample

In [4]:
pdf_sample = df_input.sample(False,fraction=0.10,seed=12345).toPandas()
pdf_sample.head(5)

Unnamed: 0,input
0,is filestream lazy loaded in net
1,programmatically launching standalone adobe fl...
2,encoding problem classic asp
3,c # winforms datagridview/sql compact negative...
4,suspending and notifying threads when there is...


### Test `predict_series`  function

In order to use `pandas_udf` to define an UDF, the argument Python function must map `pandas.Series` $\rightarrow$ `pandas.Series`. In our case, there are two ways in which we can get predictions from a fastText model for each sentence in a `pandas.Series` object.

Given the input `pandas.Series`, the options can be described as follows:

1. To use `pandas.Series.apply` method to apply the `classifier.get_predictions` function over each sentence, obtaining a `pandas.Series` object as result. However, under the hood, `pandas.Series.apply` doesn't use vectorization, but it loops over each element in the `pandas.Series` object instead (let's say, this is the **rowwise** way). Set the `rowwise=True` option in the `predict_series` function.
2. To transform `pandas.Series` $\rightarrow$ `list` with `str` elements in order to use fastText's own method for multiple sentence predictions (let's say, this is the **native** way), and transform the result back into a `pandas.Series` object. Set the `rowwise=False` option in the `predict_series` function.

A priori, we don't know which one is more performant, so we can test both.

**Single prediction, native inference**

In [5]:
predict_series(pdf_sample.input,False,False).head(5).to_frame()

Unnamed: 0,0
0,.net
1,
2,asp.net
3,c#
4,java


**Multiple prediction, native inference**

In [6]:
predict_series(pdf_sample.input,True,False).head(5).to_frame()

Unnamed: 0,0
0,"[.net, c#, asp.net]"
1,
2,"[asp.net, asp.net-mvc]"
3,"[c#, .net]"
4,"[java, c#]"


**Single prediction, rowwise inference**

In [7]:
predict_series(pdf_sample.input,False,True).head(5).to_frame()

Unnamed: 0,input
0,.net
1,
2,asp.net
3,c#
4,java


**Multiple prediction, rowwise inference**

In [8]:
predict_series(pdf_sample.input,True,True).head(5).to_frame()

Unnamed: 0,input
0,"[.net, c#, asp.net]"
1,
2,"[asp.net, asp.net-mvc]"
3,"[c#, .net]"
4,"[java, c#]"


**Performance comparison: native vs. rowwise**

In [9]:
%timeit -n 20 predict_series(pdf_sample.input,True,False)

1.48 ms ± 165 µs per loop (mean ± std. dev. of 7 runs, 20 loops each)


In [10]:
%timeit -n 20 predict_series(pdf_sample.input,True,True)

1.91 ms ± 210 µs per loop (mean ± std. dev. of 7 runs, 20 loops each)


# Approach 2.1: Pandas UDFs with native inference

## Single prediction

In [11]:
udf_predict = make_pandas_udf(multi_prediction=False,rowwise=False)
df_output = df_input.withColumn("category",udf_predict(col("input")))
df_output.sample(False,.10,12345).show(10,False)

+----------------------------------------------------------------------------+--------+
|input                                                                       |category|
+----------------------------------------------------------------------------+--------+
|is filestream lazy loaded in net                                            |.net    |
|programmatically launching standalone adobe flashplayer on linux/x11        |null    |
|encoding problem classic asp                                                |asp.net |
|c # winforms datagridview/sql compact negative integer in primary key column|c#      |
|suspending and notifying threads when there is work to do                   |java    |
|creating my own iterators                                                   |c#      |
|css `` see through '' background crazy navigation menu problem              |asp.net |
|sending email in net through gmail                                          |.net    |
|specify ordinals of c++ exporte

## Multiple prediction

In [12]:
udf_predict = make_pandas_udf(multi_prediction=True,rowwise=False)
df_output = df_input.withColumn("category",udf_predict(col("input")))
df_output.sample(False,.10,12345).show(10,False)

+----------------------------------------------------------------------------+----------------------+
|input                                                                       |category              |
+----------------------------------------------------------------------------+----------------------+
|is filestream lazy loaded in net                                            |[.net, c#, asp.net]   |
|programmatically launching standalone adobe flashplayer on linux/x11        |null                  |
|encoding problem classic asp                                                |[asp.net, asp.net-mvc]|
|c # winforms datagridview/sql compact negative integer in primary key column|[c#, .net]            |
|suspending and notifying threads when there is work to do                   |[java, c#]            |
|creating my own iterators                                                   |[c#, .net]            |
|css `` see through '' background crazy navigation menu problem              |[asp

# Approach 2.2: Pandas UDFs with rowwise inference (using `pandas.Series.apply` method)

## Single prediction

In [13]:
udf_predict = make_pandas_udf(multi_prediction=False,rowwise=True)
df_output = df_input.withColumn("category",udf_predict(col("input")))
df_output.sample(False,.10,12345).show(10,False)

+----------------------------------------------------------------------------+--------+
|input                                                                       |category|
+----------------------------------------------------------------------------+--------+
|is filestream lazy loaded in net                                            |.net    |
|programmatically launching standalone adobe flashplayer on linux/x11        |null    |
|encoding problem classic asp                                                |asp.net |
|c # winforms datagridview/sql compact negative integer in primary key column|c#      |
|suspending and notifying threads when there is work to do                   |java    |
|creating my own iterators                                                   |c#      |
|css `` see through '' background crazy navigation menu problem              |asp.net |
|sending email in net through gmail                                          |.net    |
|specify ordinals of c++ exporte

## Multiple prediction

In [14]:
udf_predict = make_pandas_udf(multi_prediction=True,rowwise=True)
df_output = df_input.withColumn("category",udf_predict(col("input")))
df_output.sample(False,.10,12345).show(10,False)

+----------------------------------------------------------------------------+----------------------+
|input                                                                       |category              |
+----------------------------------------------------------------------------+----------------------+
|is filestream lazy loaded in net                                            |[.net, c#, asp.net]   |
|programmatically launching standalone adobe flashplayer on linux/x11        |null                  |
|encoding problem classic asp                                                |[asp.net, asp.net-mvc]|
|c # winforms datagridview/sql compact negative integer in primary key column|[c#, .net]            |
|suspending and notifying threads when there is work to do                   |[java, c#]            |
|creating my own iterators                                                   |[c#, .net]            |
|css `` see through '' background crazy navigation menu problem              |[asp

# Performance comparison: native vs. rowwise

In [15]:
%%timeit -n 10 
udf_predict = make_pandas_udf(multi_prediction=True,rowwise=False)
df_output = df_input.withColumn("category",udf_predict(col("input")))
df_output.sample(False,.10).show(10)

+--------------------+-----------------+
|               input|         category|
+--------------------+-----------------+
|build tar file fr...|            [php]|
|prevent long word...|        [asp.net]|
|vector shape on s...|             null|
|can sql server ex...|[sql-server, sql]|
|building flex pro...|           [flex]|
|the necessity of ...|            [c++]|
|implementing and ...|            [c++]|
|why learn perl py...| [c++, c, python]|
|post from one con...|             null|
|actionscript3 to ...|     [javascript]|
+--------------------+-----------------+
only showing top 10 rows

+--------------------+--------+
|               input|category|
+--------------------+--------+
|how can i send an...|   [php]|
|mac iwork/pages a...|    null|
|the necessity of ...|   [c++]|
|implementing and ...|   [c++]|
|best way to use a...|  [java]|
|what is the best ...|    null|
|what design patte...|  [java]|
|how would you att...|    [c#]|
|should i provide ...|  [java]|
|image archive v

+--------------------+-----------------+
|               input|         category|
+--------------------+-----------------+
|sql server and th...|[sql-server, sql]|
|how can i send an...|            [php]|
|gantt chart contr...|  [windows, .net]|
|prevent long word...|        [asp.net]|
|python beyond the...|         [python]|
|how to effectivel...|             null|
|programmatically ...|             null|
|implementing and ...|            [c++]|
|why learn perl py...| [c++, c, python]|
|post from one con...|             null|
+--------------------+-----------------+
only showing top 10 rows

+--------------------+--------+
|               input|category|
+--------------------+--------+
|python beyond the...|[python]|
|are incrementers ...|   [c++]|
|building flex pro...|  [flex]|
|the necessity of ...|   [c++]|
|programmatically ...|    null|
|how do you quickl...|    null|
|how can i determi...|  [java]|
|how would you att...|    [c#]|
|decoding chunked ...|    null|
|unix socket imp

+--------------------+--------------------+
|               input|            category|
+--------------------+--------------------+
|how can i change ...|[jquery, javascri...|
|how can i send an...|               [php]|
|how to effectivel...|                null|
|programmatically ...|                null|
|what is the best ...|        [javascript]|
|creating my own i...|          [c#, .net]|
|actionscript3 to ...|        [javascript]|
|custom properties...|          [.net, c#]|
|is there an alter...|             [c#, c]|
|what does it mean...|     [c++, java, c#]|
+--------------------+--------------------+
only showing top 10 rows

+--------------------+--------------------+
|               input|            category|
+--------------------+--------------------+
|build tar file fr...|               [php]|
|python beyond the...|            [python]|
|carbide / symbian...|               [c++]|
|doctype rss & htm...|[html, asp.net, css]|
|how to return a p...|[sql-server, c#, ...|
|encod

+--------------------+-----------------+
|               input|         category|
+--------------------+-----------------+
|how do you manage...|             [c#]|
|mac iwork/pages a...|             null|
|how to disable vi...|  [visual-studio]|
|while clause in t...|[sql, sql-server]|
|implementing and ...|            [c++]|
|converting svg to...|             [c#]|
|how would you att...|             [c#]|
|image archive vs ...|            [css]|
|numbering regex s...|            [php]|
|drawing a custom ...|             null|
+--------------------+-----------------+
only showing top 10 rows

+--------------------+--------------------+
|               input|            category|
+--------------------+--------------------+
|vector shape on s...|                null|
|can sql server ex...|   [sql-server, sql]|
|authoritative sou...|     [.net, php, c#]|
|while clause in t...|   [sql, sql-server]|
|doctype rss & htm...|[html, asp.net, css]|
|how to generate u...|          [java, c#]|
|c #

+--------------------+--------------------+
|               input|            category|
+--------------------+--------------------+
|getting odd error...|          [.net, c#]|
|post from one con...|                null|
|how can i create ...|          [c#, .net]|
|encoding problem ...|[asp.net, asp.net...|
|c # winforms data...|          [c#, .net]|
|what is the aspne...|                null|
|what is the best ...|        [javascript]|
|how can i determi...|              [java]|
|what s the term f...|                null|
|stopping msi from...|          [.net, c#]|
+--------------------+--------------------+
only showing top 10 rows

+--------------------+--------------------+
|               input|            category|
+--------------------+--------------------+
|making an image g...|                null|
|how do i add cust...|                null|
|how does tracerou...|          [c#, .net]|
|can sql server ex...|   [sql-server, sql]|
|doctype rss & htm...|[html, asp.net, css]|
|db si

+--------------------+-------------------+
|               input|           category|
+--------------------+-------------------+
|sql server and th...|  [sql-server, sql]|
|c the definitive ...|            [c, c#]|
|is filestream laz...|[.net, c#, asp.net]|
|python beyond the...|           [python]|
|how does tracerou...|         [c#, .net]|
|how to implement ...|               [c#]|
|post from one con...|               null|
|converting svg to...|               [c#]|
|is it possible to...|         [c#, .net]|
|eclipse text comp...|    [eclipse, java]|
+--------------------+-------------------+
only showing top 10 rows

466 ms ± 23.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [16]:
%%timeit -n 10 
udf_predict = make_pandas_udf(multi_prediction=True,rowwise=True)
df_output = df_input.withColumn("category",udf_predict(col("input")))
df_output.sample(False,.10).show(10)

+--------------------+-----------------+
|               input|         category|
+--------------------+-----------------+
|are incrementers ...|            [c++]|
|ms sql 2000 turn ...|[sql-server, sql]|
|implementing and ...|            [c++]|
|db side encryptio...|             null|
|c # winforms data...|       [c#, .net]|
|eclipse text comp...|  [eclipse, java]|
|should i provide ...|           [java]|
|image archive vs ...|            [css]|
|parsing t sql to ...|[sql, sql-server]|
|rendered pixel wi...|             null|
+--------------------+-----------------+
only showing top 10 rows

+--------------------+--------------------+
|               input|            category|
+--------------------+--------------------+
|how to consume js...|              [.net]|
|how do you deal w...|           [asp.net]|
|best update metho...|[mysql, sql, data...|
|is it possible to...|          [c#, .net]|
|suspending and no...|          [java, c#]|
|what is the best ...|                null|
|how

+--------------------+--------------------+
|               input|            category|
+--------------------+--------------------+
|prevent long word...|           [asp.net]|
|whats the best wa...|          [c#, .net]|
|python beyond the...|            [python]|
|how do i add cust...|                null|
|how to pass an un...|                null|
|how to disable vi...|     [visual-studio]|
|while clause in t...|   [sql, sql-server]|
|c # lambda expres...|                [c#]|
|db side encryptio...|                null|
|encoding problem ...|[asp.net, asp.net...|
+--------------------+--------------------+
only showing top 10 rows

+--------------------+--------------------+
|               input|            category|
+--------------------+--------------------+
|how to effectivel...|                null|
|is it possible to...|          [c#, .net]|
|image archive vs ...|               [css]|
|css `` see throug...|           [asp.net]|
|what are the limi...|[ruby, ruby-on-ra...|
|drawi

+--------------------+--------------------+
|               input|            category|
+--------------------+--------------------+
|are incrementers ...|               [c++]|
|is there a way to...|                null|
|percentages of su...|         [c++, java]|
|ant and the avail...|              [java]|
|vertical text wit...|[jquery, javascript]|
|creating my own i...|          [c#, .net]|
|class methods as ...|        [javascript]|
|casting array of ...|             [c#, c]|
|thotkey with win ...|                [c#]|
|is there an alter...|             [c#, c]|
+--------------------+--------------------+
only showing top 10 rows

+--------------------+--------------------+
|               input|            category|
+--------------------+--------------------+
|how to pass an un...|                null|
|building flex pro...|              [flex]|
|best update metho...|[mysql, sql, data...|
|the necessity of ...|               [c++]|
|authoritative sou...|     [.net, php, c#]|
|c # l

+--------------------+--------------------+
|               input|            category|
+--------------------+--------------------+
|sql server and th...|   [sql-server, sql]|
|build tar file fr...|               [php]|
|mac iwork/pages a...|                null|
|best update metho...|[mysql, sql, data...|
|how to disable vi...|     [visual-studio]|
|the necessity of ...|               [c++]|
|while clause in t...|   [sql, sql-server]|
|how can i create ...|          [c#, .net]|
|encoding problem ...|[asp.net, asp.net...|
|daemon threads ex...|                null|
+--------------------+--------------------+
only showing top 10 rows

+--------------------+--------------------+
|               input|            category|
+--------------------+--------------------+
|transforming sele...|                null|
|how can i change ...|[jquery, javascri...|
|c the definitive ...|             [c, c#]|
|python beyond the...|            [python]|
|how do you deal w...|           [asp.net]|
|how t

+--------------------+--------------------+
|               input|            category|
+--------------------+--------------------+
|game programming ...|                null|
|carbide / symbian...|               [c++]|
|best update metho...|[mysql, sql, data...|
|sql query count w...|[sql, sql-server,...|
|db side encryptio...|                null|
|eclipse text comp...|     [eclipse, java]|
|can you use an al...|             [mysql]|
|ant and the avail...|              [java]|
|setting the heigh...|                [c#]|
|css `` see throug...|           [asp.net]|
+--------------------+--------------------+
only showing top 10 rows

+--------------------+--------------------+
|               input|            category|
+--------------------+--------------------+
|how do you deal w...|           [asp.net]|
|vector shape on s...|                null|
|can sql server ex...|   [sql-server, sql]|
|best update metho...|[mysql, sql, data...|
|c # lambda expres...|                [c#]|
|is it