In [1]:
import sys
sys.path.insert(0, "..")

In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
import pandas as pd

In [3]:
spark = SparkSession.builder.master("local[4]").appName('mysession').getOrCreate()
spark.conf.set("spark.sql.execution.arrow.enabled", "true")

In [4]:
spark.sparkContext.addFile('../models/ft_tuned.ftz')
spark.sparkContext.addPyFile('../classifier.py')

In [5]:
from classifier import predict_series, make_pandas_udf




## Build Spark DataFrame

In [6]:
df_input = spark.read.parquet('../data/input.parquet')

### Take Pandas Sample

In [7]:
pdf_sample = df_input.sample(False,fraction=0.10,seed=12345).toPandas()
pdf_sample.head(5)

Unnamed: 0,input
0,is filestream lazy loaded in net
1,programmatically launching standalone adobe fl...
2,encoding problem classic asp
3,c # winforms datagridview/sql compact negative...
4,suspending and notifying threads when there is...


### Test `predict_series`  function

In order to use `pandas_udf` to define an UDF, the argument Python function must map `pandas.Series` $\rightarrow$ `pandas.Series`. In our case, there are two ways in which we can get predictions from a fastText model for each sentence in a `pandas.Series` object.

Given the input `pandas.Series`, the options can be described as follows:

1. To use `pandas.Series.apply` method to apply the `classifier.get_predictions` function over each sentence, obtaining a `pandas.Series` object as result. However, under the hood, `pandas.Series.apply` doesn't use vectorization, but it loops over each element in the `pandas.Series` object instead (let's say, this is the **rowwise** way). Set the `rowwise=True` option in the `predict_series` function.
2. To transform `pandas.Series` $\rightarrow$ `list` with `str` elements in order to use fastText's own method for multiple sentence predictions (let's say, this is the **native** way), and transform the result back into a `pandas.Series` object. Set the `rowwise=False` option in the `predict_series` function.

A priori, we don't know which one is more performant, so we can test both.

**Single prediction, native inference**

In [8]:
predict_series(pdf_sample.input,False,False).head(5).to_frame()

Unnamed: 0,0
0,.net
1,
2,asp.net
3,c#
4,java


**Multiple prediction, native inference**

In [9]:
predict_series(pdf_sample.input,True,False).head(5).to_frame()

Unnamed: 0,0
0,"[.net, c#, asp.net]"
1,
2,"[asp.net, asp.net-mvc]"
3,"[c#, .net]"
4,"[java, c#]"


**Single prediction, rowwise inference**

In [10]:
predict_series(pdf_sample.input,False,True).head(5).to_frame()

Unnamed: 0,input
0,.net
1,
2,asp.net
3,c#
4,java


**Multiple prediction, rowwise inference**

In [11]:
predict_series(pdf_sample.input,True,True).head(5).to_frame()

Unnamed: 0,input
0,"[.net, c#, asp.net]"
1,
2,"[asp.net, asp.net-mvc]"
3,"[c#, .net]"
4,"[java, c#]"


**Performance comparison: native vs. rowwise**

In [12]:
%timeit -n 20 predict_series(pdf_sample.input,True,False)

1.65 ms ± 84.4 µs per loop (mean ± std. dev. of 7 runs, 20 loops each)


In [13]:
%timeit -n 20 predict_series(pdf_sample.input,True,True)

1.99 ms ± 250 µs per loop (mean ± std. dev. of 7 runs, 20 loops each)


# Approach 2.1: Pandas UDFs with native inference

## Single prediction

In [14]:
udf_predict = make_pandas_udf(multi_prediction=False,rowwise=False)
df_output = df_input.withColumn("category",udf_predict(col("input")))
df_output.sample(False,.10,12345).show(10,False)

+----------------------------------------------------------------------------+--------+
|input                                                                       |category|
+----------------------------------------------------------------------------+--------+
|is filestream lazy loaded in net                                            |.net    |
|programmatically launching standalone adobe flashplayer on linux/x11        |null    |
|encoding problem classic asp                                                |asp.net |
|c # winforms datagridview/sql compact negative integer in primary key column|c#      |
|suspending and notifying threads when there is work to do                   |java    |
|creating my own iterators                                                   |c#      |
|css `` see through '' background crazy navigation menu problem              |asp.net |
|sending email in net through gmail                                          |.net    |
|specify ordinals of c++ exporte

## Multiple prediction

In [15]:
udf_predict = make_pandas_udf(multi_prediction=True,rowwise=False)
df_output = df_input.withColumn("category",udf_predict(col("input")))
df_output.sample(False,.10,12345).show(10,False)

+----------------------------------------------------------------------------+----------------------+
|input                                                                       |category              |
+----------------------------------------------------------------------------+----------------------+
|is filestream lazy loaded in net                                            |[.net, c#, asp.net]   |
|programmatically launching standalone adobe flashplayer on linux/x11        |null                  |
|encoding problem classic asp                                                |[asp.net, asp.net-mvc]|
|c # winforms datagridview/sql compact negative integer in primary key column|[c#, .net]            |
|suspending and notifying threads when there is work to do                   |[java, c#]            |
|creating my own iterators                                                   |[c#, .net]            |
|css `` see through '' background crazy navigation menu problem              |[asp

# Approach 2.2: Pandas UDFs with rowwise inference (using `pandas.Series.apply` method)

## Single prediction

In [16]:
udf_predict = make_pandas_udf(multi_prediction=False,rowwise=True)
df_output = df_input.withColumn("category",udf_predict(col("input")))
df_output.sample(False,.10,12345).show(10,False)

+----------------------------------------------------------------------------+--------+
|input                                                                       |category|
+----------------------------------------------------------------------------+--------+
|is filestream lazy loaded in net                                            |.net    |
|programmatically launching standalone adobe flashplayer on linux/x11        |null    |
|encoding problem classic asp                                                |asp.net |
|c # winforms datagridview/sql compact negative integer in primary key column|c#      |
|suspending and notifying threads when there is work to do                   |java    |
|creating my own iterators                                                   |c#      |
|css `` see through '' background crazy navigation menu problem              |asp.net |
|sending email in net through gmail                                          |.net    |
|specify ordinals of c++ exporte

## Multiple prediction

In [17]:
udf_predict = make_pandas_udf(multi_prediction=True,rowwise=True)
df_output = df_input.withColumn("category",udf_predict(col("input")))
df_output.sample(False,.10,12345).show(10,False)

+----------------------------------------------------------------------------+----------------------+
|input                                                                       |category              |
+----------------------------------------------------------------------------+----------------------+
|is filestream lazy loaded in net                                            |[.net, c#, asp.net]   |
|programmatically launching standalone adobe flashplayer on linux/x11        |null                  |
|encoding problem classic asp                                                |[asp.net, asp.net-mvc]|
|c # winforms datagridview/sql compact negative integer in primary key column|[c#, .net]            |
|suspending and notifying threads when there is work to do                   |[java, c#]            |
|creating my own iterators                                                   |[c#, .net]            |
|css `` see through '' background crazy navigation menu problem              |[asp

# Performance comparison: native vs. rowwise

In [18]:
%%timeit -n 10 
udf_predict = make_pandas_udf(multi_prediction=True,rowwise=False)
df_output = df_input.withColumn("category",udf_predict(col("input")))
df_output.sample(False,.10).show(10)

+--------------------+--------------------+
|               input|            category|
+--------------------+--------------------+
|python beyond the...|            [python]|
|how to effectivel...|                null|
|is it possible to...|          [c#, .net]|
|how do i extract ...|                null|
|what is the best ...|        [javascript]|
|is it true that t...|          [.net, c#]|
|what are the limi...|[ruby, ruby-on-ra...|
|mvc preview 4 no ...|                null|
|sql delete suspen...|   [sql, sql-server]|
|how do i fix a ne...|              [java]|
+--------------------+--------------------+
only showing top 10 rows

+--------------------+--------------------+
|               input|            category|
+--------------------+--------------------+
|sql server and th...|   [sql-server, sql]|
|build tar file fr...|               [php]|
|python beyond the...|            [python]|
|how to implement ...|                [c#]|
|how to return a p...|[sql-server, c#, ...|
|is th

+--------------------+--------------------+
|               input|            category|
+--------------------+--------------------+
|how to implement ...|                [c#]|
|how do i call net...|          [c#, .net]|
|what is the aspne...|                null|
|eclipse text comp...|     [eclipse, java]|
|stopping msi from...|          [.net, c#]|
|thotkey with win ...|                [c#]|
|sql delete suspen...|   [sql, sql-server]|
|html over flash w...|                null|
|asp net mvc beta ...|[asp.net-mvc, asp...|
|config values in ...| [c#, java, asp.net]|
+--------------------+--------------------+
only showing top 10 rows

+--------------------+--------------------+
|               input|            category|
+--------------------+--------------------+
|how can i send an...|               [php]|
|gantt chart contr...|     [windows, .net]|
|how do you manage...|                [c#]|
|authoritative sou...|     [.net, php, c#]|
|how to return a p...|[sql-server, c#, ...|
|conve

+--------------------+--------------------+
|               input|            category|
+--------------------+--------------------+
|how can i change ...|[jquery, javascri...|
|how do you deal w...|           [asp.net]|
|how to return a p...|[sql-server, c#, ...|
|how do i call net...|          [c#, .net]|
|class methods as ...|        [javascript]|
|is it true that t...|          [.net, c#]|
|stopping msi from...|          [.net, c#]|
|sending email in ...| [.net, c#, asp.net]|
|what are the limi...|[ruby, ruby-on-ra...|
|which css tag cre...|                null|
+--------------------+--------------------+
only showing top 10 rows

+--------------------+--------------------+
|               input|            category|
+--------------------+--------------------+
|while clause in t...|   [sql, sql-server]|
|doctype rss & htm...|[html, asp.net, css]|
|what is the best ...|                null|
|is there a way to...|                null|
|is it true that t...|          [.net, c#]|
|sendi

+--------------------+----------+
|               input|  category|
+--------------------+----------+
|how can i send an...|     [php]|
|c the definitive ...|   [c, c#]|
|how do i add cust...|      null|
|how to pass an un...|      null|
|how can i create ...|[c#, .net]|
|how do i call net...|[c#, .net]|
|setting the heigh...|      [c#]|
|should i provide ...|    [java]|
|why is app_offlin...|[.net, c#]|
|how do i fix a ne...|    [java]|
+--------------------+----------+
only showing top 10 rows

+--------------------+--------------------+
|               input|            category|
+--------------------+--------------------+
|how can i change ...|[jquery, javascri...|
|carbide / symbian...|               [c++]|
|building flex pro...|              [flex]|
|why learn perl py...|    [c++, c, python]|
|db side encryptio...|                null|
|best way to use a...|              [java]|
|how do i extract ...|                null|
|what design patte...|              [java]|
|how do i focu

+--------------------+--------------------+
|               input|            category|
+--------------------+--------------------+
|how can i change ...|[jquery, javascri...|
|mac iwork/pages a...|                null|
|best update metho...|[mysql, sql, data...|
|implementing and ...|               [c++]|
|c # in linux envi...|                [c#]|
|what is the aspne...|                null|
|how do i extract ...|                null|
|css `` see throug...|           [asp.net]|
|mvc preview 4 no ...|                null|
|sql delete suspen...|   [sql, sql-server]|
+--------------------+--------------------+
only showing top 10 rows

+--------------------+-----------------+
|               input|         category|
+--------------------+-----------------+
|c # in linux envi...|             [c#]|
|converting svg to...|             [c#]|
|best way to use a...|           [java]|
|what is the aspne...|             null|
|what is the best ...|             null|
|ant and the avail...|        

In [19]:
%%timeit -n 10 
udf_predict = make_pandas_udf(multi_prediction=True,rowwise=True)
df_output = df_input.withColumn("category",udf_predict(col("input")))
df_output.sample(False,.10).show(10)

+--------------------+--------------------+
|               input|            category|
+--------------------+--------------------+
|is filestream laz...| [.net, c#, asp.net]|
|sql query count w...|[sql, sql-server,...|
|eclipse text comp...|     [eclipse, java]|
|how to serialize ...|                [c#]|
|how do i calculat...|                null|
|what is the simpl...|            [python]|
|  using lists in c #|                [c#]|
|can you set or wh...|                [c#]|
|why is app_offlin...|          [.net, c#]|
|how can i convert...|                [c#]|
+--------------------+--------------------+
only showing top 10 rows

+--------------------+--------------------+
|               input|            category|
+--------------------+--------------------+
|how can i change ...|[jquery, javascri...|
|is filestream laz...| [.net, c#, asp.net]|
|how do i add cust...|                null|
|how to implement ...|                [c#]|
|how do you quickl...|                null|
|what 

+--------------------+--------------------+
|               input|            category|
+--------------------+--------------------+
|transforming sele...|                null|
|getting odd error...|          [.net, c#]|
|sql query count w...|[sql, sql-server,...|
|implementing and ...|               [c++]|
|db side encryptio...|                null|
|c # in linux envi...|                [c#]|
|is it possible to...|          [c#, .net]|
|can you use an al...|             [mysql]|
|setting the heigh...|                [c#]|
|parsing t sql to ...|   [sql, sql-server]|
+--------------------+--------------------+
only showing top 10 rows

+--------------------+-----------------+
|               input|         category|
+--------------------+-----------------+
|sql server and th...|[sql-server, sql]|
|how to consume js...|           [.net]|
|prevent long word...|        [asp.net]|
|implementing and ...|            [c++]|
|how do i call net...|       [c#, .net]|
|creating my own i...|       [

+--------------------+--------------------+
|               input|            category|
+--------------------+--------------------+
|build tar file fr...|               [php]|
|prevent long word...|           [asp.net]|
|whats the best wa...|          [c#, .net]|
|how do you deal w...|           [asp.net]|
|how does tracerou...|          [c#, .net]|
|best update metho...|[mysql, sql, data...|
|while clause in t...|   [sql, sql-server]|
|what design patte...|              [java]|
|can you set or wh...|                [c#]|
|how do you mainta...|   [sql, sql-server]|
+--------------------+--------------------+
only showing top 10 rows

+--------------------+--------------------+
|               input|            category|
+--------------------+--------------------+
|carbide / symbian...|               [c++]|
|best update metho...|[mysql, sql, data...|
|implementing and ...|               [c++]|
|is there a way to...|                null|
|is it true that t...|          [.net, c#]|
|what 

+--------------------+--------------------+
|               input|            category|
+--------------------+--------------------+
|mac iwork/pages a...|                null|
|suspending and no...|          [java, c#]|
|vertical text wit...|[jquery, javascript]|
|numbering regex s...|               [php]|
|  using lists in c #|                [c#]|
|how do i fix a ne...|              [java]|
|patterns for the ...|         [java, c++]|
|what does it mean...|     [c++, java, c#]|
|http/ajax gwt vs ...|           [eclipse]|
|protecting javasc...|        [javascript]|
+--------------------+--------------------+
only showing top 10 rows

+--------------------+-----------------+
|               input|         category|
+--------------------+-----------------+
|transforming sele...|             null|
|should i have one...|           [java]|
|how does tracerou...|       [c#, .net]|
|mac iwork/pages a...|             null|
|how to implement ...|             [c#]|
|how can i get a l...|        

+--------------------+--------------------+
|               input|            category|
+--------------------+--------------------+
|how can i change ...|[jquery, javascri...|
|c the definitive ...|             [c, c#]|
|how to consume js...|              [.net]|
|how do i add cust...|                null|
|why learn perl py...|    [c++, c, python]|
|what is the aspne...|                null|
|can you use an al...|             [mysql]|
|ant and the avail...|              [java]|
|should i provide ...|              [java]|
|css `` see throug...|           [asp.net]|
+--------------------+--------------------+
only showing top 10 rows

+--------------------+--------------------+
|               input|            category|
+--------------------+--------------------+
|sql server and th...|   [sql-server, sql]|
|gantt chart contr...|     [windows, .net]|
|how to disable vi...|     [visual-studio]|
|c # lambda expres...|                [c#]|
|how do i call net...|          [c#, .net]|
|c # w

+--------------------+-------------------+
|               input|           category|
+--------------------+-------------------+
|making an image g...|               null|
|prevent long word...|          [asp.net]|
|whats the best wa...|         [c#, .net]|
|mac iwork/pages a...|               null|
|building flex pro...|             [flex]|
|the necessity of ...|              [c++]|
|how to implement ...|               [c#]|
|setting the heigh...|               [c#]|
|be notified when ...|          [c#, c++]|
|sending email in ...|[.net, c#, asp.net]|
+--------------------+-------------------+
only showing top 10 rows

420 ms ± 6.71 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
