In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, concat_ws
from classifier import make_udf, predict_serie, make_pandas_udf
import pandas as pd




Enable Arrow-based columnar data transfers.

In [2]:
spark = SparkSession.builder.appName('mysession').getOrCreate()
spark.conf.set("spark.sql.execution.arrow.enabled", "true")

If we would like to distribute the model file and the required Python module across nodes, we would need to do something like:

```python
spark.sparkContext.addFile('models/ft_tuned.ftz')
spark.sparkContext.addPyFile('./classifier.py')
```

# Generating Test File

Both `data/test` and `data/spark_input` are present, where the latter is just the unlabeled version of the former. 

If you are curious, I show you here how I converted one file into another via Python generators.

This is useful when we have a labeled dataset for which we want to retrieve predictions and compare them with true labels in order to compute performance metrics. In that case, we need to unlabel the dataset first and then perform inference to get the predictions. However, in practice, we will perform inference for unlabeled data. 

In [3]:
! head -10 data/test

__label__php __label__image making an image greyscale with gd library
__label__eclipse transforming selected text with a hotkey
__label__sql-server sql server and the guest account what is this for
__label__jquery __label__html how can i change html attribute names with jquery
__label__php __label__ajax how can i send an array to php through ajax
__label__c __label__cocoa c the definitive truth about rand random and arc4random
__label__winforms gantt chart controls on windows forms
__label__php __label__linux build tar file from directory in php without exec/passthru
__label__javascript __label__ajax how do you manage infragistics webgrid data from javascript/ajax code
__label__wcf how to consume json web services from a windows client


```python
def keep_sentence_field(line):
    '''
    Function to keep only the text input given a labeled instance with fastText format.
    Example
    Input:
    '__label__python __label__django help with unit testing in a python app using django'
    Output:
    'help with unit testing in a python app using django'
    '''
    words = [x for x in line.split() if "__label__" not in x]
    output = ' '.join(words)
    return output

# Location of input file
inputFile = 'data/test'

# Define Python generators to 1) read lines, 2) keep only the sentence field
lines = (line for line in open(inputFile,encoding="ISO-8859-1"))
sentences = (keep_sentence_field(line) for line in lines)

# Location of output file
outputFile = 'data/spark_input'

# Apply the generators and write predictions
with open(outputFile, 'w') as file:
    for sentence in sentences:
        file.write(sentence+'\n')
    file.close()
```

In [4]:
! head -10 data/spark_input

making an image greyscale with gd library
transforming selected text with a hotkey
sql server and the guest account what is this for
how can i change html attribute names with jquery
how can i send an array to php through ajax
c the definitive truth about rand random and arc4random
gantt chart controls on windows forms
build tar file from directory in php without exec/passthru
how do you manage infragistics webgrid data from javascript/ajax code
how to consume json web services from a windows client


# Build Spark DataFrame

Let's make a Spark DF from the unlabeled text file. However, in practice, we may read a Parquet file instead.

In [5]:
from pyspark.sql.types import StructType, StructField, StringType, ArrayType

schema = StructType([StructField("input", StringType())])

df_input = spark.read.csv('data/spark_input', header=False, schema=schema)

# Approach 1: Standard UDF

## Single prediction

In [6]:
udf_predict = make_udf(multi_prediction=False)
df_output = df_input.withColumn("category",udf_predict(col("input")))
%timeit -n 10 df_output.sample(False,.10).show(10,False)

+--------------------------------------------------------------------------------------------------------+----------+
|input                                                                                                   |category  |
+--------------------------------------------------------------------------------------------------------+----------+
|making an image greyscale with gd library                                                               |c#        |
|transforming selected text with a hotkey                                                                |c#        |
|sql server and the guest account what is this for                                                       |sql-server|
|how do you deal with connection strings when deploying an asp net site                                  |asp.net   |
|can sql server express be used to effectively administrate a sql server standard/enterprise installation|sql-server|
|c # lambda expressions or delegates as a properties or 

+--------------------------------------------------------------------------------------+-------------+
|input                                                                                 |category     |
+--------------------------------------------------------------------------------------+-------------+
|build tar file from directory in php without exec/passthru                            |php          |
|prevent long word to add horizontal scroll to html view                               |asp.net      |
|how do you deal with connection strings when deploying an asp net site                |asp.net      |
|how to pass an unpersisted modified object from view back to controller without a form|c#           |
|how do you quickly find the url for a win32 api on msdn                               |python       |
|how can i create prototype methods like javascript in c # net                         |c#           |
|daemon threads explanation                                              

+-----------------------------------------------------------------------------------+----------+
|input                                                                              |category  |
+-----------------------------------------------------------------------------------+----------+
|sql server and the guest account what is this for                                  |sql-server|
|how can i change html attribute names with jquery                                  |jquery    |
|gantt chart controls on windows forms                                              |windows   |
|should i have one class for every database i use                                   |java      |
|getting odd error on net executenonquery                                           |.net      |
|best update method for mysql db                                                    |mysql     |
|authoritative source on xml sig                                                    |.net      |
|what is the best way to see w

+-------------------------------------------------------------------------+-----------+
|input                                                                    |category   |
+-------------------------------------------------------------------------+-----------+
|is filestream lazy loaded in net                                         |.net       |
|how can i get a list of available wireless networks on linux             |python     |
|daemon threads explanation                                               |c#         |
|is there a way to asynchronously filter an ilist                         |python     |
|how to serialize an object to xml without getting xmlns `` ``            |c#         |
|what is the simplest way to find the difference between 2 times in python|python     |
|can you set or where is the local document root                          |c#         |
|html over flash without stopping interaction with flash                  |asp.net-mvc|
|uiwebview within a scrollview d

+--------------------------------------------------------------+----------+
|input                                                         |category  |
+--------------------------------------------------------------+----------+
|how can i send an array to php through ajax                   |php       |
|how do i add custom column to existing wss list template      |c#        |
|how to effectively implement sessions in gae                  |ruby      |
|c # in linux environment                                      |c#        |
|daemon threads explanation                                    |c#        |
|creating my own iterators                                     |c#        |
|css `` see through '' background crazy navigation menu problem|asp.net   |
|sending email in net through gmail                            |.net      |
|actionscript3 to javascript communication best practices      |javascript|
|best way to manipulate pages while embedding webkit           |java      |
+-----------

+-----------------------------------------------------------------------------------+----------+
|input                                                                              |category  |
+-----------------------------------------------------------------------------------+----------+
|authoritative source on xml sig                                                    |.net      |
|ms sql 2000 turn off logging during stored procedure                               |sql-server|
|how can i get a list of available wireless networks on linux                       |python    |
|how do you quickly find the url for a win32 api on msdn                            |python    |
|c # in linux environment                                                           |c#        |
|how do i extract the version and path from an svn working copy into a nant variable|java      |
|drawing a custom label on a pie chart in yahoo s flash library astra               |ruby      |
|in c # or any language what i

+-------------------------------------------------------------------------------+----------+
|input                                                                          |category  |
+-------------------------------------------------------------------------------+----------+
|how do i add custom column to existing wss list template                       |c#        |
|best update method for mysql db                                                |mysql     |
|c # in linux environment                                                       |c#        |
|how do i call net code c # /vb net from vbscript                               |c#        |
|is it possible to define in a dependent dll s application config               |c#        |
|what is the best way to determine the number of days in a month with javascript|javascript|
|creating my own iterators                                                      |c#        |
|how do i focus a foreign window                                      

+----------------------------------------------------------------+----------+
|input                                                           |category  |
+----------------------------------------------------------------+----------+
|sql server and the guest account what is this for               |sql-server|
|how to consume json web services from a windows client          |.net      |
|game programming and event handlers                             |c#        |
|implementing and enforcing coding standards                     |c++       |
|c # lambda expressions or delegates as a properties or arguments|c#        |
|converting svg to png using c #                                 |c#        |
|be notified when visual/logical child added/removed             |c#        |
|css `` see through '' background crazy navigation menu problem  |asp.net   |
|sending email in net through gmail                              |.net      |
|how can i ban a whole company from my web site                 

+----------------------------------------------------------------------------------------------------------------+-----------+
|input                                                                                                           |category   |
+----------------------------------------------------------------------------------------------------------------+-----------+
|carbide / symbian c++ change application icon                                                                   |c++        |
|authoritative source on xml sig                                                                                 |.net       |
|ms sql 2000 turn off logging during stored procedure                                                            |sql-server |
|post from one controller action to another not redirect                                                         |asp.net    |
|db side encryption via nhibernate                                                                             

+----------------------------------------------------------------+--------+
|input                                                           |category|
+----------------------------------------------------------------+--------+
|transforming selected text with a hotkey                        |c#      |
|how can i send an array to php through ajax                     |php     |
|c the definitive truth about rand random and arc4random         |c       |
|build tar file from directory in php without exec/passthru      |php     |
|how does traceroute work                                        |c#      |
|getting odd error on net executenonquery                        |.net    |
|are incrementers / decrementers var++ var etc thread safe       |c++     |
|how can i get a list of available wireless networks on linux    |python  |
|c # lambda expressions or delegates as a properties or arguments|c#      |
|db side encryption via nhibernate                               |wcf     |
+-----------

## Multi prediction

In [7]:
udf_predict = make_udf(multi_prediction=True)
df_output = df_input.withColumn("category",udf_predict(col("input")))
%timeit -n 10 df_output.sample(False,.10).show(10,False)

+---------------------------------------------------------------+---------------------------+
|input                                                          |category                   |
+---------------------------------------------------------------+---------------------------+
|gantt chart controls on windows forms                          |[windows, .net, c#]        |
|python beyond the basics                                       |[python, c++, windows]     |
|how does traceroute work                                       |[c#, .net, java]           |
|while clause in t sql that loops forever                       |[sql, sql-server, mysql]   |
|how can i create prototype methods like javascript in c # net  |[c#, .net, asp.net]        |
|what is the aspnet_client folder for under the iis structure   |[.net, c#, c++]            |
|what is the best way to see what files are locked in subversion|[java, c#, .net]           |
|is there a way to asynchronously filter an ilist           

+--------------------------------------------------------------+-------------------------+
|input                                                         |category                 |
+--------------------------------------------------------------+-------------------------+
|prevent long word to add horizontal scroll to html view       |[asp.net, .net, c#]      |
|how to disable visual studio macro `` tip '' balloon          |[visual-studio, .net, c#]|
|how can i get a list of available wireless networks on linux  |[python, c#, .net]       |
|how to return a page of results from sql                      |[sql-server, c#, sql]    |
|c # in linux environment                                      |[c#, c, winforms]        |
|how do i call net code c # /vb net from vbscript              |[c#, .net, asp.net]      |
|eclipse hide paths in the `` open resource '' dialog          |[java, c++, c]           |
|percentages of subtotal in a report                           |[c++, java, c#]          |

+--------------------------------------------------------------------------------------------------------+----------------------------------+
|input                                                                                                   |category                          |
+--------------------------------------------------------------------------------------------------------+----------------------------------+
|transforming selected text with a hotkey                                                                |[c#, python, asp.net]             |
|vector shape on stage appears over dynamic textfield                                                    |[c++, c#, .net]                   |
|should i have one class for every database i use                                                        |[java, c#, sql-server]            |
|can sql server express be used to effectively administrate a sql server standard/enterprise installation|[sql-server, sql, sql-server-2005]|
|are i

+-------------------------------------------------------------------------------+----------------------------------+
|input                                                                          |category                          |
+-------------------------------------------------------------------------------+----------------------------------+
|how do you manage infragistics webgrid data from javascript/ajax code          |[c#, java, .net]                  |
|how to effectively implement sessions in gae                                   |[ruby, python, mysql]             |
|sql query count with 0 count                                                   |[sql, sql-server, sql-server-2005]|
|doctype rss & html entities                                                    |[html, asp.net, css]              |
|c # lambda expressions or delegates as a properties or arguments               |[c#, .net, asp.net]               |
|db side encryption via nhibernate                              

+-------------------------------------------------------------------------+----------------------------------+
|input                                                                    |category                          |
+-------------------------------------------------------------------------+----------------------------------+
|sql server and the guest account what is this for                        |[sql-server, sql, database]       |
|getting odd error on net executenonquery                                 |[.net, c#, asp.net]               |
|sql query count with 0 count                                             |[sql, sql-server, sql-server-2005]|
|should db layer members be static or instance                            |[java, asp.net, c#]               |
|what is the simplest way to find the difference between 2 times in python|[python, javascript, c++]         |
|why is app_offline failing to work as soon as you it starts loading dlls |[.net, c#, c++]                   |
|

+-------------------------------------------------------------------------------+---------------------------+
|input                                                                          |category                   |
+-------------------------------------------------------------------------------+---------------------------+
|prevent long word to add horizontal scroll to html view                        |[asp.net, .net, c#]        |
|how to effectively implement sessions in gae                                   |[ruby, python, mysql]      |
|should i have one class for every database i use                               |[java, c#, sql-server]     |
|best update method for mysql db                                                |[mysql, sql, database]     |
|how to implement a singleton in c #                                            |[c#, .net, c]              |
|converting svg to png using c #                                                |[c#, .net, winforms]       |
|encoding 

+----------------------------------------------------------------------+----------------------------+
|input                                                                 |category                    |
+----------------------------------------------------------------------+----------------------------+
|how can i send an array to php through ajax                           |[php, html, jquery]         |
|is filestream lazy loaded in net                                      |[.net, c#, asp.net]         |
|how do you deal with connection strings when deploying an asp net site|[asp.net, c#, asp.net-mvc]  |
|should i have one class for every database i use                      |[java, c#, sql-server]      |
|eclipse hide paths in the `` open resource '' dialog                  |[java, c++, c]              |
|ant and the available task what if something is not available         |[java, .net, c#]            |
|how to serialize an object to xml without getting xmlns `` ``         |[c#, .net,

+----------------------------------------------------------------------------------------------+-----------------------------+
|input                                                                                         |category                     |
+----------------------------------------------------------------------------------------------+-----------------------------+
|whats the best way to start using mylyn                                                       |[c#, .net, javascript]       |
|vector shape on stage appears over dynamic textfield                                          |[c++, c#, .net]              |
|game programming and event handlers                                                           |[c#, .net, asp.net]          |
|getting odd error on net executenonquery                                                      |[.net, c#, asp.net]          |
|why learn perl python ruby if the company is using c++ c # or java as the application language|[c++, c, python

+--------------------------------------------------------------------------------------------------------+----------------------------------+
|input                                                                                                   |category                          |
+--------------------------------------------------------------------------------------------------------+----------------------------------+
|transforming selected text with a hotkey                                                                |[c#, python, asp.net]             |
|can sql server express be used to effectively administrate a sql server standard/enterprise installation|[sql-server, sql, sql-server-2005]|
|db side encryption via nhibernate                                                                       |[wcf, flash, linq-to-sql]         |
|eclipse hide paths in the `` open resource '' dialog                                                    |[java, c++, c]                    |
|how d

+---------------------------------------------------------------------------------------------------------------------------+----------------------------------+
|input                                                                                                                      |category                          |
+---------------------------------------------------------------------------------------------------------------------------+----------------------------------+
|best way to use a db table as a message/job queue                                                                          |[java, sql, sql-server]           |
|c # winforms datagridview/sql compact negative integer in primary key column                                               |[c#, .net, winforms]              |
|class methods as event handlers in javascript                                                                              |[javascript, html, asp.net]       |
|parsing t sql to parameterize a q

+--------------------------------------------------------------------+-----------------------+
|input                                                               |category               |
+--------------------------------------------------------------------+-----------------------+
|c the definitive truth about rand random and arc4random             |[c, c#, c++]           |
|how to consume json web services from a windows client              |[.net, windows, c#]    |
|is filestream lazy loaded in net                                    |[.net, c#, asp.net]    |
|mac iwork/pages automation                                          |[flash, wcf, flex]     |
|programmatically launching standalone adobe flashplayer on linux/x11|[windows, .net, c#]    |
|post from one controller action to another not redirect             |[asp.net, html, css]   |
|best way to use a db table as a message/job queue                   |[java, sql, sql-server]|
|how do i call net code c # /vb net from vbscript 

+------------------------------------------------------------+-----------------------------+
|input                                                       |category                     |
+------------------------------------------------------------+-----------------------------+
|vector shape on stage appears over dynamic textfield        |[c++, c#, .net]              |
|should i have one class for every database i use            |[java, c#, sql-server]       |
|mac iwork/pages automation                                  |[flash, wcf, flex]           |
|how can i get a list of available wireless networks on linux|[python, c#, .net]           |
|c # in linux environment                                    |[c#, c, winforms]            |
|suspending and notifying threads when there is work to do   |[java, c#, .net]             |
|is there a way to asynchronously filter an ilist            |[python, java, .net]         |
|what design pattern to use for user authentication in java  |[java, e

# Approach 2: Pandas UDF (via PyArrow)

Note: We need to use pyarrow==0.14.1. See [this](https://stackoverflow.com/questions/58878848/java-lang-illegalargumentexception-when-applying-a-python-udf-to-a-spark-datafra) Stackoverflow question.

Testing the `predict_serie` function (which is used in `make_pandas_udf`). The `predict_serie` function maps
`pandas.Series` $\rightarrow$ `pandas.Series` and that's why the `pandas_udf` built on top of it can be vectorized (this is, executed by chunks).

Take a sample to test.

In [8]:
pdf_sample = df_input.sample(False,fraction=0.10,seed=12345).toPandas()

Single prediction.

In [9]:
pd.concat([pdf_sample.input.head(10),predict_serie(pdf_sample.input,False).head(10).rename("category")],axis=1)

Unnamed: 0,input,category
0,is filestream lazy loaded in net,.net
1,programmatically launching standalone adobe fl...,windows
2,encoding problem classic asp,asp.net
3,c # winforms datagridview/sql compact negative...,c#
4,suspending and notifying threads when there is...,java
5,creating my own iterators,c#
6,css `` see through '' background crazy navigat...,asp.net
7,sending email in net through gmail,.net
8,specify ordinals of c++ exported functions in ...,c++
9,in c # or any language what is/are your favour...,c#


Multiple prediction.

In [10]:
pd.concat([pdf_sample.input.head(10),predict_serie(pdf_sample.input,True).head(10).rename("category")],axis=1)

Unnamed: 0,input,category
0,is filestream lazy loaded in net,"[.net, c#, asp.net]"
1,programmatically launching standalone adobe fl...,"[windows, .net, c#]"
2,encoding problem classic asp,"[asp.net, asp.net-mvc, css]"
3,c # winforms datagridview/sql compact negative...,"[c#, .net, winforms]"
4,suspending and notifying threads when there is...,"[java, c#, .net]"
5,creating my own iterators,"[c#, .net, asp.net]"
6,css `` see through '' background crazy navigat...,"[asp.net, javascript, html]"
7,sending email in net through gmail,"[.net, c#, asp.net]"
8,specify ordinals of c++ exported functions in ...,"[c++, c, java]"
9,in c # or any language what is/are your favour...,"[c#, .net, asp.net]"


Now let's make a `pandas_udf` built on top of the `predict_serie` function.

## Single prediction

In [11]:
udf_predict = make_pandas_udf(multi_prediction=False)
df_output = df_input.withColumn("category",udf_predict(col("input")))
%timeit -n 10 df_output.sample(False,.10).show(10,False)

+--------------------------------------------------------------------------------------------------------+----------+
|input                                                                                                   |category  |
+--------------------------------------------------------------------------------------------------------+----------+
|making an image greyscale with gd library                                                               |c#        |
|sql server and the guest account what is this for                                                       |sql-server|
|c the definitive truth about rand random and arc4random                                                 |c         |
|vector shape on stage appears over dynamic textfield                                                    |c++       |
|can sql server express be used to effectively administrate a sql server standard/enterprise installation|sql-server|
|why learn perl python ruby if the company is using c++ 

+----------------------------------------------------------------------------------------------+-------------+
|input                                                                                         |category     |
+----------------------------------------------------------------------------------------------+-------------+
|while clause in t sql that loops forever                                                      |sql          |
|doctype rss & html entities                                                                   |html         |
|implementing and enforcing coding standards                                                   |c++          |
|why learn perl python ruby if the company is using c++ c # or java as the application language|c++          |
|can you use an alias in the where clause in mysql                                             |mysql        |
|what design pattern to use for user authentication in java                                    |java         |
|

+------------------------------------------------------------+----------+
|input                                                       |category  |
+------------------------------------------------------------+----------+
|vector shape on stage appears over dynamic textfield        |c++       |
|mac iwork/pages automation                                  |flash     |
|ms sql 2000 turn off logging during stored procedure        |sql-server|
|encoding problem classic asp                                |asp.net   |
|suspending and notifying threads when there is work to do   |java      |
|what is the aspnet_client folder for under the iis structure|.net      |
|what design pattern to use for user authentication in java  |java      |
|numbering regex submatches                                  |php       |
|can you set or where is the local document root             |c#        |
|actionscript3 to javascript communication best practices    |javascript|
+-------------------------------------

+---------------------------------------------------------------------+----------+
|input                                                                |category  |
+---------------------------------------------------------------------+----------+
|how do you manage infragistics webgrid data from javascript/ajax code|c#        |
|prevent long word to add horizontal scroll to html view              |asp.net   |
|python beyond the basics                                             |python    |
|how to effectively implement sessions in gae                         |ruby      |
|ms sql 2000 turn off logging during stored procedure                 |sql-server|
|encoding problem classic asp                                         |asp.net   |
|is there a way to asynchronously filter an ilist                     |python    |
|what design pattern to use for user authentication in java           |java      |
|specify ordinals of c++ exported functions in a dll                  |c++       |
|wha

+--------------------------------------------------------------------------------------------------------+----------+
|input                                                                                                   |category  |
+--------------------------------------------------------------------------------------------------------+----------+
|gantt chart controls on windows forms                                                                   |windows   |
|is filestream lazy loaded in net                                                                        |.net      |
|can sql server express be used to effectively administrate a sql server standard/enterprise installation|sql-server|
|mac iwork/pages automation                                                                              |flash     |
|post from one controller action to another not redirect                                                 |asp.net   |
|how do i call net code c # /vb net from vbscript       

+----------------------------------------------------------------------------------------------+--------+
|input                                                                                         |category|
+----------------------------------------------------------------------------------------------+--------+
|how to effectively implement sessions in gae                                                  |ruby    |
|implementing and enforcing coding standards                                                   |c++     |
|why learn perl python ruby if the company is using c++ c # or java as the application language|c++     |
|c # lambda expressions or delegates as a properties or arguments                              |c#      |
|post from one controller action to another not redirect                                       |asp.net |
|how can i create prototype methods like javascript in c # net                                 |c#      |
|how do i call net code c # /vb net from vbscr

+----------------------------------------------------------+--------+
|input                                                     |category|
+----------------------------------------------------------+--------+
|how can i change html attribute names with jquery         |jquery  |
|build tar file from directory in php without exec/passthru|php     |
|whats the best way to start using mylyn                   |c#      |
|how do i add custom column to existing wss list template  |c#      |
|vector shape on stage appears over dynamic textfield      |c++     |
|carbide / symbian c++ change application icon             |c++     |
|the necessity of hiding the salt for a hash               |c++     |
|how to implement a singleton in c #                       |c#      |
|doctype rss & html entities                               |html    |
|c # in linux environment                                  |c#      |
+----------------------------------------------------------+--------+
only showing top 10 

+-------------------------------------------------------------------------------+----------+
|input                                                                          |category  |
+-------------------------------------------------------------------------------+----------+
|how to effectively implement sessions in gae                                   |ruby      |
|how does traceroute work                                                       |c#        |
|carbide / symbian c++ change application icon                                  |c++       |
|authoritative source on xml sig                                                |.net      |
|how to implement a singleton in c #                                            |c#        |
|doctype rss & html entities                                                    |html      |
|how do you quickly find the url for a win32 api on msdn                        |python    |
|how do i call net code c # /vb net from vbscript                     

+-----------------------------------------------------------------------------------+--------+
|input                                                                              |category|
+-----------------------------------------------------------------------------------+--------+
|vector shape on stage appears over dynamic textfield                               |c++     |
|getting odd error on net executenonquery                                           |.net    |
|how do i extract the version and path from an svn working copy into a nant variable|java    |
|sending email in net through gmail                                                 |.net    |
|what are the limits of ruby on rails                                               |ruby    |
|patterns for the overlap of two objects                                            |java    |
|how to do unit testing with uncertainties                                          |java    |
|what fields should be indexed on a given table   

+----------------------------------------------------------------+-------------+
|input                                                           |category     |
+----------------------------------------------------------------+-------------+
|gantt chart controls on windows forms                           |windows      |
|getting odd error on net executenonquery                        |.net         |
|building flex projects in ant/nant                              |flex         |
|how to disable visual studio macro `` tip '' balloon            |visual-studio|
|how can i get a list of available wireless networks on linux    |python       |
|converting svg to png using c #                                 |c#           |
|is there a way to asynchronously filter an ilist                |python       |
|image archive vs image strip                                    |css          |
|parsing t sql to parameterize a query                           |sql          |
|rendered pixel width data f

+----------------------------------------------------------------------------------------------+--------+
|input                                                                                         |category|
+----------------------------------------------------------------------------------------------+--------+
|python beyond the basics                                                                      |python  |
|sql query count with 0 count                                                                  |sql     |
|why learn perl python ruby if the company is using c++ c # or java as the application language|c++     |
|how to generate unit test code for methods                                                    |java    |
|what is the aspnet_client folder for under the iis structure                                  |.net    |
|what is the best way to see what files are locked in subversion                               |java    |
|css `` see through '' background crazy naviga

+----------------------------------------------------------------------------+-------------+
|input                                                                       |category     |
+----------------------------------------------------------------------------+-------------+
|prevent long word to add horizontal scroll to html view                     |asp.net      |
|getting odd error on net executenonquery                                    |.net         |
|programmatically launching standalone adobe flashplayer on linux/x11        |windows      |
|how can i create prototype methods like javascript in c # net               |c#           |
|is it possible to define in a dependent dll s application config            |c#           |
|what is the best way to see what files are locked in subversion             |java         |
|can you set or where is the local document root                             |c#           |
|why is visual studio constantly crashing                             

## Multi prediction

In [12]:
udf_predict = make_pandas_udf(multi_prediction=True)
df_output = df_input.withColumn("category",udf_predict(col("input")))
%timeit -n 10 df_output.sample(False,.10).show(10,False)

+----------------------------------------------------------------+-----------------------------+
|input                                                           |category                     |
+----------------------------------------------------------------+-----------------------------+
|how to consume json web services from a windows client          |[.net, windows, c#]          |
|whats the best way to start using mylyn                         |[c#, .net, javascript]       |
|how to effectively implement sessions in gae                    |[ruby, python, mysql]        |
|authoritative source on xml sig                                 |[.net, php, c#]              |
|while clause in t sql that loops forever                        |[sql, sql-server, mysql]     |
|how can i get a list of available wireless networks on linux    |[python, c#, .net]           |
|c # lambda expressions or delegates as a properties or arguments|[c#, .net, asp.net]          |
|eclipse text comparison order

+----------------------------------------------------------------------+--------------------------+
|input                                                                 |category                  |
+----------------------------------------------------------------------+--------------------------+
|how do you manage infragistics webgrid data from javascript/ajax code |[c#, java, .net]          |
|prevent long word to add horizontal scroll to html view               |[asp.net, .net, c#]       |
|how do you deal with connection strings when deploying an asp net site|[asp.net, c#, asp.net-mvc]|
|how to effectively implement sessions in gae                          |[ruby, python, mysql]     |
|getting odd error on net executenonquery                              |[.net, c#, asp.net]       |
|building flex projects in ant/nant                                    |[flex, ruby, silverlight] |
|how can i get a list of available wireless networks on linux          |[python, c#, .net]        |


+-----------------------------------------------------------------------------------+--------------------------+
|input                                                                              |category                  |
+-----------------------------------------------------------------------------------+--------------------------+
|how do you deal with connection strings when deploying an asp net site             |[asp.net, c#, asp.net-mvc]|
|how does traceroute work                                                           |[c#, .net, java]          |
|mac iwork/pages automation                                                         |[flash, wcf, flex]        |
|implementing and enforcing coding standards                                        |[c++, java, c#]           |
|post from one controller action to another not redirect                            |[asp.net, html, css]      |
|converting svg to png using c #                                                    |[c#, .net, 

+-------------------------------------------------------------------------+---------------------------+
|input                                                                    |category                   |
+-------------------------------------------------------------------------+---------------------------+
|how do you manage infragistics webgrid data from javascript/ajax code    |[c#, java, .net]           |
|is filestream lazy loaded in net                                         |[.net, c#, asp.net]        |
|how to implement a singleton in c #                                      |[c#, .net, c]              |
|c # in linux environment                                                 |[c#, c, winforms]          |
|best way to use a db table as a message/job queue                        |[java, sql, sql-server]    |
|should i provide a deep clone when implementing icloneable               |[java, c#, c++]            |
|numbering regex submatches                                     

+--------------------------------------------------------------------------------------------------------+----------------------------------+
|input                                                                                                   |category                          |
+--------------------------------------------------------------------------------------------------------+----------------------------------+
|vector shape on stage appears over dynamic textfield                                                    |[c++, c#, .net]                   |
|can sql server express be used to effectively administrate a sql server standard/enterprise installation|[sql-server, sql, sql-server-2005]|
|getting odd error on net executenonquery                                                                |[.net, c#, asp.net]               |
|sql query count with 0 count                                                                            |[sql, sql-server, sql-server-2005]|
|progr

+-----------------------------------------------------------------------------------+---------------------------+
|input                                                                              |category                   |
+-----------------------------------------------------------------------------------+---------------------------+
|sql server and the guest account what is this for                                  |[sql-server, sql, database]|
|how can i send an array to php through ajax                                        |[php, html, jquery]        |
|prevent long word to add horizontal scroll to html view                            |[asp.net, .net, c#]        |
|how does traceroute work                                                           |[c#, .net, java]           |
|getting odd error on net executenonquery                                           |[.net, c#, asp.net]        |
|building flex projects in ant/nant                                                 |[fl

+---------------------------------------------------------------------+---------------------------+
|input                                                                |category                   |
+---------------------------------------------------------------------+---------------------------+
|build tar file from directory in php without exec/passthru           |[php, c++, .net]           |
|ms sql 2000 turn off logging during stored procedure                 |[sql-server, sql, database]|
|how to return a page of results from sql                             |[sql-server, c#, sql]      |
|how do you quickly find the url for a win32 api on msdn              |[python, c++, c#]          |
|encoding problem classic asp                                         |[asp.net, asp.net-mvc, css]|
|how can i determine the ip of my router/gateway in java              |[java, c++, c#]            |
|how do i focus a foreign window                                      |[python, java, ruby]       |


+--------------------------------------------------------------------------------------+----------------------------------+
|input                                                                                 |category                          |
+--------------------------------------------------------------------------------------+----------------------------------+
|c the definitive truth about rand random and arc4random                               |[c, c#, c++]                      |
|prevent long word to add horizontal scroll to html view                               |[asp.net, .net, c#]               |
|python beyond the basics                                                              |[python, c++, windows]            |
|how to effectively implement sessions in gae                                          |[ruby, python, mysql]             |
|should i have one class for every database i use                                      |[java, c#, sql-server]            |
|how to 

+----------------------------------------------------------------------------------------------+---------------------------+
|input                                                                                         |category                   |
+----------------------------------------------------------------------------------------------+---------------------------+
|whats the best way to start using mylyn                                                       |[c#, .net, javascript]     |
|python beyond the basics                                                                      |[python, c++, windows]     |
|how do you deal with connection strings when deploying an asp net site                        |[asp.net, c#, asp.net-mvc] |
|how to implement a singleton in c #                                                           |[c#, .net, c]              |
|how can i get a list of available wireless networks on linux                                  |[python, c#, .net]         |


+-------------------------------------------------------------------------------+-----------------------------+
|input                                                                          |category                     |
+-------------------------------------------------------------------------------+-----------------------------+
|transforming selected text with a hotkey                                       |[c#, python, asp.net]        |
|how can i change html attribute names with jquery                              |[jquery, javascript, html]   |
|build tar file from directory in php without exec/passthru                     |[php, c++, .net]             |
|python beyond the basics                                                       |[python, c++, windows]       |
|how do you deal with connection strings when deploying an asp net site         |[asp.net, c#, asp.net-mvc]   |
|ms sql 2000 turn off logging during stored procedure                           |[sql-server, sql, datab

+--------------------------------------------------------------------------------------------------------+----------------------------------+
|input                                                                                                   |category                          |
+--------------------------------------------------------------------------------------------------------+----------------------------------+
|transforming selected text with a hotkey                                                                |[c#, python, asp.net]             |
|how can i change html attribute names with jquery                                                       |[jquery, javascript, html]        |
|can sql server express be used to effectively administrate a sql server standard/enterprise installation|[sql-server, sql, sql-server-2005]|
|is there a way to asynchronously filter an ilist                                                        |[python, java, .net]              |
|perce

+-----------------------------------------------------------------------------------+----------------------------------+
|input                                                                              |category                          |
+-----------------------------------------------------------------------------------+----------------------------------+
|are incrementers / decrementers var++ var etc thread safe                          |[c++, javascript, windows]        |
|while clause in t sql that loops forever                                           |[sql, sql-server, mysql]          |
|db side encryption via nhibernate                                                  |[wcf, flash, linq-to-sql]         |
|how can i create prototype methods like javascript in c # net                      |[c#, .net, asp.net]               |
|converting svg to png using c #                                                    |[c#, .net, winforms]              |
|best way to use a db table as a

+--------------------------------------------------------------------+---------------------------+
|input                                                               |category                   |
+--------------------------------------------------------------------+---------------------------+
|sql server and the guest account what is this for                   |[sql-server, sql, database]|
|game programming and event handlers                                 |[c#, .net, asp.net]        |
|building flex projects in ant/nant                                  |[flex, ruby, silverlight]  |
|the necessity of hiding the salt for a hash                         |[c++, c#, java]            |
|programmatically launching standalone adobe flashplayer on linux/x11|[windows, .net, c#]        |
|while clause in t sql that loops forever                            |[sql, sql-server, mysql]   |
|how to return a page of results from sql                            |[sql-server, c#, sql]      |
|how to ge

### From `array<string>` type to `string` type

For the multi prediction case, if we would like to convert our **category** field from  `array<string>` to  `string` so we can persist the output Spark DataFrame as a CSV file, we can do it in the following way.

However, in practice, we could persist a Parquet file instead with the **category** field with the `array<string>` type.

In [13]:
df_output.withColumn('category', concat_ws('|', 'category')).show(20,False)

+--------------------------------------------------------------------------------------+-----------------------+
|input                                                                                 |category               |
+--------------------------------------------------------------------------------------+-----------------------+
|making an image greyscale with gd library                                             |c#|php|html            |
|transforming selected text with a hotkey                                              |c#|python|asp.net      |
|sql server and the guest account what is this for                                     |sql-server|sql|database|
|how can i change html attribute names with jquery                                     |jquery|javascript|html |
|how can i send an array to php through ajax                                           |php|html|jquery        |
|c the definitive truth about rand random and arc4random                               |c|c#|c++