### Preliminar: Building fastText from source and installing CLI tool

In [1]:
%%bash
chmod u+x install_fasttext.sh && ./install_fasttext.sh

-- The C compiler identification is GNU 8.3.0
-- The CXX compiler identification is GNU 8.3.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Configuring done
-- Generating done
-- Build files have been written to: /fasttext_inference/pipe/fastText/build
Scanning dependencies of target fasttext-static
[  2%] Building CXX object CMakeFiles/fasttext-static.dir/src/args.cc.o
[  4%] Building CXX object CMakeFiles/fasttext-static.dir/src/autotune.cc.o
[  6%] Building CXX object CMakeFiles/fasttext-static.dir/src/densematrix.cc.o
[  8%] Building

Cloning into 'fastText'...
Checking out files:  95% (495/517)   Checking out files:  96% (497/517)   Checking out files:  97% (502/517)   Checking out files:  98% (507/517)   Checking out files:  99% (512/517)   Checking out files: 100% (517/517)   Checking out files: 100% (517/517), done.


### Checking fastText CLI tool is ready to use

In particular, we are interested in using the **predict** method.

In [2]:
! ./fastText/build/fasttext predict

usage: fasttext predict[-prob] <model> <test-data> [<k>] [<th>]

  <model>      model filename
  <test-data>  test data filename (if -, read from stdin)
  <k>          (optional; 1 by default) predict top k labels
  <th>         (optional; 0.0 by default) probability threshold



### Imports & build SparkSession

In [3]:
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql.functions import col, when

In [4]:
spark = SparkSession.builder.master("local[4]").appName('mysession').getOrCreate()

## Build Spark DataFrame

In [5]:
df_input = spark.read.parquet('../data/input.parquet').repartition(8)

# Approach 4: RDD's pipe

## Single prediction only!

Let's take a look at what the `get_predictions.sh` script does.

Note that retrieving multiple predictions this way would require more sophisticated manipulation of **stdout**.

In [6]:
%%bash
chmod u+x ./get_predictions.sh
cat ./get_predictions.sh

#!/bin/bash

filename=$RANDOM

while read LINE; do
   echo ${LINE}
done > $filename.input

./fastText/build/fasttext predict ../models/ft_tuned.ftz $filename.input 1 0.10 | sed 's/__label__//g' > $filename.preds

paste -d ',' $filename.input $filename.preds > $filename.output && rm $filename.input $filename.preds
cat $filename.output
rm $filename.output

In the following, we invoke the `get_predictions.sh` script within the `pipe` method. Note that all transformations are using standard operations which don't depend on Python specific features. This means we can follow this same approach using Spark's Scala API!

In [7]:
df_output = df_input.rdd.map(lambda x: ''.join(list(x))) \
            .pipe("./get_predictions.sh") \
            .map(lambda line: line.split(",")) \
            .map(lambda line: Row(input=line[0],category=line[1])) \
            .toDF() \
            .withColumn("category", when(col("category") != "", col("category")))

In [8]:
df_output.count()

592

In [9]:
df_output.sample(False,.10,12345).show(10,False)

+--------+---------------------------------------------------------------------------------------------+
|category|input                                                                                        |
+--------+---------------------------------------------------------------------------------------------+
|null    |what deployment directories do you use for rails applications deploying to a debian box      |
|sql     |sql query order by                                                                           |
|c++     |c++ reading from a file blocks any further writing why                                       |
|null    |how much does it cost to develop an iphone application                                       |
|null    |what are some excellent examples of user sign up forms on the web                            |
|.net    |why doesn t backcolor work for tabcontrols in net                                            |
|c#      |c # compiler and caching of local variables  

# Performance

In [10]:
%timeit -n 10 df_output.sample(False,.10).show(10)

+--------+--------------------+
|category|               input|
+--------+--------------------+
|    .net|what s the best w...|
|    null|what deployment d...|
|    java|why don t my html...|
|      c#|how to serialize ...|
|     c++|initialize a cons...|
|    null|running multiple ...|
|     c++|why learn perl py...|
|     sql|sql query count w...|
|    null|how much does it ...|
|    null|windows domain ch...|
+--------+--------------------+
only showing top 10 rows

+--------+--------------------+
|category|               input|
+--------+--------------------+
|       c|wrapping visual c...|
|    null|is there an idiom...|
|      c#|how can i convert...|
|    null|apache/tomcat err...|
|      c#|c # reflection ge...|
|    null|what s all this b...|
|    null|vs2008 ide giving...|
|      c#|how to walk the m...|
|    null|transforming sele...|
|      c#|is there a fast w...|
+--------+--------------------+
only showing top 10 rows

+-------------+--------------------+
|     category|

+--------+--------------------+
|category|               input|
+--------+--------------------+
|    java|why don t my html...|
|    null|tetris piece rota...|
|      c#|how to get the de...|
|    null|ddd and asynchron...|
|    null|windows domain ch...|
|    null|what can cause in...|
|      c#|drag n drop one o...|
|    java|which javascript ...|
|    null|how should i impo...|
|    .net|implementing iint...|
+--------+--------------------+
only showing top 10 rows

+-------------+--------------------+
|     category|               input|
+-------------+--------------------+
|visual-studio|why is visual stu...|
|           c#|how do you manage...|
|         null|decoding chunked ...|
|         .net|persistent storag...|
|         java|deterministic dis...|
|            c|wrapping visual c...|
|   sql-server|sending e mail fr...|
|           c#|how to get the de...|
|         null|what is the aspne...|
|          c++|why learn perl py...|
+-------------+--------------------+
only sho

+----------+--------------------+
|  category|               input|
+----------+--------------------+
|        c#|find a private fi...|
|sql-server|sql server profil...|
|      null|running multiple ...|
|      null|how to detect whe...|
|        c#|how can i convert...|
|    python|need help variabl...|
|      ruby|what are the limi...|
|   windows|are there problem...|
|      null|what is the best ...|
|      null|release configura...|
+----------+--------------------+
only showing top 10 rows

+-----------+--------------------+
|   category|               input|
+-----------+--------------------+
|asp.net-mvc|asp net mvc beta ...|
|       .net|persistent storag...|
|         c#|is it possible to...|
|       java|why don t my html...|
|       null|delphi project ne...|
|        c++|the necessity of ...|
|       java|why do people use...|
|         c#|drag n drop one o...|
|         c#|add multiple user...|
|       null|how do i slice an...|
+-----------+--------------------+
only sho

+----------+--------------------+
|  category|               input|
+----------+--------------------+
|javascript|weird ie & javasc...|
|        c#|need help handlin...|
|      null|code golf combini...|
|    python|how can i execute...|
|        c#|  using lists in c #|
|      null|how do you quickl...|
|      .net|implementing iint...|
|        c#|filenotfound exce...|
|sql-server|how to justify mo...|
|   asp.net|how to make user ...|
+----------+--------------------+
only showing top 10 rows

+--------+--------------------+
|category|               input|
+--------+--------------------+
|    java|deterministic dis...|
|    null|how to effectivel...|
|    null|how much does it ...|
|    null|subsonic subsonic...|
|    null|how do you quickl...|
|    java|embedding xulrunn...|
|    java|what s the best w...|
|      c#|is it possible to...|
|    null|bat file to run a...|
|      c#|equivalent of jav...|
+--------+--------------------+
only showing top 10 rows

+--------+--------------

+-------------+--------------------+
|     category|               input|
+-------------+--------------------+
|visual-studio|why is visual stu...|
|         null|decoding chunked ...|
|   javascript|weird ie & javasc...|
|           c#|how to get the de...|
|         null|daemon threads ex...|
|          c++|c++ reading from ...|
|           c#|what exception sh...|
|         null|what can cause in...|
|           c#|  using lists in c #|
|           c#|c # reflection ge...|
+-------------+--------------------+
only showing top 10 rows

240 ms ± 31.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [11]:
df_output.rdd.getNumPartitions()

8