### Preliminar: Building fastText from source and installing CLI tool

In [1]:
%%bash
chmod u+x install_fasttext.sh && ./install_fasttext.sh

-- The C compiler identification is GNU 8.3.0
-- The CXX compiler identification is GNU 8.3.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Configuring done
-- Generating done
-- Build files have been written to: /fasttext_inference/fastText/build
Scanning dependencies of target fasttext-static
[  2%] Building CXX object CMakeFiles/fasttext-static.dir/src/args.cc.o
[  4%] Building CXX object CMakeFiles/fasttext-static.dir/src/autotune.cc.o
[  6%] Building CXX object CMakeFiles/fasttext-static.dir/src/densematrix.cc.o
[  8%] Building CXX 

Cloning into 'fastText'...
Checking out files:  77% (402/517)   Checking out files:  78% (404/517)   Checking out files:  79% (409/517)   Checking out files:  80% (414/517)   Checking out files:  81% (419/517)   Checking out files:  82% (424/517)   Checking out files:  83% (430/517)   Checking out files:  84% (435/517)   Checking out files:  85% (440/517)   Checking out files:  86% (445/517)   Checking out files:  87% (450/517)   Checking out files:  88% (455/517)   Checking out files:  89% (461/517)   Checking out files:  90% (466/517)   Checking out files:  91% (471/517)   Checking out files:  92% (476/517)   Checking out files:  93% (481/517)   Checking out files:  94% (486/517)   Checking out files:  95% (492/517)   Checking out files:  96% (497/517)   Checking out files:  97% (502/517)   Checking out files:  98% (507/517)   Checking out files:  99% (512/517)   Checking out files: 100% (517/517)   Checking out files: 100% (517/517), done.


### Checking fastText CLI tool is ready to use

In particular, we are interested in using the **predict** method.

In [2]:
! ./fastText/build/fasttext predict

usage: fasttext predict[-prob] <model> <test-data> [<k>] [<th>]

  <model>      model filename
  <test-data>  test data filename (if -, read from stdin)
  <k>          (optional; 1 by default) predict top k labels
  <th>         (optional; 0.0 by default) probability threshold



### Imports & build Spark session

In [3]:
from pyspark.sql import SparkSession
from pyspark.sql import Row

In [4]:
spark = SparkSession.builder.appName('mysession').getOrCreate()

## Build Spark DataFrame

In [5]:
df_input = spark.read.parquet('data/input.parquet').repartition(8)

# Approach 4: RDD's pipe

## Single prediction only!

Let's take a look at what the `get_predictions.sh` script does.

Note that retrieving multiple predictions this way would require more sophisticated manipulation of **stdout**.

In [6]:
%%bash
chmod u+x ./get_predictions.sh
cat ./get_predictions.sh

#!/bin/bash

filename=$RANDOM

while read LINE; do
   echo ${LINE}
done > $filename.input

./fastText/build/fasttext predict models/ft_tuned.ftz $filename.input 1 | sed 's/__label__//g' > $filename.preds

paste -d ',' $filename.input $filename.preds > $filename.output && rm $filename.input $filename.preds
cat $filename.output
rm $filename.output

In the following, we invoke the `get_predictions.sh` script within the `pipe` method. Note that all transformations are using standard operations which don't depend on Python specific features. This means we can follow this same approach using Spark's Scala API!

In [7]:
df_output = df_input.rdd.map(lambda x: ''.join(list(x))) \
            .pipe("./get_predictions.sh") \
            .map(lambda line: line.split(",")) \
            .map(lambda line: Row(input=line[0],category=line[1])) \
            .toDF()

In [8]:
df_output.count()

592

In [9]:
df_output.sample(False,.10,12345).show(10,False)

+--------+---------------------------------------------------------------------------------------------+
|category|input                                                                                        |
+--------+---------------------------------------------------------------------------------------------+
|ruby    |what deployment directories do you use for rails applications deploying to a debian box      |
|sql     |sql query order by                                                                           |
|c++     |c++ reading from a file blocks any further writing why                                       |
|.net    |how much does it cost to develop an iphone application                                       |
|asp.net |what are some excellent examples of user sign up forms on the web                            |
|.net    |why doesn t backcolor work for tabcontrols in net                                            |
|c#      |c # compiler and caching of local variables  

# Performance

In [10]:
%timeit -n 10 df_output.sample(False,.10).show(10)

+----------+--------------------+
|  category|               input|
+----------+--------------------+
|       c++|are incrementers ...|
|    python|cross platform ed...|
|      java|deterministic dis...|
|sql-server|data verification...|
|      .net|how much does it ...|
|       php|apache/tomcat err...|
|        c#|deployment of cus...|
|      java|which javascript ...|
|        c#|c # reflection ge...|
|        c#|where d my generi...|
+----------+--------------------+
only showing top 10 rows

+----------+--------------------+
|  category|               input|
+----------+--------------------+
|      ruby|what deployment d...|
|javascript|weird ie & javasc...|
|        c#|how to get the de...|
|        c#|find a private fi...|
|      .net|what is the aspne...|
|       c++|c++ reading from ...|
|      java|code golf combini...|
|    python|how do you quickl...|
|   asp.net|prevent long word...|
|   asp.net|fetch one row per...|
+----------+--------------------+
only showing top 10 ro

+-------------+--------------------+
|     category|               input|
+-------------+--------------------+
|visual-studio|why is visual stu...|
|  asp.net-mvc|asp net mvc beta ...|
|         java|why don t my html...|
|         java|deterministic dis...|
|          c++|initialize a cons...|
|          c++|c++ reading from ...|
|         .net|how much does it ...|
|          css|abstraction away ...|
|      asp.net|edit html meta ta...|
|         java|which javascript ...|
+-------------+--------------------+
only showing top 10 rows

+-------------+--------------------+
|     category|               input|
+-------------+--------------------+
|         java|ddd and asynchron...|
|          php|random image pick...|
|           c#|need help handlin...|
|        mysql|best update metho...|
|      asp.net|get performance c...|
|           c#|what exception sh...|
|      windows|windows domain ch...|
|           c#|what can cause in...|
|visual-studio|any have a visual...|
|           

+-------------+--------------------+
|     category|               input|
+-------------+--------------------+
|visual-studio|why is visual stu...|
|   javascript|weird ie & javasc...|
|          php|random image pick...|
|           c#|how do you bind i...|
|   sql-server|data verification...|
|      asp.net|get performance c...|
|         .net|how much does it ...|
|      asp.net|what are some exc...|
|           c#|subsonic subsonic...|
|          c++|how to check if f...|
+-------------+--------------------+
only showing top 10 rows

+--------+--------------------+
|category|               input|
+--------+--------------------+
|    ruby|similar thing to ...|
|    ruby|what deployment d...|
|    java|why don t my html...|
|      c#|how to get the de...|
|    java|ddd and asynchron...|
|      c#|daemon threads ex...|
|      c#|need help handlin...|
| asp.net|get performance c...|
|  python|how can i execute...|
| asp.net|asp net ajax text...|
+--------+--------------------+
only sho

+-----------+--------------------+
|   category|               input|
+-----------+--------------------+
|asp.net-mvc|asp net mvc beta ...|
|       ruby|what deployment d...|
|     python|cross platform ed...|
|         c#|daemon threads ex...|
|     python|how do i open off...|
|        sql|  sql query order by|
|         c#|what can cause in...|
|        c++|determine if type...|
|    asp.net|fetch one row per...|
|       .net|implementing iint...|
+-----------+--------------------+
only showing top 10 rows

+----------+--------------------+
|  category|               input|
+----------+--------------------+
|     mysql|best update metho...|
|      html|doctype rss & htm...|
|        c#|where d my generi...|
|   asp.net|vs2008 ide giving...|
|       php|php $ _get sort p...|
|       sql|how can i filter ...|
|      flex|iphone programmin...|
|       svn|release configura...|
|javascript|how to do a jquer...|
|      java|ant and the avail...|
+----------+--------------------+
only sho

+--------+--------------------+
|category|               input|
+--------+--------------------+
|     c++|c++ odd compile e...|
|    java|deterministic dis...|
|     php|random image pick...|
|    .net|how much does it ...|
|  python|how can i execute...|
|    java|java stringbuffer...|
|     c++|determine if type...|
|      c#|converting svg to...|
|     php|php $ _get sort p...|
| windows|are there problem...|
+--------+--------------------+
only showing top 10 rows

+-------------+--------------------+
|     category|               input|
+-------------+--------------------+
|          c++|are incrementers ...|
|          sql|  sql query order by|
|      eclipse|eclipse text comp...|
|visual-studio|how to script vis...|
|         ruby|how to effectivel...|
|         .net|authoritative sou...|
|      windows|windows domain ch...|
|         ruby|what are the limi...|
|      asp.net|asp net ajax text...|
|            c|mysql c api using...|
+-------------+--------------------+
only sho

In [11]:
df_output.rdd.getNumPartitions()

8