### Preliminar: Building fastText from source and installing CLI tool

In [1]:
%%bash
chmod u+x install_fasttext.sh && ./install_fasttext.sh

-- The C compiler identification is GNU 8.3.0
-- The CXX compiler identification is GNU 8.3.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Configuring done
-- Generating done
-- Build files have been written to: /fasttext_inference/fastText/build
Scanning dependencies of target fasttext-static
[  2%] Building CXX object CMakeFiles/fasttext-static.dir/src/args.cc.o
[  4%] Building CXX object CMakeFiles/fasttext-static.dir/src/autotune.cc.o
[  6%] Building CXX object CMakeFiles/fasttext-static.dir/src/densematrix.cc.o
[  8%] Building CXX 

Cloning into 'fastText'...
Checking out files:  95% (495/517)   Checking out files:  96% (497/517)   Checking out files:  97% (502/517)   Checking out files:  98% (507/517)   Checking out files:  99% (512/517)   Checking out files: 100% (517/517)   Checking out files: 100% (517/517), done.


### Checking fastText CLI tool is ready to use

In particular, we are interested in using the **predict** method.

In [2]:
! ./fastText/build/fasttext predict

usage: fasttext predict[-prob] <model> <test-data> [<k>] [<th>]

  <model>      model filename
  <test-data>  test data filename (if -, read from stdin)
  <k>          (optional; 1 by default) predict top k labels
  <th>         (optional; 0.0 by default) probability threshold



### Imports & build SparkSession

In [3]:
from pyspark.sql import SparkSession
from pyspark.sql import Row

In [4]:
spark = SparkSession.builder.appName('mysession').getOrCreate()

## Build Spark DataFrame

In [5]:
df_input = spark.read.parquet('data/input.parquet').repartition(8)

# Approach 4: RDD's pipe

## Single prediction only!

Let's take a look at what the `get_predictions.sh` script does.

Note that retrieving multiple predictions this way would require more sophisticated manipulation of **stdout**.

In [6]:
%%bash
chmod u+x ./get_predictions.sh
cat ./get_predictions.sh

#!/bin/bash

filename=$RANDOM

while read LINE; do
   echo ${LINE}
done > $filename.input

./fastText/build/fasttext predict models/ft_tuned.ftz $filename.input 1 0.10 | sed 's/__label__//g' > $filename.preds

paste -d ',' $filename.input $filename.preds > $filename.output && rm $filename.input $filename.preds
cat $filename.output
rm $filename.output

In the following, we invoke the `get_predictions.sh` script within the `pipe` method. Note that all transformations are using standard operations which don't depend on Python specific features. This means we can follow this same approach using Spark's Scala API!

In [7]:
df_output = df_input.rdd.map(lambda x: ''.join(list(x))) \
            .pipe("./get_predictions.sh") \
            .map(lambda line: line.split(",")) \
            .map(lambda line: Row(input=line[0],category=line[1])) \
            .toDF()

In [8]:
df_output.count()

592

In [9]:
df_output.sample(False,.10,12345).show(10,False)

+--------+---------------------------------------------------------------------------------------------+
|category|input                                                                                        |
+--------+---------------------------------------------------------------------------------------------+
|        |what deployment directories do you use for rails applications deploying to a debian box      |
|sql     |sql query order by                                                                           |
|c++     |c++ reading from a file blocks any further writing why                                       |
|        |how much does it cost to develop an iphone application                                       |
|        |what are some excellent examples of user sign up forms on the web                            |
|.net    |why doesn t backcolor work for tabcontrols in net                                            |
|c#      |c # compiler and caching of local variables  

# Performance

In [10]:
%timeit -n 10 df_output.sample(False,.10).show(10)

+--------+--------------------+
|category|               input|
+--------+--------------------+
|        |cross platform ed...|
|      c#|how to serialize ...|
|     sql|parsing t sql to ...|
| asp.net|what are effectiv...|
|       c|mysql c api using...|
|  python|unicode vs str de...|
|        |can i use css in ...|
|     sql|normalizing a tab...|
|     sql|partial keyword s...|
|        |can someone point...|
+--------+--------------------+
only showing top 10 rows

+--------+--------------------+
|category|               input|
+--------+--------------------+
|    .net|what s the best w...|
|    .net|persistent storag...|
|        |cross platform ed...|
|        |tetris piece rota...|
|  python|extracting text f...|
|        |is there an idiom...|
|        |how to detect whe...|
|    .net|authoritative sou...|
|     sql|sql query count w...|
|  python|how can i execute...|
+--------+--------------------+
only showing top 10 rows

+----------+--------------------+
|  category|      

+-------------+--------------------+
|     category|               input|
+-------------+--------------------+
|         ruby|similar thing to ...|
|           c#|how do you manage...|
|   javascript|how do you execut...|
|   sql-server|sending e mail fr...|
|           c#|need help handlin...|
|visual-studio|how to script vis...|
|           c#|download files to...|
|          c++|c++ reading from ...|
|           c#|what exception sh...|
|           c#|deployment of cus...|
+-------------+--------------------+
only showing top 10 rows

+-------------+--------------------+
|     category|               input|
+-------------+--------------------+
|             |daemon threads ex...|
|             |what is the aspne...|
|   sql-server|sql server profil...|
|visual-studio|how to script vis...|
|        mysql|best update metho...|
|          c++|why learn perl py...|
|             |code golf combini...|
|             |subsonic subsonic...|
|          sql|parsing t sql to ...|
|         ja

+----------+--------------------+
|  category|               input|
+----------+--------------------+
|        c#|how do you manage...|
|      .net|what s the best w...|
|        c#|is it possible to...|
|      java|deterministic dis...|
|javascript|how do you execut...|
|          |is there an idiom...|
|        c#|what exception sh...|
|      java|which javascript ...|
|          |bat file to run a...|
|          |transforming sele...|
+----------+--------------------+
only showing top 10 rows

+-------------+--------------------+
|     category|               input|
+-------------+--------------------+
|         java|trim whitespace f...|
|             |daemon threads ex...|
|      eclipse|eclipse text comp...|
|           c#|how can i convert...|
|             |code golf combini...|
|         java|why do people use...|
|visual-studio|any have a visual...|
|      asp.net|asp net ajax text...|
|             |vs2008 ide giving...|
|           c#|is it possible to...|
+-------------+--

+--------+--------------------+
|category|               input|
+--------+--------------------+
|      c#|how would you att...|
|        |what are some exc...|
|        |code golf combini...|
|      c#|what exception sh...|
|      c#|add multiple user...|
| asp.net|what are effectiv...|
|      c#|is endian convers...|
|     c++|using boost share...|
|     php|transfer variable...|
|        |how to unit test ...|
+--------+--------------------+
only showing top 10 rows

+----------+--------------------+
|  category|               input|
+----------+--------------------+
|        c#|how would you att...|
|sql-server|sending e mail fr...|
|          |daemon threads ex...|
|   eclipse|eclipse text comp...|
|sql-server|sql server profil...|
|        c#|download files to...|
|   asp.net|edit html meta ta...|
|        c#|drag n drop one o...|
|        c#|add multiple user...|
|   windows|are there problem...|
+----------+--------------------+
only showing top 10 rows

+--------+--------------

+----------+--------------------+
|  category|               input|
+----------+--------------------+
|      .net|what s the best w...|
|javascript|how do you execut...|
|          |compact framework...|
|       c++|initialize a cons...|
|   asp.net|get performance c...|
|          |how to effectivel...|
|      .net|authoritative sou...|
|        c#|what exception sh...|
|       php|virtual 360Ã¢Âº s...|
|        c#|is it possible to...|
+----------+--------------------+
only showing top 10 rows

+-------------+--------------------+
|     category|               input|
+-------------+--------------------+
|visual-studio|how to disable vi...|
|          c++|c++ odd compile e...|
|         java|trim whitespace f...|
|       python|extracting text f...|
|           c#|need help handlin...|
|             |what is the aspne...|
|visual-studio|how to script vis...|
|         .net|authoritative sou...|
|             |what are some exc...|
|           c#|how can i convert...|
+-------------+--

In [11]:
df_output.rdd.getNumPartitions()

8