### Simple Text Processing and Classification with Apache Spark
---
The aim of this notebook is to practise basic text processing using the Apache Spark with the use of the toxic comment text classification dataset. The machine learning and text processing used here are at a poor standard. The goal was mainly to convert the column `comment_text` into a column of sparse vectors for use in a classification algorithm in the spark `ml` library.  

The `pyspark.ml` library is used for machine learning with Spark DataFrames. For machine learning with Spark RDDs use the `pyspark.mllib` library. 

In [52]:
import pandas as pd

from pyspark.sql import SparkSession
import pyspark.sql.functions as F
import pyspark.sql.types as T
from pyspark.ml.feature import Tokenizer, HashingTF, IDF
from pyspark.ml.classification import LogisticRegression
from pyspark.sql import SQLContext
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler
from pyspark.ml import Pipeline
# from pyspark import SparkContext
# sc =SparkContext()
# sqlContext = SQLContext(sc)
from nltk.corpus import stopwords

In [53]:
# Build a spark context
hc = (SparkSession.builder
                  .appName('Toxic Comment Classification')
                  .enableHiveSupport()
                  .config("spark.executor.memory", "4G")
                  .config("spark.driver.memory","18G")
                  .config("spark.executor.cores","7")
                  .config("spark.python.worker.memory","4G")
                  .config("spark.driver.maxResultSize","0")
                  .config("spark.sql.crossJoin.enabled", "true")
                  .config("spark.serializer","org.apache.spark.serializer.KryoSerializer")
                  .config("spark.default.parallelism","2")
                  .getOrCreate())

In [54]:
hc.sparkContext.setLogLevel('INFO')

In [55]:
hc.version
sqlContext = SQLContext(hc)

Unfortunately, as much as I love the addition of the csv reader in Spark version 2+ and the databricks spark-csv package, I was unable to use the packages to parse a multiline multi-character quoted record in a csv. As a result, I loaded the data into a DataFrame using Pandas, and then I converted the Pandas DataFrame to a Spark DataFrame.

In [56]:
def to_spark_df(fin):
    """
    Parse a filepath to a spark dataframe using the pandas api.
    
    Parameters
    ----------
    fin : str
        The path to the file on the local filesystem that contains the csv data.
        
    Returns
    -------
    df : pyspark.sql.dataframe.DataFrame
        A spark DataFrame containing the parsed csv data.
    """
    df = pd.read_csv(fin)
    df.fillna("", inplace=True)
    df = hc.createDataFrame(df)
    return(df)

# Load the train-test sets
# train = to_spark_df("../input/train.csv")
# test = to_spark_df("../input/test.csv")
train = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('articles-articles.csv')
drop_list = ['Dates', 'Topic', 'Page']
train = train.select([column for column in train.columns if column not in drop_list])
train.show(5)

+--------+--------------------+
|Category|                Body|
+--------+--------------------+
|politics|WITH THE ARRIVAL ...|
|business|TENS OF THOUSANDS...|
|politics|WASHINGTON  PRESI...|
|business|OMAHA  ELON MUSK ...|
|politics|REUTERS    THE TR...|
+--------+--------------------+
only showing top 5 rows



In [57]:
out_cols = [i for i in train.columns if i not in ["Category", "Body"]]

In [58]:
# Sadly the output is not as  pretty as the pandas.head() function
train.show(5)

+--------+--------------------+
|Category|                Body|
+--------+--------------------+
|politics|WITH THE ARRIVAL ...|
|business|TENS OF THOUSANDS...|
|politics|WASHINGTON  PRESI...|
|business|OMAHA  ELON MUSK ...|
|politics|REUTERS    THE TR...|
+--------+--------------------+
only showing top 5 rows



In [59]:
# View some toxic comments
# train.filter(F.col('toxic') == 1).show(5)

In [60]:
# Basic sentence tokenizer
tokenizer = Tokenizer(inputCol="Body", outputCol="words")
wordsData = tokenizer.transform(train)

In [61]:
# Count the words in a document
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures")
tf = hashingTF.transform(wordsData)

In [62]:
tf.select('rawFeatures').take(2)

[Row(rawFeatures=SparseVector(262144, {14: 1.0, 619: 1.0, 751: 1.0, 1461: 1.0, 1769: 44.0, 1854: 1.0, 2410: 1.0, 2437: 4.0, 4004: 1.0, 4081: 2.0, 4172: 1.0, 4366: 1.0, 4622: 1.0, 4672: 1.0, 4842: 1.0, 4869: 1.0, 5083: 2.0, 5232: 3.0, 5381: 1.0, 5476: 3.0, 6068: 1.0, 6079: 1.0, 6194: 1.0, 6258: 2.0, 6355: 1.0, 6369: 1.0, 6972: 1.0, 6981: 1.0, 7612: 9.0, 7838: 1.0, 8267: 1.0, 8630: 1.0, 8804: 2.0, 8928: 1.0, 9129: 2.0, 9155: 1.0, 9521: 1.0, 9616: 4.0, 9639: 29.0, 9916: 1.0, 10614: 1.0, 11104: 3.0, 11938: 1.0, 12109: 2.0, 12250: 1.0, 12710: 2.0, 12946: 2.0, 13142: 1.0, 14072: 1.0, 14280: 1.0, 14898: 1.0, 15889: 36.0, 15927: 2.0, 16332: 7.0, 17222: 6.0, 17353: 1.0, 17559: 1.0, 18748: 1.0, 19153: 1.0, 19208: 3.0, 19524: 1.0, 19635: 1.0, 19843: 1.0, 20998: 1.0, 21872: 1.0, 23574: 1.0, 23762: 1.0, 23776: 2.0, 23893: 1.0, 24145: 1.0, 24417: 7.0, 24698: 1.0, 24918: 1.0, 24980: 8.0, 25551: 4.0, 25570: 9.0, 25937: 1.0, 26445: 1.0, 27151: 2.0, 27353: 1.0, 27552: 1.0, 27584: 1.0, 28182: 1.0, 28402:

In [63]:
# Build the idf model and transform the original token frequencies into their tf-idf counterparts
idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(tf) 
tfidf = idfModel.transform(tf)

In [64]:
tfidf.select("features").first()

Row(features=SparseVector(262144, {14: 1.893, 619: 2.203, 751: 5.5407, 1461: 4.8475, 1769: 86.3026, 1854: 5.0146, 2410: 3.3098, 2437: 7.3536, 4004: 5.6207, 4081: 3.6768, 4172: 6.0262, 4366: 5.2724, 4622: 3.8715, 4672: 5.4666, 4842: 2.1948, 4869: 3.0687, 5083: 4.1165, 5232: 4.8583, 5381: 0.5208, 5476: 9.3252, 6068: 3.122, 6079: 4.7044, 6194: 2.6291, 6258: 7.8018, 6355: 3.5838, 6369: 1.7272, 6972: 5.4666, 6981: 2.4357, 7612: 8.0864, 7838: 2.6291, 8267: 6.3138, 8630: 1.887, 8804: 2.4628, 8928: 5.7077, 9129: 3.7113, 9155: 5.7077, 9521: 2.2975, 9616: 4.7794, 9639: 0.5122, 9916: 2.0188, 10614: 2.9016, 11104: 4.8859, 11938: 2.7257, 12109: 6.5072, 12250: 4.3444, 12710: 3.4375, 12946: 7.8624, 13142: 0.9301, 14072: 2.749, 14280: 3.9009, 14898: 2.0465, 15889: 4.129, 15927: 11.8168, 16332: 0.4201, 17222: 15.3159, 17353: 3.3098, 17559: 3.2228, 18748: 3.5309, 19153: 2.0512, 19208: 2.1113, 19524: 3.8289, 19635: 2.0053, 19843: 4.2989, 20998: 1.5791, 21872: 1.6583, 23574: 2.1141, 23762: 5.7077, 23776: 

Do a test first to practise with the LogisticRegression class. I like to create instances of objects first tocheck their methods and docstrings and figure out how to access data.

Build a logistic regression model for the binary toxic column.
Use the features column (the tfidf values) as the input vectors, `X`, and the toxic column as output vector, `y`.

In [69]:
REG = 0.1
label_stringIdx = StringIndexer(inputCol = "Category", outputCol = "label")
pipeline = Pipeline(stages=[label_stringIdx])
pipelineFit = pipeline.fit(tfidf)
dataset = pipelineFit.transform(tfidf)

In [70]:
lr = LogisticRegression(featuresCol="features", labelCol='label', regParam=REG)


In [72]:
dataset.show(5)

+--------+--------------------+--------------------+--------------------+--------------------+-----+
|Category|                Body|               words|         rawFeatures|            features|label|
+--------+--------------------+--------------------+--------------------+--------------------+-----+
|politics|WITH THE ARRIVAL ...|[with, the, arriv...|(262144,[14,619,7...|(262144,[14,619,7...|  0.0|
|business|TENS OF THOUSANDS...|[tens, of, thousa...|(262144,[511,513,...|(262144,[511,513,...|  1.0|
|politics|WASHINGTON  PRESI...|[washington, , pr...|(262144,[329,1769...|(262144,[329,1769...|  0.0|
|business|OMAHA  ELON MUSK ...|[omaha, , elon, m...|(262144,[408,1232...|(262144,[408,1232...|  1.0|
|politics|REUTERS    THE TR...|[reuters, , , , t...|(262144,[5280,538...|(262144,[5280,538...|  0.0|
+--------+--------------------+--------------------+--------------------+--------------------+-----+
only showing top 5 rows



In [73]:
lrModel = lr.fit(dataset.limit(5000))

IllegalArgumentException: 'Field "label_stringIdx" does not exist.'

In [18]:
res_train = lrModel.transform(tfidf)

In [19]:
res_train.select("id", "toxic", "probability", "prediction").show(20)

In [20]:
res_train.show(5)

#### Select the probability column
---
Create a user-defined function (udf) to select the second element in each row of the column vector

In [21]:
extract_prob = F.udf(lambda x: float(x[1]), T.FloatType())

In [22]:
(res_train.withColumn("proba", extract_prob("probability"))
 .select("proba", "prediction")
 .show())

### Create the results DataFrame
---
Convert the test text

In [23]:
test_tokens = tokenizer.transform(test)
test_tf = hashingTF.transform(test_tokens)
test_tfidf = idfModel.transform(test_tf)

Initialize the new DataFrame with the id column

In [24]:
test_res = test.select('id')
test_res.head()

Make predictions for each class

In [25]:
test_probs = []
for col in out_cols:
    print(col)
    lr = LogisticRegression(featuresCol="features", labelCol=col, regParam=REG)
    print("...fitting")
    lrModel = lr.fit(tfidf)
    print("...predicting")
    res = lrModel.transform(test_tfidf)
    print("...appending result")
    test_res = test_res.join(res.select('id', 'probability'), on="id")
    print("...extracting probability")
    test_res = test_res.withColumn(col, extract_prob('probability')).drop("probability")
    test_res.show(5)

In [26]:
test_res.show(5)

In [27]:
test_res.coalesce(1).write.csv('./results/spark_lr.csv', mode='overwrite', header=True)

The output is actually a directory and not a csv file. Within the directory there is one or more csv files, which together make up the entire csv results. I used the cat function to concatenate these csv files together.

In [28]:
!cat results/spark_lr.csv/part*.csv > spark_lr.csv

In [29]:
ls

This submission scores 0.8797 on the public leaderboard.