###  Text Processing and Classification with Apache Spark
---
This notebook performs basic text processing using the Apache Spark, using the toxic comment text classification dataset. 

Import all the required libraries
The `pyspark.ml` : machine learning with spark dataframes.  `pyspark.mllib`- machine learning with spark RDD. 

In [1]:
import pandas as pd

from pyspark.sql import SparkSession
import pyspark.sql.functions as F
import pyspark.sql.types as T
from pyspark.ml.feature import Tokenizer, HashingTF, IDF
from pyspark.ml.classification import LogisticRegression

In [2]:
# Build a spark context
hc = (SparkSession.builder
                  .appName('Toxic Comment Classification')
                  .enableHiveSupport()
                  .config("spark.executor.memory", "4G")
                  .config("spark.driver.memory","18G")
                  .config("spark.executor.cores","7")
                  .config("spark.python.worker.memory","4G")
                  .config("spark.driver.maxResultSize","0")
                  .config("spark.sql.crossJoin.enabled", "true")
                  .config("spark.serializer","org.apache.spark.serializer.KryoSerializer")
                  .config("spark.default.parallelism","2")
                  .getOrCreate())

In [3]:
hc.sparkContext.setLogLevel('INFO')

In [4]:
hc.version

'3.0.3'

 Load the data into a DataFrame using Pandas, and convert the Pandas DataFrame to a Spark DataFrame.

In [7]:
def to_spark_df(fin):
    """
    Parse a filepath to a spark dataframe using the pandas api.
    
    Parameters
    ----------
    fin : str
        The path to the file on the local filesystem that contains the csv data.
        
    Returns
    -------
    df : pyspark.sql.dataframe.DataFrame
        A spark DataFrame containing the parsed csv data.
    """
    df = pd.read_csv(fin)
    df.fillna("", inplace=True)
    df = hc.createDataFrame(df)
    return(df)

# Load the train-test sets
train = to_spark_df("train.csv")
test = to_spark_df("test.csv")

In [8]:
out_cols = [i for i in train.columns if i not in ["id", "comment_text"]]

 Show top 5 rows of the train dataset

In [9]:

train.show(5)

+----------------+--------------------+-----+------------+-------+------+------+-------------+
|              id|        comment_text|toxic|severe_toxic|obscene|threat|insult|identity_hate|
+----------------+--------------------+-----+------------+-------+------+------+-------------+
|0000997932d777bf|Explanation
Why t...|    0|           0|      0|     0|     0|            0|
|000103f0d9cfb60f|D'aww! He matches...|    0|           0|      0|     0|     0|            0|
|000113f07ec002fd|Hey man, I'm real...|    0|           0|      0|     0|     0|            0|
|0001b41b1c6bb37e|"
More
I can't ma...|    0|           0|      0|     0|     0|            0|
|0001d958c54c6e35|You, sir, are my ...|    0|           0|      0|     0|     0|            0|
+----------------+--------------------+-----+------------+-------+------+------+-------------+
only showing top 5 rows



Show 5 rows of  toxic comments in the data

In [10]:
# View some toxic comments
train.filter(F.col('toxic') == 1).show(5)

+----------------+--------------------+-----+------------+-------+------+------+-------------+
|              id|        comment_text|toxic|severe_toxic|obscene|threat|insult|identity_hate|
+----------------+--------------------+-----+------------+-------+------+------+-------------+
|0002bcb3da6cb337|COCKSUCKER BEFORE...|    1|           1|      1|     0|     1|            0|
|0005c987bdfc9d4b|Hey... what is it...|    1|           0|      0|     0|     0|            0|
|0007e25b2121310b|Bye! 

Don't look...|    1|           0|      0|     0|     0|            0|
|001810bf8c45bf5f|You are gay or an...|    1|           0|      1|     0|     1|            1|
|00190820581d90ce|FUCK YOUR FILTHY ...|    1|           0|      1|     0|     1|            0|
+----------------+--------------------+-----+------------+-------+------+------+-------------+
only showing top 5 rows



Tokenize the sentence 

In [11]:

tokenizer = Tokenizer(inputCol="comment_text", outputCol="words")
wordsData = tokenizer.transform(train)

count the words in the document

In [12]:

hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures")
tf = hashingTF.transform(wordsData)

Display two raw features

In [13]:
tf.select('rawFeatures').take(2)

[Row(rawFeatures=SparseVector(262144, {6240: 1.0, 7221: 1.0, 9420: 1.0, 10214: 1.0, 11680: 1.0, 15494: 1.0, 19036: 1.0, 19208: 1.0, 23032: 1.0, 25000: 1.0, 26144: 1.0, 66299: 1.0, 67416: 1.0, 72125: 1.0, 74944: 1.0, 77971: 1.0, 79300: 1.0, 79968: 1.0, 89833: 1.0, 94488: 1.0, 95889: 3.0, 97171: 1.0, 101169: 1.0, 103863: 1.0, 110427: 1.0, 110510: 1.0, 116767: 1.0, 140784: 1.0, 141086: 1.0, 145284: 1.0, 151536: 1.0, 151751: 1.0, 166368: 1.0, 187114: 1.0, 219915: 1.0, 223402: 1.0, 229137: 1.0, 231630: 1.0, 233967: 1.0, 240944: 1.0, 253170: 1.0})),
 Row(rawFeatures=SparseVector(262144, {2195: 1.0, 4714: 1.0, 13283: 1.0, 48234: 1.0, 85939: 1.0, 108541: 1.0, 119702: 1.0, 121320: 1.0, 137179: 1.0, 141086: 1.0, 159767: 1.0, 165258: 1.0, 169800: 1.0, 212492: 1.0, 218233: 1.0, 224255: 1.0, 224850: 1.0, 249180: 1.0}))]

 Build the idf model and transform the original token frequencies into their tf-idf counterparts

In [15]:

idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(tf) 
tfidf = idfModel.transform(tf)


In [16]:
tfidf.select("features").first()

Row(features=SparseVector(262144, {6240: 8.7614, 7221: 2.2023, 9420: 3.1522, 10214: 6.4668, 11680: 5.0275, 15494: 3.4215, 19036: 0.7385, 19208: 2.2441, 23032: 5.0114, 25000: 5.6868, 26144: 3.5877, 66299: 7.7906, 67416: 1.1947, 72125: 2.2731, 74944: 2.5138, 77971: 7.6235, 79300: 6.672, 79968: 9.9008, 89833: 3.0516, 94488: 8.4249, 95889: 1.2127, 97171: 2.0161, 101169: 1.734, 103863: 6.8445, 110427: 2.1174, 110510: 5.6685, 116767: 6.0244, 140784: 3.0482, 141086: 2.4778, 145284: 8.0682, 151536: 2.2414, 151751: 9.0358, 166368: 2.0431, 187114: 1.7657, 219915: 0.6965, 223402: 3.3517, 229137: 4.5705, 231630: 9.4953, 233967: 3.102, 240944: 1.7538, 253170: 2.6999}))



Build a logistic regression model for the binary toxic column.
Use the features column (the tfidf values) as the input vectors, `X`, and the toxic column as output vector, `y`.

In [17]:
REG = 0.1

In [18]:
lr = LogisticRegression(featuresCol="features", labelCol='toxic', regParam=REG)

In [19]:
tfidf.show(5)

+----------------+--------------------+-----+------------+-------+------+------+-------------+--------------------+--------------------+--------------------+
|              id|        comment_text|toxic|severe_toxic|obscene|threat|insult|identity_hate|               words|         rawFeatures|            features|
+----------------+--------------------+-----+------------+-------+------+------+-------------+--------------------+--------------------+--------------------+
|0000997932d777bf|Explanation
Why t...|    0|           0|      0|     0|     0|            0|[explanation, why...|(262144,[6240,722...|(262144,[6240,722...|
|000103f0d9cfb60f|D'aww! He matches...|    0|           0|      0|     0|     0|            0|[d'aww!, he, matc...|(262144,[2195,471...|(262144,[2195,471...|
|000113f07ec002fd|Hey man, I'm real...|    0|           0|      0|     0|     0|            0|[hey, man,, i'm, ...|(262144,[18700,27...|(262144,[18700,27...|
|0001b41b1c6bb37e|"
More
I can't ma...|    0|       

Fit the logistic regression model and show the actual and predicted values of 20 data

In [20]:
lrModel = lr.fit(tfidf.limit(5000))

In [21]:
res_train = lrModel.transform(tfidf)

In [22]:
res_train.select("id", "toxic", "probability", "prediction").show(20)

+----------------+-----+--------------------+----------+
|              id|toxic|         probability|prediction|
+----------------+-----+--------------------+----------+
|0000997932d777bf|    0|[0.98678683901695...|       0.0|
|000103f0d9cfb60f|    0|[0.98540708815661...|       0.0|
|000113f07ec002fd|    0|[0.95200211834013...|       0.0|
|0001b41b1c6bb37e|    0|[0.99387665346745...|       0.0|
|0001d958c54c6e35|    0|[0.96614023864236...|       0.0|
|00025465d4725e87|    0|[0.95579030992665...|       0.0|
|0002bcb3da6cb337|    1|[0.26953126376852...|       1.0|
|00031b1e95af7921|    0|[0.96328869796765...|       0.0|
|00037261f536c51d|    0|[0.98418771489899...|       0.0|
|00040093b2687caa|    0|[0.96511370913371...|       0.0|
|0005300084f90edc|    0|[0.99999200637123...|       0.0|
|00054a5e18b50dd4|    0|[0.97181127787588...|       0.0|
|0005c987bdfc9d4b|    1|[0.04245009766964...|       1.0|
|0006f16e4e9f292e|    0|[0.99702547438289...|       0.0|
|00070ef96486d6f9|    0|[0.9813

In [None]:
show 5 results

In [23]:
res_train.show(5)

+----------------+--------------------+-----+------------+-------+------+------+-------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+
|              id|        comment_text|toxic|severe_toxic|obscene|threat|insult|identity_hate|               words|         rawFeatures|            features|       rawPrediction|         probability|prediction|
+----------------+--------------------+-----+------------+-------+------+------+-------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+
|0000997932d777bf|Explanation
Why t...|    0|           0|      0|     0|     0|            0|[explanation, why...|(262144,[6240,722...|(262144,[6240,722...|[4.31324067049035...|[0.98678683901695...|       0.0|
|000103f0d9cfb60f|D'aww! He matches...|    0|           0|      0|     0|     0|            0|[d'aww!, he, matc...|(262144,[2195,471...|(262144,[2195,471...

#### Select the probability column
---
Create a user-defined function (udf) to select the second element in each row of the column vector

In [24]:
extract_prob = F.udf(lambda x: float(x[1]), T.FloatType())

show probability and prediction

In [25]:
(res_train.withColumn("proba", extract_prob("probability"))
 .select("proba", "prediction")
 .show())

+------------+----------+
|       proba|prediction|
+------------+----------+
| 0.013213161|       0.0|
| 0.014592912|       0.0|
|  0.04799788|       0.0|
|0.0061233467|       0.0|
|  0.03385976|       0.0|
|  0.04420969|       0.0|
|  0.73046875|       1.0|
|   0.0367113|       0.0|
| 0.015812285|       0.0|
|  0.03488629|       0.0|
| 7.993629E-6|       0.0|
| 0.028188722|       0.0|
|   0.9575499|       1.0|
|0.0029745256|       0.0|
| 0.018628526|       0.0|
| 0.005099778|       0.0|
|   0.8103168|       1.0|
| 0.023980903|       0.0|
|   0.0208695|       0.0|
| 0.012392127|       0.0|
+------------+----------+
only showing top 20 rows

