# ML : Tag classification

For this last part we wanted to do some Multilabel classification of the tags based on the text of each questions.
The constraint was to use :
1. `pyspark.ml` library
2. `pyspark.mllib` library
3. A third party library in a spark pipeline

We have been disapointed to learn that pyspark ml and mllib had multi label classification implementation for its algorithm.
Indeed those one can do multi class classification but not multi label. A way to go by this issue would have been to use a string indexer to transform our multlabel into problem one :
Example: if we have three classes $A,B,C$, the stringindexer creates new classes $AB,BC,AC,ABC$, so the model just becomes single label multiclass pb..
There are several problem this method : 
- Dimensionnality : we have a lot more than three tags in the sample. It would have been tramendous computations for really poor results
- A tramendous loose of in information, we would a have lost a part of the label correlation.

After some computation expensive non concluent tests we decided that we will do only multiclass classification on the 10 most used tags (which is often the most important as we alredy noticed). 

However, concerning the third part of the project, we decided to combine `tf.Keras` and `pyspark` (thanks to Elephas) to do this multilabel classification with a small neural network. 

In [1]:
import os
os.chdir(os.environ['HOME'])

import stack_overflow_functions.DataLoader as data_loader
import stack_overflow_functions.DataTransformation as data_transfo
from pycountry_convert import country_name_to_country_alpha3
from pyspark.sql import SparkSession
from pyspark.sql import SparkSession
import patoolib
import gdown
# Import Spark NLP
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.sql.types import StructField, StructType, StringType, ArrayType,IntegerType

import pyspark
import sparknlp
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
from pycountry_convert import country_alpha2_to_continent_code, country_name_to_country_alpha2
from geopy.geocoders import Nominatim
import pyspark.sql.functions as F
from pyspark.sql.types import LongType, StringType
import pandas as pd
import re
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import geopandas as gpd 
import matplotlib.pyplot as plt
import seaborn as sns
import json


from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
import pyspark.ml.feature as sf
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import CrossValidator

seed = 2020

In [2]:
spark = sparknlp.start()
conf = (pyspark
        .SparkConf()
        .set("spark.ui.showConsoleProgress", "true")
       )
sc = pyspark.SparkContext.getOrCreate(conf=conf)
sqlcontext = pyspark.SQLContext(sc)

### First part : clean the data
We use a similar Pipeline that for our feature/target analysis in order to perform ML on it. It was a choice of us to not put it into a module and call it to be clearer on the steps we have done. 
It is also a choice that we do not include those steps in the ML since they are computionnaly costly and unlikely to changed. 

#### Reads the data

In [3]:
%%time
post_dir = "Data/sample/Posts"
posts = (sqlcontext
         .read
         .format("parquet")
         .option("header",True)
         .load(post_dir)
         .sample(False, 0.01)
         .select("Id",
                  F.concat_ws(' ',F.col('Title'),F.col('Body')).alias("full_text"),
                  "Tags"
                )
        )

CPU times: user 25.3 ms, sys: 4.69 ms, total: 30 ms
Wall time: 1min 13s


#### Splits tags

In [4]:
tags_split =tags_split = F.regexp_replace( F.regexp_replace(
    F.regexp_replace(F.col('Tags'), '&lt;', ''), "&gt;", "<split_token>"), " ", "")

udf_drop = F.udf(lambda x: re.sub("'","",str(x[:-1])[1:-1]) if isinstance(x,list) else None,StringType())

posts = (
    posts
    .withColumn('Splitted_tags', tags_split)
    .withColumn('Splitted_tags', F.split(F.col("Splitted_tags"), "<split_token>"))
    .withColumn('Splitted_tags', udf_drop(F.col("Splitted_tags")))
    .withColumn('Splitted_tags', F.split(F.col("Splitted_tags"),","))
    .drop('Tags')
)



#### Clean text

In [6]:
input_col = "full_text"
clean_up_patterns = [
                    "p&gt;"
                    ,"&.*?;\space"
                    ,'&.*?;'                
                    ,"/.*?;"
                    ,"/code"
                    ,"/pre"
                    ,'/p'
                    ,"/a"
                    ,"href="
                    ,"lt;"
                    ,"gt;"
                    ,"[^\w\s]"
                    ,r"\b\d+\b"
                  ]


# Document assembler : Tokenize our text
documentAssembler = DocumentAssembler() \
    .setInputCol(input_col) \
    .setOutputCol('_intermediate_results')

# Document normalizer : Normalize the document
# by lowercasing, removing non utf8 chars
# and remove regex oattern defined
doc_norm = DocumentNormalizer() \
    .setInputCols("_intermediate_results") \
    .setOutputCol(input_col + "_cleaned") \
    .setAction("clean") \
    .setPatterns(clean_up_patterns) \
    .setReplacement(" ") \
    .setPolicy("pretty_all") \
    .setLowercase(True)

# Document tokenizer : allows to remove
# undesired tokens (punctuations etc.)
# prepare the colums for the stopwords 
# remover
tokenizer = Tokenizer() \
    .setInputCols([input_col + "_cleaned"]) \
    .setOutputCol("token") \
    .setSplitChars(['-']) \
    .setContextChars(['(', ')', '?', '!']) \
    .setSplitPattern("'") \
    .setMaxLength(0) \
    .setMaxLength(99999) \
    .setCaseSensitiveExceptions(False)


# StopWordsCleaner : remove 
# the stopwords based on
# a predifined list
Stop_words_cleaner = (
    StopWordsCleaner()
    .pretrained("stopwords_en", "en")
    .setInputCols(["token"])
    .setOutputCol(input_col + "_without_stopwords") 
    .setCaseSensitive(False) 
    .setLazyAnnotator(False)
)

# Lemmatize the text 
# thanks to the lemmatizing tab
# defined above
Lemmatizer_cleaner = (
    Lemmatizer() 
    .setInputCols([input_col + "_without_stopwords"]) 
    .setOutputCol(input_col + "_lemmatized") 
    .setDictionary("./Data/lemmatizer/AntBNC_lemmas_ver_001.txt", value_delimiter ="\t", key_delimiter = "->") 
    .setLazyAnnotator(False)
)


# Creates thepipeline
cleaning_pipeline = (
    Pipeline() 
    .setStages([
        documentAssembler,
        doc_norm,
        tokenizer,
        #Document_cleaner,
        Stop_words_cleaner,
        Lemmatizer_cleaner])
)


posts_ml = (
    cleaning_pipeline
    .fit(posts)
    .transform(posts)
    .select(F.col("Id"),
            F.col(input_col),
            F.col(input_col + "_lemmatized.result"),
            F.col("Splitted_tags")
           )
) 


stopwords_en download started this may take some time.
Approximate size to download 2.9 KB
[OK!]


### Second Part : `pyspark.ml` pipeline 

In [8]:
top = 10
tags = (posts_ml
        .select(F.explode('Splitted_tags').alias("tags"))
        .select(F.trim("tags").alias("tags"))
        .groupBy('tags')
        .count()
       ).toPandas()


In [14]:
top = 10
top10 = tags.sort_values(by ="count",ascending = False).head(top).tags.tolist()

In [None]:
top10 = []

In [15]:
posts_ml = (
    posts_ml
    .withColumn('language_deduced',
                data_transfo.udf_detect_language(top10)(F.col('Splitted_tags'))))

In [20]:
posts_multi_class = (posts_ml
                     .select(F.col('result').alias("text"),
                            data_transfo.udf_detect_language(top10)(F.col('Splitted_tags')).alias("first_label"))
                     .where(F.col('first_label').isNotNull())
                    )
posts_multi_class.cache().count()

11243

In [21]:
posts_multi_class.show()

+--------------------+-----------+
|                text|first_label|
+--------------------+-----------+
|[node, js, consid...| javascript|
|[webflux, applica...|       java|
|[make, scroll, bu...| javascript|
|[return, referenc...|        c++|
|[state, variable,...| javascript|
|[disable, screens...|     jquery|
|[onimport, exist,...| javascript|
|[net, framework, ...|         c#|
|[ie11, polyfill, ...| javascript|
|[race, condition,...|       java|
|[handle, empty, n...|         c#|
|[load, file, asse...|         c#|
|[append, insert, ...|        c++|
|[stripe, js, crea...| javascript|
|[read, file, line...|        c++|
|[python, panda, d...|     python|
|[problem, recycle...|       java|
|[response, reques...|     python|
|[show, opencv, vi...|        c++|
|[pass, reference,...|         c#|
+--------------------+-----------+
only showing top 20 rows



In [22]:
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit
train_split, test_split = posts_multi_class.randomSplit(weights=[0.80, 0.20], seed=seed)

In [23]:
train_split.show()

+--------------------+-----------+
|                text|first_label|
+--------------------+-----------+
|[add, soft, const...|     python|
|[app, crash, send...|    android|
|[append, insert, ...|        c++|
|[asp, net, member...|         c#|
|[aw, sam, deploym...|     python|
|[break, image, gi...|       html|
|[call, dropdown, ...| javascript|
|[call, wordpress,...|        php|
|[code, coverage, ...|       java|
|[code, doesn, exe...| javascript|
|[convert, date, l...|       java|
|[delete, characte...|    android|
|[detect, http, re...|     python|
|[df, to_sql, expo...|     python|
|[disable, screens...|     jquery|
|[display, data, m...|     python|
|[force, return, s...|       java|
|[form, input, slo...| javascript|
|[fragment, disapp...|       java|
|[function, provid...|    android|
+--------------------+-----------+
only showing top 20 rows



In [25]:
features = "text"
label = "first_label"
# Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
# TF
cv = sf.CountVectorizer(inputCol=features, outputCol="tf_features", vocabSize=100, minDF=0)

# IDF
idf = sf.IDF(inputCol="tf_features", outputCol="features")

# Label encoder 
label_string= sf.StringIndexer(inputCol=label, outputCol ="label")

# Logistic regression
lr = LogisticRegression(maxIter=10, regParam=0.001,family="multinomial")
pipeline = Pipeline(stages=[cv, idf, label_string, lr])

# Fit the pipeline to training documents.
model = pipeline.fit(train_split)

In [27]:
features = "text"
label = "first_label"

# Logistic regression
lr = LogisticRegression(maxIter=20, regParam=0.001,family="multinomial")
pipeline = Pipeline(stages=[cv, idf, label_string, lr])
paramGrid = ParamGridBuilder() \
    .addGrid(cv.vocabSize, [10, 100, 1000, 5000]) \
    .addGrid(lr.regParam, [0.1, 0.01]) \
    .build()

crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=MulticlassClassificationEvaluator(),
                          numFolds=4)  # use 3+ folds in practice

# Run cross-validation, and choose the best set of parameters.
cvModel = crossval.fit(train_split)

In [34]:
import pandas as pd


def get_cv_results(crossval_model):
    params = [{p.name: v for p, v in m.items()} for m in crossval_model.getEstimatorParamMaps()]

    results = pd.DataFrame.from_dict([
        {crossval_model.getEvaluator().getMetricName(): metric, **ps} 
        for ps, metric in zip(params, crossval_model.avgMetrics)
    ])
    return results
get_cv_results(cvModel)

Unnamed: 0,f1,vocabSize,regParam
0,0.187703,10,0.1
1,0.238252,10,0.01
2,0.444845,100,0.1
3,0.533043,100,0.01
4,0.63172,1000,0.1
5,0.650068,1000,0.01


In [45]:
cvModel.transform(test_split).show()

+--------------------+-----------+--------------------+--------------------+-----+--------------------+--------------------+----------+
|                text|first_label|         tf_features|            features|label|       rawPrediction|         probability|prediction|
+--------------------+-----------+--------------------+--------------------+-----+--------------------+--------------------+----------+
|[change, colour, ...|        c++|(1000,[0,20,26,45...|(1000,[0,20,26,45...|  6.0|[0.75101179459094...|[0.15558004455562...|       1.0|
|[count, number, c...| javascript|(1000,[0,1,6,11,2...|(1000,[0,1,6,11,2...|  0.0|[4.12350066461177...|[0.82773803083438...|       0.0|
|[difference, arra...|        c++|(1000,[0,6,19,32,...|(1000,[0,6,19,32,...|  6.0|[0.11159850564472...|[0.09126100108031...|       3.0|
|[fail, close, pro...|       java|(1000,[0,1,13,52,...|(1000,[0,1,13,52,...|  1.0|[-0.0134968624263...|[0.07976073056323...|       1.0|
|[file, operation,...|       java|(1000,[0,9,33,

In [38]:
from pyspark.mllib.evaluation import MulticlassMetrics



# Compute raw scores on the test set
predictionAndLabels = test_split.rdd.map(lambda lp: (float(model.transform(lp.features)), lp.label))

# Instantiate metrics object
metrics = MulticlassMetrics(predictionAndLabels)

# Overall statistics
precision = metrics.precision()
recall = metrics.recall()
f1Score = metrics.fMeasure()
print("Summary Stats")
print("Precision = %s" % precision)
print("Recall = %s" % recall)
print("F1 Score = %s" % f1Score)

# Statistics by class
labels = data.map(lambda lp: lp.label).distinct().collect()
for label in sorted(labels):
    print("Class %s precision = %s" % (label, metrics.precision(label)))
    print("Class %s recall = %s" % (label, metrics.recall(label)))
    print("Class %s F1 Measure = %s" % (label, metrics.fMeasure(label, beta=1.0)))

# Weighted stats
print("Weighted recall = %s" % metrics.weightedRecall)
print("Weighted precision = %s" % metrics.weightedPrecision)
print("Weighted F(1) Score = %s" % metrics.weightedFMeasure())
print("Weighted F(0.5) Score = %s" % metrics.weightedFMeasure(beta=0.5))
print("Weighted false positive rate = %s" % metrics.weightedFalsePositiveRate)

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/pyspark/serializers.py", line 437, in dumps
    return cloudpickle.dumps(obj, pickle_protocol)
  File "/usr/local/lib/python3.6/dist-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 101, in dumps
    cp.dump(obj)
  File "/usr/local/lib/python3.6/dist-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 540, in dump
    return Pickler.dump(self, obj)
  File "/usr/lib/python3.6/pickle.py", line 409, in dump
    self.save(obj)
  File "/usr/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python3.6/pickle.py", line 751, in save_tuple
    save(element)
  File "/usr/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/local/lib/python3.6/dist-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 722, in save_function
    *self._dynamic_function_reduce(obj), obj=obj
  Fi

PicklingError: Could not serialize object: TypeError: can't pickle _thread.RLock objects

In [29]:
cvModel.transform(train_split).show()

+--------------------+-----------+--------------------+--------------------+-----+--------------------+--------------------+----------+
|                text|first_label|         tf_features|            features|label|       rawPrediction|         probability|prediction|
+--------------------+-----------+--------------------+--------------------+-----+--------------------+--------------------+----------+
|[add, soft, const...|     python|(1000,[20,23,26,4...|(1000,[20,23,26,4...|  2.0|[-1.0823854528858...|[0.01368223073054...|       2.0|
|[app, crash, send...|    android|(1000,[0,1,2,3,5,...|(1000,[0,1,2,3,5,...|  5.0|[-4.5261350499050...|[1.12895913799774...|       5.0|
|[append, insert, ...|        c++|(1000,[0,8,32,91,...|(1000,[0,8,32,91,...|  6.0|[0.86899504376574...|[0.14314067762818...|       6.0|
|[asp, net, member...|         c#|(1000,[0,1,2,3,6,...|(1000,[0,1,2,3,6,...|  3.0|[-1.1403656300174...|[1.05380527777151...|       3.0|
|[aw, sam, deploym...|     python|(1000,[0,3,13,

1231