<a href="https://colab.research.google.com/github/murat-gunay/NLP/blob/master/02_NLP_Projects/2-project_2_Turkish_sparkNLP_Classification/Turkish_Classificaiton_sparkNLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classification of the Text in Turkish using Spark NLP

## Initializing of PySpark & Colab

In [1]:
import os

# Install java
! apt-get update -qq
! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
! java -version

# Install pyspark
! pip install --ignore-installed pyspark==2.4.4

# Install Spark NLP
! pip install --ignore-installed spark-nlp==2.6.2

openjdk version "1.8.0_265"
OpenJDK Runtime Environment (build 1.8.0_265-8u265-b01-0ubuntu2~18.04-b01)
OpenJDK 64-Bit Server VM (build 25.265-b01, mixed mode)
Collecting pyspark==2.4.4
[?25l  Downloading https://files.pythonhosted.org/packages/87/21/f05c186f4ddb01d15d0ddc36ef4b7e3cedbeb6412274a41f26b55a650ee5/pyspark-2.4.4.tar.gz (215.7MB)
[K     |████████████████████████████████| 215.7MB 67kB/s 
[?25hCollecting py4j==0.10.7
[?25l  Downloading https://files.pythonhosted.org/packages/e3/53/c737818eb9a7dc32a7cd4f1396e787bd94200c3997c72c1dbe028587bd76/py4j-0.10.7-py2.py3-none-any.whl (197kB)
[K     |████████████████████████████████| 204kB 34.4MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-2.4.4-py2.py3-none-any.whl size=216130389 sha256=9b7c02bb713c48522fe428b99bfa20dc28fe7739af4a745834e48b4af40175e4
  Stored in directory: /root/.cache/pip/wheels/ab/09/4d/0d18423005

## Starting of Spark Session

In [2]:
import sparknlp
spark = sparknlp.start()
print("Version of SparkNLP:", sparknlp.version())
print("Version of Spark :", spark.version)

'2.4.4'

In [3]:
from pyspark.ml import Pipeline
from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *

## Loading and Reading the DataSet

In [4]:
! wget https://raw.githubusercontent.com/murat-gunay/NLP/master/02_NLP_Projects/2-project_2_Turkish_sparkNLP_Classification/turkish_categorical_corpus.csv

--2020-10-06 07:20:07--  https://raw.githubusercontent.com/murat-gunay/NLP/master/02_NLP_Projects/2-project_2_Turkish_sparkNLP_Classification/turkish_categorical_corpus.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10627541 (10M) [text/plain]
Saving to: ‘turkish_categorical_corpus.csv’


2020-10-06 07:20:08 (54.5 MB/s) - ‘turkish_categorical_corpus.csv’ saved [10627541/10627541]



In [5]:
df_Spark = spark.read \
           .option("header", True) \
           .csv("turkish_categorical_corpus.csv")

In [None]:
df_Spark.show(5, truncate=55)

+--------+-------------------------------------------------------+
|category|                                                   text|
+--------+-------------------------------------------------------+
|siyaset | 3 milyon ile ön seçim vaadi mhp nin 10 olağan büyük...|
|siyaset | mesut_yılmaz yüce_divan da ceza alabilirdi prof dr ...|
|siyaset | disko lar kaldırılıyor başbakan_yardımcısı arınç di...|
|siyaset | sarıgül anayasa_mahkemesi ne gidiyor mustafa_sarıgü...|
|siyaset | erdoğan idamın bir haklılık sebebi var demek ki yer...|
+--------+-------------------------------------------------------+
only showing top 5 rows



In [None]:
df_Spark.groupBy("category").count().show()

+----------+-----+
|  category|count|
+----------+-----+
|   kultur |  700|
|  siyaset |  700|
|teknoloji |  700|
|   saglik |  700|
|  ekonomi |  700|
|     spor |  700|
|    dunya |  700|
+----------+-----+



## Removing extraneus underscores from the documents

In [None]:
df_Spark.take(2)

# We should remove the "_" (underscores) between nouns. e.g.: "mesut_yılmaz", "koray_aydın"

[Row(category='siyaset ', text=' 3 milyon ile ön seçim vaadi mhp nin 10 olağan büyük kurultayı nda konuşan genel başkan adayı koray_aydın seçimlerden önce partinin üye sayısının 3 milyona ulaştırılması hedefini koyarak ön seçim uygulaması vaadinde bulundu mhp nin 10 olağan büyük kurultayı nda konuşan genel başkan adayı koray_aydın seçimlerden önce partinin üye sayısının 3 milyona ulaştırılması hedefini koyarak ön seçim uygulaması vaadinde bulundu genel_başkan adayı koray_aydın kürsüye beklenirken yapılan tezahüratlar ve ıslıklamalar üzerine divan başkanı tuğrul_türkeş mhp nin genel başkanlığı da genel başkan adaylığı da saygıdeğer işlerdir bu salondaki herkes ciddiye almak zorundadır dedi ve taşkınlıklara izin verilmeyeceğini salonda sükunet sağlanmadan konuşmaların başlamayacağını vurguladı türkeş devlet_bahçeli nin kurultay açılışında konuştuğu için adaylık nedeniyle ikinci bir konuşma yapmayacağını açıkladı konuşmasında kurultayın mhp nin tek başına iktidarına vesile olmasını dileye

In [6]:
from pyspark.sql.functions import *
df_Spark = df_Spark.withColumn('text', regexp_replace('text', '_', ' '))

In [None]:
df_Spark.show(5, truncate=100)

+--------+----------------------------------------------------------------------------------------------------+
|category|                                                                                                text|
+--------+----------------------------------------------------------------------------------------------------+
|siyaset | 3 milyon ile ön seçim vaadi mhp nin 10 olağan büyük kurultayı nda konuşan genel başkan adayı kor...|
|siyaset | mesut yılmaz yüce divan da ceza alabilirdi prof dr sacit adalı isviçre deki banka eski başbakan ...|
|siyaset | disko lar kaldırılıyor başbakan yardımcısı arınç disko diye tabir edilen disiplin koğuşlarının k...|
|siyaset | sarıgül anayasa mahkemesi ne gidiyor mustafa sarıgül ilçedeki sınır değişikliğine itiraz için an...|
|siyaset | erdoğan idamın bir haklılık sebebi var demek ki yeri geldiği zaman idamın bir haklılık sebebi de...|
+--------+----------------------------------------------------------------------------------------------

## Splitting the dataset into training and testing sets

In [7]:
train_news, test_news = df_Spark.randomSplit([0.8, 0.2], seed = 100)

In [None]:
train_news.count()

3889

In [None]:
test_news.count()

1011

## Setting the Pipeline for ``LogisticRegression`` and ``NaiveBayes`` models.

In [8]:
document = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence = SentenceDetector()\
    .setInputCols(['document'])\
    .setOutputCol('sentence')

token = Tokenizer()\
    .setInputCols(['sentence'])\
    .setOutputCol('token')

stop_words = StopWordsCleaner.pretrained('stopwords_tr', 'tr')\
    .setInputCols(["token"]) \
    .setOutputCol("cleanTokens") \
    .setCaseSensitive(False)

lemmatizer = LemmatizerModel.pretrained("lemma", "tr") \
         .setInputCols(["cleanTokens"]) \
         .setOutputCol("lemma")

finisher = Finisher() \
    .setInputCols(["lemma"]) \
    .setOutputCols(["token_features"]) \
    .setOutputAsArray(True) \
    .setCleanAnnotations(False)

stopwords_tr download started this may take some time.
Approximate size to download 2 KB
[OK!]
lemma download started this may take some time.
Approximate size to download 14.8 MB
[OK!]


In [9]:
from pyspark.ml.feature import HashingTF, IDF, StringIndexer, IndexToString
from pyspark.ml.classification import LogisticRegression, NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

## Text Classification with `LogisticRegression`

In [44]:
hashTF = HashingTF(inputCol="token_features", outputCol="raw_features")

idf = IDF(inputCol="raw_features", outputCol="features", minDocFreq=5)

label_strIdx = StringIndexer(inputCol="category", outputCol="label")

logReg = LogisticRegression(maxIter=10)

label_Idxstr = IndexToString(inputCol="label", outputCol="article_class")

nlp_pipeline_lr = Pipeline(
        stages=[document, 
                sentence,
                token,
                stop_words, 
                lemmatizer, 
                finisher,
                hashTF,
                idf,
                label_strIdx,
                logReg,
                label_Idxstr])

In [45]:
classification_model_lr = nlp_pipeline_lr.fit(train_news)

In [46]:
pred_lr = classification_model_lr.transform(test_news)

In [22]:
pred_lr.select("category", "label", "prediction").show(5)

+--------+-----+----------+
|category|label|prediction|
+--------+-----+----------+
|  dunya |  3.0|       3.0|
|  dunya |  3.0|       3.0|
|  dunya |  3.0|       3.0|
|  dunya |  3.0|       3.0|
|  dunya |  3.0|       3.0|
+--------+-----+----------+
only showing top 5 rows



- Evaluation of Classification (`LogisticRegression`)

In [23]:
evaluator = MulticlassClassificationEvaluator(
    labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(pred_lr)
print("Accuracy = %g" % (accuracy))
print("Test Error = %g " % (1.0 - accuracy))

Accuracy = 0.900099
Test Error = 0.0999011 


In [24]:
from sklearn.metrics import classification_report, accuracy_score

In [25]:
df_lr = classification_model_lr \
   .transform(test_news) \
   .select("category", "label", "prediction") \
   .toPandas()

In [26]:
df_lr.head()

Unnamed: 0,category,label,prediction
0,dunya,3.0,3.0
1,dunya,3.0,3.0
2,dunya,3.0,3.0
3,dunya,3.0,3.0
4,dunya,3.0,3.0


In [27]:
print(classification_report(df_lr.label, df_lr.prediction))

              precision    recall  f1-score   support

         0.0       0.94      0.88      0.91       135
         1.0       0.87      0.80      0.83       140
         2.0       0.83      0.90      0.86       142
         3.0       0.87      0.92      0.89       142
         4.0       0.91      0.94      0.93       144
         5.0       0.90      0.88      0.89       153
         6.0       0.99      0.97      0.98       155

    accuracy                           0.90      1011
   macro avg       0.90      0.90      0.90      1011
weighted avg       0.90      0.90      0.90      1011



## Text Classification with `NaiveBayes`

In [40]:
hashTF = HashingTF(inputCol="token_features", outputCol="raw_features", numFeatures=4096)

idf = IDF(inputCol="raw_features", outputCol="features", minDocFreq=5)

label_strIdx = StringIndexer(inputCol="category", outputCol="label")

bayes_class = NaiveBayes(smoothing=111)

label_Idxstr = IndexToString(inputCol="label", outputCol="article_class")

nlp_pipeline_bayes = Pipeline(
    stages=[document, 
            sentence,
            token,
            stop_words, 
            lemmatizer, 
            finisher,
            hashTF,
            idf,
            label_strIdx,
            bayes_class,
            label_Idxstr])

In [41]:
classification_model_bayes = nlp_pipeline_bayes.fit(train_news)

In [42]:
pred_bayes = classification_model_bayes.transform(test_news)

In [None]:
pred_bayes.select("category", "label", "prediction").show(5)

+--------+-----+----------+
|category|label|prediction|
+--------+-----+----------+
|  dunya |  3.0|       3.0|
|  dunya |  3.0|       3.0|
|  dunya |  3.0|       1.0|
|  dunya |  3.0|       1.0|
|  dunya |  3.0|       1.0|
+--------+-----+----------+
only showing top 5 rows



- Evaluation of Classification (`NaiveBaye`)



In [43]:
evaluator = MulticlassClassificationEvaluator(
    labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(pred_bayes)
print("Accuracy = %g" % (accuracy))
print("Test Error = %g " % (1.0 - accuracy))

Accuracy = 0.872404
Test Error = 0.127596 


In [48]:
df_bayes = classification_model_bayes.transform(test_news).select("category", "label", "prediction").toPandas()

In [49]:
df_bayes.head()

Unnamed: 0,category,label,prediction
0,dunya,3.0,3.0
1,dunya,3.0,3.0
2,dunya,3.0,3.0
3,dunya,3.0,1.0
4,dunya,3.0,1.0


In [50]:
print(classification_report(df_bayes.label, df_bayes.prediction))

              precision    recall  f1-score   support

         0.0       0.92      0.94      0.93       135
         1.0       0.77      0.88      0.82       140
         2.0       0.76      0.92      0.83       142
         3.0       0.89      0.68      0.77       142
         4.0       0.90      0.94      0.92       144
         5.0       0.91      0.82      0.86       153
         6.0       0.99      0.94      0.96       155

    accuracy                           0.87      1011
   macro avg       0.88      0.87      0.87      1011
weighted avg       0.88      0.87      0.87      1011



## Classification with `BertSentenceEmbeddings` and `GloVeWordEmbeddings`


## Setting the Pipeline for `Bert("labse")` & ``ClassifierDLApproach``

In [None]:
document = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

embeddings = BertSentenceEmbeddings\
    .pretrained('labse', 'xx') \
    .setInputCols(["document"])\
    .setOutputCol("sentence_embeddings")

classsifierdl = ClassifierDLApproach()\
   .setInputCols(["sentence_embeddings"])\
   .setOutputCol("class")\
   .setLabelColumn("category")\
   .setMaxEpochs(5)\
   .setEnableOutputLogs(True)

stopwords_tr download started this may take some time.
Approximate size to download 2 KB
[OK!]
lemma download started this may take some time.
Approximate size to download 14.8 MB
[OK!]
labse download started this may take some time.
Approximate size to download 1.7 GB
[OK!]


In [None]:
nlp_pipeline_bert = Pipeline(
    stages=[document, 
            sentence,
            token,
            stop_words, 
            lemmatizer, 
            embeddings,
            classsifierdl])

In [None]:
classification_model_bert = nlp_pipeline_bert.fit(train_news)

In [None]:
df_bert = classification_model_bert.transform(test_news).select("category", "text", "class.result").toPandas()

In [None]:
df_bert.head()

Unnamed: 0,category,text,result
0,dunya,140 araç birbirine girdi 2 ölü 80 yaralı abd ...,[dunya ]
1,dunya,150 araç birbirine girdi abd de yoğun sis ned...,[dunya ]
2,dunya,150 araç birbirine girdi teksas ta etkili ola...,[dunya ]
3,dunya,2 nükleer santralin daha açılmasını istiyor j...,[ekonomi ]
4,dunya,46 5 milyon dolarlık insani yardım aldı tacik...,[ekonomi ]


In [None]:
df_bert["result"].str[0].head()

0      dunya 
1      dunya 
2      dunya 
3    ekonomi 
4    ekonomi 
Name: result, dtype: object

- Evaluation of Classification (`DLApproach` & `BertEmbeddings`)

In [None]:
print(classification_report(df_bert.category, df_bert.result.str[0]))

              precision    recall  f1-score   support

      dunya        0.86      0.80      0.82       142
    ekonomi        0.86      0.79      0.82       140
     kultur        0.89      0.94      0.92       144
     saglik        0.88      0.95      0.91       135
    siyaset        0.85      0.85      0.85       142
       spor        0.97      0.95      0.96       155
  teknoloji        0.84      0.88      0.86       153

    accuracy                           0.88      1011
   macro avg       0.88      0.88      0.88      1011
weighted avg       0.88      0.88      0.88      1011



## Setting Pipeline for `Glove840B` & `ClassifierDLApproach`

In [None]:
document = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("sentence")

token = Tokenizer()\
    .setInputCols(['sentence'])\
    .setOutputCol('token')

stop_words = StopWordsCleaner.pretrained('stopwords_tr', 'tr')\
    .setInputCols(["token"]) \
    .setOutputCol("cleanTokens") \
    .setCaseSensitive(False)

lemmatizer = LemmatizerModel.pretrained("lemma", "tr") \
         .setInputCols(["cleanTokens"]) \
         .setOutputCol("lemma")

glove_embeddings = WordEmbeddingsModel().pretrained('glove_840B_300','xx')\
  .setInputCols(["sentence",'lemma'])\
  .setOutputCol("embeddings")\
  .setCaseSensitive(False)

embeddingsSentence = SentenceEmbeddings() \
      .setInputCols(["sentence", "embeddings"]) \
      .setOutputCol("sentence_embeddings") \
      .setPoolingStrategy("AVERAGE")

classsifierdl = ClassifierDLApproach()\
  .setInputCols(["sentence_embeddings"])\
  .setOutputCol("class")\
  .setLabelColumn("category")\
  .setBatchSize(8)\
  .setMaxEpochs(50)\
  .setLr(0.003)\
  .setEnableOutputLogs(True)
  #.setOutputLogsPath('logs')

stopwords_tr download started this may take some time.
Approximate size to download 2 KB
[OK!]
lemma download started this may take some time.
Approximate size to download 14.8 MB
[OK!]
glove_840B_300 download started this may take some time.
Approximate size to download 2.3 GB
[OK!]


In [None]:
nlp_pipeline_glove = Pipeline(
    stages=[document, 
            token,
            stop_words, 
            lemmatizer, 
            glove_embeddings,
            embeddingsSentence,
            classsifierdl])

In [None]:
classification_model_glove = nlp_pipeline_glove.fit(train_news)

In [None]:
df_glove = classification_model_glove.transform(test_news).select("category", "text", "class.result").toPandas()

In [None]:
df_glove.head()

Unnamed: 0,category,text,result
0,dunya,140 araç birbirine girdi 2 ölü 80 yaralı abd ...,[dunya ]
1,dunya,150 araç birbirine girdi abd de yoğun sis ned...,[dunya ]
2,dunya,150 araç birbirine girdi teksas ta etkili ola...,[dunya ]
3,dunya,2 nükleer santralin daha açılmasını istiyor j...,[dunya ]
4,dunya,46 5 milyon dolarlık insani yardım aldı tacik...,[ekonomi ]


- Evaluation of `GloveEmbeddings`

In [None]:
print(classification_report(df_glove.category, df_glove.result.str[0]))

              precision    recall  f1-score   support

      dunya        0.78      0.63      0.70       142
    ekonomi        0.63      0.78      0.69       140
     kultur        0.83      0.86      0.84       144
     saglik        0.82      0.86      0.84       135
    siyaset        0.70      0.66      0.68       142
       spor        0.89      0.87      0.88       155
  teknoloji        0.82      0.76      0.79       153

    accuracy                           0.78      1011
   macro avg       0.78      0.77      0.77      1011
weighted avg       0.78      0.78      0.78      1011

