<a href="https://colab.research.google.com/github/murat-gunay/NLP/blob/master/Turkish_Classificaiton_sparkNLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classification of the Text in Turkish using Spark NLP

## Initializing of PySpark

In [3]:
import os

# Install java
! apt-get update -qq
! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
! java -version

# Install pyspark
! pip install --ignore-installed pyspark==2.4.4

# Install Spark NLP
! pip install --ignore-installed spark-nlp

openjdk version "1.8.0_265"
OpenJDK Runtime Environment (build 1.8.0_265-8u265-b01-0ubuntu2~18.04-b01)
OpenJDK 64-Bit Server VM (build 25.265-b01, mixed mode)
Collecting pyspark==2.4.4
[?25l  Downloading https://files.pythonhosted.org/packages/87/21/f05c186f4ddb01d15d0ddc36ef4b7e3cedbeb6412274a41f26b55a650ee5/pyspark-2.4.4.tar.gz (215.7MB)
[K     |████████████████████████████████| 215.7MB 59kB/s 
[?25hCollecting py4j==0.10.7
[?25l  Downloading https://files.pythonhosted.org/packages/e3/53/c737818eb9a7dc32a7cd4f1396e787bd94200c3997c72c1dbe028587bd76/py4j-0.10.7-py2.py3-none-any.whl (197kB)
[K     |████████████████████████████████| 204kB 42.2MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-2.4.4-py2.py3-none-any.whl size=216130389 sha256=27f0076af58799f8f88a270142ef6648c131948b79e07ac7da7e3d522e8a6086
  Stored in directory: /root/.cache/pip/wheels/ab/09/4d/0d18423005

## Starting of SparkNLP Session

In [4]:
import sparknlp
spark = sparknlp.start()
sparknlp.version()
spark.version

'2.4.4'

In [5]:
from pyspark.ml import Pipeline

from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *
#from sparknlp.embeddings import *

## Loading and Reading the DataSet

In [7]:
! wget https://raw.githubusercontent.com/murat-gunay/NLP/master/02_NLP_Projects/2-project_2_Turkish_sparkNLP_Classification/turkish_categorical_corpus.csv

--2020-10-01 19:03:20--  https://raw.githubusercontent.com/murat-gunay/NLP/master/02_NLP_Projects/2-project_2_Turkish_sparkNLP_Classification/turkish_categorical_corpus.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10627541 (10M) [text/plain]
Saving to: ‘turkish_categorical_corpus.csv’


2020-10-01 19:03:21 (19.7 MB/s) - ‘turkish_categorical_corpus.csv’ saved [10627541/10627541]



In [8]:
df_Spark = spark.read \
           .option("header", True) \
           .csv("turkish_categorical_corpus.csv")

In [10]:
df_Spark.show(5, truncate=100)

+--------+----------------------------------------------------------------------------------------------------+
|category|                                                                                                text|
+--------+----------------------------------------------------------------------------------------------------+
|siyaset | 3 milyon ile ön seçim vaadi mhp nin 10 olağan büyük kurultayı nda konuşan genel başkan adayı kor...|
|siyaset | mesut_yılmaz yüce_divan da ceza alabilirdi prof dr sacit adalı isviçre deki banka eski başbakan ...|
|siyaset | disko lar kaldırılıyor başbakan_yardımcısı arınç disko diye tabir edilen disiplin koğuşlarının k...|
|siyaset | sarıgül anayasa_mahkemesi ne gidiyor mustafa_sarıgül ilçedeki sınır değişikliğine itiraz için an...|
|siyaset | erdoğan idamın bir haklılık sebebi var demek ki yeri geldiği zaman idamın bir haklılık sebebi de...|
+--------+----------------------------------------------------------------------------------------------

In [12]:
df_Spark.groupBy("category").count().show()

+----------+-----+
|  category|count|
+----------+-----+
|   kultur |  700|
|  siyaset |  700|
|teknoloji |  700|
|   saglik |  700|
|  ekonomi |  700|
|     spor |  700|
|    dunya |  700|
+----------+-----+



## Removing extraneus underscores from the documents

In [53]:
df_Spark.take(2)

# We should remove the "_" (underscores) between nouns. e.g.: "mesut_yılmaz", "koray_aydın"

[Row(category='siyaset ', text=' 3 milyon ile ön seçim vaadi mhp nin 10 olağan büyük kurultayı nda konuşan genel başkan adayı koray aydın seçimlerden önce partinin üye sayısının 3 milyona ulaştırılması hedefini koyarak ön seçim uygulaması vaadinde bulundu mhp nin 10 olağan büyük kurultayı nda konuşan genel başkan adayı koray aydın seçimlerden önce partinin üye sayısının 3 milyona ulaştırılması hedefini koyarak ön seçim uygulaması vaadinde bulundu genel başkan adayı koray aydın kürsüye beklenirken yapılan tezahüratlar ve ıslıklamalar üzerine divan başkanı tuğrul türkeş mhp nin genel başkanlığı da genel başkan adaylığı da saygıdeğer işlerdir bu salondaki herkes ciddiye almak zorundadır dedi ve taşkınlıklara izin verilmeyeceğini salonda sükunet sağlanmadan konuşmaların başlamayacağını vurguladı türkeş devlet bahçeli nin kurultay açılışında konuştuğu için adaylık nedeniyle ikinci bir konuşma yapmayacağını açıkladı konuşmasında kurultayın mhp nin tek başına iktidarına vesile olmasını dileye

In [37]:
from pyspark.sql.functions import *
df_Spark = df_Spark.withColumn('text', regexp_replace('text', '_', ' '))

In [19]:
df_Spark.show(5, truncate=100)

+--------+----------------------------------------------------------------------------------------------------+
|category|                                                                                                text|
+--------+----------------------------------------------------------------------------------------------------+
|siyaset | 3 milyon ile ön seçim vaadi mhp nin 10 olağan büyük kurultayı nda konuşan genel başkan adayı kor...|
|siyaset | mesut yılmaz yüce divan da ceza alabilirdi prof dr sacit adalı isviçre deki banka eski başbakan ...|
|siyaset | disko lar kaldırılıyor başbakan yardımcısı arınç disko diye tabir edilen disiplin koğuşlarının k...|
|siyaset | sarıgül anayasa mahkemesi ne gidiyor mustafa sarıgül ilçedeki sınır değişikliğine itiraz için an...|
|siyaset | erdoğan idamın bir haklılık sebebi var demek ki yeri geldiği zaman idamın bir haklılık sebebi de...|
+--------+----------------------------------------------------------------------------------------------

## Splitting the dataset into training and testing sets

In [20]:
train_news, test_news = df_Spark.randomSplit([0.8, 0.2], seed = 100)

In [14]:
train_news.count()

3889

In [15]:
test_news.count()

1011

## Setting the Pipeline of Turkish Language

In [17]:
document = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence = SentenceDetector()\
    .setInputCols(['document'])\
    .setOutputCol('sentence')

token = Tokenizer()\
    .setInputCols(['sentence'])\
    .setOutputCol('token')

stop_words = StopWordsCleaner.pretrained('stopwords_tr', 'tr')\
    .setInputCols(["token"]) \
    .setOutputCol("cleanTokens") \
    .setCaseSensitive(False)

lemmatizer = LemmatizerModel.pretrained("lemma", "tr") \
        .setInputCols(["token"]) \
        .setOutputCol("lemma")

pos = PerceptronModel.pretrained("pos_ud_imst", "tr") \
      .setInputCols(["document", "token"]) \
      .setOutputCol("pos")

finisher = Finisher() \
    .setInputCols(["lemma"]) \
    .setOutputCols(["token_features"]) \
    .setOutputAsArray(True) \
    .setCleanAnnotations(False)

stopwords_tr download started this may take some time.
Approximate size to download 2 KB
[OK!]
lemma download started this may take some time.
Approximate size to download 14.8 MB
[OK!]
pos_ud_imst download started this may take some time.
Approximate size to download 2.1 MB
[OK!]


In [16]:
from pyspark.ml.feature import HashingTF, IDF, StringIndexer, IndexToString
from pyspark.ml.classification import LogisticRegression, NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

## Text Classification with `LogisticRegression`

In [21]:
hashTF = HashingTF(inputCol="token_features", outputCol="raw_features", numFeatures=4000)

idf = IDF(inputCol="raw_features", outputCol="features", minDocFreq=5)

label_strIdx = StringIndexer(inputCol="category", outputCol="label")

logReg = LogisticRegression(maxIter=10)

label_Idxstr = IndexToString(inputCol="label", outputCol="article_class")

nlp_pipeline_lr = Pipeline(
        stages=[document, 
                sentence,
                token,
                stop_words, 
                lemmatizer, 
                finisher,
                hashTF,
                idf,
                label_strIdx,
                logReg,
                label_Idxstr])

In [22]:
classification_model_lr = nlp_pipeline_lr.fit(train_news)

In [43]:
pred_lr = classification_model_lr.transform(test_news)

In [54]:
pred_lr.select("category", "label", "prediction").show(5)

+--------+-----+----------+
|category|label|prediction|
+--------+-----+----------+
|  dunya |  3.0|       3.0|
|  dunya |  3.0|       3.0|
|  dunya |  3.0|       3.0|
|  dunya |  3.0|       3.0|
|  dunya |  3.0|       3.0|
+--------+-----+----------+
only showing top 5 rows



- Evaluation of Classification (`LogisticRegression`)

In [24]:
evaluator = MulticlassClassificationEvaluator(
    labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(pred_lr)
print("Accuracy = %g" % (accuracy))
print("Test Error = %g " % (1.0 - accuracy))

Accuracy = 0.893175
Test Error = 0.106825 


In [39]:
from sklearn.metrics import classification_report, accuracy_score

In [40]:
df_lr = classification_model_lr.transform(test_news).select("category", "label", "prediction").toPandas()

In [41]:
df_lr.head()

Unnamed: 0,category,label,prediction
0,dunya,3.0,3.0
1,dunya,3.0,3.0
2,dunya,3.0,3.0
3,dunya,3.0,3.0
4,dunya,3.0,3.0


In [42]:
print(classification_report(df_lr.label, df_lr.prediction))

              precision    recall  f1-score   support

         0.0       0.94      0.93      0.93       135
         1.0       0.86      0.81      0.83       140
         2.0       0.86      0.88      0.87       142
         3.0       0.82      0.89      0.86       142
         4.0       0.91      0.92      0.91       144
         5.0       0.88      0.88      0.88       153
         6.0       0.98      0.95      0.96       155

    accuracy                           0.89      1011
   macro avg       0.89      0.89      0.89      1011
weighted avg       0.89      0.89      0.89      1011



## Text Classification with `NaiveBayes`

In [45]:
hashTF = HashingTF(inputCol="token_features", outputCol="raw_features", numFeatures=4000)

idf = IDF(inputCol="raw_features", outputCol="features", minDocFreq=5)

label_strIdx = StringIndexer(inputCol="category", outputCol="label")

bayes_class = NaiveBayes(smoothing=111)

label_Idxstr = IndexToString(inputCol="label", outputCol="article_class")

nlp_pipeline_bayes = Pipeline(
    stages=[document, 
            sentence,
            token,
            stop_words, 
            lemmatizer, 
            finisher,
            hashTF,
            idf,
            label_strIdx,
            bayes_class,
            label_Idxstr])

In [46]:
classification_model_bayes = nlp_pipeline_bayes.fit(train_news)

In [47]:
pred_bayes = classification_model_bayes.transform(test_news)

In [55]:
pred_bayes.select("category", "label", "prediction").show(5)

+--------+-----+----------+
|category|label|prediction|
+--------+-----+----------+
|  dunya |  3.0|       3.0|
|  dunya |  3.0|       3.0|
|  dunya |  3.0|       1.0|
|  dunya |  3.0|       1.0|
|  dunya |  3.0|       1.0|
+--------+-----+----------+
only showing top 5 rows



- Evaluation of Classification (`NaiveBayes`)

In [49]:
evaluator = MulticlassClassificationEvaluator(
    labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(pred_bayes)
print("Accuracy = %g" % (accuracy))
print("Test Error = %g " % (1.0 - accuracy))

Accuracy = 0.866469
Test Error = 0.133531 


In [50]:
df_bayes = classification_model_bayes.transform(test_news).select("category", "label", "prediction").toPandas()

In [52]:
df_bayes.head()

Unnamed: 0,category,label,prediction
0,dunya,3.0,3.0
1,dunya,3.0,3.0
2,dunya,3.0,1.0
3,dunya,3.0,1.0
4,dunya,3.0,1.0


In [51]:
print(classification_report(df_bayes.label, df_bayes.prediction))

              precision    recall  f1-score   support

         0.0       0.91      0.93      0.92       135
         1.0       0.75      0.86      0.80       140
         2.0       0.77      0.93      0.84       142
         3.0       0.91      0.68      0.78       142
         4.0       0.90      0.92      0.91       144
         5.0       0.88      0.83      0.85       153
         6.0       0.99      0.91      0.95       155

    accuracy                           0.87      1011
   macro avg       0.87      0.87      0.87      1011
weighted avg       0.87      0.87      0.87      1011

