# Text Classification

We are using Natural Language Processing with Disaster Tweets Kaggle dataset. The data is comprised of tweets, and our objective is to forecast which tweets are associated with a tragedy. This may enhance reaction times for a variety of interested parties, including police forces, fire departments, and news organisations, among others. We will do text categorization using predictive machine learning models, which is a subset of natural language processing.

In [1]:
import findspark
findspark.init()
import pyspark

In [2]:
#create SparkSession instance
from pyspark import SparkConf
from pyspark.sql import SparkSession
#conf = SparkConf()
#spark_session = SparkSession.builder.config(conf=conf).appName('NLTK').getOrCreate()

In [3]:
import numpy as np

#spark = SparkSession.builder.config('spark.executor.memory','6g').getOrCreate()
spark = SparkSession.builder.config('spark.driver.memory','6g').getOrCreate()

22/12/09 13:58:32 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


Starting the Spark Session

Import Important modules required 

In [4]:
from pyspark.ml import Pipeline 
from pyspark.ml.feature import CountVectorizer,StringIndexer, RegexTokenizer,StopWordsRemover
from pyspark.sql.functions import col, udf,regexp_replace,isnull
from pyspark.sql.types import StringType,IntegerType
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [5]:
#!wget --no-check-certificate 'https://drive.google.com/file/d/18ZrVjSk1L0R1Mllzg78pSloQDuiQItF7/view?usp=share_link' -O 'uci-news-aggregator.csv'

Now we are loading the dataset uci-news-aggregator.csv.

In [6]:
#read news csv dataset from the working directory
news_data = spark.read.csv('uci-news-aggregator.csv',header= True)
news_data.printSchema()
news_data.show()
#news_data = news_data.limit(300)
#news_data.cache

[Stage 0:>                                                          (0 + 1) / 1]                                                                                

root
 |-- ID: string (nullable = true)
 |-- TITLE: string (nullable = true)
 |-- URL: string (nullable = true)
 |-- PUBLISHER: string (nullable = true)
 |-- CATEGORY: string (nullable = true)
 |-- STORY: string (nullable = true)
 |-- HOSTNAME: string (nullable = true)
 |-- TIMESTAMP: string (nullable = true)

+---+--------------------+--------------------+--------------------+--------+--------------------+--------------------+-------------+
| ID|               TITLE|                 URL|           PUBLISHER|CATEGORY|               STORY|            HOSTNAME|    TIMESTAMP|
+---+--------------------+--------------------+--------------------+--------+--------------------+--------------------+-------------+
|  1|Fed official says...|http://www.latime...|   Los Angeles Times|       b|ddUyU0VZz0BRneMio...|     www.latimes.com|1394470370698|
|  2|Fed's Charles Plo...|http://www.livemi...|            Livemint|       b|ddUyU0VZz0BRneMio...|    www.livemint.com|1394470371207|
|  3|US open: Stock

We can check the count of totalitems in the dataset for analysis

In [7]:
#count data items present in the set
news_data.count()

422937

We are selecting the titles of tweets and the corresponding category of each tweet

In [8]:
#select the titles of tweets corresponding to category
title_category = news_data.select("TITLE","CATEGORY")
title_category.show()

+--------------------+--------+
|               TITLE|CATEGORY|
+--------------------+--------+
|Fed official says...|       b|
|Fed's Charles Plo...|       b|
|US open: Stocks f...|       b|
|Fed risks falling...|       b|
|Fed's Plosser: Na...|       b|
|Plosser: Fed May ...|       b|
|Fed's Plosser: Ta...|       b|
|Fed's Plosser exp...|       b|
|US jobs growth la...|       b|
|ECB unlikely to e...|       b|
|ECB unlikely to e...|       b|
|EU's half-baked b...|       b|
|Europe reaches cr...|       b|
|ECB FOCUS-Stronge...|       b|
|EU aims for deal ...|       b|
|Forex - Pound dro...|       b|
|Noyer Says Strong...|       b|
|EU Week Ahead Mar...|       b|
|ECB member Noyer ...|       b|
|Euro Anxieties Wa...|       b|
+--------------------+--------+
only showing top 20 rows



This is the custom function definition to count the null values

In [9]:
#definition to count the null values
def null_value_count(df):
  null_columns_counts = []
  numRows = df.count()
  for k in df.columns:
    nullRows = df.where(col(k).isNull()).count()
    if(nullRows > 0):
      temp = k,nullRows
      null_columns_counts.append(temp)
  return(null_columns_counts)

We are applying the custom function to the data frame title_category

In [10]:
null_columns_count_list = null_value_count(title_category)
#spark.createDataFrame(null_columns_count_list, ['Column_With_Null_Value', 'Null_Values_Count']).show()

# Cleaning the dataset

Now we can drop the null values

In [11]:
#Drop null and not available values
title_category = title_category.dropna()
title_category.count()
title_category.show(truncate=False)

+---------------------------------------------------------------------------+--------+
|TITLE                                                                      |CATEGORY|
+---------------------------------------------------------------------------+--------+
|Fed official says weak data caused by weather, should not slow taper       |b       |
|Fed's Charles Plosser sees high bar for change in pace of tapering         |b       |
|US open: Stocks fall after Fed official hints at accelerated tapering      |b       |
|Fed risks falling 'behind the curve', Charles Plosser says                 |b       |
|Fed's Plosser: Nasty Weather Has Curbed Job Growth                         |b       |
|Plosser: Fed May Have to Accelerate Tapering Pace                          |b       |
|Fed's Plosser: Taper pace may be too slow                                  |b       |
|Fed's Plosser expects US unemployment to fall to 6.2% by the end of 2014   |b       |
|US jobs growth last month hit by weather:F

In [12]:
title_category.groupBy("TITLE").count().orderBy(col("count").desc()).show(truncate=False)



+----------------------------------------------------------------------------------+-----+
|TITLE                                                                             |count|
+----------------------------------------------------------------------------------+-----+
|The article requested cannot be found! Please refresh your browser or go back  ...|145  |
|Business Highlights                                                               |59   |
|Posted by Parvez Jabri                                                            |59   |
|Posted by Imaduddin                                                               |53   |
|Posted by Shoaib-ur-Rehman Siddiqui                                               |52   |
|(click the phrases to see a list)                                                 |51   |
|Business Wire                                                                     |41   |
|PR Newswire                                                                       |38   |

                                                                                

Now let us remove the numbers present in the title category

In [13]:
#clean numbers and other unwanted characters from the tweets
title_category = title_category.withColumn("only_str",regexp_replace(col('TITLE'), '\d+', ''))
title_category.select("TITLE","only_str").show(truncate=False)

+---------------------------------------------------------------------------+---------------------------------------------------------------------------+
|TITLE                                                                      |only_str                                                                   |
+---------------------------------------------------------------------------+---------------------------------------------------------------------------+
|Fed official says weak data caused by weather, should not slow taper       |Fed official says weak data caused by weather, should not slow taper       |
|Fed's Charles Plosser sees high bar for change in pace of tapering         |Fed's Charles Plosser sees high bar for change in pace of tapering         |
|US open: Stocks fall after Fed official hints at accelerated tapering      |US open: Stocks fall after Fed official hints at accelerated tapering      |
|Fed risks falling 'behind the curve', Charles Plosser says                 

Split the text into constituent words

In [14]:
#split the text to tokens using tokenizer
#https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.ml.feature.RegexTokenizer.html
regex_tokenizer = RegexTokenizer(inputCol="only_str", outputCol="words", pattern="\\W")
raw_words = regex_tokenizer.transform(title_category)
raw_words.show()

+--------------------+--------+--------------------+--------------------+
|               TITLE|CATEGORY|            only_str|               words|
+--------------------+--------+--------------------+--------------------+
|Fed official says...|       b|Fed official says...|[fed, official, s...|
|Fed's Charles Plo...|       b|Fed's Charles Plo...|[fed, s, charles,...|
|US open: Stocks f...|       b|US open: Stocks f...|[us, open, stocks...|
|Fed risks falling...|       b|Fed risks falling...|[fed, risks, fall...|
|Fed's Plosser: Na...|       b|Fed's Plosser: Na...|[fed, s, plosser,...|
|Plosser: Fed May ...|       b|Plosser: Fed May ...|[plosser, fed, ma...|
|Fed's Plosser: Ta...|       b|Fed's Plosser: Ta...|[fed, s, plosser,...|
|Fed's Plosser exp...|       b|Fed's Plosser exp...|[fed, s, plosser,...|
|US jobs growth la...|       b|US jobs growth la...|[us, jobs, growth...|
|ECB unlikely to e...|       b|ECB unlikely to e...|[ecb, unlikely, t...|
|ECB unlikely to e...|       b|ECB unl

Remove the stop words from segregated list of words

In [15]:
#Remove and segregate stop words form the word list like for, by, in etc.
#https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.StopWordsRemover.html
remover = StopWordsRemover(inputCol="words", outputCol="filtered")
words_df = remover.transform(raw_words)
words_df.select("words","filtered").show(truncate=False)

+-------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+
|words                                                                                |filtered                                                                       |
+-------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+
|[fed, official, says, weak, data, caused, by, weather, should, not, slow, taper]     |[fed, official, says, weak, data, caused, weather, slow, taper]                |
|[fed, s, charles, plosser, sees, high, bar, for, change, in, pace, of, tapering]     |[fed, charles, plosser, sees, high, bar, change, pace, tapering]               |
|[us, open, stocks, fall, after, fed, official, hints, at, accelerated, tapering]     |[us, open, stocks, fall, fed, official, hints, accelerated, tapering]    

The category column in the dataframe can now be mapped to categoryIndex

In [16]:
#Index the string for different category
#https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.StringIndexer.html
indexer = StringIndexer(inputCol="CATEGORY", outputCol="categoryIndex")
feature_data = indexer.fit(words_df).transform(words_df)
feature_data.select("CATEGORY","categoryIndex").show()

                                                                                

+--------+-------------+
|CATEGORY|categoryIndex|
+--------+-------------+
|       b|          1.0|
|       b|          1.0|
|       b|          1.0|
|       b|          1.0|
|       b|          1.0|
|       b|          1.0|
|       b|          1.0|
|       b|          1.0|
|       b|          1.0|
|       b|          1.0|
|       b|          1.0|
|       b|          1.0|
|       b|          1.0|
|       b|          1.0|
|       b|          1.0|
|       b|          1.0|
|       b|          1.0|
|       b|          1.0|
|       b|          1.0|
|       b|          1.0|
+--------+-------------+
only showing top 20 rows



Convert text into vectors of token counts

In [17]:
#converting text to vectors and count the tokens
#https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
cv = CountVectorizer(inputCol="filtered", outputCol="features")
model = cv.fit(feature_data)
countVectorizer_feateures = model.transform(feature_data)

                                                                                

# Partition the dataset into training and test datasets


In [18]:
#partition the data set into training and test sets
(trainingData, testData) = countVectorizer_feateures.randomSplit([0.8, 0.2],seed = 11)
trainingData.show()
testData.show()

                                                                                

+--------------------+----------------+--------------------+--------------------+--------------------+-------------+--------------------+
|               TITLE|        CATEGORY|            only_str|               words|            filtered|categoryIndex|            features|
+--------------------+----------------+--------------------+--------------------+--------------------+-------------+--------------------+
|"'American Idol' ...|Reality TV World|"'American Idol' ...|[american, idol, ...|[american, idol, ...|         20.0|(49043,[74,113,39...|
|"'Bachelor' offen...|               e|"'Bachelor' offen...|[bachelor, offend...|[bachelor, offend...|          0.0|(49043,[167,553,9...|
|"'Divergent' auth...|               e|"'Divergent' auth...|[divergent, autho...|[divergent, autho...|          0.0|(49043,[4,942,109...|
|"'Hercules' First...|               e|"'Hercules' First...|[hercules, first,...|[hercules, first,...|          0.0|(49043,[6,53,162,...|
|"'Hercules' Star ...|            

[Stage 27:>                                                         (0 + 1) / 1]

+--------------------+----------------+--------------------+--------------------+--------------------+-------------+--------------------+
|               TITLE|        CATEGORY|            only_str|               words|            filtered|categoryIndex|            features|
+--------------------+----------------+--------------------+--------------------+--------------------+-------------+--------------------+
|"'Maleficent' Tra...|               e|"'Maleficent' Tra...|[maleficent, trai...|[maleficent, trai...|          0.0|(49043,[59,951,23...|
|"'Mr. Peabody & S...|               e|"'Mr. Peabody & S...|[mr, peabody, she...|[mr, peabody, she...|          0.0|(49043,[184,226,9...|
|"'The Voice' coac...|Reality TV World|"'The Voice' coac...|[the, voice, coac...|[voice, coaches, ...|         20.0|(49043,[153,325,5...|
|"'The Voice' dete...|Reality TV World|"'The Voice' dete...|[the, voice, dete...|[voice, determine...|         20.0|(49043,[153,325,1...|
|"'TheBachelor' st...|            

                                                                                

# Model Training and Prediction

## Naive Bayes Model

In [19]:
#https://scikit-learn.org/stable/modules/naive_bayes.html
nb = NaiveBayes(modelType="multinomial",labelCol="categoryIndex", featuresCol="features")
nbModel = nb.fit(trainingData)
nb_predictions = nbModel.transform(testData)

                                                                                

In [20]:
nb_predictions.show()

[Stage 30:>                                                         (0 + 0) / 1]22/12/09 13:59:06 WARN DAGScheduler: Broadcasting large task binary with size 87.4 MiB
[Stage 30:>                                                         (0 + 1) / 1]

+--------------------+----------------+--------------------+--------------------+--------------------+-------------+--------------------+--------------------+--------------------+----------+
|               TITLE|        CATEGORY|            only_str|               words|            filtered|categoryIndex|            features|       rawPrediction|         probability|prediction|
+--------------------+----------------+--------------------+--------------------+--------------------+-------------+--------------------+--------------------+--------------------+----------+
|"'Maleficent' Tra...|               e|"'Maleficent' Tra...|[maleficent, trai...|[maleficent, trai...|          0.0|(49043,[59,951,23...|[-51.741662984709...|[0.99999974818180...|       0.0|
|"'Mr. Peabody & S...|               e|"'Mr. Peabody & S...|[mr, peabody, she...|[mr, peabody, she...|          0.0|(49043,[184,226,9...|[-63.027295380563...|[0.99495466743207...|       0.0|
|"'The Voice' coac...|Reality TV World|"'The 

                                                                                

In [21]:
nb_predictions1 = nb_predictions.select("prediction", "categoryIndex", "features")

In [22]:
nb_predictions1

DataFrame[prediction: double, categoryIndex: double, features: vector]

In [23]:
## Evaluating the model
evaluator = MulticlassClassificationEvaluator(labelCol="categoryIndex", predictionCol="prediction", metricName="accuracy")
nb_accuracy = evaluator.evaluate(nb_predictions1)
print("Accuracy of NB is = %g"% (nb_accuracy))

22/12/09 13:59:08 WARN DAGScheduler: Broadcasting large task binary with size 87.4 MiB

Accuracy of NB is = 0.926778


                                                                                

In [25]:
spark.stop()