# Real-Time Twitter Hate Speech Detection

Differently from the notebooks we have seen so far, this one assumes we will be working on our own local installation of Spark rather than the one (still local!) provided by some machine on the Google Cloud infrastructure.

The aim of this notebook is to implement a Spark streaming application to detect hate speech in tweets posted on Twitter in (nearly) real-time. For the sake of simplicity, we say a tweet contains hate speech if it has a racist or sexist sentiment associated with it.

So, ultimately the task is to classify racist or sexist tweets from an incoming streaming of tweets. We will use a training dataset of tweets along with their associated labels, where label `1` denotes a racist/sexist tweet and label `0` indicates a legitimate tweet.

![](https://cdn.analyticsvidhya.com/wp-content/uploads/2019/12/twitter-mute-filter.png)

Why is this a relevant task? Because social media platforms like Twitter receive mammoth streaming data in the form of comments and status updates; this application will help us moderate what is being posted publicly.

More details on this task can be found here: [Practice Problem: Twitter Sentiment Analysis](https://datahack.analyticsvidhya.com/contest/practice-problem-twitter-sentiment-analysis/?utm_source=blog&utm_medium=streaming-data-pyspark-machine-learning-model).

# A Quick Note on the Dataset

The original dataset used for this task is available [here](https://github.com/lakshay-arora/PySpark/blob/master/spark_streaming/datasets/twitter_sentiments.csv), whilst a copy of it is also available on the course's [website](https://github.com/gtolomei/big-data-computing/raw/master/datasets/twitter-sentiments.csv.bz2).

Such a dataset contains **31,962** tweets: **29,720** of them (i.e., ~93%) are labelled as _negative_ (`0`), and the remaining **2,242** (i.e., ~7%) are labelled as _positive_ (`1`). In order to deal with such a high skewness, the dataset has been split into **2** portions using **stratified random sampling** (which maintains the same class distributions on both portions): a _training set_ which accounts for 90% of the original instances and a _test set_ which contains the remaining 10% of the instances. The former is used for training a machine learning model for hate speech detection, whilst the latter is used to test the model simulating a stream of incoming tweets.

The training set is available [here](https://github.com/gtolomei/big-data-computing/raw/master/datasets/twitter-sentiments-train.csv.bz2), and the test set is available [here](https://github.com/gtolomei/big-data-computing/raw/master/datasets/twitter-sentiments-test.csv.bz2).

In addition to that, the test set has been further processed in order to easily simulate the incoming stream of tweets. Roughly, for each tweet in the test set a record is created as follows:

`TWEET "${TWEET_TEXT_HERE}"`

The usage of the token `TWEET` is used by the streaming application for delimiting and separating individual tweets when they are streamed out. This file is also available from [here](https://github.com/gtolomei/big-data-computing/raw/master/datasets/twitter-sentiments-stream.txt).

# Setting up the Workflow

The development of our streaming application is composed of the following stages:

-  **Model Building:** This step is devoted to the (<i>offline</i>) training of a machine learning model, which is able to classify between tweets containing hate speech or not; more precisely, we will build a **logistic regression** pipeline. Note that, our focus here is not to build a very accurate classification model but to see how to use _any_ model to make nearly real-time predictions on streaming data;
-  **Spark Streaming Context Initialization:** Once the model is built, we need to define the `hostname` and `port number` from where we get the streaming data;
-  **Streaming Data:** Next, we will add the tweets from the [`netcat`](http://netcat.sourceforge.net/) server from the defined port, and the Spark Streaming API will receive the data after a specified duration (i.e., <i>batch interval</i>);
-  **Real-Time Prediction:** Once we receive the tweet text, we pass it into the machine learning pipeline we created and return the predicted sentiment from the model.

Here's a neat illustration of our workflow:

![](https://cdn.analyticsvidhya.com/wp-content/uploads/2019/12/overview.png)

## Global Constants

In [0]:
LOCAL_SPARK_HOME = "/usr/local/spark" # change this to your own local Spark installation directory
APP_NAME = "PySparkTweetHateSpeechDetector"
HOSTNAME = "localhost"
PORT = 9876
DATASET_FILENAME = "twitter-sentiments-train.csv.bz2"
GDRIVE_DATASET_DIR = "/Users/gabriele/Google\ Drive\ UniRoma\ \[@di.uniroma1.it\]/Teaching/2019-20-BDC/datasets" # change this to your own local path to the dataset file
DATASET_FILE_PATH = GDRIVE_DATASET_DIR + "/" + DATASET_FILENAME
BATCH_INTERVAL_SECS = 3
RANDOM_SEED = 42

## Import `findspark` to Locate Local Spark Installation

In [0]:
import findspark
findspark.init(LOCAL_SPARK_HOME)

## Import all the other Libraries

In [0]:
from pyspark import SparkContext
from pyspark.sql.session import SparkSession
from pyspark.streaming import StreamingContext
import pyspark.sql.types as tp
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, OneHotEncoderEstimator, VectorAssembler
from pyspark.ml.feature import StopWordsRemover, Word2Vec, RegexTokenizer
from pyspark.ml.classification import LogisticRegression
from pyspark.sql import Row, Column
import sys

## Create (local) `SparkContext`

In [0]:
def create_spark_context():
    sc = SparkContext(master="local[*]", appName=APP_NAME)
    spark = SparkSession(sc)
    
    return sc, spark

## Load the Dataset (Training Set) of Tweets

In [0]:
def load_dataset(dataset_file_path):
    # Define the schema of the dataset
    tweets_schema = tp.StructType([tp.StructField(name="id", dataType=tp.IntegerType(), nullable=True), 
                                  tp.StructField(name="tweet", dataType=tp.StringType(), nullable=True), 
                                  tp.StructField(name="label", dataType=tp.IntegerType(), nullable=True)
                             ])
    # Loading the data set
    tweet_df = spark.read.csv(dataset_file_path, 
                              schema=tweets_schema, 
                              header=True)
    
    return tweet_df

## Defining the Stages of our Machine Learning Pipeline

Here we define the different stages in which we want to transform the data, and then use it to get the predicted label from our model.

This is composed of **4 stages** in total:

-   1. [`RegexTokenizer`](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.RegexTokenizer): In the first stage, we will use regular expression to convert each tweet (i.e., text string) into a list of _proper_ words;
-  2. [`StopWordsRemover`](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.StopWordsRemover): We will then remove the stop words from the word list obtained before;
-  3. [`Word2Vec`](https://spark.apache.org/docs/2.2.0/api/python/pyspark.mllib.html#pyspark.mllib.feature.Word2Vec): We will create _word vectors_ using [Word2Vec](https://spark.apache.org/docs/2.2.0/mllib-feature-extraction.html#word2vec), which is a (shallow) neural-network-based model to map each word to a vector of numbers. Word2Vec is a well-known, standard technique to extract features from raw text, which is alternative to (and generally way better performing than) traditional bag-of-words approach like TF-IDF. For more information on how Word2Vec works, please refer to this [source](https://code.google.com/archive/p/word2vec/).

-  4. [`LogisticRegression`](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression): In the final stage, we will use word vectors extracted using Word2Vec to build a logistic regression model and get the prediction on whether a tweet contains hate speech or not.

**REMEMBER:** Our focus here is not on building a very accurate classification model, but rather to see how can we use a predictive model to get the results on streaming data

Below there is an illustration of the pipeline just described:

![](https://cdn.analyticsvidhya.com/wp-content/uploads/2019/12/pipeline_streaming.png)

In [0]:
def ml_pipeline(train):
    
    print("***** Defining the pipeline stages *****\n")
    
    # define stage 1: tokenize the tweet text  
    stage_1 = RegexTokenizer(inputCol="tweet", outputCol="tokens", pattern="\\W")
    # define stage 2: remove the stop words
    stage_2 = StopWordsRemover(inputCol="tokens", outputCol="filtered_words")
    # define stage 3: create a word vector of the size 100
    stage_3 = Word2Vec(inputCol="filtered_words", outputCol="feature_vector", vectorSize=100)
    # define stage 4: Logistic Regression Model
    model = LogisticRegression(featuresCol="feature_vector", labelCol="label") 
    
    print("***** Create the corresponding pipeline *****\n")
    pipeline = Pipeline(stages=[stage_1, stage_2, stage_3, model])

    print("***** Fit the pipeline to the training data *****\n")
    pipeline_fit = pipeline.fit(train)
    
    return pipeline_fit

## Function Used to Get Online Predictions from Streamed Tweets

In [0]:
# Define the function to get the predicted sentiment on the data received
def get_prediction(pipeline_fit, tweet_text):
    try:
        # remove blank tweets
        tweet_text = tweet_text.filter(lambda x: len(x) > 0)
        # create the dataframe with each row containing the text of a tweet
        row_rdd = tweet_text.map(lambda w: Row(tweet=w))
        words_df = spark.createDataFrame(row_rdd)
        # get the prediction for each row
        pipeline_fit.transform(words_df).select("tweet", "prediction").show(truncate=False)
    except: 
        print("\nWaiting for streaming data...\n")

## Main Application Entry Point

In [0]:
if __name__ == "__main__":
    
    if len(sys.argv) != 3:
        print("""Wrong number of input arguments!
Usage: 
> check-tweet-sentiment.py HOSTNAME PORT
where
    - HOSTNAME is the hostname of the TCP streaming data source (e.g., localhost)
    - PORT is the port number of the TCP streaming data source (e.g., 9999)""", 
      file=sys.stderr)
        
        sys.exit(-1)
    
    # Save the `hostname` and `port` of the TCP source, where the streaming data will come from
    hostname = sys.argv[1]
    port = int(sys.argv[2])
        
    # Create a local SparkContext with 2 working threads (i.e., one for the receiver and the other for processing)
    sc, spark = create_spark_context()
    
    # Setting the locale to "en-US" 
    # By default, the locale is the one of the JVM where Spark is running 
    # (which, in turn, is the system locale of the host machine where such a JVM is running)
    locale = sc._jvm.java.util.Locale
    locale.setDefault(locale.forLanguageTag("en-US"))
    
    # Load the dataset of tweets
    print("***** Loading the training set of tweets... *****\n")
    tweet_train_df = load_dataset(DATASET_FILE_PATH)
    
    print("=> Dataset schema:\n")
    tweet_train_df.printSchema()
    print("=> Dataset excerpt (first 5 rows):\n")
    tweet_train_df.show(5, truncate=False)
    print("=> Removing null entries...\n")
    tweet_train_df = tweet_train_df.na.drop()
    print("=> Dataset size (n. of rows): {:d}\n".format(tweet_train_df.count()))
    
    # Fit the ML pipeline to the training set of tweets
    pipeline_fit = ml_pipeline(tweet_train_df)
    
    # Test the hate speech prediction model online with incoming streaming data
    print("***** Model for hate speech detection successfully trained! Now, waiting for the incoming streaming data to classify... *****\n")
    
    # Create the streaming context using the specified batch interval
    ssc = StreamingContext(sc, batchDuration=BATCH_INTERVAL_SECS)
    
    # Get streaming data from the source TCP socket 
    lines = ssc.socketTextStream(hostname, port)
    
    # Create the discretized stream from each line starting with "TWEET"
    words = lines.flatMap(lambda line: line.split("TWEET"))
    
    # Send each RDD associated with this DStream to the model, and ask it for a prediction on each of them 
    words.foreachRDD(lambda tweet_rdd: get_prediction(pipeline_fit, tweet_rdd))

    # Start the computation
    ssc.start()

    # Wait for the computation to terminate
    ssc.awaitTermination()