# Ant Man: Summary from VADER Sentiment Tool

In this notebook we compute the summary data at the daily level for the file 'Antman'

We compute:
- total tweets per day
- Number of positive, negative and neutral tweets per day
- Summary stats of sentiment per day: mean, median, std dev, percentiles

We save the result as a csv file.

In [3]:
!pip install findspark

Defaulting to user installation because normal site-packages is not writeable
Collecting findspark
  Downloading findspark-2.0.1-py2.py3-none-any.whl.metadata (352 bytes)
Downloading findspark-2.0.1-py2.py3-none-any.whl (4.4 kB)
Installing collected packages: findspark
Successfully installed findspark-2.0.1


In [1]:
import findspark
findspark.init()

## Import Libraries needed:

In [2]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("spark://lachlan-tower:7077").appName("MovieTweetExplorer").getOrCreate()

24/03/07 22:57:19 WARN Utils: Your hostname, lachlan-tower resolves to a loopback address: 127.0.1.1; using 137.56.70.23 instead (on interface enp0s31f6)
24/03/07 22:57:19 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/03/07 22:57:19 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
----------------------------------------
Exception happened during processing of request from ('127.0.0.1', 53048)
Traceback (most recent call last):
  File "/usr/lib/python3.8/socketserver.py", line 316, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/usr/lib/python3.8/socketserver.py", line 347, in process_request
    self.finish_request(request, client_address)
  File "/usr/lib/python3.8/socketserver.py", line 360, in finish_request
    sel

In [4]:
#spark.sqlContext.setConf("sparl.sql.files.ignoreCorruptFiles", "true")

AttributeError: 'SparkSession' object has no attribute 'sqlContext'

In [3]:
# pyspark functions
from pyspark.sql.functions import col, udf, avg
from pyspark.sql.types import DateType
from pyspark.sql.functions import mean, stddev, min, max, count

# sentiment analysis
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# classifying vader scores into bins
from pyspark.ml.feature import Bucketizer

## Declaring the functions we need

### Daily tweet counts:

In [4]:
def dailyTweetCount(dataset):
    """
    Counts the number of tweets posted per day for an individual dataset

    INPUTS:
    dataset = a Spark DataFrame
    
    OTHER FUNCTIONS AND FILES CALLED BY THIS FUNCTION: 
    - None
    
    OBJECTS CREATED WITHIN THIS FUNCTION:
    - dailyCounts = Spark DataFrame, a an aggregate count of the number of
                    tweets per day
    
    FILES CREATED BY THIS FUNCTION: None
    
    RETURNS: 
    - dailyCounts - a Spark Dataframe
    
    """
    
    dataset = dataset.withColumn('dateColumn', dataset['postedTime'].cast('date'))
    dailyCounts = dataset.groupby(dataset.dateColumn).count()
    dailyCounts = dailyCounts.withColumnRenamed("count", "totalTweets")
    return dailyCounts

### Individual Tweet Sentiment:

In [5]:
analyzer = SentimentIntensityAnalyzer()

def negativeScore(text):
    """
    The proportion of words in a tweet that are classified as negative

    INPUTS:
    text = a Spark column of tweets
    
    OTHER FUNCTIONS AND FILES CALLED BY THIS FUNCTION: 
    -  vaderSentiment polarity analyzer: imported as SentimentIntensityAnalyzer()
    
    OBJECTS CREATED WITHIN THIS FUNCTION:
    - negativeScores = a Spark SQL function that computes the ratio of negative 
        words in each row of a column
    - negativeScores_udf = a column of a Spark DataFrame where each row is the 
        proportion of negative words
    
    FILES CREATED BY THIS FUNCTION: None
    
    RETURNS: 
    - negativeScores_udf - a column of proportions of negative words as classified
        under VADER
    
    """
    negativeScores = analyzer.polarity_scores(text).get('neg')
    negativeScores_udf = udf(negativeScores).cast('double')
    return negativeScores_udf

def positiveScore(text):
    """
    The proportion of words in a tweet that are classified as positive

    INPUTS:
    text = a Spark column of tweets
    
    OTHER FUNCTIONS AND FILES CALLED BY THIS FUNCTION: 
    -  vaderSentiment polarity analyzer: imported as SentimentIntensityAnalyzer()
    
    OBJECTS CREATED WITHIN THIS FUNCTION:
    - positiveScores = a Spark SQL function that computes the ratio of positive
        words in each row of a column
    - postiveScores_udf = a column of a Spark DataFrame where each row is the 
        proportion of positive words
    
    FILES CREATED BY THIS FUNCTION: None
    
    RETURNS: 
    - positiveScores_udf - a column of proportions of positive words as classified
        under VADER
    
    """
    positiveScores = analyzer.polarity_scores(text).get('pos')
    positiveScores_udf = udf(positiveScores).cast('double')
    return positiveScores_udf
    
def neutralScore(text):
    """
    The proportion of words in a tweet that are classified as neutral

    INPUTS:
    text = a Spark column of tweets
    
    OTHER FUNCTIONS AND FILES CALLED BY THIS FUNCTION: 
    -  vaderSentiment polarity analyzer: imported as SentimentIntensityAnalyzer()
    
    OBJECTS CREATED WITHIN THIS FUNCTION:
    - neutralScores = a Spark SQL function that computes the ratio of neutral
        words in each row of a column
    - neutralScores_udf = a column of a Spark DataFrame where each row is the 
        proportion of neutral words
    
    FILES CREATED BY THIS FUNCTION: None
    
    RETURNS: 
    - neutralScores_udf - a column of proportions of neutral words as classified
        under VADER
    
    """
    neutralScores = analyzer.polarity_scores(text).get('neu')
    neutral_udf = udf(neutralScores)
    return neutralScores_udf

def compoundScore(text):
    """
    The proportion of words in a tweet that are classified as neutral

    INPUTS:
    text = a Spark column of tweets
    
    OTHER FUNCTIONS AND FILES CALLED BY THIS FUNCTION: 
    -  vaderSentiment polarity analyzer: imported as SentimentIntensityAnalyzer()
    
    OBJECTS CREATED WITHIN THIS FUNCTION:
    - compoundScores = a Spark SQL function that sums the valence scores of each
        words in the lexicon, and normalizes the result to be between 
        -1 (most extreme negative) and +1 (most extreme positive)
    - neutralScores_udf = a column of a Spark DataFrame where each row is the 
        normalized result from the lexicon
    
    FILES CREATED BY THIS FUNCTION: None
    
    RETURNS: 
    - neutralScores_udf - a column of proportions of neutral words as classified
        under VADER    
    """
    
    compoundScores = analyzer.polarity_scores(text).get('compound')
    return compoundScores

# convert to udfs
# negative_udf = udf(negative)
# positive_udf = udf(positive)
# neutral_udf = udf(neutral)
compound_udf = udf(compoundScore)

In [6]:
def returnCompoundScore(dataset, textColumn, outputColumn):
    sentiment = dataset.withColumn(outputColumn, compound_udf(col(textColumn)).cast('Double'))
    return sentiment

In [7]:
def vaderClassify(dataset, textColumn, thresholds):
    
    # return vader score from text as column 'vaderScore'
    outCol  = 'vaderScore'
    sentimentData = returnCompoundScore(dataset, textColumn, outCol)
    
    # classify using thresholds, returns a Classifier
    bucketizer = Bucketizer(splits = thresholds, inputCol = "vaderScore", outputCol = "vaderClassifier")
    
    print("Bucketizer output with %d buckets" % (len(bucketizer.getSplits())-1))
    
    bucketedData = bucketizer.transform(sentimentData)
    return bucketedData

### Daily Sentiment Statistics

In [8]:
def vaderStats(dataset, textColumn):
    
    # get vaderscores
    outCol  = 'vaderScore'
    sentimentData = returnCompoundScore(dataset, textColumn, outCol)
    
    # aggregate functions
    aggStats = [mean, stddev, min, max, count]
    aggVariable = ["vaderScore"] 
    exprs = [iStat(col(iVariable)) for iStat in aggStats for iVariable in aggVariable]
    
    #dailyStats = dailyTweetCount(sentimentData)
    
    # summary stats 
    dailyStats = sentimentData.groupby('date').agg(*exprs)
    
    # rename cols
    autoNames = dailyStats.schema.names
    newNames  = ["date", "avgScore", "stdDev", "minScore", "maxScore", "totalTweets"]
    
    dailyStats = reduce(lambda dailyStats, idx: dailyStats.withColumnRenamed(autoNames[idx], newNames[idx]), 
                            xrange(len(autoNames)), dailyStats)
    
    dailyStats  = dailyStats.orderBy(['date'], ascending=False)
    
    return dailyStats

In [9]:
def vaderCountsByClassification(dataset, textColumn, thresholds):
    
    classifiedData = vaderClassify(dataset, textColumn, thresholds)
    vaderCounts = classifiedData.groupby(classifiedData.date, classifiedData.vaderClassifier).count()
    vaderCounts = vaderCounts.withColumnRenamed("count", "nTweets")
    vaderCounts= vaderCounts.orderBy(['date', 'vaderClassifier'], ascending=False)
    return vaderCounts

## Import Data

In [10]:
#dataPath = '/twitter/movie/DeerAntMan/'
#dataPath = "/home/lachlan/from_zurich/all-twitter-data/twitter-chicago/DeerAntMan/"
dataPath = "/home/lachlan/from_zurich/all-twitter-data/twitter-chicago/DeerSpectre/"

In [11]:
def loadTwitterData(filePath):
    
    df = spark.read.option("mode", "DROPMALFORMED").json(filePath + '*.gz', multiLine=True)
    df2 = df.select('body', 'postedTime', 'retweetCount').na.drop()
    df2 = df2.withColumn('date', df2['postedTime'].cast('date'))
    
    return df2

In [12]:
df = loadTwitterData(dataPath)

24/03/07 22:58:01 WARN TaskSetManager: Stage 1 contains a task of very large size (1461 KiB). The maximum recommended task size is 1000 KiB.
                                                                                

In [13]:
df.show(20)

                                                                                

+--------------------+--------------------+------------+----------+
|                body|          postedTime|retweetCount|      date|
+--------------------+--------------------+------------+----------+
|I liked a @YouTub...|2016-01-11T02:50:...|           0|2016-01-11|
|RT @BilgeEbiri: G...|2016-02-29T04:20:...|          16|2016-02-29|
|RT @JimsTweetings...|2015-10-26T19:20:...|         244|2015-10-26|
|It's the premiere...|2015-10-26T19:10:...|           0|2015-10-26|
|Watch the New 007...|2015-07-22T07:50:...|           0|2015-07-22|
|Sam Smith, James ...|2015-10-26T19:50:...|           0|2015-10-26|
|Full-length trail...|2015-07-22T08:00:...|           0|2015-07-22|
|Before #SPECTRE w...|2015-10-26T20:00:...|           0|2015-10-26|
|I hear George Clo...|2015-10-26T18:10:...|           0|2015-10-26|
|RT @JackHoward: T...|2015-10-26T18:00:...|          66|2015-10-26|
|RT @taran_adarsh:...|2015-07-22T07:40:...|          51|2015-07-22|
|RT @Independent: ...|2015-07-22T07:30:...|     

## Get VADER classifier results:

In [14]:
thresholds = [-1.0, -0.05, 0.05, 1.0]
textCol = 'body'

In [15]:
vaderAnalyzed = vaderClassify(df, textCol, thresholds)

Bucketizer output with 3 buckets


In [16]:
vaderAnalyzed.show(20)

[Stage 3:>                                                          (0 + 1) / 1]

+--------------------+--------------------+------------+----------+----------+---------------+
|                body|          postedTime|retweetCount|      date|vaderScore|vaderClassifier|
+--------------------+--------------------+------------+----------+----------+---------------+
|I liked a @YouTub...|2016-01-11T02:50:...|           0|2016-01-11|    0.4215|            2.0|
|RT @BilgeEbiri: G...|2016-02-29T04:20:...|          16|2016-02-29|    0.9593|            2.0|
|RT @JimsTweetings...|2015-10-26T19:20:...|         244|2015-10-26|    0.5994|            2.0|
|It's the premiere...|2015-10-26T19:10:...|           0|2015-10-26|   -0.1779|            0.0|
|Watch the New 007...|2015-07-22T07:50:...|           0|2015-07-22|       0.0|            1.0|
|Sam Smith, James ...|2015-10-26T19:50:...|           0|2015-10-26|       0.0|            1.0|
|Full-length trail...|2015-07-22T08:00:...|           0|2015-07-22|    0.4767|            2.0|
|Before #SPECTRE w...|2015-10-26T20:00:...|       

                                                                                

In [17]:
vaderClassified = vaderCountsByClassification(df, textCol, thresholds)

Bucketizer output with 3 buckets


In [18]:
vaderClassified.show(20)



+----------+---------------+-------+
|      date|vaderClassifier|nTweets|
+----------+---------------+-------+
|2016-05-07|            2.0|     29|
|2016-05-07|            1.0|     42|
|2016-05-07|            0.0|     25|
|2016-05-06|            2.0|     22|
|2016-05-06|            1.0|     44|
|2016-05-06|            0.0|     25|
|2016-05-05|            2.0|     31|
|2016-05-05|            1.0|     43|
|2016-05-05|            0.0|     19|
|2016-05-04|            2.0|     40|
|2016-05-04|            1.0|     37|
|2016-05-04|            0.0|      9|
|2016-05-03|            2.0|     42|
|2016-05-03|            1.0|     35|
|2016-05-03|            0.0|     12|
|2016-05-02|            2.0|     40|
|2016-05-02|            1.0|     36|
|2016-05-02|            0.0|      9|
|2016-05-01|            2.0|     33|
|2016-05-01|            1.0|     24|
+----------+---------------+-------+
only showing top 20 rows



                                                                                

In [16]:
#classifiedData.select("vaderClassifier").distinct().show()

## Daily Summary stats out of the sentiment index

In [17]:
dailyStatistics = vaderStats(df, 'body')

In [18]:
dailyStatistics.show(20)

+----------+--------------------+-------------------+--------+--------+-----------+
|      date|            avgScore|             stdDev|minScore|maxScore|totalTweets|
+----------+--------------------+-------------------+--------+--------+-----------+
|2016-01-16| 0.17615365853658535| 0.4218355230996798| -0.7152|  0.9022|         41|
|2016-01-15| 0.31129318181818183|0.38429921082511137| -0.3818|  0.9474|         44|
|2016-01-14|  0.1648294117647059|0.41890397768181553| -0.7118|  0.8475|         51|
|2016-01-13|-0.21386315789473684| 0.5240459775864724| -0.7118|  0.9167|         76|
|2016-01-12| 0.18567169811320758| 0.2949024723905531| -0.6114|  0.9151|         53|
|2016-01-11| 0.19425272727272727|0.35801861520738953| -0.5994|  0.8908|         55|
|2016-01-10| 0.11873928571428571|0.41725946219613497| -0.8039|  0.8402|         56|
|2016-01-09| 0.14811363636363636|0.40568942757101717| -0.8001|  0.9595|        154|
|2016-01-08| 0.03624892703862664|0.43017576841769883| -0.8095|  0.9519|     