# Ant Man: Summary from VADER Sentiment Tool

In this notebook we compute the summary data at the daily level for the file 'Antman'

We compute:
- total tweets per day
- Number of positive, negative and neutral tweets per day
- Summary stats of sentiment per day: mean, median, std dev, percentiles

We save the result as a csv file.

## Import Libraries needed:

In [1]:
# pyspark functions
from pyspark.sql.functions import col, udf, avg, date_format
from pyspark.sql.types import DateType
from pyspark.sql.functions import mean, stddev, min, max, count

# sentiment analysis
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# classifying vader scores into bins
from pyspark.ml.feature import Bucketizer

## Declaring the functions we need

### Daily tweet counts:

In [2]:
def dailyTweetCount(dataset):
    """
    Counts the number of tweets posted per day for an individual dataset

    INPUTS:
    dataset = a Spark DataFrame
    
    OTHER FUNCTIONS AND FILES CALLED BY THIS FUNCTION: 
    - None
    
    OBJECTS CREATED WITHIN THIS FUNCTION:
    - dailyCounts = Spark DataFrame, a an aggregate count of the number of
                    tweets per day
    
    FILES CREATED BY THIS FUNCTION: None
    
    RETURNS: 
    - dailyCounts - a Spark Dataframe
    
    """
    
    dataset = dataset.withColumn('dateColumn', dataset['postedTime'].cast('date'))
    dailyCounts = dataset.groupby(dataset.dateColumn).count()
    dailyCounts = dailyCounts.withColumnRenamed("count", "totalTweets")
    return dailyCounts

### Individual Tweet Sentiment:

In [3]:
analyzer = SentimentIntensityAnalyzer()

def negativeScore(text):
    """
    The proportion of words in a tweet that are classified as negative

    INPUTS:
    text = a Spark column of tweets
    
    OTHER FUNCTIONS AND FILES CALLED BY THIS FUNCTION: 
    -  vaderSentiment polarity analyzer: imported as SentimentIntensityAnalyzer()
    
    OBJECTS CREATED WITHIN THIS FUNCTION:
    - negativeScores = a Spark SQL function that computes the ratio of negative 
        words in each row of a column
    - negativeScores_udf = a column of a Spark DataFrame where each row is the 
        proportion of negative words
    
    FILES CREATED BY THIS FUNCTION: None
    
    RETURNS: 
    - negativeScores_udf - a column of proportions of negative words as classified
        under VADER
    
    """
    negativeScores = analyzer.polarity_scores(text).get('neg')
    negativeScores_udf = udf(negativeScores).cast('double')
    return negativeScores_udf

def positiveScore(text):
    """
    The proportion of words in a tweet that are classified as positive

    INPUTS:
    text = a Spark column of tweets
    
    OTHER FUNCTIONS AND FILES CALLED BY THIS FUNCTION: 
    -  vaderSentiment polarity analyzer: imported as SentimentIntensityAnalyzer()
    
    OBJECTS CREATED WITHIN THIS FUNCTION:
    - positiveScores = a Spark SQL function that computes the ratio of positive
        words in each row of a column
    - postiveScores_udf = a column of a Spark DataFrame where each row is the 
        proportion of positive words
    
    FILES CREATED BY THIS FUNCTION: None
    
    RETURNS: 
    - positiveScores_udf - a column of proportions of positive words as classified
        under VADER
    
    """
    positiveScores = analyzer.polarity_scores(text).get('pos')
    positiveScores_udf = udf(positiveScores).cast('double')
    return positiveScores_udf
    
def neutralScore(text):
    """
    The proportion of words in a tweet that are classified as neutral

    INPUTS:
    text = a Spark column of tweets
    
    OTHER FUNCTIONS AND FILES CALLED BY THIS FUNCTION: 
    -  vaderSentiment polarity analyzer: imported as SentimentIntensityAnalyzer()
    
    OBJECTS CREATED WITHIN THIS FUNCTION:
    - neutralScores = a Spark SQL function that computes the ratio of neutral
        words in each row of a column
    - neutralScores_udf = a column of a Spark DataFrame where each row is the 
        proportion of neutral words
    
    FILES CREATED BY THIS FUNCTION: None
    
    RETURNS: 
    - neutralScores_udf - a column of proportions of neutral words as classified
        under VADER
    
    """
    neutralScores = analyzer.polarity_scores(text).get('neu')
    neutral_udf = udf(neutralScores)
    return neutralScores_udf

def compoundScore(text):
    """
    The proportion of words in a tweet that are classified as neutral

    INPUTS:
    text = a Spark column of tweets
    
    OTHER FUNCTIONS AND FILES CALLED BY THIS FUNCTION: 
    -  vaderSentiment polarity analyzer: imported as SentimentIntensityAnalyzer()
    
    OBJECTS CREATED WITHIN THIS FUNCTION:
    - compoundScores = a Spark SQL function that sums the valence scores of each
        words in the lexicon, and normalizes the result to be between 
        -1 (most extreme negative) and +1 (most extreme positive)
    - neutralScores_udf = a column of a Spark DataFrame where each row is the 
        normalized result from the lexicon
    
    FILES CREATED BY THIS FUNCTION: None
    
    RETURNS: 
    - neutralScores_udf - a column of proportions of neutral words as classified
        under VADER    
    """
    
    compoundScores = analyzer.polarity_scores(text).get('compound')
    return compoundScores

# convert to udfs
# negative_udf = udf(negative)
# positive_udf = udf(positive)
# neutral_udf = udf(neutral)
compound_udf = udf(compoundScore)

In [4]:
def returnCompoundScore(dataset, textColumn, outputColumn):
    sentiment = dataset.withColumn(outputColumn, compound_udf(col(textColumn)).cast('Double'))
    return sentiment

In [5]:
def vaderClassify(dataset, textColumn, thresholds):
    
    # return vader score from text as column 'vaderScore'
    outCol  = 'vaderScore'
    sentimentData = returnCompoundScore(dataset, textColumn, outCol)
    
    # classify using thresholds, returns a Classifier
    bucketizer = Bucketizer(splits = thresholds, inputCol = "vaderScore", outputCol = "vaderClassifier")
    
    print("Bucketizer output with %d buckets" % (len(bucketizer.getSplits())-1))
    
    bucketedData = bucketizer.transform(sentimentData)
    return bucketedData

### Daily Sentiment Statistics

In [6]:
def vaderStats(dataset, textColumn):
    
    # get vaderscores
    outCol  = 'vaderScore'
    sentimentData = returnCompoundScore(dataset, textColumn, outCol)
    
    # aggregate functions
    aggStats = [mean, stddev, min, max, count]
    aggVariable = ["vaderScore"] 
    exprs = [iStat(col(iVariable)) for iStat in aggStats for iVariable in aggVariable]
    
    #dailyStats = dailyTweetCount(sentimentData)
    
    # summary stats 
    dailyStats = sentimentData.groupby('date').agg(*exprs)
    
    # rename cols
    autoNames = dailyStats.schema.names
    newNames  = ["date", "avgScore", "stdDev", "minScore", "maxScore", "totalTweets"]
    
    dailyStats = reduce(lambda dailyStats, idx: dailyStats.withColumnRenamed(autoNames[idx], newNames[idx]), 
                            xrange(len(autoNames)), dailyStats)
    
    dailyStats  = dailyStats.orderBy(['date'], ascending=False)
    
    return dailyStats

In [7]:
def vaderCountsByClassification(dataset, textColumn, thresholds):
    
    classifiedData = vaderClassify(dataset, textColumn, thresholds)
    vaderCounts = classifiedData.groupby(classifiedData.date, classifiedData.vaderClassifier).count()
    vaderCounts = vaderCounts.withColumnRenamed("count", "nTweets")
    vaderCounts= vaderCounts.orderBy(['date', 'vaderClassifier'], ascending=False)
    return vaderCounts

## Import Data

In [8]:
dataPath = 'alluxio://master001:19998/twitter-chicago/DeerAntMan/'

In [36]:
from pyspark.sql.functions import to_utc_timestamp, window

In [70]:
def loadTwitterData(filePath):
    
    df = spark.read.json(filePath + '*.gz')
    df2 = df.select('body', 'postedTime', 'retweetCount').na.drop()
    #df2 = df2.withColumn('date', df2['postedTime'].cast('date'))
    #df2 = df2.withColumn('date_local', date_format('postedTime', 'dd-MM-yyy HH:mm:ss'))
    df2 = df2.withColumn('date_utc', to_utc_timestamp('postedTime', 'CEST'))
    df2 = df2.withColumn('date_window',  window("date_utc", "1 day"))
    df2 = df2.withColumn('minute_window', window("date_utc", "1 minute"))
    
    #df2 = df2.withColumn('minute_window_start', 'minute_window.start')
    
    df2 = df2.select('body', 'retweetCount', 'date_utc', 'date_window.*', 'minute_window.*')
    return df2

In [71]:
df = loadTwitterData(dataPath)

In [72]:
df.show(10)

+--------------------+------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                body|retweetCount|            date_utc|               start|                 end|               start|                 end|
+--------------------+------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|RT @Kotaku: The n...|          11|2015-04-13 13:30:...|2015-04-13 00:00:...|2015-04-14 00:00:...|2015-04-13 13:30:...|2015-04-13 13:31:...|
|RT @SuperheroFeed...|         428|2015-04-13 13:30:...|2015-04-13 00:00:...|2015-04-14 00:00:...|2015-04-13 13:30:...|2015-04-13 13:31:...|
|RT @SuperheroFeed...|         426|2015-04-13 13:30:...|2015-04-13 00:00:...|2015-04-14 00:00:...|2015-04-13 13:30:...|2015-04-13 13:31:...|
|Second Action-Pac...|           0|2015-04-13 13:30:...|2015-04-13 00:00:...|2015-04-14 00:00:...|2015-04-13 13:30:...|2015-04-13 13:31:...|
|RT @screenra

In [64]:
df.select('date_window.start').show(10)

+--------------------+
|               start|
+--------------------+
|2015-04-13 00:00:...|
|2015-04-13 00:00:...|
|2015-04-13 00:00:...|
|2015-04-13 00:00:...|
|2015-04-13 00:00:...|
|2015-04-13 00:00:...|
|2015-04-13 00:00:...|
|2015-04-13 00:00:...|
|2015-04-13 00:00:...|
|2015-04-13 00:00:...|
+--------------------+
only showing top 10 rows



In [62]:
df.show(20000)

+--------------------+------------+--------------------+--------------------+--------------------+
|                body|retweetCount|            date_utc|         date_window|       minute_window|
+--------------------+------------+--------------------+--------------------+--------------------+
|RT @Kotaku: The n...|          11|2015-04-13 13:30:...|[2015-04-13 00:00...|[2015-04-13 13:30...|
|RT @SuperheroFeed...|         428|2015-04-13 13:30:...|[2015-04-13 00:00...|[2015-04-13 13:30...|
|RT @SuperheroFeed...|         426|2015-04-13 13:30:...|[2015-04-13 00:00...|[2015-04-13 13:30...|
|Second Action-Pac...|           0|2015-04-13 13:30:...|[2015-04-13 00:00...|[2015-04-13 13:30...|
|RT @screenrant: '...|           8|2015-04-13 13:30:...|[2015-04-13 00:00...|[2015-04-13 13:30...|
|I added a video t...|           0|2015-04-13 13:30:...|[2015-04-13 00:00...|[2015-04-13 13:30...|
|Marvel's Ant-Man ...|           0|2015-04-13 13:30:...|[2015-04-13 00:00...|[2015-04-13 13:30...|
|Wasn't su

In [63]:
df.printSchema()

root
 |-- body: string (nullable = true)
 |-- retweetCount: long (nullable = true)
 |-- date_utc: timestamp (nullable = true)
 |-- date_window: struct (nullable = true)
 |    |-- start: timestamp (nullable = true)
 |    |-- end: timestamp (nullable = true)
 |-- minute_window: struct (nullable = true)
 |    |-- start: timestamp (nullable = true)
 |    |-- end: timestamp (nullable = true)



In [20]:
spark.sql("select from_unixtime(unix_timestamp(), 'z')").show()

+--------------------------------------------------------------------------+
|from_unixtime(unix_timestamp(current_timestamp(), yyyy-MM-dd HH:mm:ss), z)|
+--------------------------------------------------------------------------+
|                                                                      CEST|
+--------------------------------------------------------------------------+



In [21]:
spark.conf.set("spark.sql.session.timeZone", "UTC")

## Get VADER classifier results:

In [11]:
thresholds = [-1.0, -0.5, 0.5, 1.0]
textCol = 'body'

In [12]:
#vaderAnalyzed = vaderClassify(df2, textCol, thresholds)

In [13]:
#vaderAnalyzed.show(5)

In [14]:
vaderClassified = vaderCountsByClassification(df, textCol, thresholds)

Bucketizer output with 3 buckets


In [15]:
vaderClassified.show(20)

+----------+---------------+-------+
|      date|vaderClassifier|nTweets|
+----------+---------------+-------+
|2016-01-16|            2.0|     10|
|2016-01-16|            1.0|     28|
|2016-01-16|            0.0|      3|
|2016-01-15|            2.0|     16|
|2016-01-15|            1.0|     28|
|2016-01-14|            2.0|     18|
|2016-01-14|            1.0|     30|
|2016-01-14|            0.0|      3|
|2016-01-13|            2.0|      9|
|2016-01-13|            1.0|     32|
|2016-01-13|            0.0|     35|
|2016-01-12|            2.0|      5|
|2016-01-12|            1.0|     47|
|2016-01-12|            0.0|      1|
|2016-01-11|            2.0|     11|
|2016-01-11|            1.0|     42|
|2016-01-11|            0.0|      2|
|2016-01-10|            2.0|     10|
|2016-01-10|            1.0|     39|
|2016-01-10|            0.0|      7|
+----------+---------------+-------+
only showing top 20 rows



In [16]:
#classifiedData.select("vaderClassifier").distinct().show()

## Daily Summary stats out of the sentiment index

In [17]:
dailyStatistics = vaderStats(df, 'body')

In [18]:
dailyStatistics.show(20)

+----------+--------------------+-------------------+--------+--------+-----------+
|      date|            avgScore|             stdDev|minScore|maxScore|totalTweets|
+----------+--------------------+-------------------+--------+--------+-----------+
|2016-01-16| 0.17615365853658535| 0.4218355230996798| -0.7152|  0.9022|         41|
|2016-01-15| 0.31129318181818183|0.38429921082511137| -0.3818|  0.9474|         44|
|2016-01-14|  0.1648294117647059|0.41890397768181553| -0.7118|  0.8475|         51|
|2016-01-13|-0.21386315789473684| 0.5240459775864724| -0.7118|  0.9167|         76|
|2016-01-12| 0.18567169811320758| 0.2949024723905531| -0.6114|  0.9151|         53|
|2016-01-11| 0.19425272727272727|0.35801861520738953| -0.5994|  0.8908|         55|
|2016-01-10| 0.11873928571428571|0.41725946219613497| -0.8039|  0.8402|         56|
|2016-01-09| 0.14811363636363636|0.40568942757101717| -0.8001|  0.9595|        154|
|2016-01-08| 0.03624892703862664|0.43017576841769883| -0.8095|  0.9519|     