## Tweets

A tweet consists of many data fields. [Here is an example](https://gist.github.com/arapat/03d02c9b327e6ff3f6c3c5c602eeaf8b). You can learn all about them in the Twitter API doc.

* `created_at`: Posted time of this tweet (time zone is included)
* `id_str`: Tweet ID - recommended using `id_str` over using `id` as Tweet IDs, becauase `id` is an integer and may bring some overflow problems.
* `text`: Tweet content
* `user`: A JSON object for information about the author of the tweet
    * `id_str`: User ID
    * `name`: User name (may contain spaces)
    * `screen_name`: User screen name (no spaces)
* `retweeted_status`: A JSON object for information about the retweeted tweet (i.e. this tweet is not original but retweeteed some other tweet)
    * All data fields of a tweet except `retweeted_status`
* `entities`: A JSON object for all entities in this tweet
    * `hashtags`: An array for all the hashtags that are mentioned in this tweet
    * `urls`: An array for all the URLs that are mentioned in this tweet

In [1]:
from pyspark import SparkContext

sc = SparkContext()

In [2]:
import json
import subprocess
import pickle

In [3]:
# reading file as RDD

file_path = []
with open("files_stored.txt") as file:
    file_path = [w.strip() for w in file.readlines() if w.strip()]

tweet_rdd = sc.textFile(file_path[0])
tweet_rdd.cache()

count = tweet_rdd.count()
print("Number of tweets : ", count)

Number of tweets :  2150


In [4]:
# filter broken tweets (invalid JSON strings)
# separate tweets from the contents (tweets + messages)
# messages don't have "created_at" field

def safe_parse(raw_json):
    try:
        json_obj = json.loads(raw_json)
        if json_obj.get("created_at") is None:
            return None
        else:
            return json_obj
    except ValueError as e:
        return None
    return True

# construct pair RDD of (user_id, text)

ut_rdd = tweet_rdd.map(safe_parse).filter(lambda ele: ele is not None).map(lambda json_obj: (json_obj["user"]["id_str"], json_obj["text"]))
ut_rdd

PythonRDD[3] at RDD at PythonRDD.scala:53

In [5]:
# evaluate number of unique users

users_count = len(ut_rdd.countByKey())
print("Number of unique users : ", users_count)

Number of unique users :  1748


In [6]:
# loading a pickle file
# contains a dictionary which represents a partition of over 452,743 twitter users
# users are partitioned into 7 groups

# proc = subprocess.Popen(["cat", "./users-partition.pickle"], shell=True, stdout=subprocess.PIPE)
# pickle_content = proc.communicate()[0]

partition = pickle.load(open("users-partition.pickle", "rb"))
print("Length of partition : ", len(partition))

Length of partition :  452743


In [15]:
# count the number of posts to each user partition
# count the number of posts from each group (0 - 6)
# assign users not in any group to group 7
# results -> (group_id, count)

def post_count(counts):
    for group_id, count in counts:
        print("group {} : {} tweets".format(group_id, count))

user_partition_rdd = ut_rdd.map(lambda user: (partition[user[0]] if user[0] in partition.keys() else 7, 1))\
                           .reduceByKey(lambda x, y: x + y)\
                           .sortByKey()

counts_per_partition = user_partition_rdd.collect()

post_count(counts_per_partition)

group 0 : 87 tweets
group 1 : 242 tweets
group 2 : 41 tweets
group 3 : 349 tweets
group 4 : 101 tweets
group 5 : 358 tweets
group 6 : 434 tweets
group 7 : 521 tweets


## Token Popularity in each User Partition

The number of mentions of a token is defined by $t$ in a specific user partition $k$ as the number of users from the user partition $k$ that ever mentioned the token $t$ in their tweets. Note that even if some users might mention a token $t$ multiple times or in multiple tweets, a user will contribute at most 1 to the counter of the token $t$. Also, the number of mentions of a token is equal to the number of users who mentioned this token but NOT the number of tweets that mentioned this token.

Let $N_t^k$ be the number of mentions of the token $t$ in the user partition $k$. Let $N_t^{all} = \sum_{i=0}^7 N_t^{i}$ be the number of total mentions of the token $t$.

We define the relative popularity of a token $t$ in a user partition $k$ as the log ratio between $N_t^k$ and $N_t^{all}$, i.e. 

\begin{equation}
p_t^k = \log \frac{N_t^k}{N_t^{all}}.
\end{equation}

In [16]:
"""
This code implements a basic, Twitter-aware tokenizer.

A tokenizer is a function that splits a string of text into words. In
Python terms, we map string and unicode objects into lists of unicode
objects.

There is not a single right way to do tokenizing. The best method
depends on the application.  This tokenizer is designed to be flexible
and this easy to adapt to new domains and tasks.  The basic logic is
this:

1. The tuple regex_strings defines a list of regular expression
   strings.

2. The regex_strings strings are put, in order, into a compiled
   regular expression object called word_re.

3. The tokenization is done by word_re.findall(s), where s is the
   user-supplied string, inside the tokenize() method of the class
   Tokenizer.

4. When instantiating Tokenizer objects, there is a single option:
   preserve_case.  By default, it is set to True. If it is set to
   False, then the tokenizer will downcase everything except for
   emoticons.
"""
######################################################################

import re
from html import entities 

######################################################################
# The following strings are components in the regular expression
# that is used for tokenizing. It's important that phone_number
# appears first in the final regex (since it can contain whitespace).
# It also could matter that tags comes after emoticons, due to the
# possibility of having text like
#
#     <:| and some text >:)
#
# Most imporatantly, the final element should always be last, since it
# does a last ditch whitespace-based tokenization of whatever is left.

# This particular element is used in a couple ways, so we define it
# with a name:
emoticon_string = r"""
    (?:
      [<>]?
      [:;=8]                     # eyes
      [\-o\*\']?                 # optional nose
      [\)\]\(\[dDpP/\:\}\{@\|\\] # mouth      
      |
      [\)\]\(\[dDpP/\:\}\{@\|\\] # mouth
      [\-o\*\']?                 # optional nose
      [:;=8]                     # eyes
      [<>]?
    )"""

# The components of the tokenizer:
regex_strings = (
    # Phone numbers:
    r"""
    (?:
      (?:            # (international)
        \+?[01]
        [\-\s.]*
      )?            
      (?:            # (area code)
        [\(]?
        \d{3}
        [\-\s.\)]*
      )?    
      \d{3}          # exchange
      [\-\s.]*   
      \d{4}          # base
    )"""
    ,
    # URLs:
    r"""http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"""
    ,
    # Emoticons:
    emoticon_string
    ,    
    # HTML tags:
     r"""<[^>]+>"""
    ,
    # Twitter username:
    r"""(?:@[\w_]+)"""
    ,
    # Twitter hashtags:
    r"""(?:\#+[\w_]+[\w\'_\-]*[\w_]+)"""
    ,
    # Remaining word types:
    r"""
    (?:[a-z][a-z'\-_]+[a-z])       # Words with apostrophes or dashes.
    |
    (?:[+\-]?\d+[,/.:-]\d+[+\-]?)  # Numbers, including fractions, decimals.
    |
    (?:[\w_]+)                     # Words without apostrophes or dashes.
    |
    (?:\.(?:\s*\.){1,})            # Ellipsis dots. 
    |
    (?:\S)                         # Everything else that isn't whitespace.
    """
    )

######################################################################
# This is the core tokenizing regex:
    
word_re = re.compile(r"""(%s)""" % "|".join(regex_strings), re.VERBOSE | re.I | re.UNICODE)

# The emoticon string gets its own regex so that we can preserve case for them as needed:
emoticon_re = re.compile(regex_strings[1], re.VERBOSE | re.I | re.UNICODE)

# These are for regularizing HTML entities to Unicode:
html_entity_digit_re = re.compile(r"&#\d+;")
html_entity_alpha_re = re.compile(r"&\w+;")
amp = "&amp;"

######################################################################

class Tokenizer:
    def __init__(self, preserve_case=False):
        self.preserve_case = preserve_case

    def tokenize(self, s):
        """
        Argument: s -- any string or unicode object
        Value: a tokenize list of strings; conatenating this list returns the original string if preserve_case=False
        """        
        # Try to ensure unicode:
        try:
            s = str(s)
        except UnicodeDecodeError:
            s = s.encode('string_escape')
            s = str(s)
        # Fix HTML character entitites:
        s = self.__html2unicode(s)
        # Tokenize:
        words = word_re.findall(s)
        # Possible alter the case, but avoid changing emoticons like :D into :d:
        if not self.preserve_case:            
            words = map((lambda x : x if emoticon_re.search(x) else x.lower()), words)
        return words

    def tokenize_random_tweet(self):
        """
        If the twitter library is installed and a twitter connection
        can be established, then tokenize a random tweet.
        """
        try:
            import twitter
        except ImportError:
            print("Apologies. The random tweet functionality requires the Python twitter library: http://code.google.com/p/python-twitter/")
        from random import shuffle
        api = twitter.Api()
        tweets = api.GetPublicTimeline()
        if tweets:
            for tweet in tweets:
                if tweet.user.lang == 'en':            
                    return self.tokenize(tweet.text)
        else:
            raise Exception("Apologies. I couldn't get Twitter to give me a public English-language tweet. Perhaps try again")

    def __html2unicode(self, s):
        """
        Internal metod that seeks to replace all the HTML entities in
        s with their corresponding unicode characters.
        """
        # First the digits:
        ents = set(html_entity_digit_re.findall(s))
        if len(ents) > 0:
            for ent in ents:
                entnum = ent[2:-1]
                try:
                    entnum = int(entnum)
                    s = s.replace(ent, unichr(entnum))	
                except:
                    pass
        # Now the alpha versions:
        ents = set(html_entity_alpha_re.findall(s))
        ents = filter((lambda x : x != amp), ents)
        for ent in ents:
            entname = ent[1:-1]
            try:            
                s = s.replace(ent, unichr(entities.name2codepoint[entname]))
            except:
                pass                    
            s = s.replace(amp, " and ")
        return s

In [17]:
from math import log

tok = Tokenizer(preserve_case=False)

def get_rel_popularity(c_k, c_all):
    return log(1.0 * c_k / c_all) / log(2)


def print_tokens(tokens, gid = None):
    group_name = "overall"
    if gid is not None:
        group_name = "group %d" % gid
    print('=' * 5 + ' ' + group_name + ' ' + '=' * 5)
    for t, n in tokens:
        print("%s\t%.4f" % (t, n))
    print

In [23]:
# calculating the number of unique tokens

num_tokens = ut_rdd.flatMap(lambda ut: tok.tokenize(ut[1])).distinct()

# print(num_tokens.take(5))
print("Number of tokens : ", num_tokens.count())

Number of tokens :  7677


In [25]:
# filter tokens used by more than 100 users

user_token_rdd = ut_rdd.mapValues(lambda text: set(tok.tokenize(text)))\
                       .reduceByKey(lambda x, y: x|y)\
                       .flatMap(lambda user_token: ((user_token[0], token) for token in user_token[1]))\
                       .cache()
# print(user_token_rdd.take(5))

token_count_rdd = user_token_rdd.map(lambda user_token: (user_token[1], 1))\
                                .reduceByKey(lambda x, y: x + y)\
                                .filter(lambda user_token: user_token[1] > 100)\
                                .cache()
# print(token_count_rdd.take(5))

token_freq = token_count_rdd.count()
top_20 = token_count_rdd.sortBy(lambda x: x[1], ascending=False).take(20)

print("Token frequency : ", token_freq)
print_tokens(top_20)

Token frequency :  46
===== overall =====
:	1046.0000
rt	920.0000
.	767.0000
the	587.0000
trump	560.0000
…	520.0000
to	501.0000
,	497.0000
in	385.0000
a	383.0000
is	382.0000
of	300.0000
!	285.0000
for	275.0000
and	263.0000
on	218.0000
i	216.0000
he	191.0000
that	190.0000
"	181.0000


## Relative Popularity

Computing relative popularity for all the tokens mentioned by atleast 100 users.
If two tokens haev the same relative popularity, then choose alphabetically.
Let the relative popularity of a token $t$ be $p$. The order of the items will be satisfied by sorting them using (-p, t) as the key.

In [27]:
def get_partition_id(user):
    if user in partition.keys():
        return partition[user]
    return 7

token_dict = dict(token_count_rdd.collect())
token_word = list(token_dict.keys())

token_popularity_count_rdd = user_token_rdd.filter(lambda user_token: user_token[1] in token_word)\
.map(lambda user_token: ((get_partition_id(user_token[0]), user_token[1]), 1))\
.reduceByKey(lambda token_count_x, token_count_y: token_count_x + token_count_y)\
.map(lambda part_token_count: (part_token_count[0][0], (part_token_count[0][1], get_rel_popularity(part_token_count[1], token_dict[part_token_count[0][1]]))))\
.groupByKey()\
.mapValues(lambda token_pop_count: sorted(token_pop_count, key=(lambda token_pop_count: (-token_pop_count[1], token_pop_count[0])), reverse=False)[0:10])

popular_in_each_group = sorted(token_popularity_count_rdd.collect())

for grp in range(8):
    print_tokens(popular_in_each_group[grp][1], grp)

===== group 0 =====
with	-3.6088
cruz	-3.6554
his	-3.6582
amp	-3.8651
on	-3.9608
to	-4.0145
&	-4.0875
https	-4.1699
i	-4.1699
what	-4.1699
===== group 1 =====
sanders	-2.2854
gop	-2.4060
hillary	-2.4330
’	-2.4463
bernie	-2.4835
"	-2.6925
are	-2.7249
this	-2.7633
for	-2.8179
about	-2.8346
===== group 2 =====
with	-4.3458
donald	-4.5146
...	-4.7004
gop	-4.7279
i	-4.9475
on	-4.9608
he	-4.9925
…	-5.1155
https	-5.1699
what	-5.1699
===== group 3 =====
bernie	-1.5945
sanders	-1.6609
hillary	-2.2188
and	-2.5154
"	-2.5930
in	-2.6114
will	-2.6160
https	-2.6674
...	-2.7004
you	-2.7004
===== group 4 =====
what	-3.4330
have	-3.4725
bernie	-3.5380
this	-3.5518
it	-3.6881
?	-3.6912
for	-3.7110
about	-3.7415
hillary	-3.7549
that	-3.7625
===== group 5 =====
what	-1.8007
not	-1.8745
https	-2.0000
his	-2.0144
cruz	-2.0704
it	-2.1031
on	-2.1243
&	-2.1399
amp	-2.1489
;	-2.1592
===== group 6 =====
will	-1.3847
have	-1.4725
!	-1.5850
cruz	-1.6919
trump	-1.7199
https	-1.7549
-	-1.7673
;	-1.7807
be	-1.7952
amp