# Case Study 1 : Collecting Data from Twitter

Due Date: September 22, **before the beginning of class at 6:00pm**

* ------------

<img src="https://upload.wikimedia.org/wikipedia/en/thumb/9/9f/Twitter_bird_logo_2012.svg/220px-Twitter_bird_logo_2012.svg.png">

**TEAM Members:** Please EDIT this cell and add the names of all the team members in your team

    Brendan Foley
    
    Luis Castillo
    
    Yuhui Gong
    
    Mu Niu
       
    Josh Uzarski
    
    Matt Weiss
    

**Required Readings:** 
* Chapter 1 and Chapter 9 of the book [Mining the Social Web](http://www.learndatasci.com/wp-content/uploads/2015/08/Mining-the-Social-Web-2nd-Edition.pdf) 
* The codes for [Chapter 1](http://bit.ly/1qCtMrr) and [Chapter 9](http://bit.ly/1u7eP33)


** NOTE **
* Please don't forget to save the notebook frequently when working in IPython Notebook, otherwise the changes you made can be lost.

*----------------------

# Problem 1: Sampling Twitter Data with Streaming API about a certain topic

* Select a topic that you are interested in, for example, "WPI" or "Lady Gaga"
* Use Twitter Streaming API to sample a collection of tweets about this topic in real time. (It would be recommended that the number of tweets should be larger than 200, but smaller than 1 million.
* Store the tweets you downloaded into a local file (txt file or json file) 

In [2]:
# python modules
import sys
import datetime
import time
import twitter
import json
from collections import Counter
import pymongo
from functools import partial
import numpy as np
from prettytable import PrettyTable
import nltk

In [3]:
#---------------------------------------------
# Define a Function to Login Twitter API
def oauth_login():
    # Go to http://twitter.com/apps/new to create an app and get values
    # for these credentials that you'll need to provide in place of these
    # empty string values that are defined as placeholders.
    # See https://dev.twitter.com/docs/auth/oauth for more information 
    # on Twitter's OAuth implementation.
    
    CONSUMER_KEY = 'YBicfvPPR7YmsJdFtd8qGATmI'
    CONSUMER_SECRET ='YYXKSBw5n6b5dpbwmMfZTjqJcUnDQm6fxlNUEnf63k5idJsj2m'
    OAUTH_TOKEN = '774076984561430528-1ENguVW2E40PSue6YWnDuCBUyASjhGB'
    OAUTH_TOKEN_SECRET = 'ZcjrXboPxPmGEXzq2n1GQN1s7u8ZAXQeKWDSVIrZxQMkV'
    
    auth = twitter.oauth.OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET,
                               CONSUMER_KEY, CONSUMER_SECRET)
    
    twitter_api = twitter.Twitter(auth=auth)
    return twitter_api

#----------------------------------------------
# Your code starts here
#   Please add comments or text cells in between to explain the general idea of each block of the code.
#   Please feel free to add more cells below this cell if necessary

# define a function to search Twitter
def twitter_search(twitter_api, q, max_results=200, **kw):

    # See https://dev.twitter.com/docs/api/1.1/get/search/tweets
    search_results = twitter_api.search.tweets(q=q, count=100, **kw)
    
    statuses = search_results['statuses']
    
    # Iterate through batches of results by following the cursor until we
    # reach the desired number of results, keeping in mind that OAuth users
    # can "only" make 180 search queries per 15-minute interval. See
    # https://dev.twitter.com/docs/rate-limiting/1.1/limits
    # for details. A reasonable number of results is ~1000, although
    # that number of results may not exist for all queries.

    # Enforce a reasonable limit
    max_results = min(1000, max_results)
    
    for _ in range(10): # 10*100 = 1000
        try:
            next_results = search_results['search_metadata']['next_results']
        except KeyError:
            break
    
        kwargs = dict([ kv.split('=')
                        for kv in next_results[1:].split("&") ])
    
        search_results = twitter_api.search.tweets(**kwargs)
        statuses += search_results['statuses']
    
        if len(statuses) > max_results:
            break
    
    return statuses

# define a function to return the most common strings in a list
def most_common( input_list ):
    
    return [ list_item 
                for item in [ input_list ]
                    for list_item in Counter(item).most_common()
            ]

# save data to mongo database
def save_to_mongo( data, mongo_db, mongo_db_coll, mongo_db_uri="localhost:27017"):

    # Connects to the MongoDB server running on
    # localhost:27017 by default
    client = pymongo.MongoClient(mongo_db_uri)

    # Get a reference to a particular database
    db = client[mongo_db]

    # Reference a particular collection in the database
    coll = db[mongo_db_coll]

    # Perform a bulk insert and return the IDs
    return coll.insert_many(data)

# load data from mongo database
def load_from_mongo(mongo_db, mongo_db_coll, mongo_db_uri="localhost:27017", 
                    return_cursor=False, criteria=None, projection=None):
    
    # Optionally, use criteria and projection to limit the data that is
    # returned as documented in
    # http://docs.mongodb.org/manual/reference/method/db.collection.find/
    # Consider leveraging MongoDB's aggregations framework for more
    # sophisticated queries.
    
    client = pymongo.MongoClient(mongo_db_uri)
    
    db = client[mongo_db]
    
    coll = db[mongo_db_coll]
    
    if criteria is None:
        criteria = {}
    
    if projection is None:
        cursor = coll.find(criteria)
    
    else:
        cursor = coll.find(criteria, projection)
        
    if return_cursor:
        return cursor
    else:
        return [ item for item in cursor ]
    
# Gets friends and followers of Twitter user specified by either screen name OR user ID
def get_friends_followers_ids(twitter_api, screen_name=None, user_id=None,
                              friends_limit=60000, followers_limit=100000):
    
    # Must have either screen_name or user_id (logical xor)
    assert (screen_name != None) != (user_id != None), \
    "Must have screen_name or user_id, but not both"
    
    # See https://dev.twitter.com/docs/api/1.1/get/friends/ids and
    # https://dev.twitter.com/docs/api/1.1/get/followers/ids for details
    # on API parameters
    
    get_friends_ids = partial(make_twitter_request, twitter_api.friends.ids, 
                              count=5000)
    get_followers_ids = partial(make_twitter_request, twitter_api.followers.ids, 
                                count=5000)

    friends_ids, followers_ids = [], []
    
    for twitter_api_func, limit, ids, label in [
                    [get_friends_ids, friends_limit, friends_ids, "friends"], 
                    [get_followers_ids, followers_limit, followers_ids, "followers"]
                ]:
        
        if limit == 0: continue
        
        cursor = -1
        while cursor != 0:
        
            # Use make_twitter_request via the partially bound callable...
            if screen_name: 
                response = twitter_api_func(screen_name=screen_name, cursor=cursor)
            else: # user_id
                response = twitter_api_func(user_id=user_id, cursor=cursor)

            if response is not None:
                ids += response['ids']
                cursor = response['next_cursor']
            #print('Fetched {0} total {1} ids for {2}'.format(len(ids), 
                                                    #label, (user_id or screen_name)), end="", file=sys.stderr)
            #print >> sys.stderr, 'Fetched {0} total {1} ids for {2}'.format(len(ids), 
                                                    #label, (user_id or screen_name))
        
            # XXX: You may want to store data during each iteration to provide an 
            # an additional layer of protection from exceptional circumstances
        
            if len(ids) >= limit or response is None:
                break

    # Do something useful with the IDs, like store them to disk...
    return friends_ids[:friends_limit], followers_ids[:followers_limit]

# Wrapper for making Twitter requests
def make_twitter_request(twitter_api_func, max_errors=10, *args, **kw): 
    
    # A nested helper function that handles common HTTPErrors. Return an updated
    # value for wait_period if the problem is a 500 level error. Block until the
    # rate limit is reset if it's a rate limiting issue (429 error). Returns None
    # for 401 and 404 errors, which requires special handling by the caller.
    def handle_twitter_http_error(e, wait_period=2, sleep_when_rate_limited=True):
    
        if wait_period > 3600: # Seconds
            print >> sys.stderr, 'Too many retries. Quitting.'
            raise e
    
        # See https://dev.twitter.com/docs/error-codes-responses for common codes
    ''' if e.e.code == 401:
            print >> sys.stderr, 'Encountered 401 Error (Not Authorized)'
            return None
        elif e.e.code == 404:
            print >> sys.stderr, 'Encountered 404 Error (Not Found)'
            return None
        elif e.e.code == 429: 
            print >> sys.stderr, 'Encountered 429 Error (Rate Limit Exceeded)'
            if sleep_when_rate_limited:
                print >> sys.stderr, "Retrying in 15 minutes...ZzZ..."
                sys.stderr.flush()
                time.sleep(60*15 + 5)
                print >> sys.stderr, '...ZzZ...Awake now and trying again.'
                return 2
            else:
                raise e # Caller must handle the rate limiting issue
        elif e.e.code in (500, 502, 503, 504):
            print >> sys.stderr, 'Encountered %i Error. Retrying in %i seconds' % \
                (e.e.code, wait_period)
            time.sleep(wait_period)
            wait_period *= 1.5
            return wait_period
        else:
            raise e '''
        

    # End of nested helper function 
    
    wait_period = 2 
    error_count = 0 

    while True:
        try:
            return twitter_api_func(*args, **kw)
        except twitter.api.TwitterHTTPError:
            error_count = 0 
            wait_period = handle_twitter_http_error(wait_period)
            if wait_period is None:
                return
        except URLError:
            error_count += 1
            print >> sys.stderr, "URLError encountered. Continuing."
            if error_count > max_errors:
                print >> sys.stderr, "Too many consecutive errors...bailing out."
                raise
        except BadStatusLine:
            error_count += 1
            print >> sys.stderr, "BadStatusLine encountered. Continuing."
            if error_count > max_errors:
                print >> sys.stderr, "Too many consecutive errors...bailing out."
                raise

# combine the same retweets
def DictListUpdate( list1, list2):
    for attributeInList1 in list1:
        if attributeInList1 not in list2:
            list2.append(attributeInList1)
    return list2

# get user profile(s) from user ids or screen names
def get_user_profile(twitter_api, screen_names=None, user_ids=None):
   
    # Must have either screen_name or user_id (logical xor)
    assert (screen_names != None) != (user_ids != None), \
    "Must have screen_names or user_ids, but not both"
    
    items_to_info = {}

    items = screen_names or user_ids
    
    while len(items) > 0:

        # Process 100 items at a time per the API specifications for /users/lookup.
        # See https://dev.twitter.com/docs/api/1.1/get/users/lookup for details.
        
        items_str = ','.join([str(item) for item in items[:100]])
        items = items[100:]

        if screen_names:
            response = make_twitter_request(twitter_api.users.lookup, 
                                            screen_name=items_str)
        else: # user_ids
            response = make_twitter_request(twitter_api.users.lookup, 
                                            user_id=items_str)
    
        for user_info in response:
            if screen_names:
                items_to_info[user_info['screen_name']] = user_info
            else: # user_ids
                items_to_info[user_info['id']] = user_info

    return items_to_info

# Various arrangements of the intersection between friends and follwers
#def setwise_friends_followers_analysis(screen_name, friends_ids, followers_ids):
    
#    friends_ids, followers_ids = set(friends_ids), set(followers_ids)
    
#    print ('{0} is following {1}'.format(screen_name, len(friends_ids)))

#    print ('{0} is being followed by {1}'.format(screen_name, len(followers_ids)))
    
#    print ('{0} of {1} are not following {2} back'.format(
#            len(friends_ids.difference(followers_ids)), 
#            len(friends_ids), screen_name))
    
#    print ('{0} of {1} are not being followed back by {2}'.format(
#            len(followers_ids.difference(friends_ids)), 
#            len(followers_ids), screen_name))
    
#    print ('{0} has {1} mutual friends'.format(
#            screen_name, len(friends_ids.intersection(followers_ids))))
    
def harvest_user_timeline(twitter_api, screen_name=None, user_id=None, max_results=1000):
    
    assert (screen_name != None) != (user_id != None), \
    "Must have screen_name or user_id, but not both"
    
    kw = { # Keyword args for the Twitter API call
        'count': 200,
        'trim_user': 'true',
        'include_rts' : 'true',
        'since_id' : 1
        }
    
    if screen_name:
        kw['screen_name'] = screen_name
    else:
        kw['user_id'] = user_id
    
    max_pages = 16
    results = []
    
    tweets = make_twitter_request(twitter_api.statuses.user_timeline, **kw)
    
    if tweets is None: # 401 (Not Authorized) - Need to bail out on loop entry
        tweets = []
    
    results += tweets
    
#     print >> sys.stderr, 'Fetched %i tweets' % len(tweets)
    
    page_num = 1
    
    # Many Twitter accounts have fewer than 200 tweets so you don't want to enter
    # the loop and waste a precious request if max_results = 200.

    # Note: Analogous optimizations could be applied inside the loop to try and
    # save requests. e.g. Don't make a third request if you have 287 tweets out of
    # a possible 400 tweets after your second request. Twitter does do some
    # post-filtering on censored and deleted tweets out of batches of 'count', though,
    # so you can't strictly check for the number of results being 200. You might get
    # back 198, for example, and still have many more tweets to go. If you have the
    # total number of tweets for an account (by GET /users/lookup/), then you could
    # simply use this value as a guide.
    
    if max_results == kw['count']:
        page_num = max_pages # Prevent loop entry
    
    while page_num < max_pages and len(tweets) > 0 and len(results) < max_results:
    
        # Necessary for traversing the timeline in Twitter's v1.1 API:
        # get the next query's max-id parameter to pass in.
        # See https://dev.twitter.com/docs/working-with-timelines.
        kw['max_id'] = min([ tweet['id'] for tweet in tweets]) - 1

        tweets = make_twitter_request(twitter_api.statuses.user_timeline, **kw)
        results += tweets

#         print >> sys.stderr, 'Fetched %i tweets' % (len(tweets),)

        page_num += 1

#     print >> sys.stderr, 'Done fetching tweets'
        
    return results[:max_results]

In [4]:
# connect to Twitter API
twitter_api = oauth_login()

In [5]:
# MongoDB information
mongo_db_name = 'ds501_casestudy1'
mongo_db_coll_stream = "pats"
mongo_db_uri = "mongodb://mlweiss:mlweiss@ds033046.mlab.com:33046/ds501_casestudy1"

In [6]:
# Load Tweets from MongoDB database
start_time = time.time()
raw_tweets = load_from_mongo(mongo_db_name, mongo_db_coll_stream, mongo_db_uri)
load_time = time.time() - start_time
print("It took %.4s seconds to load %i Tweets" % (load_time,len(raw_tweets)))

It took 38.9 seconds to load 7593 Tweets


In [7]:
# get tweet ids from original tweets collected in real time
# split raw_tweet_ids into sub-lists with a maximum of 100 elements per list
raw_tweet_ids = [ tweet['id'] for tweet in raw_tweets if 'id' in tweet ]
raw_tweet_ids_s = [raw_tweet_ids[x:x+100] for x in range(0, len(raw_tweet_ids), 100)]

In [8]:
# # Get updated copies of Tweets stored in MongoDB
start_time = time.time()
updated_tweets = list()
for tweet_ids in raw_tweet_ids_s:
    joined_ids_list = ','.join(str(i) for i in tweet_ids)
    updated_tweets += twitter_api.statuses.lookup(_id=joined_ids_list)
load_time = time.time() - start_time
print("It took %.4s seconds to update %i Tweets" % (load_time,len(raw_tweets)))

It took 92.1 seconds to update 7593 Tweets


### Report some statistics about the tweets you collected 

* The topic of interest: < Patriots>


* The total number of tweets collected:  < 7593>

*-----------------------

# Problem 2: Analyzing Tweets and Tweet Entities with Frequency Analysis

**1. Word Count:** 
* Use the tweets you collected in Problem 1, and compute the frequencies of the words being used in these tweets. 
* Plot a table of the top 30 words with their counts

In [9]:
#----------------------------------------------
# Your code starts here
#   Please add comments or text cells in between to explain the general idea of each block of the code.
#   Please feel free to add more cells below this cell if necessary

# Download nltk packages used in this example - need to run this when using this code for the first time.
# Can comment after that.

nltk.download('stopwords')

nltk_eng_lower = nltk.corpus.stopwords.words('english')
nltk_eng_upper = [ word.capitalize() for word in nltk_eng_lower ]

# Filter out stop words
stop_words = nltk_eng_lower + nltk_eng_upper + [
'.',
',',
'--',
'\'s',
'?',
')',
'(',
':',
'\'',
'\'re',
'"',
'-',
'}',
'{',
u'—',
'RT',
'#Patriots',
'&amp;'
]

status_texts = [ results['text'] 
                 for results in updated_tweets ]

# Compute a collection of all words from all tweets
words = [ w 
          for t in status_texts 
              for w in t.split() ]

# generator function to get words not in stop words list
word_gen = [ word for word in words if word not in stop_words ]

for word in [word_gen]:
    c = Counter(word)
    # print(c.most_common()[:30]) # top 30

# plot a table of the top 30 words with their counts
Popwords = c.most_common()[:30]

pt = PrettyTable(field_names=['Words', 'Count']) 
[pt.add_row(kv) for kv in Popwords ]
pt.align['Words'], pt.align['Count'] = 'l', 'r' # Set column alignment
print(pt)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\mniu\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
+--------------------+-------+
| Words              | Count |
+--------------------+-------+
| #MIAvsNE           |  1366 |
| @Patriots:         |   905 |
| Jimmy              |   605 |
| Garoppolo          |   465 |
| #patriots          |   462 |
| game               |   401 |
| lead               |   384 |
| New                |   366 |
| QB                 |   353 |
| take               |   351 |
| England            |   340 |
| defense            |   327 |
| recovered          |   327 |
| Freeny             |   323 |
| forces             |   322 |
| fumble,            |   319 |
| line.              |   316 |
| Flowers.           |   313 |
| 32-yard            |   313 |
| https://t.co/nohT… |   309 |
| de                 |   304 |
| like               |   258 |
| hurdling           |   241 |
| #Dolphins          |   228 |
| Bl

**2. Find the most popular tweets in your collection of tweets**

Please plot a table of the top 10 tweets that are the most popular among your collection, i.e., the tweets with the largest number of retweet counts.


In [10]:
#----------------------------------------------
# Your code starts here
#   Please add comments or text cells in between to explain the general idea of each block of the code.
#   Please feel free to add more cells below this cell if necessary


retweets = [
            # Store out a tuple of these three values ...
            (status['retweet_count'], 
             status['retweeted_status']['user']['screen_name'],
             status['text']) 
            
            # ... for each status ...
            for status in updated_tweets 
            
            # ... so long as the status meets this condition.
                if 'retweeted_status' in status
           ]

re = []
simpleretweets = DictListUpdate(retweets,re)

pt = PrettyTable(field_names=['Count', 'Screen Name', 'Text'])
[ pt.add_row(row) for row in sorted(simpleretweets,reverse=True)[:10]]
pt.max_width['Text'] = 50
pt.align= 'l'
print(pt)

+-------+-----------------+----------------------------------------------------+
| Count | Screen Name     | Text                                               |
+-------+-----------------+----------------------------------------------------+
| 5605  | TodaysLoop      | RT @TodaysLoop: Jarvis Landry is lowkey            |
|       |                 | disrespectful for this 😩💪🏾#Dolphins #Patriots      |
|       |                 | https://t.co/4p2BNikpPL                            |
| 5098  | Edelman11       | RT @Edelman11: great team win in the desert        |
|       |                 | #Patriots https://t.co/ZVqo9zUSuP                  |
| 2337  | Patriots        | RT @Patriots: TOUCHDOWN, #PATRIOTS                 |
|       |                 | https://t.co/UfQVoJUYkA                            |
| 2176  | GilletteStadium | RT @GilletteStadium: In honor of the #Patriots     |
|       |                 | home opener this Sunday, we're giving away pieces  |
|       |                 | 

**3. Find the most popular Tweet Entities in your collection of tweets**

Please plot a table of the top 10 hashtags, top 10 user mentions that are the most popular in your collection of tweets.

In [11]:
#----------------------------------------------
# Your code starts here
#   Please add comments or text cells in between to explain the general idea of each block of the code.
#   Please feel free to add more cells below this cell if necessary

exclude_list = ['Patriots']

hashtags = [ hashtag['text'] 
             for status in updated_tweets
                 for hashtag in status['entities']['hashtags'] if hashtag['text'] not in exclude_list ]

# print(hashtags[0])

# plot table for top 10 popular hashtags
for item in [hashtags]:
    c = Counter(item)

pt = PrettyTable(field_names=['Hashtag', 'Count']) 
[pt.add_row(kv) for kv in c.most_common()[:10] ]
pt.align['Hashtag'], pt.align['Count'] = 'l', 'r' # Set column alignment
print(pt)

+----------------+-------+
| Hashtag        | Count |
+----------------+-------+
| MIAvsNE        |  1370 |
| patriots       |   465 |
| Dolphins       |   267 |
| PATRIOTS       |   113 |
| PatsNation     |   108 |
| PatriotsNation |   108 |
| NEvsMIA        |    95 |
| NFL            |    90 |
| DoYourJob      |    69 |
| FinsUp         |    67 |
+----------------+-------+


In [12]:
# plot table for top 10 popular screen names

screen_names = [ user_mention['screen_name'] 
                 for status in updated_tweets
                     for user_mention in status['entities']['user_mentions'] ]

for item in [screen_names]:
    c = Counter(item)

pt = PrettyTable(field_names=['User Metions', 'Count']) 
[pt.add_row(kv) for kv in c.most_common()[:10] ]
pt.align['User Metions'], pt.align['Count'] = 'l', 'r' # Set column alignment
print(pt)

+-----------------+-------+
| User Metions    | Count |
+-----------------+-------+
| Patriots        |  1035 |
| BeforeFamePics2 |   180 |
| SportsTalkJoe   |   142 |
| LG_Blount       |   116 |
| RapSheet        |   115 |
| FitzyGFY        |   112 |
| JimmyG_10       |    67 |
| PatriotsMexico  |    67 |
| PatriotsExtra   |    55 |
| CharlesRobinson |    48 |
+-----------------+-------+


* ------------------------

# Problem 3: Getting "All" friends and "All" followers of a popular user in twitter


* choose a popular twitter user who has many followers, such as "ladygaga".
* Get the list of all friends and all followers of the twitter user.
* Plot 20 out of the followers, plot their ID numbers and screen names in a table.
* Plot 20 out of the friends (if the user has more than 20 friends), plot their ID numbers and screen names in a table.

In [13]:
#----------------------------------------------
# Your code starts here
#   Please add comments or text cells in between to explain the general idea of each block of the code.
#   Please feel free to add more cells below this cell if necessary
# get list of all friends and followers IDs
celeb_screen_name = "JimmyG_10"
friends_ids, followers_ids = get_friends_followers_ids(twitter_api, screen_name=celeb_screen_name)

In [14]:
print (friends_ids[0:19])
print("\n")
print(followers_ids[0:19])

[376986779, 237527660, 460658453, 210567673, 3998034981, 327555798, 290809659, 183487809, 3115775917, 316901934, 3020277803, 252266942, 329385055, 829673054, 294318508, 2286022274, 1140790656, 220408962, 353990363]


[4787664095, 339424758, 408872052, 374290550, 543216473, 3239596882, 411108773, 180519645, 527713656, 260842202, 29781755, 812186172, 1435849448, 103027082, 835524102, 66703156, 137864396, 547115315, 379365195]


In [16]:
# get information about 20 friends and followers with their IDs and names

# get followers name

screen_names=[]
for j in range(20):
    screen_names.append(get_user_profile(twitter_api, user_ids=[followers_ids[j]])) 
screen_names_followers=[]
for j in range(20):
    screen_names_followers.append(screen_names[j][followers_ids[j]]["screen_name"])

# get friends name
screen_names=[]
for j in range(20):
    screen_names.append(get_user_profile(twitter_api, user_ids=[friends_ids[j]])) 
screen_names_friends=[]
for j in range(20):
    screen_names_friends.append(screen_names[j][friends_ids[j]]["screen_name"])

In [17]:
# plot

pt = PrettyTable()
pt.add_column("friend_IDs",friends_ids[:20])
pt.add_column("friend_names",screen_names_friends[:20])
pt.add_column("follower_IDs",followers_ids[:20])
pt.add_column("follower_names",screen_names_followers[:20])
print(pt)

+------------+-----------------+--------------+-----------------+
| friend_IDs |   friend_names  | follower_IDs |  follower_names |
+------------+-----------------+--------------+-----------------+
| 376986779  |  themarkuskuhn  |  4787664095  |     R_mas14     |
| 237527660  |  ChrisHogan_15  |  339424758   |     jjmann64    |
| 460658453  |     Npleva04    |  408872052   |    DakotaShoe   |
| 210567673  |  martyMcfly_33  |  374290550   | AdrianCorrales_ |
| 3998034981 |     BStork66    |  543216473   |  Kbmfan54rowdy  |
| 327555798  |  ItsChrisHarper |  3239596882  |   BarbABarnes   |
| 290809659  |   JackEasterby  |  411108773   |     BfitzP17    |
| 183487809  |    Josh_Boyce   |  180519645   |      P0rrr      |
| 3115775917 |    soldernate   |  527713656   |    Kid_Curley   |
| 316901934  |   BigPlayBene   |  260842202   |   stephen30999  |
| 3020277803 |     ninko50     |   29781755   |   BbwSofiaRose  |
| 252266942  |  Wes_Saunders88 |  812186172   |     CjMercik    |
| 32938505

* Compute the mutual friends within the two groups, i.e., the users who are in both friend list and follower list, plot their ID numbers and screen names in a table

In [18]:
#----------------------------------------------
# Your code starts here
#   Please add comments or text cells in between to explain the general idea of each block of the code.
#   Please feel free to add more cells below this cell if necessary

#----------------------------------------------
# Your code starts here
#   Please add comments or text cells in between to explain the general idea of each block of the code.
#   Please feel free to add more cells below this cell if necessary
#setwise_friends_followers_analysis(celeb_screen_name, friends_ids, followers_ids)

intersection = list(set(friends_ids).intersection(set(followers_ids)))

screen_names=[]
for j in range(len(intersection)):
    screen_names.append(get_user_profile(twitter_api, user_ids=[intersection[j]])) 
screen_names_friends=[]
for j in range(len(intersection)):
    screen_names_friends.append(screen_names[j][intersection[j]]["screen_name"])

pt = PrettyTable()
pt.add_column("mutual_IDs",intersection)
pt.add_column("mutual_names",screen_names_friends)
print(pt)
    


+------------+-----------------+
| mutual_IDs |   mutual_names  |
+------------+-----------------+
| 2286022274 |    CrockettG7   |
| 220408962  | zeus30hightower |
| 563371267  |     anasisi_    |
| 376986779  |  themarkuskuhn  |
| 3998034981 |     BStork66    |
| 3020277803 |     ninko50     |
| 3115775917 |    soldernate   |
| 316901934  |   BigPlayBene   |
| 1486304563 |  PatrickChung23 |
| 252266942  |  Wes_Saunders88 |
| 183487809  |    Josh_Boyce   |
| 506561738  |  LaurenTaylor05 |
| 405291471  |     S_Siliga    |
| 327555798  |  ItsChrisHarper |
| 492366423  | FootbaIl_Tweets |
| 2336915160 |  James_Develin  |
| 353990363  |    LG_Blount    |
| 829673054  |  DannyAmendola  |
| 329385055  |    Tyms2Times   |
| 144548198  |  aarondobson17  |
| 1483855976 |    bbrowner27   |
| 237527660  |  ChrisHogan_15  |
| 2497367156 |   Tim_Wright83  |
| 210567673  |  martyMcfly_33  |
+------------+-----------------+


*------------------------

# Problem 4: Business question 

Run some additional experiments with your data to gain familiarity with the twitter data and twitter API.

* Come up with a business question that Twitter data could help answer.
* Decribe the business case.
* How could Twitter data help a company decide how to spend its resources.

In [19]:
#----------------------------------------------
# Your code starts here
#   Please add comments or text cells in between to explain the general idea of each block of the code.
#   Please feel free to add more cells below this cell if necessary

# list to hold tweets from celeb's users
follower_tweets = list()

# # number of the celeb's followers to harvest tweets from
num_followers_to_harvest = min(400,len(followers_ids))

# # maximum number of tweets per follower
max_tweets_per_follower = 5

for follower_id in followers_ids[400:800]:
    follower_tweets += harvest_user_timeline(twitter_api, user_id=follower_id, max_results=max_tweets_per_follower)

NameError: name 'URLError' is not defined

In [21]:
len(followers_ids)

75000

In [22]:
print(len(follower_tweets))

630


In [26]:
# save follower tweets to Mongo DB
# MongoDB collection information
mongo_db_coll_followers = celeb_screen_name + "_followers2"
save_to_mongo( follower_tweets, mongo_db_name, mongo_db_coll_followers, mongo_db_uri )

<pymongo.results.InsertManyResult at 0x245489ca798>

In [27]:
# Load tweets of JimmyG_10's followers
followers_tweets_db = load_from_mongo(mongo_db_name, mongo_db_coll_followers, mongo_db_uri )

In [28]:
print(len(followers_tweets_db))

3521


*-----------------
# Done

All set! 

** What do you need to submit?**

* **Notebook File**: Save this IPython notebook, and find the notebook file in your folder (for example, "filename.ipynb"). This is the file you need to submit. Please make sure all the plotted tables and figures are in the notebook. If you used "ipython notebook --pylab=inline" to open the notebook, all the figures and tables should have shown up in the notebook.


* **PPT Slides**: please prepare PPT slides (for 10 minutes' talk) to present about the case study . We will ask two teams which are randomly selected to present their case studies in class for this case study. 

* ** Report**: please prepare a report (less than 10 pages) to report what you found in the data.
    * What data you collected? 
    * Why this topic is interesting or important to you? (Motivations)
    * How did you analyse the data?
    * What did you find in the data? 
 
     (please include figures or tables in the report, but no source code)

Please compress all the files in a zipped file.


** How to submit: **

        Please submit through email to Prof. Paffenroth (rcpaffenroth@wpi.edu) *and* the TA Wen Liu (wliu3@wpi.edu).
        
** Note: Each team just needs to submits one submission **

# Grading Criteria:

** Totoal Points: 120 **


---------------------------------------------------------------------------
** Notebook:  **
    Points: 80


    -----------------------------------
    Qestion 1:
    Points: 20
    -----------------------------------
    
    (1) Select a topic that you are interested in.
    Points: 6 
    
    (2) Use Twitter Streaming API to sample a collection of tweets about this topic in real time. (It would be recommended that the number of tweets should be larger than 200, but smaller than 1 million. Please check whether the total number of tweets collected is larger than 200?
    Points: 10 
    
    
    (3) Store the tweets you downloaded into a local file (txt file or json file)
    Points: 4 
    
    
    -----------------------------------
    Qestion 2:
    Points: 20
    -----------------------------------
    
    1. Word Count

    (1) Use the tweets you collected in Problem 1, and compute the frequencies of the words being used in these tweets.
    Points: 4 

    (2) Plot a table of the top 30 words with their counts 
    Points: 4 
    
    2. Find the most popular tweets in your collection of tweets
    plot a table of the top 10 tweets that are the most popular among your collection, i.e., the tweets with the largest number of retweet counts.
    Points: 4 
    
    3. Find the most popular Tweet Entities in your collection of tweets

    (1) plot a table of the top 10 hashtags, 
    Points: 4 

    (2) top 10 user mentions that are the most popular in your collection of tweets.
    Points: 4 
    
    
    -----------------------------------
    Qestion 3:
    Points: 20
    -----------------------------------
    
    (1) choose a popular twitter user who has many followers, such as "ladygaga".
    Points: 4 

    (2) Get the list of all friends and all followers of the twitter user.
    Points: 4 

    (3) Plot 20 out of the followers, plot their ID numbers and screen names in a table.
    Points: 4 

    (4) Plot 20 out of the friends (if the user has more than 20 friends), plot their ID numbers and screen names in a table.
    Points: 4 
    
    (5) Compute the mutual friends within the two groups, i.e., the users who are in both friend list and follower list, plot their ID numbers and screen names in a table
    Points: 4 
  
    -----------------------------------
    Qestion 4:  Business question
    Points: 20
    -----------------------------------
        Novelty: 10
        Interestingness: 10
    -----------------------------------
    Run some additional experiments with your data to gain familiarity with the twitter data ant twitter API.  Come up with a business question and describe how Twitter data can help you answer that question.




---------------------------------------------------------------------------
** Report: communicate the results**
    Points: 20

(1) What data you collected?
    Points: 5 

(2) Why this topic is interesting or important to you? (Motivations)
    Points: 5 

(3) How did you analyse the data?
    Points: 5 

(4) What did you find in the data?
(please include figures or tables in the report, but no source code)
    Points: 5 



---------------------------------------------------------------------------
** Slides (for 10 minutes of presentation): Story-telling **
    Points: 20


1. Motivation about the data collection, why the topic is interesting to you.
    Points: 5 

2. Communicating Results (figure/table)
    Points: 10 

3. Story telling (How all the parts (data, analysis, result) fit together as a story?)
    Points: 5 

