*****************************************************
# The Social Web

## 2019-2020 Master Information Sciences

* Instructors: Davide Ceolin, Dayana Spagnuelo
* Teachers Assistants: Michael Accetto, Sarthak Gupta, Ayesha Noorain
* Exercises for Hands-on session 5
* 12 March 11:00 - 12:45
*****************************************************

Required Software: 
* Python 3 
* Python packages: twitter


In this session you are going to learn how to browse user profiles information.
There are some scripts present below which will help you solve the exercises. 
Make sure to run all the scripts before you start the exercises.


First, we set up the Twitter API permissions.
You need to fill in the empty strings with your credentials. You will need it for all the exercises.




In [None]:
pip install twitter

In [None]:
import twitter

def oauth_login():
	CONSUMER_KEY = ''
	CONSUMER_SECRET = ''
	OAUTH_TOKEN = ''
	OAUTH_TOKEN_SECRET = ''
	auth = twitter.oauth.OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET,
	CONSUMER_KEY, CONSUMER_SECRET)
	twitter_api = twitter.Twitter(auth=auth)
	return twitter_api

In [None]:
def analyze_tweet_content(statuses): 
	if len(statuses) == 0: 
		print ("No statuses to analyze")
        
		return
	# A nested helper function for computing lexical diversity
	def lexical_diversity(tokens): 
		return 1.0*len(set(tokens))/len(tokens)
	# A nested helper function for computing the average number of words per tweet
	def average_words(statuses): 
		total_words = sum([ len(s.split()) for s in statuses ])
		return 1.0*total_words/len(statuses) 
	status_texts = [ status['text'] for status in statuses ] 
	screen_names, hashtags, urls, media, _ = extract_tweet_entities(statuses)
	# Compute a collection of all words from all tweets
	words = [ w 
		for t in status_texts 
			for w in t.split() ] 
	print ("Lexical diversity (words):"), lexical_diversity(words) 
	print ("Lexical diversity (screen names):"),lexical_diversity(screen_names) 
	print ("Lexical diversity (hashtags):"),lexical_diversity(hashtags) 
	print ("Averge words per tweet:"),average_words(status_texts)

In [None]:
def extract_tweet_entities(statuses):
# See https://dev.twitter.com/docs/tweet-entities for more details on tweet
# entities
    if len(statuses) == 0:
        return [], [], [], [], []
    screen_names = [ user_mention['screen_name']
        for status in statuses
            for user_mention in status['entities']['user_mentions'] ]
    hashtags = [ hashtag['text']
        for status in statuses
            for hashtag in status['entities']['hashtags'] ]
    urls = [ url['expanded_url']
        for status in statuses
            for url in status['entities']['urls'] ]
    symbols = [ symbol['text']
        for status in statuses
            for symbol in status['entities']['symbols'] ]
    # In some circumstances (such as search results), the media entity
    # may not appear
    for status in statuses:
        if status['entities'] in status['media']:
            media = [ media['url']
                for status in statuses
                for media in status['entities']['media'] ]
        else:
            media = []
    return screen_names, hashtags, urls, media, symbols

In [None]:
import sys
from sys import maxsize
from functools import partial 

def get_friends_followers_ids(screen_name=None, user_id=None, friends_limit=maxsize, followers_limit=maxsize):# Must have either screen_name or user_id (logical xor)
    
    twitter_api = oauth_login()
    
    assert (screen_name != None) != (user_id != None), \
    "Must have screen_names or user_ids, but not both"

# See https://dev.twitter.com/docs/api/1.1/get/friends/ids and https://dev.twitter.com/docs/api/1.1/get/followers/ids for details on API parameters

    get_friends_ids = partial(make_twitter_request, twitter_api.friends.ids, count=5000)

    get_followers_ids = partial(make_twitter_request, twitter_api.followers.ids,count=5000) 
    
    friends_ids, followers_ids = [], [] 

    for twitter_api_func, limit, ids,label in [[get_friends_ids, friends_limit, friends_ids, "friends"],[get_followers_ids, followers_limit, followers_ids, "followers"]]:
         if limit == 0: continue 
         cursor = -1 
         while cursor != 0:
# Use make_twitter_request via the partially bound callable...
            if screen_name: 
                response = twitter_api_func(screen_name=screen_name, cursor=cursor)
            else: # user_id 
                response = twitter_api_func(user_id=user_id, cursor=cursor) 
            if response is not None: 
                ids += response['ids'] 
                cursor = response['next_cursor'] 
            print (sys.stderr, 'Fetched {0} total {1} ids for {2}'.format(len(ids), label, (user_id or screen_name)))
		# XXX: You may want to store data during each iteration to provide an an additional layer
		# of protection from exceptional circumstances
            if len(ids) >= limit or response is None: 
                break
	# Do something useful with the IDs, like store them to disk...
    return friends_ids[:friends_limit], followers_ids[:followers_limit]

In [None]:
def get_user_profile(screen_names=None, user_ids=None):# Must have either screen_name or user_id (logical xor)
	twitter_api = oauth_login()
	assert (screen_names != None) != (user_ids != None), \
	"Must have screen_names or user_ids, but not both"
	items_to_info = {}
	items = screen_names or user_ids
	while len(items) > 0: # Process 100 items at a time per the API specifications for /users/lookup
		items_str = ','.join([str(item) for item in items[:100]])
		items = items[100:]
		if screen_names:
			response = make_twitter_request(twitter_api.users.lookup,screen_name=items_str)
		else: # user_ids
			response = make_twitter_request(twitter_api.users.lookup,user_id=items_str)
		print (response)
		for user_info in response:
			if screen_names:
				items_to_info[user_info['screen_name']] = user_info
			else: # user_ids
				items_to_info[user_info['id']] = user_info
	print (items_to_info)

In [None]:
import sys

def harvest_user_timeline(screen_name=None, user_id=None, max_results=1000):
	twitter_api = oauth_login()
	assert (screen_name != None) != (user_id != None), \
	"Must have screen_name or user_id, but not both"
	kw = { # Keyword args for the Twitter API call
		'count': 200,
		'trim_user': 'true',
		'include_rts' : 'true',
		'since_id' : 1
		}

	if screen_name:
		kw['screen_name'] = screen_name
	else:
		kw['user_id'] = user_id
	
	max_pages = 16
	results = []
	
	tweets = make_twitter_request(twitter_api.statuses.user_timeline, **kw)
	if tweets is None: # 401 (Not Authorized) - Need to bail out on loop entry
		tweets = []
	results += tweets
	
	print (sys.stderr, 'Fetched %i tweets' % len(tweets))
	
	page_num = 1
	
	# Many Twitter accounts have fewer than 200 tweets so you don't want to enter
	# the loop and waste a precious request if max_results = 200.
	# Note: Analogous optimizations could be applied inside the loop to try and
	# save requests. e.g. Don't make a third request if you have 287 tweets out of
	# a possible 400 tweets after your second request. Twitter does do some
	# post-filtering on censored and deleted tweets out of batches of 'count', though,
	# so you can't strictly check for the number of results being 200. You might get
	# back 198, for example, and still have many more tweets to go. If you have the
	# total number of tweets for an account (by GET /users/lookup/), then you could
	# simply use this value as a guide.
	
	if max_results == kw['count']:
		page_num = max_pages # Prevent loop entry
	while page_num < max_pages and len(tweets) > 0 and len(results) < max_results:
	
	# Necessary for traversing the timeline in Twitter's v1.1 API:
	# get the next query's max-id parameter to pass in.
	# See https://dev.twitter.com/docs/working-with-timelines.
		kw['max_id'] = min([ tweet['id'] for tweet in tweets]) - 1
		tweets = make_twitter_request(twitter_api.statuses.user_timeline, **kw)
		results += tweets
		print (sys.stderr, 'Fetched %i tweets' % (len(tweets),))
		page_num += 1
	print (sys.stderr, 'Done fetching tweets')
	return results[:max_results]	

In [None]:
import sys
import time
from urllib.request  import URLError
from http.client  import BadStatusLine
import json
import twitter

def make_twitter_request(twitter_api_func, max_errors=10, *args, **kw):
	# A nested helper function that handles common HTTPErrors. Return an updated
	# value for wait_period if the problem is a 500 level error. Block until the
	# rate limit is reset if it's a rate limiting issue (429 error). Returns None
	# for 401 and 404 errors, which requires special handling by the caller.
	def handle_twitter_http_error(e, wait_period=2, sleep_when_rate_limited=True):
		if wait_period > 3600: # Seconds
			print >> (sys.stderr, 'Too many retries. Quitting.')
			raise e
		# See https://dev.twitter.com/docs/error-codes-responses for common codes
			if e.e.code == 401:
				print >> (sys.stderr, 'Encountered 401 Error (Not Authorized)')
				return None
			elif e.e.code == 404:
				print >> (sys.stderr, 'Encountered 404 Error (Not Found)')
				return None
			elif e.e.code == 429:
				print >> (sys.stderr, 'Encountered 429 Error (Rate Limit Exceeded)')
				if sleep_when_rate_limited:
					print >> (sys.stderr, "Retrying in 15 minutes...ZzZ...")
					sys.stderr.flush()
					time.sleep(60*15 + 5)
					print >> (sys.stderr, '...ZzZ...Awake now and trying again.')
					return 2
				else:
					raise e # Caller must handle the rate limiting issue
			elif e.e.code in (500, 502, 503, 504):
				print >> (sys.stderr, 'Encountered %i Error. Retrying in %i seconds' % \
					(e.e.code, wait_period))
				time.sleep(wait_period)
				wait_period *= 1.5
				return wait_period
			else:
				raise e
		# End of nested helper function
	wait_period = 2
	error_count = 0
	while True:
		try:
			return twitter_api_func(*args, **kw)
		except twitter.api.TwitterHTTPErro as e:
			error_count = 0
			wait_period = handle_twitter_http_error(e, wait_period)
			if wait_period is None:
				return
		except URLError as e:
			error_count += 1
			print >> sys.stderr, "URLError encountered. Continuing."
			if error_count > max_errors:
				print >> (sys.stderr, "Too many consecutive errors...bailing out.")
				raise
		except BadStatusLine as e:
			error_count += 1
			print >> (sys.stderr, "BadStatusLine encountered. Continuing.")
			if error_count > max_errors:
				print >> (sys.stderr, "Too many consecutive errors...bailing out.")
				raise


In [None]:
def setwise_friends_followers_analysis(screen_name, friends_ids, followers_ids):
	friends_ids, followers_ids = set(friends_ids), set(followers_ids) 
	print ('{0} is following {1}'.format(screen_name, len(friends_ids))) 
	print ('{0} is being followed by {1}'.format(screen_name, len(followers_ids)) )
	print ('{0} of {1} are not following {2} back'.format( 
		len(friends_ids.difference(followers_ids)), 
		len(friends_ids),screen_name)) 
	print ('{0} of {1} are not being followed back by {2}'.format(
		len(followers_ids.difference(friends_ids)), 
		len(followers_ids), screen_name)) 
	print ('{0} has {1} mutual friends'.format( 
		screen_name,len(friends_ids.intersection(followers_ids))))

In [None]:
def twitter_search(q, max_results=200, **kw):
	twitter_api = oauth_login()
	# See https://dev.twitter.com/docs/api/1.1/get/search/tweets and
	# https://dev.twitter.com/docs/using-search for details on advanced
	# search criteria that may be useful for keyword arguments
	# See https://dev.twitter.com/docs/api/1.1/get/search/tweets
	search_results = twitter_api.search.tweets(q=q, count=100, **kw)
	statuses = search_results['statuses']
	# Iterate through batches of results by following the cursor until we
	# reach the desired number of results, keeping in mind that OAuth users
	# can "only" make 180 search queries per 15-minute interval. See
	# https://dev.twitter.com/docs/rate-limiting/1.1/limits
	# for details. A reasonable number of results is ~1000, although
	# that number of results may not exist for all queries.
	# Enforce a reasonable limit
	max_results = min(1000, max_results)
	for _ in range(10): # 10*100 = 1000
		try:
			next_results = search_results['search_metadata']['next_results']
		except KeyError as e: # No more results when next_results doesn't exist
			break
	# Create a dictionary from next_results, which has the following form:
	# ?max_id=313519052523986943&q=NCAA&include_entities=1
		kwargs = dict([ kv.split('=')
			for kv in next_results[1:].split("&") ])
		search_results = twitter_api.search.tweets(**kwargs)
		statuses += search_results['statuses']
		if len(statuses) > max_results:
			break
	return statuses


**Exercise 1:** Resolving user profile information (from example 9-17 in Mining the Social
Web).

Many APIs, such as GET friends/ids and GET followers/ids, return opaque ID values that
need to be resolved to usernames or other profile information for meaningful analysis.
Twitter provides a GET users/lookup API that can be used to resolve as many as 100 IDs or
usernames at a time, and a simple pattern can be employed to iterate over larger batches.

In [None]:
# you can substitute the strings with others you like more
get_user_profile(screen_names=["SocialWebMining", "ptwobrussell"])

**Exercise 2:** Getting all friends or followers for a user (from example 9-19 in Mining the
Social Web).


In [None]:
# you can substitute the strings with others you are more interested in

screen_name="SocialWebMining"
friends_ids, followers_ids = get_friends_followers_ids(screen_name=screen_name,friends_limit=10,followers_limit=10)

print (friends_ids)
print (followers_ids)

**Exercise 3:** Analyzing a user’s friends and followers (from example 9-20 in Mining the
Social Web).

After harvesting all of a user’s friends and followers, you can conduct some primitive
analyses using only the ID values themselves with the help of setwise operations such as
intersection and difference, as shown in the following exercise. Given two sets, the
intersection of the sets returns the items that they have in common, whereas the
difference between the sets “subtracts” the items in one set from the other, leaving
behind the difference. Recall that intersection is a commutative operation, while
difference is not commutative. 

In the context of analyzing friends and followers, the
intersection of two sets can be interpreted as “mutual friends” or people you are
following who are also following you back, while the difference of two sets can be
interpreted as followers who you aren’t following back or people you are following who
aren’t following you back, depending on the order of the operands.

In [None]:
screen_name = "ptwobrussell"
friends_ids, followers_ids = get_friends_followers_ids(screen_name=screen_name)
setwise_friends_followers_analysis(screen_name, friends_ids, followers_ids)

**Exercise 4:** Harvesting a user’s tweets (from example 9-21 in Mining the Social Web).
Timelines are a fundamental concept in the Twitter developer ecosystem, and Twitter
provides a convenient API endpoint for the purpose of harvesting tweets by user through
the concept of a “user timeline.”


In [None]:
tweets = harvest_user_timeline(screen_name="SocialWebMining", max_results=200)
print (tweets)

**Exercise 5:** Analyzing tweet content (from example 9-23 in Mining the Social Web). 
You will be using simple statistics, such as lexical diversity and average number of words per
tweet, to gain elementary insight into what is being talked about as a first step in
sizing up the nature of the language being used. 

In [None]:
q = 'CrossFit' 
search_results = twitter_search(q) 
analyze_tweet_content(search_results)

*****************************************************
TASK 1: Analyze all of the tweets that you (or another user you are interested in) have ever retweeted. Are you at all surprised about what you have retweeted or how your (or his/her) interests have evolved over time?
*****************************************************




*****************************************************
TASK 2: Write a recipe to identify followers that you are not following back but perhaps should follow back based upon the content of their tweets. A few similarity measurements that may make suitable starting points were introduced in Section 3.3.3 on page 112 in the Mining the Social Web.
*****************************************************