#Mining the Social Web, 2nd Edition

##Chapter 1: Mining Twitter: Exploring Trending Topics, Discovering What People Are Talking About, and More

This IPython Notebook provides an interactive way to follow along with and explore the numbered examples from [_Mining the Social Web (2nd Edition)_](http://bit.ly/135dHfs). The intent behind this notebook is to reinforce the concepts from the sample code in a fun, convenient, and effective way. This notebook assumes that you are reading along with the book and have the context of the discussion as you work through these exercises.

In the somewhat unlikely event that you've somehow stumbled across this notebook outside of its context on GitHub, [you can find the full source code repository here](http://bit.ly/16kGNyb).

## Copyright and Licensing

You are free to use or adapt this notebook for any purpose you'd like. However, please respect the [Simplified BSD License](https://github.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition/blob/master/LICENSE.txt) that governs its use.

# Twitter API Access

Twitter implements OAuth 1.0A as its standard authentication mechanism, and in order to use it to make requests to Twitter's API, you'll need to go to https://dev.twitter.com/apps and create a sample application. There are four primary identifiers you'll need to note for an OAuth 1.0A workflow: consumer key, consumer secret, access token, and access token secret. Note that you will need an ordinary Twitter account in order to login, create an app, and get these credentials.

<img src="files/resources/ch01-twitter/images/Twitter-AppCredentials.png" width="600px">

If you are taking advantage of the virtual machine experience for this chapter that is powered by Vagrant, you should just be able to execute the code in this notebook without any worries whatsoever about installing dependencies. If you are running the code from your own development envioronment, however, be advised that these examples in this chapter take advantage of a Python package called [twitter](https://github.com/sixohsix/twitter) to make API calls. You can install this package in a terminal with [pip](https://pypi.python.org/pypi/pip) with the command `pip install twitter`, preferably from within a [Python virtual environment](https://pypi.python.org/pypi/virtualenv). 

Once installed, you should be able to open up a Python interpreter (or better yet, your [IPython](http://ipython.org/) interpreter) and get rolling.

## Example 1. Authorizing an application to access Twitter account data

In [1]:
import twitter

# XXX: Go to http://dev.twitter.com/apps/new to create an app and get values
# for these credentials, which you'll need to provide in place of these
# empty string values that are defined as placeholders.
# See https://dev.twitter.com/docs/auth/oauth for more information 
# on Twitter's OAuth implementation.

CONSUMER_KEY = 'LUDiw6UJwbcOpJxCUMwprQTjz'
CONSUMER_SECRET ='eRZxk3NObvhfdRPlqSZnjcAnilPCRkaV16616mI86Fzi6ENi35'
OAUTH_TOKEN = '1150071104-cBt0I6tLwweKg5LlMHRSN8sZS22lXsXPCGOpW4Z'
OAUTH_TOKEN_SECRET = 'aulODA2dhSCLgyiTLJ74adyPtJ2mShhhfWUx9FJkpJzXa'

auth = twitter.oauth.OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET,
                           CONSUMER_KEY, CONSUMER_SECRET)

twitter_api = twitter.Twitter(auth=auth)

# Nothing to see by displaying twitter_api except that it's now a
# defined variable

print twitter_api

<twitter.api.Twitter object at 0x106bc81d0>


In [2]:
results = twitter_api.statuses.user_timeline(screen_name = 'crunchbase', count=300)
#results = twitter.statuses.user_timeline(screen_name="billybob")


d = {}

print len(results)
#-----------------------------------------------------------------------
# loop through each status item, and print its content.
#-----------------------------------------------------------------------
for status in results:
    print "(%s) %s" % (status["created_at"], status["text"].encode("ascii", "ignore"))
    d[status["created_at"]] = status["text"].encode("ascii", "ignore")

200
(Wed Nov 25 20:42:41 +0000 2015) @_adonisk Hey Adonis! Can you send us an email from the address you want to unsubscribe with? We're happy to help! feedback@crunchbase.com
(Wed Nov 25 20:40:52 +0000 2015) @primdahl Hey Morton! Can you send us an email from the address you want to unsubscribe from? feedback@crunchbase.com Thanks!
(Wed Nov 25 16:29:14 +0000 2015) Today's #CBDaily: $2.2B in 203 rounds added &amp; more! https://t.co/MNUxay2bsB https://t.co/EIf130FjF8
(Tue Nov 24 18:31:02 +0000 2015) Today's #CBDaily: $2.1B in 354 rounds added &amp; more! https://t.co/AEaryWD0Vk https://t.co/mHh7rUSNpu
(Mon Nov 23 20:54:02 +0000 2015) Great to see @nyftlab featuring #FemaleFounders! Find application info for the new batch https://t.co/it7bO5J3tK https://t.co/FlDa4vZPu0
(Mon Nov 23 15:55:14 +0000 2015) Today's #CBDaily: $920M in 446 rounds added &amp; more! https://t.co/AEaryWD0Vk https://t.co/eLTAuPdtpo
(Sun Nov 22 23:07:38 +0000 2015) @julian_dunn Hey Julian, thanks for the alert! Can 

In [51]:
startuplist = [##"Uber",
##"Xiaomi",
##"Airbnb",
#"Palantir Technologies",
##"Snapchat",
##"Flipkart",
#"Didi Kuaidi",
##"SpaceX",
##"Pinterest",
##"Dropbox",
##"Lufax",
##"WeWork",
#"DJI Innovations",
##"Theranos",
##"Spotify",
##"Meituan",
#"Intarcia Therapeutics",
##"Stripe",
##"Olacabs",
##"Coupang",
##"Zenefits",
##"Cloudera",
####"Dianping",
#"Social Finance",
##"Tanium",
#"Credit Karma",
##"Atlassian",
##"Jawbone",
#"Delivery Hero",
#"Global Fashion Group",
##"Fanatics",
#"Legendary Entertainment",
##"Stemcentrx",
#"Ele.me",
##"VANCL",
##"DocuSign",
##"Moderna",
#"ContextLogic (dba. Wish)",
##"Hellofresh",
#"Slack Technologies",
##"slack",
#"Bloom Energy",
#"POWA Technologies",
##"Snapdeal",
##"Lyft",
##"Vice",# Media",
##"Mozido",
##"Adyen",
##"Houzz",
##"Klarna",
##"SurveyMonkey",
##"Evernote",
##"Avant",
##"NantHealth",
##"Nutanix",
#"One97 Communications",
##"Github",
#"Domo Technologies",
#"Trendy Group International",
##"Instacart",
#"Blue Apron",
#"Magic Leap",
#"Prosper Marketplace",
##"Zocdoc",
#"Oscar Health Insurance Co.",
#"The Honest Company",
##"MongoDB",
##"BlaBlaCar",
#"Insidesales.com",
#"Mu Sigma",
##"GuaHao",
##"MuleSoft",
##"Buzzfeed",
#"Jet.com",
##"ironSource",
#"Koudai Gouwu",
#"Jasper Technologies",
##"DraftKings",
##"Deem",
##"Thumbtack",
##"Lazada",
##"Medallia",
#"Ucar Group",
##"Appnexus",
#"Warby Parker",
#"Auto1 Group",
##"Infinidat",
##"Okta",
##"Sprinklr",
##"Automattic",
##"Twilio",
##"Uptake",
##"Udacity",
#"Proteus Digital Health",
##"Actifio",
##"TangoMe",
##"Nextdoor",
##"Docker",
#"Gilt Groupe",
##"Home24",
##"23andMe",
####"TransferWise",
##"Shazam",
#"Apus Group",
##"CloudFlare",
##"Eventbrite",
##"Lookout",
##"AppDynamics",
##"Farfetch",
####"SimpliVity",
##"Hootsuite",
##"Kabam",
#"Funding Circle",
##"Tujia",
##"Razer",
#"China Rapid Finance",
##"Fanli",
#"Zomato Media",
##"JustFab",
##"Qualtrics",
##"Mogujie",
##"Illumio",
##"GrabTaxi",
##"BeiBei",
##"Panshi",
#"Yello Mobile",
##"Pluralsight",
##"Quikr",
##"InMobi",
#"iwjw.com",
####"AVAST",# Software",
"MarkLogic",
#"Coupa Software",
"Fanduel",
#"Zeta Interactive",
"Zscaler",
"Vox",# Media",
#"Kik Interactive",
"Decolar",
"Carbon3D",
"Apttus",
"AppDirect",
"Kabbage",
"Datto",
###"TutorGroup"]

In [52]:
import tweepy #https://github.com/tweepy/tweepy
import csv

#Twitter API credentials
CONSUMER_KEY = 'LUDiw6UJwbcOpJxCUMwprQTjz'
CONSUMER_SECRET ='eRZxk3NObvhfdRPlqSZnjcAnilPCRkaV16616mI86Fzi6ENi35'
OAUTH_TOKEN = '1150071104-cBt0I6tLwweKg5LlMHRSN8sZS22lXsXPCGOpW4Z'
OAUTH_TOKEN_SECRET = 'aulODA2dhSCLgyiTLJ74adyPtJ2mShhhfWUx9FJkpJzXa'
consumer_key = CONSUMER_KEY
consumer_secret = CONSUMER_SECRET
access_key = OAUTH_TOKEN
access_secret = OAUTH_TOKEN_SECRET


def get_all_tweets(screen_name):
    #Twitter only allows access to a users most recent 3240 tweets with this method

    #authorize twitter, initialize tweepy
    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_key, access_secret)
    api = tweepy.API(auth, retry_delay=61, retry_count=16, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

    #initialize a list to hold all the tweepy Tweets
    alltweets = []	

    #make initial request for most recent tweets (200 is the maximum allowed count)
    new_tweets = api.user_timeline(screen_name = screen_name,count=200)
    #print new_tweets
    #raise Exception()
    #save most recent tweets
    alltweets.extend(new_tweets)

    #save the id of the oldest tweet less one
    if len(alltweets) > 1:
        oldest = alltweets[-1].id - 1
    else:
        oldest = None

    #keep grabbing tweets until there are no tweets left to grab
    while len(new_tweets) > 0:
        print "getting tweets before %s" % (oldest)

        #all subsiquent requests use the max_id param to prevent duplicates
        new_tweets = None
        #while new_tweets == None:
#            try:
        new_tweets = api.user_timeline(screen_name = screen_name,count=200,max_id=oldest)
 #           except:# RateLimitError:
  #              for i in xrange(0,3):
   #                 print "sleeping for 5 minutes to get around rate limit"
    #                sleep(61 * 5) #15 minute sleep try for wait limit
     #           pass
            #except TweepError:
             #   print "tweeperr"
              #  break
               # pass
            #else:
             #   print "wtf else"
        #save most recent tweets
        alltweets.extend(new_tweets)

        #update the id of the oldest tweet less one
        oldest = alltweets[-1].id - 1

#		break

        print "...%s tweets downloaded so far" % (len(alltweets))

    #transform the tweepy tweets into a 2D array that will populate the csv	
    outtweets = [[tweet.id_str, tweet.created_at, tweet.text.encode("utf-8"), \
                 str(tweet.favorite_count), str(tweet.retweet_count), tweet.lang.encode("utf-8"),\
#                 tweet.quoted_status_id_str, \
                 tweet.in_reply_to_screen_name,tweet.in_reply_to_user_id_str,\
                 tweet.in_reply_to_status_id_str, \
                 tweet.place.country.encode("utf-8") if tweet.place else "", \
                 tweet.place.country_code.encode("utf-8") if tweet.place else "",\
                 tweet.place.full_name.encode("utf-8") if tweet.place else "",\
                 tweet.place.place_type.encode("utf-8") if tweet.place else "", \
                 str(tweet.coordinates['coordinates']).encode("utf-8") if tweet.coordinates else "", \
                 ] for tweet in alltweets]
#	print outtweets[2]
#	raise Exception()
    #write the csv	
    with open('%s_tweets.csv' % screen_name, 'wb') as f:
        writer = csv.writer(f)
        writer.writerow(["id","created_at","text","favorites","retweets","lang",\
                         #"quoting_status_id",\
                        "in_reply_to_user","in_reply_to_user_id","in_reply_to_status_id",\
                        "country","country_code","place_name","place_type","coordinates"])
        writer.writerows(outtweets)

    pass

for sta in startuplist:
    print sta
    get_all_tweets(sta)
#get_all_tweets('https://twitter.com/search-advanced')


MarkLogic
getting tweets before 650695877351178239
...400 tweets downloaded so far
getting tweets before 634387106865942527
...600 tweets downloaded so far
getting tweets before 620249759714971647
...800 tweets downloaded so far
getting tweets before 609440029987901439
...1000 tweets downloaded so far
getting tweets before 598526619376975871
...1200 tweets downloaded so far
getting tweets before 588404169179987967
...1400 tweets downloaded so far
getting tweets before 581169230424502271
...1599 tweets downloaded so far
getting tweets before 571742884904484863
...1798 tweets downloaded so far
getting tweets before 564070580500975615
...1998 tweets downloaded so far
getting tweets before 555081530758799359
...2198 tweets downloaded so far
getting tweets before 544911086432952319
...2398 tweets downloaded so far
getting tweets before 536187369838223360
...2598 tweets downloaded so far
getting tweets before 529772626080587777
...2798 tweets downloaded so far
getting tweets before 522473234

## Example 2. Retrieving trends

In [4]:
# The Yahoo! Where On Earth ID for the entire world is 1.
# See https://dev.twitter.com/docs/api/1.1/get/trends/place and
# http://developer.yahoo.com/geo/geoplanet/

WORLD_WOE_ID = 1
US_WOE_ID = 23424977

# Prefix ID with the underscore for query string parameterization.
# Without the underscore, the twitter package appends the ID value
# to the URL itself as a special case keyword argument.

world_trends = twitter_api.trends.place(_id=WORLD_WOE_ID)
us_trends = twitter_api.trends.place(_id=US_WOE_ID)

print world_trends
print
print us_trends

[{u'created_at': u'2015-11-24T20:44:53Z', u'trends': [{u'url': u'http://twitter.com/search?q=Roubaix', u'query': u'Roubaix', u'name': u'Roubaix', u'promoted_content': None}, {u'url': u'http://twitter.com/search?q=%23%D8%A8%D8%B1%D8%B4%D9%84%D9%88%D9%86%D9%87_%D8%B1%D9%88%D9%85%D8%A7', u'query': u'%23%D8%A8%D8%B1%D8%B4%D9%84%D9%88%D9%86%D9%87_%D8%B1%D9%88%D9%85%D8%A7', u'name': u'#\u0628\u0631\u0634\u0644\u0648\u0646\u0647_\u0631\u0648\u0645\u0627', u'promoted_content': None}, {u'url': u'http://twitter.com/search?q=Roma', u'query': u'Roma', u'name': u'Roma', u'promoted_content': None}, {u'url': u'http://twitter.com/search?q=%23GRAMMYs', u'query': u'%23GRAMMYs', u'name': u'#GRAMMYs', u'promoted_content': None}, {u'url': u'http://twitter.com/search?q=Bayern', u'query': u'Bayern', u'name': u'Bayern', u'promoted_content': None}, {u'url': u'http://twitter.com/search?q=%23NadaMejorQue', u'query': u'%23NadaMejorQue', u'name': u'#NadaMejorQue', u'promoted_content': None}, {u'url': u'http://twit

## Example 3. Displaying API responses as pretty-printed JSON

In [5]:
import json

print json.dumps(world_trends, indent=1)
print
print json.dumps(us_trends, indent=1)

[
 {
  "created_at": "2015-11-24T20:44:53Z", 
  "trends": [
   {
    "url": "http://twitter.com/search?q=Roubaix", 
    "query": "Roubaix", 
    "name": "Roubaix", 
    "promoted_content": null
   }, 
   {
    "url": "http://twitter.com/search?q=%23%D8%A8%D8%B1%D8%B4%D9%84%D9%88%D9%86%D9%87_%D8%B1%D9%88%D9%85%D8%A7", 
    "query": "%23%D8%A8%D8%B1%D8%B4%D9%84%D9%88%D9%86%D9%87_%D8%B1%D9%88%D9%85%D8%A7", 
    "name": "#\u0628\u0631\u0634\u0644\u0648\u0646\u0647_\u0631\u0648\u0645\u0627", 
    "promoted_content": null
   }, 
   {
    "url": "http://twitter.com/search?q=Roma", 
    "query": "Roma", 
    "name": "Roma", 
    "promoted_content": null
   }, 
   {
    "url": "http://twitter.com/search?q=%23GRAMMYs", 
    "query": "%23GRAMMYs", 
    "name": "#GRAMMYs", 
    "promoted_content": null
   }, 
   {
    "url": "http://twitter.com/search?q=Bayern", 
    "query": "Bayern", 
    "name": "Bayern", 
    "promoted_content": null
   }, 
   {
    "url": "http://twitter.com/search?q=%23NadaM

## Example 4. Computing the intersection of two sets of trends

In [6]:
world_trends_set = set([trend['name'] 
                        for trend in world_trends[0]['trends']])

us_trends_set = set([trend['name'] 
                     for trend in us_trends[0]['trends']]) 

common_trends = world_trends_set.intersection(us_trends_set)

print common_trends

set([u'Roubaix', u'#ThanksgivingWithBlackFamilies', u'Roma'])


## Example 5. Collecting search results

In [10]:
# Import unquote to prevent url encoding errors in next_results
from urllib import unquote

# XXX: Set this variable to a trending topic, 
# or anything else for that matter. The example query below
# was a trending topic when this content was being developed
# and is used throughout the remainder of this chapter.

#q = '#MentionSomeoneImportantForYou' 
q = '#marinecoin OR #startcoin OR #coinbase OR #bitstamp OR #bitfinex OR #altcoins OR #bitcoin OR #bitcoins OR #btc OR #dogecoin OR #crypto OR #cryptocurrency OR #blockchain' 

count = 100

# See https://dev.twitter.com/docs/api/1.1/get/search/tweets

search_results = twitter_api.search.tweets(q=q, count=count)

statuses = search_results['statuses']


# Iterate through 5 more batches of results by following the cursor

for _ in range(5):
    print "Length of statuses", len(statuses)
    try:
        next_results = search_results['search_metadata']['next_results']
    except KeyError, e: # No more results when next_results doesn't exist
        break
        
    # Create a dictionary from next_results, which has the following form:
    # ?max_id=313519052523986943&q=NCAA&include_entities=1
    kwargs = dict([ kv.split('=') for kv in unquote(next_results[1:]).split("&") ])
    
    search_results = twitter_api.search.tweets(**kwargs)
    statuses += search_results['statuses']

# Show one sample search result by slicing the list...
print json.dumps(statuses[0], indent=1)

Length of statuses 100
Length of statuses 200
Length of statuses 300
Length of statuses 400
Length of statuses 500
{
 "contributors": null, 
 "truncated": false, 
 "text": "RT @LYNDONJJSTOKES: How safe is your water? FREE Audit here from Macol Consulting  https://t.co/vrKCUEEKF8  #bromsgrovehour #crypto https:/\u2026", 
 "is_quote_status": false, 
 "in_reply_to_status_id": null, 
 "id": 669258955160465408, 
 "favorite_count": 0, 
 "source": "<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>", 
 "retweeted": false, 
 "coordinates": null, 
 "entities": {
  "symbols": [], 
  "user_mentions": [
   {
    "id": 2359987412, 
    "indices": [
     3, 
     18
    ], 
    "id_str": "2359987412", 
    "screen_name": "LYNDONJJSTOKES", 
    "name": "TheBestof Bromsgrove"
   }
  ], 
  "hashtags": [
   {
    "indices": [
     108, 
     123
    ], 
    "text": "bromsgrovehour"
   }, 
   {
    "indices": [
     124, 
     131
    ], 
    "text": "crypto"
   }
  ],

In [21]:
print type(statuses[0])

<type 'dict'>


Note: Should you desire to do so, you can load the same set of search results that are illustrated in the text of _Mining the Social Web_ by executing the code below that reads a snapshot of the data and stores it into the same statuses variable as was defined above. Alternatively, you can choose to skip execution of this cell in order to follow along with your own data.

In [12]:
import json
#skip this for now, use our json data
#statuses = json.loads(open('resources/ch01-twitter/data/MentionSomeoneImportantForYou.json').read())

# The result of the list comprehension is a list with only one element that
# can be accessed by its index and set to the variable t
t = [ status 
      for status in statuses
          #if status['id'] == 316948241264549888 
    ][0]

# Explore the variable t to get familiarized with the data structure...

print t['retweet_count']
print t['retweeted_status']

# Can you find the most retweeted tweet in your search results? Try do do it!

3
{u'contributors': None, u'truncated': False, u'text': u'How safe is your water? FREE Audit here from Macol Consulting  https://t.co/vrKCUEEKF8  #bromsgrovehour #crypto https://t.co/Jrjm6NMlNt', u'is_quote_status': False, u'in_reply_to_status_id': None, u'id': 669255922674454529, u'favorite_count': 1, u'source': u'<a href="http://bufferapp.com" rel="nofollow">Buffer</a>', u'retweeted': False, u'coordinates': None, u'entities': {u'symbols': [], u'user_mentions': [], u'hashtags': [{u'indices': [88, 103], u'text': u'bromsgrovehour'}, {u'indices': [104, 111], u'text': u'crypto'}], u'urls': [{u'url': u'https://t.co/vrKCUEEKF8', u'indices': [63, 86], u'expanded_url': u'http://bit.ly/1OquB2c', u'display_url': u'bit.ly/1OquB2c'}], u'media': [{u'expanded_url': u'http://twitter.com/LYNDONJJSTOKES/status/669255922674454529/photo/1', u'display_url': u'pic.twitter.com/Jrjm6NMlNt', u'url': u'https://t.co/Jrjm6NMlNt', u'media_url_https': u'https://pbs.twimg.com/media/CUmstzrWsAAvl-a.jpg', u'id_str':

## Example 6. Extracting text, screen names, and hashtags from tweets

In [13]:
status_texts = [ status['text'] 
                 for status in statuses ]

screen_names = [ user_mention['screen_name'] 
                 for status in statuses
                     for user_mention in status['entities']['user_mentions'] ]

hashtags = [ hashtag['text'] 
             for status in statuses
                 for hashtag in status['entities']['hashtags'] ]

# Compute a collection of all words from all tweets
words = [ w 
          for t in status_texts 
              for w in t.split() ]

# Explore the first 5 items for each...

print json.dumps(status_texts[0:5], indent=1)
print json.dumps(screen_names[0:5], indent=1) 
print json.dumps(hashtags[0:5], indent=1)
print json.dumps(words[0:5], indent=1)

[
 "RT @LYNDONJJSTOKES: How safe is your water? FREE Audit here from Macol Consulting  https://t.co/vrKCUEEKF8  #bromsgrovehour #crypto https:/\u2026", 
 "#CrowdActivism Think #BTC! : #ITALIA URGENTE #TERRORISMO CRUCIALE INTEL #ITALIANI https://t.co/Np5IpZI93g via @censorednewsnow", 
 "RT @Kuya_Marc: I wish for #Bitcoin #donations: https://t.co/D15lzV8G4u #encouragement It is now 04:47:34PHT.", 
 "#FBI #NSA @BarackObama #security #crypto @tim_cook #Apple https://t.co/vb1t54JhYG", 
 "RT @whaleclubco: $322.39 \u00b7 Some potential harmonic patterns evolving here. Fa... https://t.co/vvPyivt0Am #bitcoin https://t.co/Dp4BdfQmeR"
]
[
 "LYNDONJJSTOKES", 
 "censorednewsnow", 
 "Kuya_Marc", 
 "BarackObama", 
 "tim_cook"
]
[
 "bromsgrovehour", 
 "crypto", 
 "CrowdActivism", 
 "BTC", 
 "ITALIA"
]
[
 "RT", 
 "@LYNDONJJSTOKES:", 
 "How", 
 "safe", 
 "is"
]


## Example 7. Creating a basic frequency distribution from the words in tweets

In [14]:
from collections import Counter

for item in [words, screen_names, hashtags]:
    c = Counter(item)
    print c.most_common()[:10] # top 10
    print

[(u'#bitcoin', 328), (u'RT', 194), (u'#Bitcoin', 158), (u'to', 114), (u'the', 110), (u'Bitcoin', 108), (u'for', 95), (u'and', 91), (u'-', 84), (u'#dogecoin', 67)]

[(u'cryptocointalk', 13), (u'whaleclubco', 10), (u'coindesk', 9), (u'ProTipHQ', 8), (u'FGraillot', 8), (u'chain', 7), (u'HELPS_COIN', 7), (u'MadBitcoins', 7), (u'mPAtBit', 7), (u'ProjectCoin', 7)]

[(u'bitcoin', 345), (u'Bitcoin', 168), (u'dogecoin', 70), (u'crypto', 67), (u'btc', 58), (u'BTC', 58), (u'money', 56), (u'news', 54), (u'blockchain', 53), (u'love', 51)]



## Example 8. Using prettytable to display tuples in a nice tabular format

In [15]:
from prettytable import PrettyTable

for label, data in (('Word', words), 
                    ('Screen Name', screen_names), 
                    ('Hashtag', hashtags)):
    pt = PrettyTable(field_names=[label, 'Count']) 
    c = Counter(data)
    [ pt.add_row(kv) for kv in c.most_common()[:10] ]
    pt.align[label], pt.align['Count'] = 'l', 'r' # Set column alignment
    print pt

ImportError: No module named prettytable

## Example 9. Calculating lexical diversity for tweets

In [16]:
# A function for computing lexical diversity
def lexical_diversity(tokens):
    return 1.0*len(set(tokens))/len(tokens) 

# A function for computing the average number of words per tweet
def average_words(statuses):
    total_words = sum([ len(s.split()) for s in statuses ]) 
    return 1.0*total_words/len(statuses)

print lexical_diversity(words)
print lexical_diversity(screen_names)
print lexical_diversity(hashtags)
print average_words(status_texts)

0.345558739255
0.517520215633
0.178398710371
14.5416666667


## Example 10. Finding the most popular retweets

In [17]:
retweets = [
            # Store out a tuple of these three values ...
            (status['retweet_count'], 
             status['retweeted_status']['user']['screen_name'],
             status['text']) 
            
            # ... for each status ...
            for status in statuses 
            
            # ... so long as the status meets this condition.
                if status.has_key('retweeted_status')
           ]

# Slice off the first 5 from the sorted results and display each item in the tuple

pt = PrettyTable(field_names=['Count', 'Screen Name', 'Text'])
[ pt.add_row(row) for row in sorted(retweets, reverse=True)[:5] ]
pt.max_width['Text'] = 50
pt.align= 'l'
print pt

NameError: name 'PrettyTable' is not defined

## Example 11. Looking up users who have retweeted a status

In [18]:
# Get the original tweet id for a tweet from its retweeted_status node 
# and insert it here in place of the sample value that is provided
# from the text of the book

_retweets = twitter_api.statuses.retweets(id=317127304981667841)
print [r['user']['screen_name'] for r in _retweets]

[u'jyeee', u'Ceejaynatics', u'Majendalove', u'shameeennnn', u'EmilyearoRuiz', u'Oliviyaaay', u'ikaayyy_', u'RafaellaaaMae', u'LoveKyana18', u'iiaamcamillee', u'kidamgos', u'ShaaaronOng', u'asdfghjbl']


## Example 12. Plotting frequencies of words

In [19]:
word_counts = sorted(Counter(words).values(), reverse=True)

plt.loglog(word_counts)
plt.ylabel("Freq")
plt.xlabel("Word Rank")

NameError: name 'plt' is not defined

## Example 13. Generating histograms of words, screen names, and hashtags

In [None]:
for label, data in (('Words', words), 
                    ('Screen Names', screen_names), 
                    ('Hashtags', hashtags)):

    # Build a frequency map for each set of data
    # and plot the values
    c = Counter(data)
    plt.hist(c.values())
    
    # Add a title and y-label ...
    plt.title(label)
    plt.ylabel("Number of items in bin")
    plt.xlabel("Bins (number of times an item appeared)")
    
    # ... and display as a new figure
    plt.figure()

## Example 14. Generating a histogram of retweet counts

In [None]:
# Using underscores while unpacking values in
# a tuple is idiomatic for discarding them

counts = [count for count, _, _ in retweets]

plt.hist(counts)
plt.title("Retweets")
plt.xlabel('Bins (number of times retweeted)')
plt.ylabel('Number of tweets in bin')

print counts

Note: This histogram gives you an idea of how many times tweets are retweeted with the x-axis defining partitions for tweets that have been retweeted some number of times and the y-axis telling you how many tweets fell into each bin. For example, a y-axis value of 5 for the "15-20 bin" on the x-axis means that there were 5 tweets that were retweeted between 15 and 20 times.

Here's another variation that transforms the data using the (automatically imported from numpy) log function in order to improve the resolution of the plot.

In [None]:
# Using underscores while unpacking values in
# a tuple is idiomatic for discarding them

counts = [count for count, _, _ in retweets]

# Taking the log of the *data values* themselves can 
# often provide quick and valuable insight into the
# underlying distribution as well. Try it back on
# Example 13 and see if it helps.

plt.hist(log(counts))
plt.title("Retweets")
plt.xlabel('Log[Bins (number of times retweeted)]')
plt.ylabel('Log[Number of tweets in bin]')

print log(counts)