# Datamining Twitter with tweepy

There are at least 7 python interfaces to the Twitter WEB Application Programming Interface (API).  We will use `tweepy`, since the [documentation is clear](http://www.tweepy.org/), and there are [interesting applications available to get started](http://adilmoujahid.com/posts/2014/07/twitter-analytics/
).

## Getting started

First you will need to install tweepy.  The most straightforward way is through the `pip` installation tool.  This can be run from the command line using:

    pip install tweepy
    
or from within a Canopy IPython shell:

    %bash pip install tweepy
    
If you get this Exception:

    TypeError: parse_requirements() got an unexpected keyword argument 'session'


Make sure you upgrade pip to the newest version:

    pip install --upgrade pip


Twitter uses the [OAuth protocol](https://dev.twitter.com/oauth/overview/faq) for secure application development.  Considering all of the applications that access Twitter (for example, using your Twitter account to login to a different website), this protocol prevents information like your password being passed through these intermediate accounts.  While this is a great security measure for intermediate client access, it adds an extra step for us before we can directly communicate with the API.  To access Twitter, you need to Create an App (https://apps.twitter.com); however, I've already created an app that we can all ping from: `GWU_TEST_APP`.   To interact with `GWU_TEST_APP`, you'll need an access token.  

<br>

[Request an access token here.](https://apps.twitter.com/app/7965526/keys)

<br>

Store your consumer key and comumer secret somewhere you'll remember them.  I'm storing mine in Python strings, but for security, not displaying this step:

    consumer_key = 'jrCYD....'
    consumer_secret = '...' 
    

Here is a discussion on the difference between the access token and the consumer token; although, for our intents and purposes, its not so important: http://stackoverflow.com/questions/20720752/whats-the-difference-between-twitter-consumer-key-and-access-token**

```
The consumer key is for your application and client tokens are for end users in your application's context. If you want to call in just the application context, then consumer key is adequate. You'd be rate limited per application and won't be able to access user data that is not public. With the user token context, you'll be rate limited per token/user, this is desirable if you have several users and need to make more calls than application context rate limiting allows. This way you can access private user data. Which to use depends on your scenarios.
```

## Example 1: Read Tweets Appearing on Homepage

With the `consumer_key` and `consumer_secret` stored, let's try a Hello World example from Tweepy's docs.  This will access the public tweets appearing on the User's feed as if they had logged in to twitter.  **For brevity, we'll only print the first two**.

In [1]:
import tweepy

consumer_key = 'jrCYD9dREozKRfchtkm6zg02Z'
consumer_secret = 'h0cWbg5TeV2AS1n5w33ZwALEQcS4JkC2rpOXNfIImOHL8hdFLg'

access_token ='718576069-CGK0f03Q94CkFysA6OJgJZeRBef2AGIh1bzceVl4'
access_token_secret = 'zdOaZWEncust1rFGKAWaj462VRUD6GMcU60plkCaobfEf'

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth)

public_tweets = api.home_timeline()
for (idx, tweet) in enumerate(public_tweets[0:3]): #First 3 tweets in my public feed
    print 'TWEET %s:\n\n\t%s\n\n' % (idx, tweet.text)

TWEET 0:

	RT @Gawker: AccuWeather slams the NWS for missing a tornado AccuWeather didn't cover. http://t.co/2NUp31vJHa


TWEET 1:

	RT @SarcasticRover: This comic isn't about me, and yet, I still like it: http://t.co/ojdydJDDSG http://t.co/glVj7hRsM9


TWEET 2:

	Sad news tonight: No Hays eaglets in 2015 http://t.co/m2jULOTPCk http://t.co/pjnZsCRdxm




When we used `tweet.text`, we implicitly used a python class defined by `tweepy`.

In [2]:
type(tweet)

tweepy.models.Status

There are many attributes associated with a `Status` object.  

In [3]:
tweet.__dict__.keys()

['contributors',
 'truncated',
 'text',
 'in_reply_to_status_id',
 'id',
 'favorite_count',
 '_api',
 'author',
 '_json',
 'coordinates',
 'entities',
 'in_reply_to_screen_name',
 'id_str',
 'retweet_count',
 'in_reply_to_user_id',
 'favorited',
 'source_url',
 'user',
 'geo',
 'in_reply_to_user_id_str',
 'possibly_sensitive',
 'possibly_sensitive_appealable',
 'lang',
 'created_at',
 'in_reply_to_status_id_str',
 'place',
 'source',
 'extended_entities',
 'retweeted']

## Example 2: What's trending where?

According to the [tweepy API](http://tweepy.readthedocs.org/en/v3.2.0/api.html), we can return the top 10 trending topics for a specific location, where the location is a `WOEID (Yahoo Where on Earth ID)`. 

<br>

The WOEID is a unique identifier, similar to zipcodes, but that expand worldwide.  For example, my hometown of Pittsburgh has a WOEID of 2473224.  You can search for WOEID's here: http://woeid.rosselliot.co.nz/

<br>

Let's return the top ten trending topics in Pittsburgh

In [4]:
top10 = api.trends_place(id=2473224)
top10

[{u'as_of': u'2015-03-28T00:59:24Z',
  u'created_at': u'2015-03-28T00:56:51Z',
  u'locations': [{u'name': u'Pittsburgh', u'woeid': 2473224}],
  u'trends': [{u'name': u'#CallMeCam',
    u'promoted_content': None,
    u'query': u'%23CallMeCam',
    u'url': u'http://twitter.com/search?q=%23CallMeCam'},
   {u'name': u'#WVUvsUK',
    u'promoted_content': None,
    u'query': u'%23WVUvsUK',
    u'url': u'http://twitter.com/search?q=%23WVUvsUK'},
   {u'name': u'#CarrotForANight',
    u'promoted_content': None,
    u'query': u'%23CarrotForANight',
    u'url': u'http://twitter.com/search?q=%23CarrotForANight'},
   {u'name': u'Ultra',
    u'promoted_content': None,
    u'query': u'Ultra',
    u'url': u'http://twitter.com/search?q=Ultra'},
   {u'name': u'Huggins',
    u'promoted_content': None,
    u'query': u'Huggins',
    u'url': u'http://twitter.com/search?q=Huggins'},
   {u'name': u'Pittsburgh',
    u'promoted_content': None,
    u'query': u'Pittsburgh',
    u'url': u'http://twitter.com/search

The result is a `JSON` object.  JSON is a human and machine-readable standardized data encoding format.  

<br>

In Python, JSON objects are implemented as lists of nested dictionaries.  JSON stands for JavaScript Object Notation, because it's designed based on a subset of the JavaScript language; however, JSON is a data-encoding format implemented in many languages.  

<br>

Looking at this structure, we see that it's contained in a list; in fact its a list of one element.  Let's access the top ten tweet names:

In [5]:
top10[0]['trends']

[{u'name': u'#CallMeCam',
  u'promoted_content': None,
  u'query': u'%23CallMeCam',
  u'url': u'http://twitter.com/search?q=%23CallMeCam'},
 {u'name': u'#WVUvsUK',
  u'promoted_content': None,
  u'query': u'%23WVUvsUK',
  u'url': u'http://twitter.com/search?q=%23WVUvsUK'},
 {u'name': u'#CarrotForANight',
  u'promoted_content': None,
  u'query': u'%23CarrotForANight',
  u'url': u'http://twitter.com/search?q=%23CarrotForANight'},
 {u'name': u'Ultra',
  u'promoted_content': None,
  u'query': u'Ultra',
  u'url': u'http://twitter.com/search?q=Ultra'},
 {u'name': u'Huggins',
  u'promoted_content': None,
  u'query': u'Huggins',
  u'url': u'http://twitter.com/search?q=Huggins'},
 {u'name': u'Pittsburgh',
  u'promoted_content': None,
  u'query': u'Pittsburgh',
  u'url': u'http://twitter.com/search?q=Pittsburgh'},
 {u'name': u'Notre Dame',
  u'promoted_content': None,
  u'query': u'%22Notre+Dame%22',
  u'url': u'http://twitter.com/search?q=%22Notre+Dame%22'},
 {u'name': u'UCLA',
  u'promoted_con

As you can see, there's alot of metadata that goes into even a simple tweet.  Let's cycle through each of these trends, and print the `name` and website of each.

In [6]:
for trend in top10[0]['trends']:
    print trend['name'], trend['url']

#CallMeCam http://twitter.com/search?q=%23CallMeCam
#WVUvsUK http://twitter.com/search?q=%23WVUvsUK
#CarrotForANight http://twitter.com/search?q=%23CarrotForANight
Ultra http://twitter.com/search?q=Ultra
Huggins http://twitter.com/search?q=Huggins
Pittsburgh http://twitter.com/search?q=Pittsburgh
Notre Dame http://twitter.com/search?q=%22Notre+Dame%22
UCLA http://twitter.com/search?q=UCLA
Trailer Park Boys http://twitter.com/search?q=%22Trailer+Park+Boys%22
Pens http://twitter.com/search?q=Pens


## Example 3: Streaming and Data Mining

*This Streaming tutorial follows closely [Adil Moujahid's great tweepy examples](http://adilmoujahid.com/posts/2014/07/twitter-analytics/)*

<br>

Twitter offers a [Streaming API](https://dev.twitter.com/streaming/overview) to make it easier to query streams of tweets.  The Stream API encapsulates some pain points of REST access to ensure that Stream calls don't exceed the rate limit.  Think of them as Twitter's suggested means to stream data for beginners.  You don't have to use them, but they're recommended and will make life easier.  There are three stream types:

   - `Public Streams:` Streams of public data flowthing through Twitter.  Suitable for followign specific users, topics or for data mining.
    
   - `User Streams:` Single-user streams.  Containing roughly all of the data corresponding with a single user's view of Twitter.
    
   - `Site Streams:`  The multi-user version of user streams.  
   
<br>
    
We'll resist the temptation to mess with our friend's Twitter accounts, and focus soley on `Public Streams`.  Combining these stream with text filters will let us accumulate content.  For example, we could look for tweets involving the text, *foxnews*.  `tweepy` and `Twitter's API` will configure the stream and filter to work nicely, you just provide the content tags you're interested in.  Finally, **remember that the more obsucre the content, the longer it will take to find**.

<br>

**<font color='red'>The following snippet will run until `max_tweets` or `max_seconds` is reached.  If running in notebook, it will hold up cells until the alotted time.  Therefore, for long runtimes, you may want to run in an external python program, and then can terminate at will if desired.  I also recommend restarting notebook kernal before running this cell multiple times...</font>**  

In [61]:
#Import the necessary methods from tweepy library
import sys
import time
import datetime

from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream

#This is a basic listener that just prints received tweets to stdout.
class StreamParser(StreamListener):
    """ Controls how streaming data is parsed. Pass an outfile, or data will be writting to 
    sys.stdout (eg the screen)
    """
    def __init__(self, outfile=None, max_tweets=5, max_seconds=30):
        self.counter = 0
        self.start_time = time.time()
        # Set upper limits on maximum tweets or seconds before timeout
        self.max_tweets = max_tweets
        self.max_seconds = max_seconds
        if outfile:
            self.stdout = open(outfile, 'w')
        else:
            self.stdout = sys.stdout
    
    def on_data(self, data):
        """ Data is a string, but formatted for json. Parses it"""
        self.counter += 1
        # time data is all timestamps.
        current_time = time.time()
        run_time = current_time - self.start_time
                
        # If we want to read time, easiest way is to convert from timestamp using datetime
        formatted_time = datetime.datetime.now()
            
        # Technically, might not be the best place to put kill statements, but works well enough
        if self.max_tweets:
            if self.counter > self.max_tweets:
                self._kill_stdout()
                raise SystemExit('Max tweets of %s exceeded.  Killing stream... see %s' \
                             % (self.max_tweets, self.stdout))
  
        if self.max_seconds:
            if run_time > self.max_seconds:
                self._kill_stdout()
                raise SystemExit('Max time of %s seconds exceeded.  Killing stream... see %s' \
                                 % (self.max_seconds, self.stdout))

        print 'Tweet %s at %s.\nEllapsed: %.2f seconds\n' % \
             (self.counter, formatted_time, run_time)

        # Write to file, return True causes stream to continue I guess...
        self.stdout.write(data)
        return True

    def _kill_stdout(self):
        """ If self.stdout is a file, close it.  If sys.stdout, pass"""
        if self.stdout is not sys.stdout:
            self.stdout.close() 
    
    def on_error(self, status):
        print status


#This handles Twitter authetification and the connection to Twitter Streaming API
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

# Stream 10 tweets, no matter the time it takes!
listener = StreamParser(outfile='test.txt', max_tweets=10, max_seconds=None)
stream = Stream(auth, listener)

#This line filter Twitter Streams to capture data by the keywords: 'python', 'javascript', 'ruby'
stream.filter(track=['ipython', 'drugs', 'rockandroll'])

Tweet 1 at 2015-03-27 21:36:23.312925.
Ellapsed: 0.82 seconds

Tweet 2 at 2015-03-27 21:36:23.398453.
Ellapsed: 0.91 seconds

Tweet 3 at 2015-03-27 21:36:24.039299.
Ellapsed: 1.55 seconds

Tweet 4 at 2015-03-27 21:36:26.303319.
Ellapsed: 3.81 seconds

Tweet 5 at 2015-03-27 21:36:28.103006.
Ellapsed: 5.61 seconds

Tweet 6 at 2015-03-27 21:36:32.845835.
Ellapsed: 10.36 seconds

Tweet 7 at 2015-03-27 21:36:34.851696.
Ellapsed: 12.36 seconds

Tweet 8 at 2015-03-27 21:36:35.339390.
Ellapsed: 12.85 seconds

Tweet 9 at 2015-03-27 21:36:35.912185.
Ellapsed: 13.42 seconds

Tweet 10 at 2015-03-27 21:36:36.107684.
Ellapsed: 13.62 seconds



SystemExit: Max tweets of 10 exceeded.  Killing stream... see <closed file 'test.txt', mode 'w' at 0x7f0e00a57f60>

To exit: use 'exit', 'quit', or Ctrl-D.


### Load data into JSON

If only one tweet were saved, we could just use json.loads() to read it in right away, but
for a file with multiple tweets, we need to [read them in one at a time](http://stackoverflow.com/questions/21058935/python-json-loads-shows-valueerror-extra-data). 

<br>

Each tweet JSON object is one long line, so we can read in line by line, until an error is reached in which case we just stop.  For example:

In [62]:
tweets = []
for line in open('test.txt', 'r'):
    tweets.append(json.loads(line))

In [63]:
len(tweets)

10

The tweet text itself is embedded in the `text` metadata field

In [67]:
tweets[0]['text']

u'@_youhadonejob Also took the record amount of drugs. Unofficially.'

Check out all of the metadata you can get from a tweet! 

In [64]:
sorted(tweets[0].keys())

[u'contributors',
 u'coordinates',
 u'created_at',
 u'entities',
 u'favorite_count',
 u'favorited',
 u'filter_level',
 u'geo',
 u'id',
 u'id_str',
 u'in_reply_to_screen_name',
 u'in_reply_to_status_id',
 u'in_reply_to_status_id_str',
 u'in_reply_to_user_id',
 u'in_reply_to_user_id_str',
 u'lang',
 u'place',
 u'possibly_sensitive',
 u'retweet_count',
 u'retweeted',
 u'source',
 u'text',
 u'timestamp_ms',
 u'truncated',
 u'user']

Within these fields, there's even more information.  For example, the `user` and `entities` fields, which provide information about the `user` as well as links and images (entities) embedded in the tweet:

In [65]:
tweets[0]['user']

{u'contributors_enabled': False,
 u'created_at': u'Wed Jan 11 13:44:33 +0000 2012',
 u'default_profile': True,
 u'default_profile_image': False,
 u'description': u'Football lover, sun worshipper, accounts wizard, pool hustler, stargazer, runner.',
 u'favourites_count': 155,
 u'follow_request_sent': None,
 u'followers_count': 74,
 u'following': None,
 u'friends_count': 186,
 u'geo_enabled': False,
 u'id': 461128587,
 u'id_str': u'461128587',
 u'is_translator': False,
 u'lang': u'en',
 u'listed_count': 1,
 u'location': u'Eastbourne',
 u'name': u'Keith Axell',
 u'notifications': None,
 u'profile_background_color': u'C0DEED',
 u'profile_background_image_url': u'http://abs.twimg.com/images/themes/theme1/bg.png',
 u'profile_background_image_url_https': u'https://abs.twimg.com/images/themes/theme1/bg.png',
 u'profile_background_tile': False,
 u'profile_banner_url': u'https://pbs.twimg.com/profile_banners/461128587/1384693366',
 u'profile_image_url': u'http://pbs.twimg.com/profile_images/43953

In [66]:
tweets[0]['entities']

{u'hashtags': [],
 u'symbols': [],
 u'trends': [],
 u'urls': [],
 u'user_mentions': [{u'id': 1653442136,
   u'id_str': u'1653442136',
   u'indices': [0, 14],
   u'name': u'You had one job',
   u'screen_name': u'_youhadonejob'}]}

### Alchemy

The Alchemy API is an artificial intelligence toolkit for machine learning needs like facial recognition, sentiment analysis and so forth.  IIUC it's built in some part over scikit-learn, and has python roots.  With a Python SDK, it's a great opportunity to do some bigboy analysis of these twitter streams.

http://www.alchemyapi.com/

In [None]:
import os
os.chdir('/home/glue/Desktop/alchemyapi_python/')
from alchemyapi import AlchemyAPI as alcapi
from types import MethodType

ALCAPI = alcapi() #<-- Instantiate
print 'The available attributes and methods for Alchemy API are:\n' 
sorted(alcapi.__dict__.keys())

In [None]:
help(alcapi.sentiment)

In [None]:
FOXARTICLE = 'http://www.foxnews.com/us/2015/02/24/southern-california-commuter-train-crashes-into-truck-injuries-reported/'
GOODARTICLE = 'http://www.goodnewsnetwork.org/company-gives-employees-1000-job-well-done/'

badnews = ALCAPI.sentiment('url', FOXARTICLE)
goodnews = ALCAPI.sentiment('url', GOODARTICLE)

print 'Article from fox news:\n\t', badnews['docSentiment']
print '\n'
print 'Article from goodnews news:\n\t', goodnews['docSentiment']

### Image Extraction

In [None]:
from IPython.display import Image
image_extract = ALCAPI.imageExtraction('url', GOODARTICLE)

# Use ipython's display system to render the image
Image(image_extract['image'])

Looks like it found an ad on the page, not the actual main image.  Let's try the "always infer" option which is supposed to be more rigorous in getting algorithms (although I don't know how):

In [None]:
image_extract = ALCAPI.imageExtraction('url', GOODARTICLE, options=dict(extractMode='always-infer'))

# Use ipython's display system to render the image
Image(image_extract['image'])

This is an add appearing at the bottom of the page!  The actual image we want is:

In [None]:
Image('http://www.goodnewsnetwork.org/wp-content/uploads/2015/02/Joseph-Beyer-giant-check-for-1000-submitted.jpg')

### Face Tagging

In [None]:
tagged = ALCAPI.faceTagging('url',
                   'http://www.goodnewsnetwork.org/wp-content/uploads/2015/02/Joseph-Beyer-giant-check-for-1000-submitted.jpg')
tagged

Notice this gives an X and Y position?  We can extract this using `scikit image`.

In [None]:
import skimage.io as skio
%pylab inline

# Read into a skimage image
somedude = skio.imread('http://www.goodnewsnetwork.org/wp-content/uploads/2015/02/Joseph-Beyer-giant-check-for-1000-submitted.jpg')
imshow(somedude);

In [None]:
def _parseFace(attr):
    """ Shortcut for tagged['imageFaces'][0]['attr'] """
    return int(tagged['imageFaces'][0][attr])
    
X, Y, WIDTH, HEIGHT = _parseFace('positionX'), _parseFace('positionY'), _parseFace('width'), _parseFace('height')

# Scikit image is reversed X, Y coordinates relative to these
imshow(somedude[Y:Y+HEIGHT, X:X+WIDTH]);

In [None]:
tagged.keys()

### Two faces

In [None]:
TWOPEEPS = 'http://media.northlandsnewscenter.com/images/400*264/tree_theft.jpg'
imshow(skio.imread(TWOPEEPS));

Let's see what is detected from this image from AlchemyAPI

In [None]:
twopeeps = ALCAPI.faceTagging('url', TWOPEEPS)
twopeeps

### No Faces

In [None]:
TREE = 'http://higherperspective.com/wp-content/uploads/2014/08/oak-tree.jpg'
treepeeps = ALCAPI.faceTagging('url', TREE)

imshow(skio.imread(TREE))
treepeeps

This shit is pretty legit...

**Code below changes notebook formatting/style**

In [68]:
from IPython.core.display import HTML
import urllib2
HTML(urllib2.urlopen('http://bit.ly/1Bf5Hft').read())