<img src="https://datasciencedegree.wisconsin.edu/wp-content/themes/data-gulp/images/logo.svg" width="300">


# Lesson 13 Activity -- Using ```tweepy``` for Data Mining

This is an introduction to data collection from <a href="http://www.twitter.com/">Twitter</a> using the [`tweepy`](http://www.tweepy.org/) package.

---

## Getting set up -- things you do about once

### Install Tweepy

You must install the tweepy package from either Anaconda or the terminal on your computer before using it!

### Make a twitter account

You'll need to set up an app at <a href="https://apps.twitter.com/">apps.twitter.com</a>.

### Save your credentials to an external file

Make a plain text file on your computer called `twitter_credentials.py`, and put it anywhere but this directory.  I put mine in my home directory for my user.  It will look something like this:

    con_key = 'your consumer key goes here'
    con_secret = 'your consumer secret goes here'
    acc_token = 'your access token goes here'
    acc_secret = 'your access secret goes here'
    
* Save your consumer key, consumer secret, access token, and access secret there.
* Don't share these secrets with others!  
* It's also possible to generate access tokens and secrets from within an app, but now's not the right time for this.

---

## Preliminaries to using tweepy -- things you do once per session

You have to do these things about once per session.  If you close your notebook, or restart the kernel, then you have to do these things before you can again use the Tweepy interface to the Twitter API.

#### 1. Gain access to the Tweepy library

As you would any other Python library, `import`.

In [None]:
import tweepy

#### 2. Load your credentials from the external file

Invoke a python plain text source file located somewhere else on your computer.

In [None]:
%run ~/twitter_credentials.py
# this cell will evaluate silently 🙊, and not print anything.  
# This is desired, because a person with your keys can act as you on Twitter in literally every way 😟

🔐 If you need to check whether the four variables, such as `con_key` have the correct value, insert a cell and print the value, then delete the cell.  Keep your credentials secret and safe!!!  

#### 3. Make an `API` object

The `tweepy.API` object handles construction of the Twitter API calls for you.  It's a convenience layer, but it's really dang convenient!

In [None]:
#Use tweepy.OAuthHandler to create an authentication using the given key and secret
auth = tweepy.OAuthHandler(consumer_key=con_key, consumer_secret=con_secret)
auth.set_access_token(acc_token, acc_secret)

#Connect to the Twitter API using the authentication
api = tweepy.API(auth)

--- 

## Using the API

Twitter has two versions of its API:
* The [REST](https://en.wikipedia.org/wiki/Representational_state_transfer) [API](https://en.wikipedia.org/wiki/Application_programming_interface) allows you to _pull_ information from Twitter, or _push_ information back to Twitter.  For example,  
  💡 if I wanted to have a Python script that ran as a CRON job to automatically tweet for me under certain conditions, I would use the REST API.
* The Streaming API allows us to monitor Twitter in real time, grabbing tweets as they are made.  For example,  
  💡 if I wanted to make a little device powered by a Raspberry Pi that showed interesting tweets in real time on a tiny screen by my desk, I would use the streaming API.

### Method 1. The REST API

The REST API allows you to _pull_ information from Twitter, or _push_ information back to Twitter.  We'll use the REST API to run a specific search.  You could also use the REST API to make automatic tweets on Twitter, or get information about specific users.

In [None]:
#Use the REST API for a static search
#Our example finds recent tweets using the hashtag #datascience

tweet_list = api.search(q='#%23datascience') #%23 is used to specify '#'

See [twitter's search documentation](https://dev.twitter.com/rest/public/search) for examples of query operators.  Pay attention to how to URL encode your query.  [This w3schools page](https://www.w3schools.com/tags/ref_urlencode.asp) has information on what `%23` and other encodings for URL's mean.

We retrieve a SearchResult object for each tweet, full of data such as the language, the identity of the poster, etc.

In [None]:
tweet_list[0]

In [None]:
#We can use the dir command to view a list of the attributes of each tweet
dir(tweet_list[0])

In [None]:
#Let's display the text of each tweet we found.
[tweet.text for tweet in tweet_list]

By default, the REST API returns 15 tweets.  We can get up to 100 by using the argument "count".

In [None]:
tweet_list = api.search(q='#%23datascience', count = 100)
len(tweet_list)

If we want more than 100 tweets, we can use a *while* loop.  The max_id argument lets us collect tweets that are older than a particular tweet index (in this case, the oldest tweet we've seen so far).

The `try/except/else` structure lets us fail gracefully in case the API search returns an error (e.g., if we run up against Twitter's rate limits).

In [None]:
num_needed = 200
tweet_list = []
last_id = -1 # id of last tweet seen
while len(tweet_list) < num_needed:
    try:
        new_tweets = api.search(q = '#%23datascience', count = 100, max_id = str(last_id - 1))
    except tweepy.TweepError as e:
        print("Error", e)
        break
    else:
        if not new_tweets:
            print("Could not find any more tweets!")
            break
        tweet_list.extend(new_tweets)
        last_id = new_tweets[-1].id

len(tweet_list)

Note that the free REST API restricts the number of tweets you can retrieve, and the dates: you may not be able to retrieve tweets that are more than a week old.  Pay attention to this restriction as you approach your final project topic!

## Method 2. The Streaming API

The Streaming API allows us to monitor Twitter in real time, grabbing tweets as they are made.

The ```tweepy``` package includes a class called ```StreamListener``` which monitors Twitter for us.  However, by default StreamListener does nothing with the tweets it collects.

In this demonstration, we'll modify ```StreamListener``` to make a class that prints each tweet we're interested in to the screen.  Later, you may wish to create your own class which saves information from tweets to a file.

In [None]:
#We create a subclass of tweepy.StreamListener to add a response to on_status

class PrintingStreamListener(tweepy.StreamListener):

    def on_status(self, status):
        print(status.text)
        
    #disconnect the stream if we receive an error message indicating we are overloading Twitter
    
    def on_error(self, status_code):
        if status_code == 420:
            #returning False in on_data disconnects the stream
            return False
 

Once we have created our subclass, we can set up our own Twitter stream.

In [None]:
#We create and authenticate an instance of our new ```PrintingStreamListener``` class

my_stream_listener = PrintingStreamListener()
my_stream = tweepy.Stream(auth = api.auth, listener=my_stream_listener)

We'll use the ```track``` command to look for tweets with a specific keyword.  You can read more about constructing searches with ```track``` in the <a href="https://dev.twitter.com/streaming/overview/request-parameters#track">Twitter streaming API documentation</a>.

In [None]:
# Now, we're ready to start streaming!  We'll look for recent tweets which use the word "data".
# You can pause the display of tweets by interrupting the Python kernel (use the menu bar at the top)

my_stream.filter(track=['data'])

In [None]:
# Even if you pause the display of tweets, your stream is still connected to Twitter!
# To disconnect (for example, if you want to change which words you are searching for), 
# use the disconnect() function.

my_stream.disconnect()

---

## Suggestions for skills to learn

* Collect 1000 tweets matching a search, or all available in the current time window, whichever comes first.  That 1000 was arbitrary
* Extract just the fields you are most interested in from a search, and create a Pandas data frame
* Follow the graph of followers from a specific Twitter user

---

## Useful resources and links

* [the structure of the Status object of Tweepy](https://gist.github.com/dev-techmoe/ef676cdd03ac47ac503e856282077bf2)
* [Tweet Data Dictionary](https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object)
* [Standard Operators](https://developer.twitter.com/en/docs/tweets/search/guides/standard-operators) -- premium operators cost money.
* [Twitter operators by product](https://developer.twitter.com/en/docs/tweets/rules-and-filtering/overview/operators-by-product) -- by product they mean *paid access level*
* [How to use Twitter’s Search REST API most effectively](https://www.karambelkar.info/2015/01/how-to-use-twitters-search-rest-api-most-effectively./)
* [Collecting Tweets with Tweepy](http://www.dealingdata.net/2016/07/23/PoGo-Series-Tweepy/)
