<img src="https://datasciencedegree.wisconsin.edu/wp-content/themes/data-gulp/images/logo.svg" width="300">


# Lesson 13 Activity -- Using ```tweepy``` for Data Mining

This is an introduction to data collection from <a href="http://www.twitter.com/">Twitter</a> using the [`tweepy`](http://www.tweepy.org/) package.

---

## Getting set up -- things you do about once

### Install Tweepy

You must install the tweepy package from either Anaconda or the terminal on your computer before using it!

### Make a twitter account

You'll need to set up an app at <a href="https://apps.twitter.com/">apps.twitter.com</a>.

### Save your credentials to an external file

Make a plain text file on your computer called `twitter_credentials.py`, and put it anywhere but this directory.  I put mine in my home directory for my user.  It will look something like this:

    con_key = 'your consumer key goes here'
    con_secret = 'your consumer secret goes here'
    acc_token = 'your access token goes here'
    acc_secret = 'your access secret goes here'
    
* Save your consumer key, consumer secret, access token, and access secret there.
* Don't share these secrets with others!  
* It's also possible to generate access tokens and secrets from within an app, but now's not the right time for this.

---

## Preliminaries to using tweepy -- things you do once per session

You have to do these things about once per session.  If you close your notebook, or restart the kernel, then you have to do these things before you can again use the Tweepy interface to the Twitter API.

#### 1. Gain access to the Tweepy library

As you would any other Python library, `import`.

In [1]:
from pathlib import Path
home = str(Path.home())

In [2]:
Path.home()

PosixPath('/Users/markhanson')

In [3]:
import tweepy

#### 2. Load your credentials from the external file

Invoke a python plain text source file located somewhere else on your computer.

In [4]:
%run ~'/twitter_credentials.py'
# this cell will evaluate silently 🙊, and not print anything.  
# This is desired, because a person with your keys can act as you on Twitter in literally every way 😟

🔐 If you need to check whether the four variables, such as `con_key` have the correct value, insert a cell and print the value, then delete the cell.  Keep your credentials secret and safe!!!  

In [5]:
print(con_key)
print(con_secret)
print(acc_token)
print(acc_secret)

qkuGAD4j107yj5TZXMz3KeDY4
gzdFbDQgmQ0FEmTI72D5F5ZrSyrDo0SiLXYMP3rxAGok389Glv
35988661-1XAmvrnssytwcWQFZVPMFsOu0jZotWaG0E9TJvtis
nbPC8qUsTKdgKxfhPzJhozpJAoDZqFFC0VkCJystMlWy9


#### 3. Make an `API` object

The `tweepy.API` object handles construction of the Twitter API calls for you.  It's a convenience layer, but it's really dang convenient!

In [6]:
#Use tweepy.OAuthHandler to create an authentication using the given key and secret
auth = tweepy.OAuthHandler(consumer_key=con_key, consumer_secret=con_secret)
auth.set_access_token(acc_token, acc_secret)

#Connect to the Twitter API using the authentication
api = tweepy.API(auth)

--- 

## Using the API

Twitter has two versions of its API:
* The [REST](https://en.wikipedia.org/wiki/Representational_state_transfer) [API](https://en.wikipedia.org/wiki/Application_programming_interface) allows you to _pull_ information from Twitter, or _push_ information back to Twitter.  For example,  
  💡 if I wanted to have a Python script that ran as a CRON job to automatically tweet for me under certain conditions, I would use the REST API.
* The Streaming API allows us to monitor Twitter in real time, grabbing tweets as they are made.  For example,  
  💡 if I wanted to make a little device powered by a Raspberry Pi that showed interesting tweets in real time on a tiny screen by my desk, I would use the streaming API.

### Method 1. The REST API

The REST API allows you to _pull_ information from Twitter, or _push_ information back to Twitter.  We'll use the REST API to run a specific search.  You could also use the REST API to make automatic tweets on Twitter, or get information about specific users.

In [7]:
#Use the REST API for a static search
#Our example finds recent tweets using the hashtag #datascience

tweet_list = api.search(q='#givingTuesday')

See [twitter's search documentation](https://dev.twitter.com/rest/public/search) for examples of query operators.  Pay attention to how to URL encode your query.  [This w3schools page](https://www.w3schools.com/tags/ref_urlencode.asp) has information on what `%23` and other encodings for URL's mean.

We retrieve a SearchResult object for each tweet, full of data such as the language, the identity of the poster, etc.

In [11]:
tweet_list[0]

Status(_api=<tweepy.api.API object at 0x7fb8343fbb90>, _json={'created_at': 'Mon Dec 07 03:13:47 +0000 2020', 'id': 1335784542360186890, 'id_str': '1335784542360186890', 'text': 'ICYMI - Chairman @GrassleyPress showing some love to Iowa charities for #GivingTuesday! Also works in a plug for th… https://t.co/XLZdwOeKm4', 'truncated': True, 'entities': {'hashtags': [{'text': 'GivingTuesday', 'indices': [72, 86]}], 'symbols': [], 'user_mentions': [{'screen_name': 'GrassleyPress', 'name': 'Sen. Grassley Press', 'id': 294126378, 'id_str': '294126378', 'indices': [17, 31]}], 'urls': [{'url': 'https://t.co/XLZdwOeKm4', 'expanded_url': 'https://twitter.com/i/web/status/1335784542360186890', 'display_url': 'twitter.com/i/web/status/1…', 'indices': [117, 140]}]}, 'metadata': {'iso_language_code': 'en', 'result_type': 'recent'}, 'source': '<a href="https://mobile.twitter.com" rel="nofollow">Twitter Web App</a>', 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_i

In [12]:
#We can use the dir command to view a list of the attributes of each tweet
dir(tweet_list[0])

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_api',
 '_json',
 'author',
 'contributors',
 'coordinates',
 'created_at',
 'destroy',
 'entities',
 'favorite',
 'favorite_count',
 'favorited',
 'geo',
 'id',
 'id_str',
 'in_reply_to_screen_name',
 'in_reply_to_status_id',
 'in_reply_to_status_id_str',
 'in_reply_to_user_id',
 'in_reply_to_user_id_str',
 'is_quote_status',
 'lang',
 'metadata',
 'parse',
 'parse_list',
 'place',
 'possibly_sensitive',
 'retweet',
 'retweet_count',
 'retweeted',
 'retweets',
 'source',
 'source_url',
 'text',
 'truncated',
 'user']

In [13]:
#Let's display the text of each tweet we found.
[tweet.text for tweet in tweet_list]

['ICYMI - Chairman @GrassleyPress showing some love to Iowa charities for #GivingTuesday! Also works in a plug for th… https://t.co/XLZdwOeKm4',
 'Incredibly proud of our work on #GivingTuesday. What an undertaking, but so worth it. Beyond raising critical $ for… https://t.co/BAFZl601sJ',
 'RT @4THEGWORLS: This #GivingTuesday and Christmas holiday, consider giving the gift of safe housing and affirmative surgeries. \u2063\n\nWe are tr…',
 'RT @DrBiden: During the most challenging moments, small acts of kindness can often go the furthest, and the gift of your time can often be…',
 'The average survival rate for malignant brain tumor patients is only 36%. Support research in the search for a cure… https://t.co/6EfX3seOQ8',
 'Thank you for all your donations from last week for Giving Tuesday! We are so grateful. #givingtuesday #ecuador… https://t.co/OBnAK0M1wn',
 'RT @RealPaulWalker: 7 years later we remember #PaulWalker as the\xa0gracious and humble man we all love.\n\nWith tomorrow bein

By default, the REST API returns 15 tweets.  We can get up to 100 by using the argument "count".

In [14]:
tweet_list = api.search(q='#givingTuesday', count = 100)
len(tweet_list)

100

In [15]:
[tweet.text for tweet in tweet_list]

["RT @FairDefense: Why support TFDP this #GivingTuesday? Because poverty is not a crime. \n\nCheck out this video about TFDP's work to mitigate…",
 'ICYMI - Chairman @GrassleyPress showing some love to Iowa charities for #GivingTuesday! Also works in a plug for th… https://t.co/XLZdwOeKm4',
 'Incredibly proud of our work on #GivingTuesday. What an undertaking, but so worth it. Beyond raising critical $ for… https://t.co/BAFZl601sJ',
 'RT @4THEGWORLS: This #GivingTuesday and Christmas holiday, consider giving the gift of safe housing and affirmative surgeries. \u2063\n\nWe are tr…',
 'RT @DrBiden: During the most challenging moments, small acts of kindness can often go the furthest, and the gift of your time can often be…',
 'The average survival rate for malignant brain tumor patients is only 36%. Support research in the search for a cure… https://t.co/6EfX3seOQ8',
 'Thank you for all your donations from last week for Giving Tuesday! We are so grateful. #givingtuesday #ecuador… https:/

If we want more than 100 tweets, we can use a *while* loop.  The max_id argument lets us collect tweets that are older than a particular tweet index (in this case, the oldest tweet we've seen so far).

The `try/except/else` structure lets us fail gracefully in case the API search returns an error (e.g., if we run up against Twitter's rate limits).

In [16]:
num_needed = 200
tweet_list = []
last_id = -1 # id of last tweet seen
while len(tweet_list) < num_needed:
    try:
        new_tweets = api.search(q = '#givingTuesday', count = 100, max_id = str(last_id - 1), 
                               lang = 'en', tweet_mode = 'extended')
    except tweepy.TweepError as e:
        print("Error", e)
        break
    else:
        if not new_tweets:
            print("Could not find any more tweets!")
            break
        tweet_list.extend(new_tweets)
        last_id = new_tweets[-1].id

len(tweet_list)

200

Note that the free REST API restricts the number of tweets you can retrieve, and the dates: you may not be able to retrieve tweets that are more than a week old.  Pay attention to this restriction as you approach your final project topic!

## Method 2. The Streaming API

The Streaming API allows us to monitor Twitter in real time, grabbing tweets as they are made.

The ```tweepy``` package includes a class called ```StreamListener``` which monitors Twitter for us.  However, by default StreamListener does nothing with the tweets it collects.

In this demonstration, we'll modify ```StreamListener``` to make a class that prints each tweet we're interested in to the screen.  Later, you may wish to create your own class which saves information from tweets to a file.

In [20]:
#We create a subclass of tweepy.StreamListener to add a response to on_status

class PrintingStreamListener(tweepy.StreamListener):

    def on_status(self, status):
        print(status.text)
        
    #disconnect the stream if we receive an error message indicating we are overloading Twitter
    
    def on_error(self, status_code):
        if status_code == 420:
            #returning False in on_data disconnects the stream
            return False
 

Once we have created our subclass, we can set up our own Twitter stream.

In [21]:
#We create and authenticate an instance of our new ```PrintingStreamListener``` class

my_stream_listener = PrintingStreamListener()
my_stream = tweepy.Stream(auth = api.auth, listener=my_stream_listener)

We'll use the ```track``` command to look for tweets with a specific keyword.  You can read more about constructing searches with ```track``` in the <a href="https://dev.twitter.com/streaming/overview/request-parameters#track">Twitter streaming API documentation</a>.

In [19]:
# Now, we're ready to start streaming!  We'll look for recent tweets which use the word "giving".
# You can pause the display of tweets by interrupting the Python kernel (use the menu bar at the top)

my_stream.filter(track=['givingTuesday'])

RT @simangeo: Esto pasa en Yemen auspiciado por Arabia Saudita apoyada y armada por EEUU, algunos gobiernos de Europa e Israel.
Pero como l…
RT @ladygaga: Join me and give to @BTWFoundation for #GivingTuesday. We’re working hard to support youth mental wellness and collaborate wi…
RT @simangeo: Esto pasa en Yemen auspiciado por Arabia Saudita apoyada y armada por EEUU, algunos gobiernos de Europa e Israel.
Pero como l…
We took the “opportunity “ to help many who appreciate more than most any help they can get #rotaryclub… https://t.co/KvbSBzjRL0
RT @SteveKornacki: As I explained to the Gap, I already have three pairs of pants, so I'm all set for next decade or so. I think this plan…
RT @FairDefense: Why support TFDP this #GivingTuesday? Because poverty is not a crime. 

Check out this video about TFDP's work to mitigate…
RT @tparamronnoCO: Today is #GivingTuesday - a day to give back to people around the world!
https://t.co/TBvTnSkobN https://t.co/nRgU3QRRqP
@MeekMill @MeekMill We've g

A great cause
RT @mjscopel: A great cause https://t.co/5QmdtMvxqy
RT @SteveKornacki: As I explained to the Gap, I already have three pairs of pants, so I'm all set for next decade or so. I think this plan…
@iamcardib https://t.co/IW7iOmkbTb
@iamcardib challenge accepted:)
I’m the MSW intern for a Korean-American 501C3 i… https://t.co/n9zhy08T6A
RT @MSNBC: .@SteveKornacki’s khakis are more than just an internet sensation. They are doing some good!

As a part of #GivingTuesday, Steve…
RT @mssociety: Tomorrow, Dec 1, the world will come together to make a difference. #GivingTuesday is a global day of giving. If you can, pl…
RT @SteveKornacki: As I explained to the Gap, I already have three pairs of pants, so I'm all set for next decade or so. I think this plan…
RT @nokidhungry: Citi is keeping the giving going! After matching donations up to $200K on #GivingTuesday, @Citibank will donate 10 cents e…
RT @heartful_ness: #GivingTuesday Your donations have brought a smile 😊  to the faces of f

@iamcardib ❗️
RT @guapojot: Please retweet this original tweet, and tag @iamcardib
@iamcardib we donated over 10.5k to @KhalsaAidUSA @Khalsa_Aid on GivingTuesday. We have receipts on a donations, th… https://t.co/3HHa2afc3p
RT @ElodieParre: @l_arrondi sur salaire ou comment l'entreprise donne la possibilité a ses collaborateurs de s'engager. Une initiative lanc…
@iamcardib ‼‼
RT @guapojot: @iamcardib on #GivingTuesday we collected over 10.5k in donations all going towards humanitarian efforts at @KhalsaAidUSA @Kh…
RT @AtacOrg: ATAC is working hard each day to improve the lives of every Arizonan. We understand that this can be a difficult time of year…
RT @guapojot: @iamcardib on #GivingTuesday we collected over 10.5k in donations all going towards humanitarian efforts at @KhalsaAidUSA @Kh…
RT @guapojot: @iamcardib on #GivingTuesday we collected over 10.5k in donations all going towards humanitarian efforts at @KhalsaAidUSA @Kh…
We would like to express our gratitude to everyone who don

ReadTimeoutError: HTTPSConnectionPool(host='stream.twitter.com', port=443): Read timed out.

In [None]:
# Even if you pause the display of tweets, your stream is still connected to Twitter!
# To disconnect (for example, if you want to change which words you are searching for), 
# use the disconnect() function.

my_stream.disconnect()

In [22]:
from tweepy import Stream
from tweepy import OAuthHandler
from tweepy.streaming import StreamListener

In [24]:
pip install tweepy

Note: you may need to restart the kernel to use updated packages.


In [None]:
# Import the necessary package to process data in JSON format
try:
    import json
except ImportError:
    import simplejson as json

# Import the tweepy library
import tweepy

# Variables that contains the user credentials to access Twitter API 
ACCESS_TOKEN = 'YOUR ACCESS TOKEN"'
ACCESS_SECRET = 'YOUR ACCESS TOKEN SECRET'
CONSUMER_KEY = 'YOUR API KEY'
CONSUMER_SECRET = 'ENTER YOUR API SECRET'

# Setup tweepy to authenticate with Twitter credentials:

auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_SECRET)

# Create the api to connect to twitter with your creadentials
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True, compression=True)
#---------------------------------------------------------------------------------------------------------------------
# wait_on_rate_limit= True;  will make the api to automatically wait for rate limits to replenish
# wait_on_rate_limit_notify= Ture;  will make the api  to print a notification when Tweepyis waiting
# for rate limits to replenish
#---------------------------------------------------------------------------------------------------------------------


#---------------------------------------------------------------------------------------------------------------------
# The following loop will print most recent statuses, including retweets, posted by the authenticating
# user and that user’s friends. 
# This is the equivalent of /timeline/home on the Web.
#---------------------------------------------------------------------------------------------------------------------

for status in tweepy.Cursor(api.home_timeline).items(200):
	print(status._json)
	
#---------------------------------------------------------------------------------------------------------------------
# Twitter API development use pagination for Iterating through timelines, user lists, direct messages, etc. 
# To help make pagination easier and Tweepy has the Cursor object.
#---------------------------------------------------------------------------------------------------------------------

---

## Suggestions for skills to learn

* Collect 1000 tweets matching a search, or all available in the current time window, whichever comes first.  That 1000 was arbitrary
* Extract just the fields you are most interested in from a search, and create a Pandas data frame
* Follow the graph of followers from a specific Twitter user

---

## Useful resources and links

* [the structure of the Status object of Tweepy](https://gist.github.com/dev-techmoe/ef676cdd03ac47ac503e856282077bf2)
* [Tweet Data Dictionary](https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object)
* [Standard Operators](https://developer.twitter.com/en/docs/tweets/search/guides/standard-operators) -- premium operators cost money.
* [Twitter operators by product](https://developer.twitter.com/en/docs/tweets/rules-and-filtering/overview/operators-by-product) -- by product they mean *paid access level*
* [How to use Twitter’s Search REST API most effectively](https://www.karambelkar.info/2015/01/how-to-use-twitters-search-rest-api-most-effectively./)
* [Collecting Tweets with Tweepy](http://www.dealingdata.net/2016/07/23/PoGo-Series-Tweepy/)
