<div align="center">
    <h1><a href="index.ipynb">Knowledge Discovery in Digital Humanities</a></h1>
</div>

<div align="center">
    <h2>Class 10. Mining Twitter</h2>
    <img src="img/twitter-logo.png" width="300">
</div>

###Table of contents

- [Why Twitter?](#Why-Twitter?)
- [Connecting to Twitter](#Connecting-to-Twitter)
- [Trending topics](#Trending-topics)
- [Timelines](#Timelines)
- [Searching for tweets](#Searching-for-tweets)
- [Extracting information from tweets](#Extracting-information-from-tweets)

###Why Twitter?

[Twitter](http://twitter.com/) is as a microblogging service that allows people to communicate with 140-character messages (called *tweets*).

- Tweets reflect people's thoughts in near real time

Twitter's *following* system connects people and creates networks. Its asymmetric model allow users to follow any other user even if there is no reciprocation, unlike other social media like Facebook and LinkedIn, that require the mutual acceptance of a connection between users (which usually implies a some kind of real-world connection).

- Twitter's asymmetric *following* model allows people to keep up with their interests

Interest graphs are a way of modeling connections between people and their interests. Interest graphs can be mined in order to measure correlations between users and interests and make recommendations ranging from whom to follow on Twitter to what to purchase online to whom you should date.

- Mining Twitter provides a way to discover people's opinions and interests 

###Connecting to Twitter

1. Create an app on Twitter
    1. Go to [https://apps.twitter.com/](https://apps.twitter.com/)
    2. Login with your user account
    3. Click on Create new app
    4. Fill in the form, accept the agreement and click on Create your Twitter application
    5. Go to Keys and Access Tokens tab
    6. Scroll down and click on Create my access token
    7. Create a script named `credentials.py` (do not share it with anyone else) than contains this code:
```
TW_CONSUMER_KEY = 'Consumer Key (API Key)'
TW_CONSUMER_SECRET = 'Consumer Secret (API Secret)'
TW_ACCESS_TOKEN = 'Access Token'
TW_ACCESS_TOKEN_SECRET = 'Access Token Secret'
```
2. Authorize your application to access Twitter

In [1]:
import credentials
import tweepy

CONSUMER_KEY = credentials.TW_CONSUMER_KEY
CONSUMER_SECRET = credentials.TW_CONSUMER_SECRET
ACCESS_TOKEN = credentials.TW_ACCESS_TOKEN
ACCESS_TOKEN_SECRET = credentials.TW_ACCESS_TOKEN_SECRET

auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)

twitter_api = tweepy.API(auth)

In [2]:
# Auxiliar function to print json in a human-readable format
import json

def print_friendly_json(js):
    if isinstance(js, tweepy.models.Status):
        js = js.__dict__['_json']
    print json.dumps(js, indent=2)

###Trending topics

Trending topics are topics that are popular *now*. Location specified by WOEID ([Yahoo! GeoPlanet](https://developer.yahoo.com/geo/geoplanet/)'s Where On Earth ID).

####Global

In [3]:
WORLD_WOEID = 1 # Worldwide
global_trends = twitter_api.trends_place(WORLD_WOEID) 
print_friendly_json(global_trends)

[
  {
    "created_at": "2015-04-08T02:57:11Z", 
    "trends": [
      {
        "url": "http://twitter.com/search?q=%23FinalBBB15", 
        "query": "%23FinalBBB15", 
        "name": "#FinalBBB15", 
        "promoted_content": null
      }, 
      {
        "url": "http://twitter.com/search?q=%22Paula+Fernandes%22", 
        "query": "%22Paula+Fernandes%22", 
        "name": "Paula Fernandes", 
        "promoted_content": null
      }, 
      {
        "url": "http://twitter.com/search?q=%23Telefe25A%C3%B1os", 
        "query": "%23Telefe25A%C3%B1os", 
        "name": "#Telefe25A\u00f1os", 
        "promoted_content": null
      }, 
      {
        "url": "http://twitter.com/search?q=%23WalterScott", 
        "query": "%23WalterScott", 
        "name": "#WalterScott", 
        "promoted_content": null
      }, 
      {
        "url": "http://twitter.com/search?q=%23FaceTimeMeNash", 
        "query": "%23FaceTimeMeNash", 
        "name": "#FaceTimeMeNash", 
        "promoted_content":

####Specific location

In [4]:
CA_WOEID = 23424775 # Canada
ca_trends = twitter_api.trends_place(CA_WOEID)
print_friendly_json(ca_trends)

[
  {
    "created_at": "2015-04-08T02:57:11Z", 
    "trends": [
      {
        "url": "http://twitter.com/search?q=%23IMFC", 
        "query": "%23IMFC", 
        "name": "#IMFC", 
        "promoted_content": null
      }, 
      {
        "url": "http://twitter.com/search?q=%22Mark+Stone%22", 
        "query": "%22Mark+Stone%22", 
        "name": "Mark Stone", 
        "promoted_content": null
      }, 
      {
        "url": "http://twitter.com/search?q=%23WalterScott", 
        "query": "%23WalterScott", 
        "name": "#WalterScott", 
        "promoted_content": null
      }, 
      {
        "url": "http://twitter.com/search?q=%23FaceTimeMeNash", 
        "query": "%23FaceTimeMeNash", 
        "name": "#FaceTimeMeNash", 
        "promoted_content": null
      }, 
      {
        "url": "http://twitter.com/search?q=%23GoJetsGo", 
        "query": "%23GoJetsGo", 
        "name": "#GoJetsGo", 
        "promoted_content": null
      }, 
      {
        "url": "http://twitter.com/s

####Closeness

In [5]:
coordinates = (42.9837, -81.2497) # London ON
close_places = twitter_api.trends_closest(coordinates[0], coordinates[1])  
print_friendly_json(close_places)

[
  {
    "name": "Detroit", 
    "countryCode": "US", 
    "url": "http://where.yahooapis.com/v1/place/2391585", 
    "country": "United States", 
    "parentid": 23424977, 
    "placeType": {
      "code": 7, 
      "name": "Town"
    }, 
    "woeid": 2391585
  }
]


In [6]:
trends = []
for place in close_places:
    woeid = place['woeid']
    trends.append(twitter_api.trends_place(woeid))

for trend in trends:
    print_friendly_json(trend)

[
  {
    "created_at": "2015-04-08T02:57:11Z", 
    "trends": [
      {
        "url": "http://twitter.com/search?q=%23WalterScott", 
        "query": "%23WalterScott", 
        "name": "#WalterScott", 
        "promoted_content": null
      }, 
      {
        "url": "http://twitter.com/search?q=%23BeingMaryJane", 
        "query": "%23BeingMaryJane", 
        "name": "#BeingMaryJane", 
        "promoted_content": null
      }, 
      {
        "url": "http://twitter.com/search?q=Pittsburgh", 
        "query": "Pittsburgh", 
        "name": "Pittsburgh", 
        "promoted_content": null
      }, 
      {
        "url": "http://twitter.com/search?q=Penguins", 
        "query": "Penguins", 
        "name": "Penguins", 
        "promoted_content": null
      }, 
      {
        "url": "http://twitter.com/search?q=%23FindingCarter", 
        "query": "%23FindingCarter", 
        "name": "#FindingCarter", 
        "promoted_content": null
      }, 
      {
        "url": "http://twitter.

####Exercise 1
What Canadian trending topics are worldwide trending topics?
1. Calculate the set of global trending topics (use list comprehensions)
2. Calculate the set of Canadian trending topics (use list comprehensions)
3. Calculate the intersection of both sets

In [7]:
global_trends_list = [trend['name'] for trend in global_trends[0]['trends']]
global_trends_set = set(global_trends_list)

ca_trends_list = [trend['name'] for trend in ca_trends[0]['trends']]
ca_trends_set = set(ca_trends_list)

common_trends_set = ca_trends_set.intersection(global_trends_set)
print common_trends_set

set([u'#WalterScott', u'#FaceTimeMeNash'])


###Timelines

A timeline is a collection of tweets ordered from the most recent to the oldest.

####Home timeline
It is your own timeline.

In [8]:
home_timeline = twitter_api.home_timeline()

Most recent tweet or status:

In [9]:
th = home_timeline[0]
th.text

u'RT @symulation: CSDH-SCHN exec. nominations open: VP-French; Secretary; Member-at-large; Graduate Rep. 150 word bio to michaelesinatra@gmai\u2026'

####User's timeline
A specific user's timeline.

In [10]:
user_timeline = twitter_api.user_timeline('suarez_juanluis')

Most recent tweet:

In [11]:
tu = user_timeline[0]
tu.text

u'RT @FMAPFREHistoria: Libro: "Fiesta rito y pol\xedtica. Del Chile borb\xf3nico al republicano" de Jaime Valenzuela. V\xeda @Historia_UC http://t.co/\u2026'

####Mentions timeline
Tweets that contain your *@username*.

In [12]:
mentions_timeline = twitter_api.mentions_timeline()

Most recent tweet:

In [13]:
tm = mentions_timeline[0]
tm.text

u'@mavillard @nandi_d @n_mejiacaldas buen viaje! Hope to see you all again soon...#dhsi2015?'

####Retweets timeline
Tweets that have been retweeted by others.

In [14]:
retweets_timeline = twitter_api.retweets_of_me()

Most recent tweet:

In [15]:
tr = retweets_timeline[0]
tr.text

u'@versae and beauty at #dh2014 http://t.co/zIqJHTepVj'

####Exercise 2
Get the name of the users that have retweeted your tweets.

In [16]:
set([t.user.screen_name for tr in retweets_timeline for t in tr.retweets()])

{u'antimony27',
 u'barrywellman',
 u'dhgermany',
 u'jamescosullivan',
 u'jborrego',
 u'marybethstart',
 u'mirian_se',
 u'nandodlrp',
 u'quiohqui',
 u'suarez_juanluis',
 u'versae',
 u'vlmavillard'}

###Searching for tweets

- *Cursor* approach instead of *pagination*
- Due to the highly dynamic state of Twitter resources (more information about *cursor vs pagination* [here](https://dev.twitter.com/rest/public/timelines))
- Search results contain a field named `next_results` that embeds a string that provides the basis of a subsequent query

Search for a trending topic, or anything else for that matter (variable `q`).

In [17]:
q = 'Wisconsin' 
count = 100
search_results = twitter_api.search(q=q, count=count)
tweets = search_results

# Iterate through 5 more batches of results by following the cursor
for _ in range(5):
    next_results = search_results.next_results
    # No more results when next_results is None
    # If next_results exists, search again
    # next_results has the following form:
    # ?max_id=313519052523986943&q=NCAA&include_entities=1
    # Unpacking the values in a dictionary into keyword arguments
    # for the next search
    if next_results:
        kwargs = dict([ kw.split('=') for kw in next_results[1:].split("&") ])
        search_results = twitter_api.search(**kwargs)
        tweets += search_results

print "Total tweets retrieved:", len(tweets)

Total tweets retrieved: 600


In [18]:
print_friendly_json(tweets[0])

{
  "contributors": null, 
  "truncated": false, 
  "text": "RT @NOTSportsCenter: The Duke-Wisconsin men's title game averaged 28 million viewers. \n\nThe UConn-Notre Dame women's title game averaged 28\u2026", 
  "in_reply_to_status_id": null, 
  "id": 585638502106320897, 
  "favorite_count": 0, 
  "source": "<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>", 
  "retweeted": false, 
  "coordinates": null, 
  "entities": {
    "symbols": [], 
    "user_mentions": [
      {
        "id": 237147973, 
        "indices": [
          3, 
          19
        ], 
        "id_str": "237147973", 
        "screen_name": "NOTSportsCenter", 
        "name": "NOT SportsCenter"
      }
    ], 
    "hashtags": [], 
    "urls": []
  }, 
  "in_reply_to_screen_name": null, 
  "in_reply_to_user_id": null, 
  "retweet_count": 499, 
  "id_str": "585638502106320897", 
  "favorited": false, 
  "retweeted_status": {
    "contributors": null, 
    "truncated": false, 
   

###Extracting information from tweets

Let's have the previous tweet assigned to the variable `t`:

In [19]:
t = tweets[0]
type(t)

tweepy.models.Status

`t` is `tweepy.models.Status` type. It is possible to explore its fields by typing `t.` + TAB. A list of fields will appear.

<div align="center">
    <figure>
        <img src="img/status_fields.png">
        <figcaption>Fields of `tweepy.models.Status`</figcaption>
    </figure>
</div>

####Tweet information
- Identificator:

In [20]:
t.id

585638502106320897

- Text:

In [21]:
t.text

u"RT @NOTSportsCenter: The Duke-Wisconsin men's title game averaged 28 million viewers. \n\nThe UConn-Notre Dame women's title game averaged 28\u2026"

- Author:

In [22]:
t.author.screen_name

u'yacone64'

- Counts. `favorite_count` and `retweet_count` give clues as to the *interestingness*:

In [23]:
t.favorite_count

0

In [24]:
t.retweet_count

499

- If a tweet has been retweeted, the `retweeted_status` field provides significant detail about the *original tweet* itself and its author:

In [25]:
print_friendly_json(t.retweeted_status)

{
  "contributors": null, 
  "truncated": false, 
  "text": "The Duke-Wisconsin men's title game averaged 28 million viewers. \n\nThe UConn-Notre Dame women's title game averaged 28 viewers.", 
  "in_reply_to_status_id": null, 
  "id": 585636324083904512, 
  "favorite_count": 629, 
  "source": "<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>", 
  "retweeted": false, 
  "coordinates": null, 
  "entities": {
    "symbols": [], 
    "user_mentions": [], 
    "hashtags": [], 
    "urls": []
  }, 
  "in_reply_to_screen_name": null, 
  "in_reply_to_user_id": null, 
  "retweet_count": 499, 
  "id_str": "585636324083904512", 
  "favorited": false, 
  "user": {
    "follow_request_sent": false, 
    "profile_use_background_image": true, 
    "profile_text_color": "333333", 
    "default_profile_image": false, 
    "id": 237147973, 
    "profile_background_image_url_https": "https://pbs.twimg.com/profile_background_images/378800000067456322/7f974df590562be1

Only original tweets can be retweeted. If a user retweets a retweet, he/she is actually retweeting the original tweet.

- Retweeted by the authenticated user:

In [26]:
t.retweeted

False

####Tweet entities
Entities provide metadata and additional contextual information about content posted on Twitter.

Example:

In [27]:
tweets[1].entities

{u'hashtags': [],
 u'symbols': [],
 u'urls': [],
 u'user_mentions': [{u'id': 237147973,
   u'id_str': u'237147973',
   u'indices': [3, 19],
   u'name': u'NOT SportsCenter',
   u'screen_name': u'NOTSportsCenter'}]}

####Extracting text, screen names, and hashtags from tweets

In [28]:
texts = [t.text for t in tweets]
screen_names = [user_mention['screen_name']
                    for t in tweets
                         for user_mention in t.entities['user_mentions']
               ]
hashtags = [hashtag['text'] 
               for t in tweets
                   for hashtag in t.entities['hashtags']
           ]

# Explore the first 5 items for each...
print 'Texts:'
print_friendly_json(texts[: 5])
print 'Screen names:'
print_friendly_json(screen_names[: 5])
print 'Hashtags:'
print_friendly_json(hashtags[: 5])

Texts:
[
  "RT @NOTSportsCenter: The Duke-Wisconsin men's title game averaged 28 million viewers. \n\nThe UConn-Notre Dame women's title game averaged 28\u2026", 
  "RT @NOTSportsCenter: The Duke-Wisconsin men's title game averaged 28 million viewers. \n\nThe UConn-Notre Dame women's title game averaged 28\u2026", 
  "Justice Ann Walsh Bradley wins re-election to Wisconsin Supreme Court: Bradley defeated Ro... http://t.co/uI1qAq0f3d #politics #dem #gop", 
  "RT @NOTSportsCenter: The Duke-Wisconsin men's title game averaged 28 million viewers. \n\nThe UConn-Notre Dame women's title game averaged 28\u2026", 
  "RT @itsHousePorn: Wow. This Woodwork. http://t.co/NhQtfB8vOv"
]
Screen names:
[
  "NOTSportsCenter", 
  "NOTSportsCenter", 
  "NOTSportsCenter", 
  "itsHousePorn", 
  "MeninistTweet"
]
Hashtags:
[
  "politics", 
  "dem", 
  "gop", 
  "NationalBeerDay", 
  "LunesDeLabia"
]


####Exercise 3
Given a timeline assigned to the variable `tweets`, get the list of all the words from the first 5 tweets. 

In [29]:
[w for t in tweets[: 5] for w in t.text.split()]

[u'RT',
 u'@NOTSportsCenter:',
 u'The',
 u'Duke-Wisconsin',
 u"men's",
 u'title',
 u'game',
 u'averaged',
 u'28',
 u'million',
 u'viewers.',
 u'The',
 u'UConn-Notre',
 u'Dame',
 u"women's",
 u'title',
 u'game',
 u'averaged',
 u'28\u2026',
 u'RT',
 u'@NOTSportsCenter:',
 u'The',
 u'Duke-Wisconsin',
 u"men's",
 u'title',
 u'game',
 u'averaged',
 u'28',
 u'million',
 u'viewers.',
 u'The',
 u'UConn-Notre',
 u'Dame',
 u"women's",
 u'title',
 u'game',
 u'averaged',
 u'28\u2026',
 u'Justice',
 u'Ann',
 u'Walsh',
 u'Bradley',
 u'wins',
 u're-election',
 u'to',
 u'Wisconsin',
 u'Supreme',
 u'Court:',
 u'Bradley',
 u'defeated',
 u'Ro...',
 u'http://t.co/uI1qAq0f3d',
 u'#politics',
 u'#dem',
 u'#gop',
 u'RT',
 u'@NOTSportsCenter:',
 u'The',
 u'Duke-Wisconsin',
 u"men's",
 u'title',
 u'game',
 u'averaged',
 u'28',
 u'million',
 u'viewers.',
 u'The',
 u'UConn-Notre',
 u'Dame',
 u"women's",
 u'title',
 u'game',
 u'averaged',
 u'28\u2026',
 u'RT',
 u'@itsHousePorn:',
 u'Wow.',
 u'This',
 u'Woodwork.'

####Exercise 4
Given a timeline assigned to the variable `tweets`, get the hastags frequency distribution from the first 20 tweets. The function `histogram` written in [the exercise 6 of the class 06](class06.ipynb#Exercise-6) can be used for any iterable sequence (strings, lists, etc.).

In [30]:
def histogram(sequence):
    d = {}
    for elem in sequence:
        if elem in d:
            d[elem] += 1
        else:
            d[elem] = 1
    return d

hashtags = [hashtag['text'] 
               for t in tweets[: 20]
                   for hashtag in t.entities['hashtags']
           ]

histogram(hashtags)

{u'NationalBeerDay': 1, u'dem': 1, u'gop': 1, u'politics': 1}