# Twitter data

## Copyright and Licensing

You are free to use or adapt this notebook for any purpose you'd like. However, please respect the [Simplified BSD License](https://github.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition/blob/master/LICENSE.txt) that governs its use.

# Twitter API Access

Twitter implements OAuth 1.0A as its standard authentication mechanism, and in order to use it to make requests to Twitter's API, you'll need to go to https://dev.twitter.com/apps and create a sample application.

Choose any name for your application, write a description and use `http://google.com` for the website.

Under **Key and Access Tokens**, there are four primary identifiers you'll need to note for an OAuth 1.0A workflow: 
* consumer key, 
* consumer secret, 
* access token, and 
* access token secret (Click on Create Access Token to create those).

Note that you will need an ordinary Twitter account in order to login, create an app, and get these credentials.

The first time you execute the notebook, add all credentials so that you can save them in the `pkl` file, then you can remove the secret keys from the notebook because they will just be loaded from the `pkl` file.

The `pkl` file contains sensitive information that can be used to take control of your twitter acccount, **do not share it**.

In [32]:
import pickle
import os

In [33]:
if not os.path.exists('secret_twitter_credentials.pkl'):
    Twitter={}
    Twitter['Consumer Key'] = ''
    Twitter['Consumer Secret'] = ''
    Twitter['Access Token'] = ''
    Twitter['Access Token Secret'] = ''
    with open('secret_twitter_credentials.pkl','wb') as f:
        pickle.dump(Twitter, f)
else:
    Twitter=pickle.load(open('secret_twitter_credentials.pkl','rb'))

Install the `twitter` package to interface with the Twitter API

In [7]:
!pip install twitter

[33mYou are using pip version 10.0.1, however version 19.1.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


## Example 1. Authorizing an application to access Twitter account data

In [34]:
import twitter

auth = twitter.oauth.OAuth(Twitter['Access Token'],
                           Twitter['Access Token Secret'],
                           Twitter['Consumer Key'],
                           Twitter['Consumer Secret'])

twitter_api = twitter.Twitter(auth=auth)

# Nothing to see by displaying twitter_api except that it's now a
# defined variable

print(twitter_api)

<twitter.api.Twitter object at 0x7fc617da1828>


## Example 2. Retrieving trends

Twitter identifies locations using the Yahoo! Where On Earth ID.

The Yahoo! Where On Earth ID for the entire world is 1.
See https://dev.twitter.com/docs/api/1.1/get/trends/place and
http://developer.yahoo.com/geo/geoplanet/

look at the BOSS placefinder here: https://developer.yahoo.com/boss/placefinder/

In [9]:
WORLD_WOE_ID = 1
US_WOE_ID = 23424977

Look for the WOEID for [san-diego](http://woeid.rosselliot.co.nz/lookup/san%20diego%20%20ca)

You can change it to another location.

In [35]:
LOCAL_WOE_ID=2487889

# Prefix ID with the underscore for query string parameterization.
# Without the underscore, the twitter package appends the ID value
# to the URL itself as a special case keyword argument.

world_trends = twitter_api.trends.place(_id=WORLD_WOE_ID)
us_trends = twitter_api.trends.place(_id=US_WOE_ID)
local_trends = twitter_api.trends.place(_id=LOCAL_WOE_ID)

In [36]:
world_trends[:2]

[{'trends': [{'name': '#TheBachelorette',
    'url': 'http://twitter.com/search?q=%23TheBachelorette',
    'promoted_content': None,
    'query': '%23TheBachelorette',
    'tweet_volume': 104192},
   {'name': '#RAWReunion',
    'url': 'http://twitter.com/search?q=%23RAWReunion',
    'promoted_content': None,
    'query': '%23RAWReunion',
    'tweet_volume': 99231},
   {'name': '#MenTellAll',
    'url': 'http://twitter.com/search?q=%23MenTellAll',
    'promoted_content': None,
    'query': '%23MenTellAll',
    'tweet_volume': 23026},
   {'name': '#9YearsOfOneDirection',
    'url': 'http://twitter.com/search?q=%239YearsOfOneDirection',
    'promoted_content': None,
    'query': '%239YearsOfOneDirection',
    'tweet_volume': 507091},
   {'name': '#تخيلو_البنات_من_غير_مكياج',
    'url': 'http://twitter.com/search?q=%23%D8%AA%D8%AE%D9%8A%D9%84%D9%88_%D8%A7%D9%84%D8%A8%D9%86%D8%A7%D8%AA_%D9%85%D9%86_%D8%BA%D9%8A%D8%B1_%D9%85%D9%83%D9%8A%D8%A7%D8%AC',
    'promoted_content': None,
    'query'

In [12]:
trends=local_trends
print(type(trends))
print(list(trends[0].keys()))
print(trends[0]['trends'])

<class 'twitter.api.TwitterListResponse'>
['trends', 'as_of', 'created_at', 'locations']
[{'name': '#MondayMotivation', 'url': 'http://twitter.com/search?q=%23MondayMotivation', 'promoted_content': None, 'query': '%23MondayMotivation', 'tweet_volume': 148828}, {'name': '#BigLittleLies', 'url': 'http://twitter.com/search?q=%23BigLittleLies', 'promoted_content': None, 'query': '%23BigLittleLies', 'tweet_volume': 121611}, {'name': 'Tom Hanks', 'url': 'http://twitter.com/search?q=%22Tom+Hanks%22', 'promoted_content': None, 'query': '%22Tom+Hanks%22', 'tweet_volume': 33477}, {'name': 'Mueller', 'url': 'http://twitter.com/search?q=Mueller', 'promoted_content': None, 'query': 'Mueller', 'tweet_volume': 327437}, {'name': '#SDCC', 'url': 'http://twitter.com/search?q=%23SDCC', 'promoted_content': None, 'query': '%23SDCC', 'tweet_volume': 140923}, {'name': '#MTVHottest', 'url': 'http://twitter.com/search?q=%23MTVHottest', 'promoted_content': None, 'query': '%23MTVHottest', 'tweet_volume': 4333687

## Example 3. Displaying API responses as pretty-printed JSON

In [37]:
import json

print((json.dumps(us_trends[:2], indent=1)))

[
 {
  "trends": [
   {
    "name": "#TheBachelorette",
    "url": "http://twitter.com/search?q=%23TheBachelorette",
    "promoted_content": null,
    "query": "%23TheBachelorette",
    "tweet_volume": 104192
   },
   {
    "name": "#RAWReunion",
    "url": "http://twitter.com/search?q=%23RAWReunion",
    "promoted_content": null,
    "query": "%23RAWReunion",
    "tweet_volume": 99231
   },
   {
    "name": "John Paul Jones",
    "url": "http://twitter.com/search?q=%22John+Paul+Jones%22",
    "promoted_content": null,
    "query": "%22John+Paul+Jones%22",
    "tweet_volume": null
   },
   {
    "name": "Kelly Kelly",
    "url": "http://twitter.com/search?q=%22Kelly+Kelly%22",
    "promoted_content": null,
    "query": "%22Kelly+Kelly%22",
    "tweet_volume": null
   },
   {
    "name": "#LHHATL",
    "url": "http://twitter.com/search?q=%23LHHATL",
    "promoted_content": null,
    "query": "%23LHHATL",
    "tweet_volume": null
   },
   {
    "name": "#9YearsOfOneDirection",
    "url":

## Example 4. Computing the intersection of two sets of trends

In [38]:
trends_set = {}
trends_set['world'] = set([trend['name'] 
                        for trend in world_trends[0]['trends']])

trends_set['us'] = set([trend['name'] 
                     for trend in us_trends[0]['trends']]) 

trends_set['san diego'] = set([trend['name'] 
                     for trend in local_trends[0]['trends']]) 

In [39]:
for loc in ['world','us','san diego']:
    print(('-'*10,loc))
    print((','.join(trends_set[loc])))

('----------', 'world')
#あなたの精神は犬なのか猫なのか,#スッキリ,Morales Solá,#LHHATL,#FilminSonunda,#DesdeElLlano,Pat Patterson,Morumbi,Toró,#OddThingsToAskACoWorker,Raniel,#RAWReunion,#لو_القلب_يتكلم_قال,#العلاوه_السنويه20,Hernanes,Antony,#selfiesforzendaya,#مطبع_سعودي_في_فلسطين,LA CARITA DE TEMO NO,#linha11,#هكون_مبسوط_لما,#RedFlagsNotToIgnore,#MenTellAll,#WeMissOur131,#تخيلو_البنات_من_غير_مكياج,#قطر_خلف_تفجيرات_مقديشو,Cuca,ソアリン,#RevogaCarmenLucia,#hugsforseungyoun,#WelcomeBamBamtoThailand,#9YearsOfOneDirection,Tirei Emoji,#StandWithSicheng,Chape,#MalibuNightsMNL,Mr. Rogers,#ماهي_علامات_الحب,Homer Bailey,#temptationisland,#اشرح_حياتك_بكلمتين,#TheBachelorette,Vitor Bueno,#ItsTimeToRetireWhen,#SinLuz,#فوزيه_الدريع_في_ذمه_الله,#IStandWithTerrenceKWilliams,#mutrewards,#東海の人がやりがちなこと,#VOLEYenDEPORTV
('----------', 'us')
Pedro Rosselló,#AmericanNinjaWarrior,#90daytheotherway,John Paul Jones,#LHHATL,TRIPLE PLAY,#OddThingsToAskACoWorker,#BombaSquad,#RAWReunion,Sabathia,Kelly Kelly,#selfiesforzendaya,Eve Torre

In [16]:
print(( '='*10,'intersection of world and us'))
print((trends_set['world'].intersection(trends_set['us'])))

print(('='*10,'intersection of us and san-diego'))
print((trends_set['san diego'].intersection(trends_set['us'])))

{'#selfiesforzendaya', 'Sigma', '#9YearsOfOneDirection', '#OddThingsToAskACoWorker', 'Art Neville', 'Tim Duncan', '#SinLuz', '#RickyRenunciaAhora', 'Kashmir', '#TrumpGreetingCards', '#KhanMeetsTrump', 'Franken', '#PeopleWhoConfuseMe', 'Mr. Rogers'}
{'#ABeautifulDayMovie', '#OddThingsToAskACoWorker', 'Kashmir', '#OurFlowerHwasa', 'Marcus Hyde', '#KhanMeetsTrump', '#selfiesforzendaya', 'Ceballos', 'Sigma', 'The Big Fundamental', 'Lili and Cole', '#5sosisgoingtojailparty', 'Neville Brothers', '#AskSum', '#OddQuestionnaireQuestions', '#TrumpGreetingCards', 'Goal of the Week', '#ModernHomeworkExcuses', 'Logan Paul', 'Chris Kraft', '#9YearsOfOneDirection', 'Art Neville', 'Becky Hammon', 'Will Hardy', '#PeopleWhoConfuseMe', 'Afghanistan', 'Gideon', '#DVDChat', 'Bow Wow', 'San Sebastián', 'Tropical Depression Three', 'Shep Smith', 'Tim Duncan', '#SinLuz', '#IStandWithTerrenceKWilliams', '#RickyRenunciaAhora', '#GoSpursGo'}


## Example 5. Collecting search results

Set the variable `q` to a trending topic, 
or anything else for that matter. The example query below
was a trending topic when this content was being developed
and is used throughout the remainder of this chapter

In [42]:
q = 'Mr. Rogers' 

number = 100

# See https://dev.twitter.com/docs/api/1.1/get/search/tweets

search_results = twitter_api.search.tweets(q=q, count=number)

statuses = search_results['statuses']

In [44]:
print(len(statuses))
print(statuses)

100
[{'created_at': 'Tue Jul 23 02:12:41 +0000 2019', 'id': 1153488069120712704, 'id_str': '1153488069120712704', 'text': "RT @TheRaDR: While you're in your Mr. Rogers feels, a few things to remind you about his radical theology.  \n\n1/x\n\nHe was a Presbyterian mi…", 'truncated': False, 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [{'screen_name': 'TheRaDR', 'name': 'Rabbi Danya Ruttenberg', 'id': 61802282, 'id_str': '61802282', 'indices': [3, 11]}], 'urls': []}, 'metadata': {'iso_language_code': 'en', 'result_type': 'recent'}, 'source': '<a href="https://mobile.twitter.com" rel="nofollow">Twitter Web App</a>', 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 829495277383979008, 'id_str': '829495277383979008', 'name': 'lane', 'screen_name': 'bbelladonnas', 'location': 'Texas', 'description': '☠️ she/her', 'url': None, 'entities': {'description

Twitter often returns duplicate results, we can filter them out checking for duplicate texts:

In [45]:
all_text = []
filtered_statuses = []
for s in statuses:
    if not s["text"] in all_text:
        filtered_statuses.append(s)
        all_text.append(s["text"])
statuses = filtered_statuses     

In [46]:
len(statuses)

39

In [47]:
[s['text'] for s in search_results['statuses']]

["RT @TheRaDR: While you're in your Mr. Rogers feels, a few things to remind you about his radical theology.  \n\n1/x\n\nHe was a Presbyterian mi…",
 "RT @StephenAtHome: I get Natalie Portman as Thor, but I'm confused about how Tom Hanks as Mr. Rogers fits into the MCU.",
 "RT @StephenAtHome: I get Natalie Portman as Thor, but I'm confused about how Tom Hanks as Mr. Rogers fits into the MCU.",
 "RT @StephenAtHome: I get Natalie Portman as Thor, but I'm confused about how Tom Hanks as Mr. Rogers fits into the MCU.",
 "RT @TheRaDR: While you're in your Mr. Rogers feels, a few things to remind you about his radical theology.  \n\n1/x\n\nHe was a Presbyterian mi…",
 "RT @TheRaDR: While you're in your Mr. Rogers feels, a few things to remind you about his radical theology.  \n\n1/x\n\nHe was a Presbyterian mi…",
 "RT @StephenAtHome: I get Natalie Portman as Thor, but I'm confused about how Tom Hanks as Mr. Rogers fits into the MCU.",
 'RT @RottenTomatoes: Good morning neighbor! Tom Hanks st

In [54]:
# Show one sample search result by slicing the list...
print(json.dumps(statuses[1], indent=1))

{
 "created_at": "Tue Jul 23 02:12:40 +0000 2019",
 "id": 1153488068394930176,
 "id_str": "1153488068394930176",
 "text": "RT @StephenAtHome: I get Natalie Portman as Thor, but I'm confused about how Tom Hanks as Mr. Rogers fits into the MCU.",
 "truncated": false,
 "entities": {
  "hashtags": [],
  "symbols": [],
  "user_mentions": [
   {
    "screen_name": "StephenAtHome",
    "name": "Stephen Colbert",
    "id": 16303106,
    "id_str": "16303106",
    "indices": [
     3,
     17
    ]
   }
  ],
  "urls": []
 },
 "metadata": {
  "iso_language_code": "en",
  "result_type": "recent"
 },
 "source": "<a href=\"http://tapbots.com/tweetbot\" rel=\"nofollow\">Tweetbot for i\u039fS</a>",
 "in_reply_to_status_id": null,
 "in_reply_to_status_id_str": null,
 "in_reply_to_user_id": null,
 "in_reply_to_user_id_str": null,
 "in_reply_to_screen_name": null,
 "user": {
  "id": 2224648789,
  "id_str": "2224648789",
  "name": "Marvin Vista",
  "screen_name": "marvinvista",
  "location": "Metro Manila

In [56]:
# The result of the list comprehension is a list with only one element that
# can be accessed by its index and set to the variable t
t = statuses[1]
#[ status for status in statuses 
#          if status['id'] == 316948241264549888 ][0]

# Explore the variable t to get familiarized with the data structure...

print(t['retweet_count'])
print(t['retweeted'])


733
False


## Example 6. Extracting text, screen names, and hashtags from tweets

In [57]:
status_texts = [ status['text'] 
                 for status in statuses ]

screen_names = [ user_mention['screen_name'] 
                 for status in statuses
                     for user_mention in status['entities']['user_mentions'] ]

hashtags = [ hashtag['text'] 
             for status in statuses
                 for hashtag in status['entities']['hashtags'] ]

# Compute a collection of all words from all tweets
words = [ w 
          for t in status_texts 
              for w in t.split() ]

In [58]:
# Explore the first 5 items for each...

print(json.dumps(status_texts[0:5], indent=1))
print(json.dumps(screen_names[0:5], indent=1)) 
print(json.dumps(hashtags[0:5], indent=1))
print(json.dumps(words[0:5], indent=1))

[
 "RT @TheRaDR: While you're in your Mr. Rogers feels, a few things to remind you about his radical theology.  \n\n1/x\n\nHe was a Presbyterian mi\u2026",
 "RT @StephenAtHome: I get Natalie Portman as Thor, but I'm confused about how Tom Hanks as Mr. Rogers fits into the MCU.",
 "RT @RottenTomatoes: Good morning neighbor! Tom Hanks stars as Mr. Rogers in the first trailer for #ABeautifulDayMovie, coming this Thanksgi\u2026",
 "@ABeautifulDay Can\u2019t believe @tomhanks just made Mr. Rogers sound like Forest Gump.  Mr. Rogers was a loving, carin\u2026 https://t.co/grQz3Fba5q",
 "RT @RonHogan: This is a wonderful thread and reminds me, once again, of how Mr. Rogers\u2019 message resonated with a key theme of liberation th\u2026"
]
[
 "TheRaDR",
 "StephenAtHome",
 "RottenTomatoes",
 "ABeautifulDay",
 "tomhanks"
]
[
 "ABeautifulDayMovie",
 "MrRogers",
 "MustWatch",
 "Canadian",
 "CBC"
]
[
 "RT",
 "@TheRaDR:",
 "While",
 "you're",
 "in"
]


## Example 7. Creating a basic frequency distribution from the words in tweets

In [26]:
from collections import Counter

for item in [words, screen_names, hashtags]:
    c = Counter(item)
    print(c.most_common()[:10]) # top 10
    print()

[('RT', 24), ('the', 18), ('#MTVAwards', 15), ('#MTVHottest', 11), ('at', 10), ('a', 10), ('to', 10), ('of', 9), ('BTS', 9), ('and', 8)]

[('BTS_twt', 8), ('TeenMom', 5), ('MTV', 3), ('MTVAwards', 3), ('Dianagggg881', 2), ('CatelynnLowell', 2), ('TylerBaltierra', 2), ('TheRock', 2), ('ZacEfron', 2), ('ygofficialblink', 2)]

[('MTVAwards', 19), ('MTVHottest', 11), ('TeenMomOG', 6), ('MTVHottestBLACKPINK', 2), ('MTVHottes', 2), ('MTV', 2), ('BLACKPINK', 2), ('AvengersEndgame', 1), ('StrangerThings', 1), ('BTS', 1)]



## Example 8. Create a prettyprint function to display tuples in a nice tabular format

In [27]:
def prettyprint_counts(label, list_of_tuples):
    print("\n{:^20} | {:^6}".format(label, "Count"))
    print("*"*40)
    for k,v in list_of_tuples:
        print("{:20} | {:>6}".format(k,v))

In [28]:
for label, data in (('Word', words), 
                    ('Screen Name', screen_names), 
                    ('Hashtag', hashtags)):
    
    c = Counter(data)
    prettyprint_counts(label, c.most_common()[:10])


        Word         | Count 
****************************************
RT                   |     24
the                  |     18
#MTVAwards           |     15
#MTVHottest          |     11
at                   |     10
a                    |     10
to                   |     10
of                   |      9
BTS                  |      9
and                  |      8

    Screen Name      | Count 
****************************************
BTS_twt              |      8
TeenMom              |      5
MTV                  |      3
MTVAwards            |      3
Dianagggg881         |      2
CatelynnLowell       |      2
TylerBaltierra       |      2
TheRock              |      2
ZacEfron             |      2
ygofficialblink      |      2

      Hashtag        | Count 
****************************************
MTVAwards            |     19
MTVHottest           |     11
TeenMomOG            |      6
MTVHottestBLACKPINK  |      2
MTVHottes            |      2
MTV                  |      2
BLAC

## Example 9. Finding the most popular retweets

In [29]:
retweets = [
            # Store out a tuple of these three values ...
            (status['retweet_count'], 
             status['retweeted_status']['user']['screen_name'],
             status['text'].replace("\n","\\")) 
            
            # ... for each status ...
            for status in statuses 
            
            # ... so long as the status meets this condition.
                if 'retweeted_status' in status
           ]

We can build another `prettyprint` function to print entire tweets with their retweet count.

We also want to split the text of the tweet in up to 3 lines, if needed.

In [30]:
row_template = "{:^7} | {:^15} | {:50}"
def prettyprint_tweets(list_of_tuples):
    print()
    print(row_template.format("Count", "Screen Name", "Text"))
    print("*"*60)
    for count, screen_name, text in list_of_tuples:
        print(row_template.format(count, screen_name, text[:50]))
        if len(text) > 50:
            print(row_template.format("", "", text[50:100]))
            if len(text) > 100:
                print(row_template.format("", "", text[100:]))

In [31]:
# Slice off the first 5 from the sorted results and display each item in the tuple

prettyprint_tweets(sorted(retweets, reverse=True)[:10])


 Count  |   Screen Name   | Text                                              
************************************************************
 6726   |       MTV       | RT @MTV: YES! Girl Power!! @brielarson #MTVAwards 
        |                 | https://t.co/QHBKbpNJe0                           
 5696   |   MarkRuffalo   | RT @MarkRuffalo: Love this team 3000 ♥ Grateful fo
        |                 | r everyone who voted for #AvengersEndgame at the #
        |                 | MTVAwards! https://t.co/tvh9kZa4HY                
 4500   |       MTV       | RT @MTV: My queens @chloexhalle slayed their perfo
        |                 | rmance at the #MTVAwards 😍 https://t.co/cDZ8kwmkF9
 3230   |    ZacEfron     | RT @ZacEfron: Always fan boy out when 0️⃣1️⃣1️⃣'s 
        |                 | around. Great to run into the #StrangerThings crew
        |                 | - even BETTER- handing @milliebbrown he…          
  616   |    ZacEfron     | RT @ZacEfron: Hanging with THE @AADaddari