# Twitter data

## Copyright and Licensing

You are free to use or adapt this notebook for any purpose you'd like. However, please respect the [Simplified BSD License](https://github.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition/blob/master/LICENSE.txt) that governs its use.

# Twitter API Access

Twitter implements OAuth 1.0A as its standard authentication mechanism, and in order to use it to make requests to Twitter's API, you'll need to go to https://dev.twitter.com/apps and create a sample application.

Choose any name for your application, write a description and use `http://google.com` for the website.

Under **Key and Access Tokens**, there are four primary identifiers you'll need to note for an OAuth 1.0A workflow: 
* consumer key, 
* consumer secret, 
* access token, and 
* access token secret (Click on Create Access Token to create those).

Note that you will need an ordinary Twitter account in order to login, create an app, and get these credentials.

The first time you execute the notebook, add all credentials so that you can save them in the `pkl` file, then you can remove the secret keys from the notebook because they will just be loaded from the `pkl` file.

The `pkl` file contains sensitive information that can be used to take control of your twitter acccount, **do not share it**.

In [37]:
import pickle
import os 

In [41]:
if not os.path.exists("secret_twitter_credentials.pkl"):
    Twitter = {}
    Twitter["Consumer Key"] = "pZsFxQc02cs1WEjJZUKaLyk18"
    Twitter["Consumer Secret"] = "lPS8WoyFJCGDOL6LdEwnM0chBq2pmkVaYSPr8xVWeRWVM9B9Yk"
    Twitter["Access Token"] = "1145666867480186885-vKClUIjIOLpb7DlnV2cIIN9nQxBRC5"
    Twitter["Access Token Secret"] = "ErHsrD2ttBEEnVAU0ATMmnvHowj8OW4KC5X1KX6DheKDu"
    with open("secret_twittter_credential.pkl","wb") as f:
             pickle.dump(Twitter,f)
else:
    Twitters = pickle.load(popen("secret_twitter-_credentials.pkl","rb"))

Install the twitter package to interface with the Twitter API

In [42]:
!pip install twitter



##  Example 1. Authorizing an application to access Twitter account data

In [43]:
import twitter

auth = twitter.oauth.OAuth(Twitter['Access Token'],
                           Twitter['Access Token Secret'],
                           Twitter['Consumer Key'],
                           Twitter['Consumer Secret'])
twitter_api = twitter.Twitter(auth = auth)

# Nothing to see by displaying twitter_api except that it now  a defined variable


print(twitter_api)

<twitter.api.Twitter object at 0x0000020869D34CC8>


## Example 2: Retrieving trends

Twitter identifies locations using the Yahoo! Where On Earth ID.

The Yahoo! Where On Earth ID for the entire world is 1.
See https://dev.twitter.com/docs/api/1.1/get/trends/place and
http://developer.yahoo.com/geo/geoplanet/

look at the BOSS placefinder here: https://developer.yahoo.com/boss/placefinder/

In [44]:
WORLD_WOE_ID = 1
US_WOE_ID = 23424977

Look for the WOEID for [san-diego](http://woeid.rosselliot.co.nz/lookup/san%20diego%20%20ca)

You can change it to another location.

In [45]:
LOCAL_WOE_ID= 2487889

# Prefix ID with the underscore for query string parameterization.
# Without the underscore, the twitter package appends the ID value
# to the URL itself as a special case keyword argument.

world_trends = twitter_api.trends.place(_id=WORLD_WOE_ID)
us_trends = twitter_api.trends.place(_id=US_WOE_ID)
local_trends = twitter_api.trends.place(_id = LOCAL_WOE_ID)

In [46]:
world_trends[:2]

[{'trends': [{'name': '#virüstürkiyede',
    'url': 'http://twitter.com/search?q=%23vir%C3%BCst%C3%BCrkiyede',
    'promoted_content': None,
    'query': '%23vir%C3%BCst%C3%BCrkiyede',
    'tweet_volume': 35813},
   {'name': 'センバツ中止',
    'url': 'http://twitter.com/search?q=%E3%82%BB%E3%83%B3%E3%83%90%E3%83%84%E4%B8%AD%E6%AD%A2',
    'promoted_content': None,
    'query': '%E3%82%BB%E3%83%B3%E3%83%90%E3%83%84%E4%B8%AD%E6%AD%A2',
    'tweet_volume': 53621},
   {'name': 'あなたのサークル',
    'url': 'http://twitter.com/search?q=%E3%81%82%E3%81%AA%E3%81%9F%E3%81%AE%E3%82%B5%E3%83%BC%E3%82%AF%E3%83%AB',
    'promoted_content': None,
    'query': '%E3%81%82%E3%81%AA%E3%81%9F%E3%81%AE%E3%82%B5%E3%83%BC%E3%82%AF%E3%83%AB',
    'tweet_volume': 121710},
   {'name': '#Hの文字でわかる性欲診断',
    'url': 'http://twitter.com/search?q=%23H%E3%81%AE%E6%96%87%E5%AD%97%E3%81%A7%E3%82%8F%E3%81%8B%E3%82%8B%E6%80%A7%E6%AC%B2%E8%A8%BA%E6%96%AD',
    'promoted_content': None,
    'query': '%23H%E3%81%AE%E6%96%87%E5%AD%97%E

In [47]:
trends = local_trends
print(type(trends))
print(list(trends[0].keys()))
print(trends[0]["trends"])

<class 'twitter.api.TwitterListResponse'>
['trends', 'as_of', 'created_at', 'locations']
[{'name': '#TheBachelorFinale', 'url': 'http://twitter.com/search?q=%23TheBachelorFinale', 'promoted_content': None, 'query': '%23TheBachelorFinale', 'tweet_volume': 99082}, {'name': 'Coachella', 'url': 'http://twitter.com/search?q=Coachella', 'promoted_content': None, 'query': 'Coachella', 'tweet_volume': 200042}, {'name': 'UCSD', 'url': 'http://twitter.com/search?q=UCSD', 'promoted_content': None, 'query': 'UCSD', 'tweet_volume': 14916}, {'name': '#COVID2019', 'url': 'http://twitter.com/search?q=%23COVID2019', 'promoted_content': None, 'query': '%23COVID2019', 'tweet_volume': 352047}, {'name': 'Italy', 'url': 'http://twitter.com/search?q=Italy', 'promoted_content': None, 'query': 'Italy', 'tweet_volume': 565149}, {'name': '#SuperTuesday2', 'url': 'http://twitter.com/search?q=%23SuperTuesday2', 'promoted_content': None, 'query': '%23SuperTuesday2', 'tweet_volume': 56667}, {'name': '#OnMyBlockS3', 

## Example:3 Displaying API responses as pretty-printed JSON

In [48]:
import json
print((json.dumps(us_trends[:2], indent=1)))

[
 {
  "trends": [
   {
    "name": "#OnMyBlockS3",
    "url": "http://twitter.com/search?q=%23OnMyBlockS3",
    "promoted_content": null,
    "query": "%23OnMyBlockS3",
    "tweet_volume": null
   },
   {
    "name": "#ByeByeBernie",
    "url": "http://twitter.com/search?q=%23ByeByeBernie",
    "promoted_content": null,
    "query": "%23ByeByeBernie",
    "tweet_volume": 27057
   },
   {
    "name": "#LoseWithBiden",
    "url": "http://twitter.com/search?q=%23LoseWithBiden",
    "promoted_content": null,
    "query": "%23LoseWithBiden",
    "tweet_volume": 56220
   },
   {
    "name": "#BernieWarriors",
    "url": "http://twitter.com/search?q=%23BernieWarriors",
    "promoted_content": null,
    "query": "%23BernieWarriors",
    "tweet_volume": null
   },
   {
    "name": "#E32020",
    "url": "http://twitter.com/search?q=%23E32020",
    "promoted_content": null,
    "query": "%23E32020",
    "tweet_volume": null
   },
   {
    "name": "3 tsa",
    "url": "http://twitter.com/search?q=

## Example:4 Computing the intersting of two sets of trends

In [49]:
trends_set = {}
trends_set["world"] = set([trend["name"]
                          for trend in world_trends[0]["trends"]])

trends_set["us"] = set([trend["name"]
                       for trend in us_trends[0]["trends"]])

trends_set["san diego"] = set([trend["name"]
                              for trend in local_trends[0]["trends"]])

In [50]:
for loc in ["world", "us", "san diego"]:
    print(("-"*10,loc))
    print((",".join(trends_set[loc])))
    

('----------', 'world')
Bank of England,#EXO_Repackage_album,#MamaBisa,#東日本大震災から9年,#ปังจริงจ้า,センバツ中止,#_بريده,#탐라분들_그림그릴때_뭐써요,#티비조선_갑질_출연계약서_공론화,#صباحات_الهلال,センバツ高校野球,#koronavirusu,休園延長,#よく呟く言葉からあなたは何県民か当てる,選抜中止,新エリア,#Çarşamba,#FelizMiércoles,#シンデレラガールの人,#SB19saPinasFM,ニコ超,enuygunfiyatta ttverilir,選抜高校野球,#Hの文字でわかる性欲診断,BTS WORLD DOMINATION,tacubaya,E3中止,あなたのサークル,会議中止,#Islamabad,#ByeByeBernie,#キティちゃんのお絵かき,#あなたの名前を直訳すると,#DukungOmnibusLaw,#virüstürkiyede,ファンタズミック,#負けるなポケ徹,日本高野連,#OnMyBlockS3,高校球児,#WaspadaCegahCorona,#JeSuisSneazzy,#BerkinElvan,ブギウギ,#くらすます,ド変態性癖ビンゴ,#عقارك_تمويل_شراء_رهن,#11Marzo,#JyotiradityaMScindia,コミケ
('----------', 'us')
Bank of England,Michigan,Lil Ricky,#CloseCuny,Van Jones,The Math,#MichaelMoore,road to kingdom,#LoseWithBiden,#DarkAges,#TeamBarb,Olympiakos,#cashappwisdom,Carville,BTS WORLD DOMINATION,tacubaya,Aunt Gloria,#NCT127RODEO,#ByeByeBernie,#yangang,Just Everywhere Already,Shrek in Spanish,#InkMaster,#BidensCognitiveDecline,Rout,Nipsey Struggle,#WednesdayWisd

In [51]:
print(( '='*10,'intersection of world and us'))
print((trends_set['world'].intersection(trends_set['us'])))

print(('='*10,'intersection of us and san-diego'))
print((trends_set['san diego'].intersection(trends_set['us'])))

{'Bank of England', '#ByeByeBernie', '#OnMyBlockS3', 'BTS WORLD DOMINATION', 'tacubaya'}
{'Bank of England', 'Michigan', 'Lil Ricky', '#CloseCuny', 'Van Jones', 'The Math', '#MichaelMoore', 'road to kingdom', '#LoseWithBiden', '#DarkAges', 'Olympiakos', '#cashappwisdom', 'Carville', 'BTS WORLD DOMINATION', 'Aunt Gloria', 'tacubaya', '#NCT127RODEO', '#ByeByeBernie', '#yangang', 'Just Everywhere Already', 'Shrek in Spanish', '#InkMaster', '#BidensCognitiveDecline', 'Rout', 'Nipsey Struggle', '#WednesdayWisdom', '#EricNaminSF', '#SuperTuesday2', 'ar-14', '1439 - michael osterholm', '#OnMyBlockS3', '#tobuildconfidence', '#RejectedDisneyShows', '#IMayBeDrunkBut', '#E32020', '#newamsterdam', 'Spy Kids', 'my next', '#BernieWarriors', '#MAR10Day', 'Italy', '#DemExit', 'Warzone', '#COVID2019', '3 tsa', '#whyiwasleftonread', 'My Last'}


## Example 5: Collecting search results

Set the variable `q` to a trending topic, 
or anything else for that matter. The example query below
was a trending topic when this content was being developed
and is used throughout the remainder of this chapter

In [52]:
q = "#MTVAwards"
# topics = "#HoweAwardsxMG"

number = 100

# See https://dev.twitter.com/docs/api/1.1/get/search/tweets
search_results = twitter_api.search.tweets(q = q, count = number)
statuses = search_results["statuses"]

In [53]:
len(statuses)

67

In [54]:
print(statuses)

[{'created_at': 'Wed Mar 11 04:36:40 +0000 2020', 'id': 1237598291728072704, 'id_str': '1237598291728072704', 'text': 'RT @cxhupdates: Chloe and Halle performing at the #MTVAwards https://t.co/cZHo1RiYxU', 'truncated': False, 'entities': {'hashtags': [{'text': 'MTVAwards', 'indices': [50, 60]}], 'symbols': [], 'user_mentions': [{'screen_name': 'cxhupdates', 'name': 'CHLOE X HALLE UPDATES', 'id': 983789784182280192, 'id_str': '983789784182280192', 'indices': [3, 14]}], 'urls': [], 'media': [{'id': 1008838927623417856, 'id_str': '1008838927623417856', 'indices': [61, 84], 'media_url': 'http://pbs.twimg.com/media/DgAoXVXVQAA0LT6.jpg', 'media_url_https': 'https://pbs.twimg.com/media/DgAoXVXVQAA0LT6.jpg', 'url': 'https://t.co/cZHo1RiYxU', 'display_url': 'pic.twitter.com/cZHo1RiYxU', 'expanded_url': 'https://twitter.com/MTV/status/1008887184684822528/video/1', 'type': 'photo', 'sizes': {'thumb': {'w': 150, 'h': 150, 'resize': 'crop'}, 'medium': {'w': 1200, 'h': 675, 'resize': 'fit'}, 'small'

In [55]:
all_text = []
filtered_statuses = []
for s in statuses:
    if not s["text"] in all_text:
        filtered_statuses.append(s)
        all_text.append(s["text"])
statuses = filtered_statuses

In [56]:
len(statuses)

46

In [57]:
[s["text"] for s in search_results["statuses"]]

['RT @cxhupdates: Chloe and Halle performing at the #MTVAwards https://t.co/cZHo1RiYxU',
 'RT @cxhupdates: Chloe and Halle performing at the #MTVAwards https://t.co/cZHo1RiYxU',
 'RT @MTVteenwolf: ✨ me and my pack watching the #MTVAwards tonight at 9/8c ✨ https://t.co/ERQpzN9zWt',
 'RT @HugoGloss: Noah, mozão, vem cá pra eu ver se você tem o melhor beijo mesmo! Kkk #MTVAwards',
 'RT @thecwban: Rebola assim perto de mim que eu não respondo pelos meus atos\n AMÉM CAMILA CABELLO \n#MTVAwards https://t.co/jfAJFvM9TS',
 'RT @GettyVIP: They’re here!  Johnny, Haechan, Mark, Jaehyun, Taeyong, Yuta, Taeil, and Doyoung of NCT 127 pose at the 2019 MTV EMAs in Sevi…',
 'RT @MTV: My queens @chloexhalle slayed their performance at the #MTVAwards 😍 https://t.co/cDZ8kwmkF9',
 'RT @MTV: @CW_Riverdale did you hear the news? #riverdale has been nominated at the 2018 #mtvawards! cast your votes at https://t.co/B6c16HK…',
 'RT @cxhupdates: Chloe and Halle performing at the #MTVAwards https://t.co/cZHo1RiYx

In [58]:
# Show one sample search result by slicing the list
print(json.dumps(statuses[0],indent = 1))

{
 "created_at": "Wed Mar 11 04:36:40 +0000 2020",
 "id": 1237598291728072704,
 "id_str": "1237598291728072704",
 "text": "RT @cxhupdates: Chloe and Halle performing at the #MTVAwards https://t.co/cZHo1RiYxU",
 "truncated": false,
 "entities": {
  "hashtags": [
   {
    "text": "MTVAwards",
    "indices": [
     50,
     60
    ]
   }
  ],
  "symbols": [],
  "user_mentions": [
   {
    "screen_name": "cxhupdates",
    "name": "CHLOE X HALLE UPDATES",
    "id": 983789784182280192,
    "id_str": "983789784182280192",
    "indices": [
     3,
     14
    ]
   }
  ],
  "urls": [],
  "media": [
   {
    "id": 1008838927623417856,
    "id_str": "1008838927623417856",
    "indices": [
     61,
     84
    ],
    "media_url": "http://pbs.twimg.com/media/DgAoXVXVQAA0LT6.jpg",
    "media_url_https": "https://pbs.twimg.com/media/DgAoXVXVQAA0LT6.jpg",
    "url": "https://t.co/cZHo1RiYxU",
    "display_url": "pic.twitter.com/cZHo1RiYxU",
    "expanded_url": "https://twitter.com/MTV/status/100888718

In [59]:
# lt of the list comprehension is a list with only one element that
# can be accessed by its index and set to the variable t
t = statuses[0]
#[ status for status in statuses 
#          if status['id'] == 316948241264549888 ][0]

# Explore the variable t to get familiarized with the data structure...

print(t["retweet_count"])
print(t["retweeted"])



726
False


## Example 6: Extracting text, screen names, and hashtags from tweets

In [60]:
status_texts = [status["text"]
               for status in statuses]

screen_names = [user_mention["screen_name"]
               for status in statuses
                   for user_mention in status["entities"]["user_mentions"]]

hashtags = [ hashtag["text"]
           for status in statuses
               for hashtag in status["entities"]["hashtags"] ]

# compute a collection of all words from all tweets
words = [w
          for t in status_texts
               for w in t.split()]

In [61]:
# Exploring the first 5 items for each.....

print(json.dumps(status_texts[0:5], indent =1))
print(json.dumps(screen_names[0:5], indent =1))
print(json.dumps(hashtags[0:5], indent =1))
print(json.dumps(words[0:5], indent =1))

[
 "RT @cxhupdates: Chloe and Halle performing at the #MTVAwards https://t.co/cZHo1RiYxU",
 "RT @MTVteenwolf: \u2728 me and my pack watching the #MTVAwards tonight at 9/8c \u2728 https://t.co/ERQpzN9zWt",
 "RT @HugoGloss: Noah, moz\u00e3o, vem c\u00e1 pra eu ver se voc\u00ea tem o melhor beijo mesmo! Kkk #MTVAwards",
 "RT @thecwban: Rebola assim perto de mim que eu n\u00e3o respondo pelos meus atos\n AM\u00c9M CAMILA CABELLO \n#MTVAwards https://t.co/jfAJFvM9TS",
 "RT @GettyVIP: They\u2019re here!  Johnny, Haechan, Mark, Jaehyun, Taeyong, Yuta, Taeil, and Doyoung of NCT 127 pose at the 2019 MTV EMAs in Sevi\u2026"
]
[
 "cxhupdates",
 "MTVteenwolf",
 "HugoGloss",
 "thecwban",
 "GettyVIP"
]
[
 "MTVAwards",
 "MTVAwards",
 "MTVAwards",
 "MTVAwards",
 "MTVAwards"
]
[
 "RT",
 "@cxhupdates:",
 "Chloe",
 "and",
 "Halle"
]


## Example 7: Creating a basic  frequency distribution from the words in tweets

In [62]:
from collections import Counter

for item in [words,screen_names,hashtags]:
    c = Counter(item)
    print(c.most_common()[:10]) # top 10
    print()

[('RT', 38), ('#MTVAwards', 31), ('the', 20), ('at', 16), ('a', 10), ('de', 7), ('@MTVAwards:', 7), ('los', 7), ('and', 6), ('que', 6)]

[('MTV', 7), ('MTVAwards', 7), ('sanbullockof', 6), ('daenerwszx', 5), ('MTVteenwolf', 2), ('jadapsmith', 2), ('cxhupdates', 1), ('HugoGloss', 1), ('thecwban', 1), ('GettyVIP', 1)]

[('MTVAwards', 33), ('mtvawards', 3), ('InternationalWomensDay', 2), ('MTVawards', 2), ('SpotifyAwards2020', 2), ('SpotifyAwards', 2), ('SandraBullock', 2), ('LHHATL', 2), ('riverdale', 1), ('applemusic', 1)]



## Example 8: Creating a prettprint function to display tuples in a nice tabular format

In [63]:
def prettyprint_counts(label, list_of_tuples):
    print("\n{:^20} | {:^6}".format(label, "Count"))
    print("*"*40)
    for k,v in list_of_tuples:
        print("{:20} | {:>6}".format(k,v))

In [64]:
for label, data in (("Word",words),
                   ("Screen name",screen_names),
                   ("Hashtag",hashtags)):
    c = Counter(data)
    prettyprint_counts(label, c.most_common()[:10])


        Word         | Count 
****************************************
RT                   |     38
#MTVAwards           |     31
the                  |     20
at                   |     16
a                    |     10
de                   |      7
@MTVAwards:          |      7
los                  |      7
and                  |      6
que                  |      6

    Screen name      | Count 
****************************************
MTV                  |      7
MTVAwards            |      7
sanbullockof         |      6
daenerwszx           |      5
MTVteenwolf          |      2
jadapsmith           |      2
cxhupdates           |      1
HugoGloss            |      1
thecwban             |      1
GettyVIP             |      1

      Hashtag        | Count 
****************************************
MTVAwards            |     33
mtvawards            |      3
InternationalWomensDay |      2
MTVawards            |      2
SpotifyAwards2020    |      2
SpotifyAwards        |      2
Sa

## Example 9: Finding the most popular retweets

In [65]:
retweets = [
            # Store out a tuple of these three values ...
            (status['retweet_count'], 
             status['retweeted_status']['user']['screen_name'],
             status['text'].replace("\n","\\")) 
            
            # ... for each status ...
            for status in statuses 
            
            # ... so long as the status meets this condition.
                if 'retweeted_status' in status
           ]

we can build another prettyprint function to print entire tweet with their retweet count.

we also want to split the text of the tweet in up to 3 times if neeeded

In [70]:
row_template = "{:^7} | {:^15} | {:50}"
def prettyprint_tweets(list_of_tuples):
    print()
    print(row_template.format("Count", "Screen name", "Text"))
    print("*"*60)
    for count, screen_name,text in list_of_tuples:
        print(row_template.format(count, screen_name,text[:50]))
        if len(text) > 50:
            print(row_template.format("", "", text[50:100]))
            if len(text) > 100:
                print(row_template.format("", "", text[100:]))
                

In [72]:
# slice off the first 5 from the sorted results and display each item in 

prettyprint_tweets(sorted(retweets, reverse = True)[:10])


 Count  |   Screen name   | Text                                              
************************************************************
 7517   |       MTV       | RT @MTV: Let @dylanobrien guide you through a firs
        |                 | t look at Maze Runner: The Death Cure, exclusively
        |                 |  for the #MTVAwards tonight at 8/7c! 💥…           
 4553   |       MTV       | RT @MTV: My queens @chloexhalle slayed their perfo
        |                 | rmance at the #MTVAwards 😍 https://t.co/cDZ8kwmkF9
 3311   |  dunevillenuve  | RT @dunevillenuve: chloe and halle are angels beyo
        |                 | ncé really did bless them #mtvawards  https://t.co
        |                 | /y2fDHMj3rs                                       
 2674   | WolverineMovie  | RT @WolverineMovie: Like father, like daughter. @R
        |                 | ealHughJackman &amp; @DafneKeen win #MTVAwards for
        |                 |  Best Duo. #Logan https://t.co/yKnXIWc3kn