# Twitter data

## Copyright and Licensing

You are free to use or adapt this notebook for any purpose you'd like. However, please respect the [Simplified BSD License](https://github.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition/blob/master/LICENSE.txt) that governs its use.

### Please Note:

This notebook likely looks a little different from the video content in the course. This notebook has been modified to be easier to understand as Tweepy is generally an easier package to work with. The old notebooks will still be available in the course downloads page if desired, but they will not be regularly updated.

# Twitter API Access

In order to use it to make requests to Twitter's API, you'll need to go to https://dev.twitter.com/apps and create a sample application.

Choose any name for your application, write a description and use `http://google.com` for the website. Further instructions can be found in week 6 of the course.

Under **Key and Access Tokens**, there are four primary identifiers you'll need to note: 
* consumer key, 
* consumer secret, 
* access token, and 
* access token secret (Click on Create Access Token to create those).

Note that you will need an ordinary Twitter account in order to login, create an app, and get these credentials.

Install the `tweepy` package to interface with the Twitter API

In [2]:
#pip install for the package we will be using
!pip install tweepy

Collecting tweepy
  Using cached tweepy-3.9.0-py2.py3-none-any.whl (30 kB)
Collecting requests-oauthlib>=0.7.0
  Using cached requests_oauthlib-1.3.0-py2.py3-none-any.whl (23 kB)
Collecting oauthlib>=3.0.0
  Using cached oauthlib-3.1.0-py2.py3-none-any.whl (147 kB)
Installing collected packages: oauthlib, requests-oauthlib, tweepy
Successfully installed oauthlib-3.1.0 requests-oauthlib-1.3.0 tweepy-3.9.0


## Example 1. Authorizing an application to access Twitter account data

In [3]:
import tweepy

#Setting up the keys and tokens
c_k = "TfibBlsCrxw9IOrHrgrQs3Bx1"
c_s = "8H5eWZM3ZLSgABFhwNrFNxakOsYPY2Kl9aDTzC9LMeToJdu1As"

a_t = "1319004281148592128-weJBkm4DsyRXXpcMlUOVAg3aRRN4oO"
a_s = "RF4y3DV08XbHoRhS0cpESdokfjja6QFTuHa2ShG6EpZJ8"

auth = tweepy.OAuthHandler(c_k, c_s)
auth.set_access_token(a_t, a_s)
api = tweepy.API(auth)

# Nothing to see by displaying twitter_api except that it's now a
# defined variable

print(api)

<tweepy.api.API object at 0x000001C0F74DEF40>


## Example 2. Retrieving trends

Twitter identifies locations using the Yahoo! Where On Earth ID.

The Yahoo! Where On Earth ID for the entire world is 1.
See https://dev.twitter.com/docs/api/1.1/get/trends/place and
http://developer.yahoo.com/geo/geoplanet/

look at the BOSS placefinder here: https://developer.yahoo.com/boss/placefinder/

To look up an area use:
https://www.findmecity.com/

In [4]:
WORLD_WOE_ID = 1
US_WOE_ID = 23424977

Look up the WOE ID for "San Diego" and you should find the following ID below defined as "LOCAL_WOE_ID".

You can change this if you would like.

In [5]:
LOCAL_WOE_ID=8775 # Calgary

# Prefix ID with the underscore for query string parameterization.
# Without the underscore, the twitter package appends the ID value
# to the URL itself as a special case keyword argument.

world_trends = api.trends_place(WORLD_WOE_ID)
us_trends = api.trends_place(US_WOE_ID)
local_trends = api.trends_place(LOCAL_WOE_ID)

In [6]:
world_trends[:2]

[{'trends': [{'name': '#FreeCodeFridayContest',
    'url': 'http://twitter.com/search?q=%23FreeCodeFridayContest',
    'promoted_content': None,
    'query': '%23FreeCodeFridayContest',
    'tweet_volume': None},
   {'name': 'Oscar Isaac',
    'url': 'http://twitter.com/search?q=%22Oscar+Isaac%22',
    'promoted_content': None,
    'query': '%22Oscar+Isaac%22',
    'tweet_volume': 20609},
   {'name': '#OurHomeSoobinDay',
    'url': 'http://twitter.com/search?q=%23OurHomeSoobinDay',
    'promoted_content': None,
    'query': '%23OurHomeSoobinDay',
    'tweet_volume': 496282},
   {'name': '#luppo',
    'url': 'http://twitter.com/search?q=%23luppo',
    'promoted_content': None,
    'query': '%23luppo',
    'tweet_volume': None},
   {'name': '#BJKvKAS',
    'url': 'http://twitter.com/search?q=%23BJKvKAS',
    'promoted_content': None,
    'query': '%23BJKvKAS',
    'tweet_volume': 17343},
   {'name': '#STFOrganizacaoCriminosa',
    'url': 'http://twitter.com/search?q=%23STFOrganizacaoCrim

In [7]:
trends=local_trends
print(type(trends))
print(list(trends[0].keys()))
print(trends[0]['trends'])

<class 'list'>
['trends', 'as_of', 'created_at', 'locations']
[{'name': '#FreeCodeFridayContest', 'url': 'http://twitter.com/search?q=%23FreeCodeFridayContest', 'promoted_content': None, 'query': '%23FreeCodeFridayContest', 'tweet_volume': None}, {'name': 'Terry Fox', 'url': 'http://twitter.com/search?q=%22Terry+Fox%22', 'promoted_content': None, 'query': '%22Terry+Fox%22', 'tweet_volume': None}, {'name': 'Bytowne', 'url': 'http://twitter.com/search?q=Bytowne', 'promoted_content': None, 'query': 'Bytowne', 'tweet_volume': None}, {'name': 'Oscar Isaac', 'url': 'http://twitter.com/search?q=%22Oscar+Isaac%22', 'promoted_content': None, 'query': '%22Oscar+Isaac%22', 'tweet_volume': 20609}, {'name': 'CERB', 'url': 'http://twitter.com/search?q=CERB', 'promoted_content': None, 'query': 'CERB', 'tweet_volume': None}, {'name': 'Solid Snake', 'url': 'http://twitter.com/search?q=%22Solid+Snake%22', 'promoted_content': None, 'query': '%22Solid+Snake%22', 'tweet_volume': 20876}, {'name': '#FridayFe

## Example 3. Displaying API responses as pretty-printed JSON

In [8]:
import json

print((json.dumps(us_trends[:2], indent=1)))

[
 {
  "trends": [
   {
    "name": "#FreeCodeFridayContest",
    "url": "http://twitter.com/search?q=%23FreeCodeFridayContest",
    "promoted_content": null,
    "query": "%23FreeCodeFridayContest",
    "tweet_volume": null
   },
   {
    "name": "Oscar Isaac",
    "url": "http://twitter.com/search?q=%22Oscar+Isaac%22",
    "promoted_content": null,
    "query": "%22Oscar+Isaac%22",
    "tweet_volume": 20609
   },
   {
    "name": "#pmgiveaway",
    "url": "http://twitter.com/search?q=%23pmgiveaway",
    "promoted_content": null,
    "query": "%23pmgiveaway",
    "tweet_volume": 29742
   },
   {
    "name": "#NationalCookieDay",
    "url": "http://twitter.com/search?q=%23NationalCookieDay",
    "promoted_content": null,
    "query": "%23NationalCookieDay",
    "tweet_volume": null
   },
   {
    "name": "Selena",
    "url": "http://twitter.com/search?q=Selena",
    "promoted_content": null,
    "query": "Selena",
    "tweet_volume": 106578
   },
   {
    "name": "Pelosi",
    "url": "

## Example 4. Computing the intersection of two sets of trends

In [9]:
trends_set = {}
trends_set['world'] = set([trend['name'] 
                        for trend in world_trends[0]['trends']])

trends_set['us'] = set([trend['name'] 
                     for trend in us_trends[0]['trends']]) 

trends_set['Calgary'] = set([trend['name'] 
                     for trend in local_trends[0]['trends']]) 

In [10]:
for loc in ['world','us','Calgary']:
    print(('-'*10,loc))
    print((','.join(trends_set[loc])))

('----------', 'world')
ReformunAdı TersKelepçe,Mika,Metal Gear Solid,Russell,KalptekiAşk KeremBürsin,Boba Fett,Pelosi,Mulan,Matt Gaetz,リュウソウジャー,COMO ME CONQUISTAR PELA BOCA,#STFOrganizacaoCriminosa,Marijuana,Joe Anderson,The House,Bottas,Kojima,The Royals,Marcius Melhem,Dani Calabresa,Atiba,#OurHomeSoobinDay,CNBC,Centro de Gobierno,Selena,Fred Hampton,#FreeCodeFridayContest,Momon,#BJKvKAS,Akaashi,Bonner,ういちゃん,Mank,Anouk,Oscar Isaac,PicPay,#luppo,じゅじゅさんぽ,Euphoria,Big Mouth,João Monlevade,FUBU,Ghezzal,Happy Founders,GelecekSende BilgiOcakta,Rick Santelli,Constituição,Solid Snake,Big Boss,Shakira
('----------', 'us')
Royals,Soda,#NationalCookieDay,Defense Business Board,Metal Gear Solid,Boba Fett,Pelosi,Queen of England,Matt Gaetz,Marijuana,Psycho Mantis,Cerberus,soobin,Beer,#pmgiveaway,The House,Michelle Bachman,Kojima,#TheMandalorian,Slave 4 U,DeJoy,Prada bag,CNBC,Mass Effect 2,5 Republicans,Selena,Fred Hampton,#FreeCodeFridayContest,Jay-Z,Akaashi,Reasonable Doubt,Tea Party,Mank,Oscar 

In [11]:
print(( '='*10,'intersection of world and us'))
print((trends_set['world'].intersection(trends_set['us'])))

print(('='*10,'intersection of us and Calgary'))
print((trends_set['Calgary'].intersection(trends_set['us'])))

{'#NationalCatDay', 'pewds', 'Willock', 'Davies', 'Paul Rudd', 'La Russa', 'Miracle Whip', 'Hahn', 'Corbyn', 'Ivy Park', 'Jerry Reinsdorf', 'White Sox', 'Tottenham', 'Vanity Fair', 'Intercept'}
{'#NationalCatDay', 'Jay Cutler', 'pewds', 'DaBaby', 'Mitchell Miller', 'Davies', 'Vanity Fair', 'Paul Rudd', 'Zlatan', 'Ivy Park', 'Kendrick', 'White Sox', 'Glenn', 'Greenwald', 'Willock', 'Tottenham', 'Intercept'}


## Example 5. Collecting search results

Set the variable `q` to a trending topic, 
or anything else for that matter. The example query below
was a trending topic when this content was being developed
and is used throughout the remainder of this chapter

In [11]:
# You can change this to whatever hashtag you want, but if the tag isn't
# popular enough you might not get back a lot of results
q = "Boba Fett"

number = 100

search_results = tweepy.Cursor(api.search, q=q, lang="en").items(number)

#This will give us an Iterator
print(search_results)

# WE will be looking at the tags "retweeted", "retweet count", 
# and the text we found earlier
tweets = []
retweeted = []
retweet_count = []

for tweet in search_results:
    tweets.append(tweet.text)
    retweet_count.append(tweet.retweet_count)
    # This if/else just checks the number of retweets and defines "rewteeted"
    # based on that value
    if tweet.retweet_count > 0:
        retweeted.append(True)
    else:
        retweeted.append(False)


#tweets

<tweepy.cursor.ItemIterator object at 0x000001C0F8525550>


In [12]:
# Not necessary, but this does make the data look pretty
import pandas as pd

df = pd.DataFrame({'Tweet':tweets, 'Retweeted':retweeted, "Retweet Count":retweet_count})

df

Unnamed: 0,Tweet,Retweeted,Retweet Count
0,people are really fat shaming boba fett??? you...,False,0
1,"RT @edckbar875: Favreau and Filoni, in just 30...",True,610
2,Boba Fett is back 🔥🔥🔥 https://t.co/NSttr9YuEk,False,0
3,RT @CaptainFireFeet: The #Mandalorian Chapter ...,True,700
4,“What follows is about 15 minutes of outstandi...,False,0
...,...,...,...
95,Boba Fett!!!!,False,0
96,RT @mauldaIorian: mando spoilers / #TheMandalo...,True,72
97,"RT @edckbar875: Favreau and Filoni, in just 30...",True,610
98,RT @MattM0411: Boba Fett is a fucking GOAT #Th...,True,103


Twitter often returns duplicate results, we can filter them out checking for duplicate texts:

In [13]:
all_text = []
filtered_tweets = []
for t in tweets:
    if not t in all_text:
        filtered_tweets.append(t)
        all_text.append(t)
#filtered_tweets    
filtered_tweets[0]

'people are really fat shaming boba fett??? you mfs weird fr'

In [14]:
#This gives us the number of all of the unique tweets from our search results
print(len(filtered_tweets))
if len(filtered_tweets) < len(tweets):
    print("There were duplicates in our search results!")

62
There were duplicates in our search results!


## Example 6. Creating a basic frequency distribution from the words in tweets

In [15]:
from collections import Counter

words = []

for t in tweets:
    for word in t.split():
        words.append(word)
        
c = Counter(words)
c.most_common(10)

[('Boba', 74),
 ('RT', 66),
 ('Fett', 59),
 ('the', 56),
 ('and', 45),
 ('in', 39),
 ('was', 35),
 ('a', 32),
 ('-', 30),
 ('#TheMandalorian', 24)]

## Example 7. Create a prettyprint function to display tuples in a nice tabular format

In [16]:
def prettyprint_counts(label, list_of_tuples):
    print("\n{:^20} | {:^6}".format(label, "Count"))
    print("*"*40)
    for k,v in list_of_tuples:
        print("{:20} | {:>6}".format(k,v))

In [17]:
for label, data in (('Word', words), 
                    ('Retweet_count', retweet_count)):
    
    c = Counter(data)
    prettyprint_counts(label, c.most_common()[:10])


        Word         | Count 
****************************************
Boba                 |     74
RT                   |     66
Fett                 |     59
the                  |     56
and                  |     45
in                   |     39
was                  |     35
a                    |     32
-                    |     30
#TheMandalorian      |     24

   Retweet_count     | Count 
****************************************
                   0 |     33
                 610 |     14
                 137 |      7
                 693 |      5
                   1 |      5
                 700 |      4
                 293 |      4
                 260 |      3
                   2 |      3
                  36 |      3


## Example 8. Finding the most popular retweets

In [18]:
# This sets up a filter for our dataset that only leaves data with Retweeted
# marked as true
filter1 = df['Retweeted'] == True

#This is a built in pandas operation that will filter the data given the filter
rt_df = df.where(filter1)

#Now we will have a new df without any NaN values
rt_df = rt_df.dropna()

#The indices will look odd, but this is because it is keeping the old indices
rt_df.head(10)

Unnamed: 0,Tweet,Retweeted,Retweet Count
1,"RT @edckbar875: Favreau and Filoni, in just 30...",1.0,610.0
3,RT @CaptainFireFeet: The #Mandalorian Chapter ...,1.0,700.0
7,RT @thchxsenone: cw // #TheMandalorian chapter...,1.0,260.0
9,RT @GagliardiChase: Boba Fett after cracking o...,1.0,293.0
10,RT @TheGeekyPeep: Boba Fett: “I’m just a simpl...,1.0,693.0
14,RT @TheGeekyPeep: Boba Fett: “I’m just a simpl...,1.0,693.0
15,"RT @edckbar875: Favreau and Filoni, in just 30...",1.0,610.0
16,RT @24Orangemamba: Mando and Fennec watching B...,1.0,128.0
17,RT @Mericam49: Most people: “WOW Boba Fett’s f...,1.0,137.0
18,RT @Mericam49: Most people: “WOW Boba Fett’s f...,1.0,137.0


We can sort this dataframe in descending order of the number of retweets using df.sort_values()

In [19]:
rt_df_sorted = rt_df.sort_values(by="Retweet Count", ascending=0)

rt_df_sorted.head(5)

Unnamed: 0,Tweet,Retweeted,Retweet Count
3,RT @CaptainFireFeet: The #Mandalorian Chapter ...,1.0,700.0
24,RT @CaptainFireFeet: The #Mandalorian Chapter ...,1.0,700.0
50,RT @CaptainFireFeet: The #Mandalorian Chapter ...,1.0,700.0
54,RT @CaptainFireFeet: The #Mandalorian Chapter ...,1.0,700.0
80,RT @TheGeekyPeep: Boba Fett: “I’m just a simpl...,1.0,693.0


We can build another `prettyprint` function to print entire tweets with their retweet count.

We also want to split the text of the tweet in up to 3 lines, if needed.

In [20]:
### Remember our pretty_print function from above
### We will modify it slightly
def prettyprint_counts_modified(label, list_of_tuples):
    print("\n{:^20} | {:^6}".format(label, "Count"))
    print("*"*40)
    for k,v in list_of_tuples:
        print("{:20} | {:>6}".format(k,v))

In [21]:
rt_tweets = rt_df_sorted["Tweet"]
rt_re_count = rt_df_sorted["Retweet Count"]

for label, data in (('Tweet', rt_tweets), 
                    ('Retweet_count', rt_re_count)):
    
    c2 = Counter(data)
    prettyprint_counts_modified(label, c2.most_common()[:5])


       Tweet         | Count 
****************************************
RT @edckbar875: Favreau and Filoni, in just 30 minutes, made me believe Boba Fett is the badass we’ve been told he was in the past 40 years… |     14
RT @Mericam49: Most people: “WOW Boba Fett’s fight scene was so cool! What a badass!!”

Me, thinking about how Jango was a foundling and fo… |      7
RT @TheGeekyPeep: Boba Fett: “I’m just a simple man making his way through the galaxy.”
Us:

#TheMandalorian https://t.co/RDv2QckjaB |      5
RT @CaptainFireFeet: The #Mandalorian Chapter 14 was lit. They brought in Boba Fett and everything https://t.co/mNmNgiScEr |      4
RT @GagliardiChase: Boba Fett after cracking open yet another stormtooper helmet #mandalorian https://t.co/9upnhQy7vk |      4

   Retweet_count     | Count 
****************************************
               610.0 |     14
               137.0 |      7
               693.0 |      5
                 1.0 |      5
               700.0 |      4
