Summary and original post @ http://kldavenport.com/examining-your-presence-on-twitter-with-python

## Before we start
To start, you will need a Twitter account to obtain credentials (i.e. API key, API secret, Access token and Access token secret) on the Twitter developer site to access the Twitter API, following these steps:

If you don't have python installed skip native setups or homebrew and go for [Conda](https://www.continuum.io/downloads) Then `pip install twitter`
1. Create a Twitter user account if you don't have one already
2. Login to https://apps.twitter.com/ to access your a Twitter dev account.
3. Click “Create New App” -> “Create your Twitter application”
4. Click on “Keys and Access Tokens” tab, and copy your “API key” and “API secret”. Scroll down and click “Create my access token”, and copy your “Access token” and “Access token secret”. You'll use this in the python code before. 

## Setting up our environment
There are many python libraries out there that abstract away the details of accessing the Twitter API and RESTful API's in general (syntax doesn't motivate intuition around service calls). Although I'm not looking to make a pure python implementation with urllib2 and json, I reached for [Python Twitter Tools](https://pypi.python.org/pypi/twitter/1.7.1).

In [68]:
%matplotlib inline  

import json
from urllib import unquote # Need unquote to prevent url encoding errors in 'next_results'
import joblib
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
from time import gmtime, strftime, strptime
from twitter import Twitter, OAuth, TwitterHTTPError, TwitterStream

# Variables that contains the user credentials to access Twitter API 
ACCESS_TOKEN = ''
ACCESS_SECRET = '' #ACESS_TOKEN_SECRET
CONSUMER_KEY = ''
CONSUMER_SECRET = ''

oauth = OAuth(ACCESS_TOKEN, ACCESS_SECRET, CONSUMER_KEY, CONSUMER_SECRET)

## Retrieving Tweets 
Some of the other twitter packages have built in search pagination functions. Below I've created my own based on the `next_results` attribute of our returned json. Essentially Twitter has a limit of 100 results per "page", so I'll need to exhaust the cursor in chunks. 

We'll be retrieving a list of dictionaries where each dict is a tweet and all associated metadata.


I've annotated the code below.

- Twiiter Search API per https://dev.twitter.com/rest/public/search
- Twitter API limit from https://dev.twitter.com/rest/public/rate-limits

We actually won't get many results per Twitter's documentation:

*indices of recent or popular Tweets and behaves similarly to, but not exactly like the Search feature available in Twitter mobile or web clients, such as Twitter.com search. The Twitter Search API searches against a sampling of recent Tweets published in the past 7 days. Before getting involved, it’s important to know that the Search API is focused on relevance and not completeness. This means that some Tweets and users may be missing from search results*

If we wanted more data we could use the [Streaming API](https://dev.twitter.com/streaming/overview) to capture all data within daily usage limits over the next x days. This, however, won't let us hit the ground running as we won't have historical data. According to the documentation it looks like the Streaming API isn't post-processed by Twitter the way the Search API is.

In [69]:
twitter = Twitter(auth=oauth) # Initiate the connection to Twitter REST API

# 'count' is The number of tweets to return per page, up to a maximum of 100. Default = 15. 
# q is case insensitive
def twt_search(q, count):
    search_results = twitter.search.tweets(q=q,
                                           count=count,
                                           result_type='mixed',
                                           include_entities=True
                                           ) # result_type=popular, lang='en'
    statuses = search_results['statuses'] # Search Results returns a dict w/ keys ['search_metadata','statuses']

    # Iterate through 5 more batches of results by following the cursor
    for _ in range(5):
        print "{} distinct tweets".format(len(statuses))
        try:
            print search_results['search_metadata']['next_results']
            next_results = search_results['search_metadata']['next_results']
            print "Fetching next results"
        # No more results if next_results does not exist    
        except KeyError, e: # KeyError when key does not exist in dict. 
            print "No more results"
            break

        # Create a dictionary from next_results sucha s ?max_id=xxxxxx&include_entities=1
        kwargs = dict([ kv.split('=') for kv in unquote(next_results[1:]).split("&") ]) # define keyword arguments

        search_results = twitter.search.tweets(**kwargs)
        statuses += search_results['statuses']

        # Store a snap shot of the results
        statuses_pkl = joblib.dump(statuses,'{}_twitter_statuses.pkl'.format(strftime("%Y%m%d_%H%M"))) 
        
    return statuses

In [70]:
statuses = twt_search('#absoluteblack',100) 

18 distinct tweets
No more results


## Some interesting metadata
Examining a the keys from one of our dicts (tweets):

In [71]:
statuses[0].keys()

[u'contributors',
 u'truncated',
 u'text',
 u'is_quote_status',
 u'in_reply_to_status_id',
 u'id',
 u'favorite_count',
 u'entities',
 u'retweeted',
 u'coordinates',
 u'source',
 u'in_reply_to_screen_name',
 u'in_reply_to_user_id',
 u'retweet_count',
 u'id_str',
 u'favorited',
 u'user',
 u'geo',
 u'in_reply_to_user_id_str',
 u'possibly_sensitive',
 u'lang',
 u'created_at',
 u'in_reply_to_status_id_str',
 u'place',
 u'metadata']

Taking a look at one tweet and its metadata (IPython pretty prints JSON):

In [101]:
statuses[0]

{u'contributors': None,
 u'coordinates': None,
 u'created_at': u'Wed Mar 23 22:18:31 +0000 2016',
 u'entities': {u'hashtags': [{u'indices': [81, 95], u'text': u'absoluteBLACK'},
   {u'indices': [96, 106], u'text': u'evilbikes'},
   {u'indices': [107, 112], u'text': u'MTBR'}],
  u'media': [{u'display_url': u'pic.twitter.com/K7pqSyO9Yy',
    u'expanded_url': u'http://twitter.com/KevinLDavenport/status/712765496571990016/photo/1',
    u'id': 712765495506706434,
    u'id_str': u'712765495506706434',
    u'indices': [113, 136],
    u'media_url': u'http://pbs.twimg.com/media/CeRAcRfVIAIwagy.jpg',
    u'media_url_https': u'https://pbs.twimg.com/media/CeRAcRfVIAIwagy.jpg',
    u'sizes': {u'large': {u'h': 768, u'resize': u'fit', u'w': 1024},
     u'medium': {u'h': 450, u'resize': u'fit', u'w': 600},
     u'small': {u'h': 255, u'resize': u'fit', u'w': 340},
     u'thumb': {u'h': 150, u'resize': u'crop', u'w': 150}},
    u'type': u'photo',
    u'url': u'https://t.co/K7pqSyO9Yy'}],
  u'symbols': [

## Examining Our Data
If you were tracking a group of twitter users, you might assess their engagement with their followers or popularity in the community by examining their `u'retweet_count` and `favorite_count` values. 

Maybe we maintain a list of ambassadors somewhere and we would like to see if any of the tweets we retrieve are from authors in that list. We might even want to automatically retweet authors that are from our vetted ambassadors list. 
```python
if tweet["user"]["screen_name"] in ambassadors: 
    api.retweet(tweet["id"])
```
Lets look at a broad break down of author specific attributes (followers, text) as well as post specific (time of day, hashtags) below. Pay attention to the formatting tricks that we have to employ to get a table with pretty results.

**Note:** Don't use lambda map functions to construct your Pandas dataframe. List comprehension is easier to read and is more pythonic :)

```python
[tweet['user']['screen_name'] for tweet in statuses]
# versus
map(lambda tweet: tweet['user']['screen_name'], statuses) 
```
See below for some performance numbers (even though you still wouldn't use map in the event of a minimal speed gain):

In [73]:
%%timeit
[tweet['user']['screen_name'] for tweet in statuses]

100000 loops, best of 3: 5.89 µs per loop


In [74]:
%%timeit
map(lambda tweet: tweet['user']['screen_name'], statuses) 

100000 loops, best of 3: 7.37 µs per loop


In [75]:
# funny pandas settings to make the column text in IPython more visible
pd.set_option('display.max_rows', 200)
pd.set_option('display.max_columns', 20)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth',100) 

ambassadors_list = ['mattheroesking', 'Uninet01', 'KevinLDavenport','bob', 'jane'] # Could be a csv or xls on your desktop

tweets_df = pd.DataFrame()

# About the author
tweets_df['user'] = [tweet['user']['screen_name'] for tweet in statuses]
tweets_df['is_amb'] = [1 if tweet['user']['screen_name'] in ambassadors_list else 0 for tweet in statuses]
# In this case I actually find the lambda more readable >:) 
# map(lambda tweet: 1 if tweet['user']['screen_name'] in ambassadors_list else 0 , statuses)
tweets_df['text'] = [tweet['text']for tweet in statuses]
tweets_df['lang'] = [tweet['lang']for tweet in statuses] #otherwise 'unf'
tweets_df['cntry'] = [tweet['place']['country'] if tweet['place'] != None else None for tweet in statuses]
tweets_df['loc'] =  [tweet['user']['location'] if tweet['user'] != None else None for tweet in statuses]
tweets_df['followers'] = [tweet['user']['followers_count'] for tweet in statuses]

# About the tweet
tweets_df['fav_cnt'] = [tweet['favorite_count']for tweet in statuses] 
tweets_df['rt_cnt'] = [tweet['retweet_count'] for tweet in statuses]
tweets_df['ht_cnt'] = [len(tweet['entities']['hashtags']) for tweet in statuses]
tweets_df['contain_oval'] = [1 if tweet['text'].lower().find('oval') > -1 else 0 for tweet in statuses]
tweets_df['dow'] = [strftime('%a', strptime(tweet['created_at'],'%a %b %d %H:%M:%S +0000 %Y')) for tweet in statuses] 
tweets_df['dt'] = [strftime('%m-%d-%y %H:%M', strptime(tweet['created_at'],'%a %b %d %H:%M:%S +0000 %Y')) for tweet in statuses]

tweets_df.sort('followers', ascending=False)

Unnamed: 0,user,is_amb,text,lang,cntry,loc,followers,fav_cnt,rt_cnt,ht_cnt,contain_oval,dow,dt
0,KevinLDavenport,1,"Thanks to Zoic, spank, and MTBR for the contest! @ZOICclothing @spankbikes @MTBR #absoluteBLACK ...",en,United States,"Seattle, WA",1468,1,0,3,0,Wed,03-23-16 22:18
11,KevinLDavenport,1,Looks like the #absoluteBLACK 28T is perfect. First ride of the season. #evilbikes #9point8… htt...,en,United States,"Seattle, WA",1468,1,0,3,0,Fri,03-18-16 01:14
14,gow_w,0,フロントシングル化して初走行なう。\nフロント変速捨てたらノイズも無くなって静かになった。\n今のところ違いはそれくらいしか…w\n#深夜自転車部 #absoluteblack https:/...,ja,,shizuoka hamamatsu,670,0,0,2,0,Wed,03-16-16 11:26
15,King_Dean27,0,Loving this setup! #absoluteBLACK 34T Oval paired with a #praxisworks 11-40 cassette. \n#giantbi...,en,,,414,0,0,4,1,Wed,03-16-16 09:55
10,BikeZoneBarrie,0,Just in !!! All new #absoluteblack black diamond hub set with magnetic driver technology. https:...,en,,"ÜT: 44.74145,-79.7862",211,1,0,1,0,Fri,03-18-16 14:22
6,bryanlikesbikes,0,My bike eats hay. #absoluteblack #adventurecycling #bikelife #pastrybrestpastry https://t.co/BdQ...,en,,Chicago,157,3,0,4,0,Sun,03-20-16 15:33
5,SplatDoctor,0,Get out and Get It!\n#mtb #dirtbags #absoluteBLACK #OvalThis #morrobay #foesracingusa https://t....,en,,"Bakersfield, CA",135,0,0,6,1,Sun,03-20-16 18:33
8,SplatDoctor,0,Getting ready for some coastal shredding this weekend. \n#absoluteBLACK\n#OvalThis\n#Dirtbags… h...,en,,"Bakersfield, CA",135,1,0,3,1,Sat,03-19-16 06:02
16,mattheroesking,1,"Its hump day, time to get off the #SPIN bike, get on the #mountainbike and trash some #singletra...",en,,Perth Western Australia,72,0,0,7,1,Tue,03-15-16 22:37
7,mattheroesking,1,Went on a great #mountainbike #ride today. And thanks to my #absoluteblack oval chainring I PB'd...,en,,Perth Western Australia,72,0,0,5,1,Sat,03-19-16 09:15


Let's take a look at the most common day of the week that people are posting. I'm guessing it'll be Friday or Saturday since its tough to get an intraweek ride for most.

In [76]:
print tweets_df['dow'].value_counts()

print tweets_df['lang'].value_counts()

print tweets_df['cntry'].value_counts()

Tue    4
Wed    4
Fri    3
Sat    2
Sun    2
Mon    2
Thu    1
dtype: int64
en    15
hu     1
ja     1
sv     1
dtype: int64
United States    2
dtype: int64


Given the location of the company I was expecting more posts from Europeans. Looks like we even have a post from Perth, Australia. I came up in the results and have the most followers, but in this case that might not lead to the most influence as I suspect more than half of my followers are from the software community, although coder & cyclist isn't mutually exclusive.

### A LITTLE Natural Language Processing
We might be curious to know what the most common word is that pops up amongst all these tweets. Maybe people that are tweeting the hastag #absoluteBLACK always mention a mountain bike brand or location, maybe its a tooth size or the word "easy" or "faster".

We'll use some nifty list comprehensions and itertools to create our list of words, then we'll use the famous Natural Language Toolkit to remove stopwords (words which do not contain important significance such as 'the', 'and', 'it').

We'll have to download the stopwords corpus first:

In [77]:
%%time
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/balthasar/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
CPU times: user 25 ms, sys: 8.62 ms, total: 33.6 ms
Wall time: 890 ms


In [78]:
from itertools import chain
from collections import defaultdict, Counter
from nltk.corpus import stopwords

s = set(stopwords.words('english'))

all_words = list(chain.from_iterable([d['text'].split() for d in statuses]))
print '{} words before'.format(len(all_words))

all_words_nlp =  filter(lambda w: not w in s, all_words)
print '{} words after'.format(len(all_words_nlp))

# let's remove variants of 'absoluteblack' from the list
exclude_list = ['#absoluteblack']

# I prefer the list comprehension's readibility 
# Counter(filter(lambda w: not w in exclude_list , [word.lower() for word in set(all_words_nlp)])).most_common()[0:20]

# Using set() will only be expensive the first time around 
Counter([i for i in [word.lower() for word in set(all_words_nlp)] if i not in exclude_list]).most_common()[0:20]

227 words before
188 words after


[(u'thanks', 2),
 (u'#ovalthis', 2),
 (u'get', 2),
 (u'oval', 2),
 (u'all', 1),
 (u'#ride', 1),
 (u'https://t.co/ipfk3brpir', 1),
 (u'just', 1),
 (u'magnetic', 1),
 (u'#foesracingusa', 1),
 (u'#masterbath', 1),
 (u'perfect.', 1),
 (u'chainring', 1),
 (u'its', 1),
 (u'#mayfairgranite', 1),
 (u'@mtbr', 1),
 (u'setup!', 1),
 (u'shredding', 1),
 (u'black', 1),
 (u'absoluteblack.cc', 1)]

Given the limited dataset over the last 7 days the results aren't especially revealing. Results such as 'oval' and 'ride' are expected. As I mentioned before the Search API is limited to 7 days of results so I think I'll collect a months (couple gigs) worth of data from the streaming API starting this weekend and give this analysis a go again then. In the mean time let's try again with a hastag that would provide large results.

### Trying again with a more ubiquitous hashtag

We'll identify trends with a hashtag that is more active on twitter. I wanted to try #specialized but then noticed a majority of the tweets weren't referring to the bike brand. I took a look at a few mountain bike specific brands like [Transition](https://www.transitionbikes.com) or my favorite [Evil](http://evil-bikes.com), but their presence wasn't large enough on twitter, maybe they're too boutique. So I decided to try "Cervelo" which is unikely to be used in another context and has a large fan base from previously sponsoring one of the largest pro teams in road cycling. I figured the hashtag would be very active since we know how narcissistic roadies are :)

In [79]:
statuses_cervelo = twt_search('#cervelo',100) 

100 distinct tweets
?max_id=710195435001012223&q=%23cervelo&count=100&include_entities=1&result_type=mixed
Fetching next results
127 distinct tweets
No more results


In [80]:
tweets_df_cervelo = pd.DataFrame()

# About the author
tweets_df_cervelo['user'] = [tweet['user']['screen_name'] for tweet in statuses_cervelo]
tweets_df_cervelo['is_amb'] = [1 if tweet['user']['screen_name'] in ambassadors_list else 0 for tweet in statuses_cervelo]
tweets_df_cervelo['text'] = [tweet['text']for tweet in statuses_cervelo]
tweets_df_cervelo['lang'] = [tweet['lang']for tweet in statuses_cervelo] #otherwise 'unf'
tweets_df_cervelo['cntry'] = [tweet['place']['country'] if tweet['place'] != None else None for tweet in statuses_cervelo]
tweets_df_cervelo['loc'] =  [tweet['user']['location'] if tweet['user'] != None else None for tweet in statuses_cervelo]
tweets_df_cervelo['followers'] = [tweet['user']['followers_count'] for tweet in statuses_cervelo]

# About the tweet
tweets_df_cervelo['fav_cnt'] = [tweet['favorite_count']for tweet in statuses_cervelo] 
tweets_df_cervelo['rt_cnt'] = [tweet['retweet_count'] for tweet in statuses_cervelo]
tweets_df_cervelo['ht'] = [len(tweet['entities']['hashtags']) for tweet in statuses_cervelo]
tweets_df_cervelo['contain_oval'] = [1 if tweet['text'].lower().find('oval') > -1 else 0 for tweet in statuses_cervelo]
tweets_df_cervelo['dow'] = [strftime('%a', strptime(tweet['created_at'],'%a %b %d %H:%M:%S +0000 %Y')) for tweet in statuses_cervelo] 
tweets_df_cervelo['dt'] = [strftime('%m-%d-%y %H:%M', strptime(tweet['created_at'],'%a %b %d %H:%M:%S +0000 %Y')) for tweet in statuses_cervelo]

tweets_df_cervelo.sort('followers', ascending=False)

Unnamed: 0,user,is_amb,text,lang,cntry,loc,followers,fav_cnt,rt_cnt,ht,contain_oval,dow,dt
115,SwissSide,0,A fast setup... \n#triathlon #hadron625 #cervelo #cervelop5 https://t.co/UvtV0JUooC,en,,Switzerland,12794,11,0,4,0,Tue,03-15-16 18:32
22,SwissSide,0,Pro-Triathlete Julia Gajer is exploring the roads of her new home in Austria. #juliagajer #cerve...,en,,Switzerland,12794,10,0,3,0,Tue,03-22-16 17:35
99,IRunColumbus,0,RT @dante_13: Cervelo R3 Disc in the house. #cervelo #bicycle #bikeporn https://t.co/VaSqddYamF ...,en,,"Columbus, OH",11986,0,1,4,0,Wed,03-16-16 20:06
57,3TCycling,0,#Repost @ irawandanoe : Muku 😋 #bike #cycling #ride #roadbike #cervelo #cerveloS5 #S5 #ga… htt...,in,,"Brembate, Italy",9677,2,1,9,0,Sun,03-20-16 07:09
104,NUNZIAROJODELAV,0,STAY STRONG. Make them wonder how you are still smiling. 😬 #triathlete #yogi #lynx #cervelo… ht...,en,,MÉXICO,8342,4,1,4,0,Wed,03-16-16 17:29
68,richphoto,0,"First ride on #sram etap wireless , on #cervelo , amazing! https://t.co/dCHEZWPus1",en,,"iPhone: 47.628832,-122.371681",5682,1,0,2,0,Sat,03-19-16 17:53
19,richphoto,0,#sram #etap #cervelo maiden voyage. Nothing short of amazing #cyclinglife #cycling https://t.co/...,en,,"iPhone: 47.628832,-122.371681",5682,1,0,5,0,Tue,03-22-16 23:40
93,textter,0,"Morning ride, before sunrise. #Miami #cervelo #ride https://t.co/SCGqAX3jSc",en,,"Miami, FL",4802,0,0,3,0,Thu,03-17-16 12:21
111,WellbriX,0,trying to get a #cervelo for next Birthday https://t.co/OJSHx7tQIB,en,,Build.Your.Best.Body,4796,1,0,1,0,Tue,03-15-16 20:14
118,smallsunday,0,RT @alabikecom: Buenos días 😎 📷 @smallsunday #cervelo #cervelos3 #rotorbike #enve #speedplay h...,es,,"Durango, CO",4319,0,1,5,0,Tue,03-15-16 15:09


In [84]:
print tweets_df_cervelo['dow'].value_counts()

print tweets_df_cervelo['lang'].value_counts()

print tweets_df_cervelo['cntry'].value_counts()

Wed    31
Tue    26
Mon    21
Thu    15
Sun    14
Sat    12
Fri     8
dtype: int64
en     69
es     15
und    10
ja      8
in      5
de      5
ko      4
nl      3
fr      3
pt      2
tl      1
tr      1
it      1
dtype: int64
대한민국                      5
France                    3
México                    3
United States             3
United Kingdom            2
Indonesia                 2
España                    2
Deutschland               1
Republika ng Pilipinas    1
ประเทศไทย                 1
Brasil                    1
Nederland                 1
dtype: int64


With this many posts it might be interesting to see who the top posters are:

In [85]:
tweets_df_cervelo.user.value_counts()[:10]

masbetmen          11
alabikecom          9
climbcs             5
mixbysoul           5
Bergfahrrad         5
anverkaufonline     3
bikegallerymelb     3
alexisenciso        3
vatamoros           3
ProBikeTool         2
dtype: int64

###Complimentary brands
Below we see the top 20 words used across these 117 cervelo tweets. Other brands appear to be referenced often such as speedplay, rotor, and enve. Rotor is expected as many of their bikes include rotor cranks, speedplay maybe makes sense because at one point they sponsored a team together, ENVE is a high end component manufacturer so I'm not surprised to see people equipping their bikes with their parts.

In [94]:
all_words = list(chain.from_iterable([d['text'].split() for d in statuses_cervelo]))
print '{} words before'.format(len(all_words))
all_words_nlp =  filter(lambda w: not w in s, all_words)
print '{} words after'.format(len(all_words_nlp))

# let's remove variants of 'cervelo' from the list
exclude_list = ['cerv','#cervelo', 'cervelo', '|', 'rt', '@' ]

Counter([i for i in [word.lower() for word in set(all_words_nlp)] if i not in exclude_list and\
         i.startswith('#cerv') == False and i.startswith('cerv') == False ]).most_common()[0:30] # sloppy

1715 words before
1555 words after


[(u'bike', 2),
 (u'hace', 2),
 (u'new', 2),
 (u'racing', 2),
 (u'#parisnice', 2),
 (u'good', 2),
 (u'ya', 2),
 (u'nada', 2),
 (u'day', 2),
 (u'#vaucluse', 2),
 (u'el', 2),
 (u'#luberon', 2),
 (u'happy', 2),
 (u'#cycling', 2),
 (u'bike!', 2),
 (u'morning', 2),
 (u'see', 2),
 (u'#roadbike', 2),
 (u'#socal', 2),
 (u'muku', 2),
 (u'en', 2),
 (u'nice', 2),
 (u'@teamdidata', 2),
 (u':)', 1),
 (u'\uba38\ub9ac\uac00', 1),
 (u'https://t.co/w8t5dvj4w7', 1),
 (u'mile', 1),
 (u'/', 1),
 (u'#sram', 1),
 (u'https://t.co/mg1ooopcpr', 1)]

In [98]:
'#rotor' in [i for i in [word.lower() for word in set(all_words_nlp)] if i not in exclude_list][:30]

True

In the end you could package this up in a .py with command line arguments around the text to search, the words to exlude (or you can point to a file), and who your ambassadors are (or you can point to a file).