# Case Study 1 : Collecting Data from Twitter

* ------------

**TEAM Members:**

    Matt Curcio
    
    Tingting Meng
    
    Wenlu Du
    

**Required Readings:** 
* Chapter 1 and Chapter 9 of the book [Mining the Social Web](http://bit.ly/1pC5ujw) 
* The codes for [Chapter 1](http://bit.ly/1qCtMrr) and [Chapter 9](http://bit.ly/1u7eP33)


** NOTE **
* Please don't forget to save the notebook frequently when working in IPython Notebook, otherwise the changes you made can be lost.

*----------------------

## Problem 1: Sampling Twitter Data with Streaming API about a certain topic

* Select a topic that you are interested in, for example, "WPI" or "Lady Gaga"
* Use Twitter Streaming API to sample a collection of tweets about this topic in real time. (It would be recommended that the number of tweets should be larger than 200, but smaller than 1 million.
* Store the tweets you downloaded into a local file (txt file or json file) 

In [48]:
import twitter
#---------------------------------------------
# Define a Function to Login Twitter API
def oauth_login():
    # Go to http://twitter.com/apps/new to create an app and get values
    # for these credentials that you'll need to provide in place of these
    # empty string values that are defined as placeholders.
    # See https://dev.twitter.com/docs/auth/oauth for more information 
    # on Twitter's OAuth implementation.
    
    CONSUMER_KEY = ck
    CONSUMER_SECRET = cs
    OAUTH_TOKEN = ot
    OAUTH_TOKEN_SECRET = ots
    
    auth = twitter.oauth.OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET,
                               CONSUMER_KEY, CONSUMER_SECRET)
    
    twitter_api = twitter.Twitter(auth=auth)
    return twitter_api



### Searching for tweets:

In [60]:
import json
import twitter

def twitter_search(twitter_api, q, max_results=100, **kw):

    # See https://dev.twitter.com/docs/api/1.1/get/search/tweets and 
    # https://dev.twitter.com/docs/using-search for details on advanced 
    # search criteria that may be useful for keyword arguments
    
    # See https://dev.twitter.com/docs/api/1.1/get/search/tweets    
    search_results = twitter_api.search.tweets(q=q, count=100, **kw)
    
    statuses = search_results['statuses']
    
    # Iterate through batches of results by following the cursor until we
    # reach the desired number of results, keeping in mind that OAuth users
    # can "only" make 180 search queries per 15-minute interval. See
    # https://dev.twitter.com/docs/rate-limiting/1.1/limits
    # for details. A reasonable number of results is ~1000, although
    # that number of results may not exist for all queries.
    
    # Enforce a reasonable limit
    max_results = min(1000, max_results)
    
    for _ in range(10): # 10*100 = 1000
        try:
            next_results = search_results['search_metadata']['next_results']
        except KeyError, e: # No more results when next_results doesn't exist
            break
            
        # Create a dictionary from next_results, which has the following form:
        # ?max_id=313519052523986943&q=NCAA&include_entities=1
        kwargs = dict([ kv.split('=') 
                        for kv in next_results[1:].split("&") ])
        
        search_results = twitter_api.search.tweets(**kwargs)
        statuses += search_results['statuses']
        
        if len(statuses) > max_results: 
            break
            
    return statuses


#### Store the tweets into  a local file

In [61]:
import json
import twitter

def save_file(search_results):
    f=open('workfile','w')
    json.dump(search_results,f)
    f.close
    
#usage
twitter_api = oauth_login()

q = "iPhone 6"
results = twitter_search(twitter_api, q, max_results=10)
save_file(results)

### Report some statistics about the tweets you collected 

*The topic of interest: iPhone 6

*The total number of tweets collected:  100

*-----------------------

## Problem 2: Analyzing Tweets and Tweet Entities with Frequency Analysis

**1. Word Count:** 
* Use the tweets you collected in Problem 1, and compute the frequencies of the words being used in these tweets. 
* Plot a table of the top 30 words with their counts

In [56]:
from prettytable import PrettyTable
from collections import Counter
import twitter
import json

#Frequency analysis
def count_freq():
    statuses = json.loads(open('workfile').read())
   
   # statuses = search_results['statuses']

    status_texts = [ status['text'] 
                     for status in statuses ]
    
    words = [ w 
              for t in status_texts 
                  for w in t.split() ]
    
    pt = PrettyTable(field_names=['Word','Count'])
    c = Counter(words)
    
    [ pt.add_row(kv) for kv in c.most_common()[:30] ]
    pt.align['Word'],pt.align['Count']='l','r'
    print pt

    
#usage
count_freq()
    


+------------------------+-------+
| Word                   | Count |
+------------------------+-------+
| RT                     |    66 |
| 6                      |    63 |
| iPhone                 |    63 |
| u                      |    26 |
| thank                  |    25 |
| 6!!                    |    25 |
| mom                    |    24 |
| My                     |    21 |
| the                    |    21 |
| üòçüòç                   |    20 |
| IPhone                 |    20 |
| http://t.co/N3eTuyN7k2 |    18 |
| @traplxrde:            |    18 |
| @RelatableQuote:       |    10 |
| my                     |     8 |
| I                      |     8 |
| http://t.co/MeJQ20XIkg |     8 |
| http://t.co/qPBqfMLuuP |     8 |
| @iPhoneTeam:           |     8 |
| iphone                 |     7 |
| is                     |     7 |
| a                      |     6 |
| Plus                   |     5 |
| on                     |     5 |
| The                    |     5 |
| 6?            

**2. Find the most popular tweets in your collection of tweets**

Please plot a table of the top 10 tweets that are the most popular among your collection, i.e., the tweets with the largest number of retweet counts.


In [52]:
from prettytable import PrettyTable
from collections import Counter
import twitter
import json

#Find the most popular tweets
def retweets():
    statuses = json.loads(open('workfile').read())
    
    retweets = [
            # Store out a tuple of these three values ...
            (status['retweet_count'], 
             status['retweeted_status']['user']['screen_name'],
             status['text']) 
            
            # ... for each status ...
            for status in statuses 
            
            # ... so long as the status meets this condition.
                if status.has_key('retweeted_status')
           ]

    # Slice off the first 10 from the sorted results and display each item in the tuple

    pt = PrettyTable(field_names=['Count', 'Screen Name', 'Text'])
    [ pt.add_row(row) for row in sorted(retweets, reverse=True)[:10] ]
    pt.max_width['Text'] = 50
    pt.align= 'l'
    print pt

#usage
retweets()



+-------+----------------+----------------------------------------------------+
| Count | Screen Name    | Text                                               |
+-------+----------------+----------------------------------------------------+
| 36785 | justinbieber   | RT @justinbieber: Shots looks amazing on the new   |
|       |                | iPhone 6 Plus http://t.co/wzHgUWU6ih               |
| 9830  | GIF_A_GOGO     | RT @GIF_A_GOGO: Comment Apple se console de vos    |
|       |                | critiques sur l'Iphone 6 ...                       |
|       |                | http://t.co/6fqMWjxrxY                             |
| 6993  | yoshiiiiii_    | RT @yoshiiiiii_: rt pour la tablette milka         |
|       |                | fav pour l'iphone 6 http://t.co/Y2vQZpgPJJ         |
| 5782  | RelatableQuote | RT @RelatableQuote: The first guy to buy an iPhone |
|       |                | 6 üòÇ https://t.co/5fVpi4eQCF                       |
| 5632  | Dory           | RT @Dory: i

**3. Find the most popular Tweet Entities in your collection of tweets**

Please plot a table of the top 10 hashtags, top 10 user mentions that are the most popular in your collection of tweets.

In [65]:
from prettytable import PrettyTable
from collections import Counter
import twitter
import json

#The most popular tweet entites 
def top_entities():
    statuses = json.loads(open('workfile').read())
   
    #hashtages
    hashtags = [ hashtag['text'] 
                 for status in statuses
                     for hashtag in status['entities']['hashtags'] ]
    
    #usar mentions
    user_mentions = [ user_mention['screen_name']
                      for status in statuses
                          for user_mention in status['entities']['user_mentions'] ]
    
    #plot the results in table 
    for label, data in ( ('Hashtag', hashtags ),
                          ('User_mention', user_mentions)): 
        pt = PrettyTable(field_names=[label,'Count'])
        c = Counter(data)          
        [ pt.add_row(kv) for kv in c.most_common()[:10] ]
        pt.align[label], pt.align['Count'] = 'l', 'r' # Set column alignment
        print pt

    
#usage
top_entities()


+---------+-------+
| Hashtag | Count |
+---------+-------+
| iPhone6 |    12 |
| iOS8    |    12 |
| iPhone  |     4 |
| The     |     2 |
| gadgets |     2 |
| tech    |     2 |
| iPad    |     1 |
| iTunes  |     1 |
| Also    |     1 |
| Got     |     1 |
+---------+-------+
+-----------------+-------+
| User_mention    | Count |
+-----------------+-------+
| traplxrde       |    14 |
| FITNESSSLEGEND  |    12 |
| ohnahcarter     |     7 |
| iPhoneTeam      |     7 |
| santosfcinforma |     6 |
| RelatableQuote  |     6 |
| pedhsm          |     5 |
| TheRealMikeEpps |     5 |
| BrianSinSuerte  |     5 |
| GobbiGustavo    |     4 |
+-----------------+-------+


*------------------------

## Problem 3: Getting "All" friends and "All" followers of a popular user in twitter


* choose a popular twitter user who has many followers, such as "ladygaga".
* Get the list of all friends and all followers of the twitter user.
* Plot 20 out of the followers, plot their ID numbers and screen names in a table.
* Plot 20 out of the friends (if the user has more than 20 friends), plot their ID numbers and screen names in a table.

In [54]:
#Problem 3.1
from prettytable import PrettyTable
from collections import Counter
import twitter
import json

#Find all the followers and friends of a user and plot 20 of them into table.
def user_info():
    followers = oauth_login().followers.list(screen_name='angryasian',count = 20 )
    friends = oauth_login().friends.list(screen_name='angryasian',count = 20 )
    
    print 'Followers'
    print
    pt_follower = PrettyTable(field_names = ['ID', 'Screen name'])
    
    for follower in followers['users']:
        pt_follower.add_row((follower['id'],follower['screen_name']))
        pt_follower.align['ID'], pt_follower.align['Screen name'] = 'l', 'r' # Set column alignment
    print pt_follower
    
    print
    print 'Friends'
    print
    pt_friends = PrettyTable(field_names = ['ID', 'Screen name'])
    
    for friend in friends['users']:
        pt_friends.add_row((friend['id'],friend['screen_name']))
        pt_friends.align['ID'], pt_friends.align['Screen name'] = 'l', 'r' # Set column alignment
    print pt_friends
   
    
#usage
user_info()

Followers

+------------+-----------------+
| ID         |     Screen name |
+------------+-----------------+
| 17917640   |           20Ent |
| 1928730974 |   Lupine_Lights |
| 9587902    |     carborocket |
| 2671440313 |        RotorUsa |
| 260277853  |       neilpryde |
| 983738660  |  InsiderPeloton |
| 2776570843 |   ChrisKemp_sac |
| 2820506090 |   HernanEliezer |
| 37143075   |    TheFixStudio |
| 14869462   |       r_ashwell |
| 80411378   |    nikola_innov |
| 82917971   |          ahhdyu |
| 353799500  |     DonkeyLabel |
| 39188622   | NorCalBikeSport |
| 91023883   |     FormigliUSA |
| 20367168   |            RQCG |
| 193291202  |       Jonny_05_ |
| 39942436   |          ekitah |
| 81378085   | oldfashionedtoy |
| 2787793003 |         KaiSaiz |
+------------+-----------------+

Friends

+------------+-----------------+
| ID         |     Screen name |
+------------+-----------------+
| 18227861   |      piraterace |
| 983738660  |  InsiderPeloton |
| 2345404443 |        

* Compute the mutual friends within the two groups, i.e., the users who are in both friend list and follower list, plot their ID numbers and screen names in a table

In [55]:
# Problem 3.2
from prettytable import PrettyTable
from collections import Counter
import twitter
import json


#find mutual friend of an user.
def mutual_friend():
    followers = oauth_login().followers.list(screen_name='angryasian', count= 100 )
    friends = oauth_login().friends.list(screen_name='angryasian', count = 100)
    
    #Followers' set
    followers_set = set([ follower['id']
                         for follower in followers['users']])

    #Friends' set
    friends_set = set([ friend['id']
                         for friend in friends['users']])
    
    #find the results of the mutual friends
    mutual_friends = followers_set.intersection(friends_set)
    
    #Create table to show the results
    pt_friends = PrettyTable(field_names = ['ID', 'Screen name'])
    
    for friend in friends['users']:
        if friend['id'] in mutual_friends:
            pt_friends.add_row((friend['id'],friend['screen_name']))
            pt_friends.align['ID'], pt_friends.align['Screen name'] = 'l', 'r' # Set column alignment
    
    print pt_friends


#usage
mutual_friend()
    


+-----------+----------------+
| ID        |    Screen name |
+-----------+----------------+
| 983738660 | InsiderPeloton |
+-----------+----------------+


* -----------------
# Done

All set! 

** What do you need to submit?**

* **Notebook File**: Save this IPython notebook, and find the notebook file in your folder (for example, "filename.ipynb"). This is the file you need to submit. Please make sure all the plotted tables and figures are in the notebook. If you used "ipython notebook --pylab=inline" to open the notebook, all the figures and tables should have shown up in the notebook.


* **PPT Slides**: please prepare PPT slides (for 10 minutes' talk) to present about the case study . We will ask two teams which are randomly selected to present their case studies in class for this case study. 

* ** Report**: please prepare a report (less than 10 pages) to report what you found in the data.
    * What data you collected? 
    * Why this topic is interesting or important to you? (Motivations)
    * How did you analyse the data?
    * What did you find in the data? 
 
     (please include figures or tables in the report, but no source code)

Please compress all the files in a zipped file.
