# Twitter Access & Influence

In the following, we will write a script to collect tweets from the Twitter API about the upcoming iPhone 7. The goal is to identify users who are most influential on the topic using the importance scores we got from the gradient boosting model in the [Predicting Social Influence](https://github.com/juliaawu/mis184n-social-media-analytics/tree/master/predicting-social-influence) assignment. Then, we will create a network of tweets and retweets about the iPhone 7.

In [3]:
import oauth2
import time
import urllib2
import json
import pandas as pd
import re
import sklearn
from sklearn import preprocessing

In [4]:
# Fixed authentication parameters for Twitter API
url1 = "https://api.twitter.com/1.1/search/tweets.json"
params = {
    "oauth_version": "1.0",
    "oauth_nonce": oauth2.generate_nonce(),
    "oauth_timestamp": int(time.time())
}

In [5]:
# Variable authentication parameters
api_key = 'HQPqqS0OeXv1F5kVEIpCjAfJV'
api_secret = 'p8fkGtt5mcgXDqR827qtDMdjc85cupfSP88ay3cGHyCnIVCQ0G'
access_token = '224485242-3X6fKZK2fvMwkcRn8OqQGc0EbJK9oRaOUv2irTtY'
access_secret = 'VHSm5OmkNR1a7WCxMjCKQxpeHfnhMjqz84r0jU7r780IZ'

consumer = oauth2.Consumer(key=api_key, secret=api_secret)
token = oauth2.Token(key=access_token, secret=access_secret)

params["oauth_consumer_key"] = consumer.key
params["oauth_token"] = token.key

In [10]:
# Search tweets for keyword "iPhone 7"
maxID = -1
search_results_final = []

for i in range(65):
    url = url1
    params["q"] = "iPhone 7"
    params["count"] = 100
    params["lang"] = 'en'
    params['max_id'] = maxID
    req = oauth2.Request(method="GET", url=url, parameters=params)
    signature_method = oauth2.SignatureMethod_HMAC_SHA1()
    req.sign_request(signature_method, consumer, token)
    url = req.to_url()
    response = urllib2.Request(url)
    search_results = json.load(urllib2.urlopen(response))
    for i in search_results['statuses']:
        maxID = int(i['id_str'])-1
        search_results_final.append(i)

In [11]:
# Retrieve user info, tweet, and retweet count
user_information = []
tweets = []
retweet_count = []

for i in search_results_final:
    i['user']['entities'] = ''
    user_information.append(i['user'])
    tweets.append(i['text'])
    retweet_count.append(i['retweet_count'])

In [21]:
# Save in df
tweets_df =pd.DataFrame(user_information)
tweets_df['text'] = tweets
tweets_df['retweet_count'] = retweet_count

Now that we have the data, we can calculate the scores. We will use the top 4 important attributes as indicated by the boosting model, with the exception of network feature 1 and 2 since they are not a part of this dataset.

In [22]:
# Get columns for calculating score
score = tweets_df[['screen_name', 'listed_count', 'followers_count', 'friends_count', 'retweet_count']]

In [23]:
# Normalize columns
cols_to_norm = ['listed_count', 'followers_count', 'friends_count', 'retweet_count']
score[cols_to_norm] = score[cols_to_norm].apply(lambda x: (x - x.mean()) / (x.max() - x.min()))

For the weights, we will use the importance score for that attribute / the sum of the other importance scores.

In [24]:
# Calculate score using importance scores from gradient boosting model in Predicting Social Influence as weights
score['score'] = 0.423115134*score['listed_count'] + 0.303900022*score['followers_count'] + 0.143122359*score['friends_count'] + 0.129862484*score['retweet_count']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from IPython.kernel.zmq import kernelapp as app


In [25]:
# Sort descending by score to get top 20 influential users
score.sort('score', ascending=0).head(20)

Unnamed: 0,screen_name,listed_count,followers_count,friends_count,retweet_count,score
1543,TheNextWeb,0.993889,0.558258,0.001409,-0.062545,0.582264
886,TeenVogue,0.29873,0.868529,0.012519,-0.06274,0.383987
2914,abpnewstv,0.074455,0.998537,-0.005915,-0.063175,0.325908
1186,macworld,0.460977,0.165527,-0.002704,-0.062871,0.236799
2139,scarletmonahan,0.020404,0.590566,0.164909,-0.06324,0.203496
1478,applenws,0.200966,0.347932,-0.006379,-0.062437,0.181747
1096,intlspectator,0.055694,0.170748,0.621862,-0.061395,0.156485
3686,phone_crazy,0.00895,0.042664,0.993368,-0.063261,0.15071
2953,phone_crazy,0.00895,0.042664,0.993368,-0.063261,0.15071
2318,leyahdeshon,-0.006015,-0.001348,-0.002888,0.936739,0.118279


Above we have the most influential users on the topic of iPhone 7.

Next, we will format the data so that we can build a network of tweets and retweets on the topic. To do this, we need to extract user2 from the text.

In [26]:
# Set user1 as screen_name
user1 = [('@'+x) for x in tweets_df['screen_name']]
user2 = []
tweet_type = []

In [27]:
# Extract user2 from text. User2 should be the user who retweeted or replied, tweet_type indicates Tweet or RT
for index, value in enumerate(tweets_df['text']):
    if value[:2] == 'RT':
        text = value[3:].split()[0][:-1]
        user2.append(text)
        tweet_type.append('RT')
    elif value[:1] == '@':
        text = value[0:].split()[0]
        user2.append(text)
        tweet_type.append('RT')
    else:
        user2.append('@'+tweets_df.ix[index,'screen_name'])
        tweet_type.append('Tweet')    

In [29]:
network = pd.concat([pd.Series(user1, name='user1'), pd.Series(user2, name='user2'), pd.Series(tweet_type, name='tweet_type')], axis=1)
network.to_csv('network.csv', index=False)