#OAuth Exercise

In this exercise we will try to scrape twitter data and do a tf-idf analysis on that (src-uwes twitter analysis). We will need OAuth authentication, and we will follow a similar approach as detailed in the yelp analysis notebook. 

In [75]:
import jsonpickle, operator,json
import numpy as np
import pandas as pd
import oauth2 as oauth
import urllib2 as urllib

We will now need twitter api access. The following steps as available online will help you set up your twitter account and access the live 1% stream.

1. Create a twitter account if you do not already have one.
2. Go to https://dev.twitter.com/apps and log in with your twitter credentials.
3. Click "Create New App"
4. Fill out the form and agree to the terms. Put in a dummy website if you don't have one you want to use.
5. On the next page, click the "API Keys" tab along the top, then scroll all the way down until you see the section "Your Access Token"
6. Click the button "Create My Access Token". You can Read more about Oauth authorization online. 

Save the details of api_key, api_secret, access_token_key, access_token_secret in your vaule directory and load it in the notebook as shown in yelpSample notebook.

In [76]:
import sys
sys.path.append('/Users/mgalarny/VaultDSE')
import twitterKeys
api_key,api_secret,access_token_key,access_token_secret=twitterKeys.getkeys()

_debug = 0

oauth_token    = oauth.Token(key=access_token_key, secret=access_token_secret)
oauth_consumer = oauth.Consumer(key=api_key, secret=api_secret)

signature_method_hmac_sha1 = oauth.SignatureMethod_HMAC_SHA1()

http_method = "GET"

http_handler  = urllib.HTTPHandler(debuglevel=_debug)
https_handler = urllib.HTTPSHandler(debuglevel=_debug)

Below is a twitter request method which will use the above user logins to sign, and open a twitter stream request

In [77]:
def getTwitterStream(url, method, parameters):
  req = oauth.Request.from_consumer_and_token(oauth_consumer,
                                             token=oauth_token,
                                             http_method=http_method,
                                             http_url=url, 
                                             parameters=parameters)

  req.sign_request(signature_method_hmac_sha1, oauth_consumer, oauth_token)

  headers = req.to_header()

  if http_method == "POST":
    encoded_post_data = req.to_postdata()
  else:
    encoded_post_data = None
    url = req.to_url()

  opener = urllib.OpenerDirector()
  opener.add_handler(http_handler)
  opener.add_handler(https_handler)

  response = opener.open(url, encoded_post_data)

  return response

We can use the above function to request a response as follows

In [78]:
#Now we will test the above function for a sample data provided by twitter stream here -  
url = "https://stream.twitter.com/1/statuses/sample.json"
parameters = []
response = getTwitterStream(url, "GET", parameters)

Write a function which will take a url and return the top 10 lines returned by the twitter stream

** Note ** The response returned needs to be intelligently parsed to get the text data which correspond to actual tweets. This part can be done in a number of ways and you are encouraged to try different approaches to parse the response data.

In [79]:
def fetchData(url):
    response = getTwitterStream(url, "GET", [])
    lines = response.read()
    allinfo = jsonpickle.loads(lines)
    statuses = allinfo['statuses']
    print 'Stream'
    print url.split('/')[-1][14:]
    print '\n'
    for i in range(10):
        try:
            print i+1
            print statuses[i]['text'],'\n'
        except:
            continue

In [80]:
queries = ['UCSD', 'Donald Trump', 'Syria']

for query in queries:
    #We can also request twitter stream data for specific search parameters as follows
    url= "https://api.twitter.com/1.1/search/tweets.json?q=" + query
    fetchData(url)

Stream
UCSD


1
Hang in there #Revelle. Only two more days... then finals week is over! #YouGotThis #UCSD https://t.co/aMKkWE4zCM 

2
https://t.co/SaWxNKMBLU Resources Here for Learning How to Dance #UCSD #reddit 

3
RT @Keeling_curve: 401.26 parts per million (ppm) CO2 in air 08-Dec-2015 https://t.co/5Q2FLbb4ix 

4
Are changing oceans and marine ecosystems included in the US's national adaptation plan? #AskUSCenter #ucsd #COP21 

5
In US, more water used in energy than in agriculture! Energy and water are intimately connected – Jonathan Pershing @US_Center #COP21 #ucsd 

6
RT @Jazz88: RETWEET til 8am PT to ENTER Pair of Seats CONTEST!&gt; Mark Dresser Septet @TheLoftatUCSD 12/11&gt;https://t.co/DrQjjn7ubg https://t.… 

7
RT @Jazz88: RETWEET til 8am PT to ENTER Pair of Seats CONTEST!&gt; Mark Dresser Septet @TheLoftatUCSD 12/11&gt;https://t.co/DrQjjn7ubg https://t.… 

8
RT @Jazz88: RETWEET til 8am PT to ENTER Pair of Seats CONTEST!&gt; Mark Dresser Septet @TheLoftatUCSD 12/11&gt;https:

Call the fetchData function to fetch latest live stream data for following search queries and output the first 5 lines

1. "UCSD"
2. "Donald Trump"
3. "Syria"

### TF-IDF###

tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.It is among the most regularly used statistical tool for word cloud analysis. You can read more about it online (https://en.wikipedia.org/wiki/Tf%E2%80%93idf)

We base our analysis on the following

1. The weight of a term that occurs in a document is simply proportional to the term frequency
2. The specificity of a term can be quantified as an inverse function of the number of documents in which it occurs

For this question we will perform tf-idf analysis o the stream data we retrieve for a given search parameter. Perform the steps below

1. use the twitterreq function to search for the query "syria" and save the top 200 lines in the file twitterStream.txt
2. load the saved file and output the count of occurrences for each term. This will be your term frequency
3. Calculate the inverse document frequency for each of the term in the output above.
4. Divide the term frequency for each of the term by corresponding inverse document frequency.
5. Sort the terms in the descending order based on their term freq/inverse document freq scores 
6. Print the top 10 terms.

In [81]:
#1.

writer = open('twitterStream.txt', 'a') 
url= "https://api.twitter.com/1.1/search/tweets.json?q="+"syria"
response = getTwitterStream(url, "GET", [])
lines = response.read()
allinfo = jsonpickle.loads(lines)
statuses = allinfo['statuses']
for i in range(200):
    try:
        writer.write(statuses[i]['text'].replace('\n',' ')+'\n\n')
    except:
        continue
        
writer.close()

In [82]:
#2.

def tf(name):
    char = '.,?"'
    text = open(name, 'r')
    line = text.read()
    text.close()
    word_list=line.lower().split()
    count_dict = {}
    for word in word_list:
        if word[-1] in char:
            word = word[:-1]
        if word not in count_dict:
            count_dict[word]=0
    for word in word_list:
        if word[-1] in char:
            word = word[:-1]
        count_dict[word]+=1
    return count_dict

name = 'twitterStream.txt'
tf = tf(name)
print tf

{'and': 1, 'invasion': 1, 'full': 1, 'force': 1, '#sadakat': 1, '#germany': 1, 'do': 1, 'https://t.co/trxv9d2tpv': 1, 'https://t.co/zyutbketkm': 1, 'saa': 2, '@bradcabana:': 1, 'used': 2, 'us': 1, 'a': 1, 'in': 2, 'vote': 1, 'russia': 1, 'rt': 4, 'forced': 1, '@amb_yakovenko': 1, 'truck-mounted': 2, 'troops': 1, 'northern': 1, 'till': 1, '(': 2, '@green_lemonnn:': 2, '#uncommonsense': 1, 'now!': 1, 'send': 1, 'should': 1, '#israil': 1, '#parisshooting': 1, '&amp;iraq': 1, 'wait': 1, 'war': 1, 'fra': 1, 'diy': 2, 'be': 1, '#news': 1, 'artillery': 2, 'iran': 1, 'trk': 1, 'congress': 1, '#lacancionde2015fue': 1, '#saudi': 1, '2a18m)': 2, "syria's": 1, '#iran': 1, 'act': 1, 'https://t.co/iqq2vfcrsf': 1, '#iraq': 1, '#cdnpoli': 1, 'https://t.co/kxupqlhdhp': 2, 'minister': 1, "don't": 1, 'assad': 1, '&amp;': 1, 'by': 2, 'must': 1, 'wont': 1, 'on': 1, 'https://t.co/1vboqsieeh': 1, '#syria': 4, 'd-30': 2, 'https://t.co/7f4lg3plnj': 1, '#isis': 1, '#tcot': 1, 'out:': 1, 'self-propelled': 2, 'jo

In [83]:
#3
def idf(name):
    docs = open(name, 'r')
    tot_docs = len(docs.readlines())
    count_dict = {}
    unique = []
    docs.close()
    
    #Get all unique terms
    docs = open(name, 'r')
    char = '.,?"'
    text_list = docs.read().lower().split()
    for word in text_list:
        if word[-1] in char:
            word = word[:-1]
        if word not in unique:
            unique.append(word)
            count_dict[word] = 0
    docs.close()
    
    docs = open(name, 'r')
    for line in docs.readlines():
        new_line = []
        for word in line.lower().split():
            if word[-1] in char:
                word = word[:-1]
            new_line.append(word)
        for term in unique:
            if term[-1] in char:
                term = term[:-1]
            if term in new_line:
                count_dict[term] += 1
            else:
                pass    
    docs.close()
    
    for key in count_dict:
        count_dict[key] = np.log10(float(tot_docs) / float(count_dict[key]))
    return count_dict
        
name = 'twitterStream.txt'    
idf = idf(name)
print idf

{'and': 1.146128035678238, 'invasion': 1.146128035678238, 'full': 1.146128035678238, 'force': 1.146128035678238, '#sadakat': 1.146128035678238, '#germany': 1.146128035678238, 'do': 1.146128035678238, 'https://t.co/trxv9d2tpv': 1.146128035678238, 'https://t.co/zyutbketkm': 1.146128035678238, 'saa': 0.84509804001425681, '@bradcabana:': 1.146128035678238, 'used': 0.84509804001425681, 'us': 1.146128035678238, 'a': 1.146128035678238, 'in': 0.84509804001425681, 'vote': 1.146128035678238, 'russia': 1.146128035678238, 'rt': 0.54406804435027567, 'forced': 1.146128035678238, '@amb_yakovenko': 1.146128035678238, 'truck-mounted': 0.84509804001425681, 'troops': 1.146128035678238, 'northern': 1.146128035678238, 'till': 1.146128035678238, '(': 0.84509804001425681, '@green_lemonnn:': 0.84509804001425681, '#uncommonsense': 1.146128035678238, 'now!': 1.146128035678238, 'send': 1.146128035678238, 'should': 1.146128035678238, '#israil': 1.146128035678238, '#parisshooting': 1.146128035678238, '&amp;iraq': 

In [84]:
#4. 

def tfidf(tf_dict, idf_dict):
    tfidf_dict = {}
    for term in tf_dict.keys():
        tfidf_dict[term] = tf_dict[term] * idf_dict[term]
    return tfidf_dict

tfidf = tfidf(tf, idf)
tfidf

{'#cdnpoli': 1.146128035678238,
 '#germany': 1.146128035678238,
 '#iran': 1.146128035678238,
 '#iraq': 1.146128035678238,
 '#isis': 1.146128035678238,
 '#israil': 1.146128035678238,
 '#lacancionde2015fue': 1.146128035678238,
 '#news': 1.146128035678238,
 '#p2': 1.146128035678238,
 '#palestine': 1.146128035678238,
 '#parisshooting': 1.146128035678238,
 '#sadakat': 1.146128035678238,
 '#saudi': 1.146128035678238,
 '#syria': 2.1762721774011027,
 '#tcot': 1.146128035678238,
 '#terrorismo': 1.146128035678238,
 '#uncommonsense': 1.146128035678238,
 '&amp;': 1.146128035678238,
 '&amp;iraq': 1.146128035678238,
 '(': 1.6901960800285136,
 '122mm': 1.6901960800285136,
 '2a18m)': 1.6901960800285136,
 '@amb_yakovenko': 1.146128035678238,
 '@bradcabana:': 1.146128035678238,
 '@green_lemonnn:': 1.6901960800285136,
 '@politics_pr:': 1.146128035678238,
 'a': 1.146128035678238,
 'act': 1.146128035678238,
 'and': 1.146128035678238,
 'artillery': 1.6901960800285136,
 'assad': 1.146128035678238,
 'be': 1.1

In [85]:
#5. 

freqScore = pd.DataFrame(tfidf.items(),columns=['Term','TF-IDF']).sort(ascending=False,columns=['TF-IDF'])
freqScore.head(10)

Unnamed: 0,Term,TF-IDF
60,#syria,2.176272
16,rt,2.176272
55,by,1.690196
19,truck-mounted,1.690196
34,saa,1.690196
67,self-propelled,1.690196
35,saudi,1.690196
23,(,1.690196
12,in,1.690196
51,syria,1.690196
