#Data Acquisition

Since we wanted to learn how to harvest and process twitter data, we wrote a python script to do just that: <link to script>.  As a backup, we also were very fortunate to have access to a dataset of 3.5 million tweets generously provided by Sébastien Gruhier of http://onemilliontweetmap.com/ , to whom we are very grateful. Due to twitter policy, we are unable to provide a public link to this data set.

For our study, we chose to focus on the United States because of the limitation of time available for the project as well as the ready availability of both tweet and health data for the US.   For both data sources, we wrote separate scripts for  parsing and reformatting the data appropriately for analysis.  This involved selecting only US tweets, adding a US state of origin label to each tweet and cleaning out unwanted characters in the tweets.  
Some of the challenges we encountered in processing our own harvested dataset are:

the onemilliontweetmap dataset (hereafter referred to omtm) were:  huge file sized and insufficient memory to process in our laptop machines.  We worked around this by using the Unix grep command to select US tweets using search term "United States" and  dividing the data (text files into chuncks of a million lines (tweets) using the unix head - 1000000.     These smaller files were then separately processed using the script.
The online Json Viewer resource http://jsonviewer.stack.hu/ proved to be a very useful tool for easily identifying the fields to extract using our script.   Similarly, the Json validator online tool http://jsonviewer.stack.hu/ was helpful in ascertaining that out output files were valid Json.  

out script -- getting location data -- state and county using geopy geocoded, timed out, next time use solution proposed by 
gps good if you want o be more precise street level -- use soluti0n proposed by BF on piazza




In [1]:
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
import json
import csv
import re
from geopy.geocoders import Nominatim
import ast
from time import sleep
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")

###Script For Collecting Data From Twitter

In [None]:
from twitter import Twitter, OAuth, TwitterHTTPError, TwitterStream
import sys

reload(sys)
sys.setdefaultencoding("utf-8")

ckey = ''
csecret = ''
atoken = ''
asecret = ''

oauth = OAuth(atoken, asecret, ckey, csecret)

# Initiate the connection to Twitter Streaming API
twitter_stream = TwitterStream(auth=oauth)

# Get a sample of the public data following through Twitter
iterator = twitter_stream.statuses.filter(locations='-126,-58,26,50', lang='en')

print "\nHow many tweets would you like to collect?"
tweet_count = input()
with open('ourdata.csv','w') as tweet_file:
	final_dict = {'uid':[], 'tid':[], 'text':[], 'timestamp':[], 'city':[], 'country':[], 'bounding_box':[]}
	for tweet in iterator:
		tweet_count -= 1
		# Twitter Python Tool wraps the data returned by Twitter 
		# as a TwitterDictResponse object.
		# We convert it back to the JSON format to print/score
		tweet.values()
		for k,v in tweet.iteritems():
			if k == 'text':
				final_dict['text'].append(v)
			elif k == 'user':
				final_dict['uid'].append(v['id'])
			elif k == 'id':
				final_dict['tid'].append(v)
			elif k == 'timestamp_ms':
				final_dict['timestamp'].append(long(v))
			elif k == 'place':
				try:
					final_dict['city'].append(v['full_name'].split(',')[0])
					final_dict['country'].append(v['country'])
					final_dict['bounding_box'].append(v['bounding_box'])
				except:
					final_dict['city'].append('')
					final_dict['country'].append('')
					final_dict['bounding_box'].append('')


		if tweet_count <= 0:
			break

	
	tweet_df = pd.DataFrame(final_dict)
	tweet_df.to_csv(tweet_file)

In [2]:
###Script for Processing Data Collected from Twitter

In [3]:
geolocator = Nominatim()
def find_tweet_address(gps_polygon_text):
    """
    Get details about the location of origin of a tweet
    based on GPS coordinates
    """
    location_dict = None
    gps_polygon_dict = ast.literal_eval(gps_polygon_text)
    longitude =  gps_polygon_dict['coordinates'][0][0][0]
    latitude =  gps_polygon_dict['coordinates'][0][0][1]
    tweetlocation = geolocator.reverse((latitude, longitude))
    tweetaddress_fields = (tweetlocation.raw)
    try:
        county = tweetaddress_fields['address']['county']
        state = tweetaddress_fields['address']['state']
        zipcode = tweetaddress_fields['address']['postcode']
    except:
        county = ''
        state = ''
        zipcode = ''
    location_dict = dict(county=county, state=state, zipcode=zipcode)
    return location_dict


In [4]:
def tweet_cleaner(tweet):
    """
    tweet cleaning function
    adopted from http://ravikiranj.net/posts/2012/code/how-build-twitter-sentiment-analyzer/
    """
    #Convert to lower case
    tweet = tweet.lower()
    #Convert www.* or https?://* to URL
    tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))','URL',tweet)
    #Convert @username to AT_USER
    tweet = re.sub('@[^\s]+','AT_USER',tweet)
    #Remove additional white spaces
    tweet = re.sub('[\s]+', ' ', tweet)
    #Replace #word with word
    tweet = re.sub(r'#([^\s]+)', r'\1', tweet)
    #trim
    tweet = tweet.strip('\'"')
    return tweet


In [5]:
def parsecsv(tweet_data):
    """
    parse each tweet and extract values of interest
    """
    tweet_dict = None
    if tweet_data[3] == "United States":
        tweetid = tweet_data[-3]
        userid = tweet_data[-1]
        place = tweet_data[2]
        coords = tweet_data[1]
        country = tweet_data[3]
        lang = ''
        timestamp = tweet_data[-2]
        ttext = tweet_data[4]
        ttext_cleand = tweet_cleaner(ttext)
        location_data = find_tweet_address(coords)
        state = location_data['state']
        tweet_dict = dict(tweetid=tweetid, userid=userid, place=place, coords=coords, country=country, state=state, lang=lang,
                         timestamp=timestamp, ttext=ttext, ttext_cleand=ttext_cleand)
        # print "\n", tweet_dict['ttext'], "\n", tweet_dict['ttext_cleand'], "\n", tweet_dict['state']
    else:
        pass
    return tweet_dict

In [6]:
def ourdata_main():
    """
    Load, reformat and clean
     
    """
    line_count = 0
    with open(incsvfile) as data_file:
        data = csv.reader(data_file)
        #data = data_file.xreadlines()
        #data = data_file.read()
        with open(outjsonfile, 'w') as fp:
            fp.write('[' + '\n')
            for tweet_data in data:
                tweet_dict = parsecsv(tweet_data)
                out_put = json.dumps(tweet_dict)
                if out_put != 'null':
                    if line_count == 0:
                        fp.write(out_put + '\n')
                    else:
                        fp.write("," + out_put + '\n')
                    line_count = 1
            fp.write(']' + '\n')





In [7]:
%%time
incsvfile = ('./test.ourdata.csv')
outjsonfile = ('test.ourdata.usa.json')
ourdata_main()

CPU times: user 460 ms, sys: 20 ms, total: 480 ms
Wall time: 12.6 s


In [8]:
usdf = pd.read_json('test.ourdata.usa.json')
usdf.head(3)

Unnamed: 0,coords,country,lang,place,state,timestamp,ttext,ttext_cleand,tweetid,userid
0,"{u'type': u'Polygon', u'coordinates': [[[-86.3...",United States,,North Muskegon,,1449028922213,@westbrook_chloe @cc6163 @anniiikkkkaaa @Jenna...,AT_USER AT_USER AT_USER AT_USER AT_USER he won...,671902115011383296,531579492
1,"{u'type': u'Polygon', u'coordinates': [[[-79.3...",United States,,Mebane,North Carolina,1449028922366,My timehop is the most embarrassing thing ever...,my timehop is the most embarrassing thing ever...,671902115653120000,517114449
2,"{u'type': u'Polygon', u'coordinates': [[[-93.2...",United States,,Prien,Louisiana,1449028922422,I swea 🌚 he was going so fast I thought it was...,i swea 🌚 he was going so fast i thought it was...,671902115887898624,2765379648


In [9]:

'''
input: twitter data file in JSON format
output: JSON file with United States tweets only, state label added and tweet text cleaned

'''

#start process_tweet
def tweetcleaner(tweet):
    ''' function adopted from http://ravikiranj.net/posts/2012/code/how-build-twitter-sentiment-analyzer/'''
    #Convert to lower case
    tweet = tweet.lower()
    #Convert www.* or https?://* to URL
    tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))','URL',tweet)
    #Convert @username to AT_USER
    tweet = re.sub('@[^\s]+','AT_USER',tweet)
    #Remove additional white spaces
    tweet = re.sub('[\s]+', ' ', tweet)
    #Replace #word with word
    tweet = re.sub(r'#([^\s]+)', r'\1', tweet)
    #trim
    tweet = tweet.strip('\'"')
    return tweet
#end


def parsejson(tweet_data):
    tweet_dict = None
    if tweet_data["_source"]["place"]["country_code"] == "US":
        tweetid = tweet_data["_id"]
        userid = tweet_data["_source"]["user"]["id"]
        place = tweet_data["_source"]["place"]["full_name"]
        coords = tweet_data["_source"]["coordinates"]
        country = tweet_data["_source"]["place"]["country"]
        lang = tweet_data["_source"]["lang"]
        timestamp = tweet_data["_source"]["timestamp_ms"]
        ttext = tweet_data["_source"]["text"]
        ttext_cleand = tweetcleaner(ttext)
        state = place.strip()[-3:].strip()
        if state == 'USA':
            state = place.split(",")[0]
        else:
            state = state
            
        tweet_dict = dict(tweetid=tweetid, userid=userid, place=place, coords=coords, country=country, state=state, lang=lang,
                         timestamp=timestamp, ttext=ttext, ttext_cleand=ttext_cleand)
        #print "\n", tweetdict['ttext'], "\n", tweetdict['ttext_cleand'], "\n", tweetdict['state']
    else:
        pass
    return tweet_dict
 

def omtmdata_main():
    """
    """
    line_count = 0
    with open(injsonfile) as data_file:
        data = json.load(data_file)
    with open(outjsonfile, 'w') as fp:
        fp.write('[' + '\n')
        for tweet_data in data:
            tweet_dict = parsejson(tweet_data)
            out_put = json.dumps(tweet_dict)
            if out_put != 'null':
                if line_count == 0:
                    fp.write(out_put + '\n')
                else:
                    fp.write("," + out_put + '\n')
                line_count = 1
        fp.write(']' + '\n')





In [10]:
%%time
injsonfile = ('./test.omtmdata.json')
outjsonfile = ('test.omtmdata.usa.json')
omtmdata_main()

CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 12.3 ms


In [11]:
usdf_omtm = pd.read_json('test.omtmdata.usa.json')
usdf_omtm.head(3)

Unnamed: 0,coords,country,lang,place,state,timestamp,ttext,ttext_cleand,tweetid,userid
0,"47.614937999999995,-122.3306025",United States,en,"Seattle, WA",WA,1447266708839,"After taking public transit in DC and Seattle,...","after taking public transit in dc and seattle,...",664510856407834624,537328079
1,"37.7050435,-122.162294",United States,en,"San Leandro, CA",CA,1447266710015,"Thankful for all the veterans out there, I lov...","thankful for all the veterans out there, i lov...",664510861340340224,3896359752
2,"37.7706565,-122.4359785",United States,en,"San Francisco, CA",CA,1447266716814,"@Priz I've been watching, but not really enjoy...","AT_USER i've been watching, but not really enj...",664510889857433600,15532647
