#QUESTION
------

The main question we wanted to answer is: Do happier people generally experience better health and do healthier people experience more happiness?    To do this, we assumed that happiness as expressed in twitter communication is a surrogate for genuine happiness in the human population.  That of course is a big assumption but we felt that it is as good a starting point as you can get for investigating this very challenging problem.  Our first task was to obtain twitter data.  


#Data Acquisition

Since we wanted to learn how to harvest and process twitter data, we wrote a python script to do just that: <link to script>.  As a backup, we also were very fortunate to have access to a dataset of 3.5 million tweets generously provided by Sébastien Gruhier of http://onemilliontweetmap.com/ , to whom we are very grateful. Due to twitter policy, we are unable to provide a public link to this data set.

For our study, we chose to focus on the United States because of the limitation of time available for the project as well as the ready availability of both tweet and health data for the US.   For both data sources, we wrote separate scripts for  parsing and reformatting the data appropriately for analysis.  This involved selecting only US tweets, adding a US state of origin label to each tweet and cleaning out unwanted characters in the tweets.  
Some of the challenges we encountered in processing our own harvested dataset are:

the onemilliontweetmap dataset (hereafter referred to omtm) were:  huge file sized and insufficient memory to process in our laptop machines.  We worked around this by using the Unix grep command to select US tweets using search term "United States" and  dividing the data (text files into chuncks of a million lines (tweets) using the unix head - 1000000.     These smaller files were then separately processed using the script.
The online Json Viewer resource http://jsonviewer.stack.hu/ proved to be a very useful tool for easily identifying the fields to extract using our script.   Similarly, the Json validator online tool http://jsonviewer.stack.hu/ was helpful in ascertaining that out output files were valid Json.  

out script -- getting location data -- state and county using geopy geocoded, timed out, next time use solution proposed by 
gps good if you want o be more precise street level -- use soluti0n proposed by BF on piazza




In [1]:
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
import json
import csv
import re
from geopy.geocoders import Nominatim
import ast
from time import sleep
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")

###Script For Collecting Data From Twitter

###Script for Processing Data Collected from Twitter

In [2]:
geolocator = Nominatim()
def find_tweet_address(gps_polygon_text):
    """
    Get details about the location of origin of a tweet
    based on GPS coordinates
    """
    location_dict = None
    gps_polygon_dict = ast.literal_eval(gps_polygon_text)
    longitude =  gps_polygon_dict['coordinates'][0][0][0]
    latitude =  gps_polygon_dict['coordinates'][0][0][1]
    tweetlocation = geolocator.reverse((latitude, longitude))
    tweetaddress_fields = (tweetlocation.raw)
    try:
        county = tweetaddress_fields['address']['county']
        state = tweetaddress_fields['address']['state']
        zipcode = tweetaddress_fields['address']['postcode']
    except:
        county = ''
        state = ''
        zipcode = ''
    location_dict = dict(county=county, state=state, zipcode=zipcode)
    return location_dict


In [3]:
def tweet_cleaner(tweet):
    """
    tweet cleaning function
    adopted from http://ravikiranj.net/posts/2012/code/how-build-twitter-sentiment-analyzer/
    """
    #Convert to lower case
    tweet = tweet.lower()
    #Convert www.* or https?://* to URL
    tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))','URL',tweet)
    #Convert @username to AT_USER
    tweet = re.sub('@[^\s]+','AT_USER',tweet)
    #Remove additional white spaces
    tweet = re.sub('[\s]+', ' ', tweet)
    #Replace #word with word
    tweet = re.sub(r'#([^\s]+)', r'\1', tweet)
    #trim
    tweet = tweet.strip('\'"')
    return tweet


In [4]:
def parsecsv(tweet_data):
    """
    parse each tweet and extract values of interest
    """
    tweet_dict = None
    if tweet_data[3] == "United States":
        tweetid = tweet_data[-3]
        userid = tweet_data[-1]
        place = tweet_data[2]
        coords = tweet_data[1]
        country = tweet_data[3]
        lang = ''
        timestamp = tweet_data[-2]
        ttext = tweet_data[4]
        ttext_cleand = tweet_cleaner(ttext)
        location_data = find_tweet_address(coords)
        state = location_data['state']
        tweet_dict = dict(tweetid=tweetid, userid=userid, place=place, coords=coords, country=country, state=state, lang=lang,
                         timestamp=timestamp, ttext=ttext, ttext_cleand=ttext_cleand)
        # print "\n", tweet_dict['ttext'], "\n", tweet_dict['ttext_cleand'], "\n", tweet_dict['state']
    else:
        pass
    return tweet_dict

In [5]:
def main():
    """
    Load, reformat and clean
     
    """
    line_count = 0
    with open(incsvfile) as data_file:
        data = csv.reader(data_file)
        #data = data_file.xreadlines()
        #data = data_file.read()
        with open(outjsonfile, 'w') as fp:
            fp.write('[' + '\n')
            for tweet_data in data:
                tweet_dict = parsecsv(tweet_data)
                out_put = json.dumps(tweet_dict)
                if out_put != 'null':
                    if line_count == 0:
                        fp.write(out_put + '\n')
                    else:
                        fp.write("," + out_put + '\n')
                    line_count = 1
            fp.write(']' + '\n')





In [6]:
%%time
incsvfile = ('./test.tweets.csv')
outjsonfile = ('test.tweets.usa.json')
main()

CPU times: user 436 ms, sys: 8 ms, total: 444 ms
Wall time: 12.3 s


In [7]:
usdf = pd.read_json('test.tweets.usa.json')
usdf.head(3)

Unnamed: 0,coords,country,lang,place,state,timestamp,ttext,ttext_cleand,tweetid,userid
0,"{u'type': u'Polygon', u'coordinates': [[[-86.3...",United States,,North Muskegon,,1449028922213,@westbrook_chloe @cc6163 @anniiikkkkaaa @Jenna...,AT_USER AT_USER AT_USER AT_USER AT_USER he won...,671902115011383296,531579492
1,"{u'type': u'Polygon', u'coordinates': [[[-79.3...",United States,,Mebane,North Carolina,1449028922366,My timehop is the most embarrassing thing ever...,my timehop is the most embarrassing thing ever...,671902115653120000,517114449
2,"{u'type': u'Polygon', u'coordinates': [[[-93.2...",United States,,Prien,Louisiana,1449028922422,I swea 🌚 he was going so fast I thought it was...,i swea 🌚 he was going so fast i thought it was...,671902115887898624,2765379648
