#Using Social Media to Study the Link between Health and Happiness

#### *Based on Twitter Data*


##A Data Science CS109 Class Project

Contributors
-----
In alphabetical order  

Alejandro Covarrubias | Jacob Lurye | Eliud Oloo | Qiu-Yue Zhong


#  
<div style="float: right; margin-left: 30px;"><img title="created by Stef Gibson at StefGibson.com"style="float: right;margin-left: 30px;" src="http://www.massage1.com/wp-content/uploads/healthhappiness.jpg" align=right height = 350 /><figcaption>    image source: http://www.massage1.com/wp-content/uploads/healthhappiness</figcaption></div>


#  

###MOTIVATION
------
We are a team with diverse backgrounds in statistics, computer science, public health and biomedical research. The motivation for this project is our common interest in applying social media and data science approaches to study and promote human health. The scientific goal of this work was to come up with a way to measure average happiness state-by state using sentiment analysis of Twitter data and then to determine how well the happiness of a state correlates with public health statistics such as morbidity, mortality, healthcare quality and other criteria.



###OVERVIEW
------

It is often stated that health and happiness are closely linked. Quantifiable evidence to that effect is however hard to come by.  One of many reasons for this that happiness is an ambiguous concept -- easy to recognize in oneself but often harder to detect, much less measure, in others.  Yet, if there was ever an opportune time to make a reasonable attempt at quantifying happiness in the population, it is now.  The rapid worldwide adoption of social media platforms in recent years has tremendously increased the amount, spontaneity and frequency of human communication.   Twitter usage alone grew six orders of magnitude from 5,000 tweets per day in 2007 to 500,000,000 tweets per day in 2013 http://www.internetlivestats.com/twitter-statistics/. And because the data is recorded electronically, it has another very attractive advantage over verbal communication for a data scientist - persistence.  It is remains available to be parsed and analyzed as new analysis methods and expertise become available long after the data was generated.  Studies have shown that self-disclosure in online communication is more frequent and revealing than in face-to-face communication presumably due to anonymity and physical distance https://en.wikipedia.org/wiki/Self-disclosure.  Together, all these factors translate into the availability of huge volumes of data to work with in trying to gauge a phenomenon as nebulous as happiness.  A number of researchers have taken advantag of these characteristics of mordern online communication and attempted to measure happiness using twitter data, for example http://hedonometer.org/index.html. We therefore leveraged the experiences published by other researches to inform our approach on how go about this project. Being able to reliably measure happiness opens the door to trying to find out what the determinants of happiness in the human population are. In particular we were interested in investigate whether public health outcomes released in various government and non-profit orgabnization reports have any relationship with happiness as measured on twitter.   

See our project homepage [here](http://hackersforhumanhealth.me).


###QUESTION
------

The main question we wanted to answer is: Do happier people generally experience better health and do healthier people experience more happiness?    To do this, we assumed that happiness as expressed in twitter communication is a surrogate for genuine happiness in the human population.  That of course is a big assumption but we felt that it is as good a starting point as you can get for investigating this very challenging problem.  Our first task was to obtain twitter data.  
Since we wanted to learn how to harvest and process twitter data, we wrote a python script to do just that: <link to script>.  As a backup, we also were very fortunate to have access to a dataset of 3.5 million tweets generously provided by Sébastien Gruhier of http://onemilliontweetmap.com/ , to whom we are very grateful. Due to twitter policy, we are unable to provide a public link to this data set.

For our study, we chose to focus on the United States because of the limitation of time available for the project as well as the ready availability of both tweet and health data for the US.   For both data sources, we wrote separate scripts for  parsing and reformatting the data appropriately for analysis.  This involved selecting only US tweets, adding a US state of origin label to each tweet and cleaning out unwanted characters in the tweets.  
Some of the challenges we encountered in processing our own harvested dataset are:

the onemilliontweetmap dataset (hereafter referred to omtm) were:  huge file sized and insufficient memory to process in our laptop machines.  We worked around this by using the Unix grep command to select US tweets using search term "United States" and  dividing the data (text files into chuncks of a million lines (tweets) using the unix head - 1000000.     These smaller files were then separately processed using the script.
The online Json Viewer resource http://jsonviewer.stack.hu/ proved to be a very useful tool for easily identifying the fields to extract using our script.   Similarly, the Json validator online tool http://jsonviewer.stack.hu/ was helpful in ascertaining that out output files were valid Json.  

out script -- getting location data -- state and county using geopy geocoded, timed out, next time use solution proposed by 
gps good if you want o be more precise street level -- use soluti0n proposed by BF on piazza





Script for harvesting twitter data .....

In [None]:
from twitter import Twitter, OAuth, TwitterHTTPError, TwitterStream
import pandas as pd
import sys

reload(sys)
sys.setdefaultencoding("utf-8")

ckey = ''
csecret = ''
atoken = ''
asecret = ''

oauth = OAuth(atoken, asecret, ckey, csecret)

# Initiate the connection to Twitter Streaming API
twitter_stream = TwitterStream(auth=oauth)

# Get a sample of the public data following through Twitter
iterator = twitter_stream.statuses.filter(locations='-126,-58,26,50', lang='en')

print "\nHow many tweets would you like to collect?"
tweet_count = input()
with open('tweets.csv','w') as tweet_file:
	final_dict = {'uid':[], 'tid':[], 'text':[], 'timestamp':[], 'city':[], 'country':[], 'bounding_box':[]}
	for tweet in iterator:
		tweet_count -= 1
		# Twitter Python Tool wraps the data returned by Twitter 
		# as a TwitterDictResponse object.
		# We convert it back to the JSON format to print/score
		tweet.values()
		for k,v in tweet.iteritems():
			if k == 'text':
				final_dict['text'].append(v)
			elif k == 'user':
				final_dict['uid'].append(v['id'])
			elif k == 'id':
				final_dict['tid'].append(v)
			elif k == 'timestamp_ms':
				final_dict['timestamp'].append(long(v))
			elif k == 'place':
				try:
					final_dict['city'].append(v['full_name'].split(',')[0])
					final_dict['country'].append(v['country'])
					final_dict['bounding_box'].append(v['bounding_box'])
				except:
					final_dict['city'].append('')
					final_dict['country'].append('')
					final_dict['bounding_box'].append('')


		if tweet_count <= 0:
			break

	
	tweet_df = pd.DataFrame(final_dict)
	tweet_df.to_csv(tweet_file)

Script for processing our own twitter dataset  ...

In [None]:
import sys
import re
import json
import csv
from geopy.geocoders import Nominatim
import ast
from time import sleep 
geolocator = Nominatim()


'''
input: twitter data file in CSV format
intermediate step: used command line grep "United States" tweets.csv 
output: JSON file with United States tweets only, state label added and tweet text cleaned

'''
 
 
def find_tweet_address(gps_polygon_text):
    """
    Get details about the location of origin of a tweet
    based on GPS coordinates
    """
    location_dict = None
    gps_polygon_dict = ast.literal_eval(gps_polygon_text)
    longitude =  gps_polygon_dict['coordinates'][0][0][0]
    latitude =  gps_polygon_dict['coordinates'][0][0][1]
    tweetlocation = geolocator.reverse((latitude, longitude), timeout=None)
    tweetaddress_fields = (tweetlocation.raw)
    try:
        county = tweetaddress_fields['address']['county']
        state = tweetaddress_fields['address']['state']
        zipcode = tweetaddress_fields['address']['postcode']
    except:
        county = ''
        state = ''
        zipcode = ''
    location_dict = dict(county=county, state=state, zipcode=zipcode)
    return location_dict
 

def tweet_cleaner(tweet):
    """
    tweet cleaning function
    adopted from http://ravikiranj.net/posts/2012/code/how-build-twitter-sentiment-analyzer/
    """
    #Convert to lower case
    tweet = tweet.lower()
    #Convert www.* or https?://* to URL
    tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))','URL',tweet)
    #Convert @username to AT_USER
    tweet = re.sub('@[^\s]+','AT_USER',tweet)
    #Remove additional white spaces
    tweet = re.sub('[\s]+', ' ', tweet)
    #Replace #word with word
    tweet = re.sub(r'#([^\s]+)', r'\1', tweet)
    #trim
    tweet = tweet.strip('\'"')
    return tweet

 
def parsecsv(tweet_data):
    """
    parse each tweet and extract values of interest
    """
    tweet_dict = None
    if tweet_data[3] == "United States":
        tweetid = tweet_data[-3]
        userid = tweet_data[-1]
        place = tweet_data[2]
        coords = tweet_data[1]
        country = tweet_data[3]
        lang = ''
        timestamp = tweet_data[-2]
        ttext = tweet_data[4]
        ttext_cleand = tweet_cleaner(ttext)
        sleep(1)
        location_data = find_tweet_address(coords)
        state = location_data['state']
        tweet_dict = dict(tweetid=tweetid, userid=userid, place=place, coords=coords, country=country, state=state, lang=lang,
                         timestamp=timestamp, ttext=ttext, ttext_cleand=ttext_cleand)
        # print "\n", tweet_dict['ttext'], "\n", tweet_dict['ttext_cleand'], "\n", tweet_dict['state']
    else:
        pass
    return tweet_dict
 
 
def main():
    """
 
    """
    line_count = 0
    #open the file in universal-newline mode
    with open(incsvfile, 'rU') as data_file:
        data = csv.reader(data_file)
        print ('[')
        for tweet_data in data:
            tweet_dict = parsecsv(tweet_data)
            out_put = json.dumps(tweet_dict)
            if out_put != 'null':
                if line_count == 0:
                    print (out_put)
                else:
                    print ("," + out_put)
                line_count = 1
        print (']')
 
 
if __name__ == "__main__":
    incsvfile = sys.argv[1]
    main()

In [None]:
import json

#import regex
import re

'''
input: twitter data file in JSON format
output: JSON file with United States tweets only, state label added and tweet text cleaned

'''


#start process_tweet
def tweetcleaner(tweet):
    ''' function adopted from http://ravikiranj.net/posts/2012/code/how-build-twitter-sentiment-analyzer/'''
    #Convert to lower case
    tweet = tweet.lower()
    #Convert www.* or https?://* to URL
    tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))','URL',tweet)
    #Convert @username to AT_USER
    tweet = re.sub('@[^\s]+','AT_USER',tweet)
    #Remove additional white spaces
    tweet = re.sub('[\s]+', ' ', tweet)
    #Replace #word with word
    tweet = re.sub(r'#([^\s]+)', r'\1', tweet)
    #trim
    tweet = tweet.strip('\'"')
    return tweet
#end


def parsejson(tweet_data):
    tweet_dict = None
    if tweet_data["_source"]["place"]["country_code"] == "US":
        tweetid = tweet_data["_id"]
        userid = tweet_data["_source"]["user"]["id"]
        place = tweet_data["_source"]["place"]["full_name"]
        coords = tweet_data["_source"]["coordinates"]
        country = tweet_data["_source"]["place"]["country"]
        lang = tweet_data["_source"]["lang"]
        timestamp = tweet_data["_source"]["timestamp_ms"]
        ttext = tweet_data["_source"]["text"]
        ttext_cleand = tweetcleaner(ttext)
        state = place.strip()[-3:].strip()
        if state == 'USA':
            state = place.split(",")[0]
        else:
            state = state
            
        tweet_dict = dict(tweetid=tweetid, userid=userid, place=place, coords=coords, country=country, state=state, lang=lang,
                         timestamp=timestamp, ttext=ttext, ttext_cleand=ttext_cleand)
        #print "\n", tweetdict['ttext'], "\n", tweetdict['ttext_cleand'], "\n", tweetdict['state']
    else:
        pass
    return tweet_dict
 

def main():
    """
    """
    line_count = 0
    with open(injsonfile) as data_file:
        data = json.load(data_file)
    with open(outjsonfile, 'w') as fp:
        fp.write('[' + '\n')
        for tweet_data in data:
            tweet_dict = parsejson(tweet_data)
            out_put = json.dumps(tweet_dict)
            if out_put != 'null':
                if line_count == 0:
                    fp.write(out_put + '\n')
                else:
                    fp.write("," + out_put + '\n')
                line_count = 1
        fp.write(']' + '\n')


if __name__ == "__main__":
    injsonfile = raw_input("what is your input json file name? ")
    outjsonfile = raw_input("what is your output json file name? ")
    main()
